View Single Post
Old 05-14-2005, 01:09 PM  
darkone
Disabled
 
darkone's Avatar
 
Join Date: Dec 2001
Posts: 2,230
Default

Quote:
Originally posted by FTPServerTools
Do me a favor and try my dupechecker with such an amount of files. Upload checking of 60000 files should take about 17 reads in a file, meanign within 0.2 seconds it can check a dupe out of 60000 files and yet it is still in a simple SORTED!! list. It works on dirs but I can extend it to work on files. The drawback is tho that after a save it takes more time to process... Have you considered using sqlite with an index in tcl? There is a tcl extension for sqlite that might give you the stuff you need. You would need some tcl scripting then tho. 60000 files with an average of lets say 14 characters is only a measly 830K file which is basically a small file. Reading such a file can be done super quickly. You can use DupeLister to make new dupelists. Please let me know if it is fast enough for you. I have been given reports of 400000 entries in it being handled within 2 seconds, so I assume in your case it'd be fast enough. If it is let me know then I'll see if I can add file support to it as well (shouldnt be hard at all).
DupeSearch and DupeLister are what you need for testing. OnDirCreated does dupe dir blocking I can make a OnPreFileUpload or something like that that tests a filename against a list like being created with DupeLister.
Consider saving data to more than one file, when file (database) size grows too large. I'm assuming you're using method similar to binary search algorithm on sorted file that I posted a while ago.

Here's simple example of how contents of files should/could look like:

filedb_1.dat
[number of files in database]
[min value of database 1]
[min value of database ...]
[filename of database ...]
...
[min value of database N]
[filename of database N]
[database contents part 1]

filedb_....dat
[database contents part ...]

filedb_N.dat
[database contents part N]

When file grows larger than eg. 5000 entries, it's split into two and header information in file_db1.dat is to be updated. With 1000000 entries you'd end up having 200files. That equals to maximum of 8 (binary search variant on min values) + 13 (binary search on file) comparisons. Neat and very efficient. Also it might be wise to limit size of filename to fixed value, and use twice as large read buffer - so you'll get full entry no matter what.

If you need further information, just message me on irc.
darkone is offline   Reply With Quote