Re: [Dar-support] A lot of files slows dar down

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Erik,

After several different implementations and testings, your
implementation is the best I could find and far away from the others. I
could not remove the vector<> as dar needs sequential reading of
directory in the order they have been added to the directory (to avoid
having changing slice unnecessarily when
extracting/testing/diffing/merging). Thus I kept your implementation
quite word for word, except one point which is specific to the std::map
class: using "map[key]" does create the corresponding entry (when it is
not in the map), you need to use the find() method instead to check
whether a particular entry exists or not.

The performance improvement is very good while the memory overhead worth
it. However, I added a --disable-large-dir option to the configure
script to keep the possibility to use the original implementation for
those which have important memory constraints.

Regards,
Denis.

Denis Corbin wrote:
> Erik Wasser wrote:
> [...]
>>>> The main problem was that the incremental backups of a directory
>>>> with a large amount of files (>90.000) are taking a long time. I
>>>> know about this drawback of a filesystem but I can't change the
>>>> software producing this mess. B-(
>>> Yep, However, note that we can speed up dar for sure about this
>>> point, but you will not speed up the filesystem operation, I guess
>>> optimizing dar will only be noticeable when you want to restore a few
>>> files :-/, else, this is the filesystem that will impose the rhythm.
>> Incremental backups are a good test case too. Reading a directory is (I 
>> think) O(n) but comparing the current content with the content of the 
>> reference archive is O(n^2). I'm doing very often backups but only a 
>> few restores so this is a quite common case for me.
> 
> Yep, that's right.
> 
>>> the hash can also be done by hand, this was my idea in the (old)
>>> reference you mentioned above. Thinking again about this point, I
>>> would design a flexible tree-like class:
>> The tree structure is a good idea but it seems to complicated for me to 
>> implement in C++.
> 
> I guess it is not that complicated.
> 
>> So I was doing the lazy way. I've added the following line to 
>> catalogue.hpp:
> 
>>> std::map<std::string, nomme *> fils_map;
>> I think you can already see where this is going. I'm using the 
>> variable 'fils' for the iteration stuff and 'fils_map' for the lookup 
>> in the method 'directory::search_children'. So I only had to add 
>> some 'fils_map.clear()' and some 'fils_map[foo] = bar' statements.
> 
> That's not that a bad idea in fact. :-) I just wonder why I have not
> thought about it before. Maybe because I have not 90,000 files in a
> given directory? ;-)
> 
>> I've made the classical tradeoff between speed and memory in favor for 
>> more speed and more memory. B-)
> 
> lol
> 
>> The patch is already done and with only 10.000 (empty) files in one 
>> directory the time goes down from 34 seconds to 4 seconds. It think 
>> this is quite okay. (Increasing the number of files to 20.000 reduces 
>> the time from 136 seconds to 10 seconds.)
> 
> Seen the efficiency of your solution I just wonder whether it would
> worth it to develop the tree-like class to gain maybe nothing compared
> to your solution based on std::map.
> 
>> I've attached some files to this message and I hope they will pass 
>> through the mailing list.
> 
>> Attachements:
> 
>> a) The patch itself. As you can see it was done against the last stable 
>> version 2.2.3.
> 
> Cool. Briefly looking at the patch, I just wonder why not to completely
> replace the std::vector by the std::map (instead of keeping both of them)?
> 
>> b) A perl script for creating a lot of files with the following scheme 
>> in the current(!) directory: "filename-X.ext". The first and only 
>> parameter is the number of files.
> 
>> c) A small shell script for testing dar. The script creates a full 
>> archive of the given directory and an (empty) incremental backup of the 
>> first one and prints time for doing it.
> 
>> Some thoughts:
> 
>> a) I was not sure what is the member variable 'it' (class 'directory') 
>> is about. But I didn't touched it.
> 
> 'it'  is an iterator (see it like a pointer). it points to the next item
> to be returned by the directory::read_children() method.
> 
>> b) Maybe it would be possible to add a configure flag for support large 
>> directories. So people can choose if the want to make the tradeoff 
>> speed<->memory.
> 
> I must double check the way to sequentially read a std::map, but my
> guess is that is is possible and replacing the std::vector by a std::map
> will not bring much more memory requirement while it will bring a great
> speed improvment in the situation you are in.
> 
>> c) The threshold idea can be combined with my patch. So adding more than 
>> X files into 'fils' can bring up 'fils_map' into live and use it for 
>> the rest of the running instance.
> 
> While I guess that the std::map is probably very well optimized.
> However, maybe it may give some minor improvment to have directory
> entries split in several std::map than all in one std::map when
> considering lookup operation.  I am not convinced that the delta would
> worth it. When I will have more time I will try...
> 
>>> I you like you could first specify then implement the class I have
>>> described the data structure above, I will just be sad not having
>>> been able to do it myself, but, ... that's no problem, there is still
>>> plenty of features to add to dar  ;-)
>> As mentioned above I think I'm not able to implement either 
>> the 'Flexible' nor the 'Tree-like' structure. For me my patch seems 
>> a "good" solution but there is more than ony way. At least I hope I 
>> don't break everything in dar. B-)
> 
> I think also that your solution is a good one, except that it could save
> some memory usage completely removing the use of the std::vector.
> 
>> Comments are welcome.
> 
> 
> 
> Thanks for your patch and feedback.
> 
> Kind Regards,
> Denis.

- -------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Dar-support mailing list
Dar...@li...
https://lists.sourceforge.net/lists/listinfo/dar-support

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGONthpC5CI8gYGlIRAhNXAKCTCh7h2PlSRMH/RXF6oufhHsyjTwCgqOc3
92IVurFb9Avc6VID4jQtkWA=
=Q752
-----END PGP SIGNATURE-----

Re: [Dar-support] A lot of files slows dar down

For full, incremental, compressed and encrypted backups or archives

Re: [Dar-support] A lot of files slows dar down