Re: [Dar-support] A lot of files slows dar down
For full, incremental, compressed and encrypted backups or archives
Brought to you by:
edrusb
From: Denis C. <dar...@fr...> - 2007-05-02 18:41:46
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello Erik, After several different implementations and testings, your implementation is the best I could find and far away from the others. I could not remove the vector<> as dar needs sequential reading of directory in the order they have been added to the directory (to avoid having changing slice unnecessarily when extracting/testing/diffing/merging). Thus I kept your implementation quite word for word, except one point which is specific to the std::map class: using "map[key]" does create the corresponding entry (when it is not in the map), you need to use the find() method instead to check whether a particular entry exists or not. The performance improvement is very good while the memory overhead worth it. However, I added a --disable-large-dir option to the configure script to keep the possibility to use the original implementation for those which have important memory constraints. Regards, Denis. Denis Corbin wrote: > Erik Wasser wrote: > [...] >>>> The main problem was that the incremental backups of a directory >>>> with a large amount of files (>90.000) are taking a long time. I >>>> know about this drawback of a filesystem but I can't change the >>>> software producing this mess. B-( >>> Yep, However, note that we can speed up dar for sure about this >>> point, but you will not speed up the filesystem operation, I guess >>> optimizing dar will only be noticeable when you want to restore a few >>> files :-/, else, this is the filesystem that will impose the rhythm. >> Incremental backups are a good test case too. Reading a directory is (I >> think) O(n) but comparing the current content with the content of the >> reference archive is O(n^2). I'm doing very often backups but only a >> few restores so this is a quite common case for me. > > Yep, that's right. > >>> the hash can also be done by hand, this was my idea in the (old) >>> reference you mentioned above. Thinking again about this point, I >>> would design a flexible tree-like class: >> The tree structure is a good idea but it seems to complicated for me to >> implement in C++. > > I guess it is not that complicated. > >> So I was doing the lazy way. I've added the following line to >> catalogue.hpp: > >>> std::map<std::string, nomme *> fils_map; >> I think you can already see where this is going. I'm using the >> variable 'fils' for the iteration stuff and 'fils_map' for the lookup >> in the method 'directory::search_children'. So I only had to add >> some 'fils_map.clear()' and some 'fils_map[foo] = bar' statements. > > That's not that a bad idea in fact. :-) I just wonder why I have not > thought about it before. Maybe because I have not 90,000 files in a > given directory? ;-) > >> I've made the classical tradeoff between speed and memory in favor for >> more speed and more memory. B-) > > lol > >> The patch is already done and with only 10.000 (empty) files in one >> directory the time goes down from 34 seconds to 4 seconds. It think >> this is quite okay. (Increasing the number of files to 20.000 reduces >> the time from 136 seconds to 10 seconds.) > > Seen the efficiency of your solution I just wonder whether it would > worth it to develop the tree-like class to gain maybe nothing compared > to your solution based on std::map. > >> I've attached some files to this message and I hope they will pass >> through the mailing list. > >> Attachements: > >> a) The patch itself. As you can see it was done against the last stable >> version 2.2.3. > > Cool. Briefly looking at the patch, I just wonder why not to completely > replace the std::vector by the std::map (instead of keeping both of them)? > >> b) A perl script for creating a lot of files with the following scheme >> in the current(!) directory: "filename-X.ext". The first and only >> parameter is the number of files. > >> c) A small shell script for testing dar. The script creates a full >> archive of the given directory and an (empty) incremental backup of the >> first one and prints time for doing it. > >> Some thoughts: > >> a) I was not sure what is the member variable 'it' (class 'directory') >> is about. But I didn't touched it. > > 'it' is an iterator (see it like a pointer). it points to the next item > to be returned by the directory::read_children() method. > >> b) Maybe it would be possible to add a configure flag for support large >> directories. So people can choose if the want to make the tradeoff >> speed<->memory. > > I must double check the way to sequentially read a std::map, but my > guess is that is is possible and replacing the std::vector by a std::map > will not bring much more memory requirement while it will bring a great > speed improvment in the situation you are in. > >> c) The threshold idea can be combined with my patch. So adding more than >> X files into 'fils' can bring up 'fils_map' into live and use it for >> the rest of the running instance. > > While I guess that the std::map is probably very well optimized. > However, maybe it may give some minor improvment to have directory > entries split in several std::map than all in one std::map when > considering lookup operation. I am not convinced that the delta would > worth it. When I will have more time I will try... > >>> I you like you could first specify then implement the class I have >>> described the data structure above, I will just be sad not having >>> been able to do it myself, but, ... that's no problem, there is still >>> plenty of features to add to dar ;-) >> As mentioned above I think I'm not able to implement either >> the 'Flexible' nor the 'Tree-like' structure. For me my patch seems >> a "good" solution but there is more than ony way. At least I hope I >> don't break everything in dar. B-) > > I think also that your solution is a good one, except that it could save > some memory usage completely removing the use of the std::vector. > >> Comments are welcome. > > > > Thanks for your patch and feedback. > > Kind Regards, > Denis. - ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Dar-support mailing list Dar...@li... https://lists.sourceforge.net/lists/listinfo/dar-support -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGONthpC5CI8gYGlIRAhNXAKCTCh7h2PlSRMH/RXF6oufhHsyjTwCgqOc3 92IVurFb9Avc6VID4jQtkWA= =Q752 -----END PGP SIGNATURE----- |