From: cppjavaperl <cpp...@ya...> - 2005-04-03 05:28:51
|
--- Craig Barratt <cba...@us...> wrote: > Marlin Prowell writes: > > > > 1) add the MD5 sum to the XFerLOG line for each file in a > > > numbered backup. Actually, now that I think of it, instead > > > of just the MD5 sum, it seems it should be the "modified" MD5 > > > sum, that is the MD5 string plus whatever suffix leads to the > > > correct file in the pool (for the case of multiple files > > > matching the same sum -- that way you know exactly which > > > file in the pool you need to link to). > > > > This is a good idea, but does not quite work. I looked at this > code > > briefly, and I recall seeing comments about *renaming* pool files > with > > identical MD5 values so that all the MD5 collision names were kept > in > > sequential order. If the middle file of 5 collisions is deleted, > then > > xxx_4 and xxx_5 are renamed so the files were named _1 through _4. > > > > It means that the pool file names are not permanently assigned. > The > > pool file name at dump time may not be the pool file name at backup > time. > > You know the code well! The original approach has merit, but all you > would store is the log file is the md5 sum and not the _nnn > extension. > The mirror/archive script would still need to do an inode comparison > to > make sure it has the right _nnn file, in the rare cases when there > are > md5 collisions for that file. > > However, the original approach does have some other flaws. There are > cases where files are linked to without knowing their md5 digest > (meaning this information cannot easily to written to the XferLOG). > The two cases that come to mind are: > > - when a file that fails to transfer correctly (eg: smb fills a > file > with 0x0 when it can't read it due to a WinXX lock), that file is > removed and the same file in a previous backup is linked to, > without > knowing the md5 digest. > > - when XferMethod rsync notices a file is identical (based on > rsync's > block/file checksum algorithm) the previous file is linked to, > again > without knowing the md5 digest. Craig, In these cases couldn't you rather store the actual path linked to in the XferLOG instead of the MD5 sum. MD5 sums won't ever have a slash ('/') or backslash ('\') character in them, right? So the replicating/backup script could just parse for those when processing the XFerLOG (to determine whether to use a literal path as opposed to an MD5 sum). Of course, some other type of mechanism (a single character flag at the beginning of the MD5/path string, for example) could be used. Actually, you could just do a literal path for *all* files, since the MD5 sum's path is mostly the MD5 sum with a few path separators in it :-) I suspect that even though all these extra paths would increase the size of the XferLOG files, it would not be too bad, since the paths should compress pretty well. I'm not saying the size increase would be insignificant, I'm just saying I think it could be livable. Another approach might be to have a separate "link log" which would be updated whenever files are linked or deleted. Of course, this would affect performance, and it would mean redundant data (the "target" paths would be both in this log and the XferLOG). Craig, you mentioned the 'inotify' utility before. It occurs to me that this might actually be a solution to consider. If inotify will track all the links created, as well as every time a file was deleted, a script could be written to parse that data and recreate the links on the replicated pool. There are a couple of nice things that come to mind about this: 1) It would be perfectly sequential (all logs for links/deletes would be in the exact order they actually occurred). If BackupPC tries to log all of its links/deletes itself, you could run into trouble because of multiple backups which are running simultaneously -- or at least it seems to be this way to me (correct me if I'm wrong) 2) BackupPC itself would not have to be modified (the entire backup process could be done outside of BackupPC Cons for this might include: 1) If the inotify daemon is ever killed, shutdown or dies, and BackupPC is still running, you're out of luck. The record of every link/delete that BackupPC does would be lost for that time period. Perhaps this could be partially resolved by creating an option that requires BackupPC to check for the inotify daemon to be running before performing any backups. Still that would not fix problems that occur with the daemon while BackupPC is already doing a backup. 2) Lack of integration with BackupPC means you have to work out issues related to what parts of the link/delete log are relevant to the current pool. Ideally, you'd not want to process the whole log every time you do a replication. And you'd probably like to toss old logs that are no longer relevant (with some history, probably). Spare time is hard to come by for me these days, but if I can find some I might look into this a little more. If any of you guys/gals have comments/suggestions about this I'd love to hear them. I agree with Paul Fox that BackupPC really needs a good solution for this. I also would be willing to take somewhat of a performance hit on backups to make pool replication a possibility. I have lost 3 hard drives in the last 3 1/2 months (no, I don't think I have a 'dirty power' problem, I think this is just one of those "Things fail in groups" things.) Fortunately for me, I had been doing some rsync-based backups to what is now my BackupPC server, so I didn't lose the most important stuff. Unfortunately, I hadn't finished setting that all up before the first major drive failure occurred. I am much better off now, because I have now set up most (if not all) of my machines I need to be backed up with BackupPC. However, should the day come that I have lightning hit the building, or some power issue that cooks the drives in all my machines, I will be toast without a backup to some type of media. I would prefer that be DVD-R(W). I could do partition copying to a USB drive, but I don't have a USB drive and I already have the DVD-RW drive and media to do the job. Plus, I like the idea of having a backup to something that has no moving parts :-) And yes, I am aware of the issues with DVD-R(W) media, and the questions about how long they are reliable, etc. I have taken that into consideration, and will take whatever precautions I feel are necessary to make sure I have good backups. Looking forward to others' further discussion on this. cppjavaperl __________________________________ Do you Yahoo!? Yahoo! Personals - Better first dates. More second dates. http://personals.yahoo.com |