Re: [BackupPC-users] replicating backup servers offsite

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

--- Craig Barratt <cba...@us...> wrote:
> Marlin Prowell writes:
> 
> > > 1) add the MD5 sum to the XFerLOG line for each file in a
> > >     numbered backup.  Actually, now that I think of it, instead
> > >     of just the MD5 sum, it seems it should be the "modified" MD5
> > >     sum, that is the MD5 string plus whatever suffix leads to the
> > >     correct file in the pool (for the case of multiple files
> > >     matching the same sum -- that way you know exactly which
> > >     file in the pool you need to link to).
> > 
> > This is a good idea, but does not quite work.  I looked at this
> code 
> > briefly, and I recall seeing comments about *renaming* pool files
> with 
> > identical MD5 values so that all the MD5 collision names were kept
> in 
> > sequential order.  If the middle file of 5 collisions is deleted,
> then 
> > xxx_4 and xxx_5 are renamed so the files were named _1 through _4.
> > 
> > It means that the pool file names are not permanently assigned. 
> The 
> > pool file name at dump time may not be the pool file name at backup
> time.
> 
> You know the code well!  The original approach has merit, but all you
> would store is the log file is the md5 sum and not the _nnn
> extension.
> The mirror/archive script would still need to do an inode comparison
> to
> make sure it has the right _nnn file, in the rare cases when there
> are
> md5 collisions for that file.
> 
> However, the original approach does have some other flaws.  There are
> cases where files are linked to without knowing their md5 digest
> (meaning this information cannot easily to written to the XferLOG).
> The two cases that come to mind are:
> 
>   - when a file that fails to transfer correctly (eg: smb fills a
> file
>     with 0x0 when it can't read it due to a WinXX lock), that file is
>     removed and the same file in a previous backup is linked to,
> without
>     knowing the md5 digest.
> 
>   - when XferMethod rsync notices a file is identical (based on
> rsync's
>     block/file checksum algorithm) the previous file is linked to,
> again
>     without knowing the md5 digest.

Craig,  

  In these cases couldn't you rather store the actual path linked to in
the XferLOG instead of the MD5 sum.  MD5 sums won't ever have a slash
('/') or backslash ('\') character in them, right?  So the
replicating/backup script could just parse for those when processing
the XFerLOG (to determine whether to use a literal path as opposed to
an MD5 sum).  Of course, some other type of mechanism (a single
character flag at the beginning of the MD5/path string, for example)
could be used.

  Actually, you could just do a literal path for *all* files, since the
MD5 sum's path is mostly the MD5 sum with a few path separators in it
:-)  I suspect that even though all these extra paths would increase
the size of the XferLOG files, it would not be too bad, since the paths
should compress pretty well.  I'm not saying the size increase would be
insignificant, I'm just saying I think it could be livable.

  Another approach might be to have a separate "link log" which would
be updated whenever files are linked or deleted.  Of course, this would
affect performance, and it would mean redundant data (the "target"
paths would be both in this log and the XferLOG).

  Craig, you mentioned the 'inotify' utility before.  It occurs to me
that this might actually be a solution to consider.  If inotify will
track all the links created, as well as every time a file was deleted,
a script could be written to parse that data and recreate the links on
the replicated pool.  There are a couple of nice things that come to
mind about this:

1) It would be perfectly sequential (all logs for links/deletes would
be in the exact order they actually occurred).  If BackupPC tries to
log all of its links/deletes itself, you could run into trouble because
of multiple backups which are running simultaneously -- or at least it
seems to be this way to me (correct me if I'm wrong)

2) BackupPC itself would not have to be modified (the entire backup
process could be done outside of BackupPC

Cons for this might include: 

1) If the inotify daemon is ever killed, shutdown or dies, and BackupPC
is still running, you're out of luck.  The record of every link/delete
that BackupPC does would be lost for that time period.  Perhaps this
could be partially resolved by creating an option that requires
BackupPC to check for the inotify daemon to be running before
performing any backups.  Still that would not fix problems that occur
with the daemon while BackupPC is already doing a backup.

2) Lack of integration with BackupPC means you have to work out issues
related to what parts of the link/delete log are relevant to the
current pool.  Ideally, you'd not want to process the whole log every
time you do a replication.  And you'd probably like to toss old logs
that are no longer relevant (with some history, probably).

Spare time is hard to come by for me these days, but if I can find some
I might look into this a little more.  If any of you guys/gals have
comments/suggestions about this I'd love to hear them.  I agree with
Paul Fox that BackupPC really needs a good solution for this.  I also
would be willing to take somewhat of a performance hit on backups to
make pool replication a possibility.  

I have lost 3 hard drives in the last 3 1/2 months (no, I don't think I
have a 'dirty power' problem, I think this is just one of those "Things
fail in groups" things.)  Fortunately for me, I had been doing some
rsync-based backups to what is now my BackupPC server, so I didn't lose
the most important stuff.  Unfortunately, I hadn't finished setting
that all up before the first major drive failure occurred.  I am much
better off now, because I have now set up most (if not all) of my
machines I need to be backed up with BackupPC.  However, should the day
come that I have lightning hit the building, or some power issue that
cooks the drives in all my machines, I will be toast without a backup
to some type of media.  I would prefer that be DVD-R(W).  I could do
partition copying to a USB drive, but I don't have a USB drive and I
already have the DVD-RW drive and media to do the job.  Plus, I like
the idea of having a backup to something that has no moving parts :-) 
And yes, I am aware of the issues with DVD-R(W) media, and the
questions about how long they are reliable, etc.  I have taken that
into consideration, and will take whatever precautions I feel are
necessary to make sure I have good backups.

Looking forward to others' further discussion on this.

cppjavaperl

__________________________________ 
Do you Yahoo!? 
Yahoo! Personals - Better first dates. More second dates. 
http://personals.yahoo.com