Re: [BackupPC-users] Why does backuppc transfer files already in the pool

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Craig Barratt wrote at about 18:45:05 -0700 on Sunday, August 29, 2010:
 > Although I probably won't support it initially in 4.0, my plan would be
 > to use the --checksum option to pre-match potential files in the pool
 > if there isn't an existing file with the same path (ie: for new or
 > renamed files).

Sounds like a good idea.

 > In 4.x, --checksum would allow an efficient transfer of any new file
 > that was already in the pool.  I would plan to use "--checksum" for
 > full backups (it's too expensive on the client for incrementals). I'll
 > probably make it a user-configured option whether a "full" does this
 > shortcut based only on the full-file MD5 matching (ie: skips block
 > digest checking), or whether a full also requires block digest matching
 > too even if the full-file MD5 matches; it could be a probability so
 > that any corruption or digest collisions (very unlikely with full-file
 > MD5, although examples are now well known) are slowly fixed.

If you also compared the file size, I imagine that a non-maliciously
constructed collision would be even more unlikely since the size and
MD5 are I would imagine "relatively" independent checksums.

 > 
 > If you are comfortable with a full backup just comparing full-file
 > MD5 digests (and all file attributes too), then there in a massive
 > reduction in server load since the MD5 digest is now stored in the
 > attribute file (since it's the path to the pool file; no hardlinks
 > remember) - it's essentially no more effort to compare MD5 digests
 > as it is comparing the other file attributes.  Basically the client
 > does most of the work for a full since it needs to read every file
 > computing the full-file MD5 digests.  But the server has no more
 > work to do than an incremental if files haven't changed.
 > 
Sounds great.

 > If you are more cautious you could increase the "block-digest-check"
 > probability to, eg, 1%, 10%, or 100%.  The last case would make it
 > behave like 3.x - every file in a full does block digest checking
 > (and consequently full file digest checking too).  However, the
 > client load will be higher since each file will be read twice
 > in this case.

I guess if you are really paranoid, there is a vanishingly small
chance that the block checksums could also collide...

I think for the average user, the real question is whether there
setting aside maliciously-created collisions, is the real-world
probability of a collision large enough to worry about relative to the
order of magnitude probabilities of other failure points.

The back-of-the-envelope calculations I did last year seem to suggest
that the chance of a random collision is vanishingly small for any
reasonable number of files.

And again if the file size could also be checked, I would imagine that
the chance of collision would go down by another couple of orders of
magnitude (assuming md5 collisions are not highly correlated with file
size).