From: Jeffrey J. K. <bac...@ko...> - 2010-08-30 02:35:39
|
Craig Barratt wrote at about 18:45:05 -0700 on Sunday, August 29, 2010: > Although I probably won't support it initially in 4.0, my plan would be > to use the --checksum option to pre-match potential files in the pool > if there isn't an existing file with the same path (ie: for new or > renamed files). Sounds like a good idea. > In 4.x, --checksum would allow an efficient transfer of any new file > that was already in the pool. I would plan to use "--checksum" for > full backups (it's too expensive on the client for incrementals). I'll > probably make it a user-configured option whether a "full" does this > shortcut based only on the full-file MD5 matching (ie: skips block > digest checking), or whether a full also requires block digest matching > too even if the full-file MD5 matches; it could be a probability so > that any corruption or digest collisions (very unlikely with full-file > MD5, although examples are now well known) are slowly fixed. If you also compared the file size, I imagine that a non-maliciously constructed collision would be even more unlikely since the size and MD5 are I would imagine "relatively" independent checksums. > > If you are comfortable with a full backup just comparing full-file > MD5 digests (and all file attributes too), then there in a massive > reduction in server load since the MD5 digest is now stored in the > attribute file (since it's the path to the pool file; no hardlinks > remember) - it's essentially no more effort to compare MD5 digests > as it is comparing the other file attributes. Basically the client > does most of the work for a full since it needs to read every file > computing the full-file MD5 digests. But the server has no more > work to do than an incremental if files haven't changed. > Sounds great. > If you are more cautious you could increase the "block-digest-check" > probability to, eg, 1%, 10%, or 100%. The last case would make it > behave like 3.x - every file in a full does block digest checking > (and consequently full file digest checking too). However, the > client load will be higher since each file will be read twice > in this case. I guess if you are really paranoid, there is a vanishingly small chance that the block checksums could also collide... I think for the average user, the real question is whether there setting aside maliciously-created collisions, is the real-world probability of a collision large enough to worry about relative to the order of magnitude probabilities of other failure points. The back-of-the-envelope calculations I did last year seem to suggest that the chance of a random collision is vanishingly small for any reasonable number of files. And again if the file size could also be checked, I would imagine that the chance of collision would go down by another couple of orders of magnitude (assuming md5 collisions are not highly correlated with file size). |