From: Craig B. <cba...@us...> - 2005-08-19 09:23:30
|
Roy Keene writes: > Can you describe what is hashed and using which algorithm is used > to determine the pool hash name ? Sorry about the delay in replying - I'm on vacation this week. It's a little arcane, but here it is. The MD5 digest is used on the following data: - for files <= 256K we use the file size and the whole file, ie: MD5([4 byte file size, file contents]) - for files <= 1M we use the file size, the first 128K and the last 128K. - for files > 1M, we use the file size, the first 128K and the 8th 128K (ie: the 128K up to 1MB). See the Buffer2MD5() function in lib/BackupPC/Lib.pm. One thing that is not clear is what perl does when the fileSize is bigger than 4GB. In particular, we start off with: $md5->add($fileSize); I suspect that this will be the real file size modulo 2^32 (ie: the lower 4 bytes of the file size). The path name is then the hex MD5 digest with the first 3 hex characters made into directories, eg: 0a40ec51a8a079d95c1ee48436fb06bf -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf Note also that to avoid collisions (ie: different files with the same digest - easy to create since the MD5 digest doesn't include the entire file), an underscore and number are appended. Eg: these four files have the same digest, and should all be considered for possible matches: $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 This is also used for the case where the number of hardlinks is hit. Matching that pool file should fail, since a new hardlink to that file will fail. So another (in this case identical) pool file will be created. See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. Note that the backup itself doesn't add new files to the pool - it only adds links to existing files. BackupPC_link adds new pool entries. Craig |