Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Roy Keene writes:

>  	Can you describe what is hashed and using which algorithm is used 
> to determine the pool hash name ?

Sorry about the delay in replying - I'm on vacation this week.

It's a little arcane, but here it is.  The MD5 digest is used
on the following data:

   - for files <= 256K we use the file size and the whole file, ie:

        MD5([4 byte file size, file contents])

   - for files <= 1M we use the file size, the first 128K and
     the last 128K.
   - for files > 1M, we use the file size, the first 128K and
     the 8th 128K (ie: the 128K up to 1MB).

See the Buffer2MD5() function in lib/BackupPC/Lib.pm.

One thing that is not clear is what perl does when the fileSize
is bigger than 4GB.  In particular, we start off with:

    $md5->add($fileSize);

I suspect that this will be the real file size modulo 2^32 (ie: the
lower 4 bytes of the file size).

The path name is then the hex MD5 digest with the first 3 hex
characters made into directories, eg:

    0a40ec51a8a079d95c1ee48436fb06bf
                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf

Note also that to avoid collisions (ie: different files with the
same digest - easy to create since the MD5 digest doesn't include
the entire file), an underscore and number are appended.  Eg: these
four files have the same digest, and should all be considered for
possible matches:

    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2

This is also used for the case where the number of hardlinks is
hit.  Matching that pool file should fail, since a new hardlink
to that file will fail.  So another (in this case identical) pool
file will be created.

See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.

Note that the backup itself doesn't add new files to the pool - it
only adds links to existing files.  BackupPC_link adds new pool
entries.

Craig