Thread: Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

backuppc-devel

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Craig B. <cba...@us...> - 2005-08-19 09:23:30

Roy Keene writes:

>  	Can you describe what is hashed and using which algorithm is used 
> to determine the pool hash name ?

Sorry about the delay in replying - I'm on vacation this week.

It's a little arcane, but here it is.  The MD5 digest is used
on the following data:

   - for files <= 256K we use the file size and the whole file, ie:

        MD5([4 byte file size, file contents])

   - for files <= 1M we use the file size, the first 128K and
     the last 128K.
   - for files > 1M, we use the file size, the first 128K and
     the 8th 128K (ie: the 128K up to 1MB).

See the Buffer2MD5() function in lib/BackupPC/Lib.pm.

One thing that is not clear is what perl does when the fileSize
is bigger than 4GB.  In particular, we start off with:

    $md5->add($fileSize);

I suspect that this will be the real file size modulo 2^32 (ie: the
lower 4 bytes of the file size).

The path name is then the hex MD5 digest with the first 3 hex
characters made into directories, eg:

    0a40ec51a8a079d95c1ee48436fb06bf
                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf

Note also that to avoid collisions (ie: different files with the
same digest - easy to create since the MD5 digest doesn't include
the entire file), an underscore and number are appended.  Eg: these
four files have the same digest, and should all be considered for
possible matches:

    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2

This is also used for the case where the number of hardlinks is
hit.  Matching that pool file should fail, since a new hardlink
to that file will fail.  So another (in this case identical) pool
file will be created.

See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.

Note that the backup itself doesn't add new files to the pool - it
only adds links to existing files.  BackupPC_link adds new pool
entries.

Craig

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Craig B. <cba...@us...> - 2005-08-20 02:11:10

Roy Keene writes:

> The answer to my questions are:
> 
>  	The hash is of the uncompressed data, using the uncompressed file 
> size.  The file size is represented as a string (non-NIL terminated, no 
> line termination).
> 
> To generate the hash you'd do (less than 256KB so we hash the entire 
> file, for simplicity):
> 
> # ( echo -n `/data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 | wc -c` ; /data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 ) | md5sum
> 0008712a1e41433c7875ef8fad542fc8  -
> 
> I assumed the "4 byte file size" implied it was written in binary, but 
> that is not the case.

Thanks for clarifying that.  I hadn't looked at perl's md5 module
to see what it does.  A string is the right thing since the binary
form won't be portable across a change in endianess.

We still need to check how it behaves for very large files (eg: >4GB)
so we can make sure your calculation matches exactly.

Craig

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Roy K. <rk...@ps...> - 2005-08-19 13:27:40

Craig,

 	The "$fileSize" variable describes the size of the file in bytes 
as a string, or an integer ?  If it's an integer, is it in host or network 
byte ordering ?

 	Roy Keene
 	Planning Systems Inc
 	++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++
 	++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>-
 	------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.<
 	<<++++++++++.

On Fri, 19 Aug 2005, Craig Barratt wrote:

> Roy Keene writes:
>
>>  	Can you describe what is hashed and using which algorithm is used
>> to determine the pool hash name ?
>
> Sorry about the delay in replying - I'm on vacation this week.
>
> It's a little arcane, but here it is.  The MD5 digest is used
> on the following data:
>
>   - for files <= 256K we use the file size and the whole file, ie:
>
>        MD5([4 byte file size, file contents])
>
>   - for files <= 1M we use the file size, the first 128K and
>     the last 128K.
>   - for files > 1M, we use the file size, the first 128K and
>     the 8th 128K (ie: the 128K up to 1MB).
>
> See the Buffer2MD5() function in lib/BackupPC/Lib.pm.
>
> One thing that is not clear is what perl does when the fileSize
> is bigger than 4GB.  In particular, we start off with:
>
>    $md5->add($fileSize);
>
> I suspect that this will be the real file size modulo 2^32 (ie: the
> lower 4 bytes of the file size).
>
> The path name is then the hex MD5 digest with the first 3 hex
> characters made into directories, eg:
>
>    0a40ec51a8a079d95c1ee48436fb06bf
>                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>
> Note also that to avoid collisions (ie: different files with the
> same digest - easy to create since the MD5 digest doesn't include
> the entire file), an underscore and number are appended.  Eg: these
> four files have the same digest, and should all be considered for
> possible matches:
>
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2
>
> This is also used for the case where the number of hardlinks is
> hit.  Matching that pool file should fail, since a new hardlink
> to that file will fail.  So another (in this case identical) pool
> file will be created.
>
> See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.
>
> Note that the backup itself doesn't add new files to the pool - it
> only adds links to existing files.  BackupPC_link adds new pool
> entries.
>
> Craig
>

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Roy K. <rk...@ps...> - 2005-08-19 13:41:12

Craig,

Also, if the BackupPC_link process does the linking into the pool, what am 
I helping by distributing that checksum calculations to all the BackupPC 
clients ?

Is it stored somewhere, other than the pool filename that I can inject it 
?  Even then, the whole file would need to be compared because of 
collisions...

 	Roy Keene
 	Planning Systems Inc
 	++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++
 	++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>-
 	------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.<
 	<<++++++++++.

On Fri, 19 Aug 2005, Craig Barratt wrote:

> Roy Keene writes:
>
>>  	Can you describe what is hashed and using which algorithm is used
>> to determine the pool hash name ?
>
> Sorry about the delay in replying - I'm on vacation this week.
>
> It's a little arcane, but here it is.  The MD5 digest is used
> on the following data:
>
>   - for files <= 256K we use the file size and the whole file, ie:
>
>        MD5([4 byte file size, file contents])
>
>   - for files <= 1M we use the file size, the first 128K and
>     the last 128K.
>   - for files > 1M, we use the file size, the first 128K and
>     the 8th 128K (ie: the 128K up to 1MB).
>
> See the Buffer2MD5() function in lib/BackupPC/Lib.pm.
>
> One thing that is not clear is what perl does when the fileSize
> is bigger than 4GB.  In particular, we start off with:
>
>    $md5->add($fileSize);
>
> I suspect that this will be the real file size modulo 2^32 (ie: the
> lower 4 bytes of the file size).
>
> The path name is then the hex MD5 digest with the first 3 hex
> characters made into directories, eg:
>
>    0a40ec51a8a079d95c1ee48436fb06bf
>                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>
> Note also that to avoid collisions (ie: different files with the
> same digest - easy to create since the MD5 digest doesn't include
> the entire file), an underscore and number are appended.  Eg: these
> four files have the same digest, and should all be considered for
> possible matches:
>
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2
>
> This is also used for the case where the number of hardlinks is
> hit.  Matching that pool file should fail, since a new hardlink
> to that file will fail.  So another (in this case identical) pool
> file will be created.
>
> See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.
>
> Note that the backup itself doesn't add new files to the pool - it
> only adds links to existing files.  BackupPC_link adds new pool
> entries.
>
> Craig
>

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Roy K. <rk...@ps...> - 2005-08-19 19:56:51

Craig,

 	Is this hash of the uncompressed or compressed stream ?

I have not been able to duplicate this hash value.

I have tried every combination of (all done on a 75KB/80KB 
(compressed/uncompressed):
 	* The filesize as a 32bit int in host byte order, in network byte 
order, as a string, and excluding it.
 	* The data compressed and uncompressed

That is:
COMPRFILESIZE(32bit int host)<compressed data>
COMPRFILESIZE(32bit int net)<compressed data>
COMPRFILESIZE(string)<compressed data>
<compressed data>
UNCOMPRFILESIZE(32bit int host)<data>
UNCOMPRFILESIZE(32bit int net)<data>
UNCOMPRFILESIZE(string)<data>
<compressed data>

Any ideas ?

 	Roy Keene
 	Planning Systems Inc
 	++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++
 	++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>-
 	------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.<
 	<<++++++++++.

On Fri, 19 Aug 2005, Craig Barratt wrote:

> Roy Keene writes:
>
>>  	Can you describe what is hashed and using which algorithm is used
>> to determine the pool hash name ?
>
> Sorry about the delay in replying - I'm on vacation this week.
>
> It's a little arcane, but here it is.  The MD5 digest is used
> on the following data:
>
>   - for files <= 256K we use the file size and the whole file, ie:
>
>        MD5([4 byte file size, file contents])
>
>   - for files <= 1M we use the file size, the first 128K and
>     the last 128K.
>   - for files > 1M, we use the file size, the first 128K and
>     the 8th 128K (ie: the 128K up to 1MB).
>
> See the Buffer2MD5() function in lib/BackupPC/Lib.pm.
>
> One thing that is not clear is what perl does when the fileSize
> is bigger than 4GB.  In particular, we start off with:
>
>    $md5->add($fileSize);
>
> I suspect that this will be the real file size modulo 2^32 (ie: the
> lower 4 bytes of the file size).
>
> The path name is then the hex MD5 digest with the first 3 hex
> characters made into directories, eg:
>
>    0a40ec51a8a079d95c1ee48436fb06bf
>                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>
> Note also that to avoid collisions (ie: different files with the
> same digest - easy to create since the MD5 digest doesn't include
> the entire file), an underscore and number are appended.  Eg: these
> four files have the same digest, and should all be considered for
> possible matches:
>
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2
>
> This is also used for the case where the number of hardlinks is
> hit.  Matching that pool file should fail, since a new hardlink
> to that file will fail.  So another (in this case identical) pool
> file will be created.
>
> See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.
>
> Note that the backup itself doesn't add new files to the pool - it
> only adds links to existing files.  BackupPC_link adds new pool
> entries.
>
> Craig
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> BackupPC-devel mailing list
> Bac...@li...
> https://lists.sourceforge.net/lists/listinfo/backuppc-devel
> http://backuppc.sourceforge.net/
>

Re: [BackupPC-devel] Hash (MD4?) Algorithm used for Pool

From: Roy K. <rk...@ps...> - 2005-08-19 21:08:02

The answer to my questions are:

 	The hash is of the uncompressed data, using the uncompressed file 
size.  The file size is represented as a string (non-NIL terminated, no 
line termination).

To generate the hash you'd do (less than 256KB so we hash the entire 
file, for simplicity):

# ( echo -n `/data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 | wc -c` ; /data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 ) | md5sum
0008712a1e41433c7875ef8fad542fc8  -

I assumed the "4 byte file size" implied it was written in binary, but 
that is not the case.

 	Roy Keene
 	Planning Systems Inc
 	++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++
 	++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>-
 	------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.<
 	<<++++++++++.

On Fri, 19 Aug 2005, Roy Keene wrote:

> Craig,
>
> 	Is this hash of the uncompressed or compressed stream ?
>
> I have not been able to duplicate this hash value.
>
> I have tried every combination of (all done on a 75KB/80KB 
> (compressed/uncompressed):
> 	* The filesize as a 32bit int in host byte order, in network byte 
> order, as a string, and excluding it.
> 	* The data compressed and uncompressed
>
> That is:
> COMPRFILESIZE(32bit int host)<compressed data>
> COMPRFILESIZE(32bit int net)<compressed data>
> COMPRFILESIZE(string)<compressed data>
> <compressed data>
> UNCOMPRFILESIZE(32bit int host)<data>
> UNCOMPRFILESIZE(32bit int net)<data>
> UNCOMPRFILESIZE(string)<data>
> <compressed data>
>
> Any ideas ?
>
> 	Roy Keene
> 	Planning Systems Inc
> 	++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++
> 	++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>-
> 	------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.<
> 	<<++++++++++.
>
> On Fri, 19 Aug 2005, Craig Barratt wrote:
>
>> Roy Keene writes:
>> 
>>>  	Can you describe what is hashed and using which algorithm is used
>>> to determine the pool hash name ?
>> 
>> Sorry about the delay in replying - I'm on vacation this week.
>> 
>> It's a little arcane, but here it is.  The MD5 digest is used
>> on the following data:
>> 
>>   - for files <= 256K we use the file size and the whole file, ie:
>> 
>>        MD5([4 byte file size, file contents])
>> 
>>   - for files <= 1M we use the file size, the first 128K and
>>     the last 128K.
>>   - for files > 1M, we use the file size, the first 128K and
>>     the 8th 128K (ie: the 128K up to 1MB).
>> 
>> See the Buffer2MD5() function in lib/BackupPC/Lib.pm.
>> 
>> One thing that is not clear is what perl does when the fileSize
>> is bigger than 4GB.  In particular, we start off with:
>> 
>>    $md5->add($fileSize);
>> 
>> I suspect that this will be the real file size modulo 2^32 (ie: the
>> lower 4 bytes of the file size).
>> 
>> The path name is then the hex MD5 digest with the first 3 hex
>> characters made into directories, eg:
>> 
>>    0a40ec51a8a079d95c1ee48436fb06bf
>>                -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>> 
>> Note also that to avoid collisions (ie: different files with the
>> same digest - easy to create since the MD5 digest doesn't include
>> the entire file), an underscore and number are appended.  Eg: these
>> four files have the same digest, and should all be considered for
>> possible matches:
>> 
>>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf
>>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0
>>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1
>>    $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2
>> 
>> This is also used for the case where the number of hardlinks is
>> hit.  Matching that pool file should fail, since a new hardlink
>> to that file will fail.  So another (in this case identical) pool
>> file will be created.
>> 
>> See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm.
>> 
>> Note that the backup itself doesn't add new files to the pool - it
>> only adds links to existing files.  BackupPC_link adds new pool
>> entries.
>> 
>> Craig
>> 
>> 
>> -------------------------------------------------------
>> SF.Net email is Sponsored by the Better Software Conference & EXPO
>> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
>> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
>> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
>> _______________________________________________
>> BackupPC-devel mailing list
>> Bac...@li...
>> https://lists.sourceforge.net/lists/listinfo/backuppc-devel
>> http://backuppc.sourceforge.net/
>> 
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> BackupPC-devel mailing list
> Bac...@li...
> https://lists.sourceforge.net/lists/listinfo/backuppc-devel
> http://backuppc.sourceforge.net/
>