From: Craig B. <cba...@us...> - 2005-08-19 09:23:30
|
Roy Keene writes: > Can you describe what is hashed and using which algorithm is used > to determine the pool hash name ? Sorry about the delay in replying - I'm on vacation this week. It's a little arcane, but here it is. The MD5 digest is used on the following data: - for files <= 256K we use the file size and the whole file, ie: MD5([4 byte file size, file contents]) - for files <= 1M we use the file size, the first 128K and the last 128K. - for files > 1M, we use the file size, the first 128K and the 8th 128K (ie: the 128K up to 1MB). See the Buffer2MD5() function in lib/BackupPC/Lib.pm. One thing that is not clear is what perl does when the fileSize is bigger than 4GB. In particular, we start off with: $md5->add($fileSize); I suspect that this will be the real file size modulo 2^32 (ie: the lower 4 bytes of the file size). The path name is then the hex MD5 digest with the first 3 hex characters made into directories, eg: 0a40ec51a8a079d95c1ee48436fb06bf -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf Note also that to avoid collisions (ie: different files with the same digest - easy to create since the MD5 digest doesn't include the entire file), an underscore and number are appended. Eg: these four files have the same digest, and should all be considered for possible matches: $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 This is also used for the case where the number of hardlinks is hit. Matching that pool file should fail, since a new hardlink to that file will fail. So another (in this case identical) pool file will be created. See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. Note that the backup itself doesn't add new files to the pool - it only adds links to existing files. BackupPC_link adds new pool entries. Craig |
From: Craig B. <cba...@us...> - 2005-08-20 02:11:10
|
Roy Keene writes: > The answer to my questions are: > > The hash is of the uncompressed data, using the uncompressed file > size. The file size is represented as a string (non-NIL terminated, no > line termination). > > To generate the hash you'd do (less than 256KB so we hash the entire > file, for simplicity): > > # ( echo -n `/data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 | wc -c` ; /data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 ) | md5sum > 0008712a1e41433c7875ef8fad542fc8 - > > I assumed the "4 byte file size" implied it was written in binary, but > that is not the case. Thanks for clarifying that. I hadn't looked at perl's md5 module to see what it does. A string is the right thing since the binary form won't be portable across a change in endianess. We still need to check how it behaves for very large files (eg: >4GB) so we can make sure your calculation matches exactly. Craig |
From: Roy K. <rk...@ps...> - 2005-08-19 13:27:40
|
Craig, The "$fileSize" variable describes the size of the file in bytes as a string, or an integer ? If it's an integer, is it in host or network byte ordering ? Roy Keene Planning Systems Inc ++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++ ++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>- ------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.< <<++++++++++. On Fri, 19 Aug 2005, Craig Barratt wrote: > Roy Keene writes: > >> Can you describe what is hashed and using which algorithm is used >> to determine the pool hash name ? > > Sorry about the delay in replying - I'm on vacation this week. > > It's a little arcane, but here it is. The MD5 digest is used > on the following data: > > - for files <= 256K we use the file size and the whole file, ie: > > MD5([4 byte file size, file contents]) > > - for files <= 1M we use the file size, the first 128K and > the last 128K. > - for files > 1M, we use the file size, the first 128K and > the 8th 128K (ie: the 128K up to 1MB). > > See the Buffer2MD5() function in lib/BackupPC/Lib.pm. > > One thing that is not clear is what perl does when the fileSize > is bigger than 4GB. In particular, we start off with: > > $md5->add($fileSize); > > I suspect that this will be the real file size modulo 2^32 (ie: the > lower 4 bytes of the file size). > > The path name is then the hex MD5 digest with the first 3 hex > characters made into directories, eg: > > 0a40ec51a8a079d95c1ee48436fb06bf > -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > > Note also that to avoid collisions (ie: different files with the > same digest - easy to create since the MD5 digest doesn't include > the entire file), an underscore and number are appended. Eg: these > four files have the same digest, and should all be considered for > possible matches: > > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 > > This is also used for the case where the number of hardlinks is > hit. Matching that pool file should fail, since a new hardlink > to that file will fail. So another (in this case identical) pool > file will be created. > > See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. > > Note that the backup itself doesn't add new files to the pool - it > only adds links to existing files. BackupPC_link adds new pool > entries. > > Craig > |
From: Roy K. <rk...@ps...> - 2005-08-19 13:41:12
|
Craig, Also, if the BackupPC_link process does the linking into the pool, what am I helping by distributing that checksum calculations to all the BackupPC clients ? Is it stored somewhere, other than the pool filename that I can inject it ? Even then, the whole file would need to be compared because of collisions... Roy Keene Planning Systems Inc ++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++ ++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>- ------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.< <<++++++++++. On Fri, 19 Aug 2005, Craig Barratt wrote: > Roy Keene writes: > >> Can you describe what is hashed and using which algorithm is used >> to determine the pool hash name ? > > Sorry about the delay in replying - I'm on vacation this week. > > It's a little arcane, but here it is. The MD5 digest is used > on the following data: > > - for files <= 256K we use the file size and the whole file, ie: > > MD5([4 byte file size, file contents]) > > - for files <= 1M we use the file size, the first 128K and > the last 128K. > - for files > 1M, we use the file size, the first 128K and > the 8th 128K (ie: the 128K up to 1MB). > > See the Buffer2MD5() function in lib/BackupPC/Lib.pm. > > One thing that is not clear is what perl does when the fileSize > is bigger than 4GB. In particular, we start off with: > > $md5->add($fileSize); > > I suspect that this will be the real file size modulo 2^32 (ie: the > lower 4 bytes of the file size). > > The path name is then the hex MD5 digest with the first 3 hex > characters made into directories, eg: > > 0a40ec51a8a079d95c1ee48436fb06bf > -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > > Note also that to avoid collisions (ie: different files with the > same digest - easy to create since the MD5 digest doesn't include > the entire file), an underscore and number are appended. Eg: these > four files have the same digest, and should all be considered for > possible matches: > > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 > > This is also used for the case where the number of hardlinks is > hit. Matching that pool file should fail, since a new hardlink > to that file will fail. So another (in this case identical) pool > file will be created. > > See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. > > Note that the backup itself doesn't add new files to the pool - it > only adds links to existing files. BackupPC_link adds new pool > entries. > > Craig > |
From: Roy K. <rk...@ps...> - 2005-08-19 19:56:51
|
Craig, Is this hash of the uncompressed or compressed stream ? I have not been able to duplicate this hash value. I have tried every combination of (all done on a 75KB/80KB (compressed/uncompressed): * The filesize as a 32bit int in host byte order, in network byte order, as a string, and excluding it. * The data compressed and uncompressed That is: COMPRFILESIZE(32bit int host)<compressed data> COMPRFILESIZE(32bit int net)<compressed data> COMPRFILESIZE(string)<compressed data> <compressed data> UNCOMPRFILESIZE(32bit int host)<data> UNCOMPRFILESIZE(32bit int net)<data> UNCOMPRFILESIZE(string)<data> <compressed data> Any ideas ? Roy Keene Planning Systems Inc ++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++ ++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>- ------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.< <<++++++++++. On Fri, 19 Aug 2005, Craig Barratt wrote: > Roy Keene writes: > >> Can you describe what is hashed and using which algorithm is used >> to determine the pool hash name ? > > Sorry about the delay in replying - I'm on vacation this week. > > It's a little arcane, but here it is. The MD5 digest is used > on the following data: > > - for files <= 256K we use the file size and the whole file, ie: > > MD5([4 byte file size, file contents]) > > - for files <= 1M we use the file size, the first 128K and > the last 128K. > - for files > 1M, we use the file size, the first 128K and > the 8th 128K (ie: the 128K up to 1MB). > > See the Buffer2MD5() function in lib/BackupPC/Lib.pm. > > One thing that is not clear is what perl does when the fileSize > is bigger than 4GB. In particular, we start off with: > > $md5->add($fileSize); > > I suspect that this will be the real file size modulo 2^32 (ie: the > lower 4 bytes of the file size). > > The path name is then the hex MD5 digest with the first 3 hex > characters made into directories, eg: > > 0a40ec51a8a079d95c1ee48436fb06bf > -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > > Note also that to avoid collisions (ie: different files with the > same digest - easy to create since the MD5 digest doesn't include > the entire file), an underscore and number are appended. Eg: these > four files have the same digest, and should all be considered for > possible matches: > > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 > $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 > > This is also used for the case where the number of hardlinks is > hit. Matching that pool file should fail, since a new hardlink > to that file will fail. So another (in this case identical) pool > file will be created. > > See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. > > Note that the backup itself doesn't add new files to the pool - it > only adds links to existing files. BackupPC_link adds new pool > entries. > > Craig > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > BackupPC-devel mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/backuppc-devel > http://backuppc.sourceforge.net/ > |
From: Roy K. <rk...@ps...> - 2005-08-19 21:08:02
|
The answer to my questions are: The hash is of the uncompressed data, using the uncompressed file size. The file size is represented as a string (non-NIL terminated, no line termination). To generate the hash you'd do (less than 256KB so we hash the entire file, for simplicity): # ( echo -n `/data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 | wc -c` ; /data/backups/software/bin/BackupPC_zcat 0008712a1e41433c7875ef8fad542fc8 ) | md5sum 0008712a1e41433c7875ef8fad542fc8 - I assumed the "4 byte file size" implied it was written in binary, but that is not the case. Roy Keene Planning Systems Inc ++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++ ++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>- ------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.< <<++++++++++. On Fri, 19 Aug 2005, Roy Keene wrote: > Craig, > > Is this hash of the uncompressed or compressed stream ? > > I have not been able to duplicate this hash value. > > I have tried every combination of (all done on a 75KB/80KB > (compressed/uncompressed): > * The filesize as a 32bit int in host byte order, in network byte > order, as a string, and excluding it. > * The data compressed and uncompressed > > That is: > COMPRFILESIZE(32bit int host)<compressed data> > COMPRFILESIZE(32bit int net)<compressed data> > COMPRFILESIZE(string)<compressed data> > <compressed data> > UNCOMPRFILESIZE(32bit int host)<data> > UNCOMPRFILESIZE(32bit int net)<data> > UNCOMPRFILESIZE(string)<data> > <compressed data> > > Any ideas ? > > Roy Keene > Planning Systems Inc > ++++++++[->++++<]>[->+>++>+++>++++<<<<]>>++++++++++++++++.>++++++ > ++++++.-----------.+++++++++++++..-----.+++++.-------.<<.>+++.>>- > ------.------.+.<--.>-------.++++++.<<<.>----------.>>-----.<--.< > <<++++++++++. > > On Fri, 19 Aug 2005, Craig Barratt wrote: > >> Roy Keene writes: >> >>> Can you describe what is hashed and using which algorithm is used >>> to determine the pool hash name ? >> >> Sorry about the delay in replying - I'm on vacation this week. >> >> It's a little arcane, but here it is. The MD5 digest is used >> on the following data: >> >> - for files <= 256K we use the file size and the whole file, ie: >> >> MD5([4 byte file size, file contents]) >> >> - for files <= 1M we use the file size, the first 128K and >> the last 128K. >> - for files > 1M, we use the file size, the first 128K and >> the 8th 128K (ie: the 128K up to 1MB). >> >> See the Buffer2MD5() function in lib/BackupPC/Lib.pm. >> >> One thing that is not clear is what perl does when the fileSize >> is bigger than 4GB. In particular, we start off with: >> >> $md5->add($fileSize); >> >> I suspect that this will be the real file size modulo 2^32 (ie: the >> lower 4 bytes of the file size). >> >> The path name is then the hex MD5 digest with the first 3 hex >> characters made into directories, eg: >> >> 0a40ec51a8a079d95c1ee48436fb06bf >> -> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf >> >> Note also that to avoid collisions (ie: different files with the >> same digest - easy to create since the MD5 digest doesn't include >> the entire file), an underscore and number are appended. Eg: these >> four files have the same digest, and should all be considered for >> possible matches: >> >> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf >> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_0 >> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_1 >> $DataDir/cpool/0/a/4/0a40ec51a8a079d95c1ee48436fb06bf_2 >> >> This is also used for the case where the number of hardlinks is >> hit. Matching that pool file should fail, since a new hardlink >> to that file will fail. So another (in this case identical) pool >> file will be created. >> >> See MD52Path() in lib/BackupPC/Lib.pm and lib/BackupPC/PoolWrite.pm. >> >> Note that the backup itself doesn't add new files to the pool - it >> only adds links to existing files. BackupPC_link adds new pool >> entries. >> >> Craig >> >> >> ------------------------------------------------------- >> SF.Net email is Sponsored by the Better Software Conference & EXPO >> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices >> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA >> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf >> _______________________________________________ >> BackupPC-devel mailing list >> Bac...@li... >> https://lists.sourceforge.net/lists/listinfo/backuppc-devel >> http://backuppc.sourceforge.net/ >> > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > BackupPC-devel mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/backuppc-devel > http://backuppc.sourceforge.net/ > |