Menu

Method used to calculate checksums for data and names

Help
OVERKILL27
2023-12-06
2023-12-08
  • OVERKILL27

    OVERKILL27 - 2023-12-06

    I use7z h to generate checksums for a large number of files and folders. I pipe its output to text files which I keep for my records.

    at at the end of the output, it lists the aggregate sums for just the data, as well as for the data and names. I am looking for a way to recreate these values by using precalculated checksums and names of files.

    I want to be able to make my own list of files that includes only the file names, file sizes, and their checksums. I then want to be able to calculate my own aggregate sums for the data, as well as for the data and names, and I want them to match the aggregate sums that 7-Zip would generate if 7z h or the 7-Zip File Manager’s "CRC SHA" command were used on those files and folders.

    I have been able to recreate the aggregate sums for the just the data by using information provided in other topics; namely that it is simply an arithmetic sum of the files' individual checksums, capped to the same number of bits that the original checksums use, with the aggregate sum’s overflow added to the end (separated by a hyphen).

    So, I have written a PowerShell script for SHA-256 checksums that:
    1. Takes the precomputed file checksum hex values from a CSV file
    2. Converts them into big-endian byte arrays
    3. Converts the byte arrays into bigints
    4. Sums them all together to get the aggregate sum
    5. Converts that sum into a little-endian hex string, with the 32-bit overflow at the end, separated by a hyphen, just like 7-Zip does it.

    This is working well for me. I’ve tested it on a few different sets of files, with one of them containing tens of thousands of folders and over a million files, and so far it's Always comes out with the same sum that 7-Zip outputs.

    I can now use this method to freely add, modify, and remove any files in the list and quickly compute the aggregate sum for that new set of files.

    However, I’m now having some issues figuring out how the sums for the data and names work. I found this topic that mentions that 7-Zip simply adds a sum of the file name checksums to the aggregate sum for the data, but I have been unable to reproduce this.

    Also, it doesn’t seem to make sense to me. Checksums generated by a secure hash function like SHA-256 should have uniformly distributed digests between their minimum and maximum possible values. So basically, the mean SHA-256 digest should roughly be: 0x7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF. Therefore the 32-bit overflow for just the data’s aggregate should be about equal to half of the file count and the 32-bit overflow for the data and name’s aggregate should be about double that (or the same as the file count). However, in my experience, I’ve found that while the data and names overflows are usually higher than the overflows for just the data, they are nowhere near double the size.
    As an example, I used 7z h -scrcSHA256 in a folder that contained 291,274 files and got the following two aggregates:

    SHA256 for data:              c6adf429fb9f3ef2ce41d5e0bc4c9e7f24512628378997bb90e1fc4766b38e0c-00022AFF
    SHA256 for data and names:    ec8de621a856e564ef0cba1150eb766231dcf8d42b3f653298fd2f904805d6eb-000252B4
    

    291,274 ÷ 2 = 0x000238E5, which is within about 2.5% of the data’s aggregate overflow, so that’s close enough. The overflow for the data and names’ aggregate is nowhere near double that though… So, there must be something else going on than a simple arithmetic sum of the file names’ individual checksums being added to the data’s aggregate.

    I’ve tried to make sense of the source code, but after a couple of hours of going through it, I couldn’t really find an answer to this question (I think that HashCalc.cpp contains the source for this?).

    Also, are the "names" used for this the fully qualified paths of the files or are they relative path? Based on the output from 7z h having the names as relative paths from the working directory, I’m guessing that’s what is being fed into the hashing algorithms? Although, I am aware that 7-Zip recently changed to using only forward slashes (/) as path separators when making these sums, to make the values consistent between different operating systems.

     

    Last edit: OVERKILL27 2023-12-06
  • Igor Pavlov

    Igor Pavlov - 2023-12-06
    CPP\7zip\UI\Common\HashCalc.cpp
    void CHashBundle::Final(bool isDir, bool isAltStream, const UString &path)
    
      Byte pre[16];
      memset(pre, 0, sizeof(pre));
      if (isDir)
        pre[0] = 1;
    
        h.Hasher->Update(pre, sizeof(pre));
        h.Hasher->Update(h.Digests[0], h.DigestSize); // data digest
    
      h.Hasher->Update(utf_16_path)
    
     

    Last edit: Igor Pavlov 2023-12-06
    • OVERKILL27

      OVERKILL27 - 2023-12-07

      Thanks for the hint Igor!

      Yes, I expected that CPP\7zip\UI\Common\HashCalc.cpp would be a good place to look for the answer... Though, I'm still having a bit of issues with really understanding it... Admittedly, C++ isn't really my area of expertise, so comprehending the details of what is going on is proving to be a challenging for me...

      The last line of your comment: h.Hasher->Update(utf_16_path), is it referring to line 232: h.Hasher->Update(temp, 2);?

      I'm not sure if I'm reading it right but to me it looks like it is generating a hash for each of the path's UTF-16 LE code points by splitting each code point into an array of 2 bytes? Surely I'm misinterpreting the source code here...

      Would it be possible for you to give me a bit more of a hint?

       
      • Igor Pavlov

        Igor Pavlov - 2023-12-07

        Yes, I use utf-16 little-endian encoding in 7-Zip, because windows also uses such encoding.

         
        • OVERKILL27

          OVERKILL27 - 2023-12-08

          Using UTF-16 LE to encode the path strings makes perfect sense, that is not what confused me.

          What did confuse me is that it seems like a hash is being generated for each code point:

          // 7z2301-src\CPP\7zip\UI\Common\HashCalc.cpp
          219 for (unsigned k = 0; k < path.Len(); k++)
          220 {
          221   wchar_t c = path[k];
          222   
          231   Byte temp[2] = { (Byte)(c & 0xFF), (Byte)((c >> 8) & 0xFF) };
          232   h.Hasher->Update(temp, 2);
          233 }
          

          Although, having now had another go at trying to understanding what is going on, it seems like h.Hasher->Update(temp, 2); on line 232 is actually just loading the bytes into a buffer that then only gets hashed after loading in all of the path's code points?

          So, I now think that what is happening to generate the sums for the data and names is:
          1. A digest is generated for a file's data
          2. The file data's digest is loaded into a buffer as a Byte[72]
          3. Each of the path's UTF-16 LE code points are converted into a high-endian Byte[2] (line 231) and are then each added to the buffer (line 232)
          4. A single hash is generated from all of the buffered bytes (This hash represents a single file's data, as well as its path)
          5. A simple arithmetic sum is done to aggregate each file's data and names hash, in the same way as is done for just the data hashes

          Is this understanding correct?

           

Log in to post a comment.