Menu

sha-1 collision files

2017-02-26
2017-02-26
  • Igor Pavlov

    Igor Pavlov - 2017-02-26

    What do you think about sha-1 collision problem?
    How to solve it for wim archives?

     
  • synchronicity

    synchronicity - 2017-02-26

    Hi Igor,

    [In the future please use the new forums at https://wimlib.net/forums]

    Yes, the current behavior in the event of a SHA-1 collision is not very good;
    you will silently get the wrong file contents. AFAICS wimlib, 7-Zip, and DISM
    all behave that way, which is not surprising because SHA-1 is built into the WIM
    file format as the way to identify and deduplicate file contents. Currently I'm
    not sure what should be done yet, but here are some ideas:

    Idea 1: create a new version of the WIM file format that uses a different
    checksum, such as SHA-256 or BLAKE2. This is probably the best solution
    long-term, but for compatibility reasons wimlib won't be able to create such
    archives by default anytime soon.

    Idea 2: extend the WIM file format to store a SHA-256 or BLAKE2 checksum
    alongside each SHA-1, in a way such that old software would see the SHA-1 only.
    wimlib would compute, store, and check both checksums. This may be a way to
    solve the problem in the shorter term, but it seems like a complicated hack.

    Idea 3: without changing the on-disk format, just try to detect SHA-1
    collisions and warn the user about them so they can address the problem
    themselves. One way this could be done is by computing the SHA-256 or BLAKE2
    for each added file temporarily, just for keeping in-memory, and noticing when
    the SHA-1 is the same but the stronger checksum is different. This should be
    pretty straightforward to implement, though of course it's not a full solution,
    and it would only be possible to detect collisions with new files only, not
    existing files in the archive (unless the latter had the stronger checksum
    stored somewhere of course).

    For now I don't have time to decide on anything specifically, so I am just going
    to wait and see for now.

     
  • Igor Pavlov

    Igor Pavlov - 2017-02-27

    It's simpler for me to use this forum. I don't want to create additional account at another forum.

    As I understand now there are two main cases when wim is used:
    1) Windows installers - sha-1 collision is not problem for that case.
    2) Backup and archive tasks - sha-1 collision can be problem in some cases. Maybe even some undesirable effects are possible. I suppose you test DISM for different cases. So you can check current DISM with collision files.

    BTW, some old wim version uses index of file instead of SHA-1. So maybe it's possible to use something like that instead of SHA-1.
    Do we really need the reference via hash ?
    Maybe simple index of file is OK?
    We still can store SHA-1 for data. But files will link data via index.

     

    Last edit: Igor Pavlov 2017-02-27
  • synchronicity

    synchronicity - 2017-02-28

    Yes, I did test DISM; it behaves the same as wimlib and 7-Zip, which is to
    incorrectly link the colliding files to the same contents.

    It would be possible to identify file contents by index, but to be compatible it
    would have to be the second part of the identifier after the SHA-1. In other
    words, each SHA-1 would identify a set of file contents (of size > 1 only if
    there is a SHA-1 collision), and the index would identify a specific file
    contents in that set.

    But the real problem is that you cannot efficiently identify duplications
    without a strong checksum, since without strong checksum you'd need to compare
    the full file contents before you know it's really a duplicate.

    As I mentioned, you could compute a strong checksum, like SHA-256 or BLAKE2,
    temporarily while adding files and use it to identify duplications. But unless
    you actually store the strong checksum in the archive, then you will not be able
    to reliably identify a duplication between a new file and a file that already
    existed in the archive. I don't think there's a way around that, other than
    actually storing the strong checksums in the archive somewhere.

    Note that it's pretty easy to add custom fields to WIM directory entries using
    the "tagged items" feature (see tagged_items.c in the wimlib source code), and
    any unrecognized custom fields get ignored by DISM / WIMGAPI. However, there
    doesn't appear to be a way to add fields to the blob descriptors (a.k.a.
    "lookup table entries"), as they have a fixed size of 50 bytes.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.