Yes, the current behavior in the event of a SHA-1 collision is not very good;
you will silently get the wrong file contents. AFAICS wimlib, 7-Zip, and DISM
all behave that way, which is not surprising because SHA-1 is built into the WIM
file format as the way to identify and deduplicate file contents. Currently I'm
not sure what should be done yet, but here are some ideas:
Idea 1: create a new version of the WIM file format that uses a different
checksum, such as SHA-256 or BLAKE2. This is probably the best solution
long-term, but for compatibility reasons wimlib won't be able to create such
archives by default anytime soon.
Idea 2: extend the WIM file format to store a SHA-256 or BLAKE2 checksum
alongside each SHA-1, in a way such that old software would see the SHA-1 only.
wimlib would compute, store, and check both checksums. This may be a way to
solve the problem in the shorter term, but it seems like a complicated hack.
Idea 3: without changing the on-disk format, just try to detect SHA-1
collisions and warn the user about them so they can address the problem
themselves. One way this could be done is by computing the SHA-256 or BLAKE2
for each added file temporarily, just for keeping in-memory, and noticing when
the SHA-1 is the same but the stronger checksum is different. This should be
pretty straightforward to implement, though of course it's not a full solution,
and it would only be possible to detect collisions with new files only, not
existing files in the archive (unless the latter had the stronger checksum
stored somewhere of course).
For now I don't have time to decide on anything specifically, so I am just going
to wait and see for now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's simpler for me to use this forum. I don't want to create additional account at another forum.
As I understand now there are two main cases when wim is used:
1) Windows installers - sha-1 collision is not problem for that case.
2) Backup and archive tasks - sha-1 collision can be problem in some cases. Maybe even some undesirable effects are possible. I suppose you test DISM for different cases. So you can check current DISM with collision files.
BTW, some old wim version uses index of file instead of SHA-1. So maybe it's possible to use something like that instead of SHA-1.
Do we really need the reference via hash ?
Maybe simple index of file is OK?
We still can store SHA-1 for data. But files will link data via index.
Last edit: Igor Pavlov 2017-02-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I did test DISM; it behaves the same as wimlib and 7-Zip, which is to
incorrectly link the colliding files to the same contents.
It would be possible to identify file contents by index, but to be compatible it
would have to be the second part of the identifier after the SHA-1. In other
words, each SHA-1 would identify a set of file contents (of size > 1 only if
there is a SHA-1 collision), and the index would identify a specific file
contents in that set.
But the real problem is that you cannot efficiently identify duplications
without a strong checksum, since without strong checksum you'd need to compare
the full file contents before you know it's really a duplicate.
As I mentioned, you could compute a strong checksum, like SHA-256 or BLAKE2,
temporarily while adding files and use it to identify duplications. But unless
you actually store the strong checksum in the archive, then you will not be able
to reliably identify a duplication between a new file and a file that already
existed in the archive. I don't think there's a way around that, other than
actually storing the strong checksums in the archive somewhere.
Note that it's pretty easy to add custom fields to WIM directory entries using
the "tagged items" feature (see tagged_items.c in the wimlib source code), and
any unrecognized custom fields get ignored by DISM / WIMGAPI. However, there
doesn't appear to be a way to add fields to the blob descriptors (a.k.a.
"lookup table entries"), as they have a fixed size of 50 bytes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What do you think about sha-1 collision problem?
How to solve it for wim archives?
Hi Igor,
[In the future please use the new forums at https://wimlib.net/forums]
Yes, the current behavior in the event of a SHA-1 collision is not very good;
you will silently get the wrong file contents. AFAICS wimlib, 7-Zip, and DISM
all behave that way, which is not surprising because SHA-1 is built into the WIM
file format as the way to identify and deduplicate file contents. Currently I'm
not sure what should be done yet, but here are some ideas:
Idea 1: create a new version of the WIM file format that uses a different
checksum, such as SHA-256 or BLAKE2. This is probably the best solution
long-term, but for compatibility reasons wimlib won't be able to create such
archives by default anytime soon.
Idea 2: extend the WIM file format to store a SHA-256 or BLAKE2 checksum
alongside each SHA-1, in a way such that old software would see the SHA-1 only.
wimlib would compute, store, and check both checksums. This may be a way to
solve the problem in the shorter term, but it seems like a complicated hack.
Idea 3: without changing the on-disk format, just try to detect SHA-1
collisions and warn the user about them so they can address the problem
themselves. One way this could be done is by computing the SHA-256 or BLAKE2
for each added file temporarily, just for keeping in-memory, and noticing when
the SHA-1 is the same but the stronger checksum is different. This should be
pretty straightforward to implement, though of course it's not a full solution,
and it would only be possible to detect collisions with new files only, not
existing files in the archive (unless the latter had the stronger checksum
stored somewhere of course).
For now I don't have time to decide on anything specifically, so I am just going
to wait and see for now.
It's simpler for me to use this forum. I don't want to create additional account at another forum.
As I understand now there are two main cases when wim is used:
1) Windows installers - sha-1 collision is not problem for that case.
2) Backup and archive tasks - sha-1 collision can be problem in some cases. Maybe even some undesirable effects are possible. I suppose you test DISM for different cases. So you can check current DISM with collision files.
BTW, some old wim version uses index of file instead of SHA-1. So maybe it's possible to use something like that instead of SHA-1.
Do we really need the reference via hash ?
Maybe simple index of file is OK?
We still can store SHA-1 for data. But files will link data via index.
Last edit: Igor Pavlov 2017-02-27
Yes, I did test DISM; it behaves the same as wimlib and 7-Zip, which is to
incorrectly link the colliding files to the same contents.
It would be possible to identify file contents by index, but to be compatible it
would have to be the second part of the identifier after the SHA-1. In other
words, each SHA-1 would identify a set of file contents (of size > 1 only if
there is a SHA-1 collision), and the index would identify a specific file
contents in that set.
But the real problem is that you cannot efficiently identify duplications
without a strong checksum, since without strong checksum you'd need to compare
the full file contents before you know it's really a duplicate.
As I mentioned, you could compute a strong checksum, like SHA-256 or BLAKE2,
temporarily while adding files and use it to identify duplications. But unless
you actually store the strong checksum in the archive, then you will not be able
to reliably identify a duplication between a new file and a file that already
existed in the archive. I don't think there's a way around that, other than
actually storing the strong checksums in the archive somewhere.
Note that it's pretty easy to add custom fields to WIM directory entries using
the "tagged items" feature (see tagged_items.c in the wimlib source code), and
any unrecognized custom fields get ignored by DISM / WIMGAPI. However, there
doesn't appear to be a way to add fields to the blob descriptors (a.k.a.
"lookup table entries"), as they have a fixed size of 50 bytes.