Menu

#48 Different SHA for identical archives

v1.0 (example)
open
nobody
None
5
2016-04-05
2016-03-01
No

Hello,

I would like to report about the SHA of 2 identical archives created from same set of input files differs.
Please let me know why is that behaviour ? Is this a bug ? Why can't we create 2 archives with same SHA for a given set of files? Could this be resolved ?
For comparison, 7zip archives have same SHA for archives created from identical files.
Below is a way to reproduce the issue :

$ touch 1.c
$ touch 2.c
$ zip -ry ../test1.zip .
adding: 1.c (stored 0%)
adding: 2.c (stored 0%)
$ zip -ry ../test2.zip .
adding: 1.c (stored 0%)
adding: 2.c (stored 0%)
$ cd ..
$ sha1sum test1.zip
2ceee47f846f7b35a28cc8a072aaafe2c3eaa4a4 test1.zip
$ sha1sum test2.zip
df8550366dc943110c9b9d781c2fab73f9511cb5 test2.zip
$

Thanks and Regards
Shirsh

Discussion

  • Steven Schweda

    Steven Schweda - 2016-03-01

    To investigate this properly, some basic information would help:

      Computer type and operating system (and version).
    
      Program version. (Ideally, a whole "unzip -v" or "zip -v" report.)
    

    Thanks.

     
  • Shirsh Kumar

    Shirsh Kumar - 2016-03-01

    Hi,
    Below is my system information.
    $ uname -a
    Linux ubuntu 3.11.0-26-generic #45~precise1-Ubuntu SMP Tue Jul 15 04:04:35 UTC 2014 i686 i686 i386 GNU/Linux

    zip -v output:
    $ zip -v
    Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
    This is Zip 3.0 (July 5th 2008), by Info-ZIP.
    Currently maintained by E. Gordon. Please send bug reports to
    the authors using the web page at www.info-zip.org; see README for details.

    Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
    as of above date; see http://www.info-zip.org/ for other sites.

    Compiled with gcc 4.6.1 for Unix (Linux ELF) on Jun 11 2011.

    Zip special compilation options:
    ASM_CRC
    ASMV
    USE_EF_UT_TIME (store Universal Time)
    BZIP2_SUPPORT (bzip2 library version 1.0.6, 6-Sept-2010)
    bzip2 code and library copyright (c) Julian R Seward
    (See the bzip2 license for terms of use)
    SYMLINK_SUPPORT (symbolic links supported)
    LARGE_FILE_SUPPORT (can read and write large files on file system)
    ZIP64_SUPPORT (use Zip64 to store large files in archives)
    UNICODE_SUPPORT (store and read UTF-8 Unicode paths)
    STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
    UIDGID_NOT_16BIT (old Unix 16-bit UID/GID extra field not used)
    [encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)

    Encryption notice:
    The encryption code of this program is not copyrighted and is
    put in the public domain. It was originally written in Europe
    and, to the best of our knowledge, can be freely distributed
    in both source and object forms from any country, including
    the USA under License Exception TSU of the U.S. Export
    Administration Regulations (section 740.13(e)) of 6 June 2002.

    Zip environment options:
    ZIP: [none]
    ZIPOPT: [none]

    Adding report with -v option below :

    $ ls
    1.c 2.c
    $ zip -rv ../test1.zip .
    adding: 1.c (in=273) (out=22) (deflated 92%)
    adding: 2.c (in=463) (out=29) (deflated 94%)
    total bytes=736, compressed=51 -> 93% savings
    $ zip -rv ../test2.zip .
    adding: 1.c (in=273) (out=22) (deflated 92%)
    adding: 2.c (in=463) (out=29) (deflated 94%)
    total bytes=736, compressed=51 -> 93% savings
    $ sha1sum ../test1.zip
    49e4905eba54f4281eb262b72bb7afc68b09cb5f ../test1.zip
    $ sha1sum ../test2.zip
    4803f1e326ae2f9525f9faefba87b94c35b08ed8 ../test2.zip

    As you can see SHA of archives test1.zip and test2.zip is different for same set of input files

     

    Last edit: Shirsh Kumar 2016-03-01
    • Greg Roelofs

      Greg Roelofs - 2016-03-01

      Adding report with -v option below :

      $ ls
      1.c 2.c
      $ zip -rv ../test1.zip .
      adding: 1.c (in=273) (out=22) (deflated 92%)
      adding: 2.c (in=463) (out=29) (deflated 94%)
      total bytes=736, compressed=51 -> 93% savings
      $ zip -rv ../test2.zip .
      adding: 1.c (in=273) (out=22) (deflated 92%)
      adding: 2.c (in=463) (out=29) (deflated 94%)
      total bytes=736, compressed=51 -> 93% savings
      $ sha1sum ../test1.zip
      49e4905eba54f4281eb262b72bb7afc68b09cb5f ../test1.zip
      $ sha1sum ../test2.zip
      4803f1e326ae2f9525f9faefba87b94c35b08ed8 ../test2.zip

      As you can see SHA of archives test1.zip and test2.zip is different for same set of input files

      Do cmp -l (IIRC) to see the specific bytes that differ, but I'd guess it's
      the timestamp of last access in the "UT" extra field.

      Actually, an even simpler way would be:

       diff -u <(zipinfo -v test1.zip) <(zipinfo -v test2.zip)
      

      (which probably is bash syntax, but you have that). ZipInfo's -v option
      doesn't actually print the access time anywhere, so if that diff shows
      nothing but cmp -l shows a one-byte or two-byte difference in two places
      per file, that's the likely reason.

      I think Zip has an option to omit the extra fields.

      Greg

       
  • Steven Schweda

    Steven Schweda - 2016-03-01

    Thanks for the additional data.

    It appears that the difference you're seeing between the two archives
    is the difference in the member file access times, which are normally
    stored in the archive. For example (on a Mac, but should be similar):

    mba$ touch 1.c
    mba$ touch 2.c

    mba$ ls -lT # Modification time
    total 0
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

    mba$ ls -luT # Access time (same as mod time, now)
    total 0
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

    mba$ zip -ry ../t1_1.zip .
    adding: 1.c (stored 0%)
    adding: 2.c (stored 0%)

    mba$ ls -lT # Modification time (same as before)
    total 0
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
    -rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

    mba$ ls -luT # Access time (now different)
    total 0
    -rw-r--r-- 1 sms staff 0 Feb 29 23:10:46 2016 1.c
    -rw-r--r-- 1 sms staff 0 Feb 29 23:10:46 2016 2.c

    When it creates the first archive, Zip accesses the member files, and
    that updates the access times of the files. When creating the second
    archive, Zip sees these different access times, so you get different
    (access) times in the second archive.

    Currently, the only way I see to keep these access time data out of
    the archive is to use the "-X" option, which inhibits Zip from storing
    almost all extra-field data (including the "UT" extra block, which holds
    the UNIX access, creation, and modification times). For example:

    mba$ zipx -ry -X ../t_X_1.zip .
    adding: 1.c (stored 0%)
    adding: 2.c (stored 0%)

    mba$ zipx -ry -X ../t_X_2.zip .
    adding: 1.c (stored 0%)
    adding: 2.c (stored 0%)

    mba$ diff ../t_X_1.zip ../t_X_2.zip
    mba$

    The latest Zip beta version, 3.1d, claims to offer a useful-sounding
    "-tn" option. From the extended ("-hh") help:

    -tn prevents storage of univeral time. -X prevents storage of most extra
    fields, including universal time.

    Sadly, although "-tn" is accepted on many system types, it seems to
    be ignored except on Windows. That may be improved in the next Zip
    version, but, for now, "-X" may be the only way. Because "-X" will
    eliminate the "UT" extra blocks from the archive, only local
    modification date-time data will be stored. You may or may not care.
    "unzip -Zv <your_archive>" should show which extra blocks are in
    <your_archive>. Around here, that seems to be the following:

    The central-directory extra field contains:
    - A subfield with ID 0x5455 (universal time) and 5 data bytes.
    The local extra field has UTC/GMT modification/access times.
    - A subfield with ID 0x7875 (Unix UID/GID (any size)) and 11 data bytes:
    01 04 f5 01 00 00 04 14 00 00 00.

    Thus, "-X" will also strip out the Unix UID/GID data along with the
    UT data.

     

    Last edit: Steven Schweda 2016-03-01
  • Shirsh Kumar

    Shirsh Kumar - 2016-03-01

    Yes you are right.

    $ diff -u <(zipinfo -v test1.zip) <(zipinfo -v test2.zip)
    --- /dev/fd/63 2016-03-01 11:37:49.289270083 +0530
    +++ /dev/fd/62 2016-03-01 11:37:49.289270083 +0530
    @@ -1,4 +1,4 @@
    -Archive: test1.zip
    +Archive: test2.zip
    There is no zipfile comment.

    End-of-central-directory record:

    $ cmp -l test1.zip test2.zip
    43 214 202
    104 214 204

    With -X switch I can generate identical archives(in terms of SHA) from same set of files.
    Thanks a lot.

     
  • Shirsh Kumar

    Shirsh Kumar - 2016-03-03

    Hello Steven,

    Actually I have 2 zip files (old and new) and we are creating a diff between uncompressed files of old and new.
    Now when I apply patch to uncompressed old files and zip it back to create the new zip.
    Even though contents exactly matches , the SHA is different for original new and patched new zip files.
    When I compare the zipinfo of the two files (org new and patched new), the order in which the files are listed is different which I think is causing SHA to be different.
    Could you suggest how I should proceed to handle this ?

    PS : the zip contains files and directories

     
  • Steven Schweda

    Steven Schweda - 2016-03-03

    When I compare the zipinfo of the two files (org new and patched new),
    the order in which the files are listed is different which I think is
    causing SHA to be different.

    Oh, yes. .zip archive data include multiple member offsets.
    Changing the order of the member files in an archive would normally
    change these offsets, which would most likely change any competent
    checksum.

    Could you suggest how I should proceed to handle this ?

    Perhaps. Zip 3.0 takes the files in a directory in the order in
    which readdir() returns them. On a UNIX(-like) system, I believe that
    readdir() results do not have a reliable order. So, if you
    change/delete/create files, the order seen by Zip could easily change,
    so the order of member files in a Zip archive could also easily change.

    Interestingly, I believe that Zip 3.1d (the latest beta version) is
    supposed to sort these file paths alphabetically. (By default; define
    the C macro NO_UNIX_DIR_SCAN_SORT to get the old behavior.) So, if you
    use Zip 3.1d, you might find that it solves your problem. (I thought
    that this sorting was a waste of time and effort, but the Zip maintainer
    insisted on adding the feature. What do I know?)

    The Zip 3.1d beta source kit should be available at:

      ftp://ftp.info-zip.org/pub/infozip/beta/zip31d.zip
    

    If Zip 3.1d does not do what you need, then the only other solution
    which leaps to mind would be to make your own list of files to be
    archived, sort it yourself, and then give that sorted list to Zip. (The
    "-@" option may help for that.)

     
  • Shirsh Kumar

    Shirsh Kumar - 2016-03-31

    Hello,

    As you know, Zipsplit can split zip archive to multiple files.
    But for our use case given above, it will be great if we can split zip archive to individual files (NOT based on -n option, but based on each file)
    Is there a way to achive this?

    Also, Is it possible to unzip/decompress archive file to memory and modify itz contents and zip it back from memory ??

     
  • Ed Gordon

    Ed Gordon - 2016-04-02

    Beta Zip 3.1e02 now supports -tn=a to remove just access times. -tn (and -tn=a) now works on Unix as well as Windows. Support on other ports will require some assistance from those familiar with those ports.

    As noted above, Zip 3.1d and later now sorts Unix entries, so the order should be the same every time. I felt that the random order that Unix generally provides was creating issues.

    Zip 3.0 and later allows copying of archive entries from an old archive to a new archive. That feature can be used to move an entry from a source archive to a new archive. For instance:

    zip old.zip --out new.zip --copy -i myfile.txt

    will open old.zip, create (or overwrite) new.zip, and copy myfile.txt from old.zip to new.zip.

    Are you asking for ZipSplit to be able to create a new archive for each file in the source archive, i.e. if there are 100 files in the source archive, are you wanting 100 archives each with one file? If so, ZipSplit can't do that now but that would be a simple feature to add. Just want to be clear this is what is wanted.

    We are considering allowing in memory operations on archives, but only through the LIB/DLL interface. If your OS provides some form of "RAM disk", that may allow what you want.

     
  • Shirsh Kumar

    Shirsh Kumar - 2016-04-05

    Hello Gordan,
    Thanks a lot for your inputs.

    1. Using zip old.zip --out new.zip --copy -i myfile.txt
      Im able to copy a file from old archive to new archive. But all files in NEW archive are deleted. i.e Iam not able to update single file from old archive to new archive.
      If i extract file and than update, there is problem with time stamps.

    2. Using Zipsplit
      Yes, i want to use zipsplit to create individual archives. (100 entries, 100 zip archives)
      This will enable me to create diff for only changed files/archives and create new archive using the same.

    Ex: (Iam not sure if it helps me to do as below.)
    Zip1 (with 10 files) ==> Split ==> 10 archives
    Zip2 (with 12 files) ==> Split ==> 12 archives

    Now, using the 2 new zip split archives created from Zip2, we can recreate zip2 from Zip1 i.e Merge 10(from old) + 2(from New)

    1. I have one new query, we jus extract file and update it back
      step 1: unzip -X Tpk1.tpk file
      step 2: zip Tpk1.tpk file

    BEFORE step 2 [using zipinfo] :

    file last modified on (DOS date/time): 2015 Nov 6 22:39:54
    file last modified on (UT extra field modtime): 2015 Nov 6 19:09:54 local
    file last modified on (UT extra field modtime): 2015 Nov 6 13:39:54 UTC

    AFTER step 2 (DOS dat/time gets modified) :

    file last modified on (DOS date/time): 2015 Nov 6 19:09:54
    file last modified on (UT extra field modtime): 2015 Nov 6 19:09:54 local
    file last modified on (UT extra field modtime): 2015 Nov 6 13:39:54 UTC

     
  • Ed Gordon

    Ed Gordon - 2016-04-05

    I guess I still don't follow what you're trying to do. If you could step through your process from start to finish, it may help us understand what you need.

    Also, how are you merging the archives? Are you just trying to combine selected entries from two archives into another archive?

    When you extract and then rezip an entry, you need to worry about timezone. The DOS time is local and the UT time is fixed to Universal Time. If the timezone is different, unzip and zip will account for that when unzipping and zipping and you will get a different DOS to UT offset. However, given the above, it may be a DOS/UT bug in either zip or unzip. Can you verify the file times of the extracted files on the file system? That would help determine if it's zip or unzip that may be the problem.

     

    Last edit: Ed Gordon 2016-04-05
  • Steven Schweda

    Steven Schweda - 2016-04-05

    Using zip old.zip --out new.zip --copy -i myfile.txt
    Im able to copy a file from old archive to new archive. But all files in
    NEW archive are deleted. i.e Iam not able to update single file from old
    archive to new archive.

    You asked for a way (using ZipSplit) to get a set of single-member
    archives from one multi-member archive. Currently ZipSplit can't do
    that, but running a Zip command like that one for each (useful) archive
    member should do that job.

    If you wish to update one archive member in an archive, then that
    should be easy:

      zip archive.zip myfile.txt
    

    If you want to preserve the original archive, then:

     zip archive.zip --out new_archive.zip myfile.txt
    

    If i extract file and than update, there is problem with time stamps.

    As Ed said, this sounds like a timezone problem.

    You seem to be trying to use a zip archive as a general-purpose data
    base or versioning file system. This may be possible in some cases, but
    it's not really what Zip and UnZip were designed to do. This may lead
    to disappointment.

     

Log in to post a comment.