Info-ZIP project / Bugs / #48 Different SHA for identical archives

Steven Schweda - 2016-03-01

To investigate this properly, some basic information would help:

Computer type and operating system (and version). Program version. (Ideally, a whole "unzip -v" or "zip -v" report.)

Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shirsh Kumar - 2016-03-01

Hi,
Below is my system information.
$ uname -a
Linux ubuntu 3.11.0-26-generic #45~precise1-Ubuntu SMP Tue Jul 15 04:04:35 UTC 2014 i686 i686 i386 GNU/Linux

zip -v output:
$ zip -v
Copyright (c) 1990-2008 Info-ZIP - Type 'zip "-L"' for software license.
This is Zip 3.0 (July 5th 2008), by Info-ZIP.
Currently maintained by E. Gordon. Please send bug reports to
the authors using the web page at www.info-zip.org; see README for details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip,
as of above date; see http://www.info-zip.org/ for other sites.

Compiled with gcc 4.6.1 for Unix (Linux ELF) on Jun 11 2011.

Zip special compilation options:
ASM_CRC
ASMV
USE_EF_UT_TIME (store Universal Time)
BZIP2_SUPPORT (bzip2 library version 1.0.6, 6-Sept-2010)
bzip2 code and library copyright (c) Julian R Seward
(See the bzip2 license for terms of use)
SYMLINK_SUPPORT (symbolic links supported)
LARGE_FILE_SUPPORT (can read and write large files on file system)
ZIP64_SUPPORT (use Zip64 to store large files in archives)
UNICODE_SUPPORT (store and read UTF-8 Unicode paths)
STORE_UNIX_UIDs_GIDs (store UID/GID sizes/values using new extra field)
UIDGID_NOT_16BIT (old Unix 16-bit UID/GID extra field not used)
[encryption, version 2.91 of 05 Jan 2007] (modified for Zip 3)

Encryption notice:
The encryption code of this program is not copyrighted and is
put in the public domain. It was originally written in Europe
and, to the best of our knowledge, can be freely distributed
in both source and object forms from any country, including
the USA under License Exception TSU of the U.S. Export
Administration Regulations (section 740.13(e)) of 6 June 2002.

Zip environment options:
ZIP: [none]
ZIPOPT: [none]

Adding report with -v option below :

$ ls
1.c 2.c
$ zip -rv ../test1.zip .
adding: 1.c (in=273) (out=22) (deflated 92%)
adding: 2.c (in=463) (out=29) (deflated 94%)
total bytes=736, compressed=51 -> 93% savings
$ zip -rv ../test2.zip .
adding: 1.c (in=273) (out=22) (deflated 92%)
adding: 2.c (in=463) (out=29) (deflated 94%)
total bytes=736, compressed=51 -> 93% savings
$ sha1sum ../test1.zip
49e4905eba54f4281eb262b72bb7afc68b09cb5f ../test1.zip
$ sha1sum ../test2.zip
4803f1e326ae2f9525f9faefba87b94c35b08ed8 ../test2.zip

As you can see SHA of archives test1.zip and test2.zip is different for same set of input files

Last edit: Shirsh Kumar 2016-03-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Greg Roelofs - 2016-03-01
  
  Adding report with -v option below :
  
  $ ls
  1.c 2.c
  $ zip -rv ../test1.zip .
  adding: 1.c (in=273) (out=22) (deflated 92%)
  adding: 2.c (in=463) (out=29) (deflated 94%)
  total bytes=736, compressed=51 -> 93% savings
  $ zip -rv ../test2.zip .
  adding: 1.c (in=273) (out=22) (deflated 92%)
  adding: 2.c (in=463) (out=29) (deflated 94%)
  total bytes=736, compressed=51 -> 93% savings
  $ sha1sum ../test1.zip
  49e4905eba54f4281eb262b72bb7afc68b09cb5f ../test1.zip
  $ sha1sum ../test2.zip
  4803f1e326ae2f9525f9faefba87b94c35b08ed8 ../test2.zip
  
  As you can see SHA of archives test1.zip and test2.zip is different for same set of input files
  
  Do cmp -l (IIRC) to see the specific bytes that differ, but I'd guess it's
  the timestamp of last access in the "UT" extra field.
  
  Actually, an even simpler way would be:
  
  diff -u <(zipinfo -v test1.zip) <(zipinfo -v test2.zip)
  
  (which probably is bash syntax, but you have that). ZipInfo's -v option
  doesn't actually print the access time anywhere, so if that diff shows
  nothing but cmp -l shows a one-byte or two-byte difference in two places
  per file, that's the likely reason.
  
  I think Zip has an option to omit the extra fields.
  
  Greg
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steven Schweda - 2016-03-01

Thanks for the additional data.

It appears that the difference you're seeing between the two archives
is the difference in the member file access times, which are normally
stored in the archive. For example (on a Mac, but should be similar):

mba$ touch 1.c
mba$ touch 2.c

mba$ ls -lT # Modification time
total 0
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

mba$ ls -luT # Access time (same as mod time, now)
total 0
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

mba$ zip -ry ../t1_1.zip .
adding: 1.c (stored 0%)
adding: 2.c (stored 0%)

mba$ ls -lT # Modification time (same as before)
total 0
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:32 2016 1.c
-rw-r--r-- 1 sms staff 0 Feb 29 23:09:41 2016 2.c

mba$ ls -luT # Access time (now different)
total 0
-rw-r--r-- 1 sms staff 0 Feb 29 23:10:46 2016 1.c
-rw-r--r-- 1 sms staff 0 Feb 29 23:10:46 2016 2.c

When it creates the first archive, Zip accesses the member files, and
that updates the access times of the files. When creating the second
archive, Zip sees these different access times, so you get different
(access) times in the second archive.

Currently, the only way I see to keep these access time data out of
the archive is to use the "-X" option, which inhibits Zip from storing
almost all extra-field data (including the "UT" extra block, which holds
the UNIX access, creation, and modification times). For example:

mba$ zipx -ry -X ../t_X_1.zip .
adding: 1.c (stored 0%)
adding: 2.c (stored 0%)

mba$ zipx -ry -X ../t_X_2.zip .
adding: 1.c (stored 0%)
adding: 2.c (stored 0%)

mba$ diff ../t_X_1.zip ../t_X_2.zip
mba$

The latest Zip beta version, 3.1d, claims to offer a useful-sounding
"-tn" option. From the extended ("-hh") help:

-tn prevents storage of univeral time. -X prevents storage of most extra
fields, including universal time.

Sadly, although "-tn" is accepted on many system types, it seems to
be ignored except on Windows. That may be improved in the next Zip
version, but, for now, "-X" may be the only way. Because "-X" will
eliminate the "UT" extra blocks from the archive, only local
modification date-time data will be stored. You may or may not care.
"unzip -Zv <your_archive>" should show which extra blocks are in
<your_archive>. Around here, that seems to be the following:

The central-directory extra field contains:
- A subfield with ID 0x5455 (universal time) and 5 data bytes.
The local extra field has UTC/GMT modification/access times.
- A subfield with ID 0x7875 (Unix UID/GID (any size)) and 11 data bytes:
01 04 f5 01 00 00 04 14 00 00 00.

Thus, "-X" will also strip out the Unix UID/GID data along with the
UT data.

Last edit: Steven Schweda 2016-03-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shirsh Kumar - 2016-03-01

Yes you are right.

$ diff -u <(zipinfo -v test1.zip) <(zipinfo -v test2.zip)
--- /dev/fd/63 2016-03-01 11:37:49.289270083 +0530
+++ /dev/fd/62 2016-03-01 11:37:49.289270083 +0530
@@ -1,4 +1,4 @@
-Archive: test1.zip
+Archive: test2.zip
There is no zipfile comment.

End-of-central-directory record:

$ cmp -l test1.zip test2.zip
43 214 202
104 214 204

With -X switch I can generate identical archives(in terms of SHA) from same set of files.
Thanks a lot.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shirsh Kumar - 2016-03-03

Hello Steven,

Actually I have 2 zip files (old and new) and we are creating a diff between uncompressed files of old and new.
Now when I apply patch to uncompressed old files and zip it back to create the new zip.
Even though contents exactly matches , the SHA is different for original new and patched new zip files.
When I compare the zipinfo of the two files (org new and patched new), the order in which the files are listed is different which I think is causing SHA to be different.
Could you suggest how I should proceed to handle this ?

PS : the zip contains files and directories

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steven Schweda - 2016-03-03

When I compare the zipinfo of the two files (org new and patched new),
the order in which the files are listed is different which I think is
causing SHA to be different.

Oh, yes. .zip archive data include multiple member offsets.
Changing the order of the member files in an archive would normally
change these offsets, which would most likely change any competent
checksum.

Could you suggest how I should proceed to handle this ?

Perhaps. Zip 3.0 takes the files in a directory in the order in
which readdir() returns them. On a UNIX(-like) system, I believe that
readdir() results do not have a reliable order. So, if you
change/delete/create files, the order seen by Zip could easily change,
so the order of member files in a Zip archive could also easily change.

Interestingly, I believe that Zip 3.1d (the latest beta version) is
supposed to sort these file paths alphabetically. (By default; define
the C macro NO_UNIX_DIR_SCAN_SORT to get the old behavior.) So, if you
use Zip 3.1d, you might find that it solves your problem. (I thought
that this sorting was a waste of time and effort, but the Zip maintainer
insisted on adding the feature. What do I know?)

The Zip 3.1d beta source kit should be available at:

ftp://ftp.info-zip.org/pub/infozip/beta/zip31d.zip

If Zip 3.1d does not do what you need, then the only other solution
which leaps to mind would be to make your own list of files to be
archived, sort it yourself, and then give that sorted list to Zip. (The
"-@" option may help for that.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shirsh Kumar - 2016-03-31

Hello,

As you know, Zipsplit can split zip archive to multiple files.
But for our use case given above, it will be great if we can split zip archive to individual files (NOT based on -n option, but based on each file)
Is there a way to achive this?

Also, Is it possible to unzip/decompress archive file to memory and modify itz contents and zip it back from memory ??

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ed Gordon - 2016-04-02

Beta Zip 3.1e02 now supports -tn=a to remove just access times. -tn (and -tn=a) now works on Unix as well as Windows. Support on other ports will require some assistance from those familiar with those ports.

As noted above, Zip 3.1d and later now sorts Unix entries, so the order should be the same every time. I felt that the random order that Unix generally provides was creating issues.

Zip 3.0 and later allows copying of archive entries from an old archive to a new archive. That feature can be used to move an entry from a source archive to a new archive. For instance:

zip old.zip --out new.zip --copy -i myfile.txt

will open old.zip, create (or overwrite) new.zip, and copy myfile.txt from old.zip to new.zip.

Are you asking for ZipSplit to be able to create a new archive for each file in the source archive, i.e. if there are 100 files in the source archive, are you wanting 100 archives each with one file? If so, ZipSplit can't do that now but that would be a simple feature to add. Just want to be clear this is what is wanted.

We are considering allowing in memory operations on archives, but only through the LIB/DLL interface. If your OS provides some form of "RAM disk", that may allow what you want.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shirsh Kumar - 2016-04-05

Hello Gordan,
Thanks a lot for your inputs.

Using zip old.zip --out new.zip --copy -i myfile.txt
Im able to copy a file from old archive to new archive. But all files in NEW archive are deleted. i.e Iam not able to update single file from old archive to new archive.
If i extract file and than update, there is problem with time stamps.

Using Zipsplit
Yes, i want to use zipsplit to create individual archives. (100 entries, 100 zip archives)
This will enable me to create diff for only changed files/archives and create new archive using the same.

Ex: (Iam not sure if it helps me to do as below.)
Zip1 (with 10 files) ==> Split ==> 10 archives
Zip2 (with 12 files) ==> Split ==> 12 archives

Now, using the 2 new zip split archives created from Zip2, we can recreate zip2 from Zip1 i.e Merge 10(from old) + 2(from New)

I have one new query, we jus extract file and update it back
step 1: unzip -X Tpk1.tpk file
step 2: zip Tpk1.tpk file

BEFORE step 2 [using zipinfo] :

file last modified on (DOS date/time): 2015 Nov 6 22:39:54
file last modified on (UT extra field modtime): 2015 Nov 6 19:09:54 local
file last modified on (UT extra field modtime): 2015 Nov 6 13:39:54 UTC

AFTER step 2 (DOS dat/time gets modified) :

file last modified on (DOS date/time): 2015 Nov 6 19:09:54
file last modified on (UT extra field modtime): 2015 Nov 6 19:09:54 local
file last modified on (UT extra field modtime): 2015 Nov 6 13:39:54 UTC
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ed Gordon - 2016-04-05

I guess I still don't follow what you're trying to do. If you could step through your process from start to finish, it may help us understand what you need.

Also, how are you merging the archives? Are you just trying to combine selected entries from two archives into another archive?

When you extract and then rezip an entry, you need to worry about timezone. The DOS time is local and the UT time is fixed to Universal Time. If the timezone is different, unzip and zip will account for that when unzipping and zipping and you will get a different DOS to UT offset. However, given the above, it may be a DOS/UT bug in either zip or unzip. Can you verify the file times of the extracted files on the file system? That would help determine if it's zip or unzip that may be the problem.

Last edit: Ed Gordon 2016-04-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steven Schweda - 2016-04-05

Using zip old.zip --out new.zip --copy -i myfile.txt
Im able to copy a file from old archive to new archive. But all files in
NEW archive are deleted. i.e Iam not able to update single file from old
archive to new archive.

You asked for a way (using ZipSplit) to get a set of single-member
archives from one multi-member archive. Currently ZipSplit can't do
that, but running a Zip command like that one for each (useful) archive
member should do that job.

If you wish to update one archive member in an archive, then that
should be easy:

zip archive.zip myfile.txt

If you want to preserve the original archive, then:

zip archive.zip --out new_archive.zip myfile.txt

If i extract file and than update, there is problem with time stamps.

As Ed said, this sounds like a timezone problem.

You seem to be trying to use a zip archive as a general-purpose data
base or versioning file system. This may be possible in some cases, but
it's not really what Zip and UnZip were designed to do. This may lead
to disappointment.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Different SHA for identical archives

Group

Searches

Help

#48 Different SHA for identical archives

Discussion