#40 memory corruption error with unzip 6.10b

closed-fixed
nobody
None
5
2012-11-24
2012-11-05
George Vlahavas
No

I'm using unzip 6.10b in 64-bit Slackware linux 14.0.

Here is the output of unzip -v:

UnZip 6.10b BETA of 10 Dec 10, by Info-ZIP. Maintained by C. Spieler. Send
bug reports using http://www.info-zip.org/zip-bug.html; see README for details.

Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.

Compiled with gcc 4.7.1 for Unix (GNU/Linux x86_64) on Oct 24 2012.

UnZip special compilation options:
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
SET_DIR_ATTRIB
SYMLINKS (symbolic links supported, if RTL and file system permit)
TIMESTAMP
UNIXBACKUP (-B creates backup files)
USE_EF_UT_TIME
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 paths)
MBCS-support (multibyte character support, MB_CUR_MAX = 6)
LARGE_FILE_SUPPORT (large files over 2 GiB supported)
ZIP64_SUPPORT (archives using Zip64 for large files supported)
VMS_TEXT_CONV
[decryption, version 2.11 of 05 Jan 2007]

UnZip and ZipInfo environment options:
UNZIP: [none]
UNZIPOPT: [none]
ZIPINFO: [none]
ZIPINFOOPT: [none]

unzip 6.10b has a bug with certain zip files. Here is an example file:
http://pnboy.pinguix.com/gapan/PLI24-2012-2013-OSS1%20-%20Yliko%20OSS%20-%20Meros%20A.zip
(I cannot attach it here as it is 1.5MB big)

I get this when trying to use unzip with it:

$ unzip -l PLI24-2012-2013-OSS1\ -\ Yliko\ OSS\ -\ Meros\ A.zip
Archive: PLI24-2012-2013-OSS1 - Yliko OSS - Meros A.zip
Length Date Time Name
--------- ---------- ----- ----
141312 10-18-2012 01:56 *** glibc detected *** unzip: malloc(): memory corruption: 0x000000000194d720 ***

and then I have to kill the unzip process.

unzip 6.0 can extract from it, but it doesn't recognize the charset used for the compressed filenames (it should be cp737 I think) and the filenames it produces are garbage. Here's what happens with unzip 6.0:

$ unzip -l PLI24-2012-2013-OSS1\ -\ Yliko\ OSS\ -\ Meros\ A.zip
Archive: PLI24-2012-2013-OSS1 - Yliko OSS - Meros A.zip
Length Date Time Name
--------- ---------- ----- ----
141312 10-18-2012 01:56 03 - ???24-2012-2013-???1 - ???? ? - ?????????? Java.ppt
1080320 10-18-2012 01:58 04 - ???24-2012-2013-???1 - ???? ? - NetBeans.ppt
293888 10-18-2012 01:27 01 - ???24-2012-2013-???1 - ??????? ?????????? ???? ????????????? ??????.ppt
579584 10-18-2012 01:29 02 - ???24-2012-2013-???1 - ???? ? - ??????????.ppt
--------- -------
2095104 4 files

All the questionmarks are supposed to be greek letters.

Discussion

1 2 > >> (Page 1 of 2)
  • Steven Schweda
    Steven Schweda
    2012-11-06

    Thanks for the problem report (and especially the complete "-v"
    info).

    I know approximately nothing about Unicode or any other exotic
    character codes ("It's all Greek to me"), but there was some code in
    extract.c:fnfilter(), which looks to me to be bad. If UnZip was built
    with UNICODE_SUPPORT, then fnfilter() treated everything as Unicode, and
    so simple CP737 character sequences (which were invalid UTF-8) caused
    bad things to happen (SIGBUS, SIGSEGV, %SYSTEM-F-ACCVIO, ...).

    I've changed the development code to check the value returned from
    mbstowcs(), and avoid the wide-character conversions when it fails. I
    doubt that this is a complete fix, but it should help to stop a complete
    program failure like this one. The characters in the displayed member
    names are still checked using is[w]print(), and replaced (typically by
    "?") when is[w]print() says that they're not printable. Note that
    is[w]print() depends on the program's current locale, so the displayed
    names may change when the user's locale changes.

    Instead of trying mbstowcs() for every name (when UNICODE_SUPPORT is
    defined), should we be using some other criterion (like, say, looking at
    bit 11 in the general-purpose bit flags) to decide whether a name is
    Unicode or not?

    Some experimental (internal-only) code should be available at:

    http://antinode.info/ftp/info-zip/unzip610c08a_l_sD.zip

     
  • Thank you for your response and the quick fix!

    The experimental code you posted seems to work. The problematic zip file extracts, albeit with questionmarks all over the place. Other zip files using the same character codes for filenames appear to work fine (extracting with the proper characters in place). An example of such a file that extracts properly with your experimental code (and also with 6.10b, but not with 6.0) is here if you want it: http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip

    Is there anything fundamentaly wrong with this experimental code that it should not be used?

    As far as I am concerned this bug report can be closed, although the ideal solution would be for unzip to always extract using the proper character codes.

     
  • Steven Schweda
    Steven Schweda
    2012-11-06

    > [...] The problematic zip file
    > extracts, albeit with questionmarks all over the place. [...]

    Are the extracted file names (in the file system) bad, too, or only
    the names in the messages? As I said, I'd expect the question marks in
    the displayed (message) names to depend on your locale, because we're
    using is[w]print() to decide what's printable. If your locale is
    different from (and/or incompatible with) the code set used in the
    archive, then is[w]print() may get many things wrong -- too many
    question marks, or not enough.

    > [...] An example of such a file
    > that extracts properly with your experimental code (and also with 6.10b,
    > but not with 6.0) is here if you want it:
    > http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip

    Those seem to be Unicode characters, not the CP737 characters used in
    the first example. They may look the same in print, but the internal
    representations are different. (I claim. But I still know nothing.)

    > [...] the ideal solution would be for unzip to always extract using
    > the proper character codes.

    That appears to be easier said than done.

    > Is there anything fundamentaly wrong with this experimental code that
    > it should not be used?

    There are no known (serious) problems, but I would have said that
    before your problem report, too. There are some junk/obsolete files in
    that kit which I haven't removed. Also, that kit contains the AES
    source code, which we plan to distribute separately. History.610
    describes the significant changes. As usual, many things could change
    between now and the next real beta kit (command-line options,
    documentation, features, general behavior, ...). If you find any (more)
    problems, we'd like to know.

     
  • > Are the extracted file names (in the file system) bad, too, or only
    > the names in the messages?

    The extracted file names in the file system are bad too. But at least it extracts. Since my original bug report for this was unzip crashing, should I close this bug and make another one for the bad filenames? I could probably find more example zip files, with different encodings too. It seems different Windows versions and different windows archivers use different settings. Could a solution be for unzip having a command line option to specify encoding (both input and output probably)?

    > Those seem to be Unicode characters, not the CP737 characters used in
    > the first example.

    Yes, exactly. These are unicode and that's the reason filenames show up properly.

    > That appears to be easier said than done.

    Yes, I understand the files themselves are a mess, but they are very common, as most zip files are created on windows.

    > If you find any (more) problems, we'd like to know.

    Of course, I'll keep using your experimental code and see how it goes. Thanks a lot!

     
  • Steven Schweda
    Steven Schweda
    2012-11-08

    > The extracted file names in the file system are bad too.

    Swell. I'll try to look into how that might happen.

    > [...] Since my original bug report for this was unzip crashing, should
    > I close this bug and make another one for the bad filenames?

    It doesn't matter to me. The problems are all related.

    > [...] I could probably find more example zip files, with different
    > encodings too.

    More is better. (Smaller is also better, of course.)

    > [...] Could a solution be for unzip having a command line
    > option to specify encoding (both input and output probably)?

    UnZip has (poorly documented) -I and -O options which might be
    involved here. Again, I know nothing.

     
  • Steven Schweda
    Steven Schweda
    2012-11-12

    A slightly revised informal source kit should be available here:

    http://antinode.info/ftp/info-zip/unzip610c08a_l_sE.zip

    It should make it a little easier to enable the -I and -O options --
    just add "ICONV=1" to the "make [...] generic" command. I still know
    nothing, but there is some related code in unix/unix.c, which mentions
    some ISO/OEM code pages. None of them looks Greek, but perhaps adding
    what you need would be possible.

     
  • I'm sorry but setting ICONV=1 doesn't really do anything. I'm getting "short option 'I' not supported". This is how I'm using it:

    make -f unix/Makefile generic ICONV=1

    In any case, here's another file exhibiting the same problem, this time a lot smaller in size.
    http://pnboy.pinguix.com/gapan/Created_on_WinXP_Greek.zip
    It was created in a Windows XP Greek system. It exhibits the same problem, the one file in it extracts with questionmarks all over the place. The actual file name should be "Δοκιμή.txt". If you're not seeing this text properly either, the characters are: capital greek delta, omicron, kappa, iota, mu, eta with a stress

    And here's another file, this time it's hungarian, with the same problem (where non-latin characters are used):
    http://pnboy.pinguix.com/gapan/hungarian.zip

    In unix/unix.c the greek encoding is actually the "el" one and it is set tp CP869, which is a greek encoding. I also tried to change it to CP737, which is another greek encoding, or CP1253 (iso8859-7) for that matter, but it didn't help.

     
  • Steven Schweda
    Steven Schweda
    2012-11-16

    > [...] setting ICONV=1 doesn't really do anything. [...]

    Oops. Sorry. Those changes were not well tested. They worked where
    I was, but failed on some other system types. I've replaced that
    unzip610c08a_l_sE.zip kit with a newer one where the build should work
    better. (And I've actually built it on a Debian GNU/Linux system.) I
    still don't know if it does anything useful, but you should be able to
    get a "-v" report which includes:
    ICONV_MAPPING (ISO/OEM (iconv) conversion supported)

     
  • Steven Schweda
    Steven Schweda
    2012-11-16

    > http://pnboy.pinguix.com/gapan/Created_on_WinXP_Greek.zip

    A dump of that archive shows a name which looks to me more like CP737
    than CP869 (as I read http://en.wikipedia.org/wiki/Code_page_XXX\). My
    UnZip produces a file with the same byte values as the archive:

    deb4# ls -b *.txt
    \203\246\241\240\243\343.txt

    which agrees with my dump:

    deb4# od -t o1 Created_on_WinXP_Greek.zip
    0000000 120 113 003 004 012 000 000 000 000 000 216 173 160 101 014 176
    vvvvvvv
    0000020 177 330 004 000 000 000 004 000 000 000 012 000 000 000 203 246
    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    0000040 241 240 243 343 056 164 170 164 164 145 163 164 120 113 001 002
    [...]

    What would "ls -b" say for the correct name on the GNU/Linux system?

     
  • Steven Schweda
    Steven Schweda
    2012-11-17

    I still know nothing, so I'm still bumbling around, and I don't have
    all the right X fonts in all the right places here, but, using the
    latest unzip610c08a_l_sE.zip kit, if I do (on my Debian system):

    export LANG=el_GR.utf8
    ../unzip610c08a_l_sE/unzip -l -O CP737 Created_on_WinXP_Greek.zip

    I get what looks like the right Greek name in the listing, and without
    the "-l", I get what looks like the right Greek name in an "ls" report
    for the extracted file.

     
1 2 > >> (Page 1 of 2)