Menu

#10 Unzip UTF-8 zip files

open
nobody
None
5
2012-05-15
2012-05-15
Felix Wong
No

On Windows, I have used Java to create a zip file which contains non-English characters in the directory names. The encoding used was UTF-8. Unzip 6.0 cannot extract these non-English characters correctly. Unzip 6.0 claims to support UTF-8. Is there any options that I need to specify?

Discussion

  • Steven Schweda

    Steven Schweda - 2012-05-15

    My ignorance of Unicode is great, but I believe that there were some
    Unicode-related changes ("-I" and "-O" options) in the latest UnZip beta
    kit. You might try that, and see if it works better.

    ftp://ftp.info-zip.org/pub/infozip/beta/unzip610b.zip

     
  • Ed Gordon

    Ed Gordon - 2012-05-16

    Can you provide an example archive (with nothing private or sensitive in it) and the language or non-English characters that are not extracting right? Also, "unzip -v" information showing the enabled features would be helpful. There has been considerable updates to Unicode in the mentioned beta, but there may still be issues in how Windows handles Unicode.

     
  • Felix Wong

    Felix Wong - 2012-05-16

    I created a zip file with Chinese characters in filenames and directory names using jar command on a Windows machine. If I use unzip to extract the files, the Chinese characters will be corrupted, but jar can extract them correctly.

     
  • Felix Wong

    Felix Wong - 2012-05-16

    How do I attach a file here? I created a file called zh_TW.zip with jar. This was the output from unzip 6.0.

    C:\temp>unzip60.exe zh_TW.zip
    Archive: zh_TW.zip
    creating: test/
    checkdir error: cannot create test/???
    Invalid argument
    unable to process test/???/.
    checkdir error: cannot create test/???
    Invalid argument
    unable to process test/???/.
    checkdir error: cannot create test/???
    Invalid argument
    unable to process test/???/????????.

    C:\temp>unzip60.exe -v
    UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send
    bug reports using http://www.info-zip.org/zip-bug.html; see README for details.

    Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
    see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.

    Compiled with Microsoft C 16.00 (Visual C++ 10.0) for
    Windows 9x / Windows NT/2K/XP/2K3 (32-bit) on Apr 26 2012.

    UnZip special compilation options:
    COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
    NTSD_EAS
    SET_DIR_ATTRIB
    TIMESTAMP
    UNIXBACKUP
    USE_EF_UT_TIME
    USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
    USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
    UNICODE_SUPPORT [wide-chars] (handle UTF-8 paths)
    MBCS-support (multibyte character support, MB_CUR_MAX = 2)
    LARGE_FILE_SUPPORT (large files over 2 GiB supported)
    ZIP64_SUPPORT (archives using Zip64 for large files supported)
    VMS_TEXT_CONV
    [decryption, version 2.11 of 05 Jan 2007]

    UnZip and ZipInfo environment options:
    UNZIP: [none]
    UNZIPOPT: [none]
    ZIPINFO: [none]
    ZIPINFOOPT: [none]

    C:\temp>

     
  • Ed Gordon

    Ed Gordon - 2012-05-17

    There's a link to attach files at the bottom of this page. I want to pull the archive apart and see just how the file name information is stored.

     
  • Felix Wong

    Felix Wong - 2012-05-17
     
  • Felix Wong

    Felix Wong - 2012-05-17

    File attached

     
  • Steven Schweda

    Steven Schweda - 2012-05-21

    Thanks for the problem report and test archive.

    There is a problem with using UnZip on a "jar" archive like this one.
    "jar" programs follow the ZIP archive format, but not very carefully.
    Some of the header data in the archive are apparently filled in with
    some details omitted, and the result can confuse UnZip. Specifically,
    there's a "version made by" field which should identify the OS (file
    system) where the archive was created. In the test archive, this
    host-type sub-field is zero. According to the .ZIP standard (and to
    UnZip), this signifies an MS-DOS:FAT/VFAT/FAT32 file system.

    http://www.pkware.com/documents/casestudies/APPNOTE.TXT

    This leads UnZip to translate archive file names using rules
    appropriate for MS-DOS, not for UTF-8. The result is to damage file
    names which include UTF-8 code bytes greater than 127, including the CJK
    characters in this archive.

    I believe that there is no command-line option which will work around
    this inappropriate name translation. In UnZip 6.0, it should be
    possible to build UnZip using an empty definition for the C macro
    "Ext_ASCII_TO_Native". I don't know how to do that on Windows, but
    adding:
    #define Ext_ASCII_TO_Native
    before:
    #ifndef Ext_ASCII_TO_Native
    in unzpriv.h might do it. That should disable the normal MS-DOS name
    translation.

    We're thinking about different ways to deal with this "jar" problem
    in the next UnZip beta release, but nothing has been decided yet.

     
  • Felix Wong

    Felix Wong - 2012-05-22

    Thanks for looking into this. I'm interested to know when you have a fix for this.

     

Log in to post a comment.