7-Zip / Discussion / Open Discussion: UTF-8 Unicode normalization problem in OS X

ChangBeom Park - 2009-07-13

Hello

I had some problems with OS X's Unicode canonical equivalence.
( http://unicode.org/reports/tr15/#Introduction )

Hangul syllables consist of choseong(L) jungseong(V) jongseong(T) or choseong(L) jungseong(V).
so 각 is consist of ㄱ ㅏ ㄱ
But it's important that former representation of Hangul is visually correct style.
In Unicode, there are two ways of storing characters, decomposed way and precomposed way.
각 is precomposed way, ㄱㅏㄱ is decomposed way.

HFS+ filesystem internally use Unicode v3.2 charset and UTF-16LE encoding.
And it use Canonical Decomposition form( particulary Appls's own Normalization Form D.
In http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition says,

"In addition, the Korean Hangul characters with codes in the range u+AC00 through u+D7A3 are illegal and must be replaced with the equivalent sequence of conjoining jamos, as described in the Unicode 2.0 book, section 3.10."

That means they use decomposed way for storing Hangul like ㄱㅏㄱ not as 각.

But other( only except OS X ) OS use normalization form C( NFC ). That means they deal with precomposed character.

If I share my file with Windows user, there is a problem like I already said.
First I ziped some Korean filenames using p7zip-9.04b on OS X, filename will be stored in NFD.
And than someone unziped it on Windows, it make a decomposed character.
Some Latin characters such as u umlaut did same thing on OS X.

<img src='http://img37.imageshack.us/img37/4467/picture1pio.png'>

<img src='http://img268.imageshack.us/img268/1/picture2zeu.png'>

You can see the unshrinked Korean Hangul, and ironycally decomposed Latin letter's are recognize both ways in Windows. Because there are same Latin filename( acutually diffrent, NFD, NFC ) on the Expolorer.

Finally I would like to fix the situation.
1. use NFD to NFC convesion supported 7zip version on Windows
2. use NFD to NFC convesion supported 7zip version on OS X.
(use http://devworld.apple.com/qa/qa2001/qa1235.html )

So I modified 7zip source code to support normalization form conversion from UTF-8 NFD to UTF-8 NFC.
I reviewed p7zip's souce code. But it's too hard to change the code where the filename have to be converted.

Would you tell me the hint for explorering source code. or the point I must modify.

Best regards,
ChangBeom Park

P.S. Sorry for my poor English. : )
I attached test files.

http://www.mediafire.com/file/5jnjgmyjfzi/7zip_testset.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- my p7zip - 2009-07-13
  
  > I had some problems with OS X's Unicode canonical equivalence.
  
  p7zip (Unix/MacOSX ...) handles Unicode with wchar_t as UCS32 (each character is a 32 bits).
  But the 7z format uses UTF16.
  
  Now p7zip translates UCS32 to/from UTF16 with a simple "cast".
  
  It's correct only for values from 0 to 65535 ...
  
  According to http://en.wikipedia.org/wiki/UTF-16/UCS-2
  UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair.
  
  p7zip need to be fixed but I didn't have a valid sample.
  
  Is "test(winxp 7zip-9.04b).7z" a valid sample ?
  
  Can you give me a picture of a valid display of the characters of this archive ?
  (your 2 pictures mix good characters and bad characters ?)
  
  Remark :
  I don't understand "range u+AC00 through u+D7A3 are illegal"
  
  According to http://en.wikipedia.org/wiki/UTF-16/UCS-2
  To allow safe use of simple word-oriented string processing,
  separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first,
  most significant surrogate (marked brown) and 0xDC00-0xDFFF for the second, least significant surrogate (marked azure).
  
  The range u+AC00 through u+D7A3 is not in the two surrogates areas ?!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Bulat Ziganshin - 2009-07-13
    
    the following code used in freearc:
    
    // Converts UTF-8 string to UTF-16
    WCHAR *utf8_to_utf16 (const char *utf8, WCHAR *_utf16)
    {
    WCHAR *utf16 = _utf16;
    do {
        BYTE c = utf8[0];   UINT c32;
             if (c<=0x7F)   c32 = c;
        else if (c<=0xBF)   c32 = '?';
        else if (c<=0xDF)   c32 = ((c&0x1F) << 6) + (utf8[1]&0x3F), utf8++;
        else if (c<=0xEF)   c32 = ((c&0x0F) <<12) + ((utf8[1]&0x3F) << 6) + (utf8[2]&0x3F), utf8+=2;
        else                c32 = ((c&0x0F) <<18) + ((utf8[1]&0x3F) <<12) + ((utf8[2]&0x3F) << 6) + (utf8[3]&0x3F), utf8+=3;
    
        // Now c32 represents full 32-bit Unicode char
        if (c32 <= 0xFFFF) *utf16++ = c32;
        else                c32-=0x10000, *utf16++ = c32/0x400 + 0xd800, *utf16++ = c32%0x400 + 0xdc00;
    
    } while (*utf8++);
    return _utf16;
    }
    
    // Converts UTF-16 string to UTF-8
    char *utf16_to_utf8 (const WCHAR *utf16, char *_utf8)
    {
    char *utf8 = _utf8;
    do {
        UINT c = utf16[0];
        if (0xd800<=c && c<=0xdbff && 0xdc00<=utf16[1] && utf16[1]<=0xdfff)
          c = (c - 0xd800)*0x400 + (UINT)(*++utf16 - 0xdc00) + 0x10000;
    
        // Now c represents full 32-bit Unicode char
             if (c<=0x7F)   *utf8++ = c;
        else if (c<=0x07FF) *utf8++ = 0xC0|(c>> 6)&0x1F, *utf8++ = 0x80|(c>> 0)&0x3F;
        else if (c<=0xFFFF) *utf8++ = 0xE0|(c>>12)&0x0F, *utf8++ = 0x80|(c>> 6)&0x3F, *utf8++ = 0x80|(c>> 0)&0x3F;
        else                *utf8++ = 0xF0|(c>>18)&0x0F, *utf8++ = 0x80|(c>>12)&0x3F, *utf8++ = 0x80|(c>> 6)&0x3F, *utf8++ = 0x80|(c>> 0)&0x3F;
    
    } while (*utf16++);
    return _utf8;
    }
    
    second part of first function and first part of second one converts between utf-16 and unicode code point. i don't tested it on real files, though
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ChangBeom Park - 2009-07-14
  
  > Can you give me a picture of a valid display of the characters of this archive ?
  
  the one with correct visual form of Korean Hangul syllables. like '테스트 폴더'.
  http://img124.imageshack.us/img124/9309/picture2u.png
  
  below one is incorrect visual form of Hangul like 'ㅌㅔㅅㅡㅌㅡ ㅍㅗㄹㄷㅓ'
  http://img217.imageshack.us/img217/6034/picture1jra.png
  
  Fomer directory name(테스트폴더) is composed form of latter name(ㅌㅔㅅㅡㅌㅡ ㅍㅗㄹㄷㅓ).
  The Koreans use Hangul composed way.
  
  as you can see, ASCII, Latin-1 characters are displayed normally.
  Actually only ASCII character has same Unicode value, others are different.
  So two same Latin filename(éàÇ◌̧äâÂÃ.txt) are showed on same folder.
  http://img37.imageshack.us/img37/4467/picture1pio.png
  
  You can imagine this situation.
  If you write u umlaut, do you write u and write two dots separately after u character?
  Answer is no. This is the way of Unicode decomposed form.
  
  The problem is Windows are not showed decomposed Korean filename as Latin character did.
  I mean both of Korean name must be showed former style.
  
  But other OSes don't use NFD, so it might be better solution to change NFD to NFC.
  
  > I don't understand "range u+AC00 through u+D7A3 are illegal"
  
  It means OS X's HFS+ file system are not allowed to store composed form of Hangul character.
  There are another Hangul character block U+1100 ~ U+11FF to represent Hangul into decomposed way.
  HFS+ use it instead of u+AC00 ~ u+D7A3.
  One Hangul syllables consist of two or three conjoinable Jamos. U+1100 ~ U+11FF block are conjoinable Jamo's area.
  가 = ㄱ + ㅏ , 한 = ㅎ + ㅏ + ㄴ
  The u+AC00 ~ u+D7A3 are each conjoined Hangul syllables area. The numbers of syllables are 11172.
  
  Apple don't use Unicode's NFD exactly, they use their own way, a.k.a called UTF-8-MAC.
  http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition
  http://developer.apple.com/technotes/tn/tn1150table.html
  
  They have some API for converting their own normalization form. CFStringNormalize()
  http://developer.apple.com/qa/qa2001/qa1235.html
  
  So I want to change 7zip's filename header into NFC on OS X.
  
  I find some part of function to use filename.
  "CPP/7zip/Archive/7z/7zOut.cpp"
  
      /* ---------- Names ---------- */
  
      int numDefined = 0;
      size_t namesDataSize = 0;
      for (int i = 0; i < db.Files.Size(); i++)
      {
        const UString &name = db.Files[i].Name;
        if (!name.IsEmpty())
          numDefined++;
        namesDataSize += (name.Length() + 1) * 2;
      }
  
  But if I changed that "db.Files[i].Name" into NFC character using CFStringNormalize(), what unexpected problem happened?
  I think that when update exist archive file, it occur the problem. and others.
  It's too hard that how many point should I change?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - my p7zip - 2009-07-14
    
    So OS X's HFS+ file system has its own rules.
    
    But we must find out if the Unix API on MacOSX follows the Unix rules or other rules.
    
    To read a file or a directory I use the Unix API :
    open/read/write/close
    opendir/readdir/closedir
    
    These functions use 8 bits characters.
    
    You said that for example "éàÇ?¸äâÂÃ" will be encoded differently between Linux and MacOSX ?
    
    I un-archived "test(winxp 7zip-9.04b).7z" on Ubuntu 9.04 (locale = UTF8)
    
    The filenames seems correct according to your image.
    
    I used the following program :
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #include <sys/types.h>
    #include <dirent.h>
    
    int main(void)
    {
    
        const char * a_dir = "basic test";
        DIR *dirp;
        struct dirent *dp;
    
        dirp = opendir(a_dir);
    
        if (dirp)
        {
            while ((dp = readdir(dirp)) != NULL)
            {
                size_t i,len = strlen(dp->d_name);
    
                printf("%s (%d)\n",dp->d_name,(int)len);
                printf("\t");
    
                for(i=0;i<len;i++)
                {
                    printf(" %02x",(unsigned)(dp->d_name[i] & 0xFF));
                }
    
                printf("\n");
    
            }
            closedir(dirp);
        }
    
        return 0;
    }
    
    compile and launch it :
    gcc listing.c
    ./a.out
    
    It displayed :
    
    테스트 폴더 (16)
         ed 85 8c ec 8a a4 ed 8a b8 20 ed 8f b4 eb 8d 94
    éàÇ◌̧äâÂÃ.txt (23)
         c3 a9 c3 a0 c3 87 e2 97 8c cc a7 c3 a4 c3 a2 c3 82 c3 83 2e 74 78 74
    ひらがな-カタカナ.txt (29)
         e3 81 b2 e3 82 89 e3 81 8c e3 81 aa 2d e3 82 ab e3 82 bf e3 82 ab e3 83 8a 2e 74 78 74
    English.txt (11)
         45 6e 67 6c 69 73 68 2e 74 78 74
    . (1)
         2e
    .DS_Store (9)
         2e 44 53 5f 53 74 6f 72 65
    .. (2)
         2e 2e
    똠방각하 펲시콜라 아햏햏.txt (39)
         eb 98 a0 eb b0 a9 ea b0 81 ed 95 98 20 ed 8e b2 ec 8b 9c ec bd 9c eb 9d bc 20 ec 95 84 ed 96 8f ed 96 8f 2e 74 78 74
    
    Please, Can you do the same on your MacOSX ?
    
    Remark :
    If it's different, could you give me a tar of "basic test" :
    tar cf test.tar "basic test"
    
    With this test.tar, I will be able to extract these files on a MacOSX with good filenames ;)
    With this good sample, I will be able to make some fixes and tests ...
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Igor Pavlov - 2009-07-14
      
      - With this good sample, I will be able to make some fixes and tests ...
      
      The problem is complex.
      Different unicode chracter sequences (short(Composed), long(Decomposed) can give same look. MacOS uses Decomposed Unicode.
      
      Do we need to store Сomposed Unicode .7z?
      It solves Windows Problems. But it requires Сomposed->Decomposed and Decomposed->Сomposed conversions in p7zip for MAC. I can't say now what places in code these conversions are required?
      
      If we store Decomposed Unicode in .7z, we need Decomposed->Composed code in text output code (and maybe in writing to FileSystem code) in 7-zip for any system that
      doesn't support Decomposed Unicode (like Windows).
      Can you confirm that some version of Linux supports Decomposed Unicode?
      What time Linux converts Decomposed->Composed?
      Is it when you write file to disk or when you print string to screen?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - my p7zip - 2009-07-14
        
        7z format use the UTF-16 as defined by Windows.
        
        We must not change the format !
        
        Now 7za/7z/7zr in p7zip are compiled without
        -DUNICODE -D_UNICODE
        
        I plan to add these 2 flags for p7zip.
        
        With this modification, I think that
        only the "fct(wchar_t *)" will be called.
        
        In FileDir.*, FileFind.*, FileIO.*
        p7zip will need to convert "char *" to/from "wchar_t *".
        
        For MacOSX, p7zip will use some special MacOSX functions.
        
        For other, p7zip will use mbstowcs/wcstombs functions.
        (but It does not solve the case where unicode > 0XFFFF,
        p7zip need a real UCS32<->UTF16 conversion)
        
        Remark : on "not too old Linux", the "char *" is indeed UTF8 (without Decomposed Unicode).
        
        But On other Unix, the "char *" is latin1 for Western Europe, and other locale
        for other countries ...
        
        I really think that the code of 7-zip/p7zip must be cleaned
        in order to have wchar_t in all the code except in FileDir.*, FileFind.*, FileIO.*
        where the Unix API use "char *" or when p7zip need to write on screen.
        
        For zip/tar format, we should use libiconv (http://www.gnu.org/software/libiconv/)
        when the user gives the encoding.
        
        For example, zip file build in Windows (Windows-1252)
        and extracted in Linux (with UTF8 locale or ISO-8859-1 locale)
        will use a command like :
        7za x -lang=CP1252 archive.7z
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Igor Pavlov - 2009-07-14
        
        - 7z format use the UTF-16 as defined by Windows.
        
        The problem that there are several ways to write text in utf-16:
        1) Composed (Windows / NTFS)
        2) Decomposed (HFS filesystem in macOS)
        So maybe it's simpler to convert MacOS name to Composed form after we get that name from Filesystem.
        
        - p7zip need a real UCS32<->UTF16 conversion
        
        I thing these symbols (n >= 0x10000) are rare.
        Did you see any real names with these characters?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        my p7zip - 2009-07-14
        
        > So maybe it's simpler to convert MacOS name to Composed form after we get that name from Filesystem.
        
        It's the Idea :
        use special MacOSX functions to convert UTF-16 to/from MacUTF8 when using
        I/O functions (fopen/open/opendir/mkdir/chmod/utimes/...) .
        
        > - p7zip need a real UCS32<->UTF16 conversion
        > I thing these symbols (n >= 0x10000) are rare.
        > Did you see any real names with these characters?
        
        No, I don't.
        
        I tried to create such filenames with a C program but
        the filenames were not correctly displayed In the Windows Explorer
        or the Linux Explorer ...
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Igor Pavlov - 2009-07-14
  
  The conversion can create new problems. We must support name comparision for update operation and so on. If we write C to .7z, then for "update" name comparision, we must convert .7z/C to D or MacOS/D to C.
  
  Is there some reference code for D to C conversion for Korean?
  Why doesn't Windows support it? Did you check it in other versions of Windows (including Windows 7 RC)?
  And can Koreans read decomposed text? Or it's too unusual?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2009-07-14
    
    And the question about MAC applications. Do you need to call some D->C function before sending text to screen on MAC?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ChangBeom Park - 2009-07-14
  
  Thanks for your endless consideration. : )
  
  Actually except ASCII code, other's are different.
  
  In case of Latin character, it might be same on windows explorer, but they have diffrent Unicode value, NFD and NFC form. So, you can see that there are two same filename on Windows explorer! : (
  http://img37.imageshack.us/img37/4467/picture1pio.png
  
  In Windows 7, I checked NFD Hangul characters are correctly showed. But it remains same situation, there exist two visually identical filename, with different Unicode value.
  http://www.appleforum.com/attachment.php?attachmentid=29287&d=1247560956
  
  I have some tests on OS X.
  OS X's HFS+ use UTF-16 LE and their own NFD rules internally, so we don't care about it.
  OS X's BSD API layers use UTF-8-MAC encoding. It's a little bit diffrent from Unicode's NFD UTF-8. I already said.
  
  HFS+ restrict only UTF-8-MAC filenames. All illegal filename(include NFC Hangul characters) converted into Appls's NFD rules.
  If I unziped archive file on zipped with Windows on OS X, It make same NFD filename which I archived with OS X.
  So 7zip ask what do I want? Overwrite, rename. and so on.
  
  But in Windows, it's different.
  As you know, Windows or Linux's file system don't enforce any restriction on filename encoding.
  So if I unzipped archive with OS X on Windows, it's possible to have two kind of normalization filenames.
  7zip only ask wheather to overwrite, when it try to extract ASCII filenames. I alread said.
  
  Visually same name and actully diffrent two files on Windows.
  What do you think about it?
  I think it's not a good idea.
  
  > Please, Can you do the same on your MacOSX ?
  
  http://www.mediafire.com/file/otwtjk3mmgn/result.txt
  
  . (1)
           2e
  .. (2)
           2e 2e
  .DS_Store (9)
           2e 44 53 5f 53 74 6f 72 65
  English.txt (11)
           45 6e 67 6c 69 73 68 2e 74 78 74
  éàÇ◌̧äâÂÃ.txt (30)
           65 cc 81 61 cc 80 43 cc a7 e2 97 8c cc a7 61 cc 88 61 cc 82 41 cc 82 41 cc 83 2e 74 78 74
  똠방각하 펲시콜라 아햏햏.txt (93)
           e1 84 84 e1 85 a9 e1 86 b7 e1 84 87 e1 85 a1 e1 86 bc e1 84 80 e1 85 a1 e1 86 a8 e1 84 92 e1 85 a1 20 e1 84 91 e1 85 a6 e1 87 81 e1 84 89 e1 85 b5 e1 84 8f e1 85 a9 e1 86 af e1 84 85 e1 85 a1 20 e1 84 8b e1 85 a1 e1 84 92 e1 85 a2 e1 87 82 e1 84 92 e1 85 a2 e1 87 82 2e 74 78 74
  테스트 폴더 (34)
           e1 84 90 e1 85 a6 e1 84 89 e1 85 b3 e1 84 90 e1 85 b3 20 e1 84 91 e1 85 a9 e1 86 af e1 84 83 e1 85 a5
  ひらがな-カタカナ.txt (32)
           e3 81 b2 e3 82 89 e3 81 8b e3 82 99 e3 81 aa 2d e3 82 ab e3 82 bf e3 82 ab e3 83 8a 2e 74 78 74
  
  >Remark :
  >If it's different, could you give me a tar of "basic test" :
  >tar cf test.tar "basic test"
  
  http://www.mediafire.com/file/qwo4qomtnjf/test.tar
  
  I know that SAMBA use some code to convert NFD <-> NFC .
  
  http://sourcejam.com/jp/samba-3.0.25b/charset__macosxfs_8c-source.html
  
  and latest SAMBA 3.4.0 charset_macosxfs.c file here.
  http://pastebin.com/m8683674
  http://www.mediafire.com/file/nnxytoazezh/charset_macosxfs.c
  
  I don't know other platform's iconv treat UTF-8-MAC encoding.
  OS X's one support it.
  
  I'm sorry that I'm not a good English speaker so, remain question will be answered lately. : )
  Forgive my foolness.
  
  Sincerely
  Chang-Beom Park.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - my p7zip - 2009-07-14
    
    I will try to do something with UTF-8-MAC encoding.
    
    But now I need :
    - spare time
    - a MacOSX machine.
    
    So don't expect a fix in a near future :(
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ChangBeom Park - 2009-07-14
  
  For Windows, read below link
  http://blogs.msdn.com/michkap/archive/2004/11/29/271476.aspx
  http://msdn.microsoft.com/en-us/library/dd374126\(VS.85).aspx
  http://msdn.microsoft.com/en-us/library/dd374126\(VS.85).aspx
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- ChangBeom Park - 2009-07-14
  
  > But now I need :
  > - spare time
  > - a MacOSX machine.
  
  How about use my mac's SSH account?
  If you want I will give you one.
  I use a laptop but I can turn it on alldays. : )
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 Unicode normalization problem in OS X

A free file archiver for extremely high compression

Forums

Help

UTF-8 Unicode normalization problem in OS X

UTF-8 Unicode normalization problem in OS X

A free file archiver for extremely high compression

Forums

Help

UTF-8 Unicode normalization problem in OS X document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

UTF-8 Unicode normalization problem in OS X