Menu

UTF-8 Unicode normalization problem in OS X

2009-07-13
2012-12-08
  • ChangBeom Park

    ChangBeom Park - 2009-07-13

    Hello

    I had some problems with OS X's Unicode canonical equivalence.
    ( http://unicode.org/reports/tr15/#Introduction )

    Hangul syllables consist of choseong(L) jungseong(V) jongseong(T) or choseong(L) jungseong(V).
    so 각 is consist of ㄱ ㅏ ㄱ
    But it's important that former representation of Hangul is visually correct style.
    In Unicode, there are two ways of storing characters, decomposed way and precomposed way.
    각 is precomposed way, ㄱㅏㄱ is decomposed way.

    HFS+ filesystem internally use Unicode v3.2 charset and UTF-16LE encoding.
    And it use Canonical Decomposition form( particulary Appls's own Normalization Form D.
    In http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition says,

    "In addition, the Korean Hangul characters with codes in the range u+AC00 through u+D7A3 are illegal and must be replaced with the equivalent sequence of conjoining jamos, as described in the Unicode 2.0 book, section 3.10."

    That means they use decomposed way for storing Hangul like ㄱㅏㄱ not as 각.

    But other( only except OS X ) OS use normalization form C( NFC ). That means they deal with precomposed character.

    If I share my file with Windows user, there is a problem like I already said.
    First I ziped some Korean filenames using p7zip-9.04b on OS X, filename will be stored in NFD.
    And than someone unziped it on Windows, it make a decomposed character.
    Some Latin characters such as u umlaut did same thing on OS X.

    <img src='http://img37.imageshack.us/img37/4467/picture1pio.png'>

    <img src='http://img268.imageshack.us/img268/1/picture2zeu.png'>

    You can see the unshrinked Korean Hangul, and ironycally decomposed Latin letter's are recognize both ways in Windows. Because there are same Latin filename( acutually diffrent, NFD, NFC ) on the Expolorer.

    Finally I would like to fix the situation.
    1. use NFD to NFC convesion supported 7zip version on Windows
    2. use NFD to NFC convesion supported 7zip version on OS X.
    (use http://devworld.apple.com/qa/qa2001/qa1235.html )

    So I modified 7zip source code to support normalization form conversion from UTF-8 NFD to UTF-8 NFC.
    I reviewed p7zip's souce code. But it's too hard to change the code where the filename have to be converted.

    Would you tell me the hint for explorering source code. or the point I must modify.

    Best regards,
    ChangBeom Park

    P.S. Sorry for my poor English. : )
    I attached test files.

    http://www.mediafire.com/file/5jnjgmyjfzi/7zip_testset.zip

     
    • my p7zip

      my p7zip - 2009-07-13

      > I had some problems with OS X's Unicode canonical equivalence. 

      p7zip (Unix/MacOSX ...) handles Unicode with wchar_t as UCS32 (each character is a 32 bits).
      But the 7z format uses UTF16.

      Now p7zip translates UCS32 to/from UTF16 with a simple "cast".

      It's correct only for values from 0 to 65535 ...

      According to http://en.wikipedia.org/wiki/UTF-16/UCS-2
      UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair.

      p7zip need to be fixed but I didn't have a valid sample.

      Is "test(winxp 7zip-9.04b).7z" a valid sample ?

      Can you give me a picture of a valid display of the characters of this archive ?
      (your 2 pictures mix good characters and bad characters ?)

      Remark :
      I don't understand "range u+AC00 through u+D7A3 are illegal"

      According to http://en.wikipedia.org/wiki/UTF-16/UCS-2
      To allow safe use of simple word-oriented string processing,
      separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first,
      most significant surrogate (marked brown) and 0xDC00-0xDFFF for the second, least significant surrogate (marked azure).

      The range u+AC00 through u+D7A3 is not in the two surrogates areas ?!

       
      • Bulat Ziganshin

        Bulat Ziganshin - 2009-07-13

        the following code used in freearc:

        // Converts UTF-8 string to UTF-16
        WCHAR *utf8_to_utf16 (const char *utf8, WCHAR *_utf16)
        {
          WCHAR *utf16 = _utf16;
          do {
            BYTE c = utf8[0];   UINT c32;
                 if (c<=0x7F)   c32 = c;
            else if (c<=0xBF)   c32 = '?';
            else if (c<=0xDF)   c32 = ((c&0x1F) << 6) +  (utf8[1]&0x3F),  utf8++;
            else if (c<=0xEF)   c32 = ((c&0x0F) <<12) + ((utf8[1]&0x3F) << 6) +  (utf8[2]&0x3F),  utf8+=2;
            else                c32 = ((c&0x0F) <<18) + ((utf8[1]&0x3F) <<12) + ((utf8[2]&0x3F) << 6) + (utf8[3]&0x3F),  utf8+=3;

            // Now c32 represents full 32-bit Unicode char
            if (c32 <= 0xFFFF)  *utf16++ = c32;
            else                c32-=0x10000, *utf16++ = c32/0x400 + 0xd800, *utf16++ = c32%0x400 + 0xdc00;

          } while (*utf8++);
          return _utf16;
        }

        // Converts UTF-16 string to UTF-8
        char *utf16_to_utf8 (const WCHAR *utf16, char *_utf8)
        {
          char *utf8 = _utf8;
          do {
            UINT c = utf16[0];
            if (0xd800<=c && c<=0xdbff && 0xdc00<=utf16[1] && utf16[1]<=0xdfff)
              c = (c - 0xd800)*0x400 + (UINT)(*++utf16 - 0xdc00) + 0x10000;

            // Now c represents full 32-bit Unicode char
                 if (c<=0x7F)   *utf8++ = c;
            else if (c<=0x07FF) *utf8++ = 0xC0|(c>> 6)&0x1F,  *utf8++ = 0x80|(c>> 0)&0x3F;
            else if (c<=0xFFFF) *utf8++ = 0xE0|(c>>12)&0x0F,  *utf8++ = 0x80|(c>> 6)&0x3F,  *utf8++ = 0x80|(c>> 0)&0x3F;
            else                *utf8++ = 0xF0|(c>>18)&0x0F,  *utf8++ = 0x80|(c>>12)&0x3F,  *utf8++ = 0x80|(c>> 6)&0x3F,  *utf8++ = 0x80|(c>> 0)&0x3F;

          } while (*utf16++);
          return _utf8;
        }

        second part of first function and first part of second one converts between utf-16 and unicode code point. i don't tested it on real files, though

         
    • ChangBeom Park

      ChangBeom Park - 2009-07-14

      > Can you give me a picture of a valid display of the characters of this archive ?

      the one with correct visual form of Korean Hangul syllables. like '테스트 폴더'.
      http://img124.imageshack.us/img124/9309/picture2u.png

      below one is incorrect visual form of Hangul like 'ㅌㅔㅅㅡㅌㅡ ㅍㅗㄹㄷㅓ'
      http://img217.imageshack.us/img217/6034/picture1jra.png

      Fomer directory name(테스트폴더) is composed form of latter name(ㅌㅔㅅㅡㅌㅡ ㅍㅗㄹㄷㅓ).
      The Koreans use Hangul composed way.

      as you can see, ASCII, Latin-1 characters are displayed normally.
      Actually only ASCII character has same Unicode value, others are different.
      So two same Latin filename(éàÇ◌̧äâÂÃ.txt) are showed on same folder.
      http://img37.imageshack.us/img37/4467/picture1pio.png

      You can imagine this situation.
      If you write u umlaut, do you write u and write two dots separately after u character?
      Answer is no. This is the way of Unicode decomposed form.

      The problem is Windows are not showed decomposed Korean filename as Latin character did.
      I mean both of Korean name must be showed former style.

      But other OSes don't use NFD, so it might be better solution to change NFD to NFC.

      > I don't understand "range u+AC00 through u+D7A3 are illegal"

      It means OS X's HFS+ file system are not allowed to store composed form of Hangul character.
      There are another Hangul character block U+1100 ~ U+11FF to represent Hangul into decomposed way.
      HFS+ use it instead of u+AC00 ~ u+D7A3.
      One Hangul syllables consist of two or three conjoinable Jamos. U+1100 ~ U+11FF block are conjoinable Jamo's area.
      가 = ㄱ + ㅏ , 한 = ㅎ + ㅏ + ㄴ
      The u+AC00 ~ u+D7A3 are each conjoined Hangul syllables area. The numbers of syllables are 11172.

      Apple don't use Unicode's NFD exactly, they use their own way, a.k.a called UTF-8-MAC.
      http://developer.apple.com/technotes/tn/tn1150.html#CanonicalDecomposition
      http://developer.apple.com/technotes/tn/tn1150table.html

      They have some API for converting their own normalization form. CFStringNormalize()
      http://developer.apple.com/qa/qa2001/qa1235.html

      So I want to change 7zip's filename header into NFC on OS X.

      I find some part of function to use filename.
      "CPP/7zip/Archive/7z/7zOut.cpp"

          /* ---------- Names ---------- */

          int numDefined = 0;
          size_t namesDataSize = 0;
          for (int i = 0; i < db.Files.Size(); i++)
          {
            const UString &name = db.Files[i].Name;
            if (!name.IsEmpty())
              numDefined++;
            namesDataSize += (name.Length() + 1) * 2;
          }

      But if I changed that "db.Files[i].Name" into NFC character using CFStringNormalize(), what unexpected problem happened?
      I think that  when update exist archive file, it occur the problem. and others.
      It's too hard that how many point should I change?

       
      • my p7zip

        my p7zip - 2009-07-14

        So OS X's HFS+ file system has its own rules.

        But we must find out if the Unix API on MacOSX follows the Unix rules or other rules.

        To read a file or a directory I use the Unix API :
        open/read/write/close
        opendir/readdir/closedir

        These functions use 8 bits characters.

        You said that for example "éàÇ?¸äâÂÃ" will be encoded differently between Linux and MacOSX ?

        I un-archived  "test(winxp 7zip-9.04b).7z" on Ubuntu 9.04 (locale = UTF8)

        The filenames seems correct according to your image.

        I used the following program :

        #include <stdio.h>
        #include <stdlib.h>
        #include <string.h>

        #include <sys/types.h>
        #include <dirent.h>

        int main(void)
        {

            const char * a_dir = "basic test";
            DIR *dirp;
            struct dirent *dp;

            dirp = opendir(a_dir);

            if (dirp)
            {
                while ((dp = readdir(dirp)) != NULL)
                {
                    size_t i,len = strlen(dp->d_name);

                    printf("%s (%d)\n",dp->d_name,(int)len);
                    printf("\t");

                    for(i=0;i<len;i++)
                    {
                        printf(" %02x",(unsigned)(dp->d_name[i] & 0xFF));
                    }

                    printf("\n");

                }
                closedir(dirp);
            }

            return 0;
        }

        compile and launch it :
        gcc listing.c
        ./a.out

        It displayed :

        테스트 폴더 (16)
             ed 85 8c ec 8a a4 ed 8a b8 20 ed 8f b4 eb 8d 94
        éàÇ◌̧äâÂÃ.txt (23)
             c3 a9 c3 a0 c3 87 e2 97 8c cc a7 c3 a4 c3 a2 c3 82 c3 83 2e 74 78 74
        ひらがな-カタカナ.txt (29)
             e3 81 b2 e3 82 89 e3 81 8c e3 81 aa 2d e3 82 ab e3 82 bf e3 82 ab e3 83 8a 2e 74 78 74
        English.txt (11)
             45 6e 67 6c 69 73 68 2e 74 78 74
        . (1)
             2e
        .DS_Store (9)
             2e 44 53 5f 53 74 6f 72 65
        .. (2)
             2e 2e
        똠방각하 펲시콜라 아햏햏.txt (39)
             eb 98 a0 eb b0 a9 ea b0 81 ed 95 98 20 ed 8e b2 ec 8b 9c ec bd 9c eb 9d bc 20 ec 95 84 ed 96 8f ed 96 8f 2e 74 78 74

        Please, Can you do the same on your MacOSX ?

        Remark :
        If it's different, could you give me a tar of "basic test" :
        tar cf test.tar  "basic test"

        With this test.tar, I will be able to extract these files on a MacOSX with good filenames ;)
        With this good sample, I will be able to make some fixes and tests ...

         
        • Igor Pavlov

          Igor Pavlov - 2009-07-14

          - With this good sample, I will be able to make some fixes and tests ...

          The problem is complex.
          Different unicode chracter sequences (short(Composed), long(Decomposed) can give same look. MacOS uses Decomposed Unicode.

          Do we need to store Сomposed Unicode .7z?
          It solves Windows Problems. But it requires Сomposed->Decomposed and Decomposed->Сomposed conversions in p7zip for MAC. I can't say now what places in code these conversions are required?

          If we store Decomposed Unicode in .7z,  we need Decomposed->Composed code in text output code (and maybe in writing to FileSystem code) in 7-zip for any system that
          doesn't support Decomposed Unicode (like Windows).
          Can you confirm that some version of Linux supports Decomposed Unicode?
          What time Linux converts Decomposed->Composed?
          Is it when you write file to disk or when you print string to screen?

           
          • my p7zip

            my p7zip - 2009-07-14

            7z format use the UTF-16 as defined by Windows.

            We must not change the format !

            Now 7za/7z/7zr in p7zip are compiled without
                -DUNICODE  -D_UNICODE

            I plan to add these 2 flags for p7zip.

            With this modification, I think that
            only the "fct(wchar_t *)" will be called.

            In FileDir.*, FileFind.*, FileIO.*
            p7zip will need to convert "char *" to/from "wchar_t *".

            For MacOSX, p7zip will use some special MacOSX functions.

            For other, p7zip will use mbstowcs/wcstombs functions.
            (but It does not solve the case where unicode > 0XFFFF,
            p7zip need a real UCS32<->UTF16 conversion)

            Remark : on "not too old Linux", the "char *" is indeed UTF8 (without Decomposed Unicode).

            But On other Unix, the "char *" is latin1 for Western Europe, and other locale
            for other countries ...

            I really think that the code of 7-zip/p7zip must be cleaned
            in order to have wchar_t in all the code except in FileDir.*, FileFind.*, FileIO.*
            where the Unix API use "char *" or when p7zip need to write on screen.

            For zip/tar format, we should use libiconv (http://www.gnu.org/software/libiconv/)
            when the user gives the encoding.

            For example, zip file build in Windows (Windows-1252)
            and extracted in Linux (with UTF8 locale or ISO-8859-1 locale)
            will use a command like :
            7za x -lang=CP1252 archive.7z

             
            • Igor Pavlov

              Igor Pavlov - 2009-07-14

              - 7z format use the UTF-16 as defined by Windows.

              The problem that there are several ways to write text in utf-16:
              1) Composed (Windows / NTFS)
              2) Decomposed (HFS filesystem in macOS)
              So maybe it's simpler to convert MacOS name to Composed form after we get that name from Filesystem.

              - p7zip need a real UCS32<->UTF16 conversion

              I thing these symbols (n >= 0x10000) are rare.
              Did you see any real names with these characters?

               
              • my p7zip

                my p7zip - 2009-07-14

                > So maybe it's simpler to convert MacOS name to Composed form after we get that name from Filesystem.

                It's the Idea :
                use special MacOSX functions to convert UTF-16 to/from MacUTF8 when using
                I/O functions (fopen/open/opendir/mkdir/chmod/utimes/...) .

                > - p7zip need a real UCS32<->UTF16 conversion
                > I thing these symbols (n >= 0x10000) are rare.
                > Did you see any real names with these characters?

                No, I don't.

                I tried to create such filenames with a C program but
                the filenames were not correctly displayed In the Windows Explorer
                or the Linux Explorer ...

                 
    • Igor Pavlov

      Igor Pavlov - 2009-07-14

      The conversion can create new problems. We must support name comparision for update operation and so on. If we write C to .7z, then for "update" name comparision, we must convert .7z/C  to D or MacOS/D to C.

      Is there some reference code for D to C conversion for Korean?
      Why doesn't Windows support it? Did you check it in other versions of Windows (including Windows 7 RC)?
      And can Koreans read decomposed text? Or it's too unusual?

       
      • Igor Pavlov

        Igor Pavlov - 2009-07-14

        And the question about MAC applications. Do you need to call some D->C function before sending text to screen on MAC?

         
    • ChangBeom Park

      ChangBeom Park - 2009-07-14

      Thanks for your endless consideration.  : )

      Actually except ASCII code, other's are different.

      In case of Latin character, it might be same on windows explorer, but they have diffrent Unicode value, NFD and NFC form. So, you can see that there are two same filename on Windows explorer! : (
      http://img37.imageshack.us/img37/4467/picture1pio.png

      In Windows 7, I checked NFD Hangul characters are correctly showed. But it remains same situation, there exist two visually identical filename, with different Unicode value.
      http://www.appleforum.com/attachment.php?attachmentid=29287&d=1247560956

      I have some tests  on OS X.
      OS X's HFS+ use UTF-16 LE and their own NFD rules internally, so we don't care about it.
      OS X's BSD API layers use UTF-8-MAC encoding. It's a little bit diffrent from Unicode's NFD UTF-8. I already said.

      HFS+ restrict only UTF-8-MAC filenames. All illegal filename(include NFC Hangul characters) converted into Appls's NFD rules.
      If I unziped archive file on zipped with Windows on OS X, It make same NFD filename which I archived with OS X.
      So 7zip ask what do I want? Overwrite, rename. and so on.

      But in Windows, it's different.
      As you know, Windows or Linux's file system don't enforce any restriction on filename encoding.
      So if I unzipped archive with OS X on Windows, it's possible to have two kind of normalization filenames.
      7zip only ask wheather to overwrite, when it try to extract ASCII filenames. I alread said.

      Visually same name and actully diffrent two files on Windows.
      What do you think about it?
      I think it's not a good idea.

      > Please, Can you do the same on your MacOSX ?

      http://www.mediafire.com/file/otwtjk3mmgn/result.txt

      . (1)
               2e
      .. (2)
               2e 2e
      .DS_Store (9)
               2e 44 53 5f 53 74 6f 72 65
      English.txt (11)
               45 6e 67 6c 69 73 68 2e 74 78 74
      éàÇ◌̧äâÂÃ.txt (30)
               65 cc 81 61 cc 80 43 cc a7 e2 97 8c cc a7 61 cc 88 61 cc 82 41 cc 82 41 cc 83 2e 74 78 74
      똠방각하 펲시콜라 아햏햏.txt (93)
               e1 84 84 e1 85 a9 e1 86 b7 e1 84 87 e1 85 a1 e1 86 bc e1 84 80 e1 85 a1 e1 86 a8 e1 84 92 e1 85 a1 20 e1 84 91 e1 85 a6 e1 87 81 e1 84 89 e1 85 b5 e1 84 8f e1 85 a9 e1 86 af e1 84 85 e1 85 a1 20 e1 84 8b e1 85 a1 e1 84 92 e1 85 a2 e1 87 82 e1 84 92 e1 85 a2 e1 87 82 2e 74 78 74
      테스트 폴더 (34)
               e1 84 90 e1 85 a6 e1 84 89 e1 85 b3 e1 84 90 e1 85 b3 20 e1 84 91 e1 85 a9 e1 86 af e1 84 83 e1 85 a5
      ひらがな-カタカナ.txt (32)
               e3 81 b2 e3 82 89 e3 81 8b e3 82 99 e3 81 aa 2d e3 82 ab e3 82 bf e3 82 ab e3 83 8a 2e 74 78 74

      >Remark :
      >If it's different, could you give me a tar of "basic test" :
      >tar cf test.tar "basic test"

      http://www.mediafire.com/file/qwo4qomtnjf/test.tar

      I know that SAMBA use some code to convert NFD <-> NFC .

      http://sourcejam.com/jp/samba-3.0.25b/charset__macosxfs_8c-source.html

      and latest SAMBA 3.4.0 charset_macosxfs.c file here.
      http://pastebin.com/m8683674
      http://www.mediafire.com/file/nnxytoazezh/charset_macosxfs.c

      I don't know other platform's iconv treat UTF-8-MAC encoding.
      OS X's one support it.

      I'm sorry that I'm not a good English speaker so, remain question will be answered lately. : )
      Forgive my foolness.

      Sincerely
      Chang-Beom Park.

       
      • my p7zip

        my p7zip - 2009-07-14

        I will try to do something with UTF-8-MAC encoding.

        But now I need :
        - spare time
        - a MacOSX machine.

        So don't expect a fix in a near future :(

         
    • ChangBeom Park

      ChangBeom Park - 2009-07-14
       
    • ChangBeom Park

      ChangBeom Park - 2009-07-14

      > But now I need :
      > - spare time
      > - a MacOSX machine.

      How about use my mac's SSH account?
      If you want I will give you one.
      I use a laptop but I can turn it on alldays. : )

       

Log in to post a comment.