Menu

#23 Unicode surrogate pair handling

open
nobody
None
5
2014-12-31
2012-05-09
No

7z with unicode file name with surrogate pair is not handled well in Linux. For example, if the file name is U+2004E, it will become U+004E ("N") in 7z archive in Linux. However, U+2004E can be encoded into header with surrogate pair correctly in Windows.

This problem is caused by different in wchar_t. In Linux, sizeof(wchar_t) is 4, and mbstowcs will convert U+2004E from UTF-8 into single wchar_t. However, when writing 7z header, it assumes sizeof(wchar_t) is 2, so U+2004E becomes U+004E. In Windows, the sizeof(wchar_t) is 2, and U+2004E is encoded by surrogate pair, so it does not have this problem.

In my patch, I added another convert to add surrogate pair when necessary. With this patch, Linux can create correct 7z archive with file name like U+2004E.

Discussion

  • ChangZhuo Chen (陳昌倬)

    Patch for unicode surrogate pair handling

     
  • ChangZhuo Chen (陳昌倬)

    sample archive

     
  • my p7zip

    my p7zip - 2014-12-31

    Thank you for your patch.

    The next version of p7zip will handle unicode file name with surrogate pair.

     

Log in to post a comment.