#23 Unicode surrogate pair handling


7z with unicode file name with surrogate pair is not handled well in Linux. For example, if the file name is U+2004E, it will become U+004E ("N") in 7z archive in Linux. However, U+2004E can be encoded into header with surrogate pair correctly in Windows.

This problem is caused by different in wchar_t. In Linux, sizeof(wchar_t) is 4, and mbstowcs will convert U+2004E from UTF-8 into single wchar_t. However, when writing 7z header, it assumes sizeof(wchar_t) is 2, so U+2004E becomes U+004E. In Windows, the sizeof(wchar_t) is 2, and U+2004E is encoded by surrogate pair, so it does not have this problem.

In my patch, I added another convert to add surrogate pair when necessary. With this patch, Linux can create correct 7z archive with file name like U+2004E.


  • ChangZhuo Chen

    ChangZhuo Chen - 2012-05-09

    Patch for unicode surrogate pair handling

  • ChangZhuo Chen

    ChangZhuo Chen - 2012-05-09

    sample archive

  • my p7zip

    my p7zip - 2014-12-31

    Thank you for your patch.

    The next version of p7zip will handle unicode file name with surrogate pair.


Log in to post a comment.