p7zip / Patches / #23 Unicode surrogate pair handling

#23 Unicode surrogate pair handling

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2014-12-31

Created: 2012-05-09

Creator: ChangZhuo Chen (陳昌倬)

Private: No

7z with unicode file name with surrogate pair is not handled well in Linux. For example, if the file name is U+2004E, it will become U+004E ("N") in 7z archive in Linux. However, U+2004E can be encoded into header with surrogate pair correctly in Windows.

This problem is caused by different in wchar_t. In Linux, sizeof(wchar_t) is 4, and mbstowcs will convert U+2004E from UTF-8 into single wchar_t. However, when writing 7z header, it assumes sizeof(wchar_t) is 2, so U+2004E becomes U+004E. In Windows, the sizeof(wchar_t) is 2, and U+2004E is encoded by surrogate pair, so it does not have this problem.

In my patch, I added another convert to add surrogate pair when necessary. With this patch, Linux can create correct 7z archive with file name like U+2004E.

Discussion

ChangZhuo Chen (陳昌倬) - 2012-05-09

Patch for unicode surrogate pair handling

unicode.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ChangZhuo Chen (陳昌倬) - 2012-05-09

sample archive

archive.7z

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

my p7zip - 2014-12-31

Thank you for your patch.

The next version of p7zip will handle unicode file name with surrogate pair.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unicode surrogate pair handling

Group

Searches

Help

#23 Unicode surrogate pair handling

Discussion