p7zip / Discussion / Open Discussion: Issues caused due inconsistent encoding forms

Issues caused due inconsistent encoding forms

Forum: Open Discussion

Creator: Anonymous

Created: 2011-06-02

Updated: 2013-05-28

Anonymous - 2011-06-02

There is an issue with the way unicode filenames are handled in the RAR handler. The filenames are not normalized to a consistent form ( either precomposed or decomposed), which can lead to issues. Here's a sample scenario that I encountered :

Say a rar archive contains the following file path : "Café foobar/quux.txt" in decomposed form ( i.e., the é is actually an ascii character e followed by an acute ).

From an external source, ( say a list file or something ), you read that the user wants to strip the "Café foobar" from the path during extraction, and this is in precomposed form (i.e., single character é ).

Now, when the extract callback's getstream compares the removePathParts variables component using MyStringCompareNoCase, it's going to mess up ( because é!=e ), the operation will fail.

I've tested this on a Mac OS X, and this appears to only occur with RAR archives because for other formats MultiByteToUnicodeString is used, which in turn calls CFStringNormalize.

On OS X, A quick and naive fix for this would be to add this line to RarIn.cpp :

void CInArchive::ReadName(CItemEx &item, int nameSize) { ... +item.UnicodeName = MultiByteToUnicodeString(UnicodeStringToMultiByte(item.UnicodeName, 0), 0); ... }

( Ideally, this normalization would occur in a more "core" location, so that removePathParts and other strings are all in a consistent form. )

http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2011-06-03

I think that there are these composed/decomposed problems in many places in 7-zip code.
Now I'm not ready to work with that problem. Maybe later.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Issues caused due inconsistent encoding forms

Forums

Help

Issues caused due inconsistent encoding forms

Issues caused due inconsistent encoding forms

Forums

Help

Issues caused due inconsistent encoding forms document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Issues caused due inconsistent encoding forms