Issues caused due inconsistent encoding forms

  • Anonymous - 2011-06-02

    There is an issue with the way unicode filenames are handled in the RAR handler. The filenames are not normalized to a consistent form ( either precomposed or decomposed), which can lead to issues. Here's a sample scenario that I encountered :

    Say a rar archive contains the following file path : "Café foobar/quux.txt" in decomposed form ( i.e., the é is actually an ascii character e followed by an acute ).

    From an external source, ( say a list file or something ), you read that the user wants to strip the "Café foobar" from the path during extraction, and this is in precomposed form (i.e., single character é ).

    Now, when the extract callback's getstream compares the removePathParts variables component using MyStringCompareNoCase, it's going to mess up ( because é!=e ), the operation will fail.

    I've tested this on a Mac OS X, and this appears to only occur with RAR archives because for other formats MultiByteToUnicodeString is used, which in turn calls CFStringNormalize.

    On OS X, A quick and naive fix for this would be to add this line to RarIn.cpp :

    void CInArchive::ReadName(CItemEx &item, int nameSize)
        +item.UnicodeName = MultiByteToUnicodeString(UnicodeStringToMultiByte(item.UnicodeName, 0), 0);

    ( Ideally, this normalization would occur in a more "core" location, so that removePathParts and other strings are all in a consistent form. )

  • Igor Pavlov

    Igor Pavlov - 2011-06-03

    I think that there are these composed/decomposed problems in many places in 7-zip code.
    Now I'm not ready to work with that problem. Maybe later.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks