Issues caused due inconsistent encoding forms

Anonymous
2011-06-02
2013-05-28

  • Anonymous
    2011-06-02

    There is an issue with the way unicode filenames are handled in the RAR handler. The filenames are not normalized to a consistent form ( either precomposed or decomposed), which can lead to issues. Here's a sample scenario that I encountered :

    Say a rar archive contains the following file path : "Café foobar/quux.txt" in decomposed form ( i.e., the é is actually an ascii character e followed by an acute ).

    From an external source, ( say a list file or something ), you read that the user wants to strip the "Café foobar" from the path during extraction, and this is in precomposed form (i.e., single character é ).

    Now, when the extract callback's getstream compares the removePathParts variables component using MyStringCompareNoCase, it's going to mess up ( because é!=e ), the operation will fail.

    I've tested this on a Mac OS X, and this appears to only occur with RAR archives because for other formats MultiByteToUnicodeString is used, which in turn calls CFStringNormalize.

    On OS X, A quick and naive fix for this would be to add this line to RarIn.cpp :

    void CInArchive::ReadName(CItemEx &item, int nameSize)
    {
    ...    
        +item.UnicodeName = MultiByteToUnicodeString(UnicodeStringToMultiByte(item.UnicodeName, 0), 0);
    ...
    }
    

    ( Ideally, this normalization would occur in a more "core" location, so that removePathParts and other strings are all in a consistent form. )

    http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html

     
  • Igor Pavlov
    Igor Pavlov
    2011-06-03

    I think that there are these composed/decomposed problems in many places in 7-zip code.
    Now I'm not ready to work with that problem. Maybe later.