Non-ascii characters in filenames get broken in obfuscated JAR
Java class file shrinker, optimizer, obfuscator, and preverifier
Brought to you by:
guardsquare
Opening the original JAR with 7-zip, WinRAR or InfoZip's unzip.exe shows filenames like '3-Introducción.html' correctly, but after obfuscation they show as '3-Introducci+?n.html' , '3-Introducci├│n.html' or something else, depending on the program used to open the JAR as a zipfile.
The Java application does indeed fail to find the file at runtime
ProGuard (through the ZipOutput class) uses the character encoding set in Java. You should specify it with e.g.
-Dfile.encoding=...so it corresponds to your system's character encoding (or the encoding the tools expect). The default in Java should be your system's character encoding, so maybe it is changed in some configuration.I believe there might be something else to this. I recently discovered that -adaptresourcefilecontents does mostly work correctly with unicode UTF-8 characters, but it seems that some multi-byte glyphs have the second byte incorrectly modified. (on 6.1.1, applied it to translation .properties files)
At first I thought I saw the pattern, something like everytime the second byte has the highest bit set, but I am unsure now that I have looked at more examples, because this does not hold true (my last examples has CF 83 being fine, but CF 81 being transformed incorrectly).
Examples:
like in the original post: ó (0xC3 0xB3 becomes 0xC 0x3F)
č (0xC4 0x8D becomes 0xC4 0x3F)
But other characters are fine. Here's a very concrete example, the phrase:
Πληροφορίες διαγραφής εξαχθέντος περιστατικού
(CE A0 CE BB CE B7 CF 81 CE BF CF 86 CE BF CF 81 CE AF CE B5 CF 82 20 CE B4 CE B9 CE B1 CE B3 CF 81 CE B1 CF 86 CE AE CF 82 20 CE B5 CE BE CE B1 CF 87 CE B8 CE AD CE BD CF 84 CE BF CF 82 20 CF 80 CE B5 CF 81 CE B9 CF 83 CF 84 CE B1 CF 84 CE B9 CE BA CE BF CF 8D)
Becomes
(CE A0 CE BB CE B7 CF 3F CE BF CF 86 CE BF CF 3F CE AF CE B5 CF 82 20 CE B4 CE B9 CE B1 CE B3 CF 3F CE B1 CF 86 CE AE CF 82 20 CE B5 CE BE CE B1 CF 87 CE B8 CE AD CE BD CF 84 CE BF CF 82 20 CF 80 CE B5 CF 3F CE B9 CF 83 CF 84 CE B1 CF 84 CE B9 CE BA CE BF CF 3F)
We see here CF 81 and CF 8D, there is also C9 9D in a couple of greek phrases I have.
I don't see the pattern at the moment, but it does feel like a bug in the UTF-8 handling somewhere, more than just setting the encoding.
The problem with filenames containing utf-8 characters could be reproduced. This is because the general purpose flag ist not properly setup and also the version made by has to be adapted to properly support that, see: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT