ProGuard Java Optimizer and Obfuscator / Bugs / #760 Non-ascii characters in filenames get broken in obfuscated JAR

#760 Non-ascii characters in filenames get broken in obfuscated JAR

Milestone: v6.1

Status: open-works-for-me

Owner: Eric Lafortune

Labels: None

Priority: Medium

Updated: 2019-12-06

Created: 2019-07-30

Creator: mar goli

Private: No

Opening the original JAR with 7-zip, WinRAR or InfoZip's unzip.exe shows filenames like '3-Introducción.html' correctly, but after obfuscation they show as '3-Introducci+?n.html' , '3-Introducci├│n.html' or something else, depending on the program used to open the JAR as a zipfile.
The Java application does indeed fail to find the file at runtime

Discussion

Eric Lafortune - 2019-08-10

ProGuard (through the ZipOutput class) uses the character encoding set in Java. You should specify it with e.g. -Dfile.encoding=... so it corresponds to your system's character encoding (or the encoding the tools expect). The default in Java should be your system's character encoding, so maybe it is changed in some configuration.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eric Lafortune - 2019-08-10

status: open --> open-works-for-me

assigned_to: Eric Lafortune
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Drakkoon - 2019-08-16

I believe there might be something else to this. I recently discovered that -adaptresourcefilecontents does mostly work correctly with unicode UTF-8 characters, but it seems that some multi-byte glyphs have the second byte incorrectly modified. (on 6.1.1, applied it to translation .properties files)

At first I thought I saw the pattern, something like everytime the second byte has the highest bit set, but I am unsure now that I have looked at more examples, because this does not hold true (my last examples has CF 83 being fine, but CF 81 being transformed incorrectly).

Examples:
like in the original post: ó (0xC3 0xB3 becomes 0xC 0x3F)
č (0xC4 0x8D becomes 0xC4 0x3F)
But other characters are fine. Here's a very concrete example, the phrase:

Πληροφορίες διαγραφής εξαχθέντος περιστατικού
(CE A0 CE BB CE B7 CF 81 CE BF CF 86 CE BF CF 81 CE AF CE B5 CF 82 20 CE B4 CE B9 CE B1 CE B3 CF 81 CE B1 CF 86 CE AE CF 82 20 CE B5 CE BE CE B1 CF 87 CE B8 CE AD CE BD CF 84 CE BF CF 82 20 CF 80 CE B5 CF 81 CE B9 CF 83 CF 84 CE B1 CF 84 CE B9 CE BA CE BF CF 8D)
Becomes
(CE A0 CE BB CE B7 CF 3F CE BF CF 86 CE BF CF 3F CE AF CE B5 CF 82 20 CE B4 CE B9 CE B1 CE B3 CF 3F CE B1 CF 86 CE AE CF 82 20 CE B5 CE BE CE B1 CF 87 CE B8 CE AD CE BD CF 84 CE BF CF 82 20 CF 80 CE B5 CF 3F CE B9 CF 83 CF 84 CE B1 CF 84 CE B9 CE BA CE BF CF 3F)
We see here CF 81 and CF 8D, there is also C9 9D in a couple of greek phrases I have.

I don't see the pattern at the moment, but it does feel like a bug in the UTF-8 handling somewhere, more than just setting the encoding.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

T. Neidhart - 2019-12-06

The problem with filenames containing utf-8 characters could be reproduced. This is because the general purpose flag ist not properly setup and also the version made by has to be adapted to properly support that, see: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Non-ascii characters in filenames get broken in obfuscated JAR

Java class file shrinker, optimizer, obfuscator, and preverifier

Group

Searches

Help

#760 Non-ascii characters in filenames get broken in obfuscated JAR

Discussion