I want to change the encoding of the messages_ja.xml from Shift_JIS to UTF-8.
However, PluginLoader.addCollection() ignores the encoding of the XML declaration.
Since charset depends on OS, Japanese messages is garbled when I make it UTF-8.
Windows 7(Windows-31J), FindBugs 1.3.9
Can you provide me with a suggestion here? Should I change the InputStreamReader to explicitly use UTF-8?
Looking in the XML to get the character set encoding poses a chicken-and-egg problem.
"InputStreamReader to fix the UTF-8" instead of that, I hope to see you want to parse the encoding.
Will it be a usual way to ignore the encoding of the document type declaration when parsing the XML document?
You should not decode from byte streams to character streams using InputStreamReader.
It is better to use SAXReader#read(InputStream) or SAXReader#read(URL) instead of SAXReader.read(Reader reader) at PluginLoader.addCollection() or other , I think.
SAX parser will detect character encoding with the XML specification, whether Shift_JIS or UTF-8
Thank you for advice.
It was executed as expected when I rewrote it as follows.
private void addCollection(List<Document> messageCollectionList, String filename) throws PluginException {
URL messageURL = getResource(filename);
if (messageURL != null) {
SAXReader reader = new SAXReader();
try {
Document messageCollection = reader.read(messageURL);
messageCollectionList.add(messageCollection);
} catch (Exception e) {
throw new PluginException("Couldn't parse \"" + messageURL +"\"", e);
}
}
}
Same problem and same fix philosophy as bug ID 2928350 (and fix uploaded in 3113643).
Looks like it's fixed long ago. At least now messages_ja is in UTF-8 and parsed correctly. Please report a new bug if there are still any problems.