Menu

#858 Problem UTF-8 Japanese messages

3.0.1
closed-fixed
Core (142)
5
2015-02-22
2010-04-27
orihalcon
No

I want to change the encoding of the messages_ja.xml from Shift_JIS to UTF-8.
However, PluginLoader.addCollection() ignores the encoding of the XML declaration.
Since charset depends on OS, Japanese messages is garbled when I make it UTF-8.

Windows 7(Windows-31J), FindBugs 1.3.9

Discussion

  • William Pugh

    William Pugh - 2010-05-24

    Can you provide me with a suggestion here? Should I change the InputStreamReader to explicitly use UTF-8?

    Looking in the XML to get the character set encoding poses a chicken-and-egg problem.

     
  • William Pugh

    William Pugh - 2010-05-24
    • assigned_to: nobody --> wpugh
    • status: open --> open-accepted
     
  • orihalcon

    orihalcon - 2010-05-25

    "InputStreamReader to fix the UTF-8" instead of that, I hope to see you want to parse the encoding.
    Will it be a usual way to ignore the encoding of the document type declaration when parsing the XML document?

     
  • haccy

    haccy - 2010-05-26

    You should not decode from byte streams to character streams using InputStreamReader.
    It is better to use SAXReader#read(InputStream) or SAXReader#read(URL) instead of SAXReader.read(Reader reader) at PluginLoader.addCollection() or other , I think.
    SAX parser will detect character encoding with the XML specification, whether Shift_JIS or UTF-8

     
  • orihalcon

    orihalcon - 2010-05-27

    Thank you for advice.
    It was executed as expected when I rewrote it as follows.

    private void addCollection(List<Document> messageCollectionList, String filename) throws PluginException {
    URL messageURL = getResource(filename);
    if (messageURL != null) {
    SAXReader reader = new SAXReader();
    try {
    Document messageCollection = reader.read(messageURL);
    messageCollectionList.add(messageCollection);
    } catch (Exception e) {
    throw new PluginException("Couldn't parse \"" + messageURL +"\"", e);
    }
    }
    }

     
  • Damien MORCELLET

    Same problem and same fix philosophy as bug ID 2928350 (and fix uploaded in 3113643).

     
  • Tagir Valeev

    Tagir Valeev - 2015-02-22
    • status: open-accepted --> closed-fixed
    • Group: --> 3.0.1
     
  • Tagir Valeev

    Tagir Valeev - 2015-02-22

    Looks like it's fixed long ago. At least now messages_ja is in UTF-8 and parsed correctly. Please report a new bug if there are still any problems.

     

Log in to post a comment.