Menu

Bug in auto detection of encoding

Antony
2013-01-18
2013-01-19
  • Antony

    Antony - 2013-01-18

    Using v3.3 on Windows with JDK6.

    I have a UTF-16LE encoded file and I am passing its InputStream to Source. It detects the BOM in StreamEncodingDetector.init() and calls

    setEncoding(UTF_16,"UTF-16 little-endian Byte Order Mark (FF EE)",2,b3==-1);

    When EncodingDetector.openReader() is called it calls new InputStreamReader with "UTF-16", it actually reads the file as UTF-16BE. I guess this is because the InputStream no longer has the BOM available as it was read and skipped in the earlier setEncoding() call. If I manually set the charset to be UTF-16LE in the debugger, it works.

     
  • Martin Jericho

    Martin Jericho - 2013-01-19

    Hi Antony,

    Thanks for the bug report. It will be fixed in v3.4.

    Until version 3.4 is officially released, the development version is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

    Regards
    Martin

     

Log in to post a comment.