Jericho HTML Parser / Discussion / Open Discussion: Bug in auto detection of encoding

Bug in auto detection of encoding

Forum: Open Discussion

Creator: Antony

Created: 2013-01-18

Updated: 2013-01-19

Antony - 2013-01-18

Using v3.3 on Windows with JDK6.

I have a UTF-16LE encoded file and I am passing its InputStream to Source. It detects the BOM in StreamEncodingDetector.init() and calls

setEncoding(UTF_16,"UTF-16 little-endian Byte Order Mark (FF EE)",2,b3==-1);

When EncodingDetector.openReader() is called it calls new InputStreamReader with "UTF-16", it actually reads the file as UTF-16BE. I guess this is because the InputStream no longer has the BOM available as it was read and skipped in the earlier setEncoding() call. If I manually set the charset to be UTF-16LE in the debugger, it works.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-01-19

Hi Antony,

Thanks for the bug report. It will be fixed in v3.4.

Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

Regards
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.