I have a UTF-16LE encoded file and I am passing its InputStream to Source. It detects the BOM in StreamEncodingDetector.init() and calls
setEncoding(UTF_16,"UTF-16 little-endian Byte Order Mark (FF EE)",2,b3==-1);
When EncodingDetector.openReader() is called it calls new InputStreamReader with "UTF-16", it actually reads the file as UTF-16BE. I guess this is because the InputStream no longer has the BOM available as it was read and skipped in the earlier setEncoding() call. If I manually set the charset to be UTF-16LE in the debugger, it works.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Using v3.3 on Windows with JDK6.
I have a UTF-16LE encoded file and I am passing its InputStream to Source. It detects the BOM in StreamEncodingDetector.init() and calls
setEncoding(UTF_16,"UTF-16 little-endian Byte Order Mark (FF EE)",2,b3==-1);
When EncodingDetector.openReader() is called it calls new InputStreamReader with "UTF-16", it actually reads the file as UTF-16BE. I guess this is because the InputStream no longer has the BOM available as it was read and skipped in the earlier setEncoding() call. If I manually set the charset to be UTF-16LE in the debugger, it works.
Hi Antony,
Thanks for the bug report. It will be fixed in v3.4.
Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
Regards
Martin