it seems that HotSAX doesn't support Chinese html pages!
Status: Pre-Alpha
Brought to you by:
ulysees2001
When I try to debug xhtmlMaker.java using a html page that contains Chinese words, I encountered a ArraysOutOfBounds error!
It's quite frustrating!
Hoping there're some solutions.
If you know,please email xiao7cn@126.com to tell me how to fix it. Thanks.
when reporting bugs please save the sample html and post that.
the issue is that the HtmlLexer.flex file states '%full' character encoding to the lexer which is a single byte character encoding. Chinese sites will use double byte encoding.
the fix which I have tested out using a chinese website is to replace '%full' in the file HtmlLexer.flex with the instruction '%unicode' then regenerate the HtmlLexer.java with the jflex tool.
documentation for jflex is at http://jflex.de/manual.html and if you search for "Character sets" it explains the point.
code which works is discussed in comments to issue https://sourceforge.net/tracker/?func=detail&aid=1913288&group_id=29085&atid=395047