Menu

#13 it seems that HotSAX doesn't support Chinese html pages!

open
nobody
None
5
2007-01-26
2007-01-26
Anonymous
No

When I try to debug xhtmlMaker.java using a html page that contains Chinese words, I encountered a ArraysOutOfBounds error!

It's quite frustrating!

Hoping there're some solutions.

If you know,please email xiao7cn@126.com to tell me how to fix it. Thanks.

Discussion

  • Simon Massey

    Simon Massey - 2012-01-05

    when reporting bugs please save the sample html and post that.

    the issue is that the HtmlLexer.flex file states '%full' character encoding to the lexer which is a single byte character encoding. Chinese sites will use double byte encoding.

    the fix which I have tested out using a chinese website is to replace '%full' in the file HtmlLexer.flex with the instruction '%unicode' then regenerate the HtmlLexer.java with the jflex tool.

    documentation for jflex is at http://jflex.de/manual.html and if you search for "Character sets" it explains the point.

     

Log in to post a comment.

MongoDB Logo MongoDB