I am using v1.41, w/ Java 1.42 on RH Linux and
setting up a Parser to extract nodes against a filter.
I am unsure why my setup of Parser is causing it to
skip over code in foreign encoded documents. I set
up the parser as follows:
URL u = new URL(http://search.yahoo.com/search?ei=UTF-8&n=20&vm=p&va=visa);
HttpURLConnection con = (HttpURLConnection) u.openConnection();
Parser parser = new Parser(con);
NodeList list = parser.extractAllNodesThatMatch(nodeFilter);
I placed print statements in the body of the accept
method of my filter to determine the text of the
given Node passed as a param. It appears that when
foreign text is in the HTML (such as in the url I used
above, see result #20), that some Nodes are not
even passed to the filter, though most are.
I have searched documentation and groups to no avail, and have
been able to reproduce the problem consistently with
any document containing 'strange' chars (japanese, chinese, Russian etc).
Is there a prob with my Impl?
-Bond7456
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to clarify that the problem isn't with the parsing of the foreign chars themselves, but that if there are foreign chars on a page it seems to mess up other results.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It never sees the </a> end tag and runs off the end of the file.
I'm not sure what the solution is; the page indicates it's UTF-8, but those sure don't look like UTF-8 to me. The originating site also says it has a UTF-8 encoding but then intermixes kanji and katana characters, preumably using SHIFT-JIS or something because Mozilla seems to understand it.
It might be best to log this as a bug, so it can be tracked.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am using v1.41, w/ Java 1.42 on RH Linux and
setting up a Parser to extract nodes against a filter.
I am unsure why my setup of Parser is causing it to
skip over code in foreign encoded documents. I set
up the parser as follows:
URL u = new URL(http://search.yahoo.com/search?ei=UTF-8&n=20&vm=p&va=visa);
HttpURLConnection con = (HttpURLConnection) u.openConnection();
Parser parser = new Parser(con);
NodeList list = parser.extractAllNodesThatMatch(nodeFilter);
I placed print statements in the body of the accept
method of my filter to determine the text of the
given Node passed as a param. It appears that when
foreign text is in the HTML (such as in the url I used
above, see result #20), that some Nodes are not
even passed to the filter, though most are.
I have searched documentation and groups to no avail, and have
been able to reproduce the problem consistently with
any document containing 'strange' chars (japanese, chinese, Russian etc).
Is there a prob with my Impl?
-Bond7456
I want to clarify that the problem isn't with the parsing of the foreign chars themselves, but that if there are foreign chars on a page it seems to mess up other results.
Yes, it seems it is failing on some Japanese text extracted from the visa.co.jp website:
<a class=yschttl href="http://rds.yahoo.com/S=2766679/K=visa/v=2/SID=e/l=WS1/R=14/H=0/SHE=0/*-http://www.visa.co.jp/">ビザ・インターナショナル公式ホームページ</a>
It never sees the </a> end tag and runs off the end of the file.
I'm not sure what the solution is; the page indicates it's UTF-8, but those sure don't look like UTF-8 to me. The originating site also says it has a UTF-8 encoding but then intermixes kanji and katana characters, preumably using SHIFT-JIS or something because Mozilla seems to understand it.
It might be best to log this as a bug, so it can be tracked.