HTML Parser / Discussion / Help: Parser.extractAllNodesThatMatch skipping code

bond7456 - 2004-07-19

I am using v1.41, w/ Java 1.42 on RH Linux and
setting up a Parser to extract nodes against a filter.
I am unsure why my setup of Parser is causing it to
skip over code in foreign encoded documents. I set
up the parser as follows:

URL u = new URL(http://search.yahoo.com/search?ei=UTF-8&n=20&vm=p&va=visa);

HttpURLConnection con = (HttpURLConnection) u.openConnection();
Parser parser = new Parser(con);
NodeList list = parser.extractAllNodesThatMatch(nodeFilter);

I placed print statements in the body of the accept
method of my filter to determine the text of the
given Node passed as a param. It appears that when
foreign text is in the HTML (such as in the url I used
above, see result #20), that some Nodes are not
even passed to the filter, though most are.

I have searched documentation and groups to no avail, and have
been able to reproduce the problem consistently with
any document containing 'strange' chars (japanese, chinese, Russian etc).

Is there a prob with my Impl?

-Bond7456

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bond7456 - 2004-07-19
  
  I want to clarify that the problem isn't with the parsing of the foreign chars themselves, but that if there are foreign chars on a page it seems to mess up other results.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2004-07-20
    
    Yes, it seems it is failing on some Japanese text extracted from the visa.co.jp website:
    
    <a class=yschttl href="http://rds.yahoo.com/S=2766679/K=visa/v=2/SID=e/l=WS1/R=14/H=0/SHE=0/*-http://www.visa.co.jp/">ビザ･インターナショナル公式ホームページ</a>
    
    It never sees the </a> end tag and runs off the end of the file.
    I'm not sure what the solution is; the page indicates it's UTF-8, but those sure don't look like UTF-8 to me. The originating site also says it has a UTF-8 encoding but then intermixes kanji and katana characters, preumably using SHIFT-JIS or something because Mozilla seems to understand it.
    
    It might be best to log this as a bug, so it can be tracked.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parser.extractAllNodesThatMatch skipping code

Forums

Help

Parser.extractAllNodesThatMatch skipping code document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Parser.extractAllNodesThatMatch skipping code