Menu

Parser.extractAllNodesThatMatch skipping code

Help
bond7456
2004-07-19
2004-07-20
  • bond7456

    bond7456 - 2004-07-19

    I am using v1.41, w/ Java 1.42 on RH Linux and
    setting up a Parser to extract nodes against a filter.
    I am unsure why my setup of Parser is causing it to
    skip over code in foreign encoded documents.  I set
    up the parser as follows:

    URL u = new URL(http://search.yahoo.com/search?ei=UTF-8&n=20&vm=p&va=visa);

    HttpURLConnection con = (HttpURLConnection) u.openConnection();
    Parser parser = new Parser(con);
    NodeList list = parser.extractAllNodesThatMatch(nodeFilter);

    I placed print statements in the body of the accept
    method of my filter to determine the text of the
    given Node passed as a param.  It appears that when
    foreign text is in the HTML (such as in the url I used
    above, see result #20), that some Nodes are not
    even passed to the filter, though most are.

    I have searched documentation and groups to no avail, and have
    been able to reproduce the problem consistently with
    any document containing 'strange' chars (japanese, chinese, Russian etc).

    Is there a prob with my Impl?

    -Bond7456

     
    • bond7456

      bond7456 - 2004-07-19

      I want to clarify that the problem isn't with the parsing of the foreign chars themselves, but that if there are foreign chars on a page it seems to mess up other results.

       
      • Derrick Oswald

        Derrick Oswald - 2004-07-20

        Yes, it seems it is failing on some Japanese text extracted from the visa.co.jp website:

        <a class=yschttl href="http://rds.yahoo.com/S=2766679/K=visa/v=2/SID=e/l=WS1/R=14/H=0/SHE=0/*-http://www.visa.co.jp/">&#12499;&#12470;&#65381;&#12452;&#12531;&#12479;&#12540;&#12490;&#12471;&#12519;&#12490;&#12523;&#20844;&#24335;&#12507;&#12540;&#12512;&#12506;&#12540;&#12472;</a>

        It never sees the </a> end tag and runs off the end of the file.
        I'm not sure what the solution is; the page indicates it's UTF-8, but those sure don't look like UTF-8 to me.  The originating site also says it has a UTF-8 encoding but then intermixes kanji and katana characters, preumably using SHIFT-JIS or something because Mozilla seems to understand it.

        It might be best to log this as a bug, so it can be tracked.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.