Menu

Parser & foreign language

Help
2005-03-02
2013-04-27
  • javacodeforger

    javacodeforger - 2005-03-02

    Hello,

    I'm trying to parse a google search result page. If you go to
    www.google.com

    & search with the keyword

    arabic

    You'll see that some of the results will display arabic words like
    ------------------------------
    BBCArabic.com | الصفحة الرئيسية
    Home of the BBC on the Internet News, Sport, Weather, World Service, Languages, نصوص فقط, مساعدة. BBCArabic.com استمع ...
    ------------------------

    However  with  parser gives

    ---------------------------------------
    BBCArabic.com | ?????? ????????
    <i>The summary for this Arabic page contains characters that cannot be correctly displayed in this language/character set.</i>
    http://www.bbc.co.uk/arabic
    --------------------------------------------

    for the code

    Parser parser = new Parser ("http://www.google.com.my/search?num=10&as_q=arabic");

    Node htmlNode = parser.elements().nextNode();
    System.out.println(htmlNode.toString());

    I've tried to retrieve the google result page encoding
    its: ISO-8859-1

    I even tried
    parser.setEncoding("ISO-8859-1");

    before doing the parser.elements().nextNode();

    It doesn't work whether with ISO-8859-1 or ISO-8859-6

    So what happens here is that somehow the
    browser is able retrieve the foreign encoded content & display them
    but the Parser can't

    Has anyone encountered this problem?
    It doesn't matter which language u're parsing for.

    I'm working on a metasearch & will need to support a number of  other languages as well.

    rgds.

     
    • javacodeforger

      javacodeforger - 2005-03-02

      I'm thinking that this could be due to the
      Content Negotiation Phase
      between the Parser & the Google web server

      Is there a way to set the supported languages & character sets in the Parser

      so that the Google webserver will return the content instead of a message saying that the
      page contains characters that cannot be correctly displayed in this language/character set

      rgds

       
      • Derrick Oswald

        Derrick Oswald - 2005-03-02

        With recent integration builds, you can use
          parser.getConnectionManager ().setDefaultRequestProperties ();
        to alter the negotiation.

        By default it only has "User-Agent" and "Accept-Encoding", but you could add "Accept-Charset" to the Hashtable with an appropriate value (comma separated list of acceptable character sets I think), which is probably what you want.

         
        • javacodeforger

          javacodeforger - 2005-03-03

          which build has this feature?
          & how do i get it?

          thnks

           
          • javacodeforger

            javacodeforger - 2005-03-03

            Found it
            pls ignore my last post
            tq

             
    • javacodeforger

      javacodeforger - 2005-03-03

      Hi Derrick,
      I've downloaded the integration build. Then modified my code as below.
      However I get the same thing

      <i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>

      pls help

      -----------------

      Parser parser = new Parser();

            Hashtable ht = parser.getConnectionManager().getDefaultRequestProperties();

            ht.put("Accept-Charset","IISO-8859-1, ISO-8859-6, Windows-1256");
            ht.put("Accept-Encoding","*");

            parser.getConnectionManager().setDefaultRequestProperties(ht);

      parser.serUrl("http://some.url");

       
      • Derrick Oswald

        Derrick Oswald - 2005-03-04

        I think you have to figure out what works.
        Use a browser maybe, and see what character set the site wants to send.
        Then add that to the list of accept-charset.

         
    • javacodeforger

      javacodeforger - 2005-03-09

      Hello all,

      I tried looking at the
      ConnectionManager class

      I noticed its the "mRequestProperties" variable that is used to set the request headers.

      notice the line no 578:

                          properties = getRequestProperties ();
                          if (null != properties)
                              for (enumeration = properties.keys (); enumeration.hasMoreElements ();)
                              {
                                  key = (String)enumeration.nextElement ();
                                  value = (String)properties.get (key);
                                  ret.setRequestProperty (key, value);
                              }

      anyway i tried to set the hashmap contaning the Accept-Charset and Accept-Language here:

      When I try to set this headers
      I get the message:

      <i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>

      although the summary is actually in russian.

      however when i don't sent anything in the header
      just using default
      I'll get some error when trying to parse the description

      my code is as follows:

            ht.put("Accept-Charset","UTF-8,KOI8-R");
            ht.put("Accept-Language","ar, ru");
            cm.setRequestProperties(ht);
            Parser.setConnectionManager(cm);
            Parser parser = new Parser();
            parser.setURL(stringUrlBuffer.toString());

      rgds,
      pls help

       
      • Derrick Oswald

        Derrick Oswald - 2005-03-09

        What is the URL you are trying to fetch?

         
    • javacodeforger

      javacodeforger - 2005-03-15

      Hello Derrick,

      I noticed something today with the test I"ve written.

      When I set the Accept-Charset & Accept-Language to *

      The code

            Parser parser = new Parser (stringUrlBuffer.toString());
           
            Node htmlNode = null;

            //look for the html node
            for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
            {
              htmlNode = e.nextNode();
              if(htmlNode.getText().equals("html"))
              {
                break;
              }
            }

            System.out.println(htmlNode.toHtml());

      returns:-

      ......
      <a class=yschttl  href="http://rds.yahoo.com/S=2766679/K=bbc+arabic/v=2/SID=e/l=WS1/R=1/IPC=us/SHE=0/H=0/SIG=11dhgfv25/EXP=1110955544/*-http%3A//www.bbcarabic.com/">BBCArabic.com |  </a></div></li></ol></div>

      actually it shows a square box symbol after the pipe symbol |

      So it looks like when an arabic character is returned the parser is unable to accept it and ends just there.

      Is this something to do with the way the parser detects end of the return stream?

      Its like the parser cuts of whatever input as soon as it received a foreign character.

      my request url is :

      http://search.yahoo.com/search?n=10&ei=UTF-8&va=bbc+arabic

      I'm testing with yahoo now.

      rgds.

       
      • Derrick Oswald

        Derrick Oswald - 2005-03-15

        The square box is a zero.
        The underlying reader, that has an associated character set, came upon a sequence of bytes that couldn't be converted into a character in the current encoding (there is no glyph for that code point), so it substituted zero. My mozilla browser puts a little square box filled with the unrecognized codes (in hex) in place of these unknown characters, but HTML parser just relies on the underlying Java implementation. By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero.

        See Bug #1121401 No Parsing with yahoo!  fixed in Integration Build 1.5-20050313

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.