HTML Parser / Discussion / Help: Parser & foreign language

javacodeforger - 2005-03-02

Hello,

I'm trying to parse a google search result page. If you go to
www.google.com

& search with the keyword

arabic

You'll see that some of the results will display arabic words like
------------------------------
BBCArabic.com | الصفحة الرئيسية
Home of the BBC on the Internet News, Sport, Weather, World Service, Languages, نصوص فقط, مساعدة. BBCArabic.com استمع ...
------------------------

However with parser gives

---------------------------------------
BBCArabic.com | ?????? ????????
<i>The summary for this Arabic page contains characters that cannot be correctly displayed in this language/character set.</i>
http://www.bbc.co.uk/arabic
--------------------------------------------

for the code

Parser parser = new Parser ("http://www.google.com.my/search?num=10&as_q=arabic");

Node htmlNode = parser.elements().nextNode();
System.out.println(htmlNode.toString());

I've tried to retrieve the google result page encoding
its: ISO-8859-1

I even tried
parser.setEncoding("ISO-8859-1");

before doing the parser.elements().nextNode();

It doesn't work whether with ISO-8859-1 or ISO-8859-6

So what happens here is that somehow the
browser is able retrieve the foreign encoded content & display them
but the Parser can't

Has anyone encountered this problem?
It doesn't matter which language u're parsing for.

I'm working on a metasearch & will need to support a number of other languages as well.

rgds.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- javacodeforger - 2005-03-02
  
  I'm thinking that this could be due to the
  Content Negotiation Phase
  between the Parser & the Google web server
  
  Is there a way to set the supported languages & character sets in the Parser
  
  so that the Google webserver will return the content instead of a message saying that the
  page contains characters that cannot be correctly displayed in this language/character set
  
  rgds
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2005-03-02
    
    With recent integration builds, you can use
    parser.getConnectionManager ().setDefaultRequestProperties ();
    to alter the negotiation.
    
    By default it only has "User-Agent" and "Accept-Encoding", but you could add "Accept-Charset" to the Hashtable with an appropriate value (comma separated list of acceptable character sets I think), which is probably what you want.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - javacodeforger - 2005-03-03
      
      which build has this feature?
      & how do i get it?
      
      thnks
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - javacodeforger - 2005-03-03
        
        Found it
        pls ignore my last post
        tq
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- javacodeforger - 2005-03-03
  
  Hi Derrick,
  I've downloaded the integration build. Then modified my code as below.
  However I get the same thing
  
  <i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>
  
  pls help
  
  -----------------
  
  Parser parser = new Parser();
  
        Hashtable ht = parser.getConnectionManager().getDefaultRequestProperties();
  
        ht.put("Accept-Charset","IISO-8859-1, ISO-8859-6, Windows-1256");
        ht.put("Accept-Encoding","*");
  
        parser.getConnectionManager().setDefaultRequestProperties(ht);
  
  parser.serUrl("http://some.url");
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2005-03-04
    
    I think you have to figure out what works.
    Use a browser maybe, and see what character set the site wants to send.
    Then add that to the list of accept-charset.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- javacodeforger - 2005-03-09
  
  Hello all,
  
  I tried looking at the
  ConnectionManager class
  
  I noticed its the "mRequestProperties" variable that is used to set the request headers.
  
  notice the line no 578:
  
                      properties = getRequestProperties ();
                      if (null != properties)
                          for (enumeration = properties.keys (); enumeration.hasMoreElements ();)
                          {
                              key = (String)enumeration.nextElement ();
                              value = (String)properties.get (key);
                              ret.setRequestProperty (key, value);
                          }
  
  anyway i tried to set the hashmap contaning the Accept-Charset and Accept-Language here:
  
  When I try to set this headers
  I get the message:
  
  <i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>
  
  although the summary is actually in russian.
  
  however when i don't sent anything in the header
  just using default
  I'll get some error when trying to parse the description
  
  my code is as follows:
  
        ht.put("Accept-Charset","UTF-8,KOI8-R");
        ht.put("Accept-Language","ar, ru");
        cm.setRequestProperties(ht);
        Parser.setConnectionManager(cm);
        Parser parser = new Parser();
        parser.setURL(stringUrlBuffer.toString());
  
  rgds,
  pls help
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2005-03-09
    
    What is the URL you are trying to fetch?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - javacodeforger - 2005-03-09
      
      Hi,
      
      http://www.google.com.my/search?num=10&as_q=russian
      
      this ist he url for the page with russian language
      
      http://www.google.com.my/search?num=10&as_q=arabic
      
      this one contains arabic language
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- javacodeforger - 2005-03-15
  
  Hello Derrick,
  
  I noticed something today with the test I"ve written.
  
  When I set the Accept-Charset & Accept-Language to *
  
  The code
  
        Parser parser = new Parser (stringUrlBuffer.toString());
  
        Node htmlNode = null;
  
        //look for the html node
        for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
        {
          htmlNode = e.nextNode();
          if(htmlNode.getText().equals("html"))
          {
            break;
          }
        }
  
        System.out.println(htmlNode.toHtml());
  
  returns:-
  
  ......
  <a class=yschttl href="http://rds.yahoo.com/S=2766679/K=bbc+arabic/v=2/SID=e/l=WS1/R=1/IPC=us/SHE=0/H=0/SIG=11dhgfv25/EXP=1110955544/*-http%3A//www.bbcarabic.com/">BBCArabic.com | </a></div></li></ol></div>
  
  actually it shows a square box symbol after the pipe symbol |
  
  So it looks like when an arabic character is returned the parser is unable to accept it and ends just there.
  
  Is this something to do with the way the parser detects end of the return stream?
  
  Its like the parser cuts of whatever input as soon as it received a foreign character.
  
  my request url is :
  
  http://search.yahoo.com/search?n=10&ei=UTF-8&va=bbc+arabic
  
  I'm testing with yahoo now.
  
  rgds.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2005-03-15
    
    The square box is a zero.
    The underlying reader, that has an associated character set, came upon a sequence of bytes that couldn't be converted into a character in the current encoding (there is no glyph for that code point), so it substituted zero. My mozilla browser puts a little square box filled with the unrecognized codes (in hex) in place of these unknown characters, but HTML parser just relies on the underlying Java implementation. By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero.
    
    See Bug #1121401 No Parsing with yahoo! fixed in Integration Build 1.5-20050313
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parser & foreign language

Forums

Help

Parser & foreign language document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Parser & foreign language