I'm trying to parse a google search result page. If you go to
www.google.com
& search with the keyword
arabic
You'll see that some of the results will display arabic words like
------------------------------
BBCArabic.com | الصفحة الرئيسية
Home of the BBC on the Internet News, Sport, Weather, World Service, Languages, نصوص فقط, مساعدة. BBCArabic.com استمع ...
------------------------
However with parser gives
---------------------------------------
BBCArabic.com | ?????? ????????
<i>The summary for this Arabic page contains characters that cannot be correctly displayed in this language/character set.</i> http://www.bbc.co.uk/arabic
--------------------------------------------
for the code
Parser parser = new Parser ("http://www.google.com.my/search?num=10&as_q=arabic");
I'm thinking that this could be due to the
Content Negotiation Phase
between the Parser & the Google web server
Is there a way to set the supported languages & character sets in the Parser
so that the Google webserver will return the content instead of a message saying that the
page contains characters that cannot be correctly displayed in this language/character set
rgds
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With recent integration builds, you can use
parser.getConnectionManager ().setDefaultRequestProperties ();
to alter the negotiation.
By default it only has "User-Agent" and "Accept-Encoding", but you could add "Accept-Charset" to the Hashtable with an appropriate value (comma separated list of acceptable character sets I think), which is probably what you want.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think you have to figure out what works.
Use a browser maybe, and see what character set the site wants to send.
Then add that to the list of accept-charset.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noticed something today with the test I"ve written.
When I set the Accept-Charset & Accept-Language to *
The code
Parser parser = new Parser (stringUrlBuffer.toString());
Node htmlNode = null;
//look for the html node
for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
{
htmlNode = e.nextNode();
if(htmlNode.getText().equals("html"))
{
break;
}
}
The square box is a zero.
The underlying reader, that has an associated character set, came upon a sequence of bytes that couldn't be converted into a character in the current encoding (there is no glyph for that code point), so it substituted zero. My mozilla browser puts a little square box filled with the unrecognized codes (in hex) in place of these unknown characters, but HTML parser just relies on the underlying Java implementation. By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero.
See Bug #1121401 No Parsing with yahoo! fixed in Integration Build 1.5-20050313
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm trying to parse a google search result page. If you go to
www.google.com
& search with the keyword
arabic
You'll see that some of the results will display arabic words like
------------------------------
BBCArabic.com | الصفحة الرئيسية
Home of the BBC on the Internet News, Sport, Weather, World Service, Languages, نصوص فقط, مساعدة. BBCArabic.com استمع ...
------------------------
However with parser gives
---------------------------------------
BBCArabic.com | ?????? ????????
<i>The summary for this Arabic page contains characters that cannot be correctly displayed in this language/character set.</i>
http://www.bbc.co.uk/arabic
--------------------------------------------
for the code
Parser parser = new Parser ("http://www.google.com.my/search?num=10&as_q=arabic");
Node htmlNode = parser.elements().nextNode();
System.out.println(htmlNode.toString());
I've tried to retrieve the google result page encoding
its: ISO-8859-1
I even tried
parser.setEncoding("ISO-8859-1");
before doing the parser.elements().nextNode();
It doesn't work whether with ISO-8859-1 or ISO-8859-6
So what happens here is that somehow the
browser is able retrieve the foreign encoded content & display them
but the Parser can't
Has anyone encountered this problem?
It doesn't matter which language u're parsing for.
I'm working on a metasearch & will need to support a number of other languages as well.
rgds.
I'm thinking that this could be due to the
Content Negotiation Phase
between the Parser & the Google web server
Is there a way to set the supported languages & character sets in the Parser
so that the Google webserver will return the content instead of a message saying that the
page contains characters that cannot be correctly displayed in this language/character set
rgds
With recent integration builds, you can use
parser.getConnectionManager ().setDefaultRequestProperties ();
to alter the negotiation.
By default it only has "User-Agent" and "Accept-Encoding", but you could add "Accept-Charset" to the Hashtable with an appropriate value (comma separated list of acceptable character sets I think), which is probably what you want.
which build has this feature?
& how do i get it?
thnks
Found it
pls ignore my last post
tq
Hi Derrick,
I've downloaded the integration build. Then modified my code as below.
However I get the same thing
<i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>
pls help
-----------------
Parser parser = new Parser();
Hashtable ht = parser.getConnectionManager().getDefaultRequestProperties();
ht.put("Accept-Charset","IISO-8859-1, ISO-8859-6, Windows-1256");
ht.put("Accept-Encoding","*");
parser.getConnectionManager().setDefaultRequestProperties(ht);
parser.serUrl("http://some.url");
I think you have to figure out what works.
Use a browser maybe, and see what character set the site wants to send.
Then add that to the list of accept-charset.
Hello all,
I tried looking at the
ConnectionManager class
I noticed its the "mRequestProperties" variable that is used to set the request headers.
notice the line no 578:
properties = getRequestProperties ();
if (null != properties)
for (enumeration = properties.keys (); enumeration.hasMoreElements ();)
{
key = (String)enumeration.nextElement ();
value = (String)properties.get (key);
ret.setRequestProperty (key, value);
}
anyway i tried to set the hashmap contaning the Accept-Charset and Accept-Language here:
When I try to set this headers
I get the message:
<i>The summary for this English page contains characters that cannot be correctly displayed in this language/character set.</i>
although the summary is actually in russian.
however when i don't sent anything in the header
just using default
I'll get some error when trying to parse the description
my code is as follows:
ht.put("Accept-Charset","UTF-8,KOI8-R");
ht.put("Accept-Language","ar, ru");
cm.setRequestProperties(ht);
Parser.setConnectionManager(cm);
Parser parser = new Parser();
parser.setURL(stringUrlBuffer.toString());
rgds,
pls help
What is the URL you are trying to fetch?
Hi,
http://www.google.com.my/search?num=10&as_q=russian
this ist he url for the page with russian language
http://www.google.com.my/search?num=10&as_q=arabic
this one contains arabic language
Hello Derrick,
I noticed something today with the test I"ve written.
When I set the Accept-Charset & Accept-Language to *
The code
Parser parser = new Parser (stringUrlBuffer.toString());
Node htmlNode = null;
//look for the html node
for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
{
htmlNode = e.nextNode();
if(htmlNode.getText().equals("html"))
{
break;
}
}
System.out.println(htmlNode.toHtml());
returns:-
......
<a class=yschttl href="http://rds.yahoo.com/S=2766679/K=bbc+arabic/v=2/SID=e/l=WS1/R=1/IPC=us/SHE=0/H=0/SIG=11dhgfv25/EXP=1110955544/*-http%3A//www.bbcarabic.com/">BBCArabic.com | </a></div></li></ol></div>
actually it shows a square box symbol after the pipe symbol |
So it looks like when an arabic character is returned the parser is unable to accept it and ends just there.
Is this something to do with the way the parser detects end of the return stream?
Its like the parser cuts of whatever input as soon as it received a foreign character.
my request url is :
http://search.yahoo.com/search?n=10&ei=UTF-8&va=bbc+arabic
I'm testing with yahoo now.
rgds.
The square box is a zero.
The underlying reader, that has an associated character set, came upon a sequence of bytes that couldn't be converted into a character in the current encoding (there is no glyph for that code point), so it substituted zero. My mozilla browser puts a little square box filled with the unrecognized codes (in hex) in place of these unknown characters, but HTML parser just relies on the underlying Java implementation. By default nio.charset.CharsetDecoder replaces characters it cannot represent in the current encoding with zero.
See Bug #1121401 No Parsing with yahoo! fixed in Integration Build 1.5-20050313