Thread: Re: [Htmlparser-user] Character Encoding
Brought to you by:
derrickoswald
|
From: Derrick O. <der...@ro...> - 2008-05-13 01:17:42
|
That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer.jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:55:56 AM Subject: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: <bo...@ti...> - 2008-05-13 07:34:11
|
Thanks Derrick, The relevant section of the ConnectionMonitor output is: INFO: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=ISO-8859-1 Transfer-Encoding: chunked Does that help? Thanks Brian ----- Original Message ---- That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer. jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- ----Original Message---- From: bo...@ti... Date: 12/05/2008 12:55 To: <htm...@li...> Subj: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co. uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun. com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
|
From: <bo...@ti...> - 2008-05-14 08:23:47
|
All the pages which don't work come from the same source... They all
have these meta tags. I believe there is an option to force decoding
with a different character set but the way I retrieve the pages - I
don't seem to have the opportunity to do so maybe if someone can give
me a few lines of sample code on how to do that - I would appreciate
it.
What I do at the moment is:
parser = new Parser(URL);
ThePage = parser.parse(null);
MyPage = ThePage.toHtml();
And that doesn't give the oportunity to change the decoding. I believe
you can read the page and then "force" decoding with a different
character set but I can't figure out how to do that. Is there an
example somewhere of how to do this?
Thanks again
Brian
----- Original Message ----
There might be an issue between the ISO-8859-1 and UTF-8.
Here's a random explanation - out of many on the net - http://www.
stanford.edu/~laurik/fsmbook/faq/utf8.html
You'll have to determine if the character you want has an encoding in
ISO-8859-1.
The parser should switch to interpreting in UTF-8 when it encounters
the meta tag.
Do all pages have the meta tag? Or just the ones that are OK.
----- Original Message ----
From: "bo...@ti..." <bo...@ti...>
To: bo...@ti...; htm...@li...
Sent: Tuesday, May 13, 2008 3:33:57 AM
Subject: Re: [Htmlparser-user] Character Encoding
Thanks Derrick,
The relevant section of the ConnectionMonitor output is:
INFO: HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html; charset=ISO-8859-1
Transfer-Encoding: chunked
Does that help?
Thanks
Brian
----- Original Message ----
That <meta> tag doesn't look like the problem.
If you use the built in ConnectionMonitor on the parser, you can see
the header:
C:>java -classpath parser\target\htmlparser.jar;
lexer\target\htmllexer.
jar org.htmlparser.Parser http://cbc.ca
INFO: GET http://cbc.ca HTTP/1.1
Accept-Encoding: gzip, deflate
User-Agent: HTMLParser/2.0
INFO: HTTP/1.1 301 Moved Permanently
Date: Tue, 13 May 2008 01:12:31 GMT
Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev
Location: http://www.cbc.ca/
Cache-Control: max-age=120
Expires: Tue, 13 May 2008 01:14:31 GMT
Content-Length: 226
Keep-Alive: timeout=15, max=150
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
INFO: GET http://www.cbc.ca/ HTTP/1.1
Accept-Encoding: gzip, deflate
User-Agent: HTMLParser/2.0
INFO: HTTP/1.1 200 OK
Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev
Accept-Ranges: bytes
Content-Type: text/html
Cache-Control: max-age=61
Expires: Tue, 13 May 2008 01:13:32 GMT
Date: Tue, 13 May 2008 01:12:31 GMT
Content-Length: 28625
Connection: keep-alive
----- Original Message ----
----Original Message----
From: bo...@ti...
Date: 12/05/2008 12:55
To: <htm...@li...>
Subj: [Htmlparser-user] Character Encoding
Thanks Derrick,
The page in question includes the following tags:
<META http-equiv=Content-Type content="text/html; charset=utf-8">
<META http-equiv=content-type>
I don't understand why the second one is there but it really is. With
that information can you suggest a resolution? I am not entirely sure
how to verify your point (1).
Best Regards
Brian
-----------------------------------------------------------------------------
There are two possibilities.
1) The HTTP server is/is not serving up content type meta information
in the HTTP header like so:
text/html; charset=utf-8
2) The source HTML does/does not contain a meta tag like so:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
You need to determine which one so the appropriate 'fix' can be
applied.
----- Original Message ----
From: "bo...@ti..." <bo...@ti...>
To: htm...@li...
Sent: Monday, May 12, 2008 7:31:39 AM
Subject: [Htmlparser-user] Character Encoding
Hi,
I have a strange problem and I can’t get my head around it. Hopefully
someone can point me in the right direction. I’m using the following
code with HTMLParser 1.6 to retrieve web pages:
parser = new Parser
(URL);
ThePage = parser.parse
(null);
MyPage = ThePage.toHtml();
On some pages (not all…) if the HTML page contains:
£10 Free
“My Page” contains “?10 Free” on other pages it works fine.
I guess it has something to do with character encoding? Can someone
suggest what I add where to get this to work correctly (I would like
to
keep the “£10 Free”)
Thanks in advance
Brian
_______________________________
Free games from Tiscali Play - http://www.tiscali.co.uk/play
|