Re: [Htmlparser-user] Character Encoding

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

All the pages which don't work come from the same source... They all 
have these meta tags. I believe there is an option to force decoding 
with a different character set but the way I retrieve the pages - I 
don't seem to have the opportunity to do so maybe if someone can give 
me a few lines of sample code on how to do that - I would appreciate 
it.

What I do at the moment is:

                parser            = new Parser(URL);               
                ThePage         = parser.parse(null);           
                MyPage           = ThePage.toHtml();            

And that doesn't give the oportunity to change the decoding. I believe 
you can read the page and then "force" decoding with a different 
character set but I can't figure out how to do that. Is there an 
example somewhere of how to do this?

Thanks again

Brian    

----- Original Message ----
There might be an issue between the ISO-8859-1 and UTF-8.
Here's a random explanation - out of many on the net - http://www.
stanford.edu/~laurik/fsmbook/faq/utf8.html
You'll have to determine if the character you want has an encoding in 
ISO-8859-1.
The parser should switch to interpreting in UTF-8 when it encounters 
the meta tag.
Do all pages have the meta tag? Or just the ones that are OK.

----- Original Message ----
From: "bo...@ti..." <bo...@ti...>
To: bo...@ti...; htm...@li...
Sent: Tuesday, May 13, 2008 3:33:57 AM
Subject: Re: [Htmlparser-user] Character Encoding

Thanks Derrick,

The relevant section of the ConnectionMonitor output is:

INFO: HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html; charset=ISO-8859-1
Transfer-Encoding: chunked

Does that help?

Thanks

Brian

----- Original Message ----

That <meta> tag doesn't look like the problem.

If you use the built in ConnectionMonitor on the parser, you can see 
the header:

C:>java -classpath parser\target\htmlparser.jar;
lexer\target\htmllexer.
jar org.htmlparser.Parser http://cbc.ca
INFO: GET http://cbc.ca HTTP/1.1
Accept-Encoding: gzip, deflate
User-Agent: HTMLParser/2.0

INFO: HTTP/1.1 301 Moved Permanently
Date: Tue, 13 May 2008 01:12:31 GMT
Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev
Location: http://www.cbc.ca/
Cache-Control: max-age=120
Expires: Tue, 13 May 2008 01:14:31 GMT
Content-Length: 226
Keep-Alive: timeout=15, max=150
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

INFO: GET http://www.cbc.ca/ HTTP/1.1
Accept-Encoding: gzip, deflate
User-Agent: HTMLParser/2.0

INFO: HTTP/1.1 200 OK
Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev
Accept-Ranges: bytes
Content-Type: text/html
Cache-Control: max-age=61
Expires: Tue, 13 May 2008 01:13:32 GMT
Date: Tue, 13 May 2008 01:12:31 GMT
Content-Length: 28625
Connection: keep-alive

----- Original Message ----

----Original Message----
From: bo...@ti...
Date: 12/05/2008 12:55 
To: <htm...@li...>
Subj: [Htmlparser-user] Character Encoding

Thanks Derrick,

The page in question includes the following tags:

<META http-equiv=Content-Type content="text/html; charset=utf-8">
<META http-equiv=content-type>

I don't understand why the second one is there but it really is. With 
that information can you suggest a resolution? I am not entirely sure 
how to verify your point (1).

Best Regards

Brian
-----------------------------------------------------------------------------

There are two possibilities.

1) The HTTP server is/is not serving up content type meta information 
in the HTTP header like so:
text/html; charset=utf-8

2) The source HTML does/does not contain a meta tag like so:

----- Original Message ----
From: "bo...@ti..." <bo...@ti...>
To: htm...@li...
Sent: Monday, May 12, 2008 7:31:39 AM
Subject: [Htmlparser-user] Character Encoding

Hi,

I have a strange problem and I can’t get my head around it. Hopefully 
someone can point me in the right direction. I’m using the following 
code with HTMLParser 1.6 to retrieve web pages:

                parser                      = new Parser
(URL);              
                ThePage                  = parser.parse
(null);          
                MyPage                    = ThePage.toHtml();

On some pages (not all…) if the HTML page contains:

Â£10 Free

“My Page” contains “?10 Free” on other pages it works fine.

I guess it has something to do with character encoding? Can someone 
suggest what I add where to get this to work correctly (I would like 
to 
keep the “Â£10 Free”)

Thanks in advance

Brian

_______________________________
Free games from Tiscali Play - http://www.tiscali.co.uk/play