Re: [Htmlparser-user] Encoding problem when parsing html

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Martin Sturm <msturm10 <at> gmail.com> writes:

> 
> Hello,
> 
> I'm using HTMLParser for extracting text from a HTML page in order to
> index it using a full text search engine.
> During the testing phase, I discovered that some web pages are not
> parsed correctly by HTMLParser. One of these webpages is for example
> http://www.microsoft.com.
> I think the problem is that according to the HTTP headers, the
> encoding is in UTF-8, but in HTML META tags this is changed to UTF-16.
> This can be handled by catching the EncodingChangeException, but this
> doesn't prevent the textual content of the site interpreted
> incorrectly.
> 

The microsoft site contains the following snippet:

<head><META http-equiv="Content-Type" content="text/html; charset=utf-16">
<title>Microsoft Corporation</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

It seems they change the content encoding just for the title (god knows why)
The second change, back to utf8 causes things to fall over.
I found fix at
http://osdir.com/ml/parsers.htmlparser.user/2006-03/msg00033.html