Re: [Htmlparser-user] Encoding problem when parsing html

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

2007/1/16, Martin Sturm <mst...@gm...>:
> During the testing phase, I discovered that some web pages are not
> parsed correctly by HTMLParser. One of these webpages is for example
> http://www.microsoft.com.
> I think the problem is that according to the HTTP headers, the
> encoding is in UTF-8, but in HTML META tags this is changed to UTF-16.

Today, I decided I wanted to know exactly what was going wrong. It
turned out that in the HTML code of www.microsoft.com, the charset is
defined two times using meta-tags (http-equiv="Content-Type"), first
as utf-16 and after that as utf-8. The actual encoding is apparently
utf-8, because if I remove the meta-tag for utf-16 of the html, the
page is parsed correctly.
Below is the offending HTML-code:

<html lang="en" dir="ltr"> <head>  <META http-equiv="Content-Type"
content="text/html; charset=utf-16">  <title>Microsoft
Corporation</title>  <meta http-equiv="Content-Type"
content="text/html; charset=utf-8">  <meta name="SearchTitle"
content="Microsoft.com">  <meta name="SearchDescription"
content="Microsoft.com Homepage">

I'm not sure if it is allowed by the html-specifications to define the
charset multiple time, but I guess not. I don't think it is really a
bug in HTMLParser, but if it takes the last defined charset (utf-8) it
would parse the site correctly. Why doesn't HTMLParser not do this?