Re: [Htmlparser-user] Encoding problem when parsing html

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

2007/1/25, Martin Sturm <mst...@gm...>:
> I'm not sure if it is allowed by the html-specifications to define the
> charset multiple time, but I guess not. I don't think it is really a
> bug in HTMLParser, but if it takes the last defined charset (utf-8) it
> would parse the site correctly. Why doesn't HTMLParser not do this?

I did some more research on this issue. The W3C specifications for
HTML 4.01 (which applies to this document, because it is a HTML 4
document according to the first line):

To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding (from
highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
3. The charset attribute set on an element that designates an external resource.

I looked in to the source code of HTMLParser 2.0 and the current
behaviour of HTMLParser is:
- use the charset defined by the "Content-Type" field in the HTTP header
- Change to the charset defined using a META declaration with
"http-equiv" if it differ from the charset defined by the HTTP header.

This last step is causing the error in the Microsoft.com example. The
http headers define a charset utf-8, the first META declaration
changes this to UTF-16 and the second META declaration (however, this
declaration is after the TITLE tag) changes this back to UTF-8.
I think the correct behaviour should be: use the charset defined by
the HTTP header if it differs from the default charset (which is:
ISO-8859-1 aka Latin-1), and only use the charset defined by a META
declaration if the HTTP headers define no charset or the default
(ISO-8859-1).

--
Martin Sturm