Re: [Htmlparser-user] Encoding problem when parsing html
Brought to you by:
derrickoswald
From: Martin S. <mst...@gm...> - 2007-01-25 13:22:24
|
2007/1/25, Martin Sturm <mst...@gm...>: > I'm not sure if it is allowed by the html-specifications to define the > charset multiple time, but I guess not. I don't think it is really a > bug in HTMLParser, but if it takes the last defined charset (utf-8) it > would parse the site correctly. Why doesn't HTMLParser not do this? I did some more research on this issue. The W3C specifications for HTML 4.01 (which applies to this document, because it is a HTML 4 document according to the first line): To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource. I looked in to the source code of HTMLParser 2.0 and the current behaviour of HTMLParser is: - use the charset defined by the "Content-Type" field in the HTTP header - Change to the charset defined using a META declaration with "http-equiv" if it differ from the charset defined by the HTTP header. This last step is causing the error in the Microsoft.com example. The http headers define a charset utf-8, the first META declaration changes this to UTF-16 and the second META declaration (however, this declaration is after the TITLE tag) changes this back to UTF-8. I think the correct behaviour should be: use the charset defined by the HTTP header if it differs from the default charset (which is: ISO-8859-1 aka Latin-1), and only use the charset defined by a META declaration if the HTTP headers define no charset or the default (ISO-8859-1). -- Martin Sturm |