Re: [Htmlparser-user] Encoding problem when parsing html
Brought to you by:
derrickoswald
From: Martin S. <mst...@gm...> - 2007-01-25 15:16:34
|
2007/1/25, Martin Sturm <mst...@gm...>: > > I looked in to the source code of HTMLParser 2.0 and the current > behaviour of HTMLParser is: > - use the charset defined by the "Content-Type" field in the HTTP header > - Change to the charset defined using a META declaration with > "http-equiv" if it differ from the charset defined by the HTTP header. > > This last step is causing the error in the Microsoft.com example. The > http headers define a charset utf-8, the first META declaration > changes this to UTF-16 and the second META declaration (however, this > declaration is after the TITLE tag) changes this back to UTF-8. > I think the correct behaviour should be: use the charset defined by > the HTTP header if it differs from the default charset (which is: > ISO-8859-1 aka Latin-1), and only use the charset defined by a META > declaration if the HTTP headers define no charset or the default > (ISO-8859-1). I've created a small patch which includes this behavior. See http://sourceforge.net/support/tracker.php?aid=1644504 This solves my problem and closes (as far as I can see) bug http://sourceforge.net/tracker/index.php?func=detail&aid=1592517&group_id=24399&atid=381399 -- Martin Sturm |