Re: [Htmlparser-user] Encoding problem when parsing html
Brought to you by:
derrickoswald
From: Martin S. <mst...@gm...> - 2007-01-25 11:37:52
|
2007/1/16, Martin Sturm <mst...@gm...>: > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. Today, I decided I wanted to know exactly what was going wrong. It turned out that in the HTML code of www.microsoft.com, the charset is defined two times using meta-tags (http-equiv="Content-Type"), first as utf-16 and after that as utf-8. The actual encoding is apparently utf-8, because if I remove the meta-tag for utf-16 of the html, the page is parsed correctly. Below is the offending HTML-code: <html lang="en" dir="ltr"> <head> <META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="SearchTitle" content="Microsoft.com"> <meta name="SearchDescription" content="Microsoft.com Homepage"> I'm not sure if it is allowed by the html-specifications to define the charset multiple time, but I guess not. I don't think it is really a bug in HTMLParser, but if it takes the last defined charset (utf-8) it would parse the site correctly. Why doesn't HTMLParser not do this? |