Thread: [Htmlparser-user] Encoding problem when parsing html
Brought to you by:
derrickoswald
From: Martin S. <mst...@gm...> - 2007-01-16 11:34:15
|
SGVsbG8sCgpJJ20gdXNpbmcgSFRNTFBhcnNlciBmb3IgZXh0cmFjdGluZyB0ZXh0IGZyb20gYSBI VE1MIHBhZ2UgaW4gb3JkZXIgdG8KaW5kZXggaXQgdXNpbmcgYSBmdWxsIHRleHQgc2VhcmNoIGVu Z2luZS4KRHVyaW5nIHRoZSB0ZXN0aW5nIHBoYXNlLCBJIGRpc2NvdmVyZWQgdGhhdCBzb21lIHdl YiBwYWdlcyBhcmUgbm90CnBhcnNlZCBjb3JyZWN0bHkgYnkgSFRNTFBhcnNlci4gT25lIG9mIHRo ZXNlIHdlYnBhZ2VzIGlzIGZvciBleGFtcGxlCmh0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS4KSSB0 aGluayB0aGUgcHJvYmxlbSBpcyB0aGF0IGFjY29yZGluZyB0byB0aGUgSFRUUCBoZWFkZXJzLCB0 aGUKZW5jb2RpbmcgaXMgaW4gVVRGLTgsIGJ1dCBpbiBIVE1MIE1FVEEgdGFncyB0aGlzIGlzIGNo YW5nZWQgdG8gVVRGLTE2LgpUaGlzIGNhbiBiZSBoYW5kbGVkIGJ5IGNhdGNoaW5nIHRoZSBFbmNv ZGluZ0NoYW5nZUV4Y2VwdGlvbiwgYnV0IHRoaXMKZG9lc24ndCBwcmV2ZW50IHRoZSB0ZXh0dWFs IGNvbnRlbnQgb2YgdGhlIHNpdGUgaW50ZXJwcmV0ZWQKaW5jb3JyZWN0bHkuCgpBIGNvbmNyZXRl IGV4YW1wbGUgdG8gc2VlIHRoZSBwcm9ibGVtOgoKICAgICAgICBTdHJpbmdCZWFuIHNiID0gbmV3 IFN0cmluZ0JlYW4oKTsKICAgICAgICBzYi5zZXRVUkwoImh0dHA6Ly93d3cubWljcm9zb2Z0LmNv bSIpOwogICAgICAgIFN5c3RlbS5vdXQucHJpbnRsbiAoc2IuZ2V0U3RyaW5ncygpKTsKClRoZSBv dXRwdXQgb2YgdGhlIGFib3ZlIGNvZGUgc25pcHBldCBpczoK5JGP5I2U5aWQ5JSg5qG05rWs4oGQ 5ZWC5LGJ5Iyg4oit4ryv5Zyz5Iyv4r2E5ZGE4oGI5ZGN5LCg45Cu44Cg5ZGy5oWu542p55Gp5r2u IC4uLi4KCk5vdCByZWFsbHkgd2hhdCBJIHdhcyBleHBlY3RpbmcuCgpBbSBJIG1pc3Npbmcgc29t ZXRoaW5nLCBvciBpcyB0aGlzIGEgYnVnIGluIHRoZSBIVE1MUGFyc2VyPwoKTWFydGluCg== |
From: Martin S. <mst...@gm...> - 2007-01-25 11:37:52
|
2007/1/16, Martin Sturm <mst...@gm...>: > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. Today, I decided I wanted to know exactly what was going wrong. It turned out that in the HTML code of www.microsoft.com, the charset is defined two times using meta-tags (http-equiv="Content-Type"), first as utf-16 and after that as utf-8. The actual encoding is apparently utf-8, because if I remove the meta-tag for utf-16 of the html, the page is parsed correctly. Below is the offending HTML-code: <html lang="en" dir="ltr"> <head> <META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="SearchTitle" content="Microsoft.com"> <meta name="SearchDescription" content="Microsoft.com Homepage"> I'm not sure if it is allowed by the html-specifications to define the charset multiple time, but I guess not. I don't think it is really a bug in HTMLParser, but if it takes the last defined charset (utf-8) it would parse the site correctly. Why doesn't HTMLParser not do this? |
From: Martin S. <mst...@gm...> - 2007-01-25 13:22:24
|
2007/1/25, Martin Sturm <mst...@gm...>: > I'm not sure if it is allowed by the html-specifications to define the > charset multiple time, but I guess not. I don't think it is really a > bug in HTMLParser, but if it takes the last defined charset (utf-8) it > would parse the site correctly. Why doesn't HTMLParser not do this? I did some more research on this issue. The W3C specifications for HTML 4.01 (which applies to this document, because it is a HTML 4 document according to the first line): To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource. I looked in to the source code of HTMLParser 2.0 and the current behaviour of HTMLParser is: - use the charset defined by the "Content-Type" field in the HTTP header - Change to the charset defined using a META declaration with "http-equiv" if it differ from the charset defined by the HTTP header. This last step is causing the error in the Microsoft.com example. The http headers define a charset utf-8, the first META declaration changes this to UTF-16 and the second META declaration (however, this declaration is after the TITLE tag) changes this back to UTF-8. I think the correct behaviour should be: use the charset defined by the HTTP header if it differs from the default charset (which is: ISO-8859-1 aka Latin-1), and only use the charset defined by a META declaration if the HTTP headers define no charset or the default (ISO-8859-1). -- Martin Sturm |
From: Martin S. <mst...@gm...> - 2007-01-25 15:16:34
|
2007/1/25, Martin Sturm <mst...@gm...>: > > I looked in to the source code of HTMLParser 2.0 and the current > behaviour of HTMLParser is: > - use the charset defined by the "Content-Type" field in the HTTP header > - Change to the charset defined using a META declaration with > "http-equiv" if it differ from the charset defined by the HTTP header. > > This last step is causing the error in the Microsoft.com example. The > http headers define a charset utf-8, the first META declaration > changes this to UTF-16 and the second META declaration (however, this > declaration is after the TITLE tag) changes this back to UTF-8. > I think the correct behaviour should be: use the charset defined by > the HTTP header if it differs from the default charset (which is: > ISO-8859-1 aka Latin-1), and only use the charset defined by a META > declaration if the HTTP headers define no charset or the default > (ISO-8859-1). I've created a small patch which includes this behavior. See http://sourceforge.net/support/tracker.php?aid=1644504 This solves my problem and closes (as far as I can see) bug http://sourceforge.net/tracker/index.php?func=detail&aid=1592517&group_id=24399&atid=381399 -- Martin Sturm |
From: MitchH <m2...@mi...> - 2007-10-12 14:17:52
|
Martin Sturm <msturm10 <at> gmail.com> writes: > > Hello, > > I'm using HTMLParser for extracting text from a HTML page in order to > index it using a full text search engine. > During the testing phase, I discovered that some web pages are not > parsed correctly by HTMLParser. One of these webpages is for example > http://www.microsoft.com. > I think the problem is that according to the HTTP headers, the > encoding is in UTF-8, but in HTML META tags this is changed to UTF-16. > This can be handled by catching the EncodingChangeException, but this > doesn't prevent the textual content of the site interpreted > incorrectly. > The microsoft site contains the following snippet: <head><META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> It seems they change the content encoding just for the title (god knows why) The second change, back to utf8 causes things to fall over. I found fix at http://osdir.com/ml/parsers.htmlparser.user/2006-03/msg00033.html |