Thread: [Htmlparser-user] Encoding problem when parsing html

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] Encoding problem when parsing html

From: Martin S. <mst...@gm...> - 2007-01-16 11:34:15

SGVsbG8sCgpJJ20gdXNpbmcgSFRNTFBhcnNlciBmb3IgZXh0cmFjdGluZyB0ZXh0IGZyb20gYSBI
VE1MIHBhZ2UgaW4gb3JkZXIgdG8KaW5kZXggaXQgdXNpbmcgYSBmdWxsIHRleHQgc2VhcmNoIGVu
Z2luZS4KRHVyaW5nIHRoZSB0ZXN0aW5nIHBoYXNlLCBJIGRpc2NvdmVyZWQgdGhhdCBzb21lIHdl
YiBwYWdlcyBhcmUgbm90CnBhcnNlZCBjb3JyZWN0bHkgYnkgSFRNTFBhcnNlci4gT25lIG9mIHRo
ZXNlIHdlYnBhZ2VzIGlzIGZvciBleGFtcGxlCmh0dHA6Ly93d3cubWljcm9zb2Z0LmNvbS4KSSB0
aGluayB0aGUgcHJvYmxlbSBpcyB0aGF0IGFjY29yZGluZyB0byB0aGUgSFRUUCBoZWFkZXJzLCB0
aGUKZW5jb2RpbmcgaXMgaW4gVVRGLTgsIGJ1dCBpbiBIVE1MIE1FVEEgdGFncyB0aGlzIGlzIGNo
YW5nZWQgdG8gVVRGLTE2LgpUaGlzIGNhbiBiZSBoYW5kbGVkIGJ5IGNhdGNoaW5nIHRoZSBFbmNv
ZGluZ0NoYW5nZUV4Y2VwdGlvbiwgYnV0IHRoaXMKZG9lc24ndCBwcmV2ZW50IHRoZSB0ZXh0dWFs
IGNvbnRlbnQgb2YgdGhlIHNpdGUgaW50ZXJwcmV0ZWQKaW5jb3JyZWN0bHkuCgpBIGNvbmNyZXRl
IGV4YW1wbGUgdG8gc2VlIHRoZSBwcm9ibGVtOgoKICAgICAgICBTdHJpbmdCZWFuIHNiID0gbmV3
IFN0cmluZ0JlYW4oKTsKICAgICAgICBzYi5zZXRVUkwoImh0dHA6Ly93d3cubWljcm9zb2Z0LmNv
bSIpOwogICAgICAgIFN5c3RlbS5vdXQucHJpbnRsbiAoc2IuZ2V0U3RyaW5ncygpKTsKClRoZSBv
dXRwdXQgb2YgdGhlIGFib3ZlIGNvZGUgc25pcHBldCBpczoK5JGP5I2U5aWQ5JSg5qG05rWs4oGQ
5ZWC5LGJ5Iyg4oit4ryv5Zyz5Iyv4r2E5ZGE4oGI5ZGN5LCg45Cu44Cg5ZGy5oWu542p55Gp5r2u
IC4uLi4KCk5vdCByZWFsbHkgd2hhdCBJIHdhcyBleHBlY3RpbmcuCgpBbSBJIG1pc3Npbmcgc29t
ZXRoaW5nLCBvciBpcyB0aGlzIGEgYnVnIGluIHRoZSBIVE1MUGFyc2VyPwoKTWFydGluCg==

Re: [Htmlparser-user] Encoding problem when parsing html

From: Martin S. <mst...@gm...> - 2007-01-25 11:37:52

2007/1/16, Martin Sturm <mst...@gm...>:
> During the testing phase, I discovered that some web pages are not
> parsed correctly by HTMLParser. One of these webpages is for example
> http://www.microsoft.com.
> I think the problem is that according to the HTTP headers, the
> encoding is in UTF-8, but in HTML META tags this is changed to UTF-16.

Today, I decided I wanted to know exactly what was going wrong. It
turned out that in the HTML code of www.microsoft.com, the charset is
defined two times using meta-tags (http-equiv="Content-Type"), first
as utf-16 and after that as utf-8. The actual encoding is apparently
utf-8, because if I remove the meta-tag for utf-16 of the html, the
page is parsed correctly.
Below is the offending HTML-code:

<html lang="en" dir="ltr"> <head>  <META http-equiv="Content-Type"
content="text/html; charset=utf-16">  <title>Microsoft
Corporation</title>  <meta http-equiv="Content-Type"
content="text/html; charset=utf-8">  <meta name="SearchTitle"
content="Microsoft.com">  <meta name="SearchDescription"
content="Microsoft.com Homepage">

I'm not sure if it is allowed by the html-specifications to define the
charset multiple time, but I guess not. I don't think it is really a
bug in HTMLParser, but if it takes the last defined charset (utf-8) it
would parse the site correctly. Why doesn't HTMLParser not do this?

Re: [Htmlparser-user] Encoding problem when parsing html

From: Martin S. <mst...@gm...> - 2007-01-25 13:22:24

2007/1/25, Martin Sturm <mst...@gm...>:
> I'm not sure if it is allowed by the html-specifications to define the
> charset multiple time, but I guess not. I don't think it is really a
> bug in HTMLParser, but if it takes the last defined charset (utf-8) it
> would parse the site correctly. Why doesn't HTMLParser not do this?

I did some more research on this issue. The W3C specifications for
HTML 4.01 (which applies to this document, because it is a HTML 4
document according to the first line):

To sum up, conforming user agents must observe the following
priorities when determining a document's character encoding (from
highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a
value set for "charset".
3. The charset attribute set on an element that designates an external resource.

I looked in to the source code of HTMLParser 2.0 and the current
behaviour of HTMLParser is:
- use the charset defined by the "Content-Type" field in the HTTP header
- Change to the charset defined using a META declaration with
"http-equiv" if it differ from the charset defined by the HTTP header.

This last step is causing the error in the Microsoft.com example. The
http headers define a charset utf-8, the first META declaration
changes this to UTF-16 and the second META declaration (however, this
declaration is after the TITLE tag) changes this back to UTF-8.
I think the correct behaviour should be: use the charset defined by
the HTTP header if it differs from the default charset (which is:
ISO-8859-1 aka Latin-1), and only use the charset defined by a META
declaration if the HTTP headers define no charset or the default
(ISO-8859-1).

--
Martin Sturm

Re: [Htmlparser-user] Encoding problem when parsing html

From: Martin S. <mst...@gm...> - 2007-01-25 15:16:34

2007/1/25, Martin Sturm <mst...@gm...>:
>
> I looked in to the source code of HTMLParser 2.0 and the current
> behaviour of HTMLParser is:
> - use the charset defined by the "Content-Type" field in the HTTP header
> - Change to the charset defined using a META declaration with
> "http-equiv" if it differ from the charset defined by the HTTP header.
>
> This last step is causing the error in the Microsoft.com example. The
> http headers define a charset utf-8, the first META declaration
> changes this to UTF-16 and the second META declaration (however, this
> declaration is after the TITLE tag) changes this back to UTF-8.
> I think the correct behaviour should be: use the charset defined by
> the HTTP header if it differs from the default charset (which is:
> ISO-8859-1 aka Latin-1), and only use the charset defined by a META
> declaration if the HTTP headers define no charset or the default
> (ISO-8859-1).

I've created a small patch which includes this behavior. See
http://sourceforge.net/support/tracker.php?aid=1644504

This solves my problem and closes (as far as I can see) bug
http://sourceforge.net/tracker/index.php?func=detail&aid=1592517&group_id=24399&atid=381399

--
Martin Sturm

Re: [Htmlparser-user] Encoding problem when parsing html

From: MitchH <m2...@mi...> - 2007-10-12 14:17:52

Martin Sturm <msturm10 <at> gmail.com> writes:

> 
> Hello,
> 
> I'm using HTMLParser for extracting text from a HTML page in order to
> index it using a full text search engine.
> During the testing phase, I discovered that some web pages are not
> parsed correctly by HTMLParser. One of these webpages is for example
> http://www.microsoft.com.
> I think the problem is that according to the HTTP headers, the
> encoding is in UTF-8, but in HTML META tags this is changed to UTF-16.
> This can be handled by catching the EncodingChangeException, but this
> doesn't prevent the textual content of the site interpreted
> incorrectly.
> 

The microsoft site contains the following snippet:

<head><META http-equiv="Content-Type" content="text/html; charset=utf-16">
<title>Microsoft Corporation</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

It seems they change the content encoding just for the title (god knows why)
The second change, back to utf8 causes things to fall over.
I found fix at
http://osdir.com/ml/parsers.htmlparser.user/2006-03/msg00033.html