HTML Parser / Bugs / #232 if a web page has double charset,the parser run error

macpin - 2006-11-08

file is which I parse

play.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

macpin - 2006-11-08

assigned_to: derrickoswald --> nobody

priority: 5 --> 9

milestone: --> 535725
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

macpin - 2006-11-08

Logged In: YES
user_id=1640299

the parser request url is:http://www.tvix.cn/play.php?v=VKm2qLblS1k

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Sturm - 2007-01-25

Logged In: YES
user_id=510190
Originator: NO

I think this is not really a bug. The EncodingChangeException indicates that the parser changes the encoding because a META declaration defines another charset than the parser was using. When this occurs, the parser reads the document again using the new encoding, and when characters should be represented different (because of the changed encoding), the EncodingChangeException is thrown. Usually, it is sufficient to catch this exception, reset the parser and try again what you where doing when the exceptioin occurs.

In your case, the resulting Java code should be something like:
----
Parser parser=new Parser (vrl,Parser.DEVNULL);
TagNameFilter tf = new TagNameFilter ("textarea");
NodeList list;
try {
list = parser.extractAllNodesThatMatch (tf);
} catch (EncodingChangeException e) {
parser.reset ();
list = parser.extractAllNodesThatMatch (tf);
}
----

Probably this will work (I have not tested the code).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Sturm - 2007-01-25

Logged In: YES
user_id=510190
Originator: NO

I'm sorry, but my previous comment is not correct. I didn't notice that the original page source was available. I've tested my sample code, and it doesn't work.
The problem is similar to a problem I recently noticed when parsing the Microsoft website (see the mailinglist for details). Microsoft also defines two charsets using META declaration but also provides a charset using HTTP headers.

The site which caused this bug does not define a charset using HTTP headers (see attached file with headers), so HTMLParser falls back to ISO-8859-1 (which is correct behaviour). However, the problem is that HTMLParser keeps switching charsets for every META tag that defines another charset. And that is the real problem.

Quote from W3C HTML 4.01 specification:

To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
3. The charset attribute set on an element that designates an external resource.

So, I think it is better that HTMLParser uses only the charset defined by HTTP headers if it is provided. If the charset defined by HTTP header is ISO-8859-1 and in the HTML the charset is changed using META declaration, only than HTMLParser should change the charset. Otherwise, it should be remain the same charset because otherwise it is impossible to guarantee a bug free behaviour of HTMLPaser. Defining two charsets using META declarations is simply not allowed and should be ignored I think.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Sturm - 2007-01-25

Logged In: YES
user_id=510190
Originator: NO

I created a patch which solves this problem. See http://sourceforge.net/support/tracker.php?aid=1644504 (bug #1644504)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2007-03-04

Logged In: YES
user_id=605407
Originator: NO

Apply patch #1644504 Patch stopping HTMLParser from infinitely switching charset to version 2 only.
See http://sourceforge.net/tracker/index.php?func=detail&aid=1644504&group_id=24399&atid=381401 for version 1.6 patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2007-03-04

milestone: 535725 --> v2.0

assigned_to: nobody --> derrickoswald

status: open --> pending-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Derrick Oswald - 2007-03-04

Logged In: YES
user_id=605407
Originator: NO

The provided .cn URL is refusing connections from the parser,
and the micorosoft site is not showing the problem
- at least in version 2.0 -
with patch #1644504 applied and the following code:

public void testTwoMeta () throws ParserException
{
String url;
Parser parser;
TagNameFilter tf;
NodeList list;

url = "http://www.tvix.cn/play.php?v=VKm2qLblS1k";
//url = "http://www.microsoft.com";
parser = new Parser ();
parser.setResource (url);
//parser.setEncoding ("gb2312");
tf = new TagNameFilter ("textarea");
try
{
list = parser.extractAllNodesThatMatch (tf);
}
catch (EncodingChangeException e)
{
parser.reset ();
list = parser.extractAllNodesThatMatch (tf);
}
}

Switching to pending until someone comes up with another URL or failing test case.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SourceForge Robot - 2007-06-03

Logged In: YES
user_id=1312539
Originator: NO

This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 90 days (the time period specified by
the administrator of this Tracker).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SourceForge Robot - 2007-06-03

status: pending-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

if a web page has double charset,the parser run error

Group

Searches

Help

#232 if a web page has double charset,the parser run error

Discussion