hi,I come from china!~I have parse a page,it's head
tag include double charset meta tag.so,while parser
running,the parser throw Exception like this:
--------------------------------------------
"org.htmlparser.util.EncodingChangeException:
character mismatch (new: · [0xb7] != old: [0x30fb?])
for encoding change from x-EUC-CN to GBK at character
offset 266"
--------------------------------------------
and the head of page source code is:
-------------------------------------
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=gb2312" />
<title>
NIKE搞笑广告 · 分享视频 · TVix.cn · 免费视频分享平台
</title>
<meta http-equiv=Content-Type content="text/html;
charset=GBK">
<meta name="description" content="Tvix.cn · 免费视频分
享平台 · 精彩视频">
-------------------------------------
and my java code is:
-----------------------------------
Parser parser=new Parser(vrl,Parser.DEVNULL);
parser.setEncoding("GBK");
TagNameFilter tf=new TagNameFilter
("textarea");
NodeList list= parser.extractAllNodesThatMatch
(tf);
---------------------------------------
either I change ' parser.setEncoding("GBK")' to'
parser.setEncoding("gb2312")' or(parser.setEncoding
("GB2312")),parser also throw the same Exception..
that is hard for me to debug this,please help me or
just give directions!thanks!
file is which I parse
Logged In: YES
user_id=1640299
the parser request url is:http://www.tvix.cn/play.php?v=VKm2qLblS1k
Logged In: YES
user_id=510190
Originator: NO
I think this is not really a bug. The EncodingChangeException indicates that the parser changes the encoding because a META declaration defines another charset than the parser was using. When this occurs, the parser reads the document again using the new encoding, and when characters should be represented different (because of the changed encoding), the EncodingChangeException is thrown. Usually, it is sufficient to catch this exception, reset the parser and try again what you where doing when the exceptioin occurs.
In your case, the resulting Java code should be something like:
----
Parser parser=new Parser (vrl,Parser.DEVNULL);
TagNameFilter tf = new TagNameFilter ("textarea");
NodeList list;
try {
list = parser.extractAllNodesThatMatch (tf);
} catch (EncodingChangeException e) {
parser.reset ();
list = parser.extractAllNodesThatMatch (tf);
}
----
Probably this will work (I have not tested the code).
Logged In: YES
user_id=510190
Originator: NO
I'm sorry, but my previous comment is not correct. I didn't notice that the original page source was available. I've tested my sample code, and it doesn't work.
The problem is similar to a problem I recently noticed when parsing the Microsoft website (see the mailinglist for details). Microsoft also defines two charsets using META declaration but also provides a charset using HTTP headers.
The site which caused this bug does not define a charset using HTTP headers (see attached file with headers), so HTMLParser falls back to ISO-8859-1 (which is correct behaviour). However, the problem is that HTMLParser keeps switching charsets for every META tag that defines another charset. And that is the real problem.
Quote from W3C HTML 4.01 specification:
To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
3. The charset attribute set on an element that designates an external resource.
So, I think it is better that HTMLParser uses only the charset defined by HTTP headers if it is provided. If the charset defined by HTTP header is ISO-8859-1 and in the HTML the charset is changed using META declaration, only than HTMLParser should change the charset. Otherwise, it should be remain the same charset because otherwise it is impossible to guarantee a bug free behaviour of HTMLPaser. Defining two charsets using META declarations is simply not allowed and should be ignored I think.
Logged In: YES
user_id=510190
Originator: NO
I created a patch which solves this problem. See http://sourceforge.net/support/tracker.php?aid=1644504 (bug #1644504)
Logged In: YES
user_id=605407
Originator: NO
Apply patch #1644504 Patch stopping HTMLParser from infinitely switching charset to version 2 only.
See http://sourceforge.net/tracker/index.php?func=detail&aid=1644504&group_id=24399&atid=381401 for version 1.6 patch.
Logged In: YES
user_id=605407
Originator: NO
The provided .cn URL is refusing connections from the parser,
and the micorosoft site is not showing the problem
- at least in version 2.0 -
with patch #1644504 applied and the following code:
public void testTwoMeta () throws ParserException
{
String url;
Parser parser;
TagNameFilter tf;
NodeList list;
url = "http://www.tvix.cn/play.php?v=VKm2qLblS1k";
//url = "http://www.microsoft.com";
parser = new Parser ();
parser.setResource (url);
//parser.setEncoding ("gb2312");
tf = new TagNameFilter ("textarea");
try
{
list = parser.extractAllNodesThatMatch (tf);
}
catch (EncodingChangeException e)
{
parser.reset ();
list = parser.extractAllNodesThatMatch (tf);
}
}
Switching to pending until someone comes up with another URL or failing test case.
Logged In: YES
user_id=1312539
Originator: NO
This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 90 days (the time period specified by
the administrator of this Tracker).