Menu

#232 if a web page has double charset,the parser run error

v2.0
closed-fixed
9
2007-06-03
2006-11-08
macpin
No

hi,I come from china!~I have parse a page,it's head
tag include double charset meta tag.so,while parser
running,the parser throw Exception like this:
--------------------------------------------
"org.htmlparser.util.EncodingChangeException:
character mismatch (new: · [0xb7] != old: [0x30fb?])
for encoding change from x-EUC-CN to GBK at character
offset 266"
--------------------------------------------
and the head of page source code is:
-------------------------------------
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=gb2312" />
<title>
NIKE搞笑广告 · 分享视频 · TVix.cn · 免费视频分享平台
</title>
<meta http-equiv=Content-Type content="text/html;
charset=GBK">
<meta name="description" content="Tvix.cn · 免费视频分
享平台 · 精彩视频">
-------------------------------------
and my java code is:
-----------------------------------
Parser parser=new Parser(vrl,Parser.DEVNULL);
parser.setEncoding("GBK");
TagNameFilter tf=new TagNameFilter
("textarea");
NodeList list= parser.extractAllNodesThatMatch
(tf);
---------------------------------------
either I change ' parser.setEncoding("GBK")' to'
parser.setEncoding("gb2312")' or(parser.setEncoding
("GB2312")),parser also throw the same Exception..
that is hard for me to debug this,please help me or
just give directions!thanks!

Discussion

  • macpin

    macpin - 2006-11-08

    file is which I parse

     
  • macpin

    macpin - 2006-11-08
    • assigned_to: derrickoswald --> nobody
    • priority: 5 --> 9
    • milestone: --> 535725
     
  • macpin

    macpin - 2006-11-08

    Logged In: YES
    user_id=1640299

    the parser request url is:http://www.tvix.cn/play.php?v=VKm2qLblS1k

     
  • Martin Sturm

    Martin Sturm - 2007-01-25

    Logged In: YES
    user_id=510190
    Originator: NO

    I think this is not really a bug. The EncodingChangeException indicates that the parser changes the encoding because a META declaration defines another charset than the parser was using. When this occurs, the parser reads the document again using the new encoding, and when characters should be represented different (because of the changed encoding), the EncodingChangeException is thrown. Usually, it is sufficient to catch this exception, reset the parser and try again what you where doing when the exceptioin occurs.

    In your case, the resulting Java code should be something like:
    ----
    Parser parser=new Parser (vrl,Parser.DEVNULL);
    TagNameFilter tf = new TagNameFilter ("textarea");
    NodeList list;
    try {
    list = parser.extractAllNodesThatMatch (tf);
    } catch (EncodingChangeException e) {
    parser.reset ();
    list = parser.extractAllNodesThatMatch (tf);
    }
    ----

    Probably this will work (I have not tested the code).

     
  • Martin Sturm

    Martin Sturm - 2007-01-25

    Logged In: YES
    user_id=510190
    Originator: NO

    I'm sorry, but my previous comment is not correct. I didn't notice that the original page source was available. I've tested my sample code, and it doesn't work.
    The problem is similar to a problem I recently noticed when parsing the Microsoft website (see the mailinglist for details). Microsoft also defines two charsets using META declaration but also provides a charset using HTTP headers.

    The site which caused this bug does not define a charset using HTTP headers (see attached file with headers), so HTMLParser falls back to ISO-8859-1 (which is correct behaviour). However, the problem is that HTMLParser keeps switching charsets for every META tag that defines another charset. And that is the real problem.

    Quote from W3C HTML 4.01 specification:

    To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

    1. An HTTP "charset" parameter in a "Content-Type" field.
    2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
    3. The charset attribute set on an element that designates an external resource.

    So, I think it is better that HTMLParser uses only the charset defined by HTTP headers if it is provided. If the charset defined by HTTP header is ISO-8859-1 and in the HTML the charset is changed using META declaration, only than HTMLParser should change the charset. Otherwise, it should be remain the same charset because otherwise it is impossible to guarantee a bug free behaviour of HTMLPaser. Defining two charsets using META declarations is simply not allowed and should be ignored I think.

     
  • Martin Sturm

    Martin Sturm - 2007-01-25

    Logged In: YES
    user_id=510190
    Originator: NO

    I created a patch which solves this problem. See http://sourceforge.net/support/tracker.php?aid=1644504 (bug #1644504)

     
  • Derrick Oswald

    Derrick Oswald - 2007-03-04
    • milestone: 535725 --> v2.0
    • assigned_to: nobody --> derrickoswald
    • status: open --> pending-fixed
     
  • Derrick Oswald

    Derrick Oswald - 2007-03-04

    Logged In: YES
    user_id=605407
    Originator: NO

    The provided .cn URL is refusing connections from the parser,
    and the micorosoft site is not showing the problem
    - at least in version 2.0 -
    with patch #1644504 applied and the following code:

    public void testTwoMeta () throws ParserException
    {
    String url;
    Parser parser;
    TagNameFilter tf;
    NodeList list;

    url = "http://www.tvix.cn/play.php?v=VKm2qLblS1k";
    //url = "http://www.microsoft.com";
    parser = new Parser ();
    parser.setResource (url);
    //parser.setEncoding ("gb2312");
    tf = new TagNameFilter ("textarea");
    try
    {
    list = parser.extractAllNodesThatMatch (tf);
    }
    catch (EncodingChangeException e)
    {
    parser.reset ();
    list = parser.extractAllNodesThatMatch (tf);
    }
    }

    Switching to pending until someone comes up with another URL or failing test case.

     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 90 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-fixed --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB