hi,I come from china!~I have parse a page,it's head
tag include double charset meta tag.so,while parser
running,the parser throw Exception like this:
--------------------------------------------
"org.htmlparser.util.EncodingChangeException:
character mismatch (new: · [0xb7] != old: [0x30fb?])
for encoding change from x-EUC-CN to GBK at character
offset 266"
--------------------------------------------
and the head of page source code is:
-------------------------------------
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=gb2312" />
<title>
NIKE搞笑广告 · 分享视频 · TVix.cn · 免费视频分享平台
</title>
<meta http-equiv=Content-Type content="text/html;
charset=GBK">
<meta name="description" content="Tvix.cn · 免费视频分
享平台 · 精彩视频">
-------------------------------------
and my java code is:
-----------------------------------
Parser parser=new Parser(vrl,Parser.DEVNULL);
parser.setEncoding("GBK");
TagNameFilter tf=new TagNameFilter
("textarea");
NodeList list= parser.extractAllNodesThatMatch
(tf);
---------------------------------------
either I change ' parser.setEncoding("GBK")' to'
parser.setEncoding("gb2312")' or(parser.setEncoding
("GB2312")),parser also throw the same Exception..
that is hard for me to debug this,please help me or
just give directions!thanks!
the parser request url
is:http://www.tvix.cn/play.php?v=VKm2qLblS1k
---------------
by thw way ,anybody can speak Chinese ^^,you see,my English is so poor-_-
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi,I come from china!~I have parse a page,it's head
tag include double charset meta tag.so,while parser
running,the parser throw Exception like this:
--------------------------------------------
"org.htmlparser.util.EncodingChangeException:
character mismatch (new: · [0xb7] != old: [0x30fb?])
for encoding change from x-EUC-CN to GBK at character
offset 266"
--------------------------------------------
and the head of page source code is:
-------------------------------------
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=gb2312" />
<title>
NIKE搞笑广告 · 分享视频 · TVix.cn · 免费视频分享平台
</title>
<meta http-equiv=Content-Type content="text/html;
charset=GBK">
<meta name="description" content="Tvix.cn · 免费视频分
享平台 · 精彩视频">
-------------------------------------
and my java code is:
-----------------------------------
Parser parser=new Parser(vrl,Parser.DEVNULL);
parser.setEncoding("GBK");
TagNameFilter tf=new TagNameFilter
("textarea");
NodeList list= parser.extractAllNodesThatMatch
(tf);
---------------------------------------
either I change ' parser.setEncoding("GBK")' to'
parser.setEncoding("gb2312")' or(parser.setEncoding
("GB2312")),parser also throw the same Exception..
that is hard for me to debug this,please help me or
just give directions!thanks!
the parser request url
is:http://www.tvix.cn/play.php?v=VKm2qLblS1k
---------------
by thw way ,anybody can speak Chinese ^^,you see,my English is so poor-_-