[Htmlparser-user] some problem when trying to extract text from a html page
Brought to you by:
derrickoswald
From: 林森 <sc...@gm...> - 2009-05-14 03:58:45
|
hello, i am using htmlparser to extract text from a webpage( http://news.sina.com.cn/c/2009-05-13/024915613519s.shtml). I have written a function "extractText" to deal with this problem. That function use recursion to process node,and is designed not to visit LinkTag,ScriptTag,RemarkNode,StyleTag etc,so the function wil return when it encounter these nodes. But the result I got is not so satisfied. I find that there exists some scriptcode, as follows: '; }else if(Id==1){ if(GetObj("hotwords_link").innerHTML == ""){ GetObj("hotwords").style.display = "none"; }else{ GetObj("hotwords").style.display = "block"; } GetObj("pbg").innerHTML = ''; } } } I try to recognize which node contains that code,and print the sibling node of that node,the result is: ------------------Previous sibling begin----------------------: /a script type="text/javascript" -------------------------Previous sibling end---------------------. --------------------Next sibling begin:----------------------- a href=" http://www.google.cn/webhp?client=aff-sina&ie=gb&oe=utf8&hl=zh-CN&channel=contentlogo" target="_blank" style="text-decoration:none;" '; } } } /script script type="text/javascript" script type="text/javascript" table cellspacing="0" width="589" 热搜代码 style type="text/css" div id="hotwords" style="height:20px; overflow:hidden; margin:10px 0 0 0; display:none;" ----------------------------Next sibling end.--------------------------------------------- I guessed HtmlParser make some mistake when it encountered ' . How can I solve this problem,? I really want to exclude any of the script code. Thanks. |