Thread: [Htmlparser-user] some problem when trying to extract text from a html page
Brought to you by:
derrickoswald
From: 林森 <sc...@gm...> - 2009-05-14 03:58:45
|
hello, i am using htmlparser to extract text from a webpage( http://news.sina.com.cn/c/2009-05-13/024915613519s.shtml). I have written a function "extractText" to deal with this problem. That function use recursion to process node,and is designed not to visit LinkTag,ScriptTag,RemarkNode,StyleTag etc,so the function wil return when it encounter these nodes. But the result I got is not so satisfied. I find that there exists some scriptcode, as follows: '; }else if(Id==1){ if(GetObj("hotwords_link").innerHTML == ""){ GetObj("hotwords").style.display = "none"; }else{ GetObj("hotwords").style.display = "block"; } GetObj("pbg").innerHTML = ''; } } } I try to recognize which node contains that code,and print the sibling node of that node,the result is: ------------------Previous sibling begin----------------------: /a script type="text/javascript" -------------------------Previous sibling end---------------------. --------------------Next sibling begin:----------------------- a href=" http://www.google.cn/webhp?client=aff-sina&ie=gb&oe=utf8&hl=zh-CN&channel=contentlogo" target="_blank" style="text-decoration:none;" '; } } } /script script type="text/javascript" script type="text/javascript" table cellspacing="0" width="589" 热搜代码 style type="text/css" div id="hotwords" style="height:20px; overflow:hidden; margin:10px 0 0 0; display:none;" ----------------------------Next sibling end.--------------------------------------------- I guessed HtmlParser make some mistake when it encountered ' . How can I solve this problem,? I really want to exclude any of the script code. Thanks. |
From: Joshua K. <jo...@in...> - 2009-05-14 04:29:25
|
Have you considered using any of the Visitors in the htmlparser? --jk 2009/5/13 林森 <sc...@gm...> > hello, > i am using htmlparser to extract text from a webpage( > http://news.sina.com.cn/c/2009-05-13/024915613519s.shtml). > I have written a function "extractText" to deal with this problem. > That function use recursion to process node,and is designed not to visit > LinkTag,ScriptTag,RemarkNode,StyleTag etc,so the function wil return when it > encounter these nodes. > But the result I got is not so satisfied. > I find that there exists some scriptcode, as follows: > > '; }else if(Id==1){ if(GetObj("hotwords_link").innerHTML == ""){ > GetObj("hotwords").style.display = "none"; }else{ > GetObj("hotwords").style.display = "block"; } GetObj("pbg").innerHTML = ''; > } } } > > I try to recognize which node contains that code,and print the sibling node > of that node,the result is: > > ------------------Previous sibling begin----------------------: > /a > script type="text/javascript" > > -------------------------Previous sibling end---------------------. > --------------------Next sibling begin:----------------------- > a href=" > http://www.google.cn/webhp?client=aff-sina&ie=gb&oe=utf8&hl=zh-CN&channel=contentlogo" > target="_blank" style="text-decoration:none;" > '; > } > } > } > > /script > > script type="text/javascript" > > script type="text/javascript" > > table cellspacing="0" width="589" > > 热搜代码 > > style type="text/css" > > div id="hotwords" style="height:20px; overflow:hidden; margin:10px 0 0 0; > display:none;" > > ----------------------------Next sibling > end.--------------------------------------------- > > I guessed HtmlParser make some mistake when it encountered ' . > How can I solve this problem,? I really want to exclude any of the script > code. > > Thanks. > > > > > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK > i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > -- best regards, jk Industrial Logic, Inc. Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California Learn Code Smells, Refactoring and TDD at http://industriallogic.com/elearning |