Re: [Htmlparser-user] some problem when trying to extract text from a html page
Brought to you by:
derrickoswald
From: Joshua K. <jo...@in...> - 2009-05-14 04:29:25
|
Have you considered using any of the Visitors in the htmlparser? --jk 2009/5/13 林森 <sc...@gm...> > hello, > i am using htmlparser to extract text from a webpage( > http://news.sina.com.cn/c/2009-05-13/024915613519s.shtml). > I have written a function "extractText" to deal with this problem. > That function use recursion to process node,and is designed not to visit > LinkTag,ScriptTag,RemarkNode,StyleTag etc,so the function wil return when it > encounter these nodes. > But the result I got is not so satisfied. > I find that there exists some scriptcode, as follows: > > '; }else if(Id==1){ if(GetObj("hotwords_link").innerHTML == ""){ > GetObj("hotwords").style.display = "none"; }else{ > GetObj("hotwords").style.display = "block"; } GetObj("pbg").innerHTML = ''; > } } } > > I try to recognize which node contains that code,and print the sibling node > of that node,the result is: > > ------------------Previous sibling begin----------------------: > /a > script type="text/javascript" > > -------------------------Previous sibling end---------------------. > --------------------Next sibling begin:----------------------- > a href=" > http://www.google.cn/webhp?client=aff-sina&ie=gb&oe=utf8&hl=zh-CN&channel=contentlogo" > target="_blank" style="text-decoration:none;" > '; > } > } > } > > /script > > script type="text/javascript" > > script type="text/javascript" > > table cellspacing="0" width="589" > > 热搜代码 > > style type="text/css" > > div id="hotwords" style="height:20px; overflow:hidden; margin:10px 0 0 0; > display:none;" > > ----------------------------Next sibling > end.--------------------------------------------- > > I guessed HtmlParser make some mistake when it encountered ' . > How can I solve this problem,? I really want to exclude any of the script > code. > > Thanks. > > > > > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK > i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > -- best regards, jk Industrial Logic, Inc. Joshua Kerievsky Founder, Extreme Programmer & Coach http://industriallogic.com 866-540-8336 (toll free) 510-540-8336 (phone) Berkeley, California Learn Code Smells, Refactoring and TDD at http://industriallogic.com/elearning |