Thread: [Htmlparser-user] some problem when trying to extract text from a html page

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

hello,
i am using htmlparser to extract text from a webpage(
http://news.sina.com.cn/c/2009-05-13/024915613519s.shtml).
I have written a function "extractText" to deal with this problem.
That function use recursion to process node,and is designed not to visit
LinkTag,ScriptTag,RemarkNode,StyleTag etc,so the function wil return when it
encounter these nodes.
But the result I got is not so satisfied.
I find that there exists some scriptcode, as follows:

'; }else if(Id==1){ if(GetObj("hotwords_link").innerHTML == ""){
GetObj("hotwords").style.display = "none"; }else{
GetObj("hotwords").style.display = "block"; } GetObj("pbg").innerHTML = '';
} } }

I try to recognize which node contains that code,and print the sibling node
of that node,the result is:

------------------Previous sibling begin----------------------:
/a
script type="text/javascript"

-------------------------Previous sibling end---------------------.
--------------------Next sibling  begin:-----------------------
a href="
http://www.google.cn/webhp?client=aff-sina&ie=gb&oe=utf8&hl=zh-CN&channel=contentlogo"
target="_blank" style="text-decoration:none;"
';
                                                        }
                                                }
                                        }

/script

script type="text/javascript"

script type="text/javascript"

table cellspacing="0" width="589"

热搜代码

style type="text/css"

div id="hotwords" style="height:20px; overflow:hidden; margin:10px 0 0 0;
display:none;"

----------------------------Next  sibling
end.---------------------------------------------

I guessed HtmlParser make some mistake when it encountered ' .
How can I solve this problem,? I really want to exclude any of the script
code.

Thanks.

Thread: [Htmlparser-user] some problem when trying to extract text from a html page

htmlparser-user