I am a newbie to HtmlParser.I really dont know how to cleanup the composit tag eg :paragraph tag "<p>","<br>"," " just like that:<p>  text1<br>  text2</p> . Can anybody tell me how to do that? I try my best but failure:
here is my class:
public void getText(){
try {
Parser parser=new Parser("http://money.finance.sina.com.cn/corp/view/vCB_AllBulletinDetail.php?stockid=000002&id={3DAAE7D0-EB20-66F9-E040-640A12016145}");
parser.setEncoding("gb2312");
HasAttributeFilter attributeFilter=new HasAttributeFilter("id","content");
NodeFilter filter= new AndFilter(new TagNameFilter("div"),attributeFilter);
NodeList nodeList=parser.extractAllNodesThatMatch(filter);
for (int i = 0; i < nodeList.size(); i++) {
String notices = ((Div)nodeList.elementAt(i)).getStringText();
System.out.println(notices);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Use a StringBean as a NodeVisitor on the nodeList to eliminate the tags.
Then apply Translate.decode to change the character references to real characters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
dear Derrick Oswald:
Thank you for your help.It works well. :)
code:
StringBean sb= new StringBean();
for (int i = 0; i < nodeList.size(); i++) {
// String notices = ((Div)nodeList.elementAt(i)).getStringText();
nodeList.visitAllNodesWith(sb);
System.out.println(sb.getStrings());
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all
I am a newbie to HtmlParser.I really dont know how to cleanup the composit tag eg :paragraph tag "<p>","<br>"," " just like that:<p>  text1<br>  text2</p> . Can anybody tell me how to do that? I try my best but failure:
here is my class:
public void getText(){
try {
Parser parser=new Parser("http://money.finance.sina.com.cn/corp/view/vCB_AllBulletinDetail.php?stockid=000002&id={3DAAE7D0-EB20-66F9-E040-640A12016145}");
parser.setEncoding("gb2312");
HasAttributeFilter attributeFilter=new HasAttributeFilter("id","content");
NodeFilter filter= new AndFilter(new TagNameFilter("div"),attributeFilter);
NodeList nodeList=parser.extractAllNodesThatMatch(filter);
for (int i = 0; i < nodeList.size(); i++) {
String notices = ((Div)nodeList.elementAt(i)).getStringText();
System.out.println(notices);
}
} catch (ParserException e) {
e.printStackTrace();
}
}
Any help would be appreciate!
wjsjw
2007-11-15
Use a StringBean as a NodeVisitor on the nodeList to eliminate the tags.
Then apply Translate.decode to change the character references to real characters.
dear Derrick Oswald:
Thank you for your help.It works well. :)
code:
StringBean sb= new StringBean();
for (int i = 0; i < nodeList.size(); i++) {
// String notices = ((Div)nodeList.elementAt(i)).getStringText();
nodeList.visitAllNodesWith(sb);
System.out.println(sb.getStrings());
}
You shouldn't need to loop over the nodeList, the nodeList.visitAllNodesWith(sb); should do it.
Dear Oswald:
Thank you very much. I get it.
code
StringBean sb= new StringBean();
nodeList.visitAllNodesWith(sb);
System.out.println(sb.getStrings());
best wishes
wjsjw
2007-11-17