[Htmlparser-user] StringBean.getStrings() problem
Brought to you by:
derrickoswald
|
From: Soumya <sou...@ya...> - 2007-01-18 13:41:12
|
Hi,
I have spent a couple of days getting htmlparser work the way I need it to.
The problem I have now is that my page has a nested table (tables inside table) and I want the
individual table contents.
Through NodeList I am able to loop and get the individual tables. Now, I want to get only the
text and not the html, so I thought of going by the remark by Nick Burch in the StringBean class
as follows,
* You can also use the StringBean as a NodeVisitor on your own parser,
* in which case you have to refetch your page if you change one of the
* properties because it resets the Strings property:</p>
* <pre>
* StringBean sb = new StringBean ();
* Parser parser = new Parser ("http://cbc.ca");
* parser.visitAllNodesWith (sb);
* String s = sb.getStrings ();
* sb.setLinks (true);
* parser.reset ();
* parser.visitAllNodesWith (sb);
* String sl = sb.getStrings ();
My Parser.getStringsForMe() code looks like the following :
public String getStringsForMe() {
StringBuffer resultPage = new StringBuffer();
this.setResource (pageURL);
NodeList nl = this.parse(null);
TagNameFilter tableFilter = new TagNameFilter("table");
tableFilter.setName("totalTable");
NodeList tableNodes = nl.extractAllNodesThatMatch(filter, true);
SimpleNodeIterator it = tableNodes.elements();
StringBean sb = new StringBean();
while (it.hasMoreNodes()) {
Node node = it.nextNode();
if (node.getText().contains("id=\"totalTable\"")) {
String inputHTML = node.toHtml();
this.setInputHTML(inputHTML); // set the content to parse
NodeList totalTableChildren = this.parse(new TagNameFilter("table"));
SimpleNodeIterator it1 = totalTableChildren.elements();
while (it1.hasMoreNodes()) {
Node childTableNode = it1.nextNode();
String tableContent = childTableNode.toString();
this.setInputHTML(tableContent);
this.visitAllNodesWith(sb);
resultPage.append(sb.getStrings());
this.reset();
sb.setLinks(false);
}
}
}
return resultPage;
}
The sb.getStrings() returns the same text every time, since the StringBean.getStrings() checks if
mStrings is null and then proceeds to set / update strings. In the case as above where the
StringBean is repeatedly used to getStrings(), after the first visit, StringBean.mStrings remains
not-null and hence the newly extracted contents in StringBean.mBuffer are not appended to
mStrings.
I have made it work for me. I needed to know if there is anything else I am missing or is this a
bug for cases like mine.
thanks in advance,
Soumya
____________________________________________________________________________________
Looking for earth-friendly autos?
Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center.
http://autos.yahoo.com/green_center/
|