[Htmlparser-user] StringBean.getStrings() problem
Brought to you by:
derrickoswald
From: Soumya <sou...@ya...> - 2007-01-18 13:41:12
|
Hi, I have spent a couple of days getting htmlparser work the way I need it to. The problem I have now is that my page has a nested table (tables inside table) and I want the individual table contents. Through NodeList I am able to loop and get the individual tables. Now, I want to get only the text and not the html, so I thought of going by the remark by Nick Burch in the StringBean class as follows, * You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:</p> * <pre> * StringBean sb = new StringBean (); * Parser parser = new Parser ("http://cbc.ca"); * parser.visitAllNodesWith (sb); * String s = sb.getStrings (); * sb.setLinks (true); * parser.reset (); * parser.visitAllNodesWith (sb); * String sl = sb.getStrings (); My Parser.getStringsForMe() code looks like the following : public String getStringsForMe() { StringBuffer resultPage = new StringBuffer(); this.setResource (pageURL); NodeList nl = this.parse(null); TagNameFilter tableFilter = new TagNameFilter("table"); tableFilter.setName("totalTable"); NodeList tableNodes = nl.extractAllNodesThatMatch(filter, true); SimpleNodeIterator it = tableNodes.elements(); StringBean sb = new StringBean(); while (it.hasMoreNodes()) { Node node = it.nextNode(); if (node.getText().contains("id=\"totalTable\"")) { String inputHTML = node.toHtml(); this.setInputHTML(inputHTML); // set the content to parse NodeList totalTableChildren = this.parse(new TagNameFilter("table")); SimpleNodeIterator it1 = totalTableChildren.elements(); while (it1.hasMoreNodes()) { Node childTableNode = it1.nextNode(); String tableContent = childTableNode.toString(); this.setInputHTML(tableContent); this.visitAllNodesWith(sb); resultPage.append(sb.getStrings()); this.reset(); sb.setLinks(false); } } } return resultPage; } The sb.getStrings() returns the same text every time, since the StringBean.getStrings() checks if mStrings is null and then proceeds to set / update strings. In the case as above where the StringBean is repeatedly used to getStrings(), after the first visit, StringBean.mStrings remains not-null and hence the newly extracted contents in StringBean.mBuffer are not appended to mStrings. I have made it work for me. I needed to know if there is anything else I am missing or is this a bug for cases like mine. thanks in advance, Soumya ____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/ |