[Htmlparser-user] Problem with HTMLParser - I can't extract any div's.
Brought to you by:
derrickoswald
From: Jan S. <net...@gm...> - 2011-07-29 07:44:41
|
I've got a small problem there, and I'd like to ask you to help me, please. Ok, so I'm trying to use HTMLParser in my project, and there's the problem - Example page that I'm trying to process: http://www.fanfiction.net/s/7229512/1/A_Horse_With_No_Name Looking at the source code, there's a div with id and class 'storytext' within a div with id and class 'storytextp', and there's a lot of <p> tags within the 'storytext' div. I want to extract the contents of that 'storytext' div to plain text string. That's what I'm trying to do: NodeList nodeList = new NodeList(); NodeFilter nodeFilter = new AndFilter(new TagNameFilter("div"),new HasChildFilter(new TagNameFilter("p"))); for(NodeIterator e = parser.elements(); e.hasMoreNodes();){ e.nextNode().collectInto(nodeList, nodeFilter); } System.out.println(nodeList.toNodeArray().length); for(Node node : nodeList.toNodeArray()){ System.out.println(node.toPlainTextString()); } The result? Lenght of nodeList.toNodeArray is equal to zero. Therefore, it means that I'm screwing something up there. I also tried using RegexFilter("storytext"), but this isn't working anyway. The question is, how should I do it? Please, help, I've been trying to run it past the last week :p |