Re: [Htmlparser-user] Help on extracting clean body content from web page
Brought to you by:
derrickoswald
From: James M. <jam...@a-...> - 2007-11-16 00:16:35
|
For more clarification, here is what I tried: Parser lParser = new Parser(); try { lParser.setInputHTML(pHTML); //as instructed in the JavaDocs } catch(ParserException e) { mLogger.info("getContent():: Caught ParsingException..."); } NodeList lDocumentNodeList; NodeList lNodes; try { lDocumentNodeList = lParser.parse (null); //I want to start with the entire document lNodes = lDocumentNodeList.extractAllNodesThatMatch (new TagNameFilter ("BODY")); //I want the BODY tag mLogger.info("lNodes.size() = " + lNodes.size()); //Using Log4J, I see that the size returned is 0 when it should be 1. if(lNodes.size() > 0) { //none of this code executes because size = 0 String lText = lNodes.toString(); //I'm not sure if I'm doing this right or not, but until the NodeList problem is resolved I can't troubleshoot it String lasString = lNodes.asString(); mLogger.info("lTExt = " + lText); mLogger.info("lasString = " + lasString); } } catch (ParserException e) { mLogger.info("ResponseParser:: Parsing exception caught."); } Thanks again for your help. -- James Mortensen A-CTI Development Team |