Re: [Htmlparser-user] Help on extracting clean body content from web page
Brought to you by:
derrickoswald
|
From: James M. <jam...@a-...> - 2007-11-16 00:16:35
|
For more clarification, here is what I tried:
Parser lParser = new Parser();
try {
lParser.setInputHTML(pHTML); //as instructed in the JavaDocs
} catch(ParserException e) {
mLogger.info("getContent():: Caught ParsingException...");
}
NodeList lDocumentNodeList;
NodeList lNodes;
try {
lDocumentNodeList = lParser.parse (null); //I want to start
with the entire document
lNodes = lDocumentNodeList.extractAllNodesThatMatch (new
TagNameFilter ("BODY")); //I want the BODY tag
mLogger.info("lNodes.size() = " + lNodes.size()); //Using
Log4J, I see that the size returned is 0 when it should be 1.
if(lNodes.size() > 0) { //none of this code executes because
size = 0
String lText = lNodes.toString(); //I'm not sure if I'm
doing this right or not, but until the NodeList problem is resolved I can't
troubleshoot it
String lasString = lNodes.asString();
mLogger.info("lTExt = " + lText);
mLogger.info("lasString = " + lasString);
}
} catch (ParserException e) {
mLogger.info("ResponseParser:: Parsing exception caught.");
}
Thanks again for your help.
--
James Mortensen
A-CTI Development Team
|