Thread: Re: [Htmlparser-user] Help on extracting clean body content from web page
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-11-13 11:58:12
|
You probably want the StringBean. The main() method of StringBean is an example of its use. ----- Original Message ---- From: cash cash <ca...@ya...> To: htm...@li... Sent: Tuesday, November 13, 2007 1:07:33 AM Subject: [Htmlparser-user] Help on extracting clean body content from web page Hi all, I am new to htmlparser. have download it and tried a few examples. However, i am having problem knowing the" correct way" to achieve my goal. I'm looking for a way to extract body content from web page, exclude all script sections. For example, using the following text <html> <head><title>title</title> <style> css style </style> </head> <body> Hello world <?php phpinfo() ?> </body> The correct code should only extract Hello world. Can any one help me on this? Thanks in advance. ____________________________________________________________________________________ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: James M. <jam...@a-...> - 2007-11-16 00:08:14
|
Hello, I'm trying to pull the body content from an HTML String using your parsing utilities. The problem I'm having is not how to GET the HTML. I have the HTML stored in a String. I am using Web Services, and the content that I need is provided to me via third-party code as a String object. Therefore, I need your parser to take HTML as a String object, parse it for the body tag, and return the innerHTML of the body tag as a String. Below is the content that I retrieve in a String object: <html><head></head> <body>Hello World</body> </html> String myHTML = myWebServices.getHTMLContent(); //this returns the above HTML in a String object .... ... .. //this is the missing piece, which is how to load the HTML into the parser and return the innerHTML of the BODY tag. ... .... String bodyContent = //This is the "Hello World" text that I'm looking for so that I can use it without the HTML. The FAQ does not appear to address this question. Thanks in advance for your help in clearing up these issues. James Mortensen -- James Mortensen A-CTI Development Team |
From: James M. <jam...@a-...> - 2007-11-16 00:16:35
|
For more clarification, here is what I tried: Parser lParser = new Parser(); try { lParser.setInputHTML(pHTML); //as instructed in the JavaDocs } catch(ParserException e) { mLogger.info("getContent():: Caught ParsingException..."); } NodeList lDocumentNodeList; NodeList lNodes; try { lDocumentNodeList = lParser.parse (null); //I want to start with the entire document lNodes = lDocumentNodeList.extractAllNodesThatMatch (new TagNameFilter ("BODY")); //I want the BODY tag mLogger.info("lNodes.size() = " + lNodes.size()); //Using Log4J, I see that the size returned is 0 when it should be 1. if(lNodes.size() > 0) { //none of this code executes because size = 0 String lText = lNodes.toString(); //I'm not sure if I'm doing this right or not, but until the NodeList problem is resolved I can't troubleshoot it String lasString = lNodes.asString(); mLogger.info("lTExt = " + lText); mLogger.info("lasString = " + lasString); } } catch (ParserException e) { mLogger.info("ResponseParser:: Parsing exception caught."); } Thanks again for your help. -- James Mortensen A-CTI Development Team |
From: James M. <jam...@a-...> - 2007-11-16 00:27:19
|
I found the solution to my problem! lNodes = lDocumentNodeList.extractAllNodesThatMatch(new TagNameFilter ("BODY"),true); The BODY tag was buried underneath another element, and by default the boolean recursive flag is set to false, meaning that nested elements will not be returned. After setting this flag -- the second parameter -- to true, the problem was resolved and I was able to retrieve my content. Hope this helps someone! -- James Mortensen A-CTI Development Team |