Re: [Htmlparser-user] parser help
Brought to you by:
derrickoswald
From: Miguel A. M. <mig...@gm...> - 2012-08-27 08:23:18
|
Hello Ernest, This is the function I use in order to extract the text. I hope it helps you. public StringBuilder textExtractor(String URL){ StringBuilder textInPage = null; try { Parser parser = new Parser(URL); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); textInPage = new StringBuilder(visitor.getExtractedText()); } catch (ParserException ex) { Logger.getLogger(HTMLAnalizer.class.getName()).log(Level.SEVERE, null, ex); } return textInPage; } Regards, Miguel On 24 August 2012 21:14, Ernest Cronin <ern...@gm...> wrote: > Hi, > > I use the parser a lot for work. one thing i've noticed is that in many > news articles there are comment sections, and in these sections, plain > text. but the parser doesn't pick them up. what is about the comment > sections that make it unreadable? is there a different class i should be > using? > > Thank you, > ernest > > On Wed, Aug 17, 2011 at 4:25 PM, ernest cronin <ern...@gm...>wrote: > >> Hi, >> >> I have been trying to use the parser for some time and I have been unable >> to get it to do exactly what I want, which is to gather only the plaintext >> without javascript or style stuff. Here is the code I've been running: >> >> public class Test >> { >> public static void main (String[] args) >> { >> try >> { >> Parser parser = new Parser (args[0]); >> TextExtractingVisitor visitor = new TextExtractingVisitor(); >> parser.visitAllNodesWith(visitor); >> String textInPage = visitor.getExtractedText(); >> System.out.println(textInPage); >> } >> catch (ParserException pe) >> { >> pe.printStackTrace (); >> } >> } >> } >> >> I could really use some help with this! >> >> Thanks, >> Ernest >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |