Re: [Htmlparser-user] parser help
Brought to you by:
derrickoswald
|
From: Miguel A. M. <mig...@gm...> - 2012-08-27 08:23:18
|
Hello Ernest,
This is the function I use in order to extract the text. I hope it helps
you.
public StringBuilder textExtractor(String URL){
StringBuilder textInPage = null;
try {
Parser parser = new Parser(URL);
TextExtractingVisitor visitor = new TextExtractingVisitor();
parser.visitAllNodesWith(visitor);
textInPage = new StringBuilder(visitor.getExtractedText());
} catch (ParserException ex) {
Logger.getLogger(HTMLAnalizer.class.getName()).log(Level.SEVERE, null,
ex);
}
return textInPage;
}
Regards,
Miguel
On 24 August 2012 21:14, Ernest Cronin <ern...@gm...> wrote:
> Hi,
>
> I use the parser a lot for work. one thing i've noticed is that in many
> news articles there are comment sections, and in these sections, plain
> text. but the parser doesn't pick them up. what is about the comment
> sections that make it unreadable? is there a different class i should be
> using?
>
> Thank you,
> ernest
>
> On Wed, Aug 17, 2011 at 4:25 PM, ernest cronin <ern...@gm...>wrote:
>
>> Hi,
>>
>> I have been trying to use the parser for some time and I have been unable
>> to get it to do exactly what I want, which is to gather only the plaintext
>> without javascript or style stuff. Here is the code I've been running:
>>
>> public class Test
>> {
>> public static void main (String[] args)
>> {
>> try
>> {
>> Parser parser = new Parser (args[0]);
>> TextExtractingVisitor visitor = new TextExtractingVisitor();
>> parser.visitAllNodesWith(visitor);
>> String textInPage = visitor.getExtractedText();
>> System.out.println(textInPage);
>> }
>> catch (ParserException pe)
>> {
>> pe.printStackTrace ();
>> }
>> }
>> }
>>
>> I could really use some help with this!
>>
>> Thanks,
>> Ernest
>>
>>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
|