Thread: [Htmlparser-user] Only extract text from div tag with specific attribute

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] Only extract text from div tag with specific attribute

From: Jumbo P. <jum...@gm...> - 2008-04-01 18:54:09

Hello,

I'm trying to extract only the page text inside div tags with the attribute
class="body".  Inside the div-body tags are other tags, e.g. h1, h2, p,
etc., which themselves should be ignored but their enclosed text should be
included with the rest of the body text.

I'm using extractAllNodesThatMatch but I don't see where I can limit it only
to the div tag with the attribute class="body".

Can anyone figure this out?

Re: [Htmlparser-user] Only extract text from div tag with specific attribute

From: Joshua K. <jo...@in...> - 2008-04-01 21:39:10

You could write your own NodeVisitor for this.   --jk


On Tue, Apr 1, 2008 at 11:54 AM, Jumbo Pongo <jum...@gm...> wrote:

> Hello,
>
> I'm trying to extract only the page text inside div tags with the
> attribute class="body".  Inside the div-body tags are other tags, e.g. h1,
> h2, p, etc., which themselves should be ignored but their enclosed text
> should be included with the rest of the body text.
>
> I'm using extractAllNodesThatMatch but I don't see where I can limit it
> only to the div tag with the attribute class="body".
>
> Can anyone figure this out?
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
>
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Re: [Htmlparser-user] Only extract text from div tag with specific attribute

From: Jumbo P. <jum...@gm...> - 2008-04-01 22:06:27

Thanks for the reply, Joshua.  I think that's what I'm trying to do.  The
part I'm stuck on is where to distinguish that I only want the div tag that
has the attribute class="body".  Here is my code:

String contents = null;

Parser parser = new Parser(url);
ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class);
parser.visitAllNodesWith(visitor);

Node[] nodes = visitor.getTags(); // do I really want to use getTags() here?
for (int i = 0; i < nodes.length; i++)
{
// if nodes[i] has attribute class="body", then get the page text enclosed
in the div tags
// what to do here?
}

return contents;


Obviously I am new to htmlparser, so much thanks in advance.

Re: [Htmlparser-user] Only extract text from div tag with specific attribute

From: Joshua K. <jo...@in...> - 2008-04-01 23:38:51

You'll want to write your very own Visitor.

Something like this (I'm using an older version of htmlparser for this
example):

public class DivVisitor extends NodeVisitor {

    public void visitTag(Tag tag) {
       // see if the tag is a div tag here and then check its attibutes
      // if it matches what you want, collect it into something that this
visitor can return via some getter method
    }
}

Send your DivVisitor into the parser as you were doing with the
ObjectFIndingVisitor.

Hope that helps,
jk

On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote:

> Thanks for the reply, Joshua.  I think that's what I'm trying to do.  The
> part I'm stuck on is where to distinguish that I only want the div tag that
> has the attribute class="body".  Here is my code:
>
> String contents = null;
>
> Parser parser = new Parser(url);
> ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class);
> parser.visitAllNodesWith(visitor);
>
> Node[] nodes = visitor.getTags(); // do I really want to use getTags()
> here?
> for (int i = 0; i < nodes.length; i++)
> {
> // if nodes[i] has attribute class="body", then get the page text enclosed
> in the div tags
> // what to do here?
> }
>
> return contents;
>
>
> Obviously I am new to htmlparser, so much thanks in advance.
>
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
>
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

Re: [Htmlparser-user] Only extract text from div tag with specific attribute

From: Jumbo P. <jum...@gm...> - 2008-04-02 19:31:49

I figured it out.  It's actually pretty simple.  Here is the code.  Thanks
anyway.

Parser p = new Parser(url);
NodeList list = p.extractAllNodesThatMatch (new AndFilter (new TagNameFilter
("div"), new HasAttributeFilter("class", "body")));
StringBean sb = new StringBean();
list.visitAllNodesWith(sb);
System.out.println(sb.getStrings());



On Tue, Apr 1, 2008 at 7:38 PM, Joshua Kerievsky <jo...@in...>
wrote:

> You'll want to write your very own Visitor.
>
> Something like this (I'm using an older version of htmlparser for this
> example):
>
> public class DivVisitor extends NodeVisitor {
>
>     public void visitTag(Tag tag) {
>        // see if the tag is a div tag here and then check its attibutes
>       // if it matches what you want, collect it into something that this
> visitor can return via some getter method
>     }
> }
>
> Send your DivVisitor into the parser as you were doing with the
> ObjectFIndingVisitor.
>
> Hope that helps,
> jk
>
> On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote:
>
> > Thanks for the reply, Joshua.  I think that's what I'm trying to do.
> > The part I'm stuck on is where to distinguish that I only want the div tag
> > that has the attribute class="body".  Here is my code:
> >
> > String contents = null;
> >
> > Parser parser = new Parser(url);
> > ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class);
> > parser.visitAllNodesWith(visitor);
> >
> > Node[] nodes = visitor.getTags(); // do I really want to use getTags()
> > here?
> > for (int i = 0; i < nodes.length; i++)
> > {
> > // if nodes[i] has attribute class="body", then get the page text
> > enclosed in the div tags
> > // what to do here?
> > }
> >
> > return contents;
> >
> >
> > Obviously I am new to htmlparser, so much thanks in advance.
> >
> >
> >
> > -------------------------------------------------------------------------
> > Check out the new SourceForge.net Marketplace.
> > It's the best place to buy or sell services for
> > just about anything Open Source.
> >
> > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
>
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>