Thread: [Htmlparser-user] Only extract text from div tag with specific attribute
Brought to you by:
derrickoswald
From: Jumbo P. <jum...@gm...> - 2008-04-01 18:54:09
|
Hello, I'm trying to extract only the page text inside div tags with the attribute class="body". Inside the div-body tags are other tags, e.g. h1, h2, p, etc., which themselves should be ignored but their enclosed text should be included with the rest of the body text. I'm using extractAllNodesThatMatch but I don't see where I can limit it only to the div tag with the attribute class="body". Can anyone figure this out? |
From: Joshua K. <jo...@in...> - 2008-04-01 21:39:10
|
You could write your own NodeVisitor for this. --jk On Tue, Apr 1, 2008 at 11:54 AM, Jumbo Pongo <jum...@gm...> wrote: > Hello, > > I'm trying to extract only the page text inside div tags with the > attribute class="body". Inside the div-body tags are other tags, e.g. h1, > h2, p, etc., which themselves should be ignored but their enclosed text > should be included with the rest of the body text. > > I'm using extractAllNodesThatMatch but I don't see where I can limit it > only to the div tag with the attribute class="body". > > Can anyone figure this out? > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Jumbo P. <jum...@gm...> - 2008-04-01 22:06:27
|
Thanks for the reply, Joshua. I think that's what I'm trying to do. The part I'm stuck on is where to distinguish that I only want the div tag that has the attribute class="body". Here is my code: String contents = null; Parser parser = new Parser(url); ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); parser.visitAllNodesWith(visitor); Node[] nodes = visitor.getTags(); // do I really want to use getTags() here? for (int i = 0; i < nodes.length; i++) { // if nodes[i] has attribute class="body", then get the page text enclosed in the div tags // what to do here? } return contents; Obviously I am new to htmlparser, so much thanks in advance. |
From: Joshua K. <jo...@in...> - 2008-04-01 23:38:51
|
You'll want to write your very own Visitor. Something like this (I'm using an older version of htmlparser for this example): public class DivVisitor extends NodeVisitor { public void visitTag(Tag tag) { // see if the tag is a div tag here and then check its attibutes // if it matches what you want, collect it into something that this visitor can return via some getter method } } Send your DivVisitor into the parser as you were doing with the ObjectFIndingVisitor. Hope that helps, jk On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote: > Thanks for the reply, Joshua. I think that's what I'm trying to do. The > part I'm stuck on is where to distinguish that I only want the div tag that > has the attribute class="body". Here is my code: > > String contents = null; > > Parser parser = new Parser(url); > ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); > parser.visitAllNodesWith(visitor); > > Node[] nodes = visitor.getTags(); // do I really want to use getTags() > here? > for (int i = 0; i < nodes.length; i++) > { > // if nodes[i] has attribute class="body", then get the page text enclosed > in the div tags > // what to do here? > } > > return contents; > > > Obviously I am new to htmlparser, so much thanks in advance. > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Jumbo P. <jum...@gm...> - 2008-04-02 19:31:49
|
I figured it out. It's actually pretty simple. Here is the code. Thanks anyway. Parser p = new Parser(url); NodeList list = p.extractAllNodesThatMatch (new AndFilter (new TagNameFilter ("div"), new HasAttributeFilter("class", "body"))); StringBean sb = new StringBean(); list.visitAllNodesWith(sb); System.out.println(sb.getStrings()); On Tue, Apr 1, 2008 at 7:38 PM, Joshua Kerievsky <jo...@in...> wrote: > You'll want to write your very own Visitor. > > Something like this (I'm using an older version of htmlparser for this > example): > > public class DivVisitor extends NodeVisitor { > > public void visitTag(Tag tag) { > // see if the tag is a div tag here and then check its attibutes > // if it matches what you want, collect it into something that this > visitor can return via some getter method > } > } > > Send your DivVisitor into the parser as you were doing with the > ObjectFIndingVisitor. > > Hope that helps, > jk > > On Tue, Apr 1, 2008 at 3:06 PM, Jumbo Pongo <jum...@gm...> wrote: > > > Thanks for the reply, Joshua. I think that's what I'm trying to do. > > The part I'm stuck on is where to distinguish that I only want the div tag > > that has the attribute class="body". Here is my code: > > > > String contents = null; > > > > Parser parser = new Parser(url); > > ObjectFindingVisitor visitor = new ObjectFindingVisitor(Div.class); > > parser.visitAllNodesWith(visitor); > > > > Node[] nodes = visitor.getTags(); // do I really want to use getTags() > > here? > > for (int i = 0; i < nodes.length; i++) > > { > > // if nodes[i] has attribute class="body", then get the page text > > enclosed in the div tags > > // what to do here? > > } > > > > return contents; > > > > > > Obviously I am new to htmlparser, so much thanks in advance. > > > > > > > > ------------------------------------------------------------------------- > > Check out the new SourceForge.net Marketplace. > > It's the best place to buy or sell services for > > just about anything Open Source. > > > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |