Thread: [Htmlparser-user] how to extract text after a certain text
Brought to you by:
derrickoswald
From: Dave <jav...@ya...> - 2006-11-04 09:52:24
|
I am new to htmlparser. For example, <div>Good Morning</div> <h3>Description</h3> <pre> Text to extract Line1 Text to extract Line2 </pre> <div>Good Morning</div> My question: How to extract Text to extract Line1 Text to extract Line2 after "Description" using filters? I tried Sibling and HasChild filters, it does not work. Also I noticed that <pre> is not treated as a tag. Thanks for help! Dave --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited Try it today. |
From: Dave <jav...@ya...> - 2006-11-04 11:52:33
|
<pre> text1 </pre> <table> <tr><td>text2</td><tr> </table> >parse http://web-site table show the whole table structure >parse http://web-site pre show the tag "pre" only, no text inside the pre tag. It seems that pre is not treated as the parent node of "text1". Is this a bug? Thanks! --------------------------------- Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. |
From: Derrick O. <Der...@Ro...> - 2006-11-04 21:52:20
|
Dave, PRE has not been added as a tag because it very often is not closed by the /PRE. You can create your own "PRE" tag class derived from CompositeTag, and register it with a PrototypicalNodeFactory you give to the parser. To answer your previous question about filters for: <div>Good Morning</div> <h3>Description</h3> <pre> *Text to extract Line1* *Text to extract Line2* </pre> <div>Good Morning</div> ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) so it would be something like ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) This is a lot easier to construct with the FilterBuilder application. ... or alternatively I had thought of making a 'TriggerFilter' that would set a member flag when it's subordinate filter went true, and after that would always return true because the flag was set... but then this member would need to be reset or you would need to build the filter fresh for each parse. Derrick Dave wrote: > <pre> > text1 > </pre> > > <table> > <tr><td>text2</td><tr> > </table> > > >parse http://web-site table > show the whole table structure > >parse http://web-site <http://web-site/> pre > show the tag "pre" only, no text inside the pre tag. > > It seems that pre is not treated as the parent node of "text1". > > Is this a bug? > > Thanks! > > > > |
From: Dave <jav...@ya...> - 2006-11-05 09:32:01
|
Hi Derrick, Thanks for help. ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) I could not find the method: ExtractTextFromChildrenOf(), which class? Does the text extracted include "Good Morning" or "Description"? I want the text after the heading(Description) only. Thanks! Dave Derrick Oswald <Der...@Ro...> wrote: Dave, PRE has not been added as a tag because it very often is not closed by the /PRE. You can create your own "PRE" tag class derived from CompositeTag, and register it with a PrototypicalNodeFactory you give to the parser. To answer your previous question about filters for: Good Morning Description *Text to extract Line1* *Text to extract Line2* Good Morning ... find the H3 node (with Description as the contents), ... get it's parent ... and extract all text from the parent's children (after the Heading) so it would be something like ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild (String(Description))))) This is a lot easier to construct with the FilterBuilder application. ... or alternatively I had thought of making a 'TriggerFilter' that would set a member flag when it's subordinate filter went true, and after that would always return true because the flag was set... but then this member would need to be reset or you would need to build the filter fresh for each parse. Derrick Dave wrote: > > text1 > > > > text2 > > > >parse http://web-site table > show the whole table structure > >parse http://web-site pre > show the tag "pre" only, no text inside the pre tag. > > It seems that pre is not treated as the parent node of "text1". > > Is this a bug? > > Thanks! > > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user --------------------------------- Check out the New Yahoo! Mail - Fire up a more powerful email and get things done faster. |
From: Derrick O. <Der...@Ro...> - 2006-11-06 13:09:00
|
The StringBean is a NodeVisitor, so it can be applied to a NodeList to extract tthe text from a child list. I guess it's up to you to remove stuff you don't want. Dave wrote: > Hi Derrick, > > Thanks for help. > > ... find the H3 node (with Description as the contents), > ... get it's parent > ... and extract all text from the parent's children (after the Heading) > ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild > (String(Description))))) > > I could not find the method: ExtractTextFromChildrenOf(), which class? > Does the text extracted include "Good Morning" or "Description"? I > want the text after the heading(Description) only. > > Thanks! > Dave > > */Derrick Oswald <Der...@Ro...>/* wrote: > > Dave, > > PRE has not been added as a tag because it very often is not > closed by > the /PRE. You can create your own "PRE" tag class derived from > CompositeTag, and register it with a PrototypicalNodeFactory you > give to > the parser. > > To answer your previous question about filters for: > > Good Morning > > > Description > > >*Text to extract Line1* >*Text to extract Line2* > > > > Good Morning > > > ... find the H3 node (with Description as the contents), > ... get it's parent > ... and extract all text from the parent's children (after the > Heading) > > so it would be something like > ExtractTextFromChildrenOf (HasSibling (And (TagName(H3), HasChild > (String(Description))))) > This is a lot easier to construct with the FilterBuilder application. > > ... or alternatively I had thought of making a 'TriggerFilter' that > would set a member flag when it's subordinate filter went true, and > after that would always return true because the flag was set... > but then > this member would need to be reset or you would need to build the > filter > fresh for each parse. > > Derrick > > Dave wrote: > > > > >> text1 >> > > > > > > > > > text2 > > > > > > > > > >parse http://web-site table > > show the whole table structure > > >parse http://web-site pre > > show the tag "pre" only, no text inside the pre tag. > > > > It seems that pre is not treated as the parent node of "text1". > > > > Is this a bug? > > > > Thanks! > > > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------------------------ > Check out the New Yahoo! Mail > <http://us.rd.yahoo.com/evt=43257/*http://advision.webevents.yahoo.com/mailbeta>- > Fire up a more powerful email and get things done faster. > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Ian M. <ian...@gm...> - 2006-11-06 09:07:54
|
One thing you could do is make a PRE tag class, if one does not currently exist. Ideally, you would then submit it to the project ;-) Ian On 11/4/06, Dave <jav...@ya...> wrote: > I am new to htmlparser. > > For example, > > <div>Good Morning</div> > <h3>Description</h3> > <pre> > Text to extract Line1 > Text to extract Line2 > </pre> > <div>Good Morning</div> > > My question: > > How to extract > > Text to extract Line1 > Text to extract Line2 > > after "Description" using filters? > > I tried Sibling and HasChild filters, it does not work. Also I noticed that > <pre> is not treated as a tag. > > Thanks for help! > Dave > > ________________________________ > Access over 1 million songs - Yahoo! Music Unlimited Try it today. > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |