Re: [Htmlparser-user] Regex Filter
Brought to you by:
derrickoswald
From: Davide T. <da...@ta...> - 2008-05-11 08:10:08
|
Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro Morasca* Università dell'Insubria* People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: > Do you want to keep the tags? If not just use the StringBean to extract all > the text and then look for the string to get its position. > If you need to keep the tags it is more difficult. > Someone else had modified the StringBean to remember the node or offset of > each piece of text added to the buffer. > This list of nodes or offsets could be used after a straight string > comparison on the text to figure out the start and end node or offsets. From > there you can extract the complete html. > > > ----- Original Message ---- > From: Davide Taibi <da...@ta...> > To: htm...@li... > Sent: Saturday, May 10, 2008 2:30:40 PM > Subject: [Htmlparser-user] Regex Filter > > Dear all, I have a problem with regular expressions. > > I'd like to extract a block of text from an html page. > > I know how the text start (the first 10 words) but I don'k know if there > are any tags inside. > > In other words, I have to find if a sentence "A" is written in an Html page > "B". My problem is that the sentence "A" is written in plain text and the > second one in html and could be nested in several nodes. > > Then... the first sentence can be written in the second including some html > tags or spaces between words: > > Example: > > sentence a: "After hours of trying to sort the problem with uploading..." > sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort > the <strong> problem with <a href="xxxxxx.html" >uploading > pictures</a> </strong>to this thing I decided..." > > The sentence a should match correctly the b at position 15. > > > I've tried to do this but it doesn't works: > > protected static String extractContent(String html, String searchText) > throws ParserException{ > Page page = new Page(html); > Lexer lex = new Lexer(page); > Parser parser = new Parser(lex); > NodeList list = new NodeList(); > > NodeFilter filter = new RegexFilter(searchText); > for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { > it.nextNode().collectInto(list, filter); > } > if(list.size()>0){ > System.out.println("text found n."+list.size() + "times"); > return Translate.decode(list.toHtml()); > } > else > System.out.println("text not found"); > return null; > } > > > Tanks in advance > > Davide Taibi > http://www.taibi.it > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |