Thread: Re: [Htmlparser-user] Regex Filter
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2008-05-11 03:29:16
|
Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it |
From: Derrick O. <der...@ro...> - 2008-05-11 13:03:18
|
A brute force approach would be to generate the parse tree in a NodeList with Parser.parse(null). Then recursively traverse the tree converting each sublist into text, until a plain text match occurs. In pseudo code the method would look something like this: findString (string, node_list) make a new StringBean apply visitAllNodesWith to the node list using the string_bean get the plain_text from the string_bean if string matches plain_text you are done, return the node_list else for each child in node_list try recursing into findString with the string and child ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htmlparser user list <htm...@li...> Sent: Sunday, May 11, 2008 1:10:04 AM Subject: Re: [Htmlparser-user] Regex Filter Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro MorascaUniversità dell'Insubria People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Davide T. <da...@ta...> - 2008-05-11 08:10:08
|
Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro Morasca* Università dell'Insubria* People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: > Do you want to keep the tags? If not just use the StringBean to extract all > the text and then look for the string to get its position. > If you need to keep the tags it is more difficult. > Someone else had modified the StringBean to remember the node or offset of > each piece of text added to the buffer. > This list of nodes or offsets could be used after a straight string > comparison on the text to figure out the start and end node or offsets. From > there you can extract the complete html. > > > ----- Original Message ---- > From: Davide Taibi <da...@ta...> > To: htm...@li... > Sent: Saturday, May 10, 2008 2:30:40 PM > Subject: [Htmlparser-user] Regex Filter > > Dear all, I have a problem with regular expressions. > > I'd like to extract a block of text from an html page. > > I know how the text start (the first 10 words) but I don'k know if there > are any tags inside. > > In other words, I have to find if a sentence "A" is written in an Html page > "B". My problem is that the sentence "A" is written in plain text and the > second one in html and could be nested in several nodes. > > Then... the first sentence can be written in the second including some html > tags or spaces between words: > > Example: > > sentence a: "After hours of trying to sort the problem with uploading..." > sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort > the <strong> problem with <a href="xxxxxx.html" >uploading > pictures</a> </strong>to this thing I decided..." > > The sentence a should match correctly the b at position 15. > > > I've tried to do this but it doesn't works: > > protected static String extractContent(String html, String searchText) > throws ParserException{ > Page page = new Page(html); > Lexer lex = new Lexer(page); > Parser parser = new Parser(lex); > NodeList list = new NodeList(); > > NodeFilter filter = new RegexFilter(searchText); > for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { > it.nextNode().collectInto(list, filter); > } > if(list.size()>0){ > System.out.println("text found n."+list.size() + "times"); > return Translate.decode(list.toHtml()); > } > else > System.out.println("text not found"); > return null; > } > > > Tanks in advance > > Davide Taibi > http://www.taibi.it > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |