Re: [Htmlparser-user] Regex Filter

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Unfortunately I think that I need to remember the container tag.

I'll try to better explain my problem.

My aim is to extract all the text included in a tag that contain a
substring. I have a list of excerpt from an RSS feed and I need to extract
the whole content of a web post only knowing the excerpt (the first sentence
of the post).

In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro
Morasca* Università dell'Insubria*  People and organizations that are
considering the adoption of OSS..."
and I have to extract the content of this post http://www.taibi.it/?p=39

The first part of the excerpt is in a <strong> tag while the second not.

My Idea is to find the tag container and then extract all the content.

Which strategy should I use?

Thanks

Davide

On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...>
wrote:

> Do you want to keep the tags? If not just use the StringBean to extract all
> the text and then look for the string to get its position.
> If you need to keep the tags it is more difficult.
> Someone else had modified the StringBean to remember the node or offset of
> each piece of text added to the buffer.
> This list of nodes or offsets could be used after a straight string
> comparison on the text to figure out the start and end node or offsets. From
> there you can extract the complete html.
>
>
> ----- Original Message ----
> From: Davide Taibi <da...@ta...>
> To: htm...@li...
> Sent: Saturday, May 10, 2008 2:30:40 PM
> Subject: [Htmlparser-user] Regex Filter
>
> Dear all, I have a problem with regular expressions.
>
> I'd like to extract a block of text from an html page.
>
> I know how the text start (the first 10 words) but I don'k know if there
> are any tags inside.
>
> In other words, I have to find if a sentence "A" is written in an Html page
> "B". My problem is that the sentence "A" is written in plain text and the
> second one in html and could be nested in several nodes.
>
> Then... the first sentence can be written in the second including some html
> tags or spaces between words:
>
> Example:
>
> sentence a: "After hours of trying to sort the problem with uploading..."
> sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort
> the         <strong> problem with <a href="xxxxxx.html" >uploading
> pictures</a> </strong>to this thing I decided..."
>
> The sentence a should match correctly the b at position 15.
>
>
> I've tried to do this but it doesn't works:
>
> protected static String extractContent(String html, String searchText)
> throws ParserException{
>         Page page = new Page(html);
>         Lexer lex = new Lexer(page);
>         Parser parser = new Parser(lex);
>         NodeList list = new NodeList();
>
>         NodeFilter filter =  new RegexFilter(searchText);
>         for (NodeIterator it = parser.elements(); it.hasMoreNodes();) {
>             it.nextNode().collectInto(list, filter);
>         }
>         if(list.size()>0){
>             System.out.println("text found n."+list.size() + "times");
>             return  Translate.decode(list.toHtml());
>         }
>         else
>             System.out.println("text not found");
>             return null;
>     }
>
>
> Tanks in advance
>
> Davide Taibi
> http://www.taibi.it
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
>
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>