Re: [Htmlparser-user] Regex Filter

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position.
If you need to keep the tags it is more difficult.
Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer.
This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html.

----- Original Message ----
From: Davide Taibi <da...@ta...>
To: htm...@li...
Sent: Saturday, May 10, 2008 2:30:40 PM
Subject: [Htmlparser-user] Regex Filter

Dear all, I have a problem with regular expressions.

I'd like to extract a block of text from an html page.

I know how the text start (the first 10 words) but I don'k know if there are any tags inside.

In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the
second one in html and could be nested in several nodes.

Then... the first sentence can be written in the second including some html tags or spaces between words:

Example:

sentence a: "After hours of trying to sort the problem with uploading..."
sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the         <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..."

The sentence a should match correctly the b at position 15.

I've tried to do this but it doesn't works:

protected static String extractContent(String html, String searchText) throws ParserException{
        Page page = new Page(html);
        Lexer lex = new Lexer(page);
        Parser parser = new Parser(lex);
        NodeList list = new NodeList();

        NodeFilter filter =  new RegexFilter(searchText);
        for (NodeIterator it = parser.elements(); it.hasMoreNodes();) {
            it.nextNode().collectInto(list, filter);
        }
        if(list.size()>0){
            System.out.println("text found n."+list.size() + "times");
            return  Translate.decode(list.toHtml());
        }
        else
            System.out.println("text not found");
            return null;
    }

Tanks in advance

Davide Taibi
http://www.taibi.it