Thread: [Htmlparser-user] Regex Filter

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

 Dear all, I have a problem with regular expressions.

I'd like to extract a block of text from an html page.

I know how the text start (the first 10 words) but I don'k know if there are
any tags inside.

In other words, I have to find if a sentence "A" is written in an Html page
"B". My problem is that the sentence "A" is written in plain text and the
second one in html and could be nested in several nodes.

Then... the first sentence can be written in the second including some html
tags or spaces between words:

Example:

sentence a: "After hours of trying to sort the problem with uploading..."
sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort
the         <strong> problem with <a href="xxxxxx.html" >uploading
pictures</a> </strong>to this thing I decided..."

The sentence a should match correctly the b at position 15.

I've tried to do this but it doesn't works:

protected static String extractContent(String html, String searchText)
throws ParserException{
        Page page = new Page(html);
        Lexer lex = new Lexer(page);
        Parser parser = new Parser(lex);
        NodeList list = new NodeList();

        NodeFilter filter =  new RegexFilter(searchText);
        for (NodeIterator it = parser.elements(); it.hasMoreNodes();) {
            it.nextNode().collectInto(list, filter);
        }
        if(list.size()>0){
            System.out.println("text found n."+list.size() + "times");
            return  Translate.decode(list.toHtml());
        }
        else
            System.out.println("text not found");
            return null;
    }

Tanks in advance

Davide Taibi
http://www.taibi.it

Thread: [Htmlparser-user] Regex Filter

htmlparser-user