Re: [Htmlparser-user] Can we extract links matching a substring in StringFilter or even better matc
Brought to you by:
derrickoswald
From: eugene k. <ku...@ya...> - 2008-02-27 17:56:12
|
<table cellspacing='0' cellpadding='0' border='0' ><tr><td style='font: inherit;'>Use LinkTag. I added LinkRegexFilter to further scan for only links that I am interested.<br><br> NodeList nodeList = parser.extractAllNodesThatMatch(new AndFilter(<br> new NodeClassFilter(LinkTag.class),<br> new LinkRegexFilter("^http://www\\.google\\..*&amp;(q|adurl)=http://(www\\.)?.*\\.(com|ca|org|net)")));<br><br><br>--- On <b>Wed, 2/27/08, Kamdar, Devang (MLITS) <i><Dev...@ml...></i></b> wrote:<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;">From: Kamdar, Devang (MLITS) <Dev...@ml...><br>Subject: [Htmlparser-user] Can we extract links matching a substring in StringFilter or even better matching a substring in the HasAttributeFilter?<br>To: htm...@li...<br>Date: Wednesday, February 27, 2008, 12:05 PM<br><br><div id="yiv448988319"><title>Can we extract links matching a substring in StringFilter or even better matching a substring in the HasAttributeFilter?</title> <p><font face="Arial" size="2">Hi,</font> <br><font face="Arial" size="2">I am trying to parse a HTML page and extract all the links that have a A tag and have "#Entry" substring in their href attribute.</font></p> <p><font face="Arial" size="2">E.g. Here is the html file</font> </p> <p><font face="Arial" size="2"><html></font> <br><font face="Arial" size="2"><body></font> <br><font face="Arial" size="2"><A href="#Entry1">1) First Line</A></font> <br><font face="Arial" size="2"><A href="#Entry2">2) Second Line</A></font> <br><font face="Arial" size="2"><A href="#Entry3">3) Third Line</A></font> <br><font face="Arial" size="2"><A href="#Entry4">4) Fourth Line</A></font> <br><font face="Arial" size="2"></body></font> <br><font face="Arial" size="2"></html></font> </p> <p><font face="Arial" size="2">I need to extract a list of links that each Entry represents.</font> </p> <p><font face="Arial" size="2">I tried using a combination of TagFilter("A") and StringFilter(")") in AndFilter() as follows:</font> <br><font color="#000000" face="Courier New" size="2">NodeList list = parser.extractAllNodesThatMatch(</font><b><font color="#7f0055" face="Courier New" size="2">new</font></b><font color="#000000" face="Courier New" size="2"> AndFilter(</font><b><font color="#7f0055" face="Courier New" size="2">new</font></b><font color="#000000" face="Courier New" size="2"> TagNameFilter(</font><font color="#2a00ff" face="Courier New" size="2">"A"</font><font color="#000000" face="Courier New" size="2">),</font><font color="#000000" face="Courier New" size="2">n</font><b></b><b><font color="#7f0055" face="Courier New" size="2">ew</font></b><font color="#000000" face="Courier New" size="2"> StringFilter(</font><font color="#2a00ff" face="Courier New" size="2">")"</font><font color="#000000" face="Courier New" size="2">)));</font> </p> <p><font face="Arial" size="2">But I think, StringFilter searches for whole strings i.e. exact match and not SubStrings</font> <br><font face="Arial" size="2">i.e. partly matching ) in </font> <br><font face="Arial" size="2"><A href="#Entry1">1) First Line</A></font> <br><font face="Arial" size="2">link for example.</font> </p> <p><font face="Arial" size="2">Or is there a filter where I can use the content of the attribute like #Entry1 or even better a substring of the content #Entry to filter out the tags?</font></p> <p><font face="Arial" size="2">Is there a way I can achieve what I am trying to do here?</font> </p> <p><font face="Arial" size="2">Any help is much appreciated.</font> </p> <p><font face="Arial" size="2">Thanks</font> <br><font face="Arial" size="2">Devang Kamdar</font> </p> <br> <br> </div><pre>-------------------------------------------------------------------------<br>This SF.net email is sponsored by: Microsoft<br>Defy all challenges. Microsoft(R) Visual Studio 2008.<br>http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/</pre><pre>_______________________________________________<br>Htmlparser-user mailing list<br>Htm...@li...<br>https://lists.sourceforge.net/lists/listinfo/htmlparser-user</pre></blockquote></td></tr></table><br> <hr size=1>Looking for last minute shopping deals? <a href="http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping"> Find them fast with Yahoo! Search.</a> |