Re: [Htmlparser-user] Best way to extract all the links from a HTML page

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi
You may try to use filters (org.htmlparser.filters). In this way you'll
get all link tags from the page:

Parser parser = parserMain.getParser(parseURL);

NodeList links = null;
try {
    links = parser.parse(new TagNameFilter("a"));
} catch (ParserException ex) {
    logger.error(null, ex);
}

for (SimpleNodeIterator sni = links.element(); sni.hasMoreNodes();) {
    Node node = sni.nextNode();
    if (node instanceof LinkTag) {
        LinkTag lt = (LinkTag) node;
        // link text - lt.getLinkText()
        // link href  - lt.getLink()
    }
}

On Tue, 2010-10-12 at 17:50 -0300, Santiago Basulto wrote:

> Hello people.
> 
> I'm starting with HTMLParser. It seems a great library. I've doing
> some benchmarking and runs really fast.
> 
> Now i'm trying to improve it a little bit.
> 
> In my software, i use something like this to extract all links:
> 
> public class LinkVisitor extends NodeVisitor {
>         private Set<String> links = new HashSet<String>(100);
> 	public LinkVisitor(){
> 	}
> 	public void visitTag(Tag tag) {
> 		String name = tag.getTagName();
> 		if ("a".equalsIgnoreCase(name)){
> 			String hrefValue = tag.getAttribute("href");
> 			links.add(tag.getAttribute("href"));
> 		}
> 	}
> 	public Set<String> getLinks(){
> 		return this.urls;
> 	}
> 	
> }
> 
> But, reading a little bit i found other classes that may help, but
> don't know how to use them. Can anyone help me out?
> 
> The idea is to extract all the links from a String (that contains an
> HTML page already read from an URLConnection). Is there anyway to
> "Canonize" them? I mean, if the href says "/food/fruits/2" convert it
> to "http://www.foodsite.com/home/fruits/2"?
> 
> 
> Thanks a lot!
>