[Htmlparser-user] Best way to extract all the links from a HTML page
Brought to you by:
derrickoswald
From: Santiago B. <san...@gm...> - 2010-10-12 20:50:40
|
Hello people. I'm starting with HTMLParser. It seems a great library. I've doing some benchmarking and runs really fast. Now i'm trying to improve it a little bit. In my software, i use something like this to extract all links: public class LinkVisitor extends NodeVisitor { private Set<String> links = new HashSet<String>(100); public LinkVisitor(){ } public void visitTag(Tag tag) { String name = tag.getTagName(); if ("a".equalsIgnoreCase(name)){ String hrefValue = tag.getAttribute("href"); links.add(tag.getAttribute("href")); } } public Set<String> getLinks(){ return this.urls; } } But, reading a little bit i found other classes that may help, but don't know how to use them. Can anyone help me out? The idea is to extract all the links from a String (that contains an HTML page already read from an URLConnection). Is there anyway to "Canonize" them? I mean, if the href says "/food/fruits/2" convert it to "http://www.foodsite.com/home/fruits/2"? Thanks a lot! -- Santiago Basulto.- |