[Htmlparser-user] Best way to extract all the links from a HTML page

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello people.

I'm starting with HTMLParser. It seems a great library. I've doing
some benchmarking and runs really fast.

Now i'm trying to improve it a little bit.

In my software, i use something like this to extract all links:

public class LinkVisitor extends NodeVisitor {
        private Set<String> links = new HashSet<String>(100);
	public LinkVisitor(){
	}
	public void visitTag(Tag tag) {
		String name = tag.getTagName();
		if ("a".equalsIgnoreCase(name)){
			String hrefValue = tag.getAttribute("href");
			links.add(tag.getAttribute("href"));
		}
	}
	public Set<String> getLinks(){
		return this.urls;
	}

}

But, reading a little bit i found other classes that may help, but
don't know how to use them. Can anyone help me out?

The idea is to extract all the links from a String (that contains an
HTML page already read from an URLConnection). Is there anyway to
"Canonize" them? I mean, if the href says "/food/fruits/2" convert it
to "http://www.foodsite.com/home/fruits/2"?

Thanks a lot!

-- 
Santiago Basulto.-