Re: [Htmlparser-user] Best way to extract all the links from a HTML page
Brought to you by:
derrickoswald
From: Stanislav O. <orl...@gm...> - 2010-10-12 21:13:38
|
Hi You may try to use filters (org.htmlparser.filters). In this way you'll get all link tags from the page: Parser parser = parserMain.getParser(parseURL); NodeList links = null; try { links = parser.parse(new TagNameFilter("a")); } catch (ParserException ex) { logger.error(null, ex); } for (SimpleNodeIterator sni = links.element(); sni.hasMoreNodes();) { Node node = sni.nextNode(); if (node instanceof LinkTag) { LinkTag lt = (LinkTag) node; // link text - lt.getLinkText() // link href - lt.getLink() } } On Tue, 2010-10-12 at 17:50 -0300, Santiago Basulto wrote: > Hello people. > > I'm starting with HTMLParser. It seems a great library. I've doing > some benchmarking and runs really fast. > > Now i'm trying to improve it a little bit. > > In my software, i use something like this to extract all links: > > public class LinkVisitor extends NodeVisitor { > private Set<String> links = new HashSet<String>(100); > public LinkVisitor(){ > } > public void visitTag(Tag tag) { > String name = tag.getTagName(); > if ("a".equalsIgnoreCase(name)){ > String hrefValue = tag.getAttribute("href"); > links.add(tag.getAttribute("href")); > } > } > public Set<String> getLinks(){ > return this.urls; > } > > } > > But, reading a little bit i found other classes that may help, but > don't know how to use them. Can anyone help me out? > > The idea is to extract all the links from a String (that contains an > HTML page already read from an URLConnection). Is there anyway to > "Canonize" them? I mean, if the href says "/food/fruits/2" convert it > to "http://www.foodsite.com/home/fruits/2"? > > > Thanks a lot! > |