Re: [Htmlparser-user] Best way to extract all the links from a HTML page
Brought to you by:
derrickoswald
|
From: Stanislav O. <orl...@gm...> - 2010-10-12 21:13:38
|
Hi
You may try to use filters (org.htmlparser.filters). In this way you'll
get all link tags from the page:
Parser parser = parserMain.getParser(parseURL);
NodeList links = null;
try {
links = parser.parse(new TagNameFilter("a"));
} catch (ParserException ex) {
logger.error(null, ex);
}
for (SimpleNodeIterator sni = links.element(); sni.hasMoreNodes();) {
Node node = sni.nextNode();
if (node instanceof LinkTag) {
LinkTag lt = (LinkTag) node;
// link text - lt.getLinkText()
// link href - lt.getLink()
}
}
On Tue, 2010-10-12 at 17:50 -0300, Santiago Basulto wrote:
> Hello people.
>
> I'm starting with HTMLParser. It seems a great library. I've doing
> some benchmarking and runs really fast.
>
> Now i'm trying to improve it a little bit.
>
> In my software, i use something like this to extract all links:
>
> public class LinkVisitor extends NodeVisitor {
> private Set<String> links = new HashSet<String>(100);
> public LinkVisitor(){
> }
> public void visitTag(Tag tag) {
> String name = tag.getTagName();
> if ("a".equalsIgnoreCase(name)){
> String hrefValue = tag.getAttribute("href");
> links.add(tag.getAttribute("href"));
> }
> }
> public Set<String> getLinks(){
> return this.urls;
> }
>
> }
>
> But, reading a little bit i found other classes that may help, but
> don't know how to use them. Can anyone help me out?
>
> The idea is to extract all the links from a String (that contains an
> HTML page already read from an URLConnection). Is there anyway to
> "Canonize" them? I mean, if the href says "/food/fruits/2" convert it
> to "http://www.foodsite.com/home/fruits/2"?
>
>
> Thanks a lot!
>
|