Re: [Htmlparser-user] Best way to extract all the links from a HTML page
Brought to you by:
derrickoswald
From: Derrick O. <der...@gm...> - 2010-10-13 05:33:41
|
If you set the document base href on the page (see how BaseHrefTag handles it in doSemanticAction, basically page.setBaseUrl (base)), then the links you get back can be 'canonized' as you call it by using the page getAbsoluteURL (String link, boolean strict) method. On Tue, Oct 12, 2010 at 10:50 PM, Santiago Basulto < san...@gm...> wrote: > Hello people. > > I'm starting with HTMLParser. It seems a great library. I've doing > some benchmarking and runs really fast. > > Now i'm trying to improve it a little bit. > > In my software, i use something like this to extract all links: > > public class LinkVisitor extends NodeVisitor { > private Set<String> links = new HashSet<String>(100); > public LinkVisitor(){ > } > public void visitTag(Tag tag) { > String name = tag.getTagName(); > if ("a".equalsIgnoreCase(name)){ > String hrefValue = tag.getAttribute("href"); > links.add(tag.getAttribute("href")); > } > } > public Set<String> getLinks(){ > return this.urls; > } > > } > > But, reading a little bit i found other classes that may help, but > don't know how to use them. Can anyone help me out? > > The idea is to extract all the links from a String (that contains an > HTML page already read from an URLConnection). Is there anyway to > "Canonize" them? I mean, if the href says "/food/fruits/2" convert it > to "http://www.foodsite.com/home/fruits/2"? > > > Thanks a lot! > > -- > Santiago Basulto.- > > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |