Re: [Htmlparser-user] how to improve link extraction speed
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-03-21 12:24:28
|
Wen, I'm not sure it would be faster but... If you don't care about nesting or other types of nodes, you can supply the LinkTag as the only prototype for the node factory: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (new LinkTag ()); Parser parser = new Parser (); parser.setNodeFactory (factory); NodeFilter filter = new NodeClassFilter (LinkTag.class); for (20 documents) { parser.setURL (url); NodeList links = parser.extractAllNodesThatMatch (filter); for (int in = 0; in < links.size (); in++) ... In this way there will be no attempt at nesting the tags, so it should be faster. You also don't need to allocate a parser and filter within your loop. Derrick Wen wrote: > Hi, > > I'm using HTMLParser to parse a link that contains specific file type. > ex. pdf files. > It works fine but takes around 20 seconds to parse 20 websites. > I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be > able to achieve the same goal. > > Is there other ways to make the extraction process faster than the way > I'm using now? > > Here is my code: > for( 20 documents){ > parser = new Parser(url); > NodeFilter filter = new NodeClassFilter (LinkTag.class); > NodeList links = new NodeList (); > > for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) > e.nextNode ().collectInto (links, filter); > for (int in = 0; in < links.size (); in++) > { > LinkTag linkTag = (LinkTag)links.elementAt (in); > if(linkTag.getLink().endsWith(".PDF")){ > doSomething; > } > } > > Thanks in advanced. |