Re: [Htmlparser-user] how to improve link extraction speed

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Wen,

I'm not sure it would be faster but...
If you don't care about nesting or other types of nodes, you can supply 
the LinkTag as the only prototype for the node factory:

  PrototypicalNodeFactory factory = new PrototypicalNodeFactory (new 
LinkTag ());
  Parser parser = new Parser ();
  parser.setNodeFactory (factory);
  NodeFilter filter = new NodeClassFilter (LinkTag.class);
  for (20 documents)
  {
    parser.setURL (url);
    NodeList links = parser.extractAllNodesThatMatch (filter);
    for (int in = 0; in < links.size (); in++)
      ...

In this way there will be no attempt at nesting the tags, so it should 
be faster.
You also don't need to allocate a parser and filter within your loop.

Derrick

Wen wrote:

> Hi,
>
> I'm using HTMLParser to parse a link that contains specific file type. 
> ex. pdf files.
> It works fine but takes around 20 seconds to parse 20 websites.
> I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be 
> able to achieve the same goal.
>
> Is there other ways to make the extraction process faster than the way 
> I'm using now?
>
> Here is my code:
> for( 20 documents){
>             parser = new Parser(url);
>             NodeFilter filter = new NodeClassFilter (LinkTag.class);
>             NodeList links = new NodeList ();
>            
>             for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
>                 e.nextNode ().collectInto (links, filter);
>             for (int in = 0; in < links.size (); in++)
>             {
>                 LinkTag linkTag = (LinkTag)links.elementAt (in);
>                 if(linkTag.getLink().endsWith(".PDF")){
>                         doSomething;
>             }
> }
>
> Thanks in advanced.