Thread: [Htmlparser-user] how to improve link extraction speed

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] how to improve link extraction speed

From: Wen <log...@gm...> - 2006-03-21 04:27:16

Hi,

I'm using HTMLParser to parse a link that contains specific file type. ex.
pdf files.
It works fine but takes around 20 seconds to parse 20 websites.
I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be able t=
o
achieve the same goal.

Is there other ways to make the extraction process faster than the way I'm
using now?

Here is my code:
for( 20 documents){
            parser =3D new Parser(url);
            NodeFilter filter =3D new NodeClassFilter (LinkTag.class);
            NodeList links =3D new NodeList ();

            for (NodeIterator e =3D parser.elements (); e.hasMoreNodes (); =
)
                e.nextNode ().collectInto (links, filter);
            for (int in =3D 0; in < links.size (); in++)
            {
                LinkTag linkTag =3D (LinkTag)links.elementAt (in);
                if(linkTag.getLink().endsWith(".PDF")){
                        doSomething;
            }
}

Thanks in advanced.

Re: [Htmlparser-user] how to improve link extraction speed

From: Derrick O. <Der...@Ro...> - 2006-03-21 12:24:28

Wen,

I'm not sure it would be faster but...
If you don't care about nesting or other types of nodes, you can supply 
the LinkTag as the only prototype for the node factory:

  PrototypicalNodeFactory factory = new PrototypicalNodeFactory (new 
LinkTag ());
  Parser parser = new Parser ();
  parser.setNodeFactory (factory);
  NodeFilter filter = new NodeClassFilter (LinkTag.class);
  for (20 documents)
  {
    parser.setURL (url);
    NodeList links = parser.extractAllNodesThatMatch (filter);
    for (int in = 0; in < links.size (); in++)
      ...

In this way there will be no attempt at nesting the tags, so it should 
be faster.
You also don't need to allocate a parser and filter within your loop.

Derrick

Wen wrote:

> Hi,
>
> I'm using HTMLParser to parse a link that contains specific file type. 
> ex. pdf files.
> It works fine but takes around 20 seconds to parse 20 websites.
> I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be 
> able to achieve the same goal.
>
> Is there other ways to make the extraction process faster than the way 
> I'm using now?
>
> Here is my code:
> for( 20 documents){
>             parser = new Parser(url);
>             NodeFilter filter = new NodeClassFilter (LinkTag.class);
>             NodeList links = new NodeList ();
>            
>             for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
>                 e.nextNode ().collectInto (links, filter);
>             for (int in = 0; in < links.size (); in++)
>             {
>                 LinkTag linkTag = (LinkTag)links.elementAt (in);
>                 if(linkTag.getLink().endsWith(".PDF")){
>                         doSomething;
>             }
> }
>
> Thanks in advanced.

Re: [Htmlparser-user] how to improve link extraction speed

From: Wen <log...@ya...> - 2006-03-22 07:59:51

Hi Derrick,

Thank you for your reply. It does improve the speed. Thanks a
lot.
wen

--- Derrick Oswald <Der...@Ro...> wrote:

> Wen,
> 
> I'm not sure it would be faster but...
> If you don't care about nesting or other types of nodes, you
> can supply 
> the LinkTag as the only prototype for the node factory:
> 
>   PrototypicalNodeFactory factory = new
> PrototypicalNodeFactory (new 
> LinkTag ());
>   Parser parser = new Parser ();
>   parser.setNodeFactory (factory);
>   NodeFilter filter = new NodeClassFilter (LinkTag.class);
>   for (20 documents)
>   {
>     parser.setURL (url);
>     NodeList links = parser.extractAllNodesThatMatch (filter);
>     for (int in = 0; in < links.size (); in++)
>       ...
> 
> In this way there will be no attempt at nesting the tags, so
> it should 
> be faster.
> You also don't need to allocate a parser and filter within
> your loop.
> 
> Derrick
> 
> Wen wrote:
> 
> > Hi,
> >
> > I'm using HTMLParser to parse a link that contains specific
> file type. 
> > ex. pdf files.
> > It works fine but takes around 20 seconds to parse 20
> websites.
> > I noticed except NodeFilter, LinkExtractor or
> LinkRegexFilter may be 
> > able to achieve the same goal.
> >
> > Is there other ways to make the extraction process faster
> than the way 
> > I'm using now?
> >
> > Here is my code:
> > for( 20 documents){
> >             parser = new Parser(url);
> >             NodeFilter filter = new NodeClassFilter
> (LinkTag.class);
> >             NodeList links = new NodeList ();
> >            
> >             for (NodeIterator e = parser.elements ();
> e.hasMoreNodes (); )
> >                 e.nextNode ().collectInto (links, filter);
> >             for (int in = 0; in < links.size (); in++)
> >             {
> >                 LinkTag linkTag = (LinkTag)links.elementAt
> (in);
> >                 if(linkTag.getLink().endsWith(".PDF")){
> >                         doSomething;
> >             }
> > }
> >
> > Thanks in advanced.
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking
> scripting language
> that extends applications into web and mobile media. Attend
> the live webcast
> and join the prime developer group breaking into this new
> coding territory!
>
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com