Thread: [Htmlparser-user] how to improve link extraction speed
Brought to you by:
derrickoswald
From: Wen <log...@gm...> - 2006-03-21 04:27:16
|
Hi, I'm using HTMLParser to parse a link that contains specific file type. ex. pdf files. It works fine but takes around 20 seconds to parse 20 websites. I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be able t= o achieve the same goal. Is there other ways to make the extraction process faster than the way I'm using now? Here is my code: for( 20 documents){ parser =3D new Parser(url); NodeFilter filter =3D new NodeClassFilter (LinkTag.class); NodeList links =3D new NodeList (); for (NodeIterator e =3D parser.elements (); e.hasMoreNodes (); = ) e.nextNode ().collectInto (links, filter); for (int in =3D 0; in < links.size (); in++) { LinkTag linkTag =3D (LinkTag)links.elementAt (in); if(linkTag.getLink().endsWith(".PDF")){ doSomething; } } Thanks in advanced. |
From: Derrick O. <Der...@Ro...> - 2006-03-21 12:24:28
|
Wen, I'm not sure it would be faster but... If you don't care about nesting or other types of nodes, you can supply the LinkTag as the only prototype for the node factory: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (new LinkTag ()); Parser parser = new Parser (); parser.setNodeFactory (factory); NodeFilter filter = new NodeClassFilter (LinkTag.class); for (20 documents) { parser.setURL (url); NodeList links = parser.extractAllNodesThatMatch (filter); for (int in = 0; in < links.size (); in++) ... In this way there will be no attempt at nesting the tags, so it should be faster. You also don't need to allocate a parser and filter within your loop. Derrick Wen wrote: > Hi, > > I'm using HTMLParser to parse a link that contains specific file type. > ex. pdf files. > It works fine but takes around 20 seconds to parse 20 websites. > I noticed except NodeFilter, LinkExtractor or LinkRegexFilter may be > able to achieve the same goal. > > Is there other ways to make the extraction process faster than the way > I'm using now? > > Here is my code: > for( 20 documents){ > parser = new Parser(url); > NodeFilter filter = new NodeClassFilter (LinkTag.class); > NodeList links = new NodeList (); > > for (NodeIterator e = parser.elements (); e.hasMoreNodes (); ) > e.nextNode ().collectInto (links, filter); > for (int in = 0; in < links.size (); in++) > { > LinkTag linkTag = (LinkTag)links.elementAt (in); > if(linkTag.getLink().endsWith(".PDF")){ > doSomething; > } > } > > Thanks in advanced. |
From: Wen <log...@ya...> - 2006-03-22 07:59:51
|
Hi Derrick, Thank you for your reply. It does improve the speed. Thanks a lot. wen --- Derrick Oswald <Der...@Ro...> wrote: > Wen, > > I'm not sure it would be faster but... > If you don't care about nesting or other types of nodes, you > can supply > the LinkTag as the only prototype for the node factory: > > PrototypicalNodeFactory factory = new > PrototypicalNodeFactory (new > LinkTag ()); > Parser parser = new Parser (); > parser.setNodeFactory (factory); > NodeFilter filter = new NodeClassFilter (LinkTag.class); > for (20 documents) > { > parser.setURL (url); > NodeList links = parser.extractAllNodesThatMatch (filter); > for (int in = 0; in < links.size (); in++) > ... > > In this way there will be no attempt at nesting the tags, so > it should > be faster. > You also don't need to allocate a parser and filter within > your loop. > > Derrick > > Wen wrote: > > > Hi, > > > > I'm using HTMLParser to parse a link that contains specific > file type. > > ex. pdf files. > > It works fine but takes around 20 seconds to parse 20 > websites. > > I noticed except NodeFilter, LinkExtractor or > LinkRegexFilter may be > > able to achieve the same goal. > > > > Is there other ways to make the extraction process faster > than the way > > I'm using now? > > > > Here is my code: > > for( 20 documents){ > > parser = new Parser(url); > > NodeFilter filter = new NodeClassFilter > (LinkTag.class); > > NodeList links = new NodeList (); > > > > for (NodeIterator e = parser.elements (); > e.hasMoreNodes (); ) > > e.nextNode ().collectInto (links, filter); > > for (int in = 0; in < links.size (); in++) > > { > > LinkTag linkTag = (LinkTag)links.elementAt > (in); > > if(linkTag.getLink().endsWith(".PDF")){ > > doSomething; > > } > > } > > > > Thanks in advanced. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking > scripting language > that extends applications into web and mobile media. Attend > the live webcast > and join the prime developer group breaking into this new > coding territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |