htmlparser-user Mailing List for HTML Parser (Page 80)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Rich W. <ri...@wi...> - 2003-04-02 22:05:56
|
What is needed is cookiejar functionality.. Something that will give and accept cookies when making requests. There is no other way around it. Many sites use cookies to deter spidering.. rw ----- Original Message ----- From: "Navid H.Langaroudi" <na...@ya...> To: <htm...@li...> Sent: Wednesday, April 02, 2003 3:45 PM Subject: Re: [Htmlparser-user] Integration Release 1.3-20030330 is out > Hi Somik and everybody else, > Things are really going fast and interesting here. It > is a great job. I hope once my program is completed, I > can share it with others. > > Well, I faced a new problem yesterday. It may not be > very much related to HTMLParser, but I appreciate it > if any one could give me a hint. > > My program uses HTMLparser classes to access sites and > extract all urls, and then in another run, using those > urls, it extract data from pages of those urls. > > There is this site which uses MicorsoftCommerc Server > 2000, and attaches the cookie to url, if request is > not from a Browser: > some thing like this. > > http://www.shoemall.com/product.asp?family%5Fid=2543&type=0&cat%5Fid= > 0&MSCSProfile=61E4CECF7275066FD87B9817DA5865CBE5EA506A04C53D8558451EC3D02BB5 7732 > 7CA398F52348946BD1631D503EA92FF120A8E45A336FAD8E7E4E31B1356470B79DDD041A4F98 A5B4 > 03FC86D8A52985761A9F6CEA80 > > And once I try to access the same page with same url, > every time I get a differnt page!!! > > Can anybody tell me why this is so? and How can I > change my java program to avoid it, or recieve the > correct page. > > I am also using > connectionnew.setRequestProperty > ("User-Agent","Mozilla/3.0(Windows NT 4.0; U) Opera > 6.0 [en]"); > > but still this does help! > > Thank you > Navid > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - File online, calculators, forms, and more > http://tax.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of bandwidth! > No other company gives more support or power for your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Navid H.L. <na...@ya...> - 2003-04-02 20:45:44
|
Hi Somik and everybody else, Things are really going fast and interesting here. It is a great job. I hope once my program is completed, I can share it with others. Well, I faced a new problem yesterday. It may not be very much related to HTMLParser, but I appreciate it if any one could give me a hint. My program uses HTMLparser classes to access sites and extract all urls, and then in another run, using those urls, it extract data from pages of those urls. There is this site which uses MicorsoftCommerc Server 2000, and attaches the cookie to url, if request is not from a Browser: some thing like this. http://www.shoemall.com/product.asp?family%5Fid=2543&type=0&cat%5Fid= 0&MSCSProfile=61E4CECF7275066FD87B9817DA5865CBE5EA506A04C53D8558451EC3D02BB57732 7CA398F52348946BD1631D503EA92FF120A8E45A336FAD8E7E4E31B1356470B79DDD041A4F98A5B4 03FC86D8A52985761A9F6CEA80 And once I try to access the same page with same url, every time I get a differnt page!!! Can anybody tell me why this is so? and How can I change my java program to avoid it, or recieve the correct page. I am also using connectionnew.setRequestProperty ("User-Agent","Mozilla/3.0(Windows NT 4.0; U) Opera 6.0 [en]"); but still this does help! Thank you Navid __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - File online, calculators, forms, and more http://tax.yahoo.com |
From: ope t. <op...@ho...> - 2003-03-31 21:08:04
|
Thanks a lot, it worked! Sincerely, Ope >From: htm...@li... >Reply-To: htm...@li... >To: htm...@li... >Subject: Htmlparser-user digest, Vol 1 #228 - 1 msg >Date: Sun, 30 Mar 2003 12:09:36 -0800 > >Send Htmlparser-user mailing list submissions to > htm...@li... > >To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/htmlparser-user >or, via email, send a message with subject or body 'help' to > htm...@li... > >You can reach the person managing the list at > htm...@li... > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Htmlparser-user digest..." > > >Today's Topics: > > 1. Re: Re: Htmlparser-user digest, Vol 1 #226 - 2 msgs (Somik Raha) > >--__--__-- > >Message: 1 >From: "Somik Raha" <so...@ya...> >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 >msgs >Date: Sat, 29 Mar 2003 22:18:18 -0800 >Reply-To: htm...@li... > >FYI, I've just found that the CompositeTagScanner had a bug, due to which >the filters were not being set. Ope --> >node.collectInto(nodeList, LinkTag.LINK_TAG_FILTER); > >will work in the next integration release. > >Regards, >Somik >----- Original Message ----- >From: "Somik Raha" <so...@ya...> >To: <htm...@li...> >Sent: Thursday, March 27, 2003 2:38 PM >Subject: RE: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 >msgs > > > > Instead of this, > > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > use: > > > > node.collectInto(nodeList,LinkTag.class); > > > > Regards, > > Somik > > --- Marc Novakowski <ma...@ke...> wrote: > > > Try removing the following line from your code: > > > > > > nodeList.add(node); > > > > > > It's most likely adding non-LinkTag nodes into > > > nodeList which causes the ClassCastException later > > > on. > > > > > > Marc > > > > > > -----Original Message----- > > > From: ope tomori [mailto:op...@ho...] > > > Sent: Thursday, March 27, 2003 1:31 PM > > > To: htm...@li... > > > Subject: [Htmlparser-user] Re: Htmlparser-user > > > digest, Vol 1 #226 - 2 > > > msgs > > > > > > > > > I figured out the part using the > > > nodeList.collectInto. My debug output shows > > > the right output, put when i try to process the link > > > information, i get this > > > error (this is part of the error): > > > > > > Exception occurred during event dispatching: > > > java.lang.ClassCastException: > > > org.htmlparser.tags.DoctypeTag > > > > > > > > > Thanks in advance for your help > > > > > > Sincerely, > > > Ope T. > > > > > > > > > This is my code below: > > > try{ > > > //create the parser with the url to be parsed > > > parser = new Parser(urlAddressComplete,new > > > DefaultParserFeedback()); > > > parser.registerScanners(); > > > nodeList = new NodeList(); > > > > > > //to extratct all the embedded links and images > > > > > > for (NodeIterator e = > > > parser.elements();e.hasMoreNodes();) { > > > Node node = (Node)e.nextNode(); > > > nodeList.add(node); > > > > > //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); > > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > > > > > }//for > > > > > > System.out.print("CHECKING NODES.. " + > > > nodeList.toString()+ "\n"); > > > > > > //now process the links and images > > > //this is the part that doesnt seem to work > > > > > > for (SimpleNodeIterator e = > > > nodeList.elements();e.hasMoreNodes();) { > > > LinkTag linkTag = (LinkTag)e.nextNode(); > > > > > > //put the links and their texts into vectors > > > allTextLinkVector.addElement(linkTag.getLinkText()); > > > allLinkVector.addElement(linkTag.getLink()); > > > } > > > // System.out.print( "All Links " + "Size: "+ > > > allTextLinkVector.size() + " > > > "+ allTextLinkVector.toString()+ "\n"); > > > > > > }//inner try > > > > > > catch (ParserException e) { > > > System.err.println("Error, could not create parser > > > object"); > > > e.printStackTrace(); > > > }//catch > > > }// outer try > > > catch(IOException ex) { ex.printStackTrace(); } > > > > > > > > > > > > > > > > > > > > > >From: htm...@li... > > > Reply-To: > > > >htm...@li... To: > > > >htm...@li... Subject: > > > Htmlparser-user digest, Vol > > > >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 > > > -0800 > > > > > > > >Send Htmlparser-user mailing list submissions to > > > >htm...@li... > > > > > > > >To subscribe or unsubscribe via the World Wide Web, > > > visit > > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > or, via email, > > > >send a message with subject or body 'help' to > > > >htm...@li... > > > > > > > >You can reach the person managing the list at > > > >htm...@li... > > > > > > > >When replying, please edit your Subject line so it > > > is more specific than > > > >"Re: Contents of Htmlparser-user digest..." > > > > > > > > > > > >Today's Topics: > > > > > > > >1. Help with method --> node.collectInto() (ope > > > tomori) 2. RE: Help with > > > >method --> node.collectInto() (Marc Novakowski) > > > > > > > >-- __--__-- > > > > > > > >Message: 1 From: "ope tomori" To: > > > htm...@li... > > > >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: > > > [Htmlparser-user] Help with > > > >method --> node.collectInto() Reply-To: > > > >htm...@li... > > > > > > > > > > > >Hi Im trying to use the method > > > node.collectInto(...) to extract embedded > > > >links and images on webpages. Im using the latest > > > integration release which > > > >means its now Parser, not HTMLParser, nodeIterator, > > > etc and all the other > > > >changes. > > > > > > > > > > > > > > > >I followed the sample code: > > > > > > > >HTMLParser parser = new > > > HTMLParser("http://www.yahoo.com"); > > > >parser.registerScanners(); int i = 0; Vector > > > collectionVector = new > > > >Vector(); HTMLNode node; for (HTMLEnumeration e = > > > >parser.elements();e.hasMoreNodes();) { node = > > > e.nextHTMLNode(); > > > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > > } // All > > > >items in the collection vector should be links for > > > (Enumeration e = > > > >collectionVector.elements();e.hasMoreElements();) { > > > HTMLLinkTag linkTag = > > > >(HTMLLinkTag)e.nextElement(); // you can now > > > process the links as you like > > > >} > > > > > *********************************************************** > > > > > > > > > > > >Im getting an error because this line: > > > > > > > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > > requires a > > > >nodeList and not a vector, ive tried changing it > > > without any success: > > > >Creating a nodelist instead of a vector, > > > > > > > >can u please help me!! > > > > > > > >Thanks Ope > > > > > > > > > > > > > >_________________________________________________________________ > > > The new > > > >MSN 8: advanced junk mail protection and 2 months > > > FREE* > > > >http://join.msn.com/?page=features/junkmail > > > > > > > > > > > > > > > >-- __--__-- > > > > > > > >Message: 2 Subject: RE: [Htmlparser-user] Help with > > > method --> > > > >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 > > > -0800 From: "Marc > > > >Novakowski" To: Reply-To: > > > htm...@li... > > > > > > > >If you can paste the actual code you're trying to > > > compile, I'd be more = > > > >than happy to take a look at it. > > > > > > > >Marc > > > > > > > >-----Original Message----- From: ope tomori > > > [mailto:op...@ho...] > > > >Sent: Thursday, March 27, 2003 7:00 AM To: > > > >htm...@li... Subject: > > > [Htmlparser-user] > > === message truncated === > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! > > http://platinum.yahoo.com > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: > > The Definitive IT and Networking Event. Be There! > > NetWorld+Interop Las Vegas 2003 -- Register today! > > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > >--__--__-- > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >End of Htmlparser-user Digest _________________________________________________________________ Add photos to your e-mail with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail |
From: Somik R. <so...@ya...> - 2003-03-31 04:43:54
|
Hi Folks, This week's integration release is packed with goodies! From the change log: Integration Build 1.3 - 20030330 -------------------------------- [1] fixed bug (an enhancement really) 694477 quotes in content-type header [2] fix bug #699886 and #707447 by using a buffered stream reader with infinite mark [3] fixed bug in CompositeTagScanner, filter not being set correctly [4] fixed thread safety issue in TagParser (bug 711073) [5] fixed out of memory error when parsing custom composite tags (bug 709152) [6] fixed bug 701159, 696455 - redesigned script scanner. Javascript parsing is now much more robust. As you can see, a lot of bug fixes have gone in. There are three major fixes - one by Derrick Oswald (#2) addresses the charset issue. The parser should now be able to handle different charsets dynamically. We hope you can test this and give us feedback. The second big change is a redesign of the way Javascript is handled by the parser. It had been riddled with problems for some time, so we've changed its internals. The new implementation is much more robust, and hopefully we can get some feedback on that too. There were some thread safety issues (thanks to Joe Robbins for reporting this). These have been addressed in this release, and the parser should be totally thread-safe now. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-30 06:16:42
|
FYI, I've just found that the CompositeTagScanner had a bug, due to which the filters were not being set. Ope --> node.collectInto(nodeList, LinkTag.LINK_TAG_FILTER); will work in the next integration release. Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Thursday, March 27, 2003 2:38 PM Subject: RE: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 msgs > Instead of this, > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > use: > > node.collectInto(nodeList,LinkTag.class); > > Regards, > Somik > --- Marc Novakowski <ma...@ke...> wrote: > > Try removing the following line from your code: > > > > nodeList.add(node); > > > > It's most likely adding non-LinkTag nodes into > > nodeList which causes the ClassCastException later > > on. > > > > Marc > > > > -----Original Message----- > > From: ope tomori [mailto:op...@ho...] > > Sent: Thursday, March 27, 2003 1:31 PM > > To: htm...@li... > > Subject: [Htmlparser-user] Re: Htmlparser-user > > digest, Vol 1 #226 - 2 > > msgs > > > > > > I figured out the part using the > > nodeList.collectInto. My debug output shows > > the right output, put when i try to process the link > > information, i get this > > error (this is part of the error): > > > > Exception occurred during event dispatching: > > java.lang.ClassCastException: > > org.htmlparser.tags.DoctypeTag > > > > > > Thanks in advance for your help > > > > Sincerely, > > Ope T. > > > > > > This is my code below: > > try{ > > //create the parser with the url to be parsed > > parser = new Parser(urlAddressComplete,new > > DefaultParserFeedback()); > > parser.registerScanners(); > > nodeList = new NodeList(); > > > > //to extratct all the embedded links and images > > > > for (NodeIterator e = > > parser.elements();e.hasMoreNodes();) { > > Node node = (Node)e.nextNode(); > > nodeList.add(node); > > > //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > > > }//for > > > > System.out.print("CHECKING NODES.. " + > > nodeList.toString()+ "\n"); > > > > //now process the links and images > > //this is the part that doesnt seem to work > > > > for (SimpleNodeIterator e = > > nodeList.elements();e.hasMoreNodes();) { > > LinkTag linkTag = (LinkTag)e.nextNode(); > > > > //put the links and their texts into vectors > > allTextLinkVector.addElement(linkTag.getLinkText()); > > allLinkVector.addElement(linkTag.getLink()); > > } > > // System.out.print( "All Links " + "Size: "+ > > allTextLinkVector.size() + " > > "+ allTextLinkVector.toString()+ "\n"); > > > > }//inner try > > > > catch (ParserException e) { > > System.err.println("Error, could not create parser > > object"); > > e.printStackTrace(); > > }//catch > > }// outer try > > catch(IOException ex) { ex.printStackTrace(); } > > > > > > > > > > > > > > >From: htm...@li... > > Reply-To: > > >htm...@li... To: > > >htm...@li... Subject: > > Htmlparser-user digest, Vol > > >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 > > -0800 > > > > > >Send Htmlparser-user mailing list submissions to > > >htm...@li... > > > > > >To subscribe or unsubscribe via the World Wide Web, > > visit > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > or, via email, > > >send a message with subject or body 'help' to > > >htm...@li... > > > > > >You can reach the person managing the list at > > >htm...@li... > > > > > >When replying, please edit your Subject line so it > > is more specific than > > >"Re: Contents of Htmlparser-user digest..." > > > > > > > > >Today's Topics: > > > > > >1. Help with method --> node.collectInto() (ope > > tomori) 2. RE: Help with > > >method --> node.collectInto() (Marc Novakowski) > > > > > >--__--__-- > > > > > >Message: 1 From: "ope tomori" To: > > htm...@li... > > >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: > > [Htmlparser-user] Help with > > >method --> node.collectInto() Reply-To: > > >htm...@li... > > > > > > > > >Hi Im trying to use the method > > node.collectInto(...) to extract embedded > > >links and images on webpages. Im using the latest > > integration release which > > >means its now Parser, not HTMLParser, nodeIterator, > > etc and all the other > > >changes. > > > > > > > > > > > >I followed the sample code: > > > > > >HTMLParser parser = new > > HTMLParser("http://www.yahoo.com"); > > >parser.registerScanners(); int i = 0; Vector > > collectionVector = new > > >Vector(); HTMLNode node; for (HTMLEnumeration e = > > >parser.elements();e.hasMoreNodes();) { node = > > e.nextHTMLNode(); > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > } // All > > >items in the collection vector should be links for > > (Enumeration e = > > >collectionVector.elements();e.hasMoreElements();) { > > HTMLLinkTag linkTag = > > >(HTMLLinkTag)e.nextElement(); // you can now > > process the links as you like > > >} > > > *********************************************************** > > > > > > > > >Im getting an error because this line: > > > > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > requires a > > >nodeList and not a vector, ive tried changing it > > without any success: > > >Creating a nodelist instead of a vector, > > > > > >can u please help me!! > > > > > >Thanks Ope > > > > > > > > > >_________________________________________________________________ > > The new > > >MSN 8: advanced junk mail protection and 2 months > > FREE* > > >http://join.msn.com/?page=features/junkmail > > > > > > > > > > > >--__--__-- > > > > > >Message: 2 Subject: RE: [Htmlparser-user] Help with > > method --> > > >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 > > -0800 From: "Marc > > >Novakowski" To: Reply-To: > > htm...@li... > > > > > >If you can paste the actual code you're trying to > > compile, I'd be more = > > >than happy to take a look at it. > > > > > >Marc > > > > > >-----Original Message----- From: ope tomori > > [mailto:op...@ho...] > > >Sent: Thursday, March 27, 2003 7:00 AM To: > > >htm...@li... Subject: > > [Htmlparser-user] > === message truncated === > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! > http://platinum.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: > The Definitive IT and Networking Event. Be There! > NetWorld+Interop Las Vegas 2003 -- Register today! > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2003-03-27 22:38:36
|
Instead of this, > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); use: node.collectInto(nodeList,LinkTag.class); Regards, Somik --- Marc Novakowski <ma...@ke...> wrote: > Try removing the following line from your code: > > nodeList.add(node); > > It's most likely adding non-LinkTag nodes into > nodeList which causes the ClassCastException later > on. > > Marc > > -----Original Message----- > From: ope tomori [mailto:op...@ho...] > Sent: Thursday, March 27, 2003 1:31 PM > To: htm...@li... > Subject: [Htmlparser-user] Re: Htmlparser-user > digest, Vol 1 #226 - 2 > msgs > > > I figured out the part using the > nodeList.collectInto. My debug output shows > the right output, put when i try to process the link > information, i get this > error (this is part of the error): > > Exception occurred during event dispatching: > java.lang.ClassCastException: > org.htmlparser.tags.DoctypeTag > > > Thanks in advance for your help > > Sincerely, > Ope T. > > > This is my code below: > try{ > //create the parser with the url to be parsed > parser = new Parser(urlAddressComplete,new > DefaultParserFeedback()); > parser.registerScanners(); > nodeList = new NodeList(); > > //to extratct all the embedded links and images > > for (NodeIterator e = > parser.elements();e.hasMoreNodes();) { > Node node = (Node)e.nextNode(); > nodeList.add(node); > //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > }//for > > System.out.print("CHECKING NODES.. " + > nodeList.toString()+ "\n"); > > //now process the links and images > //this is the part that doesnt seem to work > > for (SimpleNodeIterator e = > nodeList.elements();e.hasMoreNodes();) { > LinkTag linkTag = (LinkTag)e.nextNode(); > > //put the links and their texts into vectors > allTextLinkVector.addElement(linkTag.getLinkText()); > allLinkVector.addElement(linkTag.getLink()); > } > // System.out.print( "All Links " + "Size: "+ > allTextLinkVector.size() + " > "+ allTextLinkVector.toString()+ "\n"); > > }//inner try > > catch (ParserException e) { > System.err.println("Error, could not create parser > object"); > e.printStackTrace(); > }//catch > }// outer try > catch(IOException ex) { ex.printStackTrace(); } > > > > > > > >From: htm...@li... > Reply-To: > >htm...@li... To: > >htm...@li... Subject: > Htmlparser-user digest, Vol > >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 > -0800 > > > >Send Htmlparser-user mailing list submissions to > >htm...@li... > > > >To subscribe or unsubscribe via the World Wide Web, > visit > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > or, via email, > >send a message with subject or body 'help' to > >htm...@li... > > > >You can reach the person managing the list at > >htm...@li... > > > >When replying, please edit your Subject line so it > is more specific than > >"Re: Contents of Htmlparser-user digest..." > > > > > >Today's Topics: > > > >1. Help with method --> node.collectInto() (ope > tomori) 2. RE: Help with > >method --> node.collectInto() (Marc Novakowski) > > > >--__--__-- > > > >Message: 1 From: "ope tomori" To: > htm...@li... > >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: > [Htmlparser-user] Help with > >method --> node.collectInto() Reply-To: > >htm...@li... > > > > > >Hi Im trying to use the method > node.collectInto(...) to extract embedded > >links and images on webpages. Im using the latest > integration release which > >means its now Parser, not HTMLParser, nodeIterator, > etc and all the other > >changes. > > > > > > > >I followed the sample code: > > > >HTMLParser parser = new > HTMLParser("http://www.yahoo.com"); > >parser.registerScanners(); int i = 0; Vector > collectionVector = new > >Vector(); HTMLNode node; for (HTMLEnumeration e = > >parser.elements();e.hasMoreNodes();) { node = > e.nextHTMLNode(); > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > } // All > >items in the collection vector should be links for > (Enumeration e = > >collectionVector.elements();e.hasMoreElements();) { > HTMLLinkTag linkTag = > >(HTMLLinkTag)e.nextElement(); // you can now > process the links as you like > >} > *********************************************************** > > > > > >Im getting an error because this line: > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > requires a > >nodeList and not a vector, ive tried changing it > without any success: > >Creating a nodelist instead of a vector, > > > >can u please help me!! > > > >Thanks Ope > > > > > >_________________________________________________________________ > The new > >MSN 8: advanced junk mail protection and 2 months > FREE* > >http://join.msn.com/?page=features/junkmail > > > > > > > >--__--__-- > > > >Message: 2 Subject: RE: [Htmlparser-user] Help with > method --> > >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 > -0800 From: "Marc > >Novakowski" To: Reply-To: > htm...@li... > > > >If you can paste the actual code you're trying to > compile, I'd be more = > >than happy to take a look at it. > > > >Marc > > > >-----Original Message----- From: ope tomori > [mailto:op...@ho...] > >Sent: Thursday, March 27, 2003 7:00 AM To: > >htm...@li... Subject: > [Htmlparser-user] === message truncated === __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
From: Marc N. <ma...@ke...> - 2003-03-27 22:19:58
|
Try removing the following line from your code: nodeList.add(node); It's most likely adding non-LinkTag nodes into nodeList which causes the = ClassCastException later on. Marc -----Original Message----- From: ope tomori [mailto:op...@ho...] Sent: Thursday, March 27, 2003 1:31 PM To: htm...@li... Subject: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 msgs I figured out the part using the nodeList.collectInto. My debug output = shows=20 the right output, put when i try to process the link information, i get = this=20 error (this is part of the error): Exception occurred during event dispatching: java.lang.ClassCastException: org.htmlparser.tags.DoctypeTag Thanks in advance for your help Sincerely, Ope T. This is my code below: try{ //create the parser with the url to be parsed parser =3D new Parser(urlAddressComplete,new DefaultParserFeedback()); parser.registerScanners(); nodeList =3D new NodeList(); //to extratct all the embedded links and images for (NodeIterator e =3D parser.elements();e.hasMoreNodes();) { Node node =3D (Node)e.nextNode(); nodeList.add(node); //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); }//for System.out.print("CHECKING NODES.. " + nodeList.toString()+ "\n"); //now process the links and images //this is the part that doesnt seem to work for (SimpleNodeIterator e =3D nodeList.elements();e.hasMoreNodes();) { LinkTag linkTag =3D (LinkTag)e.nextNode(); //put the links and their texts into vectors allTextLinkVector.addElement(linkTag.getLinkText()); allLinkVector.addElement(linkTag.getLink()); } // System.out.print( "All Links " + "Size: "+ allTextLinkVector.size() + = "=20 "+ allTextLinkVector.toString()+ "\n"); }//inner try catch (ParserException e) { System.err.println("Error, could not create parser object"); e.printStackTrace(); }//catch }// outer try catch(IOException ex) { ex.printStackTrace(); } >From: htm...@li... Reply-To:=20 >htm...@li... To:=20 >htm...@li... Subject: Htmlparser-user digest, = Vol=20 >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 -0800 > >Send Htmlparser-user mailing list submissions to=20 >htm...@li... > >To subscribe or unsubscribe via the World Wide Web, visit=20 >https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via = email,=20 >send a message with subject or body 'help' to=20 >htm...@li... > >You can reach the person managing the list at=20 >htm...@li... > >When replying, please edit your Subject line so it is more specific = than=20 >"Re: Contents of Htmlparser-user digest..." > > >Today's Topics: > >1. Help with method --> node.collectInto() (ope tomori) 2. RE: Help = with=20 >method --> node.collectInto() (Marc Novakowski) > >--__--__-- > >Message: 1 From: "ope tomori" To: htm...@li... = >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: [Htmlparser-user] Help = with=20 >method --> node.collectInto() Reply-To:=20 >htm...@li... > > >Hi Im trying to use the method node.collectInto(...) to extract = embedded=20 >links and images on webpages. Im using the latest integration release = which=20 >means its now Parser, not HTMLParser, nodeIterator, etc and all the = other=20 >changes. > > > >I followed the sample code: > >HTMLParser parser =3D new HTMLParser("http://www.yahoo.com");=20 >parser.registerScanners(); int i =3D 0; Vector collectionVector =3D new = >Vector(); HTMLNode node; for (HTMLEnumeration e =3D=20 >parser.elements();e.hasMoreNodes();) { node =3D e.nextHTMLNode();=20 >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // = All=20 >items in the collection vector should be links for (Enumeration e =3D=20 >collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag = =3D=20 >(HTMLLinkTag)e.nextElement(); // you can now process the links as you = like=20 >} *********************************************************** > > >Im getting an error because this line: > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); = requires a=20 >nodeList and not a vector, ive tried changing it without any success:=20 >Creating a nodelist instead of a vector, > >can u please help me!! > >Thanks Ope > > >_________________________________________________________________ The = new=20 >MSN 8: advanced junk mail protection and 2 months FREE*=20 >http://join.msn.com/?page=3Dfeatures/junkmail > > > >--__--__-- > >Message: 2 Subject: RE: [Htmlparser-user] Help with method -->=20 >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 -0800 From: "Marc=20 >Novakowski" To: Reply-To: htm...@li... > >If you can paste the actual code you're trying to compile, I'd be more = =3D=20 >than happy to take a look at it. > >Marc > >-----Original Message----- From: ope tomori [mailto:op...@ho...] = >Sent: Thursday, March 27, 2003 7:00 AM To:=20 >htm...@li... Subject: [Htmlparser-user] Help = with=20 >method --> node.collectInto() > > > >Hi Im trying to use the method node.collectInto(...) to extract = embedded =3D > >links and images on webpages. Im using the latest integration release = which=20 >means its now Parser, not=3D20 HTMLParser, nodeIterator, etc and all = the=20 >other changes. > > > >I followed the sample code: > >HTMLParser parser =3D3D new HTMLParser("http://www.yahoo.com");=20 >parser.registerScanners(); int i =3D3D 0; Vector collectionVector =3D3D = new=20 >Vector(); HTMLNode node; for (HTMLEnumeration e =3D3D=20 >parser.elements();e.hasMoreNodes();) { node =3D3D e.nextHTMLNode();=20 >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // = All=20 >items in the collection vector should be links for (Enumeration e =3D3D = =3D=20 >collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag = =3D3D=20 >(HTMLLinkTag)e.nextElement(); // you can now process the links as you = like=20 >} *********************************************************** > > >Im getting an error because this line: > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); = requires a=20 >nodeList and not a vector, ive tried changing it without any=3D20 = success:=20 >Creating a nodelist instead of a vector, > >can u please help me!! > >Thanks Ope > > >_________________________________________________________________ The = new=20 >MSN 8: advanced junk mail protection and 2 months FREE* =3D20=20 >http://join.msn.com/?page=3D3Dfeatures/junkmail > > > >------------------------------------------------------- This SF.net = email=20 >is sponsored by: The Definitive IT and Networking Event. Be There!=20 >NetWorld+Interop Las Vegas 2003 -- Register today!=20 >http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en=20 >_______________________________________________ Htmlparser-user mailing = >list Htm...@li...=20 >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > >--__--__-- > >_______________________________________________ Htmlparser-user mailing = >list Htm...@li...=20 >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >End of Htmlparser-user Digest _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* =20 http://join.msn.com/?page=3Dfeatures/junkmail ------------------------------------------------------- This SF.net email is sponsored by: The Definitive IT and Networking Event. Be There! NetWorld+Interop Las Vegas 2003 -- Register today! http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: ope t. <op...@ho...> - 2003-03-27 21:30:53
|
I figured out the part using the nodeList.collectInto. My debug output shows the right output, put when i try to process the link information, i get this error (this is part of the error): Exception occurred during event dispatching: java.lang.ClassCastException: org.htmlparser.tags.DoctypeTag Thanks in advance for your help Sincerely, Ope T. This is my code below: try{ //create the parser with the url to be parsed parser = new Parser(urlAddressComplete,new DefaultParserFeedback()); parser.registerScanners(); nodeList = new NodeList(); //to extratct all the embedded links and images for (NodeIterator e = parser.elements();e.hasMoreNodes();) { Node node = (Node)e.nextNode(); nodeList.add(node); //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); }//for System.out.print("CHECKING NODES.. " + nodeList.toString()+ "\n"); //now process the links and images //this is the part that doesnt seem to work for (SimpleNodeIterator e = nodeList.elements();e.hasMoreNodes();) { LinkTag linkTag = (LinkTag)e.nextNode(); //put the links and their texts into vectors allTextLinkVector.addElement(linkTag.getLinkText()); allLinkVector.addElement(linkTag.getLink()); } // System.out.print( "All Links " + "Size: "+ allTextLinkVector.size() + " "+ allTextLinkVector.toString()+ "\n"); }//inner try catch (ParserException e) { System.err.println("Error, could not create parser object"); e.printStackTrace(); }//catch }// outer try catch(IOException ex) { ex.printStackTrace(); } >From: htm...@li... Reply-To: >htm...@li... To: >htm...@li... Subject: Htmlparser-user digest, Vol >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 -0800 > >Send Htmlparser-user mailing list submissions to >htm...@li... > >To subscribe or unsubscribe via the World Wide Web, visit >https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via email, >send a message with subject or body 'help' to >htm...@li... > >You can reach the person managing the list at >htm...@li... > >When replying, please edit your Subject line so it is more specific than >"Re: Contents of Htmlparser-user digest..." > > >Today's Topics: > >1. Help with method --> node.collectInto() (ope tomori) 2. RE: Help with >method --> node.collectInto() (Marc Novakowski) > >--__--__-- > >Message: 1 From: "ope tomori" To: htm...@li... >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: [Htmlparser-user] Help with >method --> node.collectInto() Reply-To: >htm...@li... > > >Hi Im trying to use the method node.collectInto(...) to extract embedded >links and images on webpages. Im using the latest integration release which >means its now Parser, not HTMLParser, nodeIterator, etc and all the other >changes. > > > >I followed the sample code: > >HTMLParser parser = new HTMLParser("http://www.yahoo.com"); >parser.registerScanners(); int i = 0; Vector collectionVector = new >Vector(); HTMLNode node; for (HTMLEnumeration e = >parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode(); >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All >items in the collection vector should be links for (Enumeration e = >collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag = >(HTMLLinkTag)e.nextElement(); // you can now process the links as you like >} *********************************************************** > > >Im getting an error because this line: > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a >nodeList and not a vector, ive tried changing it without any success: >Creating a nodelist instead of a vector, > >can u please help me!! > >Thanks Ope > > >_________________________________________________________________ The new >MSN 8: advanced junk mail protection and 2 months FREE* >http://join.msn.com/?page=features/junkmail > > > >--__--__-- > >Message: 2 Subject: RE: [Htmlparser-user] Help with method --> >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 -0800 From: "Marc >Novakowski" To: Reply-To: htm...@li... > >If you can paste the actual code you're trying to compile, I'd be more = >than happy to take a look at it. > >Marc > >-----Original Message----- From: ope tomori [mailto:op...@ho...] >Sent: Thursday, March 27, 2003 7:00 AM To: >htm...@li... Subject: [Htmlparser-user] Help with >method --> node.collectInto() > > > >Hi Im trying to use the method node.collectInto(...) to extract embedded = > >links and images on webpages. Im using the latest integration release which >means its now Parser, not=20 HTMLParser, nodeIterator, etc and all the >other changes. > > > >I followed the sample code: > >HTMLParser parser =3D new HTMLParser("http://www.yahoo.com"); >parser.registerScanners(); int i =3D 0; Vector collectionVector =3D new >Vector(); HTMLNode node; for (HTMLEnumeration e =3D >parser.elements();e.hasMoreNodes();) { node =3D e.nextHTMLNode(); >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All >items in the collection vector should be links for (Enumeration e =3D = >collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =3D >(HTMLLinkTag)e.nextElement(); // you can now process the links as you like >} *********************************************************** > > >Im getting an error because this line: > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a >nodeList and not a vector, ive tried changing it without any=20 success: >Creating a nodelist instead of a vector, > >can u please help me!! > >Thanks Ope > > >_________________________________________________________________ The new >MSN 8: advanced junk mail protection and 2 months FREE* =20 >http://join.msn.com/?page=3Dfeatures/junkmail > > > >------------------------------------------------------- This SF.net email >is sponsored by: The Definitive IT and Networking Event. Be There! >NetWorld+Interop Las Vegas 2003 -- Register today! >http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en >_______________________________________________ Htmlparser-user mailing >list Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > >--__--__-- > >_______________________________________________ Htmlparser-user mailing >list Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >End of Htmlparser-user Digest _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail |
From: Marc N. <ma...@ke...> - 2003-03-27 16:31:00
|
If you can paste the actual code you're trying to compile, I'd be more = than happy to take a look at it. Marc -----Original Message----- From: ope tomori [mailto:op...@ho...] Sent: Thursday, March 27, 2003 7:00 AM To: htm...@li... Subject: [Htmlparser-user] Help with method --> node.collectInto() Hi Im trying to use the method node.collectInto(...) to extract embedded = links and images on webpages. Im using the latest integration release which means its now Parser, not=20 HTMLParser, nodeIterator, etc and all the other changes. I followed the sample code: HTMLParser parser =3D new HTMLParser("http://www.yahoo.com"); parser.registerScanners(); int i =3D 0; Vector collectionVector =3D new Vector(); HTMLNode node; for (HTMLEnumeration e =3D parser.elements();e.hasMoreNodes();) { node =3D e.nextHTMLNode(); node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All items in the collection vector should be links for (Enumeration e =3D = collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =3D (HTMLLinkTag)e.nextElement(); // you can now process the links as you like } *********************************************************** Im getting an error because this line: node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a nodeList and not a vector, ive tried changing it without any=20 success: Creating a nodelist instead of a vector, can u please help me!! Thanks Ope _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* =20 http://join.msn.com/?page=3Dfeatures/junkmail ------------------------------------------------------- This SF.net email is sponsored by: The Definitive IT and Networking Event. Be There! NetWorld+Interop Las Vegas 2003 -- Register today! http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: ope t. <op...@ho...> - 2003-03-27 15:00:29
|
Hi Im trying to use the method node.collectInto(...) to extract embedded links and images on webpages. Im using the latest integration release which means its now Parser, not HTMLParser, nodeIterator, etc and all the other changes. I followed the sample code: HTMLParser parser = new HTMLParser("http://www.yahoo.com"); parser.registerScanners(); int i = 0; Vector collectionVector = new Vector(); HTMLNode node; for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode(); node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All items in the collection vector should be links for (Enumeration e = collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag = (HTMLLinkTag)e.nextElement(); // you can now process the links as you like } *********************************************************** Im getting an error because this line: node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a nodeList and not a vector, ive tried changing it without any success: Creating a nodelist instead of a vector, can u please help me!! Thanks Ope _________________________________________________________________ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail |
From: Marc N. <ma...@ke...> - 2003-03-24 23:23:45
|
Somik, Thanks for fixing 702614! Unfortunately I can't seem to get the latest = build to work. It's throwing an OOM exception in my own code when using = the NodeIterator returned by parser.elements(). I'm looking into this = to make sure I'm not doing something stupid in my code. However, the = library seems to be acting differently than previous releases even = out-of-the-box. For example, the following used to return a list of the = links on Yahoo (in the 0302 release): java -jar ./htmlparser.jar http://www.yahoo.com -l In the 0323 release, however, it returns nothing. Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Sunday, March 23, 2003 5:24 PM To: HTMLParser Announcement List; HTMLParser User List; HTMLParser Developer List Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in = the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and = post. Regards, Somik ------------------------------------------------------- This SF.net email is sponsored by:Crypto Challenge is now open!=20 Get cracking and register here for some mind boggling fun and=20 the chance of winning an Apple iPod: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: mohammad a. <re...@em...> - 2003-03-24 11:45:15
|
I didnt mean to jump on u or anyone else, or evene complaining. i totaly understand your situation, its same for me, i only have my free time to work on my personal projects. what i meant was that, this kind of bugs, if u can call it a bug, should be easier and faster to fix, but thats only how i see it, it may be more complicated. what i've understand av the source code is that this only happens once, when meta-scanninga starts, and therfore it should be fixed easily to let the meta-tag use different charsets. when i say "stupid bug", i mean it shouldnt be there at all, i can't understand why the designers and developers would consider every page use ISO-charsets, when there are som many of them. but thats just my opinion. i hope u dont missunderstood me about "put everything down and fix the bug" thing, its just i see it as "easy to fix" and really would help me, but thats just my opinion. i've seen a new "Integration Releaset, but what a dissapointment that the cahrset-bug is not fixed! i hope everyone have noticed the bug report for META-charset bug. as i said before, my solution was just temporary and is not a good one of 2 reasons: i dont have enough skills in this matter to come with good solutions, and i hav'nt yet checked through the whole code, as i consider it important to be able to suggest fixes. i hope the bug report is enough to fix the probelm. rezamotori, Sweden -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup |
From: Somik R. <so...@ya...> - 2003-03-24 01:22:13
|
Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the script scanning mechanism. The parser can currently handle script tags like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular tags. Such pages are quite widespread and ought to be supported. I was curious if anyone has ideas on solving this - given the existing design - fresh ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-03-23 16:41:07
|
mohammad azadi wrote: > I really think it's an stupid bug that all pages must use ISO-charset! cant u just fix the damn thing and make it as a patch so we can continue with our work?? If you're objecting to my request to file a bug report- then pls note that I cannot devote weekdays to the project, only my personal time on weekends. And when I do get the time, I do not prefer to search all emails on the user list to find what bugs need to be tackled. As far as the bug in question being stupid- all bugs are stupid, its just that one person does not have the time to find them all, and code is often written by more than one person. There are also development priorities - certain bugs take precendence - in my opinion, which I often base on feedback. Since this is not a paid project, you cannot expect me or any other developer to jump on an incomplete bug report - the least we expect is the community to help out. However, if a certain bug hurts you, and needs fixing, you could always make a polite request. Or solve it yourself and give it to the community, for which all of us will be grateful. > my suggestion is to have an String[] containing all the common charsets, and enable it to expand for new charsets. > I don't think it should take long to fix it, i've tried myself, but it just was a temperary fix. Thank you for the suggestion. Perhaps you can give us the patch in question. And just so you don't think I am being sarcastic, I'd be happy to have you on our developer team - anyone who wants to improve the system earns a right to be on the dev team. In general - I think it will be good to have guidelines for posting questions to make us a more effective community. I try to follow this Eric Raymond's well-written paper- http://www.catb.org/%7Eesr/faqs/smart-questions.html Regards, Somik |
From: mohammad a. <re...@em...> - 2003-03-23 14:09:14
|
I really think it's an stupid bug that all pages must use ISO-charset! cant u just fix the damn thing and make it as a patch so we can continue with our work?? my suggestion is to have an String[] containing all the common charsets, and enable it to expand for new charsets. I don't think it should take long to fix it, i've tried myself, but it just was a temperary fix. Rezamotori, Sweden -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup |
From: Somik R. <so...@ya...> - 2003-03-21 19:48:40
|
You should be able to suppress all the feedback. Check http://htmlparser.sourceforge.net/docs/index.php/FeedbackMechanism Regards, Somik --- Sean_Syslab <se...@sy...> wrote: > Sorry, I misunderstand the return strings. The > WARNING messsges are not within the return strings > of the methods, but are shown after that. > > Dear all: > > When I used the sample program to extract links or > strings, there were sometimes WARNING messages shown > within the return strings. I don't want those > WARNING strings accompanied with the return value. > What should I do... > > > Yours, Sean > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
From: Sean_Syslab <se...@sy...> - 2003-03-21 19:14:46
|
Sorry, I misunderstand the return strings. The WARNING messsges are = not within the return strings of the methods, but are shown after that. Dear all: When I used the sample program to extract links or strings, there were = sometimes WARNING messages shown within the return strings. I don't want = those WARNING strings accompanied with the return value. What should I = do... = Yours, Sean |
From: Sean_Syslab <se...@sy...> - 2003-03-21 18:30:17
|
Dear all: When I used the sample program to extract links or strings, there were = sometimes WARNING messages shown within the return strings. I don't want = those WARNING strings accompanied with the return value. What should I = do... = Yours, Sean |
From: Somik R. <so...@ya...> - 2003-03-21 17:39:17
|
To login to sourceforge, you need to have a sourceforge id. Get one from http://sourceforge.net/account/register.php Regards Somik --- Aminudin Khalid <ami...@mi...> wrote: > Can somebody else help mo to file this bug. I could > not login to > sourceforge. > > Thanks :) > > > Somik Raha wrote: > > > Sounds like a bug.. Can you file a bug report at > > http://htmlparser.sourceforge.net > > > > Regards, > > Somik > > > > ----- Original Message ----- > > *From:* Aminudin Khalid > <mailto:ami...@mi...> > > *To:* htm...@li... > > <mailto:htm...@li...> > > *Sent:* Monday, March 17, 2003 6:42 PM > > *Subject:* Re: [Htmlparser-user] Handling META > tag > > > > > >>It will help if you can post the stack trace. > >> > > I dunno how to do that. > > > > Well , I think the error comes from the > htmlparser.jar . Simply > > parse a file that contains the following code > and you will notice > > the "error". Actually there is no error , it > just doesnt parse > > the file correctly. > > > > OK, I have a file ( thisfile.html) . Below is > HTML code inside > > thisfile.html . > > > > <html> > > <head> > > <meta http-equiv="content-type" > content="text/html; > > charset=windows-1252"> > > </head> > > </html> > > > > > > Try to parse thisfile.html with htmlparser.jar > . > > > > java -jar htmlparser.jar > thisfile.html > > > > Below is the only output, (It doesn't detect > html code ???? ): > > > > HTMLParser v1.3 (Integration Build Mar 16, > 2003) > > INFO: file://localhost/thisfile.html > > Parsing file://localhost/thisifle.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Mohd. Aminudin bin Mohd. Khalid > Linux Programmer > Asian Open Source Centre (http://www.asiaosc.org) > Mimos Berhad (http://www.mimos.my) > > > > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-03-21 00:48:57
|
Can somebody else help mo to file this bug. I could not login to sourceforge. Thanks :) Somik Raha wrote: > Sounds like a bug.. Can you file a bug report at > http://htmlparser.sourceforge.net > > Regards, > Somik > > ----- Original Message ----- > *From:* Aminudin Khalid <mailto:ami...@mi...> > *To:* htm...@li... > <mailto:htm...@li...> > *Sent:* Monday, March 17, 2003 6:42 PM > *Subject:* Re: [Htmlparser-user] Handling META tag > > >>It will help if you can post the stack trace. >> > I dunno how to do that. > > Well , I think the error comes from the htmlparser.jar . Simply > parse a file that contains the following code and you will notice > the "error". Actually there is no error , it just doesnt parse > the file correctly. > > OK, I have a file ( thisfile.html) . Below is HTML code inside > thisfile.html . > > <html> > <head> > <meta http-equiv="content-type" content="text/html; > charset=windows-1252"> > </head> > </html> > > > Try to parse thisfile.html with htmlparser.jar . > > java -jar htmlparser.jar thisfile.html > > Below is the only output, (It doesn't detect html code ???? ): > > HTMLParser v1.3 (Integration Build Mar 16, 2003) > INFO: file://localhost/thisfile.html > Parsing file://localhost/thisifle.html > > > > > > > > > > > > > > -- Mohd. Aminudin bin Mohd. Khalid Linux Programmer Asian Open Source Centre (http://www.asiaosc.org) Mimos Berhad (http://www.mimos.my) |
From: Sean_YZU90 <s9...@ma...> - 2003-03-20 17:15:59
|
The member who posts about the compilation problem sets a correct classpath, I think. The problem is that he used the latest version of htmlparser, which doesn't contain the class HtmlNode... . So he should use htmlparser 1.2 , then the sample LinkExtractor.java could be correctly compiled. =20 =20 Yours, Sean |
From: Somik R. <so...@ya...> - 2003-03-19 06:26:08
|
Sounds like a bug.. Can you file a bug report at = http://htmlparser.sourceforge.net Regards, Somik ----- Original Message -----=20 From: Aminudin Khalid=20 To: htm...@li...=20 Sent: Monday, March 17, 2003 6:42 PM Subject: Re: [Htmlparser-user] Handling META tag It will help if you can post the stack trace.I dunno how to do that. Well , I think the error comes from the htmlparser.jar . Simply parse = a file that contains the following code and you will notice the = "error". Actually there is no error , it just doesnt parse the file = correctly. OK, I have a file ( thisfile.html) . Below is HTML code inside = thisfile.html .=20 <html> <head> <meta http-equiv=3D"content-type" content=3D"text/html; = charset=3Dwindows-1252"> </head> </html> Try to parse thisfile.html with htmlparser.jar . =20 java -jar htmlparser.jar thisfile.html Below is the only output, (It doesn't detect html code ???? ): HTMLParser v1.3 (Integration Build Mar 16, 2003) INFO: file://localhost/thisfile.html Parsing file://localhost/thisifle.html =20 |
From: Aminudin K. <ami...@mi...> - 2003-03-18 02:41:56
|
>It will help if you can post the stack trace. > I dunno how to do that. Well , I think the error comes from the htmlparser.jar . Simply parse a file that contains the following code and you will notice the "error". Actually there is no error , it just doesnt parse the file correctly. OK, I have a file ( thisfile.html) . Below is HTML code inside thisfile.html . <html> <head> <meta http-equiv="content-type" content="text/html; charset=windows-1252"> </head> </html> Try to parse thisfile.html with htmlparser.jar . java -jar htmlparser.jar thisfile.html Below is the only output, (It doesn't detect html code ???? ): HTMLParser v1.3 (Integration Build Mar 16, 2003) INFO: file://localhost/thisfile.html Parsing file://localhost/thisifle.html |
From: Somik R. <so...@ya...> - 2003-03-17 21:35:52
|
It will help if you can post the stack trace. Regards Somik --- Aminudin Khalid <ami...@mi...> wrote: > I have problem to parse HTML codes that contains the > following META tag. > > <html> > <head> > <meta http-equiv="content-type" > content="text/html; > charset=windows-1252"> > </head> > </html> > > I wrote a visitor class to parse several web sites > but it failed to > parse this kind of HTML codes. I also tried ( java > -jar htmlparser.jar > thisfile.html ), it also failed. > > I guess it couldn't read the > *http-equiv="content-type" * > > > Any idea ? > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
From: Aminudin K. <ami...@mi...> - 2003-03-17 08:46:01
|
I have problem to parse HTML codes that contains the following META tag. <html> <head> <meta http-equiv="content-type" content="text/html; charset=windows-1252"> </head> </html> I wrote a visitor class to parse several web sites but it failed to parse this kind of HTML codes. I also tried ( java -jar htmlparser.jar thisfile.html ), it also failed. I guess it couldn't read the *http-equiv="content-type" * Any idea ? |