htmlparser-user Mailing List for HTML Parser (Page 88)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Lou-yahoo <vas...@ya...> - 2002-12-13 11:25:26
|
Hi folks, I'd like to use the parser API for some pretty straightforward html file parsing, but when I try your simple example (although I did have to change Enumeration to HTMLEnumeration)... public static void main(String[] args) { try{ System.out.println("testpoint 0"); HTMLParser parser = new HTMLParser("http://www.yahoo.com"); //HTMLParser parser = // new HTMLParser("http://www.yahoo.com",new DefaultHTMLParserFeedback()); System.out.println("testpoint 1"); System.exit(0); // In this example, we are registering all the common scanners parser.registerScanners(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { HTMLNode node = (HTMLNode)e.nextHTMLNode(); node.print(); } } catch (Exception e) { System.out.println("Exception Thrown:" + e.getMessage() + " : " + e); e.printStackTrace(); } System.out.println("Done! yay"); } I get NoClassDefFoundError: sun/security/action/GetPropertyAction What am I doin wrong? Thanks, Lou |
From: Somik R. <so...@ya...> - 2002-12-09 03:01:37
|
Leslie and Derrick have been added to the list of users in cvs. Claude, = Kaarle and Dhaval are also on this list. You can add cvs watches by doing : cvs watch add * (I think), and you will be automatically emailed when = anyone does a cvs update. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-09 01:28:26
|
Hi Folks, This week's release is Candidate 5. We've had talented developers joining us over the weekend, hence, you can expect improvements in quality in the coming weeks. Hopefully, we should have our production release ready by New Year's... From the change log : Integration Build 1.2 - 20021208 --------------------------------- [1] Fixed bug in base href scanner - would always expect href [2] Refactored HTMLFormScanner [3] Refactored HTMLRenderer to use the Visitor pattern- enabling connections with links and images [4] HTMLStringNode returns a blank string in toPlainTextString() [5] HTMLFormTag returns string information in toPlainTextString() #5 is an important fix as now, we wont lose any meaningful string info contained inside forms when we issue calls like node.toPlainTextString(). Get the latest release from http://htmlparser.sourceforge.net The site update is continuing at an even pace. There is a new section on writing tests for HTMLParser. We're also trying to introduce a philosophy called "Communicate with TestCases". If you've found a bug, write a testcase for it, and submit that in your report. Of course, you dont have to do this, but if you do, we'd be able to make the fix much faster (and motivated to make the fix). Writing a testcase for the parser is super simple - you can check the philosophy and an example on the documentation page. http://htmlparser.sourceforge.net/design/index.html Regards, Somik |
From: Somik R. <so...@ya...> - 2002-12-06 05:45:17
|
Hi Leslie, Indeed, the <form> tag is a nightmare to work with. At one point, we had removed it from the basic set of scanners. We put it back in after our exception handling mechanism was in place - so now, if things get messy, you should get an exception. We can't possibly handle every bit of screwed html :), although we try really hard to. > it would be better if the end-form tag could be 'assumed' so that the > file could at least be parsed. that would mirror the behavior of commercial > browsers. I had spent some time on this, tried it and failed miserably. It turned out to be almost impossible to predict where a form tag should end, bcos of its expanse, and particularly bcos of its intermingling with <table>. Its quite possible I missed something - so if you have any innovative suggestions, it would be really helpful. I think we're dealing on the realm of AI here :). Remember, there is a big constraint, we have a streaming, real-time parser and not a DOM style parser where we have all of it and can go back and forth. A good heuristic that really works will make our day. Bytway, maybe this discussion could better happen on the dev list.. Feel free to join us as a dev (send me your sourceforge id). Regards, Somik ----- Original Message ----- From: "Leslie Rohde" <le...@op...> To: <htm...@li...> Sent: Thursday, December 05, 2002 5:25 PM Subject: Re: [Htmlparser-user] how to deal with form tag following table tag > actually, there are two problems in the case at hand, and i am not at all > sure that the <table><form> construction is the worst of them. > > not only does hotbot produce this invalid sequence, but they also > failed to close the form tag. it looks like HTMLFormScanner > simply falls out of the loop at lines 136-154 looking for the end tag and > throws an exception when not found. > > it would be better if the end-form tag could be 'assumed' so that the > file could at least be parsed. that would mirror the behavior of commercial > browsers. > > Leslie Rohde wrote: > > > the construction <table ...><form...> is not allowed in spec, but it > > does occur in such places as the hotbot search engine results page. > > currently, htmlparser delivers a flood errors and exceptions when > > parsing a hotbot results page. > > > > how best to handle this? > > > > -- > Leslie Rohde > mailto:le...@op... > http://www.optitext.com > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Leslie R. <le...@op...> - 2002-12-06 00:40:13
|
actually, there are two problems in the case at hand, and i am not at all sure that the <table><form> construction is the worst of them. not only does hotbot produce this invalid sequence, but they also failed to close the form tag. it looks like HTMLFormScanner simply falls out of the loop at lines 136-154 looking for the end tag and throws an exception when not found. it would be better if the end-form tag could be 'assumed' so that the file could at least be parsed. that would mirror the behavior of commercial browsers. Leslie Rohde wrote: > the construction <table ...><form...> is not allowed in spec, but it > does occur in such places as the hotbot search engine results page. > currently, htmlparser delivers a flood errors and exceptions when > parsing a hotbot results page. > > how best to handle this? > -- Leslie Rohde mailto:le...@op... http://www.optitext.com |
From: Leslie R. <le...@op...> - 2002-12-05 22:52:38
|
the construction <table ...><form...> is not allowed in spec, but it does occur in such places as the hotbot search engine results page. currently, htmlparser delivers a flood errors and exceptions when parsing a hotbot results page. how best to handle this? -- Leslie Rohde mailto:le...@op... http://www.optitext.com |
From: <dha...@or...> - 2002-12-05 07:29:15
|
Navid, You can easily get around this problem by removing the HTMLFormScanner from the scan list. The HTMLFormScanner is automatically registered when u use HTMLParser.registerScanners(). Use the removeScanner function to remove this scanner and then proceed with your code. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-8290019 Extn. 1457 -----Original Message----- From: nav911t [mailto:na...@ya...] Sent: Thursday, December 05, 2002 10:24 AM To: htmlparser-user Cc: nav911t Subject: Re: [Htmlparser-user] Extracting Images Hi Somik, Thank you. I these examples are helpful. But the problem is, that the image tags between <form> and </form>, get skipped, I can not extract them! Once I remove <form> and </form>, I can extract all images tags, but it is not so when I put them back. Navid --- Somik Raha <so...@ya...> wrote: > Hi Navid, > If you want to extract images - you need to read > this : > http://htmlparser.sourceforge.net/samples/imageslinks.html > If you want to extract text b/w two images, > thats something quite > different. Use toPlainTextString() to get the data > out. > > Regards > Somik > ----- Original Message ----- > From: "Navid H.Langaroudi" <na...@ya...> > To: <htm...@li...> > Sent: Wednesday, December 04, 2002 4:33 PM > Subject: Re: [Htmlparser-user] Candidate Release 4 > is out > > > > Hi Somik, > > I am trying to extract all images of a page. I can > > only extract part of it. I do not know what I am > > missing. > > Actually what I am doing is, trying to extract all > > text that appears between two images. > > Any suggestion? > > > > Thanks, > > Navid > > --- Somik Raha <so...@ya...> wrote: > > > Hi Folks, > > > Candidate Release 4 is out. This actually > > > contains a few minor API changes which wont > affect > > > your application, but have been done to improve > the > > > OO design of the system. HTMLFormScanner has > been > > > improved. The major work in this release went in > > > refactoring 201 testcases - so as to make it > more > > > readable, and follow the Once-And-Only-Once > > > paradigm. Well, the package size dropped about > 12KB > > > (after zipping), so you can estimate how much > > > refactoring was done.. All tests are passing. > > > > > > From the Change Log, > > > > > > Integration Build 1.2 - 20021201 > > > -------------------------------- > > > [1] Refactored HTMLNode, API improved, now > HTMLNode > > > stores > > > nodeBegin and nodeEnd. > > > [2] Refactored Testing framework - to reduce the > > > code size substantially. > > > [3] HTMLFormScanner improved to include > > > Input,TextArea, Select and Option scanners > within > > > > > > You can get it from > > > http://htmlparser.sourceforge.net > > > There's an all-new Contributors Page (linked > from > > > the main site). Just in case I missed anybody, > or > > > you have info to add, pls let me know. > > > > > > Regards, > > > Somik > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > > http://mailplus.yahoo.com > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > Visual Studio.NET > > comprehensive development tool, built to increase > your > > productivity. Try a free online hosted session at: > > > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Microsoft Visual > Studio.NET > comprehensive development tool, built to increase > your > productivity. Try a free online hosted session at: > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com ------------------------------------------------------- This SF.net email is sponsored by: Microsoft Visual Studio.NET comprehensive development tool, built to increase your productivity. Try a free online hosted session at: http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Navid H.L. <na...@ya...> - 2002-12-05 04:59:26
|
Thank you. --- Somik Raha <so...@ya...> wrote: > for (Enumeration > e=formTag.getAllNodesVector().elements();e.hasMoreElements();) > { > HTMLNode node = (HTMLNode)e.nextElement(); > if (node instanceof HTMLImageTag) { > // ... > } > } > ----- Original Message ----- > From: "Navid H.Langaroudi" <na...@ya...> > To: <htm...@li...> > Sent: Wednesday, December 04, 2002 6:15 PM > Subject: [Htmlparser-user] How to parse a forms > nodes > > > > Hi Somik, > > I know that HTMLFormTag has a method > > getAllNodesVector() that returns all nodes of > form > > tag. But how can I use this vector to extract > Tags, > > like image tags. > > > > Thanks > > Navid > > --- Somik Raha <so...@ya...> wrote: > > > Hi Folks, > > > Candidate Release 4 is out. This actually > > > contains a few minor API changes which wont > affect > > > your application, but have been done to improve > the > > > OO design of the system. HTMLFormScanner has > been > > > improved. The major work in this release went in > > > refactoring 201 testcases - so as to make it > more > > > readable, and follow the Once-And-Only-Once > > > paradigm. Well, the package size dropped about > 12KB > > > (after zipping), so you can estimate how much > > > refactoring was done.. All tests are passing. > > > > > > From the Change Log, > > > > > > Integration Build 1.2 - 20021201 > > > -------------------------------- > > > [1] Refactored HTMLNode, API improved, now > HTMLNode > > > stores > > > nodeBegin and nodeEnd. > > > [2] Refactored Testing framework - to reduce the > > > code size substantially. > > > [3] HTMLFormScanner improved to include > > > Input,TextArea, Select and Option scanners > within > > > > > > You can get it from > > > http://htmlparser.sourceforge.net > > > There's an all-new Contributors Page (linked > from > > > the main site). Just in case I missed anybody, > or > > > you have info to add, pls let me know. > > > > > > Regards, > > > Somik > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > > http://mailplus.yahoo.com > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > Visual Studio.NET > > comprehensive development tool, built to increase > your > > productivity. Try a free online hosted session at: > > > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Microsoft Visual > Studio.NET > comprehensive development tool, built to increase > your > productivity. Try a free online hosted session at: > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Navid H.L. <na...@ya...> - 2002-12-05 04:54:23
|
Hi Somik, Thank you. I these examples are helpful. But the problem is, that the image tags between <form> and </form>, get skipped, I can not extract them! Once I remove <form> and </form>, I can extract all images tags, but it is not so when I put them back. Navid --- Somik Raha <so...@ya...> wrote: > Hi Navid, > If you want to extract images - you need to read > this : > http://htmlparser.sourceforge.net/samples/imageslinks.html > If you want to extract text b/w two images, > thats something quite > different. Use toPlainTextString() to get the data > out. > > Regards > Somik > ----- Original Message ----- > From: "Navid H.Langaroudi" <na...@ya...> > To: <htm...@li...> > Sent: Wednesday, December 04, 2002 4:33 PM > Subject: Re: [Htmlparser-user] Candidate Release 4 > is out > > > > Hi Somik, > > I am trying to extract all images of a page. I can > > only extract part of it. I do not know what I am > > missing. > > Actually what I am doing is, trying to extract all > > text that appears between two images. > > Any suggestion? > > > > Thanks, > > Navid > > --- Somik Raha <so...@ya...> wrote: > > > Hi Folks, > > > Candidate Release 4 is out. This actually > > > contains a few minor API changes which wont > affect > > > your application, but have been done to improve > the > > > OO design of the system. HTMLFormScanner has > been > > > improved. The major work in this release went in > > > refactoring 201 testcases - so as to make it > more > > > readable, and follow the Once-And-Only-Once > > > paradigm. Well, the package size dropped about > 12KB > > > (after zipping), so you can estimate how much > > > refactoring was done.. All tests are passing. > > > > > > From the Change Log, > > > > > > Integration Build 1.2 - 20021201 > > > -------------------------------- > > > [1] Refactored HTMLNode, API improved, now > HTMLNode > > > stores > > > nodeBegin and nodeEnd. > > > [2] Refactored Testing framework - to reduce the > > > code size substantially. > > > [3] HTMLFormScanner improved to include > > > Input,TextArea, Select and Option scanners > within > > > > > > You can get it from > > > http://htmlparser.sourceforge.net > > > There's an all-new Contributors Page (linked > from > > > the main site). Just in case I missed anybody, > or > > > you have info to add, pls let me know. > > > > > > Regards, > > > Somik > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > > http://mailplus.yahoo.com > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > Visual Studio.NET > > comprehensive development tool, built to increase > your > > productivity. Try a free online hosted session at: > > > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Microsoft Visual > Studio.NET > comprehensive development tool, built to increase > your > productivity. Try a free online hosted session at: > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-05 04:43:06
|
for (Enumeration e=formTag.getAllNodesVector().elements();e.hasMoreElements();) { HTMLNode node = (HTMLNode)e.nextElement(); if (node instanceof HTMLImageTag) { // ... } } ----- Original Message ----- From: "Navid H.Langaroudi" <na...@ya...> To: <htm...@li...> Sent: Wednesday, December 04, 2002 6:15 PM Subject: [Htmlparser-user] How to parse a forms nodes > Hi Somik, > I know that HTMLFormTag has a method > getAllNodesVector() that returns all nodes of form > tag. But how can I use this vector to extract Tags, > like image tags. > > Thanks > Navid > --- Somik Raha <so...@ya...> wrote: > > Hi Folks, > > Candidate Release 4 is out. This actually > > contains a few minor API changes which wont affect > > your application, but have been done to improve the > > OO design of the system. HTMLFormScanner has been > > improved. The major work in this release went in > > refactoring 201 testcases - so as to make it more > > readable, and follow the Once-And-Only-Once > > paradigm. Well, the package size dropped about 12KB > > (after zipping), so you can estimate how much > > refactoring was done.. All tests are passing. > > > > From the Change Log, > > > > Integration Build 1.2 - 20021201 > > -------------------------------- > > [1] Refactored HTMLNode, API improved, now HTMLNode > > stores > > nodeBegin and nodeEnd. > > [2] Refactored Testing framework - to reduce the > > code size substantially. > > [3] HTMLFormScanner improved to include > > Input,TextArea, Select and Option scanners within > > > > You can get it from > > http://htmlparser.sourceforge.net > > There's an all-new Contributors Page (linked from > > the main site). Just in case I missed anybody, or > > you have info to add, pls let me know. > > > > Regards, > > Somik > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > http://mailplus.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Microsoft Visual Studio.NET > comprehensive development tool, built to increase your > productivity. Try a free online hosted session at: > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-12-05 04:41:59
|
Hi Navid, If you want to extract images - you need to read this : http://htmlparser.sourceforge.net/samples/imageslinks.html If you want to extract text b/w two images, thats something quite different. Use toPlainTextString() to get the data out. Regards Somik ----- Original Message ----- From: "Navid H.Langaroudi" <na...@ya...> To: <htm...@li...> Sent: Wednesday, December 04, 2002 4:33 PM Subject: Re: [Htmlparser-user] Candidate Release 4 is out > Hi Somik, > I am trying to extract all images of a page. I can > only extract part of it. I do not know what I am > missing. > Actually what I am doing is, trying to extract all > text that appears between two images. > Any suggestion? > > Thanks, > Navid > --- Somik Raha <so...@ya...> wrote: > > Hi Folks, > > Candidate Release 4 is out. This actually > > contains a few minor API changes which wont affect > > your application, but have been done to improve the > > OO design of the system. HTMLFormScanner has been > > improved. The major work in this release went in > > refactoring 201 testcases - so as to make it more > > readable, and follow the Once-And-Only-Once > > paradigm. Well, the package size dropped about 12KB > > (after zipping), so you can estimate how much > > refactoring was done.. All tests are passing. > > > > From the Change Log, > > > > Integration Build 1.2 - 20021201 > > -------------------------------- > > [1] Refactored HTMLNode, API improved, now HTMLNode > > stores > > nodeBegin and nodeEnd. > > [2] Refactored Testing framework - to reduce the > > code size substantially. > > [3] HTMLFormScanner improved to include > > Input,TextArea, Select and Option scanners within > > > > You can get it from > > http://htmlparser.sourceforge.net > > There's an all-new Contributors Page (linked from > > the main site). Just in case I missed anybody, or > > you have info to add, pls let me know. > > > > Regards, > > Somik > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > http://mailplus.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Microsoft Visual Studio.NET > comprehensive development tool, built to increase your > productivity. Try a free online hosted session at: > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Navid H.L. <na...@ya...> - 2002-12-05 02:15:06
|
Hi Somik, I know that HTMLFormTag has a method getAllNodesVector() that returns all nodes of form tag. But how can I use this vector to extract Tags, like image tags. Thanks Navid --- Somik Raha <so...@ya...> wrote: > Hi Folks, > Candidate Release 4 is out. This actually > contains a few minor API changes which wont affect > your application, but have been done to improve the > OO design of the system. HTMLFormScanner has been > improved. The major work in this release went in > refactoring 201 testcases - so as to make it more > readable, and follow the Once-And-Only-Once > paradigm. Well, the package size dropped about 12KB > (after zipping), so you can estimate how much > refactoring was done.. All tests are passing. > > From the Change Log, > > Integration Build 1.2 - 20021201 > -------------------------------- > [1] Refactored HTMLNode, API improved, now HTMLNode > stores > nodeBegin and nodeEnd. > [2] Refactored Testing framework - to reduce the > code size substantially. > [3] HTMLFormScanner improved to include > Input,TextArea, Select and Option scanners within > > You can get it from > http://htmlparser.sourceforge.net > There's an all-new Contributors Page (linked from > the main site). Just in case I missed anybody, or > you have info to add, pls let me know. > > Regards, > Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Navid H.L. <na...@ya...> - 2002-12-05 00:33:28
|
Hi Somik, I am trying to extract all images of a page. I can only extract part of it. I do not know what I am missing. Actually what I am doing is, trying to extract all text that appears between two images. Any suggestion? Thanks, Navid --- Somik Raha <so...@ya...> wrote: > Hi Folks, > Candidate Release 4 is out. This actually > contains a few minor API changes which wont affect > your application, but have been done to improve the > OO design of the system. HTMLFormScanner has been > improved. The major work in this release went in > refactoring 201 testcases - so as to make it more > readable, and follow the Once-And-Only-Once > paradigm. Well, the package size dropped about 12KB > (after zipping), so you can estimate how much > refactoring was done.. All tests are passing. > > From the Change Log, > > Integration Build 1.2 - 20021201 > -------------------------------- > [1] Refactored HTMLNode, API improved, now HTMLNode > stores > nodeBegin and nodeEnd. > [2] Refactored Testing framework - to reduce the > code size substantially. > [3] HTMLFormScanner improved to include > Input,TextArea, Select and Option scanners within > > You can get it from > http://htmlparser.sourceforge.net > There's an all-new Contributors Page (linked from > the main site). Just in case I missed anybody, or > you have info to add, pls let me know. > > Regards, > Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-04 06:33:52
|
Hi Marcin Thanks for the bug report - it has been reproduced and fixed. Bytway, when submitting reports, could you login at sourceforge- else, the automatic response mechanism of the bug reporting system will not be able to initimate you of our responses. You can expect this fix in the next release (coming week). Regards, Somik ----- Original Message ----- From: "Marcin Pionnier" <mar...@so...> To: <htm...@li...> Sent: Monday, December 02, 2002 5:08 AM Subject: [Htmlparser-user] Base tag problem > Hi > > When I was parsing pages from google.com directory I found a problem: > Inside a document there is such line: > <base target="_top"> > > but HtmlBaseHREFScanner assumes that there is HREF paramater: > (Line 68) HTMLBaseHREFScanner.java) > String baseUrl = (String)tag.getParameter("HREF"); > String absoluteBaseUrl = removeLastSlash(baseUrl.trim()); > > so I got null pointer exception (of course it is very easy to fix it) > > Is that line from google malformed HTML or it is a bug in parser? > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Somik R. <so...@ya...> - 2002-12-02 17:38:45
|
That sounds like a bug. Can you enter it in at http://htmlparser.sourceforge.net ? Thanks a lot. Regards Somik --- Marcin Pionnier <mar...@so...> wrote: > Hi > > When I was parsing pages from google.com directory I > found a problem: > Inside a document there is such line: > <base target="_top"> > > but HtmlBaseHREFScanner assumes that there is HREF > paramater: > (Line 68) HTMLBaseHREFScanner.java) > String baseUrl = (String)tag.getParameter("HREF"); > String absoluteBaseUrl = > removeLastSlash(baseUrl.trim()); > > so I got null pointer exception (of course it is > very easy to fix it) > > Is that line from google malformed HTML or it is a > bug in parser? > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Marcin P. <mar...@so...> - 2002-12-02 13:24:12
|
Hi When I was parsing pages from google.com directory I found a problem: Inside a document there is such line: <base target="_top"> but HtmlBaseHREFScanner assumes that there is HREF paramater: (Line 68) HTMLBaseHREFScanner.java) String baseUrl = (String)tag.getParameter("HREF"); String absoluteBaseUrl = removeLastSlash(baseUrl.trim()); so I got null pointer exception (of course it is very easy to fix it) Is that line from google malformed HTML or it is a bug in parser? |
From: Somik R. <so...@ya...> - 2002-12-02 02:56:54
|
Hi Folks, Candidate Release 4 is out. This actually contains a few minor API = changes which wont affect your application, but have been done to = improve the OO design of the system. HTMLFormScanner has been improved. = The major work in this release went in refactoring 201 testcases - so as = to make it more readable, and follow the Once-And-Only-Once paradigm. = Well, the package size dropped about 12KB (after zipping), so you can = estimate how much refactoring was done.. All tests are passing. From the Change Log,=20 Integration Build 1.2 - 20021201 -------------------------------- [1] Refactored HTMLNode, API improved, now HTMLNode stores nodeBegin and nodeEnd. [2] Refactored Testing framework - to reduce the code size = substantially. [3] HTMLFormScanner improved to include Input,TextArea, Select and = Option scanners within You can get it from http://htmlparser.sourceforge.net There's an all-new Contributors Page (linked from the main site). Just = in case I missed anybody, or you have info to add, pls let me know. Regards, Somik |
From: Amit R. <ami...@ya...> - 2002-11-28 01:09:16
|
> > Well, now I reached to the real difficult part. I > am > > trying to extract meaningful data from some sites, > > like a site products names, discription and > keywords, > > which is not always in meta tags. > > Do have any suggestion? > > You might do well to use Artificial Intelligence > (whatever that is). :) > If you define your goal more clearly, it would be > easier to help you. i used Regular expressions for a similar job. __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-11-27 03:39:00
|
> Well, now I reached to the real difficult part. I am > trying to extract meaningful data from some sites, > like a site products names, discription and keywords, > which is not always in meta tags. > Do have any suggestion? You might do well to use Artificial Intelligence (whatever that is). :) If you define your goal more clearly, it would be easier to help you. Regards, Somik |
From: Navid H.L. <na...@ya...> - 2002-11-27 01:49:57
|
Thank you Somik, It worked as I wanted. Also using the parser classes, I could extract each part of page data separatly. Well, now I reached to the real difficult part. I am trying to extract meaningful data from some sites, like a site products names, discription and keywords, which is not always in meta tags. Do have any suggestion? Navid --- Somik Raha <so...@ya...> wrote: > Hi Navid, > I ran the program, and it does exactly what I > expected. > But I see your doubt now. You want to suppress the > exception messages. These are happening because of > DefaultHMLParserFeedback(). Pls write your own > NullHTMLParserFeedback(), that does not print > anything > when it encounters an error, and use that to > initialize the parser. Read the javadoc of > HTMLParser.java carefully. > > I have written the modified program for you : > public void testNullUrl(){ > try { > parser = new > HTMLParser("http://www.yahooeeeeee.com",new > HTMLParserFeedback() { > /** > * @see > org.htmlparser.util.HTMLParserFeedback#info(String) > */ > public void info(String message) { > } > > /** > * @see > org.htmlparser.util.HTMLParserFeedback#warning(String) > */ > public void warning(String message) { > } > > /** > * @see > org.htmlparser.util.HTMLParserFeedback#error(String, > HTMLParserException) > */ > public void error(String message, > HTMLParserException e) { > } > }); > //assertTrue("Should have thrown an > exception!",false); > parser.registerScanners(); > parser.addScanner(new HTMLLinkScanner("-l")); > > } > catch (HTMLParserException e) { > System.out.println("Can not connect the URL!"); > } > } > > Try this- it should give you what you want now. > > Regards > Somik > > (The earlier mail went before I could complete it..) > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus Powerful. Affordable. Sign up > now. > http://mailplus.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Get the new Palm > Tungsten T > handheld. Power & Color in a compact size! > http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-11-26 06:39:53
|
Hi Folks, Candidate 3 is out. You can get it from http://htmlparser.sourceforge.net The website is getting an overhaul, though this is in progress. You = will find a new samples page. If anyone wishes to contribute a simple program to add to the = catalog, please feel free to come forward. From the change log, in this release : Integration Build 1.2 - 20021125 -------------------------------- [1] Incorporated Bug Fix for HTMLLinkProcessor to parse dynamic urls [2] Refactored package names to org.htmlparser [3] Added documentation [4] Can handle url with spaces in it [5] Fixed bug 643352 - going into infinite loop on bad img within link [6] Refactored HTMLLinkTag - unnecessary boolean variables removed Developers --> can you send me a brief bio, with a pic - I'd like to = acknowledge everyone who has contributed to this project.=20 Regards, Somik |
From: Somik R. <so...@ya...> - 2002-11-25 19:51:14
|
Hi Navid, I ran the program, and it does exactly what I expected. But I see your doubt now. You want to suppress the exception messages. These are happening because of DefaultHMLParserFeedback(). Pls write your own NullHTMLParserFeedback(), that does not print anything when it encounters an error, and use that to initialize the parser. Read the javadoc of HTMLParser.java carefully. I have written the modified program for you : public void testNullUrl(){ try { parser = new HTMLParser("http://www.yahooeeeeee.com",new HTMLParserFeedback() { /** * @see org.htmlparser.util.HTMLParserFeedback#info(String) */ public void info(String message) { } /** * @see org.htmlparser.util.HTMLParserFeedback#warning(String) */ public void warning(String message) { } /** * @see org.htmlparser.util.HTMLParserFeedback#error(String, HTMLParserException) */ public void error(String message, HTMLParserException e) { } }); //assertTrue("Should have thrown an exception!",false); parser.registerScanners(); parser.addScanner(new HTMLLinkScanner("-l")); } catch (HTMLParserException e) { System.out.println("Can not connect the URL!"); } } Try this- it should give you what you want now. Regards Somik (The earlier mail went before I could complete it..) __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-11-25 19:50:29
|
Hi Navid, I ran the program, and it does exactly what I expected. But I see your doubt now. You want to suppress the exception messages. These are happening because of DefaultHMLParserFeedback(). Pls write your own NullHTMLParserFeedback(), that does not print anything when it encounters an error, and use that to initialize the parser. Read the javadoc of HTMLParser.java carefully. > --- Somik Raha <so...@ya...> wrote: > > Hi Navid, > > > > > > > I still need your help. > > > I am getting some exception errors in my > program, > > it > > > happens when it tried to open a url to a non > > existing > > > page. > > > I tried some try/catch, but still can't catch > this > > > one. > > > It says > > > Error: HTMLParser.oprnURLConnection(): Error in > > > opening a URL connection to > > > http://www.somenoneexistingurl.com ....... > > > > > > How can I skip this, my program reads all urls > of > > a > > > site and tries to go to next page, and if the > url > > does > > > not exist or is wrong, it terminats, I think I > > should > > > control this in order to let the program carry > on > > on > > > correct links. > > > > I wrote a testcase for this - in > > HTMLParserTest.java. This test proves that > > there is no bug in the parser. You can add this > > snippet and verify for > > yourself. > > > > public void testNullUrl() { > > HTMLParser parser; > > try { > > parser = new > > HTMLParser("http://someoneexisting.com"); > > assertTrue("Should have thrown an > > exception!",false); > > } > > catch (HTMLParserException e) { > > } > > } > > > > I can guess what you might be doing wrong though > .. > > are you sure you are > > trying to trap HTMLParserException ? > > > > Regards, > > Somik > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus Powerful. Affordable. Sign up > now. > http://mailplus.yahoo.com __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Navid H.L. <na...@ya...> - 2002-11-25 19:28:52
|
Hi Somik, I am still getting this error. Here I send a sample code, which if you run, you will get the error. Please see the attachment. Thanks Navid --- Somik Raha <so...@ya...> wrote: > Hi Navid, > > > > I still need your help. > > I am getting some exception errors in my program, > it > > happens when it tried to open a url to a non > existing > > page. > > I tried some try/catch, but still can't catch this > > one. > > It says > > Error: HTMLParser.oprnURLConnection(): Error in > > opening a URL connection to > > http://www.somenoneexistingurl.com ....... > > > > How can I skip this, my program reads all urls of > a > > site and tries to go to next page, and if the url > does > > not exist or is wrong, it terminats, I think I > should > > control this in order to let the program carry on > on > > correct links. > > I wrote a testcase for this - in > HTMLParserTest.java. This test proves that > there is no bug in the parser. You can add this > snippet and verify for > yourself. > > public void testNullUrl() { > HTMLParser parser; > try { > parser = new > HTMLParser("http://someoneexisting.com"); > assertTrue("Should have thrown an > exception!",false); > } > catch (HTMLParserException e) { > } > } > > I can guess what you might be doing wrong though .. > are you sure you are > trying to trap HTMLParserException ? > > Regards, > Somik > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-11-23 19:17:32
|
Hi Steve, Im taking this to the list.. You cant use URLEncoder bcos URLEncoder replaces spaces with +. = However, what works on the browser is %20. OTOH, there are still = problems with replacing spaces with %20 when you come to mailto links.=20 For your case though, the parser now checks if a url has spaces, and = if so, it performs the correct encoding.=20 The latest codebase is in CVS.=20 Cheers, Somik ----- Original Message -----=20 From: Stephen J. Harrington=20 To: Somik Raha=20 Sent: Thursday, November 21, 2002 10:23 AM Subject: Another oddity Somik,=20 Some of the URL's we parse have spaces in them, e.g = http://www.cnn.com/This is a test.html=20 I know this in non-standard, but I just parse what I am told to = parse.....=20 My first response was to URLEncode the URL (URLEncoder.encode()) and = pass it to the htmlparser. This didn't work!?!=20 So I then wrote a quick utility to replace all spaces with the correct = URL encoded character and this worked. Any idea why the parser would = not like a URLEncoded URL?=20 --stephen=20 |