htmlparser-developer Mailing List for HTML Parser (Page 18)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
From: Somik R. <so...@ya...> - 2003-02-16 04:33:25
|
Hi Folks, Integration release 1.3-20030215 is out. From the change log: Integration build 1.3 - 20030215 -------------------------------- [1] Added HtmlScanner [2] Removed Table, Div and Span from registry of scanners, can still be added individually [3] Reference test directory of project home page to maybe cure some sporadic errors in BeanTest. [4] Added setAttribute method [5] Cleaned up HTMLNode interface (removed TYPE, getType() and print()) With HtmlScanner, you can now get the entire page - sort of a DOM model in a Html object. Useful for testing. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-16 04:20:51
|
> Question: How I can extract links of a page as: > > http://www.google.com/search?q=universe Don't know why this is happening- Derrick ? Regards, Somik |
From: agente007 <e-a...@ex...> - 2003-02-13 21:22:56
|
Hi, Question: How I can extract links of a page as: http://www.google.com/search?q=universe The parser can't read that URL! JJ _______________________________________________ Join Excite! - http://www.excite.com The most personalized portal on the Web! |
From: Somik R. <so...@ya...> - 2003-02-09 03:27:58
|
Hi, Here's a testcase that proves there's no bug: public void testTagWithQuotes() throws Exception { String testHtml = "<img src=\"http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields -logo-20.gif\" width=87 height=20 border=0 alt=\"Marshall Field's\">"; createParser(testHtml); parseAndAssertNodeCount(1); assertType("should be HTMLTag",HTMLTag.class,node[0]); HTMLTag tag = (HTMLTag)node[0]; assertStringEquals("alt","Marshall Field's",tag.getAttribute("ALT")); assertStringEquals( "html", "<IMG BORDER=\"0\" ALT=\"Marshall Field's\" WIDTH=\"87\" SRC=\"http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields -logo-20.gif\" HEIGHT=\"20\">", tag.toHTML() ); } This test is now in org.htmlparser.tests.parserHelperTests.TagParserTest Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Saturday, February 08, 2003 6:44 PM Subject: Re: [Htmlparser-developer] testStringBeanListener() consistently failing > Derrick Oswald wrote: > > ERROR: HTMLReader.readElement() : Error occurred while trying to > > decipher the tag using scanners > > at Line 686 : null > > > > ...and then it really starts to have problems. It seems the "xxxxxx's" > > pattern causes grief as it reads the > in what it thinks is a single > > quoted string and 'fixes' it. > > I doubt that this is the problem.. It was bcos of the TableScanner, Div > Scanner, and Span scanners. I had taken a gamble by putting them in the > current set of registered scanners - there's a lot of dirty html out there > that don't close the div's or span's or table's. Instead of fixing this > issue by adding more code, I want to try refactoring this logic from the > link and form scanners and reuse it. > > For now, the above mentioned three scanners are not registered by default > (the page gets parsed just fine after that). > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Somik R. <so...@ya...> - 2003-02-09 02:42:49
|
Derrick Oswald wrote: > ERROR: HTMLReader.readElement() : Error occurred while trying to > decipher the tag using scanners > at Line 686 : null > > ...and then it really starts to have problems. It seems the "xxxxxx's" > pattern causes grief as it reads the > in what it thinks is a single > quoted string and 'fixes' it. I doubt that this is the problem.. It was bcos of the TableScanner, Div Scanner, and Span scanners. I had taken a gamble by putting them in the current set of registered scanners - there's a lot of dirty html out there that don't close the div's or span's or table's. Instead of fixing this issue by adding more code, I want to try refactoring this logic from the link and form scanners and reuse it. For now, the above mentioned three scanners are not registered by default (the page gets parsed just fine after that). Regards, Somik |
From: Brad P. <phi...@ho...> - 2003-02-07 07:04:52
|
hi, I ran the Robot test program with a depth of three. It got an error and crashed when it tried to crawl into a non-existant website. How can I prevent this? Here is another error message I received. I don't understand the cause of this error. BEGIN ERROR ----------- org.htmlparser.util.HTMLParserException: HTMLParserException at crawl(3); org.htmlparser.util.HTMLParserException: Unexpected Exception occurred while reading http://www.brookscole.com, in nextHTMLNode at Line 704 : <a href="javascript:transport();"><script>document.write("<img name=\"featureslider\" src=\""+features[randomImgNum(features.length-1)]+"\" width=\"215\" height=\"40\" border=\"0\">")</script></a></td> Previous Line 703 : <td bgcolor="D7B78E" align="center" valign="middle" class="bold12" width="215">; ... java.lang.StringIndexOutOfBoundsException: String index out of range: 1 ----------- END ERROR If anyone can offer any suggestions, I would appreciate it. Thanks. -Brad _________________________________________________________________ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail |
From: James M. <jmo...@uc...> - 2003-02-06 21:26:39
|
Hello, I found a bug with the current CVS release of the HTML Parse and the code below is the fix. minor code change... // line 86 of StringParser in method find(... if (ignoreStateMode && (ch=='\'' || ch=='"')) { if (state==PARSE_IGNORE_STATE) state=PARSE_HAS_BEGUN_STATE; else { //-----> make sure we're not testing outside the length of input. if (i+1 < input.length() && input.charAt(i+1)=='<') state = PARSE_IGNORE_STATE; } } ------------------------------------------------------------------- When parsing the HTML below, an index out of bounds exception gets thrown when the parser hits the last single quote in the word 'hello' below. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> </head> <script language="JavaScript" type="text/JavaScript"> // if this fails, output a 'hello' if (true) { //something good... } </script> </body> </html> James Moliere |
From: Somik R. <so...@ya...> - 2003-02-06 19:35:02
|
> public class StringParser { > private final static int > BEFORE_PARSE_BEGINS_STATE=0; This is the state in which the stringparser begins its life. > private final static int PARSE_HAS_BEGUN_STATE=1; When the first valid character has been encountered, the machine enters this state. > private final static int PARSE_COMPLETED_STATE=2; When an invalid character has been encountered (which cannot be a part of the stringnode), the machine enters a parse-completed state and returns. > private final static int PARSE_IGNORE_STATE=3; When quotes are encountered, the machine ignores characters which would normally throw it to PARSE_COMPLETED_STATE. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: James M. <jmo...@uc...> - 2003-02-06 19:11:09
|
Greetings, Can someone elaborate on these parse states below? It would help me very much with the code. Thanks! public class StringParser { private final static int BEFORE_PARSE_BEGINS_STATE=0; private final static int PARSE_HAS_BEGUN_STATE=1; private final static int PARSE_COMPLETED_STATE=2; private final static int PARSE_IGNORE_STATE=3; James Moliere |
From: Derrick O. <Der...@ro...> - 2003-02-06 02:16:46
|
Somik, hmmm, Maybe a better paradigm than 'precedence' is to enter the in-double-quote state until encountering another double quote. Similarly, enter the in-single-quote state until encountering another single quote. Derrick Somik Raha wrote: >Oh, thanks Derrick! I was actually hoping to get a >testcase that we could plug into the system. > >Looks like we need to have some kind of precedence of >double quotes over single quotes.. > >Regards, >Somik > > > |
From: Somik R. <so...@ya...> - 2003-02-05 23:18:50
|
Oh, thanks Derrick! I was actually hoping to get a testcase that we could plug into the system. Looks like we need to have some kind of precedence of double quotes over single quotes.. Regards, Somik --- Derrick Oswald <Der...@ro...> wrote: > Somik, > > It's easily reproducible: > > java -jar > ./release/htmlparser1_3/lib/htmlparser.jar > http://www.amazon.com > > yields: > > Begin Tag : br; begins at : 0; ends at : 3 > WARNING: HTMLTagParser : Encountered > inside > inverted commas in line > <td><a > href="/exec/obidos/tg/browse/-/1055940/ref=gw_mafb_/"><img > > src="http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields-logo-20.gif" > > width=87 height=20 border=0 alt="Marshall > Field's"></a></td>, location 205 > > ^ > Automatically corrected. > ERROR: HTMLReader.readElement() : Error occurred > while trying to > decipher the tag using scanners > at Line 686 : null > > ...and then it really starts to have problems. It > seems the "xxxxxx's" > pattern causes grief as it reads the > in what it > thinks is a single > quoted string and 'fixes' it. > > Derrick > > > Somik Raha wrote: > > >Hi Ling Ma > > It is very hard for us to help you with a vague > >request for help. Can you pls post your code, and > the > >complete exception that you received ? > > > > If possible, submit a testcase showing the > problem > >(check > >http://htmlparser.sourceforge.net/design/tests.html > - > >Communicate with testcases). Fixes are usually > fast, > >and we try to have a new release every week. > > > >Regards, > >Somik > >--- Mr LING MA <law...@ya...> wrote: > > > > > >>Dear Derick: > >>I am a graduate student, my name is Ling. > >>I just begin to use htmlparser 1.3, I try to parse > > >>amazon pages, most of the time it gives me parsing > >>error and exited. > >> > >>Do you happen to know what was the reason? Is it > >>just > >>because amazon pages tags are not well closed? or > >>some > >>bugs in htmlparser? > >> > >>Thanks a lot > >> > >>Sincerely yours > >>Ling Ma > >> > >> > >> > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2003-02-05 23:07:06
|
Somik, It's easily reproducible: java -jar ./release/htmlparser1_3/lib/htmlparser.jar http://www.amazon.com yields: Begin Tag : br; begins at : 0; ends at : 3 WARNING: HTMLTagParser : Encountered > inside inverted commas in line <td><a href="/exec/obidos/tg/browse/-/1055940/ref=gw_mafb_/"><img src="http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields-logo-20.gif" width=87 height=20 border=0 alt="Marshall Field's"></a></td>, location 205 ^ Automatically corrected. ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners at Line 686 : null ...and then it really starts to have problems. It seems the "xxxxxx's" pattern causes grief as it reads the > in what it thinks is a single quoted string and 'fixes' it. Derrick Somik Raha wrote: >Hi Ling Ma > It is very hard for us to help you with a vague >request for help. Can you pls post your code, and the >complete exception that you received ? > > If possible, submit a testcase showing the problem >(check >http://htmlparser.sourceforge.net/design/tests.html - >Communicate with testcases). Fixes are usually fast, >and we try to have a new release every week. > >Regards, >Somik >--- Mr LING MA <law...@ya...> wrote: > > >>Dear Derick: >>I am a graduate student, my name is Ling. >>I just begin to use htmlparser 1.3, I try to parse >>amazon pages, most of the time it gives me parsing >>error and exited. >> >>Do you happen to know what was the reason? Is it >>just >>because amazon pages tags are not well closed? or >>some >>bugs in htmlparser? >> >>Thanks a lot >> >>Sincerely yours >>Ling Ma >> >> >> |
From: Somik R. <so...@ya...> - 2003-02-05 22:30:49
|
Hi Ling Ma It is very hard for us to help you with a vague request for help. Can you pls post your code, and the complete exception that you received ? If possible, submit a testcase showing the problem (check http://htmlparser.sourceforge.net/design/tests.html - Communicate with testcases). Fixes are usually fast, and we try to have a new release every week. Regards, Somik --- Mr LING MA <law...@ya...> wrote: > Dear Derick: > I am a graduate student, my name is Ling. > I just begin to use htmlparser 1.3, I try to parse > amazon pages, most of the time it gives me parsing > error and exited. > > Do you happen to know what was the reason? Is it > just > because amazon pages tags are not well closed? or > some > bugs in htmlparser? > > Thanks a lot > > Sincerely yours > Ling Ma > --- Derrick Oswald <Der...@ro...> wrote: > > Hi back, > > > > Sorry, I've just been lurking the HTMLParser > project > > lately. That's > > because I liked the open source experience I had > > with HTMLParser so much > > I started my own project: > > http://sourceforge.net/projects/connector/ > > > > I think the problem (although I didn't experience > > it) with > > testStringBeanListener() is the fetch of the > > external URL (slashdot.org) > > may fail and then the contents of the bean's > string > > property won't > > change meaning the listener isn't fired (my guess > is > > you were not > > getting data back from slashdot.org consistently > and > > then not at all for > > a period, it should be back to OK now). > > > > So I switched to a > > > http://htmlparser.sourceforge.org/test/example.html, > > > > which is something I said I was going to do a long > > time ago anyway. > > > > The change to a local string instead of a web page > > for the serialization > > test may have missed the point, since the intent > is > > to be able to > > recover the connection and parse it after > > rehydration, so I switched it > > back to a URL (the same as above). I hope that's > OK, > > it should go faster > > anyway because the page is smaller. If it's not, > > we'll have to > > investigate why it isn't. > > > > Derrick > > > > Somik Raha wrote: > > > > >Hi Derrick, > > > As of the last release, I'd noticed something > > >peculiar - testStringBeanListener() would fail > > every > > >once in a while. I ignored it, but then, after > > >Kaarle's fix to the attribute bug today, I find > > that > > >testStringBeanListener() fails every time. > > > > > > I am not sure if we caused something to break > > >(probably one of us did). Could you take a look > > when > > >you are free ? Bytway, I did modify a few of the > > >tests, and the part which requires a parser to > hold > > a > > >url to be serializable (I didn't actually > > understand > > >why that was necessary). Could that have caused > the > > >problem ? > > > > > >Regards, > > >Somik > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld > = > > Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > http://mailplus.yahoo.com > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Mr L. MA <law...@ya...> - 2003-02-05 22:11:59
|
Dear Derick: I am a graduate student, my name is Ling. I just begin to use htmlparser 1.3, I try to parse amazon pages, most of the time it gives me parsing error and exited. Do you happen to know what was the reason? Is it just because amazon pages tags are not well closed? or some bugs in htmlparser? Thanks a lot Sincerely yours Ling Ma --- Derrick Oswald <Der...@ro...> wrote: > Hi back, > > Sorry, I've just been lurking the HTMLParser project > lately. That's > because I liked the open source experience I had > with HTMLParser so much > I started my own project: > http://sourceforge.net/projects/connector/ > > I think the problem (although I didn't experience > it) with > testStringBeanListener() is the fetch of the > external URL (slashdot.org) > may fail and then the contents of the bean's string > property won't > change meaning the listener isn't fired (my guess is > you were not > getting data back from slashdot.org consistently and > then not at all for > a period, it should be back to OK now). > > So I switched to a > http://htmlparser.sourceforge.org/test/example.html, > > which is something I said I was going to do a long > time ago anyway. > > The change to a local string instead of a web page > for the serialization > test may have missed the point, since the intent is > to be able to > recover the connection and parse it after > rehydration, so I switched it > back to a URL (the same as above). I hope that's OK, > it should go faster > anyway because the page is smaller. If it's not, > we'll have to > investigate why it isn't. > > Derrick > > Somik Raha wrote: > > >Hi Derrick, > > As of the last release, I'd noticed something > >peculiar - testStringBeanListener() would fail > every > >once in a while. I ignored it, but then, after > >Kaarle's fix to the attribute bug today, I find > that > >testStringBeanListener() fails every time. > > > > I am not sure if we caused something to break > >(probably one of us did). Could you take a look > when > >you are free ? Bytway, I did modify a few of the > >tests, and the part which requires a parser to hold > a > >url to be serializable (I didn't actually > understand > >why that was necessary). Could that have caused the > >problem ? > > > >Regards, > >Somik > > > > > > > > > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2003-02-05 01:29:11
|
Hi back, Sorry, I've just been lurking the HTMLParser project lately. That's because I liked the open source experience I had with HTMLParser so much I started my own project: http://sourceforge.net/projects/connector/ I think the problem (although I didn't experience it) with testStringBeanListener() is the fetch of the external URL (slashdot.org) may fail and then the contents of the bean's string property won't change meaning the listener isn't fired (my guess is you were not getting data back from slashdot.org consistently and then not at all for a period, it should be back to OK now). So I switched to a http://htmlparser.sourceforge.org/test/example.html, which is something I said I was going to do a long time ago anyway. The change to a local string instead of a web page for the serialization test may have missed the point, since the intent is to be able to recover the connection and parse it after rehydration, so I switched it back to a URL (the same as above). I hope that's OK, it should go faster anyway because the page is smaller. If it's not, we'll have to investigate why it isn't. Derrick Somik Raha wrote: >Hi Derrick, > As of the last release, I'd noticed something >peculiar - testStringBeanListener() would fail every >once in a while. I ignored it, but then, after >Kaarle's fix to the attribute bug today, I find that >testStringBeanListener() fails every time. > > I am not sure if we caused something to break >(probably one of us did). Could you take a look when >you are free ? Bytway, I did modify a few of the >tests, and the part which requires a parser to hold a >url to be serializable (I didn't actually understand >why that was necessary). Could that have caused the >problem ? > >Regards, >Somik > > > > |
From: Somik R. <so...@ya...> - 2003-02-04 17:35:39
|
Hi Derrick, As of the last release, I'd noticed something peculiar - testStringBeanListener() would fail every once in a while. I ignored it, but then, after Kaarle's fix to the attribute bug today, I find that testStringBeanListener() fails every time. I am not sure if we caused something to break (probably one of us did). Could you take a look when you are free ? Bytway, I did modify a few of the tests, and the part which requires a parser to hold a url to be serializable (I didn't actually understand why that was necessary). Could that have caused the problem ? Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2003-02-03 07:25:00
|
Hi Folks, Jamese Moliere of Intelesis has joined our dev team. Here's his bio : "I'm a computer software engineer for Intelesis (www.intelesis.org) with 6 years working experience and a self motivated interest in programming for about 10 years. I have a biochemistry degree from UCSD and am currently working on a Masters from San Diego State University in Computer Science. I primarily develop in Java but will do Macromedia Flash ActionScript and C/C++. I have done some development of an Ada compiler in C using Lex and Yacc -- a class project. Htmlparser is planned to be used in a portal. I came across this tool from a colleague who found it using a search engine -- We need an HTML parser for re-writing URL links. We found this tool to be the best because of it's SAX like parsing and Java based development -- Open source helped a lot too! I'm a bit under the gun to get this URL re-writer to work properly and I certainly will help as much as possible in the near term to get this parser to work properly. If we can get this parser to be used in this project, I'll certainly be more available in the long term." James-> Welcome to the htmlparser-developer team. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-03 07:21:35
|
Hi Folks, Integration release 1.3-20030202 is out. From the change log : Integration build 1.3 - 20030202 -------------------------------- [1] Renamed HTMLCompositeTagScanner to CompositeTagScanner [2] Renamed HTMLTag.getParameter() to HTMLTag.getAttribute() [3] Added TableScanner [4] Added HtmlPage [5] Added SpanScanner [6] Added assertType in HTMLParserTestCase [7] Added TextExtractingVisitor [8] Added non-recursive visiting (flag in HTMLVisitor) [9] Added DivScanner [10] Modified collectInto to use NodeList [11] Added collectInto(NodeList, Class) [12] CompositeTagScanner can handle single xml-like tags e.g. <div/> [13] Fixed bug 678969 - StringParser was not going into ignore mode on encountering double quotes [14] Added LabelScanner Dhaval Udani has contributed LabelScanner. (He has also contributed a BodyScanner which will make it next week's release). We've shipped this time with two tests failing- both tests replicate the same bug - 677874 - "mishandling of double quotes". I made this release for two reasons : [1] This bug is not a new addition but was always there - its a deep bug in AttributeParser (previously known as ParameterParser) - and it might take a little time to fix [2] There are lot of new additions which we'd like to get out there - we finally have a table scanner! [3] Important bug fixes have been made which further stabilize the parser's performance (and at least one user was desperately waiting for the fix) Notable addition - HTMLNode.collectInto() has a new mode of operation - using the class type. Suppose you need to get to a node (e.g. images) that is within a composite (like a table), you can do : NodeList imageList = new ImageList(); tableTag.collectInto(imageList,HTMLImageTag.class); You can also do this directly from the parser - like so : HTMLNode node [] = parser.extractAllNodesThatAre(HTMLLinkTag.class); And here's some more news - we now have our own wiki (finally!). Go to http://htmlparser.sourceforge.net/docs/ This is a free-for-all wiki. It is a little too much for me to write the entire documentation on my own - so I'd highly appreciate if the user/developer community pitches in - that would be a great benefit for the community. The current documentation on the site is already obsolete, and I am going to take it down soon (hopefully by the next release). Regards, Somik |
From: <dha...@or...> - 2003-02-02 09:19:13
|
Hi, I was just having a look at the HTMLFormTag. It consists of a vector for input and textarea tags. However it has missed out on the select tags. I think that too needs to be added. Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 |
From: <dha...@or...> - 2003-02-02 08:57:59
|
Hi all, How about having a main method in all testcases with a default ui so that each test case prgram can be run independently without the need of running all the test cases at once. Also inc ase someone is not connected to the net (pretty common here in India!!!) he could avoid running those test cases. Typcially a developer is making some changes nad he is essentially only interested in seeing whether his changes/enahcenments are working or not so initially it would be nice for him to be able to test out only a particular test case rather than the whole suite. So is it possible to just add the following in all the test cases: public static TestSuite suite() { return new TestSuite(<class-name>.class); } public static void main(String[] args) { new junit.awtui.TestRunner().start(new String[] {<class-name>.class.getName()}); } Regards, Dhaval Udani Senior Analyst M-Line, QPEG OrbiTech Solutions Ltd. +91-22-28290019 Extn. 1457 -----Original Message----- From: somik [mailto:so...@ya...] Sent: Sunday, January 26, 2003 5:13 AM To: htmlparser-announce; htmlparser-developer; htmlparser-user Cc: somik Subject: [Htmlparser-developer] Integration Release 1.3-20030125 is out Hi Folks, The next integration release is out. From the change log : Integration build 1.3 - 20030125 -------------------------------- [1] HTMLCompositeTagScanner now takes an array of match strings [2] toHTML(HTMLRenderer ...) was replaced by UrlModifyingVisitor [3] Fixed NullPointerException in HTMLScriptTag.toString() [4] Fixed bug in HTMLStringNode (breaking up empty lines into seperate string nodes) [5] Fixed thread safety issue and introduced parser helpers [6] Fixed bug 664404 - spewing incorrect line breaks in HTMLRemarkNode.toHTML() [7] Added assertXmlEquals() in HTMLParserTestCase [8] Added better option tag support [9] Replaced instanceof with getType() mechanism - much faster [10] Incorporated NodeList instead of Vector in HTMLCompositeTag [11] Added HTMLRemarkNode support in Visitor [12] Fixed bug 673379 (infinite loop on encountering links like ".someurl.html") Among the notable additions is assertXmlEquals() - this is present to enable us to perform xml testing. This method actually creates the parser and performs a node for node comparison. Reconstruction has improved a lot - you will find that the parser now does not add unnecessary line breaks - and preserves the html as it came in. One significant addition is the use of NodeList instead of Vector. The integration has been performed, so there should be a significant performance increase - check http://htmlparser.sourceforge.net/performance/simpleEnumerationPerforman ce.h tml In the coming week, we will be setting up a wiki on sourceforge, where we can collaboratively create documentation - hopefully that will finally take the burden out of the documentation process. Regards, Somik ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ Htmlparser-developer mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: agente007 <e-a...@ex...> - 2003-01-28 12:33:41
|
Hello. Previously, he had commented a problem with the Spanish language. That problem continues without being solved, the program doesn't offer an appropriate exit then the accented characters it doesn't write them correctly, example, "máximo" (with accent in the to) writes it as "mximo", that which is incorrect. My regional configuration of Windows Xp is Spanish traditional and Spanish international. JJ _______________________________________________ <font size=2 face=geneva><b>Join Excite! - <a href=http://www.excite.com target=_blank>http://www.excite.com</a></b> The most personalized portal on the Web!</font> |
From: Mr L. MA <law...@ya...> - 2003-01-28 05:30:58
|
Dear Somik: I am a graduate student trying to use HTMLParser. I come into a remark parsing bug when use version 1.3 with the StringExtractor program, I posted the html page as well as the bug report in Sourceforge, It seems remarkparser cannot parse a very long remark. It always trucated the remark at about 395 chars. Thanks Ling Ma --- Somik Raha <so...@ya...> wrote: > Hi Folks, > The next integration release is out. From the > change log : > > Integration build 1.3 - 20030125 > -------------------------------- > [1] HTMLCompositeTagScanner now takes an array of > match strings > [2] toHTML(HTMLRenderer ...) was replaced by > UrlModifyingVisitor > [3] Fixed NullPointerException in > HTMLScriptTag.toString() > [4] Fixed bug in HTMLStringNode (breaking up empty > lines into seperate > string nodes) > [5] Fixed thread safety issue and introduced parser > helpers > [6] Fixed bug 664404 - spewing incorrect line breaks > in > HTMLRemarkNode.toHTML() > [7] Added assertXmlEquals() in HTMLParserTestCase > [8] Added better option tag support > [9] Replaced instanceof with getType() mechanism - > much faster > [10] Incorporated NodeList instead of Vector in > HTMLCompositeTag > [11] Added HTMLRemarkNode support in Visitor > [12] Fixed bug 673379 (infinite loop on encountering > links like > ".someurl.html") > > Among the notable additions is assertXmlEquals() - > this is present to enable > us to perform xml testing. This method actually > creates the parser and > performs a node for node comparison. > > Reconstruction has improved a lot - you will find > that the parser now does > not add unnecessary line breaks - and preserves the > html as it came in. > > One significant addition is the use of NodeList > instead of Vector. The > integration has been performed, so there should be a > significant performance > increase - check > http://htmlparser.sourceforge.net/performance/simpleEnumerationPerformance.h > tml > > In the coming week, we will be setting up a wiki on > sourceforge, where we > can collaboratively create documentation - hopefully > that will finally take > the burden out of the documentation process. > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2003-01-25 23:41:45
|
Hi Folks, The next integration release is out. From the change log : Integration build 1.3 - 20030125 -------------------------------- [1] HTMLCompositeTagScanner now takes an array of match strings [2] toHTML(HTMLRenderer ...) was replaced by UrlModifyingVisitor [3] Fixed NullPointerException in HTMLScriptTag.toString() [4] Fixed bug in HTMLStringNode (breaking up empty lines into seperate string nodes) [5] Fixed thread safety issue and introduced parser helpers [6] Fixed bug 664404 - spewing incorrect line breaks in HTMLRemarkNode.toHTML() [7] Added assertXmlEquals() in HTMLParserTestCase [8] Added better option tag support [9] Replaced instanceof with getType() mechanism - much faster [10] Incorporated NodeList instead of Vector in HTMLCompositeTag [11] Added HTMLRemarkNode support in Visitor [12] Fixed bug 673379 (infinite loop on encountering links like ".someurl.html") Among the notable additions is assertXmlEquals() - this is present to enable us to perform xml testing. This method actually creates the parser and performs a node for node comparison. Reconstruction has improved a lot - you will find that the parser now does not add unnecessary line breaks - and preserves the html as it came in. One significant addition is the use of NodeList instead of Vector. The integration has been performed, so there should be a significant performance increase - check http://htmlparser.sourceforge.net/performance/simpleEnumerationPerformance.h tml In the coming week, we will be setting up a wiki on sourceforge, where we can collaboratively create documentation - hopefully that will finally take the burden out of the documentation process. Regards, Somik |
From: Somik R. <so...@ya...> - 2003-01-21 21:51:31
|
Hi Derrick, > But what problem are you trying to solve? My sole intention was to get a performance improvement if "instanceof" turned out to be guilty of being a bottleneck. But it looks like the performance of instanceof is really good - so I'm now thinking of sticking to that (as you suggested as well). Regards, Somik |
From: Joshua K. <jo...@in...> - 2003-01-20 17:19:13
|
> Similarly, rationalization of toString(), getPlainTextString(), > getHTML() and any required new methods to return appropriate renditions > of the text within the node could eliminate the instanceof operations in > StringExtractor and elsewhere. Speaking of the StringExtractor, are there no tests for that class? I couldn't find any. thanks, jk |