Thread: [Htmlparser-developer] testStringBeanListener() consistently failing
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2003-02-04 17:35:39
|
Hi Derrick, As of the last release, I'd noticed something peculiar - testStringBeanListener() would fail every once in a while. I ignored it, but then, after Kaarle's fix to the attribute bug today, I find that testStringBeanListener() fails every time. I am not sure if we caused something to break (probably one of us did). Could you take a look when you are free ? Bytway, I did modify a few of the tests, and the part which requires a parser to hold a url to be serializable (I didn't actually understand why that was necessary). Could that have caused the problem ? Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2003-02-05 01:29:11
|
Hi back, Sorry, I've just been lurking the HTMLParser project lately. That's because I liked the open source experience I had with HTMLParser so much I started my own project: http://sourceforge.net/projects/connector/ I think the problem (although I didn't experience it) with testStringBeanListener() is the fetch of the external URL (slashdot.org) may fail and then the contents of the bean's string property won't change meaning the listener isn't fired (my guess is you were not getting data back from slashdot.org consistently and then not at all for a period, it should be back to OK now). So I switched to a http://htmlparser.sourceforge.org/test/example.html, which is something I said I was going to do a long time ago anyway. The change to a local string instead of a web page for the serialization test may have missed the point, since the intent is to be able to recover the connection and parse it after rehydration, so I switched it back to a URL (the same as above). I hope that's OK, it should go faster anyway because the page is smaller. If it's not, we'll have to investigate why it isn't. Derrick Somik Raha wrote: >Hi Derrick, > As of the last release, I'd noticed something >peculiar - testStringBeanListener() would fail every >once in a while. I ignored it, but then, after >Kaarle's fix to the attribute bug today, I find that >testStringBeanListener() fails every time. > > I am not sure if we caused something to break >(probably one of us did). Could you take a look when >you are free ? Bytway, I did modify a few of the >tests, and the part which requires a parser to hold a >url to be serializable (I didn't actually understand >why that was necessary). Could that have caused the >problem ? > >Regards, >Somik > > > > |
From: Mr L. MA <law...@ya...> - 2003-02-05 22:11:59
|
Dear Derick: I am a graduate student, my name is Ling. I just begin to use htmlparser 1.3, I try to parse amazon pages, most of the time it gives me parsing error and exited. Do you happen to know what was the reason? Is it just because amazon pages tags are not well closed? or some bugs in htmlparser? Thanks a lot Sincerely yours Ling Ma --- Derrick Oswald <Der...@ro...> wrote: > Hi back, > > Sorry, I've just been lurking the HTMLParser project > lately. That's > because I liked the open source experience I had > with HTMLParser so much > I started my own project: > http://sourceforge.net/projects/connector/ > > I think the problem (although I didn't experience > it) with > testStringBeanListener() is the fetch of the > external URL (slashdot.org) > may fail and then the contents of the bean's string > property won't > change meaning the listener isn't fired (my guess is > you were not > getting data back from slashdot.org consistently and > then not at all for > a period, it should be back to OK now). > > So I switched to a > http://htmlparser.sourceforge.org/test/example.html, > > which is something I said I was going to do a long > time ago anyway. > > The change to a local string instead of a web page > for the serialization > test may have missed the point, since the intent is > to be able to > recover the connection and parse it after > rehydration, so I switched it > back to a URL (the same as above). I hope that's OK, > it should go faster > anyway because the page is smaller. If it's not, > we'll have to > investigate why it isn't. > > Derrick > > Somik Raha wrote: > > >Hi Derrick, > > As of the last release, I'd noticed something > >peculiar - testStringBeanListener() would fail > every > >once in a while. I ignored it, but then, after > >Kaarle's fix to the attribute bug today, I find > that > >testStringBeanListener() fails every time. > > > > I am not sure if we caused something to break > >(probably one of us did). Could you take a look > when > >you are free ? Bytway, I did modify a few of the > >tests, and the part which requires a parser to hold > a > >url to be serializable (I didn't actually > understand > >why that was necessary). Could that have caused the > >problem ? > > > >Regards, > >Somik > > > > > > > > > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2003-02-05 22:30:49
|
Hi Ling Ma It is very hard for us to help you with a vague request for help. Can you pls post your code, and the complete exception that you received ? If possible, submit a testcase showing the problem (check http://htmlparser.sourceforge.net/design/tests.html - Communicate with testcases). Fixes are usually fast, and we try to have a new release every week. Regards, Somik --- Mr LING MA <law...@ya...> wrote: > Dear Derick: > I am a graduate student, my name is Ling. > I just begin to use htmlparser 1.3, I try to parse > amazon pages, most of the time it gives me parsing > error and exited. > > Do you happen to know what was the reason? Is it > just > because amazon pages tags are not well closed? or > some > bugs in htmlparser? > > Thanks a lot > > Sincerely yours > Ling Ma > --- Derrick Oswald <Der...@ro...> wrote: > > Hi back, > > > > Sorry, I've just been lurking the HTMLParser > project > > lately. That's > > because I liked the open source experience I had > > with HTMLParser so much > > I started my own project: > > http://sourceforge.net/projects/connector/ > > > > I think the problem (although I didn't experience > > it) with > > testStringBeanListener() is the fetch of the > > external URL (slashdot.org) > > may fail and then the contents of the bean's > string > > property won't > > change meaning the listener isn't fired (my guess > is > > you were not > > getting data back from slashdot.org consistently > and > > then not at all for > > a period, it should be back to OK now). > > > > So I switched to a > > > http://htmlparser.sourceforge.org/test/example.html, > > > > which is something I said I was going to do a long > > time ago anyway. > > > > The change to a local string instead of a web page > > for the serialization > > test may have missed the point, since the intent > is > > to be able to > > recover the connection and parse it after > > rehydration, so I switched it > > back to a URL (the same as above). I hope that's > OK, > > it should go faster > > anyway because the page is smaller. If it's not, > > we'll have to > > investigate why it isn't. > > > > Derrick > > > > Somik Raha wrote: > > > > >Hi Derrick, > > > As of the last release, I'd noticed something > > >peculiar - testStringBeanListener() would fail > > every > > >once in a while. I ignored it, but then, after > > >Kaarle's fix to the attribute bug today, I find > > that > > >testStringBeanListener() fails every time. > > > > > > I am not sure if we caused something to break > > >(probably one of us did). Could you take a look > > when > > >you are free ? Bytway, I did modify a few of the > > >tests, and the part which requires a parser to > hold > > a > > >url to be serializable (I didn't actually > > understand > > >why that was necessary). Could that have caused > the > > >problem ? > > > > > >Regards, > > >Somik > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld > = > > Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up > now. > http://mailplus.yahoo.com > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2003-02-05 23:07:06
|
Somik, It's easily reproducible: java -jar ./release/htmlparser1_3/lib/htmlparser.jar http://www.amazon.com yields: Begin Tag : br; begins at : 0; ends at : 3 WARNING: HTMLTagParser : Encountered > inside inverted commas in line <td><a href="/exec/obidos/tg/browse/-/1055940/ref=gw_mafb_/"><img src="http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields-logo-20.gif" width=87 height=20 border=0 alt="Marshall Field's"></a></td>, location 205 ^ Automatically corrected. ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners at Line 686 : null ...and then it really starts to have problems. It seems the "xxxxxx's" pattern causes grief as it reads the > in what it thinks is a single quoted string and 'fixes' it. Derrick Somik Raha wrote: >Hi Ling Ma > It is very hard for us to help you with a vague >request for help. Can you pls post your code, and the >complete exception that you received ? > > If possible, submit a testcase showing the problem >(check >http://htmlparser.sourceforge.net/design/tests.html - >Communicate with testcases). Fixes are usually fast, >and we try to have a new release every week. > >Regards, >Somik >--- Mr LING MA <law...@ya...> wrote: > > >>Dear Derick: >>I am a graduate student, my name is Ling. >>I just begin to use htmlparser 1.3, I try to parse >>amazon pages, most of the time it gives me parsing >>error and exited. >> >>Do you happen to know what was the reason? Is it >>just >>because amazon pages tags are not well closed? or >>some >>bugs in htmlparser? >> >>Thanks a lot >> >>Sincerely yours >>Ling Ma >> >> >> |
From: Somik R. <so...@ya...> - 2003-02-05 23:18:50
|
Oh, thanks Derrick! I was actually hoping to get a testcase that we could plug into the system. Looks like we need to have some kind of precedence of double quotes over single quotes.. Regards, Somik --- Derrick Oswald <Der...@ro...> wrote: > Somik, > > It's easily reproducible: > > java -jar > ./release/htmlparser1_3/lib/htmlparser.jar > http://www.amazon.com > > yields: > > Begin Tag : br; begins at : 0; ends at : 3 > WARNING: HTMLTagParser : Encountered > inside > inverted commas in line > <td><a > href="/exec/obidos/tg/browse/-/1055940/ref=gw_mafb_/"><img > > src="http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields-logo-20.gif" > > width=87 height=20 border=0 alt="Marshall > Field's"></a></td>, location 205 > > ^ > Automatically corrected. > ERROR: HTMLReader.readElement() : Error occurred > while trying to > decipher the tag using scanners > at Line 686 : null > > ...and then it really starts to have problems. It > seems the "xxxxxx's" > pattern causes grief as it reads the > in what it > thinks is a single > quoted string and 'fixes' it. > > Derrick > > > Somik Raha wrote: > > >Hi Ling Ma > > It is very hard for us to help you with a vague > >request for help. Can you pls post your code, and > the > >complete exception that you received ? > > > > If possible, submit a testcase showing the > problem > >(check > >http://htmlparser.sourceforge.net/design/tests.html > - > >Communicate with testcases). Fixes are usually > fast, > >and we try to have a new release every week. > > > >Regards, > >Somik > >--- Mr LING MA <law...@ya...> wrote: > > > > > >>Dear Derick: > >>I am a graduate student, my name is Ling. > >>I just begin to use htmlparser 1.3, I try to parse > > >>amazon pages, most of the time it gives me parsing > >>error and exited. > >> > >>Do you happen to know what was the reason? Is it > >>just > >>because amazon pages tags are not well closed? or > >>some > >>bugs in htmlparser? > >> > >>Thanks a lot > >> > >>Sincerely yours > >>Ling Ma > >> > >> > >> > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Derrick O. <Der...@ro...> - 2003-02-06 02:16:46
|
Somik, hmmm, Maybe a better paradigm than 'precedence' is to enter the in-double-quote state until encountering another double quote. Similarly, enter the in-single-quote state until encountering another single quote. Derrick Somik Raha wrote: >Oh, thanks Derrick! I was actually hoping to get a >testcase that we could plug into the system. > >Looks like we need to have some kind of precedence of >double quotes over single quotes.. > >Regards, >Somik > > > |
From: Somik R. <so...@ya...> - 2003-02-09 02:42:49
|
Derrick Oswald wrote: > ERROR: HTMLReader.readElement() : Error occurred while trying to > decipher the tag using scanners > at Line 686 : null > > ...and then it really starts to have problems. It seems the "xxxxxx's" > pattern causes grief as it reads the > in what it thinks is a single > quoted string and 'fixes' it. I doubt that this is the problem.. It was bcos of the TableScanner, Div Scanner, and Span scanners. I had taken a gamble by putting them in the current set of registered scanners - there's a lot of dirty html out there that don't close the div's or span's or table's. Instead of fixing this issue by adding more code, I want to try refactoring this logic from the link and form scanners and reuse it. For now, the above mentioned three scanners are not registered by default (the page gets parsed just fine after that). Regards, Somik |
From: Somik R. <so...@ya...> - 2003-02-09 03:27:58
|
Hi, Here's a testcase that proves there's no bug: public void testTagWithQuotes() throws Exception { String testHtml = "<img src=\"http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields -logo-20.gif\" width=87 height=20 border=0 alt=\"Marshall Field's\">"; createParser(testHtml); parseAndAssertNodeCount(1); assertType("should be HTMLTag",HTMLTag.class,node[0]); HTMLTag tag = (HTMLTag)node[0]; assertStringEquals("alt","Marshall Field's",tag.getAttribute("ALT")); assertStringEquals( "html", "<IMG BORDER=\"0\" ALT=\"Marshall Field's\" WIDTH=\"87\" SRC=\"http://g-images.amazon.com/images/G/01/merchants/logos/marshall-fields -logo-20.gif\" HEIGHT=\"20\">", tag.toHTML() ); } This test is now in org.htmlparser.tests.parserHelperTests.TagParserTest Regards, Somik ----- Original Message ----- From: "Somik Raha" <so...@ya...> To: <htm...@li...> Sent: Saturday, February 08, 2003 6:44 PM Subject: Re: [Htmlparser-developer] testStringBeanListener() consistently failing > Derrick Oswald wrote: > > ERROR: HTMLReader.readElement() : Error occurred while trying to > > decipher the tag using scanners > > at Line 686 : null > > > > ...and then it really starts to have problems. It seems the "xxxxxx's" > > pattern causes grief as it reads the > in what it thinks is a single > > quoted string and 'fixes' it. > > I doubt that this is the problem.. It was bcos of the TableScanner, Div > Scanner, and Span scanners. I had taken a gamble by putting them in the > current set of registered scanners - there's a lot of dirty html out there > that don't close the div's or span's or table's. Instead of fixing this > issue by adding more code, I want to try refactoring this logic from the > link and form scanners and reuse it. > > For now, the above mentioned three scanners are not registered by default > (the page gets parsed just fine after that). > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |