Thread: [Htmlparser-developer] Form tag should not be composite tag?
Brought to you by:
derrickoswald
From: Mr L. MA <law...@ya...> - 2003-03-06 06:34:42
|
Hi all: Do you guys think form tag should not be composite tag? or else it cannot process page like: http://money.cnn.com/services/glossary/a.html which misses one form end tag. Ling Ma __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-06 15:05:15
|
Thanks very much for the sample page. My to do list for this week : [1] Refactor correction logic in the link scanner to the composite scanner, so that it becomes available for all composite tags. That will solve the problem you mention. [2] Work on Dhaval's suggestion - I have some ideas about switching off testcases that require the internet. Regards, Somik ----- Original Message ----- From: "Mr LING MA" <law...@ya...> To: <htm...@li...> Sent: Wednesday, March 05, 2003 10:34 PM Subject: [Htmlparser-developer] Form tag should not be composite tag? > Hi all: > Do you guys think form tag should not be composite > tag? > or else it cannot process page like: > > http://money.cnn.com/services/glossary/a.html > > which misses one form end tag. > > Ling Ma > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger > for complex code. Debugging C/C++ programs can leave you feeling lost and > disoriented. TotalView can help you find your way. Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |
From: Mr L. MA <law...@ya...> - 2003-03-06 17:31:08
|
One problem I had with FormTag.toString() method is that form tag should be treated as body tag since any other tags could be nested in it. The ultimate htmlparser test would be webase collection from stanford. What I did is to download a website with a offline browser ( such as webstripper) Running StringExtractor on the local collection gives many ParserExceptions. Sometimes with JTidy I can get luck on some pages before apply HTMLParser, sometimes not. My focus is to use HTMLParser for text extraction, so I came into "dirty" pages that HTMLParser gives error. Is there a way even with readelements=null I can still get the rest nodes? Ling Ma --- Somik Raha <so...@ya...> wrote: > Thanks very much for the sample page. My to do list > for this week : > [1] Refactor correction logic in the link scanner to > the composite scanner, > so that it becomes available for all composite tags. > That will solve the > problem you mention. > > [2] Work on Dhaval's suggestion - I have some ideas > about switching off > testcases that require the internet. > > Regards, > Somik > ----- Original Message ----- > From: "Mr LING MA" <law...@ya...> > To: <htm...@li...> > Sent: Wednesday, March 05, 2003 10:34 PM > Subject: [Htmlparser-developer] Form tag should not > be composite tag? > > > > Hi all: > > Do you guys think form tag should not be composite > > tag? > > or else it cannot process page like: > > > > http://money.cnn.com/services/glossary/a.html > > > > which misses one form end tag. > > > > Ling Ma > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Tax Center - forms, calculators, tips, more > > http://taxes.yahoo.com/ > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Etnus, makers > of TotalView, The > debugger > > for complex code. Debugging C/C++ programs can > leave you feeling lost and > > disoriented. TotalView can help you find your way. > Available on major UNIX > > and Linux platforms. Try it free. www.etnus.com > > _______________________________________________ > > Htmlparser-developer mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
From: Somik R. <so...@ya...> - 2003-03-07 03:11:28
|
> One problem I had with FormTag.toString() method is > that form tag should be treated as body tag since any > other tags could be nested in it. > > The ultimate htmlparser test would be webase > collection from stanford. What you could really do to speed up our testing is to provide us with urls that cause breaks - and keep filing lots of bug reports. That would be a great help. > Is there a way even with readelements=null I can still > get the rest nodes? This usually means the parser has reached the end of the page without finding a matching end tag. It is usually a fatal error. But this week, I am planning to improve robustness - systemwide. It would be good to have some nice bug reports before I start, though. Regards, Somik |
From: Mr L. MA <law...@ya...> - 2003-03-09 23:08:54
|
If you have a ftp site, I can upload exception pages to it daily. Ling Ma --- Somik Raha <so...@ya...> wrote: > > > > One problem I had with FormTag.toString() method > is > > that form tag should be treated as body tag since > any > > other tags could be nested in it. > > > > The ultimate htmlparser test would be webase > > collection from stanford. > > What you could really do to speed up our testing is > to provide us with urls > that cause breaks - and keep filing lots of bug > reports. That would be a > great help. > > > Is there a way even with readelements=null I can > still > > get the rest nodes? > > This usually means the parser has reached the end of > the page without > finding a matching end tag. It is usually a fatal > error. But this week, I am > planning to improve robustness - systemwide. It > would be good to have some > nice bug reports before I start, though. > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of > TotalView, The debugger > for complex code. Debugging C/C++ programs can leave > you feeling lost and > disoriented. TotalView can help you find your way. > Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |