Re: [Htmlparser-developer] charset
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-24 02:05:00
|
Hi Derrick, Great going! > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. That is bcos we allow folks to custom build their own parser. e.g. if you are only interested in text content, you wouldnt wish to register any of the scanners at all. If you are interested only in links, you would register only the link scanner. Control that passes to a scanner - implies extra processing time (sometimes twice that of the no-scanner scenario). > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > Hmm.. I thought I'd added this doc in. If its not enough, feel free to modify. > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > That would cause a serious performance hit while running 284 tests - I'd simply stop testing if it took too long to run the tests.. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |