Re: [Htmlparser-developer] charset
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-24 02:00:26
|
Hi Derrick, --- Derrick Oswald <Der...@ro...> wrote: > g'day, > > I've dropped code to handle charset parameters, but > this raised a couple > of issues. First an explanation of the changes. > > HTMLParser constructors were added that take a > URLConnection, so that > the input stream could be reacquired as needed. > For consistency the 'from file' handling is > identical to a URL now. > > The HTTP header is now examined for the charset > parameter and the input > stream is converted to a reader with that encoding. > However, a lot of sites lie. They don't specify the > charset in the HTTP > header, even though the HTML is encoded with a > non-standard encoding. > So, we have to pre-read the header portion (meta > tags) and restart if > the charset in the meta tags is different than in > the HTTP header. > > Ya, ya, pre-reading is fraught with peril, and > that's what I'm on about > here... > > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. > > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > > Derrick > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |