[Htmlparser-developer] charset
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@ro...> - 2002-12-23 20:47:40
|
g'day, I've dropped code to handle charset parameters, but this raised a couple of issues. First an explanation of the changes. HTMLParser constructors were added that take a URLConnection, so that the input stream could be reacquired as needed. For consistency the 'from file' handling is identical to a URL now. The HTTP header is now examined for the charset parameter and the input stream is converted to a reader with that encoding. However, a lot of sites lie. They don't specify the charset in the HTTP header, even though the HTML is encoded with a non-standard encoding. So, we have to pre-read the header portion (meta tags) and restart if the charset in the meta tags is different than in the HTTP header. Ya, ya, pre-reading is fraught with peril, and that's what I'm on about here... 1) Why aren't the standard scanners registered by the HTMLParser constructor? It would seem that proper parsing of HTML would require these always be registered. Unless there's a set for HTML 3.2 and one for HTML 4.1 or something. Some of the tests rely on this 'empty scanner list' behaviour to return the expected number of nodes. 2) Mark and reset are still supported somewhat. If you mark before calling elements() and reset() after exhausting the input stream, all is well. But mark() in the midst of parsing is useless and dangerous because of the BufferedReader. Perhaps the documentation should reflect this, or a way to do it properly worked out. Aside notes: Nobody's checking that the contents of the document are "text/html", perhaps this check should be added. These changes still leave the 'Reader' constructors as a separate group which is used by a majority of the test cases. The 'new testing framework' should try to use the same code path as most of the real-world use cases. Perhaps, test cases could be loaded from htmlparser.sourceforge.net like HTMLParserTest.testURLWithSpaces() does. Derrick |