Thread: [Htmlparser-developer] charset
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2002-12-23 20:47:40
|
g'day, I've dropped code to handle charset parameters, but this raised a couple of issues. First an explanation of the changes. HTMLParser constructors were added that take a URLConnection, so that the input stream could be reacquired as needed. For consistency the 'from file' handling is identical to a URL now. The HTTP header is now examined for the charset parameter and the input stream is converted to a reader with that encoding. However, a lot of sites lie. They don't specify the charset in the HTTP header, even though the HTML is encoded with a non-standard encoding. So, we have to pre-read the header portion (meta tags) and restart if the charset in the meta tags is different than in the HTTP header. Ya, ya, pre-reading is fraught with peril, and that's what I'm on about here... 1) Why aren't the standard scanners registered by the HTMLParser constructor? It would seem that proper parsing of HTML would require these always be registered. Unless there's a set for HTML 3.2 and one for HTML 4.1 or something. Some of the tests rely on this 'empty scanner list' behaviour to return the expected number of nodes. 2) Mark and reset are still supported somewhat. If you mark before calling elements() and reset() after exhausting the input stream, all is well. But mark() in the midst of parsing is useless and dangerous because of the BufferedReader. Perhaps the documentation should reflect this, or a way to do it properly worked out. Aside notes: Nobody's checking that the contents of the document are "text/html", perhaps this check should be added. These changes still leave the 'Reader' constructors as a separate group which is used by a majority of the test cases. The 'new testing framework' should try to use the same code path as most of the real-world use cases. Perhaps, test cases could be loaded from htmlparser.sourceforge.net like HTMLParserTest.testURLWithSpaces() does. Derrick |
From: Somik R. <so...@ya...> - 2002-12-24 02:00:26
|
Hi Derrick, --- Derrick Oswald <Der...@ro...> wrote: > g'day, > > I've dropped code to handle charset parameters, but > this raised a couple > of issues. First an explanation of the changes. > > HTMLParser constructors were added that take a > URLConnection, so that > the input stream could be reacquired as needed. > For consistency the 'from file' handling is > identical to a URL now. > > The HTTP header is now examined for the charset > parameter and the input > stream is converted to a reader with that encoding. > However, a lot of sites lie. They don't specify the > charset in the HTTP > header, even though the HTML is encoded with a > non-standard encoding. > So, we have to pre-read the header portion (meta > tags) and restart if > the charset in the meta tags is different than in > the HTTP header. > > Ya, ya, pre-reading is fraught with peril, and > that's what I'm on about > here... > > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. > > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > > Derrick > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |
From: Somik R. <so...@ya...> - 2002-12-24 02:05:00
|
Hi Derrick, Great going! > 1) Why aren't the standard scanners registered by > the HTMLParser > constructor? It would seem that proper parsing of > HTML would require > these always be registered. Unless there's a set for > HTML 3.2 and one > for HTML 4.1 or something. Some of the tests rely on > this 'empty scanner > list' behaviour to return the expected number of > nodes. That is bcos we allow folks to custom build their own parser. e.g. if you are only interested in text content, you wouldnt wish to register any of the scanners at all. If you are interested only in links, you would register only the link scanner. Control that passes to a scanner - implies extra processing time (sometimes twice that of the no-scanner scenario). > 2) Mark and reset are still supported somewhat. If > you mark before > calling elements() and reset() after exhausting the > input stream, all is > well. But mark() in the midst of parsing is useless > and dangerous > because of the BufferedReader. Perhaps the > documentation should reflect > this, or a way to do it properly worked out. > Hmm.. I thought I'd added this doc in. If its not enough, feel free to modify. > Aside notes: > > Nobody's checking that the contents of the document > are "text/html", > perhaps this check should be added. > > These changes still leave the 'Reader' constructors > as a separate group > which is used by a majority of the test cases. The > 'new testing > framework' should try to use the same code path as > most of the > real-world use cases. Perhaps, test cases could be > loaded from > htmlparser.sourceforge.net like > HTMLParserTest.testURLWithSpaces() does. > That would cause a serious performance hit while running 284 tests - I'd simply stop testing if it took too long to run the tests.. Regards, Somik __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com |