[Htmlparser-developer] charset

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

g'day,

I've dropped code to handle charset parameters, but this raised a couple 
of issues. First an explanation of the changes.

HTMLParser constructors were added that take a URLConnection, so that 
the input stream could be reacquired as needed.
For consistency the 'from file' handling is identical to a URL now.

The HTTP header is now examined for the charset parameter and the input 
stream is converted to a reader with that encoding.
However, a lot of sites lie. They don't specify the charset in the HTTP 
header, even though the HTML is encoded with a non-standard encoding.
So, we have to pre-read the header portion (meta tags) and restart if 
the charset in the meta tags is different than in the HTTP header.

Ya, ya, pre-reading is fraught with peril, and that's what I'm on about 
here...

1) Why aren't the standard scanners registered by the HTMLParser 
constructor? It would seem that proper parsing of HTML would require 
these always be registered. Unless there's a set for HTML 3.2 and one 
for HTML 4.1 or something. Some of the tests rely on this 'empty scanner 
list' behaviour to return the expected number of nodes.

2) Mark and reset are still supported somewhat. If you mark before 
calling elements() and reset() after exhausting the input stream, all is 
well. But mark() in the midst of parsing is useless and dangerous 
because of the BufferedReader. Perhaps the documentation should reflect 
this, or a way to do it properly worked out.

Aside notes:

Nobody's checking that the contents of the document are "text/html", 
perhaps this check should be added.

These changes still leave the 'Reader' constructors as a separate group 
which is used by a majority of the test cases. The 'new testing 
framework' should try to use the same code path as most of the  
real-world use cases.  Perhaps, test cases could be loaded from 
htmlparser.sourceforge.net like HTMLParserTest.testURLWithSpaces() does.

Derrick