[Htmlparser-user] Charset and multiple reparsing questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I have a few questions regarding the best way to perform multiple
parsing to and from HTML stored as a String and HTMLParser parsed
(tree) format.

1) Firstly, when first parsing (using Parser not Lexer, I need a
tree), is there a way to pass it the charset (e.g. UTF-8) that was
specified in the HTTP headers? Do I need to do this if it is already
encoded correctly? (I'm using Apache HTTPClient which can convert into
a Byte[] or a correctly encoded String using the headers found, and
I'm using the latter option).

2) Once I have done this, I'd want it to be overridden if the Meta
http-equiv Content-Type gives me a different one. Can the parser
automatically do this? Or do I have to attempt to read it myself?

3) Now I've got the body tag, and a charset specified either by the
headers or the meta tag (or if none, a sensible default), I want to
convert the document back into a String again. Do I need to be
concerned about the charset again here, or do the Node/NodeList
toString methods handle this?

4) Finally, once I have a String that's a product of the above, and I
want to again convert it into an HTMLParser tree, do I need to specify
the charset again here?

Thanks

Ian