[Htmlparser-user] Charset and multiple reparsing questions
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-06-02 19:11:19
|
I have a few questions regarding the best way to perform multiple parsing to and from HTML stored as a String and HTMLParser parsed (tree) format. 1) Firstly, when first parsing (using Parser not Lexer, I need a tree), is there a way to pass it the charset (e.g. UTF-8) that was specified in the HTTP headers? Do I need to do this if it is already encoded correctly? (I'm using Apache HTTPClient which can convert into a Byte[] or a correctly encoded String using the headers found, and I'm using the latter option). 2) Once I have done this, I'd want it to be overridden if the Meta http-equiv Content-Type gives me a different one. Can the parser automatically do this? Or do I have to attempt to read it myself? 3) Now I've got the body tag, and a charset specified either by the headers or the meta tag (or if none, a sensible default), I want to convert the document back into a String again. Do I need to be concerned about the charset again here, or do the Node/NodeList toString methods handle this? 4) Finally, once I have a String that's a product of the above, and I want to again convert it into an HTMLParser tree, do I need to specify the charset again here? Thanks Ian |