Re: [Htmlparser-user] Charset and multiple reparsing questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Its thrown in
  org.htmlparser.lexer.InputStreamSource.setEncoding (String)

Ian Macfarlane <ian...@gm...> wrote: Derrick,

I can't see anywhere EncodingChangeException is thrown in the code,
perhaps this is not implemented yet?

Ian

On 6/5/06, Derrick Oswald  wrote:
> Ian,
>
> If you have a String in Java, it's Unicode encoded in UTF-16 - no?
> (the trick of course, is in how it got to be a String, or how the String
> gets saved to a Stream)
> so I don't think you *need* to specify the encoding if you are passing
> in a String.
> Looking at the StringSource.java code, the encoding which may be passed
> in the constructor is just stored as a property.
> It doesn't appear to be used. But if set properly on the constructor it
> would avoid a retrace when the META tag is encountered.
> You would do something like this:
>    new Parser (new Lexer (new Page (my_string, my_encoding)))
>
> There is code in MetaTag.doSemanticAction() to set the page encoding
> based on the META tag.
> This mechanism wouldn't do anything under the hood if the input is a
> String (based on the the fact the StringSource just stores the encoding).
> But, if the HttpClient incorrectly converted the stream to a String
> based on the HTTP header content type and the META tag actually has the
> correct encoding you have a problem (this is the reason for the
> EncodingChangeException thrown by the parser).
>
> Conversion from the parse tree to a String actually just regurgitates
> the characters read in, so the charset and encoding don't enter into it
> here.
>
> Submitting the String to be parsed again brings up the same issues as
> the first time.
>
> Derrick
>
> Ian Macfarlane wrote:
>
> >I have a few questions regarding the best way to perform multiple
> >parsing to and from HTML stored as a String and HTMLParser parsed
> >(tree) format.
> >
> >1) Firstly, when first parsing (using Parser not Lexer, I need a
> >tree), is there a way to pass it the charset (e.g. UTF-8) that was
> >specified in the HTTP headers? Do I need to do this if it is already
> >encoded correctly? (I'm using Apache HTTPClient which can convert into
> >a Byte[] or a correctly encoded String using the headers found, and
> >I'm using the latter option).
> >
> >2) Once I have done this, I'd want it to be overridden if the Meta
> >http-equiv Content-Type gives me a different one. Can the parser
> >automatically do this? Or do I have to attempt to read it myself?
> >
> >3) Now I've got the body tag, and a charset specified either by the
> >headers or the meta tag (or if none, a sensible default), I want to
> >convert the document back into a String again. Do I need to be
> >concerned about the charset again here, or do the Node/NodeList
> >toString methods handle this?
> >
> >4) Finally, once I have a String that's a product of the above, and I
> >want to again convert it into an HTMLParser tree, do I need to specify
> >the charset again here?
> >
> >Thanks
> >
> >Ian
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user