Re: [Htmlparser-user] Charset and multiple reparsing questions
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2006-06-07 21:40:39
|
Its thrown in org.htmlparser.lexer.InputStreamSource.setEncoding (String) Ian Macfarlane <ian...@gm...> wrote: Derrick, I can't see anywhere EncodingChangeException is thrown in the code, perhaps this is not implemented yet? Ian On 6/5/06, Derrick Oswald wrote: > Ian, > > If you have a String in Java, it's Unicode encoded in UTF-16 - no? > (the trick of course, is in how it got to be a String, or how the String > gets saved to a Stream) > so I don't think you *need* to specify the encoding if you are passing > in a String. > Looking at the StringSource.java code, the encoding which may be passed > in the constructor is just stored as a property. > It doesn't appear to be used. But if set properly on the constructor it > would avoid a retrace when the META tag is encountered. > You would do something like this: > new Parser (new Lexer (new Page (my_string, my_encoding))) > > There is code in MetaTag.doSemanticAction() to set the page encoding > based on the META tag. > This mechanism wouldn't do anything under the hood if the input is a > String (based on the the fact the StringSource just stores the encoding). > But, if the HttpClient incorrectly converted the stream to a String > based on the HTTP header content type and the META tag actually has the > correct encoding you have a problem (this is the reason for the > EncodingChangeException thrown by the parser). > > Conversion from the parse tree to a String actually just regurgitates > the characters read in, so the charset and encoding don't enter into it > here. > > Submitting the String to be parsed again brings up the same issues as > the first time. > > Derrick > > Ian Macfarlane wrote: > > >I have a few questions regarding the best way to perform multiple > >parsing to and from HTML stored as a String and HTMLParser parsed > >(tree) format. > > > >1) Firstly, when first parsing (using Parser not Lexer, I need a > >tree), is there a way to pass it the charset (e.g. UTF-8) that was > >specified in the HTTP headers? Do I need to do this if it is already > >encoded correctly? (I'm using Apache HTTPClient which can convert into > >a Byte[] or a correctly encoded String using the headers found, and > >I'm using the latter option). > > > >2) Once I have done this, I'd want it to be overridden if the Meta > >http-equiv Content-Type gives me a different one. Can the parser > >automatically do this? Or do I have to attempt to read it myself? > > > >3) Now I've got the body tag, and a charset specified either by the > >headers or the meta tag (or if none, a sensible default), I want to > >convert the document back into a String again. Do I need to be > >concerned about the charset again here, or do the Node/NodeList > >toString methods handle this? > > > >4) Finally, once I have a String that's a product of the above, and I > >want to again convert it into an HTMLParser tree, do I need to specify > >the charset again here? > > > >Thanks > > > >Ian > > > > > >_______________________________________________ > >Htmlparser-user mailing list > >Htm...@li... > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |