Re: [Htmlparser-user] Charset and multiple reparsing questions
Brought to you by:
derrickoswald
From: Ian M. <ian...@gm...> - 2006-06-08 12:28:25
|
That will teach me to rely on windows search. Bleh. Ok, so if the headers kick the file out as one charset, then the meta tag states that it is a different one, I assume (based on the W3C recommendations and a quick peek at InputStreamSource) if the new encoding is compatible (characters parsed so far are the same) it will just reparse the rest of the page with the new charset, otherwise it will throw an EncodingChangeException. Am I right so far? Now if I walk through these two potential paths: - If the exception is not thrown, is the parsed document encoded with the charset specified in the headers or in the meta tag? I.e. if I convert it back to a String from a Nodelist etc, will it have the correct charset from the meta tag still? - If the exception is thrown, can I reparse the entire document from the original String or would I have to go back to the orignal byte[] to do this? Thanks, Ian On 6/7/06, Derrick Oswald <der...@ro...> wrote: > Its thrown in > org.htmlparser.lexer.InputStreamSource.setEncoding > (String) > > > > Ian Macfarlane <ian...@gm...> wrote: > > Derrick, > > I can't see anywhere EncodingChangeException is thrown in the code, > perhaps this is not implemented yet? > > Ian > > On 6/5/06, Derrick Oswald wrote: > > Ian, > > > > If you have a String in Java, it's Unicode encoded in UTF-16 - no? > > (the trick of course, is in how it got to be a String, or how the String > > gets saved to a Stream) > > so I don't think you *need* to specify the encoding if you are passing > > in a String. > > Looking at the StringSource.java code, the encoding which may be passed > > in the constructor is just stored as a property. > > It doesn't appear to be used. But if set properly on the constructor it > > would avoid a retrace when the META tag is encountered. > > You would do something like this: > > new Parser (new Lexer (new Page (my_string, my_encoding))) > > > > There is code in MetaTag.doSemanticAction() to set the page encoding > > based on the META tag. > > This mechanism wouldn't do anything under the hood if the input is a > > String (based on the the fact the StringSource just stores the encoding). > > But, if the HttpClient incorrectly converted the stream to a String > > based on the HTTP header content type and the META tag actually has the > > correct encoding you have a problem (this is the reason for the > > EncodingChangeException thrown by the parser). > > > > Conversion from the parse tree to a String actually just regurgitates > > the characters read in, so the charset and encoding don't enter into it > > here. > > > > Submitting the String to be parsed again brings up the same issues as > > the first time. > > > > Derrick > > > > Ian Macfarlane wrote: > > > > >I have a few questions regarding the best way to perform multiple > > >parsing to and from HTML stored as a String and HTMLParser parsed > > >(tree) format. > > > > > >1) Firstly, when first parsing (using Parser not Lexer, I need a > > >tree), is there a way to pass it the charset (e.g. UTF-8) that was > > >specified in the HTTP headers? Do I need to do this if it is already > > >encoded correctly? (I'm using Apache HTTPClient which can convert into > > >a Byte[] or a correctly encoded String using the headers found, and > > >I'm using the latter option). > > > > > >2) Once I have done this, I'd want it to be overridden if the Meta > > >http-equiv Content-Type gives me a different one. Can the parser > > >automatically do this? Or do I have to attempt to read it myself? > > > > > >3) Now I've got the body tag, and a charset specified either by the > > >headers or the meta tag (or if none, a sensible default), I want to > > >convert the document back into a String again. Do I need to be > > >concerned about the charset again here, or do the Node/NodeList > > >toString methods handle this? > > > > > >4) Finally, once I have a String that's a product of the above, and I > > >want to again convert it into an HTMLParser tree, do I need to specify > > >the charset again here? > > > > > >Thanks > > > > > >Ian > > > > > > > > >_______________________________________________ > > >Htmlparser-user mailing list > > >Htm...@li... > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > > > > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |