Thread: [Htmlparser-user] Charset and multiple reparsing questions

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] Charset and multiple reparsing questions

From: Ian M. <ian...@gm...> - 2006-06-02 19:11:19

I have a few questions regarding the best way to perform multiple
parsing to and from HTML stored as a String and HTMLParser parsed
(tree) format.

1) Firstly, when first parsing (using Parser not Lexer, I need a
tree), is there a way to pass it the charset (e.g. UTF-8) that was
specified in the HTTP headers? Do I need to do this if it is already
encoded correctly? (I'm using Apache HTTPClient which can convert into
a Byte[] or a correctly encoded String using the headers found, and
I'm using the latter option).

2) Once I have done this, I'd want it to be overridden if the Meta
http-equiv Content-Type gives me a different one. Can the parser
automatically do this? Or do I have to attempt to read it myself?

3) Now I've got the body tag, and a charset specified either by the
headers or the meta tag (or if none, a sensible default), I want to
convert the document back into a String again. Do I need to be
concerned about the charset again here, or do the Node/NodeList
toString methods handle this?

4) Finally, once I have a String that's a product of the above, and I
want to again convert it into an HTMLParser tree, do I need to specify
the charset again here?

Thanks

Ian

[Htmlparser-user] Charset and multiple reparsing questions

From: Ian M. <ian...@gm...> - 2006-06-06 01:46:32

(reposting as it doesn't seem to have gone through the first time)

I have a few questions regarding the best way to perform multiple
parsing to and from HTML stored as a String and HTMLParser parsed
(tree) format.

1) Firstly, when first parsing (using Parser not Lexer, I need a
tree), is there a way to pass it the charset (e.g. UTF-8) that was
specified in the HTTP headers? Do I need to do this if it is already
encoded correctly? (I'm using Apache HTTPClient which can convert into
a Byte[] or a correctly encoded String using the headers found, and
I'm using the latter option).

2) Once I have done this, I'd want it to be overridden if the Meta
http-equiv Content-Type gives me a different one. Can the parser
automatically do this? Or do I have to attempt to read it myself?

3) Now I've got the body tag, and a charset specified either by the
headers or the meta tag (or if none, a sensible default), I want to
convert the document back into a String again. Do I need to be
concerned about the charset again here, or do the Node/NodeList
toString methods handle this?

4) Finally, once I have a String that's a product of the above, and I
want to again convert it into an HTMLParser tree, do I need to specify
the charset again here?

Thanks

Ian

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Derrick O. <Der...@Ro...> - 2006-06-06 01:48:33

Ian,

If you have a String in Java, it's Unicode encoded in UTF-16 - no?
(the trick of course, is in how it got to be a String, or how the String 
gets saved to a Stream)
so I don't think you *need* to specify the encoding if you are passing 
in a String.
Looking at the StringSource.java code, the encoding which may be passed 
in the constructor is just stored as a property.
It doesn't appear to be used. But if set properly on the constructor it 
would avoid a retrace when the META tag is encountered.
You would do something like this:
   new Parser (new Lexer (new Page (my_string, my_encoding)))

There is code in MetaTag.doSemanticAction() to set the page encoding 
based on the META tag.
This mechanism wouldn't do anything under the hood if the input is a 
String (based on the the fact the StringSource just stores the encoding).
But, if the HttpClient incorrectly converted the stream to a String 
based on the HTTP header content type and the META tag actually has the 
correct encoding you have a problem (this is the reason for the 
EncodingChangeException thrown by the parser).

Conversion from the parse tree to a String actually just regurgitates 
the characters read in, so the charset and encoding don't enter into it 
here.

Submitting the String to be parsed again brings up the same issues as 
the first time.

Derrick

Ian Macfarlane wrote:

>I have a few questions regarding the best way to perform multiple
>parsing to and from HTML stored as a String and HTMLParser parsed
>(tree) format.
>
>1) Firstly, when first parsing (using Parser not Lexer, I need a
>tree), is there a way to pass it the charset (e.g. UTF-8) that was
>specified in the HTTP headers? Do I need to do this if it is already
>encoded correctly? (I'm using Apache HTTPClient which can convert into
>a Byte[] or a correctly encoded String using the headers found, and
>I'm using the latter option).
>
>2) Once I have done this, I'd want it to be overridden if the Meta
>http-equiv Content-Type gives me a different one. Can the parser
>automatically do this? Or do I have to attempt to read it myself?
>
>3) Now I've got the body tag, and a charset specified either by the
>headers or the meta tag (or if none, a sensible default), I want to
>convert the document back into a String again. Do I need to be
>concerned about the charset again here, or do the Node/NodeList
>toString methods handle this?
>
>4) Finally, once I have a String that's a product of the above, and I
>want to again convert it into an HTMLParser tree, do I need to specify
>the charset again here?
>
>Thanks
>
>Ian
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Ian M. <ian...@gm...> - 2006-06-07 20:14:24

Derrick,

I can't see anywhere EncodingChangeException is thrown in the code,
perhaps this is not implemented yet?

Ian

On 6/5/06, Derrick Oswald <Der...@ro...> wrote:
> Ian,
>
> If you have a String in Java, it's Unicode encoded in UTF-16 - no?
> (the trick of course, is in how it got to be a String, or how the String
> gets saved to a Stream)
> so I don't think you *need* to specify the encoding if you are passing
> in a String.
> Looking at the StringSource.java code, the encoding which may be passed
> in the constructor is just stored as a property.
> It doesn't appear to be used. But if set properly on the constructor it
> would avoid a retrace when the META tag is encountered.
> You would do something like this:
>    new Parser (new Lexer (new Page (my_string, my_encoding)))
>
> There is code in MetaTag.doSemanticAction() to set the page encoding
> based on the META tag.
> This mechanism wouldn't do anything under the hood if the input is a
> String (based on the the fact the StringSource just stores the encoding).
> But, if the HttpClient incorrectly converted the stream to a String
> based on the HTTP header content type and the META tag actually has the
> correct encoding you have a problem (this is the reason for the
> EncodingChangeException thrown by the parser).
>
> Conversion from the parse tree to a String actually just regurgitates
> the characters read in, so the charset and encoding don't enter into it
> here.
>
> Submitting the String to be parsed again brings up the same issues as
> the first time.
>
> Derrick
>
> Ian Macfarlane wrote:
>
> >I have a few questions regarding the best way to perform multiple
> >parsing to and from HTML stored as a String and HTMLParser parsed
> >(tree) format.
> >
> >1) Firstly, when first parsing (using Parser not Lexer, I need a
> >tree), is there a way to pass it the charset (e.g. UTF-8) that was
> >specified in the HTTP headers? Do I need to do this if it is already
> >encoded correctly? (I'm using Apache HTTPClient which can convert into
> >a Byte[] or a correctly encoded String using the headers found, and
> >I'm using the latter option).
> >
> >2) Once I have done this, I'd want it to be overridden if the Meta
> >http-equiv Content-Type gives me a different one. Can the parser
> >automatically do this? Or do I have to attempt to read it myself?
> >
> >3) Now I've got the body tag, and a charset specified either by the
> >headers or the meta tag (or if none, a sensible default), I want to
> >convert the document back into a String again. Do I need to be
> >concerned about the charset again here, or do the Node/NodeList
> >toString methods handle this?
> >
> >4) Finally, once I have a String that's a product of the above, and I
> >want to again convert it into an HTMLParser tree, do I need to specify
> >the charset again here?
> >
> >Thanks
> >
> >Ian
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Derrick O. <der...@ro...> - 2006-06-07 21:40:39

Its thrown in
  org.htmlparser.lexer.InputStreamSource.setEncoding (String)


Ian Macfarlane <ian...@gm...> wrote: Derrick,

I can't see anywhere EncodingChangeException is thrown in the code,
perhaps this is not implemented yet?

Ian

On 6/5/06, Derrick Oswald  wrote:
> Ian,
>
> If you have a String in Java, it's Unicode encoded in UTF-16 - no?
> (the trick of course, is in how it got to be a String, or how the String
> gets saved to a Stream)
> so I don't think you *need* to specify the encoding if you are passing
> in a String.
> Looking at the StringSource.java code, the encoding which may be passed
> in the constructor is just stored as a property.
> It doesn't appear to be used. But if set properly on the constructor it
> would avoid a retrace when the META tag is encountered.
> You would do something like this:
>    new Parser (new Lexer (new Page (my_string, my_encoding)))
>
> There is code in MetaTag.doSemanticAction() to set the page encoding
> based on the META tag.
> This mechanism wouldn't do anything under the hood if the input is a
> String (based on the the fact the StringSource just stores the encoding).
> But, if the HttpClient incorrectly converted the stream to a String
> based on the HTTP header content type and the META tag actually has the
> correct encoding you have a problem (this is the reason for the
> EncodingChangeException thrown by the parser).
>
> Conversion from the parse tree to a String actually just regurgitates
> the characters read in, so the charset and encoding don't enter into it
> here.
>
> Submitting the String to be parsed again brings up the same issues as
> the first time.
>
> Derrick
>
> Ian Macfarlane wrote:
>
> >I have a few questions regarding the best way to perform multiple
> >parsing to and from HTML stored as a String and HTMLParser parsed
> >(tree) format.
> >
> >1) Firstly, when first parsing (using Parser not Lexer, I need a
> >tree), is there a way to pass it the charset (e.g. UTF-8) that was
> >specified in the HTTP headers? Do I need to do this if it is already
> >encoded correctly? (I'm using Apache HTTPClient which can convert into
> >a Byte[] or a correctly encoded String using the headers found, and
> >I'm using the latter option).
> >
> >2) Once I have done this, I'd want it to be overridden if the Meta
> >http-equiv Content-Type gives me a different one. Can the parser
> >automatically do this? Or do I have to attempt to read it myself?
> >
> >3) Now I've got the body tag, and a charset specified either by the
> >headers or the meta tag (or if none, a sensible default), I want to
> >convert the document back into a String again. Do I need to be
> >concerned about the charset again here, or do the Node/NodeList
> >toString methods handle this?
> >
> >4) Finally, once I have a String that's a product of the above, and I
> >want to again convert it into an HTMLParser tree, do I need to specify
> >the charset again here?
> >
> >Thanks
> >
> >Ian
> >
> >
> >_______________________________________________
> >Htmlparser-user mailing list
> >Htm...@li...
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> >
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>


_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Ian M. <ian...@gm...> - 2006-06-08 12:28:25

That will teach me to rely on windows search. Bleh.

Ok, so if the headers kick the file out as one charset, then the meta
tag states that it is a different one, I assume (based on the W3C
recommendations and a quick peek at InputStreamSource) if the new
encoding is compatible (characters parsed so far are the same) it will
just reparse the rest of the page with the new charset, otherwise it
will throw an EncodingChangeException. Am I right so far?

Now if I walk through these two potential paths:

- If the exception is not thrown, is the parsed document encoded with
the charset specified in the headers or in the meta tag? I.e. if I
convert it back to a String from a Nodelist etc, will it have the
correct charset from the meta tag still?

- If the exception is thrown, can I reparse the entire document from
the original String or would I have to go back to the orignal byte[]
to do this?

Thanks,

Ian

On 6/7/06, Derrick Oswald <der...@ro...> wrote:
> Its thrown in
>   org.htmlparser.lexer.InputStreamSource.setEncoding
> (String)
>
>
>
> Ian Macfarlane <ian...@gm...> wrote:
>
>  Derrick,
>
> I can't see anywhere EncodingChangeException is thrown in the code,
> perhaps this is not implemented yet?
>
> Ian
>
> On 6/5/06, Derrick Oswald wrote:
> > Ian,
> >
> > If you have a String in Java, it's Unicode encoded in UTF-16 - no?
> > (the trick of course, is in how it got to be a String, or how the String
> > gets saved to a Stream)
> > so I don't think you *need* to specify the encoding if you are passing
> > in a String.
> > Looking at the StringSource.java code, the encoding which may be passed
> > in the constructor is just stored as a property.
> > It doesn't appear to be used. But if set properly on the constructor it
> > would avoid a retrace when the META tag is encountered.
> > You would do something like this:
> > new Parser (new Lexer (new Page (my_string, my_encoding)))
> >
> > There is code in MetaTag.doSemanticAction() to set the page encoding
> > based on the META tag.
> > This mechanism wouldn't do anything under the hood if the input is a
> > String (based on the the fact the StringSource just stores the encoding).
> > But, if the HttpClient incorrectly converted the stream to a String
> > based on the HTTP header content type and the META tag actually has the
> > correct encoding you have a problem (this is the reason for the
> > EncodingChangeException thrown by the parser).
> >
> > Conversion from the parse tree to a String actually just regurgitates
> > the characters read in, so the charset and encoding don't enter into it
> > here.
> >
> > Submitting the String to be parsed again brings up the same issues as
> > the first time.
> >
> > Derrick
> >
> > Ian Macfarlane wrote:
> >
> > >I have a few questions regarding the best way to perform multiple
> > >parsing to and from HTML stored as a String and HTMLParser parsed
> > >(tree) format.
> > >
> > >1) Firstly, when first parsing (using Parser not Lexer, I need a
> > >tree), is there a way to pass it the charset (e.g. UTF-8) that was
> > >specified in the HTTP headers? Do I need to do this if it is already
> > >encoded correctly? (I'm using Apache HTTPClient which can convert into
> > >a Byte[] or a correctly encoded String using the headers found, and
> > >I'm using the latter option).
> > >
> > >2) Once I have done this, I'd want it to be overridden if the Meta
> > >http-equiv Content-Type gives me a different one. Can the parser
> > >automatically do this? Or do I have to attempt to read it myself?
> > >
> > >3) Now I've got the body tag, and a charset specified either by the
> > >headers or the meta tag (or if none, a sensible default), I want to
> > >convert the document back into a String again. Do I need to be
> > >concerned about the charset again here, or do the Node/NodeList
> > >toString methods handle this?
> > >
> > >4) Finally, once I have a String that's a product of the above, and I
> > >want to again convert it into an HTMLParser tree, do I need to specify
> > >the charset again here?
> > >
> > >Thanks
> > >
> > >Ian
> > >
> > >
> > >_______________________________________________
> > >Htmlparser-user mailing list
> > >Htm...@li...
> >
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > >
> > >
> > >
> >
> >
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> >
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Derrick O. <Der...@Ro...> - 2006-06-09 01:41:48

Ian,

Don't you just hate Windows Search - completely broken and it's been 
that way for a half dozen years.
But if you complain it doesn't get you anywhere...

Correct so far.

- the interpretation of the bytes from the input stream follows the META 
tag after that's encountered,
  a Java String doesn't really have a charset as far as I can tell - 
it's Unicode UTF-16 (I may be wrong here),
  so the answer is: it will have the 'correct' charset - whatever set it 
last, header or META.
  The regenerated toHtml() String will have the 'correct' charset 
because it's coming from an array of char
  which (as far as I can tell) covers most of the possible charsets.
  [I say most because there is a move afoot to make chars int32 in size 
to accommodate many more Chinese glyphs etc.
  and I'm not sure if that's in Mustang (Java 1.5) or not.]
  Now if you want to write an array of bytes on disk or pass a string to 
another program with 8 bit chars,
  you need to choose an encoding that can accommodate your charset... 
whole new ballgame.
  Bottom line is the encoding only matters if it's converted to bytes (I 
think).
  Check out the 'Save As Unicode' option in NotePad, it doesn't ask for 
a charset,
  (but then again, it may *know* the charset from user settings)
  but that sets the encoding (Unicode UTF-8 I think) for the file on disk.

- in a number of places this is exactly the processing used, reset() 
followed by reparse,
  see for example StringBean.setStrings () - the point being that the 
client *must* rehandle
  nodes it was given, usually by starting from scratch - I don't know 
any other way - because
  what it was given was erroneous. In the case of a String as input, 
reparse won't yield any
  different characters (they just come from the String via charAt() and 
that won't change
  because the String is immutable) so the reset is redundant, except 
that the StringSource
  will have it's encoding (member variable) set correctly the second 
time so the hiccup
  won't happen twice.  But the conversion from the byte stream to a 
String has to have
  been correct regardless of what the HTTP header says, otherwise you're 
pimped.
  So if the HttpClient gives you a String you have to ask:
  - did it look at the META tag?
  - is the META tag correct?

If it sounds confused, that's because it probably is - in my own mind.

Derrick

Ian Macfarlane wrote:

>That will teach me to rely on windows search. Bleh.
>
>Ok, so if the headers kick the file out as one charset, then the meta
>tag states that it is a different one, I assume (based on the W3C
>recommendations and a quick peek at InputStreamSource) if the new
>encoding is compatible (characters parsed so far are the same) it will
>just reparse the rest of the page with the new charset, otherwise it
>will throw an EncodingChangeException. Am I right so far?
>
>Now if I walk through these two potential paths:
>
>- If the exception is not thrown, is the parsed document encoded with
>the charset specified in the headers or in the meta tag? I.e. if I
>convert it back to a String from a Nodelist etc, will it have the
>correct charset from the meta tag still?
>
>- If the exception is thrown, can I reparse the entire document from
>the original String or would I have to go back to the orignal byte[]
>to do this?
>
>Thanks,
>
>Ian
>
>On 6/7/06, Derrick Oswald <der...@ro...> wrote:
>  
>
>>Its thrown in
>>  org.htmlparser.lexer.InputStreamSource.setEncoding
>>(String)
>>
>>
>>
>>Ian Macfarlane <ian...@gm...> wrote:
>>
>> Derrick,
>>
>>I can't see anywhere EncodingChangeException is thrown in the code,
>>perhaps this is not implemented yet?
>>
>>Ian
>>
>>On 6/5/06, Derrick Oswald wrote:
>>    
>>
>>>Ian,
>>>
>>>If you have a String in Java, it's Unicode encoded in UTF-16 - no?
>>>(the trick of course, is in how it got to be a String, or how the String
>>>gets saved to a Stream)
>>>so I don't think you *need* to specify the encoding if you are passing
>>>in a String.
>>>Looking at the StringSource.java code, the encoding which may be passed
>>>in the constructor is just stored as a property.
>>>It doesn't appear to be used. But if set properly on the constructor it
>>>would avoid a retrace when the META tag is encountered.
>>>You would do something like this:
>>>new Parser (new Lexer (new Page (my_string, my_encoding)))
>>>
>>>There is code in MetaTag.doSemanticAction() to set the page encoding
>>>based on the META tag.
>>>This mechanism wouldn't do anything under the hood if the input is a
>>>String (based on the the fact the StringSource just stores the encoding).
>>>But, if the HttpClient incorrectly converted the stream to a String
>>>based on the HTTP header content type and the META tag actually has the
>>>correct encoding you have a problem (this is the reason for the
>>>EncodingChangeException thrown by the parser).
>>>
>>>Conversion from the parse tree to a String actually just regurgitates
>>>the characters read in, so the charset and encoding don't enter into it
>>>here.
>>>
>>>Submitting the String to be parsed again brings up the same issues as
>>>the first time.
>>>
>>>Derrick
>>>
>>>Ian Macfarlane wrote:
>>>
>>>      
>>>
>>>>I have a few questions regarding the best way to perform multiple
>>>>parsing to and from HTML stored as a String and HTMLParser parsed
>>>>(tree) format.
>>>>
>>>>1) Firstly, when first parsing (using Parser not Lexer, I need a
>>>>tree), is there a way to pass it the charset (e.g. UTF-8) that was
>>>>specified in the HTTP headers? Do I need to do this if it is already
>>>>encoded correctly? (I'm using Apache HTTPClient which can convert into
>>>>a Byte[] or a correctly encoded String using the headers found, and
>>>>I'm using the latter option).
>>>>
>>>>2) Once I have done this, I'd want it to be overridden if the Meta
>>>>http-equiv Content-Type gives me a different one. Can the parser
>>>>automatically do this? Or do I have to attempt to read it myself?
>>>>
>>>>3) Now I've got the body tag, and a charset specified either by the
>>>>headers or the meta tag (or if none, a sensible default), I want to
>>>>convert the document back into a String again. Do I need to be
>>>>concerned about the charset again here, or do the Node/NodeList
>>>>toString methods handle this?
>>>>
>>>>4) Finally, once I have a String that's a product of the above, and I
>>>>want to again convert it into an HTMLParser tree, do I need to specify
>>>>the charset again here?
>>>>
>>>>Thanks
>>>>
>>>>Ian
>>>>
>>>>        
>>>>
>  
>

Re: [Htmlparser-user] Charset and multiple reparsing questions

From: Ian M. <ian...@gm...> - 2006-06-09 10:19:47

Yup, internationalization issues can be pretty confusing :)

I've written a test file using windows-1251 source file and
windows-specific characters, then changing meta content and headers to
find out what happens in my web browser. The case is as follows:

- With no charset specified in headers or meta tags, it can't work out
the charset
- With no charset specified in headers but charset specified in meta
tag, the meta tag one is used
- When the charset is specified in the headers, it overrides anything
specified in the meta tag

I'm going to eventaully have to emulate that. I'm pretty sure that the
Apache HTTPClient just uses the charset specified in the http headers.
It does have a method to get the response charset specified in the
headers, but unfortunately it looks like if none is specified it
defaults to iso-8859-1 (or whichever is set as default), which means I
can't really tell if it's had one set or not.

So I think, for now, I'm going to ignore it and see if it turns out to
be a problem or not :)

Ian

On 6/9/06, Derrick Oswald <Der...@ro...> wrote:
> Ian,
>
> Don't you just hate Windows Search - completely broken and it's been
> that way for a half dozen years.
> But if you complain it doesn't get you anywhere...
>
> Correct so far.
>
> - the interpretation of the bytes from the input stream follows the META
> tag after that's encountered,
>   a Java String doesn't really have a charset as far as I can tell -
> it's Unicode UTF-16 (I may be wrong here),
>   so the answer is: it will have the 'correct' charset - whatever set it
> last, header or META.
>   The regenerated toHtml() String will have the 'correct' charset
> because it's coming from an array of char
>   which (as far as I can tell) covers most of the possible charsets.
>   [I say most because there is a move afoot to make chars int32 in size
> to accommodate many more Chinese glyphs etc.
>   and I'm not sure if that's in Mustang (Java 1.5) or not.]
>   Now if you want to write an array of bytes on disk or pass a string to
> another program with 8 bit chars,
>   you need to choose an encoding that can accommodate your charset...
> whole new ballgame.
>   Bottom line is the encoding only matters if it's converted to bytes (I
> think).
>   Check out the 'Save As Unicode' option in NotePad, it doesn't ask for
> a charset,
>   (but then again, it may *know* the charset from user settings)
>   but that sets the encoding (Unicode UTF-8 I think) for the file on disk.
>
> - in a number of places this is exactly the processing used, reset()
> followed by reparse,
>   see for example StringBean.setStrings () - the point being that the
> client *must* rehandle
>   nodes it was given, usually by starting from scratch - I don't know
> any other way - because
>   what it was given was erroneous. In the case of a String as input,
> reparse won't yield any
>   different characters (they just come from the String via charAt() and
> that won't change
>   because the String is immutable) so the reset is redundant, except
> that the StringSource
>   will have it's encoding (member variable) set correctly the second
> time so the hiccup
>   won't happen twice.  But the conversion from the byte stream to a
> String has to have
>   been correct regardless of what the HTTP header says, otherwise you're
> pimped.
>   So if the HttpClient gives you a String you have to ask:
>   - did it look at the META tag?
>   - is the META tag correct?
>
> If it sounds confused, that's because it probably is - in my own mind.
>
> Derrick
>
> Ian Macfarlane wrote:
>
> >That will teach me to rely on windows search. Bleh.
> >
> >Ok, so if the headers kick the file out as one charset, then the meta
> >tag states that it is a different one, I assume (based on the W3C
> >recommendations and a quick peek at InputStreamSource) if the new
> >encoding is compatible (characters parsed so far are the same) it will
> >just reparse the rest of the page with the new charset, otherwise it
> >will throw an EncodingChangeException. Am I right so far?
> >
> >Now if I walk through these two potential paths:
> >
> >- If the exception is not thrown, is the parsed document encoded with
> >the charset specified in the headers or in the meta tag? I.e. if I
> >convert it back to a String from a Nodelist etc, will it have the
> >correct charset from the meta tag still?
> >
> >- If the exception is thrown, can I reparse the entire document from
> >the original String or would I have to go back to the orignal byte[]
> >to do this?
> >
> >Thanks,
> >
> >Ian
> >
> >On 6/7/06, Derrick Oswald <der...@ro...> wrote:
> >
> >
> >>Its thrown in
> >>  org.htmlparser.lexer.InputStreamSource.setEncoding
> >>(String)
> >>
> >>
> >>
> >>Ian Macfarlane <ian...@gm...> wrote:
> >>
> >> Derrick,
> >>
> >>I can't see anywhere EncodingChangeException is thrown in the code,
> >>perhaps this is not implemented yet?
> >>
> >>Ian
> >>
> >>On 6/5/06, Derrick Oswald wrote:
> >>
> >>
> >>>Ian,
> >>>
> >>>If you have a String in Java, it's Unicode encoded in UTF-16 - no?
> >>>(the trick of course, is in how it got to be a String, or how the String
> >>>gets saved to a Stream)
> >>>so I don't think you *need* to specify the encoding if you are passing
> >>>in a String.
> >>>Looking at the StringSource.java code, the encoding which may be passed
> >>>in the constructor is just stored as a property.
> >>>It doesn't appear to be used. But if set properly on the constructor it
> >>>would avoid a retrace when the META tag is encountered.
> >>>You would do something like this:
> >>>new Parser (new Lexer (new Page (my_string, my_encoding)))
> >>>
> >>>There is code in MetaTag.doSemanticAction() to set the page encoding
> >>>based on the META tag.
> >>>This mechanism wouldn't do anything under the hood if the input is a
> >>>String (based on the the fact the StringSource just stores the encoding).
> >>>But, if the HttpClient incorrectly converted the stream to a String
> >>>based on the HTTP header content type and the META tag actually has the
> >>>correct encoding you have a problem (this is the reason for the
> >>>EncodingChangeException thrown by the parser).
> >>>
> >>>Conversion from the parse tree to a String actually just regurgitates
> >>>the characters read in, so the charset and encoding don't enter into it
> >>>here.
> >>>
> >>>Submitting the String to be parsed again brings up the same issues as
> >>>the first time.
> >>>
> >>>Derrick
> >>>
> >>>Ian Macfarlane wrote:
> >>>
> >>>
> >>>
> >>>>I have a few questions regarding the best way to perform multiple
> >>>>parsing to and from HTML stored as a String and HTMLParser parsed
> >>>>(tree) format.
> >>>>
> >>>>1) Firstly, when first parsing (using Parser not Lexer, I need a
> >>>>tree), is there a way to pass it the charset (e.g. UTF-8) that was
> >>>>specified in the HTTP headers? Do I need to do this if it is already
> >>>>encoded correctly? (I'm using Apache HTTPClient which can convert into
> >>>>a Byte[] or a correctly encoded String using the headers found, and
> >>>>I'm using the latter option).
> >>>>
> >>>>2) Once I have done this, I'd want it to be overridden if the Meta
> >>>>http-equiv Content-Type gives me a different one. Can the parser
> >>>>automatically do this? Or do I have to attempt to read it myself?
> >>>>
> >>>>3) Now I've got the body tag, and a charset specified either by the
> >>>>headers or the meta tag (or if none, a sensible default), I want to
> >>>>convert the document back into a String again. Do I need to be
> >>>>concerned about the charset again here, or do the Node/NodeList
> >>>>toString methods handle this?
> >>>>
> >>>>4) Finally, once I have a String that's a product of the above, and I
> >>>>want to again convert it into an HTMLParser tree, do I need to specify
> >>>>the charset again here?
> >>>>
> >>>>Thanks
> >>>>
> >>>>Ian
> >>>>
> >>>>
> >>>>
> >
> >
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>