Can htmlparser play with the "charset" in response header to determine the stream charset ? like this one:
(Status-Line):HTTP/1.1 200 OK
Date:Fri, 26 Aug 2005 03:10:50 GMT
P3P:policyref="http://privacy.yahoo.co.jp/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
Expires:-1
Pragma:no-cache
Cache-Control:no-cache
Connection:close
Content-Type:text/html;charset=euc-jp
so far, i don't find any code to handle this, and i have written some stupid code to extract this infomation from the response header before initiate the htmlparser, but it needs to make the connection twice and really is a waste of resource. is it possible to make this happen within one connection ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The charset processing is located in org.htmlparser.lexer.Page, specifically in the setConnection(URLConnection) method. It should already be handled automatically for you.
You can check the charset in effect indirectly by querying the Page via getEncoding (). This returns the encoding the Reader is using to convert bytes to characters, which should be the Java equivalent of the HTTP charset specified in the header.
Note that encountering a <META> tag with a charset in the header of the HTML document, overrides the one specified in the HTTP header, and the underlying bytes are rescanned from the beginning with the new charset to ensure there isn't a conflict. If there is, an EncodingChangeException is thrown.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can htmlparser play with the "charset" in response header to determine the stream charset ? like this one:
(Status-Line):HTTP/1.1 200 OK
Date:Fri, 26 Aug 2005 03:10:50 GMT
P3P:policyref="http://privacy.yahoo.co.jp/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
Expires:-1
Pragma:no-cache
Cache-Control:no-cache
Connection:close
Content-Type:text/html;charset=euc-jp
so far, i don't find any code to handle this, and i have written some stupid code to extract this infomation from the response header before initiate the htmlparser, but it needs to make the connection twice and really is a waste of resource. is it possible to make this happen within one connection ?
The charset processing is located in org.htmlparser.lexer.Page, specifically in the setConnection(URLConnection) method. It should already be handled automatically for you.
You can check the charset in effect indirectly by querying the Page via getEncoding (). This returns the encoding the Reader is using to convert bytes to characters, which should be the Java equivalent of the HTTP charset specified in the header.
Note that encountering a <META> tag with a charset in the header of the HTML document, overrides the one specified in the HTTP header, and the underlying bytes are rescanned from the beginning with the new charset to ensure there isn't a conflict. If there is, an EncodingChangeException is thrown.