Menu

About the "charset" in response header?

Help
jerry_tian
2005-08-26
2013-04-27
  • jerry_tian

    jerry_tian - 2005-08-26

    Can htmlparser play with the "charset" in response header to determine the stream charset ? like this one:

    (Status-Line):HTTP/1.1 200 OK
    Date:Fri, 26 Aug 2005 03:10:50 GMT
    P3P:policyref="http://privacy.yahoo.co.jp/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV"
    Expires:-1
    Pragma:no-cache
    Cache-Control:no-cache
    Connection:close
    Content-Type:text/html;charset=euc-jp

    so far, i don't find any code to handle this, and i have written some stupid code to extract this infomation from the response header before initiate the htmlparser, but it needs to make the connection twice and really is a waste of resource. is it possible to make this happen within one connection ?

     
    • Derrick Oswald

      Derrick Oswald - 2005-08-26

      The charset processing is located in org.htmlparser.lexer.Page, specifically in the setConnection(URLConnection) method. It should already be handled automatically for you.

      You can check the charset in effect indirectly by querying the Page via getEncoding ().  This returns the encoding the Reader is using to convert bytes to characters, which should be the Java equivalent of the HTTP charset specified in the header.

      Note that encountering a <META> tag with a charset in the header of the HTML document, overrides the one specified in the HTTP header, and the underlying bytes are rescanned from the beginning with the new charset to ensure there isn't a conflict. If there is, an EncodingChangeException is thrown.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.