#29 URL escape is UTF-8, not iso-8859-1

open-accepted
nobody
None
5
2008-03-10
2008-03-07
No

When request URL includes URL escape (%HH), it needs to be decoded as UTF-8, not iso-8859-1. See RFC 3986 section 2.5 for the authoritative specification.

I attached a patch to fix this issue.

Discussion

  • Kohsuke Kawaguchi

    Patch

     
  • Rick Knowles

    Rick Knowles - 2008-03-08

    Logged In: YES
    user_id=716353
    Originator: NO

    Unfortunately not that simple. I understand the spec says always UTF-8 but unfortunately the browsers don't follow the spec.

    The browser encodes GET form parameters using the page encoding (hence the tomcat hack uriEncoding="xxx" on the connector), so a shift_jis encoded page with a form using method="get" will be shift_jis encoded (confirmed for firefox, IE5/6/7 and safari when I've been developing in japanese).

    I actually coded it the spec way in the beginning, but once it became apparent that the only way you would ever get a page following the spec was if someone built a querystring manually, i figured that spec point was generally ignored by the people who matter.

    To be honest, it seems kind of a stupid requirement anyway. There's no reason for it, since the parsing of the query string can easily be delayed until after the content-type header has been read in (or even until request.getParameter or request.getQueryString is called), at which time we can treat it like the rest of the request body.

    Anyway ... if you have an example of a page that results in a querystring that follows the spec behaviour without manual generation, please submit it and re-open the bug, but otherwise this is old territory.

    Thanks,

    Rick

     
  • Rick Knowles

    Rick Knowles - 2008-03-08
    • status: open --> closed-wont-fix
     
  • Kohsuke Kawaguchi

    Logged In: YES
    user_id=179238
    Originator: YES

    OK, in that case, given that Winstone is for embedded use, I'd like this to be configurable --- my webapp serves all its pages in UTF-8 (and this is the norm in webapps), and thus Winstone decoding this in the system default encoding is not helping.

    I didn't quite follow the part where you said "it seems kind of a stupid requirement." I thought the servlet spec is pretty clear that the decoding is done by the container, so I don't think your proposed scheme would work.

     
  • Kohsuke Kawaguchi

    • status: closed-wont-fix --> open-wont-fix
     
  • Rick Knowles

    Rick Knowles - 2008-03-10
    • status: open-wont-fix --> open-accepted
     
  • Rick Knowles

    Rick Knowles - 2008-03-10

    Logged In: YES
    user_id=716353
    Originator: NO

    I think there's an incorrect assumption here. The fact that it gets parsed as ISO-8859-1 is not because it's the system default. It's because that's the servlet spec mandated default charset for request body encoding if none is supplied. If your pages are using UTF-8 as a body encoding, then you really should have a filter that calls request.setCharacterEncoding("UTF-8"), or at least be able to rely on the content-type header being read in to set the request encoding.

    Perhaps this is the thing winstone is not doing correctly - implicitly picking up the character encoding from the request's content-type header. I'll look into this - in the mean time, a work around would be adding a character encoding filter to default to UTF-8.

    Regarding the stupid requirement bit: I was actually referring to the HTTP spec bit that you mentioned. Certainly the servlet spec makes it clear the container is responsible for decoding, but forcing UTF-8 as a separate unalterable encoding on the querystring only seems like a misguided statelessness hack for the sake of stream based servers that want to decode the query string before reading the content-type header 3 or 4 lines later. Probably some legacy thing from HTTP/1.0, in any case all the browsers ignore it.

    Thanks - hadn't noticed this, will look into it.

    Rick

    PS surely you don't believe that UTF-8 is "the norm in webapps" from a global perspective, even now ? From surfing japanese and chinese sites, I'd guess that UTF-8 penetration in the CJK market is still sub-20%, mainly because it cuts the bandwidth performance of your site by a factor of 3 unnecessarily (not to mention the poor support for extended charsets). Anyway given the choice of sending 2 or 6 bytes per character, most CJK sites will pick the 2 byte option, so Shift_JIS, EUC-JP, Big5 and GB2312 are still entrenched.

     

Log in to post a comment.