When request URL includes URL escape (%HH), it needs to be decoded as UTF-8, not iso-8859-1. See RFC 3986 section 2.5 for the authoritative specification.
Unfortunately not that simple. I understand the spec says always UTF-8 but unfortunately the browsers don't follow the spec.
The browser encodes GET form parameters using the page encoding (hence the tomcat hack uriEncoding="xxx" on the connector), so a shift_jis encoded page with a form using method="get" will be shift_jis encoded (confirmed for firefox, IE5/6/7 and safari when I've been developing in japanese).
I actually coded it the spec way in the beginning, but once it became apparent that the only way you would ever get a page following the spec was if someone built a querystring manually, i figured that spec point was generally ignored by the people who matter.
To be honest, it seems kind of a stupid requirement anyway. There's no reason for it, since the parsing of the query string can easily be delayed until after the content-type header has been read in (or even until request.getParameter or request.getQueryString is called), at which time we can treat it like the rest of the request body.
Anyway ... if you have an example of a page that results in a querystring that follows the spec behaviour without manual generation, please submit it and re-open the bug, but otherwise this is old territory.
Thanks,
Rick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, in that case, given that Winstone is for embedded use, I'd like this to be configurable --- my webapp serves all its pages in UTF-8 (and this is the norm in webapps), and thus Winstone decoding this in the system default encoding is not helping.
I didn't quite follow the part where you said "it seems kind of a stupid requirement." I thought the servlet spec is pretty clear that the decoding is done by the container, so I don't think your proposed scheme would work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think there's an incorrect assumption here. The fact that it gets parsed as ISO-8859-1 is not because it's the system default. It's because that's the servlet spec mandated default charset for request body encoding if none is supplied. If your pages are using UTF-8 as a body encoding, then you really should have a filter that calls request.setCharacterEncoding("UTF-8"), or at least be able to rely on the content-type header being read in to set the request encoding.
Perhaps this is the thing winstone is not doing correctly - implicitly picking up the character encoding from the request's content-type header. I'll look into this - in the mean time, a work around would be adding a character encoding filter to default to UTF-8.
Regarding the stupid requirement bit: I was actually referring to the HTTP spec bit that you mentioned. Certainly the servlet spec makes it clear the container is responsible for decoding, but forcing UTF-8 as a separate unalterable encoding on the querystring only seems like a misguided statelessness hack for the sake of stream based servers that want to decode the query string before reading the content-type header 3 or 4 lines later. Probably some legacy thing from HTTP/1.0, in any case all the browsers ignore it.
Thanks - hadn't noticed this, will look into it.
Rick
PS surely you don't believe that UTF-8 is "the norm in webapps" from a global perspective, even now ? From surfing japanese and chinese sites, I'd guess that UTF-8 penetration in the CJK market is still sub-20%, mainly because it cuts the bandwidth performance of your site by a factor of 3 unnecessarily (not to mention the poor support for extended charsets). Anyway given the choice of sending 2 or 6 bytes per character, most CJK sites will pick the 2 byte option, so Shift_JIS, EUC-JP, Big5 and GB2312 are still entrenched.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Patch
Logged In: YES
user_id=716353
Originator: NO
Unfortunately not that simple. I understand the spec says always UTF-8 but unfortunately the browsers don't follow the spec.
The browser encodes GET form parameters using the page encoding (hence the tomcat hack uriEncoding="xxx" on the connector), so a shift_jis encoded page with a form using method="get" will be shift_jis encoded (confirmed for firefox, IE5/6/7 and safari when I've been developing in japanese).
I actually coded it the spec way in the beginning, but once it became apparent that the only way you would ever get a page following the spec was if someone built a querystring manually, i figured that spec point was generally ignored by the people who matter.
To be honest, it seems kind of a stupid requirement anyway. There's no reason for it, since the parsing of the query string can easily be delayed until after the content-type header has been read in (or even until request.getParameter or request.getQueryString is called), at which time we can treat it like the rest of the request body.
Anyway ... if you have an example of a page that results in a querystring that follows the spec behaviour without manual generation, please submit it and re-open the bug, but otherwise this is old territory.
Thanks,
Rick
Logged In: YES
user_id=179238
Originator: YES
OK, in that case, given that Winstone is for embedded use, I'd like this to be configurable --- my webapp serves all its pages in UTF-8 (and this is the norm in webapps), and thus Winstone decoding this in the system default encoding is not helping.
I didn't quite follow the part where you said "it seems kind of a stupid requirement." I thought the servlet spec is pretty clear that the decoding is done by the container, so I don't think your proposed scheme would work.
Logged In: YES
user_id=716353
Originator: NO
I think there's an incorrect assumption here. The fact that it gets parsed as ISO-8859-1 is not because it's the system default. It's because that's the servlet spec mandated default charset for request body encoding if none is supplied. If your pages are using UTF-8 as a body encoding, then you really should have a filter that calls request.setCharacterEncoding("UTF-8"), or at least be able to rely on the content-type header being read in to set the request encoding.
Perhaps this is the thing winstone is not doing correctly - implicitly picking up the character encoding from the request's content-type header. I'll look into this - in the mean time, a work around would be adding a character encoding filter to default to UTF-8.
Regarding the stupid requirement bit: I was actually referring to the HTTP spec bit that you mentioned. Certainly the servlet spec makes it clear the container is responsible for decoding, but forcing UTF-8 as a separate unalterable encoding on the querystring only seems like a misguided statelessness hack for the sake of stream based servers that want to decode the query string before reading the content-type header 3 or 4 lines later. Probably some legacy thing from HTTP/1.0, in any case all the browsers ignore it.
Thanks - hadn't noticed this, will look into it.
Rick
PS surely you don't believe that UTF-8 is "the norm in webapps" from a global perspective, even now ? From surfing japanese and chinese sites, I'd guess that UTF-8 penetration in the CJK market is still sub-20%, mainly because it cuts the bandwidth performance of your site by a factor of 3 unnecessarily (not to mention the poor support for extended charsets). Anyway given the choice of sending 2 or 6 bytes per character, most CJK sites will pick the 2 byte option, so Shift_JIS, EUC-JP, Big5 and GB2312 are still entrenched.