Some time you might need to build-up a JSP page and send its contents to the client with a charset different from ISO-8859-1. This is done by specifying a page directive like the one in the following box, which means you're sending to the client an HTML page (MIME type text/html) and that the byte stream is codified with ISO-8859-5 charset, used to represent Cyrillic alphabet. More on charset, if you need.
<@ page contentType="text/html;charset=ISO-8859-5" %>
This all works. But if you have a <FORM> in that page with some text fields, you can get into troubles. All the text a user can write into the fields is sent to the server when submit button is pressed. Users are not limited to standard ASCII characterts, but, clearly, if you're is a russian site, they want to write in russian.
Which charset the browser use to codify the user input? Normally it choose the same page's charset. So if you send an ISO-8859-5 page, you get back ISO-8859-5 codified data.
But for a lack of all browser, when they send back the form fields the don't include the charset into the request headers. So the request miss the information "charset=ISO-8859-5". How does a server deal the request? We will take a look to Tomcat 5.28 and to its request implementation. Beyond there is an extract:
private void mergeParameters() { if ((queryParamString == null) || (queryParamString.length() < 1)) return; HashMap queryParameters = new HashMap(); String encoding = getCharacterEncoding(); if (encoding == null) encoding = "ISO-8859-1"; try { RequestUtil.parseParameters (queryParameters, queryParamString, encoding); } catch (Exception e) { ; } [...]
As you can see, Tomcat try to find an encoding indication inside the request, and if it can't find it, it goes back to ISO-8859-1. Tomcat will never find a charset in a request, at least if it relies on the browser!
You can think a work around like setting the encoding just before to use the request inside your JSP or your servlet. But this doesn't work, because the request parameters are parsed before you can get them with request.getParameter(), and there is no way to force reparsing with a different charset enconding.
There are two solutions on the way: the first, to redecode parameters on the fly and the second, to use a filter. The first is more a patch than a good solution.
Well, so you received some parameter encoded with ISO-8859-5 and Tomcat decode them as ISO-8859-1. So you need to do a further conversion, look a this code:
String value = new String(request.getParameter("name").getBytes("ISO-8859-1"), "ISO-8859-5");
It extracts the byte "stream" of the original parameter called "name" (assuming the Tomcat has decoded it with the standard ISO-8859-1) and rebuild the string using another charset, the one we know was used by the browser. You can make a util method to do this work, but you need to change all you JSP code.
Every servlet container can be enhanced with filters, which are more or less simple classes doing some preprocess tasks on requests or responses going in and out from the server. A request filter is called before basic operations like parameters decoding, so we have a chance to modified someting in the request to make tomcat deconding right.
In Tomcat 5.0.28 sources there is a filter (SetCharacterEncodingFilter.java) which intercepts the requests and check whatever they have an encoding specified. If not, it sets the request econding as specified in its configuration.
So, for a web application or even only for some pages of a web application, we can use this filter and ensure that request parameters are decoded with the correct charset. As an example, beyond you can find a simple web application in war format to deploy under Tomcat, just to see this effects. In the example there is a demonstrattion of the first method I discussed, too.
As you will see, the web application has a page which claims to be encoded as ISO-8859-5 (the charset for Cyrillic sites): /filtered/index.jsp. The page contains two forms, where you can past some cyrillic characters, copying them form the above text (still in the page).
One of the form, posts to a page (post.jsp) which is intercepted by the filter, the other not. It's easy to see the difference: the filtered request is correctly decoded, the other gives you some question marks instead of cyrillic characters.
But exactly, why this happens? It's simple: Tomcat receive come text with non ASCII characters codified with ISO-8859-5 table. It parse the request and extracts some bytes that it assumes to be ISO-8859-1 and build up a string converting parameters bytes to unicode characters (strings in Java are ALWAYS unicode), using the wrong ISO-8859-1 table. It works, bytes are bytes, and Tomcat knows anything about our charset. When we try to print this string in a page (post.jsp) which is ISO-8859-5, Tomcat has to convert an unicode string which contains ISO-8859-1 characters that don't match ISO-8859-5 characters, so it write out a "?" for each character that doesn't match.