From: Stephane B. <ste...@be...> - 2005-03-04 14:12:43
|
I'm posting here a small document I usually keep around to help people=20 understand I18N issues in webapps rather than just hack everything=20 around without a clue of what they are doing. This is all the=20 information I sent to Yannick Q. to help him solve his problem last week There are also a couple of statements I use in my slides of my J2EE=20 training session. I18N is quite complex for web application as we have to deal with many=20 issues (including database for example) and legacy behavior from=20 browsers as well as strange one that may arise if you intend to not just=20 have something fully coherent..even, it is stilla challenge as you will=20 encounter various issues. I believe Chinese and Japanese fellows know=20 more about it than I do, but here it is: Enjoy. INTRODUCTION ____________ The following documents hopes to give tips to take care of while doing=20 I18N on webapps. This information is subject to change anytime but this should give=20 enough background to figure out where to look for such issue. Articles: Multibyte-character processing in J2EE Develop J2EE applications with multibyte characters http://www.javaworld.com/javaworld/jw-04-2004/jw-0419-multibytes.html Tutorial: Character sets & encodings in XHTML, HTML and CSS http://www.w3.org/International/tutorials/tutorial-char-enc.html Tutorial: Using language information in XHTML, HTML and CSS http://www.w3.org/International/tutorials/tutorial-lang/ SERVLET SPECIFICATIONS =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >From servlet specs 2.3 (p.37) SRV.4.9 Request data encoding Currently, many browsers do not send a char encoding qualifier with the=20 Content- Type header, leaving open the determination of the character encoding=20 for reading HTTP requests. The default encoding of a request the container uses to=20 create the request reader and parse POST data must be =93ISO-8859-1=94, if none has = been specified by the client request. However, in order to indicate to the=20 developer in this case the failure of the client to send a character encoding, the=20 container returns null from the getCharacterEncoding method. If the client hasn=92t set character encoding and the request data is=20 encoded with a different encoding than the default as described above, breakage can=20 occur. To remedy this situation, a new method setCharacterEncoding(String enc) has been added to the ServletRequest interface. Developers can override the character encoding supplied by the container by calling this method. It=20 must be called prior to parsing any post data or reading any input from the=20 request. Calling this method once data has been read will not affect the encoding. The following information was compiled by Mark Thomas on the tomcat-user=20 mailing-list in late April 2004 REQUESTS =3D=3D=3D=3D=3D=3D=3D=3D There are a number of situations where there may be a requirement to use non-US ASCII characters in a URI. These include: - Parameters in the query string - Servlet paths There is a standard for encoding URIs (http://www.w3.org/International/O-URL-code.html) but this standard is not consistently followed by clients. This causes a number of problems. The functionality provided by Tomcat (4 and 5) to handle this less than ideal situation is described below. 1. The Coyote HTTP/1.1 connector has a useBodyEncodingForURI attribute which if set to true will use the request body encoding to decode the URI query parameters. - The default value is true for TC4 (breaks spec but gives consistent behaviour across TC4 versions) - The default value is false for TC5 (spec compliant but there may be migration issues for some apps) 2. The Coyote HTTP/1.1 connector has a URIEncoding attribute which defaults to ISO-8859-1. 3. The parameters class (o.a.t.u.http.Parameters) has a QueryStringEncoding field which defaults to the URIEncoding. It must be set before the parameters are parsed to have an effect. Things to note regarding the servlet API: 1. HttpServletRequest.setCharacterEncoding() normally only applies to the request body NOT the URI. 2. HttpServletRequest.getPathInfo() is decoded by the web container. 3. HttpServletRequest.getRequestURI() is not decoded by container. Other tips: 1. Use POST with forms to return parameters as the parameters are then part of the request body. ADDITIONAL RECOMMENDATIONS =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D - Always specify JSP Page encoding (default is ISO-8859-1 except if you=20 are writing a XML JSP document) <%@ page pageEncoding=3D"UTF-8" %> - Always specify the HTTP response charset within the content type=20 (defaut ISO-8859-1) <%@ page contentType=3D"text/html; charset=3DUTF-8" %> - Use meta tag at the very top of the head element (use a ending /> for=20 xhtml) <meta http-equiv=3D"Content-Type" content=3D"text/html;charset=3DUTF-8" > - Always specify the DOCTYPE at the VERY TOP of your document to avoid=20 the browser render in quirk mode <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 ....> - Always specify your CSS encoding (FIRST DIRECTIVE) @charset "utf-8"; - In your HTML form, specify the accept-charset=3D"UTF-8" attribute - Use a servlet filter to set the content-type using=20 setCharacterEncoding on the request/response - For Tomcat watchout the HTTP Connector URIBodyEncoding and=20 UseBodyEncodingForURI - With JSP 2.0 servlet container, use <jsp-property-group> to set encodin= g. <jsp-config> <jsp-property-group> <display-name>JSP Encoding Configuration</display-name> <url-pattern>/*.jsp</display-name> <page-encoding>UTF-8</page-encoding> ... <jsp-property-group> </jsp-config> - Use the JSTL facility to managed resource bundles <fmt:message=20 key=3D"button.cancel"/> - Configure your webapp You can configure default of everything via context-param in web.xml <webapp> ... <context-param> <param-name>*javax.servlet.jsp.jstl.fmt.fallbackLocale*</param-name> <param-value>*en_US*</param-value> </context-param> <context-param> <param-name>*javax.servlet.jsp.jstl.fmt.locale*</param-name> <param-value>*fr_FR*</param-value> </context-param> <context-param> <param-name>*javax.servlet.jsp.jstl.fmt.localizationContext*</param-name> <param-value>*tutorial.j2ee.l10n.Messages*</param-value> </context-param> ... </webapp> - Do not forget that in JSTL it looks for different scope before falling=20 back to the default locale, so you can still have a per user configuratio= n: import *javax.servlet.jsp.jstl.core.Config*; // httpSession.setAttribute(*Config.FMT_LOCALE*, locale); - If you use Struts, it is of course highly advised to have both=20 localization properties set equally import *org.apache.struts.Globals*; // httpSession.setAttribute(*Globals.LOCALE_KEY*, locale); - Resource bundles need to be UTF-8 encoded via native2ascii |