From: Erik <eri...@bo...> - 2005-11-23 13:23:19
|
Hi I have played with pagekit for some time now. And now I would be able to have a site that use UTF8 internally. But how to I do that. The easy part is to have all files in UTF-8 and save to the DB in UTF-8 and so on. But pagekit are smart and sends the page in the encoding the browser prefers. That is not any problem. But who do I handle the input from a form? I mean how do I know what char encoding the web-browser are sending in? I can't trust the outgoing encoding because that is trivial to change in ant browser. Afaik there are no serten way to tell what encoding by just looking at the string. What are you doing to fix this? Om my previous site i "converted" all to Latin-1. But that was just a ugly hack. utf8:Is_utf8() and Encode::is_uft8() won't help they say false on every string passed by apache. :/ One way is to block pagekit and send everything in UTF-8 because most often the browser will send the return in UTF-8... but that solution aren't bullet prof. The user can still send in eg Latin-1 or the browser do not handle UTF-8 (rare). Any ideas? --=20 /erikg=20 =20 Erik G=FCnther eri...@bo... System Developer Bokus AB +46 (0)40 - 35 21 19 icq: 160744619 = =20 =20 Fortune: 'Course, I haven't weighed in yet. :-) -- Larry Wall in <199...@wa...> |
From: Shimon R. <sh...@ru...> - 2005-11-23 15:43:19
|
Erik, Unfortunately, I don't think there is a perfect solution to this. The browser is supposed to submit any forms using the encoding you served the page in, but there are so many levels of second-guessing about character encodings that this isn't guaranteed. For my site voo2do.com, I decided that if things weren't going to work perfectly, I might as well keep them simple. So I did the all-UTF-8 approach: I hacked PageKit to always send pages in UTF-8, regardless of wha= t the browser requested. Now people whose browsers don't support UTF-8 can't use non-ASCII characters on my site... but the site has plenty of Javascrip= t that won't work on old browsers anyway, so that probably doesn't hurt anyone. This isn't perfect, but serving in non-unicode is problematic too. With an unhacked pagekit, my site would be served using Latin-1 because my browser prefers it to UTF-8 for some reason. If I type some non-Latin-1 characters= , my browser will send HTML entity codes. Of course, there is no way to distinguish whether the user actually meant to type %u10123 or whether that's a trick the browser pulled. So I think it's best to just make everything unicode. A reasonable alternative might be to hack pagekit to serve in UTF-8 as long as it's one of the browser-supported encodings (even when it's not the preferred one), and only recode if UTF-8 is just unsupported. Then perhaps you have a slightly better chance of serving pre-UTF8 browsers. Good luck, and let us know how it goes. shimon. On 11/23/05, Erik G=FCnther <eri...@bo... > wrote: > > Hi > > I have played with pagekit for some time now. And now I would be able to > have a site that use UTF8 internally. But how to I do that. The easy > part is to have all files in UTF-8 and save to the DB in UTF-8 and so > on. But pagekit are smart and sends the page in the encoding the browser > prefers. That is not any problem. But who do I handle the input from a > form? > > I mean how do I know what char encoding the web-browser are sending in? > I can't trust the outgoing encoding because that is trivial to change in > ant browser. Afaik there are no serten way to tell what encoding by just > looking at the string. > > What are you doing to fix this? Om my previous site i "converted" all to > Latin-1. But that was just a ugly hack. utf8:Is_utf8() and > Encode::is_uft8() won't help they say false on every string passed by > apache. :/ > > > One way is to block pagekit and send everything in UTF-8 because most > often the browser will send the return in UTF-8... but that solution > aren't bullet prof. The user can still send in eg Latin-1 or the browser > do not handle UTF-8 (rare). > > Any ideas? > > -- > > /erikg > > Erik G=FCnther eri...@bo... > System Developer Bokus AB > +46 (0)40 - 35 21 19 icq: 160744619 > > Fortune: > 'Course, I haven't weighed in yet. :-) > -- Larry Wall in <199...@wa...> > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.4 (GNU/Linux) > > iD8DBQBDhG03q1HQ7Yl9BM8RAqUoAJ9cjBKEBmF1GSmMfMMJEPlHDf2mQQCfWXH6 > 3V6AtwghzOqYdFWEcf4fdb8=3D > =3D4QYG > -----END PGP SIGNATURE----- > > > |
From: Damyan I. <di...@cr...> - 2005-11-24 07:05:31
Attachments:
signature.asc
|
Shimon Rura wrote: > Erik, > > Unfortunately, I don't think there is a perfect solution to this. The Whatever encoding a browses uses to send data, it is mandatory to supply correct Content-Type header, right? Can't this be used when determining request encoding? (I am not in PK internals, so my suggestion may be well off-track) dam -- Damyan Ivanov Creditreform Bulgaria di...@cr... http://www.creditreform.bg/ phone: +359(2)928-2611, 929-3993 fax: +359(2)920-0994 mob. +359(88)856-6067 da...@ja.../Gaim |
From: Erik <eri...@bo...> - 2005-11-24 08:32:58
|
tor 2005-11-24 klockan 09:05 +0200 skrev Damyan Ivanov:=20 > Shimon Rura wrote: > > Erik, > >=20 > > Unfortunately, I don't think there is a perfect solution to this. The >=20 > Whatever encoding a browses uses to send data, it is mandatory to supply > correct Content-Type header, right? Can't this be used when determining r= equest > encoding? Hmmm I did some checks on that and the only Content-Type header are from the server to the browser. The other way around I can only find Accept-Charset. That aren't the same. So afaik there are only one way to do this the nice way. And that is to remember what encoding I did send the page in. And that should be saved in the session. The bad thing is that _every_ request needs a session :/ But if we add a new option in the Configfile eq output_charset (that mean use this if you can regardless of priority from the browser. But if the browser don't know this charset the use priority.) The the encoding for browser that don't use cookies (can't use sessions) we could guess the char-set to output_charset. With this it should be possible with a pretty god chans to find the right charset-encoding from the browser. =20 Implementation: * New option in Config.xml that means use this charset if possible output_charset =3D "UTF-8"=20 * When a request arrives we first look in the session for _encoding to see what encoding the request most likely be in and the change encoding to default_input_charset. This only needs to be done on request with parameters (when QUERY_STRING or REQUEST_METHOD=3D'post'). If no session or _encoding exists then use output_charset.=20 If output_charset, default_input_charset and default_output_charset are utf-8 then there are pretty small chances that a conversion ever is needed.=20 If no output_charset exists in the config file the use the same behavior we have to day, with no input_conversion. Ideas? comments? If not I'll try to do some/all of this this weekend. |
From: Damyan I. <di...@cr...> - 2005-11-24 09:31:51
Attachments:
signature.asc
|
Erik G=C3=BCnther wrote: > tor 2005-11-24 klockan 09:05 +0200 skrev Damyan Ivanov:=20 >=20 >>Shimon Rura wrote: >> >>>Erik, >>> >>>Unfortunately, I don't think there is a perfect solution to this. The= >> >>Whatever encoding a browses uses to send data, it is mandatory to suppl= y >>correct Content-Type header, right? Can't this be used when determining= request >>encoding? >=20 > Hmmm I did some checks on that and the only Content-Type header are fro= m > the server to the browser. The other way around I can only find > Accept-Charset. That aren't the same. You're right. I've had to check this beforehand. I see "Content-Type: application/x-www-form-urlencoded" header for POST requests, but even in it, there's no charset mentioned. :-( Here's excerpt from RFC 2070 - Internationalization of the Hypertext Mark= up Language. No ideal solution, though :-/ 5.2. Form submission The HTML 2.0 form submission mechanism, based on the "application/x- www-form-urlencoded" media type, is ill-equipped with regard to internationalization. In fact, since URLs are restricted to ASCII characters, the mechanism is akward even for ISO-8859-1 text. Section 2.2 of [RFC1738] specifies that octets may be encoded using the "%HH" notation, but text submitted from a form is composed of characters, not octets. Lacking a specification of a character encoding scheme, the "%HH" notation has no well-defined meaning. The best solution is to use the "multipart/form-data" media type described in [RFC1867] with the POST method of form submission. This mechanism encapsulates the value part of each name-value pair in a body-part of a multipart MIME body that is sent as the HTTP entity; each body part can be labeled with an appropriate Content-Type, including if necessary a charset parameter that specifies the character encoding scheme. The changes to the DTD necessary to support this method of form submission have been incorporated in the DTD included in this specification. A less satisfactory solution is to add a MIME charset parameter to the "application/x-www-form-urlencoded" media type specifier sent along with a POST method form submission, with the understanding that the URL encoding of [RFC1738] is applied on top of the specified character encoding, as a kind of implicit Content-Transfer-Encoding. One problem with both solutions above is that current browsers do not generally allow for bookmarks to specify the POST method; this should be improved. Conversely, the GET method could be used with the form data transmitted in the body instead of in the URL. Nothing in the protocol seems to prevent it, but no implementations appear to exist at present. How the user agent determines the encoding of the text entered by the user is outside the scope of this specification. NOTE -- Designers of forms and their handling scripts should be aware of an important caveat: when the default value of a field (the VALUE attribute) is returned upon form submission (i.e. the user did not modify this value), it cannot be guaranteed to be transmitted as a sequence of octets identical to that in the source document -- only as a possibly different but valid encoding of the same sequence of text elements. This may be true even if the encoding of the document containing the form and that used for submission are the same. Differences can occur when a sequence of characters can be represented by various sequences of octets, and also when a composite sequence (a base character plus one or more combining diacritics) can be represented by either a different but equivalent composite sequence or by a fully precomposed character. For instance, the UCS-2 sequence 00EA+0323 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT + COMBINING DOT BELOW) may be transformed into 1EC7 (LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT AND DOT BELOW), into 0065+0302+0323 (LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING DOT BELOW), as well as into other equivalent composite sequences. > * New option in Config.xml that means use this charset if possible > output_charset =3D "UTF-8"=20 >=20 > * When a request arrives we first look in the session for _encoding to > see what encoding the request most likely be in and the change encoding= > to default_input_charset. This only needs to be done on request with > parameters (when QUERY_STRING or REQUEST_METHOD=3D'post'). If no sessio= n > or _encoding exists then use output_charset.=20 >=20 > If output_charset, default_input_charset and default_output_charset are= > utf-8 then there are pretty small chances that a conversion ever is > needed.=20 >=20 > If no output_charset exists in the config file the use the same behavio= r > we have to day, with no input_conversion. To me (and I don't use pkit extensively), your proposal seems appropriate= , it should work except when the browser supports neither cookies, nor utf-8, = which I guess is very uncommon situation and isnot handled right now either. Greetings, dam --=20 Damyan Ivanov Creditreform Bulgaria di...@cr... http://www.creditreform.bg/ phone: +359(2)928-2611, 929-3993 fax: +359(2)920-0994 mob. +359(88)856-6067 da...@ja.../Gaim |
From: Boris Z. <bz...@2b...> - 2006-01-08 22:18:43
|
Hi All, sorry, I'm somewhat late here. For some unknown reason I did not read the= list=20 for a while.=20 The answer is a browser is free to change the charset in the answer. Even= if=20 you know the charset you used to send the form. There is a attribute to t= he=20 form tag to hint a charset to the browser. But for my tests it did not wo= rk. The only solutio, that always worked for me is add a hidden field to the = form=20 with a char or word that is diffenent in utf8 and your prefered charset(s= ). In my case I use utf8 and latin-1. Then look at the length or values of the string in your hidden field if t= hat=20 string is in utf8 all other form values are also in utf8 thats the whole=20 trick. And the best, to always do this on the fly just subclass=20 Apache::Request::PageKit and add request_class =3D "MyCharsetFunPackage" = to=20 your config. Am Mittwoch, 23. November 2005 14:23 schrieb Erik G=C3=BCnther: > Hi > > I have played with pagekit for some time now. And now I would be able t= o > have a site that use UTF8 internally. But how to I do that. The easy > part is to have all files in UTF-8 and save to the DB in UTF-8 and so > on. But pagekit are smart and sends the page in the encoding the browse= r > prefers. That is not any problem. But who do I handle the input from a > form? > > I mean how do I know what char encoding the web-browser are sending in? > I can't trust the outgoing encoding because that is trivial to change i= n > ant browser. Afaik there are no serten way to tell what encoding by jus= t > looking at the string. > > What are you doing to fix this? Om my previous site i "converted" all t= o > Latin-1. But that was just a ugly hack. utf8:Is_utf8() and > Encode::is_uft8() won't help they say false on every string passed by > apache. :/ > > > One way is to block pagekit and send everything in UTF-8 because most > often the browser will send the return in UTF-8... but that solution > aren't bullet prof. The user can still send in eg Latin-1 or the browse= r > do not handle UTF-8 (rare). > > Any ideas? --=20 Boris |