From: Kutter M. <mar...@si...> - 2004-09-08 10:06:31
|
Hi Andreas, This looks like the default solution. Unfortunately, = SPOPS::Tool::UTFConvert always assumes iso-8859-a (Latin1) as originating charset, which is not neccessarily true. So, this does not work for charsets other than Latin1. The ability to = grab the charset from the request in conjunction with a slightly modified SPOPS::Tool::UTFConvert ( use the request's charset, if given) would = remove the problem completely. Regards, Martin Kutter -----Original Message----- From: ope...@li... [mailto:ope...@li...]On Behalf Of And...@Be... Sent: Mittwoch, 8. September 2004 11:53 To: ope...@li... Subject: AW: [Openinteract-dev] Small i18n issue Hi Martin, We had the same problem with reading Umlaut characters with LDAP for = the user names. You can solve this by adding the following to the = spops.perl file of the package in question: rules_from =3D> [ 'SPOPS::Tool::UTFConvert' ], Hope that helped... Mit freundlichen Gr=FC=DFen Andreas Nolte=20 Leitung IT ------------------------------------------------- arvato direct services Olympiastra=DFe 1 26419 Schortens Germany http://www.arvato.com/ and...@be... Telefon +49 (0) 44 21 - 76-84002 Telefax +49 (0) 44 21 - 76-84111 -----Urspr=FCngliche Nachricht----- Von: ope...@li... [mailto:ope...@li...]=20 Gesendet: Mittwoch, 8. September 2004 11:33 An: 'ope...@li...' Betreff: [Openinteract-dev] Small i18n issue Hi * ! I ran into a small internationalization issue, today.=20 I'm running a OI2 site with SPOPS::LDAP as backend. On storing = non-ASCII characters in the LDAP directory server, this complains that properties = with non-ASCII-characters have an "invalid syntax". I've been able to track this down to a charset problem.=20 LDAP expects directoryString attributes to be in UTF-8 encoding. The perl-ldap interface (Net::LDAP) does not provide UTF-8 conversions by default, so these are to be done by the application using Net::LDAP. = This is no big deal - just a=20 use Encode; $value =3D decode($charset, $value); for all the fields to set - but one needs to know the request's = charset. The charset used in the HTTP request is specified by the "charset" = attribute in the Content-Type header.=20 Example: Content-Type: multipart/formdata; boundary=3D"--------------12345"; charset=3D"EUC-JP" The default is "iso-8859-1" if no charset is supplied.=20 The problem is, that the only available solution to get the charset = used in the request is to grab it from the underlying Apache::Request or CGI::Request handle - not really easy and not really portable: my $contentHeader =3D CTX->request->apache->headers_in()->{ = Content-Type }; As different charsets in HTTP requests are very likely to happen in = i18n'ed environments, and the problem is very likely to occur in non-LDAP environments, too, I would suggest an extension to the OpenInteract2::Request class, that provides access to the Content-Type = HTTP header, like it already does with some other header fields. Maybe even a more general approach - exposing all HTTP headers in the request object - could be suitable: This would remove the need to react = on additional HTTP headers by code changes forever. Regards, Martin Kutter ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D5047&alloc_id=3D10808&op=3Dclick _______________________________________________ openinteract-dev mailing list ope...@li... https://lists.sourceforge.net/lists/listinfo/openinteract-dev ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_idP47&alloc_id=10808&op=3Dclick _______________________________________________ openinteract-dev mailing list ope...@li... https://lists.sourceforge.net/lists/listinfo/openinteract-dev |
From: Kutter M. <mar...@si...> - 2004-09-08 12:22:31
|
Hi Teemu, Hi *, looks like we just found a not-so-small issue... While SPOPS::Tool::UTFConvert can handle conversion for SPOPS backends, there's nothing like that for the OI2 frontends (say, Template::Toolkit and the like). My suggestion for the "whole OI2 i18n charset encoding" would be: 1. get the charset from the request 2. encode all parameters as UTF-8 when fetching them in the request object (all but uploads) 3. set the Content-type: charset="foo" for the response (if needed). 4. encode all output in the Response object to the appropriate charset just before sending it (if needed). Step 4 would probably be an issue for the Controller - OI2::Controller::Raw should never re-code anything, and alternative controllers like, let's say for outputting PDFs - probably shouldn't recode their stuff, too. This would allow OI2 to use UTF-8 only in it's internal processing, but serve frontends with potentially different character encodings. It would also remove the need for charset conversions in SPOPS backends (as long as the backends are UTF-8 capable - most perl modules should be) - they would have the appropriate form already, and, the number of supported charsets would largely superseed the current sad 'Latin1'. Regards, Martin -----Original Message----- From: Teemu Arina [mailto:te...@di...] Sent: Mittwoch, 8. September 2004 11:56 To: ope...@li... Cc: Kutter Martin; 'ope...@li...' Subject: Re: [Openinteract-dev] Small i18n issue > I've been able to track this down to a charset problem. > LDAP expects directoryString attributes to be in UTF-8 encoding. The > perl-ldap interface (Net::LDAP) does not provide UTF-8 conversions by > default, so these are to be done by the application using Net::LDAP. This > is no big deal - just a > > use Encode; > $value = decode($charset, $value); I had a similar problem with DBD::mysql and UTF-8. DBI has no general policy for UTF-8, so it has to be implemented by DBD:s themselves. DBD::mysql does nothing to that issue. If you store UTF-8 strings in the database and retrieve them back, these strings do not get marked as UTF-8. Later it might happen that your UTF-8 strings get encoded as UTF-8, because Perl didn't mark those as UTF-8 already =) Encode module helps fixing this problem for example in SPOPS::DBI::MySQL (define a post_fetch_action() that converts all data fields). I wonder when you are able to write utf-8 compatible software without mocking all the internals on several layers... It has been around so many years and still many module writers ignore it. It also wasn't until mysql 4.x when they included UTF-8 support in character type fields. > The problem is, that the only available solution to get the charset used in > the request is to grab it from the underlying Apache::Request or > CGI::Request handle - not really easy and not really portable: > > my $contentHeader = CTX->request->apache->headers_in()->{ Content-Type }; I noticed the same thing. I also found another way to set it: CTX->response->content_type( 'text/html; charset=utf-8' ) CTX->response->charset() would be nice to have. Greetings, - Teemu |
From: Teemu A. <te...@io...> - 2004-09-08 15:02:39
|
> Step 4 would probably be an issue for the Controller - OI2::Controller::Raw > should never re-code anything, and alternative controllers like, let's say > for outputting PDFs - probably shouldn't recode their stuff, too. I do agree. Although content_type should not include charset=iso-8859-1 as it does now. > This would allow OI2 to use UTF-8 only in it's internal processing, but > serve frontends with potentially different character encodings. Utf8 atleast for internal data presentation is the way to go. I would also pay attention that utf8 from backend to frontend is also possible without losing any bits or forcing to certain character set in the middle. i.e. russian interface with chinese content should be possible. This of course requires that all the backends speak utf8. > 1. get the charset from the request > 2. encode all parameters as UTF-8 when fetching them in the request object > (all but uploads) > 3. set the Content-type: charset="foo" for the response (if needed). > 4. encode all output in the Response object to the appropriate charset just > before sending it (if needed). For full UTF-8 support, also: 5. translation files (I18N/maketext framework) should be in standard GNU gettext (PO) format which is binary safe, instead of the current system which is poorly utf-8 compatible 6. Reading/writing OI2 configuration files should also be utf-8 compatible, with something like: open( INI, ">:utf8", "action.conf" ); 7. Searching and text parsing should be utf8 compatible. This means use utf8; and support for utf8 on the database backends. At the moment =~ /\w/ doesn't work very well with utf8 content ;) Full support most likely requires perl 5.8.x =(... 5.6.x utf8 support just sucks (Encode module for example does not work with 5.6.x). -- -------------- Teemu Arina www.dicole.org |
From: Teemu A. <te...@io...> - 2004-09-08 15:22:27
|
Also a slight note, it seems that some broken browsers (tm) do not obey the charset of the document when posting forms. It might be a good idea to use a form tag like the following by default: <form accept-charset="utf-8" enctype="application/x-www-form-urlencoded"> Sends the form in utf-8 to server. Notice also weird problems with utf8 and template toolkit: http://template-toolkit.org/pipermail/templates/2003-March/004314.html Also see: http://template-toolkit.org/pipermail/templates/2003-November/005342.html I don't know the status of utf-8 in the current version of TT. -- -------------- Teemu Arina Ionstream Oy / Dicole Komeetankuja 4 A 02210 Espoo FINLAND Tel: +358-(0)50 - 555 7636 http://www.dicole.fi "Discover, collaborate, learn." |
From: Kutter M. <mar...@si...> - 2004-09-08 15:24:48
|
Hi Teemu, Your step 6. is not an issue with perl >= 5.8.0 Perl >= 5.8 supports IO-Layers that can be used to filter/recode content almost transparently (almost means: exactly like your example). I don't think that all backends need to support utf-8 - if they don't that's no harm as long as we can recode it in the backend (currently we're recoding iso-8859-1 to utf8 for utf8-aware backends). As for 7., using perl's locale() should override the broad scope of utf8's \w meaning. And backends that support utf8, normally support searches etc on them, too. I agree with you that full utf8-support will probably require perl >= 5.8. But as we're probably talking about years until everything's up & working, this should not be an issue (hey, there's perl 5.10 out by nor - and perl6 will raise, too. Sure it will.). Regards, Martin Kutter -----Original Message----- From: ope...@li... [mailto:ope...@li...]On Behalf Of Teemu Arina Sent: Mittwoch, 8. September 2004 17:02 To: Kutter Martin Cc: 'ope...@li...' Subject: Re: [Openinteract-dev] Small i18n issue > Step 4 would probably be an issue for the Controller - OI2::Controller::Raw > should never re-code anything, and alternative controllers like, let's say > for outputting PDFs - probably shouldn't recode their stuff, too. I do agree. Although content_type should not include charset=iso-8859-1 as it does now. > This would allow OI2 to use UTF-8 only in it's internal processing, but > serve frontends with potentially different character encodings. Utf8 atleast for internal data presentation is the way to go. I would also pay attention that utf8 from backend to frontend is also possible without losing any bits or forcing to certain character set in the middle. i.e. russian interface with chinese content should be possible. This of course requires that all the backends speak utf8. > 1. get the charset from the request > 2. encode all parameters as UTF-8 when fetching them in the request object > (all but uploads) > 3. set the Content-type: charset="foo" for the response (if needed). > 4. encode all output in the Response object to the appropriate charset just > before sending it (if needed). For full UTF-8 support, also: 5. translation files (I18N/maketext framework) should be in standard GNU gettext (PO) format which is binary safe, instead of the current system which is poorly utf-8 compatible 6. Reading/writing OI2 configuration files should also be utf-8 compatible, with something like: open( INI, ">:utf8", "action.conf" ); 7. Searching and text parsing should be utf8 compatible. This means use utf8; and support for utf8 on the database backends. At the moment =~ /\w/ doesn't work very well with utf8 content ;) Full support most likely requires perl 5.8.x =(... 5.6.x utf8 support just sucks (Encode module for example does not work with 5.6.x). -- -------------- Teemu Arina www.dicole.org ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ openinteract-dev mailing list ope...@li... https://lists.sourceforge.net/lists/listinfo/openinteract-dev |