Re: [ppa-dev] Switching to UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Le 31/12/2010 11:16, Alexey Baturin a écrit :
> Hello Jehan,
> 
> I forgot there were some recent changes to lang files, here is an updated patch.
> 
> On Wed, Dec 29, 2010 at 11:38 PM, Jehan-Guillaume (ioguix) de Rorthais
> <io...@fr...> wrote:
>> About the comments stuff, it's not 100% clear in my mind. Before
>> considering moving everything to UTF8, we need to double check that it
>> will not actually break the text printed on the page.
>> We should pay attention to the encoding of *all* layers: the lang file,
>> the one send to the browser, the one send to the database, and the
>> encoding of the database itself.
> 
> Let's begin with small changes - language files =).

Yeah, but all these layers are related :/

>>>>> 1. I think it's not so good to have two Russian translations - so I
>>>>> replaced old translation in KOI8-R with a new one in UTF-8:
>>>>> russian_utf8.patch.bz2
>>
>> So I think we had both to be able to have a complete homogeneous
>> encoding stack and avoid character breakage.
> 
> What do you mean by "homogeneous encoding stack"? I've verified that
> UTF-8 works well with Russian translation for some time. If you're not
> sure it works well in some configuration - you could send me a
> screenshot - I'll verify it.

Ok, I need to have a clear view on what happen behind the scene. So here
my understanding of what PPA does:

#0 the encoding we send to the browser depends on $lang['appcharset'].
See function in classes/Misc.php:376
#1 $lang['appcharset'] is set in all language file
#2 we overwrite $lang['appcharset'] depending on the database encoding
we are connected to. See block in libraries/lib.inc.php:226

As far as I understand it, #1 shouldn't be necessary.
We are using the recoded files that are pure ASCII files. We recode
files in lang/ in lang/recoded using he command "recode $encoding..xml".
so resulting files are using the XML escape sequence based on character
reference, ie. &#233; for 'é'.

So far, as we are using plain ASCII-7 and sending UTF-8 to browsers
should be perfectly fine.

#2 set the client encoding according to the database encoding. So
PostgreSQL doesn't convert data.
To be able to print data fetched from PostgreSQL, we overwrite
$lang['appcharset'] using this database encoding and use it as HTML page
encoding (see #0).
According to this page, we can use UNICODE/UTF-8 as client_encoding for
all database encoding but MULE_INTERNAL :
  http://www.postgresql.org/docs/7.4/static/multibyte.html

But anyway, I don't think we support MULE_ENCODING correctly today.

In conclusion, if I didn't forget something on the way, then yes, we
could probably use UTF-8 everywhere, with some more investigation needed
for MULE_ENCODING.

So I guess it's fairly safe to create a first patch then test it as much
as we can.

Comments ? Warnings ? What did I forgot ?

>>>>> 2. Do you remember a bad screenshot on the PPA site? I've checked
>>>>> Ukrainian translation - it doesn't work as expected: you can compare
>>>>> attached screenshots (bad symbols are in red circles). It seems that
>>>>> the reason is KOI8-R instead of KOI8-U everywhere. I have not managed
>>>>> to fix it, something goes wrong with Recode:
>>>>> =====
>>>>> Recoding ukrainian...
>>>>> recode: Untranslatable input in step `KOI8-U..ISO-10646-UCS-2'
>>>>> =====
>>>>> However I managed to switch fast to UTF-8 with iconv - patch is attached.
>>
>> So we need to make a full check to be sure it doesn't break data pages I
>> guess...
>> Could we grant you as the ukranian and russian maintainer ?
> 
> As a Russian - sure you can, as a Ukrainian - you can't, I don't know
> Ukrainian well =). But I can see this obvious bug of Ukrainian
> translation.

Great :)

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk0d9yIACgkQxWGfaAgowiLwvwCggbYb84mz0mQfWZ9Sp+nS2UH+
WpAAnjRTV2r1RYv1GY4cNo5aeQcaxFxa
=pV9L
-----END PGP SIGNATURE-----