Re: [mined-editor] Re: uterm script

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thomas Wolff <mi...@to...> =E3=81=95=E3=82=93=E3=81=AF=E6=9B=B8=E3=81=8D=
=E3=81=BE=E3=81=97=E3=81=9F:

>> > * It cannot distinguish between terminal encoding and preferred data=
=20
>> >   encoding. For mined, I decided to slightly interpret (or misuse,=20
>> >   if you want)
>
>> misuse, certainly misuse! This is against the POSIX standard.
> Not quite, see below.
>
>> >   the locale mechanism by allowing the following:
>> >   	LC_CTYPE=3Dsomething.UTF-8
>> >   	LANG=3Dsomething.gb18030
>> >   This would tell mined that the preferred encoding when editing tex=
t=20
>> >   is GB18030 while leaving the LC_CTYPE category indicating a UTF-8=20
>> >   terminal, so other applications are not confused, and CJK files=20
>> >   can easily be worked on in a UTF-8 terminal (there are options for=
=20
>> >   this in mined, too, and of course auto-detection...).
>
>> Having options for this in mined is the only possibility here.  You
>> cannot use the locale environment variables against the POSIX
>> standard.
> I'm actually not doing that.
> The POSIX requirement you quoted says that different character sets=20
> must not be used by the locale categories.
> But by rules of priority of the variables affecting each category,=20
> LANG doesn't affect anything if all other categories are set.
> So, when the example is modified slightly:
> 	LC_ALL=3Dsomething.UTF-8
> 	LANG=3Dsomething.gb18030
> this does not violate the POSIX locale standard. (For the POSIX=20
> locale mechanism, the value of LANG would have no effect here.)
>
> Maybe a little picky here, but this way at least one important=20
> missing configuration feature (distinguishing terminal encoding from=20
> data encoding) can be achieved this way if an application likes.

But this is completely nonstandard, no other application but mined
is doing something like this. And it is very misleading as it makes
the user believe that using different encodings in these variables
is OK. But it is not.

You are right that LANG is ignored if LC_ALL is set, but it
is still misleading. MINED_PREFERRED_FILE_ENCODING would
be ignored by the POSIX locale mechanism as well and not only
if LC_ALL is set.

I  still  think a special   mined option, be it  via  a mined specific
environment  variable  or a config file   option is  much clearer than
abusing the POSIX locale system in weird and funny ways.

>> And most encodings wouldn't work for German
>> anyway, i.e. de_DE.SJIS could not work because SJIS has no umlauts.
> Actually, Shift-JIS X0213 does maintain umlauts. Try it with mined,=20
> enter them in the terminal, ESC u will reveal the encoding. Maybe a=20
> later extension of Shift-JIS; I took the table from libiconv.

Yes, this is a later extension of shift_jis. The old shift_jis has no
umlauts:

mfabian@magellan:~$ echo -n =C3=B6 | iconv -f utf-8 -t sjis >/dev/null
iconv: illegal input sequence at position 0
mfabian@magellan:~$

only shift_jisx0213 has:

mfabian@magellan:~$ echo -n =C3=B6 | iconv -f utf-8 -t shift_jisx0213 >/d=
ev/null
mfabian@magellan:~$

>> So it is much less than n*m.
> Sure, but locale installation is far too complicated for users=20
> (I don't know myself how it works because there is no decent documentat=
ion)=20
> - and it seems it's not even possible if you're not root -

It is possible if you are not root. You can install additional locales
in your home directory using the command "localedef".  You are right
though that it is not commonly known by ordinary users how to do that.

>> >> The updated version is attached. Thank you very much for the review=
.
>>     mfabian@magellan:/tmp$ locale
>>     LANG=3Dja_JP.UTF-8
>>     LC_CTYPE=3D"ja_JP.UTF-8"
>>     ...
>>     LC_PAPER=3D"ja_JP.UTF-8"
>>     LC_ALL=3D
>>     mfabian@magellan:/tmp$ LANG=3Den_US.ISO-8859-1 LC_PAPER=3Dde_DE@eu=
ro ./uterm
>>     20x20 font not found, using 9x18 with 18x18.
>
>> Now in the xterm which started:
>
>>     mfabian@magellan:/tmp$ locale
>>     LANG=3Den_US.UTF-8
>>     LC_CTYPE=3D"en_US.UTF-8"
>>     ...
>>     LC_PAPER=3Dde_DE@euro
>>     LC_ALL=3D
>>     mfabian@magellan:/tmp$=20
>
>> Illegal combination of UTF-8 and ISO-8859-1 encoding because LC_PAPER
>> is still set to de_DE@euro -> trouble ahead.
> Well, in this case you explicitly caused the "illegal combination"=20
> yourself with the above quoted command line settings of LANG and LC_PAP=
ER=20
> in a UTF-8 locale environment, so any trouble caused would not be=20
> caused by uterm.

Sorry I wanted to write

 LANG=3Den_US.ISO-8859-1 LC_PAPER=3Dde_DE.ISO-8859-1 ./uterm

which is perfectly fine because the encodings are the same.  But after
starting ./uterm there is a conflict because uterm changes only one
environment variable.

To avoid such conflicts, uterm should try to change all of the LANG
and LC_* variables which were set before to non-UTF-8 locales to UTF-8
locales, not only one.

I.e. in the above case the uterm script should change LANG to
en_US.UTF-8 and LC_PAPER to de_DE.UTF-8.

On SuSE Linux you can be sure that all locales also exist in UTF-8.
I.e. if there is a xx_YY.something then there is also a xx_YY.UTF-8.
On other systems you cannot be sure. You wrote yourself that on some
systems only en_US.UTF-8 exists.

That means in the above case you should try to set LC_PAPER to
de_DE.UTF-8 if possible and if not fallback to en_US.UTF-8.

You could improve your script by checking LANG and all LC_* variables
and try to set all of them which were set when uterm was started to
the respective UTF-8 locales if possible and if not use another UTF-8
locale which exists as a fallback.

On glibc systems, you can easily check whether a certain UTF-8 locale
exists with the "locale" program:

mfabian@magellan:~$ LC_CTYPE=3Den_US.UTF-8 locale charmap
UTF-8
mfabian@magellan:~$

exists because it returns UTF-8.

mfabian@magellan:~$ LC_CTYPE=3Daa_BB.UTF-8 locale charmap
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
ANSI_X3.4-1968
mfabian@magellan:~$

does not exists because it does *not* return UTF-8.

I'm not sure how this can be done in a portable way.  Maybe the
"locale charmap" command is standardized, but I'm not sure at the
moment.

> I think the uterm script has the following properties:
> * It gives the user the opportunity to start an xterm which can display=
=20
>   a maximum of Unicode "out-of-the-box", in two respects:
>   * Enforcing UTF-8 even if the user's environment is misconfigured.
>     (not needed for SuSE)
>   * Choosing a most suitable font.
>     (also useful for SuSE default configuration, or with user's own=20
>     configuration active that might address traditional non-sufficient=20
>     fonts)

yes, maybe.

--=20
Mike FABIAN   <mf...@su...>   http://www.suse.de/~mfabian
=E7=9D=A1=E7=9C=A0=E4=B8=8D=E8=B6=B3=E3=81=AF=E3=81=84=E3=81=84=E4=BB=95=E4=
=BA=8B=E3=81=AE=E6=95=B5=E3=81=A0=E3=80=82