[Tcl-bugs] [ tcl-Bugs-418645 ] Initial encoding selected incorrectly

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #418645, was updated on 2001-04-24 13:28
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110894&aid=418645&group_id=10894

Category: Environment Variables
Group: 8.3.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Markus Kuhn (mkuhn)
Assigned to: Nobody/Anonymous (nobody)
Summary: Initial encoding selected incorrectly

Initial Comment:
The function unix/tclUnixInit.c:TclpSetInitialEncodings
contains an ugly hack to guess from the locale name the
multibyte encoding currently used on a Unix system.
This might work in some of the few special cases listed
in the provided table, but it fails badly in general.
For example under Linux (glibc 2.2), the locale de_DE
uses ISO 8859-1, the locale de_DE@euro uses ISO
8859-15, and the locale vi_VN uses UTF-8. None of these
is covered by your table.

Just extending localeTable[] is not the solution here,
because manufacturers change the encodings of locales
sometimes. Unix has an X/Open standardized API function
to determine the character set of the current locale! I
suggest that you drop the entire environment variable
parsing and table mechanics in TclpSetInitialEncodings.
Instead simply first call

  setlocale(LC_NUMERIC, "C");

such that the C library sets the locale, then call

  nl_langinfo(CODESET)

(on all platforms where langinfo.h is available) which
will return the name of the now used encoding. This
will be a string such as

  ISO-8859-1
  ISO-8859-15
  UTF-8
  EUC-JP
  KOI8-R
  SJIS

The command "locale -m" will print you on a system a
list of all available encodings. These strings are
unfortunately not strictly standardized and you will
still need a table to map these encoding names into
those used by TCL, but the return value of
nl_langinfo(CODESET) is a far better starting point to
find the currently used encoding than the locale name.

On some systems (including all with glibc 2.2 for
instance) you do not even have to determine the
encoding from the output of nl_langinfo(CODESET). The
iconv function will provide you a comprehensive
conversion service to convert whatever encoding
nl_langinfo(CODESET) identified into "UTF-8".

The matter is of some urgency, because SuSE Linux is
going to switch the default locales of most European
Union countries to ISO 8859-15 (for support of the Euro
symbol) soon, and then you assumption that ISO 8859-1
is a good default will fail for millions of Linux
users.

X/Open spec for nl_langinfo:

http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=110894&aid=418645&group_id=10894

[Tcl-bugs] [ tcl-Bugs-418645 ] Initial encoding selected incorrectly

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-418645 ] Initial encoding selected incorrectly