[TCLCORE] Tcl utf-8, ucs-2, unicode, short and ints ... oh my!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In following up on a bug report for Tk, a whole cascade of issues
relating to Tcl's confused relationship with what "unicode" means
was encountered.  In order to unravel it, it is time to make some
choices about what unicode means to Tcl, and how that will affect
development going forward.  The following decisions were reached
primarily with Kevin Kenny on the Tclers chat.  The bug that
started this is here:

http://sf.net/tracker/?func=detail&atid=110894&aid=1122671&group_id=10894

A similar problem occurs here:

http://sf.net/tracker/?func=detail&aid=1004065&group_id=10894&atid=110894

The problem is rooted in the size of Tcl_UniChar, and the support
of characters that fall out of the 16-bit range.  This is expressed
here:

http://sf.net/tracker/?func=detail&aid=578030&group_id=10894&atid=110894

The problem with the UtfToUnicodeProc and UnicodeToUtfProc in
tclEncoding.c (and the related ones in tkUnixFont.c) are the
invalid handling of pointer alignment.  To correct this, you
hard-code in the size of Tcl_UniChar.  However, this is not
necessarily incorrect.  Part of the issue is that Tcl's "unicode"
encoding is sometimes used synonmously as the system encoding on
WindowsNT - which is UCS-2 (+ surrogates - I'll get to that later).
Thus the procs should really substitute "unicode" with "ucs-2".
This is definitely the case with the UtfToUcs2beProc in
tkUnixFont.c.  UCS-2 is 2-bytes, but this didn't consider that
Tcl_UniChar might not be 2 bytes.

But now we get to the heart ... Tcl_UniChar should be declared as
2-bytes.  Solutions that require >16-bit wide chars should be
handled by surrogates.  IOW, Tcl's unicode should be cemented as
UCS-2 + surrogates (where the + surrogates part is "currently
nonfunctional" until we implement it).  To try and make Tcl_UniChar
4 bytes (which has been tried, and currently has some support in
the core ... although based on misplaced trust in the core's
handling of what "unicode" is) and you get into a whole host of
issues on unicode vs ucs-2 (le or be?) vs ucs-4.  Without having to
do a full code review, cementing things as ucs-2 + surrogates
should clarify things and direct any mishandled areas into certain
directions for fixing the code.

That said, this should be reflected properly at the Tcl level.  I
think ucs-2 should augment "unicode" in the built-in encodings, and
we should have ucs-2le and ucs-2be equivalents also at the core Tcl
level to make some operations clearer (from the programmer's
perspective).  That means going from the current "unicode" to
unicode, ucs-2, ucs-2le and ucs-2be, where ucs-2 and unicode both
point to one of ucs-2le or ucs-2be, as appropriate.  In the future,
this could be extended to support ucs-4* as a true output encoding
with a very clear meaning.

Feel free to comment.  Note that this isn't a change in behavior,
it is a clarification and correction of what the core has been
doing (and correcting misconceptions about what it might do).
These same clarifications also provide direction as to proper
support of larger character sets.

Jeff

[TCLCORE] Tcl utf-8, ucs-2, unicode, short and ints ... oh my!

The Tool Command Language implementation

[TCLCORE] Tcl utf-8, ucs-2, unicode, short and ints ... oh my!