From: Jeff H. <jeffh@ActiveState.com> - 2005-02-23 02:21:06
|
In following up on a bug report for Tk, a whole cascade of issues relating to Tcl's confused relationship with what "unicode" means was encountered. In order to unravel it, it is time to make some choices about what unicode means to Tcl, and how that will affect development going forward. The following decisions were reached primarily with Kevin Kenny on the Tclers chat. The bug that started this is here: http://sf.net/tracker/?func=detail&atid=110894&aid=1122671&group_id=10894 A similar problem occurs here: http://sf.net/tracker/?func=detail&aid=1004065&group_id=10894&atid=110894 The problem is rooted in the size of Tcl_UniChar, and the support of characters that fall out of the 16-bit range. This is expressed here: http://sf.net/tracker/?func=detail&aid=578030&group_id=10894&atid=110894 The problem with the UtfToUnicodeProc and UnicodeToUtfProc in tclEncoding.c (and the related ones in tkUnixFont.c) are the invalid handling of pointer alignment. To correct this, you hard-code in the size of Tcl_UniChar. However, this is not necessarily incorrect. Part of the issue is that Tcl's "unicode" encoding is sometimes used synonmously as the system encoding on WindowsNT - which is UCS-2 (+ surrogates - I'll get to that later). Thus the procs should really substitute "unicode" with "ucs-2". This is definitely the case with the UtfToUcs2beProc in tkUnixFont.c. UCS-2 is 2-bytes, but this didn't consider that Tcl_UniChar might not be 2 bytes. But now we get to the heart ... Tcl_UniChar should be declared as 2-bytes. Solutions that require >16-bit wide chars should be handled by surrogates. IOW, Tcl's unicode should be cemented as UCS-2 + surrogates (where the + surrogates part is "currently nonfunctional" until we implement it). To try and make Tcl_UniChar 4 bytes (which has been tried, and currently has some support in the core ... although based on misplaced trust in the core's handling of what "unicode" is) and you get into a whole host of issues on unicode vs ucs-2 (le or be?) vs ucs-4. Without having to do a full code review, cementing things as ucs-2 + surrogates should clarify things and direct any mishandled areas into certain directions for fixing the code. That said, this should be reflected properly at the Tcl level. I think ucs-2 should augment "unicode" in the built-in encodings, and we should have ucs-2le and ucs-2be equivalents also at the core Tcl level to make some operations clearer (from the programmer's perspective). That means going from the current "unicode" to unicode, ucs-2, ucs-2le and ucs-2be, where ucs-2 and unicode both point to one of ucs-2le or ucs-2be, as appropriate. In the future, this could be extended to support ucs-4* as a true output encoding with a very clear meaning. Feel free to comment. Note that this isn't a change in behavior, it is a clarification and correction of what the core has been doing (and correcting misconceptions about what it might do). These same clarifications also provide direction as to proper support of larger character sets. Jeff |