From: <ke...@cr...> - 2005-02-23 15:46:02
|
Lar...@ma... said: > what Tcl has now _is_ UCS-2, but the proper name for "UCS-2 with > surrogates" is UTF-16, and if that is where you want to go then you > should use the proper name for it. Quite right - UTF-16 is what Jeff is proposing, and we should use its proper name. > Second, I think it would be a grave mistake to let technical aspects > of how Unicode is implemented in Tcl shine through to the script > level. A single character should be a string of length one regardless > of whether that single character resides in the Basic Multilingual > Plane or not, just in the same way as \u0100 (LATIN CAPITAL LETTER A > WITH MACRON) has the same length as \xFF (LATIN SMALL LETTER Y WITH > DIAERESIS) and \x7E (TILDE). I believe that we're all in agreement on that point. A character has a string length of one. What Jeff is proposing is an intermediate step, where surrogates are broken in some important ways (for instance, appearing as two characters in certain contexts) but can at least be converted to and from UTF-8 and UTF-16 at both endianities. That's a major step forward, since it at least allows Tcl to exchange characters outside the BMP with external applications. Nobody is proposing that intermediate step as an endpoint; it is merely an important step that can be taken immediately to clean up some very obscure existing practice and pave the way for support of full Unicode. I'm quite in agreement with Jeff that using UCS-4 as an internal encoding is a waste of space. Even the Unicode Consortium official documents state that in most applications, characters outside BMP will be uncommon. There is currently some active discussion behind the scenes about how best to achieve efficient character indexing and character ranges in strings where the character's length in memory is variable. Adopting a UCS-4 internal representation is simple, but wasteful. The Magic Tcl Gnomes are considering various more compact alternatives; many of them change the cost of character indexing from O(1) to O(log N) where N is the length of the string, but require only 1/4 the memory or less. -- 73 de ke9tv/2, Kevin KENNY GE Corporate Research & Development ke...@cr... P. O. Box 8, Bldg. K-1, Rm. 5B36A Schenectady, New York 12301-0008 USA |