Just in the past few days, a developer contacted me about a production
problem. They were processing some unicode data in a tcl script, using tdom.
The file contained 𝒜 - a script-A, which is 0x1D49C. Tdom when hitting
this character replied:
"tcldom_AppendEscaped: can only handle UTF-8 chars up to 3 bytes length"
I contacted the tdom author, who tells me this is because the default Tcl
only handles 3 byte unicode.
With the increasing use of Unicode around the world, and Tcl having one of
the premier libraries to manipulate such data, are there any technical
reasons not to extend the Tcl support from 3 byte Unicode to whatever the
next level of of Unicode character size might be?
I've been told that one could always create a custom version of Tcl,
modifying the define in tcl.h (I believe). However, I was just wondering
what the consequences of that might be. If everything is certain to work,
then what forces are at play that would prevent Tcl from shipping with the
value set higher?
I'm trying to figure out what our next step should be. Tell authors and
publishers they can't use all of Unicode? Write the code in some other
language, or at least, some sort of pre-filter that encodes the bigger
This latter of course is not on topic for this list. I'll do work on that
elsewhere. The on-topic question is those first few of the message - about
the impact we should expect and what might be preventing the shipping Tcl
distribution from using larger bytes.
Tcl - The glue of a new generation. http://wiki.tcl.tk/
Larry W. Virden http://www.purl.org/net/lvirden/http://www.xanga.com/lvirden/
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.