|
From: Donald G P. <dg...@em...> - 2003-11-20 17:57:56
|
> There is one thing in Tcl that uses it: The encoding of the null byte. > It is encoded as non shortest sequence. I don't know if it is sent over > the net in this form if a socket is configured as -encoding utf-8 I think we should be careful in this discussion to distinguish the working of the [encoding] command and the -encoding option to [fconfigure] from the "internal encoding" used by Tcl's C API. We've called that "internal encoding" UTF-8. That's never been entirely true, AIUI, and as the true UTF-8 standard has evolved, it's apparently less true now. Perhaps we should be more careful in our descriptions and documentation (apparently CESU is a better name for things closer to what we're doing) so that we don't mislead people, but I don't see any reason Tcl should need to change its internals. No standard body should have a care about how Tcl's internals are organized. What we perhaps do need to do is provide sufficient tools with our [encoding] command and our -encoding option to allow Tcl application programmers to create programs and libraries that conform to the UTF-8 spec laid down in RFC 3629. I think that creation of a new encoding, "utf-8-rfc3629" might be sufficient to address that issue. When using that new encoding, which presumably would not accept invalid UTF-8 input, we'd need to sort out among Donal's options of how to react to invalid UTF-8. Note that none of Tcl's current encodings have any script-level reaction to invalid input. The TCL_CONVERT_SYNTAX return code from Tcl_ExternalToUtf() is silently ignored. | Don Porter Mathematical and Computational Sciences Division | | don...@ni... Information Technology Laboratory | | http://math.nist.gov/~DPorter/ NIST | |______________________________________________________________________| |