|
From: Donald G P. <don...@ni...> - 2023-02-03 18:53:53
|
On 1/27/23 10:36, apnmbx-public--- via Tcl-Core wrote: > > I’ve written up my view of “state of Unicode in Tcl 9” at https://www.magicsplat.com/tcl9/tcl9unicode.html <https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.magicsplat.com%2Ftcl9%2Ftcl9unicode.html&data=05%7C01%7Cdonald.porter%40nist.gov%7Ca50e9f009bb8451e79f908db007c6664%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C638104306513471515%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=aWNJU765jtf8q6kNK6CXEA0QW6vTqE8pDEJFmet0E8M%3D&reserved=0> > Thank you for putting this together. You have a real talent for writing that captures precise details without becoming too tedious. This is very useful. The document reveals many things that are different from what I assumed, and raises other shortcomings. It appears to me that a good stress-testing use case is the task to create and then later unpack a lossless archive of a directory in a filesystem. This task intersects with many of the issues presented in the document. It also touches on the ill-defined Tcl concept of the "system encoding" and whether it needs revision. A key command in any archive creation is [glob]. It seems that [glob] has never been capable of handling file names that fall outside the system encoding.* Revising Tcl 9 so that it can handle such file names suggests a few possibilities: 1) Have all Tcl strings make use of PEP 383, or some alternative means to represent such names. This implies that the alphabet for Tcl strings must include symbols outside the set of unicode scalar values. (PEP 383 uses lone surrogates); OR 2) Revise [glob] so it returns encoding information in addition to the list of filenames. The iso-8859-1 encoding is a way to capture every filename losslessly, but it is a poor universal solution. Design of such a reformed [glob] with reasonable compatibility is at least tricky.; OR 3) Create a new [glob2]** that returns encoding information without messy compatibility constraints, and leave it up to scripts and extensions to move from the old command to the new one as they perceive a need to robustly handle these edge cases. Both 2) and 3) may impose constraints and demand revision to the Tcl_Filesystem interface and its Tcl_FSMatchInDirectoryProc slot. The encoding to be used to interpret the bytes of a filename might better be an attribute of a Tcl_Filesystem or of a mount point rather than an application-wide (and not thread-stable?) notion of a system encoding pulled in through a side channel. For 1) if the alphabet for Tcl strings is larger than unicode scalar values, that provides a clear use and meaning for [string is unicode] which has puzzled some people. Maybe a change to [string is usv] would be clearer to the reader that the test is whether symbols outside the set of unicode scalar values are present. These are symbols that cannot be properly encoded in the Unicode encodings utf-8, utf-16, utf-32. Thanks again. Plenty to be done here. * The one use case I've seen presented for scripts to be able to set the system encoding with [encoding system $enc] has been the power to work around this problem. ** not a final name choice, just the concept of a new command. -- | Don Porter Applied and Computational Mathematics Division | | don...@ni... Information Technology Laboratory | | http://math.nist.gov/~DPorter/ NIST | |______________________________________________________________________| |