this is bug 823330 & 956433
On 10/07/2004, at 5:17, Bernhard Spinnler wrote:
> I have the problem that file names returned from [glob] that contain
> accented characters get not written to a file correctly. I remember a
> thread here on this list where someone explained that it has something
> to do with "decomposed" utf-8. But shouldn't those characters be
> converted to the system default when they are written to file?
the thing is that decomposed utf-8 _is_ the system default on Mac OS
Essentially, the issue is that in the absence of locale settings, the
default encoding of filenames for the posix file APIs in Mac OS X is
utf-8, hence the tcl [encoding system] of utf-8.
Now, because the default filesystem on Mac OSX, HFS+, stores and
returns all filenames in decomposed unicode, it accepts on input
filenames in any normalization form, but converts all such input to
normalization form D (NFD, decomposed unicode) before using it, and all
the filename output from the filesystem is in NFD.
Both NFC (composed) and NFD are perfectly valid unicode and under a
fully unicode compliant system should be treated as equal strings;
however, unfortunately tcl currently does not deal with unicode
normalization, so on Mac OS X input to and output from the filesystem
can be in a different normalization form, e.g. in a [glob]
For more on unicode normalization, see
My proposed solution to this is to set the tcl system encoding on Mac
OS X to a new encoding utf-8-nfd, which would compose anything coming
from the filesystem before passing it on to tcl.
I have made some progress on this but the effort has currently stalled,
mainly because an efficient implementation of utf-8 composition in a
streamed interface as needed in tcl encodings is much more difficult
than anticipated, and it looks like most of the unicode normalization
algorithm (which is highly non-trivial) needs to be implemented even
when OS unicode normalization facilites are available (which is the
case on OSX with CoreFoundation).
I'm not anticipating returning to this soon due to lack of time, so
somebody else is most welcome to have a go at this...
OTOH, if you only need to normalize a tcl string that you have fully in
memory already, it'd be simple to write an extension using the
** Daniel A. Steffen ** "And now for something completely
** Dept. of Mathematics ** different" Monty Python
** Macquarie University ** <mailto:steffen@...>
** NSW 2109 Australia ** <http://www.maths.mq.edu.au/~steffen/>