#2727 [glob] fails conversion to internal encoding

obsolete: 8.5a1
closed-duplicate
5
2005-03-10
2004-05-19
Don Porter
No

On a Mac OS X system:

% encoding system
utf-8
% glob *
no files matched glob pattern "*"
% set s \u00e9
é
% string bytelength $s
2
% close [open $s w]
% set f [glob *]
é
% string bytelength $f
3
% binary scan [encoding convertto identity $s] c*
bytes; set bytes
-61 -87
% binary scan [encoding convertto identity $f] c*
bytes; set bytes
101 -52 -127

I start with the character \u00E9 in
its 2-byte internal encoding: xC3 xA9

Write a file with that name, and retrieve
the file name with [glob] and I get
a different representation: x65 xCC x81
which is our internal encoding for
\u0065 \u0301

% format %s \u00E9
é
% format %s \u0065\u0301
é

those outputs look the same in the
interactive shell, but not in this
browser text entry.

Discussion

  • Don Porter

    Don Porter - 2004-05-19

    Logged In: YES
    user_id=80530

    A worse symptom is that
    [glob f??] and [glob f\u00E9?]
    both fail to return the file.

     
  • Don Porter

    Don Porter - 2004-05-19

    Logged In: YES
    user_id=80530

    oops, changed examples in mid-stream.

    I think I meant to complain that
    [glob ?] fails to return the file, but
    I'm not on a system where I can
    verify that.

    kbk suggests the answers to this
    issue may lie at:

    http://www.unicode.org/reports/tr15/

     
  • Vince Darley

    Vince Darley - 2004-05-19
    • assigned_to: vincentdarley --> das
     
  • Vince Darley

    Vince Darley - 2004-05-19

    Logged In: YES
    user_id=32170

    I believe this is a duplicate of bug 823330, which Daniel
    Steffen is the expert on!

     
  • Daniel A. Steffen

    Logged In: YES
    user_id=90580

    indeed, this is a syptom of bug 823330: tcl does not
    know about unicode normalization, in particular how to deal with
    decomposed unicode, which
    is the system encoding on Mac OS X, i.e. the encoding
    used by the posix API to communicate with the
    filesystem: Composed unicode input to the filesystem will return as
    decomposed unicode on Mac OS X.
    My proposed solution is for the system encoding on OSX to compose
    all input before passing it higher up. Unfortunately, composing utf8
    (as oppposed to utf32) is non trivial in a stream interface as needed
    for a tcl encoding... My effort to implement this has currently
    stalled, somebody else is most welcome to have a go; otherwise I'll
    get back to it at some point, it's on the todo list...
    It might be worthwile for somebody to devise a general solution to
    how tcl deals with unicode normlizatoin, including all normalization
    forms, not just NFC and NFD like we need in this context.

     
  • Daniel A. Steffen

    • status: open --> closed
     
  • Daniel A. Steffen

    • status: closed --> closed-duplicate
     
  • Daniel A. Steffen

    Logged In: YES
    user_id=90580

    closed as duplicate of bug 823330:
    http://sourceforge.net/tracker/?
    func=detail&aid=823330&group_id=10894&atid=110894

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks