Menu

#1459 Passing list w/UTF-8 from C can fail

obsolete: 8.4.4
closed-fixed
5
2003-08-27
2001-03-28
No

On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function. In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.

The following C function, when called from Tcl,
illustrates the
problem.

int sendCharList(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
{
char s1[5], s2[5], s3[5], s4[5];

strcpy(s1, "\345\220\240");
strcpy(s2, "\345\214\240");
strcpy(s3, "\351\235\240");
strcpy(s4, "\347\264\240");

Tcl_ResetResult(interp);

Tcl_AppendElement(interp, s1);
Tcl_AppendElement(interp, s2);
Tcl_AppendElement(interp, s3);
Tcl_AppendElement(interp, s4);

return TCL_OK;
}

The Tcl calls:

set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"

should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters). On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl. A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).

A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim

Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't. A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?

I'm also not sure whether it persists in 8.3.2 or 8.4.

Discussion

1 2 3 > >> (Page 1 of 3)
  • Adrian Robert

    Adrian Robert - 2001-03-28

    test program and makefile

     
  • Don Porter

    Don Porter - 2001-03-28
    • assigned_to: nobody --> dgp
    • labels: 105681 --> 104239
    • status: open --> closed-invalid
     
  • Don Porter

    Don Porter - 2001-03-28

    Logged In: YES
    user_id=80530

    TclNeedSpace() is not UTF-8 aware. That's why routines
    that call it, like Tcl_AppendElement() are deprecated.
    (See the documentation.)

    Rewrite your command procedure like so:
    Tcl_Obj *resultPtr;
    ...
    Tcl_ResetResult(interp);
    resultPtr = Tcl_GetObjResult(interp);
    Tcl_ListObjAppendElement(interp, resultPtr,
    Tcl_NewStringObj(s1, -1));
    ...
    Tcl_ListObjAppendElement(interp, resultPtr,
    Tcl_NewStringObj(s4, -1));
    return TCL_OK;

     
  • Adrian Robert

    Adrian Robert - 2001-03-29
    • status: closed-invalid --> open-invalid
     
  • Adrian Robert

    Adrian Robert - 2001-03-29

    Logged In: YES
    user_id=146959

    Thanks very much for a response and proposed solution,
    however the documentation in the man page unfortunately
    says nothing about this issue. It only says that it is
    best to use the object versions of the result-handling
    functions because it is "significantly more efficient".
    This is hardly incentive to go and learn a framework that
    is significantly more complex at first sight when all one
    wants to do is pass a string and everything has been
    running fast enough as it is. Since string-handling is
    said to be fully unicode-based in Tcl/Tk 8.1 and above,
    the default assumption on a developer's part is to assume
    that "string" means "internationalized, UTF-8, or what have
    you string", and that Tcl_AppendElement therefore does not
    present a problem.

    The real solution it seems to me is to repair the deficiency
    in TclNeedSpace(), but there may be other constraints,
    performance among them, that argue against this. If this
    repair is not made, the documentation for Tcl_AppendElement,
    and "routines that call it" (how exactly is the typical
    Tcl/Tk end-developer supposed to know which those are)
    should be updated to reflect the fact that they should not
    be used for anything but ASCII. Maybe there is some other
    documentation that says something about these issues, but
    it should be in the man page as well.

     
  • Adrian Robert

    Adrian Robert - 2001-03-29

    Logged In: YES
    user_id=146959

    Thanks very much for a response and proposed solution,
    however the documentation in the man page unfortunately
    says nothing about this issue. It only says that it is
    best to use the object versions of the result-handling
    functions because it is "significantly more efficient".
    This is hardly incentive to go and learn a framework that
    is significantly more complex at first sight when all one
    wants to do is pass a string and everything has been
    running fast enough as it is. Since string-handling is
    said to be fully unicode-based in Tcl/Tk 8.1 and above,
    the default assumption on a developer's part is to assume
    that "string" means "internationalized, UTF-8, or what have
    you string", and that Tcl_AppendElement therefore does not
    present a problem.

    The real solution it seems to me is to repair the deficiency
    in TclNeedSpace(), but there may be other constraints,
    performance among them, that argue against this. If this
    repair is not made, the documentation for Tcl_AppendElement,
    and "routines that call it" (how exactly is the typical
    Tcl/Tk end-developer supposed to know which those are)
    should be updated to reflect the fact that they should not
    be used for anything but ASCII. Maybe there is some other
    documentation that says something about these issues, but
    it should be in the man page as well.

     
  • Adrian Robert

    Adrian Robert - 2001-03-29

    Logged In: YES
    user_id=146959

    Thanks very much for a response and proposed solution,
    however the documentation in the man page unfortunately
    says nothing about this issue. It only says that it is
    best to use the object versions of the result-handling
    functions because it is "significantly more efficient".
    This is hardly incentive to go and learn a framework that
    is significantly more complex at first sight when all one
    wants to do is pass a string and everything has been
    running fast enough as it is. Since string-handling is
    said to be fully unicode-based in Tcl/Tk 8.1 and above,
    the default assumption on a developer's part is to assume
    that "string" means "internationalized, UTF-8, or what have
    you string", and that Tcl_AppendElement therefore does not
    present a problem.

    The real solution it seems to me is to repair the deficiency
    in TclNeedSpace(), but there may be other constraints,
    performance among them, that argue against this. If this
    repair is not made, the documentation for Tcl_AppendElement,
    and "routines that call it" (how exactly is the typical
    Tcl/Tk end-developer supposed to know which those are)
    should be updated to reflect the fact that they should not
    be used for anything but ASCII. Maybe there is some other
    documentation that says something about these issues, but
    it should be in the man page as well.

     
  • Adrian Robert

    Adrian Robert - 2001-03-29

    Logged In: YES
    user_id=146959

    Also, could you please post a pointer to the documentation
    you are referring to? It would help clear up other
    questions like whether Tcl_Merge is affected...

    For example, the docs at
    http://dev.scriptics.com/doc/howto/i18n.html do not so much
    as hint at the problem. They merely say that all the Tcl C
    APIs expect UTF-8 strings, and that everything should work
    perfectly if they get them...

     
  • Adrian Robert

    Adrian Robert - 2001-03-29

    Logged In: YES
    user_id=146959

    Also, could you please post a pointer to the documentation
    you are referring to? It would help clear up other
    questions like whether Tcl_Merge is affected...

    For example, the docs at
    http://dev.scriptics.com/doc/howto/i18n.html do not so much
    as hint at the problem. They merely say that all the Tcl C
    APIs expect UTF-8 strings, and that everything should work
    perfectly if they get them...

     
  • Adrian Robert

    Adrian Robert - 2001-03-29
    • labels: 104239 --> 105681
     
  • Don Porter

    Don Porter - 2001-03-29
    • labels: 105681 --> 104239
     
  • Don Porter

    Don Porter - 2001-03-29
     
  • Don Porter

    Don Porter - 2001-03-29

    Logged In: YES
    user_id=80530

    I was talking about the man page for Tcl_AppendElement():

    http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm

    Now, reading the I18N HOWTO, it looks like I was reading
    "deprecated" too strongly. Tcl_DStringAppendElement() and
    Tcl_DStringStartSublist() also rely on TclNeedSpace() and
    they have not been deprecated, so TclNeedSpace() needs to
    be fixed after all. This bug is re-opened.

    Looking at TclNeedSpace() explains the mysterious platform
    dependence. The buggy symptoms you report will be present
    on those platforms/locales for which isspace(0240) returns
    true.

    I've attached a patch that I think will correct the problem.
    It's possible that it has other undesirable side-effects, so
    I've assigned this report to one of the maintainers of
    generic/tclUtil.c for review.

    Meanwhile you can use the workaround I posted in the first
    comment.

    Tcl_Merge() is safe for UTF-8 strings.

     
  • Don Porter

    Don Porter - 2001-03-29
    • assigned_to: dgp --> msofer
     
  • miguel sofer

    miguel sofer - 2001-03-30

    Logged In: YES
    user_id=148712

    This bug is related to bugs #408568 and #227512.
    See TIP #20 at
    http://www.cs.man.ac.uk/fellowsd-bin/TIP/

     
  • Adrian Robert

    Adrian Robert - 2001-04-02
    • labels: 104239 --> 105681
     
  • Adrian Robert

    Adrian Robert - 2001-04-02

    Logged In: YES
    user_id=146959

    Yes, OK, the suggestion in Tip #20 just mentioned of adding
    a locale-independent
    isspace() to Tcl and using that would prevent the problem I
    had, which arises
    because 0240 is defined as a "no-break" space in a number of
    important character
    encodings, such as ISO-8859-1. This leads a great many
    locales, including
    en_US, to define 0240 as being in the whitespace category.
    Since many UTF
    characters have 0240 inside them, this can lead to
    problems...

     
  • Adrian Robert

    Adrian Robert - 2001-04-02

    Logged In: YES
    user_id=146959

    Yes, OK, the suggestion in Tip #20 just mentioned of adding
    a locale-independent
    isspace() to Tcl and using that would prevent the problem I
    had, which arises
    because 0240 is defined as a "no-break" space in a number of
    important character
    encodings, such as ISO-8859-1. This leads a great many
    locales, including
    en_US, to define 0240 as being in the whitespace category.
    Since many UTF
    characters have 0240 inside them, this can lead to
    problems...

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2001-05-03
    • status: open-invalid --> closed-invalid
     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2001-05-03

    Logged In: YES
    user_id=72656

    The basic answer at this point is that if you want space
    chars to be thought of as space chars in Tcl, you should
    restrict yourself to the ascii 7-bit set, of which \240
    isn't part. It works on some systems, where the locale
    isspace('\240') is 1, but that's not reliable.

     
  • Adrian Robert

    Adrian Robert - 2001-09-17

    Logged In: YES
    user_id=146959

    This is NOT a solution. If you don't want to change any
    code, you should at least clarify the documentation so that
    people in the future don't waste their time. The
    documentation should state at the very least that
    List-related methods should NOT be used with UTF-8 strings
    for communications between C and Tcl. Please see the
    comments submitted earlier for this bug for additional
    clarification. Thank you.

     
  • Jeffrey Hobbs

    Jeffrey Hobbs - 2001-09-18

    Logged In: YES
    user_id=72656

    Ah, but you are making a fatal flaw in your argument - you
    are *not* passing UTF-8 strings - you are passing
    incorrectly formed strings through Tcl. If you converted
    these to UTF-8 first (with Tcl_ExternalToUtf), this would
    not have happened. That isn't to say this still doesn't
    need fixing - but it is one of those areas in the core
    where the distinction between using utf-8 and raw data
    became important.

     
  • Don Porter

    Don Porter - 2001-09-18

    Logged In: YES
    user_id=80530

    Sorry if I'm being dense, but what is it about the
    strings in Adrian's example that makes them invalid
    UTF-8 strings? Is it the terminating null bytes?
    How would would Tcl_ExternalToUtf be added to the
    reported example code to solve the problem?

     
  • miguel sofer

    miguel sofer - 2001-09-18
    • assigned_to: msofer --> hobbs
     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902

    Jeff just happens to be wrong. :^)

    The example code contains valid UTF-8 strings. The problem
    is that TclNeedsSpace doesn't know anything about UTF-8 and
    therefore anything depending on it (Tcl_AppendElement,
    Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
    search with grep, plus goodness knows how much in extensions
    as the code is in the stub table) is *not* UTF-8 safe.

    Unfortunately, none of those three public functions (two of
    which are not deprecated at all) warns in its documentation
    that it is unsafe to pass UTF-8 strings to it. :^(

    The problems in TclNeedSpace are really the 'end--' which is
    fundamentally wrong on UTF-8 strings, and the way it detects
    what character it is looking at which needs to be much more
    careful when looking at bytes outside \000-\177. Plus
    isspace is not usually Unicode-aware...

     
1 2 3 > >> (Page 1 of 3)