On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function. In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.
The following C function, when called from Tcl,
illustrates the
problem.
int sendCharList(ClientData clientData, Tcl_Interp *interp,
int argc, char **argv)
{
char s1[5], s2[5], s3[5], s4[5];
strcpy(s1, "\345\220\240");
strcpy(s2, "\345\214\240");
strcpy(s3, "\351\235\240");
strcpy(s4, "\347\264\240");
Tcl_ResetResult(interp);
Tcl_AppendElement(interp, s1);
Tcl_AppendElement(interp, s2);
Tcl_AppendElement(interp, s3);
Tcl_AppendElement(interp, s4);
return TCL_OK;
}
The Tcl calls:
set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"
should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters). On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl. A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).
A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim
Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't. A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?
I'm also not sure whether it persists in 8.3.2 or 8.4.
test program and makefile
Logged In: YES
user_id=80530
TclNeedSpace() is not UTF-8 aware. That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)
Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
Logged In: YES
user_id=146959
Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.
The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
Logged In: YES
user_id=146959
Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...
For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...
Logged In: YES
user_id=80530
I was talking about the man page for Tcl_AppendElement():
http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm
Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly. Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all. This bug is re-opened.
Looking at TclNeedSpace() explains the mysterious platform
dependence. The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true.
I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.
Meanwhile you can use the workaround I posted in the first
comment.
Tcl_Merge() is safe for UTF-8 strings.
Logged In: YES
user_id=148712
This bug is related to bugs #408568 and #227512.
See TIP #20 at
http://www.cs.man.ac.uk/fellowsd-bin/TIP/
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
Logged In: YES
user_id=146959
Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...
Logged In: YES
user_id=72656
The basic answer at this point is that if you want space
chars to be thought of as space chars in Tcl, you should
restrict yourself to the ascii 7-bit set, of which \240
isn't part. It works on some systems, where the locale
isspace('\240') is 1, but that's not reliable.
Logged In: YES
user_id=146959
This is NOT a solution. If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time. The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl. Please see the
comments submitted earlier for this bug for additional
clarification. Thank you.
Logged In: YES
user_id=72656
Ah, but you are making a fatal flaw in your argument - you
are *not* passing UTF-8 strings - you are passing
incorrectly formed strings through Tcl. If you converted
these to UTF-8 first (with Tcl_ExternalToUtf), this would
not have happened. That isn't to say this still doesn't
need fixing - but it is one of those areas in the core
where the distinction between using utf-8 and raw data
became important.
Logged In: YES
user_id=80530
Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings? Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?
Logged In: YES
user_id=79902
Jeff just happens to be wrong. :^)
The example code contains valid UTF-8 strings. The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.
Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it. :^(
The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177. Plus
isspace is not usually Unicode-aware...