Tcl / Read-Only Bugs / #1459 Passing list w/UTF-8 from C can fail

Adrian Robert - 2001-03-28

test program and makefile

testTclBug.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-28

assigned_to: nobody --> dgp

labels: 105681 --> 104239

status: open --> closed-invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-28

Logged In: YES
user_id=80530

TclNeedSpace() is not UTF-8 aware. That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)

Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

status: closed-invalid --> open-invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

Logged In: YES
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

Logged In: YES
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

Logged In: YES
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue. It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is. Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this. If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII. Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

Logged In: YES
user_id=146959

Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...

For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

Logged In: YES
user_id=146959

Also, could you please post a pointer to the documentation
you are referring to? It would help clear up other
questions like whether Tcl_Merge is affected...

For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem. They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-03-29

labels: 104239 --> 105681
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-29

labels: 105681 --> 104239
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-29

tclUtil.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-29

Logged In: YES
user_id=80530

I was talking about the man page for Tcl_AppendElement():

http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm

Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly. Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all. This bug is re-opened.

Looking at TclNeedSpace() explains the mysterious platform
dependence. The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true.

I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.

Meanwhile you can use the workaround I posted in the first
comment.

Tcl_Merge() is safe for UTF-8 strings.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-03-29

assigned_to: dgp --> msofer
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

miguel sofer - 2001-03-30

Logged In: YES
user_id=148712

This bug is related to bugs #408568 and #227512.
See TIP #20 at
http://www.cs.man.ac.uk/fellowsd-bin/TIP/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-04-02

labels: 104239 --> 105681
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-04-02

Logged In: YES
user_id=146959

Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-04-02

Logged In: YES
user_id=146959

Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as ISO-8859-1. This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category.
Since many UTF
characters have 0240 inside them, this can lead to
problems...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2001-05-03

status: open-invalid --> closed-invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2001-05-03

Logged In: YES
user_id=72656

The basic answer at this point is that if you want space
chars to be thought of as space chars in Tcl, you should
restrict yourself to the ascii 7-bit set, of which \240
isn't part. It works on some systems, where the locale
isspace('\240') is 1, but that's not reliable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Robert - 2001-09-17

Logged In: YES
user_id=146959

This is NOT a solution. If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time. The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl. Please see the
comments submitted earlier for this bug for additional
clarification. Thank you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeffrey Hobbs - 2001-09-18

Logged In: YES
user_id=72656

Ah, but you are making a fatal flaw in your argument - you
are *not* passing UTF-8 strings - you are passing
incorrectly formed strings through Tcl. If you converted
these to UTF-8 first (with Tcl_ExternalToUtf), this would
not have happened. That isn't to say this still doesn't
need fixing - but it is one of those areas in the core
where the distinction between using utf-8 and raw data
became important.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Don Porter - 2001-09-18

Logged In: YES
user_id=80530

Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings? Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

miguel sofer - 2001-09-18

assigned_to: msofer --> hobbs
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2001-09-18

Logged In: YES
user_id=79902

Jeff just happens to be wrong. :^)

The example code contains valid UTF-8 strings. The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.

Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it. :^(

The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177. Plus
isspace is not usually Unicode-aware...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Passing list w/UTF-8 from C can fail

The Tool Command Language implementation

Group

Searches

Help

#1459 Passing list w/UTF-8 from C can fail

Discussion