From: SourceForge.net <no...@so...> - 2003-08-25 22:08:01
|
Bugs item #411825, was opened at 2001-03-28 01:36 Message generated for change (Comment added) made by dgp You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894 Category: 10. Objects Group: = 8.3.1 Status: Closed Resolution: Fixed Priority: 5 Submitted By: Adrian Robert (arobert3434) Assigned to: Donal K. Fellows (dkf) Summary: Passing list w/UTF-8 from C can fail Initial Comment: On certain installations of Tcl/Tk 8.3.1, the passing of UTF-8 character-triplets ending in octal 240 (decimal 160, hex A0) interferes with list delimitation when Tcl_AppendElement is used to return a result from a C function. In particular, if a UTF-8 string ending in octal 240 is appended to the result, and then another UTF-8 string is appended afterwards, the octal 240 seems to be interpreted as a "forward delete" character of some kind, with the result that the separation between the two list elements is erased and they are interpreted as one. The following C function, when called from Tcl, illustrates the problem. int sendCharList(ClientData clientData, Tcl_Interp *interp, int argc, char **argv) { char s1[5], s2[5], s3[5], s4[5]; strcpy(s1, "\345\220\240"); strcpy(s2, "\345\214\240"); strcpy(s3, "\351\235\240"); strcpy(s4, "\347\264\240"); Tcl_ResetResult(interp); Tcl_AppendElement(interp, s1); Tcl_AppendElement(interp, s2); Tcl_AppendElement(interp, s3); Tcl_AppendElement(interp, s4); return TCL_OK; } The Tcl calls: set s6 [sendCharList] puts "[llength $s6] , [string length $s6]" should output "4 , 7" (4 list elements, each a single UTF-8 composite character plus 3 delimiters). On some systems it does. On others, however, the output is "1 , 4", resulting from deletion of the list delimiters somewhere during passage from C to Tcl. A complete test program involving the above (plus some additional tests and using wish not tclsh) may be accessed at: ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it is also attached). A full application that exposes the bug (and led to its discovery) may be found at: http://freshmeat.net/projects/hanzim Unfortunately, I have not been able to isolate why some installations exhibit the bug and some don't. A default SUSE 7.0 Linux installation of 8.3.1 had the problem, while a default Slackware 7.1 installation of the same Tcl/Tk version did not. Maybe it is a compilation flag difference... ? I'm also not sure whether it persists in 8.3.2 or 8.4. ---------------------------------------------------------------------- >Comment By: Don Porter (dgp) Date: 2003-08-25 18:06 Message: Logged In: YES user_id=80530 Provide the C code that calls Tcl_AppendElement() and that gives results that are incorrect in either Tcl 8.4.4 or the HEAD. ---------------------------------------------------------------------- Comment By: Dossy Shiobara (dossy) Date: 2003-08-25 17:42 Message: Logged In: YES user_id=21885 I'd really hate to pick at an old scab (this bug was closed back in 09/2001) but exactly what was "fixed" by dkf's commit? Against Tcl 8.4.4, using Tcl_AppendElement() which I know is deprecated, the problem is still occurring. I guess it has to do with this behavior: $ string is space [encoding convertfrom utf-8 \302\240] 1 What's annoying is if you do: > set a foo\302\240 fooà> set a [encoding convertfrom utf-8 foo\302\240] foo > lappend a bar foo bar > llength $a 2 > string bytelength $a 9 That does the right thing. But if you Tcl_AppendElement(), you'll get "foo\302\240bar", which is bad. -- Dossy ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2001-09-19 04:53 Message: Logged In: YES user_id=79902 Test and fix committed (SF seems to be working at the mo...) ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-09-18 17:23 Message: Logged In: YES user_id=80530 Assigning to dkf, since he can't log in and assign it himself. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-09-18 13:10 Message: Logged In: YES user_id=80530 The bug is in TclNeedSpace(), in generic/tclUtil.c, part of the Objects Category. Is there a reason not to accept the patch already attached to this report? Will it break TclNeedSpace for its existing callers? ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-09-18 12:47 Message: Logged In: YES user_id=80530 Here's a sequence of Tcl commands broken by this bug. % interp create \u5420 ? % interp create [list \u5420 foo] ? foo % interp alias {} fooset [list \u5420 foo] set fooset % interp target {} fooset ?foo Re-opening the bug. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-09-18 12:17 Message: Logged In: YES user_id=80530 Self-explanatory and revealing. I think you're missing the point, Jeff. Adrian's [sendCharList] command is trying to return the result [list \u5420 \u5320 \u9760 \u7d20] but it's failing because Tcl_AppendElement is mangling his UTF-8 characters that he has encoded "by hand". If I can manage it, I'll post a Tcl script that demos the bug. I think such a script is possible. Tcl_AppendElement calls haven't been entirely banished from the Tcl source code. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2001-09-18 11:03 Message: Logged In: YES user_id=72656 This should be self-explanatory: (hobbs) 50 % set var \345\220\240 å (hobbs) 51 % string length $var 3 (hobbs) 52 % string bytelength $var 6 ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2001-09-18 10:34 Message: Logged In: YES user_id=79902 Jeff just happens to be wrong. :^) The example code contains valid UTF-8 strings. The problem is that TclNeedsSpace doesn't know anything about UTF-8 and therefore anything depending on it (Tcl_AppendElement, Tcl_DStringAppendElement and Tcl_DStringStartSublist says a search with grep, plus goodness knows how much in extensions as the code is in the stub table) is *not* UTF-8 safe. Unfortunately, none of those three public functions (two of which are not deprecated at all) warns in its documentation that it is unsafe to pass UTF-8 strings to it. :^( The problems in TclNeedSpace are really the 'end--' which is fundamentally wrong on UTF-8 strings, and the way it detects what character it is looking at which needs to be much more careful when looking at bytes outside \000-\177. Plus isspace is not usually Unicode-aware... ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-09-18 01:46 Message: Logged In: YES user_id=80530 Sorry if I'm being dense, but what is it about the strings in Adrian's example that makes them invalid UTF-8 strings? Is it the terminating null bytes? How would would Tcl_ExternalToUtf be added to the reported example code to solve the problem? ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2001-09-17 20:06 Message: Logged In: YES user_id=72656 Ah, but you are making a fatal flaw in your argument - you are *not* passing UTF-8 strings - you are passing incorrectly formed strings through Tcl. If you converted these to UTF-8 first (with Tcl_ExternalToUtf), this would not have happened. That isn't to say this still doesn't need fixing - but it is one of those areas in the core where the distinction between using utf-8 and raw data became important. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-09-17 19:58 Message: Logged In: YES user_id=146959 This is NOT a solution. If you don't want to change any code, you should at least clarify the documentation so that people in the future don't waste their time. The documentation should state at the very least that List-related methods should NOT be used with UTF-8 strings for communications between C and Tcl. Please see the comments submitted earlier for this bug for additional clarification. Thank you. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2001-05-03 17:08 Message: Logged In: YES user_id=72656 The basic answer at this point is that if you want space chars to be thought of as space chars in Tcl, you should restrict yourself to the ascii 7-bit set, of which \240 isn't part. It works on some systems, where the locale isspace('\240') is 1, but that's not reliable. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-04-01 21:27 Message: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-04-01 21:27 Message: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... ---------------------------------------------------------------------- Comment By: miguel sofer (msofer) Date: 2001-03-29 19:07 Message: Logged In: YES user_id=148712 This bug is related to bugs #408568 and #227512. See TIP #20 at http://www.cs.man.ac.uk/fellowsd-bin/TIP/ ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-03-29 03:55 Message: Logged In: YES user_id=80530 I was talking about the man page for Tcl_AppendElement(): http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm Now, reading the I18N HOWTO, it looks like I was reading "deprecated" too strongly. Tcl_DStringAppendElement() and Tcl_DStringStartSublist() also rely on TclNeedSpace() and they have not been deprecated, so TclNeedSpace() needs to be fixed after all. This bug is re-opened. Looking at TclNeedSpace() explains the mysterious platform dependence. The buggy symptoms you report will be present on those platforms/locales for which isspace(0240) returns true. I've attached a patch that I think will correct the problem. It's possible that it has other undesirable side-effects, so I've assigned this report to one of the maintainers of generic/tclUtil.c for review. Meanwhile you can use the workaround I posted in the first comment. Tcl_Merge() is safe for UTF-8 strings. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-29 01:50 Message: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-29 00:34 Message: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-29 00:11 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-29 00:11 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-29 00:09 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-03-28 17:31 Message: Logged In: YES user_id=80530 TclNeedSpace() is not UTF-8 aware. That's why routines that call it, like Tcl_AppendElement() are deprecated. (See the documentation.) Rewrite your command procedure like so: Tcl_Obj *resultPtr; ... Tcl_ResetResult(interp); resultPtr = Tcl_GetObjResult(interp); Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s1, -1)); ... Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s4, -1)); return TCL_OK; ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894 |