From: <no...@so...> - 2001-04-02 01:28:42
|
Bugs item #411825, was updated on 2001-03-27 22:36 You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894 Category: UTF-8 Strings Group: 8.3.1 Status: Open Priority: 5 Submitted By: Adrian Robert (arobert3434) Assigned to: miguel sofer (msofer) Summary: Passing list w/UTF-8 from C can fail Initial Comment: On certain installations of Tcl/Tk 8.3.1, the passing of UTF-8 character-triplets ending in octal 240 (decimal 160, hex A0) interferes with list delimitation when Tcl_AppendElement is used to return a result from a C function. In particular, if a UTF-8 string ending in octal 240 is appended to the result, and then another UTF-8 string is appended afterwards, the octal 240 seems to be interpreted as a "forward delete" character of some kind, with the result that the separation between the two list elements is erased and they are interpreted as one. The following C function, when called from Tcl, illustrates the problem. int sendCharList(ClientData clientData, Tcl_Interp *interp, int argc, char **argv) { char s1[5], s2[5], s3[5], s4[5]; strcpy(s1, "\345\220\240"); strcpy(s2, "\345\214\240"); strcpy(s3, "\351\235\240"); strcpy(s4, "\347\264\240"); Tcl_ResetResult(interp); Tcl_AppendElement(interp, s1); Tcl_AppendElement(interp, s2); Tcl_AppendElement(interp, s3); Tcl_AppendElement(interp, s4); return TCL_OK; } The Tcl calls: set s6 [sendCharList] puts "[llength $s6] , [string length $s6]" should output "4 , 7" (4 list elements, each a single UTF-8 composite character plus 3 delimiters). On some systems it does. On others, however, the output is "1 , 4", resulting from deletion of the list delimiters somewhere during passage from C to Tcl. A complete test program involving the above (plus some additional tests and using wish not tclsh) may be accessed at: ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it is also attached). A full application that exposes the bug (and led to its discovery) may be found at: http://freshmeat.net/projects/hanzim Unfortunately, I have not been able to isolate why some installations exhibit the bug and some don't. A default SUSE 7.0 Linux installation of 8.3.1 had the problem, while a default Slackware 7.1 installation of the same Tcl/Tk version did not. Maybe it is a compilation flag difference... ? I'm also not sure whether it persists in 8.3.2 or 8.4. ---------------------------------------------------------------------- >Comment By: Adrian Robert (arobert3434) Date: 2001-04-01 18:27 Message: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-04-01 18:27 Message: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... ---------------------------------------------------------------------- Comment By: miguel sofer (msofer) Date: 2001-03-29 16:07 Message: Logged In: YES user_id=148712 This bug is related to bugs #408568 and #227512. See TIP #20 at http://www.cs.man.ac.uk/fellowsd-bin/TIP/ ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-03-29 00:55 Message: Logged In: YES user_id=80530 I was talking about the man page for Tcl_AppendElement(): http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm Now, reading the I18N HOWTO, it looks like I was reading "deprecated" too strongly. Tcl_DStringAppendElement() and Tcl_DStringStartSublist() also rely on TclNeedSpace() and they have not been deprecated, so TclNeedSpace() needs to be fixed after all. This bug is re-opened. Looking at TclNeedSpace() explains the mysterious platform dependence. The buggy symptoms you report will be present on those platforms/locales for which isspace(0240) returns true. I've attached a patch that I think will correct the problem. It's possible that it has other undesirable side-effects, so I've assigned this report to one of the maintainers of generic/tclUtil.c for review. Meanwhile you can use the workaround I posted in the first comment. Tcl_Merge() is safe for UTF-8 strings. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-28 22:50 Message: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-28 21:34 Message: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-28 21:11 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-28 21:11 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Adrian Robert (arobert3434) Date: 2001-03-28 21:09 Message: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2001-03-28 14:31 Message: Logged In: YES user_id=80530 TclNeedSpace() is not UTF-8 aware. That's why routines that call it, like Tcl_AppendElement() are deprecated. (See the documentation.) Rewrite your command procedure like so: Tcl_Obj *resultPtr; ... Tcl_ResetResult(interp); resultPtr = Tcl_GetObjResult(interp); Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s1, -1)); ... Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s4, -1)); return TCL_OK; ---------------------------------------------------------------------- You can respond by visiting: http://sourceforge.net/tracker/?func=detail&atid=110894&aid=411825&group_id=10894 |