From: SourceForge.net <no...@so...> - 2004-11-12 23:44:21
|
Bugs item #1004065, was opened at 2004-08-05 09:51 Message generated for change (Comment added) made by hobbs You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1004065&group_id=10894 Category: 43. UTF-8 Strings Group: current: 8.4.7 Status: Open >Resolution: Fixed Priority: 5 Submitted By: Martin v. Löwis (loewis) >Assigned to: Don Porter (dgp) Summary: UTF-8 encoding crashes in UCS-4 mode Initial Comment: I built Tcl 8.4.7 by setting TCL_UTF_MAX to 6 in tcl.h (changed from 3). I then run the command set x [encoding convertfrom utf-8 \xf0\x9d\x99\xaf] Tcl crashes with the traceback #0 0x080994e7 in TableFromUtfProc (clientData=0x810acb0, src=0x8115960 "ð\235\231¯ð\213\030@# Default system startup file for Tcl-based applications. Defines\n# \unknown\ procedure and auto-load facilities.\n#\n# RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59 dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0, statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111, srcReadPtr=0xbffff068, dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at ../generic/tclEncoding.c:2353 #1 0x08097a22 in Tcl_UtfToExternal (interp=0x0, encoding=0x810adc0, src=0x8115960 "ð\235\231¯ð\213\030@# Default system startup file for Tcl-based applications. Defines\n# \unknown\ procedure and auto-load facilities.\n#\n# RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59 dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0, statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111, srcReadPtr=0xbffff068, dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at ../generic/tclEncoding.c:1091 #2 0x080b2848 in WriteChars (chanPtr=0x8115940, src=0x811a59c "", srcLen=0) at ../generic/tclIO.c:3170 #3 0x080b24b7 in Tcl_WriteObj (chan=0x8115940, objPtr=0x81140e8) at ../generic/tclIO.c:2960 #4 0x08055128 in Tcl_Main (argc=1, argv=0xbffff2f4, appInitProc=0x80549e6 <Tcl_AppInit>) at ../generic/tclMain.c:407 #5 0x080549dc in main (argc=1, argv=0xbffff2f4) at ../unix/tclAppInit.c:90 This is all on a Debian system. ---------------------------------------------------------------------- >Comment By: Jeffrey Hobbs (hobbs) Date: 2004-11-12 15:44 Message: Logged In: YES user_id=72656 I've applied the patch to avoid the crash for 8.4.8 and 8.5a2, but a full evaluation of the TCL_UTF_MAX==6 is still required. ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2004-08-06 01:50 Message: Logged In: YES user_id=21627 For the case of TCL_UTF_MAX==6 (which I refer to in this report), there is no need for surrogates: They can be represented just using a four-byte integer. For the two-byte Unicode case, you have two options: - use surrogates, or - explicitly don't support surrogates, and non-BMP characters In the latter alternative, you essentially would tell people that they need a four-byte Unicode installation if they want non-BMP characters. Of course, there will be additional issues externally, e.g. when trying to pass non-BMP characters to the GUI platform in Tk, when finding appropriate fonts, etc. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2004-08-06 00:47 Message: Logged In: YES user_id=79902 The question is really how should we support unicode chars outside the BMP? (I suspect the answer involves surrogates internally, and goodness knows what externally.) ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-08-05 11:51 Message: Logged In: YES user_id=80530 It looks like the encoding routines simply do not support the #define TCL_UTF_MAX 6 variant currently. The attached patch stops the reported crash, but a more thorough review is in order to really address the issue. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-08-05 11:34 Message: Logged In: YES user_id=80530 Confirmed on the HEAD. Note this bug only shows up in an interactive tclsh. The problem happens when converting to the system encoding for writing the result to stdout. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-08-05 11:25 Message: Logged In: YES user_id=80530 sorry, I misspoke before. I got my "convertfrom" and "convertto" mixed up. Your demo script is fine. The following demo script should be equivalent: set x [encoding convertfrom utf-8 \u00f0\u009d\u0099\u00af] ---------------------------------------------------------------------- Comment By: Martin v. Löwis (loewis) Date: 2004-08-05 10:50 Message: Logged In: YES user_id=21627 I believe it does what I want to do: explicitly invoke decoding of an UTF-8 encoded string. I hope that the string I specify contains four characters, which is then interpreted as an octet string of four octets when passed to "encoding convertfrom". This, in turn, should generate a string with a single character. Using the \u form is not possible, since it only supports characters with numeric values up to U+FFFF. However, the character above is U+0001D6FF, MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL Z. Other languages support a \U notation for non-BMP characters. Apparently, Tcl doesn't (which is not really a problem if you could get the character by decoding it from UTF-8). ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2004-08-05 10:23 Message: Logged In: YES user_id=80530 A crash is definitely a bug; thanks for the report. That said, your snippet of code is likely not doing what you intend. Each \xHH substitution produces one Unicode character, not one byte. As a general rule \xHH substitution should be avoided in Tcl scripts. Use \u instead. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1004065&group_id=10894 |