From: SourceForge.net <no...@so...> - 2008-03-06 04:10:48
|
Bugs item #1908443, was opened at 2008-03-05 19:01 Message generated for change (Comment added) made by jenglish You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 99. Other Group: current: 8.5.1 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Joe English (jenglish) Assigned to: Joe English (jenglish) Summary: Composed characters in UTF-8 locale Initial Comment: Observed on Debian Sarge after installing UTF-8 locales: ISO8859-1 characters entered with compose key sequences end up wrong. Setup: xmodmap -e 'keysym Super_L = Multi_key Super_L' (this makes the Windows key into a Compose key). export LC_ALL=en_US.UTF-8 export XMODIFIERS=@im=local Run wish; verify that [encoding system] is utf-8 Press e.g., <Compose> <c> <comma>. This should turn into ç (c-cedilla, \UE7). Instead, it shows up as \UFFE7. I think I've narrowed this down to sometime between 8.4.12 and 8.4.13. Problem appears to be in Tcl, not Tk. This looks like improper sign extension. ---------------------------------------------------------------------- >Comment By: Joe English (jenglish) Date: 2008-03-05 20:10 Message: Logged In: YES user_id=68433 Originator: YES Quoth TFM: "The XmbLookupString and XwcLookupString functions return text in the encoding of the locale bound to the input method of the specified input context." It appears that Xlib is using a different set of heuristics to determine the encoding of a locale than Tcl (and glibc) does. Xlib apparently uses the table in /usr/lib/X11/locale/locale.alias, while Tcl uses nl_langinfo. ---------------------------------------------------------------------- Comment By: Joe English (jenglish) Date: 2008-03-05 19:48 Message: Logged In: YES user_id=68433 Originator: YES This is one of the nasty surprises in C89: when casting a (signed) char to an unsigned short, it gets widened to a signed short first, then converted to an unsigned short. Sign extension happens in the first step. Changing the line from: ch = (Tcl_UniChar) *src; to ch = (unsigned char) *src; prevents sign extension and makes things behave as expected. (You don't need to say "(Tcl_UniChar)(unsigned char)*src", since the usual integral promotions apply. MSVC might complain though.) This masks the problem, but does not fix it: the real problem is that UtfToUtfProc is getting called in the first place. XmbLookupString() is apparently returning ISO8859-1 text, but Tk believes this is in the "system" encoding, which is utf-8. IOW, Xlib's idea of "the system encoding" is different from Tcl's. More research required. ---------------------------------------------------------------------- Comment By: Joe English (jenglish) Date: 2008-03-05 19:40 Message: Logged In: YES user_id=68433 Originator: YES Specifically, this part: generic/tclEncoding.c r1.16.2.9 -> r1.16.2.10 @@ -2083,13 +2083,23 @@ UtfToUtfProc(clientData, src, srcLen, flags, statePtr, dst, dstLen, */ *dst++ = 0; src += 2; + } else if (!Tcl_UtfCharComplete(src, srcEnd - src)) { + /* Always check before using Tcl_UtfToUniChar. Not doing + * can so cause it run beyond the endof the buffer! If we + * * happen such an incomplete char its byts are made to * + * represent themselves. + */ + + ch = (Tcl_UniChar) *src; ^^^^^^^^^^^^^^^^^^^ here + src += 1; + dst += Tcl_UniCharToUtf(ch, dst); } else { src += Tcl_UtfToUniChar(src, &ch); dst += Tcl_UniCharToUtf(ch, dst); } } ---------------------------------------------------------------------- Comment By: Joe English (jenglish) Date: 2008-03-05 19:18 Message: Logged In: YES user_id=68433 Originator: YES `git bisect` narrows it down to this commit: Author: andreas_kupries <andreas_kupries> Date: Wed Apr 5 00:05:53 2006 +0000 * generic/tclIO.c (ReadChars): Added check and panic and commentary to a piece of code which relies on BUFFER_PADDING to create enough space at the beginning of each buffer forthe insertion of partial multi-byte data at the beginning of a buffer. To explain why this code is ok, and as precaution if someone twiddled the BUFFER_PADDING into uselessness. * generic/tclIO.c (ReadChars): [SF Tcl Bug 1462248]. Added code temporarily suppress the use of TCL_ENCODING_END set when eof was reached while the buffer we are converting is not truly the last buffer in the queue. together with the Utf bug below it was possible to completely bollox the buffer data structures, eventually crashing Tcl. * generic/tclEncoding.c (UtfToUtfProc): Fixed problem where the function accessed memory beyond the end of the input buffer. When TCL_ENCODING_END is set and the last bytes of the buffer start a multi-byte sequence. This bug contributed to [SF Tcl Bug 1462248]. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997 |