[Tcl-bugs] [ tktoolkit-Bugs-1908443 ] Composed characters in UTF-8 locale

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1908443, was opened at 2008-03-05 19:01
Message generated for change (Comment added) made by jenglish
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 99. Other
Group: current: 8.5.1
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Joe English (jenglish)
Assigned to: Joe English (jenglish)
Summary: Composed characters in UTF-8 locale

Initial Comment:
Observed on Debian Sarge after installing UTF-8 locales: ISO8859-1 characters entered with compose key sequences end up wrong.

Setup: xmodmap -e 'keysym Super_L = Multi_key Super_L'
(this makes the Windows key into a Compose key).
export LC_ALL=en_US.UTF-8
export XMODIFIERS=@im=local

Run wish; verify that [encoding system] is utf-8
Press e.g., <Compose> <c> <comma>.  
This should turn into ç (c-cedilla, \UE7).  Instead, it shows up as \UFFE7.

I think I've narrowed this down to sometime between 8.4.12 and 8.4.13.  Problem appears to be in Tcl, not Tk.

This looks like improper sign extension.


----------------------------------------------------------------------

>Comment By: Joe English (jenglish)
Date: 2008-03-05 20:10

Message:
Logged In: YES 
user_id=68433
Originator: YES

Quoth TFM: "The XmbLookupString and XwcLookupString functions return text
in the encoding of the locale bound to the input method of the specified
input context."

It appears that Xlib is using a different set of heuristics to determine
the encoding of a locale than Tcl (and glibc) does.  Xlib apparently uses
the table in /usr/lib/X11/locale/locale.alias, while Tcl uses nl_langinfo.

----------------------------------------------------------------------

Comment By: Joe English (jenglish)
Date: 2008-03-05 19:48

Message:
Logged In: YES 
user_id=68433
Originator: YES

This is one of the nasty surprises in C89: when casting a (signed) char to
an unsigned short, it gets widened to a signed short first, then converted
to an unsigned short.  Sign extension happens in the first step.

Changing the line from:

    ch = (Tcl_UniChar) *src;

to 

    ch = (unsigned char) *src;

prevents sign extension and makes things behave as expected.  (You don't
need to say "(Tcl_UniChar)(unsigned char)*src", since the usual integral
promotions apply.  MSVC might complain though.)

This masks the problem, but does not fix it: the real problem is that
UtfToUtfProc is getting called in the first place.

XmbLookupString() is apparently returning ISO8859-1 text, but Tk believes
this is in the "system" encoding, which is utf-8.  IOW, Xlib's idea of "the
system encoding" is different from Tcl's.

More research required.

----------------------------------------------------------------------

Comment By: Joe English (jenglish)
Date: 2008-03-05 19:40

Message:
Logged In: YES 
user_id=68433
Originator: YES

Specifically, this part:

generic/tclEncoding.c r1.16.2.9 -> r1.16.2.10

@@ -2083,13 +2083,23 @@ UtfToUtfProc(clientData, src, srcLen, flags,
statePtr, dst, dstLen,
             */
            *dst++ = 0;
            src += 2;
+       } else if (!Tcl_UtfCharComplete(src, srcEnd - src)) {
+           /* Always check before using Tcl_UtfToUniChar. Not doing
+            * can so cause it run beyond the endof the buffer!  If we
+            * * happen such an incomplete char its byts are made to *
+            * represent themselves.
+            */
+
+           ch = (Tcl_UniChar) *src;
                 ^^^^^^^^^^^^^^^^^^^  here
+           src += 1;
+           dst += Tcl_UniCharToUtf(ch, dst);
        } else {
            src += Tcl_UtfToUniChar(src, &ch);
            dst += Tcl_UniCharToUtf(ch, dst);
        }
     }


----------------------------------------------------------------------

Comment By: Joe English (jenglish)
Date: 2008-03-05 19:18

Message:
Logged In: YES 
user_id=68433
Originator: YES

`git bisect` narrows it down to this commit:

Author: andreas_kupries <andreas_kupries>
Date:   Wed Apr 5 00:05:53 2006 +0000

        * generic/tclIO.c (ReadChars): Added check and panic and
          commentary to a piece of code which relies on BUFFER_PADDING to
          create enough space at the beginning of each buffer forthe
          insertion of partial multi-byte data at the beginning of a
          buffer. To explain why this code is ok, and as precaution if
          someone twiddled the BUFFER_PADDING into uselessness.

        * generic/tclIO.c (ReadChars): [SF Tcl Bug 1462248]. Added code
          temporarily suppress the use of TCL_ENCODING_END set when eof
          was reached while the buffer we are converting is not truly the
          last buffer in the queue. together with the Utf bug below it
was
          possible to completely bollox the buffer data structures,
          eventually crashing Tcl.

        * generic/tclEncoding.c (UtfToUtfProc): Fixed problem where the
          function accessed memory beyond the end of the input
          buffer. When TCL_ENCODING_END is set and the last bytes of the
          buffer start a multi-byte sequence. This bug contributed to [SF
          Tcl Bug 1462248].



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997




[Tcl-bugs] [ tktoolkit-Bugs-1908443 ] Composed characters in UTF-8 locale

The Tool Command Language implementation

[Tcl-bugs] [ tktoolkit-Bugs-1908443 ] Composed characters in UTF-8 locale