[TCLCORE] Re: Unicode and Utf in Tcl

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Laurent Duperval said:
 > Hmmm... that's what I thought... Hmm, I was getting behaviour that I 
wasn't
 > expecting while implementing my -text and -underline patch. I thought I
 > wasn't working with it properly but now it looks like it's a coding 
error on
 > my part. I'll investigate further.

One thing to watch out for is that Tcl is liberal in what it will accept 
as input.  In UTF-8 all characters > 127 are part of a multi-byte 
sequence.  They begin with a lead byte and are followed by one or more 
trail bytes.  There is a specific bit pattern used to indicate whether a 
byte is a lead or trail byte.  If you pass Latin-1 (or any other format) 
to Tcl, it will try to interpret it as UTF8.  ASCII characters retain 
their meaning in UTF8.  One of two things will happen to non-ASCII 
characters.  If they happen to form a valid lead/trail byte sequence, then 
multiple characters in the original string will collapse into a single 
UTF8 character.  If they don't form a valid UTF sequence, then Tcl 
interprets them as single Latin-1 characters.   So, if you pass latin-1 
text into Tcl, you may think you've done an encoding conversion when you 
haven't because Tcl is accepting the malformed data and doing its best to 
handle it.

This was done for backwards compatibility reasons, but in retrospect I 
think it was a mistake.  It tends to mask errors because most Latin-1 
strings don't form valid UTF8 sequences.  The cases where multiple bytes 
happen to collapse to a single character are less common and may not show 
up during initial testing.  The simplest way to check to see if you've 
done the encoding conversion correctly is to compare "string length" and 
"string bytelength" for a string that is known to contain characters > 
127.  They two lengths returned should be different in this case.  If they 
are the same, then you are dealing with invalid UTF data.

--Scott

--
The TclCore mailing list is sponsored by Ajuba Solutions
To unsubscribe:  email tcl...@aj... with the 
                 word UNSUBSCRIBE as the subject.

[TCLCORE] Re: Unicode and Utf in Tcl

The Tool Command Language implementation

[TCLCORE] Re: Unicode and Utf in Tcl