From: Scott S. <st...@aj...> - 2000-08-21 18:59:26
|
Laurent Duperval said: > Hmmm... that's what I thought... Hmm, I was getting behaviour that I wasn't > expecting while implementing my -text and -underline patch. I thought I > wasn't working with it properly but now it looks like it's a coding error on > my part. I'll investigate further. One thing to watch out for is that Tcl is liberal in what it will accept as input. In UTF-8 all characters > 127 are part of a multi-byte sequence. They begin with a lead byte and are followed by one or more trail bytes. There is a specific bit pattern used to indicate whether a byte is a lead or trail byte. If you pass Latin-1 (or any other format) to Tcl, it will try to interpret it as UTF8. ASCII characters retain their meaning in UTF8. One of two things will happen to non-ASCII characters. If they happen to form a valid lead/trail byte sequence, then multiple characters in the original string will collapse into a single UTF8 character. If they don't form a valid UTF sequence, then Tcl interprets them as single Latin-1 characters. So, if you pass latin-1 text into Tcl, you may think you've done an encoding conversion when you haven't because Tcl is accepting the malformed data and doing its best to handle it. This was done for backwards compatibility reasons, but in retrospect I think it was a mistake. It tends to mask errors because most Latin-1 strings don't form valid UTF8 sequences. The cases where multiple bytes happen to collapse to a single character are less common and may not show up during initial testing. The simplest way to check to see if you've done the encoding conversion correctly is to compare "string length" and "string bytelength" for a string that is known to contain characters > 127. They two lengths returned should be different in this case. If they are the same, then you are dealing with invalid UTF data. --Scott -- The TclCore mailing list is sponsored by Ajuba Solutions To unsubscribe: email tcl...@aj... with the word UNSUBSCRIBE as the subject. |