From: Lars H. <Lar...@re...> - 2011-06-30 14:59:02
|
Bernard Desgraupes skrev 2011-06-30 10.33: > Hi Lars, > > I added my own Makefile.in and pkgIndex.tcl.in and it compiles out of the box. > I tested it from Alpha: > > Welcome to AlphaX's AlphaTcl shell. > «» version > AlphaX 8.2rc3, Monday, 27 June 2011 > «» package require UTF > 0.1 > «» lsearch -exact [encoding names] UTF-8 > 34 > «» encoding convertfrom UTF-8 \303\266 > ö > «» encoding convertto UTF-8 \u00F6 > ö > etc... > > Great! Can we ship it with the next release of AlphaX (8.2) ? "Hey, it compiles! Let's ship it!", eh? Well, I'm flattered, and licensing-wise have no objection (consider the code BSD'ed), but I don't see that it would do much good for AlphaX. /Maybe/ it could be of use with Alphatk. To clarify this, I suppose I'd better explain what the package does: It creates a Tcl encoding (programmatic rather than table-defined) named UTF-8. Tcl already comes with an utf-8 encoding built in, so how are the two different? Mainly in how they deal with four-byte UTF-8 sequences (i.e., characters outside the Basic Multilingual Plane from \u0000 to \uFFFF, which is the only range Tcl supports by default). Tcl's standard utf-8 encoding decodes these, sees that they are outside the supported range, and substitutes a REPLACEMENT CHARACTER (don't recall anymore if it is \uFFFC or just ?). The UTF-8 encoding decodes these, sees that they are outside the supported range, and substitute the corresponding surrogate pair of characters (something like \uD800\DC00). It's not the same to anyone doing a [string index], but it can preserve the integrity of texts just happening to contain some non-BMP character. Conversely, when one does [encoding convertto UTF-8], a surrogate pair is converted to a four-byte sequences (rather than two three-byte sequences, as utf-8 would). For an AlphaCocoa that treats its windows as containing a sequence of UTF-16 strings, this UTF package behaviour makes perfect sense, as it is precisely what one wants to do when the file encoding is UTF-8. For AlphaX that is still restricted to macRoman in the text windows, I don't see how it would make a difference. In addition, I'm not entirely sure how stable this UTF package is: I think I had crashes when it was used as a channel encoding, and I can't recall whether I fixed those. Finally, it does only provide part of the functionality I had envisioned. For one thing, I also wanted to provide UTF-16BE, UTF-16LE, and UTF-16 (autodetection of Byte Order Mark) encodings; Tcl's built-in Unicode encoding is platform-dependently UTF-16BE or UTF-16LE, which is a bit unreliable. For another, I wanted to add all the #ifdef'ery needed to compile the package also for a Tcl that supports full Unicode. (IIRC, Tcl should work if compiled that way, even if Tk does not.) Lars Hellström PS: For AlphaCocoa, it may seem scary to have surrogate pairs in the buffer that are supposed to render as just one character, since that complicates indexing. However, Unicode forces us to face that that much earlier, due to the existence of combining accent characters -- isn't internationalisation fun? ;-) On the other hand, we may already have taken the basic steps towards decoupling column from character index, since we don't consider tabs (\t) to be just one column wide. |