Re: The AlphaCocoa Project

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bernard Desgraupes skrev 2011-06-30 10.33:
> Hi Lars,
>
> I added my own Makefile.in and pkgIndex.tcl.in and it compiles out of the box.
> I tested it from Alpha:
>
> Welcome to AlphaX's AlphaTcl shell.
> «» version
> AlphaX 8.2rc3, Monday, 27 June 2011
> «» package require UTF
> 0.1
> «» lsearch -exact [encoding names] UTF-8
> 34
> «» encoding convertfrom UTF-8 \303\266
> ö
> «» encoding convertto UTF-8 \u00F6
> Ã¶
> etc...
>
> Great! Can we ship it with the next release of AlphaX (8.2) ?

"Hey, it compiles! Let's ship it!", eh?

Well, I'm flattered, and licensing-wise have no objection (consider the code 
BSD'ed), but I don't see that it would do much good for AlphaX. /Maybe/ it 
could be of use with Alphatk.

To clarify this, I suppose I'd better explain what the package does: It 
creates a Tcl encoding (programmatic rather than table-defined) named UTF-8. 
Tcl already comes with an utf-8 encoding built in, so how are the two 
different? Mainly in how they deal with four-byte UTF-8 sequences (i.e., 
characters outside the Basic Multilingual Plane from \u0000 to \uFFFF, which 
is the only range Tcl supports by default).

Tcl's standard utf-8 encoding decodes these, sees that they are outside the 
supported range, and substitutes a REPLACEMENT CHARACTER (don't recall 
anymore if it is \uFFFC or just ?). The UTF-8 encoding decodes these, sees 
that they are outside the supported range, and substitute the corresponding 
surrogate pair of characters (something like \uD800\DC00). It's not the same 
to anyone doing a [string index], but it can preserve the integrity of texts 
just happening to contain some non-BMP character. Conversely, when one does 
[encoding convertto UTF-8], a surrogate pair is converted to a four-byte 
sequences (rather than two three-byte sequences, as utf-8 would).

For an AlphaCocoa that treats its windows as containing a sequence of UTF-16 
strings, this UTF package behaviour makes perfect sense, as it is precisely 
what one wants to do when the file encoding is UTF-8. For AlphaX that is 
still restricted to macRoman in the text windows, I don't see how it would 
make a difference. In addition, I'm not entirely sure how stable this UTF 
package is: I think I had crashes when it was used as a channel encoding, 
and I can't recall whether I fixed those.

Finally, it does only provide part of the functionality I had envisioned. 
For one thing, I also wanted to provide UTF-16BE, UTF-16LE, and UTF-16 
(autodetection of Byte Order Mark) encodings; Tcl's built-in Unicode 
encoding is platform-dependently UTF-16BE or UTF-16LE, which is a bit 
unreliable. For another, I wanted to add all the #ifdef'ery needed to 
compile the package also for a Tcl that supports full Unicode. (IIRC, Tcl 
should work if compiled that way, even if Tk does not.)

Lars Hellström

PS: For AlphaCocoa, it may seem scary to have surrogate pairs in the buffer 
that are supposed to render as just one character, since that complicates 
indexing. However, Unicode forces us to face that that much earlier, due to 
the existence of combining accent characters -- isn't internationalisation 
fun? ;-) On the other hand, we may already have taken the basic steps 
towards decoupling column from character index, since we don't consider tabs 
(\t) to be just one column wide.