On 5/18/06, Juan Jose Garcia Ripoll <lisp@...> wrote:
> On Wed, 2006-05-17 at 11:31 +0900, Brian Spilsbury wrote:
> > My plan is to add three additional stream types which use iconv for
> > encoding translation, to extend the permitted range of character, and
> > discriminate base-char from character by the value, then separate
> > character and base-char strings.
> What would be the range for characters? I assume 16 bits is not enough
> for chinese, is it? In any case, the 21-bit encoding of UTF-32 will
> definitely fit in the cl_fixnum type.
There are 1,114,111 unicode code points in the full set, but the
official (traditional and simplified) chinese characters all fit into
the basic 16 bit set.
There are plenty of archaic and specialized characters, though.
If I read correctly, the current immediate character representation
has 30 data bits and 2 tag bits, so that shouldn't require
Instead of producing separate base-char and character classes, I'd
suggest partitioning the range of characters into 0..255 (ie,
base-char) and 256..1,114,111 (ie, extended-char) and then reporting
the type based on the char-code value.
Additional string types will be necessary though. I'd suggest 8, and
24 bit elements, possibly adding 16 bit later.
In order to keep the implementation simple, we can use a macro which
rebinds a dynamic variable to specify which class of string we want to
be produced by default.
(format nil "..."))
(with-strings-of base-char ; or standard-char?
which would be the default.
> A more controversial issue is how to handle isalpha(), and similar
> macros. This will depend on the size of the character type. For windows,
> I think wchar_t is 16 bit while for Linux wchar_t is 32 bit and there
> would be no problem translating the character to wchar_t and using the
> library functions.
The library functions are problematic in that the structure of wchar_t
is implementation and locale specific, but you could dispatch that
Alternately we could decide that base-char is dispatched as expected,
and then we can extend the character attribute registry by loading
modules for different ranges.
In particular, most people probably won't want the asian character
ranges, which have a lot of extra information associated with them.
It's likely that several parallel modules would be useful there, for
information like traditional -> simplified mappings, cantonese
readings, etc, etc.
> > I'm also considering support string streams with :external-format,
> > which would operate on base-char strings.
> That should be left to a second step, shouldn't it?
Maybe, but it would make testing easier, and should be trivial to implement=
My current idea is to add two new stream types -- an 'encoding stream'
and a 'decoding stream'.
These would be composed with streams which can read/write octets.
Now we just need to alter the stream constructors (open,
make-input-string-stream, make-output-string-stream) to do this
composition for us when presented with an :external-format argument,
and it should be able to work.
Adding general support for read-sequence-no-hang would allow for more
efficient io and the iconv library could do the translation.