Hi Christophe Rhodes,
> I believe Kevin has chosen option #2 for his Debian uploads -- in some
> sense, I think that covers that base; it seems to me that #4 is the
> option that best reflects the state of SBCL development at release
> time.
Indeed I've just installed SBCL 0.8.17-2 from Debian Unstable and
unfortunately it doesn't have large character support compiled in.
I find 8 bit strings that can store arbitrary octets useful and I'm
wondering if you will consider continuing support for them in preference
to the ASCII support you've mooted for the FFI. The fantastic property of
ISO-8859-1 and Unicode is that the first 256 characters of Unicode
map exactly onto ISO-8859-1. So if you define BASE-CHAR to be octets
(of ostensibly ISO-8859-1 encoded characters) then EXTENDED-CHAR naturally
maps from character code 256 onwards. In practice the only two useful
encodings will be BASE-CHAR and CHARACTER.
I have one situation where I'm encoding binary within strings over a
socket connection that can only transfer text. The lowest common
denominator is ASCII so I have 7 bits (effectively 6 bits to avoid control
characters). If the implementation supports strings of octets then I'm
only using (8-6)/6 = 33% more memory. If one implementation only supports
32 bit characters then I'm using (32-6)/6 = 433% more memory.
It also sounds like ongoing support for strings of octets would help the
porting of sb-md5 (add :element-type 'base-char when making strings).
People could also store UTF-8 encoded text in the strings of octets.
There's no decoding overhead. You just store the octets as read instead of
the implementation having to decode the stream of octets and store
them as code points within the CHARACTER data structure.
It's also possible to encode extra information within the string such as
an escape character. Some values in UTF-8 are undefined such as 255.
Such strings with undefined UTF-8 sequences can be portably transferred as
ISO-8859-1 while additionally interpreting and unescaping the UTF-8
sequence at the other end.
There are many reasons why strings of octets would continue to be
useful in an implementation that has Unicode support. This advice may be
helpful but feel free to ignore it without explanation.
Regards,
Adam
|