Adam Warner <lists@...> writes:
>> I believe Kevin has chosen option #2 for his Debian uploads -- in some
>> sense, I think that covers that base; it seems to me that #4 is the
>> option that best reflects the state of SBCL development at release
> Indeed I've just installed SBCL 0.8.17-2 from Debian Unstable and
> unfortunately it doesn't have large character support compiled in.
> I find 8 bit strings that can store arbitrary octets useful and I'm
> wondering if you will consider continuing support for them in preference
> to the ASCII support you've mooted for the FFI. The fantastic property of
> ISO-8859-1 and Unicode is that the first 256 characters of Unicode
> map exactly onto ISO-8859-1. So if you define BASE-CHAR to be octets
> (of ostensibly ISO-8859-1 encoded characters) then EXTENDED-CHAR naturally
> maps from character code 256 onwards. In practice the only two useful
> encodings will be BASE-CHAR and CHARACTER.
For the FFI, my current working model is as follows:
char * <=> (* sb-alien:char) <=> (array ([un]signed-byte 8) (*))
char <=> c-string <=> [base-]string
<=> iso-8859-1-string <=> [base-]string
<=> utf8-string <=> [base-]string
where the consequences "are undefined" if the alien code modifies the
contents of its arguments in the case of the foo-string types.
I could be wrong about this, but I think that there are different use
cases for arrays of octets and arrays of
things-which-are-going-to-be-treated-as-characters, and I don't think
that they overlap so much.
> I have one situation where I'm encoding binary within strings over a
> socket connection that can only transfer text. The lowest common
> denominator is ASCII so I have 7 bits (effectively 6 bits to avoid control
> characters). If the implementation supports strings of octets then I'm
> only using (8-6)/6 = 33% more memory. If one implementation only supports
> 32 bit characters then I'm using (32-6)/6 = 433% more memory.
I'm not sure I understand this as an argument against ASCII BASE-CHAR;
if the lowest common denominator is ASCII, then a BASE-CHAR=ASCII
representation gives you what you need, does it not? The base-string
type isn't going away; it's mandated by ANSI, and has (in the
Unicode-enabled SBCL) a 7-bit range including all of ASCII; it also
has the potential to use only 16% more memory. :-)
> It also sounds like ongoing support for strings of octets would help the
> porting of sb-md5 (add :element-type 'base-char when making strings).
No, I think this is completely orthogonal; and in particular I'm of
the firm opinion that there should be no guarantee, implicit or
explicit, that lisp objects are laid out in a particular way.
In particular, having the return values of sb-md5 depend on the
representation of the string is (I hope you'll agree, but maybe you
won't) completely misguided. (sb-md5:md5sum-string "a") should be the
same as (sb-md5:md5sum-string (coerce "a" 'base-string)). The problem
with sb-md5 as it stands is that we don't even have this interface at
A string-to-octets and octets-to-string pair (with an external-format
> People could also store UTF-8 encoded text in the strings of octets.
> There's no decoding overhead. You just store the octets as read instead of
> the implementation having to decode the stream of octets and store
> them as code points within the CHARACTER data structure.
> It's also possible to encode extra information within the string such as
> an escape character. Some values in UTF-8 are undefined such as 255.
> Such strings with undefined UTF-8 sequences can be portably transferred as
> ISO-8859-1 while additionally interpreting and unescaping the UTF-8
> sequence at the other end.
All of this sounds like a good argument for having an underlying data
type with an 8-bit field size, but that exists already as an (array
(unsigned-byte 8) (*)) -- I can't see anything in the above argument
which relies on the array element type being a subtype of CHARACTER...
> There are many reasons why strings of octets would continue to be
> useful in an implementation that has Unicode support. This advice may be
> helpful but feel free to ignore it without explanation.
Your thoughts are welcome. Thank you.