On Tue, Sep 25, 2001 at 02:55:27AM +1000, Brian Spilsbury wrote:
> William Harold Newman wrote:
> >On Mon, Sep 24, 2001 at 01:05:49PM +1000, Brian Spilsbury wrote:
> >>I've just implemented some preliminary unicode support in sbcl, so far
> >>this consists of;
> >>Adding a new immediate primitive type UNICODE-CHAR.
> >>Adding a bunch of functions, and patching a bunch of existing ones with (etypecase) mostly.
> >>In the longer term I'm tempted to look at phasing out base-char
> >>completely, and look at perhaps a (character 0 255) type to replace it,
> >>along a similar line to (integer 0 255). Then we can define UCS-2 (ugly
> >>thing) as (char 0 #xffff), etc.
> >BASE-CHAR is required by the ANSI standard, so I don't think it should
> >be phased out (?). Perhaps you mean eliminating the distinction between
> >BASE-CHAR and this (CHARACTER #.(CODE-CHAR 0) #.(CODE-CHAR 255)) type?
> >or eliminating the distinction between BASE-CHAR and UNICODE-CHAR?
> Well, in practical terms, I mean to eliminate the base-char immediate
> primitive type,
> and then to refigure it in terms of Unicode-Char, perhaps with some
> (mod) style specialisation.
> >If UCS-2 is the scheme where special digraphs are used to refer to
> >characters numbered >64K, I'm not sure it'll be practical to support
> >it cleanly in Common Lisp strings. The assumption of uniform indexing
> >of characters seems to be built pretty deeply into the Common Lisp
> >string operations. But that's just my superficial impression, not
> >something I've thought about deeply.
> UCS-2 is the nasty system that java, etc uses, which only gives a range
> of #x0000-#xffff.
> Ie, 16 bits per character, linear. Since unicode covers #x0000-#x10ffff
> this is a bit small,
> and has the dual disadvantage of blowing out all your normal text two-fold.
> >Unicode-Char is currently 21 bit, and covers the range of 0-#x10FFFF as
> >per the unicode 3.1 specification.
> >That should fit nicely into the Common Lisp string operations, I'd
> >think, although of course it will tend to make strings a little large.
> Well, I'm looking at using utf-8 encoded strings, which give no overhead
> for ascii,
> and then progressive levels as it gets more complex.
> Utf-8 has two issues for string encoding, firstly random access is
> slowed to O(index),
> which isn't as bad as O(octects-to-reach-index), secondly mutating a
> slot in a utf-8 string may alter the length of the string.
> Ideally you'd implement immutable-utf-8 strings as simple vectors, and
> mutable ones in some tree-structure.
> In order to get around the random-access overhead, adding in primitive
> string iteration forms should help, since
> mostly we're doing random access in order to iterate - for printing,
> searching, etc.
> I don't think it's that big a deal for strings, but see how it goes.
I should warn you before you do too much work on this: it may be
difficult to persuade me to support data structures which can't
effectively be accessed using the ordinary ANSI string operations
(because the "right" way is to use nonstandard string iteration
forms). It's not impossible, but I think I'd need a pretty compelling
reason, and a small constant factor in space efficiency may not be
> >>When this settles down a little I'm intending to implement utf-8 string
> >>support, and general immutable string/vector support. Fix the reader to
> >>handle unicode properly, add input method support, etc.
> >What extensions do you plan to use to express immutable strings and
> >vectors? Is this a feature set in some other Lisp that you want to port,
> >or have you designed it yourself?
> Just seemed rather obvious, CL really needs more immutable types :)
> A type specifier of (and immutable (vector ...)) seems to be the sanest
> that I can think of at the moment, although that seems a little ugly, if
> you have an alternative, let me know. :)
> Immutable cons, immutable vectors, immutable structs, whatever.
> Immutability is great, you can destructure freely and so on. Lots of
> opportunities for optimisation.
I agree that constant data structures are great, although without lazy
evaluation there are some pretty severe limits on what you can do with
them. And when I came to Lisp from C++, I missed the 'const' qualifier
quite a lot. But now mostly I wish that there was more code in a pure
functional style, and I try to write code in a functional style when I
can, and then once things are pervasively constant I find that I don't
miss the const declarations that much.
On the other hand, one thing I like about Lisp is that it's hard to
corrupt the machine. Languages like C and assembly, where it's easy
for an error anywhere to corrupt the system and cause strange symptoms
later in some logically unrelated part of the system, are not nearly
as nice as Lisp, where it's hard for an error to corrupt the system,
or Eiffel, where it's (AFAIK) impossible. And if there are practical
ways to make it harder to corrupt the system (like using immutable
strings to implement SYMBOL-NAME) then I have some interest in it.
Immutable conses are probably much harder to do that immutable
strings, especially on a 64-bit port, since in the 32-bit
implementations I don't think there are enough type bits. Also,
they're kinda wimpy anyway unless you have lazy evaluation to help you
when you want to do cyclic data structures.
> The main problem is getting the damned thing to substitute properly (as
> into unicode-char-code, etc).
> This is what I'm using;
> (defun code-unicode-char (code)
> "Returns the character with the code CODE."
> (declare (type char-code/unicode code))
> (%primitive sb!c:make-other-immediate-type code sb!vm:unicode-char-type))
> (define-vop (unicode-char-code)
> (:translate unicode-char-code)
> (:args (char :scs (any-reg descriptor-reg) :target code))
> (:arg-types unicode-char)
> (:results (code :scs (unsigned-reg)))
> (:result-types positive-fixnum)
> (:generator 1
> (inst mov code char)))
You might consider changing course and navigating toward a variant of
SBCL which accepts an :SB-UNICODE value in *SHEBANG-FEATURES*, and
when that feature is used in the build, all characters in the
executable are 20-bit Unicode values, with 32-bit upgraded array
element type. Then:
* This would tend to dodge some of the complexities of introducing
new VOPs and transforms and making the compiler pick the ones
you want, because there'd be a one-to-one correspondence between
the operations in the #!+SB-UNICODE system and the #!-SB-UNICODE
* My guess is that this variant is only about 30% as hard to do
as an SBCL which accepts dynamically-typed 20-bit Unicode
STRING/CHARACTER and 8-bit BASE-STRING/BASE-CHAR.
* This variant all by itself is likely to be useful for many people
who care about non-ASCII character sets.
* Even if your ultimate objective is for SBCL to be able to support
several dynamically typed variants of STRING (at least ANSI's STRING,
SIMPLE-BASE-STRING, SIMPLE-STRING, and BASE-STRING) with a
distinction between 8-bit array element types for BASE-STRING and
32-bit array element types for STRING, I doubt that stabilizing
this intermediate version would be much extra work.
* Once you had a stable result, it would be a good base camp for
your assault on the problem of a full dynamically typed
BASE-STRING-vs-STRING implementation. Among other things, you
could just grep for all the SB-UNICODE stuff to see what needed
to be switched at runtime. So I don't think that stabilizing
the intermediate version would be wasted work, either.
One drawback to this approach, compared to the way you've already set
out, is that you'd need to get a lot of stuff working before you'd
have a target SBCL which would actually run. I don't think that would
be too horrible, though. Debugging cold init can be pretty horrible,
but I expect that mostly you'd be debugging the cross-compiler, which
isn't nearly so bad.
William Harold Newman <william.newman@...>
"Sometimes if you have a cappuccino and then try again it will work OK."
-- Dr. Brian Reid, 1992, quoted by mjr
"Sometimes one cappucino isn't enough."
-- mjr = Marcus J. "will do TCP/IP for food" Ranum <mjr@...>
PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C