From: William H. N. <wil...@ai...> - 2001-10-05 14:38:48
|
On Fri, Oct 05, 2001 at 03:54:00PM +1000, Brian Spilsbury wrote: > Well, after much stuffing about I change direction and used what I > learned implementing unicode-char to make base-char cover the unicode range. > > Then I made it so that writing to simple-string rudely truncated the > base-char back to 8 bits worth of data. > > This is a bit nasty, but it now compiles and seems to work. > > By replacing *standard-output* with a utf-8 output stream, I can now > (write-char #\u5000) etc to produce exciting weird characters under > utf-8 enabled xterm. > > For the time being char-code-limit is incorrectly set to 256, with > actual-char-code-limit being #10ffff, which allows the reader and so on > to compile without exploding. char-code-limit also seems to leak in from > the host environment, which is odd, have to track that down sometime. > > All the standard unicode character names work with #\name format, and > #\uhex also produces characters as expected. > > What isn't done yet is fixing the string support. > > simple-string needs to be duplicated into simple-immutable-string and > simple-byte-string, the latter for the ffi interface mostly. ^^^^^^^^^^^^^^^^^^ As long as there is any specialized support for 8-bit characters left in the implementation, shouldn't they be BASE-CHARs, and then shouldn't this be SIMPLE-BASE-STRING? > then a (make-literal-string) method is necessary which produces a > simple-immuatable-string if all the characters qualify, > or a utf8-immutable-string if they don't. I'd prefer separating "make read only" and "squash into BASE-STRING if possible" into separate orthogonal operations. Perhaps * a new SB-EXT:READ-ONLY-STRING function analogous to STRING, which coerces its argument to a read-only string; or a :READ-ONLY option to STRING, as an alternate interface to the same thing * a new keyword argument (perhaps :COMPACT, I dunno) to functions like COPY-STRING, STRING, and READ-ONLY-STRING, to cause the squash-into-BASE-STRING-if-possible operation (But also see my growing reservations about read-only-ness below.) > All string literals and symbol names can then be represented in these > forms, which should account for 99.9% of them. > > simple-string becomes a UCS-4 encoded string (we could use a 24 bit > representation, but I'm not sure it wins much). On some CPU architectures it's quite a chore to read non-word-aligned values, so I agree that 32-bit arrays are the way to go here. > complex-string is likewise UCS-4 encoded. > > I think that covers things pretty well, I'm not sure what kind of > overhead we can expect on string dispatching though, and I don't know of > an ansi typespecifier for an immutable string... which is annoying since > they effectively specify them in the literals section. > > Comments? Of course you're right that there's no ANSI type specifier for an immutable string. (And you're also right that there are some annoying things about the ANSI standard.:-) Some sort of extension would be needed. And now that I think about it, I'm afraid that might be a jumbo-size can of worms. I don't see any very nice way to put read-only-ness in a Common Lisp type specifier. The C++ idea that 'const char' is used where you'd use 'char' doesn't seem like an obvious fit to Common Lisp type syntax (and there might even be deep problems in extending the semantics that way). You could avoid any deep syntax problems by extending the Common Lisp type system to include READ-ONLY-ARRAY and READ-ONLY-STRING and so forth, but I'm afraid it'd be quite a mess, since the system is already big enough that it can be hard to remember everything. (READ-ONLY-SIMPLE-BASE-STRING? ugh..) So it might be hard to put immutability in the type specifier. Unfortunately, I think there are some good reasons that immutability ought to show up there. * A lot of optimization code in the compiler works by passing around type information, so if you want to be able to compile things like (SETF AREF) efficiently in the presence of read-only-ness, you'll probably need to pass around read-only-ness (and writeable-ness) as part of the type. * The idea that (TYPE-OF FOO) could return a standard type like SIMPLE-STRING, but then when you tried to do (SETF (SCHAR FOO 0) #\0) you'd get a runtime error, makes me uneasy. (Although I might be able to live with it if it's only used to signal an error in the ANSI-undefined case (SETF (SCHAR (SYMBOL-NAME 'PRINT) 0) #\p).) When some time ago I asked whether you were designing the extension yourself or porting something from some other Lisp, I was mostly worried about the problem of generalizing all the operators in Common Lisp to do the right thing when presented with read-only inputs, and to construct read-only outputs when wanted. (If I do DELETE-IF on an immutatable string, should I get back an immutable string?) But now that I think about it, I'm even more worried about the problems of the type system than I am about the problems with operators. I was receptive to the idea of read-only-ness, even in the absence of a complete design, because I sorta thought that even if a complete design of operator behavior turned out to be hard, you could punt. Just add a READ-ONLY keyword argument to MAKE-ARRAY and MAKE-STRING and COPY-STRING and declare victory! I still think that that simple result would be somewhat useful (especially since it would make SBCL's system data, e.g. SYMBOL-NAME values, a little harder to corrupt). However, I had also assumed it would be complete, and now that I've thought more about the type system issues, I'm no longer sure of that. * What will TYPE-OF return when it's called on a read-only string? * What will TYPE-OF return when it's called on an ordinary ANSI-standard string, one which supports (SETF SCHAR)? (TYPE-OF (MAKE-STRING 51)) * (SUBTYPEP (TYPE-OF (SYMBOL-NAME 'FOO)) (TYPE-OF (MAKE-STRING 3))) => ? * (SUBTYPEP (TYPE-OF (TYPE-OF (MAKE-STRING 3)) (SYMBOL-NAME 'FOO))) => ? * If there are new type specifiers, how much code can be exposed to the new type specifiers by portable constructs, e.g. (CONCATENATE (TYPE-OF (SYMBOL-NAME SYM)) THIS THAT) (MAP (TYPE-OF (SYMBOL-NAME SYM)) *FROBBER* THOSE THESE) (COERCE MY-NAME (TYPE-OF (SYMBOL-NAME SYM))) and so must necessarily be made to work with the new type specifiers before the extended system is ANSI-compliant? * Is there a way (some sort of extension?) to explicitly declare a string argument to be writable, so that (SETF SCHAR) can be compiled efficiently? Will portable string-handling code, without such extended declarations, still be compiled reasonably efficiently? -- William Harold Newman <wil...@ai...> Where are we going and why am I in this handbasket? -- Daniel Demus <de...@so...> PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |