From: Brian S. <br...@de...> - 2001-10-05 15:12:24
|
William Harold Newman wrote: >On Fri, Oct 05, 2001 at 03:54:00PM +1000, Brian Spilsbury wrote: > >>What isn't done yet is fixing the string support. >> >>simple-string needs to be duplicated into simple-immutable-string and >>simple-byte-string, the latter for the ffi interface mostly. >> > ^^^^^^^^^^^^^^^^^^ > >As long as there is any specialized support for 8-bit characters left >in the implementation, shouldn't they be BASE-CHARs, and then >shouldn't this be SIMPLE-BASE-STRING? > Well, base-char is now 21 bit, so... not sure, but yes, I've left it as simple-string for now. >>then a (make-literal-string) method is necessary which produces a >>simple-immuatable-string if all the characters qualify, >>or a utf8-immutable-string if they don't. >> > >I'd prefer separating "make read only" and "squash into BASE-STRING >if possible" into separate orthogonal operations. Perhaps > * a new SB-EXT:READ-ONLY-STRING function analogous to STRING, which > coerces its argument to a read-only string; or a :READ-ONLY option > to STRING, as an alternate interface to the same thing > * a new keyword argument (perhaps :COMPACT, I dunno) to functions > like COPY-STRING, STRING, and READ-ONLY-STRING, to cause the > squash-into-BASE-STRING-if-possible operation >(But also see my growing reservations about read-only-ness below.) > Yes, I was thinking of (make-immutable-string) as a pair to (make-string), and having the reader naturally use it for string literals, which would allow us to by default use the fat 32-bit char strings without too much anguish, and support coersion when the type-specs are sorted out. >>complex-string is likewise UCS-4 encoded. >> >>I think that covers things pretty well, I'm not sure what kind of >>overhead we can expect on string dispatching though, and I don't know of >>an ansi typespecifier for an immutable string... which is annoying since >>they effectively specify them in the literals section. >> >>Comments? >> > >Of course you're right that there's no ANSI type specifier for an >immutable string. (And you're also right that there are some annoying >things about the ANSI standard.:-) Some sort of extension would be >needed. And now that I think about it, I'm afraid that might be a >jumbo-size can of worms. > I don't think we need a type-specifier for ansi-cl code, in ansi code the only immutables will be literals anyhow, and modifying those is undefined behaviour. However we can add a non-ansi type specifier for sbcl-ish code such as immutable-simple-string etc, where we can afford some ugliness. immutable-p might also be useful. > >I don't see any very nice way to put read-only-ness in a Common Lisp >type specifier. The C++ idea that 'const char' is used where you'd use >'char' doesn't seem like an obvious fit to Common Lisp type syntax >(and there might even be deep problems in extending the semantics that >way). You could avoid any deep syntax problems by extending the Common >Lisp type system to include READ-ONLY-ARRAY and READ-ONLY-STRING and >so forth, but I'm afraid it'd be quite a mess, since the system is >already big enough that it can be hard to remember everything. >(READ-ONLY-SIMPLE-BASE-STRING? ugh..) > >So it might be hard to put immutability in the type specifier. >Unfortunately, I think there are some good reasons that immutability >ought to show up there. > * A lot of optimization code in the compiler works by passing > around type information, so if you want to be able to compile > things like (SETF AREF) efficiently in the presence of > read-only-ness, you'll probably need to pass around > read-only-ness (and writeable-ness) as part of the type. > * The idea that (TYPE-OF FOO) could return a standard type like > SIMPLE-STRING, but then when you tried to do (SETF (SCHAR FOO 0) #\0) > you'd get a runtime error, makes me uneasy. (Although I > might be able to live with it if it's only used to signal an error > in the ANSI-undefined case (SETF (SCHAR (SYMBOL-NAME 'PRINT) 0) #\p).) > >When some time ago I asked whether you were designing the extension >yourself or porting something from some other Lisp, I was mostly >worried about the problem of generalizing all the operators in Common >Lisp to do the right thing when presented with read-only inputs, and >to construct read-only outputs when wanted. (If I do DELETE-IF on an >immutatable string, should I get back an immutable string?) But now >that I think about it, I'm even more worried about the problems of the >type system than I am about the problems with operators. > My feeling is to simply raise a condition upon a mutation attempt upon an immutable object. Initially, I will silently ignore attempts to mutate such strings, though, for simplicity. This seems to satisfy the spec in a polite fashion (the condition raising), which allows undefined behaviour on such anyhow. >I was receptive to the idea of read-only-ness, even in the absence of >a complete design, because I sorta thought that even if a complete >design of operator behavior turned out to be hard, you could punt. >Just add a READ-ONLY keyword argument to MAKE-ARRAY and MAKE-STRING >and COPY-STRING and declare victory! I still think that that simple >result would be somewhat useful (especially since it would make SBCL's >system data, e.g. SYMBOL-NAME values, a little harder to corrupt). >However, I had also assumed it would be complete, and now that I've >thought more about the type system issues, I'm no longer sure of that. > * What will TYPE-OF return when it's called on a read-only string? > * What will TYPE-OF return when it's called on an ordinary > ANSI-standard string, one which supports (SETF SCHAR)? > (TYPE-OF (MAKE-STRING 51)) > * (SUBTYPEP (TYPE-OF (SYMBOL-NAME 'FOO)) (TYPE-OF (MAKE-STRING 3))) => ? > * (SUBTYPEP (TYPE-OF (TYPE-OF (MAKE-STRING 3)) (SYMBOL-NAME 'FOO))) => ? > * If there are new type specifiers, how much code can be exposed > to the new type specifiers by portable constructs, e.g. > (CONCATENATE (TYPE-OF (SYMBOL-NAME SYM)) THIS THAT) > (MAP (TYPE-OF (SYMBOL-NAME SYM)) *FROBBER* THOSE THESE) > (COERCE MY-NAME (TYPE-OF (SYMBOL-NAME SYM))) > and so must necessarily be made to work with the new type specifiers > before the extended system is ANSI-compliant? > * Is there a way (some sort of extension?) to explicitly declare a > string argument to be writable, so that (SETF SCHAR) can be compiled > efficiently? Will portable string-handling code, without such > extended declarations, still be compiled reasonably efficiently? > Well, initially type-of will return weird names for the extended strings, but that's a good question. It may be possible to transform the weird type into a polite type at the surface level, while retaining the true nature for the compiler, and those non-ansi programs which care to use the weird names, I'm not sure about that though, I'll read the hyperspec some more. I believe that by default all strings which are not literals should be mutable for ansi compatibility. This means that all our (make-string) results will be ucs-4 encoded, and we pay in size, but we don't usually do a huge amount of this. String operations between representations are a bit messy, but it isn't too bad. We can use the same operators for comparison (of strings) for all simple-string, immutable-simple-string, immutable-utf8-strings. Comparison between simple-string and ucs-4-string is fairly straight-forward, comparing ucs-4-string and immutable-utf8-string is the most expensive since we need to decode the utf-8-string character by character. Fortunately utf-8-string will mostly be in symbol-names (I expect), where it will generally be compared with other symbol-names and printed, which are operations that do not require random access. I believe with some care that the result can be acceptably efficient. Regards, Brian Spilsbury |