From: <ra...@be...> - 2005-02-21 21:44:35
|
On Mac OS X (typep "hello" 'simple-base-string) returns T, but on Linux returns NIL. Which is correct? |
From: Christophe R. <cs...@ca...> - 2005-02-21 22:14:56
|
ra...@be... (Ralph Richard Cook) writes: > On Mac OS X > (typep "hello" 'simple-base-string) > returns T, but on Linux returns NIL. > Which is correct? Either or both. Despite your subject line and the body of your e-mail, this isn't at all OS-dependent: in SBCL versions since about 0.8.16, it depends on whether your sbcl has been compiled with the :sb-unicode target feature. If it has, there are three specialized representations of simple-string, of which two are (simple-array character (*)) and simple-base-string; if it hasn't, there are only two, of which the only user-visible one is simple-base-string, which in these circumstances is equivalent to (simple-array character (*)). Since this is an area of common confusion, be aware that the type SIMPLE-BASE-STRING does not mean "a SIMPLE-STRING which happens to contain only BASE-CHARs"; it means "a SIMPLE-STRING which is capable only of holding BASE-CHARs, not CHARACTERs in general". Why this distinction? In an SBCL compiled with :sb-unicode, CHAR-CODE-LIMIT is #x110000 -- or, to put it another way, 21 bits are needed to uniquely identify a character. Therefore, storage for an object of type (SIMPLE-ARRAY CHARACTER (*)) -- which means "a SIMPLE-STRING which can have any CHARACTERs as element" -- is allocated as [ h e l l o \0] while there is an obvious more efficient storage method for objects of type SIMPLE-BASE-STRING (which is the same type as (SIMPLE-ARRAY BASE-CHAR (*))) [hello\0] So depending on your compilation options, the string which is read in from the #\" reader macro may have a packed or unpacked representation in memory. Why does this matter to you? Cheers, Christophe |
From: Ralph R. C. <ra...@be...> - 2005-02-22 02:15:16
|
I've been looking at Prism, at http://www.radonc.washington.edu/medinfo/prism/. It has just about every variable typed, and in some places they explicitly check to see if parameters are simple-base-string's, erroring out if the parameter isn't one. Sometimes quoted strings are passed to these functions. |
From: Christophe R. <cs...@ca...> - 2005-02-22 07:36:44
|
Ralph Richard Cook <ra...@be...> writes: > I've been looking at Prism, at > http://www.radonc.washington.edu/medinfo/prism/. > It has just about every variable typed, and in some places they > explicitly check to see if parameters are simple-base-string's, > erroring out if the parameter isn't one. Sometimes quoted strings are > passed to these functions. In that case, you need to find out whether there is a reason for that behaviour, or whether the authors made a mistake. (I rather strongly suspect that they did, but I haven't looked at their work at all, so maybe they didn't). Cheers, Christophe |
From: Ralph R. C. <ra...@be...> - 2005-02-24 04:29:55
|
On Feb 21, 2005, at 5:08 PM, Christophe Rhodes wrote: > Since this is an area of common confusion, be aware that the type > SIMPLE-BASE-STRING does not mean "a SIMPLE-STRING which happens to > contain only BASE-CHARs"; it means "a SIMPLE-STRING which is capable > only of holding BASE-, CHARs, not CHARACTERs in general". Why this > distinction? In an SBCL compiled with :sb-unicode, CHAR-CODE-LIMIT is > #x110000 -- or, to put it another way, 21 bits are needed to uniquely > identify a character. Therefore, storage for an object of type > (SIMPLE-ARRAY CHARACTER (*)) -- which means "a SIMPLE-STRING which can > have any CHARACTERs as element" -- is allocated as > [ h e l l o \0] > while there is an obvious more efficient storage method for objects of > type SIMPLE-BASE-STRING (which is the same type as > (SIMPLE-ARRAY BASE-CHAR (*))) > [hello\0] > So depending on your compilation options, the string which is read in > from the #\" reader macro may have a packed or unpacked representation > in memory. > > I did this in SBCL on Linux: * (type-of "hello") (SIMPLE-ARRAY CHARACTER (5)) * (type-of (aref "hello" 0)) STANDARD-CHAR I guess this means that the #\" reader macro puts a STANDARD-CHAR in each place in the [ h e l l o \0], and not a unicode character? Or are they unicode characters in the array, and get converted by the aref? Thanks, Ralph Richard Cook |
From: Harald Hanche-O. <ha...@ma...> - 2005-02-24 15:32:01
|
+ Ralph Richard Cook <ra...@be...>: | I did this in SBCL on Linux: | | * (type-of "hello") | | (SIMPLE-ARRAY CHARACTER (5)) | * (type-of (aref "hello" 0)) | | STANDARD-CHAR | | I guess this means that the #\" reader macro puts a STANDARD-CHAR in | each place in the [ h e l l o \0], and not a unicode character? What makes you think a standard-char is not a unicode character as well? After all, ASCII is nothing but the first 128 character positions in unicode, and I see no reason that Lisp should not support that viewpoint. Here's an analogy: standard-char <-> fixnum, extended-char <-> bignum, character <-> integer. Does that make it clearer? In the early history of multilingual support for emacs, they managed to screw up this sort of thing royally. Basically, charsets such as latin-1 and latin-9 were considered disjoint, even if they are in fact more equal than different. This used to cause all sorts of problems when someone would send me latin-9 coded mail and I would reply, quoting some of their mail: For I use latin-1, and suddenly emacs would complain that letters like ÆØÅ could not be encoded in latin-1. (I believe the brain damage is still there, but now there is clever code working around it so I don't see it manifesting anymore.) In summary, be glad ASCII is a subset of Unicode and not totally disjoint from it. - Harald |
From: Christophe R. <cs...@ca...> - 2005-02-24 16:04:23
|
Ralph Richard Cook <ra...@be...> writes: >> > I did this in SBCL on Linux: ... by which I presume from context you mean "in an sbcl which has been compiled with :sb-unicode" -- because be aware that... > * (type-of "hello") > > (SIMPLE-ARRAY CHARACTER (5)) > * (type-of (aref "hello" 0)) > > STANDARD-CHAR ... I would expect these results from more or less any ANSI Common Lisp. > I guess this means that the #\" reader macro puts a STANDARD-CHAR in > each place in the [ h e l l o \0], and not a unicode character? Or are > they unicode characters in the array, and get converted by the aref? Aha. To the first question I answer "mu", and to the second "yes, no, and mu". Or, in other words, these are the wrong questions. Characters are characters; there is no way of turning a "standard-char h" into a "unicode character h" -- #\h is always #\h. I'll try to illustrate by way of an analogy. I'm going to define a new reader macro for the character #\{, roughly as follows: (defun |{-READER| (stream char) (coerce (read-delimited-list stream #\}) '(simple-array (unsigned-byte 32) (*))) so that reading {104 101 108 108 111 0} constructs an array which can hold word-sized integers, with six elements: 104, 101, 108, 108, 111 and 0. I hope that you wouldn't worry about whether the 104 that's the first element in that array is an (unsigned-byte 32) 104 or an (unsigned-byte 8) 104: it is just 104. In much the same way, the #\h which is in a simple-base-string or a (simple-array character (*)) is not different -- in each case what is stored is #\h. The difference between a (simple-array character (*)) and a simple-base-string is not what /is/ stored at any point, but what is capable of being stored: the former can store any characters, while the latter can only store base-chars. So we can have two objects which are strings, and contain the characters #\h, #\e, #\l, #\l and #\o, but on one of them (setf (aref string 0) (code-char 512)) is legal, and on the other it isn't -- in just the same way as we can have two arrays containing 104, 101, 108, 108 and 111, but one of them does not allow you to do (setf (aref array 0) 512) -- because there is no space to store 512. To wrap up, maybe: all characters in SBCL are characters; some of them are also base-chars and standard-chars, but they all share a uniform representation. There is a difference in representation between subtypes of STRING, however: some of them can store any and all characters, while some have a limited character repertoire. Does that help? (This is a fiendish area to get straight, and has confused many a person before now.) Cheers, Christophe |