On Fri, May 13, 2005 at 05:57:14PM -0500, Nathan Froyd wrote:
> On Sat, May 14, 2005 at 12:17:40AM +0200, R. Mattes wrote:
> > How is that? Doesn't a base-string consist entirely of base-chars (with
> > code-points <= 127)? How _can_ i construct an array of characters with
> > code-point <= 127 that has a different internal representation?
> (Assuming a SB-UNICODEd SBCL) You do this all the time, simply by
> typing strings at the REPL:
> CL-USER> (type-of "DAD")
> (SIMPLE-ARRAY CHARACTER (3))
> Such a string has a layout that looks roughly like:
> [tag] [length] [00 00 00 44] [00 00 00 41] [00 00 00 44] [00 00 00 00]
> where [...] is a 32-bit quantity, with values written in hexadecimal
> when necessary. If you instead said something like:
> CL-USER (type-of (coerce "DAD" '(simple-array base-char (3)))
> (SIMPLE-BASE-STRING (3))
> The memory layout of such a string would look like:
> [tag] [length] [44 41 44 00]
> which is more like what you are expecting.
I was expecting (falsely) expecting sbcl to use utf-8 encoding (where the first
string would look like the second). Where actually can i find notes on the
implementation of unicode in sbcl? I found a page on the sbcl-internals wiki and
listened to Christophe's talk in Amsterdam. Is there more?
> But that doesn't mean that
> you get to pun such a string into being a sequence of bytes like C.
> > I _hope_ i don't sound stubborn but i somehow miss to see the half-
> > bakedness of this interface . Somehow i expect
> > (sb-md5:md5sum-sequence "Blah") to act equivalent to
> > (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default))
> > but i might be wrong.
> Don't expect; read the documentation! :)
> CL-USER> (documentation 'sb-md5:md5sum-sequence 'function)
> "Calculate the MD5 message-digest of data bounded by START and END
> in SEQUENCE , which must be a vector with element-type (UNSIGNED-BYTE
Yes, three bonus points to the SBCL developers for updating the documentation
together with the code :-) When i wrote my code it read:
"Calculate the MD5 message-digest of data in sequence. On CMU CL
this works for all sequences whose element-type is supported by the
underlying MD5 routines, on other implementations it only works for 1d
simple-arrays with such element types."
And, from the README:
the "high-level" entry points to the md5 algorithm are
MD5SUM-FILE, MD5SUM-STREAM and MD5SUM-SEQUENCE (despite its name,
the last only acts on vectors).
This is not "... The basic criteria are that the introduction of Unicode
should be invisible to existing code ..." (from the wike page). Since we
all seem to agree that an MD5 digest of a string depends on the sequence
of characters (code points) as well as their encoding why _do_ we have to
treat strings (which _are_ sequences according to the CL specs) different.
Christophe seems to fear that the encoding will confuse users, i think
a user will/would expect the equivalence of:
(sb-md5:md5sum-sequence "Blah") == (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default))
[being probably a bit to pragmatic i'd add an :encoding keyword to md5sum-sequence to
help in those cases where a digest needs to be compared against one of a file with
a known encoding - but that is syntactic sweetener].
Thanks, Ralf Mattes
> Nathan | From Man's effeminate slackness it begins. --Paradise Lost
> The last good thing written in C was Franz Schubert's Symphony Number 9.
> --Erwin Dieterich