From: R. M. <rm...@mh...> - 2005-05-13 13:27:22
|
Hello List, can anyone test the following code with SBCL 0.9? (require 'sb-md5) (asdf:oos 'asdf:load-op :md5) (let ((a-string "/2005/19/L-Krimis") (b-string "/2005/19/L-Salter")) (format t "MD5 for A and B are ~:[distinct~;equal~] with MD5~%" (equal (format nil "~(~{~2,'0X~}~)" (coerce (md5:md5sum-sequence a-string) 'list)) (format nil "~(~{~2,'0X~}~)" (coerce (md5:md5sum-sequence b-string) 'list)))) (format t "MD5 for A and B are ~:[distinct~;equal~] with SB-MD5~%" (equal (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence a-string) 'list)) (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence b-string) 'list))))) This prints: MD5 for A and B are distinct with MD5 MD5 for A and B are equal with SB-MD5 On both my development system (PPC/Linux) and our production server (i386/Linux). This happened after upgrading SBCL to .9.n (Unicode enabled). Thanks for your help Ralf Mattes |
From: Christophe R. <cs...@ca...> - 2005-05-13 13:42:57
|
"R. Mattes" <rm...@mh...> writes: > (equal (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence a-string) 'list)) > (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence b-string) 'list))))) According to http://www.sbcl.org/manual/sb-md5.html#sb-md5, md5sum-sequence works on vectors of (unsigned-byte 8), not on strings. If you're hashing strings, use md5sum-string. (This was alluded to in the NEWS file for sbcl-0.8.19) I'll see if I can't convince it to give you a useful error message in future revisions. Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-05-13 14:19:50
|
On Fri, 13 May 2005 14:39:30 +0100, Christophe Rhodes wrote: > "R. Mattes" <rm...@mh...> writes: > >> (equal (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence a-string) 'list)) >> (format nil "~(~{~2,'0X~}~)" (coerce (sb-md5:md5sum-sequence b-string) 'list))))) > > According to http://www.sbcl.org/manual/sb-md5.html#sb-md5, > md5sum-sequence works on vectors of (unsigned-byte 8), not on strings. > If you're hashing strings, use md5sum-string. (This was alluded to in > the NEWS file for sbcl-0.8.19) > > I'll see if I can't convince it to give you a useful error message in > future revisions. Hmm, so now we have a small but important semantic difference between sb-md5:md5sum-sequence and md5:md5sum-sequence. And, since (typep "string" 'sequence) is true it makes conditional code rather more elaborate than necessary. What's the design rationale behind this? After all strings _are_ sequences. The "mysterious" thing is that md5sum-sequence does provide a result (and doesn't throw a condition) - just a strange one. Given the use of md5 in security relevant code i'm a bit worried. Thanks RalfD > Cheers, > > Christophe > > > ------------------------------------------------------- This SF.Net > email is sponsored by Oracle Space Sweepstakes Want to be the first > software developer in space? Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click |
From: Christophe R. <cs...@ca...> - 2005-05-13 14:55:49
|
"R. Mattes" <rm...@mh...> writes: > Hmm, so now we have a small but important semantic difference between > sb-md5:md5sum-sequence and md5:md5sum-sequence.=20 Who is the maintainer of md5:md5sum-sequence, and do they know the implications of multiple string representations? > And, since (typep "string" 'sequence) is true it makes conditional > code rather more elaborate than necessary. What's the design > rationale behind this? After all strings _are_ sequences.=20 Strings are sequences, but the md5 algorithm is defined over octets, not characters. As such, the name md5sum-sequence is a little misleading, but is a relic of the days when sbcl only knew about 256 characters, and did not have any idea that there might be more than one possible encoding. The sb-md5:md5sum-string entry point acknowledges the existence of multiple encodings for character data, and the sb-md5:md5sum-sequence entry point will shortly no longer accept general strings to prevent the user from shooting themselves in the foot. Just to put this on a concrete footing, let me throw the question back to you: what, if md5sum-sequence works on strings, should=20 (md5sum-sequence "=A4" return) [ where the character in the string is a euro sign ] > The "mysterious" thing is that md5sum-sequence does provide a result > (and doesn't throw a condition) - just a strange one. Probably it has been compiled with a low safety setting relative to speed, so that the usual checks are not performed. > Given the use of md5 in security relevant code i'm a bit worried. md5 in security-relevant code should probably be avoided, as it has "only" 128 bits of hash, which wouldn't be a problem were it not for the fact that it has been broken fairly comprehensively. Cheers, Christophe |
From: <rm...@fa...> - 2005-05-13 19:12:11
|
On Fri, May 13, 2005 at 03:52:21PM +0100, Christophe Rhodes wrote: > "R. Mattes" <rm...@mh...> writes: > > > Hmm, so now we have a small but important semantic difference between > > sb-md5:md5sum-sequence and md5:md5sum-sequence. > > Who is the maintainer of md5:md5sum-sequence, and do they know the > implications of multiple string representations? Kevin M. Rosenberg is the maintainer of the Debian package and since the ASDF link on Cliki poiunts to his site i _assume_ it's him. The asdf file mentins him as an author. > > And, since (typep "string" 'sequence) is true it makes conditional > > code rather more elaborate than necessary. What's the design > > rationale behind this? After all strings _are_ sequences. > > Strings are sequences, but the md5 algorithm is defined over octets, > not characters. As such, the name md5sum-sequence is a little > misleading, but is a relic of the days when sbcl only knew about 256 > characters, and did not have any idea that there might be more than > one possible encoding. Yes, that name itself seems to come from Pierre Mai and his code originally written for CMUCL. > The sb-md5:md5sum-string entry point > acknowledges the existence of multiple encodings for character data, > and the sb-md5:md5sum-sequence entry point will shortly no longer > accept general strings to prevent the user from shooting themselves in > the foot. Thanks, i think this is what i would have expected. I didn't shoot intentionally :-) > > Just to put this on a concrete footing, let me throw the question back > to you: what, if md5sum-sequence works on strings, should > (md5sum-sequence "?" return) > [ where the character in the string is a euro sign ] I'd expect one of two possible results: - first: throw an error/condition (maybe only iff the codepoints of the string's characters don't fit into 8 bits (taking advantage of the code point overlay of ASCII, latin-1 and Unicode). - second: create the hash of the internal representation of the string. After all the md5 algorithm is _always_ senitive to the binary representation. Will there be a possible case in SBCL where the binary representation of to strings equal under string= will differ? If not then i'd vote dor this solution. One drawback of this solution: the md5 sum of a string would not necessarily match that of a file containing the same string. - third (just to make a mathematician nervous): have md5sum-sequence accept a keyword :encoding. This would actually be backward compatible and (with :default as the default encoding) would work as solution 2. > > > The "mysterious" thing is that md5sum-sequence does provide a result > > (and doesn't throw a condition) - just a strange one. > > Probably it has been compiled with a low safety setting relative to > speed, so that the usual checks are not performed. Hmmm, beats me. This is on a "standard" Debian box. How would i check this? > > > Given the use of md5 in security relevant code i'm a bit worried. > > md5 in security-relevant code should probably be avoided, as it has > "only" 128 bits of hash, which wouldn't be a problem were it not for > the fact that it has been broken fairly comprehensively. Still, it's used in several security relevant spots. Luckily, in my case it "only" totally messed up a knowlege base (where object IDs are generated with a hash of important properties (all 7-bit URIs). Thanks for your input ralfd > Cheers, > > Christophe |
From: Christophe R. <cs...@ca...> - 2005-05-13 19:18:38
|
rm...@fa... writes: >> Just to put this on a concrete footing, let me throw the question back >> to you: what, if md5sum-sequence works on strings, should >> (md5sum-sequence "?" return) >> [ where the character in the string is a euro sign ] > > I'd expect one of two possible results: > > - first: throw an error/condition (maybe only iff the codepoints of > the string's characters don't fit into 8 bits (taking advantage > of the code point overlay of ASCII, latin-1 and Unicode). OK, this is basically what will happen, apart from the DWIMish part, which is not really in the spirit of SBCL. (For more information on this, you might take a look at the PRINCIPLES file distributed with the sbcl source.) > - second: create the hash of the internal representation of the string. > After all the md5 algorithm is _always_ senitive to the binary > representation. Will there be a possible case in SBCL where the > binary representation of to strings equal under string= will > differ? Yes, there are. BASE-STRINGs and (ARRAY CHARACTER (*)) can have the same contents with different in-memory representations. > If not then i'd vote dor this solution. One drawback of this > solution: the md5 sum of a string would not necessarily match > that of a file containing the same string. Right, and with the added potential confusion over different in-memory representations, this is a non-starter. > - third (just to make a mathematician nervous): have > md5sum-sequence accept a keyword :encoding. This would actually > be backward compatible and (with :default as the default > encoding) would work as solution 2. Backwards compatibility, at this point, is not even remotely interesting to me. I'd much rather get an interface that we can collectively be happy to support in the long term, than deal with the headaches involved in supporting half-baked ones. I'm sorry if that causes our current users problems, but our current users have made that choice by using a 0.x piece of software where the development culture is not focussed on interface stability. (Again, see the PRINCIPLES file, as well as, well, five years of commit logs :-) Cheers, Christophe |
From: R. M. <rm...@mh...> - 2005-05-13 22:41:05
|
On Fri, 13 May 2005 20:14:58 +0100, Christophe Rhodes wrote: > rm...@fa... writes: > >> [...snip ...] >> - first: throw an error/condition (maybe only iff the codepoints of >> the string's characters don't fit into 8 bits (taking advantage >> of the code point overlay of ASCII, latin-1 and Unicode). > > OK, this is basically what will happen, apart from the DWIMish part, > which is not really in the spirit of SBCL. (For more information on > this, you might take a look at the PRINCIPLES file distributed with > the sbcl source.) > >> - second: create the hash of the internal representation of the string. >> After all the md5 algorithm is _always_ senitive to the binary >> representation. Will there be a possible case in SBCL where the >> binary representation of to strings equal under string= will >> differ? > > Yes, there are. BASE-STRINGs and (ARRAY CHARACTER (*)) can have the > same contents with different in-memory representations. How is that? Doesn't a base-string consist entirely of base-chars (with code-points <= 127)? How _can_ i construct an array of characters with code-point <= 127 that has a different internal representation? >> If not then i'd vote dor this solution. One drawback of this >> solution: the md5 sum of a string would not necessarily match >> that of a file containing the same string. > > Right, and with the added potential confusion over different in-memory > representations, this is a non-starter. > >> - third (just to make a mathematician nervous): have >> md5sum-sequence accept a keyword :encoding. This would actually >> be backward compatible and (with :default as the default >> encoding) would work as solution 2. > > Backwards compatibility, at this point, is not even remotely > interesting to me. Backwards compatibility isn't what i aim for, but if it can be achieved easily i don't mind :) > I'd much rather get an interface that we can > collectively be happy to support in the long term, than deal with the > headaches involved in supporting half-baked ones. I _hope_ i don't sound stubborn but i somehow miss to see the half- bakedness of this interface [1]. Somehow i expect (sb-md5:md5sum-sequence "Blah") to act equivalent to (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default)) but i might be wrong. > I'm sorry if that > causes our current users problems, but our current users have made > that choice by using a 0.x piece of software where the development > culture is not focussed on interface stability. (Again, see the > PRINCIPLES file, as well as, well, five years of commit logs :-) I never complained about non-stable interfaces. I'm actually _very_ thankful for the Unicode support. I was taken by surprise because the semantics of a public function changed (and from what i understand then missing warning/error was just an accident). Thanks RalfD [1] i purposely won't quote "the other" languages as a reference or guide. > Cheers, > > Christophe > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7393&alloc_id=16281&op=click |
From: Nathan F. <fr...@cs...> - 2005-05-13 22:57:34
|
On Sat, May 14, 2005 at 12:17:40AM +0200, R. Mattes wrote: > How is that? Doesn't a base-string consist entirely of base-chars (with > code-points <= 127)? How _can_ i construct an array of characters with > code-point <= 127 that has a different internal representation? (Assuming a SB-UNICODEd SBCL) You do this all the time, simply by typing strings at the REPL: CL-USER> (type-of "DAD") (SIMPLE-ARRAY CHARACTER (3)) Such a string has a layout that looks roughly like: [tag] [length] [00 00 00 44] [00 00 00 41] [00 00 00 44] [00 00 00 00] where [...] is a 32-bit quantity, with values written in hexadecimal when necessary. If you instead said something like: CL-USER (type-of (coerce "DAD" '(simple-array base-char (3))) (SIMPLE-BASE-STRING (3)) The memory layout of such a string would look like: [tag] [length] [44 41 44 00] which is more like what you are expecting. But that doesn't mean that you get to pun such a string into being a sequence of bytes like C. > I _hope_ i don't sound stubborn but i somehow miss to see the half- > bakedness of this interface [1]. Somehow i expect > > (sb-md5:md5sum-sequence "Blah") to act equivalent to > (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default)) > > but i might be wrong. Don't expect; read the documentation! :) CL-USER> (documentation 'sb-md5:md5sum-sequence 'function) "Calculate the MD5 message-digest of data bounded by START and END in SEQUENCE , which must be a vector with element-type (UNSIGNED-BYTE 8)." -- Nathan | From Man's effeminate slackness it begins. --Paradise Lost The last good thing written in C was Franz Schubert's Symphony Number 9. --Erwin Dieterich |
From: Kevin R. <ke...@ro...> - 2005-05-13 19:43:20
|
rm...@fa... wrote: > Kevin M. Rosenberg is the maintainer of the Debian package and since the > ASDF link on Cliki poiunts to his site i _assume_ it's him. The asdf file > mentins him as an author. I'll fix the assumption: I'm the author of the only the .asd file, not of the md5sum program. I host the package on my server only so that it can be downloadable with asdf-install for my CLSQL package. I am no longer the maintainer of the Debian package. -- Kevin Rosenberg ke...@ro... |
From: <rm...@fa...> - 2005-05-13 20:04:30
|
On Fri, May 13, 2005 at 01:42:38PM -0600, Kevin Rosenberg wrote: > rm...@fa... wrote: > > Kevin M. Rosenberg is the maintainer of the Debian package and since the > > ASDF link on Cliki poiunts to his site i _assume_ it's him. The asdf file > > mentins him as an author. > > I'll fix the assumption: I'm the author of the only the .asd file, not > of the md5sum program. I host the package on my server only so that it > can be downloadable with asdf-install for my CLSQL package. I am no > longer the maintainer of the Debian package. Ok, so Peter took over the responsibilties. Oh gosh, yet another bug report to him. Thanks Ralf Mattes > -- > Kevin Rosenberg > ke...@ro... |
From: <rm...@mh...> - 2005-05-14 00:15:45
|
On Fri, May 13, 2005 at 05:57:14PM -0500, Nathan Froyd wrote: > On Sat, May 14, 2005 at 12:17:40AM +0200, R. Mattes wrote: > > How is that? Doesn't a base-string consist entirely of base-chars (with > > code-points <= 127)? How _can_ i construct an array of characters with > > code-point <= 127 that has a different internal representation? > > (Assuming a SB-UNICODEd SBCL) You do this all the time, simply by > typing strings at the REPL: > > CL-USER> (type-of "DAD") > (SIMPLE-ARRAY CHARACTER (3)) > > Such a string has a layout that looks roughly like: > > [tag] [length] [00 00 00 44] [00 00 00 41] [00 00 00 44] [00 00 00 00] > > where [...] is a 32-bit quantity, with values written in hexadecimal > when necessary. If you instead said something like: > > CL-USER (type-of (coerce "DAD" '(simple-array base-char (3))) > (SIMPLE-BASE-STRING (3)) > > The memory layout of such a string would look like: > > [tag] [length] [44 41 44 00] > > which is more like what you are expecting. I was expecting (falsely) expecting sbcl to use utf-8 encoding (where the first string would look like the second). Where actually can i find notes on the implementation of unicode in sbcl? I found a page on the sbcl-internals wiki and listened to Christophe's talk in Amsterdam. Is there more? > But that doesn't mean that > you get to pun such a string into being a sequence of bytes like C. > > > I _hope_ i don't sound stubborn but i somehow miss to see the half- > > bakedness of this interface [1]. Somehow i expect > > > > (sb-md5:md5sum-sequence "Blah") to act equivalent to > > (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default)) > > > > but i might be wrong. > > Don't expect; read the documentation! :) > > CL-USER> (documentation 'sb-md5:md5sum-sequence 'function) > "Calculate the MD5 message-digest of data bounded by START and END > in SEQUENCE , which must be a vector with element-type (UNSIGNED-BYTE > 8)." Yes, three bonus points to the SBCL developers for updating the documentation together with the code :-) When i wrote my code it read: "Calculate the MD5 message-digest of data in sequence. On CMU CL this works for all sequences whose element-type is supported by the underlying MD5 routines, on other implementations it only works for 1d simple-arrays with such element types." And, from the README: the "high-level" entry points to the md5 algorithm are MD5SUM-FILE, MD5SUM-STREAM and MD5SUM-SEQUENCE (despite its name, the last only acts on vectors). This is not "... The basic criteria are that the introduction of Unicode should be invisible to existing code ..." (from the wike page). Since we all seem to agree that an MD5 digest of a string depends on the sequence of characters (code points) as well as their encoding why _do_ we have to treat strings (which _are_ sequences according to the CL specs) different. Christophe seems to fear that the encoding will confuse users, i think a user will/would expect the equivalence of: (sb-md5:md5sum-sequence "Blah") == (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default)) [being probably a bit to pragmatic i'd add an :encoding keyword to md5sum-sequence to help in those cases where a digest needs to be compared against one of a file with a known encoding - but that is syntactic sweetener]. Thanks, Ralf Mattes > -- > Nathan | From Man's effeminate slackness it begins. --Paradise Lost > > The last good thing written in C was Franz Schubert's Symphony Number 9. > --Erwin Dieterich |
From: Christophe R. <cs...@ca...> - 2005-05-14 07:10:31
|
rm...@mh... (Le grand pinguin) writes: > On Fri, May 13, 2005 at 05:57:14PM -0500, Nathan Froyd wrote: >> [tag] [length] [00 00 00 44] [00 00 00 41] [00 00 00 44] [00 00 00 00] >> >> where [...] is a 32-bit quantity, with values written in hexadecimal >> when necessary. If you instead said something like: >> >> CL-USER (type-of (coerce "DAD" '(simple-array base-char (3))) >> (SIMPLE-BASE-STRING (3)) >> >> The memory layout of such a string would look like: >> >> [tag] [length] [44 41 44 00] >> >> which is more like what you are expecting. > > I was expecting (falsely) expecting sbcl to use utf-8 encoding (where the first > string would look like the second). Where actually can i find notes on the > implementation of unicode in sbcl? I found a page on the sbcl-internals wiki and > listened to Christophe's talk in Amsterdam. Is there more? Well, <http://www.doc.gold.ac.uk/~mas01cr/talks/2005-04-24%20Amsterdam/presentation.html> are my slides from that talk, sadly minus the deadpan asides. The section from <http://www.doc.gold.ac.uk/~mas01cr/talks/2005-04-24%20Amsterdam/img15.html> is of particular interest, with img17.html discussing exactly this point. (Actually, I would never expect anyone to use utf-8 as an internal representation for mutable strings, because you lose the O(1) access time; that's why it doesn't appear as one of the implementation options.) > (sb-md5:md5sum-sequence "Blah") == (sb:md5sum-sequence (string-to-octets "Blah" :encoding :default)) Well, in my world, probably md5sum-sequence will end up going away altogether, being replaced by md5sum-octet-vector or somesuch. As you've observed, the name md5sum-sequence is confusing. I would ask you what you actually _gain_ from the genericity you're asking for in the above -- how often do you call a function on X, where X is either a string or a vector of (unsigned-byte 8) and you don't know which? This would seem to be only useful if you're confused in the first place. Cheers, Christophe |