From: Nikodemus Siivola <nikodemus@ra...>  20070814 18:27:49

Two modular arithmetic issues have just come to my attention. They both like bugs to me, but maybe my understanding is wrong. 1. Initial binding confuses modular arithmetic. (defun test1 (a b) (let ((x a) (y b)) (declare (type (unsignedbyte 32) x y)) (macrolet ((store (expr) `(setf x (logand #xffffffff ,expr)))) (store (+ x 7)) (store (ash x 3)) (store ( x y)) (store (ash x 3)) (store (+ x 100)) (store (+ x 7)) (store (ash x 3)) (store ( x y)) (store (ash x 3)) (store (+ x 100))) x)) Doesn't use modular arithmetic. Declaring A and B as (UNSIGNEDBYTE 32), or removing the LET fixes things. Can anyone suggest where I should look for the cause of this? 2. 32 bit modular arithmetic not working on x8664. Is this to be expected, is this a regression, or am I doing something wrong? How hard would it be to make it work? Cheers,  Nikodemus 
From: Nathan Froyd <froydnj@gm...>  20070814 18:37:10

On 8/14/07, Nikodemus Siivola <nikodemus@...> wrote: > 2. 32 bit modular arithmetic not working on x8664. Is this to be expected, is > this a regression, or am I doing something wrong? How hard would it be to > make it work? This is expected and somewhat unfortunate. I posted a patch to make this work several months ago: http://thread.gmane.org/gmane.lisp.steelbank.devel/9155/ Bill and Christophe expressed reservations about the patch, so I did not commit it. Nathan 
From: Nikodemus Siivola <nikodemus@ra...>  20070816 08:57:06

Uh. I can't use email anymore. This was supposed to go to the list as well. Cheers,  Nikodemus  Forwarded message  From: Nikodemus Siivola <nikodemus@...> Date: Aug 15, 2007 12:09 AM Subject: Re: [Sbcldevel] Two modular arithmetic issues To: Nathan Froyd <froydnj@...> On 8/14/07, Nathan Froyd <froydnj@...> wrote: > On 8/14/07, Nikodemus Siivola <nikodemus@...> wrote: > > 2. 32 bit modular arithmetic not working on x8664. Is this to be expected, is > > this a regression, or am I doing something wrong? How hard would it be to > > make it work? > > This is expected and somewhat unfortunate. I posted a patch to make > this work several months ago: > > http://thread.gmane.org/gmane.lisp.steelbank.devel/9155/ > > Bill and Christophe expressed reservations about the patch, so I did > not commit it. Assuming the patch attached to that thread is the full one, I'm not sure I understand the reservations: it doesn't seem to add any complexity of note: new VOPs and movefuns, sure, but no new infrastructure. Maybe I'm missing something, but  (findif (lambda (itemwidth) (>= itemwidth width)) + (findif (lambda (itemwidth) (= itemwidth width)) seems like the only bit that actually changes the way things work as opposed to using the existing infra to support more special cases. While I cannot claim to understand the patch in its entirety yet, it doesn't seem all that bad to me, and I can vouch that there indeed are usecases for this, even if those uses cases aren't traditionally lispy. (No surprise there: lisp traditionally sucks at this kind of stuff, so people try to avoid this. Currently the 32 bit version of the function I posted is bad enough on x8664 that coding it in C and doing a foreign call is cheaper  not a practice I'd like to encourage.) Cheers,  Nikodemus 
From: Christophe Rhodes <csr21@ca...>  20070816 09:11:49

"Nikodemus Siivola" <nikodemus@...> writes: >> Bill and Christophe expressed reservations about the patch, so I did >> not commit it. > > Assuming the patch attached to that thread is the full one, I'm not > sure I understand the reservations: it doesn't seem to add any complexity > of note: new VOPs and movefuns, sure, but no new infrastructure. > > Maybe I'm missing something, but > >  (findif (lambda (itemwidth) (>= itemwidth width)) > + (findif (lambda (itemwidth) (= itemwidth width)) > > seems like the only bit that actually changes the way things work as > opposed to using the existing infra to support more special cases. This I think is the bit I was unhappy with. At the moment, on x8664, we have (modulo offbyone errors from me :) 64 (untagged) 61 (tagged) implementations of modular arithmetic. And with that regime, the >= test makes sense, as you only get an untagged representation when you were going to get one (or bignums) anyway. The proposed regime (as I understand it; I haven't reread the patch) would have 64 (untagged) 61 (tagged) 32 (untagged) implementations. This means that the the >= test is ruinously bad for byte sizes 131, as they could (probably) be implemented much more cheaply by staying with a tagged representation. (This isn't necessarily obvious, I think; it depends on how many times something like a lognot necessitates an untagged temporary clearing the tag bits.) So I think my preferred solution would be to be a little bit cleverer in declaring good widths: some widths are "good" only if they're exactly the ones requested, whereas others are sensible defaults for anything smaller. Implementing this makes the simple FINDIF above turn into something a little bit more complicated, but I think then that everyone can be happy. (I think I agree that there are interesting things that it would be nice to do with asefficientarithmeticaspossible, and that supporting it is worth some effort  just like supporting dynamicextent is worth some effort. Of course, as we've recently seen on x86/Solaris, supporting dynamicextent turns out to be not quite as effortless as maybe we thought...) Cheers, Christophe 
From: Nathan Froyd <froydnj@gm...>  20070816 12:17:45

On 8/16/07, Christophe Rhodes <csr21@...> wrote: > "Nikodemus Siivola" <nikodemus@...> writes: > > Assuming the patch attached to that thread is the full one, I'm not > > sure I understand the reservations: it doesn't seem to add any complexity > > of note: new VOPs and movefuns, sure, but no new infrastructure. > > > > Maybe I'm missing something, but > > > >  (findif (lambda (itemwidth) (>= itemwidth width)) > > + (findif (lambda (itemwidth) (= itemwidth width)) > > > > seems like the only bit that actually changes the way things work as > > opposed to using the existing infra to support more special cases. > > This I think is the bit I was unhappy with. At the moment, on x8664, > we have (modulo offbyone errors from me :) > 64 (untagged) > 61 (tagged) > implementations of modular arithmetic. And with that regime, the >= > test makes sense, as you only get an untagged representation when you > were going to get one (or bignums) anyway. This is not exactly true; the x86 backend goes through some contortions to get cases like: (logand (logxor x y) mostpositivefixnum) (where x and y are fixnums) to produce saneish looking code and we still wind up shifting x and y to untagged representationsat least last time I looked. (Furthermore, the x86 backend can also get "simple" shouldbefixnum cases "wrong" tooIIRC.) The x8664 backendagain, the last time I lookeddoesn't do the same sort of contortions and winds up with some awful code. The only time changing >= to = could cause Bad Things to happen is doing operations with widths between nfixnumbits and nwordbitsand I think SBCL's normal untagged smarts will kick in properly here. If you're proposing changing this = test to something like "= or between fixnum and wordbit sizes", I think that would be the best of both worlds: fixnum operations happen when the user expects (as Python did in the premodular arithmetic world), the user can force modular arithmetic when desired, and those rare notfixnumbutalmostword cases will work properly as well. Nathan 
From: Christophe Rhodes <csr21@ca...>  20070816 12:52:39

"Nathan Froyd" <froydnj@...> writes: > On 8/16/07, Christophe Rhodes <csr21@...> wrote: >> This I think is the bit I was unhappy with. At the moment, on x8664, >> we have (modulo offbyone errors from me :) >> 64 (untagged) >> 61 (tagged) >> implementations of modular arithmetic. And with that regime, the >= >> test makes sense, as you only get an untagged representation when you >> were going to get one (or bignums) anyway. > > This is not exactly true; the x86 backend goes through some > contortions to get cases like: > > (logand (logxor x y) mostpositivefixnum) > > (where x and y are fixnums) to produce saneish looking code and we > still wind up shifting x and y to untagged representationsat least > last time I looked. (Furthermore, the x86 backend can also get > "simple" shouldbefixnum cases "wrong" tooIIRC.) The x8664 > backendagain, the last time I lookeddoesn't do the same sort of > contortions and winds up with some awful code. OK, I don't understand this (and I don't see where in the code this happens). Maybe we should burrow down into this issue and sort it out? > The only time changing >= to = could cause Bad Things to happen is > doing operations with widths between nfixnumbits and > nwordbitsand I think SBCL's normal untagged smarts will kick in > properly here. I don't think I understand this either. Consider (ldb (byte 28 0) (+ (+ a b) (+ c d))) where a, b, c and d are declare as being of type (unsignedbyte 28). If there is no modular substitution of the + operators, then the outer + will return an (unsignedbyte 30) [ which on a 32bit platform is a bignum ], so needless testing and bignum consing will occur. Instead, I believe a more sensible evaluation should be (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) no? What am I missing? (Or are we in agreement on this point?) Cheers, Christophe 
From: Alexey Dejneka <adejneka@to...>  20070816 17:36:12

Hello, Christophe Rhodes <csr21@...> writes: > I don't think I understand this either. Consider > > (ldb (byte 28 0) (+ (+ a b) (+ c d))) > > where a, b, c and d are declare as being of type (unsignedbyte 28). > If there is no modular substitution of the + operators, then the outer > + will return an (unsignedbyte 30) [ which on a 32bit platform is a > bignum ], so needless testing and bignum consing will occur. Instead, > I believe a more sensible evaluation should be > > (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) Signedness of _intermediate_ results is completely unimportant. +/mod29 (for (unsignedbyte 29)) requires one extra instruction to clear the sign bit; a better implementation is +/smod30 (for (signedbyte 30)).  Regards, Alexey Dejneka "Alas, the spheres of truth are less transparent than those of illusion."  L.E.J. Brouwer 
From: Christophe Rhodes <csr21@ca...>  20070817 08:57:58

Alexey Dejneka <adejneka@...> writes: > Hello, Hi! Good to know you're still around. :) > Christophe Rhodes <csr21@...> writes: > >> I don't think I understand this either. Consider >> >> (ldb (byte 28 0) (+ (+ a b) (+ c d))) >> >> where a, b, c and d are declare as being of type (unsignedbyte 28). >> If there is no modular substitution of the + operators, then the outer >> + will return an (unsignedbyte 30) [ which on a 32bit platform is a >> bignum ], so needless testing and bignum consing will occur. Instead, >> I believe a more sensible evaluation should be >> >> (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) > > Signedness of _intermediate_ results is completely unimportant. > +/mod29 (for (unsignedbyte 29)) requires one extra instruction to > clear the sign bit; a better implementation is +/smod30 (for > (signedbyte 30)). OK, I think I understand this  each +/mod29 must clear the top bit, whereas +/smod30 need not; this is because for +/smod30 the entire 30bit effective register is always valid (the two tag bits need to be cleared always, of course, though they remain clear automatically for +) while +/mod29 needs special treatment of the top bit, because the return value from that function must be nonnegative. I think, though, that the overall point remains: that refusing to do modular arithmetic on (ldb (byte 28 0) (+ (+ a b) (+ c d))) because 28 does not match one of the "good" modular widths exactly leads to suboptimal code. No? Cheers, Christophe 
From: Alexey Dejneka <adejneka@to...>  20070817 16:37:25

Hello, Christophe Rhodes <csr21@...> writes: >>> (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) >> >> Signedness of _intermediate_ results is completely unimportant. >> +/mod29 (for (unsignedbyte 29)) requires one extra instruction to >> clear the sign bit; a better implementation is +/smod30 (for >> (signedbyte 30)). > > OK, I think I understand this  each +/mod29 must clear the top bit, > whereas +/smod30 need not; this is because for +/smod30 the entire > 30bit effective register is always valid (the two tag bits need to be > cleared always, of course, though they remain clear automatically for > +) while +/mod29 needs special treatment of the top bit, because the > return value from that function must be nonnegative. My point is that our separation of signed and unsigned modular arithmetic is wrong: the sign bit is important only for the outer operation. And integers of fixnum width can only be signed :( > I think, though, that the overall point remains: that refusing to do > modular arithmetic on > (ldb (byte 28 0) (+ (+ a b) (+ c d))) > because 28 does not match one of the "good" modular widths exactly > leads to suboptimal code. No? Yes. Another example: doing */mod32 is probably more efficient than */mod16 on MIPS (but not on X86?): (ldb (byte 16 0) (* (* (* a b) c) d))  Regards, Alexey Dejneka "Alas, the spheres of truth are less transparent than those of illusion."  L.E.J. Brouwer 
From: Christophe Rhodes <csr21@ca...>  20070911 12:43:35

From: Paul Khuong <pkhuong@gm...>  20070912 16:00:17
Attachments:
boundcheckingfastash.patch

On 9/11/07, Christophe Rhodes <csr21@...> wrote: > No test cases fail (on x86), and in the process I found a bug in > smod30 handling of ash, which I think I've fixed. Attached is a patch against 1.0.9.57 that fixes the same issue under x8664. Paul Khuong 
From: Christophe Rhodes <csr21@ca...>  20070912 16:14:44

"Paul Khuong" <pkhuong@...> writes: > On 9/11/07, Christophe Rhodes <csr21@...> wrote: >> No test cases fail (on x86), and in the process I found a bug in >> smod30 handling of ash, which I think I've fixed. > > Attached is a patch against 1.0.9.57 that fixes the same issue under > x8664. Thanks, merged as 1.0.9.59. Cheers, Christophe 
From: Christophe Rhodes <csr21@ca...>  20070928 09:34:59
Attachments:
Message as HTML
sbclmodular.diff

From: Christophe Rhodes <csr21@ca...>  20080307 21:17:11

Christophe Rhodes <csr21@...> writes: > I attach a patch which implements the strategy for choosing modular > versions that Nathan and I discussed a while back: prefer an exact > untagged match, followed by an approximate tagged match, followed by > an approximate untagged; this should allow implementation of mod32 > arithmetic on x8664. I have finally merged a patch (for all currentlybuilding architectures) for this new modular arithmetic strategy as sbcl1.0.15.16. (Free with this patch: better lognot/fixnum). Nathan, I hope this helps supporting 32bit modular arithmetic on x8664... Best, Christophe 
From: Christophe Rhodes <csr21@ca...>  20071231 11:49:27
Attachments:
Message as HTML
modular.diff

From: Nathan Froyd <froydnj@gm...>  20080103 04:44:09

On Dec 31, 2007 6:49 AM, Christophe Rhodes <csr21@...> wrote: > I've resurrected my work towards making modular arithmetic choose > which width to use more cleverly than before, from patches I sent to > sbcldevel in September. In particular, I have got > > * better support for signed modular arithmetic: all LOGFOO functions > are treated as "good"; > * preference for tagged modular arithmetic over untagged if the width > requested can be satisfied by both, unless the width matches > exactly. > > Nathan (or Nikodemus, or anyone else for that matter): are you in a > position to retry the x8664 32bit modular arithmetic on top of this? I may have some time in the next two weeks, but I will have to go back and review our IRC conversations about this to refresh my memory of what I was supposed to do. :) Nathan 
From: Christophe Rhodes <csr21@ca...>  20080103 08:15:52

"Nathan Froyd" <froydnj@...> writes: > On Dec 31, 2007 6:49 AM, Christophe Rhodes <csr21@...> wrote: >> I've resurrected my work towards making modular arithmetic choose >> which width to use more cleverly than before, from patches I sent to >> sbcldevel in September. In particular, I have got >> >> * better support for signed modular arithmetic: all LOGFOO functions >> are treated as "good"; >> * preference for tagged modular arithmetic over untagged if the width >> requested can be satisfied by both, unless the width matches >> exactly. >> >> Nathan (or Nikodemus, or anyone else for that matter): are you in a >> position to retry the x8664 32bit modular arithmetic on top of this? > > I may have some time in the next two weeks, but I will have to go back > and review our IRC conversations about this to refresh my memory of > what I was supposed to do. :) Heh. I think that my patch is at a position that you should be able to apply just the x8664 bits implementing 32bit arithmetic from <http://thread.gmane.org/gmane.lisp.steelbank.devel/9155/>;; it's possible that bits and pieces might break, though. It's more because you have useful test cases for this new behaviour that I'm trying to rope you in :) Cheers, Christophe 
From: Nathan Froyd <froydnj@gm...>  20080218 17:00:24

On Jan 3, 2008 3:15 AM, Christophe Rhodes <csr21@...> wrote: > Heh. I think that my patch is at a position that you should be able > to apply just the x8664 bits implementing 32bit arithmetic from > <http://thread.gmane.org/gmane.lisp.steelbank.devel/9155/>;; it's > possible that bits and pieces might break, though. It's more because > you have useful test cases for this new behaviour that I'm trying to > rope you in :) I finally had time to look at this; I stuck your patch and my patch together to produce a working SBCL. SBMD5::UPDATEMD5BLOCK looks right (almost; see below) and James's favorite testcase does the right thing. I haven't tried more complicated variations on James's testcase, but it seems plausible that they too will do the right thing. The only problem with SBMD5::UPDATEMD5BLOCK is that you get code like: MOV R9, R8 SHL R9, 3 MOV R8, RCX SHL R8, 3 MOV R8, RCX OR R10, R9 MOV R8, RDX SHL R8, 3 XOR R8, R10 SAR R8, 3 We want something more like: OR R8D, ECX XOR R8D, EDX The problem here is that we've been doing ADD/ROL/NOT in 32bit registers, but the only way SBCL knows how to do modular LOG{AND,IOR,XOR} operations is to do them on its usual triumvirate of fixnum/signedbyte64/unsignedbyte64. These gymnastics are converting from our 32bit registers to fixnums (since fixnum operations are "cheaper" than unsignedbyte64 operations...sigh), operating on fixnums, and then going back. So we can't rewrite the unsignedbyte64 vops to take UNSIGNED32REG inputs, because that vop will never be selected over the fixnum one. And we can't rewrite the fixnum ones to take UNSIGNED32REG inputs because that would just get messy. The simpleminded solution of (DEFINEMODULARFUN LOGIOR ...) doesn't work because LOGIOR is a :GOOD function. I don't know what the right thing to do here is. We like having these :GOOD functions because we're not making the user write even more (LOGAND #xffffffff ...) forms. But we also want a way to specify that these really should be done on a particular representation. Thoughts? Nathan 
From: Christophe Rhodes <csr21@ca...>  20080218 18:38:42

"Nathan Froyd" <froydnj@...> writes: > The simpleminded solution of (DEFINEMODULARFUN LOGIOR ...) doesn't > work because LOGIOR is a :GOOD function. > > I don't know what the right thing to do here is. We like having these > :GOOD functions because we're not making the user write even more > (LOGAND #xffffffff ...) forms. But we also want a way to specify that > these really should be done on a particular representation. My understanding is that :GOOD isn't about saving the user work; it's about saving us (the implementor) work: saying that we can just use the implementation's normal version rather than the modular version in all cases, and that things will magically work. If I'm right, then what would need to happen to make this work perfectly on x8664 is to remove the goodness from all three (LOGAND, LOGIOR and LOGXOR) and instead implement them as regular modular functions with implementations for unsignedbyte32, fixnum and unsignedbyte64. I don't think that this will cause the user to have to add in any more (LOGAND #xffffffff ...) forms, based on a monthold memory of what the code does. Cheers, Christophe 