From: Nikodemus S. <nik...@ra...> - 2007-08-14 18:27:49
|
Two modular arithmetic issues have just come to my attention. They both like bugs to me, but maybe my understanding is wrong. 1. Initial binding confuses modular arithmetic. (defun test1 (a b) (let ((x a) (y b)) (declare (type (unsigned-byte 32) x y)) (macrolet ((store (expr) `(setf x (logand #xffffffff ,expr)))) (store (+ x 7)) (store (ash x -3)) (store (- x y)) (store (ash x 3)) (store (+ x 100)) (store (+ x 7)) (store (ash x -3)) (store (- x y)) (store (ash x 3)) (store (+ x 100))) x)) Doesn't use modular arithmetic. Declaring A and B as (UNSIGNED-BYTE 32), or removing the LET fixes things. Can anyone suggest where I should look for the cause of this? 2. 32 bit modular arithmetic not working on x86-64. Is this to be expected, is this a regression, or am I doing something wrong? How hard would it be to make it work? Cheers, -- Nikodemus |
From: Nathan F. <fr...@gm...> - 2007-08-14 18:37:10
|
On 8/14/07, Nikodemus Siivola <nik...@ra...> wrote: > 2. 32 bit modular arithmetic not working on x86-64. Is this to be expected, is > this a regression, or am I doing something wrong? How hard would it be to > make it work? This is expected and somewhat unfortunate. I posted a patch to make this work several months ago: http://thread.gmane.org/gmane.lisp.steel-bank.devel/9155/ Bill and Christophe expressed reservations about the patch, so I did not commit it. -Nathan |
From: Nikodemus S. <nik...@ra...> - 2007-08-16 08:57:06
|
Uh. I can't use email anymore. This was supposed to go to the list as well. Cheers, -- Nikodemus ---------- Forwarded message ---------- From: Nikodemus Siivola <nik...@ra...> Date: Aug 15, 2007 12:09 AM Subject: Re: [Sbcl-devel] Two modular arithmetic issues To: Nathan Froyd <fr...@gm...> On 8/14/07, Nathan Froyd <fr...@gm...> wrote: > On 8/14/07, Nikodemus Siivola <nik...@ra...> wrote: > > 2. 32 bit modular arithmetic not working on x86-64. Is this to be expected, is > > this a regression, or am I doing something wrong? How hard would it be to > > make it work? > > This is expected and somewhat unfortunate. I posted a patch to make > this work several months ago: > > http://thread.gmane.org/gmane.lisp.steel-bank.devel/9155/ > > Bill and Christophe expressed reservations about the patch, so I did > not commit it. Assuming the patch attached to that thread is the full one, I'm not sure I understand the reservations: it doesn't seem to add any complexity of note: new VOPs and move-funs, sure, but no new infrastructure. Maybe I'm missing something, but - (find-if (lambda (item-width) (>= item-width width)) + (find-if (lambda (item-width) (= item-width width)) seems like the only bit that actually changes the way things work as opposed to using the existing infra to support more special cases. While I cannot claim to understand the patch in its entirety yet, it doesn't seem all that bad to me, and I can vouch that there indeed are use-cases for this, even if those uses cases aren't traditionally lispy. (No surprise there: lisp traditionally sucks at this kind of stuff, so people try to avoid this. Currently the 32 bit version of the function I posted is bad enough on x86-64 that coding it in C and doing a foreign call is cheaper -- not a practice I'd like to encourage.) Cheers, -- Nikodemus |
From: Christophe R. <cs...@ca...> - 2007-08-16 09:11:49
|
"Nikodemus Siivola" <nik...@ra...> writes: >> Bill and Christophe expressed reservations about the patch, so I did >> not commit it. > > Assuming the patch attached to that thread is the full one, I'm not > sure I understand the reservations: it doesn't seem to add any complexity > of note: new VOPs and move-funs, sure, but no new infrastructure. > > Maybe I'm missing something, but > > - (find-if (lambda (item-width) (>= item-width width)) > + (find-if (lambda (item-width) (= item-width width)) > > seems like the only bit that actually changes the way things work as > opposed to using the existing infra to support more special cases. This I think is the bit I was unhappy with. At the moment, on x86-64, we have (modulo off-by-one errors from me :-) 64 (untagged) 61 (tagged) implementations of modular arithmetic. And with that regime, the >= test makes sense, as you only get an untagged representation when you were going to get one (or bignums) anyway. The proposed regime (as I understand it; I haven't reread the patch) would have 64 (untagged) 61 (tagged) 32 (untagged) implementations. This means that the the >= test is ruinously bad for byte sizes 1-31, as they could (probably) be implemented much more cheaply by staying with a tagged representation. (This isn't necessarily obvious, I think; it depends on how many times something like a lognot necessitates an untagged temporary clearing the tag bits.) So I think my preferred solution would be to be a little bit cleverer in declaring good widths: some widths are "good" only if they're exactly the ones requested, whereas others are sensible defaults for anything smaller. Implementing this makes the simple FIND-IF above turn into something a little bit more complicated, but I think then that everyone can be happy. (I think I agree that there are interesting things that it would be nice to do with as-efficient-arithmetic-as-possible, and that supporting it is worth some effort -- just like supporting dynamic-extent is worth some effort. Of course, as we've recently seen on x86/Solaris, supporting dynamic-extent turns out to be not quite as effortless as maybe we thought...) Cheers, Christophe |
From: Nathan F. <fr...@gm...> - 2007-08-16 12:17:45
|
On 8/16/07, Christophe Rhodes <cs...@ca...> wrote: > "Nikodemus Siivola" <nik...@ra...> writes: > > Assuming the patch attached to that thread is the full one, I'm not > > sure I understand the reservations: it doesn't seem to add any complexity > > of note: new VOPs and move-funs, sure, but no new infrastructure. > > > > Maybe I'm missing something, but > > > > - (find-if (lambda (item-width) (>= item-width width)) > > + (find-if (lambda (item-width) (= item-width width)) > > > > seems like the only bit that actually changes the way things work as > > opposed to using the existing infra to support more special cases. > > This I think is the bit I was unhappy with. At the moment, on x86-64, > we have (modulo off-by-one errors from me :-) > 64 (untagged) > 61 (tagged) > implementations of modular arithmetic. And with that regime, the >= > test makes sense, as you only get an untagged representation when you > were going to get one (or bignums) anyway. This is not exactly true; the x86 backend goes through some contortions to get cases like: (logand (logxor x y) most-positive-fixnum) (where x and y are fixnums) to produce sane-ish looking code and we still wind up shifting x and y to untagged representations--at least last time I looked. (Furthermore, the x86 backend can also get "simple" should-be-fixnum cases "wrong" too--IIRC.) The x86-64 backend--again, the last time I looked--doesn't do the same sort of contortions and winds up with some awful code. The only time changing >= to = could cause Bad Things to happen is doing operations with widths between n-fixnum-bits and n-word-bits--and I think SBCL's normal untagged smarts will kick in properly here. If you're proposing changing this = test to something like "= or between fixnum and word-bit sizes", I think that would be the best of both worlds: fixnum operations happen when the user expects (as Python did in the pre-modular arithmetic world), the user can force modular arithmetic when desired, and those rare not-fixnum-but-almost-word cases will work properly as well. -Nathan |
From: Christophe R. <cs...@ca...> - 2007-08-16 12:52:39
|
"Nathan Froyd" <fr...@gm...> writes: > On 8/16/07, Christophe Rhodes <cs...@ca...> wrote: >> This I think is the bit I was unhappy with. At the moment, on x86-64, >> we have (modulo off-by-one errors from me :-) >> 64 (untagged) >> 61 (tagged) >> implementations of modular arithmetic. And with that regime, the >= >> test makes sense, as you only get an untagged representation when you >> were going to get one (or bignums) anyway. > > This is not exactly true; the x86 backend goes through some > contortions to get cases like: > > (logand (logxor x y) most-positive-fixnum) > > (where x and y are fixnums) to produce sane-ish looking code and we > still wind up shifting x and y to untagged representations--at least > last time I looked. (Furthermore, the x86 backend can also get > "simple" should-be-fixnum cases "wrong" too--IIRC.) The x86-64 > backend--again, the last time I looked--doesn't do the same sort of > contortions and winds up with some awful code. OK, I don't understand this (and I don't see where in the code this happens). Maybe we should burrow down into this issue and sort it out? > The only time changing >= to = could cause Bad Things to happen is > doing operations with widths between n-fixnum-bits and > n-word-bits--and I think SBCL's normal untagged smarts will kick in > properly here. I don't think I understand this either. Consider (ldb (byte 28 0) (+ (+ a b) (+ c d))) where a, b, c and d are declare as being of type (unsigned-byte 28). If there is no modular substitution of the + operators, then the outer + will return an (unsigned-byte 30) [ which on a 32-bit platform is a bignum ], so needless testing and bignum consing will occur. Instead, I believe a more sensible evaluation should be (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) no? What am I missing? (Or are we in agreement on this point?) Cheers, Christophe |
From: Alexey D. <ade...@to...> - 2007-08-16 17:36:12
|
Hello, Christophe Rhodes <cs...@ca...> writes: > I don't think I understand this either. Consider > > (ldb (byte 28 0) (+ (+ a b) (+ c d))) > > where a, b, c and d are declare as being of type (unsigned-byte 28). > If there is no modular substitution of the + operators, then the outer > + will return an (unsigned-byte 30) [ which on a 32-bit platform is a > bignum ], so needless testing and bignum consing will occur. Instead, > I believe a more sensible evaluation should be > > (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) Signedness of _intermediate_ results is completely unimportant. +/mod29 (for (unsigned-byte 29)) requires one extra instruction to clear the sign bit; a better implementation is +/smod30 (for (signed-byte 30)). -- Regards, Alexey Dejneka "Alas, the spheres of truth are less transparent than those of illusion." -- L.E.J. Brouwer |
From: Christophe R. <cs...@ca...> - 2007-08-17 08:57:58
|
Alexey Dejneka <ade...@to...> writes: > Hello, Hi! Good to know you're still around. :-) > Christophe Rhodes <cs...@ca...> writes: > >> I don't think I understand this either. Consider >> >> (ldb (byte 28 0) (+ (+ a b) (+ c d))) >> >> where a, b, c and d are declare as being of type (unsigned-byte 28). >> If there is no modular substitution of the + operators, then the outer >> + will return an (unsigned-byte 30) [ which on a 32-bit platform is a >> bignum ], so needless testing and bignum consing will occur. Instead, >> I believe a more sensible evaluation should be >> >> (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) > > Signedness of _intermediate_ results is completely unimportant. > +/mod29 (for (unsigned-byte 29)) requires one extra instruction to > clear the sign bit; a better implementation is +/smod30 (for > (signed-byte 30)). OK, I think I understand this -- each +/mod29 must clear the top bit, whereas +/smod30 need not; this is because for +/smod30 the entire 30-bit effective register is always valid (the two tag bits need to be cleared always, of course, though they remain clear automatically for +) while +/mod29 needs special treatment of the top bit, because the return value from that function must be non-negative. I think, though, that the overall point remains: that refusing to do modular arithmetic on (ldb (byte 28 0) (+ (+ a b) (+ c d))) because 28 does not match one of the "good" modular widths exactly leads to suboptimal code. No? Cheers, Christophe |
From: Alexey D. <ade...@to...> - 2007-08-17 16:37:25
|
Hello, Christophe Rhodes <cs...@ca...> writes: >>> (ldb (byte 28 0) (+/mod29 (+/mod29 a b) (+/mod29 c d))) >> >> Signedness of _intermediate_ results is completely unimportant. >> +/mod29 (for (unsigned-byte 29)) requires one extra instruction to >> clear the sign bit; a better implementation is +/smod30 (for >> (signed-byte 30)). > > OK, I think I understand this -- each +/mod29 must clear the top bit, > whereas +/smod30 need not; this is because for +/smod30 the entire > 30-bit effective register is always valid (the two tag bits need to be > cleared always, of course, though they remain clear automatically for > +) while +/mod29 needs special treatment of the top bit, because the > return value from that function must be non-negative. My point is that our separation of signed and unsigned modular arithmetic is wrong: the sign bit is important only for the outer operation. And integers of fixnum width can only be signed :-( > I think, though, that the overall point remains: that refusing to do > modular arithmetic on > (ldb (byte 28 0) (+ (+ a b) (+ c d))) > because 28 does not match one of the "good" modular widths exactly > leads to suboptimal code. No? Yes. Another example: doing */mod32 is probably more efficient than */mod16 on MIPS (but not on X86?): (ldb (byte 16 0) (* (* (* a b) c) d)) -- Regards, Alexey Dejneka "Alas, the spheres of truth are less transparent than those of illusion." -- L.E.J. Brouwer |
From: Christophe R. <cs...@ca...> - 2007-09-11 12:43:35
Attachments:
small-smod-better-than-large-umod.diff
|
Alexey Dejneka <ade...@to...> writes: > Hello, > > Christophe Rhodes <cs...@ca...> writes: > >> OK, I think I understand this -- each +/mod29 must clear the top bit, >> whereas +/smod30 need not; this is because for +/smod30 the entire >> 30-bit effective register is always valid (the two tag bits need to be >> cleared always, of course, though they remain clear automatically for >> +) while +/mod29 needs special treatment of the top bit, because the >> return value from that function must be non-negative. > > My point is that our separation of signed and unsigned modular > arithmetic is wrong: the sign bit is important only for the outer > operation. And integers of fixnum width can only be signed :-( Right. Here's an implementation of probably an imperfect workaround: when optimizing logand (and related unsigned friends), prefer a suitable signed implementation of a smaller width to an unsigned implementation with larger width. Really this distinction is between tagged and untagged, preferring the untagged case, but Nathan's canonical test case (defun foom (x y) (declare (fixnum x y)) (logand (logxor x y) most-positive-fixnum)) now compiles to ; 0A6D5AAE: 8BD0 MOV EDX, EAX ; B0: 31FA XOR EDX, EDI ; B2: 81E2FCFFFF7F AND EDX, 2147483644 which I think is right, and better than the ; 0A7D320A: 8BD0 MOV EDX, EAX ; 0C: C1FA02 SAR EDX, 2 ; 0F: 8BCF MOV ECX, EDI ; 11: C1F902 SAR ECX, 2 ; 14: 31CA XOR EDX, ECX ; 16: 81E2FFFFFF1F AND EDX, 536870911 ; 1C: C1E202 SHL EDX, 2 No test cases fail (on x86), and in the process I found a bug in smod30 handling of ash, which I think I've fixed. This doesn't start addressing Nathan's point that we can also usefully do mod32 arithmetic on x86-64, and that that should probably only happen when a width of /exactly/ 32 is asked for, but I think this is better than what we currently have. |
From: Paul K. <pk...@gm...> - 2007-09-12 16:00:17
Attachments:
bound-checking-fast-ash.patch
|
On 9/11/07, Christophe Rhodes <cs...@ca...> wrote: > No test cases fail (on x86), and in the process I found a bug in > smod30 handling of ash, which I think I've fixed. Attached is a patch against 1.0.9.57 that fixes the same issue under x86-64. Paul Khuong |
From: Christophe R. <cs...@ca...> - 2007-09-12 16:14:44
|
"Paul Khuong" <pk...@gm...> writes: > On 9/11/07, Christophe Rhodes <cs...@ca...> wrote: >> No test cases fail (on x86), and in the process I found a bug in >> smod30 handling of ash, which I think I've fixed. > > Attached is a patch against 1.0.9.57 that fixes the same issue under > x86-64. Thanks, merged as 1.0.9.59. Cheers, Christophe |
From: Christophe R. <cs...@ca...> - 2007-09-28 09:34:59
Attachments:
sbcl-modular.diff
|
Christophe Rhodes <cs...@ca...> writes: > Really this distinction is between > tagged and untagged, preferring the untagged case, but Nathan's > canonical test case > (defun foom (x y) > (declare (fixnum x y)) > (logand (logxor x y) most-positive-fixnum)) > now compiles to > ; 0A6D5AAE: 8BD0 MOV EDX, EAX > ; B0: 31FA XOR EDX, EDI > ; B2: 81E2FCFFFF7F AND EDX, 2147483644 So, at least some of this goodnes is because we have fewer smod30 than mod32 functions, and in particular logxor blocks modular optimizations. A better test case to look at might be (defun foom2 (x y) (declare (fixnum x y)) (logand (logxor x (+ x y)) most-positive-fixnum)) [ which has a currently unoptimized + ] and also (defun foom3 (x y) (declare (fixnum x y)) (logand (+ x (logxor x y)) most-positive-fixnum)) I attach a patch which implements the strategy for choosing modular versions that Nathan and I discussed a while back: prefer an exact untagged match, followed by an approximate tagged match, followed by an approximate untagged; this should allow implementation of mod32 arithmetic on x86-64. I have only lightly tested it, and not at all on anything that's not an x86; in fact I'm sure it won't build anywhere else. I'd appreciate feedback on whether this is the right path, and whether we should be doing more work to support smod30 arithmetic. Cheers, Christophe |
From: Christophe R. <cs...@ca...> - 2008-03-07 21:17:11
|
Christophe Rhodes <cs...@ca...> writes: > I attach a patch which implements the strategy for choosing modular > versions that Nathan and I discussed a while back: prefer an exact > untagged match, followed by an approximate tagged match, followed by > an approximate untagged; this should allow implementation of mod32 > arithmetic on x86-64. I have finally merged a patch (for all currently-building architectures) for this new modular arithmetic strategy as sbcl-1.0.15.16. (Free with this patch: better lognot/fixnum). Nathan, I hope this helps supporting 32-bit modular arithmetic on x86-64... Best, Christophe |
From: Christophe R. <cs...@ca...> - 2007-12-31 11:49:27
Attachments:
modular.diff
|
Hi, "Nathan Froyd" <fr...@gm...> writes: > On 8/14/07, Nikodemus Siivola <nik...@ra...> wrote: >> 2. 32 bit modular arithmetic not working on x86-64. Is this to be expected, is >> this a regression, or am I doing something wrong? How hard would it be to >> make it work? > > This is expected and somewhat unfortunate. I posted a patch to make > this work several months ago: > > http://thread.gmane.org/gmane.lisp.steel-bank.devel/9155/ > > Bill and Christophe expressed reservations about the patch, so I did > not commit it. I've resurrected my work towards making modular arithmetic choose which width to use more cleverly than before, from patches I sent to sbcl-devel in September. In particular, I have got * better support for signed modular arithmetic: all LOGFOO functions are treated as "good"; * preference for tagged modular arithmetic over untagged if the width requested can be satisfied by both, unless the width matches exactly. Test case for before/after comparison: (defun foom (x y) (declare (fixnum x y)) (logand (logxor (+ x y) x) most-positive-fixnum)) This should allow two further pieces of work: * support for signed word-sized (i.e. untagged) arithmetic; * support for sub-word-sized untagged arithmetic, as in Nathan's original work. You can see my work-in-progress (and my inability to drive git usefully) at <http://www-jcsu.jesus.cam.ac.uk/~csr21/cgi-bin/gitweb.cgi>; I'm attaching a patch for those who, like me, are not quite comfortable with this particular VCS yet. Nathan (or Nikodemus, or anyone else for that matter): are you in a position to retry the x86-64 32-bit modular arithmetic on top of this? Best, Christophe |
From: Nathan F. <fr...@gm...> - 2008-01-03 04:44:09
|
On Dec 31, 2007 6:49 AM, Christophe Rhodes <cs...@ca...> wrote: > I've resurrected my work towards making modular arithmetic choose > which width to use more cleverly than before, from patches I sent to > sbcl-devel in September. In particular, I have got > > * better support for signed modular arithmetic: all LOGFOO functions > are treated as "good"; > * preference for tagged modular arithmetic over untagged if the width > requested can be satisfied by both, unless the width matches > exactly. > > Nathan (or Nikodemus, or anyone else for that matter): are you in a > position to retry the x86-64 32-bit modular arithmetic on top of this? I may have some time in the next two weeks, but I will have to go back and review our IRC conversations about this to refresh my memory of what I was supposed to do. :) -Nathan |
From: Christophe R. <cs...@ca...> - 2008-01-03 08:15:52
|
"Nathan Froyd" <fr...@gm...> writes: > On Dec 31, 2007 6:49 AM, Christophe Rhodes <cs...@ca...> wrote: >> I've resurrected my work towards making modular arithmetic choose >> which width to use more cleverly than before, from patches I sent to >> sbcl-devel in September. In particular, I have got >> >> * better support for signed modular arithmetic: all LOGFOO functions >> are treated as "good"; >> * preference for tagged modular arithmetic over untagged if the width >> requested can be satisfied by both, unless the width matches >> exactly. >> >> Nathan (or Nikodemus, or anyone else for that matter): are you in a >> position to retry the x86-64 32-bit modular arithmetic on top of this? > > I may have some time in the next two weeks, but I will have to go back > and review our IRC conversations about this to refresh my memory of > what I was supposed to do. :) Heh. I think that my patch is at a position that you should be able to apply just the x86-64 bits implementing 32-bit arithmetic from <http://thread.gmane.org/gmane.lisp.steel-bank.devel/9155/>; it's possible that bits and pieces might break, though. It's more because you have useful test cases for this new behaviour that I'm trying to rope you in :-) Cheers, Christophe |
From: Nathan F. <fr...@gm...> - 2008-02-18 17:00:24
|
On Jan 3, 2008 3:15 AM, Christophe Rhodes <cs...@ca...> wrote: > Heh. I think that my patch is at a position that you should be able > to apply just the x86-64 bits implementing 32-bit arithmetic from > <http://thread.gmane.org/gmane.lisp.steel-bank.devel/9155/>; it's > possible that bits and pieces might break, though. It's more because > you have useful test cases for this new behaviour that I'm trying to > rope you in :-) I finally had time to look at this; I stuck your patch and my patch together to produce a working SBCL. SB-MD5::UPDATE-MD5-BLOCK looks right (almost; see below) and James's favorite testcase does the right thing. I haven't tried more complicated variations on James's testcase, but it seems plausible that they too will do the right thing. The only problem with SB-MD5::UPDATE-MD5-BLOCK is that you get code like: MOV R9, R8 SHL R9, 3 MOV R8, RCX SHL R8, 3 MOV R8, RCX OR R10, R9 MOV R8, RDX SHL R8, 3 XOR R8, R10 SAR R8, 3 We want something more like: OR R8D, ECX XOR R8D, EDX The problem here is that we've been doing ADD/ROL/NOT in 32-bit registers, but the only way SBCL knows how to do modular LOG{AND,IOR,XOR} operations is to do them on its usual triumvirate of fixnum/signed-byte-64/unsigned-byte-64. These gymnastics are converting from our 32-bit registers to fixnums (since fixnum operations are "cheaper" than unsigned-byte-64 operations...sigh), operating on fixnums, and then going back. So we can't rewrite the unsigned-byte-64 vops to take UNSIGNED32-REG inputs, because that vop will never be selected over the fixnum one. And we can't rewrite the fixnum ones to take UNSIGNED32-REG inputs because that would just get messy. The simple-minded solution of (DEFINE-MODULAR-FUN LOGIOR ...) doesn't work because LOGIOR is a :GOOD function. I don't know what the right thing to do here is. We like having these :GOOD functions because we're not making the user write even more (LOGAND #xffffffff ...) forms. But we also want a way to specify that these really should be done on a particular representation. Thoughts? -Nathan |
From: Christophe R. <cs...@ca...> - 2008-02-18 18:38:42
|
"Nathan Froyd" <fr...@gm...> writes: > The simple-minded solution of (DEFINE-MODULAR-FUN LOGIOR ...) doesn't > work because LOGIOR is a :GOOD function. > > I don't know what the right thing to do here is. We like having these > :GOOD functions because we're not making the user write even more > (LOGAND #xffffffff ...) forms. But we also want a way to specify that > these really should be done on a particular representation. My understanding is that :GOOD isn't about saving the user work; it's about saving us (the implementor) work: saying that we can just use the implementation's normal version rather than the modular version in all cases, and that things will magically work. If I'm right, then what would need to happen to make this work perfectly on x86-64 is to remove the goodness from all three (LOGAND, LOGIOR and LOGXOR) and instead implement them as regular modular functions with implementations for unsigned-byte-32, fixnum and unsigned-byte-64. I don't think that this will cause the user to have to add in any more (LOGAND #xffffffff ...) forms, based on a month-old memory of what the code does. Cheers, Christophe |