[Ecls-list] computing results into preallocated bignums From: Nils Bruin - 2009-09-08 23:27 ```When the recent changes to ECL's use of GMP were made, one thing came up: If we could preallocate limb space for bignums in such a way that GMP doesn't need to resize it, one could avoid the use of registers altogether. Jason Moxham responded to the suggestion by making available macros that would guarantee space needs (it's pretty clear how much space an answer is going to need, but MPIR development wants to reserve the right to use a little more, should that ever be beneficial for speed) I decided to run a little experiment with addition to see if avoiding extra memcpy's is going to make a difference: In the file "num_arith.d" I tried out the following: OLD CODE in ecl_plus: ------------------------------------------------------------------------------ case t_bignum: z = _ecl_big_register0(); _ecl_big_add(z,x,y); return _ecl_big_register_normalize(z); ------------------------------------------------------------------------------ NEW CODE in ecl_plus: ------------------------------------------------------------------------------ case t_bignum: i = x->big.big_size; if (i < 0) i = -i; j = y->big.big_size; if (j < 0) j = -j; if (i > j ) dim = (i+1); else dim = (j+1); z = ecl_alloc_compact_object(t_bignum, (cl_index) (dim*sizeof(mp_limb_t))); z->big.big_limbs = ECL_COMPACT_OBJECT_EXTRA(z); z->big.big_size = 0; z->big.big_dim = dim; _ecl_big_add(z, x, y); ------------------------------------------------------------------------------ (so this only affects bignum+bignum). I tried out the following program: ------------------------------------------------------------------------- (defun fibo (n) (let ((a 1)(b 1)(c 0)) (loop for i from 1 to n do (setq c a) (setq a (+ a b)) (setq b c)) b)) (defun test (n m) (let ((t0 (get-internal-run-time))) (loop for i from 1 to m do (fibo n)) (- (get-internal-run-time) t0))) ------------------------------------------------------- On a 2391.454 MHz AMD Opteron(tm) Processor 250 I tried getting some timings from (test 4000 4000) It was hard to get stable results, so the following figures are indicative only: OLD: 15665, 16050, 16032 NEW: 15217, 15233, 15210 after (compile 'fibo) (compile 'test) OLD: 9182, 9189, 9181 [but also some much lower values at the start] NEW: 8156, 8144, 8144 so, I think the conclusion is: Yes, in some situations, the extra allocations/deallocations and memcpy involved in bignum arithmetic do have an influence on computation times. With macro's becoming available to guarantee allocation sizes, it may be worthwhile to rewrite bignum arithmetic to avoid extra register use as much as possible. (incidentally, the "ECL" executable is still dynamically linked to libecl, so renaming "ecl" to "ecl_old" and then recompiling ecl does NOT lead to a situation where one can compare two versions of ecl!) Kind regards, Nils ```
 Re: [Ecls-list] computing results into preallocated bignums From: Nils Bruin - 2009-09-08 23:36 ```On Tue, 8 Sep 2009, Nils Bruin wrote: > ------------------------------------------------------------------------------ > NEW CODE in ecl_plus: > ------------------------------------------------------------------------------ > case t_bignum: > i = x->big.big_size; > if (i < 0) i = -i; > j = y->big.big_size; > if (j < 0) j = -j; > if (i > j ) dim = (i+1); else dim = (j+1); > z = ecl_alloc_compact_object(t_bignum, (cl_index) (dim*sizeof(mp_limb_t))); > z->big.big_limbs = ECL_COMPACT_OBJECT_EXTRA(z); > z->big.big_size = 0; > z->big.big_dim = dim; > _ecl_big_add(z, x, y); > ------------------------------------------------------------------------------ My apologies. I mispasted. The code I tried did have the required "return z;" after that. ```
 Re: [Ecls-list] computing results into preallocated bignums From: Juan Jose Garcia-Ripoll - 2009-09-09 07:37 ```On Wed, Sep 9, 2009 at 1:27 AM, Nils Bruin wrote: > If we could preallocate limb space for bignums in such a way that GMP > doesn't need to resize it, one could avoid the use of registers > altogether. Jason Moxham responded to the suggestion by making available > macros that would guarantee space needs (it's pretty clear how much space > an answer is going to need, but MPIR development wants to reserve the > right to use a little more, should that ever be beneficial for speed) Indeed bignum operations and special function arithmetics is something yet to be optimized in ECL. Your suggestion is interesting and, if used, it should be somehow encapsulated, but it may be worth. I am only worried about one thing: the code would bind ECL to MPIR's macros and not all platforms use MPIR -- some still use GMP --. In other words, would the code still be backwards compatible? > I tried out the following program: [...] > > OLD: 9182, 9189, 9181 [but also some much lower values at the start] > NEW: 8156, 8144, 8144 Could you redo the timings using Common Lisp's TIME macro, like in (TIME (test ... )) This would be more stable because TIME takes care of performing garbage collection _outside_ the addition loop. It would also return numbers about consed memory, which we could compare. Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com ```
 Re: [Ecls-list] computing results into preallocated bignums From: Juan Jose Garcia-Ripoll - 2009-09-09 12:56 ```On Wed, Sep 9, 2009 at 1:27 AM, Nils Bruin wrote: > ------------------------------------------------------------------------------ >                case t_bignum: >                         i = x->big.big_size; >                         if (i < 0) i = -i; >                         j = y->big.big_size; >                         if (j < 0) j = -j; >                         if (i > j ) dim = (i+1); else dim = (j+1); >                         z = ecl_alloc_compact_object(t_bignum, (cl_index) (dim*sizeof(mp_limb_t))); >                         z->big.big_limbs = ECL_COMPACT_OBJECT_EXTRA(z); >                         z->big.big_size = 0; >                         z->big.big_dim = dim; >                         _ecl_big_add(z, x, y); > ------------------------------------------------------------- The only perhaps "corner" case in which this may lead to worse performance is when the result is not a bignum, but an integer that fits into a computer word. In that case "z" has to be converted to that, and the bignum should be discarded. Currently this leads to _no_ additional memory being "consed" or allocated precisely because of using preallocated registers. Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com ```
 Re: [Ecls-list] computing results into preallocated bignums From: Nils Bruin - 2009-09-09 15:57 ```OK, I redid the timings for the fibo example, without the time related commands in "test" and using the time macro and results are much more consistent. All tests were in a fresh ecl session: I only pasted in the required "defun"s and the "compile"s when required. Results seem quite stable now. I did 3 consecutive runs in each of the 4 sessions. Times are: Register code, uncompiled: 15.9 sec Register code, compiled: 9.1 sec Prealloc code, uncompiled: 14.8 sec Prealloc code, compiled: 8.8 sec So the gain is something like 6% or 3% for this very addition intensive example. Quite a bit smaller than I expected, but in line with what I measured initially. I find it odd that bytecompiled seems to benefit more than c compiled. You are not optimizing any bignum computation in "compile" are you? Some remarks: - this example computed the 4000(+-1)th fibonacci number, which has about 836 digits, or about 2777 bits. On 64 bit machine, I think this does not quite trigger register reallocation upon freeing yet (that would happen when the register hits 32*3*64=6144 bits). Indeed "cons"-ing seems to be on par. Is that what's measured by "cons"ed? That means that the penalty we're measuring here is solely from the memcpys and that the difference in computation time with ridiculously large integers could be bigger. That doesn't sound like a scenario to optimize for, though, so the present benchmark is perhaps a fair one. - Juanjo made a good point: You don't want to do this if the result is likely to fit in a fixnum. Hence, the right thing to do is probably to keep the register code in (fixnum,fixnum) operations (although the spare bits mean that for addition, that one can be done in a long anyway!). I'd think that "cancellation" is relatively rare, so when a bignum comes into play, I'd think you're making a good guess by assuming that the answer will be a bignum again. Any heuristics on the percentage of multiprecision operations that end up producing a small integer in "real life"? - backwards compatibility: should be good. You could just copy MPIR's macro definitions and use those if the macros are not defined. The length assumptions are currently documented in the mpz_array_init documentation of GMP 4.3.1 (which is binary compatible with 4.x and 3.x) - memory usage: the present code may allocate too much memory for a bignum (in fact, almost always will), especially when there is much cancellation (bit still a bignum result). It may be possible to let the memory manager "shrink" an allocated block (i.e., free the last few bytes of a block) relatively cheaply. Full results: ----------------------------------------- (defun fibo (n) (let ((a 1)(b 1)(c 0)) (loop for i from 1 to n do (setq c a) (setq a (+ a b)) (setq b c)) b)) (defun test (n m) (loop for i from 1 to m do (fibo n))) ------------------------------------------ //uncompiled, old code > (time (test 4000 4000)) real time : 15.892 secs run time : 15.882 secs gc count : 1380 times consed : 2277370957856 bytes NIL > (time (test 4000 4000)) real time : 15.920 secs run time : 15.908 secs gc count : 1380 times consed : 6802985308016 bytes NIL > (time (test 4000 4000)) real time : 15.946 secs run time : 15.937 secs gc count : 1378 times consed : 11312546734160 bytes NIL //compiled, old code > (time (test 4000 4000)) real time : 9.157 secs run time : 9.144 secs gc count : 896 times consed : 1483250487536 bytes NIL > (time (test 4000 4000)) real time : 9.149 secs run time : 9.134 secs gc count : 896 times consed : 4420817410864 bytes NIL > (time (test 4000 4000)) real time : 9.154 secs run time : 9.143 secs gc count : 896 times consed : 7358387430192 bytes NIL > 7931560345736 //uncompiled, new code > (time (test 4000 4000)) real time : 14.870 secs run time : 14.848 secs gc count : 1432 times consed : 2452083663360 bytes NIL > (time (test 4000 4000)) real time : 14.856 secs run time : 14.838 secs gc count : 1432 times consed : 7323742296272 bytes NIL > (time (test 4000 4000)) real time : 14.847 secs run time : 14.834 secs gc count : 1429 times consed : 12169306108296 bytes NIL //compiled, new code > (time (test 4000 4000)) real time : 8.827 secs run time : 8.811 secs gc count : 930 times consed : 1596897759552 bytes NIL > (time (test 4000 4000)) real time : 8.824 secs run time : 8.811 secs gc count : 930 times consed : 4759916280672 bytes NIL > (time (test 4000 4000)) real time : 8.805 secs run time : 8.791 secs gc count : 931 times consed : 7931560345736 bytes NIL ```
 Re: [Ecls-list] computing results into preallocated bignums From: Juan Jose Garcia-Ripoll - 2009-09-11 09:20 ```On Wed, Sep 9, 2009 at 5:57 PM, Nils Bruin wrote: > OK, I redid the timings for the fibo example, without the time related > commands in "test" and using the time macro and results are much more > consistent. > > Register code, uncompiled: 15.9 sec > Register code, compiled: 9.1 sec > > Prealloc code, uncompiled: 14.8 sec > Prealloc code, compiled: 8.8 sec That is definitely not much. I wonder whether your code does the equivalent of big_register_normalize(), that is detecting whether a bignum fits into a computer word or not and then output the value. I am also wondering whether instead of actual bignum optimization we are just seeing the effect of inlining code that before resided elsewhere. If this is the case then I would take it with a grain of salt and balance the benefits of code maintainability versus open-coding all bignum operations. > Some remarks: >  - this example computed the 4000(+-1)th fibonacci number, which has about > 836 digits, or about 2777 bits. On 64 bit machine, I think this does not > quite trigger register reallocation upon freeing yet (that would happen > when the register hits 32*3*64=6144 bits). Indeed "cons"-ing seems to be > on par. Is that what's measured by "cons"ed? Yes, "consing" is a lispy word for memory allocation. The numbers you see is how much has been allocated during the execution of the timed form. >  - Juanjo made a good point: You don't want to do this if the result is > likely to fit in a fixnum. Hence, the right thing to do is probably to > keep the register code in (fixnum,fixnum) operations (although the spare > bits mean that for addition, that one can be done in a long anyway!). I'd > think that "cancellation" is relatively rare, so when a bignum comes into > play, I'd think you're making a good guess by assuming that the answer > will be a bignum again. Any heuristics on the percentage of multiprecision > operations that end up producing a small integer in "real life"? Cancellations are more likely than you would expect in the sense that it will be quite frequent that we operate on bignums just because they have 1 or 2 more bits than fit in a lisp word. These numbers, or "fixnums" have 32-3=29 or 64-3=61 bits because they have to be distinguished from pointers, and thus most of the bignum operations that you see are due to those extra bits. This connects with the other remark you made: for most calculations 1024 bits is really enough and the bignum register is not being reallocated. If only extremely big computations are going to profit and small number computations might suffer from whatever optimization we make, then I would think it twice before changing the current code... Juanjo -- Instituto de Física Fundamental, CSIC c/ Serrano, 113b, Madrid 28006 (Spain) http://juanjose.garciaripoll.googlepages.com ```