You can subscribe to this list here.
| 2007 |
Jan
|
Feb
|
Mar
|
Apr
(118) |
May
(140) |
Jun
(56) |
Jul
(86) |
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2008 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(94) |
Aug
(86) |
Sep
|
Oct
(3) |
Nov
(18) |
Dec
(27) |
| 2009 |
Jan
(15) |
Feb
(15) |
Mar
(27) |
Apr
(2) |
May
(1) |
Jun
(6) |
Jul
(10) |
Aug
(4) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
|
From: Bill H. <goo...@go...> - 2009-03-02 16:41:53
|
I've now switched to zn_poly's newton inversion for polynomials of length > 2000. Times: > BEFORE: > Testing zmod_poly_divrem_newton()... Cpu = 2480 ms Wall = 2487 ms ... ok > Testing zmod_poly_divrem()... Cpu = 8510 ms Wall = 8506 ms ... ok > Testing zmod_poly_div_classical()... Cpu = 1160 ms Wall = 1166 ms ... ok > Testing zmod_poly_div()... Cpu = 4560 ms Wall = 4561 ms ... ok > Testing zmod_poly_divrem_divconquer()... Cpu = 1300 ms Wall = 1307 ms ... ok > Testing zmod_poly_div_divconquer()... Cpu = 1430 ms Wall = 1431 ms ... ok > Testing zmod_poly_div_divconquer_recursive()... Cpu = 1420 ms Wall = > 1421 ms ... ok > Testing zmod_poly_newton_invert_basecase()... Cpu = 470 ms Wall = 468 ms ... ok > Testing zmod_poly_newton_invert()... Cpu = 9700 ms Wall = 9699 ms ... ok > Testing zmod_poly_div_series()... Cpu = 1200 ms Wall = 1202 ms ... ok > Testing zmod_poly_div_newton()... Cpu = 1090 ms Wall = 1088 ms ... ok > Testing zmod_poly_gcd_euclidean()... Cpu = 6300 ms Wall = 6300 ms ... ok > Testing zmod_poly_half_gcd_iter()... Cpu = 1870 ms Wall = 1868 ms ... ok > Testing zmod_poly_gcd_hgcd()... Cpu = 5240 ms Wall = 5243 ms ... ok > Testing zmod_poly_half_gcd()... Cpu = 1630 ms Wall = 1630 ms ... ok > Testing zmod_poly_gcd()... Cpu = 3070 ms Wall = 3068 ms ... ok > Testing zmod_poly_gcd_invert()... Cpu = 4410 ms Wall = 4410 ms ... ok > Testing zmod_poly_xgcd()... Cpu = 3700 ms Wall = 3704 ms ... ok > Testing zmod_poly_resultant_euclidean()... Cpu = 840 ms Wall = 839 ms ... ok > Testing zmod_poly_mulmod()... Cpu = 2340 ms Wall = 2339 ms ... ok > Testing zmod_poly_powmod()... Cpu = 3630 ms Wall = 3629 ms ... ok > Testing zmod_poly_isirreducible()... Cpu = 3730 ms Wall = 3731 ms ... ok > Testing zmod_poly_factor_berlekamp()... Cpu = 1390 ms Wall = 1390 ms ... ok > Testing zmod_poly_factor_square_free()... Cpu = 2740 ms Wall = 2734 ms ... ok > Testing zmod_poly_factor()... Cpu = 2360 ms Wall = 2359 ms ... ok > AFTER: Testing zmod_poly_divrem_newton()... Cpu = 2970 ms Wall = 2970 ms ... ok Testing zmod_poly_divrem()... Cpu = 9650 ms Wall = 9655 ms ... ok Testing zmod_poly_div_classical()... Cpu = 1170 ms Wall = 1164 ms ... ok Testing zmod_poly_div()... Cpu = 5240 ms Wall = 5242 ms ... ok Testing zmod_poly_divrem_divconquer()... Cpu = 1310 ms Wall = 1316 ms ... ok Testing zmod_poly_div_divconquer()... Cpu = 1440 ms Wall = 1438 ms ... ok Testing zmod_poly_div_divconquer_recursive()... Cpu = 1430 ms Wall = 1430 ms ... ok Testing zmod_poly_newton_invert_basecase()... Cpu = 470 ms Wall = 469 ms ... ok Testing zmod_poly_newton_invert()... Cpu = 8470 ms Wall = 8466 ms ... ok Testing zmod_poly_div_series()... Cpu = 1430 ms Wall = 1430 ms ... ok Testing zmod_poly_div_newton()... Cpu = 1400 ms Wall = 1404 ms ... ok Testing zmod_poly_gcd_euclidean()... Cpu = 6300 ms Wall = 6297 ms ... ok Testing zmod_poly_half_gcd_iter()... Cpu = 1860 ms Wall = 1862 ms ... ok Testing zmod_poly_gcd_hgcd()... Cpu = 5270 ms Wall = 5272 ms ... ok Testing zmod_poly_half_gcd()... Cpu = 1620 ms Wall = 1621 ms ... ok Testing zmod_poly_gcd()... Cpu = 3090 ms Wall = 3090 ms ... ok Testing zmod_poly_gcd_invert()... Cpu = 4520 ms Wall = 4520 ms ... ok Testing zmod_poly_xgcd()... Cpu = 3700 ms Wall = 3693 ms ... ok Testing zmod_poly_resultant_euclidean()... Cpu = 830 ms Wall = 837 ms ... ok Testing zmod_poly_mulmod()... Cpu = 2920 ms Wall = 2915 ms ... ok Testing zmod_poly_powmod()... Cpu = 3960 ms Wall = 3966 ms ... ok Testing zmod_poly_isirreducible()... Cpu = 3760 ms Wall = 3754 ms ... ok Testing zmod_poly_factor_berlekamp()... Cpu = 1400 ms Wall = 1398 ms ... ok Testing zmod_poly_factor_square_free()... Cpu = 3090 ms Wall = 3091 ms ... ok Testing zmod_poly_factor()... Cpu = 2640 ms Wall = 2637 ms ... ok I guess the FLINT crossover is wrong for zn_poly. The newton_invert time went down, but the time for everything else that relies on it went up. I'll have to have a fiddle this evening. Bill. |
|
From: Bill H. <goo...@go...> - 2009-03-02 15:37:35
|
I've added zn_poly scalar_mul into zmod_poly. Here are the times of the relevant affected functions: BEFORE: > Testing zmod_poly_scalar_mul()... Cpu = 220 ms Wall = 217 ms ... ok > Testing zmod_poly_make_monic()... Cpu = 430 ms Wall = 431 ms ... ok > Testing zmod_poly_divrem_classical()... Cpu = 1430 ms Wall = 1430 ms ... ok > Testing zmod_poly_divrem_newton()... Cpu = 2620 ms Wall = 2621 ms ... ok > Testing zmod_poly_divrem()... Cpu = 9420 ms Wall = 9421 ms ... ok > Testing zmod_poly_div_classical()... Cpu = 1690 ms Wall = 1689 ms ... ok > Testing zmod_poly_div()... Cpu = 5170 ms Wall = 5172 ms ... ok > Testing zmod_poly_divrem_divconquer()... Cpu = 1430 ms Wall = 1422 ms ... ok > Testing zmod_poly_div_divconquer()... Cpu = 1770 ms Wall = 1775 ms ... ok > Testing zmod_poly_div_divconquer_recursive()... Cpu = 1770 ms Wall = > 1766 ms ... ok > Testing zmod_poly_newton_invert_basecase()... Cpu = 490 ms Wall = 495 ms ... ok > Testing zmod_poly_newton_invert()... Cpu = 9720 ms Wall = 9714 ms ... ok > Testing zmod_poly_div_series()... Cpu = 1230 ms Wall = 1235 ms ... ok > Testing zmod_poly_div_newton()... Cpu = 1120 ms Wall = 1121 ms ... ok > Testing zmod_poly_gcd_euclidean()... Cpu = 6600 ms Wall = 6603 ms ... ok > Testing zmod_poly_half_gcd_iter()... Cpu = 2070 ms Wall = 2069 ms ... ok > Testing zmod_poly_gcd_hgcd()... Cpu = 5590 ms Wall = 5591 ms ... ok > Testing zmod_poly_half_gcd()... Cpu = 1780 ms Wall = 1777 ms ... ok > Testing zmod_poly_gcd()... Cpu = 3280 ms Wall = 3281 ms ... ok > Testing zmod_poly_gcd_invert()... Cpu = 4710 ms Wall = 4709 ms ... ok > Testing zmod_poly_xgcd()... Cpu = 3850 ms Wall = 3848 ms ... ok > Testing zmod_poly_resultant_euclidean()... Cpu = 860 ms Wall = 857 ms ... ok > Testing zmod_poly_mulmod()... Cpu = 2370 ms Wall = 2373 ms ... ok > Testing zmod_poly_powmod()... Cpu = 4000 ms Wall = 4001 ms ... ok > Testing zmod_poly_isirreducible()... Cpu = 4380 ms Wall = 4381 ms ... ok > Testing zmod_poly_factor_berlekamp()... Cpu = 1540 ms Wall = 1536 ms ... ok > Testing zmod_poly_factor_square_free()... Cpu = 2980 ms Wall = 2979 ms ... ok > Testing zmod_poly_factor()... Cpu = 2500 ms Wall = 2508 ms ... ok AFTER: Testing zmod_poly_scalar_mul()... Cpu = 160 ms Wall = 164 ms ... ok Testing zmod_poly_make_monic()... Cpu = 330 ms Wall = 326 ms ... ok Testing zmod_poly_divrem_classical()... Cpu = 1010 ms Wall = 1009 ms ... ok Testing zmod_poly_divrem_newton()... Cpu = 2480 ms Wall = 2487 ms ... ok Testing zmod_poly_divrem()... Cpu = 8510 ms Wall = 8506 ms ... ok Testing zmod_poly_div_classical()... Cpu = 1160 ms Wall = 1166 ms ... ok Testing zmod_poly_div()... Cpu = 4560 ms Wall = 4561 ms ... ok Testing zmod_poly_divrem_divconquer()... Cpu = 1300 ms Wall = 1307 ms ... ok Testing zmod_poly_div_divconquer()... Cpu = 1430 ms Wall = 1431 ms ... ok Testing zmod_poly_div_divconquer_recursive()... Cpu = 1420 ms Wall = 1421 ms ... ok Testing zmod_poly_newton_invert_basecase()... Cpu = 470 ms Wall = 468 ms ... ok Testing zmod_poly_newton_invert()... Cpu = 9700 ms Wall = 9699 ms ... ok Testing zmod_poly_div_series()... Cpu = 1200 ms Wall = 1202 ms ... ok Testing zmod_poly_div_newton()... Cpu = 1090 ms Wall = 1088 ms ... ok Testing zmod_poly_gcd_euclidean()... Cpu = 6300 ms Wall = 6300 ms ... ok Testing zmod_poly_half_gcd_iter()... Cpu = 1870 ms Wall = 1868 ms ... ok Testing zmod_poly_gcd_hgcd()... Cpu = 5240 ms Wall = 5243 ms ... ok Testing zmod_poly_half_gcd()... Cpu = 1630 ms Wall = 1630 ms ... ok Testing zmod_poly_gcd()... Cpu = 3070 ms Wall = 3068 ms ... ok Testing zmod_poly_gcd_invert()... Cpu = 4410 ms Wall = 4410 ms ... ok Testing zmod_poly_xgcd()... Cpu = 3700 ms Wall = 3704 ms ... ok Testing zmod_poly_resultant_euclidean()... Cpu = 840 ms Wall = 839 ms ... ok Testing zmod_poly_mulmod()... Cpu = 2340 ms Wall = 2339 ms ... ok Testing zmod_poly_powmod()... Cpu = 3630 ms Wall = 3629 ms ... ok Testing zmod_poly_isirreducible()... Cpu = 3730 ms Wall = 3731 ms ... ok Testing zmod_poly_factor_berlekamp()... Cpu = 1390 ms Wall = 1390 ms ... ok Testing zmod_poly_factor_square_free()... Cpu = 2740 ms Wall = 2734 ms ... ok Testing zmod_poly_factor()... Cpu = 2360 ms Wall = 2359 ms ... ok Some nice improvements there. Now for newton inversion. Bill. |
|
From: Bill H. <goo...@go...> - 2009-03-02 02:14:34
|
Oh, I can probably also use zn_poly for scalar multiplication. I'll also look into this tomorrow. Note that some of the times have actually gone up. This particular machine is renown for timing irregularities, so I wouldn't pay too much attention to those. factor_square_free does seem to have gone up an awful lot though. But it seems unlikely that FLINT just happened to be faster over the range of multiplications which are common in this test. Bill. 2009/3/2 Bill Hart <goo...@go...>: > I've done a partial inclusion of zn_poly into FLINT (devel). > > There is now a directory in FLINT called zn_poly, which contains a > largely unmodified copy of zn_poly-0.8. The FLINT makefile contains > lines to build the main zn_poly object files. These are linked > directly with FLINT when built, i.e. library calls to zn_poly are no > longer made. > > I've set zmod_poly up to use the zn_poly polynomial multiplication > function. There is a #define at the top of zmod_poly.h called > USE_ZN_POLY, which will switch in the zn_poly code when it is defined > to be 1, otherwise all the original code is used. > > I also made zmod_poly_mul_middle use zn_poly, though there seems to be > a difference between how the two libraries implement this. Basically > FLINT doesn't compute the first len1/2 coefficients whereas zn_poly > doesn't compute the first len2 - 1 coefficients. However FLINT has the > restriction that len2 - 1 < len1/2, thus by zeroing some of the > coefficients that come from zn_poly it is always possible to get the > same result as FLINT gives. > > All the test functions in zmod_poly pass, and I actually ran a test > which compared the results of all multiplications whether top level or > subordinate between zn_poly and FLINT, and the results are always the > same. So this validates the multiplication functions in both FLINT and > zn_poly to a fairly high degree. > > I cannot use the precomputed middle product functions yet, as FLINT is > set up to be able to use the same precomputed FFT for an ordinary > truncated multiplication and a middle product, whereas zn_poly-0.8 > doesn't have ordinary truncated multiplication yet, as far as I can > see. > > I may be able to use the zn_poly function for inverting a polynomial > however, which is the main place that the middle product is used. I'll > look into this tomorrow. Once that is done, zn_poly will basically be > used for everything it can be in FLINT at the moment. Below are > timings for the test code from before and after the inclusion of > zn_poly in FLINT. Note the zmod_poly_mul_middle test function was not > present before the inclusion. I added it to ensure that this function > returns the same thing before and after the inclusion. > > Also note that functions such as mul_KS and mul_classical do not use > zn_poly. However all internal calls to such functions have been > replaced with calls to zn_poly where appropriate. > > It's also important to realise that in many cases the majority of time > spent in the test code is not spent in the function that is being > tested. So even a small improvement in time of the test code > represents a large improvement in the time of the function being > tested. > > Bill. > > BEFORE: > > Testing zmod_poly_to_from_string()... Cpu = 200 ms Wall = 185 ms ... ok > Testing __zmod_poly_normalise()... Cpu = 170 ms Wall = 175 ms ... ok > Testing zmod_poly_truncate()... Cpu = 170 ms Wall = 167 ms ... ok > Testing zmod_poly_reverse()... Cpu = 300 ms Wall = 305 ms ... ok > Testing zmod_poly_add_no_red()... Cpu = 80 ms Wall = 77 ms ... ok > Testing zmod_poly_addsub()... Cpu = 60 ms Wall = 62 ms ... ok > Testing zmod_poly_neg()... Cpu = 50 ms Wall = 43 ms ... ok > Testing zmod_poly_bits()... Cpu = 10 ms Wall = 17 ms ... ok > Testing zmod_poly_shift()... Cpu = 20 ms Wall = 21 ms ... ok > Testing zmod_poly_swap()... Cpu = 30 ms Wall = 22 ms ... ok > Testing zmod_poly_setequal()... Cpu = 10 ms Wall = 18 ms ... ok > Testing _zmod_poly_attach()... Cpu = 10 ms Wall = 9 ms ... ok > Testing _zmod_poly_attach_truncate()... Cpu = 10 ms Wall = 9 ms ... ok > Testing _zmod_poly_attach_shift()... Cpu = 10 ms Wall = 9 ms ... ok > Testing zmod_poly_derivative()... Cpu = 200 ms Wall = 195 ms ... ok > Testing zmod_poly_getset_coeff()... Cpu = 10 ms Wall = 10 ms ... ok > Testing zmod_poly_mul()... Cpu = 2660 ms Wall = 2668 ms ... ok > Testing zmod_poly_mul_trunc_n()... Cpu = 3880 ms Wall = 3876 ms ... ok > Testing zmod_poly_mul_classicalKS()... Cpu = 1290 ms Wall = 1292 ms ... ok > Testing zmod_poly_sqr()... Cpu = 1970 ms Wall = 1967 ms ... ok > Testing zmod_poly_sqr_classicalKS()... Cpu = 1030 ms Wall = 1032 ms ... ok > Testing zmod_poly_mul_classical_trunc()... Cpu = 3220 ms Wall = 3222 ms ... ok > Testing zmod_poly_mul_classical_trunc_left()... Cpu = 2200 ms Wall = > 2200 ms ... ok > Testing zmod_poly_mul_trunc_left_n()... Cpu = 2600 ms Wall = 2601 ms ... ok > Testing zmod_poly_mul_KS_trunc()... Cpu = 1170 ms Wall = 1168 ms ... ok > Testing zmod_poly_mul_KS_middle()... Cpu = 3850 ms Wall = 3853 ms ... ok > Testing zmod_poly_mul_middle_precache()... Cpu = 2510 ms Wall = 2509 ms ... ok > Testing _zmod_poly_mul_KS_precache()... Cpu = 3430 ms Wall = 3427 ms ... ok > Testing zmod_poly_mul_precache()... Cpu = 7760 ms Wall = 7762 ms ... ok > Testing zmod_poly_mul_trunc_n_precache()... Cpu = 7180 ms Wall = 7179 ms ... ok > Testing zmod_poly_scalar_mul()... Cpu = 220 ms Wall = 218 ms ... ok > Testing zmod_poly_make_monic()... Cpu = 440 ms Wall = 440 ms ... ok > Testing zmod_poly_divrem_classical()... Cpu = 1640 ms Wall = 1645 ms ... ok > Testing zmod_poly_divrem_newton()... Cpu = 2720 ms Wall = 2717 ms ... ok > Testing zmod_poly_divrem()... Cpu = 10280 ms Wall = 10275 ms ... ok > Testing zmod_poly_div_classical()... Cpu = 1870 ms Wall = 1877 ms ... ok > Testing zmod_poly_div()... Cpu = 5760 ms Wall = 5759 ms ... ok > Testing zmod_poly_divrem_divconquer()... Cpu = 1720 ms Wall = 1722 ms ... ok > Testing zmod_poly_div_divconquer()... Cpu = 1950 ms Wall = 1949 ms ... ok > Testing zmod_poly_div_divconquer_recursive()... Cpu = 1940 ms Wall = > 1935 ms ... ok > Testing zmod_poly_newton_invert_basecase()... Cpu = 530 ms Wall = 531 ms ... ok > Testing zmod_poly_newton_invert()... Cpu = 9090 ms Wall = 9091 ms ... ok > Testing zmod_poly_div_series()... Cpu = 1230 ms Wall = 1231 ms ... ok > Testing zmod_poly_div_newton()... Cpu = 1270 ms Wall = 1268 ms ... ok > Testing zmod_poly_gcd_euclidean()... Cpu = 6570 ms Wall = 6572 ms ... ok > Testing zmod_poly_half_gcd_iter()... Cpu = 2310 ms Wall = 2307 ms ... ok > Testing zmod_poly_gcd_hgcd()... Cpu = 6670 ms Wall = 6667 ms ... ok > Testing zmod_poly_half_gcd()... Cpu = 1970 ms Wall = 1974 ms ... ok > Testing zmod_poly_gcd()... Cpu = 3830 ms Wall = 3835 ms ... ok > Testing zmod_poly_gcd_invert()... Cpu = 5190 ms Wall = 5186 ms ... ok > Testing zmod_poly_xgcd()... Cpu = 4410 ms Wall = 4413 ms ... ok > Testing zmod_poly_resultant_euclidean()... Cpu = 830 ms Wall = 822 ms ... ok > Testing zmod_poly_mulmod()... Cpu = 2570 ms Wall = 2572 ms ... ok > Testing zmod_poly_powmod()... Cpu = 4470 ms Wall = 4470 ms ... ok > Testing zmod_poly_isirreducible()... Cpu = 4700 ms Wall = 4704 ms ... ok > Testing zmod_poly_factor_berlekamp()... Cpu = 1720 ms Wall = 1721 ms ... ok > Testing zmod_poly_factor_square_free()... Cpu = 2530 ms Wall = 2525 ms ... ok > Testing zmod_poly_factor()... Cpu = 2910 ms Wall = 2925 ms ... ok > Testing zmod_poly_2x2_mat_mul_classical_strassen()... Cpu = 790 ms > Wall = 783 ms ... ok > Testing zmod_poly_2x2_mat_mul()... Cpu = 2280 ms Wall = 2285 ms ... ok > > AFTER: > > Testing zmod_poly_to_from_string()... Cpu = 180 ms Wall = 184 ms ... ok > Testing __zmod_poly_normalise()... Cpu = 180 ms Wall = 177 ms ... ok > Testing zmod_poly_truncate()... Cpu = 170 ms Wall = 171 ms ... ok > Testing zmod_poly_reverse()... Cpu = 310 ms Wall = 311 ms ... ok > Testing zmod_poly_add_no_red()... Cpu = 80 ms Wall = 77 ms ... ok > Testing zmod_poly_addsub()... Cpu = 60 ms Wall = 62 ms ... ok > Testing zmod_poly_neg()... Cpu = 50 ms Wall = 43 ms ... ok > Testing zmod_poly_bits()... Cpu = 10 ms Wall = 18 ms ... ok > Testing zmod_poly_shift()... Cpu = 20 ms Wall = 21 ms ... ok > Testing zmod_poly_swap()... Cpu = 30 ms Wall = 21 ms ... ok > Testing zmod_poly_setequal()... Cpu = 10 ms Wall = 19 ms ... ok > Testing _zmod_poly_attach()... Cpu = 10 ms Wall = 9 ms ... ok > Testing _zmod_poly_attach_truncate()... Cpu = 10 ms Wall = 8 ms ... ok > Testing _zmod_poly_attach_shift()... Cpu = 10 ms Wall = 10 ms ... ok > Testing zmod_poly_derivative()... Cpu = 200 ms Wall = 197 ms ... ok > Testing zmod_poly_getset_coeff()... Cpu = 10 ms Wall = 11 ms ... ok > Testing zmod_poly_mul()... Cpu = 2250 ms Wall = 2247 ms ... ok > Testing zmod_poly_mul_trunc_n()... Cpu = 3960 ms Wall = 3959 ms ... ok > Testing zmod_poly_mul_classicalKS()... Cpu = 1280 ms Wall = 1286 ms ... ok > Testing zmod_poly_sqr()... Cpu = 1850 ms Wall = 1849 ms ... ok > Testing zmod_poly_sqr_classicalKS()... Cpu = 1040 ms Wall = 1036 ms ... ok > Testing zmod_poly_mul_classical_trunc()... Cpu = 3200 ms Wall = 3202 ms ... ok > Testing zmod_poly_mul_classical_trunc_left()... Cpu = 2190 ms Wall = > 2190 ms ... ok > Testing zmod_poly_mul_trunc_left_n()... Cpu = 2440 ms Wall = 2442 ms ... ok > Testing zmod_poly_mul_KS_trunc()... Cpu = 1170 ms Wall = 1174 ms ... ok > Testing zmod_poly_mul_KS_middle()... Cpu = 3880 ms Wall = 3873 ms ... ok > Testing zmod_poly_mul_middle()... Cpu = 3310 ms Wall = 3311 ms ... ok > Testing zmod_poly_mul_middle_precache()... Cpu = 2320 ms Wall = 2320 ms ... ok > Testing _zmod_poly_mul_KS_precache()... Cpu = 4200 ms Wall = 4202 ms ... ok > Testing zmod_poly_mul_precache()... Cpu = 8210 ms Wall = 8211 ms ... ok > Testing zmod_poly_mul_trunc_n_precache()... Cpu = 6700 ms Wall = 6701 ms ... ok > Testing zmod_poly_scalar_mul()... Cpu = 220 ms Wall = 217 ms ... ok > Testing zmod_poly_make_monic()... Cpu = 430 ms Wall = 431 ms ... ok > Testing zmod_poly_divrem_classical()... Cpu = 1430 ms Wall = 1430 ms ... ok > Testing zmod_poly_divrem_newton()... Cpu = 2620 ms Wall = 2621 ms ... ok > Testing zmod_poly_divrem()... Cpu = 9420 ms Wall = 9421 ms ... ok > Testing zmod_poly_div_classical()... Cpu = 1690 ms Wall = 1689 ms ... ok > Testing zmod_poly_div()... Cpu = 5170 ms Wall = 5172 ms ... ok > Testing zmod_poly_divrem_divconquer()... Cpu = 1430 ms Wall = 1422 ms ... ok > Testing zmod_poly_div_divconquer()... Cpu = 1770 ms Wall = 1775 ms ... ok > Testing zmod_poly_div_divconquer_recursive()... Cpu = 1770 ms Wall = > 1766 ms ... ok > Testing zmod_poly_newton_invert_basecase()... Cpu = 490 ms Wall = 495 ms ... ok > Testing zmod_poly_newton_invert()... Cpu = 9720 ms Wall = 9714 ms ... ok > Testing zmod_poly_div_series()... Cpu = 1230 ms Wall = 1235 ms ... ok > Testing zmod_poly_div_newton()... Cpu = 1120 ms Wall = 1121 ms ... ok > Testing zmod_poly_gcd_euclidean()... Cpu = 6600 ms Wall = 6603 ms ... ok > Testing zmod_poly_half_gcd_iter()... Cpu = 2070 ms Wall = 2069 ms ... ok > Testing zmod_poly_gcd_hgcd()... Cpu = 5590 ms Wall = 5591 ms ... ok > Testing zmod_poly_half_gcd()... Cpu = 1780 ms Wall = 1777 ms ... ok > Testing zmod_poly_gcd()... Cpu = 3280 ms Wall = 3281 ms ... ok > Testing zmod_poly_gcd_invert()... Cpu = 4710 ms Wall = 4709 ms ... ok > Testing zmod_poly_xgcd()... Cpu = 3850 ms Wall = 3848 ms ... ok > Testing zmod_poly_resultant_euclidean()... Cpu = 860 ms Wall = 857 ms ... ok > Testing zmod_poly_mulmod()... Cpu = 2370 ms Wall = 2373 ms ... ok > Testing zmod_poly_powmod()... Cpu = 4000 ms Wall = 4001 ms ... ok > Testing zmod_poly_isirreducible()... Cpu = 4380 ms Wall = 4381 ms ... ok > Testing zmod_poly_factor_berlekamp()... Cpu = 1540 ms Wall = 1536 ms ... ok > Testing zmod_poly_factor_square_free()... Cpu = 2980 ms Wall = 2979 ms ... ok > Testing zmod_poly_factor()... Cpu = 2500 ms Wall = 2508 ms ... ok > Testing zmod_poly_2x2_mat_mul_classical_strassen()... Cpu = 450 ms > Wall = 449 ms ... ok > Testing zmod_poly_2x2_mat_mul()... Cpu = 1510 ms Wall = 1510 ms ... ok > |
|
From: Bill H. <goo...@go...> - 2009-03-02 01:58:10
|
I've done a partial inclusion of zn_poly into FLINT (devel). There is now a directory in FLINT called zn_poly, which contains a largely unmodified copy of zn_poly-0.8. The FLINT makefile contains lines to build the main zn_poly object files. These are linked directly with FLINT when built, i.e. library calls to zn_poly are no longer made. I've set zmod_poly up to use the zn_poly polynomial multiplication function. There is a #define at the top of zmod_poly.h called USE_ZN_POLY, which will switch in the zn_poly code when it is defined to be 1, otherwise all the original code is used. I also made zmod_poly_mul_middle use zn_poly, though there seems to be a difference between how the two libraries implement this. Basically FLINT doesn't compute the first len1/2 coefficients whereas zn_poly doesn't compute the first len2 - 1 coefficients. However FLINT has the restriction that len2 - 1 < len1/2, thus by zeroing some of the coefficients that come from zn_poly it is always possible to get the same result as FLINT gives. All the test functions in zmod_poly pass, and I actually ran a test which compared the results of all multiplications whether top level or subordinate between zn_poly and FLINT, and the results are always the same. So this validates the multiplication functions in both FLINT and zn_poly to a fairly high degree. I cannot use the precomputed middle product functions yet, as FLINT is set up to be able to use the same precomputed FFT for an ordinary truncated multiplication and a middle product, whereas zn_poly-0.8 doesn't have ordinary truncated multiplication yet, as far as I can see. I may be able to use the zn_poly function for inverting a polynomial however, which is the main place that the middle product is used. I'll look into this tomorrow. Once that is done, zn_poly will basically be used for everything it can be in FLINT at the moment. Below are timings for the test code from before and after the inclusion of zn_poly in FLINT. Note the zmod_poly_mul_middle test function was not present before the inclusion. I added it to ensure that this function returns the same thing before and after the inclusion. Also note that functions such as mul_KS and mul_classical do not use zn_poly. However all internal calls to such functions have been replaced with calls to zn_poly where appropriate. It's also important to realise that in many cases the majority of time spent in the test code is not spent in the function that is being tested. So even a small improvement in time of the test code represents a large improvement in the time of the function being tested. Bill. BEFORE: Testing zmod_poly_to_from_string()... Cpu = 200 ms Wall = 185 ms ... ok Testing __zmod_poly_normalise()... Cpu = 170 ms Wall = 175 ms ... ok Testing zmod_poly_truncate()... Cpu = 170 ms Wall = 167 ms ... ok Testing zmod_poly_reverse()... Cpu = 300 ms Wall = 305 ms ... ok Testing zmod_poly_add_no_red()... Cpu = 80 ms Wall = 77 ms ... ok Testing zmod_poly_addsub()... Cpu = 60 ms Wall = 62 ms ... ok Testing zmod_poly_neg()... Cpu = 50 ms Wall = 43 ms ... ok Testing zmod_poly_bits()... Cpu = 10 ms Wall = 17 ms ... ok Testing zmod_poly_shift()... Cpu = 20 ms Wall = 21 ms ... ok Testing zmod_poly_swap()... Cpu = 30 ms Wall = 22 ms ... ok Testing zmod_poly_setequal()... Cpu = 10 ms Wall = 18 ms ... ok Testing _zmod_poly_attach()... Cpu = 10 ms Wall = 9 ms ... ok Testing _zmod_poly_attach_truncate()... Cpu = 10 ms Wall = 9 ms ... ok Testing _zmod_poly_attach_shift()... Cpu = 10 ms Wall = 9 ms ... ok Testing zmod_poly_derivative()... Cpu = 200 ms Wall = 195 ms ... ok Testing zmod_poly_getset_coeff()... Cpu = 10 ms Wall = 10 ms ... ok Testing zmod_poly_mul()... Cpu = 2660 ms Wall = 2668 ms ... ok Testing zmod_poly_mul_trunc_n()... Cpu = 3880 ms Wall = 3876 ms ... ok Testing zmod_poly_mul_classicalKS()... Cpu = 1290 ms Wall = 1292 ms ... ok Testing zmod_poly_sqr()... Cpu = 1970 ms Wall = 1967 ms ... ok Testing zmod_poly_sqr_classicalKS()... Cpu = 1030 ms Wall = 1032 ms ... ok Testing zmod_poly_mul_classical_trunc()... Cpu = 3220 ms Wall = 3222 ms ... ok Testing zmod_poly_mul_classical_trunc_left()... Cpu = 2200 ms Wall = 2200 ms ... ok Testing zmod_poly_mul_trunc_left_n()... Cpu = 2600 ms Wall = 2601 ms ... ok Testing zmod_poly_mul_KS_trunc()... Cpu = 1170 ms Wall = 1168 ms ... ok Testing zmod_poly_mul_KS_middle()... Cpu = 3850 ms Wall = 3853 ms ... ok Testing zmod_poly_mul_middle_precache()... Cpu = 2510 ms Wall = 2509 ms ... ok Testing _zmod_poly_mul_KS_precache()... Cpu = 3430 ms Wall = 3427 ms ... ok Testing zmod_poly_mul_precache()... Cpu = 7760 ms Wall = 7762 ms ... ok Testing zmod_poly_mul_trunc_n_precache()... Cpu = 7180 ms Wall = 7179 ms ... ok Testing zmod_poly_scalar_mul()... Cpu = 220 ms Wall = 218 ms ... ok Testing zmod_poly_make_monic()... Cpu = 440 ms Wall = 440 ms ... ok Testing zmod_poly_divrem_classical()... Cpu = 1640 ms Wall = 1645 ms ... ok Testing zmod_poly_divrem_newton()... Cpu = 2720 ms Wall = 2717 ms ... ok Testing zmod_poly_divrem()... Cpu = 10280 ms Wall = 10275 ms ... ok Testing zmod_poly_div_classical()... Cpu = 1870 ms Wall = 1877 ms ... ok Testing zmod_poly_div()... Cpu = 5760 ms Wall = 5759 ms ... ok Testing zmod_poly_divrem_divconquer()... Cpu = 1720 ms Wall = 1722 ms ... ok Testing zmod_poly_div_divconquer()... Cpu = 1950 ms Wall = 1949 ms ... ok Testing zmod_poly_div_divconquer_recursive()... Cpu = 1940 ms Wall = 1935 ms ... ok Testing zmod_poly_newton_invert_basecase()... Cpu = 530 ms Wall = 531 ms ... ok Testing zmod_poly_newton_invert()... Cpu = 9090 ms Wall = 9091 ms ... ok Testing zmod_poly_div_series()... Cpu = 1230 ms Wall = 1231 ms ... ok Testing zmod_poly_div_newton()... Cpu = 1270 ms Wall = 1268 ms ... ok Testing zmod_poly_gcd_euclidean()... Cpu = 6570 ms Wall = 6572 ms ... ok Testing zmod_poly_half_gcd_iter()... Cpu = 2310 ms Wall = 2307 ms ... ok Testing zmod_poly_gcd_hgcd()... Cpu = 6670 ms Wall = 6667 ms ... ok Testing zmod_poly_half_gcd()... Cpu = 1970 ms Wall = 1974 ms ... ok Testing zmod_poly_gcd()... Cpu = 3830 ms Wall = 3835 ms ... ok Testing zmod_poly_gcd_invert()... Cpu = 5190 ms Wall = 5186 ms ... ok Testing zmod_poly_xgcd()... Cpu = 4410 ms Wall = 4413 ms ... ok Testing zmod_poly_resultant_euclidean()... Cpu = 830 ms Wall = 822 ms ... ok Testing zmod_poly_mulmod()... Cpu = 2570 ms Wall = 2572 ms ... ok Testing zmod_poly_powmod()... Cpu = 4470 ms Wall = 4470 ms ... ok Testing zmod_poly_isirreducible()... Cpu = 4700 ms Wall = 4704 ms ... ok Testing zmod_poly_factor_berlekamp()... Cpu = 1720 ms Wall = 1721 ms ... ok Testing zmod_poly_factor_square_free()... Cpu = 2530 ms Wall = 2525 ms ... ok Testing zmod_poly_factor()... Cpu = 2910 ms Wall = 2925 ms ... ok Testing zmod_poly_2x2_mat_mul_classical_strassen()... Cpu = 790 ms Wall = 783 ms ... ok Testing zmod_poly_2x2_mat_mul()... Cpu = 2280 ms Wall = 2285 ms ... ok AFTER: Testing zmod_poly_to_from_string()... Cpu = 180 ms Wall = 184 ms ... ok Testing __zmod_poly_normalise()... Cpu = 180 ms Wall = 177 ms ... ok Testing zmod_poly_truncate()... Cpu = 170 ms Wall = 171 ms ... ok Testing zmod_poly_reverse()... Cpu = 310 ms Wall = 311 ms ... ok Testing zmod_poly_add_no_red()... Cpu = 80 ms Wall = 77 ms ... ok Testing zmod_poly_addsub()... Cpu = 60 ms Wall = 62 ms ... ok Testing zmod_poly_neg()... Cpu = 50 ms Wall = 43 ms ... ok Testing zmod_poly_bits()... Cpu = 10 ms Wall = 18 ms ... ok Testing zmod_poly_shift()... Cpu = 20 ms Wall = 21 ms ... ok Testing zmod_poly_swap()... Cpu = 30 ms Wall = 21 ms ... ok Testing zmod_poly_setequal()... Cpu = 10 ms Wall = 19 ms ... ok Testing _zmod_poly_attach()... Cpu = 10 ms Wall = 9 ms ... ok Testing _zmod_poly_attach_truncate()... Cpu = 10 ms Wall = 8 ms ... ok Testing _zmod_poly_attach_shift()... Cpu = 10 ms Wall = 10 ms ... ok Testing zmod_poly_derivative()... Cpu = 200 ms Wall = 197 ms ... ok Testing zmod_poly_getset_coeff()... Cpu = 10 ms Wall = 11 ms ... ok Testing zmod_poly_mul()... Cpu = 2250 ms Wall = 2247 ms ... ok Testing zmod_poly_mul_trunc_n()... Cpu = 3960 ms Wall = 3959 ms ... ok Testing zmod_poly_mul_classicalKS()... Cpu = 1280 ms Wall = 1286 ms ... ok Testing zmod_poly_sqr()... Cpu = 1850 ms Wall = 1849 ms ... ok Testing zmod_poly_sqr_classicalKS()... Cpu = 1040 ms Wall = 1036 ms ... ok Testing zmod_poly_mul_classical_trunc()... Cpu = 3200 ms Wall = 3202 ms ... ok Testing zmod_poly_mul_classical_trunc_left()... Cpu = 2190 ms Wall = 2190 ms ... ok Testing zmod_poly_mul_trunc_left_n()... Cpu = 2440 ms Wall = 2442 ms ... ok Testing zmod_poly_mul_KS_trunc()... Cpu = 1170 ms Wall = 1174 ms ... ok Testing zmod_poly_mul_KS_middle()... Cpu = 3880 ms Wall = 3873 ms ... ok Testing zmod_poly_mul_middle()... Cpu = 3310 ms Wall = 3311 ms ... ok Testing zmod_poly_mul_middle_precache()... Cpu = 2320 ms Wall = 2320 ms ... ok Testing _zmod_poly_mul_KS_precache()... Cpu = 4200 ms Wall = 4202 ms ... ok Testing zmod_poly_mul_precache()... Cpu = 8210 ms Wall = 8211 ms ... ok Testing zmod_poly_mul_trunc_n_precache()... Cpu = 6700 ms Wall = 6701 ms ... ok Testing zmod_poly_scalar_mul()... Cpu = 220 ms Wall = 217 ms ... ok Testing zmod_poly_make_monic()... Cpu = 430 ms Wall = 431 ms ... ok Testing zmod_poly_divrem_classical()... Cpu = 1430 ms Wall = 1430 ms ... ok Testing zmod_poly_divrem_newton()... Cpu = 2620 ms Wall = 2621 ms ... ok Testing zmod_poly_divrem()... Cpu = 9420 ms Wall = 9421 ms ... ok Testing zmod_poly_div_classical()... Cpu = 1690 ms Wall = 1689 ms ... ok Testing zmod_poly_div()... Cpu = 5170 ms Wall = 5172 ms ... ok Testing zmod_poly_divrem_divconquer()... Cpu = 1430 ms Wall = 1422 ms ... ok Testing zmod_poly_div_divconquer()... Cpu = 1770 ms Wall = 1775 ms ... ok Testing zmod_poly_div_divconquer_recursive()... Cpu = 1770 ms Wall = 1766 ms ... ok Testing zmod_poly_newton_invert_basecase()... Cpu = 490 ms Wall = 495 ms ... ok Testing zmod_poly_newton_invert()... Cpu = 9720 ms Wall = 9714 ms ... ok Testing zmod_poly_div_series()... Cpu = 1230 ms Wall = 1235 ms ... ok Testing zmod_poly_div_newton()... Cpu = 1120 ms Wall = 1121 ms ... ok Testing zmod_poly_gcd_euclidean()... Cpu = 6600 ms Wall = 6603 ms ... ok Testing zmod_poly_half_gcd_iter()... Cpu = 2070 ms Wall = 2069 ms ... ok Testing zmod_poly_gcd_hgcd()... Cpu = 5590 ms Wall = 5591 ms ... ok Testing zmod_poly_half_gcd()... Cpu = 1780 ms Wall = 1777 ms ... ok Testing zmod_poly_gcd()... Cpu = 3280 ms Wall = 3281 ms ... ok Testing zmod_poly_gcd_invert()... Cpu = 4710 ms Wall = 4709 ms ... ok Testing zmod_poly_xgcd()... Cpu = 3850 ms Wall = 3848 ms ... ok Testing zmod_poly_resultant_euclidean()... Cpu = 860 ms Wall = 857 ms ... ok Testing zmod_poly_mulmod()... Cpu = 2370 ms Wall = 2373 ms ... ok Testing zmod_poly_powmod()... Cpu = 4000 ms Wall = 4001 ms ... ok Testing zmod_poly_isirreducible()... Cpu = 4380 ms Wall = 4381 ms ... ok Testing zmod_poly_factor_berlekamp()... Cpu = 1540 ms Wall = 1536 ms ... ok Testing zmod_poly_factor_square_free()... Cpu = 2980 ms Wall = 2979 ms ... ok Testing zmod_poly_factor()... Cpu = 2500 ms Wall = 2508 ms ... ok Testing zmod_poly_2x2_mat_mul_classical_strassen()... Cpu = 450 ms Wall = 449 ms ... ok Testing zmod_poly_2x2_mat_mul()... Cpu = 1510 ms Wall = 1510 ms ... ok |
|
From: Bill H. <goo...@go...> - 2009-03-01 15:50:38
|
I have issued a bug release for FLINT. This tries to fix some dramatic slowdowns that occur in fmpz_poly_divrem and fmpz_poly_pseudo_divrem, noted by William Stein. Essentially the issue only showed up when calling these functions repeatedly with some of the same polynomials as parameters (perhaps permuted). This is common in euclidean gcd-like algorithms (e.g. polynomial signature computation using Sturm sequences). Along the way, I noted a handful of potentially unsafe lines of code and I fixed those (unfortunately the unmanaged _fmpz_poly_left_shift function doesn't like length 0 polynomials with zero limbs allocated for coefficients), and this can't be fixed without radically altering lots of FLINT 1.1.x. The managed version of this function is safe. However some of the division code uses the unmanaged version and in a couple of places where there may also be no limbs allocated. These could have caused segfaults, though I expect their chance of doing so was pretty low, though it is probably possible to construct polynomials which could have caused the issue. Bill. |
|
From: Bill H. <goo...@go...> - 2009-02-27 21:06:34
|
I got the dots at 237 bits. There were two reasons I didn't figure it out until now. One was that in the tuning code I was summing the max bits from both polynomials, so I needed to look for a tuning value around 474 bits. Secondly, the correct tuning was quite a way off. I had it set to 512 bits. Changing this crossover to 470 bits fixed the issue. Still a couple of dots to go. Those seem nowhere near crossovers. Perhaps the FFT is at fault there. It's certainly not tuned for this machine. I need to write some auto tuning code. Bill. 2009/2/27 Bill Hart <goo...@go...>: > I found an old FLINT 1.0.x timing graph for this machine, and many of > the same points occur: > > http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/polymul3.png > > So it seems to be something that has been in the code for a long time. > It did not appear in any graph for sage.math, but this is a different > Opteron. > > Tracking down the source of those points could be hard. I suppose I > should check that the FFT is not playing up at that point. It could > well be an FFT tuning issue at those specific points. The 237 line > might be a red herring. Perhaps things are just slightly slower along > that line for whatever reason, and this combined with some FFT tuning > badness makes it bleed red blood at that point. > > Bill. > > 2009/2/27 William Hart <ha...@ya...>: >> I've done some more work optimising the bit and byte packing and unpacking code, and making sure the maximum number of bits/limbs per coefficient is not counted more than once anywhere. >> >> I think I've done all I can in terms of this general sort of optimisation now. Here is the result: >> >> http://sage.math.washington.edu/home/wbhart/poly-KS.png >> >> But there are still some red dots, and these are real. >> >> 1) They are not tuning issues - the FLINT 1.x tuning is identical at these points. >> >> 2) They are not timing irregularities - they show up in the same places every time the graph is drawn. >> >> 3) They do not represent especially fast points for Magma - the FLINT times go up dramatically at these points, sometimes even higher than other nearby problems which are larger! >> >> 4) All points are in the KS region where byte packing is done. >> >> Here are where some of the points lie: >> >> length bits >> >> 237 237 >> 1020 284 >> 5266 66 >> >> There's nothing special about any of these. They are not at any crossovers of any kind, bit, byte, limb, algorithm, etc. >> >> Note that at 237 bits there is a whole dark line across the graph. This is mystifying, as there is nothing special happening at 237 bits. >> >> The output coefficients in the 237/237 case are 2*237+8 = 482 bits. Again there just isn't anything special about that. Packing is done into fields of 61 bytes each. >> >> Each of the large integers is 115656 bits or 1808 limbs. That is just into the FFT region (the FFT starts at 1695 limbs). >> >> The only thing that should be relevant is the size of the input and output coefficients, as there is a whole dark line at 237 bits. >> >> Input coefficients fit easily into 4 limbs, output coefficients into 8 limbs. There's just nothing special about any of these things. We aren't even near any boundaries in either case. >> >> Bill. >> >> --- On Fri, 2/27/09, Bill Hart <goo...@go...> wrote: >> >>> From: Bill Hart <goo...@go...> >>> Subject: Re: [flint-devel] FLINT 2.0 Progress >>> To: "Development list for FLINT" <fli...@li...> >>> Date: Friday, February 27, 2009, 3:32 PM >>> I've now fixed the karatsuba multiplication by switching >>> to the >>> ordinary algorithm when the lengths are equal. >>> >>> http://sage.math.washington.edu/home/wbhart/poly-KS.png >>> >>> Next I will try to figure out where those remaining red >>> dots are >>> coming from. The whole graph is actually a slightly darker >>> ble than I >>> expect. I did fix one issue namely counting the maximum >>> number of bits >>> per coefficient outside the KS function and then again >>> inside it. I >>> now pass this data in. >>> >>> But I think the byte packing might not be optimal as there >>> is quite >>> dark blue in some parts of the KS region. >>> >>> Bill. >>> >>> 2009/2/21 William Hart <ha...@ya...>: >>> > I figured out what the red line is on the left hand >>> side of the multiplication graph: >>> > >>> > >>> http://sage.math.washington.edu/home/wbhart/poly-KS.png >>> > >>> > It occurs for polynomials of length 15, where David >>> Harvey's odd/even karatsuba is used. This algorithm is >>> great for unbalanced operands, as it uses far fewer >>> multiplies in general than ordinary karatsuba. >>> > >>> > Instead of dividing the polynomials into a first half >>> and a second half, it works by dividing into odd exponent >>> coefficients and even exponent coefficients. >>> > >>> > However when you work it through, you need the lengths >>> of the original polynomials (at every level of the >>> recursion) to be even. So David's implementation (which >>> I borrowed for FLINT 2.0) uses a trick called peeling to >>> make everything work. Basically at each level, if the length >>> of a polynomial is odd, one of the coefficients is ignored >>> and the karatsuba recursion is carried out on what is left >>> (which is of even length), then at the end a fixup is done >>> to handle the product of the ignored coefficient with the >>> other polynomial. >>> > >>> > The problem is obvious. 15 is odd, so 1 is subtracted >>> to make 14 which is 2*7. At the next level, 7 is odd, so 1 >>> is subtracted to make 6 which is 2*3. At the next level 3 is >>> odd, so 1 is subtracted to make 2 which is 2*1. >>> > >>> > So at every level, peeling occurs, i.e. 15 is a worst >>> case for this algorithm. >>> > >>> > One way to fix this issue is to code up ordinary >>> karatsuba for balanced operands. In that case the 15 x 15 >>> polynomial multiplication gets split into two 8 x 8 >>> multiplications and one 7 x 7. The 7 x 7 gets split into two >>> 4 x 4 multiplications and one 3 x 3. The 3 x 3 gets split >>> into two 2 x 2 and one 1 x 1 multiplication. >>> > >>> > With the odd/even method one has: >>> > >>> > 1 x 1 -> 1 mul >>> > 2 x 2 -> 3 mul >>> > 3 x 3 -> 8 mul >>> > 6 x 6 -> 24 mul >>> > 7 x 7 -> 37 mul >>> > 14 x 14 -> 111 mul >>> > 15 x 15 -> 140 mul >>> > >>> > With the ordinary method one has: >>> > >>> > 1 x 1 -> 1 mul >>> > 2 x 2 -> 3 mul >>> > 3 x 3 -> 7 mul >>> > 4 x 4 -> 9 mul >>> > 7 x 7 -> 25 mul >>> > 8 x 8 -> 27 mul >>> > 15 x 15 -> 79 mul >>> > >>> > nearly twice the speed for large coefficients. >>> > >>> > Bill. >>> > >>> > >>> > >>> > --- On Wed, 2/18/09, Bill Hart >>> <goo...@go...> wrote: >>> > >>> >> From: Bill Hart >>> <goo...@go...> >>> >> Subject: [flint-devel] FLINT 2.0 Progress >>> >> To: "Development list for FLINT" >>> <fli...@li...> >>> >> Date: Wednesday, February 18, 2009, 4:18 PM >>> >> The new F_mpz_poly module for FLINT 2.0 is >>> progressing well. >>> >> I've now >>> >> got the byte packing code working and coded up the >>> >> (non-truncating) >>> >> polynomial multiplication functions. Here is the >>> result of >>> >> a profile >>> >> comparing with Magma: >>> >> >>> >> >>> http://sage.math.washington.edu/home/wbhart/poly-KS.png >>> >> >>> >> The dark red on the left is probably an indication >>> that the >>> >> processor >>> >> clock speed is not exactly 2.8GHz on this machine >>> (or the >>> >> real time >>> >> clock is slightly off), and is not important. The >>> times >>> >> should be >>> >> identical with Magma until reaching the FFT >>> region, as both >>> >> FLINT and >>> >> Magma must use GMP and degree 0 polynomials are >>> just >>> >> integers. >>> >> >>> >> The red dots in the middle are probably real, with >>> the dots >>> >> at the >>> >> bottom of the middle probably tuning issues. The >>> ones on >>> >> the right of >>> >> the middle I am not sure about. They may just be >>> timing >>> >> irregularities >>> >> (which this machine has many of). >>> >> >>> >> The red vertical line near the left is probably >>> just a >>> >> tuning issue. >>> >> It's doubtlessly at the crossover between >>> karatsuba and >>> >> Schoenhage-Strassen, which I must have wrong. >>> I'm >>> >> surprised it is so >>> >> red though. >>> >> >>> >> There are also some red dots near the right of the >>> diagram. >>> >> Those are >>> >> probably significant. They lie in a region where >>> Kronecker >>> >> segmentation is used after byte packing the >>> coefficients of >>> >> the >>> >> polynomials into two large integers. >>> >> >>> >> Note that below those dots is lots of strong blue. >>> That is >>> >> the bit >>> >> packing region, and the problems we used to have >>> there are >>> >> completely >>> >> gone. They were due to cache issue related to the >>> fact that >>> >> Magma was >>> >> storing a coefficient in 32 bits, not 2 limbs as >>> the FLINT >>> >> 1.0.x and >>> >> 1.1.x series did. FLINT 2.0 uses just a single >>> limb per >>> >> coefficient, >>> >> and the nice blue there vindicates this change. >>> >> >>> >> But the red dots in the byte packing region are >>> almost >>> >> certainly due >>> >> to the fact that once one gets out of the region >>> where the >>> >> output >>> >> coefficients fit into a limb, one must switch to >>> >> mpz_t's (in the new >>> >> F_mpz_t format). That switching costs some cycles >>> for extra >>> >> comparisons and branches, and so for small >>> coefficients in >>> >> the byte >>> >> packing region, FLINT 2.0 is slower than earlier >>> versions >>> >> of FLINT. >>> >> I've worked hard on getting overheads down. At >>> one >>> >> point I tried an >>> >> optimisation which should have made a difference >>> (removing >>> >> some >>> >> unnecessary branches from the bit packing code). >>> However >>> >> this didn't >>> >> improve things at all. So I think it's >>> difficult to see >>> >> how I could >>> >> expect to remove those dots. They may go away by >>> fiddling >>> >> with the >>> >> tuning of the FFT itself. >>> >> >>> >> At any rate, I'm really happy with how things >>> have >>> >> gone. The next step >>> >> will be to work on the truncating multiplication >>> routines, >>> >> which >>> >> should be very simple in comparison with the >>> functions I >>> >> just coded. >>> >> Then I can move on to the division and GCD >>> functions, etc. >>> >> >>> >> Although I coded up David Harvey's KS2 >>> algorithm for >>> >> the bitpacked >>> >> region, I didn't include it in this profile >>> yet. You >>> >> wouldn't notice >>> >> its effect probably, as things are already so blue >>> at the >>> >> bottom of >>> >> the graph. But when I get around to it, I'll >>> put up >>> >> another graph with >>> >> that included, and some of these other tuning >>> issues >>> >> resolved. >>> >> >>> >> Bill. >>> >> >>> >> >>> ------------------------------------------------------------------------------ >>> >> Open Source Business Conference (OSBC), March >>> 24-25, 2009, >>> >> San Francisco, CA >>> >> -OSBC tackles the biggest issue in open source: >>> Open >>> >> Sourcing the Enterprise >>> >> -Strategies to boost innovation and cut costs with >>> open >>> >> source participation >>> >> -Receive a $600 discount off the registration fee >>> with the >>> >> source code: SFAD >>> >> http://p.sf.net/sfu/XcvMzF8H >>> >> _______________________________________________ >>> >> flint-devel mailing list >>> >> fli...@li... >>> >> >>> https://lists.sourceforge.net/lists/listinfo/flint-devel >>> > >>> > >>> > >>> > >>> > >>> ------------------------------------------------------------------------------ >>> > Open Source Business Conference (OSBC), March 24-25, >>> 2009, San Francisco, CA >>> > -OSBC tackles the biggest issue in open source: Open >>> Sourcing the Enterprise >>> > -Strategies to boost innovation and cut costs with >>> open source participation >>> > -Receive a $600 discount off the registration fee with >>> the source code: SFAD >>> > http://p.sf.net/sfu/XcvMzF8H >>> > _______________________________________________ >>> > flint-devel mailing list >>> > fli...@li... >>> > >>> https://lists.sourceforge.net/lists/listinfo/flint-devel >>> > >>> >>> ------------------------------------------------------------------------------ >>> Open Source Business Conference (OSBC), March 24-25, 2009, >>> San Francisco, CA >>> -OSBC tackles the biggest issue in open source: Open >>> Sourcing the Enterprise >>> -Strategies to boost innovation and cut costs with open >>> source participation >>> -Receive a $600 discount off the registration fee with the >>> source code: SFAD >>> http://p.sf.net/sfu/XcvMzF8H >>> _______________________________________________ >>> flint-devel mailing list >>> fli...@li... >>> https://lists.sourceforge.net/lists/listinfo/flint-devel >> >> >> >> >> ------------------------------------------------------------------------------ >> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA >> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise >> -Strategies to boost innovation and cut costs with open source participation >> -Receive a $600 discount off the registration fee with the source code: SFAD >> http://p.sf.net/sfu/XcvMzF8H >> _______________________________________________ >> flint-devel mailing list >> fli...@li... >> https://lists.sourceforge.net/lists/listinfo/flint-devel >> > |
|
From: Bill H. <goo...@go...> - 2009-02-27 17:54:01
|
I found an old FLINT 1.0.x timing graph for this machine, and many of the same points occur: http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/polymul3.png So it seems to be something that has been in the code for a long time. It did not appear in any graph for sage.math, but this is a different Opteron. Tracking down the source of those points could be hard. I suppose I should check that the FFT is not playing up at that point. It could well be an FFT tuning issue at those specific points. The 237 line might be a red herring. Perhaps things are just slightly slower along that line for whatever reason, and this combined with some FFT tuning badness makes it bleed red blood at that point. Bill. 2009/2/27 William Hart <ha...@ya...>: > I've done some more work optimising the bit and byte packing and unpacking code, and making sure the maximum number of bits/limbs per coefficient is not counted more than once anywhere. > > I think I've done all I can in terms of this general sort of optimisation now. Here is the result: > > http://sage.math.washington.edu/home/wbhart/poly-KS.png > > But there are still some red dots, and these are real. > > 1) They are not tuning issues - the FLINT 1.x tuning is identical at these points. > > 2) They are not timing irregularities - they show up in the same places every time the graph is drawn. > > 3) They do not represent especially fast points for Magma - the FLINT times go up dramatically at these points, sometimes even higher than other nearby problems which are larger! > > 4) All points are in the KS region where byte packing is done. > > Here are where some of the points lie: > > length bits > > 237 237 > 1020 284 > 5266 66 > > There's nothing special about any of these. They are not at any crossovers of any kind, bit, byte, limb, algorithm, etc. > > Note that at 237 bits there is a whole dark line across the graph. This is mystifying, as there is nothing special happening at 237 bits. > > The output coefficients in the 237/237 case are 2*237+8 = 482 bits. Again there just isn't anything special about that. Packing is done into fields of 61 bytes each. > > Each of the large integers is 115656 bits or 1808 limbs. That is just into the FFT region (the FFT starts at 1695 limbs). > > The only thing that should be relevant is the size of the input and output coefficients, as there is a whole dark line at 237 bits. > > Input coefficients fit easily into 4 limbs, output coefficients into 8 limbs. There's just nothing special about any of these things. We aren't even near any boundaries in either case. > > Bill. > > --- On Fri, 2/27/09, Bill Hart <goo...@go...> wrote: > >> From: Bill Hart <goo...@go...> >> Subject: Re: [flint-devel] FLINT 2.0 Progress >> To: "Development list for FLINT" <fli...@li...> >> Date: Friday, February 27, 2009, 3:32 PM >> I've now fixed the karatsuba multiplication by switching >> to the >> ordinary algorithm when the lengths are equal. >> >> http://sage.math.washington.edu/home/wbhart/poly-KS.png >> >> Next I will try to figure out where those remaining red >> dots are >> coming from. The whole graph is actually a slightly darker >> ble than I >> expect. I did fix one issue namely counting the maximum >> number of bits >> per coefficient outside the KS function and then again >> inside it. I >> now pass this data in. >> >> But I think the byte packing might not be optimal as there >> is quite >> dark blue in some parts of the KS region. >> >> Bill. >> >> 2009/2/21 William Hart <ha...@ya...>: >> > I figured out what the red line is on the left hand >> side of the multiplication graph: >> > >> > >> http://sage.math.washington.edu/home/wbhart/poly-KS.png >> > >> > It occurs for polynomials of length 15, where David >> Harvey's odd/even karatsuba is used. This algorithm is >> great for unbalanced operands, as it uses far fewer >> multiplies in general than ordinary karatsuba. >> > >> > Instead of dividing the polynomials into a first half >> and a second half, it works by dividing into odd exponent >> coefficients and even exponent coefficients. >> > >> > However when you work it through, you need the lengths >> of the original polynomials (at every level of the >> recursion) to be even. So David's implementation (which >> I borrowed for FLINT 2.0) uses a trick called peeling to >> make everything work. Basically at each level, if the length >> of a polynomial is odd, one of the coefficients is ignored >> and the karatsuba recursion is carried out on what is left >> (which is of even length), then at the end a fixup is done >> to handle the product of the ignored coefficient with the >> other polynomial. >> > >> > The problem is obvious. 15 is odd, so 1 is subtracted >> to make 14 which is 2*7. At the next level, 7 is odd, so 1 >> is subtracted to make 6 which is 2*3. At the next level 3 is >> odd, so 1 is subtracted to make 2 which is 2*1. >> > >> > So at every level, peeling occurs, i.e. 15 is a worst >> case for this algorithm. >> > >> > One way to fix this issue is to code up ordinary >> karatsuba for balanced operands. In that case the 15 x 15 >> polynomial multiplication gets split into two 8 x 8 >> multiplications and one 7 x 7. The 7 x 7 gets split into two >> 4 x 4 multiplications and one 3 x 3. The 3 x 3 gets split >> into two 2 x 2 and one 1 x 1 multiplication. >> > >> > With the odd/even method one has: >> > >> > 1 x 1 -> 1 mul >> > 2 x 2 -> 3 mul >> > 3 x 3 -> 8 mul >> > 6 x 6 -> 24 mul >> > 7 x 7 -> 37 mul >> > 14 x 14 -> 111 mul >> > 15 x 15 -> 140 mul >> > >> > With the ordinary method one has: >> > >> > 1 x 1 -> 1 mul >> > 2 x 2 -> 3 mul >> > 3 x 3 -> 7 mul >> > 4 x 4 -> 9 mul >> > 7 x 7 -> 25 mul >> > 8 x 8 -> 27 mul >> > 15 x 15 -> 79 mul >> > >> > nearly twice the speed for large coefficients. >> > >> > Bill. >> > >> > >> > >> > --- On Wed, 2/18/09, Bill Hart >> <goo...@go...> wrote: >> > >> >> From: Bill Hart >> <goo...@go...> >> >> Subject: [flint-devel] FLINT 2.0 Progress >> >> To: "Development list for FLINT" >> <fli...@li...> >> >> Date: Wednesday, February 18, 2009, 4:18 PM >> >> The new F_mpz_poly module for FLINT 2.0 is >> progressing well. >> >> I've now >> >> got the byte packing code working and coded up the >> >> (non-truncating) >> >> polynomial multiplication functions. Here is the >> result of >> >> a profile >> >> comparing with Magma: >> >> >> >> >> http://sage.math.washington.edu/home/wbhart/poly-KS.png >> >> >> >> The dark red on the left is probably an indication >> that the >> >> processor >> >> clock speed is not exactly 2.8GHz on this machine >> (or the >> >> real time >> >> clock is slightly off), and is not important. The >> times >> >> should be >> >> identical with Magma until reaching the FFT >> region, as both >> >> FLINT and >> >> Magma must use GMP and degree 0 polynomials are >> just >> >> integers. >> >> >> >> The red dots in the middle are probably real, with >> the dots >> >> at the >> >> bottom of the middle probably tuning issues. The >> ones on >> >> the right of >> >> the middle I am not sure about. They may just be >> timing >> >> irregularities >> >> (which this machine has many of). >> >> >> >> The red vertical line near the left is probably >> just a >> >> tuning issue. >> >> It's doubtlessly at the crossover between >> karatsuba and >> >> Schoenhage-Strassen, which I must have wrong. >> I'm >> >> surprised it is so >> >> red though. >> >> >> >> There are also some red dots near the right of the >> diagram. >> >> Those are >> >> probably significant. They lie in a region where >> Kronecker >> >> segmentation is used after byte packing the >> coefficients of >> >> the >> >> polynomials into two large integers. >> >> >> >> Note that below those dots is lots of strong blue. >> That is >> >> the bit >> >> packing region, and the problems we used to have >> there are >> >> completely >> >> gone. They were due to cache issue related to the >> fact that >> >> Magma was >> >> storing a coefficient in 32 bits, not 2 limbs as >> the FLINT >> >> 1.0.x and >> >> 1.1.x series did. FLINT 2.0 uses just a single >> limb per >> >> coefficient, >> >> and the nice blue there vindicates this change. >> >> >> >> But the red dots in the byte packing region are >> almost >> >> certainly due >> >> to the fact that once one gets out of the region >> where the >> >> output >> >> coefficients fit into a limb, one must switch to >> >> mpz_t's (in the new >> >> F_mpz_t format). That switching costs some cycles >> for extra >> >> comparisons and branches, and so for small >> coefficients in >> >> the byte >> >> packing region, FLINT 2.0 is slower than earlier >> versions >> >> of FLINT. >> >> I've worked hard on getting overheads down. At >> one >> >> point I tried an >> >> optimisation which should have made a difference >> (removing >> >> some >> >> unnecessary branches from the bit packing code). >> However >> >> this didn't >> >> improve things at all. So I think it's >> difficult to see >> >> how I could >> >> expect to remove those dots. They may go away by >> fiddling >> >> with the >> >> tuning of the FFT itself. >> >> >> >> At any rate, I'm really happy with how things >> have >> >> gone. The next step >> >> will be to work on the truncating multiplication >> routines, >> >> which >> >> should be very simple in comparison with the >> functions I >> >> just coded. >> >> Then I can move on to the division and GCD >> functions, etc. >> >> >> >> Although I coded up David Harvey's KS2 >> algorithm for >> >> the bitpacked >> >> region, I didn't include it in this profile >> yet. You >> >> wouldn't notice >> >> its effect probably, as things are already so blue >> at the >> >> bottom of >> >> the graph. But when I get around to it, I'll >> put up >> >> another graph with >> >> that included, and some of these other tuning >> issues >> >> resolved. >> >> >> >> Bill. >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Open Source Business Conference (OSBC), March >> 24-25, 2009, >> >> San Francisco, CA >> >> -OSBC tackles the biggest issue in open source: >> Open >> >> Sourcing the Enterprise >> >> -Strategies to boost innovation and cut costs with >> open >> >> source participation >> >> -Receive a $600 discount off the registration fee >> with the >> >> source code: SFAD >> >> http://p.sf.net/sfu/XcvMzF8H >> >> _______________________________________________ >> >> flint-devel mailing list >> >> fli...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/flint-devel >> > >> > >> > >> > >> > >> ------------------------------------------------------------------------------ >> > Open Source Business Conference (OSBC), March 24-25, >> 2009, San Francisco, CA >> > -OSBC tackles the biggest issue in open source: Open >> Sourcing the Enterprise >> > -Strategies to boost innovation and cut costs with >> open source participation >> > -Receive a $600 discount off the registration fee with >> the source code: SFAD >> > http://p.sf.net/sfu/XcvMzF8H >> > _______________________________________________ >> > flint-devel mailing list >> > fli...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/flint-devel >> > >> >> ------------------------------------------------------------------------------ >> Open Source Business Conference (OSBC), March 24-25, 2009, >> San Francisco, CA >> -OSBC tackles the biggest issue in open source: Open >> Sourcing the Enterprise >> -Strategies to boost innovation and cut costs with open >> source participation >> -Receive a $600 discount off the registration fee with the >> source code: SFAD >> http://p.sf.net/sfu/XcvMzF8H >> _______________________________________________ >> flint-devel mailing list >> fli...@li... >> https://lists.sourceforge.net/lists/listinfo/flint-devel > > > > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: William H. <ha...@ya...> - 2009-02-27 17:27:50
|
I've done some more work optimising the bit and byte packing and unpacking code, and making sure the maximum number of bits/limbs per coefficient is not counted more than once anywhere. I think I've done all I can in terms of this general sort of optimisation now. Here is the result: http://sage.math.washington.edu/home/wbhart/poly-KS.png But there are still some red dots, and these are real. 1) They are not tuning issues - the FLINT 1.x tuning is identical at these points. 2) They are not timing irregularities - they show up in the same places every time the graph is drawn. 3) They do not represent especially fast points for Magma - the FLINT times go up dramatically at these points, sometimes even higher than other nearby problems which are larger! 4) All points are in the KS region where byte packing is done. Here are where some of the points lie: length bits 237 237 1020 284 5266 66 There's nothing special about any of these. They are not at any crossovers of any kind, bit, byte, limb, algorithm, etc. Note that at 237 bits there is a whole dark line across the graph. This is mystifying, as there is nothing special happening at 237 bits. The output coefficients in the 237/237 case are 2*237+8 = 482 bits. Again there just isn't anything special about that. Packing is done into fields of 61 bytes each. Each of the large integers is 115656 bits or 1808 limbs. That is just into the FFT region (the FFT starts at 1695 limbs). The only thing that should be relevant is the size of the input and output coefficients, as there is a whole dark line at 237 bits. Input coefficients fit easily into 4 limbs, output coefficients into 8 limbs. There's just nothing special about any of these things. We aren't even near any boundaries in either case. Bill. --- On Fri, 2/27/09, Bill Hart <goo...@go...> wrote: > From: Bill Hart <goo...@go...> > Subject: Re: [flint-devel] FLINT 2.0 Progress > To: "Development list for FLINT" <fli...@li...> > Date: Friday, February 27, 2009, 3:32 PM > I've now fixed the karatsuba multiplication by switching > to the > ordinary algorithm when the lengths are equal. > > http://sage.math.washington.edu/home/wbhart/poly-KS.png > > Next I will try to figure out where those remaining red > dots are > coming from. The whole graph is actually a slightly darker > ble than I > expect. I did fix one issue namely counting the maximum > number of bits > per coefficient outside the KS function and then again > inside it. I > now pass this data in. > > But I think the byte packing might not be optimal as there > is quite > dark blue in some parts of the KS region. > > Bill. > > 2009/2/21 William Hart <ha...@ya...>: > > I figured out what the red line is on the left hand > side of the multiplication graph: > > > > > http://sage.math.washington.edu/home/wbhart/poly-KS.png > > > > It occurs for polynomials of length 15, where David > Harvey's odd/even karatsuba is used. This algorithm is > great for unbalanced operands, as it uses far fewer > multiplies in general than ordinary karatsuba. > > > > Instead of dividing the polynomials into a first half > and a second half, it works by dividing into odd exponent > coefficients and even exponent coefficients. > > > > However when you work it through, you need the lengths > of the original polynomials (at every level of the > recursion) to be even. So David's implementation (which > I borrowed for FLINT 2.0) uses a trick called peeling to > make everything work. Basically at each level, if the length > of a polynomial is odd, one of the coefficients is ignored > and the karatsuba recursion is carried out on what is left > (which is of even length), then at the end a fixup is done > to handle the product of the ignored coefficient with the > other polynomial. > > > > The problem is obvious. 15 is odd, so 1 is subtracted > to make 14 which is 2*7. At the next level, 7 is odd, so 1 > is subtracted to make 6 which is 2*3. At the next level 3 is > odd, so 1 is subtracted to make 2 which is 2*1. > > > > So at every level, peeling occurs, i.e. 15 is a worst > case for this algorithm. > > > > One way to fix this issue is to code up ordinary > karatsuba for balanced operands. In that case the 15 x 15 > polynomial multiplication gets split into two 8 x 8 > multiplications and one 7 x 7. The 7 x 7 gets split into two > 4 x 4 multiplications and one 3 x 3. The 3 x 3 gets split > into two 2 x 2 and one 1 x 1 multiplication. > > > > With the odd/even method one has: > > > > 1 x 1 -> 1 mul > > 2 x 2 -> 3 mul > > 3 x 3 -> 8 mul > > 6 x 6 -> 24 mul > > 7 x 7 -> 37 mul > > 14 x 14 -> 111 mul > > 15 x 15 -> 140 mul > > > > With the ordinary method one has: > > > > 1 x 1 -> 1 mul > > 2 x 2 -> 3 mul > > 3 x 3 -> 7 mul > > 4 x 4 -> 9 mul > > 7 x 7 -> 25 mul > > 8 x 8 -> 27 mul > > 15 x 15 -> 79 mul > > > > nearly twice the speed for large coefficients. > > > > Bill. > > > > > > > > --- On Wed, 2/18/09, Bill Hart > <goo...@go...> wrote: > > > >> From: Bill Hart > <goo...@go...> > >> Subject: [flint-devel] FLINT 2.0 Progress > >> To: "Development list for FLINT" > <fli...@li...> > >> Date: Wednesday, February 18, 2009, 4:18 PM > >> The new F_mpz_poly module for FLINT 2.0 is > progressing well. > >> I've now > >> got the byte packing code working and coded up the > >> (non-truncating) > >> polynomial multiplication functions. Here is the > result of > >> a profile > >> comparing with Magma: > >> > >> > http://sage.math.washington.edu/home/wbhart/poly-KS.png > >> > >> The dark red on the left is probably an indication > that the > >> processor > >> clock speed is not exactly 2.8GHz on this machine > (or the > >> real time > >> clock is slightly off), and is not important. The > times > >> should be > >> identical with Magma until reaching the FFT > region, as both > >> FLINT and > >> Magma must use GMP and degree 0 polynomials are > just > >> integers. > >> > >> The red dots in the middle are probably real, with > the dots > >> at the > >> bottom of the middle probably tuning issues. The > ones on > >> the right of > >> the middle I am not sure about. They may just be > timing > >> irregularities > >> (which this machine has many of). > >> > >> The red vertical line near the left is probably > just a > >> tuning issue. > >> It's doubtlessly at the crossover between > karatsuba and > >> Schoenhage-Strassen, which I must have wrong. > I'm > >> surprised it is so > >> red though. > >> > >> There are also some red dots near the right of the > diagram. > >> Those are > >> probably significant. They lie in a region where > Kronecker > >> segmentation is used after byte packing the > coefficients of > >> the > >> polynomials into two large integers. > >> > >> Note that below those dots is lots of strong blue. > That is > >> the bit > >> packing region, and the problems we used to have > there are > >> completely > >> gone. They were due to cache issue related to the > fact that > >> Magma was > >> storing a coefficient in 32 bits, not 2 limbs as > the FLINT > >> 1.0.x and > >> 1.1.x series did. FLINT 2.0 uses just a single > limb per > >> coefficient, > >> and the nice blue there vindicates this change. > >> > >> But the red dots in the byte packing region are > almost > >> certainly due > >> to the fact that once one gets out of the region > where the > >> output > >> coefficients fit into a limb, one must switch to > >> mpz_t's (in the new > >> F_mpz_t format). That switching costs some cycles > for extra > >> comparisons and branches, and so for small > coefficients in > >> the byte > >> packing region, FLINT 2.0 is slower than earlier > versions > >> of FLINT. > >> I've worked hard on getting overheads down. At > one > >> point I tried an > >> optimisation which should have made a difference > (removing > >> some > >> unnecessary branches from the bit packing code). > However > >> this didn't > >> improve things at all. So I think it's > difficult to see > >> how I could > >> expect to remove those dots. They may go away by > fiddling > >> with the > >> tuning of the FFT itself. > >> > >> At any rate, I'm really happy with how things > have > >> gone. The next step > >> will be to work on the truncating multiplication > routines, > >> which > >> should be very simple in comparison with the > functions I > >> just coded. > >> Then I can move on to the division and GCD > functions, etc. > >> > >> Although I coded up David Harvey's KS2 > algorithm for > >> the bitpacked > >> region, I didn't include it in this profile > yet. You > >> wouldn't notice > >> its effect probably, as things are already so blue > at the > >> bottom of > >> the graph. But when I get around to it, I'll > put up > >> another graph with > >> that included, and some of these other tuning > issues > >> resolved. > >> > >> Bill. > >> > >> > ------------------------------------------------------------------------------ > >> Open Source Business Conference (OSBC), March > 24-25, 2009, > >> San Francisco, CA > >> -OSBC tackles the biggest issue in open source: > Open > >> Sourcing the Enterprise > >> -Strategies to boost innovation and cut costs with > open > >> source participation > >> -Receive a $600 discount off the registration fee > with the > >> source code: SFAD > >> http://p.sf.net/sfu/XcvMzF8H > >> _______________________________________________ > >> flint-devel mailing list > >> fli...@li... > >> > https://lists.sourceforge.net/lists/listinfo/flint-devel > > > > > > > > > > > ------------------------------------------------------------------------------ > > Open Source Business Conference (OSBC), March 24-25, > 2009, San Francisco, CA > > -OSBC tackles the biggest issue in open source: Open > Sourcing the Enterprise > > -Strategies to boost innovation and cut costs with > open source participation > > -Receive a $600 discount off the registration fee with > the source code: SFAD > > http://p.sf.net/sfu/XcvMzF8H > > _______________________________________________ > > flint-devel mailing list > > fli...@li... > > > https://lists.sourceforge.net/lists/listinfo/flint-devel > > > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, > San Francisco, CA > -OSBC tackles the biggest issue in open source: Open > Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open > source participation > -Receive a $600 discount off the registration fee with the > source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel |
|
From: Bill H. <goo...@go...> - 2009-02-27 04:32:15
|
I've now fixed the karatsuba multiplication by switching to the ordinary algorithm when the lengths are equal. http://sage.math.washington.edu/home/wbhart/poly-KS.png Next I will try to figure out where those remaining red dots are coming from. The whole graph is actually a slightly darker ble than I expect. I did fix one issue namely counting the maximum number of bits per coefficient outside the KS function and then again inside it. I now pass this data in. But I think the byte packing might not be optimal as there is quite dark blue in some parts of the KS region. Bill. 2009/2/21 William Hart <ha...@ya...>: > I figured out what the red line is on the left hand side of the multiplication graph: > > http://sage.math.washington.edu/home/wbhart/poly-KS.png > > It occurs for polynomials of length 15, where David Harvey's odd/even karatsuba is used. This algorithm is great for unbalanced operands, as it uses far fewer multiplies in general than ordinary karatsuba. > > Instead of dividing the polynomials into a first half and a second half, it works by dividing into odd exponent coefficients and even exponent coefficients. > > However when you work it through, you need the lengths of the original polynomials (at every level of the recursion) to be even. So David's implementation (which I borrowed for FLINT 2.0) uses a trick called peeling to make everything work. Basically at each level, if the length of a polynomial is odd, one of the coefficients is ignored and the karatsuba recursion is carried out on what is left (which is of even length), then at the end a fixup is done to handle the product of the ignored coefficient with the other polynomial. > > The problem is obvious. 15 is odd, so 1 is subtracted to make 14 which is 2*7. At the next level, 7 is odd, so 1 is subtracted to make 6 which is 2*3. At the next level 3 is odd, so 1 is subtracted to make 2 which is 2*1. > > So at every level, peeling occurs, i.e. 15 is a worst case for this algorithm. > > One way to fix this issue is to code up ordinary karatsuba for balanced operands. In that case the 15 x 15 polynomial multiplication gets split into two 8 x 8 multiplications and one 7 x 7. The 7 x 7 gets split into two 4 x 4 multiplications and one 3 x 3. The 3 x 3 gets split into two 2 x 2 and one 1 x 1 multiplication. > > With the odd/even method one has: > > 1 x 1 -> 1 mul > 2 x 2 -> 3 mul > 3 x 3 -> 8 mul > 6 x 6 -> 24 mul > 7 x 7 -> 37 mul > 14 x 14 -> 111 mul > 15 x 15 -> 140 mul > > With the ordinary method one has: > > 1 x 1 -> 1 mul > 2 x 2 -> 3 mul > 3 x 3 -> 7 mul > 4 x 4 -> 9 mul > 7 x 7 -> 25 mul > 8 x 8 -> 27 mul > 15 x 15 -> 79 mul > > nearly twice the speed for large coefficients. > > Bill. > > > > --- On Wed, 2/18/09, Bill Hart <goo...@go...> wrote: > >> From: Bill Hart <goo...@go...> >> Subject: [flint-devel] FLINT 2.0 Progress >> To: "Development list for FLINT" <fli...@li...> >> Date: Wednesday, February 18, 2009, 4:18 PM >> The new F_mpz_poly module for FLINT 2.0 is progressing well. >> I've now >> got the byte packing code working and coded up the >> (non-truncating) >> polynomial multiplication functions. Here is the result of >> a profile >> comparing with Magma: >> >> http://sage.math.washington.edu/home/wbhart/poly-KS.png >> >> The dark red on the left is probably an indication that the >> processor >> clock speed is not exactly 2.8GHz on this machine (or the >> real time >> clock is slightly off), and is not important. The times >> should be >> identical with Magma until reaching the FFT region, as both >> FLINT and >> Magma must use GMP and degree 0 polynomials are just >> integers. >> >> The red dots in the middle are probably real, with the dots >> at the >> bottom of the middle probably tuning issues. The ones on >> the right of >> the middle I am not sure about. They may just be timing >> irregularities >> (which this machine has many of). >> >> The red vertical line near the left is probably just a >> tuning issue. >> It's doubtlessly at the crossover between karatsuba and >> Schoenhage-Strassen, which I must have wrong. I'm >> surprised it is so >> red though. >> >> There are also some red dots near the right of the diagram. >> Those are >> probably significant. They lie in a region where Kronecker >> segmentation is used after byte packing the coefficients of >> the >> polynomials into two large integers. >> >> Note that below those dots is lots of strong blue. That is >> the bit >> packing region, and the problems we used to have there are >> completely >> gone. They were due to cache issue related to the fact that >> Magma was >> storing a coefficient in 32 bits, not 2 limbs as the FLINT >> 1.0.x and >> 1.1.x series did. FLINT 2.0 uses just a single limb per >> coefficient, >> and the nice blue there vindicates this change. >> >> But the red dots in the byte packing region are almost >> certainly due >> to the fact that once one gets out of the region where the >> output >> coefficients fit into a limb, one must switch to >> mpz_t's (in the new >> F_mpz_t format). That switching costs some cycles for extra >> comparisons and branches, and so for small coefficients in >> the byte >> packing region, FLINT 2.0 is slower than earlier versions >> of FLINT. >> I've worked hard on getting overheads down. At one >> point I tried an >> optimisation which should have made a difference (removing >> some >> unnecessary branches from the bit packing code). However >> this didn't >> improve things at all. So I think it's difficult to see >> how I could >> expect to remove those dots. They may go away by fiddling >> with the >> tuning of the FFT itself. >> >> At any rate, I'm really happy with how things have >> gone. The next step >> will be to work on the truncating multiplication routines, >> which >> should be very simple in comparison with the functions I >> just coded. >> Then I can move on to the division and GCD functions, etc. >> >> Although I coded up David Harvey's KS2 algorithm for >> the bitpacked >> region, I didn't include it in this profile yet. You >> wouldn't notice >> its effect probably, as things are already so blue at the >> bottom of >> the graph. But when I get around to it, I'll put up >> another graph with >> that included, and some of these other tuning issues >> resolved. >> >> Bill. >> >> ------------------------------------------------------------------------------ >> Open Source Business Conference (OSBC), March 24-25, 2009, >> San Francisco, CA >> -OSBC tackles the biggest issue in open source: Open >> Sourcing the Enterprise >> -Strategies to boost innovation and cut costs with open >> source participation >> -Receive a $600 discount off the registration fee with the >> source code: SFAD >> http://p.sf.net/sfu/XcvMzF8H >> _______________________________________________ >> flint-devel mailing list >> fli...@li... >> https://lists.sourceforge.net/lists/listinfo/flint-devel > > > > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA > -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open source participation > -Receive a $600 discount off the registration fee with the source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: William H. <ha...@ya...> - 2009-02-21 17:32:56
|
I figured out what the red line is on the left hand side of the multiplication graph: http://sage.math.washington.edu/home/wbhart/poly-KS.png It occurs for polynomials of length 15, where David Harvey's odd/even karatsuba is used. This algorithm is great for unbalanced operands, as it uses far fewer multiplies in general than ordinary karatsuba. Instead of dividing the polynomials into a first half and a second half, it works by dividing into odd exponent coefficients and even exponent coefficients. However when you work it through, you need the lengths of the original polynomials (at every level of the recursion) to be even. So David's implementation (which I borrowed for FLINT 2.0) uses a trick called peeling to make everything work. Basically at each level, if the length of a polynomial is odd, one of the coefficients is ignored and the karatsuba recursion is carried out on what is left (which is of even length), then at the end a fixup is done to handle the product of the ignored coefficient with the other polynomial. The problem is obvious. 15 is odd, so 1 is subtracted to make 14 which is 2*7. At the next level, 7 is odd, so 1 is subtracted to make 6 which is 2*3. At the next level 3 is odd, so 1 is subtracted to make 2 which is 2*1. So at every level, peeling occurs, i.e. 15 is a worst case for this algorithm. One way to fix this issue is to code up ordinary karatsuba for balanced operands. In that case the 15 x 15 polynomial multiplication gets split into two 8 x 8 multiplications and one 7 x 7. The 7 x 7 gets split into two 4 x 4 multiplications and one 3 x 3. The 3 x 3 gets split into two 2 x 2 and one 1 x 1 multiplication. With the odd/even method one has: 1 x 1 -> 1 mul 2 x 2 -> 3 mul 3 x 3 -> 8 mul 6 x 6 -> 24 mul 7 x 7 -> 37 mul 14 x 14 -> 111 mul 15 x 15 -> 140 mul With the ordinary method one has: 1 x 1 -> 1 mul 2 x 2 -> 3 mul 3 x 3 -> 7 mul 4 x 4 -> 9 mul 7 x 7 -> 25 mul 8 x 8 -> 27 mul 15 x 15 -> 79 mul nearly twice the speed for large coefficients. Bill. --- On Wed, 2/18/09, Bill Hart <goo...@go...> wrote: > From: Bill Hart <goo...@go...> > Subject: [flint-devel] FLINT 2.0 Progress > To: "Development list for FLINT" <fli...@li...> > Date: Wednesday, February 18, 2009, 4:18 PM > The new F_mpz_poly module for FLINT 2.0 is progressing well. > I've now > got the byte packing code working and coded up the > (non-truncating) > polynomial multiplication functions. Here is the result of > a profile > comparing with Magma: > > http://sage.math.washington.edu/home/wbhart/poly-KS.png > > The dark red on the left is probably an indication that the > processor > clock speed is not exactly 2.8GHz on this machine (or the > real time > clock is slightly off), and is not important. The times > should be > identical with Magma until reaching the FFT region, as both > FLINT and > Magma must use GMP and degree 0 polynomials are just > integers. > > The red dots in the middle are probably real, with the dots > at the > bottom of the middle probably tuning issues. The ones on > the right of > the middle I am not sure about. They may just be timing > irregularities > (which this machine has many of). > > The red vertical line near the left is probably just a > tuning issue. > It's doubtlessly at the crossover between karatsuba and > Schoenhage-Strassen, which I must have wrong. I'm > surprised it is so > red though. > > There are also some red dots near the right of the diagram. > Those are > probably significant. They lie in a region where Kronecker > segmentation is used after byte packing the coefficients of > the > polynomials into two large integers. > > Note that below those dots is lots of strong blue. That is > the bit > packing region, and the problems we used to have there are > completely > gone. They were due to cache issue related to the fact that > Magma was > storing a coefficient in 32 bits, not 2 limbs as the FLINT > 1.0.x and > 1.1.x series did. FLINT 2.0 uses just a single limb per > coefficient, > and the nice blue there vindicates this change. > > But the red dots in the byte packing region are almost > certainly due > to the fact that once one gets out of the region where the > output > coefficients fit into a limb, one must switch to > mpz_t's (in the new > F_mpz_t format). That switching costs some cycles for extra > comparisons and branches, and so for small coefficients in > the byte > packing region, FLINT 2.0 is slower than earlier versions > of FLINT. > I've worked hard on getting overheads down. At one > point I tried an > optimisation which should have made a difference (removing > some > unnecessary branches from the bit packing code). However > this didn't > improve things at all. So I think it's difficult to see > how I could > expect to remove those dots. They may go away by fiddling > with the > tuning of the FFT itself. > > At any rate, I'm really happy with how things have > gone. The next step > will be to work on the truncating multiplication routines, > which > should be very simple in comparison with the functions I > just coded. > Then I can move on to the division and GCD functions, etc. > > Although I coded up David Harvey's KS2 algorithm for > the bitpacked > region, I didn't include it in this profile yet. You > wouldn't notice > its effect probably, as things are already so blue at the > bottom of > the graph. But when I get around to it, I'll put up > another graph with > that included, and some of these other tuning issues > resolved. > > Bill. > > ------------------------------------------------------------------------------ > Open Source Business Conference (OSBC), March 24-25, 2009, > San Francisco, CA > -OSBC tackles the biggest issue in open source: Open > Sourcing the Enterprise > -Strategies to boost innovation and cut costs with open > source participation > -Receive a $600 discount off the registration fee with the > source code: SFAD > http://p.sf.net/sfu/XcvMzF8H > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel |
|
From: Bill H. <goo...@go...> - 2009-02-18 05:18:49
|
The new F_mpz_poly module for FLINT 2.0 is progressing well. I've now got the byte packing code working and coded up the (non-truncating) polynomial multiplication functions. Here is the result of a profile comparing with Magma: http://sage.math.washington.edu/home/wbhart/poly-KS.png The dark red on the left is probably an indication that the processor clock speed is not exactly 2.8GHz on this machine (or the real time clock is slightly off), and is not important. The times should be identical with Magma until reaching the FFT region, as both FLINT and Magma must use GMP and degree 0 polynomials are just integers. The red dots in the middle are probably real, with the dots at the bottom of the middle probably tuning issues. The ones on the right of the middle I am not sure about. They may just be timing irregularities (which this machine has many of). The red vertical line near the left is probably just a tuning issue. It's doubtlessly at the crossover between karatsuba and Schoenhage-Strassen, which I must have wrong. I'm surprised it is so red though. There are also some red dots near the right of the diagram. Those are probably significant. They lie in a region where Kronecker segmentation is used after byte packing the coefficients of the polynomials into two large integers. Note that below those dots is lots of strong blue. That is the bit packing region, and the problems we used to have there are completely gone. They were due to cache issue related to the fact that Magma was storing a coefficient in 32 bits, not 2 limbs as the FLINT 1.0.x and 1.1.x series did. FLINT 2.0 uses just a single limb per coefficient, and the nice blue there vindicates this change. But the red dots in the byte packing region are almost certainly due to the fact that once one gets out of the region where the output coefficients fit into a limb, one must switch to mpz_t's (in the new F_mpz_t format). That switching costs some cycles for extra comparisons and branches, and so for small coefficients in the byte packing region, FLINT 2.0 is slower than earlier versions of FLINT. I've worked hard on getting overheads down. At one point I tried an optimisation which should have made a difference (removing some unnecessary branches from the bit packing code). However this didn't improve things at all. So I think it's difficult to see how I could expect to remove those dots. They may go away by fiddling with the tuning of the FFT itself. At any rate, I'm really happy with how things have gone. The next step will be to work on the truncating multiplication routines, which should be very simple in comparison with the functions I just coded. Then I can move on to the division and GCD functions, etc. Although I coded up David Harvey's KS2 algorithm for the bitpacked region, I didn't include it in this profile yet. You wouldn't notice its effect probably, as things are already so blue at the bottom of the graph. But when I get around to it, I'll put up another graph with that included, and some of these other tuning issues resolved. Bill. |
|
From: Bill H. <goo...@go...> - 2009-02-15 20:45:42
|
The MPIR project is pleased to announce their first release 0.9.0. MPIR is based on the GMP project (specifically version 4.2.1 of GMP, which was licensed LGPL v2+). The main MPIR website is here: http://www.mpir.org/ A source tarball is available here: http://www.mpir.org/mpir-0.9.0.tar.gz The main things we have been concentrating on for this release are improving build support for Apple and Sun systems (a project which will be ongoing), provision of a port to Microsoft MSVC and incorporation of Jason Martin's Core 2 patches, Pierrick Gaudry's AMD 64 patches and Niels Mohler's GCD patches. A full list of important changes can be found here: http://www.mpir.org/changes.html A continuously expanding list of planned projects is available here: http://www.mpir.org/projects.html We already have funding and promises of funding for some of these, and we hope that many new developers will join the project as volunteers in the coming months. Please ask on our development list: http://groups.google.com/group/mpir-devel/topics?start= if interested in contributing. We intend to have a very regular release schedule (at least every 3 months). Sage 3.3 will already use MPIR 0.9.0 (we have done extensive testing). Please feel free to post this announcement on other lists where you think developers might be interested to hear about the MPIR release. Bill. |
|
From: Bill H. <goo...@go...> - 2009-02-12 19:19:17
|
I've now made the multi modular reduction functions in fmpz thread safe. This make _fmpz_poly_mul_modular and fmpz_poly_mul_modular threadsafe. That means all of fmpz_poly is now threadsafe. Bill. 2009/2/11 Michael Abshoff <mab...@go...>: > On Wed, Feb 11, 2009 at 11:57 AM, Bill Hart <goo...@go...> wrote: >> 2009/2/11 Michael Abshoff <mab...@go...>: >>> On Wed, Feb 11, 2009 at 11:39 AM, Bill Hart <goo...@go...> wrote: >>> >>> Hi Bill, >>> >>>> I have just committed changes to the development version of FLINT to >>>> make the test_support and memory_manager code thread safe. >>> >>> Cool. Can one turn off the use of threads or is this code deeply >>> interwoven with the rest of the code? I am asking because I am keeping >>> an eye out for the native Windows port. >> >> FLINT has been linking with pthreads since about version 0.0.1., not >> that it is needed for anything. > > Ok. No harm done here. phtread is available in 32 and 64 bit flavors > for Windows, but I have doubts about the maturity. > >> The only thing which needs to be defined by the compiler is __thread, >> and even then, I will disable this for non-gcc compatible compilers. >> This means that FLINT will not be threadsafe when compiled with >> non-gcc compatible compilers, unless they offer something similar to >> gcc in the way of thread local variables (most compilers do) > > Ok. A FLINT without threads with MSVC is better than no FLINT at all. > Similar things do apply do David's parallel Bernoulli code in Sage > which has compile issues on Cygwin at the moment, but for now I will > likely just turn off threading until I fix the issue :) > >> I don't >> know about Windows compilers. I've not checked if they offer thread >> local variables, and if so, how they do it. > > Yes, I think TLS is doable with MSVC, but I won't claim to know it until I try. > >>>> I've verified that all tests in fmpz_poly-test can now run in parallel >>>> without issues and fixed numerous issues with thread safety in that >>>> module. All of the functions in fmpz_poly are fully threadsafe, with >>>> the exception of: >>>> >>>> * _fmpz_poly_mul_modular is not threadsafe (it stores static >>>> information - Gonzalo and I are working on fixing it). >>>> >>>> Some of the biggest issues were: >>>> >>>> * mpn_random2 is not threadsafe and segfaults >>>> * the stack based memory manager in FLINT was not threadsafe as it did >>>> not set up per thread stacks >>>> * use of randstate_t's in test support without initialising per thread >>>> was not threadsafe >>>> * everything needs cleaning up when joining threads (i.e. the per >>>> thread memory managers and random states), otherwise you get a memory >>>> leak - there is now a macro in flint.h for doing this cleanup, though >>>> OpenMP does not seem to have a mechanism for doing automatic cleanup >>>> on joining threads, so this is the user's responsibility >>>> >>>> I have not checked the thread safety of any other part of FLINT, >>>> however I am unaware of anything else that is not threadsafe except >>>> the random functions in long_extras and the stack based mpz_t >>>> allocator in the mpz_extras module (which no one uses anyway). >>>> >>>> It is my hope that FLINT 1.2 will be fully threadsafe. >>>> >>>> Note gcc 4.2.4 or higher is currently needed to compile the >>>> development version of FLINT. I'll put some checking in for lower >>>> versions in FLINT 1.2 before releasing (and OpenMP stuff just won't be >>>> available to people with older versions of gcc). >>> >>> Mhh, mandating gcc 4.2 or higher would be a problem for Sage since >>> currently we only need C99 support, i.e. gcc 3.4.x or higher. If this >>> dependency comes out of the need for OpenMP and it cannot be turned >>> off via some config options this is a major issue. >> >> This is only the current development version, not the released >> version. It's only temporary. I will check for the gcc version, and if >> it is not 2.4 or higher, the user will not be able to use OpenMP with >> FLINT. > > 4.2, but yes. > >> Whoopdidoo, they couldn't use OpenMP anyway, because it is >> broken before gcc 4.2.4. > > And some people do claim that OpenMP with other compilers like the > PathScale or Intel compiler is still superior to the version in gcc > 4.3, too :) > >> Obviously FLINT will still compile without >> OpenMP support on earlier gcc's. > > Excellent, that is all I wanted to hear. > > Cheers, > > Michael > > ------------------------------------------------------------------------------ > Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) > software. With Adobe AIR, Ajax developers can use existing skills and code to > build responsive, highly engaging applications that combine the power of local > resources and data with the reach of the web. Download the Adobe AIR SDK and > Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: Michael A. <mab...@go...> - 2009-02-11 20:30:44
|
On Wed, Feb 11, 2009 at 11:57 AM, Bill Hart <goo...@go...> wrote: > 2009/2/11 Michael Abshoff <mab...@go...>: >> On Wed, Feb 11, 2009 at 11:39 AM, Bill Hart <goo...@go...> wrote: >> >> Hi Bill, >> >>> I have just committed changes to the development version of FLINT to >>> make the test_support and memory_manager code thread safe. >> >> Cool. Can one turn off the use of threads or is this code deeply >> interwoven with the rest of the code? I am asking because I am keeping >> an eye out for the native Windows port. > > FLINT has been linking with pthreads since about version 0.0.1., not > that it is needed for anything. Ok. No harm done here. phtread is available in 32 and 64 bit flavors for Windows, but I have doubts about the maturity. > The only thing which needs to be defined by the compiler is __thread, > and even then, I will disable this for non-gcc compatible compilers. > This means that FLINT will not be threadsafe when compiled with > non-gcc compatible compilers, unless they offer something similar to > gcc in the way of thread local variables (most compilers do) Ok. A FLINT without threads with MSVC is better than no FLINT at all. Similar things do apply do David's parallel Bernoulli code in Sage which has compile issues on Cygwin at the moment, but for now I will likely just turn off threading until I fix the issue :) > I don't > know about Windows compilers. I've not checked if they offer thread > local variables, and if so, how they do it. Yes, I think TLS is doable with MSVC, but I won't claim to know it until I try. >>> I've verified that all tests in fmpz_poly-test can now run in parallel >>> without issues and fixed numerous issues with thread safety in that >>> module. All of the functions in fmpz_poly are fully threadsafe, with >>> the exception of: >>> >>> * _fmpz_poly_mul_modular is not threadsafe (it stores static >>> information - Gonzalo and I are working on fixing it). >>> >>> Some of the biggest issues were: >>> >>> * mpn_random2 is not threadsafe and segfaults >>> * the stack based memory manager in FLINT was not threadsafe as it did >>> not set up per thread stacks >>> * use of randstate_t's in test support without initialising per thread >>> was not threadsafe >>> * everything needs cleaning up when joining threads (i.e. the per >>> thread memory managers and random states), otherwise you get a memory >>> leak - there is now a macro in flint.h for doing this cleanup, though >>> OpenMP does not seem to have a mechanism for doing automatic cleanup >>> on joining threads, so this is the user's responsibility >>> >>> I have not checked the thread safety of any other part of FLINT, >>> however I am unaware of anything else that is not threadsafe except >>> the random functions in long_extras and the stack based mpz_t >>> allocator in the mpz_extras module (which no one uses anyway). >>> >>> It is my hope that FLINT 1.2 will be fully threadsafe. >>> >>> Note gcc 4.2.4 or higher is currently needed to compile the >>> development version of FLINT. I'll put some checking in for lower >>> versions in FLINT 1.2 before releasing (and OpenMP stuff just won't be >>> available to people with older versions of gcc). >> >> Mhh, mandating gcc 4.2 or higher would be a problem for Sage since >> currently we only need C99 support, i.e. gcc 3.4.x or higher. If this >> dependency comes out of the need for OpenMP and it cannot be turned >> off via some config options this is a major issue. > > This is only the current development version, not the released > version. It's only temporary. I will check for the gcc version, and if > it is not 2.4 or higher, the user will not be able to use OpenMP with > FLINT. 4.2, but yes. > Whoopdidoo, they couldn't use OpenMP anyway, because it is > broken before gcc 4.2.4. And some people do claim that OpenMP with other compilers like the PathScale or Intel compiler is still superior to the version in gcc 4.3, too :) > Obviously FLINT will still compile without > OpenMP support on earlier gcc's. Excellent, that is all I wanted to hear. Cheers, Michael |
|
From: Bill H. <goo...@go...> - 2009-02-11 19:57:16
|
2009/2/11 Michael Abshoff <mab...@go...>: > On Wed, Feb 11, 2009 at 11:39 AM, Bill Hart <goo...@go...> wrote: > > Hi Bill, > >> I have just committed changes to the development version of FLINT to >> make the test_support and memory_manager code thread safe. > > Cool. Can one turn off the use of threads or is this code deeply > interwoven with the rest of the code? I am asking because I am keeping > an eye out for the native Windows port. FLINT has been linking with pthreads since about version 0.0.1., not that it is needed for anything. The only thing which needs to be defined by the compiler is __thread, and even then, I will disable this for non-gcc compatible compilers. This means that FLINT will not be threadsafe when compiled with non-gcc compatible compilers, unless they offer something similar to gcc in the way of thread local variables (most compilers do). I don't know about Windows compilers. I've not checked if they offer thread local variables, and if so, how they do it. > >> I've verified that all tests in fmpz_poly-test can now run in parallel >> without issues and fixed numerous issues with thread safety in that >> module. All of the functions in fmpz_poly are fully threadsafe, with >> the exception of: >> >> * _fmpz_poly_mul_modular is not threadsafe (it stores static >> information - Gonzalo and I are working on fixing it). >> >> Some of the biggest issues were: >> >> * mpn_random2 is not threadsafe and segfaults >> * the stack based memory manager in FLINT was not threadsafe as it did >> not set up per thread stacks >> * use of randstate_t's in test support without initialising per thread >> was not threadsafe >> * everything needs cleaning up when joining threads (i.e. the per >> thread memory managers and random states), otherwise you get a memory >> leak - there is now a macro in flint.h for doing this cleanup, though >> OpenMP does not seem to have a mechanism for doing automatic cleanup >> on joining threads, so this is the user's responsibility >> >> I have not checked the thread safety of any other part of FLINT, >> however I am unaware of anything else that is not threadsafe except >> the random functions in long_extras and the stack based mpz_t >> allocator in the mpz_extras module (which no one uses anyway). >> >> It is my hope that FLINT 1.2 will be fully threadsafe. >> >> Note gcc 4.2.4 or higher is currently needed to compile the >> development version of FLINT. I'll put some checking in for lower >> versions in FLINT 1.2 before releasing (and OpenMP stuff just won't be >> available to people with older versions of gcc). > > Mhh, mandating gcc 4.2 or higher would be a problem for Sage since > currently we only need C99 support, i.e. gcc 3.4.x or higher. If this > dependency comes out of the need for OpenMP and it cannot be turned > off via some config options this is a major issue. This is only the current development version, not the released version. It's only temporary. I will check for the gcc version, and if it is not 2.4 or higher, the user will not be able to use OpenMP with FLINT. Whoopdidoo, they couldn't use OpenMP anyway, because it is broken before gcc 4.2.4. Obviously FLINT will still compile without OpenMP support on earlier gcc's. > >> I haven't decided whether to have the tests to run in parallel or not. >> Currently some of them run much faster in parallel, and some much >> slower. > > :) > >> Bill. > > Cheers, > > Michael > >> ------------------------------------------------------------------------------ >> Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) >> software. With Adobe AIR, Ajax developers can use existing skills and code to >> build responsive, highly engaging applications that combine the power of local >> resources and data with the reach of the web. Download the Adobe AIR SDK and >> Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com >> _______________________________________________ >> flint-devel mailing list >> fli...@li... >> https://lists.sourceforge.net/lists/listinfo/flint-devel >> > > ------------------------------------------------------------------------------ > Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) > software. With Adobe AIR, Ajax developers can use existing skills and code to > build responsive, highly engaging applications that combine the power of local > resources and data with the reach of the web. Download the Adobe AIR SDK and > Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: Michael A. <mab...@go...> - 2009-02-11 19:48:34
|
On Wed, Feb 11, 2009 at 11:21 AM, Bill Hart <goo...@go...> wrote: > Hi all, Hi Bill, > I've just release FLINT 1.1.1. This is a service release to fix some > SERIOUS issues in FLINT 1.1.0. I strongly recommend upgrading! > > The issues fixed were as follows: > > * fmpz_poly_gcd had a serious bug in it (which was actually in the > function fmpz_poly_gcd_heuristic) which caused it to return the > ****WRONG ANSWER**** in certain cases, specifically when aliasing one > of the inputs with the output and where the heuristic gcd was not > sufficient to compute the gcd. In this case the heuristic gcd would > corrupt the output (which is aliased with an input) and pass the > corrupted inputs to another gcd algorithm, which would then get things > wrong. This occurs rarely, as the heuristic gcd cannot compute the gcd > in rare cases. I was unable to get it to return a wrong answer except > in the case where the inputs both had a Gaussian content whose gcd was > non-trivial. However, it may theoretically occur in other cases. > > * fmpz_poly_gcd_subresultant needed a special case to deal with length > 1 polynomials. Actually this probably did not cause a bug, but was > fixed anyway as a precaution (and for performance reasons). > > * _fmpz_poly_scalar_mul_fmpz could cause a segmentation fault. This > was due to the fact that GMP, when multiplying integers of n1 and n2 > limbs, automatically takes n1 + n2 limbs to write the result, even if > they are not all needed. I changed the interface of the precached FFT > multiplication some time ago to use this same semantics, but did not > adjust the scalar multiplication function accordingly. > > * test__fmpz_poly_scalar_div_fmpz did not allocate sufficient space in > the output and so could segfault in very rare cases. > > * test_fmpz_poly_scalar_div_fmpz did not allocate sufficient space in > the output and so could segfault in very rare cases. > > * test_fmpz_poly_scalar_div_mpz did not allocate sufficient space in > the output and so could segfault in very rare cases. I will get this into Sage 3.3.rc1 ASAP. As you pointed out Sage is only affected by some of the above, but no point in waiting to upgrade. > Bill. Cheers, Michael > ------------------------------------------------------------------------------ > Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) > software. With Adobe AIR, Ajax developers can use existing skills and code to > build responsive, highly engaging applications that combine the power of local > resources and data with the reach of the web. Download the Adobe AIR SDK and > Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: Michael A. <mab...@go...> - 2009-02-11 19:47:29
|
On Wed, Feb 11, 2009 at 11:39 AM, Bill Hart <goo...@go...> wrote: Hi Bill, > I have just committed changes to the development version of FLINT to > make the test_support and memory_manager code thread safe. Cool. Can one turn off the use of threads or is this code deeply interwoven with the rest of the code? I am asking because I am keeping an eye out for the native Windows port. > I've verified that all tests in fmpz_poly-test can now run in parallel > without issues and fixed numerous issues with thread safety in that > module. All of the functions in fmpz_poly are fully threadsafe, with > the exception of: > > * _fmpz_poly_mul_modular is not threadsafe (it stores static > information - Gonzalo and I are working on fixing it). > > Some of the biggest issues were: > > * mpn_random2 is not threadsafe and segfaults > * the stack based memory manager in FLINT was not threadsafe as it did > not set up per thread stacks > * use of randstate_t's in test support without initialising per thread > was not threadsafe > * everything needs cleaning up when joining threads (i.e. the per > thread memory managers and random states), otherwise you get a memory > leak - there is now a macro in flint.h for doing this cleanup, though > OpenMP does not seem to have a mechanism for doing automatic cleanup > on joining threads, so this is the user's responsibility > > I have not checked the thread safety of any other part of FLINT, > however I am unaware of anything else that is not threadsafe except > the random functions in long_extras and the stack based mpz_t > allocator in the mpz_extras module (which no one uses anyway). > > It is my hope that FLINT 1.2 will be fully threadsafe. > > Note gcc 4.2.4 or higher is currently needed to compile the > development version of FLINT. I'll put some checking in for lower > versions in FLINT 1.2 before releasing (and OpenMP stuff just won't be > available to people with older versions of gcc). Mhh, mandating gcc 4.2 or higher would be a problem for Sage since currently we only need C99 support, i.e. gcc 3.4.x or higher. If this dependency comes out of the need for OpenMP and it cannot be turned off via some config options this is a major issue. > I haven't decided whether to have the tests to run in parallel or not. > Currently some of them run much faster in parallel, and some much > slower. :) > Bill. Cheers, Michael > ------------------------------------------------------------------------------ > Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) > software. With Adobe AIR, Ajax developers can use existing skills and code to > build responsive, highly engaging applications that combine the power of local > resources and data with the reach of the web. Download the Adobe AIR SDK and > Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: Bill H. <goo...@go...> - 2009-02-11 19:39:14
|
I have just committed changes to the development version of FLINT to make the test_support and memory_manager code thread safe. I've verified that all tests in fmpz_poly-test can now run in parallel without issues and fixed numerous issues with thread safety in that module. All of the functions in fmpz_poly are fully threadsafe, with the exception of: * _fmpz_poly_mul_modular is not threadsafe (it stores static information - Gonzalo and I are working on fixing it). Some of the biggest issues were: * mpn_random2 is not threadsafe and segfaults * the stack based memory manager in FLINT was not threadsafe as it did not set up per thread stacks * use of randstate_t's in test support without initialising per thread was not threadsafe * everything needs cleaning up when joining threads (i.e. the per thread memory managers and random states), otherwise you get a memory leak - there is now a macro in flint.h for doing this cleanup, though OpenMP does not seem to have a mechanism for doing automatic cleanup on joining threads, so this is the user's responsibility I have not checked the thread safety of any other part of FLINT, however I am unaware of anything else that is not threadsafe except the random functions in long_extras and the stack based mpz_t allocator in the mpz_extras module (which no one uses anyway). It is my hope that FLINT 1.2 will be fully threadsafe. Note gcc 4.2.4 or higher is currently needed to compile the development version of FLINT. I'll put some checking in for lower versions in FLINT 1.2 before releasing (and OpenMP stuff just won't be available to people with older versions of gcc). I haven't decided whether to have the tests to run in parallel or not. Currently some of them run much faster in parallel, and some much slower. Bill. |
|
From: Bill H. <goo...@go...> - 2009-02-11 19:21:47
|
Hi all, I've just release FLINT 1.1.1. This is a service release to fix some SERIOUS issues in FLINT 1.1.0. I strongly recommend upgrading! The issues fixed were as follows: * fmpz_poly_gcd had a serious bug in it (which was actually in the function fmpz_poly_gcd_heuristic) which caused it to return the ****WRONG ANSWER**** in certain cases, specifically when aliasing one of the inputs with the output and where the heuristic gcd was not sufficient to compute the gcd. In this case the heuristic gcd would corrupt the output (which is aliased with an input) and pass the corrupted inputs to another gcd algorithm, which would then get things wrong. This occurs rarely, as the heuristic gcd cannot compute the gcd in rare cases. I was unable to get it to return a wrong answer except in the case where the inputs both had a Gaussian content whose gcd was non-trivial. However, it may theoretically occur in other cases. * fmpz_poly_gcd_subresultant needed a special case to deal with length 1 polynomials. Actually this probably did not cause a bug, but was fixed anyway as a precaution (and for performance reasons). * _fmpz_poly_scalar_mul_fmpz could cause a segmentation fault. This was due to the fact that GMP, when multiplying integers of n1 and n2 limbs, automatically takes n1 + n2 limbs to write the result, even if they are not all needed. I changed the interface of the precached FFT multiplication some time ago to use this same semantics, but did not adjust the scalar multiplication function accordingly. * test__fmpz_poly_scalar_div_fmpz did not allocate sufficient space in the output and so could segfault in very rare cases. * test_fmpz_poly_scalar_div_fmpz did not allocate sufficient space in the output and so could segfault in very rare cases. * test_fmpz_poly_scalar_div_mpz did not allocate sufficient space in the output and so could segfault in very rare cases. Bill. |
|
From: Bill H. <goo...@go...> - 2009-02-05 02:02:33
|
Hi all, Michael Abshoff suggested I post something here about what I've been working on as there may be some people with some expertise who can help me with a problem I have encountered, to do with heaps. I've been implementing Pearce and Monagan's heap algorithm for multiplying multivariate polynomials in FLINT. See their paper here: http://www.cecm.sfu.ca/~monaganm/papers/PseudoHeap.pdf Roman Pearce has also been quite helpful in giving me clues. However one aspect of the algorithm can still be improved and there may be people on this list who have some ideas. First I'll give the timings. I downloaded the latest sdmp and with Roman's help, did some timings of sdmp on sage.math (note sdmp is not open source and runs in Maple). The two benchmarks I have been focusing on are: 1) Fateman's benchmark: multiply f * g where f = (1+x+y+z+t)^20 and g = f + 1. 2) Very sparse 5 variable benchmark: f * g where f = (1+x+y^2+z^3+t^5+u^7)^12 and g = (1+u+t^2+z^3+y^5+x^7)^12 Currently sdmp takes 2.7s for Fateman's benchmark and 2.8s for the very sparse 5 variable benchmark on sage.math (a 2.66 GHz Xeon - Dunnington core). FLINT takes 2.8s for Fateman's benchmark and 4.8s for the very sparse 5 variable benchmark. Coefficient arithmetic is not relevant here. Without doing any coefficient arithmetic at all, and just computing the monomials of the output the 5 var benchmark still takes 4.2s in FLINT. The algorithm for computing the monomials uses a heap. Basically the heap stores pairs of pointers, one to a monomial of f and one to a monomial of g. At each iteration the root of the heap is removed and the corresponding monomials are multiplied and written out. After terms f_i and g_j are multiplied the pair f_i, g_{j+1} is then added to the heap to replace the pair f_i, g_j. The heap is basically implemented as an array indexed from m=1. The children of element m of the heap are at array indices 2*m and 2*m + 1. But to get really good timings one allows multiple pairs to "chain" at the same location in the heap. This happens if one is adding f_i, g_j to the heap where the product of those monomials will be equal to an existing pair in the heap. Chaining pairs which render equal product monomials in this way means that large numbers of equal monomials can be computed for the cost of a single heap extraction at basically the same time, for which there is a dramatic time saving. It also makes the heap much smaller, as now many entries can be chained all at the same location of the heap. That makes searching through the heap much less time consuming. Coefficient arithmetic can also be made faster by accumulating the sum of products yielding the same monomial in a single temporary. In our case, chaining is important from the point of view of making the heap smaller, but not from the point of view of speeding up coefficient arithmetic (as mentioned, coefficient arithmetic is clearly not the cause of the inefficiency in the current FLINT code). The lower number of heap extractions probably also helps, but in the sparse case very little chaining actually occurs, so it isn't clear if this is even relevant. Pearce and Monagan mention numerous improvements in their paper which are designed to improve the efficiency of heap chaining. I believe I have implemented all these correctly. The difference between the benchmarks 1 and 2 is the sparsity. Fateman's benchmark is quite dense. Each output term is equal to the sum of about 831 products of pairs of input monomials f_i * g_j on average, whereas for the very sparse 5 var bench only 3 pairs contribute to each output term on average. Now, chaining is not supposed to be perfect. One doesn't get all 831 pairs chained at the root of the heap when it comes time to compute that output entry. However all the contributing pairs will be chained in a small number of heap locations at the top of the heap (but not necessarily right on the root). As the heap is designed to deliver pairs in order, all pairs for a given output term will come out before pairs for the next term, they just may not all be chained at exactly on the same (root) entry. My hypothesis is that in the very sparse case, the generic case is that each output term is built from a number (2-3) **unchained** individual entries from the top of the heap. This means that in order to improve the performance, one must have really fast heap handling code. That is where someone else may be able to help me. It *is* possible that I am not getting sufficient heap chaining, but then the question would be, "why not"? How would one improve things so that more heap chaining occurs on average for the sparse benchmark. I am assuming in the following that this is not the problem. Here is what I currently do when a new entry is added to the heap. 1) As per the recommendation in Pearce and Monagan, after the previous term has been fully computed, there will be a single gap in the heap at the root. Instead of reheapifying, we insert a new term in this spot. This entry is then filtered down through the heap. At first we need to find where in the heap this term should go. We swap it down through the heap until it reaches a spot where it is bigger than its parent (unlike Pearce and Monagan I made the arbitrary choice to have smallest entries at the top of the heap). A number of comparisons are required at each level. We need to compare with both children of the current node. We want to know if the new term is equal to one of the children, or if it is perhaps less than both children. Essentially I determine the smallest child first, with a single comparison. If the new term is smaller than the smallest child, it is already in the right position. If not, then I do a comparison for equality with this smallest child. If bigger, I swap it down. Checking equality with the largest child slows things down, so I don't do that. 2) Now the new node is in the right place, except that something can go wrong. If the new term actually equals an existing term in the heap, it gets chained with that term. That leaves a gap in the heap. It is from this point on that I am unsure how to proceed. At present I fill this gap with the entry at the bottom of the heap. This then gets filtered up first, then back down until it reaches the bottom. Filtering up only costs one comparison per level as one only needs to compare with the parent. Equality is checked on the way up, but not on the way back down, as that is apparently slower. If an equal entry is found we chain, leaving another gap, and so the same procedure has to be iterated, placing the heap bottom in the gap, etc, until no gap remains. This all seems very inefficient, but I don't see a better way of filling gaps. If anyone has any ideas, please let me know. 3) If more than one entry is added to the heap, pretty much the same procedure is applied, except this time we don't start with an empty spot at the top of the heap. This time if equality occurs then we are happy, however if no equal terms are encountered on the way down, the new entry eventually pushes a term out the bottom, dislodging it. We extend the heap by one place and put the dislodged entry at the new heap bottom, then sift it up first, then back down again, into place. In this case it is faster to not do any tests for equality, so no chaining occurs at this step and we avoid having to fill gaps. I am convinced the problem with the sparse benchmark lies with step 2, in the filling of empty spots caused by chaining, and having to iterate this procedure until the empty spot is vanquished. All swaps in the above are done virtually (just move the entry in the heap into the gap and keep the entry we are trying to place in a temporary), so I believe these swaps have all been done efficiently. Roman mentioned having to use particular compiler options to get his heap code to be compiled optimally, and he mentioned that gotos made some parts of the code faster. I've experimented with these, but that is only giving 1% or so, not a factor of nearly 2. Whatever is wrong is not minor, but a major pain in the proverbial. I'm pretty convinced it is number 2 above. Did anyone pay enough attention in CS classes to know how to do this more efficiently? Bill. |
|
From: William H. <ha...@ya...> - 2009-01-25 19:13:25
|
That should say mutivariable polynomial multiplication, not arithmetic. Bill. --- On Mon, 1/26/09, Bill Hart <goo...@go...> wrote: > From: Bill Hart <goo...@go...> > Subject: [flint-devel] Multivariate polynomial arithmetic > To: "Development list for FLINT" <fli...@li...> > Date: Monday, January 26, 2009, 6:12 AM > I've implemented a basic algorithm for doing > multivariate polynomial > arithmetic (I'm thinking of including a module for this > in FLINT 2.0 - > due out towards the end of the year). > > It computes Fateman's bechmark f*g where f = d^20, g = > d^20 + 1 with d > = (1+x+y+z+t) in 23.7s on a 2.66GHz Xeon. Pari takes about > 38.5s and > it's probably comparable if not better than Magma. > > Next I'll work on improving the cache locality of the > algorithm. > > The real times to try and equal are those of Trip and sdmp > which can > both do it in about 2.5s. Trip is/will be open source I > believe. sdmp > is not as far as I know. We don't really need to > compete with either > as they aren't aimed at the same sorts of problems as > FLINT (Trip is > used for Celestial Mechanics and sdmp is available only as > a binary to > link against Maple and only performs a handful of > operations). But it > would be nice to have times vaguely comparable with theirs. > Another > nice open source package for doing this sort of thing is > giac. I don't > have an idea of times for that yet. > > I haven't committed the code yet. But presuming someone > wanted to > examine it, I would clean it up and commit it. > > Bill. > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel |
|
From: Bill H. <goo...@go...> - 2009-01-25 19:12:24
|
I've implemented a basic algorithm for doing multivariate polynomial arithmetic (I'm thinking of including a module for this in FLINT 2.0 - due out towards the end of the year). It computes Fateman's bechmark f*g where f = d^20, g = d^20 + 1 with d = (1+x+y+z+t) in 23.7s on a 2.66GHz Xeon. Pari takes about 38.5s and it's probably comparable if not better than Magma. Next I'll work on improving the cache locality of the algorithm. The real times to try and equal are those of Trip and sdmp which can both do it in about 2.5s. Trip is/will be open source I believe. sdmp is not as far as I know. We don't really need to compete with either as they aren't aimed at the same sorts of problems as FLINT (Trip is used for Celestial Mechanics and sdmp is available only as a binary to link against Maple and only performs a handful of operations). But it would be nice to have times vaguely comparable with theirs. Another nice open source package for doing this sort of thing is giac. I don't have an idea of times for that yet. I haven't committed the code yet. But presuming someone wanted to examine it, I would clean it up and commit it. Bill. |
|
From: Michael A. <mab...@go...> - 2009-01-16 03:12:52
|
On Thu, Jan 15, 2009 at 6:56 PM, Bill Hart <goo...@go...> wrote: > 2009/1/16 Michael Abshoff <mab...@go...>: <SNIP> >> Ok, it seems like I should be working on the development version then >> and if we get that to work we can always back port the relevant >> changes. This allows for more flexibility and won't impact the stable >> branch. > > Any changes to trunk without changing flint-1.1 will not appear in > Sage until FLINT 2.0 is released. Yes, but I see porting patches to FLINT 1.1 once they have worked out in the FLINT 2.0 trunk. > The vast majority of files are identical between the two at this > moment in time (which diff can check for you). If you change one > version, you may as well change the other). Of the remaining files, > most of those appear only in trunk. There are a handful of files which > appear in trunk and flint-1.1 which differ, but these are small enough > in number to deal with separately at present. Ok, good to know. >> But in the end I can certainly live with >> >> ulong >> slong > > I used ulong because it is what David uses in zn_poly. It is > convenient because it is short to type. I don't know what he uses for > long. Probably he uses long. That sounds like it will be another library to revisit. Let's hope he abstracted out all the types, but I haven't looked. >> Ok. There is a new valgrind tool called exp-ptrcheck that catches the >> use of 32 bit pointers when interacting with 64 bit pointers. It >> doesn't run on anything but Linux and I can't see a way yet to make >> long on 64 bit platforms 32 bit without impacting all the other things >> FLINT uses. There is no LLP Linux box AFAIK, so maybe using purify in >> Windows might lead some assistance. In the end it might have to be >> good old fashioned debugging :) > > Yeah I don't have any easy answers here. The problem is, it is not > going to be enough to just make sure all the tests pass. The tests are > not designed to pick up problems with pointer arithmetic. Presumably > the compiler will complain though. If it doesn't, send a strongly > worded letter to the compiler writer. > > I suppose the compiler may not have a problem with using a 32 bit > ulong for pointer arithmetic between 64 bit pointers though.This would > be a serious bug which would affect FLINT in real world use but would > not fail a single test. The only way to fix that would be to check > every occurrence of * in FLINT and every occurrence of every type > which is typedef'd to some blah * or blah **. If that is the case, > you're screwed, and see you in a year or two, I'm off to the Bahamas. Well, this was meant to test is there is still any long pointer lurking around somewhere. The compiler should pick all of them up, but we will see. > I expect the problem will be casts, which no doubt the compiler will > be happy with regardless if they are correct. But casts should always > be of the form (ulong *), (mp_limb_t *), etc or (ulong), etc, > depending which way the cast is. So if you look for all occurrences of > (ulong), (mp_limb_t), (F_mpz_t) and *) and **) you should get > everything the compiler misses, I guess learning about all the gory internals of FLINT can't be a bad thing :) > Bill. Cheers, Michael > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > flint-devel mailing list > fli...@li... > https://lists.sourceforge.net/lists/listinfo/flint-devel > |
|
From: Michael A. <mab...@go...> - 2009-01-16 03:09:49
|
On Thu, Jan 15, 2009 at 5:36 PM, Bill Hart <goo...@go...> wrote: > Damn, I was hoping to get to SD12, but I'm going to miss it. Oh well, have fun. > > Bill. Well, it will be a 72 hour bug squashing marathon, so the best kind of Sage Days :) Cheers, Michael |
|
From: Bill H. <goo...@go...> - 2009-01-16 02:56:59
|
2009/1/16 Michael Abshoff <mab...@go...>: > On Thu, Jan 15, 2009 at 5:15 PM, Bill Hart <goo...@go...> wrote: > > Hi Bill, > >> It's ok to keep the discussion on list, but you are also welcome to >> take it offline. > > I only meant to discuss the details of potentially meeting up at UW to > work on this. All the technical discussion should happen on list. > >> You are welcome to check out FLINT and start modifying files if you >> want. One complication is that modifications for the FLINT 1.1 series >> need to be made to trunk *and* the flint-1.1 branch. There is no way >> to merge the changes to both automatically as the two directories >> differ. Note that there are more files in the trunk than the flint-1.1 >> branch, but all files that appear in flint-1.1 also appear in trunk. > > Ok, it seems like I should be working on the development version then > and if we get that to work we can always back port the relevant > changes. This allows for more flexibility and won't impact the stable > branch. Any changes to trunk without changing flint-1.1 will not appear in Sage until FLINT 2.0 is released. The vast majority of files are identical between the two at this moment in time (which diff can check for you). If you change one version, you may as well change the other). Of the remaining files, most of those appear only in trunk. There are a handful of files which appear in trunk and flint-1.1 which differ, but these are small enough in number to deal with separately at present. > >> I am the only person committing code at present and I am likely to >> spend time only in the files F_zmod_poly.c/h, F_zmod_poly-test.c and >> packed_vec.c/h (which only exist in trunk) for the next couple of >> days. Otherwise you are welcome to edit the other files as you wish. >> When it comes to modifying the files I am currently working on I'll do >> a commit so you can work on them. > > Sure. As mentioned in the other mail my time for the next 10 or so > days is booked. > >> I will enable write access to the flint-svn via your sourceforge >> username in a few moments. But can I ask that you tell me which files >> you're modifying two or three at a time so I can complain if >> necessary. :-) > > Thanks, will do. > >> You will also need to replace long with something. I don't know what >> to call it. slong will probably have to do, though it is very close to >> a rude Australian swear word, > > Not only Australian :) > > But I am wondering if using something closer to a C99 unsigned long > and signed long type might make the nature of ulong and "slong" more > clear. But I think this will cause confusion with the underlying gmp > types. I need to think about this more. Either way, signed and > unsignled long types should be consistent in naming convention if > possible. But in the end I can certainly live with > > ulong > slong I used ulong because it is what David uses in zn_poly. It is convenient because it is short to type. I don't know what he uses for long. Probably he uses long. > >> i.e. FLINT should use ulong instead of >> unsigned long and slong instead of long throughout. Finding where >> pointers are assigned to a ulong will be difficult to find. Looking >> for all occurrences of **, fmpz_t and ****alloc*** (e.g. >> flint_heap_alloc, malloc, flint_stack_alloc, flint_stack_alloc_small, >> flint_stack_alloc_bytes, etc) should get many of them. > > Ok. There is a new valgrind tool called exp-ptrcheck that catches the > use of 32 bit pointers when interacting with 64 bit pointers. It > doesn't run on anything but Linux and I can't see a way yet to make > long on 64 bit platforms 32 bit without impacting all the other things > FLINT uses. There is no LLP Linux box AFAIK, so maybe using purify in > Windows might lead some assistance. In the end it might have to be > good old fashioned debugging :) Yeah I don't have any easy answers here. The problem is, it is not going to be enough to just make sure all the tests pass. The tests are not designed to pick up problems with pointer arithmetic. Presumably the compiler will complain though. If it doesn't, send a strongly worded letter to the compiler writer. I suppose the compiler may not have a problem with using a 32 bit ulong for pointer arithmetic between 64 bit pointers though.This would be a serious bug which would affect FLINT in real world use but would not fail a single test. The only way to fix that would be to check every occurrence of * in FLINT and every occurrence of every type which is typedef'd to some blah * or blah **. If that is the case, you're screwed, and see you in a year or two, I'm off to the Bahamas. I expect the problem will be casts, which no doubt the compiler will be happy with regardless if they are correct. But casts should always be of the form (ulong *), (mp_limb_t *), etc or (ulong), etc, depending which way the cast is. So if you look for all occurrences of (ulong), (mp_limb_t), (F_mpz_t) and *) and **) you should get everything the compiler misses, Bill. |