From: Wolfgang W. <ww...@gm...> - 2005-05-26 12:40:02
|
On Thursday 26 May 2005 06:19, ajc...@en... wrote: > I've updated the tests > - Changed constants single precision (neither float or double seems to > introduce any casts in the assembly, now) Yes, as I verified, the compiler will automatically do float -> double widening at compile time while double -> float conversion is done at run time. > - Added dependency between successive iterations for each test > I slightly changed that again: Instead of having the same variable on both sides (like in "a=a*b"), I introduced 2 tests: One for "a*=b" and one for (a=b*c, b=c*a, c=a*b). This is because commonly the result of an operation will not overwrite one input variable but be stored somewhere else. This has impact on compiler generated temporaries. The overall picture of the benchmarks is not very much affected by this. > These changes make a substantial difference on my system: > > --AMD 64 3500+ 2.2Ghz > Ah, great - you have a fairly fast AMD64! Could you please run the tests in src/lib/threads and send me the output? (Also compiler & kernel version.) I'd be very much interested in the switch and lock times on Linux/AMD64. (The timings on FreeBSD/AMD64 are not so good and the question is whether it's the architecture or the operating system. I don't have access to an idle Linux/AMD64 system.) > So, I suppose we can conclude that doubles really are preferred to floats > for straight arithmetic, at least on AMD64, for algorithms with only > additions and multiplications. > It seems logical that the 64 bit doubles are the natural data type for 64 bit platforms and preferrable over floats as long as storage size does not matter. However, for the 32bit Athlon, double is considerably faster than float. Here are updated timings: --<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>---------------------------- (nothing) : 0.358208 nsec/cyc vector3::length : flt: 37.1814 dbl: 37.179 nsec/cyc (1.00) vector3*vector3 : flt: 17.4478 dbl: 25.3944 nsec/cyc (1.46) vector3 x vector3 : flt: 10.6335 dbl: 15.1611 nsec/cyc (1.43) matrix3*vector3 : flt: 20.4215 dbl: 32.536 nsec/cyc (1.59) trafomatrix*vector3 : flt: 20.3866 dbl: 28.9683 nsec/cyc (1.42) trafomatrix::inverse : flt: 120.664 dbl: 112.873 nsec/cyc (0.94) trafomatrix*trafomatrix : flt: 88.0527 dbl: 97.4989 nsec/cyc (1.11) trafomatrix*=trafomatrix : flt: 78.0263 dbl: 75.8576 nsec/cyc (0.97) matrix3*matrix3 : flt: 65.0964 dbl: 109.145 nsec/cyc (1.68) matrix3*=matrix3 : flt: 68.4487 dbl: 67.6107 nsec/cyc (0.99) --<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>--------------------------------------- (nothing) : 0.28741 nsec/cyc vector3::length : flt: 21.1455 dbl: 25.4859 nsec/cyc (1.21) vector3*vector3 : flt: 3.86673 dbl: 3.79015 nsec/cyc (0.98) vector3 x vector3 : flt: 8.95599 dbl: 8.94856 nsec/cyc (1.00) matrix3*vector3 : flt: 17.1492 dbl: 14.7704 nsec/cyc (0.86) trafomatrix*vector3 : flt: 16.4591 dbl: 13.9329 nsec/cyc (0.85) trafomatrix::inverse : flt: 58.0311 dbl: 61.0099 nsec/cyc (1.05) trafomatrix*trafomatrix : flt: 62.8014 dbl: 57.8739 nsec/cyc (0.92) trafomatrix*=trafomatrix : flt: 65.3974 dbl: 55.8263 nsec/cyc (0.85) matrix3*matrix3 : flt: 51.4784 dbl: 50.3268 nsec/cyc (0.98) matrix3*=matrix3 : flt: 54.4237 dbl: 49.8042 nsec/cyc (0.92) But now the bad news: Your changes somwhow screwed up heavily the gcc-3.4/Pentium4 pair. Look at that: (r1.6, r1.7 are the CVS revisions) r1.6 (AJC) (nothing) : 0.116446 nsec/cyc vector3::length : flt: 2519.8 dbl: 2517.85 nsec/cyc (1.00) vector3*vector3 : flt: 628.565 dbl: 628.868 nsec/cyc (1.00) vector3 x vector3 : flt: 3032.07 dbl: 3033.61 nsec/cyc (1.00) matrix3*vector3 : flt: 4726.15 dbl: 4727.36 nsec/cyc (1.00) trafomatrix*vector3 : flt: 4718 dbl: 4732.51 nsec/cyc (1.00) trafomatrix::inverse : flt: 115.973 dbl: 73.5815 nsec/cyc (0.63) trafomatrix*trafomatrix : flt: 20309.1 dbl: 20257.4 nsec/cyc (1.00) matrix3*matrix3 : flt: 14475.6 dbl: 14460.1 nsec/cyc (1.00) My changes based on Andrew's don't make things better either... r1.7 (WW) (nothing) : 0.11606 nsec/cyc vector3::length : flt: 2523.47 dbl: 2527.31 nsec/cyc (1.00) vector3*vector3 : flt: 630.039 dbl: 628.347 nsec/cyc (1.00) vector3 x vector3 : flt: 3055.63 dbl: 3057.65 nsec/cyc (1.00) matrix3*vector3 : flt: 4725.59 dbl: 4722.21 nsec/cyc (1.00) trafomatrix*vector3 : flt: 4722.81 dbl: 4727.08 nsec/cyc (1.00) trafomatrix::inverse : flt: 115.964 dbl: 74.8951 nsec/cyc (0.65) trafomatrix*trafomatrix : flt: 20279.7 dbl: 20276.5 nsec/cyc (1.00) trafomatrix*=trafomatrix : flt: 20135.6 dbl: 20150.9 nsec/cyc (1.00) matrix3*matrix3 : flt: 14451.3 dbl: 14454.8 nsec/cyc (1.00) matrix3*=matrix3 : flt: 14400.8 dbl: 14450.7 nsec/cyc (1.00) The "(nothing)" and inversion tests are okay but the rest is just horrible. I will probably have a more in-depth view of the problem but let me first compile gcc-4.1.x on the P4 box... Wolfgang |