Re: [Ray-devel] FP benchmarking, float versus double

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thursday 26 May 2005 06:19, ajc...@en... wrote:
> I've updated the tests
> - Changed constants single precision (neither float or double seems to
> introduce any casts in the assembly, now)
Yes, as I verified, the compiler will automatically do float -> double 
widening at compile time while double -> float conversion is done at 
run time. 

> - Added dependency between successive iterations for each test
>
I slightly changed that again: Instead of having the same variable on both 
sides (like in "a=a*b"), I introduced 2 tests: One for "a*=b" and one for 
(a=b*c, b=c*a, c=a*b). This is because commonly the result of an operation 
will not overwrite one input variable but be stored somewhere else. 
This has impact on compiler generated temporaries. 

The overall picture of the benchmarks is not very much affected by this. 

> These changes make a substantial difference on my system:
>
> --AMD 64 3500+ 2.2Ghz
>
Ah, great - you have a fairly fast AMD64!
Could you please run the tests in src/lib/threads and send me the output?
(Also compiler & kernel version.)

I'd be very much interested in the switch and lock times on Linux/AMD64. 
(The timings on FreeBSD/AMD64 are not so good and the question is whether 
it's the architecture or the operating system. I don't have access to an 
idle Linux/AMD64 system.)

> So, I suppose we can conclude that doubles really are preferred to floats
> for straight arithmetic, at least on AMD64, for algorithms with only
> additions and multiplications.
>
It seems logical that the 64 bit doubles are the natural data type for 
64 bit platforms and preferrable over floats as long as storage size does 
not matter. 

However, for the 32bit Athlon, double is considerably faster than float. 

Here are updated timings: 

--<AthlonXP 1.47GHz Linux-2.6.11 gcc-4.1.0, NPTL>----------------------------
  (nothing)                : 0.358208 nsec/cyc
  vector3::length          : flt:  37.1814   dbl:   37.179  nsec/cyc  (1.00)
  vector3*vector3          : flt:  17.4478   dbl:  25.3944  nsec/cyc  (1.46)
  vector3 x vector3        : flt:  10.6335   dbl:  15.1611  nsec/cyc  (1.43)
  matrix3*vector3          : flt:  20.4215   dbl:   32.536  nsec/cyc  (1.59)
  trafomatrix*vector3      : flt:  20.3866   dbl:  28.9683  nsec/cyc  (1.42)
  trafomatrix::inverse     : flt:  120.664   dbl:  112.873  nsec/cyc  (0.94)
  trafomatrix*trafomatrix  : flt:  88.0527   dbl:  97.4989  nsec/cyc  (1.11)
  trafomatrix*=trafomatrix : flt:  78.0263   dbl:  75.8576  nsec/cyc  (0.97)
  matrix3*matrix3          : flt:  65.0964   dbl:  109.145  nsec/cyc  (1.68)
  matrix3*=matrix3         : flt:  68.4487   dbl:  67.6107  nsec/cyc  (0.99)

--<AMD64 1.8GHz FreeBSD-5.4 gcc-3.4.2>---------------------------------------
  (nothing)                : 0.28741 nsec/cyc
  vector3::length          : flt:  21.1455   dbl:  25.4859  nsec/cyc  (1.21)
  vector3*vector3          : flt:  3.86673   dbl:  3.79015  nsec/cyc  (0.98)
  vector3 x vector3        : flt:  8.95599   dbl:  8.94856  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  17.1492   dbl:  14.7704  nsec/cyc  (0.86)
  trafomatrix*vector3      : flt:  16.4591   dbl:  13.9329  nsec/cyc  (0.85)
  trafomatrix::inverse     : flt:  58.0311   dbl:  61.0099  nsec/cyc  (1.05)
  trafomatrix*trafomatrix  : flt:  62.8014   dbl:  57.8739  nsec/cyc  (0.92)
  trafomatrix*=trafomatrix : flt:  65.3974   dbl:  55.8263  nsec/cyc  (0.85)
  matrix3*matrix3          : flt:  51.4784   dbl:  50.3268  nsec/cyc  (0.98)
  matrix3*=matrix3         : flt:  54.4237   dbl:  49.8042  nsec/cyc  (0.92)

But now the bad news: Your changes somwhow screwed up heavily the 
gcc-3.4/Pentium4 pair. Look at that: (r1.6, r1.7 are the CVS revisions)

r1.6 (AJC)
  (nothing)                : 0.116446 nsec/cyc
  vector3::length          : flt:   2519.8   dbl:  2517.85  nsec/cyc  (1.00)
  vector3*vector3          : flt:  628.565   dbl:  628.868  nsec/cyc  (1.00)
  vector3 x vector3        : flt:  3032.07   dbl:  3033.61  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  4726.15   dbl:  4727.36  nsec/cyc  (1.00)
  trafomatrix*vector3      : flt:     4718   dbl:  4732.51  nsec/cyc  (1.00)
  trafomatrix::inverse     : flt:  115.973   dbl:  73.5815  nsec/cyc  (0.63)
  trafomatrix*trafomatrix  : flt:  20309.1   dbl:  20257.4  nsec/cyc  (1.00)
  matrix3*matrix3          : flt:  14475.6   dbl:  14460.1  nsec/cyc  (1.00)

My changes based on Andrew's don't make things better either...

r1.7 (WW)
  (nothing)                : 0.11606 nsec/cyc
  vector3::length          : flt:  2523.47   dbl:  2527.31  nsec/cyc  (1.00)
  vector3*vector3          : flt:  630.039   dbl:  628.347  nsec/cyc  (1.00)
  vector3 x vector3        : flt:  3055.63   dbl:  3057.65  nsec/cyc  (1.00)
  matrix3*vector3          : flt:  4725.59   dbl:  4722.21  nsec/cyc  (1.00)
  trafomatrix*vector3      : flt:  4722.81   dbl:  4727.08  nsec/cyc  (1.00)
  trafomatrix::inverse     : flt:  115.964   dbl:  74.8951  nsec/cyc  (0.65)
  trafomatrix*trafomatrix  : flt:  20279.7   dbl:  20276.5  nsec/cyc  (1.00)
  trafomatrix*=trafomatrix : flt:  20135.6   dbl:  20150.9  nsec/cyc  (1.00)
  matrix3*matrix3          : flt:  14451.3   dbl:  14454.8  nsec/cyc  (1.00)
  matrix3*=matrix3         : flt:  14400.8   dbl:  14450.7  nsec/cyc  (1.00)

The "(nothing)" and inversion tests are okay but the rest is just horrible. 

I will probably have a more in-depth view of the problem but let me first 
compile gcc-4.1.x on the P4 box...

Wolfgang