Re: [Ray-devel] FP benchmarking, float versus double

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've updated the tests
- Changed constants single precision (neither float or double seems to introduce
any casts in the assembly, now)
- Added dependency between successive iterations for each test

These changes make a substantial difference on my system:

--AMD 64 3500+ 2.2Ghz
  (nothing)                : 0.250217 nsec/cyc
  vector3::length          : flt:  18.2921   dbl:  22.4885  nsec/cyc  (1.23)
  vector3*vector3          : flt:  3.37153   dbl:  3.33858  nsec/cyc  (0.99)
  vector3 x vector3        : flt:  6.43606   dbl:  6.51059  nsec/cyc  (1.01)
  matrix3*vector3          : flt:  14.3812   dbl:  13.4929  nsec/cyc  (0.94)
  trafomatrix*vector3      : flt:  13.8013   dbl:  12.6405  nsec/cyc  (0.92)
  trafomatrix::inverse     : flt:  51.8531   dbl:  53.4506  nsec/cyc  (1.03)
  trafomatrix*trafomatrix  : flt:  60.8745   dbl:  55.2666  nsec/cyc  (0.91)
  matrix3*matrix3          : flt:  46.5577   dbl:  51.3582  nsec/cyc  (1.10)

It turns out I was somewhat wrong stating that single precision operations have
a longer latency than double.  At least on AMD64, additions and multiplications
of floats and doubles have the same latency (4 cycles).  Only division and
square root have longer latency for double precision operations.

I spent a while trying to figure out why there are still cases where double
outperforms float without coming to any conclusion.  I'm guessing that the
remaining difference is due to unaligned accesses to some of the floats.  The
AMD64 docs state that some of the move operations are slower when a floating
point value is not 8-byte aligned.

So, I suppose we can conclude that doubles really are preferred to floats for
straight arithmetic, at least on AMD64, for algorithms with only additions and
multiplications.

Quoting Wolfgang Wieser <ww...@gm...>:

> Hello Andrew!
> 
> On Wednesday 25 May 2005 02:00, Andrew Clinton wrote:
> > The only way I think you'll be able to make sure this isn't
> > happening is to check the assembly code manually.
> >
> The easiest thing is simply to USE the output (e.g. by summing it all up) 
> and slightly vary the input (e.g. by adding 0.1 each iteration). 
> This is what is actually done (right?)
> 
> Another possibility is to look at execution time scaling (which I did as 
> well): 
> - The basic operations are inline functions which are called 4 times in 
>   each loop. If I comment out 2 of them, the time halves. 
> - There is one no-op measurement, i.e. a loop with 4 inlines which to 
>   nothing. This is the "(nothing)" line. Its execution time indicates 
>   required time if the stmts are optimized away and does not change when 
>   2 of the calls are commented out. 
> 
> While only looking at and understanding the assembly gives a definite answer,
> 
> the scaling already gives confidence and the using/varying the vars prevents
> 
> the compiler from removing operations. 
> 
> I tried the explicit float casting (by adding a "f" to all FP constants) 
> and it did not change the timings. 
> 
> > It looks like your code is subject to some of these problems.  I've made
> > some attempts at fixing it but it is difficult to verify (looking through
> > lots of assembly).  
> >
> Did you see any change in the timings?
> Feel free to commit a "fixed" version; I'd like to have a look at it. 
> 
> > Especially problematic might be the production of a 
> > result value that is never used again, 
> >
> Well, the results in the loop statements are usually summed up and hence 
> "used". The calculated sum is in the end not used but that should not make 
> any difference. 
> 
> The fact that produced results are not directly needed for the next 
> calculation may enable the compiler/processor to make better use of 
> pipelining. I expect this to be benifitial to both types (float, double). 
> 
> > Maybe a better test would be to do
> > something useful with each type of arithmetic (intersect a ray with a
> > sphere?).
> >
> Sure. 
> 
> Let me tell you something about the history why I did this: I just wanted 
> to get a feeling for the time needed for typical floating point operations 
> as compared to the time required by thread-safe locking and context 
> switching. 
> 
> So, as I wrote in the email the day before, I'd be very happy if you 
> could post some real ray-shape intersection timings for your raytracer and 
> a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). 
> 
> Because the point is basically that I would like to compare the cost of a 
> ray-shape intersection to the cost of a coroutine switch. 
> This is required to decide if it makes sense to use ray-shape intersections 
> as request type or if the introduced overhead is too high. 
> 
> Wolfgang
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate
> online with coworkers and clients while avoiding the high cost of travel and
> communications. There is no equipment to buy and you can meet as often as
> you want. Try it
> free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click
> _______________________________________________
> Ray-devel mailing list
> Ray...@li...
> https://lists.sourceforge.net/lists/listinfo/ray-devel
> 

----------------------------------------
This mail sent through www.mywaterloo.ca