From: <ajc...@en...> - 2005-05-26 04:19:46
|
I've updated the tests - Changed constants single precision (neither float or double seems to introduce any casts in the assembly, now) - Added dependency between successive iterations for each test These changes make a substantial difference on my system: --AMD 64 3500+ 2.2Ghz (nothing) : 0.250217 nsec/cyc vector3::length : flt: 18.2921 dbl: 22.4885 nsec/cyc (1.23) vector3*vector3 : flt: 3.37153 dbl: 3.33858 nsec/cyc (0.99) vector3 x vector3 : flt: 6.43606 dbl: 6.51059 nsec/cyc (1.01) matrix3*vector3 : flt: 14.3812 dbl: 13.4929 nsec/cyc (0.94) trafomatrix*vector3 : flt: 13.8013 dbl: 12.6405 nsec/cyc (0.92) trafomatrix::inverse : flt: 51.8531 dbl: 53.4506 nsec/cyc (1.03) trafomatrix*trafomatrix : flt: 60.8745 dbl: 55.2666 nsec/cyc (0.91) matrix3*matrix3 : flt: 46.5577 dbl: 51.3582 nsec/cyc (1.10) It turns out I was somewhat wrong stating that single precision operations have a longer latency than double. At least on AMD64, additions and multiplications of floats and doubles have the same latency (4 cycles). Only division and square root have longer latency for double precision operations. I spent a while trying to figure out why there are still cases where double outperforms float without coming to any conclusion. I'm guessing that the remaining difference is due to unaligned accesses to some of the floats. The AMD64 docs state that some of the move operations are slower when a floating point value is not 8-byte aligned. So, I suppose we can conclude that doubles really are preferred to floats for straight arithmetic, at least on AMD64, for algorithms with only additions and multiplications. Quoting Wolfgang Wieser <ww...@gm...>: > Hello Andrew! > > On Wednesday 25 May 2005 02:00, Andrew Clinton wrote: > > The only way I think you'll be able to make sure this isn't > > happening is to check the assembly code manually. > > > The easiest thing is simply to USE the output (e.g. by summing it all up) > and slightly vary the input (e.g. by adding 0.1 each iteration). > This is what is actually done (right?) > > Another possibility is to look at execution time scaling (which I did as > well): > - The basic operations are inline functions which are called 4 times in > each loop. If I comment out 2 of them, the time halves. > - There is one no-op measurement, i.e. a loop with 4 inlines which to > nothing. This is the "(nothing)" line. Its execution time indicates > required time if the stmts are optimized away and does not change when > 2 of the calls are commented out. > > While only looking at and understanding the assembly gives a definite answer, > > the scaling already gives confidence and the using/varying the vars prevents > > the compiler from removing operations. > > I tried the explicit float casting (by adding a "f" to all FP constants) > and it did not change the timings. > > > It looks like your code is subject to some of these problems. I've made > > some attempts at fixing it but it is difficult to verify (looking through > > lots of assembly). > > > Did you see any change in the timings? > Feel free to commit a "fixed" version; I'd like to have a look at it. > > > Especially problematic might be the production of a > > result value that is never used again, > > > Well, the results in the loop statements are usually summed up and hence > "used". The calculated sum is in the end not used but that should not make > any difference. > > The fact that produced results are not directly needed for the next > calculation may enable the compiler/processor to make better use of > pipelining. I expect this to be benifitial to both types (float, double). > > > Maybe a better test would be to do > > something useful with each type of arithmetic (intersect a ray with a > > sphere?). > > > Sure. > > Let me tell you something about the history why I did this: I just wanted > to get a feeling for the time needed for typical floating point operations > as compared to the time required by thread-safe locking and context > switching. > > So, as I wrote in the email the day before, I'd be very happy if you > could post some real ray-shape intersection timings for your raytracer and > a "typical" scene (i.e. lots of basic CSG / a big triangle mesh / ..). > > Because the point is basically that I would like to compare the cost of a > ray-shape intersection to the cost of a coroutine switch. > This is required to decide if it makes sense to use ray-shape intersections > as request type or if the introduced overhead is too high. > > Wolfgang > > > ------------------------------------------------------- > SF.Net email is sponsored by: GoToMeeting - the easiest way to collaborate > online with coworkers and clients while avoiding the high cost of travel and > communications. There is no equipment to buy and you can meet as often as > you want. Try it > free.http://ads.osdn.com/?ad_id=7402&alloc_id=16135&op=click > _______________________________________________ > Ray-devel mailing list > Ray...@li... > https://lists.sourceforge.net/lists/listinfo/ray-devel > ---------------------------------------- This mail sent through www.mywaterloo.ca |