From: Rune Petersen <rune@me...>  20070215 22:03:36

Roland Scheidegger wrote: > Roland Scheidegger wrote: >> Rune Petersen wrote: >>> This patch:  Fixes COS.  Does range reductions for SIN & COS.  >>> Adds SCS.  removes the optimized version of SIN & COS.  tweaked >>> weight (should help on precision).  fixed a copy paste typo in >>> emit_arith(). >>> >>> Roland would you mind testing if the tweaked weight helped? >> Well I didn't test it first time (just quoting the numbers from the >> link you provided), but I guess that's fine too. I was actually >> wondering myself if it's better to optimize for absolute or relative >> error, so choosing a weight inbetween should work too (the >> difference is not that big after all). >> >> A couple comments though: Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI) >> + (1/4 + 0.5) you could optimize away the first mad for the COS case. >> > Ah I see you're a bit short on consts, if you want to only use 2 (btw > I'd say there should be 32 not only 16 but I have no idea why the driver > restricts it to 16). > >> Also, the comments for SCS seem a bit off. That's a pity, because >> without comments I can't really see what the code does at first sight >> :). Looks like quite a few extra instructions though, are you sure >> not more could be shared for calculating both sin and cos? > I've looked a bit closer (this is an interesting optimization > problem...) and I think it should be doable with fewer instructions, > though ultimately I needed 2 temps instead of 1 (I don't think it's much > of a problem, 32 is plenty, PS2.0 only exposes 12). > > Ok the equation was: > Q (4/pi x  4/pi^2 x^2) + P (4/pi x  4/pi^2 x^2)^2 > > Simplified to: > y = B * x + C * x * abs(x) > y = P * (y * abs(y)  y) + y > > const0: B,C,pi,P > const1: 0.5pi, 0.75, 1/(2pi), 2.0pi > > That's what I came up with with pseudocode: > //should be 5 slots (I guess it might generate 6 due to force sameslot, > //but that needs fixing elewhere) > > //cos is even: cos(x) = cos(x). So using simple trigofu > //we get sin(neg(abs(x)) + pi/2)) = cos(x), no comparison needed and all > //values for sine stay inside [pi,pi] ([pi/2, pi/2], actually) > //hope it's ok to use neg+abs simultaneously? > temp.z = add(neg(abs(src)), const1.x) > temp.w = mul(src, C) > > //temp.xy = B*x, C*x (cos), temp.w = C * x, temp2.w = B * x (sin) > temp.xy = mul(temp.z, BC) > temp2.w = mul(src, B) > > //do cos in alpha slot not sin due to restricted swizzling > //sin y = B * x + C * x * abs(x) > temp2.z = mad(temp.w, abs(src), temp2.w) > //cos > temp2.w = mad(temp.y, abs(temp.z), temp.x) > > temp.xy = mad(temp2.wzy, abs(temp2.wzy), neg(temp2.wzy)) > // now temp.x holds y * abs(y)  y for cos, temp.y same for sin > > dest.xy = mad(temp.xy, P, temp2.wzy) > > range reduction for cos: > x = (x/(2*PI))+0.75 > x = frac(x) > x = (x*2*PI)PI > > sin: > x = (x/(2*PI))+HALF > x = frac(x) > x = (x*2*PI)PI > > Isn't that an elegant solution :) There may be any number of bugs, of > course... Very elegant I must say. Thank you I'll see about implementing this. Rune Petersen 