Roland Scheidegger wrote:
> Rune Petersen wrote:
>> This patch: - Fixes COS. - Does range reductions for SIN & COS. -
>> Adds SCS. - removes the optimized version of SIN & COS. - tweaked
>> weight (should help on precision). - fixed a copy paste typo in
>> Roland would you mind testing if the tweaked weight helped?
> Well I didn't test it first time (just quoting the numbers from the
> link you provided), but I guess that's fine too. I was actually
> wondering myself if it's better to optimize for absolute or relative
> error, so choosing a weight in-between should work too (the
> difference is not that big after all).
> A couple comments though: Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI)
> + (1/4 + 0.5) you could optimize away the first mad for the COS case.
Ah I see you're a bit short on consts, if you want to only use 2 (btw
I'd say there should be 32 not only 16 but I have no idea why the driver
restricts it to 16).
> Also, the comments for SCS seem a bit off. That's a pity, because
> without comments I can't really see what the code does at first sight
> :-). Looks like quite a few extra instructions though, are you sure
> not more could be shared for calculating both sin and cos?
I've looked a bit closer (this is an interesting optimization
problem...) and I think it should be doable with fewer instructions,
though ultimately I needed 2 temps instead of 1 (I don't think it's much
of a problem, 32 is plenty, PS2.0 only exposes 12).
Ok the equation was:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
y = B * x + C * x * abs(x)
y = P * (y * abs(y) - y) + y
const1: 0.5pi, 0.75, 1/(2pi), 2.0pi
That's what I came up with with pseudo-code:
//should be 5 slots (I guess it might generate 6 due to force same-slot,
//but that needs fixing elewhere)
//cos is even: cos(x) = cos(-x). So using simple trigo-fu
//we get sin(neg(abs(x)) + pi/2)) = cos(x), no comparison needed and all
//values for sine stay inside [-pi,pi] ([-pi/2, pi/2], actually)
//hope it's ok to use neg+abs simultaneously?
temp.z = add(neg(abs(src)), const1.x)
temp.w = mul(src, C)
//temp.xy = B*x, C*x (cos), temp.w = C * x, temp2.w = B * x (sin)
temp.xy = mul(temp.z, BC)
temp2.w = mul(src, B)
//do cos in alpha slot not sin due to restricted swizzling
//sin y = B * x + C * x * abs(x)
temp2.z = mad(temp.w, abs(src), temp2.w)
temp2.w = mad(temp.y, abs(temp.z), temp.x)
temp.xy = mad(temp2.wzy, abs(temp2.wzy), neg(temp2.wzy))
// now temp.x holds y * abs(y) - y for cos, temp.y same for sin
dest.xy = mad(temp.xy, P, temp2.wzy)
range reduction for cos:
x = (x/(2*PI))+0.75
x = frac(x)
x = (x*2*PI)-PI
x = (x/(2*PI))+HALF
x = frac(x)
x = (x*2*PI)-PI
Isn't that an elegant solution :-) There may be any number of bugs, of