We have a requirement to support numerical intrinsics across various hardware architectures and have found using SLEEF's LIBM a strong possibility with its both scalar and vector (SIMD) implementations of all the core routines.
But we have a need that the scalar and vector routines generate the same numerical results. Of immediate concern is that it appears that some of the scalar and vector implementations of the same routine have (possibly) different operations. For example, the scalar version of cosf is coded as:
EXPORT CONST float xcosf(float d) {
int q;
float u, s, t = d;
q = 1 + 2*(int)rintfk(d * (float)M_1_PI - 0.5f);
d = mlaf(q, -PI_Af*0.5f, d);
d = mlaf(q, -PI_Bf*0.5f, d);
d = mlaf(q, -PI_Cf*0.5f, d);
d = mlaf(q, -PI_Df*0.5f, d);
s = d * d;
if ((q & 2) == 0) d = -d;
u = 2.6083159809786593541503e-06f;
u = mlaf(u, s, -0.0001981069071916863322258f);
u = mlaf(u, s, 0.00833307858556509017944336f);
u = mlaf(u, s, -0.166666597127914428710938f);
u = mlaf(s, u * d, d);
if (fabsfk(t) > TRIGRANGEMAXf) u = 1.0f;
if (xisinff(t)) u = NANf;
return u;
}
And the vector is:
EXPORT CONST vfloat xcosf(vfloat d) {
vint2 q;
vfloat u, s, r = d;
q = vrint_vi2_vf(vsub_vf_vf_vf(vmul_vf_vf_vf(d, vcast_vf_f((float)M_1_PI)), vcast_vf_f(0.5f)));
q = vadd_vi2_vi2_vi2(vadd_vi2_vi2_vi2(q, q), vcast_vi2_i(1));
u = vcast_vf_vi2(q);
d = vmla_vf_vf_vf_vf(u, vcast_vf_f(-PI_Af*0.5f), d);
d = vmla_vf_vf_vf_vf(u, vcast_vf_f(-PI_Bf*0.5f), d);
d = vmla_vf_vf_vf_vf(u, vcast_vf_f(-PI_Cf*0.5f), d);
d = vmla_vf_vf_vf_vf(u, vcast_vf_f(-PI_Df*0.5f), d);
s = vmul_vf_vf_vf(d, d);
d = vreinterpret_vf_vm(vxor_vm_vm_vm(vand_vm_vo32_vm(veq_vo_vi2_vi2(vand_vi2_vi2_vi2(q, vcast_vi2_i(2)), vcast_vi2_i(0)), vreinterpret_vm_vf(vcast_vf_f(-0.0f))), vreinterpret_vm_vf(d)));
u = vcast_vf_f(2.6083159809786593541503e-06f);
u = vmla_vf_vf_vf_vf(u, s, vcast_vf_f(-0.0001981069071916863322258f));
u = vmla_vf_vf_vf_vf(u, s, vcast_vf_f(0.00833307858556509017944336f));
u = vmla_vf_vf_vf_vf(u, s, vcast_vf_f(-0.166666597127914428710938f));
u = vadd_vf_vf_vf(vmul_vf_vf_vf(s, vmul_vf_vf_vf(u, d)), d);
u = vreinterpret_vf_vm(vandnot_vm_vo32_vm(vgt_vo_vf_vf(vabs_vf_vf(r), vcast_vf_f(TRIGRANGEMAXf)),
vreinterpret_vm_vf(u)));
u = vreinterpret_vf_vm(vor_vm_vo32_vm(visinf_vo_vf(d), vreinterpret_vm_vf(u)));
return u;
}
Of particular concern are the terms: u = mlaf(s, u * d, d);
and u = vadd_vf_vf_vf(vmul_vf_vf_vf(s, vmul_vf_vf_vf(u, d)), d);
Are the two variants of the term intentional or simply an oversight in implementation?
We would be building the scalar routines targeting both FMA and non-FMA processors. I also see that the x86-64 build flags for the scalar routines disable floating point contractions (FMA), thus introducing another possible difference between the scalar and vector results.
Best regards,
Dave
David Parks
NVIDIA
dparks@nvidia.com
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Because of this, it is impossible to make FMA and non-FMA implementation produce exactly the same values. SLEEF defines specifications common for all implementation, but it allows each variant of implemetation to produce different results within those specifications. This is important since availability of operations is different between architectures. The AVX512F implementation produces different results from that of AVX2, since AVX512F has some unique instructions that are handy for this kind of calculation.
Disabling FP contractions is also crucial since the computation of the functions are so delicate, and especially the DD calculation can be easily hampered by FP contracion.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your response, but I don't think I explained our concern properly.
We are fully aware of the delicate nuances of FMA arithmetic - like when is B=A*A; C=A^2-B != 0 :-)
But, if I understand your explanation, it is a conscience decision that the scalar and vector algorithms may not produce the same numerical results.
Going back to the two routines in the thread I used as examples, the "mlaf" in the scalar version does not use FMA arithmetic because of contraction being disabled by the compiler, but the vector (take AVX2) version would use FMA operations (vmla_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) expands to return _mm256_fmadd_ps(x, y, z); )
Best regards,
Dave
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On some architectures, non-fused combined multiplication and addition is available. On architectures with fma, it is safe to assume that fma is faster than any other combination of mulciplication and addition. In order to speed up calculation, mla to fma conversion is enabled.
I still don't undertstand your requirements, but is it okay to just make functions that do not use DD calculation to produce the same numerical results? For those functions with DD calculations, there is no way to make them produce the same numerical results except entirely giving up utilizing fma.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What would be most desirable is that for the scalar and vector versions of a function produce the same numerical results on a particular processor.
Ignoring the DD functions, if we use FMA for a term with the vector implementation, it is probably best to use FMA for the corresponding term in the scalar version of the intrinsic.
Again, thank you for your time and consideration.
Best regards,
Dave
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't want to make this too difficult. I'm wondering if something along the lines of (for single precision):
if (x86-64) FMA3 is available (would also need an FMA4 version) define mlaf to be: float mlaf(float x, float y, float z) { asm ("vfmadd132ss %2, %1, %0" : "+r"(x) : "x"(z), "x"(y) :); return x; }
Would also need to have the negative flavor of the FMA operation.
But this does come back to one of my original questions, if there was an FMA imlementation of mlaf for scalar, there still is one difference between the scalar and vector version (using original example) in the computation of the u term: u = mlaf(s, u * d, d); Maybe I just came across one of only a few differences between the scalar and vector implementations.
Last edit: David Parks 2017-08-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It would be harder to maintain the scalar version in a way that it produces the same results as the vectorized version. There was absolutely no intention to use the scalar version in that way, and there are so many subtle differences. Writing new helper files is not hard and easier to maintain.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
We have a requirement to support numerical intrinsics across various hardware architectures and have found using SLEEF's LIBM a strong possibility with its both scalar and vector (SIMD) implementations of all the core routines.
But we have a need that the scalar and vector routines generate the same numerical results. Of immediate concern is that it appears that some of the scalar and vector implementations of the same routine have (possibly) different operations. For example, the scalar version of cosf is coded as:
And the vector is:
Of particular concern are the terms:
u = mlaf(s, u * d, d);
and
u = vadd_vf_vf_vf(vmul_vf_vf_vf(s, vmul_vf_vf_vf(u, d)), d);
Are the two variants of the term intentional or simply an oversight in implementation?
We would be building the scalar routines targeting both FMA and non-FMA processors. I also see that the x86-64 build flags for the scalar routines disable floating point contractions (FMA), thus introducing another possible difference between the scalar and vector results.
Best regards,
Dave
David Parks
NVIDIA
dparks@nvidia.com
Hi David,
FMA is not just a combination of multiplication and addition, but those two operations are fused.
https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation#Fused_multiply.E2.80.93add
The property of FMA is exploited to speed up computation of double-double arithmetic.
https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic
Because of this, it is impossible to make FMA and non-FMA implementation produce exactly the same values. SLEEF defines specifications common for all implementation, but it allows each variant of implemetation to produce different results within those specifications. This is important since availability of operations is different between architectures. The AVX512F implementation produces different results from that of AVX2, since AVX512F has some unique instructions that are handy for this kind of calculation.
Disabling FP contractions is also crucial since the computation of the functions are so delicate, and especially the DD calculation can be easily hampered by FP contracion.
Dear Shibata-san,
Thank you for your response, but I don't think I explained our concern properly.
We are fully aware of the delicate nuances of FMA arithmetic - like when is
B=A*A; C=A^2-B != 0
:-)But, if I understand your explanation, it is a conscience decision that the scalar and vector algorithms may not produce the same numerical results.
Going back to the two routines in the thread I used as examples, the "mlaf" in the scalar version does not use FMA arithmetic because of contraction being disabled by the compiler, but the vector (take AVX2) version would use FMA operations (
vmla_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z)
expands toreturn _mm256_fmadd_ps(x, y, z)
; )Best regards,
Dave
On some architectures, non-fused combined multiplication and addition is available. On architectures with fma, it is safe to assume that fma is faster than any other combination of mulciplication and addition. In order to speed up calculation, mla to fma conversion is enabled.
I still don't undertstand your requirements, but is it okay to just make functions that do not use DD calculation to produce the same numerical results? For those functions with DD calculations, there is no way to make them produce the same numerical results except entirely giving up utilizing fma.
Hi,
What would be most desirable is that for the scalar and vector versions of a function produce the same numerical results on a particular processor.
Ignoring the DD functions, if we use FMA for a term with the vector implementation, it is probably best to use FMA for the corresponding term in the scalar version of the intrinsic.
Again, thank you for your time and consideration.
Best regards,
Dave
It is possible that I write a new helper file for scalar calculation. Does that satisfy your requirements?
I don't want to make this too difficult. I'm wondering if something along the lines of (for single precision):
if (x86-64) FMA3 is available (would also need an FMA4 version) define mlaf to be:
float mlaf(float x, float y, float z) { asm ("vfmadd132ss %2, %1, %0" : "+r"(x) : "x"(z), "x"(y) :); return x; }
Would also need to have the negative flavor of the FMA operation.
But this does come back to one of my original questions, if there was an FMA imlementation of mlaf for scalar, there still is one difference between the scalar and vector version (using original example) in the computation of the
u
term:u = mlaf(s, u * d, d);
Maybe I just came across one of only a few differences between the scalar and vector implementations.Last edit: David Parks 2017-08-08
It would be harder to maintain the scalar version in a way that it produces the same results as the vectorized version. There was absolutely no intention to use the scalar version in that way, and there are so many subtle differences. Writing new helper files is not hard and easier to maintain.
David,
We are now talking about what we should implement. Please join our discussion.
https://github.com/shibatch/sleef/issues/47