Naoki Shibata - 2017-08-07

SLEEF implements a mechanism to select coefficients according to the argument of kernel functions in a few functions. In order to do so, the default generic method uses multiple blending functions, but for AVX2, permutation is used for faster computation. I tried a similar thing using tbl instruction for AArch64, but I found it is actually slower than the generic method. Today I found an article describing a similar experience.

http://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-permute-instructions-in-armv8/