TBL instruction slower on cortex-A72 than on A53

Vectorized libm

Brought to you by: shibatch

TBL instruction slower on cortex-A72 than on A53

Forum: General Discussion

Creator: Naoki Shibata

Created: 2017-08-07

Updated: 2017-08-07

Naoki Shibata - 2017-08-07

SLEEF implements a mechanism to select coefficients according to the argument of kernel functions in a few functions. In order to do so, the default generic method uses multiple blending functions, but for AVX2, permutation is used for faster computation. I tried a similar thing using tbl instruction for AArch64, but I found it is actually slower than the generic method. Today I found an article describing a similar experience.

http://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-permute-instructions-in-armv8/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.