I am using fwsAdd_8u_Sfs(add1,add2,result,16,0) to add two blocks of 8bit integers. When the program runs on a Pentium3 processor, it reverts to the reference method, since P3 does not support sse2 or above.  But then, the reference method is SOOOO slow, it's about 60 times slower than a casual implementation in simple  *c++=(*a++)+(*b++), considering abc here are all 8 bit integers, it's just unbelievable.
Does anyone know why this reference method is so slow?