In light of the disappointing performance of the SSE2 code for small
vectors and matrices I made a few small changes to the code.
Specifically I changed the calculation of return values and the
handling of leftover elements to use SSE intrinsics instead of normal
The new timings (attached) show the results. SSE2 is now faster than,
or only marginally slower, in all cases. Therefore I have altered the
config so that SSE2 support is enabled by default if the hardware