[Numpy-discussion] Huge performance hit for NaNs with Intel P3, P4

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

This is just to flag up a problem I ran into for matlab, which is that
Pentium 3s and 4s have very very slow standard math performance with
NaN values - for example adding to an NaN value on my machine is about
22 times slower than adding to a non-NaN value.  This can become a
very big problem with matrix multiplication if there are a significant
number of NaNs.  I explained the problem here, for matlab and the
software I have been working with:

http://www.mrc-cbu.cam.ac.uk/Imaging/Common/spm_intel_tune.shtml

To illustrate, I've attached a timing script, running on current svn
numpy linked with a standard P4 optimized ATLAS library.  It (dot)
multiples a 200x200 array of ones by a) another 200x200 array of ones
and b) a 200x200 array of NaNs:

ones * ones: 0.017460
ones * NaNs: 2.323742
proportion: 133.090452

Happily, for the Pentium 4, you can solve the problem by forcing the
chip to do floating point math with the SSE instructions, which do not
have this NaN penalty.  So, the solution was only to recompile the
ATLAS libraries with extra gcc flags forcing the use of SSE math (see
the page above) - or use the Intel Math Kernel libraries, which appear
to have already used this trick.  Here's output from numpy linked to
the recompiled ATLAS libraries:

ones * ones: 0.026638
ones * NaNs: 0.023987
proportion: 0.900473

I wonder if it would be worth considering distributing the recompiled
libraries by default in any binary releases?  Or include a test like
this one in the benchmarks to warn users about this problem?

Best,

Matthew

[Numpy-discussion] Huge performance hit for NaNs with Intel P3, P4

A package for scientific computing with Python

[Numpy-discussion] Huge performance hit for NaNs with Intel P3, P4