#867 bad performance on i7 laptop

Stable_(v3.10.x)
closed-works-for-me
5
2012-12-05
2012-11-01
No

Hello,

I am trying to use ATLAS 3.10 on my laptop with an Intel Core i7-2630QM CPU with gcc 4.7.2

but my current benchmarks inside my own test say, that somehow the performance is really bad.
Here is the output: (fast_prod = wrapper around gemv/gemm, prod/axpy_prod=naiive product using boost::ublas)

Benchmarking matrix vector prod (768x 768)
fast_prod Ax: 2.59983
fast_prod A^Tx: 2.93648
prod Ax: 0.719953
prod A^Tx: 1.08993
Benchmarking matrix matrix prod for medium sized matrices (512x512)
fast_prod AX: 0.306646
fast_prod A^TX: 0.313313
fast_prod AX^T: 0.313313
fast_prod A^TX^T: 0.306646
axpy_prod AX: 0.633292
axpy_prod A^TX: 0.626626
axpy_prod AX^T: 1.22659
axpy_prod A^TX^T: 1.20992

to ensure that the error does not obviously lie in my code, i asked a friend to do the benchmark as well, his results were at least better:

Benchmarking matrix vector prod
fast_prod Ax: 0.863277
fast_prod A^Tx: 0.773283
prod Ax: 0.926606
prod A^Tx: 2.90981
Benchmarking matrix matrix prod for medium sized matrices
fast_prod AX: 0.179988
fast_prod A^TX: 0.179988
fast_prod AX^T: 0.176655
fast_prod A^TX^T: 0.176655
axpy_prod AX: 0.886609
axpy_prod A^TX: 0.963271
axpy_prod AX^T: 9.05941
axpy_prod A^TX^T: 9.23273

since we use the same distro (ArchLinux) we should have roughly the same setup aside from the hardware. Atlas is not a prebuild package but compiled during install.

I checked the build file and got this as build command:

# fix SSE1 only bug, see https://sourceforge.net/tracker/?func=detail&aid=3554109&group_id=23725&atid=379482
patch -Np0 -i "$srcdir/fix_sse1.patch"
if [ "$CARCH" = "x86_64" ]; then
ARCHITECTURE_BUILD_OPTS="-b 64" # for x86_64
else
ARCHITECTURE_BUILD_OPTS="-b 32" # for i686
fi
../configure --prefix=/usr/ $ARCHITECTURE_BUILD_OPTS -Fa alg -fPIC \ --with-netlib-lapack-tarfile="$srcdir/lapack-$_lapackver.tgz"

which looks reasonable. So i reinstalled atlas and got the same result. after that I tried adding -Si archdef 0, compiled for hours and still get the same result. throttling is of course disabled.

I still suspect the Error to be on my side. Is there a way i can diagnose the problem? factor 3 worse than the naiive implementation is a lot.

Discussion

  • R. Clint Whaley

    R. Clint Whaley - 2012-11-01
    • assigned_to: nobody --> rwhaley
     
  • R. Clint Whaley

    R. Clint Whaley - 2012-11-01

    Yes, that looks bad! The only thing that leaps out at me is that you built using the CPU timer, which is very inaccurate. However, this shouldn't cause your GEMV to be terrible, just your timings to be very variable.

    I usually install with -D c -DPentiumCPS=<Mhz> (my machine is 3.3Ghz, so I say -DPentiumCPS=3300).

    What are you comparing it to, the Fortran77 BLAS?

    One think we can do is eliminate timing error. ATLAS should be auto-setup to compare ATLAS vs. the f77BLAS. In your BLDdir/bin directory, you can build "make xdl1blastst xdl2blastst xdl3blastst"; each one of these is a timer for the L1, L2, and L3BLAS respectively. Passing --help will give you some info on usage, as will reading ATLAS/doc/TestTime.txt.

    To time all gemv sizes between 200 2000 using l2blastst, simply issue this command (with my output):
    ./xdl2blastst -N 200 2000 200 -F 20 -R gemv

    ------------------------------- GEMV --------------------------------
    TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
    0 N 200 200 1.0 2000 1 1.0 1 0.00 1398.5 1.00 -----
    0 N 200 200 1.0 2000 1 1.0 1 0.00 3440.0 2.46 PASS
    1 N 400 400 1.0 2000 1 1.0 1 0.00 1589.4 1.00 -----
    1 N 400 400 1.0 2000 1 1.0 1 0.00 3976.4 2.50 PASS
    2 N 600 600 1.0 2000 1 1.0 1 0.00 1665.5 1.00 -----
    2 N 600 600 1.0 2000 1 1.0 1 0.00 3089.5 1.85 PASS
    3 N 800 800 1.0 2000 1 1.0 1 0.00 1718.1 1.00 -----
    3 N 800 800 1.0 2000 1 1.0 1 0.00 3217.2 1.87 PASS
    4 N 1000 1000 1.0 2000 1 1.0 1 0.00 1739.5 1.00 -----
    4 N 1000 1000 1.0 2000 1 1.0 1 0.00 3290.6 1.89 PASS
    5 N 1200 1200 1.0 2000 1 1.0 1 0.00 1767.1 1.00 -----
    5 N 1200 1200 1.0 2000 1 1.0 1 0.00 3405.7 1.93 PASS
    6 N 1400 1400 1.0 2000 1 1.0 1 0.00 1790.0 1.00 -----
    6 N 1400 1400 1.0 2000 1 1.0 1 0.00 3394.6 1.90 PASS
    7 N 1600 1600 1.0 2000 1 1.0 1 0.00 1792.2 1.00 -----
    7 N 1600 1600 1.0 2000 1 1.0 1 0.00 3379.3 1.89 PASS
    8 N 1800 1800 1.0 2000 1 1.0 1 0.00 1824.0 1.00 -----
    8 N 1800 1800 1.0 2000 1 1.0 1 0.00 3521.0 1.93 PASS
    9 N 2000 2000 1.0 2000 1 1.0 1 0.00 1870.4 1.00 -----
    9 N 2000 2000 1.0 2000 1 1.0 1 0.00 3704.4 1.98 PASS

    10 tests run, 10 passed

    The first number is the time of the F77BLAS, and the second is ATLAS's time on my Corei7 (2nd gen Corei with AVX).

    You can do similar things for L1 BLAS and L3 BLAS.

    Let me know what this shows you,
    Clint

     
  • Oswin Krause

    Oswin Krause - 2012-11-02

    Thanks for the answer. I am trying to add all missing information:

    >I usually install with -D c -DPentiumCPS=<Mhz> (my machine is 3.3Ghz, so I say -DPentiumCPS=3300).
    Okay, but as you said, this didn't improve the situation much. I will keep that in mind for the next build :)

    > What are you comparing it to, the Fortran77 BLAS?
    ublas, which boils down to a simple for loop. In fact, i replaced the ublas call by this and got the exact same result. See the benchmark code i am adding at the end :)

    > One think we can do is eliminate timing error.
    In my previous tests i iterated 1000 times over the test to get a more reliable timing, now to avoid being outsmarted by the compiler i set the iteration count to one and increased the dimensionality instad to 784*8. I also directly called atlas using the cblas interface(again see below for the evaluation code)

    result:
    fast_prod Ax: 0.139991
    prod Ax: 0.049996

    also order does not matter.

    >./xdl2blastst -N 200 2000 200 -F 20 -R gemv
    Here is the table:

    ------------------------------- GEMV --------------------------------
    TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST
    ==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
    0 N 200 200 1.0 2000 1 1.0 1 0.00 1276.3 1.00 -----
    0 N 200 200 1.0 2000 1 1.0 1 0.00 3528.6 2.76 PASS
    1 N 400 400 1.0 2000 1 1.0 1 0.00 1428.2 1.00 -----
    1 N 400 400 1.0 2000 1 1.0 1 0.00 3998.8 2.80 PASS
    2 N 600 600 1.0 2000 1 1.0 1 0.00 1460.6 1.00 -----
    2 N 600 600 1.0 2000 1 1.0 1 0.00 3327.0 2.28 PASS
    3 N 800 800 1.0 2000 1 1.0 1 0.00 1498.6 1.00 -----
    3 N 800 800 1.0 2000 1 1.0 1 0.00 3526.3 2.35 PASS
    4 N 1000 1000 1.0 2000 1 1.0 1 0.00 1523.9 1.00 -----
    4 N 1000 1000 1.0 2000 1 1.0 1 0.00 3496.1 2.29 PASS
    5 N 1200 1200 1.0 2000 1 1.0 1 0.00 1529.3 1.00 -----
    5 N 1200 1200 1.0 2000 1 1.0 1 0.00 3508.6 2.29 PASS
    6 N 1400 1400 1.0 2000 1 1.0 1 0.00 1538.5 1.00 -----
    6 N 1400 1400 1.0 2000 1 1.0 1 0.00 3529.5 2.29 PASS
    7 N 1600 1600 1.0 2000 1 1.0 1 0.00 1577.0 1.00 -----
    7 N 1600 1600 1.0 2000 1 1.0 1 0.00 3525.1 2.24 PASS
    8 N 1800 1800 1.0 2000 1 1.0 1 0.00 1576.7 1.00 -----
    8 N 1800 1800 1.0 2000 1 1.0 1 0.00 3646.2 2.31 PASS
    9 N 2000 2000 1.0 2000 1 1.0 1 0.01 1557.3 1.00 -----
    9 N 2000 2000 1.0 2000 1 1.0 1 0.00 3841.2 2.47 PASS

    which doesn't seem to match my results.

    here my benchmark code (without random initialisation code of matrices), to ensure that my benchmark is not too bad :)

    double start=Timer::now();
    cblas_dgemv( CblasRowMajor, CblasNoTrans, 784*8, 784*8, 1.0, &A(0,0), 784*8,
    &x(0), 1,
    1.0, &result(0), 1);
    double end=Timer::now();
    std::cout<<"fast_prod Ax: "<<end-start<<std::endl;
    start=Timer::now();
    for(std::size_t j = 0; j != 784*8; ++j){
    for(std::size_t k = 0; k != 784*8; ++k){
    result(j) += A(j,k)*x(k);
    }
    }
    end=Timer::now();
    std::cout<<"prod Ax: "<<end-start<<std::endl;
    double output= 0;
    output += inner_prod(testResult,testResult);
    std::cout<<"anti optimization output: "<<output<<std::endl;

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-11-02

    Looks like you've got a timer error to me. Without seeing the entire thing, I can't guess very well what is wrong, but one common trick is when you have the loops laid out exactly inside the code, the compiler realizes that the multiple calls you are using for timing are useless, and reduces them to one call, for instance. You also aren't flushing the cache, so if your operands fit in cache, later calls will be much faster than earlier.

    Are you timing a 6272x6272 matrix (784*8)?

    Another idea is to stick your loops in a function, and have l2blastst call it, so your code is timed by a known-good timer. You can do that by editing ATLAS/bin/l2blastst.c, and changing the macro test_gemv to call your routine rather than the stuff it calls presently. Then, when you run the l2blastst timer, the first call will be to your code, and the second to ATLAS, for a head-to-head comparison.

    Let me know,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-11-02
    • status: open --> open-works-for-me
     
  • Oswin Krause

    Oswin Krause - 2012-11-03

    thanks again for your fast reply.

    I couldn't reproduce the results with your timing code, even when i use the same compiler settings (-O3) as in my code(also i had to change that your code is of course column major while i use row major arrays). So i think it might really be a timing issue on my side. The rest of this message is now optional for you as i just want to find out, what is wrong in my setup, since this really gives me a bad feeling about what is going on :

    The timer is getrusage (ru_utime and ru_stime added). While not the highest precision this should be enough for this big matrix.

    what bothers me is, that the results are consistent over drastic changes of the setup. I get the same results when i:
    - set matrix size at runtime
    - reorder the function evaluations
    - fill the matrix with new random values between every evaluation
    - only activate one of the tests in the program and compile/run it 4 times with different settings of the used function.

    > the compiler realizes that the multiple calls you are using for timing are useless
    That should not happen in my case. I only run a single iteration on a huge matrix for every function. I initialize the matrix A at the bginning and copy it's transpose into a matrix AT. thus i am alternating between these two matrices (first fast_prod with A then with AT, then prod with A and with AT). And now i also added that between fast_prod and prod the matrices are reinitialised with new values.
    Every function adds it's result on the result of the previous evaluation, calculates an inner product of this result and prints that out at the end of the test suite. And in between are library functions with side effects (getrusage). I don't see a way a compiler could optimize this.

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-12-05

    I tell people that when I first developed ATLAS, I spent more than 1/3 of my time developing and writing the timers, and they don't understand this at all; unfortunately, you need to do a lot of stuff right to get a tuning-quality timer written.

    I don't know what's up with your timer, but most of what I know about getting decent timings is published in:
    http://www.cs.utsa.edu/~whaley/papers/timing_SPE08.pdf
    if you want to see what is behind my timers . . .

    I'm closing this support request, but feel free to re-open if you have more questions on this topic that I need to address.

    Cheers,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-12-05
    • status: open-works-for-me --> closed-works-for-me
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks