we are using ATLAS for our electronic structure code SPHInX, and are really happy about its performance, in particular for non-quadratic matrices. We presently use the most recent stable release, ATLAS 3.8.3
Yet, we have encountered very strange problems in some cases for zgemm (beta=0 (but with your patch), alpha=1, no trans, N=12, M=12, K=79204, lda,ldb,ldc with their minimal value), getting out nans with a very unusual signature (0x9...) compared to our own on-purpose nans (0x800...). This is reproducible within our code (several independent runs stop at the same point), but not in a separate executable with exactly the same input data => strange enough in itself.
To be precise: I wrote out the input data for the cblas_zgemm call on disk before the critical call, as well as the result I get from zgemm. Then I start a separate program that reads the data on disk and calls zgemm again. So far, the tester executable always succeeded, even for data that failed in the main program. The result for successful calls is numerically correct compared to "manual gemm" using axpy and zdot .
I had valgrind 3.4.1 running on the code and it complains about uninitialized values allocated in the ATLAS (even for tester):
==25897== by 0x40450F: main (sxmttst.cpp:41)
==25897== Address 0x4025008 is not stack'd, malloc'd or (recently) free'd
==25897== Uninitialised value was created by a heap allocation
==25897== at 0x4C2230B: malloc (vg_replace_malloc.c:207)
==25897== by 0x7CA2DF9: mmNMK (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x7CA33F0: ATL_zmmJITcp (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x7C884D5: ATL_zgemm (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x676D978: matmult(SxComplex<double>*, SxComplex<double> const*, SxComplex<double> const*, int, int, int) (SxMath.cpp:689)
I am not sure if this is conclusive, since you ATLAS guys might do some clever stuff that valgrind doesn't detect. Still, might be a piece in the puzzle.
I've worked hard to exclude any other error from our side.
From the overall picture, an uninitialized value is a plausible explanation for the observed dependence on the call history (since the memory might have been used before by other parts of the program).
I know that this is no great bug report because I cannot present a well-defined test case. I am lacking ideas of how to better define the problem. Maybe you have some suggestions how I could generate a test case with the problematic behavior.
Any suggestions of how to get around this problems would be highly welcome!
Is it worthwhile to test the latest developer version on this?
Log in to post a comment.