#126 strange nans from zgemm

Both
closed-fixed
5
2009-08-06
2009-07-06
No

Hi,

we are using ATLAS for our electronic structure code SPHInX, and are really happy about its performance, in particular for non-quadratic matrices. We presently use the most recent stable release, ATLAS 3.8.3

Yet, we have encountered very strange problems in some cases for zgemm (beta=0 (but with your patch), alpha=1, no trans, N=12, M=12, K=79204, lda,ldb,ldc with their minimal value), getting out nans with a very unusual signature (0x9...) compared to our own on-purpose nans (0x800...). This is reproducible within our code (several independent runs stop at the same point), but not in a separate executable with exactly the same input data => strange enough in itself.

To be precise: I wrote out the input data for the cblas_zgemm call on disk before the critical call, as well as the result I get from zgemm. Then I start a separate program that reads the data on disk and calls zgemm again. So far, the tester executable always succeeded, even for data that failed in the main program. The result for successful calls is numerically correct compared to "manual gemm" using axpy and zdot .

I had valgrind 3.4.1 running on the code and it complains about uninitialized values allocated in the ATLAS (even for tester):

==25897== by 0x40450F: main (sxmttst.cpp:41)
==25897== Address 0x4025008 is not stack'd, malloc'd or (recently) free'd
==25897== Uninitialised value was created by a heap allocation
==25897== at 0x4C2230B: malloc (vg_replace_malloc.c:207)
==25897== by 0x7CA2DF9: mmNMK (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x7CA33F0: ATL_zmmJITcp (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x7C884D5: ATL_zgemm (in /scratch/freysoldt/devel/numlibs/lib/libatlas.so.1.0.0)
==25897== by 0x676D978: matmult(SxComplex<double>*, SxComplex<double> const*, SxComplex<double> const*, int, int, int) (SxMath.cpp:689)

I am not sure if this is conclusive, since you ATLAS guys might do some clever stuff that valgrind doesn't detect. Still, might be a piece in the puzzle.

I've worked hard to exclude any other error from our side.

From the overall picture, an uninitialized value is a plausible explanation for the observed dependence on the call history (since the memory might have been used before by other parts of the program).

--

I know that this is no great bug report because I cannot present a well-defined test case. I am lacking ideas of how to better define the problem. Maybe you have some suggestions how I could generate a test case with the problematic behavior.

Any suggestions of how to get around this problems would be highly welcome!

Is it worthwhile to test the latest developer version on this?

Discussion

  • R. Clint Whaley

    R. Clint Whaley - 2009-07-06

    I take it the patch you have applied is:
    http://math-atlas.sourceforge.net/errata.html#JITB0

    >zgemm (beta=0 (but with your patch), alpha=1, no trans, N=12, M=12, K=79204

    Wow, I believe if you made the transpose settings 'NoTrans', "Trans' you would achieve the absolute worst-case for ATLAS performance :) Is ATLAS any better than the reference BLAS for this problem?

    Now, on the NaN, problem I do not understand the behavior you are seeing. Is the answer supposed to contain NaNs, just a different value of NaN, or is a valid problem sometimes producing NaNs?

    Obviously, if you cannot reproduce the problem w/o calling your code, I am free to believe that the error is in your code, which causes a memory overwrite that screws up ATLAS, rather than being in my own, perfect, code base :) This can still be the case even if it goes away when you use another BLAS, since a mem overwrite will cause different behaviors based on different mem footprints. Therefore, I have a perfect defense until you find an actual error in my code . . .

    However, as the above patch demonstrates, this is a crufty part of ATLAS that hasn't been tested as well as I would like, so I'm willing to believe there may be something there. Can you compile ATLAS as a .a using the -g flag, and see if valgrind will give you line numbers where uninit mem reads are occuring, so I can examine if they are real? If the error is truly in ATLAS, these uninit behaviors should occur always in ATLAS, so you can use the tester to show . . .

    The developer is definitely worth trying; it has all known patches applied, and might even parallelize that awful problem you are giving it . . .

    Cheers,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-06
    • assigned_to: nobody --> rwhaley
     
  • Christoph Freysoldt

    Small C++ program with a zgemm call

     
  • Christoph Freysoldt

    Clint,

    thanks for the answer to my vague bug report. I think I can back my case now with some hard facts.

    To clarify the unclear points:
    - yes, the patch applied is the "Complex gemm sometimes reads C for beta=0"
    - the problem is that zgemm sometimes yields a NaN-containing result, although the exact answer is well behaved. So far, I could give no condition when "sometimes" is, but I think I made some progress (see below). If the ATLAS zgemm produces valid (=no NaN) numbers, they are correct.

    ---
    I tried out the developers release 3.9.11, and also the 3.8.3 on a different machine (my computer at home). Same NaN behavior, same valgrind messages.

    I managed to produce the bug with a small tester program and valgrind. If I compile&link the attached tester.cpp into tester.x and run it with valgrind:

    valgrind --malloc-fill=ff tester.x

    I reproducibly get nans. I use valgrind 3.4.1 here to fill the memory allocated by ATLAS with 0xff bytes. If I don't initialize the memory or with something like 0x00, no nans appear in the result.
    I reduced K from 79204 (original case) to 3000 for the tester. Going much lower makes the bug disappear - probably some change in algorithm(?). If this algorithm threshold depends on the particular machine, higher Ks might be needed.

    I would prefer to get this case running without valgrind, but I have no other means of initializing the memory that ATLAS mallocates. Maybe you can simulate this behavior inside the ATLAS if you do not trust the valgrind CPU simulator.

    Unfortunately, valgrind does not complain about unitialized values until they affect some branching or output. Adding and multiplying around uninitialized values seems perfectly legitimate to the valgrind people. So, valgrind is not helpful for locating the first use of the uninitialized value within the ATLAS.

    I am sorry if I wasted your time with the initial report - I had been trying to hunt down this bug for more than a week and was really desperate. I hope this update is more useful.

    Christoph

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07

    Unless C++ works differently than C, you need to pass "alpha, beta', not "&alpha, &beta". Applying a beta of a random mem location would indeed tend to produce NaNs . . .

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07

    OK, I believe I can confirm that this is an error in ATLAS, though I don't yet know where. What I have done is fix the alpha/beta in your tester, and I still get the NaNs. Then I have ATLAS internally call the reference GEMM, and the problem goes away.

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07
    • labels: 497307 -->
    • milestone: 148062 -->
    • status: open --> open-accepted
     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07

    The error is in mmJITcp

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07
    • labels: --> Incorrect answer
    • milestone: --> Both
     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-07

    Christoph,

    I am still trying to track this down, but for a quick fix, you can change line 185 of ATLAS3.8.3/src/blas/gemm/ATL_gemmXX.c from:
    if (K >= C2R_K || ((M < MB || N < NB) && K >= 4*KB))
    to:
    if (0)
    and you should get the right answer (though it should be slower).

    Let me know if this fixes the problem in your application (I already showed it does in the tester),
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-08

    rough translation to C from C++

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-08

    In BLDdir/bin, you can edit the Makefile to build and run nantest.c with commands:

    xtst : force_build
    $(ICC) $(ICCFLAGS) -g -o xtst nantest.c $(CBLASlib) $(ATLASlib)
    valgrind --malloc-fill=ff ./xtst

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-08
    • status: open-accepted --> open-fixed
     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-08

    Christoph,

    I have found an error that seems to fix your tester. Can you verify that it fixes your actual code:
    http://math-atlas.sourceforge.net/errata.html#JITB02

    If you don't have time to do this fairly quickly, let me know, as I am hoping to wait on verification from you before sending out a message on the atlas-error list . . .

    Thanks,
    Clint

     
  • Christoph Freysoldt

    Clint,
    I'll try the patch within the next 12 hours.
    Christoph

     
  • Christoph Freysoldt

    Clint,

    your fix solves the nan problem in our code.

    Many, many thanks for your quick help!

    Christoph

    PS:
    about whether to call cblas_zgemm with &alpha or only alpha: to my own big surprise, it is valid C++ to put the address operator or leave it out for fixed-size arrays (the Stroustrup book on C++ states that an array name can(!) be used as a pointer to its first element). The same seems to be true for C, although I have no reference for that. For example, the following code prints out identical addresses:
    double alpha[2];
    printf ("%p\n", alpha);
    printf ("%p\n", &alpha);

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-07-09

    Christoph,

    Many thanks to you for reporting and providing a test for this error, and testing the fix. ATLAS is a big package, and w/o people willing to help, there is no way ATLAS would be as solid as it is. Since this involved a unitialized mem read in corner-case code, no telling how long this would have stayed in the package; it was already in both developer and stable . . .

    No, I had no idea that alpha and &alpha were the same!

    Thanks,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-08-06

    Should be fixed in 3.9.12. Please reopen if not.

     
  • R. Clint Whaley

    R. Clint Whaley - 2009-08-06
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.