Menu

#172 Default behaviour of dgemm when BETA=0

Both
closed-fixed
5
2011-04-21
2011-04-14
Anonymous
No

Dear maintainers,
with the 3.8.4.x releases I'm having some problems in Octave, described in the thread here: https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2011-April/023614.html. In few words, dgemm can read uninitialized values when BETA=0. The problem disappears if I revert the fix (w.r.t. 3.8.3) described here: http://math-atlas.sourceforge.net/errata.html#scal0. I'm sorry I cannot provide a small fortran or C file to show the issue.
Reference dgemm initializes C to zero when BETA=0 (and not multiplies C by BETA). Does ATLAS 3.8.4.x do the same?

Best regards,

Marco

Discussion

  • R. Clint Whaley

    R. Clint Whaley - 2011-04-14

    ATLAS's behaviour depends on a lot of things. Generally, it tries to avoid loading C rather than zero it, for performance reasons.

    However, the revert code indicates this is a case where I was zeroing, but depending on nonstandard behavior. Since you haven't narrowed the case at all, how are you sure this is coming from DGEMM?

    If you could find me a case of a call to ATLAS (as opposed to a call to an octave function I've never heard of), I think the problem will be easy to find . . .

    Thanks,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-14
    • assigned_to: nobody --> rwhaley
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-14

    I cannot find anything in DGEMM that depends on the non-standard behaviour of scal mentioned in the fix you reverted. Are you sure that octave itself isn't just depending on the non-standard behaviour described in that fix?

    If it is in ATLAS, I am going to need a call to ATLAS routines that demonstrates the problems before I have a hope of finding and fixing.

    Thanks,
    Clint

     
  • Nobody/Anonymous

    Dear Clint,

    the Octave code showing the problem is a column vector times row vector multiplication. It is managed by the following lines in file dMatrix.cc of Octave:

    retval = Matrix (a_nr, b_nc); // a_nr = rows of a and b_nc = columns of b
    double *c = retval.fortran_vec ();
    if (b_nc == 1)
    {...}
    else if (a_nr == 1)
    {...}
    else
    {
    const char ctra = get_blas_trans_arg (tra);
    const char ctrb = get_blas_trans_arg (trb);
    F77_XFCN (dgemm, DGEMM, (F77_CONST_CHAR_ARG2 (&ctra, 1),
    F77_CONST_CHAR_ARG2 (&ctrb, 1),
    a_nr, b_nc, a_nc, 1.0, a.data (),
    lda, b.data (), ldb, 0.0, c, a_nr
    F77_CHAR_ARG_LEN (1)
    F77_CHAR_ARG_LEN (1)));
    }

    If I initialize c with

    for (int j = 0; j < a_nr; j++)
    for (int i = 0; i < b_nc; i++)
    retval.xelem (j,i) = 0.;

    before the call to DGEMM, everything is fine. If not, everything is fine only with ATLAS 3.8.3 or 3.8.4.x with the reverted fix. From this experiment, I thought ATLAS DGEMM is *no more* setting C to zero. I tried a simple fortran code, but I was not able to reproduce the problem. So, you are right I am not 100% sure it is a problem with ATLAS.

    Best regards,

    Marco

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-14

    Marco,

    I don't at all doubt the error is in ATLAS, I just need some help to find it. I originally wrote ATLAS with the design of SCAL that zeroed when called with alpha=0, but then a user pointed out that this violates the BLAS standard, which reads the vector even when alpha=0. So, I had to change it with the patch, but this doesn't mean I found all the cases where I am calling SCAL with alpha=0.

    However, I didn't see any calls to it inside of GEMM. There are a lot of subcases of GEMM, and I may just be missing it.

    So, can you track your octave to down to the DGEMM call, and give me the dimensions of the call?

    Thanks,
    Clint

     
  • Nobody/Anonymous

    Dear Clint,
    this is the fortran *equivalent* of my Octave code

    program test
    implicit none
    integer i,j,m
    parameter (m = 50)
    double precision a(m,1),b(1,m),c(m,m)
    do i = 1,m
    a(i,1) = (i-1) * 1.0d0
    b(1,i) = (i-1) * 1.0d0
    enddo
    call dgemm('N','N',m,m,1,1.0d0,a,m,b,1,0.0d0,c,m)
    do i = 1,m
    write(6,*) c(i,1) ! c(i,1) is -0 or, rarely, NaN instead of 0 in some random positions
    enddo
    stop
    end

    But, as I said, this fortran code does not give me problems. Anyway, in Octave I observe the problem at least for 45 <= m <=72. I hope this helps you.

    Marco

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-15

    Marco,

    Can you compile your example code with the gfortran flag "-finit-real=nan"? This should cause it to init any uninitialized values to NaN (you can print a elt of C before the call to make sure this worked). Then, if you have NaNs in the answer after the run, the error is in ATLAS DGEMM and I will look again to see what I missed.

    However, if you don't get NaNs even then, then I think it is likely the error is in octave, which is somewhere internally relying on the old nonstandard behaviour of dscal.

    Thanks,
    Clint

     
  • Nobody/Anonymous

    I got a very nice column of NANs ;-)
    And I double checked that everything is fine with 3.8.3. If it matters, I'm using Core232SSE3. I get NANs for 41 <= m <= 208.

    Marco

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-15

    Marco

    Excellent. Thanks for all the help. I'll try installing on my Core2 system and see if I can repeat ASAP.

    Thanks!
    Clint

     
  • Nobody/Anonymous

    Thank you Clint for your patience to guide me to a simple test case.

    Marco

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-15

    Just realized why I didn't find this problem initially: this is against 3.8.4.x, and I was looking in the 3.9 series. Will give 3.8 a scope, hopefully this weekend.

    Escalating to bug tracker.

    Cheers,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-15
    • milestone: 148062 -->
    • labels: 497307 -->
    • status: open --> open-accepted
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-20

    Bug confirmed on Phenom, so not system specific.

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-20
    • milestone: --> Both
    • labels: --> Incorrect answer
    • status: open-accepted --> open-fixed
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-20

    OK, I have posted a new tarfile at:
    http://www.cs.utsa.edu/~whaley/dload/atlas3.8.4.3.tar.bz2

    According to my testing, this fixes the problem. Can you please confirm this particular fix, as well as letting me know if you see further problems when you run the same tests you did to originally find this bug?

    Thanks,
    Clint

     
  • Marco Caliari

    Marco Caliari - 2011-04-20

    Dear Clint,

    your fix works and I don't see further problems.

    Thanks,
    Marco

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-21
    • status: open-fixed --> closed-fixed
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-04-21

    Thanks for the quick reply! Closing as fixed.

     

Log in to post a comment.