Automatically Tuned Linear Algebra Soft. / Bugs / #172 Default behaviour of dgemm when BETA=0

#172 Default behaviour of dgemm when BETA=0

Milestone: Both

Status: closed-fixed

Owner: R. Clint Whaley

Labels: Incorrect answer (42)

Priority: 5

Updated: 2011-04-21

Created: 2011-04-14

Creator: Anonymous

Private: No

Dear maintainers,
with the 3.8.4.x releases I'm having some problems in Octave, described in the thread here: https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2011-April/023614.html. In few words, dgemm can read uninitialized values when BETA=0. The problem disappears if I revert the fix (w.r.t. 3.8.3) described here: http://math-atlas.sourceforge.net/errata.html#scal0. I'm sorry I cannot provide a small fortran or C file to show the issue.
Reference dgemm initializes C to zero when BETA=0 (and not multiplies C by BETA). Does ATLAS 3.8.4.x do the same?

Best regards,

Marco

Discussion

R. Clint Whaley - 2011-04-14

ATLAS's behaviour depends on a lot of things. Generally, it tries to avoid loading C rather than zero it, for performance reasons.

However, the revert code indicates this is a case where I was zeroing, but depending on nonstandard behavior. Since you haven't narrowed the case at all, how are you sure this is coming from DGEMM?

If you could find me a case of a call to ATLAS (as opposed to a call to an octave function I've never heard of), I think the problem will be easy to find . . .

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-14

assigned_to: nobody --> rwhaley
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-14

I cannot find anything in DGEMM that depends on the non-standard behaviour of scal mentioned in the fix you reverted. Are you sure that octave itself isn't just depending on the non-standard behaviour described in that fix?

If it is in ATLAS, I am going to need a call to ATLAS routines that demonstrates the problems before I have a hope of finding and fixing.

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2011-04-14

Dear Clint,

the Octave code showing the problem is a column vector times row vector multiplication. It is managed by the following lines in file dMatrix.cc of Octave:

retval = Matrix (a_nr, b_nc); // a_nr = rows of a and b_nc = columns of b
double *c = retval.fortran_vec ();
if (b_nc == 1)
{...}
else if (a_nr == 1)
{...}
else
{
const char ctra = get_blas_trans_arg (tra);
const char ctrb = get_blas_trans_arg (trb);
F77_XFCN (dgemm, DGEMM, (F77_CONST_CHAR_ARG2 (&ctra, 1),
F77_CONST_CHAR_ARG2 (&ctrb, 1),
a_nr, b_nc, a_nc, 1.0, a.data (),
lda, b.data (), ldb, 0.0, c, a_nr
F77_CHAR_ARG_LEN (1)
F77_CHAR_ARG_LEN (1)));
}

If I initialize c with

for (int j = 0; j < a_nr; j++)
for (int i = 0; i < b_nc; i++)
retval.xelem (j,i) = 0.;

before the call to DGEMM, everything is fine. If not, everything is fine only with ATLAS 3.8.3 or 3.8.4.x with the reverted fix. From this experiment, I thought ATLAS DGEMM is *no more* setting C to zero. I tried a simple fortran code, but I was not able to reproduce the problem. So, you are right I am not 100% sure it is a problem with ATLAS.

Best regards,

Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-14

Marco,

I don't at all doubt the error is in ATLAS, I just need some help to find it. I originally wrote ATLAS with the design of SCAL that zeroed when called with alpha=0, but then a user pointed out that this violates the BLAS standard, which reads the vector even when alpha=0. So, I had to change it with the patch, but this doesn't mean I found all the cases where I am calling SCAL with alpha=0.

However, I didn't see any calls to it inside of GEMM. There are a lot of subcases of GEMM, and I may just be missing it.

So, can you track your octave to down to the DGEMM call, and give me the dimensions of the call?

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2011-04-15

Dear Clint,
this is the fortran *equivalent* of my Octave code

program test
implicit none
integer i,j,m
parameter (m = 50)
double precision a(m,1),b(1,m),c(m,m)
do i = 1,m
a(i,1) = (i-1) * 1.0d0
b(1,i) = (i-1) * 1.0d0
enddo
call dgemm('N','N',m,m,1,1.0d0,a,m,b,1,0.0d0,c,m)
do i = 1,m
write(6,*) c(i,1) ! c(i,1) is -0 or, rarely, NaN instead of 0 in some random positions
enddo
stop
end

But, as I said, this fortran code does not give me problems. Anyway, in Octave I observe the problem at least for 45 <= m <=72. I hope this helps you.

Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-15

Marco,

Can you compile your example code with the gfortran flag "-finit-real=nan"? This should cause it to init any uninitialized values to NaN (you can print a elt of C before the call to make sure this worked). Then, if you have NaNs in the answer after the run, the error is in ATLAS DGEMM and I will look again to see what I missed.

However, if you don't get NaNs even then, then I think it is likely the error is in octave, which is somewhere internally relying on the old nonstandard behaviour of dscal.

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2011-04-15

I got a very nice column of NANs ;-)
And I double checked that everything is fine with 3.8.3. If it matters, I'm using Core232SSE3. I get NANs for 41 <= m <= 208.

Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-15

Marco

Excellent. Thanks for all the help. I'll try installing on my Core2 system and see if I can repeat ASAP.

Thanks!
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2011-04-15

Thank you Clint for your patience to guide me to a simple test case.

Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-15

Just realized why I didn't find this problem initially: this is against 3.8.4.x, and I was looking in the 3.9 series. Will give 3.8 a scope, hopefully this weekend.

Escalating to bug tracker.

Cheers,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-15

milestone: 148062 -->

labels: 497307 -->

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-20

Bug confirmed on Phenom, so not system specific.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-20

milestone: --> Both

labels: --> Incorrect answer

status: open-accepted --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-20

OK, I have posted a new tarfile at:
http://www.cs.utsa.edu/~whaley/dload/atlas3.8.4.3.tar.bz2

According to my testing, this fixes the problem. Can you please confirm this particular fix, as well as letting me know if you see further problems when you run the same tests you did to originally find this bug?

Thanks,
Clint

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marco Caliari - 2011-04-20

Dear Clint,

your fix works and I don't see further problems.

Thanks,
Marco

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-21

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

R. Clint Whaley - 2011-04-21

Thanks for the quick reply! Closing as fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.