Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#60 crash when using cblas_dgemm in 3.6.0

Stable
closed-fixed
Other (25)
5
2007-10-14
2004-04-16
Anonymous
No

G'day Folks,

This is my first exposure (at least knowingly) to blas,
cblas and atlas, so my problem may be due to
something stupid I've done. If so, apologies right off
the bat.

I'm using atlas to implement some fast non-linear
constrained optimisation algorithms.

The problem comes when I call cblas_dgemm for some
particular problem sizes.

The folowing code causes a crash:

#include <stdio.h>
#include <cblas.h>
#include <stdlib.h>

int main()
{
int nrows = 572;
int ncols = 512;
int i, j;

double *J2 = calloc(nrows * ncols, sizeof(double));
double *H = calloc(nrows * nrows, sizeof(double));

for(i = 0; i < nrows; i++){
for(j = 0; j < ncols; j++){
J2[i * ncols + j] = 1.0/(i + j + 1);
}
}

cblas_dgemm(CblasRowMajor, CblasNoTrans,
CblasTrans, nrows, nrows, ncols, 1, J2, ncols, J2, ncols,
0, H, nrows);

}

The value of ncols seems to be critical. If ncols is 512 I
get a segv, if ncols is 513 or 1024 or 256 or 12 there is
no crash.

I loaded the core file into gdb and it says that the stack
looks like:
#0 0x3ea44 in KLOOP ()
#1 0x545a48 in ?? ()
#2 0x3e350 in ATL_dpKBmm_b1 ()
#3 0x3e42c in ATL_dpKBmm ()
#4 0x16974 in ATL_dmmJIK2 ()
#5 0x17314 in ATL_dmmJIK ()
#6 0x134c0 in ATL_dGEMM2TN ()
#7 0x13a08 in ATL_dgemm ()
#8 0x12b84 in cblas_dgemm ()
#9 0x12554 in main () at test_dgemm.c:20

I decided to run the code under purify to see if there
was anything strange that I could find. When ncols =
512 purify reports a heap of uninitialised memory reads
and out of bounds array reads. I've included a sample
of the reports below - each type appears to involve the
same functions.

UMR: Uninitialized memory read
This is occurring while in:
*unknown func* [pc=0xba2f4]
ATL_dpKBmm_b1 [libatlas.a]
ATL_dpKBmm [libatlas.a]
ATL_dmmJIK2 [libatlas.a]
ATL_dmmJIK [libatlas.a]
ATL_dGEMM2TN [libatlas.a]
Reading 8 bytes from 0x61dc40 in the heap.
Address 0x61dc40 is 36624 bytes into a malloc'd
block at 0x614d30 of 36640 bytes.
This block was allocated from:
malloc [rtlib.o]
ATL_dmmJIK [libatlas.a]
ATL_dGEMM2TN [libatlas.a]
ATL_dgemm [libatlas.a]
cblas_dgemm [libcblas.a]
main [test_dgemm.c:20]

ABR: Array bounds read (11518 times)
This is occurring while in:
*unknown func* [pc=0xb9cf0]
ATL_dpKBmm_b1 [libatlas.a]
ATL_dpKBmm [libatlas.a]
ATL_dmmJIK2 [libatlas.a]
ATL_dmmJIK [libatlas.a]
ATL_dGEMM2TN [libatlas.a]
Reading 8 bytes from 0x61dc50 in the heap.
Address 0x61dc50 is 1 byte past end of a malloc'd
block at 0x614d30 of 36640 bytes.
This block was allocated from:
malloc [rtlib.o]
ATL_dmmJIK [libatlas.a]
ATL_dGEMM2TN [libatlas.a]
ATL_dgemm [libatlas.a]
cblas_dgemm [libcblas.a]
main [test_dgemm.c:20]

If ncols = 513 purify reports nothing.

I dowloaded a copy of the reference blas and cblas
interface and compiled them separately and linked the
example code to those new libraries and I got no crash.

I haven't had a chance to try atlas 3.7.3 yet but I
looked at the changes log and I didn't find anything
that seemed related to this problem.

Compiler: gcc 3.2.2
Atlas version: 3.6.0
Architecture: Sun Sparc sun4u
OS: solaris 2.8

Cheers and thanks for your help
Herman
Herman.Ferra@team.telstra.com

Discussion

    • labels: 360155 -->
    • status: open --> open-accepted
     
    • assigned_to: nobody --> rwhaley
    • milestone: --> Stable
    • labels: --> Other
     
  • Logged In: YES
    user_id=182470

    Herman,

    I have confirmed the seg fault, and I believe you have found
    a bug. I need to scope it further to figure it out for
    sure, but suspect the problem is in the K cleanup. Note
    that this problem can be reproduced using the ATLAS tester:
    ./xdmmtst -m 168 -n 168 -k 512
    ./xdmmtst -m 168 -n 168 -k 344
    ./xdmmtst -m 168 -n 168 -k 176
    all seg fault. All of these represent K-sizes that are K%NB
    = 8, which is done by a routine that I'm scoping now.

    Thanks for the report, I will update when I know more.

    Clint

     
  • Logged In: YES
    user_id=182470

    Herman,

    OK, I've posted a fix that seems to resolve the problem at:
    http://math-atlas.sourceforge.net/errata.html#USCU
    I'm not sure how this slipped through testing, but thanks
    for finding it.

    Cheers,
    Clint
    NOTE: I have fixed in basefile, so should be fixed in next
    dev release.

     
  • Logged In: YES
    user_id=182470

    NOTE TO SELF: Need to update prebuilt binary as soon as
    machine comes up

     
  • Logged In: YES
    user_id=182470
    Originator: NO

    fixed in 3.8.0

     
    • status: open-accepted --> closed-fixed