#866 dsyrk gives floating point exception

Stable_(v3.10.x)
open
Other (122)
5
2014-08-19
2012-10-30
mrca
No

dsyrk produces a floating point exception when running dsyrkx_atlas. Source code found in dsyrkx.f (attached)
When using standard blas it runs correctly (takes a very long time). It works correctly when using MKL 11.0.

We are running really big matrices. A is 1000 by 100000 real*8. We are using dsyrk to do the crossproduct (A**T)*A, so resulting matrix is 100000x100000 (~76 Gb). We have 256 Gb of memory, so everything fits. To reproduce you will need a machine with at least 80Gb of memory.

System is running CentOS 6.3.
16 Intel processors (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz)
compiler gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)

When using ATLAS (single threaded) for BLAS library. ATLAS built with:
../configure --shared --with-netlib-lapack-tarfile=/home/lapack/lapack-3.4.1.tgz -Fa alg '-fPIC' -b 64

dsyrk produces a floating point exception when running dsyrkx_atlas. Source code found in dsyrkx.f (attached)

When using standard blas it runs correctly (takes a very long time). It works correctly when using MKL 11.0.

$ gfortran dsyrkx.f -L/usr/local/atlas/lib/-lsatlas -o dsyrkx_atlas

Debug run
$ gdb dsyrkx_atlas
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-56.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/mimi/blas/dsyrkx_atlas...(no debugging symbols found)...done.
(gdb) go
Command requires an argument.
(gdb) run
Starting program: /home/mrca/blas/dsyrkx_atlas
[Thread debugging using libthread_db enabled]

D S Y R K EXAMPLE PROGRAM

INPUT DATA
N=* K=*
ALPHA=1.0 BETA=0.0
UPLO=U TRANS=T

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff791aadc in ATL_dpcol2blk_a1 () from /usr/local/atlas/lib/libsatlas.so
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.5.x86_64 libgcc-4.4.6-4.el6.x86_64 libgfortran-4.4.6-4.el6.x86_64
(gdb) where
#0 0x00007ffff791aadc in ATL_dpcol2blk_a1 () from /usr/local/atlas/lib/libsatlas.so
#1 0x00007ffff791b3df in ATL_dpmmJIKF () from /usr/local/atlas/lib/libsatlas.so
#2 0x00007ffff791c8ae in ATL_dprankK () from /usr/local/atlas/lib/libsatlas.so
#3 0x00007ffff791a898 in ATL_dgpmm () from /usr/local/atlas/lib/libsatlas.so
#4 0x00007ffff791f03c in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
#5 0x00007ffff791ef2d in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
#6 0x00007ffff791ef2d in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
#7 0x00007ffff791f8e2 in ATL_dsprk_rK () from /usr/local/atlas/lib/libsatlas.so
#8 0x00007ffff791ea9e in ATL_dsprk () from /usr/local/atlas/lib/libsatlas.so
#9 0x00007ffff791fda2 in ATL_dsyrk () from /usr/local/atlas/lib/libsatlas.so
#10 0x00007ffff77e4a9f in atl_f77wrap_dsyrk_ () from /usr/local/atlas/lib/libsatlas.so
#11 0x00007ffff77e41cd in dsyrk_ () from /usr/local/atlas/lib/libsatlas.so
#12 0x00000000004010ab in MAIN__ ()
#13 0x000000000040119a in main ()

Discussion

  • mrca

    mrca - 2012-10-30

    source file to produce problem

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-11-01

    Is this the smallest problem that displays the error? Unfortunately, my largest machine has 64GB of memory, and cannot run it.

    Dying at such large problems has historically been due to a failure to cast integer operations involving lda to size_t (didn't used to be a problem when a lot of ATLAS code was written).

    Since I can't reproduce the problem, what say I give you a couple of blind fixes to try, and see if it the obvious ideas can fix things for you?

    Edit ATLAS/include/atlas_pkblas.h, and change lines 78-79 from:
    #define Mpld(uplo_,J_,lda_) (uplo_) == PackUpper ? (lda_)+(J_) : \ ( (uplo_) == PackLower ? (lda_)-(J_) : (lda_) )
    to:
    #define Mpld(uplo_,J_,lda_) (uplo_) == PackUpper ? ((size_t)(lda_))+(J_) : \ ( (uplo_) == PackLower ? ((size_t)(lda_))-(J_) : (lda_) )

    and see if that fixes. Unfortunately, you'll have to do manual:
    touch ATLAS/src/blas/pklevel3/sprk/*.c
    to force all the needed compiles, since the makefile does not seem to check the dependence on the include file properly. After the touch doing "make xdl3blastst" in BLDdir/bin should force the recompile.

    If that alone does not do the job, try declaring Nright, Nleft on line 71 of ATLAS/src/blas/pklevel3/sprk/ATL_sprk_rK.c as size_t instead of int, and tell me what happens.

    Thanks,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-11-01
    • assigned_to: nobody --> rwhaley
    • milestone: --> Stable_(v3.10.x)
     
  • mrca

    mrca - 2012-11-07

    I tried the propose fixes and I still get the same error.

    I realize this is very difficult to debug without being able to reproduce the problem given the size of the matrices. I have been trying to see if I can reproduce it with smaller matrices but so far I haven't.

    Are there any flags I can use to build atlas that would produce information that I can pass along that would help you?

     
  • R. Clint Whaley

    R. Clint Whaley - 2012-12-04

    OK, next thing to try, edit ATLAS/src/blas/pklevel3/gpmm/ATL_pcol2blk.c, and change ldainc, kb, nrb, NN to all be size_t instead of int, then force the recompile of those files, and let me know what happens.

     
  • mrca

    mrca - 2012-12-19

    Changed ldainc, kb, nrb and NN to be const size_t instead of const int in file ATLAS/src/blas/pklevel3/gpmm/ATL_pcol2blk.c. The compilation failed:

    Also had to change ATLAS/include/atlas_pkblas.h
    lines:
    void ATL_dpcol2blk_a1(const int M, const int N, const double alpha,
    const double *A, int lda, const size_t ldainc, double *V);
    // const double *A, int lda, const int ldainc, double *V);
    void ATL_dpcol2blk_aX(const int M, const int N, const double alpha,
    const double *A, int lda, const size_t ldainc, double *V);
    // const double *A, int lda, const int ldainc, double *V);

    to eliminate compilation errors.

    Got the same error as before:

    Program received signal SIGFPE, Arithmetic exception.
    0x00007ffff791aadc in ATL_dpcol2blk_a1 () from /usr/local/atlas/lib/libsatlas.so
    Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 libgcc-4.4.6-4.el6.x86_64 libgfortran-4.4.6-4.el6.x86_64
    (gdb) where
    #0 0x00007ffff791aadc in ATL_dpcol2blk_a1 () from /usr/local/atlas/lib/libsatlas.so
    #1 0x00007ffff791b3df in ATL_dpmmJIKF () from /usr/local/atlas/lib/libsatlas.so
    #2 0x00007ffff791c8ae in ATL_dprankK () from /usr/local/atlas/lib/libsatlas.so
    #3 0x00007ffff791a898 in ATL_dgpmm () from /usr/local/atlas/lib/libsatlas.so
    #4 0x00007ffff791f054 in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
    #5 0x00007ffff791ef34 in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
    #6 0x00007ffff791ef34 in ATL_rk_recUT () from /usr/local/atlas/lib/libsatlas.so
    #7 0x00007ffff791f8ed in ATL_dsprk_rK () from /usr/local/atlas/lib/libsatlas.so
    #8 0x00007ffff791ea9e in ATL_dsprk () from /usr/local/atlas/lib/libsatlas.so
    #9 0x00007ffff791fdb2 in ATL_dsyrk () from /usr/local/atlas/lib/libsatlas.so
    #10 0x00007ffff77e4a9f in atl_f77wrap_dsyrk_ () from /usr/local/atlas/lib/libsatlas.so
    #11 0x00007ffff77e41cd in dsyrk_ () from /usr/local/atlas/lib/libsatlas.so
    #12 0x00000000004010ab in MAIN__ ()
    #13 0x000000000040119a in main ()

    Thanks!

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks