#186 Floating point exception on matrix multiplication

Developer
closed-fixed
5
2011-12-10
2011-12-09
Anonymous
No

I have built Octave against ATLAS 3.9.56 and found some simple data that causes a floating point exception:
octave:1> load '/home/melrobin/research/nn/coredump.mat'
octave:2> X'*X
panic: Floating point exception -- stopping myself...
attempting to save variables to `octave-core'...
save to `octave-core' complete
Floating point exception (core dumped)

I did some more troubleshooting and found that (X+0)'*X does not produce the same error. I opened up a bug report at Octave and one developer suggested that there may be a problem with ATLAS. I'm also asking the folks at Octave about the differences between what the interpreter would send to the ATLAS routines in the cases of X'*X (which causes the fp exception) and (X+0)'*X (which does not).

To narrow things down even further I entirely rebuilt Octave using the serial libraries and this seems to have fixed the problem. Specifically I configured Octave with options --with-lapack=-llapack --with-blas=-latlas instead of --with-lapack=-lptlapack --with-blas=-ltatlas.

This narrows it down pretty well, but I would like to provide even more information if you will assist me with the huge gaps in my knowledge (abilities).

My machine is an Intel i7 Quad Core 2.8 GHz running Fedora 16.

Upon any floating point exceptions, ATLAS would have to report this back to Octave for further processing, right? Would it also dump to some other files?

If you want to put some hooks to try to catch this, I do not mind rebuilding everything to narrow down things further. If necessary I have the octave-core file which has the dataset.

Discussion

  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    I just got back from the lead Octave developer that Octave would call DSYRK for computing X'*X, but would call DGEMM for the case of computing (X+0)'*X. Maybe this is a good start in narrowing the problem down in the multithreaded libraries.

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09
    • assigned_to: nobody --> rwhaley
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09

    Can you create a C or Fortran tester that shows the problem?

    If not, can you get octave to report the exact parameters it called DSYRK with in the failure?

    Thanks!
    Clint

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    Yes, I'll work on a small C tester for this problem.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09
     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    Zipped data file

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    I have duplicated the problem I think. With this compilation:
    gcc -g -o test ctester.c -lf77blas -latlas -lgfortran
    Here is the output:
    [melrobin@melrobin nn]$ ./test
    [melrobin@melrobin nn]$

    With this compilation:
    gcc -g -o test ctester.c -lptlapack -ltatlas

    I receive this output:
    [melrobin@melrobin nn]$ ./test
    Floating point exception (core dumped)
    [melrobin@melrobin nn]$

    I can try to send you the core file, but I did not find it in the directory.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    I'm looking for the core file on Fedora 16, but I ran it through GDB which gives a little more information:

    Program received signal SIGFPE, Arithmetic exception.
    0x00007ffff7c59673 in ATL_tsyrkdecomp_K ()
    from /usr/local/atlas/lib/libtatlas.so

    I'm still thinking about how I can get down further, but I'm about at my abilities end.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    Stack backtrace

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    I just provided a stack backtrace.

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09

    OK, I was able to get the floating point exception with this link line:
    gcc -g -o xtst ctester.c -lptf77blas -latlas -lgfortran -lpthread -L./

    The serial case, as you point out, works fine.

    I am moving this to the bug tracker, and will see if I can figure out what is going on.

    Thank you very much for the report and writing the test case so I can confirm.

    Thanks,
    Clint

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09
    • milestone: 148063 -->
    • labels: 497307 -->
    • status: open --> open-accepted
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09
    • milestone: --> Developer
    • labels: --> Incorrect answer
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09

    Can reproduce on any machine with:
    ./xdl3blastst_pt -R syrk -n 19 -k 1768

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-09

    Just so I can start learning, what does xdl3blastst_pt do? Maybe it will help me to troubleshoot further sometime. Is there a doc that I should start reading?

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-09

    OK, I have fixed this problem in the basefiles, will be fixed in 3.9.57 (which I should release soon).

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-10

    Can you try 3.9.57? I believe this problem is fixed.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-10

    Yes, this did fix the problem. Thanks for fixing this. I will now rebuild Octave and continue research. I would close the report, but I don't know if it is appropriate for me to do so.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous

    Anonymous - 2011-12-10

    I have confirmed that the build did fix the problem in Octave.

     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-10
    • status: open-accepted --> closed-fixed
     
  • R. Clint Whaley

    R. Clint Whaley - 2011-12-10

    Great, thanks for the report!

     

Log in to post a comment.