#184 Segfault with large matrices

Developer
closed-works-for-me
None
8
2012-04-04
2011-12-06
Emmanuel Bertin
No

Dear developers,
are the clapack functions implemented in the ATLAS libraries supposed to work with large matrices (i.e. larger than 65536x65536)? Calling ATLAS function clapack_dposv() and clapack_dgesv() with larger sizes generates a segfault after running ~ 1 minute on both 3.8.4 stable and 3.9.51 devel here. The test was done on a 1TB machine. ATLAS was compiled with gcc 4.6.2 and -DATL_USE64BITS . I checked first that there was no problem on my side to allocate such large matrices and access matrix elements with size_t array indices on the same platform.
I read somewhere that the netlib Lapack itself is supposed to support matrix sizes in excess of 100,000x100,000. So is the limitation in the ATLAS Lapack subset or in the ATLAS implementation of BLAS?
Thanks in advance for your help and keep up the good work!

Discussion

1 2 3 4 > >> (Page 1 of 4)
  • Yes, the clapack functions should certainly work with N=64K.

    Can you reproduce the problem by running the testers in OBJdir/bin? xdslvtst is the serial routine, and xdslvtst_pt is the parallel.

    Thanks,
    Clint

     
    • assigned_to: nobody --> rwhaley
    • labels: 360155 --> 497307
     
  • Thanks for your quick reply.
    Testing with the 3.8.4 stable version, I get a segfault within 2 seconds for both ./xdslvtst -n 66000 and ./xdslvtst_pt -n 66000. Interestingly, for 16383 <n<65536 I get a
    "assertion A && X && B && ipiv failed, line 668 of file (...)/ATLAS/build/..//bin/slvtst.c"
    Works fine for n<16384 though.

     
  • Thank you very much for tracking this down and reporting it. I can repeat your problem on my Linux/sandy bridge machine.

    I now have to figure out if the problem is due to an error in ATLAS or the way the system handles very large mallocs. I will keep this page updated with what I figure out, and will escalate to the bug tracker if I find it is an error in ATLAS.

    Thanks!
    Clint

     
    • status: open --> open-accepted
     
    • priority: 5 --> 8
     
    • milestone: 111737 -->
    • labels: 497307 -->
     
  • OK, this is indeed a bug, but not in the ATLAS library. This is because *slvtst* (not built into the lib) failed to check for NULL return for malloc at the right spot.

    However, that begs the question if this error is the same one you are seeing in your code or not!

    So, in your original code, can you make sure you are getting non-NULL for your array allocation?

    Thanks,
    Clint

    P.S.: I have fixed the problem in the basefiles, so this will be fixed in 3.9.56.

     
  • Test code

     
    Attachments
  • Please find attached the small test code that crashes in less than one minute generally when used with large n; for instance:
    ./test 66000 1000 15000
    Thanks!

     
1 2 3 4 > >> (Page 1 of 4)