#184 Segfault with large matrices

Developer
closed-works-for-me
None
8
2012-04-04
2011-12-06
Emmanuel Bertin
No

Dear developers,
are the clapack functions implemented in the ATLAS libraries supposed to work with large matrices (i.e. larger than 65536x65536)? Calling ATLAS function clapack_dposv() and clapack_dgesv() with larger sizes generates a segfault after running ~ 1 minute on both 3.8.4 stable and 3.9.51 devel here. The test was done on a 1TB machine. ATLAS was compiled with gcc 4.6.2 and -DATL_USE64BITS . I checked first that there was no problem on my side to allocate such large matrices and access matrix elements with size_t array indices on the same platform.
I read somewhere that the netlib Lapack itself is supposed to support matrix sizes in excess of 100,000x100,000. So is the limitation in the ATLAS Lapack subset or in the ATLAS implementation of BLAS?
Thanks in advance for your help and keep up the good work!

Discussion

1 2 > >> (Page 1 of 2)
  • Yes, the clapack functions should certainly work with N=64K.

    Can you reproduce the problem by running the testers in OBJdir/bin? xdslvtst is the serial routine, and xdslvtst_pt is the parallel.

    Thanks,
    Clint

     
    • assigned_to: nobody --> rwhaley
    • labels: 360155 --> 497307
     
  • Thanks for your quick reply.
    Testing with the 3.8.4 stable version, I get a segfault within 2 seconds for both ./xdslvtst -n 66000 and ./xdslvtst_pt -n 66000. Interestingly, for 16383 <n<65536 I get a
    "assertion A && X && B && ipiv failed, line 668 of file (...)/ATLAS/build/..//bin/slvtst.c"
    Works fine for n<16384 though.

     
  • Thank you very much for tracking this down and reporting it. I can repeat your problem on my Linux/sandy bridge machine.

    I now have to figure out if the problem is due to an error in ATLAS or the way the system handles very large mallocs. I will keep this page updated with what I figure out, and will escalate to the bug tracker if I find it is an error in ATLAS.

    Thanks!
    Clint

     
    • status: open --> open-accepted
     
    • priority: 5 --> 8
     
    • milestone: 111737 -->
    • labels: 497307 -->
     
  • OK, this is indeed a bug, but not in the ATLAS library. This is because *slvtst* (not built into the lib) failed to check for NULL return for malloc at the right spot.

    However, that begs the question if this error is the same one you are seeing in your code or not!

    So, in your original code, can you make sure you are getting non-NULL for your array allocation?

    Thanks,
    Clint

    P.S.: I have fixed the problem in the basefiles, so this will be fixed in 3.9.56.

     
  • Test code

     
    Attachments
  • Please find attached the small test code that crashes in less than one minute generally when used with large n; for instance:
    ./test 66000 1000 15000
    Thanks!

     
  • When I run it, I get:
    ./test: alpha (ncoeff*ncoeff elements) !: Unknown error 4886585
    which I believe means that NULL was returned by malloc in your funky QMALLOC wrapper.

    Are you running under Linux? If so, I wonder if this isn't the well-known non-NULL malloc return really means NULL bug.

    If you are running Linux, can you log in as root, and the issue the command
    echo 2 > /proc/sys/vm/overcommit_memory
    to turn off this linux misfeature, and rerun your program?

    BTW, this command comes from "man malloc" on a Linux system, if you want to know what I'm talking about.

    Let me know,
    Clint

     
  • I am running Linux: Linux xxx.xxx.xxx 2.6.18-274.3.1.el5 #1 SMP Tue Sep 6 20:13:52 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
    In the"real" program where I first detected the pb my own code manages to access the whole matrix without any trouble. The crash occurs only in the ATLAS call.
    I am not root on this machine, but /proc/sys/vm/overcommit_memory already contains the "2".
    I fixed my funky QCALLOC() macro (which would use my personal error() routine), but I still get the segfault...
    Best wishes

     
  • Well, damn. The largest case I can run on my machine with your tester is:
    ./test 60000 1000 15000
    which runs just shy of forever, and then exits without error. Any larger matrix (first number) and I get the NULL return error.

    I have a couple of machines with more physical memory; I'll try to see what happens on those machines ASAP.

    In the meantime, can you do your link w/o the -lptcblas and -lpthread args (i.e., do the purely serial code), and see if that also fails in the same way (will run even longer if successful, obviously, since it will be serial the whole way). The parallel codes do take more memory internally . . .

    I will update as I investigate further. If none of my machines can reproduce the problem, I don't suppose you can arrange access to your machine?

    Many thanks,
    Clint

     
  • Yep I already started the test with the serial version (has not crashed yet; this is a 64 core machine, I have not checked what the speedup is on this particular example, but I guess it will take about 1-2 hours).
    I am afraid it will be difficult to arrange access to this machine; I do not manage it, and I have myself limited access to it. But I am keen on running tests if it helps.
    btw is there some information somewhere about how the various routines implemented in ATLAS roughly scale in terms of memory?
    Thanks!

     
  • I have very good news: I can duplicate the error on one of my machines (etl-core112)! It has 32GB of mem, which is apparently enough. So, I should now be able to track this down. Unfortunately, it is going to take a while.

    BTW, if you are really wanting top-of-the-line parallel performance (which you should with 64 processors!) you will want to link to parallel lapack as well as parallel blas. So the link line would look like:
    -lptlapack -lptcblas -lcblas -latlas -lm -lpthread

    I will update with further info as I discover it. Thank you very much for submitting this bug report and helping with tracking it down!

    Clint

     
  • OK great! Thanks for the ptlapack tip. And good luck with the debug!

     
  • On corei112, the link line (from the lib dir) is:
    gcc -O -o test test.c -llapack -lptcblas -lcblas -latlas -lm -lpthread -I../../include/ -I../include/ -L./
    And the command that failed is:
    >etl-corei112:~/TEST/ATLAS3.9.52.0/obj64/lib> ./test 66000 1000 15000
    >Solving!
    >Segmentation fault

    Am presently running serial.

    Need to install debug-enabled atlas and see where fault is coming from.

     
  • I got the segfault on the serial version too. Sounds like something happening by the end of the processing (free() of a bad pointer?).

     
  • I just (after 21 minutes) got the seg fault in serial as well!

    I am now running with the debugger on so I can see where it happens at.

    Any bets as to how long this will take :)

    Cheers,
    Clint

     
  • OK, the debugger ran until it died, so no information.

    I'm trying to figure this out, and am having difficulty understanding your tester. What are the 2nd and 3rd params for?

    Can you get this error to happen with slvtst?

    Can you get it to happen by calling potrf directly; even better, did you say it also occured with LU?

    Thanks,
    Clint

     
  • > What are the 2nd and 3rd params for?

    What I do here is create a positive definite matrix consisting of non-zero diagonal elements plus a small sub-matrix with non-diagonal elements (create a whole dense matrix would take much more time). The 2nd parameter is the size of the sub-matrix and the 3rd element is the position of the sub-matrix inside the big one.

    > Can you get this error to happen with slvtst?

    I would need some guidance here. I could not find a rule to create an slvtst executable in the Makefile. Do you mean xsslvtst ? Also you mentioned that you fixed a bug in "*slvtst*" that would lead to a segfault. I tried to find an updated version of slvtst.c on the SourceForge CVS repository; there is a lot of stuff there but I could not find any relevant file. Where is it?? In any case, I get a segfault on the "old" xsslvtst too.

    > Can you get it to happen by calling potrf directly.

    Yes.

    > did you say it also occured with LU?

    Yes.

     
  • Hi Clint,
    I just checked that test,c linked with the MKL (Lapacke interface) completes without crashing and without an error message (even with n=200,000) .
    Cheers
    - Emmanuel.

     
  • Emmanuel,

    Yes, yes, maybe you'd like to taunt me with MKL's superior performance as well :)

    > What are the 2nd and 3rd params for?

    So, you are saying that:
    (1) the first parameter is the order of the matrix; this value should not be passed to lapack except as lda
    (2) The 2nd parameter is the size of the matrix that will be passed to lapack (in this case the size of the positive def matrix).
    (3) The third parameter is the offset of the lapack matrix inside the first
    --> para1 == (para2+para3)

    Is this correct?

    > Can you get this error to happen with slvtst?
    Sorry, slvtst is the generic name. Actual name is x[s,d,c,z]slvtst (and same name with _pt added for parallel versions).

    You can't find stuff under CVS/sourceforge because I don't use it anymore (they left it broke for several months and announced they were discontinuing support). All development is now done on github!

    The only change to slvtst was to check for NULL return value when calling matrix generator.

    > did you say it also occured with LU?
    Can I get you to post a tester showing this? I work with LU a lot more that Cholesky, and can get in the weeds much easier and quicker there.

    Many thanks,
    Clint

     
  • > Yes, yes, maybe you'd like to taunt me with MKL's superior performance as
    well :)

    Don't know yet. Does not seem obvious. Will check when ATLAS stops crashing :-P. In any case I generally prefer to work with open source solutions.

    > So, you are saying that:
    > (1) the first parameter is the order of the matrix; this value should
    > not be passed to lapack except as lda
    > (2) The 2nd parameter is the size of the matrix that will be passed to
    > lapack (in this case the size of the positive def matrix).
    > (3) The third parameter is the offset of the lapack matrix inside the
    > first
    > --> para1 == (para2+para3)

    > Is this correct?

    No, forget about the 2nd and 3rd parameter, they are just there to include non-diagonal elements in the matrix. What is passed to Lapack is only the 1st one. Actually ATLAS crashes just the same on a purely diagonal matrix.

    > Can I get you to post a tester showing this

    Sorry I did not get that (English is not my native language). What do you mean?

    Thanks.
    - Emmanuel.

     
  • Emmanuel,

    I was wondering if you could post a tester showing the seg fault calling DGETRF (LU factor) rather than POSV?

    Thanks,
    Clint

     
1 2 > >> (Page 1 of 2)