#17 GS2 crashes for medium sized problems on Blue Joule

GS2
closed
bluejoule (1)
5
2014-02-04
2014-01-30
Edmund Highcock
No

Title says it all really. This bug was detected by Ferdinand over the summer, and I've moved discussion of it onto the bug tracker. The input file which causes the crash is attached. It crashes when placed in the 4h128 queue... haven't tried anything else yet.

We had some thoughts it was to do with diagnostics, but it is not because it is reproduced when loop_diagnostics is commented out.

The error messages are varied, sig 9, sig 11 both seen.

The smaller the problem, the longer it runs before crashing which does indicate some kind of memory leak.

1 Attachments

Related

Bugs: #17

Discussion

  • Is there any guidance as to what signal 9 or 11 mean (searching only yields walltime exceeded errors which I guess this is not)?

    Other than signal 9 do you get any other abnormal messages in the output, job output, job error *.error etc.?

    In the case where you remove loop_diagnostics does the job crash after the same number of steps as the case with loop_diagnostics? I think if I remember correctly you previously mentioned that changing nwrite changed when the crash happened, which is what first suggested it may be related to diagnostics. As the number of steps achieved before crashing depends upon nwrite but the crashes don't go away with nwrite=0 i'd guess the problem must lie in some fairly standard code (unless the difference in number of steps with nwrite is <=nwrite).

    It's certainly possible that there's a memory leak in the MPI library, so things like the velocity space integration may cause issues. Fixing such problems often relies on fixes to the underlying base library (which can be slow), but occasionally replacing things like MPI_IN_PLACE can help. It should be fairly straightforward to make a small test program which links to GS2's mp.o and uses the various MPI routines, you can then try running this through valgrind or similar to see if there are any leaks.

    If the problem is not in the MPI then I think the best way to shed some light on where the problem is, is to run with something like valgrind. In fact i'd suggest this might be a good first step. To detect the memory leak you don't need to run the full problem until it crashes, you can start with a small problem that just uses the same features as the large problem (e.g. try ny=4, nx=32 with nstep=5*nwrite or similar). Similarly you may want to compile with DEBUG=on and enable core dumps so that you can use a debugger to find out exactly where you are when the code exits. This won't tell you exactly where the leak is occurring but it may give some pointers as to where you are in the call tree, which may give some idea as to what's wrong.

    If valgrind or similar is not available then tracking this down will be quite tricky. It is usually possible to spawn system commands from within fortran so it's possible that if you develop a small test program you could spawn system commands to measure memory usage to simply verify if there is a leak or not, comment out one section and repeat until there is no leak. This is likely to be quite slow and not particularly accurate, but may help!

     
  • Hi David,

    Thanks for the suggestions. In fact, changing nwrite does not change the number of completed timesteps... I don't think it is anything to do with diagnostics.

    Here are some of the error messages... I'm pretty sure it's a memory leak. I'm going to start by running different sections of the code a very large number of times, under the assumption that it may be a memory leak within a particular subroutine. I'll also take your suggestion and try testing the mp functions. I'm just getting to grips with what debuggers work... supposedly they have total view.

    MPIR_Reduce_redscat_gather(293): Unable to allocate 6397440 bytes of memory for receive buffer (probably out of memory)
    
    MPIR_Allreduce_intra(264): Unable to allocate 456960 bytes of memory for temporary buffer (probably out of memory)
    

    Now of course, these errors could have been the result of simply not having enough memory for the problem, but that is ruled out by the fact that the number of completed time steps is inversely proportional to the problem size.

     
  • Those errors are talking about relatively small amounts of memory so I guess the leak must be fairly gradual, i.e. only a small amount of memory is lost per iteration. It's also interesting that the errors are both coming whilst in mpi routines. I'd expect that the largest allocates are associated with GS2 arrays so you'd expect it would be more likely that you'd hit the unable to allocate problem in one of those allocates.

    I've attached a very basic program to test a few of the main mpi routines used in GS2. To compile I first did

    make -j2 gs2
    

    and then used

    mpif90 $(make test_make | grep F90FLAGS | awk -F "F90FLAGS is" '{print $2}') -o mptest mp_test.f90 mp.o file_utils.o command_line.o
    

    where mpif90 should be replaced with whatever is appropriate for your system (on a side note the test_make target should probably report the compiler command).

    I could then run this with valgrind --leak-check=full mpirun -n 2 ./mptest to get the valgrind report.

     
    Attachments
  • Thanks very much for this, I will give it a whirl.

    I have determined that the memory leak occurs within the collisions routines (or within the MPI calls within the collisions routines). Basically I ran each of the commands within timeadv the same number of times as the number of time steps before the run crashed. I also ran getfield the same number of times.

    Everything else ran, but vspace_derivatives crashed after about 9600 calls with the same error.

    I now have to teach, will be back on this soon.

     
  • PS I am not using any of the optimisations for vspace_derivatives.

     
  • There bits to the mpi in collisions:

    • The redistributes used to go from g_lo to any of the v-space layouts.
    • The integrate_moments calls used to calculate the conservation terms.

    If you add use_le_layouts=.true. to the input file then the velocity space integrals should all be local and the only mpi comes from the redistributes (which will likely be larger).

    The integrate_moments routine is very similar to integrate_species used in getfield so if getfield didn't crash I'd suggest that the culprit must either be the redistributes or a non-mpi part of the code. Whilst redistributes will be used in calls to the nonlinear term as you have layout='xyles' and are probably not using enough cores to start splitting xy very much the NL redistributes are probably mostly just local copying. If you change to layout='lexys' and repeat your tests you may find that the call to add_nonlinear_terms is now what crashes. If not then it's probably not related to the redistributes.

     
  • Fixed... there was a gtmp not being deallocated (see [r2662])

    I tried those suggestions... the problem still persisted with use_le_layouts... so it had to be something reasonable simple.

    Thanks for all the help!

    Index: collisions.fpp
    ===================================================================
    --- collisions.fpp      (revision 2661)
    +++ collisions.fpp      (working copy)
    @@ -2735,7 +2735,7 @@
            end do
         end do
    
    -    deallocate (vns, v0y0, v1y1, v2y2)
    +    deallocate (vns, v0y0, v1y1, v2y2, gtmp)
    
       end subroutine conserve_lorentz_standard_layout
    
    @@ -2887,7 +2887,7 @@
            gle(:,:,ile) = gle(:,:,ile) - ieqzip(it,ik)*w0le(:,:,ile) * v2y2(ile)
         end do
    
    -    deallocate (vpanud, v0y0, v1y1, v2y2)
    +    deallocate (gtmp, vpanud, v0y0, v1y1, v2y2)
    
       end subroutine conserve_lorentz_le_layout
    
     

    Related

    Commit: [r2662]


    Last edit: David Dickinson 2014-02-03
    • status: open --> closed
     
  • Well done!

    That's very strange, an allocatable array should always be automatically deallocated on exit from the subroutine (provided it doesn't have the save attribute). What compiler do you use on the bluegene?
    Edit: Actually I think the automatic deallocate only appeared in fortran 95, so if you have a strict f90 compiler it may not give us automatic deallocation.

    If this caused a problem here then I think there are likely to be other places where this can cause a problem on this system. For example I think the diagnostics don't always deallocate their arrays. This will be less damaging than the collisions routines as they are only called every nwrite steps (rather than collisions which is called roughly 2*nstep+nfield*(2*ntgrid+1)*ntheta0*naky times).

     
    Last edit: David Dickinson 2014-02-03
  • Yes I see that on bluejoule the compiler is XL. Specifically FC = xlf90_r which means that the f90 standard is default, i.e. -qxlf90=noautodealloc is assumed (see https://computing.llnl.gov/tutorials/ibm_sp/man/xlf.txt). By setting qxlf90=autodealloc allocatables will be automatically deallocated on subroutine exit as with most other compilers.

    see [r2673]

     

    Related

    Commit: [r2673]


    Last edit: David Dickinson 2014-02-04
    • Bill Dorland
      Bill Dorland
      2014-02-04

      Great conversation, gentlemen. Sorry for the bug.


      Prof. William Dorland
      Honors College Executive Director
      Department of Physics
      University of Maryland
      College Park, MD 20742
      301-405-6771


      On Tue, Feb 4, 2014 at 7:41 AM, David Dickinson ddickinson@users.sf.netwrote:

      Yes I see that on bluejoule the compiler is XL. Specifically FC = xlf90_rwhich means that the f90 standard is default, i.e.
      -qxlf90=noautodealloc is assumed (see
      https://computing.llnl.gov/tutorials/ibm_sp/man/xlf.txt). By setting
      qxlf90=autodealloc allocatables will be automatically deallocated on
      subroutine exit as with most other compilers.


      Status: closed

      Labels: bluejoule
      Created: Thu Jan 30, 2014 05:31 PM UTC by Edmund Highcock
      Last Updated: Mon Feb 03, 2014 02:00 PM UTC
      Owner: Edmund Highcock

      Title says it all really. This bug was detected by Ferdinand over the
      summer, and I've moved discussion of it onto the bug tracker. The input
      file which causes the crash is attached. It crashes when placed in the
      4h128 queue... haven't tried anything else yet.

      We had some thoughts it was to do with diagnostics, but it is not because
      it is reproduced when loop_diagnostics is commented out.

      The error messages are varied, sig 9, sig 11 both seen.

      The smaller the problem, the longer it runs before crashing which does
      indicate some kind of memory leak.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/gyrokinetics/bugs/17/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #17