Menu

many trajectories eventually segfault

Help
2013-10-08
2016-11-28
  • Matthew Egbert

    Matthew Egbert - 2013-10-08

    Hello,

    I have been working on integrating PyDSTool with the inspyred
    genetic algorithm package, to allow me to use machine learning
    algorithms to tune dynamical neural networks to solve particular
    tasks. The Radau and Dopri integrators seems to work most of the
    time, but after generating many trajectories (an inconsistent,
    unpredictable number, sometimes ~1000, sometimes ~100000), the program
    just dies with a segfault and no information about the origin of the
    segfault.

    I am using 64 bit ubuntu, and have modified the -m32 flags to be
    -m64 in Generators/Dopri_ODEsystem.py, Generators/Radau_ODEsystem.py
    and PyCont/ContClass.py.  (Simply removing the flags didn't seem to
    work…but because of the unpredictable nature of the segfaults, I'm
    not sure if there is a difference between removing the -m32 flags and
    replacing them with -m64 flags).

    When I run the run_all_tests.py script, the final output says:

    Test scripts that failed:
            test_variable_traj.py
            poly_interp_test.py
            PyCont_Hopfield.py
            interp_dopri_test.py
            PyCont_MorrisLecar_TypeI.py
            PyCont_MorrisLecar_TypeII.py
            PyCont_HindmarshRose.py
    Summary:
    Basic PyDSTool functions: appears to be broken on your system
    Map related modules: appears to work on your system
    VODE related modules: appears to be broken on your system
    Symbolic differentiation module: appears to work on your system
    Parameter estimation module: appears to work on your system
    PyCont: appears to be broken on your system
    Dopri ODE systems: appears to be broken on your system
    Radau ODE systems: appears to work on your system
    Parameter estimation module with external compilers: appears to work on your system
    PyCont interface to AUTO: appears to be broken on your system
    

    But, when I rerun some of these tests, there is no problem. Specifically:

    The following tests fail (not segfaulting, but due to not finding some
    libraries).  I think I saw that these are known bugs?? So I have not
    included the details of the output of these failures, but if it is
    helpful, I am happy to do so (just let me know).

            test_variable_traj.py
            poly_interp_test.py
            PyCont_Hopfield.py

    But the following test seems to pass without problem, generates a nice
    plot and exits without warnings or errors:

            interp_dopri_test.py

    The following tests all segfault when I run them with python, but not
    when I run them with ipython!

            PyCont_MorrisLecar_TypeI.py
            PyCont_MorrisLecar_TypeII.py
            PyCont_HindmarshRose.py

    They all seem to do so at the same point, close to the end of the
    test.  See the output at http://www.rhthm.com/pydstool, where I have
    indicated where the test fails.

    Running my program with ipython or python successfully generates
    thousands (if not 10s or 100s of thousands) of trajectories before
    segfaulting or reporting

    Fatal Python error: deallocating None
    Aborted (core dumped)
    

    I have confirmed that it is not the evolutionary optimisation of
    parameters that is causing the segfault, by replacing the system of
    equations with a set of very simple equations that have no parameters
    that are influenced by evolution.

    My questions are:

    1. What is likely to be causing the segfault? Is there a way to
    configure pydstool / the compilation of the C integrators to provide
    me with more information?

    2. Does anything stand out to anyone as a possible cause for the
    different behaviour between python and ipython.  Are there any library
    path variables that I should check?

    3. Is there anything that needs to be done when using the same
    generator to create many many trajectories? I have tried recreating
    the generator between the production of every generator, but that does
    not help.  When I run top during execution, I have not noticed any
    memory leaks.

    Basically, I have tried everything that I can think of to debug the
    error. Source code in a .tz is available at the following url.

    You will need to install the package inspyred avaiable through pip.

    http://www.rhthm.com/pydstool

    Any help you can provide would be much appreciated!  Any if you need
    any further information from me, please don't hesitate to ask.

    Cheers,
    Matthew

     
  • Matthew Egbert

    Matthew Egbert - 2013-10-08

    I am using the latest version of PyDSTool

    PyDSTool-0.88.121202.zip

    I have also tried, with no improvement.

    PyDSTool-0.88.120504.zip

    In : PyDSTool.__version__
    Out: '0.88'

    In : numpy.__version__
    Out: '1.7.1'

    In : scipy.__version__
    Out: '0.12.0'

     
  • Rob Clewley

    Rob Clewley - 2013-10-08

    Hi,

    Sorry that this has been a problem. We have been aware of the slow memory leak causing problems for long integrations for some time but have been unable to trace its origin (it's something either in the SWIG interfacing that is opaque to us or otherwise inherent in the C behind the array libraries). We *might* have inadvertently fixed it in the minor update that is posted here, but we haven't had a way to test this for the kind of case you have:

    https://dl.dropboxusercontent.com/u/2816627/PyDSTool-0.88.130726.zip

    Please try it with this version and let us know here whether it has solved the problem. The other issue is that 64 bit systems have been troublesome - sometimes they've worked fine and others not. We also have not yet been able to trace that but someone is looking into it.

    I don't know of any simple fix except to break up your long runs into multiple pieces. You can also split up many short runs (which can also cause it) by starting different threads each running PyDSTool for different groups of initial conditions. That, at least, avoids the problem.

     
  • Matthew Egbert

    Matthew Egbert - 2013-10-10

    Hello Rob,

    Thanks for your response. Unfortunately the lastest version that you posted did not resolve the problem.  I only did one test (more will follow) and it did seem to run for longer than before.  But just as soon as I got my hopes up…

    Fatal Python error: deallocating None
    Aborted (core dumped)

    Hit me up if you want me to run another test of some kind to help you debug what's going on.

    In terms of breaking the run into multiple pieces.  Is there a way to do that from within python (i.e. without starting the python interpreter between the pieces)?

    Thanks again.
    All the best,
    Matthew

     
  • Rob Clewley

    Rob Clewley - 2013-10-11

    Sorry that didn't work out. The only tests that could be more helpful would be if you had installed Python with the debug symbols included, and then ran this through the new gdb with the python bindings. I don't presently have such a setup ready to do such a test. If you happen to know enough about using gdb and would be willing to help us track down this elusive problem then it would be a great benefit to our project. You can email me privately if you'd like to get involved.

    You can start new threads within python using the os library to create independent threads and run arbitrary system commands (e.g. to run a python script), or the subprocess library to call python and wait on a return code.

     
  • Anonymous

    Anonymous - 2013-10-16

    I wonder if this is related…

    If I run the following code, the memory usage keeps on increasing indefinitely (I am loosing a ~2 MB/second. ).

    Could there be a memory leak even before calls to any of the integrators?

    I ran into this issue while trying to integrate a system feeding in (MANY >10**4) initial conditions in a for loop.
    After about 1-2 hours of work my computer runs out of memory (~16 GB).

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    #!/usr/bin/env python
    import PyDSTool as dst
    def code():
        """ Runs the simulation for a single day of growth. Returning all the dynamics. """
        icdict = {'x': 1}
        x_rhs = '-0.1 * x'
        vardict = {'x': x_rhs}
        DSargs = dst.args()                   # create an empty object instance of the args class, call it DSargs
        DSargs.name = 'SHM'               # name our model
        DSargs.ics = icdict               # assign the icdict to the ics attribute
        DSargs.tdata = [0, 20]            # declare how long we expect to integrate for
        DSargs.varspecs = vardict         # assign the vardict dictionary to the 'varspecs' attribute of DSargs
        DS = dst.Generator.Vode_ODEsystem(DSargs)
        del DS
        return
    def main():
        for i in range(10**8):
            code()
    if __name__ == '__main__':
            main()
    

    $ uname -a
    Linux marbles 3.8.0-31-generic #46-Ubuntu SMP Tue Sep 10 20:03:44 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

    In : import numpy

    In : numpy.__version__
    Out: '1.7.1'

    In : import scipy

    In : scipy.__version__
    Out: '0.11.0'

    This is true both for this link: https://dl.dropboxusercontent.com/u/2816627/PyDSTool-0.88.130726.zip
    and the version available for download from sourceforge.

     
  • Rob Clewley

    Rob Clewley - 2013-10-16

    Thanks for looking at this. I don't think it's related. The Vode generator is wrapping the scipy module and has nothing to do with our Dopri/Radau SWIG-interfaced code, that's for sure. As for the python side, for the initialization of the generator, most likely this commits memory to the pre-compiled Vode DLL that is loaded. Creating the DS object repeatedly will probably assign new memory to the DLL, and I believe that Python does not actually unload DLLs. So that, I suspect, is the source of that memory increase. There's nothing really I can do about that, but repeatedly creating the same Generator is obviously not recommended anyway.

    The segfault with Dopri comes from repeatedly calling into the integrator, not recreating its Generator wrapper (i.e. not reloading the DLL). Although, you'll probably get the same memory increase if you repeatedly recreate a Dopri-based Generator in your loop too…

    Let me know if my logic is off or you notice anything else.
    Thanks!

     
  • Alexandre Foncelle

    Hello,

    I was wondering if you made any progress about this leak of memory. Indeed, when I re-create a Radau-based Generator in my loop, memory start growing.

    Thank you !

    Alexandre

     
  • Rob Clewley

    Rob Clewley - 2016-02-25

    I don't think there has been any progress on the full rewrite of the interfaces to Radau and Dopri, alas. I think the work has begun, though. As an interim measure, you could see if you can break a large trajectory calculation into smaller pieces and see if the memory still grows when the integrator is restarted many times instead. If it still does, a small script could manage a loop that starts and kills a second python process to compute each part of the trajectory.

     
    • Evgenij Gr.

      Evgenij Gr. - 2017-06-23

      Excuse me, what way of restarting integrator did you mean exactly? I've encountered the same problem with memory leak in Dopri/Radau and that way could be extremely useful to me. Somehow Generator's cleanupMemory method doesn't work at all (from the description I got the idea that it might solve problem, but no success) and I can't recompile RHS and recreate generator in a loop of a single script (first iteration is OK, at the second iteration something strange happens with paths and compilation fails).

       

      Last edit: Evgenij Gr. 2017-06-23
      • Rob Clewley

        Rob Clewley - 2017-06-26

        Sorry for the difficulty with this. cleanupMemory was our early attempt to
        mitigate the problem but I agree it doesn't seem to have helped. You
        definitely can't recreate the generator object -- the DLL will remain the
        same under the hood.

        I simply meant that, in my experience, there are fewer segfaults if one
        splits the time domain into chunks and compute trajectories over those
        restricted domains consecutively. That way, the same blocks of memory will
        be overwritten with each new call to compute, rather than running out of
        mem with one giant allocation for the entire domain.

        You can avoid the segfaults altogether if you use a control script with
        os.popen to call separate python scripts with each consecutive IC from the
        last domain chunk. It's a kludge but it's also not hard to put together,
        and we never got to the bottom of where the leak originated from having put
        some hours in with valgrind back in the day.

        On Thu, Jun 22, 2017 at 8:40 PM, Evgenij Gr. evgenijgr@users.sf.net wrote:

        Sorry, what way of restarting integrator did you mean exactly? I've
        encountered the same problem with memory leak in Dopri/Radau and that way
        could be extremely useful to me. Somehow Generator's cleanupMemory
        method doesn't work at all (from the description I got the idea that it
        might solve problem, but no success) and I can't recompile RHS and recreate
        generator in a loop of a single script (first iteration is OK, at the
        second iteration paths are screwed and compilation fails).


        many trajectories eventually segfault
        https://sourceforge.net/p/pydstool/discussion/472291/thread/b501c79d/?limit=50#9bc1/0ee8


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/pydstool/discussion/472291/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

         
        • Maurizio De Pitta'

          Hi Rob,
          just in case, I recoded meanwhile the integrators in C++ both Radau and Dopri from the original source by Hairer. They do not use the PyDSTool Interface and are standalone integrators to use for your own model, but yet they run smoothly without segfault. So, back in the day, I believed that the segfault is somehow subtle and must be somehow generated by a memory leak or a some wrong array assignment in the python interface or maybe in the fortran code itself.

          If you deem it helpful, I can share my code. Just ping me in pvt.

          Cheers,
          M

           
        • Evgenij Gr.

          Evgenij Gr. - 2017-06-29

          Thank you for your answer! Yeah, calling separate script is a workaround that I currently have. I'll check your suggestion for splitting domain. I don't know whether it will work because mostly I integrate until trajectory returns to cross-section (I use terminal event for that). Maybe the return time is not as tame as I've expected and your suggestion can fix this.

          Can I ask about recreating generator here? What I've meant is that I delete folders with generated source code and library, recompile them from scratch and recreate generator after that. This was my first workaround, I was going to do this at each iteration. Could this worked (theoretically, at least) or it's a dead end?

          Update: I've checked the return time to cross-section and it's pretty consistent, about 40 units of PyDSTool time. So what I am usually doing is computing a lot of (>10000) these short trajectories that are ended up by some terminal event.

           

          Last edit: Evgenij Gr. 2017-06-29
          • Rob Clewley

            Rob Clewley - 2017-06-29

            Well, you can't unload and reload the same named module (i.e. the DLL
            created from the C code). So, you have to at least a whole new python
            process to be able to make that visible after delete/recreate. At the
            python level, merely recreating the generator is not sufficient to do a
            deep-level restart of the DLL and its memory allocation.

            You can time-domain split based on state-dependent events just as easily as
            by literal time. Just spawn subprocesses that run the next part until
            either a time limit is hit or the next event is hit, and so on.

            On Thu, Jun 29, 2017 at 7:16 AM, Evgenij Gr. evgenijgr@users.sf.net wrote:

            Thank you for your answer! Yeah, calling separate script is a workaround
            that I currently have. I'll check your suggestion for splitting domain. I
            don't know whether it will work because mostly I integrate until trajectory
            returns to cross-section (I use terminal event for that). Maybe the return
            time is not as tame as I've expected and your suggestion can fix this.

            Can I ask about recreating generator here? What I've meant is that I
            delete folders with generated source code and library, recompile them from
            scratch and recreate generator after that. This was my first workaround, I
            was going to do this at each iteration. Could this worked (theoretically,
            at least) or it's a dead end?


            many trajectories eventually segfault
            https://sourceforge.net/p/pydstool/discussion/472291/thread/b501c79d/?limit=50#9bc1/0ee8/1eee/1b14


            Sent from sourceforge.net because you indicated interest in
            https://sourceforge.net/p/pydstool/discussion/472291/

            To unsubscribe from further messages, please visit
            https://sourceforge.net/auth/subscriptions/

             
            • Evgenij Gr.

              Evgenij Gr. - 2017-07-02

              Thanks fot the reply! I had some hopes for that way of handling the problem but okay, so much for that idea.

               
  • Maurizio De Pitta'

    Hi folks,
    I am also currently using Dopri and Radau within the framework of a minimization problem, which requires computation of potentially a very large number of trajectories. Unfortunately the problem got to a halting point insofar as after say 10-trajectories, memory resources saturate and error 137 (SIGKILL) is issued. So, it does not confirm this post, but I am afraid it adds support to the possibility that, somewhere, there is a memory leak in the integrators...

    M

     

Log in to post a comment.