Menu

Segmentation fault for ParallelHCDipolar.py

Technical
2018-07-30
2018-08-09
  • Juergen Salk

    Juergen Salk - 2018-07-30

    Hi there,

    I followed the installation instruction and successfully installed the parallel
    version of OSMPS version 2.1 on our HPC cluster. But when I try run the ParallelHCDipolar.py
    test case from the Example folder, the "Execute_MPSParallelMain" executable always
    segfaults one of its running MPI processes due to invalid memory reference after
    some unpredictable run time (sometimes after some tens of minutes, sometimes after
    hours).

    This is the workflow that I use to run the test case:

    $ python ParallelHCDipolar.py
    [...]
    $ cat HCDipolar.moab
    #!/bin/sh
    #MOAB -l nodes=1:ppn=16
    #MOAB -l walltime=24:00:00
    #MOAB -N HCDipolar
    #------------------------
    # Load modules as nessecary
    module load phys/osmps/.2.1_parallel
    
    # Change to submission directory of this job
    cd "$MOAB_SUBMITDIR"
    
    # Run the simulation
    mpiexec Execute_MPSParallelMain TMP/HCDipolar > HCDipolarout
    
    $ msub HCDipolar.moab
    [...]
    

    And this is what I get in the stderr output of the job when compiled with
    OpenMPI (the "Loading module dependency: ..." lines come from autoloaded
    software modules and can probably be ignored):

    --- snip ---
    
    Loading module dependency 'numlib/mkl/11.1.4'.
    Loading module dependency 'mpi/openmpi/3.0-gnu-4.8'.
    Loading module dependency 'numlib/python_numpy/1.9.1'.
    Loading module dependency 'numlib/python_scipy/0.15.0'.
    Loading module dependency 'lib/matplotlib/1.5.3'.
    Loading module dependency 'office/texlive/2016'.
    
    Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
    
    Backtrace for this error:
    --------------------------------------------------------------------------
    A process has executed an operation involving a call to the
    "fork()" system call to create a child process.  Open MPI is currently
    operating in a condition that could result in memory corruption or
    other system errors; your job may hang, crash, or produce silent
    data corruption.  The use of fork() (or system() or other calls that
    create child processes) is strongly discouraged.
    
    The process that invoked fork was:
    
    Local host:          [[51793,1],6] (PID 10915)
    
    If you are *absolutely sure* that your application will successfully
    and correctly survive a call to fork(), you may disable this warning
    by setting the mpi_warn_on_fork MCA parameter to 0.
    --------------------------------------------------------------------------
    #0  0x2B91C2AA76F7
    #1  0x2B91C2AA7D3E
    #2  0x2B91C353926F
    #3  0x2B91C35844DC
    #4  0x594FDD in __variationalops_MOD_orthogonalize_mps
    #5  0x5D12A5 in __variationalops_MOD_infinitemps_mps
    #6  0x6BD03D in __pyinterface_MOD_runmps
    #7  0x6C229F in __parallelops_MOD_worker
    --------------------------------------------------------------------------
    mpiexec noticed that process rank 6 with PID 0 on node n1522 exited on signal      11 (Segmentation fault).
    --------------------------------------------------------------------------
    
    --- snip ---
    

    In case it matters: This is on RHEL 7 and OSMPS has been compiled with the default system compiler GNU Fortran (GCC) version 4.8.5.

    What could be the problem here? If I can add more information for you to tackle the problem, please let me know.

    Best regards
    Juergen

     

    Last edit: Juergen Salk 2018-07-30
  • Daniel Jaschke

    Daniel Jaschke - 2018-07-31

    Hello Jürgen,

    I'll make some test runs on our cluster trying to match compiler version as close as possible and get back to you. Thank you for your patience.

    Best regards,

    Daniel

     
  • Juergen Salk

    Juergen Salk - 2018-08-04

    Hello Daniel,

    thank you very much for taking care of this.

    For what it's worth, I have also run quite a number of tests with different build und runtime environments (GNU compiler vs. Intel compiler, ATLAS cv. MKL library, OpenMPI vs. Intel MPI, home-built NumPy/SciPy modules vs. stock RPM packages from official RHEL7 RPM repos) but always ended up with some MPI process of Execute_MPSParallelMain crashing with some invalid memory reference at some point.

    Best regards,
    Jürgen

    PS: Maybe I should have filed a ticket rather than raising this issue in a discussion forum?

     
  • Daniel Jaschke

    Daniel Jaschke - 2018-08-06

    Hello Jürgen,

    I think discussion forum or ticket is ok; since we already started here, we'll keep it in the forum. I could reproduce the error with gfortran and openmpi and found an initial problem, but I still have to check everything. I'll let you know once I further advance. Sorry, I am quite busy right now.

    Best regards,

    Daniel

     
  • Daniel Jaschke

    Daniel Jaschke - 2018-08-07

    Hello Jürgen,

    The following change should fix it. In the file MPSFortLib/Mods/template/VariationalOps_InfiniteMPS.f90 in line 1382

    OLD VERSION

    IF(lerrcode.ne.0.and.rerrcode.ne.0) THEN
        DEALLOCATE(evs%elem)
        RETURN
    END IF
    

    FIX FOR SEGMENTATION FAULT

    IF(lerrcode.ne.0.and.rerrcode.ne.0) THEN
        if(lerrcode == 0) DEALLOCATE(evs%elem)
        RETURN
    END IF
    

    I will integrate it into svn for version 2 at the end of the week. Same for version 3, which has most likely the same problem. Thank you for pointing this issue out and please let me know if this does not fix your problem.

    Best regards,

    Daniel

     
  • Juergen Salk

    Juergen Salk - 2018-08-07

    Hello Daniel,

    thank you very much for the patch. I will run a test asap and report back if this fixes the segfault issues described above.

    Best regards
    Jürgen

     
  • Juergen Salk

    Juergen Salk - 2018-08-07

    Wait. Doesn't this effectively prevent DEALLOCATE(evs%elem) from being ever executed?

    Best regards
    Jürgen

     
  • Juergen Salk

    Juergen Salk - 2018-08-07

    Dear Daniel,

    by looking into the code of Orthogonalize_MPS_TYPE() it seems save
    to just completely remove the DEALLOCATE(evs%elem) from line 1382
    without introducing some kind of memory leak as this vector gets only
    allocated in case TransferMatrixWrapper_MPS_TYPE() in line 1322
    returned lerrcode equal to 0. So there is probably no point in deallocation if
    lerrcode is not equal to 0 anyway. Please let me know if you disagree.

    Anyway, I've just recompiled with line 1382 removed from the original
    MPSFortLib/Mods/templates/VariationalOps_InfiniteMPS.f90 and was finally able
    to run the ParallelHCDipolar.py test case without any segfaults any more.

    Thank you very much.

    Best regards
    Jürgen

     

    Last edit: Juergen Salk 2018-08-07
  • Daniel Jaschke

    Daniel Jaschke - 2018-08-09

    Hello Jürgen,

    Yes, you are correct and removing the line does the job. The two if cases I had exclude each other, but I was focusing on matching the original if-case of the allocation and oversaw it ...

    Thanks for confirming that it runs without segfaults.

    Best regards,

    Daniel

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.