I followed the installation instruction and successfully installed the parallel
version of OSMPS version 2.1 on our HPC cluster. But when I try run the ParallelHCDipolar.py
test case from the Example folder, the "Execute_MPSParallelMain" executable always
segfaults one of its running MPI processes due to invalid memory reference after
some unpredictable run time (sometimes after some tens of minutes, sometimes after
hours).
This is the workflow that I use to run the test case:
$pythonParallelHCDipolar.py[...]$catHCDipolar.moab#!/bin/sh#MOAB -l nodes=1:ppn=16#MOAB -l walltime=24:00:00#MOAB -N HCDipolar#------------------------# Load modules as nessecarymoduleloadphys/osmps/.2.1_parallel# Change to submission directory of this jobcd"$MOAB_SUBMITDIR"# Run the simulationmpiexecExecute_MPSParallelMainTMP/HCDipolar>HCDipolarout$msubHCDipolar.moab[...]
And this is what I get in the stderr output of the job when compiled with
OpenMPI (the "Loading module dependency: ..." lines come from autoloaded
software modules and can probably be ignored):
--- snip ---
Loading module dependency 'numlib/mkl/11.1.4'.
Loading module dependency 'mpi/openmpi/3.0-gnu-4.8'.
Loading module dependency 'numlib/python_numpy/1.9.1'.
Loading module dependency 'numlib/python_scipy/0.15.0'.
Loading module dependency 'lib/matplotlib/1.5.3'.
Loading module dependency 'office/texlive/2016'.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: [[51793,1],6] (PID 10915)
If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
#0 0x2B91C2AA76F7
#1 0x2B91C2AA7D3E
#2 0x2B91C353926F
#3 0x2B91C35844DC
#4 0x594FDD in __variationalops_MOD_orthogonalize_mps
#5 0x5D12A5 in __variationalops_MOD_infinitemps_mps
#6 0x6BD03D in __pyinterface_MOD_runmps
#7 0x6C229F in __parallelops_MOD_worker
--------------------------------------------------------------------------
mpiexec noticed that process rank 6 with PID 0 on node n1522 exited on signal 11 (Segmentation fault).
----------------------------------------------------------------------------- snip ---
In case it matters: This is on RHEL 7 and OSMPS has been compiled with the default system compiler GNU Fortran (GCC) version 4.8.5.
What could be the problem here? If I can add more information for you to tackle the problem, please let me know.
Best regards
Juergen
Last edit: Juergen Salk 2018-07-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For what it's worth, I have also run quite a number of tests with different build und runtime environments (GNU compiler vs. Intel compiler, ATLAS cv. MKL library, OpenMPI vs. Intel MPI, home-built NumPy/SciPy modules vs. stock RPM packages from official RHEL7 RPM repos) but always ended up with some MPI process of Execute_MPSParallelMain crashing with some invalid memory reference at some point.
Best regards,
Jürgen
PS: Maybe I should have filed a ticket rather than raising this issue in a discussion forum?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think discussion forum or ticket is ok; since we already started here, we'll keep it in the forum. I could reproduce the error with gfortran and openmpi and found an initial problem, but I still have to check everything. I'll let you know once I further advance. Sorry, I am quite busy right now.
Best regards,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The following change should fix it. In the file MPSFortLib/Mods/template/VariationalOps_InfiniteMPS.f90 in line 1382
OLD VERSION
IF(lerrcode.ne.0.and.rerrcode.ne.0) THEN
DEALLOCATE(evs%elem)
RETURN
END IF
FIX FOR SEGMENTATION FAULT
IF(lerrcode.ne.0.and.rerrcode.ne.0) THEN
if(lerrcode == 0) DEALLOCATE(evs%elem)
RETURN
END IF
I will integrate it into svn for version 2 at the end of the week. Same for version 3, which has most likely the same problem. Thank you for pointing this issue out and please let me know if this does not fix your problem.
Best regards,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
by looking into the code of Orthogonalize_MPS_TYPE() it seems save
to just completely remove the DEALLOCATE(evs%elem) from line 1382
without introducing some kind of memory leak as this vector gets only
allocated in case TransferMatrixWrapper_MPS_TYPE() in line 1322
returned lerrcode equal to 0. So there is probably no point in deallocation if
lerrcode is not equal to 0 anyway. Please let me know if you disagree.
Anyway, I've just recompiled with line 1382 removed from the original
MPSFortLib/Mods/templates/VariationalOps_InfiniteMPS.f90 and was finally able
to run the ParallelHCDipolar.py test case without any segfaults any more.
Thank you very much.
Best regards
Jürgen
Last edit: Juergen Salk 2018-08-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, you are correct and removing the line does the job. The two if cases I had exclude each other, but I was focusing on matching the original if-case of the allocation and oversaw it ...
Thanks for confirming that it runs without segfaults.
Best regards,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
I followed the installation instruction and successfully installed the parallel
version of OSMPS version 2.1 on our HPC cluster. But when I try run the ParallelHCDipolar.py
test case from the Example folder, the "Execute_MPSParallelMain" executable always
segfaults one of its running MPI processes due to invalid memory reference after
some unpredictable run time (sometimes after some tens of minutes, sometimes after
hours).
This is the workflow that I use to run the test case:
And this is what I get in the stderr output of the job when compiled with
OpenMPI (the "Loading module dependency: ..." lines come from autoloaded
software modules and can probably be ignored):
In case it matters: This is on RHEL 7 and OSMPS has been compiled with the default system compiler GNU Fortran (GCC) version 4.8.5.
What could be the problem here? If I can add more information for you to tackle the problem, please let me know.
Best regards
Juergen
Last edit: Juergen Salk 2018-07-30
Hello Jürgen,
I'll make some test runs on our cluster trying to match compiler version as close as possible and get back to you. Thank you for your patience.
Best regards,
Daniel
Hello Daniel,
thank you very much for taking care of this.
For what it's worth, I have also run quite a number of tests with different build und runtime environments (GNU compiler vs. Intel compiler, ATLAS cv. MKL library, OpenMPI vs. Intel MPI, home-built NumPy/SciPy modules vs. stock RPM packages from official RHEL7 RPM repos) but always ended up with some MPI process of Execute_MPSParallelMain crashing with some invalid memory reference at some point.
Best regards,
Jürgen
PS: Maybe I should have filed a ticket rather than raising this issue in a discussion forum?
Hello Jürgen,
I think discussion forum or ticket is ok; since we already started here, we'll keep it in the forum. I could reproduce the error with gfortran and openmpi and found an initial problem, but I still have to check everything. I'll let you know once I further advance. Sorry, I am quite busy right now.
Best regards,
Daniel
Hello Jürgen,
The following change should fix it. In the file MPSFortLib/Mods/template/VariationalOps_InfiniteMPS.f90 in line 1382
OLD VERSION
FIX FOR SEGMENTATION FAULT
I will integrate it into svn for version 2 at the end of the week. Same for version 3, which has most likely the same problem. Thank you for pointing this issue out and please let me know if this does not fix your problem.
Best regards,
Daniel
Hello Daniel,
thank you very much for the patch. I will run a test asap and report back if this fixes the segfault issues described above.
Best regards
Jürgen
Wait. Doesn't this effectively prevent DEALLOCATE(evs%elem) from being ever executed?
Best regards
Jürgen
Dear Daniel,
by looking into the code of Orthogonalize_MPS_TYPE() it seems save
to just completely remove the DEALLOCATE(evs%elem) from line 1382
without introducing some kind of memory leak as this vector gets only
allocated in case TransferMatrixWrapper_MPS_TYPE() in line 1322
returned lerrcode equal to 0. So there is probably no point in deallocation if
lerrcode is not equal to 0 anyway. Please let me know if you disagree.
Anyway, I've just recompiled with line 1382 removed from the original
MPSFortLib/Mods/templates/VariationalOps_InfiniteMPS.f90 and was finally able
to run the ParallelHCDipolar.py test case without any segfaults any more.
Thank you very much.
Best regards
Jürgen
Last edit: Juergen Salk 2018-08-07
Hello Jürgen,
Yes, you are correct and removing the line does the job. The two if cases I had exclude each other, but I was focusing on matching the original if-case of the allocation and oversaw it ...
Thanks for confirming that it runs without segfaults.
Best regards,
Daniel