From: Rohan G. <ro...@cc...> - 2016-10-14 17:26:04
|
Hi Sara, Could you please re-try after applying the following patch to the DMTCP source? diff --git a/src/util_misc.cpp b/src/util_misc.cpp index f5bc84a..86650cf 100644 --- a/src/util_misc.cpp +++ b/src/util_misc.cpp @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area) if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer) strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older) strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu + strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8 strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora return true; } Thanks, Rohan On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote: > > Hi Rohan, > > I am using the latest release on github, which is DMTCP-2.4.5. Same error received with mpirun. > > > I tried another mpi implementation, called OpenMPI-ULFM (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I got same error: > > > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbuYHRnM > orterun (40000): Terminating... > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbCEJazi > dummy.ulfm (41000): Terminating... > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbCEJazi > dummy.ulfm (42000): Terminating... > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbCEJazi > dummy.ulfm (43000): Terminating... > > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM. > > > Best Regards, > Sara > > Sara S. Hamouda > PhD Candidate (Computer Systems Group) > College of Engineering and Computer Science > The Australian National University > ________________________________ > From: Rohan Garg <ro...@cc...> > Sent: Friday, October 14, 2016 7:11:12 AM > To: Sara Salem Hamouda > Cc: dmt...@li... > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node > > Hi Sara, > > What version of DMTCP were you using? DMTCP-3.0 has some known issues > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with > DMTCP-2.5. > > Also, could you try launching your MPI program with mpirun instead of > mpiexec? > > Thanks, > Rohan > > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote: > > Dear DMTCP team, > > > > Appreciate your support regarding the below issue. > > > > > > I am using a single machine to learn DMTCP. The operating system is "CentOS release 6.8", and it uses a network file system. I run a simple MPI program (dummy.c), using mpich V3.2. > > > > > > On terminal-1: > > > > dmtcp_coordinator > > > > > > On terminal-2: > > > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000 > > > > > > While dummy is running in terminal-2, I move to terminal-1 and press 'c' , then 'q' to exit. > > > > > > To restart, I run the generated dmtcp_restart_script.sh script, but I get the error below. Would you please advice on a possible fix for this issue? > > > > > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that has a local file system, and the restart worked successfully. Is there specific configuration I should use with network file systems?) > > > > > > size = 1 > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > area.name = /ram/var/run/nscd/dbbxzrxW > > dummy.mpich2 (43000): Terminating... > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > area.name = /ram/var/run/nscd/dbbxzrxW > > dummy.mpich2 (44000): Terminating... > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > area.name = /ram/var/run/nscd/dbbxzrxW > > dummy.mpich2 (42000): Terminating... > > [40000] ERROR at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > > sign = > > Message: read invalid message, signature mismatch. (External socket?) > > mpiexec.hydra (40000): Terminating... > > [41000] ERROR at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > > sign = > > Message: read invalid message, signature mismatch. (External socket?) > > hydra_pmi_proxy (41000): Terminating... > > > > > > > > Best Regards, > > Sara > > > > Sara S. Hamouda > > PhD Candidate (Computer Systems Group) > > College of Engineering and Computer Science > > The Australian National University > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmt...@li... > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > Dmtcp-forum Info Page - SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> > lists.sourceforge.net > To see the collection of prior postings to the list, visit the Dmtcp-forum Archives. Using Dmtcp-forum: To post a message to all the list members ... > > > > |