From: Jiajun C. <ji...@cc...> - 2016-12-07 23:39:57
|
Which dmtcp version are you using? Could you try the following patch please? https://github.com/jiajuncao/dmtcp/commit/8d693636e4a0fce87fb4d96e685e4336831d50ea Best, Jiajun On Mon, Dec 5, 2016 at 5:52 PM, Maksym Planeta < mpl...@os...> wrote: > I was running the application inside interactive job allocation. One shell > was running coordinator, another one was launching the application. > > Both shells were inside the same working directory. > > Normally I use mpirun_rsh to launch applications. If I use srun, I have to > provide --mpi=pmi2 additionally. > > MVAPICH is configured for mpirun. > > $ srun --version > slurm 16.05.5-Bull.1.1-20161010-0700 > > > $ mpiname -a > MVAPICH2 2.2 Thu Sep 08 22:00:00 EST 2016 ch3:mrail > > Compilation > CC: gcc -g -O0 > CXX: g++ -g -O0 > F77: gfortran -L/lib -L/lib -g -O0 > FC: gfortran -g -O0 > > Configuration > --enable-fortran=all --enable-cxx --enable-timing=none --enable-debuginfo > --enable-mpit-pvars=all --enable-check-compiler-flags > --enable-threads=multiple --enable-weak-symbols > --disable-dependency-tracking --enable-fast-install --disable-rdma-cm > --with-pm=mpirun:hydra --with-rdma=gen2 --with-device=ch3:mrail > --enable-alloca --enable-hwloc --disable-fast --enable-g=dbg > --enable-error-messages=all --enable-error-checking=all --prefix=<dir> > > > On 12/05/2016 11:39 PM, Jiajun Cao wrote: > >> Hi Maksym, >> >> Thanks for writing to us. Can you provide the following info: >> >> DMTCP version, Slurm version, Mvapich2 version, and is Mvapich2 >> configured with srun as the process launcher? >> >> Also, how did you run the jobs? Did you do it by submitting scripts or >> by running interactive jobs? >> >> >> Best, >> Jiajun >> >> On Mon, Dec 5, 2016 at 2:21 PM, Maksym Planeta >> <mpl...@os... <mailto:mpl...@os...>> >> >> wrote: >> >> Dear DMTCP developers, >> >> I'm trying to set up checkpoint/restart of MPI applications using >> MVAPICH. >> >> I tried several options to launch DMTCP with MVAPICH, but none >> succeeded. >> >> I use symbols ****** around lengthy dumps of debugging information. >> >> I show my most successful attempt, I can report results of other >> attempts by request. >> >> In the end the restart script seem to complain about shared memory >> file, which it can't open. Could you tell me how can I work around >> this issue? >> >> >> First I launch dmtcp_coordinator in separate window, then I start >> application as following: >> >> ****** >> $ dmtcp_launch --rm --ib srun --mpi=pmi2 ./wrapper.sh ./bin/lu.A.2 >> [40000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' >> [40000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' >> [40000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under >> SLURM!' >> [40000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= >> tid_offset: 720 >> [40000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= >> pid_offset: 724 >> [42000] TRACE at rm_slurm.cpp:131 in print_args; REASON='Init CMD:' >> cmdline = /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 >> ./wrapper.sh ./bin/lu.A.2 >> [42000] TRACE at rm_slurm.cpp:160 in patch_srun_cmdline; >> REASON='Expand dmtcp_launch path' >> dmtcpCkptPath = dmtcp_launch >> [42000] TRACE at rm_slurm.cpp:253 in execve; REASON='How command >> looks from exec*:' >> [42000] TRACE at rm_slurm.cpp:254 in execve; REASON='CMD:' >> cmdline = dmtcp_srun_helper dmtcp_nocheckpoint >> /opt/slurm/current/bin/srun_slurm/srun --mpi=pmi2 dmtcp_launch >> --coord-host 127.0.0.1 --coord-port 7779 --ckptdir >> /home/s9951545/dmtcp-app/NPB3.3/NPB3.3-MPI --infiniband >> --batch-queue --explicit-srun ./wrapper.sh ./bin/lu.A.2 >> [42000] restore_libc.c:214 in TLSInfo_GetTidOffset; REASON= >> tid_offset: 720 >> [42000] restore_libc.c:244 in TLSInfo_GetPidOffset; REASON= >> pid_offset: 724 >> [42000] TRACE at rm_torque.cpp:99 in probeTorque; REASON='Start' >> [42000] TRACE at rm_slurm.cpp:52 in probeSlurm; REASON='Start' >> [42000] TRACE at rm_slurm.cpp:54 in probeSlurm; REASON='We run under >> SLURM!' >> >> >> NAS Parallel Benchmarks 3.3 -- LU Benchmark >> >> Size: 64x 64x 64 >> Iterations: 250 >> Number of processes: 2 >> >> Time step 1 >> Time step 20 >> Time step 40 >> Time step 60 >> [42000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, >> internal pmi capable' >> [40000] TRACE at rm_pmi.cpp:161 in rm_shutdown_pmi; REASON='Start, >> internal pmi capable' >> [42000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no >> sockets left' >> [40000] TRACE at jsocket.cpp:581 in monitorSockets; REASON='no >> sockets left' >> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> Time step 80 >> ****** >> >> I manage to create a checkpoint, but when I try to restart, the >> restart script stop at this point: >> >> ****** >> $ ./dmtcp_restart_script.sh >> <SKIPPED> >> dir = /tmp/dmtcp-s9951545@taurusi4043 >> [45000] TRACE at jfilesystem.cpp:172 in mkdir_r; REASON='Directory >> already exists' >> dir = /tmp/dmtcp-s9951545@taurusi4043 >> [45000] WARNING at fileconnlist.cpp:192 in resume; >> REASON='JWARNING(unlink(missingUnlinkedShmFiles[i].name) != -1) >> failed' >> missingUnlinkedShmFiles[i].name = >> /dev/shm/cm_shmem-1003236.42-taurusi4043-1074916.tmp >> (strerror((*__errno_location ()))) = No such file or directory >> Message: The file was unlinked at the time of checkpoint. Unlinking >> it after restart failed >> [42000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; >> REASON='Cannot open SLURM environment file. Environment won't be >> restored!' >> filename = >> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4- >> 42000-5845bc0a >> [44000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; >> REASON='Cannot open SLURM environment file. Environment won't be >> restored!' >> filename = >> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4- >> 44000-323cf5bc0749 >> [40000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; >> REASON='Cannot open SLURM environment file. Environment won't be >> restored!' >> filename = >> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4- >> 40000-323cd8b79b6f >> [45000] TRACE at rm_slurm.cpp:74 in slurm_restore_env; >> REASON='Cannot open SLURM environment file. Environment won't be >> restored!' >> filename = >> /tmp/dmtcp-s9951545@taurusi4043/slurm_env_4b324242916bf6c4- >> 45000-5845bc0a >> [44000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> [42000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> [42000] TRACE at rm_slurm.cpp:522 in slurmRestoreHelper; >> REASON='This is srun helper. Restore it' >> [40000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> [45000] TRACE at rm_pmi.cpp:183 in rm_restore_pmi; REASON='Start, >> internal pmi capable' >> lu.A.2: ibvctx.c:273: query_qp_info: Assertion `size == >> sizeof(ibv_qp_id_t)' failed. >> ****** >> >> >> Before starting, I set up following environment variables for MVAPICH: >> >> export MV2_USE_SHARED_MEM=0 # This one is probably the most relevant >> export MV2_USE_BLOCKING=0 >> export MV2_ENABLE_AFFINITY=0 >> export MV2_RDMA_NUM_EXTRA_POLLS=1 >> export MV2_CM_MAX_SPIN_COUNT=1 >> export MV2_SPIN_COUNT=100 >> export MV2_DEBUG_SHOW_BACKTRACE=1 >> export MV2_DEBUG_CORESIZE=unlimited >> >> >> >> -- >> Regards, >> Maksym Planeta >> >> >> ------------------------------------------------------------ >> ------------------ >> >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmt...@li... >> <mailto:Dmt...@li...> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >> <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> >> >> >> > -- > Regards, > Maksym Planeta > > |