[Scalablecr-discuss] Running SCR with MVAPICH2 and BLCR
Brought to you by:
kathrynmohror,
moody20
From: Arjun J R. <rec...@gm...> - 2013-12-26 05:00:03
|
I understand that SCR was built to be used with custom application codes that write their own checkpoints from within the application. However, MVAPICH2 claims to have integrated SCR such that the checkpoints written by BLCR can be written to the parallel file system in a scalable manner later. However, I am not currently writing out the checkpoints to a central location, but to a local disk for testing. I first installed BLCR and SLURM. Then I installed MVAPICH2 with the following options : ./configure --enable-ckpt --with-scr --with-pm=no --with-pmi=slurm However, taking a simple MPI program and compiling using mpicc and then running using srun yields the following errors : srun -N2 -n12 MPIExecutable SCR v1.1-8 ABORT : rank 1 on machine2: Failed to create store descriptor for control directory @ src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299 In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 1) SCR v1.1-8 ABORT : rank 0 on machine1: Failed to create store descriptor for control directory @ src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299 In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 0) . . . SCR v1.1-8 ERROR: rank 3 on machine1: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting SCR v1.1-8 ERROR: rank 1 on machine1: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting . . . slurmd[machine1]: ***STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL 9*** srun: Job step aborted: Waiting upto 2 seconds for job step to finish slurmd[machine1]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL 9**** slurmd[machine2]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL 9**** . . . There was actually a lot of output but i've just printed only one version of each of the message types in the output.I have set SCR_CNTL_BASE, SCR_RUNS, SCR_CACHE_BASE, SCR_PREFIX and SCR_FLUSH. What could be wrong with the configuration or the environment of SCR to yield such errors ? |