Re: [Scalablecr-discuss] Running SCR with MVAPICH2 and BLCR
Brought to you by:
kathrynmohror,
moody20
From: Kathryn M. <ka...@ll...> - 2014-01-14 00:53:40
|
Hi Arjun, Sorry for the late reply -- I took some vacation. Just as a first guess, did you make sure that you have write permissions on the location of the control directory? I believe (and the error message supports this) that SCR_CNTL_BASE can't be set in the environment, but has to be set in your SCR configuration file. It defaults to /tmp. Do you have a /tmp? Kathryn On Dec 25, 2013, at 8:59 PM, Arjun J Rao <rec...@gm...> wrote: > I understand that SCR was built to be used with custom application codes that > write their own checkpoints from within the application. However, MVAPICH2 > claims to have integrated SCR such that the checkpoints written by BLCR can > > be written to the parallel file system in a scalable manner later. However, I > am not currently writing out the checkpoints to a central location, but to a > local disk for testing. > > I first installed BLCR and SLURM. Then I installed MVAPICH2 with the > following options : > ./configure --enable-ckpt --with-scr --with-pm=no --with-pmi=slurm > > However, taking a simple MPI program and compiling using mpicc and then > running using srun yields the following errors : > > srun -N2 -n12 MPIExecutable > SCR v1.1-8 ABORT : rank 1 on machine2: Failed to create store descriptor > for control directory @ > src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299 > In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 1) > SCR v1.1-8 ABORT : rank 0 on machine1: Failed to create store descriptor > for control directory @ > src/mpid/ch3/channels/common/src/scr/scr_storedesc.c:299 > In: PMI_Abort(0,application called MPI_Abort(MPI_COMM_WORLD,0) - process 0) > . > . > . > SCR v1.1-8 ERROR: rank 3 on machine1: SCR_CNTL_BASE cannot be set in the > environment or user configuration file, ignoring setting > SCR v1.1-8 ERROR: rank 1 on machine1: SCR_CNTL_BASE cannot be set in the > environment or user configuration file, ignoring setting > . > . > . > slurmd[machine1]: ***STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL > 9*** > srun: Job step aborted: Waiting upto 2 seconds for job step to finish > > slurmd[machine1]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL > 9**** > slurmd[machine2]: *** STEP 195.1 KILLED AT 2013-12-24T10:11:23 WITH SIGNAL > 9**** > . > . > . > > There was actually a lot of output but i've just printed only one version > of each of the message types in the output.I have set SCR_CNTL_BASE, SCR_RUNS, > SCR_CACHE_BASE, SCR_PREFIX and SCR_FLUSH. What could be wrong with the > configuration or the environment of SCR to yield such errors ? > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk_______________________________________________ > Scalablecr-discuss mailing list > Sca...@li... > https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss _________________________________________________________________ Kathryn Mohror, ka...@ll..., http://scalability.llnl.gov/ Scalability Team @ Lawrence Livermore National Laboratory, Livermore, CA, USA |