Re: [Scalablecr-discuss] Checkpointing using SCR
Brought to you by:
kathrynmohror,
moody20
From: Kathryn M. <ka...@ll...> - 2013-12-06 02:20:47
|
Hi Rommel, Do you have SCR_RUNS set in your job environment? It looks to me like the scr_srun script is checking to see if the number of runs is exhausted, and since it thinks it is, it starts exit procedures. You might try setting it to 3 to see what happens. Kathryn On Dec 5, 2013, at 3:42 AM, AJR <ico...@gm...> wrote: > I had BLCR in mind when I asked the question. Yea, MVAPICH2 does claim that they have integrated SCR with MVAPICH. But, when I was previously able to easily run a Hello World MPI job using srun -N2 -n4 ./MPIHello, after I installed MVAPICH2's SCR support (Using the --with-scr flag during configuration) I get the following output while I run the same command. Maybe they have made every call to srun call scr_srun or something similar. My question is about what could be causing the following kind of error. Seems to be some kind of configuration glitch with SCR. > > scr_srun: Started: Wed Dec 4 19:06:23 IST 2013 > scr_srun: prerun: Wed Dec 4 19:06:24 IST 2013 > scr_prerun: Started: Wed Dec 4 19:06:24 IST 2013 > scr_prerun: Ended: Wed Dec 4 19:06:24 IST 2013 > scr_prerun: secs: 0 > scr_prerun: exit code: 0 > scr_srun: RUN 1: Wed Dec 4 19:06:24 IST 2013 > SCR v1.1-8 ERROR: rank 1 on qdr3: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting > SCR v1.1-8 ERROR: rank 3 on qdr4: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting > SCR v1.1-8 ERROR: rank 0 on qdr3: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting > SCR v1.1-8 WARNING: rank 0 on qdr3: Failed to record cluster name @ src/mpid/ch3/channels/common/src/scr/scr.c:367 > scr_srun: $SCR_RUNS exhausted, ending run. > scr_srun: postrun: Wed Dec 4 19:06:25 IST 2013 > scr_postrun: Started: Wed Dec 4 19:06:25 IST 2013 > scr_postrun: UPNODES: qdr[3-4] > scr_postrun: Looking for latest checkpoint set id > scr_postrun: Found no checkpoint set to flush on qdr3 > scr_postrun: Ended: Wed Dec 4 19:06:26 IST 2013 > scr_postrun: secs: 1 > scr_postrun: exit code: 1 > scr_srun: ERROR: Command failed: scr_postrun -p /home/arjun/Checkpoint > scr_srun: Ended: Wed Dec 4 19:06:26 IST 2013 > > > > > On Tue, Dec 3, 2013 at 2:01 AM, Moody, Adam T. <mo...@ll...> wrote: > Hi Rommel, > Most of the applications using SCR contain logic to write their own checkpoints, so we don't have a lot of experience yet with third-party checkpointing packages. Having said that, we know that the MVAPICH MPI team has integrated BLCR (Berkeley Lab Checkpoint Restart) with SCR. > > http://mvapich.cse.ohio-state.edu > > Also, there is a research-level checkpointing package at LLNL which implements a persistent heap (variables stored in this region are checkpointed with a call to msync). This package has also been integrated with SCR. > > http://e-reports-ext.llnl.gov/pdf/754806.pdf > > Is there a particular checkpointing package you have in mind? > -Adam > > From: AJR [ico...@gm...] > Sent: Sunday, December 01, 2013 11:44 PM > To: sca...@li... > Subject: [Scalablecr-discuss] Checkpointing using SCR > > Hi, > You mention in your manuals that SCR is meant to work well with checkpointing packages that write a file per MPI process. Could you give some examples of the kind of checkpointing packages that will work well with SCR ? > > Rommel Lucas > > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk_______________________________________________ > Scalablecr-discuss mailing list > Sca...@li... > https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss ______________________________________________________________ Kathryn Mohror, ka...@ll..., http://people.llnl.gov/mohror1 CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA |