[Scalablecr-discuss] Fwd: Checkpointing using SCR
Brought to you by:
kathrynmohror,
moody20
From: AJR <ico...@gm...> - 2013-12-05 11:42:35
|
I had BLCR in mind when I asked the question. Yea, MVAPICH2 does claim that they have integrated SCR with MVAPICH. But, when I was previously able to easily run a Hello World MPI job using srun -N2 -n4 ./MPIHello, after I installed MVAPICH2's SCR support (Using the --with-scr flag during configuration) I get the following output while I run the same command. Maybe they have made every call to srun call scr_srun or something similar. My question is about what could be causing the following kind of error. Seems to be some kind of configuration glitch with SCR. scr_srun: Started: Wed Dec 4 19:06:23 IST 2013 scr_srun: prerun: Wed Dec 4 19:06:24 IST 2013 scr_prerun: Started: Wed Dec 4 19:06:24 IST 2013 scr_prerun: Ended: Wed Dec 4 19:06:24 IST 2013 scr_prerun: secs: 0 scr_prerun: exit code: 0 scr_srun: RUN 1: Wed Dec 4 19:06:24 IST 2013 SCR v1.1-8 ERROR: rank 1 on qdr3: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting SCR v1.1-8 ERROR: rank 3 on qdr4: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting SCR v1.1-8 ERROR: rank 0 on qdr3: SCR_CNTL_BASE cannot be set in the environment or user configuration file, ignoring setting SCR v1.1-8 WARNING: rank 0 on qdr3: Failed to record cluster name @ src/mpid/ch3/channels/common/src/scr/scr.c:367 scr_srun: $SCR_RUNS exhausted, ending run. scr_srun: postrun: Wed Dec 4 19:06:25 IST 2013 scr_postrun: Started: Wed Dec 4 19:06:25 IST 2013 scr_postrun: UPNODES: qdr[3-4] scr_postrun: Looking for latest checkpoint set id scr_postrun: Found no checkpoint set to flush on qdr3 scr_postrun: Ended: Wed Dec 4 19:06:26 IST 2013 scr_postrun: secs: 1 scr_postrun: exit code: 1 scr_srun: ERROR: Command failed: scr_postrun -p /home/arjun/Checkpoint scr_srun: Ended: Wed Dec 4 19:06:26 IST 2013 On Tue, Dec 3, 2013 at 2:01 AM, Moody, Adam T. <mo...@ll...> wrote: > Hi Rommel, > Most of the applications using SCR contain logic to write their own > checkpoints, so we don't have a lot of experience yet with third-party > checkpointing packages. Having said that, we know that the MVAPICH MPI > team has integrated BLCR (Berkeley Lab Checkpoint Restart) with SCR. > > http://mvapich.cse.ohio-state.edu > > Also, there is a research-level checkpointing package at LLNL which > implements a persistent heap (variables stored in this region are > checkpointed with a call to msync). This package has also been integrated > with SCR. > > http://e-reports-ext.llnl.gov/pdf/754806.pdf > > Is there a particular checkpointing package you have in mind? > -Adam > > ------------------------------ > *From:* AJR [ico...@gm...] > *Sent:* Sunday, December 01, 2013 11:44 PM > *To:* sca...@li... > *Subject:* [Scalablecr-discuss] Checkpointing using SCR > > Hi, > You mention in your manuals that SCR is meant to work well with > checkpointing packages that write a file per MPI process. Could you give > some examples of the kind of checkpointing packages that will work well > with SCR ? > > Rommel Lucas > |