Re: [Scalablecr-discuss] Running SCR examples
Brought to you by:
kathrynmohror,
moody20
From: Kathryn M. <ka...@ll...> - 2013-11-11 18:00:40
|
Hi Jorge, Is the environment variable SLURM_NODELIST set in your job environment? That's where SCR gets the 'nodeset' from. Also, the interpose library is not well supported anymore and may not be fully functional. We recommend using the test_api program in the examples directory for initial testing. Thanks, Kathryn On Nov 11, 2013, at 8:01 AM, Jorge Bellon <jb...@bs...> wrote: > Hi, > I am trying to check how SCR works in a test environment I've set up, which is > made up of two compute nodes and one controller running SLURM. > > First, I installed all the required software, by following the README and > INSTALL instructions. SLURM seems to work since I could queue some jobs and > they executed successfully. > > Then, after installing and reading the SCR documentation, I tried to launch > some examples to see how do the checkpoint perform. These examples are > included by default in "/usr/local/tools/scr-1.1/examples". I compiled all the > binaries through make command and then edited the scr_interpose.moab script in > order to fit my system architecture: > > #!/bin/bash > #MSUB -l partition=debug > #MSUB -l nodes=2 > #MSUB -l resfailpolicy=ignore > > # above, tell MOAB / SLURM to not kill job allocation upon a node failure > > # specify what the name of a checkpoint file looks like > export SCR_CHECKPOINT_PATTERN="rank_[0-9]+.ckpt" > > # specify where checkpoint directories should be written > export SCR_PREFIX=/home/jbellon/checkpoints > > # instruct SCR to flush to the file system every 20 checkpoints > export SCR_FLUSH=20 > > # exit if there is less than an hour remaining (3600 seconds) > export SCR_HALT_SECONDS=3600 > > # attempt to run the job up to 3 times > export SCR_RUNS=3 > > # run the job with scr_srun > /usr/local/tools/scr-1.1/bin/scr_srun -n16 -N2 ./test_interpose > > However, when I execute this script, the following error is shown: > root@frontend-0:/usr/local/tools/scr-1.1/examples# ./scr_interpose.moab > scr_srun: Started: Mon Nov 11 15:56:40 GMT 2013 > scr_srun: ERROR: Could not identify nodeset > > This is the output from 'sinfo' command: > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 30 down* compute-[3-32] > debug* up infinite 2 idle compute-[1-2] > > Is there anything I am missing? > > Thanks in advance, > Jorge > > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer > > ------------------------------------------------------------------------------ > November Webinars for C, C++, Fortran Developers > Accelerate application performance with scalable programming models. Explore > techniques for threading, error checking, porting, and tuning. Get the most > from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk > _______________________________________________ > Scalablecr-discuss mailing list > Sca...@li... > https://lists.sourceforge.net/lists/listinfo/scalablecr-discuss ______________________________________________________________ Kathryn Mohror, ka...@ll..., http://people.llnl.gov/mohror1 CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA |