[Scalablecr-discuss] Running SCR examples
Brought to you by:
kathrynmohror,
moody20
From: Jorge B. <jb...@bs...> - 2013-11-11 16:21:18
|
Hi, I am trying to check how SCR works in a test environment I've set up, which is made up of two compute nodes and one controller running SLURM. First, I installed all the required software, by following the README and INSTALL instructions. SLURM seems to work since I could queue some jobs and they executed successfully. Then, after installing and reading the SCR documentation, I tried to launch some examples to see how do the checkpoint perform. These examples are included by default in "/usr/local/tools/scr-1.1/examples". I compiled all the binaries through make command and then edited the scr_interpose.moab script in order to fit my system architecture: #!/bin/bash #MSUB -l partition=debug #MSUB -l nodes=2 #MSUB -l resfailpolicy=ignore # above, tell MOAB / SLURM to not kill job allocation upon a node failure # specify what the name of a checkpoint file looks like export SCR_CHECKPOINT_PATTERN="rank_[0-9]+.ckpt" # specify where checkpoint directories should be written export SCR_PREFIX=/home/jbellon/checkpoints # instruct SCR to flush to the file system every 20 checkpoints export SCR_FLUSH=20 # exit if there is less than an hour remaining (3600 seconds) export SCR_HALT_SECONDS=3600 # attempt to run the job up to 3 times export SCR_RUNS=3 # run the job with scr_srun /usr/local/tools/scr-1.1/bin/scr_srun -n16 -N2 ./test_interpose However, when I execute this script, the following error is shown: root@frontend-0:/usr/local/tools/scr-1.1/examples# ./scr_interpose.moab scr_srun: Started: Mon Nov 11 15:56:40 GMT 2013 scr_srun: ERROR: Could not identify nodeset This is the output from 'sinfo' command: PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 30 down* compute-[3-32] debug* up infinite 2 idle compute-[1-2] Is there anything I am missing? Thanks in advance, Jorge WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |