Hi,
> I am running SLURM and all the other stuff needed to run SCR on a single computer, as a way to evaluate SCR. I have a simple C counting program that I wish to run and take periodic checkpoints of. My executable name is "count" How can I get SCR to take periodic checkpoints ? I have specified two separate directories for the two kinds of checkpoint images
>
> When I installed SCR, I don't have any command such as scr_srun on my system.
> When I type in scr_srun, I get the error "Command not found"
Possibly you did these steps already, but just as a sanity check:
Did you do a 'make install'?
Did you set your path to point to the installation directory for SCR?
setenv PATH /usr/local/tools/scr-1.1/bin:${PATH}
>
> The executables are stored in /usr/local/tools/scr-1.1/bin/ and when I execute just scr_srun using ./scr_srun I get
> scr_srun: Started: Mon Sep 30 12:58:22 IST 2013
> scr_srun: ERROR: Could not identify node set
scr_srun assumes you are executing in an allocation given to you by SLURM and that the environment variable SLURM_NODELIST is set. You can get an interactive partition with salloc, e.g.
salloc -N 1 -p <queuename>
If I recall correctly, you may have problems running on a single node. SCR will want to find locations for storing redundant checkpoints that are not on the same node. That way if a node goes down, the checkpoints are protected on the other node. I think that it will exit with an error message if you run on a single node.
Hope that helps!
Kathryn
>
> Nandaka Jojha
>
______________________________________________________________
Kathryn Mohror, ka...@ll..., http://people.llnl.gov/mohror1
CASC @ Lawrence Livermore National Laboratory, Livermore, CA, USA
|