Scalable Checkpoint / Restart Library Code
Brought to you by:
kathrynmohror,
moody20
========================================================================
Scalable Checkpoint / Restart (SCR) Library
========================================================================
The Scalable Checkpoint / Restart (SCR) library enables MPI applications
to utilize distributed storage on Linux clusters to attain high file I/O
bandwidth for checkpointing and restarting large-scale jobs. With SCR,
jobs run more efficiently, lose less work upon a failure, and reduce
load on critical shared resources such as the parallel file system and
the network infrastructure.
In the current SCR implementation, application checkpoint files are
cached in storage local to the compute nodes, and a redundancy scheme is
applied such that the cached files can be recovered even after a failure
disables part of the system. SCR supports the use of spare nodes such
that it is possible to restart a job directly from its cached
checkpoint, provided the redundancy scheme holds and provided there are
sufficient spares.
On current platforms, SCR scales linearly with the number of compute
nodes. It has been benchmarked at 1216GB/s on 992 nodes of Coastal at
Lawrence Livermore National Laboratory (LLNL). At this scale, it is two
orders of magnitude faster than the parallel file system.
The SCR library is written in C, and it currently only provides a C
interface. It has been used in production at LLNL since late 2007 using
RAM disk on Linux / x86-64 / Infiniband clusters. Production runs are
now underway using solid-state drives on the same cluster architecture.
The implementation assumes SLURM is used as the resource manager. The
library is designed and intended to be ported to other platforms and
resource managers. SCR is an open source project under a BSD license
hosted at: http://scalablecr.sourceforge.net.
Detailed usage is provided in the user manual (scr_users_manual.pdf),
and detailed implementation details are in the develope's manual
(scr_developers_manual.pdf).
-----------------------------
Dependencies
-----------------------------
SCR depends on a number of other software packages.
Required:
pdsh -- parallel remote shell command (pdsh, dshbak)
http://sourceforge.net/projects/pdsh
e.g.,
wget http://downloads.sourceforge.net/project/pdsh/pdsh/pdsh-2.18/pdsh-2.18.tar.bz2
bunzip2 pdsh-2.18.tar.bz2
tar -xf pdsh-2.18.tar
pushd pdsh-2.18
./configure --prefix=/tmp
make
make install
Date::Manip -- Perl module for date/time interpretation
http://search.cpan.org/~sbeck/Date-Manip-5.54/lib/Date/Manip.pod
SLURM -- resource manager
http://sourceforge.net/projects/slurm
Optional:
libyogrt -- your one get remaining time library
Not currently available outside of LLNL.
Enables a running job to determine how much time it has left
in its resource allocation.
io-watchdog -- tool to catch and terminate hanging jobs
http://code.google.com/p/io-watchdog
Spawns a thread that can kill an MPI process (and hence
the job) when it detects that the process has not written
data is some specified amount of time. This is very useful
to catch and kill hanging jobs, so that SCR can restart the
job or fetch the latest checkpoint.
MySQL -- for logging events
http://www.mysql.com
SCR is designed to be ported to other resource managers, so with some
work, one may replace SLURM with something else. Also libyogrt is
optional and similarly can be replaced with something else.
-----------------------------
Configuration
-----------------------------
Most of the default settings in the SCR library can be adjusted. One
may customize default settings via configure options, compile-time
defines, or a system configuration file.
In particular, the following parameters likely need to be adjusted for
other platforms:
Cache and control directories:
It is critical to configure SCR with directory paths it should use
for cache and control directories. Each of these directories should
correspond to a path leading to a storage device that is local to the
compute node. For more information, see the SCR directories section
below.
scr.conf:
A number of options can and must be specified for an SCR installation.
A convenient method to set these parameters is to create a system
configuration file. To help you get started, an example config file
is provided in scr.conf. Copy this file, customize it with your
desired settings, and configure your SCR build to know where to look
for it using the --with-scr-config-file configure option, e.g.,
configure --with-scr-config-file=/etc/scr.conf ...
libyogrt support:
If libyogrt is available, specify --with-yogrt[=PATH] during the
configure.
The function of libyogrt is to enable a job to determine how much
time it has left in its current allocaiton. While libyogrt is not
available outside of LLNL, other computer centers often have some
mechanism that provides similar functionality. It should be straigt-
forward to replace calls to libyogrt with something else. One must
simply modify the scr_env_seconds_remaining() implementation in
src/scr_env.c to return the number of seconds left in the job
allocation.
MySQL DB Logging:
SCR can log events to a MySQL database. However, this feature is
still in development. To enable logging to MySQL, configure with
--with-mysql. If you would like to give it a try, see the
scr.mysql file for further instructions.
The compile-time define variables are specified in src/scr_conf.h. One
may modify the settings in this file before building to set different
defaults. In many cases, one may be able to avoid having to create a
system configuration using this method.
-----------------------------
Build and install
-----------------------------
Use the standard, configure/make/make install to configure, build, and
install the library, e.g.:
./configure \
--prefix=/usr/local/tools/scr-1.1 \
--with-scr-config-file=/etc/scr.conf
make
make install
To uninstall the installation:
make uninstall
SCR is layered on top of MPI, so it must be built for the MPI library
installed on the system. It uses the standard mpicc compiler wrapper.
If there are multiple MPI libraries, a separate SCR library must be
built for each MPI library.
If you are using the scr.conf file, you'll also need to edit this file
to match the settings on your system. Please open this file and make
any necessary changes -- it is self-documented with comments. After
modifying this file, copy it to the location specified in your configure
step. The configure option simply informs the SCR install where to
look for the file, it does not modify or install the file.
-----------------------------
SCR directories
-----------------------------
There are three types of directories where SCR manages files.
(1) Cache directory
SCR instructs the application to write its checkpoint files to a cache
directory, and SCR computes and stores redundancy data here. SCR can
be configured to use more than one cache directory in a single run.
The cache directory must be local to the compute node.
The cache directory must be large enough to hold at least one full
application checkpoint. However, it is better if the cache is large
enough to store at least two checkpoints. When using storage local
to the compute node, the cache must be roughly 0.5x-3x the size of
the node's memory to be effective.
LLNL currently uses RAM disk for the cache directory, and LLNL is now
exploring SSDs.
The library is hard-coded to use /tmp as the cache directory. To
change this location, do any of the following:
* change the SCR_CACHE_BASE value in src/scr_conf.h,
* specify the new path during configure using the
--with-scr-cache-base=PATH option,
* or set the value in a system configuration file.
To enable multiple cache directories, one must use a system
configuration file.
(2) Control directory
SCR stores files to track its internal state in the control directory.
It also uses this directory to read commands from the user. Currently,
this directory must be local to the compute node. Files written to the
control directory are small and they are read and written to frequently,
so RAM disk works well for this. There are a variety of files kept in
the control directory. For some of these files, all processes maintain
their own file, and for others, only rank 0 accesses the file.
The library is hard-coded to use /tmp as the control directory. To
change this location, do any of the following:
* change the SCR_CNTL_BASE value in src/scr_conf.h,
* specify the new path during configure using the
--with-scr-cntl-base=PATH option,
* or set the value in a system configuration file.
(3) Parallel file system prefix directory
This is the directory where SCR will fetch and flush checkpoint sets
to the parallel file system.
The prefix directory defaults to the current working directory of the
application. To override this, the user can set the SCR_PREFIX
parameter at run time.
-----------------------------
RAM disk
-----------------------------
On systems where compute nodes are diskless, often RAM disk is the only
option for a checkpoint cache. When using RAM disk, one should
configure the hard limit to be around two-thirds of the amount of memory
on a node. This is a high limit, however Linux only allocates space as
needed, so it is safe to use a high setting. Jobs which do not use RAM
disk will not be negatively affected by using a high setting. This
setting must be set by the system administrators.
-----------------------------
SCR library and commands
-----------------------------
There are two primary components that make up SCR.
(1) Library
First, there's the library the application calls. This library is
written in C and it makes calls to MPI and POSIX I/O functions to
manage the cached checkpoint files on the cluster. The library informs
the application where it should write checkpoint files, and then the
library applies the redundancy scheme to those checkpoint files after
the application has written them. It also supports a rebuild of lost
files upon a restart, and it supports flushes and fetches between the
cache and the prefix directory on the parallel file system.
(2) Commands
Second, there are commands that must be run before and after the job.
After a resource allocation has been granted but before a run is
launched, these commands check that the cache and control directories
are available and healthy on each node. Nodes can be excluded or the
job can exit if this is not the case. If multiple runs are attempted
within the same resource allocation (e.g., start run / node fails /
start second run in what's left of the allocation), it's best to do
this check of the cache and control directories before each run is
started to check that a node that used to be healthy hasn't gone bad.
At the end of a job, commands must be run to drain the latest
checkpoint from the cache directory to the parallel file system.
This is necessary since the library may not have flushed its latest
checkpoint before the last run was killed.
The commands are just as vital as the library in order for SCR to
function. In particular, after a node failure, the commands must be
able to access the cache and control directories of the nodes which
are still healthy. Thus, the resource manager must be configured to
enable these commands to run on the remaining nodes of an allocation,
which may have sustained one or more failed nodes.
-----------------------------
Nodeset format
-----------------------------
Several of the commands process the nodeset allocated to the job. The
node names are expected to be of the form
clustername#
where clustername is a string of letters common for all nodes,
and # is the node number, like atlas35 and atlas173.
The scr_hostlist.pm perl module compresses lists of such nodenames into a
form like so:
atlas[31-43,45,48-51]
-----------------------------
IPv4
-----------------------------
Currently, SCR expects to run on a system whose nodes have IPv4
addresses, and each node should have a unique IPv4 address. The
library uses this address to determine which processes are running
on the same compute node.
-----------------------------
SLURM dependencies
-----------------------------
The current implementation assumes it is running on a system using
SLURM as the resource manager. However, it should be straight-
forward to extend SCR to work with other resource managers. Here
are the locations where SCR currently depends on SLURM.
scripts/scr_env.in
--nodes option
reads SLURM_NODELIST to get the nodeset allocated to a job
--jobid
reads SLURM_JOBID to get jobid
--down
reads SLURM_NODELIST and invokes "sinfo" to identify known down
nodes
scripts/scr_halt.in
--all and --user options
invokes "squeue" to determine jobids
--immediate option
calls "scontrol" and "scancel" to determine SLURM jobstep and to
stop job
all options
call "squeue" to verify jobid is really running
call "squeue" to get nodeset for jobid to run pdsh
scripts/scr_srun.in
calls "srun" to run the job
attempts to avoid running on bad nodes via "srun -x"
attempts to ensure rank 0 runs on the same node via "srun -w"
scr/scr_env.c
scr_env_jobid() - reads SLURM_JOBID to get jobid string
Additionally, the current implementation depends on SLURM for the
following features:
* To clean the cache and control directories between job allocations
via an rm command executed in the SLURM prolog or epilog.
* To kill a job which has suffered a failure.
* To allow the batch script to continue running in an allocation
which has lost one or more nodes.
-----------------------------
pdsh dependencies
-----------------------------
The current implementation uses pdsh in order to run commands
on the nodes allocated to the job. One could replace pdsh
with some work. Here are the locations where pdsh is used.
scripts/scr_flush.in
invokes pdsh to run "scr_copy" on each node to copy files from
cache to the parallel file system
scripts/scr_halt.in
invokes pdsh to run "scr_halt_cntl" on each node to update halt
file
scripts/scr_inspect.in
invokes pdsh to run "scr_inspect_cache" on each node to identify
which checkpoints need to be flushed, if any
scripts/scr_list_down_nodes.in
invokes pdsh to run "scr_check_node" on each node
-----------------------------
Dependency on files in common directory
-----------------------------
Some commands must be able to read / write files also accessed by
the library.
scripts/scr_env.in
--runnodes option
reads "nodes.scrinfo" from control directory to report the number
of nodes used in the last run
scripts/scr_flush_file.in
reads the "flush.scrinfo" file from the control directory to
identify whether any checkpoints need to be flushed from cache
src/scr_inspect_cache.c
reads "filemap.scrinfo" and "filemap_#.scrinfo" files to locate
files in cache
src/scr_copy.c
reads "filemap.scrinfo" and "filemap_#.scrinfo" files to locate
files in cache
src/scr_halt_cntl.c
reads and writes "halt.scrinfo"
scripts/scr_retries_halt.in
reads "halt.scrinfo" file to determine whether the current run
should halt
src/scr_transfer.c
reads and writes "transfer.scrinfo" to get list of files for
asynchronous flush
Here is how the control directory files are currently accessed.
File Library Commands
---------------------------------------------------------------------
halt.scrinfo rank 0 scr_halt_cntl,
scr_retries_halt
nodes.scrinfo master rank on each node scr_env
flush.scrinfo master rank on each node scr_flush_file
filemap.scrinfo master rank on each node scr_inspect_cache,
scr_copy
filemap_#.scrinfo every rank scr_inspect_cache,
scr_copy
transfer.scrinfo master rank on each node scr_transfer
-----------------------------
Run location and program language
-----------------------------
Different commands may run in different places.
Compute node only:
scripts/scr_check_node.in perl
src/scr_inspect_cache.in C
src/scr_copy.in C
src/scr_halt_cntl.c C
src/scr_transfer.c C
Batch script only:
scripts/scr_prerun.in bash
scripts/scr_srun.in bash
scripts/scr_postrun.in bash
scripts/scr_flush.in perl
scripts/scr_env.in perl
scripts/scr_cntl_dir.in perl
scripts/scr_list_down_nodes.in perl
src/scr_log_event.c C
src/scr_log_transfer.c C
src/scr_retries_halt.c C
src/scr_flush_file.c C
src/scr_nodes_file.c C
Within or external to batch script:
scripts/scr_halt.in perl
scripts/scr_glob_hosts.in perl
src/scr_index.c C
src/scr_rebuild_xor.c C
src/scr_crc32.c C
src/scr_print_hash_file.c C