Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Artem, 

I've had a bit of success with this, but if I log off the nodes where I'm running the job and attempt to restart it at a later time, I'm seeing errors. Here are some more details. 

The successful part: 

1) I start up a multi-node PBS session. 
2) In a separate window, I start up "dmtcp_coordinator" 
3) In the PBS session, I then start up the job using 

dmtcp_checkpoint -h `hostname` --rm mpiexec -np 4 ./matmat2 

4) In the coordinator window, I'm now able to checkpoint, and kill the job, using "c" or "k". 

5) In the PBS session, I'm able to successfully restart the job using 

./dmtcp_restart_script.sh 

and the job successfully restarts on the nodes that were assigned by the PBS session. 

However, if I now log out of the PBS session, and start a new one, I'm not able to get the job to restart. 

1) I again start up a multi-node PBS session 
2) In a separate window, I again start up "dmtcp_coordinator" 
3) In the PBS session, I now run 

./dmtcp_restart_script.sh 

however, I now see the error: 

which: no dmtcp_discover_rm in (/apps/rhel6/openmpi/1.6.3/intel-13.1.1.163/bin:/apps/rhel6/intel/composer_xe_2013.3.163/bin/intel64:/opt/intel/mic/bin:/apps/rhel6/intel/inspector_xe_2013/bin64:/apps/rhel6/intel/advisor_xe_2013/bin64:/apps/rhel6/intel/vtune_amplifier_xe_2013/bin64:/apps/rhel6/intel/opencl-1.2-3.0.67279/bin:/home/bfp/bin:.:/home/bfp/dmtcp-trunk/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/hpss/bin:/opt/hsi/bin:/opt/bin:/usr/pbs/bin:/opt/moab/bin:/usr/site/rcac/scripts) 

----- Original Message -----

And of cource you need to include "--rm": 

dmtcp_checkpoint -h <frontend> --rm mpiexec ./hellompi 

2013/11/4 Artem Polyakov < art...@gm... > 

Bryan! 

There is one more error that I made when writing sample scripts I've send to you! 

We need to run everything under DMTCP control, mpirun/mpiexec too! So the line in job script should be the following: 

dmtcp_checkpoint -h <frontend> mpiexec ./hellompi 

I believe that README contains correct example. 

Sorry once again. 

2013/11/2 Artem Polyakov < art...@gm... > 

Hello, Bryan. 

2013/11/2 Bryan F Putnam < bf...@pu... > 

Hi Gene and Artem, 

I suppose what would be useful for me to see would be a sample PBS runscript that starts a multinode session (e.g. nodes=4:ppn=2) which starts up a simple MPI job using dmtcp_coordinator/dmtcp_launch, etc., and which checkpoints the job, for example, every 60 seconds. And then, assuming that job is killed or terminated, a second PBS runscript that will restart that job. For example Artem has attempted to do that with his two scripts: 

#PBS -N hellompi 
#PBS -l nodes=4:ppn=2 
#PBS -j oe 
cd $PBS_O_WORKDIR 
export PATH=<path-to-dmtcp>:$PATH 
mpiexec dmtcp_checkpoint -h jet ./hellompi 

#PBS -N hellompi 
#PBS -l nodes=4:ppn=2 
#PBS -j oe 
cd $PBS_O_WORKDIR 
export PATH=<path-to-dmtcp>:$PATH 
export DMTCP_HOST=jet 
./dmtcp_restart_script.sh 

However, I'm confused about what the first script is doing. Is it immediately checkpointing a job? I see for example something similar to 

dmtcp_launch mpiexec -np 8 ./a.out 
dmtcp_command --checkpoint 
.dmtcp_restart_script.sh 

First script should be slightly modified since "jet" is the name of our frontend on my cluster. You should set your clusters frontend name there. 

#PBS -N hellompi 

#PBS -l nodes=4:ppn=2 
#PBS -j oe 
cd $PBS_O_WORKDIR 
export PATH=<path-to-dmtcp>:$PATH 
mpiexec dmtcp_checkpoint -h <frontend> ./hellompi 

The same for the second script. 

Next, 
dmtcp_checkpoint is a command to run your program under DMTCP control, BUT!! I also suggested you to run DMTCP coordinator in separate terminal on the frontend. This is good for debugging. In interactive coordinator you can checkpoint application using one-chair commands like 
'c' <Enter> 

so you do the following: 

1. Start coordinator on the frontend in separate terminal. 
2. Run batch script with -h option pointing on the frontend. 
3. In coordinator window you should see that processes are connected and running. If not - something is already wrong. You'll also see on what hosts they are running which is alsovaluable information. 
4. If everything is OK you can checkpoint hitting 'c' key and pressing Enter. 
5. Check what is the status of jobs after checkpoint: they shoud move into RUNNING state again. 
6. Kill job. 

Next you submit restart script setting DMTCP_HOST variable to frontend as you do that with '-h' option of dmtcp_checkpoint. Other action should be similar. 

in the QUICK-START file, and which does actually work for a single node job, but I don't see any description of a command of the form: 

mpiexec dmtcp_command ... 

In the file 

dmtcp-2.0/contrib/rm/README 

I see a third method of doing the same thing, for example, 

dmtcp_launch --rm (without using mpiexec or mpirun at all) 

and then doing 

dmtcp_coordinator& 
dmtcp_restart_script.sh 

to restart the job. 

Anyway, I'll look some more into this, and the documentation, and let you know if I can give any helpful suggestions. 

Thanks! 

Bryan 

----- Original Message ----- 
> Hi Gene, 
> 
> Yes, I can take a look at the documentation and try to give some 
> suggestions. I'll get back to you soon on that. 
> 
> Our mvapich2 builds are configured with 
> --with-device=ch3:mrail \ 
> --with-rdma=gen2 \ 
> 
> and they don't run on TCP/IP networks, only IB. However we do have 
> some mpich2 and mpich-3 builds (upon which mvapich2 is based) which 
> use TCP/IP, and I'm able to successfully checkpoint, kill, and restart 
> parallel mpich2 jobs as long as they are using only a single node. 
> 
> In general, DMTCP appears to be working well for me, as long as the 
> job is running on a single node. I can checkpoint, kill the job, and 
> restart it, and it will restart again, even on a different node. It's 
> just that when more than one node is involved, DMTCP doesn't appears 
> to be retaining information about the remote nodes, and it restarts 
> everything on whatever localhost it is restarted on. Perhaps I'm just 
> missing something simple, I'm having difficulty understanding the use 
> of the "rm" plugin. 
> 
> I was also able to checkpoint and restart a parallel Gaussian09 job, 
> which doesn't use MPI at all. But again it only worked when the 
> parallel job was a single node job. 
> 
> Thanks, 
> Bryan 
> 
> 
> ----- Original Message ----- 
> > Hi Bryan, 
> > Also, we've been thinking about how to improve the documentation 
> > for the resource managers (Torque and SLURM). We always get good 
> > insights on this by looking at people seeing it for the first time. 
> > If you should have the time, could you make some rough notes on 
> > how we can improve our documentation (what to emphasize, extra 
> > pointers 
> > to include, etc.)? 
> > 
> > As for InfiniBand, we're now tracking down still one more bug (a 
> > race 
> > condition). For InfiniBand, please continue updating from our svn: 
> > svn co svn:// svn.code.sf.net/p/dmtcp/code/trunk dmtcp-trunk 
> > We're hoping to have the last bugs out of InfiniBand sometime this 
> > next week. 
> > 
> > You also mention mvapich2. Does that work for you with ordinary 
> > Ethernet? If it fails for youeven in that case, would you mind 
> > letting 
> > us know (either informally, or a bug report -- whichever you like). 
> > 
> > Thanks, 
> > - Gene 
> > 
> > On Fri, Nov 01, 2013 at 03:00:08PM -0400, Bryan F Putnam wrote: 
> > > Thanks for the examples Artem. Let me take a look as these, and 
> > > also 
> > > your instructions in 
> > > 
> > > 
> > > .../dmtcp-2.0/contrib/rm/README 
> > > 
> > > 
> > > and see if I can come up with something that works with Torque-4. 
> > > If 
> > > not, I'll contact my supervisor and I'm sure he'd be happy to let 
> > > us 
> > > set up an account for you on one of our clusters. So far I've 
> > > tried 
> > > using both openmpi and mpich2 (and mpich-3) but am seeing the same 
> > > problems with not being able to specify a specific set of nodes on 
> > > restarting. 
> > > 
> > > 
> > > I've also tried mvapich2, but that fails for different reasons, 
> > > and 
> > > I do see that Infiniband is not fully supported. 
> > > 
> > > 
> > > Please feel free to play around with my Fortran code "matmat2.f". 
> > > It's a simple matrix multiply inside a loop. If it doesn't run 
> > > long 
> > > enough for you, just modify the variable "niter". The iteration is 
> > > printed as the job proceeds, so it's easy to see that the job is 
> > > picking up where it left off, after being checkpointed and 
> > > restarted. 
> > > 
> > > 
> > > Thanks, 
> > > Bryan 
> > > 
> > > 
> > > 
> > > 
> > > ----- Original Message ----- 
> > > 
> > > 
> > > 
> > > Bryan, 
> > > 
> > > 
> > > 
> > > Resource manager plugin is installed by default. As far as I see 
> > > you 
> > > execute application correctly. 
> > > Just in case I am attaching initial and restart batch scripts to 
> > > this e-mail for reference. 
> > > What is inside: at this moment (for debugging) I usually start 
> > > dmtcp_coordinator at the frontend and use DMTCP options to point 
> > > on 
> > > it. We already have a solution how to run coordinator in batch 
> > > manner too but untill you get correct behavior this is not 
> > > reasonable. 
> > > We test DMTCP with Open MPI mostly. Different MPI implementation 
> > > also can be the reason but we need to check if that is so. 
> > > 
> > > 
> > > 1. I need to additionally check Torque plugin by myself. This will 
> > > take few days. We add 
> > > 2. What application you run and is it possible for me to get it 
> > > for 
> > > testing with instructions about how to do that exactly as you do. 
> > > 3. I have acces to Torque 2.x installations and we didn't test 
> > > Torque 4.x. Is it possible for me to have access on your system 
> > > for 
> > > testing and debuggig? 
> > > 
> > > 
> > > 
> > > 2013/10/29 Bryan F Putnam < bf...@pu... > 
> > > 
> > > 
> > > 
> > > 
> > > Hi Artem, thanks for writing back. 
> > > 
> > > 
> > > We're using DMTCP-2.0 and Torque-4.1.5.1. 
> > > 
> > > 
> > > I'm a bit confused as to how to install a dmtcp plugin, or if in 
> > > fact the Torque plugin is already installed by default. For 
> > > example 
> > > if I start up a nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may 
> > > look 
> > > something like 
> > > 
> > > 
> > > host1 
> > > host1 
> > > host2 
> > > host2 
> > > 
> > > 
> > > I then do 
> > > 
> > > 
> > > dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job 
> > > successfully runs on 2 processors on each of 2 nodes) 
> > > dmtcp_command --checkpoint (in a separate window) 
> > > dmtcp_command --kill (in a separate window) 
> > > dmtcp_restart ckpt*.dmtcp 
> > > 
> > > 
> > > After the last step, the job successfully restarts, but all 4 
> > > processes are now running on the localhost (host1), nothing is 
> > > running on host2, and the $PBS_NODEFILE appears to be ignored. 
> > > 
> > > 
> > > Thanks for any tips! 
> > > 
> > > 
> > > Bryan 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Hellp, Bryan. 
> > > 
> > > 
> > > What version of DMTCP/Torque you use? 
> > > 
> > > 
> > > 
> > > 2013/10/29 gene < ge...@cc... > 
> > > 
> > > 
> > > > Perhaps this is something that is handled by the Torque plugin? 
> > > Yes, that's correct. You'll need to use the DMTCP plugin for 
> > > Torque. 
> > > Artem Polyakov is supporting that, and I'm cc'ing to him. Among 
> > > other 
> > > issues, mount points can change and network addresses can change 
> > > on 
> > > restart. 
> > > The plugin tries to handle that. 
> > > 
> > > Please let us know if you have any trouble using the Torque 
> > > plugin. 
> > > 
> > > Best, 
> > > - Gene 
> > > 
> > > On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote: 
> > > > 
> > > > Dear DMTCP developers, 
> > > > 
> > > > I've found that when restarting a multi-node job, dmtcp_restart 
> > > > only appears to be aware of the local host. Is it possible to 
> > > > tell 
> > > > dmtcp_restart which hosts are currently available for a job 
> > > > restart, whether it's the same set of multiple hosts, or a 
> > > > completely different set of hosts? 
> > > > 
> > > > Typically our hosts are contained in $PBS_NODEFILE since we use 
> > > > Torque. Perhaps this is something that is handled by the Torque 
> > > > plugin? 
> > > > 
> > > > Thanks, 
> > > > Bryan 
> > > > 
> > > > -- 
> > > > Bryan Putnam 
> > > > Senior Scientific Applications Analyst 
> > > > Rosen Center for Advanced Computing, Purdue University 
> > > > Young Hall (Rm. 910) 
> > > > 155 S. Grant St. 
> > > > West Lafayette, IN 47907-2114 
> > > > Ph 765-496-8225 Fax 765-496-2275 
> > > > bf...@pu... 
> > > > www.purdue.edu/itap 
> > > 
> > > 
> > > 
> > > 
> > > -- 
> > > С Уважением, Поляков Артем Юрьевич 
> > > Best regards, Artem Y. Polyakov 
> > > 
> > > 
> > > 
> > > 
> > > -- 
> > > С Уважением, Поляков Артем Юрьевич 
> > > Best regards, Artem Y. Polyakov 
> > 
> > > c************************************************************************ 
> > > c matmat.f - matrix-matrix multiply, C = A*B 
> > > c simple self-scheduling version 
> > > c************************************************************************ 
> > > program matmat 
> > > 
> > > include 'mpif.h' 
> > > c use mpi 
> > > 
> > > integer MAX_AROWS, MAX_ACOLS, MAX_BCOLS 
> > > c parameter (MAX_AROWS = 20, MAX_ACOLS = 1000, MAX_BCOLS = 20) 
> > > c parameter (MAX_AROWS = 200, MAX_ACOLS = 1000, MAX_BCOLS = 200) 
> > > parameter (MAX_AROWS = 2000, MAX_ACOLS = 2000, MAX_BCOLS = 
> > > 2000) 
> > > c parameter (MAX_AROWS = 4000, MAX_ACOLS = 4000, MAX_BCOLS = 4000) 
> > > double precision a(MAX_AROWS,MAX_ACOLS), 
> > > b(MAX_ACOLS,MAX_BCOLS) 
> > > double precision c(MAX_AROWS,MAX_BCOLS) 
> > > double precision buffer(MAX_ACOLS), ans(MAX_BCOLS) 
> > > double precision start_time, stop_time 
> > > 
> > > integer myid, master, numprocs, ierr, 
> > > status(MPI_STATUS_SIZE) 
> > > integer i, j, numsent, numrcvd, sender 
> > > integer anstype, row, arows, acols, brows, bcols, crows, 
> > > ccols 
> > > integer errorcode 
> > > integer niter, iter 
> > > 
> > > call MPI_INIT(ierr) 
> > > call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) 
> > > call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) 
> > > if (numprocs .lt. 2) then 
> > > print *, "Must have at least 2 processes!" 
> > > errorcode = 1 
> > > call MPI_ABORT(MPI_COMM_WORLD, errorcode, ierr) 
> > > stop 
> > > endif 
> > > print *, "Process ", myid, " of ", numprocs, " is alive" 
> > > 
> > > arows = MAX_AROWS 
> > > acols = MAX_ACOLS 
> > > brows = MAX_ACOLS 
> > > bcols = MAX_BCOLS 
> > > crows = MAX_AROWS 
> > > ccols = MAX_BCOLS 
> > > 
> > > master = 0 
> > > 
> > > c 
> > > niter = 400 
> > > c niter = 100 
> > > c niter = 20 
> > > c niter = 800 
> > > c niter = 4 
> > > do 900 iter = 1, niter 
> > > c 
> > > if ( myid .eq. master ) then 
> > > c master initializes and then dispatches 
> > > c initialization of a and b, broadcast of b 
> > > c 
> > > c a(i,j) = i + j 
> > > c 
> > > do 22 i = 1, arows 
> > > do 22 j = 1, acols 
> > > a(i,j) = dble(i+j) 
> > > 22 continue 
> > > 
> > > do 20 i = 1, brows 
> > > do 20 j = 1, bcols 
> > > b(i,j) = dble(i+j) 
> > > 20 continue 
> > > 
> > > start_time = MPI_WTIME() 
> > > c start_time = mclock() 
> > > if ( numprocs .lt. 2 ) then 
> > > do 46 j = 1,ccols 
> > > do 46 i = 1,crows 
> > > c(i,j) = 0.0 
> > > do 46 k = 1,acols 
> > > c(i,j) = c(i,j) + a(i,k)*b(k,j) 
> > > 46 continue 
> > > go to 200 
> > > endif 
> > > 
> > > do 25 i = 1,bcols 
> > > call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, master, 
> > > $ MPI_COMM_WORLD, ierr) 
> > > 25 continue 
> > > 
> > > numsent = 0 
> > > numrcvd = 0 
> > > 
> > > c send a row of a to each other process; tag with row number 
> > > do 40 i = 1,numprocs-1 
> > > do 30 j = 1,acols 
> > > buffer(j) = a(i,j) 
> > > 30 continue 
> > > call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION, 
> > > $ i, i, MPI_COMM_WORLD, ierr) 
> > > numsent = numsent+1 
> > > 40 continue 
> > > 
> > > do 70 i = 1,crows 
> > > call MPI_RECV(ans, ccols, MPI_DOUBLE_PRECISION, 
> > > MPI_ANY_SOURCE, 
> > > $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr) 
> > > sender = status(MPI_SOURCE) 
> > > anstype = status(MPI_TAG) 
> > > 
> > > do 45 j = 1,ccols 
> > > c(anstype,j) = ans(j) 
> > > 45 continue 
> > > if (numsent .lt. arows) then 
> > > do 50 j = 1,acols 
> > > buffer(j) = a(numsent+1,j) 
> > > 50 continue 
> > > call MPI_SEND(buffer, acols, MPI_DOUBLE_PRECISION, 
> > > $ sender, numsent+1, MPI_COMM_WORLD, ierr) 
> > > numsent = numsent+1 
> > > else 
> > > call MPI_SEND(1.0, 1, MPI_DOUBLE_PRECISION, sender, 0, 
> > > $ MPI_COMM_WORLD, ierr) 
> > > endif 
> > > 70 continue 
> > > 
> > > else 
> > > c slaves receive b, then compute dot products until done message 
> > > do 85 i = 1,bcols 
> > > call MPI_BCAST(b(1,i), brows, MPI_DOUBLE_PRECISION, 
> > > master, 
> > > $ MPI_COMM_WORLD, ierr) 
> > > 85 continue 
> > > 90 continue 
> > > call MPI_RECV(buffer, acols, MPI_DOUBLE_PRECISION, 
> > > master, 
> > > $ MPI_ANY_TAG, MPI_COMM_WORLD, status, ierr) 
> > > if (status(MPI_TAG) .eq. 0) then 
> > > go to 200 
> > > else 
> > > row = status(MPI_TAG) 
> > > do 100 i = 1,bcols 
> > > ans(i) = 0.0 
> > > do 95 j = 1,acols 
> > > ans(i) = ans(i) + buffer(j)*b(j,i) 
> > > 95 continue 
> > > 100 continue 
> > > call MPI_SEND(ans, bcols, MPI_DOUBLE_PRECISION, 
> > > master, 
> > > row, 
> > > $ MPI_COMM_WORLD, ierr) 
> > > go to 90 
> > > endif 
> > > endif 
> > > 
> > > 200 continue 
> > > c print out the answer 
> > > c do 80 i = 1,crows 
> > > c do 80 j = 1,ccols 
> > > c print *, "c(", i, j ") = ", c(i,j) 
> > > c80 continue 
> > > 
> > > if ( myid .eq. master ) then 
> > > stop_time = MPI_WTIME() 
> > > c stop_time = mclock() 
> > > print *, 'Time is ', stop_time - start_time, 
> > > & ' seconds for iteration ', iter 
> > > endif 
> > > c 
> > > 900 continue 
> > > c 
> > > call MPI_FINALIZE(ierr) 
> > > stop 
> > > end 

-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 

-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 

-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 

Re: [Dmtcp-forum] DMTCP restart for multi-node jobs

Checkpoint/Restart functionality for Linux processes

Re: [Dmtcp-forum] DMTCP restart for multi-node jobs