From: Kapil A. <ka...@cc...> - 2011-01-31 15:49:25
|
Hello Javier, Thanks for your feedback on DMTCP. In order to restart a computation spanning across multiple nodes, you should use the dmtcp_restart_script.sh script which will automatically recreate the processes on the nodes on which they were running prior to checkpointing. It also takes an argument where you can specify a machinefile/hostfile. Please run the command "./dmtcp_restart_script.sh --help" to check the usage. dmtcp_restart command is useful when working on a single node or when you want to explicitly restart processes by hand; in other cases dmtcp_restart_script.sh should be used. dmtcp_restart_script.sh is created by the dmtcp_coordinator process and so it would be lying in the directory from where the dmtcp_coordinator was started. Please note that dmtcp_restart_script.sh is a symbolic link which points to the actual restart script which would have a name like this -- dmtcp_restart_script_<alphanum>-<pid>-<timestamp>.sh Please let us know if there are any more questions. Thanks, - Kapil On Mon, Jan 31, 2011 at 10:18 AM, Javier Martinez Canillas <mar...@gm...> wrote: > Hello, > > First of all congratulations for your work. Implementing a > checkpoint/restart library in userpace that works that well and covers > so many use cases is something really amazing. > > We are using dmtcp to checkpoint and restart MPI processes. Everything > works pretty well, the only problem we have is that our processes loss > their mapping on restart. We use a machinefile to map the processes to > different cores but on restart all are executed in the node where they > are restarted. > > I searched in the documentation, FAQ and mailing list archives for an > answer, but couldn't find anyone. > > Below I explain the steps I followed to start, checkpoint and restart > our processes: > > jmartinez@headnode $ echo "export DMTCP_HOST=node1" >> ~/.bashrc > jmartinez@headnode $ ssh node1 > jmartinez@node1 $ dmtcp_coordinator > jmartinez@headnode $ for i in {1..4}; do echo node$i >> machinefile; done > jmartinez@headnode $ dmtcp_checkpoint mpirun -np 4 -machinefile > machinefile mpi_app > jmartinez@headnode $ dmtcp_command -c > > At this point checkpoint for the four processes, the four orted > instantes and the orterun processes are created. > > The problem is when I restart the processes: > > jmartinez@headnode $ dmtcp_resart ckpt_* > > The processes are not mapped as they where before the checkpoint, > instead I have 9 mtcp_restart processes in the node where I did the > restart (i.e: the cluster head node) > > jmartinez@headnode $ ps -u jmartinez | grep mtcp_restart | wc -l > 9 > > These are the four MPI processes, the four orted and the orterun > processes. The application is restarted cleanly and finish execution. > The only problem is the mapping. > > Please point me what I'm doing wrong and if dmtcp currently supports > mapping preservation on restart I will be more than glad to help > documented it. > > Thanks a lot for your help. > > Best regards, > > -- > ----------------------------------------- > Javier Martínez Canillas > (+34) 682 39 81 69 > PhD Student in High Performance Computing > Computer Architecture and Operating System Department (CAOS) > Universitat Autònoma de Barcelona > Barcelona, Spain > > ------------------------------------------------------------------------------ > Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! > Finally, a world-class log management solution at an even better price-free! > Download using promo code Free_Logger_4_Dev2Dev. Offer expires > February 28th, so secure your free ArcSight Logger TODAY! > http://p.sf.net/sfu/arcsight-sfd2d > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > |