From: Gene C. <ge...@cc...> - 2014-02-07 19:32:39
|
Hi Basma, First of all, I forgot to ask you which MPI you are using, and which version of that MPI. That will also help us locally reproduce any behavior that you are seeing. Next, I'll answer your questions: > should not I do "make install " also in order to ... Yes, it will be best if you do 'make install'. I wasn't sure if you had root privilege on your cluster. > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster > it worked fine in these cases: It's possible that version 1.2.5 has problems. But the larger reason is that even if version 1.2.5 works well, it will be easier for us to communicate if we are both looking at the same version. Most of the people on the DMTCP project are now concentrating on version DMTCP 2.1 and the svn repository. So, it will be easier for us to analyze your results if we all look at DMTCP 2.1. To answer your other question, MPI implementations have continued to move forward with more efficient internal engines. So in DMTCP, we have added improvements to cover the newer system services and parameters that MPI implementations tend to use. Best wishes, - Gene On Fri, Feb 07, 2014 at 07:14:44PM +0200, basma a.azeem wrote: > Hi Gene/Kapil > thank you so much for your help > > about your question: > > ./dmtcp_restart_script.sh > (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) > > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: > > 1- single node for 4 processes and 16 processes > 2- 4 nodes cluster for 4 processes > > > about this part: > Building it should be easy: ./configure && make > should not i do "make install " also in order to find all the required files in all nodes of the cluster ? > > thank you > > > > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > > From: ge...@cc... > > To: bas...@ho... > > CC: dmt...@li... > > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > Hi Basma, > > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > > Building it should be easy: ./configure && make > > We renamed the way to start. It will now be: > > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > Then to restart, it should be the same as before: > > ./dmtcp_restart_script.sh > > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > > > If this still gives you any problems, please do write back. > > > > Best wishes, > > - Gene > > > > ----- Original Message ----- > > From: basma a.azeem <bas...@ho...> > > To: dmt...@li... > > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > > > From: bas...@ho... > > To: ka...@cc... > > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > > > > > > i am trying dmtcp version 1.2.5 with open mpi > > i use a 4 node cluster > > > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > > master,node001,node002,node003 > > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > > > it looks like i am missing a simple detail > > > > here is the output i had : > > > > ------------------------------------------------------- > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_coordinator starting... > > Port: 7779 > > Checkpoint Interval: disabled (checkpoint manually instead) > > Exit on last client: 1 > > Backgrounding... > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6416-52f43ea3(99072) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6419-52f43ea3(99092) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6422-52f43ea3(99112) > > Message: Bind failed. > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3257-52f43ea3(99074) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3261-52f43ea3(99094) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3265-52f43ea3(99114) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2483-52f43ea3(99074) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2487-52f43ea3(99094) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2491-52f43ea3(99114) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2475-52f43ea3(99076) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2479-52f43ea3(99096) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2483-52f43ea3(99116) > > Message: Bind failed. > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > > 6419:929 read_shared_memory_area_from_file: > > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > > read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > > > > > > |