From: basma a.a. <bas...@ho...> - 2014-02-07 02:39:24
|
From: bas...@ho... To: ka...@cc... Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? Date: Fri, 7 Feb 2014 04:37:58 +0200 i am trying dmtcp version 1.2.5 with open mpi i use a 4 node cluster when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 it looks like i am missing a simple detail here is the output i had : ------------------------------------------------------- dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_coordinator starting... Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 1 Backgrounding... [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6416-52f43ea3(99072) Message: Bind failed. [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6419-52f43ea3(99092) Message: Bind failed. [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6422-52f43ea3(99112) Message: Bind failed. dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3257-52f43ea3(99074) Message: Bind failed. [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3261-52f43ea3(99094) Message: Bind failed. [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3265-52f43ea3(99114) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2483-52f43ea3(99074) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2487-52f43ea3(99094) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2491-52f43ea3(99114) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2475-52f43ea3(99076) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2479-52f43ea3(99096) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2483-52f43ea3(99116) Message: Bind failed. Segmentation fault (core dumped) Segmentation fault (core dumped) Segmentation fault (core dumped) [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image 6419:929 read_shared_memory_area_from_file: ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image |
From: Gene C. <ge...@cc...> - 2014-02-07 04:03:14
|
Hi Basma, Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ Building it should be easy: ./configure && make We renamed the way to start. It will now be: bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 Then to restart, it should be the same as before: ./dmtcp_restart_script.sh (Is this the way that you were invoking restart for dmtcp-1.2.5?) If this still gives you any problems, please do write back. Best wishes, - Gene ----- Original Message ----- From: basma a.azeem <bas...@ho...> To: dmt...@li... Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? From: bas...@ho... To: ka...@cc... Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? Date: Fri, 7 Feb 2014 04:37:58 +0200 i am trying dmtcp version 1.2.5 with open mpi i use a 4 node cluster when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 it looks like i am missing a simple detail here is the output i had : ------------------------------------------------------- dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_coordinator starting... Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 1 Backgrounding... [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6416-52f43ea3(99072) Message: Bind failed. [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6419-52f43ea3(99092) Message: Bind failed. [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 18af1fad8d756-6422-52f43ea3(99112) Message: Bind failed. dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3257-52f43ea3(99074) Message: Bind failed. [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3261-52f43ea3(99094) Message: Bind failed. [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e707-3265-52f43ea3(99114) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2483-52f43ea3(99074) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2487-52f43ea3(99094) Message: Bind failed. [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e708-2491-52f43ea3(99114) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2475-52f43ea3(99076) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2479-52f43ea3(99096) Message: Bind failed. [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 20385667ca0e709-2483-52f43ea3(99116) Message: Bind failed. Segmentation fault (core dumped) Segmentation fault (core dumped) Segmentation fault (core dumped) [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image 6419:929 read_shared_memory_area_from_file: ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image |
From: basma a.a. <bas...@ho...> - 2014-02-07 17:14:52
|
Hi Gene/Kapil thank you so much for your help about your question: ./dmtcp_restart_script.sh (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: 1- single node for 4 processes and 16 processes 2- 4 nodes cluster for 4 processes about this part: Building it should be easy: ./configure && make should not i do "make install " also in order to find all the required files in all nodes of the cluster ? thank you > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > From: ge...@cc... > To: bas...@ho... > CC: dmt...@li... > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Hi Basma, > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > Building it should be easy: ./configure && make > We renamed the way to start. It will now be: > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > Then to restart, it should be the same as before: > ./dmtcp_restart_script.sh > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > If this still gives you any problems, please do write back. > > Best wishes, > - Gene > > ----- Original Message ----- > From: basma a.azeem <bas...@ho...> > To: dmt...@li... > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > From: bas...@ho... > To: ka...@cc... > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > i am trying dmtcp version 1.2.5 with open mpi > i use a 4 node cluster > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > master,node001,node002,node003 > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > it looks like i am missing a simple detail > > here is the output i had : > > ------------------------------------------------------- > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_coordinator starting... > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6416-52f43ea3(99072) > Message: Bind failed. > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6419-52f43ea3(99092) > Message: Bind failed. > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6422-52f43ea3(99112) > Message: Bind failed. > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3257-52f43ea3(99074) > Message: Bind failed. > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3261-52f43ea3(99094) > Message: Bind failed. > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3265-52f43ea3(99114) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2483-52f43ea3(99074) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2487-52f43ea3(99094) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2491-52f43ea3(99114) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2475-52f43ea3(99076) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2479-52f43ea3(99096) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2483-52f43ea3(99116) > Message: Bind failed. > Segmentation fault (core dumped) > Segmentation fault (core dumped) > Segmentation fault (core dumped) > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > 6419:929 read_shared_memory_area_from_file: > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > |
From: Gene C. <ge...@cc...> - 2014-02-07 19:32:39
|
Hi Basma, First of all, I forgot to ask you which MPI you are using, and which version of that MPI. That will also help us locally reproduce any behavior that you are seeing. Next, I'll answer your questions: > should not I do "make install " also in order to ... Yes, it will be best if you do 'make install'. I wasn't sure if you had root privilege on your cluster. > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster > it worked fine in these cases: It's possible that version 1.2.5 has problems. But the larger reason is that even if version 1.2.5 works well, it will be easier for us to communicate if we are both looking at the same version. Most of the people on the DMTCP project are now concentrating on version DMTCP 2.1 and the svn repository. So, it will be easier for us to analyze your results if we all look at DMTCP 2.1. To answer your other question, MPI implementations have continued to move forward with more efficient internal engines. So in DMTCP, we have added improvements to cover the newer system services and parameters that MPI implementations tend to use. Best wishes, - Gene On Fri, Feb 07, 2014 at 07:14:44PM +0200, basma a.azeem wrote: > Hi Gene/Kapil > thank you so much for your help > > about your question: > > ./dmtcp_restart_script.sh > (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) > > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: > > 1- single node for 4 processes and 16 processes > 2- 4 nodes cluster for 4 processes > > > about this part: > Building it should be easy: ./configure && make > should not i do "make install " also in order to find all the required files in all nodes of the cluster ? > > thank you > > > > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > > From: ge...@cc... > > To: bas...@ho... > > CC: dmt...@li... > > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > Hi Basma, > > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > > Building it should be easy: ./configure && make > > We renamed the way to start. It will now be: > > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > Then to restart, it should be the same as before: > > ./dmtcp_restart_script.sh > > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > > > If this still gives you any problems, please do write back. > > > > Best wishes, > > - Gene > > > > ----- Original Message ----- > > From: basma a.azeem <bas...@ho...> > > To: dmt...@li... > > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > > > From: bas...@ho... > > To: ka...@cc... > > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > > > > > > i am trying dmtcp version 1.2.5 with open mpi > > i use a 4 node cluster > > > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > > master,node001,node002,node003 > > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > > > it looks like i am missing a simple detail > > > > here is the output i had : > > > > ------------------------------------------------------- > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_coordinator starting... > > Port: 7779 > > Checkpoint Interval: disabled (checkpoint manually instead) > > Exit on last client: 1 > > Backgrounding... > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6416-52f43ea3(99072) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6419-52f43ea3(99092) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6422-52f43ea3(99112) > > Message: Bind failed. > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3257-52f43ea3(99074) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3261-52f43ea3(99094) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3265-52f43ea3(99114) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2483-52f43ea3(99074) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2487-52f43ea3(99094) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2491-52f43ea3(99114) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2475-52f43ea3(99076) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2479-52f43ea3(99096) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2483-52f43ea3(99116) > > Message: Bind failed. > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > > 6419:929 read_shared_memory_area_from_file: > > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > > read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > > > > > > |
From: basma a.a. <bas...@ho...> - 2014-02-07 17:49:31
|
i tried version 2.1 in single node case for 16 processes at restart it gives me this error: ./dmtcp_restart_script.sh dmtcp_restart (DMTCP + MTCP) 2.1 Copyright (C) 2006-2014 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman License LGPLv3+: GNU LGPL version 3 or later <http://gnu.org/licenses/lgpl.html>. This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [40000] ERROR at fileconnection.cpp:399 in postRestart; REASON='JASSERT(tempfd >= 0) failed' tempfd = -1 controllingTty = /dev/pts/0 (strerror((*__errno_location ()))) = Permission denied Message: Error Opening the terminal attached with the process orterun (40000): Terminating... ubuntu@ip-10-43-154-61:~$ [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-41000-52f519db(99080) Message: Bind failed. [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-41000-52f519db(99081) Message: Bind failed. [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-41000-52f519db(99097) Message: Bind failed. [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-41000-52f519db(99098) Message: Bind failed. [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-53000-52f519dc(99201) Message: Bind failed. [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-53000-52f519dc(99202) Message: Bind failed. [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-53000-52f519dc(99218) Message: Bind failed. [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-53000-52f519dc(99219) Message: Bind failed. [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-45000-52f519db(99121) Message: Bind failed. [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-45000-52f519db(99122) Message: Bind failed. [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-45000-52f519db(99138) Message: Bind failed. [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-45000-52f519db(99139) Message: Bind failed. [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-54000-52f519dc(99211) Message: Bind failed. [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-54000-52f519dc(99212) Message: Bind failed. [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-54000-52f519dc(99228) Message: Bind failed. [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-54000-52f519dc(99229) Message: Bind failed. [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-49000-52f519db(99161) Message: Bind failed. [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-49000-52f519db(99162) Message: Bind failed. [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-49000-52f519db(99178) Message: Bind failed. [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-49000-52f519db(99179) Message: Bind failed. [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-42000-52f519db(99091) Message: Bind failed. [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-42000-52f519db(99092) Message: Bind failed. [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-42000-52f519db(99108) Message: Bind failed. [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-42000-52f519db(99109) Message: Bind failed. [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-56000-52f519dc(99231) Message: Bind failed. [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-56000-52f519dc(99232) Message: Bind failed. [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-56000-52f519dc(99248) Message: Bind failed. [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-56000-52f519dc(99249) Message: Bind failed. [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-43000-52f519db(99101) Message: Bind failed. [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-43000-52f519db(99102) Message: Bind failed. [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-43000-52f519db(99118) Message: Bind failed. [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-43000-52f519db(99119) Message: Bind failed. [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-44000-52f519db(99111) Message: Bind failed. [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-44000-52f519db(99112) Message: Bind failed. [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-44000-52f519db(99128) Message: Bind failed. [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-44000-52f519db(99129) Message: Bind failed. [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-46000-52f519db(99131) Message: Bind failed. [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-46000-52f519db(99132) Message: Bind failed. [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-46000-52f519db(99148) Message: Bind failed. [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-46000-52f519db(99149) Message: Bind failed. [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-52000-52f519dc(99191) Message: Bind failed. [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-52000-52f519dc(99192) Message: Bind failed. [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-48000-52f519db(99151) Message: Bind failed. [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-52000-52f519dc(99208) Message: Bind failed. [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-52000-52f519dc(99209) Message: Bind failed. [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-48000-52f519db(99152) Message: Bind failed. [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-48000-52f519db(99168) Message: Bind failed. [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-55000-52f519dc(99221) Message: Bind failed. [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-48000-52f519db(99169) Message: Bind failed. [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-55000-52f519dc(99222) Message: Bind failed. [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-55000-52f519dc(99238) Message: Bind failed. [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-55000-52f519dc(99239) Message: Bind failed. [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-47000-52f519db(99141) Message: Bind failed. [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-47000-52f519db(99142) Message: Bind failed. [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-47000-52f519db(99158) Message: Bind failed. [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-47000-52f519db(99159) Message: Bind failed. [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-51000-52f519db(99181) Message: Bind failed. [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-51000-52f519db(99182) Message: Bind failed. [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-51000-52f519db(99198) Message: Bind failed. [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-51000-52f519db(99199) Message: Bind failed. [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-50000-52f519db(99171) Message: Bind failed. [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-50000-52f519db(99172) Message: Bind failed. [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-50000-52f519db(99188) Message: Bind failed. [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' (strerror((*__errno_location ()))) = Address already in use id() = 6da2961af00014aa-50000-52f519db(99189) Message: Bind failed. [41000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99049) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (41000): Terminating... [49000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99129) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (49000): Terminating... [46000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99099) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (46000): Terminating... [55000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99189) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (55000): Terminating... [43000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99069) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (43000): Terminating... [47000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99109) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (47000): Terminating... [42000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99059) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (42000): Terminating... [45000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99089) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (45000): Terminating... [54000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99179) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (54000): Terminating... [51000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99149) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (51000): Terminating... [50000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99139) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (50000): Terminating... [44000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99079) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (44000): Terminating... [52000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99159) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (52000): Terminating... [56000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99199) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (56000): Terminating... [53000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99169) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (53000): Terminating... [48000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' id = 6da2961af00014aa-40000-52f519db(99119) (strerror((*__errno_location ()))) = Invalid argument Message: failed to restore connection lu.A.16 (48000): Terminating... From: bas...@ho... To: ge...@cc...; dmt...@li... Subject: RE: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? Date: Fri, 7 Feb 2014 19:14:44 +0200 Hi Gene/Kapil thank you so much for your help about your question: ./dmtcp_restart_script.sh (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: 1- single node for 4 processes and 16 processes 2- 4 nodes cluster for 4 processes about this part: Building it should be easy: ./configure && make should not i do "make install " also in order to find all the required files in all nodes of the cluster ? thank you > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > From: ge...@cc... > To: bas...@ho... > CC: dmt...@li... > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Hi Basma, > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > Building it should be easy: ./configure && make > We renamed the way to start. It will now be: > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > Then to restart, it should be the same as before: > ./dmtcp_restart_script.sh > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > If this still gives you any problems, please do write back. > > Best wishes, > - Gene > > ----- Original Message ----- > From: basma a.azeem <bas...@ho...> > To: dmt...@li... > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > From: bas...@ho... > To: ka...@cc... > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > i am trying dmtcp version 1.2.5 with open mpi > i use a 4 node cluster > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > master,node001,node002,node003 > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > it looks like i am missing a simple detail > > here is the output i had : > > ------------------------------------------------------- > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_coordinator starting... > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 1 > Backgrounding... > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6416-52f43ea3(99072) > Message: Bind failed. > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6419-52f43ea3(99092) > Message: Bind failed. > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 18af1fad8d756-6422-52f43ea3(99112) > Message: Bind failed. > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3257-52f43ea3(99074) > Message: Bind failed. > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3261-52f43ea3(99094) > Message: Bind failed. > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e707-3265-52f43ea3(99114) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2483-52f43ea3(99074) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2487-52f43ea3(99094) > Message: Bind failed. > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e708-2491-52f43ea3(99114) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2475-52f43ea3(99076) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2479-52f43ea3(99096) > Message: Bind failed. > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 20385667ca0e709-2483-52f43ea3(99116) > Message: Bind failed. > Segmentation fault (core dumped) > Segmentation fault (core dumped) > Segmentation fault (core dumped) > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > 6419:929 read_shared_memory_area_from_file: > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > |
From: Gene C. <ge...@cc...> - 2014-02-07 19:43:05
|
Hi Basma, I'm reading the e-mails in chronological order. :-) The excerpt below seems interesting. > controllingTty = /dev/pts/0 > (strerror((*__errno_location ()))) = Permission denied Kapil tells me that another group had reported a similar bug, but we were unable to reproduce their bug locally. Would it be possible for us to get an account on your cluster? That will be the shortest path to analyzing this bug. If that is not possible for you, we can also propose a screen-sharing session, so that we can watch you as you exhibit the bug. (If a guest account is possible, the we'll work wit the DMTCP-2.1 that you've already installed. We certainly won't need any privileges.) Best wishes, - Gene On Fri, Feb 07, 2014 at 07:49:21PM +0200, basma a.azeem wrote: > > > i tried version 2.1 > > in single node case for 16 processes at restart it gives me this error: > > > ./dmtcp_restart_script.sh > > dmtcp_restart (DMTCP + MTCP) 2.1 > > Copyright (C) 2006-2014 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > License LGPLv3+: GNU LGPL version 3 or later <http://gnu.org/licenses/lgpl.html>. > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > [40000] ERROR at fileconnection.cpp:399 in postRestart; REASON='JASSERT(tempfd >= 0) failed' > tempfd = -1 > controllingTty = /dev/pts/0 > (strerror((*__errno_location ()))) = Permission denied > Message: Error Opening the terminal attached with the process > orterun (40000): Terminating... > ubuntu@ip-10-43-154-61:~$ [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99080) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99081) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99097) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99098) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99201) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99202) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99218) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99219) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99121) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99122) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99138) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99139) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99211) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99212) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99228) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99229) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99161) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99162) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99178) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99179) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99091) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99092) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99108) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99109) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99231) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99232) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99248) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99249) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99101) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99102) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99118) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99119) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99111) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99112) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99128) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99129) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99131) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99132) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99148) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99149) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99191) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99192) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99151) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99208) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99209) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99152) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99168) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99221) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99169) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99222) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99238) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99239) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99141) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99142) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99158) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99159) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99181) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99182) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99198) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99199) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99171) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99172) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99188) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99189) > Message: Bind failed. > [41000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99049) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (41000): Terminating... > [49000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99129) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (49000): Terminating... > [46000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99099) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (46000): Terminating... > [55000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99189) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (55000): Terminating... > [43000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99069) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (43000): Terminating... > [47000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99109) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (47000): Terminating... > [42000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99059) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (42000): Terminating... > [45000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99089) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (45000): Terminating... > [54000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99179) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (54000): Terminating... > [51000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99149) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (51000): Terminating... > [50000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99139) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (50000): Terminating... > [44000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99079) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (44000): Terminating... > [52000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99159) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (52000): Terminating... > [56000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99199) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (56000): Terminating... > [53000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99169) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (53000): Terminating... > [48000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99119) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (48000): Terminating... > > > > > From: bas...@ho... > To: ge...@cc...; dmt...@li... > Subject: RE: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > Date: Fri, 7 Feb 2014 19:14:44 +0200 > > > > > Hi Gene/Kapil > thank you so much for your help > > about your question: > > ./dmtcp_restart_script.sh > (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) > > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: > > 1- single node for 4 processes and 16 processes > 2- 4 nodes cluster for 4 processes > > > about this part: > Building it should be easy: ./configure && make > should not i do "make install " also in order to find all the required files in all nodes of the cluster ? > > thank you > > > > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > > From: ge...@cc... > > To: bas...@ho... > > CC: dmt...@li... > > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > Hi Basma, > > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > > Building it should be easy: ./configure && make > > We renamed the way to start. It will now be: > > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > Then to restart, it should be the same as before: > > ./dmtcp_restart_script.sh > > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > > > If this still gives you any problems, please do write back. > > > > Best wishes, > > - Gene > > > > ----- Original Message ----- > > From: basma a.azeem <bas...@ho...> > > To: dmt...@li... > > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > > > From: bas...@ho... > > To: ka...@cc... > > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > > > > > > i am trying dmtcp version 1.2.5 with open mpi > > i use a 4 node cluster > > > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > > master,node001,node002,node003 > > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > > > it looks like i am missing a simple detail > > > > here is the output i had : > > > > ------------------------------------------------------- > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_coordinator starting... > > Port: 7779 > > Checkpoint Interval: disabled (checkpoint manually instead) > > Exit on last client: 1 > > Backgrounding... > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6416-52f43ea3(99072) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6419-52f43ea3(99092) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6422-52f43ea3(99112) > > Message: Bind failed. > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3257-52f43ea3(99074) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3261-52f43ea3(99094) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3265-52f43ea3(99114) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2483-52f43ea3(99074) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2487-52f43ea3(99094) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2491-52f43ea3(99114) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2475-52f43ea3(99076) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2479-52f43ea3(99096) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2483-52f43ea3(99116) > > Message: Bind failed. > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > > 6419:929 read_shared_memory_area_from_file: > > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > > read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > > > > > > |
From: Gene C. <ge...@cc...> - 2014-02-12 19:13:48
|
Hi Basma, I'm pleased to say that Kapil and I were running on a different computer today, and saw your bug. If you checkout the last DMTCP 'svn', we believe this will fix the DMTCP bug that you reported. The safest thing will be to download svn revision 2566 (with the bug fix): http://sourceforge.net/code-snapshots/svn/d/dm/dmtcp/code/dmtcp-code-2566-trunk.zip This bug fix will also be part of DMTCP version 2.2 (the next release, perhaps in about a month). In general, for background, this was the issue: We suspect that you were initially starting a process on a terminal as root. So, your controlling terminal was owned by root. We suspect that you then did 'su username'. At this point, you inherited the file descriptor of root for your controlling terminal. But on restart, DMTCP tries to open its own fresh file descriptor to the controlling terminal. If the terminal is owned by root, this is not possible. Hence the message you saw about 'no permission' for controlling terminal. The piece of information we were missing is that you had probably opened a new terminal as one user (as 'root' or other), and then had done an su after starting work at a certain terminal. Is this what had happened? We hope this fixes everything. Best wishes, - Gene and Kapil On Fri, Feb 07, 2014 at 07:49:21PM +0200, basma a.azeem wrote: > > > i tried version 2.1 > > in single node case for 16 processes at restart it gives me this error: > > > ./dmtcp_restart_script.sh > > dmtcp_restart (DMTCP + MTCP) 2.1 > > Copyright (C) 2006-2014 Jason Ansel, Michael Rieker, Kapil Arya, and > Gene Cooperman > License LGPLv3+: GNU LGPL version 3 or later <http://gnu.org/licenses/lgpl.html>. > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it > under certain conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > [40000] ERROR at fileconnection.cpp:399 in postRestart; REASON='JASSERT(tempfd >= 0) failed' > tempfd = -1 > controllingTty = /dev/pts/0 > (strerror((*__errno_location ()))) = Permission denied > Message: Error Opening the terminal attached with the process > orterun (40000): Terminating... > ubuntu@ip-10-43-154-61:~$ [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99080) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99081) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99097) > Message: Bind failed. > [41000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-41000-52f519db(99098) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99201) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99202) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99218) > Message: Bind failed. > [53000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-53000-52f519dc(99219) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99121) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99122) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99138) > Message: Bind failed. > [45000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-45000-52f519db(99139) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99211) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99212) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99228) > Message: Bind failed. > [54000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-54000-52f519dc(99229) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99161) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99162) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99178) > Message: Bind failed. > [49000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-49000-52f519db(99179) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99091) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99092) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99108) > Message: Bind failed. > [42000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-42000-52f519db(99109) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99231) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99232) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99248) > Message: Bind failed. > [56000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-56000-52f519dc(99249) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99101) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99102) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99118) > Message: Bind failed. > [43000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-43000-52f519db(99119) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99111) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99112) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99128) > Message: Bind failed. > [44000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-44000-52f519db(99129) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99131) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99132) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99148) > Message: Bind failed. > [46000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-46000-52f519db(99149) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99191) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99192) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99151) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99208) > Message: Bind failed. > [52000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-52000-52f519dc(99209) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99152) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99168) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99221) > Message: Bind failed. > [48000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-48000-52f519db(99169) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99222) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99238) > Message: Bind failed. > [55000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-55000-52f519dc(99239) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99141) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99142) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99158) > Message: Bind failed. > [47000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-47000-52f519db(99159) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99181) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99182) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99198) > Message: Bind failed. > [51000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-51000-52f519db(99199) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99171) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99172) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99188) > Message: Bind failed. > [50000] WARNING at socketconnection.cpp:504 in postRestart; REASON='JWARNING(sock.bind((sockaddr*) &_bindAddr,_bindAddrlen)) failed' > (strerror((*__errno_location ()))) = Address already in use > id() = 6da2961af00014aa-50000-52f519db(99189) > Message: Bind failed. > [41000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99049) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (41000): Terminating... > [49000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99129) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (49000): Terminating... > [46000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99099) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (46000): Terminating... > [55000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99189) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (55000): Terminating... > [43000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99069) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (43000): Terminating... > [47000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99109) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (47000): Terminating... > [42000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99059) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (42000): Terminating... > [45000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99089) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (45000): Terminating... > [54000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99179) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (54000): Terminating... > [51000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99149) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (51000): Terminating... > [50000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99139) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (50000): Terminating... > [44000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99079) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (44000): Terminating... > [52000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99159) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (52000): Terminating... > [56000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99199) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (56000): Terminating... > [53000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99169) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (53000): Terminating... > [48000] ERROR at connectionrewirer.cpp:89 in doReconnect; REASON='JASSERT(_real_connect(fd, (sockaddr*) &remoteAddr.addr, remoteAddr.len) == 0) failed' > id = 6da2961af00014aa-40000-52f519db(99119) > (strerror((*__errno_location ()))) = Invalid argument > Message: failed to restore connection > lu.A.16 (48000): Terminating... > > > > > From: bas...@ho... > To: ge...@cc...; dmt...@li... > Subject: RE: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > Date: Fri, 7 Feb 2014 19:14:44 +0200 > > > > > Hi Gene/Kapil > thank you so much for your help > > about your question: > > ./dmtcp_restart_script.sh > (yes , this Is the way by which i was invoking restart for dmtcp-1.2.5) > > does version 1.2.5 have problems in case of 16 process on 4 nodes cluster it worked fine in these cases: > > 1- single node for 4 processes and 16 processes > 2- 4 nodes cluster for 4 processes > > > about this part: > Building it should be easy: ./configure && make > should not i do "make install " also in order to find all the required files in all nodes of the cluster ? > > thank you > > > > > Date: Thu, 6 Feb 2014 23:03:00 -0500 > > From: ge...@cc... > > To: bas...@ho... > > CC: dmt...@li... > > Subject: Re: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > Hi Basma, > > Would you mind re-doing this experiment with DMTCP 2.1 (the latest version)? > > You'll find it at: http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.1/ > > Building it should be easy: ./configure && make > > We renamed the way to start. It will now be: > > bin/dmtcp_launch mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > Then to restart, it should be the same as before: > > ./dmtcp_restart_script.sh > > (Is this the way that you were invoking restart for dmtcp-1.2.5?) > > > > If this still gives you any problems, please do write back. > > > > Best wishes, > > - Gene > > > > ----- Original Message ----- > > From: basma a.azeem <bas...@ho...> > > To: dmt...@li... > > Sent: Thu, 6 Feb 2014 21:39:17 -0500 (EST) > > Subject: [Dmtcp-forum] FW: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > > > > > From: bas...@ho... > > To: ka...@cc... > > Subject: ver 1.2.5 - restart fail for 16 processes on a 4 nodes cluster ? > > Date: Fri, 7 Feb 2014 04:37:58 +0200 > > > > > > > > > > i am trying dmtcp version 1.2.5 with open mpi > > i use a 4 node cluster > > > > when i try to check point and restart an exe that was compiler 4 processes it works good at checkpoint and at restart it gives me an ""Segmentation fault (core dumped)" " then it works correctly also at restart > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 4 -H master,node001,node002,node003 /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.4 > > > > but when i try to check point and restart an exe that was compiler 16 processes it works good at checkpoint but at restart it gives this output and hangs . it stops for ever > > > > ubuntu@master:~$ dmtcp-1.2.5/bin/dmtcp_checkpoint mpirun -np 16 -H > > master,node001,node002,node003 > > /home/ubuntu/NPB3.3/NPB3.3-MPI/bin/lu.A.16 > > > > it looks like i am missing a simple detail > > > > here is the output i had : > > > > ------------------------------------------------------- > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_coordinator starting... > > Port: 7779 > > Checkpoint Interval: disabled (checkpoint manually instead) > > Exit on last client: 1 > > Backgrounding... > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6416-52f43ea3(99072) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6419-52f43ea3(99092) > > Message: Bind failed. > > [6506] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 18af1fad8d756-6422-52f43ea3(99112) > > Message: Bind failed. > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > dmtcp_checkpoint (DMTCP + MTCP) 1.2.5 > > Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and > > Gene Cooperman > > This program comes with ABSOLUTELY NO WARRANTY. > > This is free software, and you are welcome to redistribute it > > under certain conditions; see COPYING file for details. > > (Use flag "-q" to hide this message.) > > > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3257-52f43ea3(99074) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3261-52f43ea3(99094) > > Message: Bind failed. > > [3314] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e707-3265-52f43ea3(99114) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2483-52f43ea3(99074) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2487-52f43ea3(99094) > > Message: Bind failed. > > [2528] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e708-2491-52f43ea3(99114) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2475-52f43ea3(99076) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2479-52f43ea3(99096) > > Message: Bind failed. > > [2520] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' > > (strerror((*__errno_location ()))) = Address already in use > > id() = 20385667ca0e709-2483-52f43ea3(99116) > > Message: Bind failed. > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > Segmentation fault (core dumped) > > [[6422] mtcp_restart_nolibc.c:[929 read_shared_memory_area_from_file: > > mapping 6416/tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master] mtcp_restart_nolibc.c with data from ckpt image > > 6419:929 read_shared_memory_area_from_file: > > ] mtcp_restart_nolibc.cmapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master:929 with data from ckpt image > > read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_pool.master with data from ckpt image > > [6416] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6422] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > [6419] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: > > mapping /tmp/openmpi-sessions-ubuntu@master_0/16976/1/shared_mem_btl_module.master with data from ckpt image > > > > > > > > > |