From: Nicholas H. <he...@se...> - 2002-10-22 21:22:28
|
On Tue, 22 Oct 2002 15:33:16 -0400 Nicholas Henke <he...@se...> wrote: I seem to be having a ton of these errors on our testcluster, but not a different cluster. I have tried to find the discrepencies, but no luck. Do you have any idea what might be causing this ? $ strace mpirun -d -p --np 2 ./cpi [snip] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({5, 0}, {5, 0}) = 0 accept(3, {sin_family=AF_INET, sin_port=htons(33105), sin_addr=inet_addr("192.168.2.10")}}, [16]) = 4 read(4, "\3048\0\0", 4) = 4 read(4, "P\201\0\0", 4) = 4 accept(3, {sin_family=AF_INET, sin_port=htons(32976), sin_addr=inet_addr("192.168.2.11")}}, [16]) = 5 read(5, "\3058\0\0", 4) = 4 read(5, "\317\200\0\0", 4) = 4 write(1, " 0 0 14532 192.168.2.10 3"..., 37 0 0 14532 192.168.2.10 33104 ) = 37 write(1, " 1 1 14533 192.168.2.11 3"..., 37 1 1 14533 192.168.2.11 32975 ) = 37 write(4, "\2\0\0\0", 4) = 4 write(4, "\0\0\0\0", 4) = 4 write(4, "\3048\0\0", 4) = 4 write(4, "\300\250\2\n", 4) = 4 write(4, "\201P", 2) = 2 write(4, "\3058\0\0", 4) = 4 write(4, "\300\250\2\v", 4) = 4 write(4, "\200\317", 2) = 2 close(4) = 0 write(5, "\2\0\0\0", 4) = 4 write(5, "\1\0\0\0", 4) = 4 write(5, "\3048\0\0", 4) = 4 write(5, "\300\250\2\n", 4) = 4 write(5, "\201P", 2) = 2 write(5, "\3058\0\0", 4) = 4 write(5, "\300\250\2\v", 4) = 4 write(5, "\200\317", 2) = 2 close(5) = 0 wait4(-1, Process 0 on node1.internal.org [WIFSIGNALED(s) && WTERMSIG(s) == SIGINT], 0, NULL) = 14533 --- SIGCHLD (Child exited) --- write(2, "rank 1 pid=14533 exited with sig"..., 38rank 1 pid=14533 exited with signal 2 ) = 38 wait4(-1, xm_14532: p4_error: net_recv read: probable EOF on socket: 1 Connection failed for reason: : Connection refused Connection failed for reason: : Connection refused [WIFSIGNALED(s) && WTERMSIG(s) == SIGPIPE], 0, NULL) = 14532 --- SIGCHLD (Child exited) --- write(2, "rank 0 pid=14532 exited with sig"..., 39rank 0 pid=14532 exited with signal 13 ) = 39 wait4(-1, 0xbffff9f8, 0, NULL) = -1 ECHILD (No child processes) munmap(0x40017000, 4096) = 0 _exit(0) |