You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
|
Dec
(22) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(13) |
Feb
(22) |
Mar
(39) |
Apr
(10) |
May
(26) |
Jun
(23) |
Jul
(38) |
Aug
(20) |
Sep
(27) |
Oct
(76) |
Nov
(32) |
Dec
(11) |
2003 |
Jan
(8) |
Feb
(23) |
Mar
(12) |
Apr
(39) |
May
(1) |
Jun
(48) |
Jul
(35) |
Aug
(15) |
Sep
(60) |
Oct
(27) |
Nov
(9) |
Dec
(32) |
2004 |
Jan
(8) |
Feb
(16) |
Mar
(40) |
Apr
(25) |
May
(12) |
Jun
(33) |
Jul
(49) |
Aug
(39) |
Sep
(26) |
Oct
(47) |
Nov
(26) |
Dec
(36) |
2005 |
Jan
(29) |
Feb
(15) |
Mar
(22) |
Apr
(1) |
May
(8) |
Jun
(32) |
Jul
(11) |
Aug
(17) |
Sep
(9) |
Oct
(7) |
Nov
(15) |
Dec
|
From: Nicholas H. <he...@se...> - 2003-04-10 16:55:56
|
On Wed, 9 Apr 2003 16:04:19 -0600 er...@he... wrote: Ok -- here is another node. I tried to find the process information on the head node incase that helps. ps -zxf for node25 on the head node: 28460 ? S 0:00 bpsh -n 25 subtaskInvoker /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/fin 28462 ? SW 0:00 \_ [subtaskInvoker] 28463 ? SW 0:00 \_ [blastSimilarity] 11803 ? SW 0:00 \_ [sh] 11804 ? SW 0:00 \_ [blastx] 11823 ? SW 0:00 \_ [blastx] 11825 ? SW 0:00 \_ [blastx] On the node before the kill: 10509 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similar 10510 ? S 0:00 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/curren 11825 ? S 0:00 \_ sh -c /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask=s 11826 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg 11839 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask 11840 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -word Hrm -- This one seemed to have killed them all. With the head node information I managed to trim the trace to where the bpsh <...> started to the end. Trace is at :http://www.liniac.upenn.edu/~henken/bproc/node25trace One more after this -- just for fun :) Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-10 16:48:23
|
On Wed, 9 Apr 2003 16:04:19 -0600 er...@he... wrote: > > I suspect there's a problem in the signal forwarding and the remote > system call stuff that the slave side does. That code *looks* ok to > me but maybe there's a problem. Seeing a message trace for the PIDs > involved should shed some light on this. > > Also, process 5377 reparenting to bpslave is normal. bpslave is the > "child reaper" (instead of init) for bproc managed processes on the > nodes. This is necessary for ptrace to work properly. I think the > parents exited and it didn't so that reparent is correct. Ok -- The traces are huge -- and frankly I could not discern the interesting parts -- I have placed them on hour web server for your fun and amusement. Here is the ps output before and after the kill -9 569 ? S 0:02 /usr/sbin/bpslave -m /scratch/bpslave_new.strace -r 192.168.0.223 2223 570 ? S 0:00 \_ /usr/sbin/bpslave -m /scratch/bpslave_new.strace -r 192.168.0.223 2223 624 ? S 0:00 \_ mond -d 3271 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f 3272 ? S 0:00 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d 4923 ? S 0:00 \_ sh -c /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask=seg+xn 4924 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu 4937 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+ 4938 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask kill -9 4924 4937 4938 569 ? S 0:02 /usr/sbin/bpslave -m /scratch/bpslave_new.strace -r 192.168.0.223 2223 570 ? S 0:00 \_ /usr/sbin/bpslave -m /scratch/bpslave_new.strace -r 192.168.0.223 2223 624 ? S 0:00 \_ mond -d 4938 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B 3271 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f 3272 ? S 0:00 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d 4923 ? Z 0:00 \_ [sh <defunct>] The trace is located at http://www.liniac.upenn.edu/~henken/bproc/node48bpslave.strace. I will send another node's info just for comparison. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: J.A. M. <jam...@ab...> - 2003-04-09 22:58:24
|
On 04.10, er...@he... wrote: > On Wed, Apr 09, 2003 at 05:40:01PM -0400, Nicholas Henke wrote: > > On Wed, 9 Apr 2003 14:48:07 -0600 > > er...@he... wrote: > > > > > Usually the only reason you would get a sigstop from the OS is > > > terminal related and then it should be TSTP. > > > Can you try with the below patch ? I am currently running this kernel: http://giga.cps.unizar.es/~magallon/linux/kernel/old/2.4.21-pre5-jam1.tar.bz2 and it works with threads. I run multithreaded (pthreads) programs on the nodes via bpsh, and they work fine. Not too much stress, just create one thread per processor (2) and crunch numbers in parrallel, with some occasinal locking for serializing some ops. This patch is related to the possibility of detached threads receiving a new exit signal while they are exiting due to a previous one, and end in a loop. Author is Ingo Molnar <mi...@el...> for the exact details. --- linux/kernel/exit.c.orig Mon Sep 9 14:06:05 2002 +++ linux/kernel/exit.c Mon Sep 9 14:06:25 2002 @@ -369,7 +369,7 @@ * */ - if(current->exit_signal != SIGCHLD && + if(current->exit_signal != SIGCHLD && current->exit_signal != -1 && ( current->parent_exec_id != t->self_exec_id || current->self_exec_id != current->parent_exec_id) && !capable(CAP_KILL)) -- J.A. Magallon <jam...@ab...> \ Software is like sex: werewolf.able.es \ It's better when it's free Mandrake Linux release 9.2 (Cooker) for i586 Linux 2.4.21-pre7-jam1 (gcc 3.2.2 (Mandrake Linux 9.2 3.2.2-5mdk)) |
From: Nicholas H. <he...@se...> - 2003-04-09 22:51:56
|
On Wed, 9 Apr 2003 16:04:19 -0600 er...@he... wrote: > > Hrm.... The only plausible reason I can think of for the kill -9 to > not work is that it's actually blocked in kernel space somewhere. It > could be that the process is getting signaled while it's waiting for > some remote request to complete. Most likely in a bpr_rsyscall. I > think a message trace for all the pids involved would be very > interesting here. If that's the case we need to figure out what that > remote request is and (of course) why it's not completing in a > reasonable amount of time. Sure will do. > > I suspect there's a problem in the signal forwarding and the remote > system call stuff that the slave side does. That code *looks* ok to > me but maybe there's a problem. Seeing a message trace for the PIDs > involved should shed some light on this. > > Also, process 5377 reparenting to bpslave is normal. bpslave is the > "child reaper" (instead of init) for bproc managed processes on the > nodes. This is necessary for ptrace to work properly. I think the > parents exited and it didn't so that reparent is correct. Ok -- I just thought it weird that it reparented, but didn't die from kill -9. I will get those message traces to you as soon as I can ( most likely in the morning ). Thanks a ton Erik :) Nic -- Nicholas Henke Linux Cluster Systems Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-04-09 22:35:14
|
On Wed, Apr 09, 2003 at 05:40:01PM -0400, Nicholas Henke wrote: > On Wed, 9 Apr 2003 14:48:07 -0600 > er...@he... wrote: > > > Usually the only reason you would get a sigstop from the OS is > > terminal related and then it should be TSTP. > > > > That strace is pretty strange. There's a lot of rt_sigaction w/ > > SIGPIPE but the slave daemon code only does that once with SIGPIPE and > > it sets it to SIG_IGN. Anyway, this might have something to do with > > the 'exit signal' on a process. That's about the only way I can think > > to signal the slave daemon... > > Hrm -- ok. Is there any more information I can provide about this ? > I have the user running jobs for me so I can catch them in the 'wild' so to speak. > > Here are a few things I am seeing: > > Hung state: > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f > 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d > 5362 ? S 0:00 \_ sh -c /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask=seg+xn > 5363 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu > 5376 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+ > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask > > OK -- so blast is hung -- lets kill it: > kill -9 5377 5376 5363 > Now the rest is hung.... -- Notice how the blastx has jumped parents -- that don't seem right. > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B > 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f > 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d > 5362 ? Z 0:00 \_ [sh <defunct>] > > Odd -- ok so : kill -9 3055 > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B > > hrm -- ok still wont die: kill -9 5377 > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > > And... sometimes that last process dies, or sometimes it doesn't. > I tried reverting to an older version of glibc to make sure that wasn't the culprit. > Any ideas ? Hrm.... The only plausible reason I can think of for the kill -9 to not work is that it's actually blocked in kernel space somewhere. It could be that the process is getting signaled while it's waiting for some remote request to complete. Most likely in a bpr_rsyscall. I think a message trace for all the pids involved would be very interesting here. If that's the case we need to figure out what that remote request is and (of course) why it's not completing in a reasonable amount of time. I suspect there's a problem in the signal forwarding and the remote system call stuff that the slave side does. That code *looks* ok to me but maybe there's a problem. Seeing a message trace for the PIDs involved should shed some light on this. Also, process 5377 reparenting to bpslave is normal. bpslave is the "child reaper" (instead of init) for bproc managed processes on the nodes. This is necessary for ptrace to work properly. I think the parents exited and it didn't so that reparent is correct. - Erik |
From: Nicholas H. <he...@se...> - 2003-04-09 21:36:00
|
On Wed, 9 Apr 2003 14:48:07 -0600 er...@he... wrote: > Usually the only reason you would get a sigstop from the OS is > terminal related and then it should be TSTP. > > That strace is pretty strange. There's a lot of rt_sigaction w/ > SIGPIPE but the slave daemon code only does that once with SIGPIPE and > it sets it to SIG_IGN. Anyway, this might have something to do with > the 'exit signal' on a process. That's about the only way I can think > to signal the slave daemon... Hrm -- ok. Is there any more information I can provide about this ? I have the user running jobs for me so I can catch them in the 'wild' so to speak. Here are a few things I am seeing: Hung state: 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 622 ? S 0:00 \_ mond -d 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d 5362 ? S 0:00 \_ sh -c /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask=seg+xn 5363 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu 5376 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+ 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask OK -- so blast is hung -- lets kill it: kill -9 5377 5376 5363 Now the rest is hung.... -- Notice how the blastx has jumped parents -- that don't seem right. 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 622 ? S 0:00 \_ mond -d 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d 5362 ? Z 0:00 \_ [sh <defunct>] Odd -- ok so : kill -9 3055 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 622 ? S 0:00 \_ mond -d 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B hrm -- ok still wont die: kill -9 5377 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 622 ? S 0:00 \_ mond -d And... sometimes that last process dies, or sometimes it doesn't. I tried reverting to an older version of glibc to make sure that wasn't the culprit. Any ideas ? Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-04-09 21:19:03
|
On Wed, Apr 09, 2003 at 01:47:51PM -0400, Nicholas Henke wrote: > On Wed, 9 Apr 2003 10:18:35 -0600 > er...@he... wrote: > > > > Signal stuff *should* be local to the node and basically the same as > > w/o BProc for pthreads stuff. If it relies on process group stuff > > that might not be true but I don't *think* it should be doing that. > > Ok -- I just thought it was really weird that bpslave was getting > SIGSTOP. It seems to me that getting that signal might be causing the > rest of the problems. > Is there any reason that bpslave would get SIGSTOP -- does the OS send > that in a strange condition ? Usually the only reason you would get a sigstop from the OS is terminal related and then it should be TSTP. That strace is pretty strange. There's a lot of rt_sigaction w/ SIGPIPE but the slave daemon code only does that once with SIGPIPE and it sets it to SIG_IGN. Anyway, this might have something to do with the 'exit signal' on a process. That's about the only way I can think to signal the slave daemon... - Erik |
From: Nicholas H. <he...@se...> - 2003-04-09 17:47:30
|
On Wed, 9 Apr 2003 13:47:51 -0400 Nicholas Henke <he...@se...> wrote: > > It lookst to me that it pops off a could of threads, and then starts > working. It looks like it either hangs right after thread creation, or > after the program has exited. Let me clarify this -- I have seen 2 error cases here -- the blast process will hang right after thread creation, or bpslave looks hung (SIGSTOP) after the program has exited -- this leaves the perl & sh process hanging around that started the blast process. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-09 17:44:17
|
On Wed, 9 Apr 2003 10:20:50 -0600 er...@he... wrote: > > It should have a message trace option like the master but other than > that, no, I don't think it should spew anything. OK -- it didn't which is why I asked. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-09 17:43:54
|
On Wed, 9 Apr 2003 10:18:35 -0600 er...@he... wrote: > Signal stuff *should* be local to the node and basically the same as > w/o BProc for pthreads stuff. If it relies on process group stuff > that might not be true but I don't *think* it should be doing that. Ok -- I just thought it was really weird that bpslave was getting SIGSTOP. It seems to me that getting that signal might be causing the rest of the problems. Is there any reason that bpslave would get SIGSTOP -- does the OS send that in a strange condition ? > > Is it possible to characterize what the app is doing wrt pthreads? > Does this app do a lot of thread creation and cleanup or does it kick > off a few and they get stuck later? From the trace backs It looks > like it's sticking in the mutex or condition variable code. I tried > to stress that stuff a bit but it hasn't been breaking for me so far. It lookst to me that it pops off a could of threads, and then starts working. It looks like it either hangs right after thread creation, or after the program has exited. > > As usual, it's really hard to say what's going on if I can't reproduce > it. A small test program that did it would be fantastic. As usual, a > message trace might be interesting. In particular, it might be useful > to know if there's BProc traffic related to those processes while it's > running. If it's creating a lot of threads while it runs, the answer > will be yes but it might still be interesting if it's anything other > than fork and wait messages. I do not think it runs many threads -- just 2 to do its work, and then it dies. I ran an strace on all of the bpslave processes during the users last run, and it looks to me that there is some strange SIG* stuff in there -- I am not sure what you should see normally. I have attached the 'grep -n SIG | grep -v CHLD' output -- if you would like to see the full logs holler, and I will try to get them to you -- they are 6.4G of logs total :) I will have the next run use the -m flag for bpslave to see if there is any interesting message traffic. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-04-09 16:51:43
|
On Mon, Apr 07, 2003 at 01:57:59PM -0400, Nicholas Henke wrote: > On Mon, 7 Apr 2003 13:43:20 -0400 > Nicholas Henke <he...@se...> wrote: > > > On Mon, 7 Apr 2003 11:06:44 -0600 > > er...@he... wrote: > > > > > One of the bpslave processes is the interesting one, the other is an > > > IO forwarder. It might be useful to run bpslave with -d and -v and > > > just watch it. > > > > Is this supposed to output anything during normall operations ? -- I > just got the IO forwarder started. It should have a message trace option like the master but other than that, no, I don't think it should spew anything. - Erik |
From: <er...@he...> - 2003-04-09 16:49:28
|
On Tue, Apr 08, 2003 at 05:23:23PM -0400, Nicholas Henke wrote: > OK -- upgrading to 2.4.20 and bproc-3.2.4 seems to have solved the > problem ...at least so far. > > I am now seeing a _really_ strange error that I am trying to find the > root of. All indicators seem to point to bproc -- the user is running > their job under ssh to make sure right now. > > Problem: The user is running ncbi and wu-blast -- programs that do > genomics/bioinformatics sequence 'stuff' The program is invoked using > bpsh, after which it gets to the node, and using pthreads, forks off a > couple of threads. Now, apparently at random, all 2 of the threads get > stuck, and provide similar tracebacks from gdb -- with respect to the > mutex_lock and sigsuspend... Signal stuff *should* be local to the node and basically the same as w/o BProc for pthreads stuff. If it relies on process group stuff that might not be true but I don't *think* it should be doing that. Is it possible to characterize what the app is doing wrt pthreads? Does this app do a lot of thread creation and cleanup or does it kick off a few and they get stuck later? From the trace backs It looks like it's sticking in the mutex or condition variable code. I tried to stress that stuff a bit but it hasn't been breaking for me so far. As usual, it's really hard to say what's going on if I can't reproduce it. A small test program that did it would be fantastic. As usual, a message trace might be interesting. In particular, it might be useful to know if there's BProc traffic related to those processes while it's running. If it's creating a lot of threads while it runs, the answer will be yes but it might still be interesting if it's anything other than fork and wait messages. - Erik > Loaded symbols for /lib/i686/libc.so.6 > Reading symbols from /lib/ld-linux.so.2...done. > Loaded symbols for /lib/ld-linux.so.2 > Reading symbols from /lib/libnss_files.so.2...done. > Loaded symbols for /lib/libnss_files.so.2 > 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. > in ../sysdeps/unix/sysv/linux/sigsuspend.c > (gdb) bt > #0 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x59969be0) at pthread.c:978 > #2 0x40049ee9 in __pthread_alt_lock (lock=0x40189720, self=0x0) at restart.h:34 > #3 0x40046cf6 in __pthread_mutex_lock (mutex=0x40189710) at mutex.c:120 > #4 0x400d53e8 in __libc_free (mem=0x82b0ef0) at malloc.c:3152 > #5 0x0817a316 in Nlm_MemFree () > #6 0x0804ac06 in NlmThreadWrapper () > #7 0x40045c3f in pthread_start_thread (arg=0x59969be0) at manager.c:284 > > > This would not be a terrible problem, except that one of the programs > will refuse to pay attention to kill -9 -- some of them will die, but > one of the threads stuck in sigsuspend will not go away. > > Is it possible that bpslave is dropping the wait_for_restart_signal ? I > would appreciate any info or direction you could provide -- this is > really odd. BTW -- The same thing happens on 2.4.19 and bproc-3.2.3 ( > all on RH 7.2 ). > > Nic > -- > Nicholas Henke > Penguin Herder & Linux Cluster System Programmer > Liniac Project - Univ. of Pennsylvania > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of bandwidth! > No other company gives more support or power for your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |
From: Nicholas H. <he...@se...> - 2003-04-08 21:39:48
|
Following are the backtraces in gdb from each of the threads. I also noticed that these processes obeyed kill -9 ( from the node side -- all this has been ), but if gdb is attached -- the process remains in state 'T', and ignores kill -9. Is there a good way to attach gdb to LWP's ? 0x40082bb5 in __sigsuspend (set=0xbffff390) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. in ../sysdeps/unix/sysv/linux/sigsuspend.c (gdb) bt #0 0x40082bb5 in __sigsuspend (set=0xbffff390) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x40050f20) at pthread.c:978 #2 0x40044bac in pthread_cond_wait (cond=0x82ac8ec, mutex=0x82ac8d4) at restart.h:34 #3 0x0804b17a in NlmSemaWait () #4 0x0804af5e in NlmThreadJoinAll () #5 0x08088da1 in RPSBlastSearchMT () #6 0x0804a957 in Nlm_Main () #7 0x08178470 in main () #8 0x40070657 in __libc_start_main (main=0x8178454 <main>, argc=11, ubp_av=0xbffff5b4, init=0x80496d4 <_init>, fini=0x818a164 <_fini>, rtld_fini=0x4000dcd4 <_dl_fini>, stack_end=0xbffff5ac) at ../sysdeps/generic/libc-start.c:129 0x4013a427 in __poll (fds=0x82ae784, nfds=1, timeout=2000) at ../sysdeps/unix/sysv/linux/poll.c:63 63 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. in ../sysdeps/unix/sysv/linux/poll.c (gdb) bt #0 0x4013a427 in __poll (fds=0x82ae784, nfds=1, timeout=2000) at ../sysdeps/unix/sysv/linux/poll.c:63 #1 0x400458f0 in __pthread_manager (arg=0x7) at manager.c:140 0x40082bb5 in __sigsuspend (set=0x599ac88c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. in ../sysdeps/unix/sysv/linux/sigsuspend.c (gdb) bt #0 0x40082bb5 in __sigsuspend (set=0x599ac88c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x599acbe0) at pthread.c:978 #2 0x400499cc in __pthread_lock (lock=0x40188bf0, self=0x599acbe0) at spinlock.c:149 #3 0x40046d16 in __pthread_mutex_lock (mutex=0x40188be0) at mutex.c:109 #4 0x400d0f70 in _IO_link_in (fp=0x82bb5f8) at genops.c:97 #5 0x400d0bb0 in _IO_new_file_init (fp=0x82bb5f8) at fileops.c:150 #6 0x400c5db3 in _IO_new_fopen (filename=0x82b3180 "rpsblast.log", mode=0x81b1664 "a+") at iofopen.c:63 #7 0x08177c0a in Nlm_FileOpen () #8 0x08174bf8 in Nlm_ErrSetLogfile () #9 0x0804abfa in NlmThreadWrapper () #10 0x40045c3f in pthread_start_thread (arg=0x599acbe0) at manager.c:284 Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-08 21:30:37
|
More info -- I get a ping timeout on the master to node30 -- so I ssh node30 and see what is going on. bpslave is still running 2 threads, and there is one processes running under bproc. If I run strace on the lower of the 2 processes -- this is what I get. After strace exits, the bpslave process is dead. < from ps auxwwww > root 577 0.0 0.0 1416 532 ? T 14:46 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 [root@node30 root]# strace -p 577 --- SIGSTOP (Stopped (signal)) --- wait4(-1, NULL, WNOHANG, NULL) = 0 time(NULL) = 1049833543 time(NULL) = 1049833543 select(5, [3 4], [3], NULL, {15, 0}) = 3 (in [3 4], out [3], left {15, 0}) read(4, "\323B\0\0\17\0\1\3\276\f\0\0\276\f\0\0\0\0\0\0\0\0\0\0"..., 152) = 152 write(3, "\0\0\0\0\27\0\2\2\0\0\0\0\377\377\377\377\0\0\0\0\4\20"..., 152) = 152 write(3, "\323B\0\0\17\0\1\3\276\f\0\0\276\f\0\0\0\0\0\0\0\0\0\0"..., 152) = 152 read(3, "\270\341\2\0\4\0\2\1\377\377\377\377\271\f\0\0\0\0\0\0"..., 152) = 152 time(NULL) = 1049833543 wait4(-1, NULL, WNOHANG, NULL) = 0 time(NULL) = 1049833543 time(NULL) = 1049833543 select(5, [3 4], [4], NULL, {15, 0}) = 3 (in [3 4], out [4], left {15, 0}) write(4, "\270\341\2\0\4\0\2\1\377\377\377\377\271\f\0\0\0\0\0\0"..., 152) = 152 read(4, "\324B\0\0\r\0\1\3\275\f\0\0\275\f\0\0\0\0\0\0\0\0\0\0\0"..., 152) = 152 read(3, "\271\341\2\0\4\0\2\1\377\377\377\377\271\f\0\0\0\0\0\0"..., 152) = 152 time(NULL) = 1049833543 wait4(-1, NULL, WNOHANG, NULL) = 0 time(NULL) = 1049833543 time(NULL) = 1049833543 select(5, [3 4], [3 4], NULL, {15, 0}) = 4 (in [3 4], out [3 4], left {15, 0}) write(4, "\271\341\2\0\4\0\2\1\377\377\377\377\271\f\0\0\0\0\0\0"..., 152) = 152 read(4, "\325B\0\0\17\0\1\3\275\f\0\0\275\f\0\0\0\0\0\0\0\0\0\0"..., 152) = 152 write(3, "\324B\0\0\r\0\1\3\275\f\0\0\275\f\0\0\0\0\0\0\0\0\0\0\0"..., 152) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) --- [root@node30 root]# strace -p 577 attach: ptrace(PTRACE_ATTACH, ...): No such process I find it a bit curious that bpslave is getting SIGSTOP -- ?? Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-08 21:19:48
|
OK -- upgrading to 2.4.20 and bproc-3.2.4 seems to have solved the problem ...at least so far. I am now seeing a _really_ strange error that I am trying to find the root of. All indicators seem to point to bproc -- the user is running their job under ssh to make sure right now. Problem: The user is running ncbi and wu-blast -- programs that do genomics/bioinformatics sequence 'stuff' The program is invoked using bpsh, after which it gets to the node, and using pthreads, forks off a couple of threads. Now, apparently at random, all 2 of the threads get stuck, and provide similar tracebacks from gdb -- with respect to the mutex_lock and sigsuspend... Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. in ../sysdeps/unix/sysv/linux/sigsuspend.c (gdb) bt #0 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x59969be0) at pthread.c:978 #2 0x40049ee9 in __pthread_alt_lock (lock=0x40189720, self=0x0) at restart.h:34 #3 0x40046cf6 in __pthread_mutex_lock (mutex=0x40189710) at mutex.c:120 #4 0x400d53e8 in __libc_free (mem=0x82b0ef0) at malloc.c:3152 #5 0x0817a316 in Nlm_MemFree () #6 0x0804ac06 in NlmThreadWrapper () #7 0x40045c3f in pthread_start_thread (arg=0x59969be0) at manager.c:284 This would not be a terrible problem, except that one of the programs will refuse to pay attention to kill -9 -- some of them will die, but one of the threads stuck in sigsuspend will not go away. Is it possible that bpslave is dropping the wait_for_restart_signal ? I would appreciate any info or direction you could provide -- this is really odd. BTW -- The same thing happens on 2.4.19 and bproc-3.2.3 ( all on RH 7.2 ). Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: rwillis <rw...@ct...> - 2003-04-07 23:51:22
|
Hi, I made a simple test program for MPI to test it on my cluster. The program simply sends, from any non-zero node to node-0 a simple message, node-0 then prints it. Anyway, I compile it (using mpicc), and invoke it by typing <path>/mpirun -d -G -p 2 ./hello The program does not run, but I get this back; rank 1 pid=6681 exited with signal 13 [0] Error: inconsistancy in collected data! rank 0 pid=6680 exited with signal 13 I get the same thing when trying to run Netpipe (make mpi). I don't know where the inconsistancy error is coming from, I have not found it in any source. Any ideas? Has anyone seen this before? is exiting with signal 13 bad or good? BTW, I put in some debugging into mpirun to look at parameters going into bproc_vexecmove_io() and a NULL is being passed into the function for the program name when I use 'hello' instead of './hello' as an arguement to mpirun. I don't know if htis is a bug or not, but I thought I would report it anyways. Thanks all, - Richard |
From: <er...@he...> - 2003-04-07 19:18:56
|
On Mon, Apr 07, 2003 at 11:59:12AM -0700, rwillis wrote: > Hi, > I'm not using the entire toolset that makes up Clustermatic, as I already > have a nice working node bootup system running. I have BProc up and running, > and I am using MPICH-GM, so I would like to use your version of mpirun. I > downloaded, built and installed CMtools 1.1, but the file > /var/beowulf/nodeinfo is missing. I have not found out yet what makes this > file. If anybody knows I would be interested in hearing from you. nodeinfo is normally created by our boot up stuff. It looks something like: (node 0 (cpus 2) (gm 10) (hz 2395000000) (mem 2119340032)) (node 1 (cpus 2) (gm 11) (hz 2395000000) (mem 2119340032)) (node 2 (cpus 2) (gm 12) (hz 2395000000) (mem 2119340032)) (node 3 (cpus 2) (gm 13) (hz 2395000000) (mem 2119340032)) (node 4 (cpus 2) (gm 14) (hz 2395000000) (mem 2119340032)) (node 5 (cpus 2) (gm 15) (hz 2395000000) (mem 2119340032)) (node 6 (cpus 2) (gm 16) (hz 2395000000) (mem 2119340032)) (node 7 (cpus 2) (gm 17) (hz 2395000000) (mem 2119340032)) (node 8 (cpus 2) (gm 18) (hz 2395000000) (mem 2119340032)) (node 9 (cpus 2) (gm 19) (hz 2395000000) (mem 2119340032)) > Also, there was a patchfile in the mpirun directory > (mpich-1.2.4..8a-gm-bproc.patch). I manually patched the gmpi_conf.c file in > the mpich-1.2.5..9 directory with it. I am just curious why it is needed, > and how the environment variables it requires are created. It is even > required? > > The environment variables that cmtools (and the patched mpich-gm) needs > are; > GMPI_CONF, BPROC_RANK & NODES. Do I set these manually? These are all set by the mpirun that comes with cmtools. NODES is optional, the others are required if you're using that patch. - Erik |
From: rwillis <rw...@ct...> - 2003-04-07 18:59:27
|
Hi, I'm not using the entire toolset that makes up Clustermatic, as I already have a nice working node bootup system running. I have BProc up and running, and I am using MPICH-GM, so I would like to use your version of mpirun. I downloaded, built and installed CMtools 1.1, but the file /var/beowulf/nodeinfo is missing. I have not found out yet what makes this file. If anybody knows I would be interested in hearing from you. Also, there was a patchfile in the mpirun directory (mpich-1.2.4..8a-gm-bproc.patch). I manually patched the gmpi_conf.c file in the mpich-1.2.5..9 directory with it. I am just curious why it is needed, and how the environment variables it requires are created. It is even required? The environment variables that cmtools (and the patched mpich-gm) needs are; GMPI_CONF, BPROC_RANK & NODES. Do I set these manually? - Richard |
From: Nicholas H. <he...@se...> - 2003-04-07 17:54:40
|
On Mon, 7 Apr 2003 13:43:20 -0400 Nicholas Henke <he...@se...> wrote: > On Mon, 7 Apr 2003 11:06:44 -0600 > er...@he... wrote: > > > One of the bpslave processes is the interesting one, the other is an > > IO forwarder. It might be useful to run bpslave with -d and -v and > > just watch it. > Is this supposed to output anything during normall operations ? -- I just got the IO forwarder started. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Nicholas H. <he...@se...> - 2003-04-07 17:39:59
|
On Mon, 7 Apr 2003 11:06:44 -0600 er...@he... wrote: > Is it possible to sick gdb on both of the processes until one dies? > That might be impractical because of signal handling issues. Are > there any kernel log messages? Nope -- no kernel logs. Gdb does get a bit wonky with regards to signals, so I cannot do that. > > One of the bpslave processes is the interesting one, the other is an > IO forwarder. It might be useful to run bpslave with -d and -v and > just watch it. Will try. It is a bit of a problem as this is a 32 node cluster, and the nodes are dying at random -- I will see what kind of info I can get. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: <er...@he...> - 2003-04-07 17:37:12
|
On Mon, Apr 07, 2003 at 01:27:53PM -0400, Nicholas Henke wrote: > Hey Erik-- > I am running bproc-3.2.3 on a 2.4.20 kernel, and am experiencing a > strace bpslave problem. Every so often -- with no apparent trigger -- > one of the 2 bpslave processes on the nodes will die. The second stays > alive, and strace shows it sitting on a select. Any ideas on how to > debug this ? Is it possible to sick gdb on both of the processes until one dies? That might be impractical because of signal handling issues. Are there any kernel log messages? One of the bpslave processes is the interesting one, the other is an IO forwarder. It might be useful to run bpslave with -d and -v and just watch it. - Erik |
From: Nicholas H. <he...@se...> - 2003-04-07 17:24:38
|
Hey Erik-- I am running bproc-3.2.3 on a 2.4.20 kernel, and am experiencing a strace bpslave problem. Every so often -- with no apparent trigger -- one of the 2 bpslave processes on the nodes will die. The second stays alive, and strace shows it sitting on a select. Any ideas on how to debug this ? Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |
From: Kimitoshi T. <kt...@cl...> - 2003-04-04 11:25:17
|
Thank you for the reply, We will examine the license of each componet. As for the myrnet stuff, we will exclude their software, because our small cluster will only use giga ether-net. #BTW, I'm very impressed by the bproc. #I hope we can somehow contribute to the developement of #the software in the future. Thanks, Kimitoshi Takahashi Cluster Computing Inc., Japan > >The license for just about everything on there is the GPL. That means >you can freely re-distribute and modify it but any modifications you >decide to distrubute must also be distributed under the terms of the >GPL. > >The only exception I can think of off the top of my head is the >myrinet stuff. I believe you will need permission to redistribute for >commercial purposes. > >- Erik > |
From: <er...@he...> - 2003-04-03 22:02:55
|
On Thu, Apr 03, 2003 at 04:23:21PM +0900, Kimitoshi Takahashi wrote: > Hello all, > > I'm a founder of a startup company called "Cluster Computing Inc." > which aims to sell small(around 10cpu) clusters in Japan. > > After some testing, we found bproc based suite, clustermatic is feasible > for our products. > > Do we need to get any permission for selling clustermatic(with minor modification) > preinstalled system ? If so from where ? The license for just about everything on there is the GPL. That means you can freely re-distribute and modify it but any modifications you decide to distrubute must also be distributed under the terms of the GPL. The only exception I can think of off the top of my head is the myrinet stuff. I believe you will need permission to redistribute for commercial purposes. - Erik |
From: Kimitoshi T. <kt...@cl...> - 2003-04-03 07:25:36
|
Hello all, I'm a founder of a startup company called "Cluster Computing Inc." which aims to sell small(around 10cpu) clusters in Japan. After some testing, we found bproc based suite, clustermatic is feasible for our products. Do we need to get any permission for selling clustermatic(with minor modification) preinstalled system ? If so from where ? Please forgive me, if this question is not appropriate for this mailing list. Any suggestion is appreciated. Thank you, Kimitoshi Takahashi Cluster Computing Inc., Japan |