From: <er...@he...> - 2003-04-09 22:35:14
|
On Wed, Apr 09, 2003 at 05:40:01PM -0400, Nicholas Henke wrote: > On Wed, 9 Apr 2003 14:48:07 -0600 > er...@he... wrote: > > > Usually the only reason you would get a sigstop from the OS is > > terminal related and then it should be TSTP. > > > > That strace is pretty strange. There's a lot of rt_sigaction w/ > > SIGPIPE but the slave daemon code only does that once with SIGPIPE and > > it sets it to SIG_IGN. Anyway, this might have something to do with > > the 'exit signal' on a process. That's about the only way I can think > > to signal the slave daemon... > > Hrm -- ok. Is there any more information I can provide about this ? > I have the user running jobs for me so I can catch them in the 'wild' so to speak. > > Here are a few things I am seeing: > > Hung state: > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f > 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d > 5362 ? S 0:00 \_ sh -c /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask=seg+xn > 5363 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu > 5376 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+ > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask > > OK -- so blast is hung -- lets kill it: > kill -9 5377 5376 5363 > Now the rest is hung.... -- Notice how the blastx has jumped parents -- that don't seem right. > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B > 3054 ? S 0:00 \_ /bin/sh /proc/self/fd/3 /scratch/user/sfischer/slot_1/result /genomics/binf/scratch/dotsBuilds/nicTest/mus/similarity/f > 3055 ? S 0:01 \_ /usr/bin/perl /home/sfischer/gushome/bin/blastSimilarity --blastBinDir /genomics/share/pkg/bio/wu-blast/current --d > 5362 ? Z 0:00 \_ [sh <defunct>] > > Odd -- ok so : kill -9 3055 > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > 5377 ? S 0:00 \_ /genomics/share/pkg/bio/wu-blast/current/blastx /scratch/user/sfischer/prodom.fsa seqTmp -wordmask seg+xnu W 3 T 1000 B > > hrm -- ok still wont die: kill -9 5377 > > 567 ? S 0:01 /usr/sbin/bpslave -r 192.168.0.223 2223 > 568 ? S 0:00 \_ /usr/sbin/bpslave -r 192.168.0.223 2223 > 622 ? S 0:00 \_ mond -d > > And... sometimes that last process dies, or sometimes it doesn't. > I tried reverting to an older version of glibc to make sure that wasn't the culprit. > Any ideas ? Hrm.... The only plausible reason I can think of for the kill -9 to not work is that it's actually blocked in kernel space somewhere. It could be that the process is getting signaled while it's waiting for some remote request to complete. Most likely in a bpr_rsyscall. I think a message trace for all the pids involved would be very interesting here. If that's the case we need to figure out what that remote request is and (of course) why it's not completing in a reasonable amount of time. I suspect there's a problem in the signal forwarding and the remote system call stuff that the slave side does. That code *looks* ok to me but maybe there's a problem. Seeing a message trace for the PIDs involved should shed some light on this. Also, process 5377 reparenting to bpslave is normal. bpslave is the "child reaper" (instead of init) for bproc managed processes on the nodes. This is necessary for ptrace to work properly. I think the parents exited and it didn't so that reparent is correct. - Erik |