From: <er...@he...> - 2003-07-09 18:03:44
|
On Wed, Jul 09, 2003 at 01:41:11PM -0400, Nicholas Henke wrote: > On Mon, 2003-07-07 at 15:20, er...@he... wrote: > > I have a hunch about what might be going on here. There's some > > potential for badness in exit_notify with BProc. kill_pg and > > is_orphaned_pgrp might end up setting the process state back to > > RUNNING instead of ZOMBIE. Then they could get hung up because the > > ghost is gone because it's already exited. > > > > I've attached a revised patch which I think should fix that. Can you > > try it an see if it helps? > > It seems to have helped, but not solved the problem. It seems like more > of the processes are running, and not getting hung, but there were a few > that did hang. I was able to do a 'top->bottom' kill -9 with a 'sleep 1' > between, and in one case it worked, but in another, I had to go to the > node again and kill -9 the process there. Well, that's encouraging at least. There's definitely some bogosity fixed by that patch but I guess there's more. Are you still getting processes which are unkillable w/ signal 9? The processes that you had to go and kill on the node, they were gone from the master's process tree, right? Here's another thing to try as a diagnostic. Comment out this line in daemons/master.c do_parent_exit(req); I think there might be something sketchy going on with that code although I'm don't know exactly what. Removing the "parent exit" stuff will have some implications for correctness - getppid() might return the wrong answer if your parent was on another node and it has exited. It *shouldn't* have any implications beyond that, however. I need to try to replicate this problem again. - Erik |