From: Nicholas H. <he...@se...> - 2003-07-09 18:21:06
|
On Wed, 2003-07-09 at 13:52, er...@he... wrote: > > Well, that's encouraging at least. There's definitely some bogosity > fixed by that patch but I guess there's more. Yeah -- after a while, all of them are hung again. > > Are you still getting processes which are unkillable w/ signal 9? The > processes that you had to go and kill on the node, they were gone from > the master's process tree, right? Nope -- see attached. > > Here's another thing to try as a diagnostic. Comment out this line in > daemons/master.c > > do_parent_exit(req); > > I think there might be something sketchy going on with that code > although I'm don't know exactly what. Removing the "parent exit" > stuff will have some implications for correctness - getppid() might > return the wrong answer if your parent was on another node and it has > exited. It *shouldn't* have any implications beyond that, however. > > I need to try to replicate this problem again. I will see what this does.... Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |