From: Nicholas H. <he...@se...> - 2003-07-10 19:26:53
|
On Wed, 2003-07-09 at 13:52, er...@he... wrote: > > Well, that's encouraging at least. There's definitely some bogosity > fixed by that patch but I guess there's more. > > Are you still getting processes which are unkillable w/ signal 9? The > processes that you had to go and kill on the node, they were gone from > the master's process tree, right? Nope -- the get reparented to init, but it's calling process is still waiting for it, and I can kill -9 it from the head node, after which its calling process ( the 'sh' ) exits. > > Here's another thing to try as a diagnostic. Comment out this line in > daemons/master.c > > do_parent_exit(req); Tried that -- It would appear that 6 of the 7 nodes hung almost at the same time -- I could not tell if it was _exactly_ the same time. Also -- just as an observation, all of the pids were in the range of 400 to 700 -- could this be a pid wrap around problem ? Also -- I have noticed that after doing a kill -9 on the processes, sometimes the bpslave process will get nuked as well, leaving just one bpslave process on the node. Restarting bpslave allows the node to come backup, and bpsh to that node then works fine. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |