From: <er...@he...> - 2003-07-01 22:48:29
|
On Tue, Jul 01, 2003 at 04:03:28PM -0400, Nicholas Henke wrote: > Ok -- So I have managed to find the change in versions that isolates the > problem, unfortuneately, it is a kernel version change that triggers it, > not a bproc one. > > FYI -- The working combination is 2.4.18 patched for bproc 3.2.3 -- I > used the diff in the patches to backport the 2.4.19 patch for 3.2.3 to > 2.4.18 > > The 'bad' combination is 2.4.19 with bproc 3.2.3. > > So, the behavior that I am seeing now, is that a program is bpsh'd to a > node, where it uses pthreads to create a few threads to do the work. At > some point, the threads hang, and it takes a 'kill -9' to kill them. > Most of the time this will work, but I have noticed that I will have to > go to the node and 'kill -9' them there for the process to die all of > the way, if not, and I kill -9 from the fron-end, the processes will be > removed from the front-end ps output, but when I ssh to the remote node, > it is still there, and needs another kill -9 to kill it. There is also > the case where the process on the remote node just refuses to die -- > kill -9 will not pull it out of whereever it is stuck. > > What else can I provide ? Would it be possible to get a patch for bproc > 3.2.3 for kernel 2.4.20 to see if I get the same behavior there ? > > Here is a traceback for when the threads hang.This is the same traceback > as when the process ignores the kill -9. I think user land back-traces are probably useless since this is some kind of weird kernel-land problem - and the judging by the message traces you've sent me before, the procs are getting caught somewhere in exit (i.e. signal received and *trying* to exit). It doesn't look like much changed to me between 2.4.18 and 2.4.19 but some of the process tree handling code in exit code did. The examples you sent me a while back all show several threads/processes being killed at once. I have a sneaking suspicion that this is somehow a race related to many things exiting and getting re-parented at the same time. I have no idea how that's getting hung up but maybe we can determine if it's really such a race or not. To make a long story short, can you try the following: Kill the threads one at time and see if they still get hung up in that weird state. A half a second in between kills should be more than enough. Then maybe bottom up or top->bottom might be interesting. I appoligize if I've ased this before: When the threads are hung, does the system seem healthy otherwise? Specifically, no problems creating or killing other processes? - Erik P.S. I've attached a quick port of the 3.2.3 patch to 2.4.20. I think it should work. |