Re: [BProc] pthreads & bproc, round 2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tue, Jul 01, 2003 at 04:03:28PM -0400, Nicholas Henke wrote:
> Ok -- So I have managed to find the change in versions that isolates the
> problem, unfortuneately, it is a kernel version change that triggers it,
> not a bproc one. 
> 
> FYI -- The working combination is 2.4.18 patched for bproc 3.2.3 -- I
> used the diff in the patches to backport the 2.4.19 patch for 3.2.3 to
> 2.4.18
> 
> The 'bad' combination is 2.4.19 with bproc 3.2.3. 
> 
> So, the behavior that I am seeing now, is that a program is bpsh'd to a
> node, where it uses pthreads to create a few threads to do the work. At
> some point, the threads hang, and it takes a 'kill -9' to kill them.
> Most of the time this will work, but I have noticed that I will have to
> go to the node and 'kill -9' them there for the process to die all of
> the way, if not, and I kill -9 from the fron-end, the processes will be
> removed from the front-end ps output, but when I ssh to the remote node,
> it is still there, and needs another kill -9 to kill it. There is also
> the case where the process on the remote node just refuses to die --
> kill -9 will not pull it out of whereever it is stuck.
> 
> What else can I provide ? Would it be possible to get a patch for bproc
> 3.2.3 for kernel 2.4.20 to see if I get the same behavior there ?
> 
> Here is a traceback for when the threads hang.This is the same traceback
> as when the process ignores the kill -9.

I think user land back-traces are probably useless since this is some
kind of weird kernel-land problem - and the judging by the message
traces you've sent me before, the procs are getting caught somewhere
in exit (i.e. signal received and *trying* to exit).

It doesn't look like much changed to me between 2.4.18 and 2.4.19 but
some of the process tree handling code in exit code did.  The examples
you sent me a while back all show several threads/processes being
killed at once.  I have a sneaking suspicion that this is somehow a
race related to many things exiting and getting re-parented at the
same time.

I have no idea how that's getting hung up but maybe we can determine
if it's really such a race or not.  To make a long story short, can
you try the following:

Kill the threads one at time and see if they still get hung up in that
weird state.  A half a second in between kills should be more than
enough.  Then maybe bottom up or top->bottom might be interesting.

I appoligize if I've ased this before: When the threads are hung, does
the system seem healthy otherwise?  Specifically, no problems creating
or killing other processes?

- Erik

P.S.  I've attached a quick port of the 3.2.3 patch to 2.4.20.  I
think it should work.