From: Nicholas H. <he...@se...> - 2003-07-02 14:37:28
|
On Tue, 2003-07-01 at 18:39, er...@he... wrote: > I think user land back-traces are probably useless since this is some > kind of weird kernel-land problem - and the judging by the message > traces you've sent me before, the procs are getting caught somewhere > in exit (i.e. signal received and *trying* to exit). Ahhh.. that would make sense. > > It doesn't look like much changed to me between 2.4.18 and 2.4.19 but > some of the process tree handling code in exit code did. The examples > you sent me a while back all show several threads/processes being > killed at once. I have a sneaking suspicion that this is somehow a > race related to many things exiting and getting re-parented at the > same time. Ew -- and that is my official opinion of that. > > I have no idea how that's getting hung up but maybe we can determine > if it's really such a race or not. To make a long story short, can > you try the following: Sure -- I have attached a text file with the results -- slightly more readable than limiting it to 80 chars in email. > > Kill the threads one at time and see if they still get hung up in that > weird state. A half a second in between kills should be more than > enough. Then maybe bottom up or top->bottom might be interesting. Basically -- top->bottom : screwed. bottom->top+sleep: ok, bottom->top+nosleep: screwed. > > I appoligize if I've ased this before: When the threads are hung, does > the system seem healthy otherwise? Specifically, no problems creating > or killing other processes? Yes it does -- I have no problems ssh'ing or bpsh'ing in and running anything. > > P.S. I've attached a quick port of the 3.2.3 patch to 2.4.20. I > think it should work. Thanks! I will see what this produces as well. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |