From: Nicholas H. <he...@se...> - 2003-04-09 17:43:54
|
On Wed, 9 Apr 2003 10:18:35 -0600 er...@he... wrote: > Signal stuff *should* be local to the node and basically the same as > w/o BProc for pthreads stuff. If it relies on process group stuff > that might not be true but I don't *think* it should be doing that. Ok -- I just thought it was really weird that bpslave was getting SIGSTOP. It seems to me that getting that signal might be causing the rest of the problems. Is there any reason that bpslave would get SIGSTOP -- does the OS send that in a strange condition ? > > Is it possible to characterize what the app is doing wrt pthreads? > Does this app do a lot of thread creation and cleanup or does it kick > off a few and they get stuck later? From the trace backs It looks > like it's sticking in the mutex or condition variable code. I tried > to stress that stuff a bit but it hasn't been breaking for me so far. It lookst to me that it pops off a could of threads, and then starts working. It looks like it either hangs right after thread creation, or after the program has exited. > > As usual, it's really hard to say what's going on if I can't reproduce > it. A small test program that did it would be fantastic. As usual, a > message trace might be interesting. In particular, it might be useful > to know if there's BProc traffic related to those processes while it's > running. If it's creating a lot of threads while it runs, the answer > will be yes but it might still be interesting if it's anything other > than fork and wait messages. I do not think it runs many threads -- just 2 to do its work, and then it dies. I ran an strace on all of the bpslave processes during the users last run, and it looks to me that there is some strange SIG* stuff in there -- I am not sure what you should see normally. I have attached the 'grep -n SIG | grep -v CHLD' output -- if you would like to see the full logs holler, and I will try to get them to you -- they are 6.4G of logs total :) I will have the next run use the -m flag for bpslave to see if there is any interesting message traffic. Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |