From: <er...@he...> - 2003-04-09 16:49:28
|
On Tue, Apr 08, 2003 at 05:23:23PM -0400, Nicholas Henke wrote: > OK -- upgrading to 2.4.20 and bproc-3.2.4 seems to have solved the > problem ...at least so far. > > I am now seeing a _really_ strange error that I am trying to find the > root of. All indicators seem to point to bproc -- the user is running > their job under ssh to make sure right now. > > Problem: The user is running ncbi and wu-blast -- programs that do > genomics/bioinformatics sequence 'stuff' The program is invoked using > bpsh, after which it gets to the node, and using pthreads, forks off a > couple of threads. Now, apparently at random, all 2 of the threads get > stuck, and provide similar tracebacks from gdb -- with respect to the > mutex_lock and sigsuspend... Signal stuff *should* be local to the node and basically the same as w/o BProc for pthreads stuff. If it relies on process group stuff that might not be true but I don't *think* it should be doing that. Is it possible to characterize what the app is doing wrt pthreads? Does this app do a lot of thread creation and cleanup or does it kick off a few and they get stuck later? From the trace backs It looks like it's sticking in the mutex or condition variable code. I tried to stress that stuff a bit but it hasn't been breaking for me so far. As usual, it's really hard to say what's going on if I can't reproduce it. A small test program that did it would be fantastic. As usual, a message trace might be interesting. In particular, it might be useful to know if there's BProc traffic related to those processes while it's running. If it's creating a lot of threads while it runs, the answer will be yes but it might still be interesting if it's anything other than fork and wait messages. - Erik > Loaded symbols for /lib/i686/libc.so.6 > Reading symbols from /lib/ld-linux.so.2...done. > Loaded symbols for /lib/ld-linux.so.2 > Reading symbols from /lib/libnss_files.so.2...done. > Loaded symbols for /lib/libnss_files.so.2 > 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. > in ../sysdeps/unix/sysv/linux/sigsuspend.c > (gdb) bt > #0 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 > #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x59969be0) at pthread.c:978 > #2 0x40049ee9 in __pthread_alt_lock (lock=0x40189720, self=0x0) at restart.h:34 > #3 0x40046cf6 in __pthread_mutex_lock (mutex=0x40189710) at mutex.c:120 > #4 0x400d53e8 in __libc_free (mem=0x82b0ef0) at malloc.c:3152 > #5 0x0817a316 in Nlm_MemFree () > #6 0x0804ac06 in NlmThreadWrapper () > #7 0x40045c3f in pthread_start_thread (arg=0x59969be0) at manager.c:284 > > > This would not be a terrible problem, except that one of the programs > will refuse to pay attention to kill -9 -- some of them will die, but > one of the threads stuck in sigsuspend will not go away. > > Is it possible that bpslave is dropping the wait_for_restart_signal ? I > would appreciate any info or direction you could provide -- this is > really odd. BTW -- The same thing happens on 2.4.19 and bproc-3.2.3 ( > all on RH 7.2 ). > > Nic > -- > Nicholas Henke > Penguin Herder & Linux Cluster System Programmer > Liniac Project - Univ. of Pennsylvania > > > ------------------------------------------------------- > This SF.net email is sponsored by: ValueWeb: > Dedicated Hosting for just $79/mo with 500 GB of bandwidth! > No other company gives more support or power for your dedicated server > http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/ > _______________________________________________ > BProc-users mailing list > BPr...@li... > https://lists.sourceforge.net/lists/listinfo/bproc-users |