Re: pthreads+bproc (was Re: [BProc] Re: bpslave dies)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Apr 08, 2003 at 05:23:23PM -0400, Nicholas Henke wrote:
> OK -- upgrading to 2.4.20 and bproc-3.2.4 seems to have solved the
> problem ...at least so far.
> 
> I am now seeing a _really_ strange error that I am trying to find the
> root of. All indicators seem to point to bproc -- the user is running
> their job under ssh to make sure right now.
> 
> Problem: The user is running ncbi and wu-blast -- programs that do
> genomics/bioinformatics sequence 'stuff' The program is invoked using
> bpsh, after which it gets to the node, and using pthreads, forks off a
> couple of threads. Now, apparently at random, all 2 of the threads get
> stuck, and provide similar tracebacks from gdb -- with respect to the
> mutex_lock and sigsuspend...

Signal stuff *should* be local to the node and basically the same as
w/o BProc for pthreads stuff.  If it relies on process group stuff
that might not be true but I don't *think* it should be doing that.

Is it possible to characterize what the app is doing wrt pthreads?
Does this app do a lot of thread creation and cleanup or does it kick
off a few and they get stuck later?  From the trace backs It looks
like it's sticking in the mutex or condition variable code.  I tried
to stress that stuff a bit but it hasn't been breaking for me so far.

As usual, it's really hard to say what's going on if I can't reproduce
it.  A small test program that did it would be fantastic.  As usual, a
message trace might be interesting.  In particular, it might be useful
to know if there's BProc traffic related to those processes while it's
running.  If it's creating a lot of threads while it runs, the answer
will be yes but it might still be interesting if it's anything other
than fork and wait messages.

- Erik

> Loaded symbols for /lib/i686/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> 45      ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory.
>         in ../sysdeps/unix/sysv/linux/sigsuspend.c
> (gdb) bt
> #0  0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
> #1  0x40048179 in __pthread_wait_for_restart_signal (self=0x59969be0) at pthread.c:978
> #2  0x40049ee9 in __pthread_alt_lock (lock=0x40189720, self=0x0) at restart.h:34
> #3  0x40046cf6 in __pthread_mutex_lock (mutex=0x40189710) at mutex.c:120
> #4  0x400d53e8 in __libc_free (mem=0x82b0ef0) at malloc.c:3152
> #5  0x0817a316 in Nlm_MemFree ()
> #6  0x0804ac06 in NlmThreadWrapper ()
> #7  0x40045c3f in pthread_start_thread (arg=0x59969be0) at manager.c:284
> 
> 
> This would not be a terrible problem, except that one of the programs
> will refuse to pay attention to kill -9 -- some of them will die, but
> one of the threads stuck in sigsuspend will not go away. 
> 
> Is it possible that bpslave is dropping the wait_for_restart_signal ? I
> would appreciate any info or direction you could provide -- this is
> really odd. BTW -- The same thing happens on 2.4.19 and bproc-3.2.3 (
> all on RH 7.2 ).
> 
> Nic
> -- 
> Nicholas Henke
> Penguin Herder & Linux Cluster System Programmer
> Liniac Project - Univ. of Pennsylvania
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: ValueWeb: 
> Dedicated Hosting for just $79/mo with 500 GB of bandwidth! 
> No other company gives more support or power for your dedicated server
> http://click.atdmt.com/AFF/go/sdnxxaff00300020aff/direct/01/
> _______________________________________________
> BProc-users mailing list
> BPr...@li...
> https://lists.sourceforge.net/lists/listinfo/bproc-users