From: Nicholas H. <he...@se...> - 2003-04-08 21:19:48
|
OK -- upgrading to 2.4.20 and bproc-3.2.4 seems to have solved the problem ...at least so far. I am now seeing a _really_ strange error that I am trying to find the root of. All indicators seem to point to bproc -- the user is running their job under ssh to make sure right now. Problem: The user is running ncbi and wu-blast -- programs that do genomics/bioinformatics sequence 'stuff' The program is invoked using bpsh, after which it gets to the node, and using pthreads, forks off a couple of threads. Now, apparently at random, all 2 of the threads get stuck, and provide similar tracebacks from gdb -- with respect to the mutex_lock and sigsuspend... Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. in ../sysdeps/unix/sysv/linux/sigsuspend.c (gdb) bt #0 0x40082bb5 in __sigsuspend (set=0x5996991c) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x40048179 in __pthread_wait_for_restart_signal (self=0x59969be0) at pthread.c:978 #2 0x40049ee9 in __pthread_alt_lock (lock=0x40189720, self=0x0) at restart.h:34 #3 0x40046cf6 in __pthread_mutex_lock (mutex=0x40189710) at mutex.c:120 #4 0x400d53e8 in __libc_free (mem=0x82b0ef0) at malloc.c:3152 #5 0x0817a316 in Nlm_MemFree () #6 0x0804ac06 in NlmThreadWrapper () #7 0x40045c3f in pthread_start_thread (arg=0x59969be0) at manager.c:284 This would not be a terrible problem, except that one of the programs will refuse to pay attention to kill -9 -- some of them will die, but one of the threads stuck in sigsuspend will not go away. Is it possible that bpslave is dropping the wait_for_restart_signal ? I would appreciate any info or direction you could provide -- this is really odd. BTW -- The same thing happens on 2.4.19 and bproc-3.2.3 ( all on RH 7.2 ). Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |