From: Nicholas H. <he...@se...> - 2003-07-01 20:09:23
|
Ok -- So I have managed to find the change in versions that isolates the problem, unfortuneately, it is a kernel version change that triggers it, not a bproc one. FYI -- The working combination is 2.4.18 patched for bproc 3.2.3 -- I used the diff in the patches to backport the 2.4.19 patch for 3.2.3 to 2.4.18 The 'bad' combination is 2.4.19 with bproc 3.2.3. So, the behavior that I am seeing now, is that a program is bpsh'd to a node, where it uses pthreads to create a few threads to do the work. At some point, the threads hang, and it takes a 'kill -9' to kill them. Most of the time this will work, but I have noticed that I will have to go to the node and 'kill -9' them there for the process to die all of the way, if not, and I kill -9 from the fron-end, the processes will be removed from the front-end ps output, but when I ssh to the remote node, it is still there, and needs another kill -9 to kill it. There is also the case where the process on the remote node just refuses to die -- kill -9 will not pull it out of whereever it is stuck. What else can I provide ? Would it be possible to get a patch for bproc 3.2.3 for kernel 2.4.20 to see if I get the same behavior there ? Here is a traceback for when the threads hang.This is the same traceback as when the process ignores the kill -9. [root@test6 root]# gdb genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast 17154 GNU gdb Red Hat Linux (5.2-2) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"...genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast: No such file or directory. Attaching to process 17154 Reading symbols from /mnt/io1/genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast...done. Reading symbols from /lib/i686/libm.so.6...done.[henken@test6 henken]$ ps -jxf PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 17156 17157 17157 17157 pts/0 17200 S 27659 0:00 -bash 17157 17200 17200 17157 pts/0 17200 R 27659 0:00 ps -jxf 568 17024 568 568 ? -1 S 27659 0:00 /bin/sh /proc/self/fd/3 /scratch/user/henken/slot_1/result /genomics/share/testsuite/tests/blastSim 17024 17025 568 568 ? -1 S 27659 0:00 /usr/bin/perl /genomics/share/testsuite/test_software/gus/gushome_06-05-03/bin/blastSimilarity --bl 17025 17151 568 568 ? -1 S 27659 0:00 \_ sh -c /genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast -d /scratch/user/he 17151 17152 568 568 ? -1 S 27659 0:00 \_ /genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast -d /scratch/user/henk 17152 17153 568 568 ? -1 S 27659 0:00 \_ /genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast -d /scratch/user/ 17153 17154 568 568 ? -1 S 27659 0:00 \_ /genomics/share/testsuite/test_software/ncbiblast_2000-10-31/rpsblast -d /scratch/u [henken@test6 henken]$ strace -p 17154 attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted [henken@test6 henken]$ su - Password: [root@test6 root]# strace -p 17154 [root@test6 root]# strace -p 17153 getppid() = 511 poll([{fd=7, events=POLLIN}], 1, 2000) = 0 getppid() = 511 poll( <unfinished ...> [root@test6 root]# strace -p 17154 Loaded symbols for /lib/i686/libm.so.6 Reading symbols from /lib/i686/libpthread.so.0...done. [New Thread 1024 (LWP 511)] Error while reading shared library symbols: Can't attach LWP 511: No such process Reading symbols from /lib/i686/libc.so.6...done. Loaded symbols for /lib/i686/libc.so.6 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 0x40080bb5 in __sigsuspend (set=0x597697bc) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 45 ../sysdeps/unix/sysv/linux/sigsuspend.c: No such file or directory. in ../sysdeps/unix/sysv/linux/sigsuspend.c (gdb) bt #0 0x40080bb5 in __sigsuspend (set=0x597697bc) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45 #1 0x400461d9 in __pthread_wait_for_restart_signal (self=0x59769be0) at pthread.c:971 #2 0x40047f49 in __pthread_alt_lock (lock=0x8297ab0, self=0x0) at restart.h:34 #3 0x40044d26 in __pthread_mutex_lock (mutex=0x8297aa0) at mutex.c:120 #4 0x0804b7aa in s_MutexLock () #5 0x0804b83d in NlmMutexLockEx () #6 0x0817794c in Nlm_GetAppParam () #7 0x0817583f in GetAppErrInfo () #8 0x08174ba1 in Nlm_ErrSetLogfile () #9 0x0804abfa in NlmThreadWrapper () #10 0x40043c6f in pthread_start_thread (arg=0x59769be0) at manager.c:284 -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |