|
From: George N. <gna...@ll...> - 2006-03-24 22:39:34
|
I've searched the archvies and the docs for help to no avail, so I bring my plea to the mailing list. Please forgive me if I've overlooked something obvious in the docs. I'm using Valgrind-3.1.1 on Redhat Enterprise Linux 3 on . Valgrind is run with these switches: --db-attach=3Dyes --suppressions=3Dfoo See the end of this message for my /proc/cpuinfo and output from uname -a (appendices A and B). I'm attempting to debug a problem in my app which causes a crash when run without Valgrind. However, when I do run with Valgrind it hangs long before getting to the crash. It appears to be hanging inside of a library that I have source to, but which (for obnoxious contractual reasons beyond my control) I am unable to modify. So using printfs to pinpoint the exact location of the hang is impossible. Attempting to get the vendor of this library to help is impractical because they are much too slow, and I need to fix this problem quickly. The main thread (thread 1 in gdb parlance) seems to get stuck for unknown reasons. In other words, I call a function in this third-party library from that thread which usually returns after a couple seconds, and it never returns (I've waited several hours). When running the app without valgrind, it always returns in a reasonable amount of time (never longer than 10 seconds). Other threads continue to run normally. I've tried a variety of tricks to get Valgrind to tell me what function the main thread is in when it's hung: sending the process a signal [footnote 1] and creating a thread that performes an invalid write [footnote 2] do get Valgrind to wake up and give me a stack trace, but not in the correct thread. Attaching with gdb from the --db-attach prompt and trying to switch threads is not productive [end of footnote 2]. I cannot attach to a running Valgrind-instrumented process with gdb and get a meaningful stack trace: it's just unknown symbols all the way down in every thread [footnote 3]. My questions are: 1) Is there a way to get a stack trace out of Valgrind for the main thread when I don't control what's going on in that thread? 2) If I have to modify Valgrind itself to make this happen, can anyone suggest where to start? I've made trivial changes to Valgrind in the past, but this is way beyond my experience, so hints would be appreciated. Thanks for your help, George Footnotes: 1. Here's what valgrind says when I send it a signal with the default signa= l handler installed: =3D=3D19016=3D=3D =3D=3D19016=3D=3D ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- y =3D=3D19016=3D=3D starting debugger with cmd: /usr/local/bin/gdb -nw /proc/19036/fd/1014 19036 [gdb chatter cut...] (gdb) bt #0 0xb001b635 in ?? () #1 0x042f23aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #2 0x043c9528 in timer_scheduler (arg=3D0x47b4180) at ../../../foo.c:394 #3 0x042efdac in start_thread () from /lib/tls/libpthread.so.0 #4 0x045999ea in clone () from /lib/tls/libc.so.6 (gdb) info threads 14 Thread 86719408 (LWP 19017) 0xb0040c3c in ?? () 13 Thread 97209264 (LWP 19018) 0xb0019e99 in ?? () 12 Thread 107699120 (LWP 19019) 0xb0040c3c in ?? () 11 Thread 134437808 (LWP 19023) 0xb0040c3c in ?? () 10 Thread 156449712 (LWP 19024) 0xb0040c3c in ?? () 9 Thread 166939568 (LWP 19025) 0xb0019e99 in ?? () 8 Thread 177429424 (LWP 19026) 0xb0040c3c in ?? () 7 Thread 187980720 (LWP 19027) 0xb0040c3c in ?? () 6 Thread 210983856 (LWP 19028) 0xb0040c3c in ?? () 5 Thread 221473712 (LWP 19029) 0xb0040c3c in ?? () 4 Thread 231963568 (LWP 19030) 0xb0040c3c in ?? () 3 Thread 252943280 (LWP 19033) 0xb0019e99 in ?? () 2 Thread 242453424 (LWP 19034) 0xb0019e99 in ?? () 1 Thread 75179008 (LWP 19016) 0xb0019e99 in ?? () 2. Here's what happens when I trigger the thread to perform an invalid write: found the file. killing myself =3D=3D11704=3D=3D =3D=3D11704=3D=3D Thread 14: =3D=3D11704=3D=3D Invalid write of size 1 =3D=3D11704=3D=3D at 0x805C6B7: george (hb_main.c:137) =3D=3D11704=3D=3D by 0x1BBD3DAB: start_thread (in /lib/tls/libpthread-0.= 60.so) =3D=3D11704=3D=3D by 0x1BE7D9E9: clone (in /lib/tls/libc-2.3.2.so) =3D=3D11704=3D=3D Address 0x0 is not stack'd, malloc'd or (recently) free'= d =3D=3D11704=3D=3D =3D=3D11704=3D=3D ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- y starting debugger =3D=3D11704=3D=3D starting debugger with cmd: /usr/local/bin/gdb -nw /proc/11722/fd/1015 11722 GNU gdb 6.4.0.20051202-cvs [snip gdb chatter] Attaching to program: /proc/11722/fd/1015, process 11722 [snip more gdb chatter] 0x0805c6b7 in george (data=3D0x0) at hb_main.c:137 137 *x=3D1; (gdb) bt #0 0x0805c6b7 in george (data=3D0x0) at hb_main.c:137 #1 0x1bbd3dac in start_thread () from /lib/tls/libpthread.so.0 #2 0x1be7d9ea in clone () from /lib/tls/libc.so.6 (gdb) info threads 14 Thread 616156080 (LWP 11705) 0xb004cb08 in ?? () 13 Thread 626645936 (LWP 11706) 0xb0021a4e in ?? () 12 Thread 637135792 (LWP 11707) 0xb004cb08 in ?? () 11 Thread 663894960 (LWP 11709) 0xb004cb08 in ?? () 10 Thread 674397104 (LWP 11710) 0xb004cb08 in ?? () 9 Thread 684886960 (LWP 11711) 0xb0021a4e in ?? () 8 Thread 695376816 (LWP 11712) 0xb004cb08 in ?? () 7 Thread 705928112 (LWP 11713) 0xb004cb08 in ?? () 6 Thread 726404016 (LWP 11715) 0xb004cb08 in ?? () 5 Thread 736893872 (LWP 11716) 0xb004cb08 in ?? () 4 Thread 747383728 (LWP 11717) 0xb004cb08 in ?? () 3 Thread 768363440 (LWP 11720) 0xb0021a4e in ?? () 2 Thread 757873584 (LWP 11721) 0xb0021a4e in ?? () 1 Thread 469738592 (LWP 11704) 0xb0021a4e in ?? () (gdb) thread 1 [Switching to thread 1 (Thread 469738592 (LWP 11704))]#0 0xb0021a4e in ?? () (gdb) bt #0 0xb0021a4e in ?? () (gdb) 3. Attaching to a running valgrind process is not informative: % gdb /usr/local/bin/valgrind 11681 [...gdb chatter snipped out...] Attaching to program: /usr/local/bin/valgrind, process 11681 0xb115b85b in ?? () (gdb) bt #0 0xb115b85b in ?? () #1 0x00000001 in ?? () #2 0x00000001 in ?? () #3 0xb115b8f4 in ?? () #4 0x52bf1923 in ?? () #5 0x52bf1923 in ?? () #6 0x00000001 in ?? () #7 0xb11b489c in ?? () #8 0xb115bd26 in ?? () #9 0x52bf1923 in ?? () #10 0x00000001 in ?? () #11 0x000000ff in ?? () #12 0xffffffff in ?? () #13 0x00000000 in ?? () (gdb) info threads (gdb) Appendices: A: % cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.20GHz stepping : 1 cpu MHz : 3192.133 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 3 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm nx lm bogomips : 6370.09 B: % uname -a Linux llamas.foo.com 2.4.21-20.TB #1 Tue Mar 8 21:38:52 PST 2005 i686 i686 i386 GNU/Linux |
|
From: Julian S. <js...@ac...> - 2006-03-24 22:49:25
|
Hmm, looks like you've tried quite hard to see where it hung. One thing worth trying is to figure out the pid of the hanging thread, and doing 'strace -p <pid>', so as to see what syscall it's hung in. Can you try that? Another thing worth trying is to run with --smc-check=all; that rules out and wierdness to do with self modifying code. There is one known problem area, which is signal handlers which throw exceptions (or do pthread_cancel or something). I don't remember the details right now. Does this hang have anything to do with signal handlers and exception throwing/catching, as far as you can tell? J On Friday 24 March 2006 22:39, George Nachman wrote: > I've searched the archvies and the docs for help to no avail, so I bring my > plea to the mailing list. Please forgive me if I've overlooked something > obvious in the docs. > > I'm using Valgrind-3.1.1 on Redhat Enterprise Linux 3 on . Valgrind is run > with these switches: --db-attach=yes --suppressions=foo > See the end of this message for my /proc/cpuinfo and output from uname -a > (appendices A and B). > > I'm attempting to debug a problem in my app which causes a crash when run > without Valgrind. However, when I do run with Valgrind it hangs long before > getting to the crash. > > It appears to be hanging inside of a library that I have source to, but > which (for obnoxious contractual reasons beyond my control) I am unable to > modify. So using printfs to pinpoint the exact location of the hang is > impossible. Attempting to get the vendor of this library to help is > impractical because they are much too slow, and I need to fix this problem > quickly. > > The main thread (thread 1 in gdb parlance) seems to get stuck for unknown > reasons. In other words, I call a function in this third-party library from > that thread which usually returns after a couple seconds, and it never > returns (I've waited several hours). When running the app without valgrind, > it always returns in a reasonable amount of time (never longer than 10 > seconds). Other threads continue to run normally. > > I've tried a variety of tricks to get Valgrind to tell me what function the > main thread is in when it's hung: sending the process a signal [footnote 1] > and creating a thread that performes an invalid write [footnote 2] do get > Valgrind to wake up and give me a stack trace, but not in the correct > thread. Attaching with gdb from the --db-attach prompt and trying to switch > threads is not productive [end of footnote 2]. > > I cannot attach to a running Valgrind-instrumented process with gdb and get > a meaningful stack trace: it's just unknown symbols all the way down in > every thread [footnote 3]. > > My questions are: > 1) Is there a way to get a stack trace out of Valgrind for the main thread > when I don't control what's going on in that thread? > 2) If I have to modify Valgrind itself to make this happen, can anyone > suggest where to start? I've made trivial changes to Valgrind in the past, > but this is way beyond my experience, so hints would be appreciated. > > Thanks for your help, > George > > Footnotes: > > 1. Here's what valgrind says when I send it a signal with the default > signal handler installed: > ==19016== > ==19016== ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- y > ==19016== starting debugger with cmd: /usr/local/bin/gdb -nw > /proc/19036/fd/1014 19036 > [gdb chatter cut...] > (gdb) bt > #0 0xb001b635 in ?? () > #1 0x042f23aa in pthread_cond_timedwait@@GLIBC_2.3.2 () from > /lib/tls/libpthread.so.0 > #2 0x043c9528 in timer_scheduler (arg=0x47b4180) at ../../../foo.c:394 > #3 0x042efdac in start_thread () from /lib/tls/libpthread.so.0 > #4 0x045999ea in clone () from /lib/tls/libc.so.6 > (gdb) info threads > 14 Thread 86719408 (LWP 19017) 0xb0040c3c in ?? () > 13 Thread 97209264 (LWP 19018) 0xb0019e99 in ?? () > 12 Thread 107699120 (LWP 19019) 0xb0040c3c in ?? () > 11 Thread 134437808 (LWP 19023) 0xb0040c3c in ?? () > 10 Thread 156449712 (LWP 19024) 0xb0040c3c in ?? () > 9 Thread 166939568 (LWP 19025) 0xb0019e99 in ?? () > 8 Thread 177429424 (LWP 19026) 0xb0040c3c in ?? () > 7 Thread 187980720 (LWP 19027) 0xb0040c3c in ?? () > 6 Thread 210983856 (LWP 19028) 0xb0040c3c in ?? () > 5 Thread 221473712 (LWP 19029) 0xb0040c3c in ?? () > 4 Thread 231963568 (LWP 19030) 0xb0040c3c in ?? () > 3 Thread 252943280 (LWP 19033) 0xb0019e99 in ?? () > 2 Thread 242453424 (LWP 19034) 0xb0019e99 in ?? () > 1 Thread 75179008 (LWP 19016) 0xb0019e99 in ?? () > > 2. Here's what happens when I trigger the thread to perform an invalid > write: > > found the file. killing myself > ==11704== > ==11704== Thread 14: > ==11704== Invalid write of size 1 > ==11704== at 0x805C6B7: george (hb_main.c:137) > ==11704== by 0x1BBD3DAB: start_thread (in /lib/tls/libpthread-0.60.so) > ==11704== by 0x1BE7D9E9: clone (in /lib/tls/libc-2.3.2.so) > ==11704== Address 0x0 is not stack'd, malloc'd or (recently) free'd > ==11704== > ==11704== ---- Attach to debugger ? --- [Return/N/n/Y/y/C/c] ---- y > starting debugger > ==11704== starting debugger with cmd: /usr/local/bin/gdb -nw > /proc/11722/fd/1015 11722 > GNU gdb 6.4.0.20051202-cvs > > [snip gdb chatter] > Attaching to program: /proc/11722/fd/1015, process 11722 > [snip more gdb chatter] > 0x0805c6b7 in george (data=0x0) at hb_main.c:137 > 137 *x=1; > (gdb) bt > #0 0x0805c6b7 in george (data=0x0) at hb_main.c:137 > #1 0x1bbd3dac in start_thread () from /lib/tls/libpthread.so.0 > #2 0x1be7d9ea in clone () from /lib/tls/libc.so.6 > (gdb) info threads > 14 Thread 616156080 (LWP 11705) 0xb004cb08 in ?? () > 13 Thread 626645936 (LWP 11706) 0xb0021a4e in ?? () > 12 Thread 637135792 (LWP 11707) 0xb004cb08 in ?? () > 11 Thread 663894960 (LWP 11709) 0xb004cb08 in ?? () > 10 Thread 674397104 (LWP 11710) 0xb004cb08 in ?? () > 9 Thread 684886960 (LWP 11711) 0xb0021a4e in ?? () > 8 Thread 695376816 (LWP 11712) 0xb004cb08 in ?? () > 7 Thread 705928112 (LWP 11713) 0xb004cb08 in ?? () > 6 Thread 726404016 (LWP 11715) 0xb004cb08 in ?? () > 5 Thread 736893872 (LWP 11716) 0xb004cb08 in ?? () > 4 Thread 747383728 (LWP 11717) 0xb004cb08 in ?? () > 3 Thread 768363440 (LWP 11720) 0xb0021a4e in ?? () > 2 Thread 757873584 (LWP 11721) 0xb0021a4e in ?? () > 1 Thread 469738592 (LWP 11704) 0xb0021a4e in ?? () > (gdb) thread 1 > [Switching to thread 1 (Thread 469738592 (LWP 11704))]#0 0xb0021a4e in ?? > () > (gdb) bt > #0 0xb0021a4e in ?? () > (gdb) > > 3. Attaching to a running valgrind process is not informative: > % gdb /usr/local/bin/valgrind 11681 > [...gdb chatter snipped out...] > > Attaching to program: /usr/local/bin/valgrind, process 11681 > 0xb115b85b in ?? () > (gdb) bt > #0 0xb115b85b in ?? () > #1 0x00000001 in ?? () > #2 0x00000001 in ?? () > #3 0xb115b8f4 in ?? () > #4 0x52bf1923 in ?? () > #5 0x52bf1923 in ?? () > #6 0x00000001 in ?? () > #7 0xb11b489c in ?? () > #8 0xb115bd26 in ?? () > #9 0x52bf1923 in ?? () > #10 0x00000001 in ?? () > #11 0x000000ff in ?? () > #12 0xffffffff in ?? () > #13 0x00000000 in ?? () > (gdb) info threads > (gdb) > > > Appendices: > > A: > % cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 15 > model : 4 > model name : Intel(R) Xeon(TM) CPU 3.20GHz > stepping : 1 > cpu MHz : 3192.133 > cache size : 1024 KB > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 3 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm nx lm > bogomips : 6370.09 > > B: > % uname -a > Linux llamas.foo.com 2.4.21-20.TB #1 Tue Mar 8 21:38:52 PST 2005 i686 i686 > i386 GNU/Linux |
|
From: Francois-Xavier 'F. K. <fra...@hp...> - 2006-04-01 15:29:37
|
George Nachman wrote: > [...] > > I cannot attach to a running Valgrind-instrumented process with gdb > and get a meaningful stack trace: it's just unknown symbols all the > way down in every thread [footnote 3]. FYI, RHEL3 comes with 2 debugger versions. The default gdb is 6.3, which is often unable to decode a crash backtrace. RHEL3 also comes with a former gdb version (6.1) that generally gives better results out of a core for a multi-threaded application. rpm -Uvh --oldpackage gdb61-... HTH. -- __________ //_ o \\/ Francois-Xavier "FiX" KOWALSKI // // /\\ Everything is disclaimed, including disclaimer. |