From: Nikolaus R. <nr...@tr...> - 2015-12-08 17:50:01
|
Hello, I'm having a problem using massif. When looking at the ms_print output, I'm getting entries like this: ->10.77% (11,146,544B) 0xF4AC9C: H5FL_reg_calloc (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | ->10.58% (10,946,896B) 0x1000D57: H5S_copy (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | | ->10.57% (10,940,640B) 0xEE0A80: H5A_create (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | | | ->10.57% (10,940,640B) 0xEDAAAF: H5Acreate2 (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | | | ->10.57% (10,940,640B) 0xEC985C: h5acreate_c_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | | | ->10.57% (10,940,640B) 0xEC3C55: h5a_mp_h5acreate_f_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) | | | ->10.57% (10,940,640B) 0xB99ED5: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) | | | ->04.24% (4,387,296B) in 40 places, all below massif's threshold (01.00%) However, line 1936 of taehdf5.f90 is actually inside a different subroutine, defined as: subroutine h5dump_attr_int(loc_id,f,name) [..] ! next line is 1936 call h5acreate_f(loc_id,name,H5T_NATIVE_INTEGER,space_id,attr_id,hdferr) [...] end subroutine The h5append_data_double_0d subroutine is actually defined much later, starting in line 4120 with: subroutine h5append_data_double_0d(group_id,f,name) ..and it does not contain any calls to h5acreate_f. So I think that probably the line number is correct, but the function name is not. Does anyone know what might cause this? Best, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-08 22:24:49
|
On Tue, 2015-12-08 at 09:08 -0800, Nikolaus Rath wrote: > Hello, > > I'm having a problem using massif. When looking at the ms_print output, I'm getting entries like this: > > ->10.77% (11,146,544B) 0xF4AC9C: H5FL_reg_calloc (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | ->10.58% (10,946,896B) 0x1000D57: H5S_copy (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | | ->10.57% (10,940,640B) 0xEE0A80: H5A_create (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | | | ->10.57% (10,940,640B) 0xEDAAAF: H5Acreate2 (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | | | ->10.57% (10,940,640B) 0xEC985C: h5acreate_c_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | | | ->10.57% (10,940,640B) 0xEC3C55: h5a_mp_h5acreate_f_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) > | | | ->10.57% (10,940,640B) 0xB99ED5: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) > | | | ->04.24% (4,387,296B) in 40 places, all below massif's threshold (01.00%) > > > However, line 1936 of taehdf5.f90 is actually inside a different subroutine, defined as: > > subroutine h5dump_attr_int(loc_id,f,name) > [..] > ! next line is 1936 > call h5acreate_f(loc_id,name,H5T_NATIVE_INTEGER,space_id,attr_id,hdferr) > [...] > end subroutine > > > The h5append_data_double_0d subroutine is actually defined much later, starting in line 4120 with: > > subroutine h5append_data_double_0d(group_id,f,name) > > ..and it does not contain any calls to h5acreate_f. So I think that probably the line number is correct, but the function name is not. > > Does anyone know what might cause this? What is the platform ? (version of valgrind/os/gcc, which cpu, ....) ? Strange stacktraces can be given in case the compiler is inlining some calls or if Valgrind unwinder has a bug. You might check that using gdb+vgdb: put a break on H5FL_reg_calloc. When break encountered, do (gdb) backtrace (gdb) monitor v.info scheduler This will allow to compare: the backtrace produced by gdb the stacktrace produced by Valgrind (using inline info) and the stacktrace produced by massif in its output file. If the first 2 stacktraces are as expected, but the massif stacktrace is not, then that is probably because massif does not use inline info to produce its output. If the gdb backtrace is ok, but the monitor stack trace is not, then it means the valgrind unwinder does not properly unwind your code. Philippe Some more background info about stacktraces and inline info: If you run: valgrind --read-inline-info=no ./memcheck/tests/inlinfo then the last stacktrace produced is: ==2277== at 0x80485C4: fun_noninline_o (inlinfo.c:40) ==2277== by 0x804867D: fun_noninline_n (inlinfo.c:48) ==2277== by 0x804880D: main (inlinfo.c:72) There is effectively a call to fun_noninline_n at line 72. However, at line 48, we are inside the function fun_f, while the stacktrace above shows fun_noninline_n. When running with the (default) --read-inline-info=yes, the stacktrace is: ==2302== at 0x80485C4: fun_noninline_o (inlinfo.c:40) ==2302== by 0x804867D: fun_f (inlinfo.c:48) ==2302== by 0x804867D: fun_e (inlinfo.c:54) ==2302== by 0x804867D: fun_noninline_n (inlinfo.c:60) ==2302== by 0x804880D: main (inlinfo.c:72) which is then understandable: when there is inlining, the line nr corresponds to the (last) inlined function, but the function name corresponds to the 'inlining' function. Massif does not expand inlined function calls (so, produces stacktraces as if --read-inline-info=no was given). See the call to VG_(describe_IP) in ms_main.c: it has a NULL second argument. |
From: Nikolaus R. <nr...@tr...> - 2015-12-08 23:05:47
|
On 12/08/2015 02:26 PM, Philippe Waroquiers wrote: > On Tue, 2015-12-08 at 09:08 -0800, Nikolaus Rath wrote: >> Hello, >> >> I'm having a problem using massif. When looking at the ms_print output, I'm getting entries like this: >> >> ->10.77% (11,146,544B) 0xF4AC9C: H5FL_reg_calloc (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | ->10.58% (10,946,896B) 0x1000D57: H5S_copy (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | | ->10.57% (10,940,640B) 0xEE0A80: H5A_create (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | | | ->10.57% (10,940,640B) 0xEDAAAF: H5Acreate2 (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | | | ->10.57% (10,940,640B) 0xEC985C: h5acreate_c_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | | | ->10.57% (10,940,640B) 0xEC3C55: h5a_mp_h5acreate_f_ (in /work/nrath/issue_2014_q2d_mem/part_agmg0_4_massif_fixed/LR_model) >> | | | ->10.57% (10,940,640B) 0xB99ED5: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) >> | | | ->04.24% (4,387,296B) in 40 places, all below massif's threshold (01.00%) >> >> >> However, line 1936 of taehdf5.f90 is actually inside a different subroutine, defined as: >> >> subroutine h5dump_attr_int(loc_id,f,name) >> [..] >> ! next line is 1936 >> call h5acreate_f(loc_id,name,H5T_NATIVE_INTEGER,space_id,attr_id,hdferr) >> [...] >> end subroutine >> >> >> The h5append_data_double_0d subroutine is actually defined much later, starting in line 4120 with: >> >> subroutine h5append_data_double_0d(group_id,f,name) >> >> ..and it does not contain any calls to h5acreate_f. So I think that probably the line number is correct, but the function name is not. >> >> Does anyone know what might cause this? > > What is the platform ? (version of valgrind/os/gcc, which cpu, ....) ? $ uname -a Linux ibm-h1.localcluster 2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64 GNU/Linux $ cat /proc/cpuinfo | head -10 processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz stepping : 4 cpu MHz : 1200.000 cache size : 20480 KB physical id : 0 siblings : 8 $ valgrind --version valgrind-3.8.1 The binary I'm analyzing has been compiled with $ ifort --version ifort (IFORT) 15.0.0 20140723 Copyright (C) 1985-2014 Intel Corporation. All rights reserved. > Strange stacktraces can be given in case the compiler is inlining some > calls or if Valgrind unwinder has a bug. Inlining was what came to my mind at first too. But in my experience this only causes a difference of a few lines - not more than 2000. > You might check that using gdb+vgdb: put a break on H5FL_reg_calloc. > When break encountered, do > (gdb) backtrace Hmm. This is interesting. If I ask gcc for a backtrace, it just hangs: (gdb) b H5FL_reg_calloc Breakpoint 4 at 0xf4abf0 (gdb) c Continuing. Breakpoint 4, 0x0000000000f4abf0 in H5FL_reg_calloc () (gdb) backtrace #0 0x0000000000f4abf0 in H5FL_reg_calloc () #1 0x0000000000fada52 in H5O_attr_shared_decode () #2 0x0000000000fd0f7c in H5O_msg_iterate_real () #3 0x0000000000fb411b in H5O_attr_exists () #4 0x0000000000ee08ab in H5A_create () #5 0x0000000000edaab1 in H5Acreate2 () #6 0x0000000000ec985e in h5acreate_c_ () #7 0x0000000000ec3c57 in h5a_mp_h5acreate_f_ () #8 0x0000000000ba1ec8 in h5dump_attr_double (loc_id=1, After pressing Ctrl-C: #8 0x0000000000ba1ec8 in h5dump_attr_double (loc_id=1, ^C f=<error reading variable: Cannot access memory at address 0xa000008>, name=Quit Might this have something to do with the problem? That said, I don't think the above backtrace is the right one, because H5FL_reg_calloc is called via a ton of different code paths. Is there a way to break in this function only when entered via the codepath that valgrind gave above? Thanks for your help, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-08 23:10:24
|
On Tue, 2015-12-08 at 14:39 -0800, Nikolaus Rath wrote: > $ valgrind --version > valgrind-3.8.1 This is more than 3 years old. Would be (much) better to upgrade to the last version (3.11). In any case, if the problem is related to inlining, you need a recent valgrind version to see it (inlining info was implemented in 3.10). > Inlining was what came to my mind at first too. But in my experience > this only causes a difference of a few lines - not more than 2000. The difference will only depend on where the inlining and inlined functions are, and these do not need to be close to each other. > Hmm. This is interesting. If I ask gcc for a backtrace, it just hangs: ... > After pressing Ctrl-C: > > #8 0x0000000000ba1ec8 in h5dump_attr_double (loc_id=1, > ^C f=<error reading variable: Cannot access memory at address > 0xa000008>, name=Quit > > > Might this have something to do with the problem? Maybe : it is possible that ifort generates unusual code/debug info, that gdb and/or valgrind unwinder do not understand. You might also maybe upgrade gdb (if you are using an old gdb version). > > > That said, I don't think the above backtrace is the right one, because > H5FL_reg_calloc is called via a ton of different code paths. Is there a > way to break in this function only when entered via the codepath that > valgrind gave above? No (easy) way that I know. So, you will have to put a break in the function that is most likely to be called only in the interesting stacktrace, and then put a break on H5FL_reg_calloc and continue. Philippe |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 18:00:14
|
Hi Philippe, I found that I can work around the problem of gdb failing to produce backtraces by compiling with -O0. Switching to -O1 or higher is enough to cause issues. I also experimented using dwarf-2, dwarf-3, or dwarf-3 debug information but that did not seem to matter. I tried to narrow down the problem with -O1, -gdwarf2, newer valgrind, and newer gdb: $ valgrind --tool=massif --vgdb-error=0 ../../Q2D/LamyRidge/src/model/LR_model ==4881== Massif, a heap profiler ==4881== Copyright (C) 2003-2013, and GNU GPL'd, by Nicholas Nethercote ==4881== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info ==4881== Command: ../../Q2D/LamyRidge/src/model/LR_model [...] $ gdb ../../Q2D/LamyRidge/src/model/LR_model GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1 Copyright (C) 2014 Free Software Foundation, Inc. [...] (gdb) target remote | /usr/lib/valgrind/../../bin/vgdb --pid=4881 Remote debugging using | /usr/lib/valgrind/../../bin/vgdb --pid=4881 [...] (gdb) b taehdf5.f90:1936 (gdb) c (gdb) c (gdb) b H5FL_reg_calloc (gdb) c Continuing. Breakpoint 2, 0x00000000009a7da0 in H5FL_reg_calloc () (gdb) monitor v.info scheduler [...] Thread 1: status = VgTs_Runnable ==4881== at 0x9A7DA0: H5FL_reg_calloc (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==4881== by 0x93DB39: H5A_create (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==4881== by 0x937C60: H5Acreate2 (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==4881== by 0x926A0D: h5acreate_c_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==4881== by 0x920E06: h5a_mp_h5acreate_f_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==4881== by 0x6CC1E0: taehdf5_mp_h5dump_attr_int_ (taehdf5.f90:1936) ==4881== by 0x6D2EAB: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:4193) ==4881== by 0x6A23F6: plot_m_mp_plots_ (plot_hdf5.f:144) ==4881== by 0x6AC10F: lr_mod_m_mp_check_dt_ (LR_model.F:487) ==4881== by 0x6A4E96: lr_mod_m_mp_lr_step_ (LR_model.F:252) ==4881== by 0x6AC38F: MAIN__ (LR_model.F:544) ==4881== by 0x406F3D: main (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) client stack range: [0xFFEBFE000 0xFFF000FFF] client SP: 0xFFEC2CF78 valgrind stack top usage: 12424 of 1048576 (gdb) bt #0 0x00000000009a7da0 in H5FL_reg_calloc () #1 0x000000000093db3a in H5A_create () #2 0x0000000000937c61 in H5Acreate2 () #3 0x0000000000926a0e in h5acreate_c_ () #4 0x0000000000920e07 in h5a_mp_h5acreate_f_ () #5 0x00000000006cc1e1 in taehdf5::h5dump_attr_int (loc_id=1, [..hangs for a while...] f=<error reading variable: Cannot access memory at address 0xa000008>, name=<error reading variable: Cannot access memory at address 0x832f5f9>, .tmp.NAME.len_V$1086=272) at /home/nrath/Q2D/utils/src/taehdf5.f90:1936 #6 0x00000000006d2eac in taehdf5::h5append_data_double_0d (group_id=1, f=<error reading variable: Cannot access memory at address 0xa000008>, name=<error reading variable: Cannot access memory at address 0x832f5f9>, .tmp.NAME.len_V$1cd8=272) at /home/nrath/Q2D/utils/src/taehdf5.f90:4193 #7 0x00000000006a23f7 in plot_m::plots (idt=1) at /home/nrath/Q2D/LamyRidge/src/model/plot_hdf5.f:144 #8 0x00000000006ac110 in lr_mod_m::check_dt (idt=1) at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:487 #9 0x00000000006a4e97 in lr_mod_m::lr_step (idt=1, dt_r=<error reading variable: Cannot access memory at address 0xa000008>, t_r=0) at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:252 #10 0x00000000006ac390 in lr_model () at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:544 So as far as I can tell, valgrind is getting the backtrace right. Is this correct? If so, I guess the only explanation is that I am not setting the breakpoint at the time where massif takes the snapshot? Best, -Nikolaus |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 18:32:13
|
On 12/09/2015 10:00 AM, Nikolaus Rath wrote: > Hi Philippe, > > I found that I can work around the problem of gdb failing to produce backtraces by compiling with -O0. Switching to -O1 or higher is enough to cause issues. I also experimented using dwarf-2, dwarf-3, or dwarf-3 debug information but that did not seem to matter. > > I tried to narrow down the problem with -O1, -gdwarf2, newer valgrind, and newer gdb: > > $ valgrind --tool=massif --vgdb-error=0 ../../Q2D/LamyRidge/src/model/LR_model > ==4881== Massif, a heap profiler > ==4881== Copyright (C) 2003-2013, and GNU GPL'd, by Nicholas Nethercote > ==4881== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info > ==4881== Command: ../../Q2D/LamyRidge/src/model/LR_model > [...] > > $ gdb ../../Q2D/LamyRidge/src/model/LR_model > GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1 > Copyright (C) 2014 Free Software Foundation, Inc. > [...] > (gdb) target remote | /usr/lib/valgrind/../../bin/vgdb --pid=4881 > Remote debugging using | /usr/lib/valgrind/../../bin/vgdb --pid=4881 > [...] > (gdb) b taehdf5.f90:1936 > (gdb) c > (gdb) c > (gdb) b H5FL_reg_calloc > (gdb) c > Continuing. > >[...] > > So as far as I can tell, valgrind is getting the backtrace right. Is this correct? > > If so, I guess the only explanation is that I am not setting the breakpoint at the time where massif takes the snapshot? Ok, I fell into a trap. I assumed that whatever causes gdb to hang when trying to print a backtrace also causes valgrind to produce wrong stacktraces. But that is not the case. So, when compiling with -O1 -gdwarf-2, the valgrind and gdb backtraces agree. However, when compiling with -O3 -gdwarf-2, there is a difference: Valgrind thinks: (gdb) monitor v.info scheduler [...] Thread 1: status = VgTs_Runnable ==5489== at 0x1010750: H5FL_reg_calloc (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==5489== by 0xFA64E9: H5A_create (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==5489== by 0xFA0610: H5Acreate2 (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==5489== by 0xF8F3BD: h5acreate_c_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==5489== by 0xF897B6: h5a_mp_h5acreate_f_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==5489== by 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) ==5489== by 0xB248E6: plot_m_mp_plots_ (plot_hdf5.f:144) ==5489== by 0xB3B722: lr_mod_m_mp_check_dt_ (LR_model.F:487) ==5489== by 0xB272E3: lr_mod_m_mp_lr_step_ (LR_model.F:252) ==5489== by 0xB261DD: MAIN__ (LR_model.F:544) ==5489== by 0x406E3D: main (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) client stack range: [0xFFEBFE000 0xFFF000FFF] client SP: 0xFFEC2CFC8 valgrind stack top usage: 12424 of 1048576 But gdb says: (gdb) bt #0 0x0000000001010750 in H5FL_reg_calloc () #1 0x0000000000fa64ea in H5A_create () #2 0x0000000000fa0611 in H5Acreate2 () #3 0x0000000000f8f3be in h5acreate_c_ () #4 0x0000000000f897b7 in h5a_mp_h5acreate_f_ () #5 0x0000000000b99fc7 in h5dump_attr_int (loc_id=<optimized out>, f=<optimized out>, name=..., .tmp.NAME.len_V$1086=<optimized out>) at /home/nrath/Q2D/utils/src/taehdf5.f90:1936 #6 h5append_data_double_0d (group_id=1, f=<error reading variable: Cannot access memory at address 0xa000008>, name=..., .tmp.NAME.len_V$1cd8=272) at /home/nrath/Q2D/utils/src/taehdf5.f90:4193 #7 0x0000000000b248e7 in plot_m::plots (idt=1) at /home/nrath/Q2D/LamyRidge/src/model/plot_hdf5.f:144 #8 0x0000000000b3b723 in lr_mod_m::check_dt (idt=1) at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:487 #9 0x0000000000b272e4 in lr_mod_m::lr_step (idt=1, dt_r=<error reading variable: Cannot access memory at address 0xa000008>, t_r=0) at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:252 #10 0x0000000000b261de in lr_model () at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:544 Interestingly enough, but stacktraces are incorrect: gdb is missing the call to taehdf5_mp_h5append_data_double_0d_, and valgrind is missing the call to h5dump_attr_int. This is with valgrind 3.10.0 and gdb 7.7.1 (as above). (I also tried compiling with just "-O3" (should be using dwarf-3), "-O3 -gdwarf-4", and just "-O2", but the stacktrace difference was there in every case). Short of only using -O1 and -O0, is there a way to fix this? Best, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-09 20:33:22
|
On Wed, 2015-12-09 at 10:32 -0800, Nikolaus Rath wrote: > On 12/09/2015 10:00 AM, Nikolaus Rath wrote: > Interestingly enough, but stacktraces are incorrect: gdb is missing the call to taehdf5_mp_h5append_data_double_0d_, > and valgrind is missing the call to h5dump_attr_int. The missing valgrind entry has the same "look" as an inlining "not understood": we see a a function, but with a source that is another function. Maybe you could try by recompiling with an option that fully disable inlining ? > > This is with valgrind 3.10.0 and gdb 7.7.1 (as above). At least for valgrind, it would be better to upgrade to the last released version (3.11). > > (I also tried compiling with just "-O3" (should be using dwarf-3), "-O3 -gdwarf-4", and just "-O2", > but the stacktrace difference was there in every case). If this is related to inlining, then as far as I know, you need at least dwarf-3. > > > Short of only using -O1 and -O0, is there a way to fix this? Without more info on the origin of the wrong stack trace, difficult to to say. You might try by using -O1, and then add individual optimisation flags one by one, till you see which optimisation flag causes the wrong stacktrace (assuming your compiler is like gcc, i.e. has very fine grained optimisation control). Alternatively, you could investigate by using the valgrind monitor command (gdb) monitor v.info unwind <addr> <len> to investigate the unwind info around the missing stacktrace entry and/or by activating the debug trace of valgrind e.g. --trace-symtab=no|yes show symbol table details? [no] --trace-symtab-patt=<patt> limit debuginfo tracing to obj name <patt> --trace-cfi=no|yes show call-frame-info details? [no] and/or by using objdump to look at the dwarf info. Philippe |
From: Philippe W. <phi...@sk...> - 2015-12-09 20:43:22
|
On Wed, 2015-12-09 at 10:32 -0800, Nikolaus Rath wrote: > (gdb) bt > #0 0x0000000001010750 in H5FL_reg_calloc () > #1 0x0000000000fa64ea in H5A_create () > #2 0x0000000000fa0611 in H5Acreate2 () > #3 0x0000000000f8f3be in h5acreate_c_ () > #4 0x0000000000f897b7 in h5a_mp_h5acreate_f_ () > #5 0x0000000000b99fc7 in h5dump_attr_int (loc_id=<optimized out>, > f=<optimized out>, name=..., > .tmp.NAME.len_V$1086=<optimized out>) > at /home/nrath/Q2D/utils/src/taehdf5.f90:1936 > #6 h5append_data_double_0d (group_id=1, > f=<error reading variable: Cannot access memory at address > 0xa000008>, name=..., > .tmp.NAME.len_V$1cd8=272) > at /home/nrath/Q2D/utils/src/taehdf5.f90:4193 > #7 0x0000000000b248e7 in plot_m::plots (idt=1) > at /home/nrath/Q2D/LamyRidge/src/model/plot_hdf5.f:144 > #8 0x0000000000b3b723 in lr_mod_m::check_dt (idt=1) > at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:487 > #9 0x0000000000b272e4 in lr_mod_m::lr_step (idt=1, > dt_r=<error reading variable: Cannot access memory at address > 0xa000008>, t_r=0) > at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:252 > #10 0x0000000000b261de in lr_model () > at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:544 > > Interestingly enough, but stacktraces are incorrect: gdb is missing > the call to taehdf5_mp_h5append_data_double_0d_, and valgrind is > missing the call to h5dump_attr_int. The gdb stacktrace contains a call to h5append_data_double_0d (frame 6). So, as far as I can see, the gdb stacktrace is similar with -O1 and -O2 (if we ignore small details like the namespace taehdf5::) (I know nothing about fortran, so no idea if that is a namespace or whatever). When using gdb, you might maybe find more info using e.g. info frame 5/6/7 Philippe |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 21:10:19
|
On 12/09/2015 12:44 PM, Philippe Waroquiers wrote: > On Wed, 2015-12-09 at 10:32 -0800, Nikolaus Rath wrote: >> (gdb) bt >> #0 0x0000000001010750 in H5FL_reg_calloc () >> #1 0x0000000000fa64ea in H5A_create () >> #2 0x0000000000fa0611 in H5Acreate2 () >> #3 0x0000000000f8f3be in h5acreate_c_ () >> #4 0x0000000000f897b7 in h5a_mp_h5acreate_f_ () >> #5 0x0000000000b99fc7 in h5dump_attr_int (loc_id=<optimized out>, >> f=<optimized out>, name=..., >> .tmp.NAME.len_V$1086=<optimized out>) >> at /home/nrath/Q2D/utils/src/taehdf5.f90:1936 >> #6 h5append_data_double_0d (group_id=1, >> f=<error reading variable: Cannot access memory at address >> 0xa000008>, name=..., >> .tmp.NAME.len_V$1cd8=272) >> at /home/nrath/Q2D/utils/src/taehdf5.f90:4193 >> #7 0x0000000000b248e7 in plot_m::plots (idt=1) >> at /home/nrath/Q2D/LamyRidge/src/model/plot_hdf5.f:144 >> #8 0x0000000000b3b723 in lr_mod_m::check_dt (idt=1) >> at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:487 >> #9 0x0000000000b272e4 in lr_mod_m::lr_step (idt=1, >> dt_r=<error reading variable: Cannot access memory at address >> 0xa000008>, t_r=0) >> at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:252 >> #10 0x0000000000b261de in lr_model () >> at /home/nrath/Q2D/LamyRidge/src/model/LR_model.F:544 >> >> Interestingly enough, but stacktraces are incorrect: gdb is missing >> the call to taehdf5_mp_h5append_data_double_0d_, and valgrind is >> missing the call to h5dump_attr_int. > > The gdb stacktrace contains a call to h5append_data_double_0d (frame 6). Oh, indeed. I missed that because it didn't start with the long hex address. > So, as far as I can see, the gdb stacktrace is similar with -O1 and -O2 > (if we ignore small details like the namespace taehdf5::) Yes. But that makes it even more confusing to me: apparently gdb picks up the debug information about the inlined function just fine - but valgrind doesn't. I'll update valgrind to most recent and see if I can explicitly disable inlining. Thanks, -Nikolaus |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 21:41:29
|
On 12/09/2015 12:35 PM, Philippe Waroquiers wrote: > On Wed, 2015-12-09 at 10:32 -0800, Nikolaus Rath wrote: >> On 12/09/2015 10:00 AM, Nikolaus Rath wrote: > >> Interestingly enough, but stacktraces are incorrect: gdb is >> missing the call to taehdf5_mp_h5append_data_double_0d_, >> and valgrind is missing the call to h5dump_attr_int. > > The missing valgrind entry has the same "look" as an inlining "not > understood": we see a a function, but with a source that is another > function. > Maybe you could try by recompiling with an option that fully disable > inlining ? > >> This is with valgrind 3.10.0 and gdb 7.7.1 (as above). > > At least for valgrind, it would be better to upgrade to the last > released version (3.11). Ok, I tried again with version 3.11.0. The problem is still there. However, I can confirm that it's caused by inlining: if I explicitly disable inlining I can still compile with -O3 and get correct stacktraces. >> Short of only using -O1 and -O0, is there a way to fix this? > Without more info on the origin of the wrong stack trace, difficult to > to say. You might try by using -O1, and then add individual optimisation > flags one by one, till you see which optimisation flag causes the > wrong stacktrace (assuming your compiler is like gcc, i.e. has very > fine grained optimisation control). > > Alternatively, you could investigate by using the valgrind monitor > command (gdb) monitor v.info unwind <addr> <len> > to investigate the unwind info around the missing stacktrace entry > and/or by activating the debug trace of valgrind e.g. > --trace-symtab=no|yes show symbol table details? [no] > --trace-symtab-patt=<patt> limit debuginfo tracing to obj name <patt> > --trace-cfi=no|yes show call-frame-info details? [no] Hmm. Unfortunately the output doesn't tell me anything. Here's what I get: Correct stacktrace: Thread 1: status = VgTs_Runnable (lwpid 1654) ==1654== at 0xF1F5D0: H5FL_reg_calloc (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==1654== by 0xEB5369: H5A_create (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==1654== by 0xEAF490: H5Acreate2 (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==1654== by 0xE9E23D: h5acreate_c_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==1654== by 0xE98636: h5a_mp_h5acreate_f_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==1654== by 0xAE9C68: taehdf5_mp_h5dump_attr_int_ (taehdf5.f90:1936) ==1654== by 0xAE58A3: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:4193) ==1654== by 0xA8A063: plot_m_mp_plots_ (plot_hdf5.f:144) ==1654== by 0xAA0578: lr_mod_m_mp_check_dt_ (LR_model.F:487) ==1654== by 0xA8D059: lr_mod_m_mp_lr_step_ (LR_model.F:252) ==1654== by 0xA8BF33: MAIN__ (LR_model.F:544) ==1654== by 0x406E3D: main (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) client stack range: [0xFFEBFE000 0xFFF000FFF] client SP: 0xFFEC2CF88 valgrind stack top usage: 12312 of 1048576 (gdb) monitor v.info unwind 0x0000000000ae9c69 [0xae9c69 .. 0xae9c69]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) (gdb) monitor v.info unwind 0x0000000000ae58a4 [0xae58a4 .. 0xae58a4]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) Incorrect stacktrace: Thread 1: status = VgTs_Runnable (lwpid 2047) ==2047== at 0xFE9640: H5FL_reg_calloc (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==2047== by 0xF7F3D9: H5A_create (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==2047== by 0xF79500: H5Acreate2 (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==2047== by 0xF682AD: h5acreate_c_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==2047== by 0xF626A6: h5a_mp_h5acreate_f_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==2047== by 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) ==2047== by 0xB248E6: plot_m_mp_plots_ (plot_hdf5.f:144) ==2047== by 0xB3B722: lr_mod_m_mp_check_dt_ (LR_model.F:487) ==2047== by 0xB272E3: lr_mod_m_mp_lr_step_ (LR_model.F:252) ==2047== by 0xB261DD: MAIN__ (LR_model.F:544) ==2047== by 0x406E3D: main (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) client stack range: [0xFFEBFE000 0xFFF000FFF] client SP: 0xFFEC2CFC8 valgrind stack top usage: 12312 of 1048576 (gdb) monitor v.info unwind 0xF626A6 [0xf626a6 .. 0xf626a6]: let cfa=oldSP+48 in RA=*(cfa+-8) SP=cfa+0 BP=Same (gdb) monitor v.info unwind 0xB99FC6 [0xb99fc6 .. 0xb99fc6]: let cfa=oldBP+16 in RA=*(cfa+-8) SP=cfa+0 BP=*(cfa+-16) > and/or by using objdump to look at the dwarf info. I believe this is the relevant objdump output - but again I don't understand it. Does it tell you anything? <1><a8f8c>: Abbrev Number: 42 (DW_TAG_subprogram) <a8f8d> DW_AT_decl_line : 1916 <a8f8f> DW_AT_decl_file : 1 <a8f90> DW_AT_declaration : 1 <a8f91> DW_AT_name : (indirect string, offset: 0xa5fc): h5dump_attr_int <a8f95> DW_AT_external : 1 <a8f96> DW_AT_inline : 1 (inlined) Best, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-09 21:57:23
|
On Wed, 2015-12-09 at 13:41 -0800, Nikolaus Rath wrote: > I believe this is the relevant objdump output - but again I don't understand it. Does it tell you anything? > > <1><a8f8c>: Abbrev Number: 42 (DW_TAG_subprogram) > <a8f8d> DW_AT_decl_line : 1916 > <a8f8f> DW_AT_decl_file : 1 > <a8f90> DW_AT_declaration : 1 > <a8f91> DW_AT_name : (indirect string, offset: 0xa5fc): h5dump_attr_int > <a8f95> DW_AT_external : 1 > <a8f96> DW_AT_inline : 1 (inlined) So, we have now clearly confirmed that the wrong stacktrace is caused by valgrind not understanding that h5dump_attr_int has been inlined. What you could do is to modify storage.c line 668 put a (1) instead of (0) : if (0) VG_(message) (Vg_DebugMsg, "addInlInfo: fn %s inlined as addr_lo %#lx,addr_hi %#lx," "caller fndn_ix %u %s:%d\n", inlinedfn, addr_lo, addr_hi, fndn_ix, ML_(fndn_ix2filename) (di, fndn_ix), lineno); This will then trace the inlined call information. If this inlined info is properly understood and recorded, you should see some information produced that tells that h5dump_attr_int has been inlined around the program counter of: ==2047== by 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) If this is not recorded, then it is the valgrind dwarf reader that likely has a problem. Otherwise, it is the unwinder which does not properly use the inline info. Philippe |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 22:57:44
|
On 12/09/2015 01:59 PM, Philippe Waroquiers wrote: > On Wed, 2015-12-09 at 13:41 -0800, Nikolaus Rath wrote: >> I believe this is the relevant objdump output - but again I don't understand it. Does it tell you anything? >> >> <1><a8f8c>: Abbrev Number: 42 (DW_TAG_subprogram) >> <a8f8d> DW_AT_decl_line : 1916 >> <a8f8f> DW_AT_decl_file : 1 >> <a8f90> DW_AT_declaration : 1 >> <a8f91> DW_AT_name : (indirect string, offset: 0xa5fc): h5dump_attr_int >> <a8f95> DW_AT_external : 1 >> <a8f96> DW_AT_inline : 1 (inlined) > > So, we have now clearly confirmed that the wrong stacktrace is caused by > valgrind not understanding that h5dump_attr_int has been inlined. > > What you could do is to modify storage.c line 668 put a (1) instead of > (0) : > > if (0) VG_(message) > (Vg_DebugMsg, > "addInlInfo: fn %s inlined as addr_lo %#lx,addr_hi %#lx," > "caller fndn_ix %u %s:%d\n", > inlinedfn, addr_lo, addr_hi, fndn_ix, > ML_(fndn_ix2filename) (di, fndn_ix), lineno); > > > This will then trace the inlined call information. > If this inlined info is properly understood and recorded, you should > see some information produced that tells that h5dump_attr_int > has been inlined around the program counter of: > ==2047== by 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) > The output is still the same: Thread 1: status = VgTs_Runnable (lwpid 18278) ==18278== at 0xFE9640: H5FL_reg_calloc (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==18278== by 0xF7F3D9: H5A_create (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==18278== by 0xF79500: H5Acreate2 (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==18278== by 0xF682AD: h5acreate_c_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==18278== by 0xF626A6: h5a_mp_h5acreate_f_ (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) ==18278== by 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) ==18278== by 0xB248E6: plot_m_mp_plots_ (plot_hdf5.f:144) ==18278== by 0xB3B722: lr_mod_m_mp_check_dt_ (LR_model.F:487) ==18278== by 0xB272E3: lr_mod_m_mp_lr_step_ (LR_model.F:252) ==18278== by 0xB261DD: MAIN__ (LR_model.F:544) ==18278== by 0x406E3D: main (in /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model) client stack range: [0xFFEBFE000 0xFFF000FFF] client SP: 0xFFEC2CF48 Best, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-09 22:13:38
|
On Wed, 2015-12-09 at 22:59 +0100, Philippe Waroquiers wrote: > If this is not recorded, then it is the valgrind dwarf reader that likely has > a problem. > Otherwise, it is the unwinder which does not properly use the inline info. What you can also do is to use (gdb) monitor v.info location <address> When address is some code, it will print the function/file/line nr (and if address corresponds to inlined calls, it should describe all what is inlined). For example, for the test memcheck/tests/inlinfo, I obtain: (gdb) mo v.info loc 0x80486DC Address 0x80486dc is in the Text segment of memcheck/tests/inlinfo ==8004== at 0x80486DC: fun_d (inlinfo.c:7) ==8004== by 0x80486DC: fun_c (inlinfo.c:15) ==8004== by 0x80486DC: fun_b (inlinfo.c:21) ==8004== by 0x80486DC: fun_a (inlinfo.c:27) ==8004== by 0x80486DC: main (inlinfo.c:66) If the inline info is recorded and used properly, the above command should give the inlining and inlined functions for the relevant program counter. Philippe |
From: Nikolaus R. <nr...@tr...> - 2015-12-09 22:59:20
|
On 12/09/2015 02:15 PM, Philippe Waroquiers wrote: > On Wed, 2015-12-09 at 22:59 +0100, Philippe Waroquiers wrote: > >> If this is not recorded, then it is the valgrind dwarf reader that likely has >> a problem. >> Otherwise, it is the unwinder which does not properly use the inline info. > What you can also do is to use > (gdb) monitor v.info location <address> > > When address is some code, it will print the function/file/line nr > (and if address corresponds to inlined calls, it should describe all > what is inlined). > > For example, for the test memcheck/tests/inlinfo, I obtain: > (gdb) mo v.info loc 0x80486DC > Address 0x80486dc is in the Text segment of memcheck/tests/inlinfo > ==8004== at 0x80486DC: fun_d (inlinfo.c:7) > ==8004== by 0x80486DC: fun_c (inlinfo.c:15) > ==8004== by 0x80486DC: fun_b (inlinfo.c:21) > ==8004== by 0x80486DC: fun_a (inlinfo.c:27) > ==8004== by 0x80486DC: main (inlinfo.c:66) > > If the inline info is recorded and used properly, the above command > should give the inlining and inlined functions for the relevant program counter. Apparently the info is not used correctly then: (gdb) monitor v.info location 0xB99FC6 Address 0xb99fc6 is in the Text segment of /mnt/nfs-home/nrath/Q2D/LamyRidge/src/model/build/LR_model ==18278== at 0xB99FC6: taehdf5_mp_h5append_data_double_0d_ (taehdf5.f90:1936) (gdb) Best, -Nikolaus |
From: Matthias S. <zz...@ge...> - 2015-12-10 18:21:23
|
Am 09.12.2015 um 22:10 schrieb Nikolaus Rath: > > Yes. But that makes it even more confusing to me: apparently gdb picks > up the debug information about the inlined function just fine - but > valgrind doesn't. > > > I'll update valgrind to most recent and see if I can explicitly disable > inlining. > > Do you do your experiments with massif or with memcheck? If it is massif then read-inline-info defaults to "no". an explicit --read-inline-info=yes could at least make valgrind core save the extra information. But I do not know if massif will care about this information. Regards Matthias |
From: Nikolaus R. <nr...@tr...> - 2015-12-10 19:06:55
|
On 12/10/2015 10:20 AM, Matthias Schwarzott wrote: > Am 09.12.2015 um 22:10 schrieb Nikolaus Rath: >> >> Yes. But that makes it even more confusing to me: apparently gdb picks >> up the debug information about the inlined function just fine - but >> valgrind doesn't. >> >> >> I'll update valgrind to most recent and see if I can explicitly disable >> inlining. >> >> > Do you do your experiments with massif or with memcheck? > > If it is massif then read-inline-info defaults to "no". > an explicit --read-inline-info=yes could at least make valgrind core > save the extra information. I am using massif - and bingo, using --read-inline-info=yes makes everything work. (So much time wasted experimenting when a careful study of the manpage would have been enough... grmbl). Thanks! -Niko |
From: Philippe W. <phi...@sk...> - 2015-12-10 21:09:59
|
On Thu, 2015-12-10 at 11:06 -0800, Nikolaus Rath wrote: > On 12/10/2015 10:20 AM, Matthias Schwarzott wrote: > > Am 09.12.2015 um 22:10 schrieb Nikolaus Rath: > >> > >> Yes. But that makes it even more confusing to me: apparently gdb picks > >> up the debug information about the inlined function just fine - but > >> valgrind doesn't. > >> > >> > >> I'll update valgrind to most recent and see if I can explicitly disable > >> inlining. > >> > >> > > Do you do your experiments with massif or with memcheck? > > > > If it is massif then read-inline-info defaults to "no". > > an explicit --read-inline-info=yes could at least make valgrind core > > save the extra information. Oops, yes, I forgot that inline info was not activated for massif tool. > > I am using massif - and bingo, using --read-inline-info=yes makes > everything work. This is strange. --read-inline-info=yes will ensure that e.g. 'v.info scheduler' stacktraces will show inlined calls, or that 'v.info location <address>' will describe inlined calls. But I do not see how massif output would use it: to be used, a non NULL iipc argument must be given to VG_(describe_IP). And massif snapshot does pass NULL (see ms_main.c:2149). Are you sure the massif output shows the inlined calls ? Philippe |
From: Nikolaus R. <nr...@tr...> - 2015-12-10 21:20:45
|
On 12/10/2015 01:11 PM, Philippe Waroquiers wrote: > On Thu, 2015-12-10 at 11:06 -0800, Nikolaus Rath wrote: >> On 12/10/2015 10:20 AM, Matthias Schwarzott wrote: >>> Am 09.12.2015 um 22:10 schrieb Nikolaus Rath: >>>> >>>> Yes. But that makes it even more confusing to me: apparently gdb picks >>>> up the debug information about the inlined function just fine - but >>>> valgrind doesn't. >>>> >>>> >>>> I'll update valgrind to most recent and see if I can explicitly disable >>>> inlining. >>>> >>>> >>> Do you do your experiments with massif or with memcheck? >>> >>> If it is massif then read-inline-info defaults to "no". >>> an explicit --read-inline-info=yes could at least make valgrind core >>> save the extra information. > Oops, yes, I forgot that inline info was not activated for massif tool. > >> >> I am using massif - and bingo, using --read-inline-info=yes makes >> everything work. > This is strange. > --read-inline-info=yes will ensure that e.g. 'v.info scheduler' > stacktraces will show inlined calls, or that 'v.info location <address>' > will describe inlined calls. > > But I do not see how massif output would use it: to be used, > a non NULL iipc argument must be given to VG_(describe_IP). > And massif snapshot does pass NULL (see ms_main.c:2149). > > Are you sure the massif output shows the inlined calls ? Oh, no, sorry. I was silently assuming that if 'v.info scheduler' produces the correct output, massif would produce the correct output as well. So the answer is then that massif simply does not support inlined calls? Best, -Nikolaus |
From: Philippe W. <phi...@sk...> - 2015-12-10 21:38:58
|
On Thu, 2015-12-10 at 13:20 -0800, Nikolaus Rath wrote: > So the answer is then that massif simply does not support inlined calls? Effectively, today, massif does not support outputting inlined calls. pp_snapshot_SXPt in ms_main.c has to be modified to print inlined function calls. Philippe |