|
From: John R.
Attachments:
co-resident.patch
|
Hi, In order for valgrind to tell other tools (debugger, profiler, ...) about the modules it loads (stage2, vgtool_memcheck, ld-linux, ...), I have had success using the attached patches for valgrind-3.0.1 (co-resident.patch, 4370 bytes.) Synopsys (refer to <elf.h> and <link.h>): Have a PT_DYNAMIC that contains a DT_DEBUG with a datum that points to a struct r_debug. For each module loaded: add a struct link_map onto the front of the list that is maintained through .l_next, .l_prev. Use .r_brk as a function pointer, and call it as appropriate: with .r_state=RT_ADD just before loading some modules, and with .r_state=RT_CONSISTENT just after loading some modules. Set .r_brk to the address of a no-op subroutine if 0==.r_brk. A debugger or co-resident tool may intercept this subroutine, so make it easy to breakpoint or overwrite (allow at least 16 bytes.) Using names _r_debug and _dl_debug_state() may help older [or lazy] debuggers that don't understand the PT_DYNAMIC + DT_DEBUG protocol. I have used this with a subroutine profiler that is in development. The current state is sufficient to measure memcheck on simple programs. Example output is http://BitWagon.com/valgrind/valgrind.tsprof.bz2 (76636 bytes; expands to 497089 bytes, 6084 lines.) -- |
|
From: Nicholas N. <nj...@cs...> - 2005-09-13 21:47:49
|
On Sun, 4 Sep 2005, John Reiser wrote: > In order for valgrind to tell other tools (debugger, profiler, ...) > about the modules it loads (stage2, vgtool_memcheck, ld-linux, ...), > I have had success using the attached patches for valgrind-3.0.1 > (co-resident.patch, 4370 bytes.) > > Synopsys (refer to <elf.h> and <link.h>): > Have a PT_DYNAMIC that contains a DT_DEBUG with a datum that points > to a struct r_debug. For each module loaded: add a struct link_map > onto the front of the list that is maintained through .l_next, .l_prev. > Use .r_brk as a function pointer, and call it as appropriate: > with .r_state=RT_ADD just before loading some modules, > and with .r_state=RT_CONSISTENT just after loading some modules. > Set .r_brk to the address of a no-op subroutine if 0==.r_brk. > A debugger or co-resident tool may intercept this subroutine, > so make it easy to breakpoint or overwrite (allow at least 16 bytes.) > Using names _r_debug and _dl_debug_state() may help older [or lazy] > debuggers that don't understand the PT_DYNAMIC + DT_DEBUG protocol. > > I have used this with a subroutine profiler that is in development. > The current state is sufficient to measure memcheck on simple programs. > Example output is http://BitWagon.com/valgrind/valgrind.tsprof.bz2 > (76636 bytes; expands to 497089 bytes, 6084 lines.) The profile output looks really interesting. There's an explanation of the format at http://bitwagon.com/tsprof/tsprof.html. Here's an interesting snippet: Black box 4.766 seconds ('+' excludes Recursive entries) count ticks millisec @millisec % [id] module:name [...] 2599 9055199077 4527.5995 1.7421 95 [4] 2:LibVEX_Translate 2599 37727305 18.8637 0.0073 0 [4] 2:LibVEX_Translate (self) 2599 2641400997 1320.7005 0.5082 29 [4]->[5] 2:doRegisterAllocation 10396 1482524941 741.2625 0.0713 16 [4]->[6] 2:sanityCheckIRBB 2599 905329930 452.6650 0.1742 10 [4]->[11] 2:iselBB_AMD64 2599 856546420 428.2732 0.1648 9 [4]->[13] 2:do_iropt_BB 2599 689222695 344.6113 0.1326 8 [4]->[19] 5:vgMemCheck_instrument 2599 688523386 344.2617 0.1325 8 [4]->[20] 2:do_treebuild_BB 2599 490806029 245.4030 0.0944 5 [4]->[17] 2:cprop_BB 185354 445410431 222.7052 0.0012 5 [4]->[26] 2:emit_AMD64Instr 2599 373706079 186.8530 0.0719 4 [4]->[33] 2:bb_to_IR 7797 335553017 167.7765 0.0215 4 [4]->[27] 2:do_deadcode_BB 2599 93003070 46.5015 0.0179 1 [4]->[87] 2:vg_SP_update_pass 2599 14737922 7.3690 0.0028 0 [4]->[211] 2:getAllocableRegs_AMD64 5198 706855 0.3534 0.0001 0 [4]->[506] 2:vexClearTEMP This shows that 95% of time accounted for was spent in LibVEX_Translate, and that 29% of that 95% was in doRegisterAllocation, and 16% of that 95% was in sanitycheckIRBB. This is a small program, it would be interesting to see what the numbers are like for a bigger program. Presumably the translation costs would drop as a proportion of total time. John, will generated code be covered by this profile? I see this entry: 1 994533995 497.2670 497.2670 10 [9] 2:0x70057c53 1 0 0.0000 0.0000 0 [9] 2:0x70057c53 (self) Is this a generated basic block? Also, Valgrind seg faulted during the run -- is this due to a bad interaction with tsprof? Nick |
|
From: Oswald B. <os...@kd...> - 2005-09-14 15:56:13
|
On Tue, Sep 13, 2005 at 04:47:43PM -0500, Nicholas Nethercote wrote: > This is a small program, it would be interesting to see what the > numbers are like for a bigger program. Presumably the translation > costs would drop as a proportion of total time. > i dare to claim the opposite ... big == many translations. cpu boundness of the app would mean underproportional translation costs. try starting a simple kde app (maybe kclac) - it will probably execute a few hundred kilobytes of unique code before it even shows up. -- Hi! I'm a .signature virus! Copy me into your ~/.signature, please! -- Chaos, panic, and disorder - my work here is done. |
|
From: Nicholas N. <nj...@cs...> - 2005-09-14 16:15:41
|
On Wed, 14 Sep 2005, Oswald Buddenhagen wrote: > On Tue, Sep 13, 2005 at 04:47:43PM -0500, Nicholas Nethercote wrote: >> This is a small program, it would be interesting to see what the >> numbers are like for a bigger program. Presumably the translation >> costs would drop as a proportion of total time. >> > i dare to claim the opposite ... big == many translations. cpu boundness > of the app would mean underproportional translation costs. > try starting a simple kde app (maybe kclac) - it will probably execute > a few hundred kilobytes of unique code before it even shows up. I should have been clearer: I meant "longer-running". Nick |
|
From: John R.
|
Nicholas Nethercote wrote: > John, will generated code be covered by this profile? Runtime-generated code will be counted as if it were inlined into the nearest code [in calling stack order] that was subject to execve() or dlopen(). If runtime-generated code can be described as a ELF module that the profiler finds out about via intercepting (*_r_debug.r_brk)(), then runtime-generated code also will be profiled. I see this entry: > > 1 994533995 497.2670 497.2670 10 [9] 2:0x70057c53 > 1 0 0.0000 0.0000 0 [9] 2:0x70057c53 > (self) > > Is this a generated basic block? No, it's a subroutine whose name could not be found. There was a CALL to that address, but no dynamic symbol, static symbol, or debug symbol had that address. [Or, perhaps it's a profiler bug. :-)] > > Also, Valgrind seg faulted during the run -- is this due to a bad > interaction with tsprof? It was the profiler that segfaulted: a bug in profiling syscall __NR_exit or __NR_exit_group. Subroutines that have not exited must be "shut down." Fixed in current version. As for profiling a longer-running program, please suggest suitable testcases. -- John Reiser, jreiser@BitWagon.com |
|
From: Nicholas N. <nj...@cs...> - 2005-09-15 18:26:43
|
On Wed, 14 Sep 2005, John Reiser wrote: > As for profiling a longer-running program, please suggest suitable > testcases. Konqueror is a good one, as it's quite big but is all a single process (unlike OpenOffice, for example). An interesting first run would be to start up, view a couple of websites and then close it. Vim or xedit or another editor would also be interesting -- again start up, do some stuff, shut down. Nick |
|
From: John R.
|
>> As for profiling a longer-running program, please suggest suitable >> testcases. > > > Konqueror is a good one, as it's quite big but is all a single process > (unlike OpenOffice, for example). An interesting first run would be to > start up, view a couple of websites and then close it. > > Vim or xedit or another editor would also be interesting -- again start > up, do some stuff, shut down. http://BitWagon.com/valgrind/vg-konq.tsprof.bz2 96331 bytes ==> 620595 bytes, 7463 lines http://BitWagon.com/valgrind/vg-vim.tsprof.bz2 80711 bytes ==> 513247 bytes, 6217 lines Run on 2.0GHz AMD64 using valgrind-3.0.1 minimally modified as per previously-posted patch, which identifies modules dynamically loaded by valgrind, so that other tools can "see" them. Time is measured by thread-virtual hardware TSC cycle counter provided by perfctr patches, as used by tsprof, a proprietary profiling application. The previous report on 2005-09-04 used non-virtualized wallclock time on an "otherwise-idle" machine. In the reports, "<unknown subroutine>" corresponds to VEX-generated code. For instance, the vim-under-valgrind run shows this excerpt: ----- count ticks millisec @millisec % [id] module:name 11442209 19088608055 9497.7130 0.0008 37 [10] 0:<unknown subroutine> 11442209 6634858092 3301.2348 0.0003 35 [10] 0:<unknown subroutine> (self) 6768800 3323591125 1653.6834 0.0002 17 [10]->[16] 2:amd64g_calculate_condition 1824246 2599201091 1293.2565 0.0007 14 [10]->[19] 2:vgPlain_unknown_SP_update 3402866 1620918313 806.5029 0.0002 8 [10]->[28] 2:amd64g_calculate_rflags_c 4732061 1377005011 685.1415 0.0001 7 [10]->[32] 5:vgMemCheck_helperc_LOADV1 5567188 1040584819 517.7526 0.0001 5 [10]->[38] 5:vgMemCheck_helperc_LOADV8le 4443735 698827803 347.7082 0.0001 4 [10]->[58] 5:vgMemCheck_helperc_STOREV8le 1776344 474429732 236.0569 0.0001 2 [10]->[70] 5:vgMemCheck_helperc_STOREV1 1462144 374469654 186.3208 0.0001 2 [10]->[85] 5:vgMAC_die_mem_stack_8 2044409 356894445 177.5761 0.0001 2 [10]->[88] 5:vgMemCheck_helperc_MAKE_STACK_UNINIT 1027349 233359193 116.1100 0.0001 1 [10]->[116] 5:vgMemCheck_helperc_LOADV4le 1292385 214058138 106.5066 0.0001 1 [10]->[122] 5:vgMAC_new_mem_stack_8 455598 101726912 50.6152 0.0001 1 [10]->[163] 5:vgMemCheck_helperc_STOREV4le 20989 18884921 9.3964 0.0004 0 [10]->[321] 5:mc_LOADVn_slow 34723 9040793 4.4983 0.0001 0 [10]->[378] 5:vgMemCheck_helperc_LOADV2le 14962 4322899 2.1509 0.0001 0 [10]->[430] 5:vgMemCheck_helperc_STOREV2le 2827 4156037 2.0679 0.0007 0 [10]->[433] 5:mc_STOREVn_slow 9 1466390 0.7296 0.0811 0 [10]->[517] 5:mc_record_value_error 331 225631 0.1123 0.0003 0 [10]->[680] 5:vgMAC_new_mem_stack_16 2 190098 0.0946 0.0473 0 [10]->[692] 2:sync_signalhandler 152 171809 0.0855 0.0006 0 [10]->[698] 5:vgMAC_new_mem_stack_32 170 135610 0.0675 0.0004 0 [10]->[722] 5:vgMAC_die_mem_stack_32 63 62011 0.0309 0.0005 0 [10]->[783] 5:vgMAC_die_mem_stack_16 9 11246 0.0056 0.0006 0 [10]->[936] 5:vgMemCheck_helperc_value_check0_fail 21 8910 0.0044 0.0002 0 [10]->[949] 2:amd64g_dirtyhelper_RDTSC 10 7372 0.0037 0.0004 0 [10]->[970] 2:amd64g_calculate_rflags_all ----- where execution of generated instructions accounted for 0.37 * 0.35 = 13% of all cycles, and subroutines called by generated code accounted for 0.37 * (1 - 0.35) = 24% of all cycles. Both figures include profiling overheads of a few hundred cycles per CALL+RET; YMMV. [The measurement overhead can be reduced to zero by considering the called routines as "inlined" into their callers, but then the entry counts are unavailable, too.] The profiled run of vim-under-valgrind (26 seconds) completed normally. The profiled run of konqueror-under-valgrind (55 seconds) got as far as indicated below. konqueror runs under either valgrind or the profiler separately. The blame for the Assertion during konqueror-under-valgrind-under-tsprof has not yet been investigated. ----- Detaching after fork from child process 4840. kbuildsycoca running... not in syscall (0x7005DEBA - 0x7005DF3E) valgrind: syswrap-main.c:570 (vgPlain_client_syscall): Assertion 'sci->status.what == SsIdle' failed. ==4835== at 0x7001AC21: panic (m_libcassert.c:175) ==4835== by 0x70EEEFFF: ??? ==4835== by 0x80B1412BE9: ??? sched status: running_tid=1 Thread 1: status = VgTs_Runnable ==4835== at 0x1476BF22: __write_nocancel (in /lib64/tls/libpthread-2.3.4.so) ==4835== by 0x12DBBC10: KProcessController::theSigCHLDHandler(int) (in /usr/lib64/libkdecore.so.4.2.0) ==4835== by 0x1476D31F: (within /lib64/tls/libpthread-2.3.4.so) ----- -- |
|
From: Nicholas N. <nj...@cs...> - 2005-11-11 17:51:15
|
On Sun, 6 Nov 2005, John Reiser wrote: > http://BitWagon.com/valgrind/vg-konq.tsprof.bz2 96331 bytes ==> 620595 bytes, 7463 lines > http://BitWagon.com/valgrind/vg-vim.tsprof.bz2 80711 bytes ==> 513247 bytes, 6217 lines > > Run on 2.0GHz AMD64 using valgrind-3.0.1 minimally modified as per previously-posted patch, > which identifies modules dynamically loaded by valgrind, so that other tools can "see" them. > Time is measured by thread-virtual hardware TSC cycle counter provided by perfctr patches, > as used by tsprof, a proprietary profiling application. The previous report on 2005-09-04 > used non-virtualized wallclock time on an "otherwise-idle" machine. Thanks for the info. That's a very useful tool! Nick |