You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(83) |
Oct
(89) |
Nov
(97) |
Dec
(30) |
2024 |
Jan
(25) |
Feb
(73) |
Mar
(76) |
Apr
(122) |
May
(46) |
Jun
(44) |
Jul
(27) |
Aug
(30) |
Sep
(33) |
Oct
(67) |
Nov
(91) |
Dec
(70) |
2025 |
Jan
(44) |
Feb
(36) |
Mar
(85) |
Apr
(100) |
May
(138) |
Jun
(55) |
Jul
(107) |
Aug
(54) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Jeremy F. <je...@go...> - 2002-10-04 15:52:24
|
On Fri, 2002-10-04 at 03:21, Julian Seward wrote: Good news. After peering at weird segfaults on Red Hat Null (8.0 beta) last night, I can see that it might be possible to assemble enough hacks so that the stable branch will work on R H 8. Assuming that they haven't changed the threading model used in the transition between the "null" final beta and 8.0 itself, which doesn't sound likely. I thought that 8.0 would use glibc-2.3, but apparently it is only at 2.2.93, so we don't have to deal yet with big threading changes. How has threading changed in RH8 and/or glibc 2.3? Have they dropped LinuxThreads? I had expected only to be able to support R H 8 on the head, using the LDT/GDT support, but it seems that might not be necessary. Vague plan therefore is to assemble this and various other bugfixes What fixes do you intend putting in the 1.0.X branch? J |
From: Jeremy F. <je...@go...> - 2002-10-04 15:44:12
|
On Fri, 2002-10-04 at 03:51, Julian Seward wrote: How much faster is "significantly faster" ? I haven't measured it in detail, but the frame rate increased from about 1100ms/frame to 800-900ms/frame. I'll so some more scientific measurements soon. So, my main point. I think this patch is unsafe and will lead to hard to find problems down the line. The difficulty is that it allows the simulated FPU state to hang around in the real FPU for long periods, up to a whole basic block's worth of execution (if I understand it write). We only need a skin to call out to a helper function which modifies the real FPU state on some obscure path, and we're hosed. Since we don't have any control over what skins people might plug in, this seems like and unsafe modification to the core. The modification I had in mind for a while was a lot more conservative, and more along the lines of a peephole optimisation. Essentially if we see a FPU-no-mem op followed by another FPU-no-mem op we can skip the save at the end of the first and the restore at the start of the second. What I'm doing is not conceptually different from caching an ArchReg in a RealReg for the lifetime of a basic block. The general idea is that the FP state is pulled in just before the first FPU/FPU_[RW] instruction, and saved again just before: - JMP - CCALL - any skin UInstr I can't see how a skin can introduce any instrumentation which would be able to catch the FP state unsaved (is there any way for a skin to do instrumentation or call a C function without using either CCALL or its own UInstr?). Your idea is basically the same, except we add a fourth saving condition: - any non FPU instruction This would only be necessary if you imagine a non-FPU instruction which can inspect the architectural state of the FPU (in other words, is a memory access offset into the baseBlock: something which skins can't generate directly). In summary, I think this is actually pretty conservative, simple and safe. J |
From: Nicholas N. <nj...@ca...> - 2002-10-04 10:48:27
|
On Fri, 4 Oct 2002, Josef Weidendorfer wrote: > The only problem I saw was that I need a valgrind version of the LIBC > "unlink", which I already mailed to Nick... I just added VG_(unlink) to head; it's untested, hopefully I got it right. > Regarding the valgrind skin architecture: Shouldn't it be possible to "stack" > skins? At the moment, for my skin I have to include all the cachegrind code > again. And if the cachegrind skin decides to simulate a 3rd level cache, I > have to copy it. Hmm, the LD_PRELOADing of two shared objects (skin.so + core.so) is already a bit fragile, having multiple .so's feels like a bad idea to me... Anyway, aren't Cachegrind and your patched version dissimilar enough that it wouldn't be easy to "stack" them in a sensible way? A better way might be to factor out the common code which gets included in both skin .so's, if you see what I mean. This should be done with addrcheck and memcheck at some stage because they share a lot of identical code. Thinking longer term, your version of Cachegrind could entirely replace the original Cachegrind one day, since AFAICT your Cachegrind's functionality is a strict superset of my Cachegrind's. > In fact: Call tree logging should be totally orthogonal to event > logging. Shouldn't we have some general support for expandable cost > centers in the core? Maybe, I thought about this but didn't get any further. If would help if you could give some specific suggestions as to the form it might take. > Perhaps you have some suggestions for my problem with recursive calls: > > Suppose a call chain starting from A: A calls B and C; C calls A again. > [...] > Suggestions? My brain is melting. Do you know how gprof handles it? N |
From: Julian S. <js...@ac...> - 2002-10-04 10:44:56
|
Cobbling together a response to this from the archives, since I didn't get it via the normal routes. > This patch makes FPU state changes lazy, so there should only be one > save/restore pair per basic block. With this change in place, > FPU-intensive programs (in my case, some 3D code using OpenGL) are > significantly faster. Interesting. This is something I'd wondered about doing at the time I did the FPU stuff in the first place. How much faster is "significantly faster" ? So, my main point. I think this patch is unsafe and will lead to hard to find problems down the line. The difficulty is that it allows the simulated FPU state to hang around in the real FPU for long periods, up to a whole basic block's worth of execution (if I understand it write). We only need a skin to call out to a helper function which modifies the real FPU state on some obscure path, and we're hosed. Since we don't have any control over what skins people might plug in, this seems like and unsafe modification to the core. The modification I had in mind for a while was a lot more conservative, and more along the lines of a peephole optimisation. Essentially if we see a FPU-no-mem op followed by another FPU-no-mem op we can skip the save at the end of the first and the restore at the start of the second. Looking at the stable branch vg_from_ucode.c and the codegen cases for FPU, FPU_R and FPU_W it's clear we can also do the same for FPU_R/W followed by FPU since there is no calls to helpers in the gap between these two. Or am I missing something? It would definitely be good to speed up the FPU stuff a bit, but I need to be convinced that you've got this 100% tied down in a not-too-complex way, in the face of arbitrary actions carried out by skins-not-invented-yet. J |
From: Josef W. <Jos...@gm...> - 2002-10-04 10:18:46
|
Hi, On Friday 04 October 2002 03:01, Jeremy Fitzhardinge wrote: > Hi, > > Do you have a patch for the current CVS version of valgrind? I finally > got enough of KDE installed on my laptop to compile kcachegrind, so I'm > keen to try it out. > > Thanks, > J Sorry: I don't have much time at the moment... Last wednesday I looked for the first time at valgrind-HEAD. It seems to be= =20 quite easy to port my patch to a skin. The only problem I saw was that I ne= ed=20 a valgrind version of the LIBC "unlink", which I already mailed to Nick...= =20 Regarding the valgrind skin architecture: Shouldn't it be possible to "stac= k"=20 skins? At the moment, for my skin I have to include all the cachegrind code= =20 again. And if the cachegrind skin decides to simulate a 3rd level cache, I= =20 have to copy it. In fact: Call tree logging should be totally orthogonal to= =20 event logging. Shouldn't we have some general support for expandable cost=20 centers in the core? Then I could use these to add/subtract costs without=20 even knowing which ones are logged... To be honest, I didn't thought much about this idea yet... Regarding KCachegrind: I still have a problem with visualizing recursive=20 calls. This seems to involve changes in the Cachegrind patch, too. So I first have to solve this one before I'm making any new patch/release... Aside from that: I switched to GCC 3.2 with an update to Suse 8.1, and now I have a lot of problems with lost debugging info :-( Question: What are the exact problems with GCC 3.x, that it's not officiall= y=20 supported in Valgrind ? Perhaps you have some suggestions for my problem with recursive calls: Suppose a call chain starting from A: A calls B and C; C calls A again. The= =20 recursively called A only calls B. Say the cost of B is always 100, the sel= f=20 cost of each A and C are 10. So the cumulative cost of C will be 120=20 (C=3D>A=3D>B), and the one of the first call to A will be 230. I log (cumul= ative)=20 costs for the call A=3D>B only once, so this call gets cost 200. The problem: The callee list of A shows a call to B with cost 200 and a cal= l=20 to C with cost 120, but A itself only has cumulative cost 230 !?! This is=20 confusing to the user, and really makes problems drawing the TreeMap... The real problem is that KCachegrind can't see that the cost of A=3D>B is=20 included in the call cost A=3D>C and thus shown twice in the callee list of= A. And in the Treemap, I simple stop drawing recursive calls: This leaves empty space where there should be none and it looks like a performance problem for the user where there is none !! The first ad-hoc solution was to distinguish among calls from recursively=20 called functions, i.e. the 2 calls A=3D>B will be logged independent from e= ach=20 other: this makes the example looking quite fine again. But this makes the real problem disappear for a few simple examples (as the= =20 above one) only; it's still there for deeper recursions, and cumulative cos= ts=20 of calls always include ALL recursion costs inside of this call: I log the= =20 cost counter difference at entering/leaving the function. The real solution (without the ad-hoc one) would be: The callee list of A shows a call to B with cost 200. This is correct: B is= =20 called twice from A, leading to cost 200 for calls to B. But the call A=3D>= C=20 should be 20 only, skipping costs from any recursive A inside (perhaps=20 stating that cost 100 is already included in other calls). And this would=20 make the Treemap drawing fine again. So the only question I have: HOW to calculate this value (20) in the general case ?!? I suppose I can't calulate it at post-processing time, but have to log it i= n=20 Cachegrind somehow (that is, the skipped cost of 100 in the example above). Suggestions? Josef |
From: Julian S. <js...@ac...> - 2002-10-04 10:14:55
|
Good news. After peering at weird segfaults on Red Hat Null (8.0 beta) last night, I can see that it might be possible to assemble enough hacks so that the stable branch will work on R H 8. Assuming that they haven't changed the threading model used in the transition between the "null" final beta and 8.0 itself, which doesn't sound likely. I thought that 8.0 would use glibc-2.3, but apparently it is only at 2.2.93, so we don't have to deal yet with big threading changes. I had expected only to be able to support R H 8 on the head, using the LDT/GDT support, but it seems that might not be necessary. Vague plan therefore is to assemble this and various other bugfixes into 1.0.4 and release that within about a week. I still plan to make a snapshot release of the head as 1.1.0 in the next couple of days, for the more adventurous. J |
From: Nicholas N. <nj...@ca...> - 2002-10-03 19:23:10
|
http://sourceforge.net/projects/gnogrind/ N |
From: Julian S. <js...@ac...> - 2002-10-03 11:59:10
|
Having created this list I forgot to subscribe to it. Duh. J |
From: Nicholas N. <nj...@ca...> - 2002-10-03 08:57:42
|
On 2 Oct 2002, Jeremy Fitzhardinge wrote: > Surely it should be: > > -#define ALL_RREGS_LIVE (1 << (VG_MAX_REALREGS-1)) /* 0011...11b */ > +#define ALL_RREGS_LIVE ((1 << VG_MAX_REALREGS)-1) /* 0011...11b */ Absolutely. I was wondering how anything worked at all with such an egregious error... turns out this value is only being used to initialise the reg liveness info, and is always overwritten by the liveness analysis. I just fixed it. Thanks. N |
From: Nicholas N. <nj...@ca...> - 2002-10-03 08:51:17
|
On 2 Oct 2002, Jeremy Fitzhardinge wrote: > To solve this, prev_bb needs to be a per-thread value rather than a > global one. It seems to me that a clean way of solving this is to > introduce a mechanism which is analogous to VG_(register_*_helper) which > allows a skin to allocate space in the baseBlock, with a change to the > scheduler to save and restore the values on context switch and some way > to generate uInstr code to load and store them. Do you need to store this information in baseBlock? You could do it with global variables in your skin. There's a UCode-generating function VG_(set_global_var) that might be useful for this. N |
From: Jeremy F. <je...@go...> - 2002-10-03 04:48:31
|
On Wed, 2002-10-02 at 21:42, Jeremy Fitzhardinge wrote: > Hi, > > This patch makes FPU state changes lazy, so there should only be one > save/restore pair per basic block. Oh, for safety's sake, it should also probably have: default: if (VG_(needs).extended_UCode) { + if (fplive) { + emit_put_fpu_state(); + fplive = False; + } SK_(emit_XUInstr)(u, regs_live_before); } else { VG_(printf)("\nError:\n" " unhandled opcode: %u. Perhaps " " VG_(needs).extended_UCode should be set?\n", u->opcode); VG_(pp_UInstr)(0,u); VG_(core_panic)("emitUInstr: unimplemented opcode"); } J |
From: Jeremy F. <je...@go...> - 2002-10-03 04:42:13
|
Hi, This patch makes FPU state changes lazy, so there should only be one save/restore pair per basic block. With this change in place, FPU-intensive programs (in my case, some 3D code using OpenGL) are significantly faster. Rather than adding the fplive argument to emitUInstr(), I considered adding another bit to regs_live_before/after which signifies FP state liveness. That was a little more invasive, and it wasn't clear whether I should maintain such a bit in emitUInstr or add the logic to the register allocator. J Index: coregrind/vg_from_ucode.c =================================================================== RCS file: /cvsroot/valgrind/valgrind/coregrind/vg_from_ucode.c,v retrieving revision 1.15 diff -u -r1.15 vg_from_ucode.c --- coregrind/vg_from_ucode.c 2 Oct 2002 13:26:34 -0000 1.15 +++ coregrind/vg_from_ucode.c 3 Oct 2002 04:38:21 -0000 @@ -1808,18 +1808,14 @@ UChar second_byte_masked, Int reg ) { - emit_get_fpu_state(); emit_fpu_regmem ( first_byte, second_byte_masked, reg ); - emit_put_fpu_state(); } static void synth_fpu_no_mem ( UChar first_byte, UChar second_byte ) { - emit_get_fpu_state(); emit_fpu_no_mem ( first_byte, second_byte ); - emit_put_fpu_state(); } @@ -1961,7 +1957,7 @@ return (u->flags_w != FlagsEmpty); } -static void emitUInstr ( UCodeBlock* cb, Int i, RRegSet regs_live_before ) +static Bool emitUInstr ( UCodeBlock* cb, Int i, RRegSet regs_live_before, Bool fplive ) { Int old_emitted_code_used; UInstr* u = &cb->instrs[i]; @@ -2299,6 +2295,10 @@ case JMP: { vg_assert(u->tag2 == NoValue); vg_assert(u->tag1 == RealReg || u->tag1 == Literal); + if (fplive) { + emit_put_fpu_state(); + fplive = False; + } if (u->cond == CondAlways) { switch (u->tag1) { case RealReg: @@ -2353,6 +2353,10 @@ vg_assert(u->size == 0); if (readFlagUse ( u )) emit_get_eflags(); + if (fplive) { + emit_put_fpu_state(); + fplive = False; + } VG_(synth_call) ( False, u->val1 ); if (writeFlagUse ( u )) emit_put_eflags(); @@ -2375,6 +2379,10 @@ else vg_assert(u->tag3 == NoValue); vg_assert(u->size == 0); + if (fplive) { + emit_put_fpu_state(); + fplive = False; + } VG_(synth_ccall) ( u->lit32, u->argc, u->regparms_n, argv, tagv, ret_reg, regs_live_before, u->regs_live_after ); break; @@ -2397,6 +2405,10 @@ case FPU_W: vg_assert(u->tag1 == Lit16); vg_assert(u->tag2 == RealReg); + if (!fplive) { + emit_get_fpu_state(); + fplive = True; + } synth_fpu_regmem ( (u->val1 >> 8) & 0xFF, u->val1 & 0xFF, u->val2 ); @@ -2407,6 +2419,10 @@ vg_assert(u->tag2 == NoValue); if (readFlagUse ( u )) emit_get_eflags(); + if (!fplive) { + emit_get_fpu_state(); + fplive = True; + } synth_fpu_no_mem ( (u->val1 >> 8) & 0xFF, u->val1 & 0xFF ); if (writeFlagUse ( u )) @@ -2430,6 +2446,8 @@ vg_assert(u->opcode < 100); histogram[u->opcode].counts++; histogram[u->opcode].size += (emitted_code_used - old_emitted_code_used); + + return fplive; } @@ -2439,17 +2457,17 @@ { Int i; UChar regs_live_before = 0; /* No regs live at BB start */ - + Bool fplive = False; /* FPU state not loaded */ + emitted_code_used = 0; emitted_code_size = 500; /* reasonable initial size */ emitted_code = VG_(arena_malloc)(VG_AR_JITTER, emitted_code_size); if (dis) VG_(printf)("Generated x86 code:\n"); - + for (i = 0; i < cb->used; i++) { UInstr* u = &cb->instrs[i]; if (cb->instrs[i].opcode != NOP) { - /* Check on the sanity of this insn. */ Bool sane = VG_(saneUInstr)( False, False, u ); if (!sane) { @@ -2457,10 +2475,12 @@ VG_(up_UInstr)( i, u ); } vg_assert(sane); - emitUInstr( cb, i, regs_live_before ); + fplive = emitUInstr( cb, i, regs_live_before, fplive ); } regs_live_before = u->regs_live_after; } + vg_assert(!fplive); /* FPU state must be saved by end of BB */ + if (dis) VG_(printf)("\n"); /* Returns a pointer to the emitted code. This will have to be |
From: Jeremy F. <je...@go...> - 2002-10-03 04:35:57
|
Surely it should be: Index: include/vg_skin.h =================================================================== RCS file: /cvsroot/valgrind/valgrind/include/vg_skin.h,v retrieving revision 1.13 diff -u -r1.13 vg_skin.h --- include/vg_skin.h 2 Oct 2002 13:26:34 -0000 1.13 +++ include/vg_skin.h 3 Oct 2002 04:34:20 -0000 @@ -601,7 +603,7 @@ typedef UInt RRegSet; #define ALL_RREGS_DEAD 0 /* 0000...00b */ -#define ALL_RREGS_LIVE (1 << (VG_MAX_REALREGS-1)) /* 0011...11b */ +#define ALL_RREGS_LIVE ((1 << VG_MAX_REALREGS)-1) /* 0011...11b */ #define UNIT_RREGSET(rank) (1 << (rank)) #define IS_RREG_LIVE(rank,rregs_live) (rregs_live & UNIT_RREGSET(rank)) J |
From: Jeremy F. <je...@di...> - 2002-10-02 20:59:49
|
I've noticed that on my laptop the rdtsc calibration often fails with "impossible MHz". I think this is because the TSC only advances when there's something actually happening, as part of the power management. I've attached a patch (against HEAD) to make it spin rather than sleep for 20ms as part of the calibration. This makes the MHz estimate accurate and solves the panics, but it does imply that the TSC is not a reliable timebase for other time measurements, which is a larger problem to solve. J |
From: Jeremy F. <je...@go...> - 2002-10-02 20:53:04
|
On Wed, 2002-10-02 at 11:33, Josef Weidendorfer wrote: What do you think about making data structure cost centers, and relating them to the functions? Even much more information available ;-) You mean storing which code touches what memory as part of the profile? An excellent idea. More serious: With C++, you have constructors, and that's a nice way to name malloced areas. Together with some debug info, it should be easy to give out a list of all C++ classes, and read/write access numbers for each offset (or with annotated class definition from source). If the constructor is defined in a shared lib (as for all QT/KDE classes), you don't even need debug info for this: The object start address is always the first arg to the constructor, only question: how to detect the object size? Well, it seems to me that the best name for an allocated block is some portion of the stack backtrace leading to its allocation. If you want to parse the mangled names, you can easily tell what the class is, and group all class instances together. The object size should be easy - it is the size of the allocated memory, surely? > I looked at the screenshots and decided it is very pretty, but I haven't > actually tried it out yet. > > I've actually done a first cut of a gprof skin now, which generates > correctly formed gprof gmon.out files. Unfortunately gprof itself is > too broken to deal with them (it wants a single histogram array for the > whole address space; I'm teaching it to work with a sparse array). Cool. Sorry, I couldn't follow the discussion. Can gmon.out hold other events than sample counts? Do you log calls, too? Yes, it can record a histogram (in any units/event types you like, but the standard tools only generate time histograms), entry counts for each basic block and BB to BB control flow counts. gprof can display output either on a function-by-function basis or at the basic block level (including annotating source) I'm not quite sure I understand the benefit of creating gmon.out files. Are there other frontends for this format than gprof? (There's a KProf, but that "only" shows the info from gprof). I want to add a gmon.out reader for KCachegrind some day for quick browsing and TreeMap generation for gprof-profiled apps. I'm doing this work to instrument a piece of software which lots of developers are working on, most of whom are familiar with gprof. I also think most developers would welcome a friendly UI like kcachegrind, so I'm very keen to try it out soon - I just want to get the basics working first. I think the cachegrind.out format is quite nice: Although I added a lot, I still can read the original cachegrind.out files without problem. Nick: Can you add some versioning to this format to distinguish some format variants? (I added a line "version: xxx"). The gprof format could have been nice, but its somewhat broken. They extended it to be a tagged format so you can add extra sections - but forgot to include a length with each tag, so you can't parse the file unless you understand all the tag types. A lost opportunity there. > I'm also going to extend the core slightly; I'd like to add some way of > extracting more information about the segments described in the SegInfo > list. I'd like to be able to walk the list so I can include a table of > mapped shared objects and what address range they cover. A problem here could be the dynamic behaviour of mappings... Yes, but for now the code I'm instrumenting loads a lot of shared libraries, but doesn't really unload or reload on the fly. J |
From: Jeremy F. <je...@go...> - 2002-10-02 20:41:31
|
On Wed, 2002-10-02 at 12:25, Nicholas Nethercote wrote: On 2 Oct 2002, Jeremy Fitzhardinge wrote: > At present I'm using a single global, which means that I'll be creating > spurious edges when there's context switches between threads. The > obvious place to store the information is in the baseBlock, and have it > copied to/from the thread state on context switch. I didn't see a > mechanism for allocating variable space in the baseBlock, nor a way of > conveniently addressing baseBlock offsets directly. Should I add it? > Or some other way of storing per-thread information? Cachegrind stores variable-sized basic-block information. It is pretty low-level and dirty: it allocates a flat array in which cost centres of different sizes are all packed in together, with different cost centre types distinguished by a tag. The basic blocks' arrays are stored in a hash table. Yes, I've got that. I have a hash which keeps per-basic-block information. But what I also want it a hash which keeps a count of control flow edges between basic blocks. That is, the key of the hash is not orig_eip, but the tuple (from_bb, to_bb). The way I maintain this is by inserting an assignment to a global variable "prev_bb" (ie, code to do prev_bb = cur_eip) just before each JMP instruction (conditional or otherwise). Then, at the start of each basic block, I update the edge count structure by looking up (and possibly creating) the tuple (prev_bb, cur_eip). The trouble with this scheme is that if the dispatch loop decides that it is time to switch threads, prev_bb will have been set by the previous thread, and therefore the control flow graph will have spurious edges which represent context switches. While this isn't completely undesirable, it isn't what I want to measure at the moment. To solve this, prev_bb needs to be a per-thread value rather than a global one. It seems to me that a clean way of solving this is to introduce a mechanism which is analogous to VG_(register_*_helper) which allows a skin to allocate space in the baseBlock, with a change to the scheduler to save and restore the values on context switch and some way to generate uInstr code to load and store them. J |
From: Nicholas N. <nj...@ca...> - 2002-10-02 19:25:22
|
On 2 Oct 2002, Jeremy Fitzhardinge wrote: > At present I'm using a single global, which means that I'll be creating > spurious edges when there's context switches between threads. The > obvious place to store the information is in the baseBlock, and have it > copied to/from the thread state on context switch. I didn't see a > mechanism for allocating variable space in the baseBlock, nor a way of > conveniently addressing baseBlock offsets directly. Should I add it? > Or some other way of storing per-thread information? Cachegrind stores variable-sized basic-block information. It is pretty low-level and dirty: it allocates a flat array in which cost centres of different sizes are all packed in together, with different cost centre types distinguished by a tag. The basic blocks' arrays are stored in a hash table. Josef's patch uses the same basic mechanisms, but does more complicated stuff with the hash tables. So there's not really any built-in mechanism, but you can certainly allocate yourself some space for each basic block in SK_(instrument). As for addressing baseBlock offsets directly, I'm not sure what you mean -- the orig_addr is passed in to SK_(instrument); is that not enough? I'm also not sure how threads ("per-thread information") relate to this. N |
From: Josef W. <Jos...@gm...> - 2002-10-02 18:32:36
|
Hi, just want to say hello to the Valgrind Developers mailing list... On Wednesday 02 October 2002 18:04, Jeremy Fitzhardinge wrote: > On Wed, 2002-10-02 at 04:31, Nicholas Nethercote wrote: > As for gprof stuff, have you seen Josef Wiedendorfer's Cachegrind pat= ch > and KCachegrind visualisation tool?=20 > (www.weidendorfers.de/kcachegrind/) It contains loads of that sort of > thing, more than my brain can handle in one sitting :) What do you think about making data structure cost centers, and relating them to the functions? Even much more information available ;-) More serious: With C++, you have constructors, and that's a nice way to name malloced areas. Together with some debug info, it should be easy to give out a list of all C++ classes, and read/write access numbers for ea= ch=20 offset (or with annotated class definition from source). If the constructor is defined in a shared lib (as for all QT/KDE classes), = you don't even need debug info for this: The object start address is always the first arg to the constructor, only question: how to detect the object size? > I looked at the screenshots and decided it is very pretty, but I haven't > actually tried it out yet. > > I've actually done a first cut of a gprof skin now, which generates > correctly formed gprof gmon.out files. Unfortunately gprof itself is > too broken to deal with them (it wants a single histogram array for the > whole address space; I'm teaching it to work with a sparse array). Cool. Sorry, I couldn't follow the discussion.=20 Can gmon.out hold other events than sample counts? Do you log calls, too? I'm not quite sure I understand the benefit of creating gmon.out files. Are there other frontends for this format than gprof? (There's a KProf, but that "only" shows the info from gprof). I want to add a gmon.out reader for KCachegrind some day for quick browsing and TreeMap generation for gprof-profiled apps. I think the cachegrind.out format is quite nice: Although I added a lot, I still can read the original cachegrind.out files without problem. Nick: Can you add some versioning to this format to distinguish some format variants? (I added a line "version: xxx"). > I'm also going to extend the core slightly; I'd like to add some way of > extracting more information about the segments described in the SegInfo > list. I'd like to be able to walk the list so I can include a table of > mapped shared objects and what address range they cover. A problem here could be the dynamic behaviour of mappings... > > J > J :-) |
From: Jeremy F. <je...@go...> - 2002-10-02 16:24:33
|
On Wed, 2002-10-02 at 04:23, Nicholas Nethercote wrote: Best way I can think of doing it, which only requires skin changes rather than core changes, is this: using the `extended_UCode' need, add a new UInstr PRE_JCC, which gets inserted by SK_(instrument) before conditional JMPs, evaluates the condition, and calls a C function (or whatever) if it's true. This would duplicate the condition evaluation but that shouldn't matter since they're trivial (just checking an EFLAGS bit I think). It's a bit nasty that something as simple as this requires a new UInstr... Well, I've actually come up with a simpler approach. Since what I want is to get the (from, to) pair for a BB graph edge, I'm simply updating a global (bb_from) with %EIP before each jump, and then create/update the edge at the entry to each BB (bb_from, %EIP). At present I'm using a single global, which means that I'll be creating spurious edges when there's context switches between threads. The obvious place to store the information is in the baseBlock, and have it copied to/from the thread state on context switch. I didn't see a mechanism for allocating variable space in the baseBlock, nor a way of conveniently addressing baseBlock offsets directly. Should I add it? Or some other way of storing per-thread information? J |
From: Jeremy F. <je...@go...> - 2002-10-02 16:05:05
|
On Wed, 2002-10-02 at 04:31, Nicholas Nethercote wrote: As for gprof stuff, have you seen Josef Wiedendorfer's Cachegrind patch and KCachegrind visualisation tool? (www.weidendorfers.de/kcachegrind/) It contains loads of that sort of thing, more than my brain can handle in one sitting :) I looked at the screenshots and decided it is very pretty, but I haven't actually tried it out yet. I've actually done a first cut of a gprof skin now, which generates correctly formed gprof gmon.out files. Unfortunately gprof itself is too broken to deal with them (it wants a single histogram array for the whole address space; I'm teaching it to work with a sparse array). I'm also going to extend the core slightly; I'd like to add some way of extracting more information about the segments described in the SegInfo list. I'd like to be able to walk the list so I can include a table of mapped shared objects and what address range they cover. J |
From: Nicholas N. <nj...@ca...> - 2002-10-02 11:23:31
|
On 30 Sep 2002, Jeremy Fitzhardinge wrote: > I'm writing a skin to generate gprof-like output, so I need to see all > the edges in the control flow graph. In particular, I'd like to insert > some instrumentation code which is run IFF a conditional branch is > taken. > > I see a few options: > * something to properly represent uInstr sequences with > conditionals within the ucode for one real instruction (ie, > some way of representing jumps to real addresses rather than > simulated addresses). Sounds messy. > * Intercept the jump target address and generate a completely > new piece of code at some place within the simulated address > space. Ugly. > * Introduce a new exceptional value for ebp when it is passed > back into the dispatcher to trigger a call into the skin. > Would need some way to attach some kind of argument values for > the call (encode in %edx?). Seems like the least nasty. > > Any opinions? Best way I can think of doing it, which only requires skin changes rather than core changes, is this: using the `extended_UCode' need, add a new UInstr PRE_JCC, which gets inserted by SK_(instrument) before conditional JMPs, evaluates the condition, and calls a C function (or whatever) if it's true. This would duplicate the condition evaluation but that shouldn't matter since they're trivial (just checking an EFLAGS bit I think). It's a bit nasty that something as simple as this requires a new UInstr... Oh, and apologies for the delay in replying. N |
From: Jeremy F. <je...@go...> - 2002-10-01 06:49:53
|
I'm writing a skin to generate gprof-like output, so I need to see all the edges in the control flow graph. In particular, I'd like to insert some instrumentation code which is run IFF a conditional branch is taken. I see a few options: * something to properly represent uInstr sequences with conditionals within the ucode for one real instruction (ie, some way of representing jumps to real addresses rather than simulated addresses). Sounds messy. * Intercept the jump target address and generate a completely new piece of code at some place within the simulated address space. Ugly. * Introduce a new exceptional value for ebp when it is passed back into the dispatcher to trigger a call into the skin. Would need some way to attach some kind of argument values for the call (encode in %edx?). Seems like the least nasty. Any opinions? J |
From: Jeremy F. <je...@go...> - 2002-09-30 21:03:09
|
Well, I decided to look into my suggestion for maintaining the call stack of by "simply" tracking pairs of call/ret instructions. I've decided it is either completely non-viable, or too complex to bother with. While a fine idea in theory, it relies on real code using call/return and not playing too much with return address on the stack. I had thought the main violations of the call/return rule would be some or all of: /* A - PIC code - dealt with already in vg_to_ucode */ call 1f 1: popl %reg /* B - common idiom */ push code_addr ret /* C - manual call */ pop %reg push some_addr jmp *%reg /* D - manual return */ pop %reg jmp *%reg Unfortunately I completely underestimated the twisty-turnyness of the glibc dynamic linker, which basically does arbitrary stack manipulations to manually build stack frames, manually call and manually return, and also does delights such as: xchg %eax, 8(%esp) ret $8 In other words, to keep track of this stuff, Valgrind would have to keep track of all stack accesses (optimistically accesses based off %esp; pessimistically all memory accesses which happen to be in the area), and maintain the call stack that way. I don't think this is viable. A possibly more viable, but complex, approach is to use a hybrid technique: maintain a call stack, and also walk up the call frames using esp/ebp. If they agree, then all is well; if not, use the call stack to resync the call frame walker, rather than using its information directly. I think that would take rather more effort than I was prepared to get slightly more precision in a backtrace, so I'm not going to explore it for now. J |