You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
|
From: Jeremy F. <je...@go...> - 2002-10-11 05:48:59
|
On Thu, 2002-10-10 at 12:25, Jeremy Fitzhardinge wrote: > This means that if a particular program under study has callbacks for > different skins, they will end up doing the wrong thing if run with the > wrong skin. It seems to me that there needs to be a systematic way for > skins to distinguish their callback numbers from each other. Maybe > encode some kind of skin ID into the top 16 bits of the request number? Well, since there seems to be a de-facto standard for each skin to have a two letter code, I implemented a scheme to use it as a way for skins to distinguish their client requests. I also made it not an error for a callback to be unimplemented, since Valgrind is getting more capable with more skins, it will be common to have a program with client requests inserted for multiple skins. As part of this, I changed the interface to SK_(handle_client_request) so that a skin can indicate that it did not handle the request, so that the proper default return value can be returned. Patch against CVS head. J |
|
From: Jeremy F. <je...@go...> - 2002-10-10 19:27:23
|
At the moment, not many skins have client callbacks (only memcheck, I
think). However, if other skins follow memcheck's example of starting
callbacks at VG_USERREQ__FINAL_DUMMY_CLIENT_REQUEST + 1, they will all
end up with overlapping numbers.
This means that if a particular program under study has callbacks for
different skins, they will end up doing the wrong thing if run with the
wrong skin. It seems to me that there needs to be a systematic way for
skins to distinguish their callback numbers from each other. Maybe
encode some kind of skin ID into the top 16 bits of the request number?
J
|
|
From: Jeremy F. <je...@go...> - 2002-10-10 19:20:27
|
My skin needs to register baseblock helpers after parsing its command
line options (so that it knows what it needs to do). Unfortunately the
baseblock is currently being set up before calling SK_(post_clo_init).
This patch rearranges things.
J
|
|
From: Josef W. <Jos...@gm...> - 2002-10-10 10:31:32
|
On Thursday 10 October 2002 03:15, Jeremy Fitzhardinge wrote: > On Mon, 2002-10-07 at 03:38, Josef Weidendorfer wrote: > > So my question: Do you think this exact logging is usefull ? > > It seems to me that you're confusing two levels of abstraction: the > general call-graph summary of program execution vs. an exact trace of > program execution. If I understand what you're saying correctly, you I'm still not convinced that I'm mixing these levels ;-) I simple have a more exact call graph than gprof as input: In addition, I already get the weights for the call arcs (directed edges)= =2E The gprof algorithm calculates estimations for these by using accumulated= self=20 costs and distributing these costs to the graph edges using call counts (= by=20 using a virtual value "cost per call"). Why should I throw away the more=20 exact weights and recalculate an estimation? Cycles need special handling. And this was the source of my problem, as I= =20 didn't handle it specially. This is the postprocessing I have to do in KCachegrind: Cycles are in fac= t=20 superfunctions. I have to handle the real function in them like BBs of no= rmal=20 functions, and I'm fine again. That is, I throw away call arcs inside of=20 cycles. A nice side effect of logging call costs already in Cachgrind: I let the user create trace parts: He can decide when to make a dump (I k= now=20 of at least one successful use of this feature because of a user report..= =2E). That way I get call arcs with call count 0 (these are "active" calls). Th= e=20 gprof algorithm would calculate a weight of 0 for this arc, which is=20 obviously wrong... > say that for each invocation of function A, you record the exact counte= r > deltas for that invocation, and presumably the caller. You can > therefore say that every time B calls A, the sum of the counter deltas > is X. However, by only recording the immediate caller, you're already X is the weight of the call arc from B to A in the general call graph. > throwing information away, and I suspect this lack of information is > causing the trouble. If I visualize the children of a call A=3D>B, I use the cumulative cost o= f this=20 call for the area size and scale down the inner visualisation of whole A = as I=20 don't have the call costs from B when called from A. This of course is a = pure=20 estimation... If a would log call costs not only depending on the immediate caller, but= also=20 on the caller of caller, I still don't think I'm mixing abstraction level= s: For each function, I simple have many call graph nodes, depending on the=20 caller: I don't have the self costs of these nodes, but these can be=20 calculated from the edge weights. > I think you may be better off by either just adopting the gprof > algorithm (certainly for the purposes of visualising what's going on), > and/or implement a complete program trace for more detailed > post-processing. You may find the paper "Encoding Program Executions" = a > useful source of ideas > (http://citeseer.nj.nec.com/reiss01encoding.html). Thanks. That could be a project for the future... You will need ways to keep the traced data low, and that's another topic. And this has nothing to do with KCachegrind, as I can't see a usage for a= =20 TreeMap. > > Note: All the hassle with recursion detection in my calltree patch wo= uld > > not be needed if I don't log exact call costs. But on the other side:= I > > already have everything in place now in my patch. > > How much effort do you put into detecting recursion? I still think > you'd be better off just recording everything and leave all analysis to > post-processing. For each active function (i.e. currently called somewhere on the stack), = I=20 have a CallEntry struct, put into a hash table with function start addres= s as=20 key. This way I can lookup very fast on each call happening if this call = is a=20 recursive one. In a CallEntry, I have a recursion count for the function,= and=20 thus can remove the CallEntry if the the recursion count reaches zero on=20 function return. The problem with this concept are returns from functions: I can't be sure= =20 these are done by executing a RET instruction. A program can do a longjmp= =20 (e.g. C++ exception handling) and modify the stack pointer in arbitrary w= ays,=20 thus returning implicit from active functions. So the above algorithm is a little fragile: So I need to have my own call= =20 stack (an array of elements, consisting of a pointer to a CallEntry struc= t=20 and the real ESP at call time). At each CALL/RETURN instruction I keep th= e=20 real stack with my stack in sync by looking at the stored/real ESP value = and=20 popping (returning from functions) as needed. > > Regarding profiling performance: Even without a need for recursion > > detection I think I will need hash table lookups... > > Is that the expensive part of doing the trace? Don't know ;-) I'm adding around 20% to pure cachegrind. > TreeMap doesn't need to be exact, it just needs to be an accurate guide > for a programmer to tell where the time is going. Yes. But it can make a difference if weigths are distributed the wrong wa= y=20 among call arcs, and the TreeMap shows call weights, not self costs. > > I'm wondering if I should abstract from the term "function" in > > KCachegrind. It makes a lot of sense to handle BBs the same way (Jere= my: > > I wondered why you do profiling on a BB granularity!): Jumps between = BBs > > are in fact calls with implicit return on funtion return: If there ar= e > > loops in the function, we have lot of "cycles" (i.e. recursive BB cal= ls) > > for the BBs of a function. "Cumulative cost" of a BB is the cost unti= l > > the end of the function. We even can extend this abstraction, not onl= y > > down (from functions to BBs), but also up to C++ classes and ELF obje= cts: > > It makes sense to see cumulative costs of all methods of a class, or > > calls among classes; the same holds for ELF objects. > > What do you think about this? > > Well, I've been putting some effort into my skin to try to distinguish > function calls from other kinds of jumps, simply because recording What's the difficulty? > everything at the BB level generates too much information to be > reasonably used, and despite the claims in the documentation, gprof > generates almost useless output as a result (if you record the call and > return of A calling B as separate edges, prof shows it as A calls B who > calls A). If you handle BBs as seperate nodes in a call graph of BBs, you get a LOT= of=20 cycles, and not really much in addition to the coarser function call grap= h, I=20 suppose? > So, I'm now just recording calls, and also only instrumenting basic > blocks in the text section (so that I don't get spurious edges calling = a > shared library, because it calls into the PLT, which then jumps into th= e > function in the target library). I already solved this by trapping jumps to "_dl_runtime_resolve" (this is= the=20 function first called from a PLT stub), and attributing all the cost of t= he=20 PLT entrie to the real function. After relocation, you have a jump to the= =20 shared library function from the PLT entry: I have an alias hash which ha= s=20 the relations between this stubs and sh.lib. functions... > I think recording to the BB level makes a lot more sense with better > tools to interpret the output, so it definitely makes sense for you to > consider the idea. A TreeMap visualisation of the BBs of a function could be cool... You would see all the loops happening, and the loop bodies as nested BBs. But you always need correct mapping of BB to source lines: So entitling t= he=20 areas for BBs with the source code lines ?! > But don't get too elaborate. The idea of a profiling tool is to > generate a comprehensible abstraction of program execution, not a > complete encoding of what happened. I'm sure not looking at exact tracing of program execution ;-) Josef |
|
From: Julian S. <js...@ac...> - 2002-10-09 18:40:48
|
Something is seriously snafu'd with sourceforge mailing lists; Nick sent this at 11am monday and it arrived here sometime Weds afternoon. > > and it all looks pretty straightforward. Once I get some profiling > > machinery going, I'm definitely interested in working on helgrind. > > That would be really good. It's been a while since I looked at it. My > first attempt at the lockset stuff was very simple and probably wasn't > going to scale at all well, but then Julian improved the lockset stuff a > bit (a lot?) so that might not be such a problem yet. Yes, please hack on! Re my changes, all I did was debug the set operations, so it's stable on large programs. I also and restructured hg_main.c a but, but I didn't change the efficiency of it at all. cvs log of hg_main.c should tell you more. J |
|
From: Jeremy F. <je...@go...> - 2002-10-08 01:09:56
|
On Mon, 2002-10-07 at 11:00, Julian Seward wrote:
On Monday 07 October 2002 1:14 am, Jeremy Fitzhardinge wrote:
> Can go into a few more details about what's wrong with helgrind? Does
> it get the Eraser algorithm wrong, or does it just need some more
> refinement?
>
> I read the Eraser paper and the extension suggested in "Visual Threads",
> and it all looks pretty straightforward. Once I get some profiling
> machinery going, I'm definitely interested in working on helgrind.
It generates bazillions of errors. We don't know why.
The only _known_ cause for tyhis is that glibc has a counter for
the number of relocations done by the dynamic linker, and this is
incremented without locking. That's allegedly harmless because
it's only for stats purposes, but it's still something that needs
to be suippressed or worked around. Problem with suppression is
that there is no fixed call stack at the error point; it happens
at the first use of every dynamic symbol (or something) so the
normal stack-based-identification of the suppression mechanism
won't work. We haven't thought of a way round this yet.
I had a play with this last night, and noticed the dynamic linker count
thing. Several things occurred to me:
* Obviously, helgrind needs to have suppression implemented. I
suspect it needs dynamic suppression (hints inserted in the
code) as well as static suppression rules.
* I noticed that the symtab stuff only collects function
symbols, but we're really going to want static and global data
symbols as well, so they can be used for both reports and
suppressions.
* Also, a fair amount of the memcheck error report stuff needs
to be reused as well, so dynamic memory can be identified in
terms of its allocation site.
* I think the algorithm needs a new state, which means "can be
touched by any thread for any reason". This would mainly be
for error suppression, so that you can annotate it that way,
and so that once you've reported that location you can silence
further errors. Perhaps you could use "exclusive" state with
a magic thread ID meaning "everyone".
* The Eraser paper didn't cover the case where a mutex identity
can be reused (say the mutex is part of an object in dynamic
memory which is reallocated). I'm guessing that on freeing
some memory, you have to go through all the mutex sets looking
for a mutex in the freed memory range, and poison that set
somehow.
* The Eraser paper also used a resolution of 4 bytes because
that's the smallest the alpha guarantees for atomic
operations. The x86 goes down to 1 byte; I wonder if we need
more resolution (ie, are there neighbouring memory locations
being updated by different threads, which are causing spurious
errors?).
J
|
|
From: Julian S. <js...@ac...> - 2002-10-07 17:53:47
|
On Monday 07 October 2002 1:14 am, Jeremy Fitzhardinge wrote: > Can go into a few more details about what's wrong with helgrind? Does > it get the Eraser algorithm wrong, or does it just need some more > refinement? > > I read the Eraser paper and the extension suggested in "Visual Threads", > and it all looks pretty straightforward. Once I get some profiling > machinery going, I'm definitely interested in working on helgrind. It generates bazillions of errors. We don't know why. The only _known_ cause for tyhis is that glibc has a counter for the number of relocations done by the dynamic linker, and this is incremented without locking. That's allegedly harmless because it's only for stats purposes, but it's still something that needs to be suippressed or worked around. Problem with suppression is that there is no fixed call stack at the error point; it happens at the first use of every dynamic symbol (or something) so the normal stack-based-identification of the suppression mechanism won't work. We haven't thought of a way round this yet. If you want to investigate that would be great. Having a usable race detector would be really excellent. J |
|
From: Nicholas N. <nj...@ca...> - 2002-10-07 10:10:37
|
On 6 Oct 2002, Jeremy Fitzhardinge wrote: > Can go into a few more details about what's wrong with helgrind? Does > it get the Eraser algorithm wrong, or does it just need some more > refinement? > > I read the Eraser paper and the extension suggested in "Visual Threads", > and it all looks pretty straightforward. Once I get some profiling > machinery going, I'm definitely interested in working on helgrind. That would be really good. It's been a while since I looked at it. My first attempt at the lockset stuff was very simple and probably wasn't going to scale at all well, but then Julian improved the lockset stuff a bit (a lot?) so that might not be such a problem yet. I think the problem is more that I never tried it on any program larger than about 20 lines. It could find easy data races in these short programs. But it also reported quite a few races in glibc. One of these we found to be genuine (an unsafe update of a totally trivial statistics counter only printed for debugging), but for the others it was unclear whether they were genuine or if Helgrind was wrong. I certainly tried to implement the Eraser algorithm faithfully. So to summarise: it probably needs to be robustified to handle large programs, and it hasn't been verified as to whether it's finding real races yet. N |
|
From: Jeremy F. <je...@go...> - 2002-10-07 00:14:47
|
Can go into a few more details about what's wrong with helgrind? Does
it get the Eraser algorithm wrong, or does it just need some more
refinement?
I read the Eraser paper and the extension suggested in "Visual Threads",
and it all looks pretty straightforward. Once I get some profiling
machinery going, I'm definitely interested in working on helgrind.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-10-06 12:06:38
|
On Sun, 6 Oct 2002, Julian Seward wrote: > Hi. I just put 1.1.0 at http://developer.kde.org/~sewardj along with > a long description thereof. You might want to look at it. One minor thing -- Cachegrind is called Cachegrind everywhere now, not Cachesim. > > Me: > > - Allow memcheck,addrcheck:Cond style multi-skin suppressions? > > Interestingly, this was the only thing that really bit. Some of the > memcheck suppression also apply to addrcheck, and not having them > makes the addrcheck thing spew out a lot of garbage. Any chance you > can push this along? Sure, how should I proceed -- ok to hack on HEAD? N |
|
From: Julian S. <js...@ac...> - 2002-10-06 11:13:42
|
On Friday 04 October 2002 5:50 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-10-04 at 09:22, Julian Seward wrote: > The important change is to do with thread-specific data. New scheme on > x86s at least is that the linux kernel creates a GDT entry which points > at a thread's local storage; the thread can then access that merely > by using a suitable segment override prefix on any old memory > referencing insn. Kernel swizzles the GDT entry when scheduling threads, > and there is a new syscall by which a thread can tell the kernel > where it's TLD is [i assume because GDT can't be swizzled from > user-space] and because the kernel needs to know this anyway. > > But if all this is just implementation detail of libpthreads and you're > replacing all of libpthreads, then surely you wouldn't have to deal with > it anyway? Not exacltly true, because gcc >= 3.2 is roped into this game as well, and generates TLD-accessing code directly for variables declared with the __thread qualifier. > - possibly your patches of 25 Sept if they don't cause build problems > on any test platform I have > > Oh, you mean the poll/select name issue and the msgsnd/msgrcv > implementation? Yes. J |
|
From: Julian S. <js...@ac...> - 2002-10-06 03:57:04
|
Hi. I just put 1.1.0 at http://developer.kde.org/~sewardj along with a long description thereof. You might want to look at it. Fixed various small probs stopping it from working right across multiple distros (VMware is a tremendous help in testing). > Julian: > - --run-libc-freeres option done > - docs: > - read skin docs > - split up core/memcheck docs > - for addrcheck > - factor out addrcheck/memcheck commonality? not done > Me or Julian: > - docs: > - index describing each skin, basic overview. > - document skin_name:supp_name stuff not done > Me: > - Allow memcheck,addrcheck:Cond style multi-skin suppressions? Interestingly, this was the only thing that really bit. Some of the memcheck suppression also apply to addrcheck, and not having them makes the addrcheck thing spew out a lot of garbage. Any chance you can push this along? ------ Vague plan (vague because now I don't know when I will have time to get back to this): - fix up all the above - implement logging to a network socket - make it work on R H 8 and after that I'd like to freeze and stabilise for 1.2.0. Dunno exactly when we should start the 1_2_0 branch. J |
|
From: Jeremy F. <je...@go...> - 2002-10-04 23:02:59
|
On Fri, 2002-10-04 at 13:55, Julian Seward wrote: > Hmm, not bad. What's the numbers for --skin=memcheck and --skin=addrcheck ? > I know the improvement factor will be a lot less, but I'd still like to know > what it is. baseline lazy fp addrcheck 77.33 41.10 memcheck 82.98 44.76 J |
|
From: Julian S. <js...@ac...> - 2002-10-04 20:48:32
|
On Friday 04 October 2002 9:43 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-10-04 at 03:51, Julian Seward wrote: > > How much faster is "significantly faster" ? > > OK, I've quantified this now. > > Using the attached test program (a matrix multiply extracted from Mesa), > I'm getting the following timings (fptest compiled with -O; 600MHz PIII > laptop; --skin=none): > > native execution: 0.38s > baseline valgrind: 65.35s > lazy-fp valgrind: 4.05s > > In other words, Valgrind is currently 172 times slower than running > native for FP-intensive code; the lazy save-restore improves this by a > factor of 16 or so to make the valgrind overhead only about 11 times > slower than native. Hmm, not bad. What's the numbers for --skin=memcheck and --skin=addrcheck ? I know the improvement factor will be a lot less, but I'd still like to know what it is. /me is suitably impressed, just in case you were getting any other impression, btw. J |
|
From: Jeremy F. <je...@go...> - 2002-10-04 20:42:41
|
On Fri, 2002-10-04 at 03:51, Julian Seward wrote: > How much faster is "significantly faster" ? OK, I've quantified this now. Using the attached test program (a matrix multiply extracted from Mesa), I'm getting the following timings (fptest compiled with -O; 600MHz PIII laptop; --skin=none): native execution: 0.38s baseline valgrind: 65.35s lazy-fp valgrind: 4.05s In other words, Valgrind is currently 172 times slower than running native for FP-intensive code; the lazy save-restore improves this by a factor of 16 or so to make the valgrind overhead only about 11 times slower than native. I'd say that's significant. I'm attaching my current diff against HEAD; the previous one left out saving before JIFZ. J |
|
From: Jeremy F. <je...@go...> - 2002-10-04 20:29:25
|
On Fri, 2002-10-04 at 11:24, Josef Weidendorfer wrote:
> > index % time self children called name
> > [1] 100.0 0.03 0.23 1+2 <cycle 1 as a whole> [1]
> > 0.02 0.23 2 A <cycle 1> [3]
>
> Why isn't "B <cycle1>" shown as child of <cycle1> ?
Because [1] is the entry for "cycle 1 as a whole"; A is considered to be
the head of the cycle (which is then described in [3]).
[2] is the linear part of the call graph to A through main.
> > [Handwave mode:] For the purposes of displaying a profile as a TreeMap,
> > it seems to me that you need to break the graph up into strongly
> > connected subgraphs, and display each of those as a node in your tree.
>
> The subgraphs being the cycles with its children?
Yes. "Strongly connected graph" is graph-theory talk for a graph with
internal cycles. For example:
A A'
| --> |
v v
B <=> C BC'
B <=> C is a simple cycle, and so is a strongly connected subgraph. You
can rewrite the overall graph so that strongly connected subgraphs are
collapsed into a single node.
> I think I'm doing exactly this: Stop drawing at recursions...
> And the problem is: I don't know what the cost of the recursive function
> should be. I don't have the gprof results, but only my own logging.
> I think I will have to dive into gprof algorithms a little bit...
> As I understand at the moment, gprof never talks about recursive calls, but
> makes "cycles" with subobjects.
http://citeseer.nj.nec.com/graham82gprof.html should explain the
algorithm.
J
|
|
From: Josef W. <Jos...@gm...> - 2002-10-04 18:41:17
|
On Friday 04 October 2002 18:17, Jeremy Fitzhardinge wrote: > On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote: > > Suppose a call chain starting from A: A calls B and C; C calls A a= gain. > > I don't think you should calculate anything at runtime; just record > enough so you can do it all afterwards. There shouldn't be a huge > problem in working all this out: gprof does it, after all. Perhaps I'm already logging to much with the "cumulative" costs of calls. But I don't understand how to calculate this with self costs and call tre= e=20 information alone, especially for functions taking part in multiple cycle= s... > Does this do what your example describes? > ... Yes :-) > gprof generates this flat profile: > ... Seems quite correct :-) > ... > index % time self children called name > [1] 100.0 0.03 0.23 1+2 <cycle 1 as a whole> [1] > 0.02 0.23 2 A <cycle 1> [3] Why isn't "B <cycle1>" shown as child of <cycle1> ? > ... > [Handwave mode:] For the purposes of displaying a profile as a TreeMap, > it seems to me that you need to break the graph up into strongly > connected subgraphs, and display each of those as a node in your tree. The subgraphs being the cycles with its children? > When you want to descend into that node (ie display its sub-parts), you > remove all the backwards edges (recursive calls) and treat it as a plai= n > tree. When you remove the backwards edge, you can recompute the cost o= f > the subgraph without the recursion; the difference between the original > cost and the new cost is how much you should attribute to the function > which is making the recursive call. I think I'm doing exactly this: Stop drawing at recursions... And the problem is: I don't know what the cost of the recursive function=20 should be. I don't have the gprof results, but only my own logging. I think I will have to dive into gprof algorithms a little bit... As I understand at the moment, gprof never talks about recursive calls, b= ut makes "cycles" with subobjects. Thanks anyway! Josef |
|
From: Josef W. <Jos...@gm...> - 2002-10-04 18:41:16
|
Forgot to send to the list...
On Friday 04 October 2002 12:48, Nicholas Nethercote wrote:
> On Fri, 4 Oct 2002, Josef Weidendorfer wrote:
> > The only problem I saw was that I need a valgrind version of the LIBC
> > "unlink", which I already mailed to Nick...
>
> I just added VG_(unlink) to head; it's untested, hopefully I got it
> right.
Thanks!
> > Regarding the valgrind skin architecture: Shouldn't it be possible to
> > "stack" skins? At the moment, for my skin I have to include all the
> > cachegrind code again. And if the cachegrind skin decides to simulate=
a
> > 3rd level cache, I have to copy it.
>
> Hmm, the LD_PRELOADing of two shared objects (skin.so + core.so) is
> already a bit fragile, having multiple .so's feels like a bad idea to
> me... Anyway, aren't Cachegrind and your patched version dissimilar
> enough that it wouldn't be easy to "stack" them in a sensible way? A
> better way might be to factor out the common code which gets included i=
n
> both skin .so's, if you see what I mean. This should be done with
> addrcheck and memcheck at some stage because they share a lot of identi=
cal
> code.
Yes. It seems to be the simplest way.
Two skins stacked on each other seems strange: The 1st does instrumentati=
on,=20
and the UCode outcome is instrumentated by the 2nd. And so on...
Regarding the LD_PRELOADing: can't this made be explicit by runtime-loadi=
ng of=20
the skins from valgrind.so (i.e. a plugin architecture)?
But a general "cost center API" would be nice to have for skins counting=20
events, linked as library to any such skin.
We need cost types (can be an array of subcost types), and cost center ta=
rget=20
types (e.g. instruction, basic block, call, function, ELF object, memory=20
access etc..) using these cost types. For this, we need register_XXX=20
functions, supplying call backs for zeroing/adding/writing ASCII version/=
=2E..
For each cost center a skin creates, it registers it and links it either =
to=20
some other cost center target ("parent") or to an existing object of the=20
valgrind core (e.g. basic block, memory area, ...).
Dumping out the profile could be fully done in a generic way: We need a "=
cost=20
center position" structure and go through all available positions of=20
registered cost centers (using e.g. the "parent" relation).
I'm sure the file format of dumps would change to be a lot more generic.
Support for per-thread cost centers could be generic, too.
I still need some time to think about it.
> Thinking longer term, your version of Cachegrind could entirely replace
> the original Cachegrind one day, since AFAICT your Cachegrind's
> functionality is a strict superset of my Cachegrind's.
Yes :-)
But it's still your baby, I only extended it. I can make small patches fo=
r=20
independent features separately (e.g. jump cost centers, recursion detect=
ion,=20
shared lib support [alias hash/_dl_resolve_runtime], "compressed" profile=
=20
format, threading support), and you decide if some modification is needed=
or=20
we can put it in as it is...
> > Perhaps you have some suggestions for my problem with recursive calls=
:
> >
> > Suppose a call chain starting from A: A calls B and C; C calls A agai=
n.
> > [...]
> > Suggestions?
>
> My brain is melting. Do you know how gprof handles it?
If I understand correctly:
gprof makes an additional virtual function, calling in "cycle" out of the
"A=3D>C=3D>A" chain, with subcycle objects A and C.
I would draw this cycle as area, splitting it up totally among A and C as
subareas. Unfortunately this doesn't show the function where the cycle wa=
s=20
entered from the outside. So I can rename the cycle back to the function=20
where the cycle was entered, and I'm back to my original proposal :-)
Still the question is how gprof calculates its results.
(see reply to the mail of Jeremy, too)
|
|
From: Jeremy F. <je...@go...> - 2002-10-04 16:50:11
|
On Fri, 2002-10-04 at 09:22, Julian Seward wrote:
The important change is to do with thread-specific data. New scheme on
x86s at least is that the linux kernel creates a GDT entry which points
at a thread's local storage; the thread can then access that merely
by using a suitable segment override prefix on any old memory referencing
insn. Kernel swizzles the GDT entry when scheduling threads, and
there is a new syscall by which a thread can tell the kernel
where it's TLD is [i assume because GDT can't be swizzled from user-space]
and because the kernel needs to know this anyway.
But if all this is just implementation detail of libpthreads and you're
replacing all of libpthreads, then surely you wouldn't have to deal with
it anyway?
- possibly your patches of 25 Sept if they don't cause build problems
on any test platform I have
Oh, you mean the poll/select name issue and the msgsnd/msgrcv
implementation?
J
|
|
From: Jeremy F. <je...@go...> - 2002-10-04 16:47:14
|
On Fri, 2002-10-04 at 09:17, Jeremy Fitzhardinge wrote:
On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote:
Regarding the valgrind skin architecture: Shouldn't it be possible to "stack"
skins?
I've been thinking about that too. If the skins have non-overlapping
functionality and instrument the code in non-overlapping ways, then
maybe. It seems to me that you'd want to avoid having multiple skins
instrumenting each other's instrumentation. I guess you could add a
couple of new core UInstrs which limit CALLM_[SE], except for delimiting
s/limit/are like/
J
|
|
From: Jeremy F. <je...@go...> - 2002-10-04 16:17:41
|
On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote:
Regarding the valgrind skin architecture: Shouldn't it be possible to "stack"
skins?
I've been thinking about that too. If the skins have non-overlapping
functionality and instrument the code in non-overlapping ways, then
maybe. It seems to me that you'd want to avoid having multiple skins
instrumenting each other's instrumentation. I guess you could add a
couple of new core UInstrs which limit CALLM_[SE], except for delimiting
a skin's instrumentation so other skins can avoid each other, but
there's then the problem of making sure the generated code is actually
correct...
But I still like the idea because my gprof skin is very non-intrusive
(it inserts just one call per basic block and a memory store per jump),
and could easily be composed with memcheck.
Suppose a call chain starting from A: A calls B and C; C calls A again. The
recursively called A only calls B. Say the cost of B is always 100, the self
cost of each A and C are 10. So the cumulative cost of C will be 120
(C=>A=>B), and the one of the first call to A will be 230. I log (cumulative)
costs for the call A=>B only once, so this call gets cost 200.
The problem: The callee list of A shows a call to B with cost 200 and a call
to C with cost 120, but A itself only has cumulative cost 230 !?! This is
confusing to the user, and really makes problems drawing the TreeMap...
The real problem is that KCachegrind can't see that the cost of A=>B is
included in the call cost A=>C and thus shown twice in the callee list of A.
And in the Treemap, I simple stop drawing recursive calls: This leaves empty
space where there should be none and it looks like a performance problem for
the user where there is none !!
The real solution (without the ad-hoc one) would be:
The callee list of A shows a call to B with cost 200. This is correct: B is
called twice from A, leading to cost 200 for calls to B. But the call A=>C
should be 20 only, skipping costs from any recursive A inside (perhaps
stating that cost 100 is already included in other calls). And this would
make the Treemap drawing fine again.
So the only question I have:
HOW to calculate this value (20) in the general case ?!?
I suppose I can't calulate it at post-processing time, but have to log it in
Cachegrind somehow (that is, the skipped cost of 100 in the example above).
I don't think you should calculate anything at runtime; just record
enough so you can do it all afterwards. There shouldn't be a huge
problem in working all this out: gprof does it, after all.
Does this do what your example describes?
#define COST(N) { int i; for(i = 0; i < N*1000000; i++); }
void A(int);
void B(void);
void C(void);
int main()
{
A(0);
return 0;
}
void A(int a)
{
COST(10);
B();
if (!a)
C();
}
void B(void)
{
COST(100);
}
void C(void)
{
COST(10);
A(1);
}
gprof generates this flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
88.46 0.23 0.23 2 115000.00 115000.00 B
7.69 0.25 0.02 2 10000.00 125000.00 A
3.85 0.26 0.01 1 10000.00 10000.00 C
and this call graph profile:
index % time self children called name
[1] 100.0 0.03 0.23 1+2 <cycle 1 as a whole> [1]
0.02 0.23 2 A <cycle 1> [3]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 0.26 main [2]
0.03 0.23 1/1 A <cycle 1> [3]
-----------------------------------------------
1 C <cycle 1> [5]
0.03 0.23 1/1 main [2]
[3] 96.2 0.02 0.23 2 A <cycle 1> [3]
0.23 0.00 2/2 B [4]
1 C <cycle 1> [5]
-----------------------------------------------
0.23 0.00 2/2 A <cycle 1> [3]
[4] 88.5 0.23 0.00 2 B [4]
-----------------------------------------------
1 A <cycle 1> [3]
[5] 3.8 0.01 0.00 1 C <cycle 1> [5]
1 A <cycle 1> [3]
-----------------------------------------------
So basically this splits call graph into 3 pieces (indices 3, 4, 5):
3 is A seen as part of the cycle through A->C
4 is A calling B
5 is C seen as part of the cycle through C->A
[Handwave mode:] For the purposes of displaying a profile as a TreeMap,
it seems to me that you need to break the graph up into strongly
connected subgraphs, and display each of those as a node in your tree.
When you want to descend into that node (ie display its sub-parts), you
remove all the backwards edges (recursive calls) and treat it as a plain
tree. When you remove the backwards edge, you can recompute the cost of
the subgraph without the recursion; the difference between the original
cost and the new cost is how much you should attribute to the function
which is making the recursive call.
Does this help at all?
J
|
|
From: Julian S. <js...@ac...> - 2002-10-04 16:15:47
|
On Friday 04 October 2002 4:52 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-10-04 at 03:21, Julian Seward wrote: > Good news. After peering at weird segfaults on Red Hat Null (8.0 beta) > last night, I can see that it might be possible to assemble enough > hacks so that the stable branch will work on R H 8. Assuming that they > haven't changed the threading model used in the transition between the > "null" final beta and 8.0 itself, which doesn't sound likely. I thought > that 8.0 would use glibc-2.3, but apparently it is only at 2.2.93, so we > don't have to deal yet with big threading changes. > > How has threading changed in RH8 and/or glibc 2.3? Have they dropped > LinuxThreads? The important change is to do with thread-specific data. New scheme on x86s at least is that the linux kernel creates a GDT entry which points at a thread's local storage; the thread can then access that merely by using a suitable segment override prefix on any old memory referencing insn. Kernel swizzles the GDT entry when scheduling threads, and there is a new syscall by which a thread can tell the kernel where it's TLD is [i assume because GDT can't be swizzled from user-space] and because the kernel needs to know this anyway. This means accesses to TLD can be done in a single insn, which is dozens of times faster and more convenient doing pthread_key_get or whatever it's really called. However, this stuff isn't in R H Null, and I would be surprised if something that fundamental was changed in the transition from Null (the final RH8 beta) to RH8 itself. I'm downloading RH8 now but it takes ~ 4 days on a 64kbit/sec cable modem. LinuxThreads will be replaced by "Native Posix Threads" (I think) in due course although that will be binary-compatible with current libpthread.so . > I had expected only to be able to support R H 8 on the head, using the > LDT/GDT support, but it seems that might not be necessary. > > Vague plan therefore is to assemble this and various other bugfixes > > What fixes do you intend putting in the 1.0.X branch? For 1.0.4: - Support for RH8 if practical; mostly this means some (old-style) TLD improvments to do with thread-specific locales - add in the gcc-3.2 support patch available on the web page - someone sent a helpful patch to use dynamic symbols in elf executables/so's; this improves backtraces sometimes - the usual round of minor unimplemented opcodes, syscalls and ioctls - possibly your patches of 25 Sept if they don't cause build problems on any test platform I have No new functionality; that stuff is for the head. Far more important that the stable branch is stable. J |
|
From: Jeremy F. <je...@go...> - 2002-10-04 15:52:24
|
On Fri, 2002-10-04 at 03:21, Julian Seward wrote:
Good news. After peering at weird segfaults on Red Hat Null (8.0 beta)
last night, I can see that it might be possible to assemble enough hacks
so that the stable branch will work on R H 8. Assuming that they haven't
changed the threading model used in the transition between the "null"
final beta and 8.0 itself, which doesn't sound likely. I thought that
8.0 would use glibc-2.3, but apparently it is only at 2.2.93, so we
don't have to deal yet with big threading changes.
How has threading changed in RH8 and/or glibc 2.3? Have they dropped
LinuxThreads?
I had expected only to be able to support R H 8 on the head, using the
LDT/GDT support, but it seems that might not be necessary.
Vague plan therefore is to assemble this and various other bugfixes
What fixes do you intend putting in the 1.0.X branch?
J
|
|
From: Jeremy F. <je...@go...> - 2002-10-04 15:44:12
|
On Fri, 2002-10-04 at 03:51, Julian Seward wrote:
How much faster is "significantly faster" ?
I haven't measured it in detail, but the frame rate increased from about
1100ms/frame to 800-900ms/frame. I'll so some more scientific
measurements soon.
So, my main point. I think this patch is unsafe and will lead to hard
to find problems down the line. The difficulty is that it allows the
simulated FPU state to hang around in the real FPU for long periods,
up to a whole basic block's worth of execution (if I understand it
write).
We only need a skin to call out to a helper function which modifies
the real FPU state on some obscure path, and we're hosed. Since we don't
have any control over what skins people might plug in, this seems like
and unsafe modification to the core.
The modification I had in mind for a while was a lot more conservative,
and more along the lines of a peephole optimisation. Essentially
if we see a FPU-no-mem op followed by another FPU-no-mem op we can
skip the save at the end of the first and the restore at the start of
the second.
What I'm doing is not conceptually different from caching an ArchReg in
a RealReg for the lifetime of a basic block. The general idea is that
the FP state is pulled in just before the first FPU/FPU_[RW]
instruction, and saved again just before:
- JMP
- CCALL
- any skin UInstr
I can't see how a skin can introduce any instrumentation which would be
able to catch the FP state unsaved (is there any way for a skin to do
instrumentation or call a C function without using either CCALL or its
own UInstr?).
Your idea is basically the same, except we add a fourth saving
condition:
- any non FPU instruction
This would only be necessary if you imagine a non-FPU instruction which
can inspect the architectural state of the FPU (in other words, is a
memory access offset into the baseBlock: something which skins can't
generate directly).
In summary, I think this is actually pretty conservative, simple and
safe.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-10-04 10:48:27
|
On Fri, 4 Oct 2002, Josef Weidendorfer wrote: > The only problem I saw was that I need a valgrind version of the LIBC > "unlink", which I already mailed to Nick... I just added VG_(unlink) to head; it's untested, hopefully I got it right. > Regarding the valgrind skin architecture: Shouldn't it be possible to "stack" > skins? At the moment, for my skin I have to include all the cachegrind code > again. And if the cachegrind skin decides to simulate a 3rd level cache, I > have to copy it. Hmm, the LD_PRELOADing of two shared objects (skin.so + core.so) is already a bit fragile, having multiple .so's feels like a bad idea to me... Anyway, aren't Cachegrind and your patched version dissimilar enough that it wouldn't be easy to "stack" them in a sensible way? A better way might be to factor out the common code which gets included in both skin .so's, if you see what I mean. This should be done with addrcheck and memcheck at some stage because they share a lot of identical code. Thinking longer term, your version of Cachegrind could entirely replace the original Cachegrind one day, since AFAICT your Cachegrind's functionality is a strict superset of my Cachegrind's. > In fact: Call tree logging should be totally orthogonal to event > logging. Shouldn't we have some general support for expandable cost > centers in the core? Maybe, I thought about this but didn't get any further. If would help if you could give some specific suggestions as to the form it might take. > Perhaps you have some suggestions for my problem with recursive calls: > > Suppose a call chain starting from A: A calls B and C; C calls A again. > [...] > Suggestions? My brain is melting. Do you know how gprof handles it? N |