|
From: Josef W. <Jos...@gm...> - 2002-10-10 10:31:32
|
On Thursday 10 October 2002 03:15, Jeremy Fitzhardinge wrote: > On Mon, 2002-10-07 at 03:38, Josef Weidendorfer wrote: > > So my question: Do you think this exact logging is usefull ? > > It seems to me that you're confusing two levels of abstraction: the > general call-graph summary of program execution vs. an exact trace of > program execution. If I understand what you're saying correctly, you I'm still not convinced that I'm mixing these levels ;-) I simple have a more exact call graph than gprof as input: In addition, I already get the weights for the call arcs (directed edges)= =2E The gprof algorithm calculates estimations for these by using accumulated= self=20 costs and distributing these costs to the graph edges using call counts (= by=20 using a virtual value "cost per call"). Why should I throw away the more=20 exact weights and recalculate an estimation? Cycles need special handling. And this was the source of my problem, as I= =20 didn't handle it specially. This is the postprocessing I have to do in KCachegrind: Cycles are in fac= t=20 superfunctions. I have to handle the real function in them like BBs of no= rmal=20 functions, and I'm fine again. That is, I throw away call arcs inside of=20 cycles. A nice side effect of logging call costs already in Cachgrind: I let the user create trace parts: He can decide when to make a dump (I k= now=20 of at least one successful use of this feature because of a user report..= =2E). That way I get call arcs with call count 0 (these are "active" calls). Th= e=20 gprof algorithm would calculate a weight of 0 for this arc, which is=20 obviously wrong... > say that for each invocation of function A, you record the exact counte= r > deltas for that invocation, and presumably the caller. You can > therefore say that every time B calls A, the sum of the counter deltas > is X. However, by only recording the immediate caller, you're already X is the weight of the call arc from B to A in the general call graph. > throwing information away, and I suspect this lack of information is > causing the trouble. If I visualize the children of a call A=3D>B, I use the cumulative cost o= f this=20 call for the area size and scale down the inner visualisation of whole A = as I=20 don't have the call costs from B when called from A. This of course is a = pure=20 estimation... If a would log call costs not only depending on the immediate caller, but= also=20 on the caller of caller, I still don't think I'm mixing abstraction level= s: For each function, I simple have many call graph nodes, depending on the=20 caller: I don't have the self costs of these nodes, but these can be=20 calculated from the edge weights. > I think you may be better off by either just adopting the gprof > algorithm (certainly for the purposes of visualising what's going on), > and/or implement a complete program trace for more detailed > post-processing. You may find the paper "Encoding Program Executions" = a > useful source of ideas > (http://citeseer.nj.nec.com/reiss01encoding.html). Thanks. That could be a project for the future... You will need ways to keep the traced data low, and that's another topic. And this has nothing to do with KCachegrind, as I can't see a usage for a= =20 TreeMap. > > Note: All the hassle with recursion detection in my calltree patch wo= uld > > not be needed if I don't log exact call costs. But on the other side:= I > > already have everything in place now in my patch. > > How much effort do you put into detecting recursion? I still think > you'd be better off just recording everything and leave all analysis to > post-processing. For each active function (i.e. currently called somewhere on the stack), = I=20 have a CallEntry struct, put into a hash table with function start addres= s as=20 key. This way I can lookup very fast on each call happening if this call = is a=20 recursive one. In a CallEntry, I have a recursion count for the function,= and=20 thus can remove the CallEntry if the the recursion count reaches zero on=20 function return. The problem with this concept are returns from functions: I can't be sure= =20 these are done by executing a RET instruction. A program can do a longjmp= =20 (e.g. C++ exception handling) and modify the stack pointer in arbitrary w= ays,=20 thus returning implicit from active functions. So the above algorithm is a little fragile: So I need to have my own call= =20 stack (an array of elements, consisting of a pointer to a CallEntry struc= t=20 and the real ESP at call time). At each CALL/RETURN instruction I keep th= e=20 real stack with my stack in sync by looking at the stored/real ESP value = and=20 popping (returning from functions) as needed. > > Regarding profiling performance: Even without a need for recursion > > detection I think I will need hash table lookups... > > Is that the expensive part of doing the trace? Don't know ;-) I'm adding around 20% to pure cachegrind. > TreeMap doesn't need to be exact, it just needs to be an accurate guide > for a programmer to tell where the time is going. Yes. But it can make a difference if weigths are distributed the wrong wa= y=20 among call arcs, and the TreeMap shows call weights, not self costs. > > I'm wondering if I should abstract from the term "function" in > > KCachegrind. It makes a lot of sense to handle BBs the same way (Jere= my: > > I wondered why you do profiling on a BB granularity!): Jumps between = BBs > > are in fact calls with implicit return on funtion return: If there ar= e > > loops in the function, we have lot of "cycles" (i.e. recursive BB cal= ls) > > for the BBs of a function. "Cumulative cost" of a BB is the cost unti= l > > the end of the function. We even can extend this abstraction, not onl= y > > down (from functions to BBs), but also up to C++ classes and ELF obje= cts: > > It makes sense to see cumulative costs of all methods of a class, or > > calls among classes; the same holds for ELF objects. > > What do you think about this? > > Well, I've been putting some effort into my skin to try to distinguish > function calls from other kinds of jumps, simply because recording What's the difficulty? > everything at the BB level generates too much information to be > reasonably used, and despite the claims in the documentation, gprof > generates almost useless output as a result (if you record the call and > return of A calling B as separate edges, prof shows it as A calls B who > calls A). If you handle BBs as seperate nodes in a call graph of BBs, you get a LOT= of=20 cycles, and not really much in addition to the coarser function call grap= h, I=20 suppose? > So, I'm now just recording calls, and also only instrumenting basic > blocks in the text section (so that I don't get spurious edges calling = a > shared library, because it calls into the PLT, which then jumps into th= e > function in the target library). I already solved this by trapping jumps to "_dl_runtime_resolve" (this is= the=20 function first called from a PLT stub), and attributing all the cost of t= he=20 PLT entrie to the real function. After relocation, you have a jump to the= =20 shared library function from the PLT entry: I have an alias hash which ha= s=20 the relations between this stubs and sh.lib. functions... > I think recording to the BB level makes a lot more sense with better > tools to interpret the output, so it definitely makes sense for you to > consider the idea. A TreeMap visualisation of the BBs of a function could be cool... You would see all the loops happening, and the loop bodies as nested BBs. But you always need correct mapping of BB to source lines: So entitling t= he=20 areas for BBs with the source code lines ?! > But don't get too elaborate. The idea of a profiling tool is to > generate a comprehensible abstraction of program execution, not a > complete encoding of what happened. I'm sure not looking at exact tracing of program execution ;-) Josef |