Thread: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

Brought to you by: njn, sewardj, wielaard

valgrind-developers

[Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Josef W. <Jos...@gm...> - 2002-10-04 10:18:46

Hi,

On Friday 04 October 2002 03:01, Jeremy Fitzhardinge wrote:
> Hi,
>
> Do you have a patch for the current CVS version of valgrind?  I finally
> got enough of KDE installed on my laptop to compile kcachegrind, so I'm
> keen to try it out.
>
> Thanks,
> 	J

Sorry: I don't have much time at the moment...
Last wednesday I looked for the first time at valgrind-HEAD. It seems to be=
=20
quite easy to port my patch to a skin. The only problem I saw was that I ne=
ed=20
a valgrind version of the LIBC "unlink", which I already mailed to Nick...=
=20

Regarding the valgrind skin architecture: Shouldn't it be possible to "stac=
k"=20
skins? At the moment, for my skin I have to include all the cachegrind code=
=20
again. And if the cachegrind skin decides to simulate a 3rd level cache, I=
=20
have to copy it. In fact: Call tree logging should be totally orthogonal to=
=20
event logging. Shouldn't we have some general support for expandable cost=20
centers in the core? Then I could use these to add/subtract costs without=20
even knowing which ones are logged...
To be honest, I didn't thought much about this idea yet...

Regarding KCachegrind: I still have a problem with visualizing recursive=20
calls. This seems to involve changes in the Cachegrind patch, too.
So I first have to solve this one before I'm making any new patch/release...
Aside from that: I switched to GCC 3.2 with an update to Suse 8.1, and now
I have a lot of problems with lost debugging info :-(

Question: What are the exact problems with GCC 3.x, that it's not officiall=
y=20
supported in Valgrind ?

Perhaps you have some suggestions for my problem with recursive calls:

Suppose a call chain starting from A: A calls B and C; C calls A again. The=
=20
recursively called A only calls B. Say the cost of B is always 100, the sel=
f=20
cost of each A and C are 10. So the cumulative cost of C will be 120=20
(C=3D>A=3D>B), and the one of the first call to A will be 230. I log (cumul=
ative)=20
costs for the call A=3D>B only once, so this call gets cost 200.
The problem: The callee list of A shows a call to B with cost 200 and a cal=
l=20
to C with cost 120, but A itself only has cumulative cost 230 !?! This is=20
confusing to the user, and really makes problems drawing the TreeMap...
The real problem is that KCachegrind can't see that the cost of A=3D>B is=20
included in the call cost A=3D>C and thus shown twice in the callee list of=
 A.
And in the Treemap, I simple stop drawing recursive calls: This leaves empty
space where there should be none and it looks like a performance problem for
the user where there is none !!

The first ad-hoc solution was to distinguish among calls from recursively=20
called functions, i.e. the 2 calls A=3D>B will be logged independent from e=
ach=20
other: this makes the example looking quite fine again.
But this makes the real problem disappear for a few simple examples (as the=
=20
above one) only; it's still there for deeper recursions, and cumulative cos=
ts=20
of calls always include ALL recursion costs inside of this call: I log the=
=20
cost counter difference at entering/leaving the function.

The real solution (without the ad-hoc one) would be:
The callee list of A shows a call to B with cost 200. This is correct: B is=
=20
called twice from A, leading to cost 200 for calls to B. But the call A=3D>=
C=20
should be 20 only, skipping costs from any recursive A inside (perhaps=20
stating that cost 100 is already included in other calls). And this would=20
make the Treemap drawing fine again.

So the only question I have:
HOW to calculate this value (20) in the general case ?!?
I suppose I can't calulate it at post-processing time, but have to log it i=
n=20
Cachegrind somehow (that is, the skipped cost of 100 in the example above).

Suggestions?

Josef

Re: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Josef W. <Jos...@gm...> - 2002-10-04 18:41:17

On Friday 04 October 2002 18:17, Jeremy Fitzhardinge wrote:
> On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote:
> >    Suppose a call chain starting from A: A calls B and C; C calls A a=
gain.
>
> I don't think you should calculate anything at runtime; just record
> enough so you can do it all afterwards.  There shouldn't be a huge
> problem in working all this out: gprof does it, after all.

Perhaps I'm already logging to much with the "cumulative" costs of calls.
But I don't understand how to calculate this with self costs and call tre=
e=20
information alone, especially for functions taking part in multiple cycle=
s...

> Does this do what your example describes?
> ...

Yes :-)

> gprof generates this flat profile:
> ...

Seems quite correct :-)

> ...
> index % time    self  children    called     name
> [1]    100.0    0.03    0.23       1+2       <cycle 1 as a whole> [1]
>                 0.02    0.23       2             A <cycle 1> [3]

Why isn't "B <cycle1>" shown as child of <cycle1> ?

> ...
> [Handwave mode:] For the purposes of displaying a profile as a TreeMap,
> it seems to me that you need to break the graph up into strongly
> connected subgraphs, and display each of those as a node in your tree.

The subgraphs being the cycles with its children?

> When you want to descend into that node (ie display its sub-parts), you
> remove all the backwards edges (recursive calls) and treat it as a plai=
n
> tree.  When you remove the backwards edge, you can recompute the cost o=
f
> the subgraph without the recursion; the difference between the original
> cost and the new cost is how much you should attribute to the function
> which is making the recursive call.

I think I'm doing exactly this: Stop drawing at recursions...
And the problem is: I don't know what the cost of the recursive function=20
should be. I don't have the gprof results, but only my own logging.
I think I will have to dive into gprof algorithms a little bit...
As I understand at the moment, gprof never talks about recursive calls, b=
ut
makes "cycles" with subobjects.

Thanks anyway!

Josef

Re: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Jeremy F. <je...@go...> - 2002-10-04 20:29:25

On Fri, 2002-10-04 at 11:24, Josef Weidendorfer wrote:
> > index % time    self  children    called     name
> > [1]    100.0    0.03    0.23       1+2       <cycle 1 as a whole> [1]
> >                 0.02    0.23       2             A <cycle 1> [3]
> 
> Why isn't "B <cycle1>" shown as child of <cycle1> ?

Because [1] is the entry for "cycle 1 as a whole"; A is considered to be
the head of the cycle (which is then described in [3]).

[2] is the linear part of the call graph to A through main.

> > [Handwave mode:] For the purposes of displaying a profile as a TreeMap,
> > it seems to me that you need to break the graph up into strongly
> > connected subgraphs, and display each of those as a node in your tree.
> 
> The subgraphs being the cycles with its children?

Yes.  "Strongly connected graph" is graph-theory talk for a graph with
internal cycles.  For example:

    A                           A'
    |            -->            |
    v                           v
    B <=> C                    BC'

B <=> C is a simple cycle, and so is a strongly connected subgraph. You
can rewrite the overall graph so that strongly connected subgraphs are
collapsed into a single node.

> I think I'm doing exactly this: Stop drawing at recursions...
> And the problem is: I don't know what the cost of the recursive function 
> should be. I don't have the gprof results, but only my own logging.
> I think I will have to dive into gprof algorithms a little bit...
> As I understand at the moment, gprof never talks about recursive calls, but
> makes "cycles" with subobjects.

http://citeseer.nj.nec.com/graham82gprof.html should explain the
algorithm.

	J

Re: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Nicholas N. <nj...@ca...> - 2002-10-04 10:48:27

On Fri, 4 Oct 2002, Josef Weidendorfer wrote:

> The only problem I saw was that I need a valgrind version of the LIBC
> "unlink", which I already mailed to Nick...

I just added VG_(unlink) to head;  it's untested, hopefully I got it
right.

> Regarding the valgrind skin architecture: Shouldn't it be possible to "stack"
> skins? At the moment, for my skin I have to include all the cachegrind code
> again. And if the cachegrind skin decides to simulate a 3rd level cache, I
> have to copy it.

Hmm, the LD_PRELOADing of two shared objects (skin.so + core.so) is
already a bit fragile, having multiple .so's feels like a bad idea to
me...  Anyway, aren't Cachegrind and your patched version dissimilar
enough that it wouldn't be easy to "stack" them in a sensible way?  A
better way might be to factor out the common code which gets included in
both skin .so's, if you see what I mean.  This should be done with
addrcheck and memcheck at some stage because they share a lot of identical
code.

Thinking longer term, your version of Cachegrind could entirely replace
the original Cachegrind one day, since AFAICT your Cachegrind's
functionality is a strict superset of my Cachegrind's.

> In fact: Call tree logging should be totally orthogonal to event
> logging. Shouldn't we have some general support for expandable cost
> centers in the core?

Maybe, I thought about this but didn't get any further.  If would help if
you could give some specific suggestions as to the form it might take.

> Perhaps you have some suggestions for my problem with recursive calls:
>
> Suppose a call chain starting from A: A calls B and C; C calls A again.
> [...]
> Suggestions?

My brain is melting.  Do you know how gprof handles it?

N

Re: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Josef W. <Jos...@gm...> - 2002-10-04 18:41:16

Forgot to send to the list...

On Friday 04 October 2002 12:48, Nicholas Nethercote wrote:
> On Fri, 4 Oct 2002, Josef Weidendorfer wrote:
> > The only problem I saw was that I need a valgrind version of the LIBC
> > "unlink", which I already mailed to Nick...
>
> I just added VG_(unlink) to head;  it's untested, hopefully I got it
> right.

Thanks!

> > Regarding the valgrind skin architecture: Shouldn't it be possible to
> > "stack" skins? At the moment, for my skin I have to include all the
> > cachegrind code again. And if the cachegrind skin decides to simulate=
 a
> > 3rd level cache, I have to copy it.
>
> Hmm, the LD_PRELOADing of two shared objects (skin.so + core.so) is
> already a bit fragile, having multiple .so's feels like a bad idea to
> me...  Anyway, aren't Cachegrind and your patched version dissimilar
> enough that it wouldn't be easy to "stack" them in a sensible way?  A
> better way might be to factor out the common code which gets included i=
n
> both skin .so's, if you see what I mean.  This should be done with
> addrcheck and memcheck at some stage because they share a lot of identi=
cal
> code.

Yes. It seems to be the simplest way.
Two skins stacked on each other seems strange: The 1st does instrumentati=
on,=20
and the UCode outcome is instrumentated by the 2nd. And so on...
Regarding the LD_PRELOADing: can't this made be explicit by runtime-loadi=
ng of=20
the skins from valgrind.so (i.e. a plugin architecture)?

But a general "cost center API" would be nice to have for skins counting=20
events, linked as library to any such skin.

We need cost types (can be an array of subcost types), and cost center ta=
rget=20
types (e.g. instruction, basic block, call, function, ELF object, memory=20
access etc..) using these cost types. For this, we need register_XXX=20
functions, supplying call backs for zeroing/adding/writing ASCII version/=
=2E..
For each cost center a skin creates, it registers it and links it either =
to=20
some other cost center target ("parent") or to an existing object of the=20
valgrind core (e.g. basic block, memory area, ...).

Dumping out the profile could be fully done in a generic way: We need a "=
cost=20
center position" structure and go through all available positions of=20
registered cost centers (using e.g. the "parent" relation).
I'm sure the file format of dumps would change to be a lot more generic.
Support for per-thread cost centers could be generic, too.

I still need some time to think about it.

> Thinking longer term, your version of Cachegrind could entirely replace
> the original Cachegrind one day, since AFAICT your Cachegrind's
> functionality is a strict superset of my Cachegrind's.

Yes :-)
But it's still your baby, I only extended it. I can make small patches fo=
r=20
independent features separately (e.g. jump cost centers, recursion detect=
ion,=20
shared lib support [alias hash/_dl_resolve_runtime], "compressed" profile=
=20
format, threading support), and you decide if some modification is needed=
 or=20
we can put it in as it is...

> > Perhaps you have some suggestions for my problem with recursive calls=
:
> >
> > Suppose a call chain starting from A: A calls B and C; C calls A agai=
n.
> > [...]
> > Suggestions?
>
> My brain is melting.  Do you know how gprof handles it?

If I understand correctly:

gprof makes an additional virtual function, calling in "cycle" out of the
"A=3D>C=3D>A" chain, with subcycle objects A and C.

I would draw this cycle as area, splitting it up totally among A and C as
subareas. Unfortunately this doesn't show the function where the cycle wa=
s=20
entered from the outside. So I can rename the cycle back to the function=20
where the cycle was entered, and I'm back to my original proposal :-)

Still the question is how gprof calculates its results.
(see reply to the mail of Jeremy, too)

[Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Jeremy F. <je...@go...> - 2002-10-04 16:17:41

On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote:
    Regarding the valgrind skin architecture: Shouldn't it be possible to "stack" 
    skins? 

I've been thinking about that too.  If the skins have non-overlapping
functionality and instrument the code in non-overlapping ways, then
maybe.  It seems to me that you'd want to avoid having multiple skins
instrumenting each other's instrumentation.  I guess you could add a
couple of new core UInstrs which limit CALLM_[SE], except for delimiting
a skin's instrumentation so other skins can avoid each other, but
there's then the problem of making sure the generated code is actually
correct...

But I still like the idea because my gprof skin is very non-intrusive
(it inserts just one call per basic block and a memory store per jump),
and could easily be composed with memcheck.

    Suppose a call chain starting from A: A calls B and C; C calls A again. The 
    recursively called A only calls B. Say the cost of B is always 100, the self 
    cost of each A and C are 10. So the cumulative cost of C will be 120 
    (C=>A=>B), and the one of the first call to A will be 230. I log (cumulative) 
    costs for the call A=>B only once, so this call gets cost 200.
    The problem: The callee list of A shows a call to B with cost 200 and a call 
    to C with cost 120, but A itself only has cumulative cost 230 !?! This is 
    confusing to the user, and really makes problems drawing the TreeMap...
    The real problem is that KCachegrind can't see that the cost of A=>B is 
    included in the call cost A=>C and thus shown twice in the callee list of A.
    And in the Treemap, I simple stop drawing recursive calls: This leaves empty
    space where there should be none and it looks like a performance problem for
    the user where there is none !!

    The real solution (without the ad-hoc one) would be:
    The callee list of A shows a call to B with cost 200. This is correct: B is 
    called twice from A, leading to cost 200 for calls to B. But the call A=>C 
    should be 20 only, skipping costs from any recursive A inside (perhaps 
    stating that cost 100 is already included in other calls). And this would 
    make the Treemap drawing fine again.
    
    So the only question I have:
    HOW to calculate this value (20) in the general case ?!?
    I suppose I can't calulate it at post-processing time, but have to log it in 
    Cachegrind somehow (that is, the skipped cost of 100 in the example above).

I don't think you should calculate anything at runtime; just record
enough so you can do it all afterwards.  There shouldn't be a huge
problem in working all this out: gprof does it, after all.

Does this do what your example describes?

#define COST(N)	{ int i; for(i = 0; i < N*1000000; i++); }

void A(int);
void B(void);
void C(void);

int main()
{
	A(0);

	return 0;
}

void A(int a)
{
	COST(10);
	B();
	if (!a)
		C();
}

void B(void)
{
	COST(100);
}

void C(void)
{
	COST(10);
	A(1);
}

gprof generates this flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 88.46      0.23     0.23        2 115000.00 115000.00  B
  7.69      0.25     0.02        2 10000.00 125000.00  A
  3.85      0.26     0.01        1 10000.00 10000.00  C

and this call graph profile:

index % time    self  children    called     name
[1]    100.0    0.03    0.23       1+2       <cycle 1 as a whole> [1]
                0.02    0.23       2             A <cycle 1> [3]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.26                 main [2]
                0.03    0.23       1/1           A <cycle 1> [3]
-----------------------------------------------
                                   1             C <cycle 1> [5]
                0.03    0.23       1/1           main [2]
[3]     96.2    0.02    0.23       2         A <cycle 1> [3]
                0.23    0.00       2/2           B [4]
                                   1             C <cycle 1> [5]
-----------------------------------------------
                0.23    0.00       2/2           A <cycle 1> [3]
[4]     88.5    0.23    0.00       2         B [4]
-----------------------------------------------
                                   1             A <cycle 1> [3]
[5]      3.8    0.01    0.00       1         C <cycle 1> [5]
                                   1             A <cycle 1> [3]
-----------------------------------------------

So basically this splits call graph into 3 pieces (indices 3, 4, 5):
	3 is A seen as part of the cycle through A->C
	4 is A calling B
	5 is C seen as part of the cycle through C->A


[Handwave mode:] For the purposes of displaying a profile as a TreeMap,
it seems to me that you need to break the graph up into strongly
connected subgraphs, and display each of those as a node in your tree. 
When you want to descend into that node (ie display its sub-parts), you
remove all the backwards edges (recursive calls) and treat it as a plain
tree.  When you remove the backwards edge, you can recompute the cost of
the subgraph without the recursion; the difference between the original
cost and the new cost is how much you should attribute to the function
which is making the recursive call.

Does this help at all?

	J

Re: [Valgrind-developers] Re: kcachegrind patch for valgrind-HEAD?

From: Jeremy F. <je...@go...> - 2002-10-04 16:47:14

On Fri, 2002-10-04 at 09:17, Jeremy Fitzhardinge wrote:
    On Fri, 2002-10-04 at 03:10, Josef Weidendorfer wrote:
        Regarding the valgrind skin architecture: Shouldn't it be possible to "stack" 
        skins? 
    
    I've been thinking about that too.  If the skins have non-overlapping
    functionality and instrument the code in non-overlapping ways, then
    maybe.  It seems to me that you'd want to avoid having multiple skins
    instrumenting each other's instrumentation.  I guess you could add a
    couple of new core UInstrs which limit CALLM_[SE], except for delimiting

s/limit/are like/

	J