|
From: Philippe W. <phi...@sk...> - 2009-05-14 22:34:05
|
I am working on callgrind to add the capture of alloc/free information events.
I have now something which starts to work (at least on small examples).
This allows then to use the callgrind/kcachegrind to examine e.g. graphically
the heap memory usage. In other words, this is similar to massif, but integrated
within callgrind and its graphical tool. The result on small examples
looks nice in kcachegrind.
The events I am currently capturing is:
allocSize (increased by N when N bytes are allocated)
in the callgraph, associated to the callstack that did the alloc
freeSize (increased by N when N bytes are freed).
in the callgraph, associated to the callstack that did the free
(this allows to see who allocates and/or who frees a lot).
To this, I would like to add a "releasedSize" event.
The releasedSize will be increased similarly to freeSize but the idea is to associate
this cost to the same callstack that allocated the memory being freed.
With this "releasedSize", the current memory usage of the heap can be examined in kcachegrind
by using the cost "allocSize - releasedSize"
This will allow to see which stack trace "keeps" a lot of memory.
I have tried a naive (and incorrect) approach to implement releasedSize:
At allocation time, I am saving somewhere the pointer and the bbcc used at allocation.
When the pointer is released, I am retrieving the corresponding bbcc and adding the size
of the block freed to the releasedSize cost of this bbcc.
At the end of the run, I see that the total releasedSize is incremented (but twice the
nr of bytes freed). Moreover, nothing appears in the "self" or "inclusive" columns
for this event.
I guess that I need to do something more complex, like:
* saving a full chain of bbcc (or something like that) at allocation time
* and then propagate the releasedSize on this chain of bbcc
Or maybe save the stack trace (execontext) and "simulate" a call to
propagate the releasedSize ?
I have read here and there callgrind code, but it is not clear to me how to attack this.
Any advice/pointers about how to go/ what to read in detail/ ... ?
Thanks for any help
|
|
From: Josef W. <Jos...@gm...> - 2009-05-16 21:42:36
|
On Friday 15 May 2009, Philippe Waroquiers wrote: > I am working on callgrind to add the capture of alloc/free information events. > > I have now something which starts to work (at least on small examples). > This allows then to use the callgrind/kcachegrind to examine e.g. graphically > the heap memory usage. In other words, this is similar to massif, but integrated > within callgrind and its graphical tool. The result on small examples > looks nice in kcachegrind. > > The events I am currently capturing is: > allocSize (increased by N when N bytes are allocated) > in the callgraph, associated to the callstack that did the alloc > freeSize (increased by N when N bytes are freed). > in the callgraph, associated to the callstack that did the free > (this allows to see who allocates and/or who frees a lot). Nice. > To this, I would like to add a "releasedSize" event. > The releasedSize will be increased similarly to freeSize but the idea is to associate > this cost to the same callstack that allocated the memory being freed. Ah, Ok. So the idea is to attach an event to a execution context from the past. Unfortunately, this does not play nice with the way inclusive costs currently are captured (see below). > With this "releasedSize", the current memory usage of the heap can be examined in kcachegrind > by using the cost "allocSize - releasedSize" > This will allow to see which stack trace "keeps" a lot of memory. Yes. Note that the minus operator currently is not supported in KCachegrind, but that should be easy to fix... > I have tried a naive (and incorrect) approach to implement releasedSize: > At allocation time, I am saving somewhere the pointer and the bbcc used at allocation. > When the pointer is released, I am retrieving the corresponding bbcc and adding the size > of the block freed to the releasedSize cost of this bbcc. > > At the end of the run, I see that the total releasedSize is incremented (but twice the > nr of bytes freed). Moreover, nothing appears in the "self" or "inclusive" columns > for this event. Hmm.. Not sure about the "twice" and the "self" (should work). However, "inclusive" can not work. In callgrind, inclusive cost is measured using a global, always increasing counter array. For a function f, the inclusive cost is the global counter values when leaving f minus the values when entering it. Now, your update of releasedSize never changes any such global counter, so any inclusive cost will always be zero. But more problematic, there is simply no way to increase a global counter in a way such that this will result in meaningful data with the current strategy. BTW, the "cache use" metrics have the same problem: Only self cost is useful. Would be nice to have a general solution for this. > I guess that I need to do something more complex, like: > * saving a full chain of bbcc (or something like that) at allocation time Something similar to this is already available. A bbcc is releated to a "context", which is unique for a call chain of functions (default is length 1, ie. only the current running function). If you look at new_cxt() in callgrind/context.c, this is the "size" of a context, which currently is given as a setting stored for every function. It would be nice to have every context specifiying the full call chain from main(), but this is simply not feasable, as the number of such contexts explodes with huge codes such as firefox/openoffice; therefore the limit on the call chain length for contexts. Anyway, the inclusive costs are not stored in BBCCs, but in JCCs (with from/to tuples of BBCCs), so for adjusting the inclusive costs, you would need to take these from the current call stack, which is an array of call_entry's (see type call_stack in callgrind/global.h). > * and then propagate the releasedSize on this chain of bbcc Yes, adjust "cost" in the JCCs. However, this can be potentially very expensive. A completely alternative way would be to not adjust any inclusive costs in Callgrind itself for "releasedSize", but make the propagation of costs at postprocessing time in KCachegrind, and for this to be possible, put a more exact calling context of self costs into the output file. > Or maybe save the stack trace (execontext) and "simulate" a call to > propagate the releasedSize ? "simulating a call" will help nothing, as there is no propagation up the call stack done at all. > I have read here and there callgrind code, but it is not clear to me how to attack this. To be honest, the best way is not really clear for me, too. Storing the callstack for every allocation, and propagating "releasedSize" at free time, seems to be a very resource-expensive way (both in space & time), and probably is not working at all for big codes. In contrast, relating allocation to the call context, writing out more detailled contexts with according self cost, and do the propagation in KCachegrind seems to be somehow better, and probably is more the massif way. But then, you only estimate the real inclusive cost (if not every context has the full call chain from main()), and it will not directly be comparable to "allocSize". > Any advice/pointers about how to go/ what to read in detail/ ... ? I would like to extend KCachegrind to come up with inclusive costs itself, if they are not provided by the measurement tool, using calling contexts. This is needed to visulize data from other measurement tools (such as sampling tools which can provide length-limited calling contexts for every sample), so ... Perhaps there is not much to do in Callgrind at all. Just writing more detailed contexts from Callgrind (as is already possible with --separate-callers=<n>), and ignore the issue of inclusive costs for "releasedSize". However, KCachegrind needs to be extended as written above. KCachegrind currently makes no use at all from calling context information. It just prints out the concatenation of symbol names in the call chains, as given by Callgrind... Josef |
|
From: Philippe W. <phi...@sk...> - 2009-05-16 23:15:06
|
>> With this "releasedSize", the current memory usage of the heap can be examined in kcachegrind >> by using the cost "allocSize - releasedSize" >> This will allow to see which stack trace "keeps" a lot of memory. > > Yes. Note that the minus operator currently is not supported in KCachegrind, > but that should be easy to fix... With the versions I am using, minus operator seems supported : I have tested on kcachegrind 0.4.6 on portable ubuntu for windows (an ubuntu running as a windows process) and the minus operator (at least in an event type defined inside kcachegrind) using "new event type" is working. I will retry with the kcachegrind 0.5.0 version on fedora 10 but IIRC, it also works under fedora. > Hmm.. Not sure about the "twice" and the "self" (should work). I have somewhat cleaned up the first quick and dirty implementation I did (for allocSize and freeSize), With your info that it should work at least for the self, I will re-introduce the releasedSize and double-check what it gives. > To be honest, the best way is not really clear for me, too. Storing the callstack > for every allocation, and propagating "releasedSize" at free time, seems to be > a very resource-expensive way (both in space & time), and probably is not working > at all for big codes. For the heap related events (malloc/free), the explosion should be reasonable, (as this is the technique used by memcheck/massif to report errors: both are associating to each heap block a pointer to a stack trace). So, at least in terms of stack trace size, I imagine that a callstack of JCC for each allocation should work. (using an hash table to share the identical JCC callstacks to avoid duplication). But this solution can only work for a "small subset" of all the possible stack traces in a huge program (and even small programs can create a huge nr of different stack traces; cfr the test program in the valgrind bug 191182). I understand there is a need for a more general solution (such as the below), that will work for other cases having a much bigger nr of stack traces than heap events. > I would like to extend KCachegrind to come up with inclusive costs itself, if they > are not provided by the measurement tool, using calling contexts. This is > needed to visulize data from other measurement tools (such as sampling tools which > can provide length-limited calling contexts for every sample), so ... > > Perhaps there is not much to do in Callgrind at all. Just writing more detailed > contexts from Callgrind (as is already possible with --separate-callers=<n>), > and ignore the issue of inclusive costs for "releasedSize". > > However, KCachegrind needs to be extended as written above. > > KCachegrind currently makes no use at all from calling context information. It > just prints out the concatenation of symbol names in the call chains, as > given by Callgrind... I am not too sure I understand what you describe for kcachegrind based solution, I will re-read, sleep on it, and get back to you probably with more questions. Thanks for your help/advice (and thanks for callgrind/kcachegrind tools) Philippe |
|
From: Philippe W. <phi...@sk...> - 2009-05-21 23:11:25
|
> So, at least in terms of stack trace size, I imagine that a callstack of JCC for each > allocation should work. (using an hash table to share the identical JCC callstacks to > avoid duplication). I have now implemented this solution (but using an OSetGen of jCC callstack rather than an hash table). Testing this on firefox startup gives reasonable result: there is about 72000 different jCC stack traces, 4,000,000 elements in these stack traces. => so, on a x86, this needs about 20Mb of memory I still have a few things to cleanup/improve. I will then send the current state of the code as a basis to discuss/review (there are for sure points to improve/change/enhance/...) Philippe |
|
From: Philippe W. <phi...@sk...> - 2009-05-22 11:02:56
|
>> Testing this on firefox startup gives reasonable result: >> there is about 72000 different jCC stack traces, 4,000,000 elements in these stack traces. >> => so, on a x86, this needs about 20Mb of memory > > So there are around 72000 different calling contexts for heap allocations? That seems to > be manageble, yes. Do you do anything special for contexts with recursion? Nothing special is done for recursion: the "raw" stack of JCCs is stored. > > So if you have this, it would be easy to try whether this is practical at > least for some cases also for the cacheline usage... > >> I still have a few things to cleanup/improve. I will then send the current state of >> the code as a basis to discuss/review (there are for sure points to improve/change/enhance/...) > > Thanks, I am interested ;-) The jCC stack trace stuff will be easy to restructure so that it can be re-used for other things than the heap allocations. But I think this will be better done once you will have had the occasion to do a first review : there are a few pieces of code that I have done by trial and errors, without fully understanding the logic behind the code :). Philippe |