Re: [Valgrind-users] instrumentation start/stop activated cross-threads

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thursday 18 November 2010, edA-qa mort-ora-y wrote:
> On 11/18/2010 04:12 PM, Josef Weidendorfer wrote:
> > Ok. But even if these other threads do a lot of stuff not interesting to you,
> > they will change the cache state, ie. evict data used by the thread interesting to you.
> > So they are important.
> 
> That's interesting. Does that imply valgrind assumes a single core
> system for its simulation, or does it model the number of cores on the
> host machine?

Callgrind/Cachegrind currently simulates a single cache hierarchy. So accesses of
every thread will go through the same two cache levels. Further, as Valgrind currently
serializes thread-execution, it very well looks like a single-core system, yes.

However, most important is last level cache behavior, and for this, all modern
multicore processors use a shared cache. Thus, Callgrinds/Cachegrinds model matches
that quite well, if you only look at last level hit/miss event counts.

We could extend the simulator do assume separate L1 caches per thread, assuming a
fixed pinning of each thread to its own core...

> Also, you mentioned you just record events, not costs. When using
> KCachegrind then what exactly is the number it is showing? How fair is
> it in representing how much time a part of the code actually took?

In KCachegrind, you can either select the direct collected event counts, or select
a derived event type, where the formula is given in the "event type" pane. You can
add/change the formulas to your liking. There is a derived event type "Cycle estimation",
which by default assumes
- each executed instruction takes 1 cycle,
- each L1 cache miss takes 10 cycles,
- each LL (last level) cache miss takes 100 cycles,
- each mispredicted branch takes 10 cycles,
- each global bus event (e.g. CAS) takes 20 cycles.
Only events add up in the cycle estimation which actually were simulated, e.g. for
branch predictor misses, it needs "--branch-sim=yes".

This gives quite a rough estimation, but usually it highlights the bottlenecks.
If you do not have cache issues (ie. a large number of misses) anyway, the estimation
of course will be quite wrong.

Josef