From: Josef W. <Jos...@gm...> - 2010-11-18 17:35:00
|
On Thursday 18 November 2010, edA-qa mort-ora-y wrote: > On 11/18/2010 04:12 PM, Josef Weidendorfer wrote: > > Ok. But even if these other threads do a lot of stuff not interesting to you, > > they will change the cache state, ie. evict data used by the thread interesting to you. > > So they are important. > > That's interesting. Does that imply valgrind assumes a single core > system for its simulation, or does it model the number of cores on the > host machine? Callgrind/Cachegrind currently simulates a single cache hierarchy. So accesses of every thread will go through the same two cache levels. Further, as Valgrind currently serializes thread-execution, it very well looks like a single-core system, yes. However, most important is last level cache behavior, and for this, all modern multicore processors use a shared cache. Thus, Callgrinds/Cachegrinds model matches that quite well, if you only look at last level hit/miss event counts. We could extend the simulator do assume separate L1 caches per thread, assuming a fixed pinning of each thread to its own core... > Also, you mentioned you just record events, not costs. When using > KCachegrind then what exactly is the number it is showing? How fair is > it in representing how much time a part of the code actually took? In KCachegrind, you can either select the direct collected event counts, or select a derived event type, where the formula is given in the "event type" pane. You can add/change the formulas to your liking. There is a derived event type "Cycle estimation", which by default assumes - each executed instruction takes 1 cycle, - each L1 cache miss takes 10 cycles, - each LL (last level) cache miss takes 100 cycles, - each mispredicted branch takes 10 cycles, - each global bus event (e.g. CAS) takes 20 cycles. Only events add up in the cycle estimation which actually were simulated, e.g. for branch predictor misses, it needs "--branch-sim=yes". This gives quite a rough estimation, but usually it highlights the bottlenecks. If you do not have cache issues (ie. a large number of misses) anyway, the estimation of course will be quite wrong. Josef |