|
From: Stephen H. <Ste...@sl...> - 2006-08-02 09:51:42
|
Hi,
I have a fairly simple question but I was unable to find any mention of
it in the documentation; please forgive me if it has been asked before.
After running cachegrind, the output report provides a value for data
cache accesses ("D refs:"). What I would like to do is derive a figure
for the memory bandwidth of regions in my code, so I would like to take
this figure for cache accesses and convert it to a figure for the number
of bytes transferred.
Is there a constant relating the number of bytes transferred to the
number of cache references?
Best regards,
Stephen Henry
|
|
From: Josef W. <Jos...@gm...> - 2006-08-02 14:27:29
|
On Wednesday 02 August 2006 11:47, Stephen Henry wrote:
> Hi,
>
> I have a fairly simple question but I was unable to find any mention of
> it in the documentation; please forgive me if it has been asked before.
>
> After running cachegrind, the output report provides a value for data
> cache accesses ("D refs:"). What I would like to do is derive a figure
> for the memory bandwidth of regions in my code, so I would like to take
> this figure for cache accesses and convert it to a figure for the number
> of bytes transferred.
>
> Is there a constant relating the number of bytes transferred to the
> number of cache references?
Cachegrind is simulating a blocking cache where only 1 load
can be outstanding. Thus, a cache miss reported by cachegrind relates
exactly to one cache line loaded. The cache line used is given in the first
few lines of cachegrinds output file - which is taken from the
actual processor. On x86, this is usually 64 bytes.
However, there are some complications, as there are 2 transfer directions.
When a CPU does a store, it depends on the write policy of the cache what
actually is transferred: should the store bypass a cache level, and if
not, should the modified data directly be written to the next level
(write-through), or be kept as modification until the cache line is
evicted (write-back)?
For a store, a load, or a modification, Cachegrind checks if the containing
line is in L1; if not, it chooses a line to be evicted, and modifies the cache
state to hold the new line; then (on a L1 miss), it does the same with the L2.
Effectively, this means that Cachegrind is simulating a cache which does
allocate a cache line on a store, and always does write-through.
You can not calculate transfer between CPU and L1, as Cachegrind only
gives you the number of accesses, but not the size of accesses.
To get the amount of data from L2 to L1, you can multiply
"number of L1 misses" by 64, and to get the amount of data from L1 to
L2, multiply "number of L1 write misses" by 64.
Same holds for the transfers between L2 and main memory. But there,
you have a problem, if you want to compare this with real machines:
L2 caches typically use a write-back strategy, which gives you less
store traffic; you can use Callgrinds simulator with "--simulate-wb=yes"
to simulate L2 write-back: you will get additinal numbers, e.g.
"L2 read miss evicting a dirty line".
Further complications:
* Cachegrind attributes "modifications" (inc, ...) always as being loads
(I forgot why; in Callgrind, I handle them as being stores)
* In reality, most caches incorporate some prefetching schemes, which
lead to higher transfers because of wrong predictions (but lower
real miss numbers because data often resides in the cache when prefetched)
* x86 has some instructions which are handled "wrong" by the simulation
(e.g. non-temporal stores bypass cache, and software prefetching is ignored)
Regarding bandwidth: this is really difficult because Cachegrind/Callgrind
can not measure exact time. You can only do very rough estimations...
Hope this helps,
Josef
|