|
From: Josef W. <Jos...@gm...> - 2008-06-06 17:37:59
|
On Friday 06 June 2008, Nicholas Nethercote wrote:
> On Thu, 5 Jun 2008, Josef Weidendorfer wrote:
> > The LRU list of L1 now is (a1, a2), and the one of L2 is still (a2, a1).
>
> In short: when an L1 hit occurs, the L1 MRU list is updated, but the L2 MRU
> list is not, right?
Yes.
> Hmm, I think your analysis is correct. I hope anyone
> that is using Cachegrind for serious analysis is plugging their own cache
> simulator into it.
>
> Making the simulation correct so that it is properly inclusive seems like a
> good thing to do.
Yes, but of course this depends both on performance and results effect.
> Would you be able to work up a patch that does this,
First try:
Modify the simulation macro to add a HIT_TREATMENT, and for I1/D1, on a hit
do a L2 reference, the same way as on a miss. This adjusts the L2 MRU lists.
With an inclusive cache, a L2 reference after a L1 hit always should give a
hit in L2, too...
So with the macro definition (and according additions in the macro)
#define CACHESIM(L, MISS_TREATMENT, HIT_TREATMENT) ...
we can check for the inclusion property with
CACHESIM(D1, { (*m1)++; cachesim_L2_doref(a, size, m1, m2); },
{ cachesim_L2_doref(a, size, 0, 0); } );
Note that for a L1 hit, we call the L2 reference with m1/m2 as zero pointers.
Thus, if there is a L2 miss instead, there will be a segfault.
And really, this crashes already with small programs on my Pentium-M laptop :-(
So even with this addition, the cache simulation is not really fully inclusive...
I can only assume what is going on:
On my laptop, the Pentium-M has an 8-way associative for L1D, L1I and L2 (which also
is used as parameter by Cachegrind). For addresses mapping into the same set,
L1 and I1 *each* can hold 8 cache lines, while the unified L2 can hold 8 lines only.
So there is the possibility that lines are evicted from L2 which still are in
L1D or L1I, even with adjusted MRU list.
To really get inclusive behavior, everytime a cacheline is evicted (from L1D, L1I or
L2), we would need to invalidate it in the other caches. This probably would get
really messy.
In other words, I do not think that the above change is worth it.
Here are some numbers with running bzip2 on a 1.2MB binary file
("time valgrind --tool=cachegrind bzip2 -c libc.so.6 >/dev/null"):
ORIGINAL Cachegrind:
==11957== L2 refs: 16,674,882 ( 9,690,259 rd + 6,984,623 wr)
==11957== L2 misses: 1,504,180 ( 888,988 rd + 615,192 wr)
==11957== L2 miss rate: 0.0% ( 0.0% + 0.4% )
real 0m43.337s
user 0m42.019s
sys 0m0.088s
MODIFIED Cachegrind (L2 ref also on L1 hit, but these refs are not counted in the
results):
==12708== L2 refs: 16,674,882 ( 9,690,259 rd + 6,984,623 wr)
==12708== L2 misses: 1,506,936 ( 890,131 rd + 616,805 wr)
==12708== L2 miss rate: 0.0% ( 0.0% + 0.4% )
real 0m53.941s
user 0m53.603s
sys 0m0.128s
One can see that there really are changes in L2 misses: 0,18% more.
But the "better" simulation takes 27% more time :(
Of course, "bzip2" is a bad example for this "extension": before, most accesses
where L1 hits. With the modification, there always is a lookup in L2, too.
IMHO the easiest way to handle this is to note in the manual that the
"inclusive" cache property is not really met by Cachegrinds simulation,
but results should be quite close (enough for the purpose of analyzing cach
performance).
Josef
|