|
From: Tom H. <to...@co...> - 2011-06-03 11:09:09
Attachments:
cache.patch
|
The latest Intel i7's have finally switched to a saner method of
recording the cache details in the cpuid that doesn't rely on a lookup
table to get the details.
The attached patch implements support for this, but there is one problem
(which is why I haven't committed it yet) namely that, at least on the
machine I am testing on, the L3 cache is 12Mb but 16 way associative
which means that it causes cachegrind/callgrind to assert because the
set count (12288) is not a power of two.
The full configuration of the caches on this machine is:
I1: 32KB, 4-way, 64B lines (128 sets)
D1: 32KB, 8-way, 64B lines (64 sets)
L2: 256KB, 8-way, 64B lines (512 sets)
L3: 12MB, 16-way, 64B lines (12288 sets)
Suggestions on the best way to deal with this gratefully received...
Tom
--
Tom Hughes (to...@co...)
http://compton.nu/
|
|
From: Julian S. <js...@ac...> - 2011-06-03 11:43:53
|
On Friday, June 03, 2011, Tom Hughes wrote: > The latest Intel i7's have finally switched to a saner method of > recording the cache details in the cpuid that doesn't rely on a lookup > table to get the details. 'bout time too! It's good that you're looking into this. > The attached patch implements support for this, but there is one problem > (which is why I haven't committed it yet) namely that, at least on the > machine I am testing on, the L3 cache is 12Mb but 16 way associative > which means that it causes cachegrind/callgrind to assert because the > set count (12288) is not a power of two. I've seen the same failure when running on a 6-core AMD, which has a 48 way associative L3 as a result. It's as if each core contributes 8 ways to the L3. > Suggestions on the best way to deal with this gratefully received... Well, one obvious sick hack is to round both the size and associativity down to the next lowest power of 2, so everything is power-of-2 sized. That'd be pretty bad though, since eg Atom has a 24k D1 (or I1 ? I forget) and we'd wind up simulating only a 16k cache. Better is maybe if you can massage the associativity number somehow without messing with the size. Presumably the loss of accuracy induced by messing with the associativity alone is much smaller than that induced by messing with the size. If you see what I mean. That said, I guess the best fix is fix the cache simulators themselves, but that's beyond my ken -- Nick's/Josef's department. J |
|
From: Josef W. <Jos...@gm...> - 2011-06-03 12:42:04
|
On Friday 03 June 2011, Julian Seward wrote:
>
> On Friday, June 03, 2011, Tom Hughes wrote:
> > The latest Intel i7's have finally switched to a saner method of
> > recording the cache details in the cpuid that doesn't rely on a lookup
> > table to get the details.
>
> 'bout time too! It's good that you're looking into this.
>
> > The attached patch implements support for this, but there is one problem
> > (which is why I haven't committed it yet) namely that, at least on the
> > machine I am testing on, the L3 cache is 12Mb but 16 way associative
> > which means that it causes cachegrind/callgrind to assert because the
> > set count (12288) is not a power of two.
>
> I've seen the same failure when running on a 6-core AMD, which
> has a 48 way associative L3 as a result. It's as if each core
> contributes 8 ways to the L3.
Hmm... already 2 cases then.
> > Suggestions on the best way to deal with this gratefully received...
>
> Well, one obvious sick hack is to round both the size and associativity
> down to the next lowest power of 2, so everything is power-of-2 sized.
>
> That'd be pretty bad though, since eg Atom has a 24k D1 (or I1 ? I forget)
> and we'd wind up simulating only a 16k cache.
Atom works, it has 6-way associativity for 24k L1 data => 64 Sets.
> Better is maybe if you can massage the associativity number somehow
> without messing with the size. Presumably the loss of accuracy
> induced by messing with the associativity alone is much smaller than
> that induced by messing with the size. If you see what I mean.
>
> That said, I guess the best fix is fix the cache simulators themselves,
> but that's beyond my ken -- Nick's/Josef's department.
Currently, when calculating the set number from the address, the simulator
does bit shifting. With non-power-of-2 number of sets, that is not possible.
We can use modulo to get the set from the address with the following patch:
--- a/cachegrind/cg_sim.c
+++ b/cachegrind/cg_sim.c
@@ -98,8 +98,8 @@ __attribute__((always_inline)) \
static __inline__ \
void cachesim_##L##_doref(Addr a, UChar size, ULong* m1, ULong *mL) \
{ \
- UInt set1 = ( a >> L.line_size_bits) & (L.sets_min_1); \
- UInt set2 = ((a+size-1) >> L.line_size_bits) & (L.sets_min_1); \
+ UInt set1 = ( a >> L.line_size_bits) % (L.sets); \
+ UInt set2 = ((a+size-1) >> L.line_size_bits) % (L.sets); \
UWord tag = a >> L.tag_shift; \
UWord tag2; \
Int i, j; \
@@ -139,7 +139,7 @@ void cachesim_##L##_doref(Addr a, UChar size, ULong* m1, ULong *mL) \
\
/* Second case: word straddles two lines. */ \
/* Nb: this is a fast way of doing ((set1+1) % L.sets) */ \
- } else if (((set1 + 1) & (L.sets-1)) == set2) { \
+ } else if (((set1 + 1) % (L.sets)) == set2) { \
set = &(L.tags[set1 * L.assoc]); \
However, this has a huge performance impact in all cases. Before (on my
Core i5):
~> time valgrind --tool=cachegrind ls /usr/bin >/dev/null
...
==4991== D1 misses: 57,898 ( 41,713 rd + 16,185 wr)
==4991== LLd misses: 17,322 ( 4,840 rd + 12,482 wr)
...
real 0m0.668s
user 0m0.600s
sys 0m0.060s
After, with patch:
time ./cachegrind-amd64-linux --tool=cachegrind ls /usr/bin >/dev/null
...
==5042== D1 misses: 57,888 ( 41,709 rd + 16,179 wr)
==5042== LLd misses: 17,326 ( 4,846 rd + 12,480 wr)
...
real 0m1.108s
user 0m1.080s
sys 0m0.020s
:(
Not sure yet where the slight changes in D1/LLd misses come from. I assume
the recompilation of cachegrind leads to minimal changes in the address space
layout?
In real hardware, instead of modulo (which obviously takes a lot of time),
you probably would do some hashing of a subset of address bits. However, it
leads to a small non-uniformity for covering the address space by the sets.
We probably need to do something similar. And make this a special case only
if needed.
Josef
>
> J
>
>
|
|
From: Julian S. <js...@ac...> - 2011-06-03 21:13:52
|
On Friday, June 03, 2011, Josef Weidendorfer wrote: > In real hardware, instead of modulo (which obviously takes a lot of time), > you probably would do some hashing of a subset of address bits. > > We probably need to do something similar. And make this a special case only > if needed. Do you have any suggestion for how to implement such a hash fn? One good thing is that these strange caches generally appear only at L3 or LL, and so the number of references we'd have to process is relatively small compared to (eg) D1 or I1. That makes the extra costs not so bad. re my earlier suggestion about kludging this by simulating a different associativity .. If I remember Hennessy & Patterson correctly, associativity above 4 makes very little difference to the hit rate. That's why I suggested it. eg > L3: 12MB, 16-way, 64B lines (12288 sets) If we instead pretended it was 12-way, then there would be 16384 sets. I would be happy with such an approximation. I don't know if it's always possible though -- need to think about the address arithmetic more. Any preferences (or other proposals) ? I don't mind either way. I only care that the performance hit is minimal. J |
|
From: Josef W. <Jos...@gm...> - 2011-06-05 17:09:03
|
On Friday 03 June 2011, Julian Seward wrote: > On Friday, June 03, 2011, Josef Weidendorfer wrote: > > > In real hardware, instead of modulo (which obviously takes a lot of time), > > you probably would do some hashing of a subset of address bits. > > > > We probably need to do something similar. And make this a special case only > > if needed. > > Do you have any suggestion for how to implement such a hash fn? Hmm. It should approximate "<memory block> % sets", perhaps using a precalculated 256-byte lookup table "set[<memory block> & 255]", or "<memory block> - sets * int(<memory block> * 1/sets)". And do "* 1/sets" with "*(256/sets) >>8" or by converting to float and back again? I'll have to check what is fastest. The only thing is that our approximation for sure will not be the same as what some HW does in that case. > One good thing is that these strange caches generally appear only > at L3 or LL, Yes. In HW, it also adds latency to a critical path, so I suppose it is done only for shared last-level caches. I suppose it is needed there to be flexible on the number of LL cache partitions (when you have a tile-based die layout, a L3 partition would be part of a tile). So if you have a chip with 6 cores, you would have 6 L3 cache partitions, and get a multiple of 6 of total sets even if one L3 partition has a power-of-2 number of sets. This strange number of sets has the benefit that getting an access stream which always competes for the same set (with lots of conflict cases) is less probably. > and so the number of references we'd have to process > is relatively small compared to (eg) D1 or I1. That makes the > extra costs not so bad. Probably. I still think we want to fall back to the fast way if we have a power-of-2 even in LL. The best way is to instrument different handler calls depending on the LL cache configuration, ie. a duplication of all the log_ handlers? > re my earlier suggestion about kludging this by simulating a different > associativity .. > > If I remember Hennessy & Patterson correctly, associativity above 4 makes > very little difference to the hit rate. That's why I suggested it. eg > > > L3: 12MB, 16-way, 64B lines (12288 sets) > > If we instead pretended it was 12-way, then there would be 16384 sets. Perhaps this indeed is the best default behavior. To actually be able to check this, it would be nice to still enforce 12288 sets if requested. > I would be happy with such an approximation. I don't know if it's > always possible though -- need to think about the address arithmetic > more. In general, for sure not. But for practical numbers, it could work out. Josef > Any preferences (or other proposals) ? I don't mind either way. I only > care that the performance hit is minimal. > > J > > |