|
From: Julian S. <js...@ac...> - 2012-12-28 10:52:45
|
Hi Josef, IIRC you posted a message some time back about some experiments you did, which allow handling of strange (non-power-of-2) sized LLs in a more generic way, and would allow the removal of existing kludges that change the associativity vs #lines. What's the status on this? If the perf loss is small, it would be good to get this landed, so that we don't have future breakage on strange sized caches. Thanks, J |
|
From: Josef W. <Jos...@gm...> - 2013-01-04 20:21:36
Attachments:
Nachricht als Anhang
|
Am 28.12.2012 11:52, schrieb Julian Seward: > > Hi Josef, > > IIRC you posted a message some time back about some experiments you > did, which allow handling of strange (non-power-of-2) sized LLs in a > more generic way, and would allow the removal of existing kludges that > change the associativity vs #lines. Yes. I had to search a bit myself. It's attached. The patch simply uses the modulo operation for mapping addresses to cache sets for the LL. > What's the status on this? I was undecided on what to do. On average, it slows down cachegrind by 5% on amd64, with the worst case being 17% for ffbench (see below). If the perf loss is small, it would be good > to get this landed, so that we don't have future breakage on strange > sized caches. I agree, but just unconditionally always using modulo for LL (as the patch does now) seems not to be the best solution. Alternatives below. Josef > > Thanks, > > J > |
|
From: John R. <jr...@bi...> - 2013-01-04 23:28:03
|
> The patch simply uses the modulo operation for mapping addresses to > cache sets for the LL. > I was undecided on what to do. Shift right by log2(line_size), then do a table lookup using the bottom K bits as index. K and the table are computed during initialization, and never modified after that. Typical line size is 32, 64, or 128 bytes; thus shift is 5, 6, or 7. Typical K is 6, 7, or 8. Statically allocate unsigned char table[512], and just ignore what is not used. Using compile-time constants 64==line_size and 6==K (computing only the table at runtime) does OK for many of today's caches. -- |
|
From: Josef W. <Jos...@gm...> - 2013-01-08 10:01:32
|
Am 05.01.2013 00:29, schrieb John Reiser: >> The patch simply uses the modulo operation for mapping addresses to >> cache sets for the LL. > >> I was undecided on what to do. > > Shift right by log2(line_size), then do a table lookup using the bottom > K bits as index. Ok, that's a trade-off between approximation accuracy and slow-down because of table size (cache pollution). I think I already tried something like that, but it was more complicated. One needs to check the difference in result between such an approximation and modulo. "char" as table entry type is too small if it should result directly in the set number. If you assume a 8 MB L3 cache with associativity 8, you get 16384 sets. So, using short may be fine. Ah.... wait. You cannot directly store the set number in the table, as with 16384 sets, you already would need a table size of 16384 to map to each set only once. The "more complex" solution I mentioned above did bit shuffling to allow for a small table. In the end, this was slower than modulo :( With <block> being the address shifted right by log2(line_size), instead of set = block % number_of_sets the table lookup looks like set = (block & lower_X_bits) | (table[(block >> X) & lower_K_bits]<<X) E.g. with 5*1024 sets, X would be 10, such that the table lookup with K==6 would approximate a uniform mapping from 64 to 5. Thus, the upper 20% of the sets get only mapped 60/5 times instead of 65/5 (for the perfect mapping), which gives an error of 7% (= 1/13). But this shifting and masking got me slower on my i5 Nehalem laptop than just using modulo. Perhaps this is because the modulo operation can partly be hidden by the out-of-order machinery, while the bit masking just produces too much instructions (?). Or did I misunderstand your solution? Josef > K and the table are computed during initialization, > and never modified after that. > Typical line size is 32, 64, or 128 bytes; thus shift is 5, 6, or 7. > Typical K is 6, 7, or 8. Statically allocate unsigned char table[512], > and just ignore what is not used. > > Using compile-time constants 64==line_size and 6==K > (computing only the table at runtime) does OK for many of today's caches. > |
|
From: John R. <jr...@bi...> - 2013-01-08 23:04:09
|
> With <block> being the address shifted right by log2(line_size), > instead of > > set = block % number_of_sets > > the table lookup looks like > > set = (block & lower_X_bits) | (table[(block >> X) & lower_K_bits]<<X) Whatever the mapping is: compute it (or an approximation) once during initialization, store the result into a linear table of length 2**K, and thereafter perform table lookup using the bottom K bits of 'set'. In general there will be an approximation error. Choose how much error against the access cost (including caching) to the table. The cache itself does not perform modulo. Instead the cache adjusts the associativity. If the cache has 96K lines, then (except for fully- associative) that is almost certainly 3x 32K lines (or 6x 16K lines, or 64K+32K lines, etc.), and the associativity is something like 3 times the associativity of each 32K-line piece. "Something like" because the hierarchy might not be fully distributive; and if so then only an exact simulation of the hardware will be error-free, and even the hardware designer may have a hard time telling you exactly what the cache does: only the VHDL/Verilog/etc. knows for sure. -- |
|
From: Josef W. <Jos...@gm...> - 2013-01-09 18:05:31
|
Am 09.01.2013 00:05, schrieb John Reiser:
>> With <block> being the address shifted right by log2(line_size),
>> instead of
>>
>> set = block % number_of_sets
>>
>> the table lookup looks like
>>
>> set = (block & lower_X_bits) | (table[(block >> X) & lower_K_bits]<<X)
>
> Whatever the mapping is: compute it (or an approximation) once during
> initialization, store the result into a linear table of length 2**K,
> and thereafter perform table lookup using the bottom K bits of 'set'.
Do you mean 'block' here, with the definition from above ('set' is the
result) ? Can you be more precise about the expression you would use?
> In general there will be an approximation error. Choose how much
> error against the access cost (including caching) to the table.
>
> The cache itself does not perform modulo. Instead the cache adjusts
> the associativity.
Usually, yes. With the result that the number of sets is a power of 2.
Cachegrind/Callgrind can handle this quite fine, e.g. Intel Atom
has 24kB L1D, with associativity 6. No problem.
But if the processor tells me that L3 has an associativity which makes
the number of sets being not a power of two?
Would you say that if CPUID tells me an associativity of 32, but
the CPU at hand has e.g. only 3 cores and thus L3 distributed on the
3 tiles of the chip, the associativity of 32 must be a lie?
If the cache has 96K lines, then (except for fully-
> associative) that is almost certainly 3x 32K lines (or 6x 16K lines,
> or 64K+32K lines, etc.),
> and the associativity is something like 3 times the associativity
> of each 32K-line piece. "Something like" because the hierarchy
> might not be fully distributive;
This would mean that different sets can have a different number of ways
each, wouldn't it?
As we do not know the hardware in the case anyway, I think 'modulo' is
the simplest thing to do, to approximate any possible real hardware.
Josef
and if so then only an
> exact simulation of the hardware will be error-free,
> and even the hardware designer may have a hard time telling you
> exactly what the cache does: only the VHDL/Verilog/etc. knows for sure.
|
|
From: John R. <jr...@bi...> - 2013-01-10 13:55:27
|
On 01/09/2013 10:05 AM, Josef Weidendorfer wrote:
> Am 09.01.2013 00:05, schrieb John Reiser:
>>> With <block> being the address shifted right by log2(line_size),
>>> instead of
>>>
>>> set = block % number_of_sets
>>>
>>> the table lookup looks like
>>>
>>> set = (block & lower_X_bits) | (table[(block >> X) & lower_K_bits]<<X)
>>
>> Whatever the mapping is: compute it (or an approximation) once during
>> initialization, store the result into a linear table of length 2**K,
>> and thereafter perform table lookup using the bottom K bits of 'set'.
>
> Do you mean 'block' here, with the definition from above ('set' is the
> result) ? Can you be more precise about the expression you would use?
The schema is:
bits = table[mask0 & (address >> shift0)];
t1 = mask1 & (bits >> shift1);
t2 = mask2 & (bits >> shift2);
The initialization phase should compute the masks and the shifts,
initialize the table, and generate a subroutine which computes
t1 and t2 from address and the table; with "constants" mask{0,1,2}
and shift{0,1,2} propagated
so that the only reference to data memory is indexing into the table.
Two values t1 and t2 are returned much like division returns
quotient and remainder.
> [snip]
> But if the processor tells me that L3 has an associativity which makes
> the number of sets being not a power of two?
> Would you say that if CPUID tells me an associativity of 32, but
> the CPU at hand has e.g. only 3 cores and thus L3 distributed on the
> 3 tiles of the chip, the associativity of 32 must be a lie?
Either the chip has some "secret sauce" which handles this,
or L3 is on the "other side" of the bus: L3 is a _memory_ cache,
L1 and L2 are _CPU_ caches. Yes, the cores contend (and queue)
for such a memory cache.
--
|
|
From: Josef W. <Jos...@gm...> - 2013-01-10 15:56:30
|
Am 10.01.2013 14:56, schrieb John Reiser:
> On 01/09/2013 10:05 AM, Josef Weidendorfer wrote:
>> Am 09.01.2013 00:05, schrieb John Reiser:
>>>> With <block> being the address shifted right by log2(line_size),
>>>> instead of
>>>>
>>>> set = block % number_of_sets
>>>>
>>>> the table lookup looks like
>>>>
>>>> set = (block & lower_X_bits) | (table[(block >> X) & lower_K_bits]<<X)
>>>
>>> Whatever the mapping is: compute it (or an approximation) once during
>>> initialization, store the result into a linear table of length 2**K,
>>> and thereafter perform table lookup using the bottom K bits of 'set'.
>>
>> Do you mean 'block' here, with the definition from above ('set' is the
>> result) ? Can you be more precise about the expression you would use?
>
> The schema is:
> bits = table[mask0 & (address >> shift0)];
> t1 = mask1 & (bits >> shift1);
> t2 = mask2 & (bits >> shift2);
>
> The initialization phase should compute the masks and the shifts,
> initialize the table, and generate a subroutine which computes
> t1 and t2 from address and the table;
Why two values t1 and t2? I want to have the set number back,
to search for tag matching in this set. So t2 (ie. remainder) should
be enough to get back from this function.
with "constants" mask{0,1,2}
> and shift{0,1,2} propagated
> so that the only reference to data memory is indexing into the table.
I think this actually is the crucial point. Some shift/mask values
depend on cache parameters only known at runtime. To make use of
the fact that these actually are "constant" within a run, we need
to generate that function.
While Valgrind allows to generate code, this currently only works
as part of the translation step of a superblock. Of course the
compiler can produce variants for fixed mask/shift values, and at
runtime, callbacks could be instrumented to handlers using these fixed
variants as needed. But already in Cachegrind, this would lead to
an explosion of dirty handlers.
I actually do not care much about 'modulo' making the simulator slower
in the case of strange cache parameters. But currently my idea was to
always use 'modulo' or that table-lookup stuff, even if cache parameters
are nice. But this makes also the nice cases slower than now. And this
is bad.
For this case, it probably would be good to make VEX usable as code
generator for arbitrary functions prepared by tools. But on the other
hand, such an generated function of course can not be inlined, making
the benefit unclear to me if they are to be called from compiler
generated dirty handlers.
Josef
> Two values t1 and t2 are returned much like division returns
> quotient and remainder.
>
>> [snip]
>> But if the processor tells me that L3 has an associativity which makes
>> the number of sets being not a power of two?
>> Would you say that if CPUID tells me an associativity of 32, but
>> the CPU at hand has e.g. only 3 cores and thus L3 distributed on the
>> 3 tiles of the chip, the associativity of 32 must be a lie?
>
> Either the chip has some "secret sauce" which handles this,
> or L3 is on the "other side" of the bus: L3 is a _memory_ cache,
> L1 and L2 are _CPU_ caches. Yes, the cores contend (and queue)
> for such a memory cache.
>
|
|
From: Julian S. <js...@ac...> - 2013-01-10 16:15:43
|
On 01/10/2013 04:56 PM, Josef Weidendorfer wrote: > I think this actually is the crucial point. Some shift/mask values > depend on cache parameters only known at runtime. Since the cache parameters are known at tool-startup time, can Callgrind compute table(s) containing magic numbers at startup, and use the table(s) from within the cache functions? J |
|
From: Josef W. <Jos...@gm...> - 2013-01-10 18:19:21
|
Am 10.01.2013 17:14, schrieb Julian Seward: > On 01/10/2013 04:56 PM, Josef Weidendorfer wrote: > >> I think this actually is the crucial point. Some shift/mask values >> depend on cache parameters only known at runtime. > > Since the cache parameters are known at tool-startup time, can Callgrind > compute table(s) containing magic numbers at startup, and use the > table(s) from within the cache functions? Sure, but e.g. this way, the function calculating the set number from an address never will be just one memory access into the table, as shift and mask values to do the table lookup still are variables, and can not be kept in registers. As mentioned before, in my limited experiments, the shifting/masking /table lookup stuff was slower than simply doing a 'modulo'. I think I need to revive that version to present numbers. If you check the cachegrind code, you will see that, as most memory accesses fall within one cache line, there is only one 'modulo' operation per memory access and only when it is missing L1. I.e. for a 'modulo' operation to happen, we already searched all ways of an L1 set for a match. Still, using 'modulo' slows down cachegrind by 7% on average. Do you see a chance to use VEX to generate functions callable in dirty helpers? Can VEX generate code into a supplied buffer, with correct calling convention? Josef > > J > > > |