|
From: Florian K. <br...@ac...> - 2012-09-21 17:01:27
|
We had a discussion about this a few weeks back. Here are my thoughts.
Objective:
----------
Have coregrind query the properties of the host's cache system. Make
this information available in a simple interface that hides all
architecture-specific details e.g. the existence of a cpuid instruction.
Benefits:
---------
This is conceptually cleaner than the status quo. Detection of cache
properties does not belong in the realm of the tools. Additionally,
if several tools required information about caches there would be code
duplication.
Representation of cache information:
------------------------------------
/* The various kinds of caches */
typedef enum {
DATA_CACHE,
INSN_CACHE,
DATA_INSN_CACHE // combined data and insn cache
} cache_kind;
/* Information about a particular cache */
typedef struct {
cache_kind kind;
UInt level; /* level this cache is at, e.g. 1 for L1 cache */
UInt sizeB; /* size of this cache in bytes */
UInt line_sizeB; /* cache line size in bytes */
UInt associativity;
} cache_t;
/* Information about the cache system as a whole */
typedef struct {
UInt num_levels;
UInt num_caches;
/* Unordered array of caches for this host. NULL if there are
no caches. Users can assume that the array contains at most one
cache of a given kind per cache level. */
cache_t *caches;
} cacheinfo_t;
What is shown here has all the info that is needed for cachegrind
and callgrind purposes. It can trivially be extended to provide more
detail if that becomes necessary. Clearly, arranging the data this
way will be easy on some architectures and more cumbersome on others.
The objective is that it should be easy for the users to extract what
they need. The code to build up this data would reside in m_machine.c.
Users of cache information:
---------------------------
I found at least these (there may be more):
(1) cachegrind / callgrind
These are perfectly served by cacheinfo_t as shown above.
(2) VG_(invalidate_icache)
We need to extend the above representation, by, say, adding a
"Bool icaches_maintain_coherence;" to cacheinfo_t
(3) VEX: VexArchinfo contains ppc_cache_line_szB
The cache line size is needed to implement the icbi insn.
Can be obtained from cacheinfo_t
(4) VEX: functions returning VexInvalRange
The returned address range indicates whether some cache invalidation
needs to occur later. What is returned here may, in general, depend
on the particular machine model of a given architecture. So we need
to query the cache info before returning anything.
A different possibility, which may even be cleaner, is to make these
functions *always* return the address range for the insns that were
patched. In that case we would not need cache information here. The
call site would decide about invalidation.
How to make cache information available:
----------------------------------------
Ideally we want the cache information to be provided in one spot only;
be that a function call returning it or some persistent data structure
containing it. A related question is where the definition of cacheinfo_t
resides. If it does not reside in VEX then VEX would be dependent on
coregrind and that's a no-go. So, adding a cacheinfo_t typed member
to VexArchInfo looks natural. It could be filled in the same way hwcaps
are currently filled in in m_machine.c.
The type definitions (cache_t, cacheinfo_t etc) would be included in
libvex.h
Tools can call the existing VG_(machine_get_VexArchInfo) to get the
cache information. The function would have to be exposed through
pub_tool_machine.h so it becomes available.
Alternatively, VexArchInfo could be passed to the "instrument" function
of the tools.
Cache information will be determined after hwcaps and machine models
have been determined. The rationale is that not all cache information
can be figured out automatically (e.g. on s390 we cannot figure out
whether icaches are coherent). Some such information may depend on
the machine model (part of hwcaps) for which we just know what it is and
can fill it in.
I can work on an implementation for this but first there should be
agreement
- about the data structures to represent cache info
- how tools access it (function call or passed to the instrument
function)
Florian
|
|
From: Josef W. <Jos...@gm...> - 2012-09-21 18:20:45
|
Am 21.09.2012 19:01, schrieb Florian Krohm: > DATA_INSN_CACHE // combined data and insn cache The standard term here is "unified", ie. UNIFIED_CACHE ? Josef |
|
From: Florian K. <br...@ac...> - 2012-09-29 17:50:36
|
On 09/21/2012 02:18 PM, Josef Weidendorfer wrote: > Am 21.09.2012 19:01, schrieb Florian Krohm: >> DATA_INSN_CACHE // combined data and insn cache > > The standard term here is "unified", ie. UNIFIED_CACHE ? > Hello Josef, I finally have a bit of time to look at this again. Sure, UNIFIED_CACHE sounds good. I read the cachegrind documentation. It mentions that for a machine with 3 levels of cache, the L3 cache will be used instead of L2. What about a machine with 4 cache levels such as the one below (s390). Should the L4 cache be used instead of L2 ? Florian L1 topology: separate data and instruction; private L1 cache line size data: 256 L1 cache line size insn: 256 L1 total cachesize data: 131072 L1 total cachesize insn: 65536 L1 set. assoc. data: 8 L1 set. assoc. insn: 4 L2 topology: unified data and instruction; private L2 cache line size: 256 L2 total cachesize: 1572864 L2 set. assoc. : 12 L3 topology: unified data and instruction; shared L3 cache line size: 256 L3 total cachesize: 25165824 L3 set. assoc. : 12 L4 topology: unified data and instruction; shared L4 cache line size: 256 L4 total cachesize: 201326592 L4 set. assoc. : 24 |
|
From: Josef W. <Jos...@gm...> - 2012-10-01 12:50:06
|
Am 29.09.2012 19:50, schrieb Florian Krohm: > On 09/21/2012 02:18 PM, Josef Weidendorfer wrote: >> Am 21.09.2012 19:01, schrieb Florian Krohm: >>> DATA_INSN_CACHE // combined data and insn cache >> >> The standard term here is "unified", ie. UNIFIED_CACHE ? >> > > Hello Josef, > > I finally have a bit of time to look at this again. > Sure, UNIFIED_CACHE sounds good. > > I read the cachegrind documentation. It mentions that for a machine > with 3 levels of cache, the L3 cache will be used instead of L2. > What about a machine with 4 cache levels such as the one below (s390). > Should the L4 cache be used instead of L2 ? Short answer: Yes, I think so. Long answer: The cache simulation model in Cachegrind/Callgrind currently is a synchronous (ie. one access at a time), 2-level (not-strict) inclusive cache hierarchy with private caches for instruction and data at L1, and unified L2. All caches do LRU replacement and are write-allocate. The model does not care about write-back vs. write-through, as this difference would not change hit/miss numbers. All threads go through this single hierarchy (ie. no coherency issues possible, and no concurrency miss numbers, also no MESI/MOESI /MESIF...). Callgrind's model optionally adds a simple prefetcher at L2 level. As far as I know, this model was motivated by Intel processors with two cache levels. With such processors, the mapping of real cache parameters is clear. When processors with 3 levels came up, there was some discussion about useful mappings of real parameters to the simulator model. The mapping used should be able to catch cache-inefficient memory access behavior. The most important issue is of course not using the on-chip caches at all, which gets visible when looking at the behaviour of the last-level cache. Thus, the last-level cache should always be mapped to the L2 in the model. Another important problem is huge number of conflict misses due to low associativity. This can already happen in L1 (last level cache associativity often is much higher, and just looking at that level would not show up such a problem). So it seems right to use real L1 parameters for the L1 in the model. The usefulness of a 3-level cache hierarchy comes from the wish to share the last level among multiple cores. For that, you want to reduce the number of references coming from the cores, resulting in better scalability of the shared cache level. As L1 has to be small to be fast, it's good to have a L2 to reduce the references to the shared level. As Cachegrind's current model does not have private L1 per core anyway, it seemed not really that important to extend the simulator for 3 levels. The same argumentation seems fine for 4 levels. On the other hand, from my point of view, we can always think about making the cache model more flexible. E.g. for detecting cache-bouncing between private L1 caches (ie. concurrency misses), or to understanding NUMA issues. However, for such issues, it is important to see slowdowns because of limited bandwidths of shared resources. This needs time simulation, and makes everything a little bit more complex :-) Do you know if the L4 on s390 covers all memory modules, or is it partitioned for separate modules (as it was with Suns/Oracle Niagara)? This can make a difference: in the latter case, if all accesses go to one module, the cache size effectively gets much smaller. Josef > > Florian > > L1 topology: separate data and instruction; private > L1 cache line size data: 256 > L1 cache line size insn: 256 > L1 total cachesize data: 131072 > L1 total cachesize insn: 65536 > L1 set. assoc. data: 8 > L1 set. assoc. insn: 4 > L2 topology: unified data and instruction; private > L2 cache line size: 256 > L2 total cachesize: 1572864 > L2 set. assoc. : 12 > L3 topology: unified data and instruction; shared > L3 cache line size: 256 > L3 total cachesize: 25165824 > L3 set. assoc. : 12 > L4 topology: unified data and instruction; shared > L4 cache line size: 256 > L4 total cachesize: 201326592 > L4 set. assoc. : 24 > > > |
|
From: Josef W. <Jos...@gm...> - 2012-10-01 13:08:53
|
Am 29.09.2012 19:50, schrieb Florian Krohm: > Should the L4 cache be used instead of L2 ? Another thing: I thought the simulator does not talk about L2 any longer, but about LL (last level). At least, this is the case for the event names. To not confuse users, using L4 parameters is the right thing. Josef |
|
From: Julian S. <js...@ac...> - 2012-09-24 09:56:15
|
I agree with almost all of this, with only minor comments.
> /* The various kinds of caches */
> typedef enum {
> DATA_CACHE,
> INSN_CACHE,
> DATA_INSN_CACHE // combined data and insn cache
> } cache_kind;
Like JosefW I would prefer COMBINED_CACHE to DATA_INSN_CACHE.
> /* Information about a particular cache */
> typedef struct {
> cache_kind kind;
> UInt level; /* level this cache is at, e.g. 1 for L1 cache */
> UInt sizeB; /* size of this cache in bytes */
> UInt line_sizeB; /* cache line size in bytes */
> UInt associativity;
> } cache_t;
>
> /* Information about the cache system as a whole */
> typedef struct {
> UInt num_levels;
> UInt num_caches;
> /* Unordered array of caches for this host. NULL if there are
> no caches. Users can assume that the array contains at most one
> cache of a given kind per cache level. */
> cache_t *caches;
> } cacheinfo_t;
IIUC, num_levels is the number of levels the machine really has, and
num_caches <= num_levels is the number of caches for which we have
descriptions. Which may be less than the number of levels. Yes?
> (1) cachegrind / callgrind
> These are perfectly served by cacheinfo_t as shown above.
>
> (2) VG_(invalidate_icache)
> We need to extend the above representation, by, say, adding a
> "Bool icaches_maintain_coherence;" to cacheinfo_t
Am I right to understand (from your comments later) that some s390s
have icaches which are coherent, and some do not? So you need to know
the model number in order to decide whether or not VG_(invalid_icache)
is a no-op?
> (3) VEX: VexArchinfo contains ppc_cache_line_szB
> The cache line size is needed to implement the icbi insn.
> Can be obtained from cacheinfo_t
>
> (4) VEX: functions returning VexInvalRange
> The returned address range indicates whether some cache invalidation
> needs to occur later. What is returned here may, in general, depend
> on the particular machine model of a given architecture. So we need
> to query the cache info before returning anything.
> A different possibility, which may even be cleaner, is to make these
> functions *always* return the address range for the insns that were
> patched. In that case we would not need cache information here.
Yes. I would prefer this solution. The patchers just return the address
range they patched, and it is the call site's problem to figure out what to
do about cache coherence.
> How to make cache information available:
> ----------------------------------------
> Ideally we want the cache information to be provided in one spot only;
> be that a function call returning it or some persistent data structure
> containing it. A related question is where the definition of cacheinfo_t
> resides. If it does not reside in VEX then VEX would be dependent on
> coregrind and that's a no-go. So, adding a cacheinfo_t typed member
> to VexArchInfo looks natural.
Yes.
> It could be filled in the same way hwcaps
> are currently filled in in m_machine.c.
> The type definitions (cache_t, cacheinfo_t etc) would be included in
> libvex.h
>
> Tools can call the existing VG_(machine_get_VexArchInfo) to get the
> cache information. The function would have to be exposed through
> pub_tool_machine.h so it becomes available.
> Alternatively, VexArchInfo could be passed to the "instrument" function
> of the tools.
I would prefer both, in fact -- pass it to the instrument function, and
allow tools to make ad-hoc calls to get it, if they want.
> I can work on an implementation for this but first there should be
> agreement
>
> - about the data structures to represent cache info
What you propose seems good to me.
> - how tools access it (function call or passed to the instrument
> function)
Both!
---------
A couple of other comments:
* pls can we use type names beginning w/ capitals, as in the rest
of the code base
* another piece of complexity to be aware of (a small one, though) is
that ARM sets its hwcaps not only by the normal game of trying insns
and seeing which get SIGILL (in m_machine.c), but also by looking at
the AUXV entries on the stack at startup.
Overall, sounds good. It's quite a tricky area to tidy up since I think
it has grown without much planning, over the years.
J
|
|
From: Florian K. <br...@ac...> - 2012-09-24 12:22:06
|
On 09/24/2012 05:55 AM, Julian Seward wrote:
>
>> /* The various kinds of caches */
>> typedef enum {
>> DATA_CACHE,
>> INSN_CACHE,
>> DATA_INSN_CACHE // combined data and insn cache
>> } cache_kind;
>
> Like JosefW I would prefer COMBINED_CACHE to DATA_INSN_CACHE.
Josef did suggest UNIFIED_CACHE as it is the standard term. I think I've
seen "unified cache" more often that "combined cache" so I'm going to
run with that unless you object.
>> /* Information about the cache system as a whole */
>> typedef struct {
>> UInt num_levels;
>> UInt num_caches;
>> /* Unordered array of caches for this host. NULL if there are
>> no caches. Users can assume that the array contains at most one
>> cache of a given kind per cache level. */
>> cache_t *caches;
>> } cacheinfo_t;
>
> IIUC, num_levels is the number of levels the machine really has,
Yes
> and
> num_caches <= num_levels is the number of caches for which we have
> descriptions. Which may be less than the number of levels. Yes?
Yes, num_caches is the number of caches for which we have descriptions.
But it can be >= num_levels. Think of a machine that has L1 data and L1
insn cache and L2 data and L2 insn cache. So num_caches == 4 and
num_levels == 2.
> Am I right to understand (from your comments later) that some s390s
> have icaches which are coherent, and some do not? So you need to know
> the model number in order to decide whether or not VG_(invalid_icache)
> is a no-op?
No that is not the case for s390. But it might be for some architecture.
I was trying to propose something to cover the general case.
>> (4) VEX: functions returning VexInvalRange
>> The returned address range indicates whether some cache invalidation
>> needs to occur later. What is returned here may, in general, depend
>> on the particular machine model of a given architecture. So we need
>> to query the cache info before returning anything.
>> A different possibility, which may even be cleaner, is to make these
>> functions *always* return the address range for the insns that were
>> patched. In that case we would not need cache information here.
>
> Yes. I would prefer this solution. The patchers just return the address
> range they patched, and it is the call site's problem to figure out what to
> do about cache coherence.
Excellent.
>> Tools can call the existing VG_(machine_get_VexArchInfo) to get the
>> cache information. The function would have to be exposed through
>> pub_tool_machine.h so it becomes available.
>> Alternatively, VexArchInfo could be passed to the "instrument" function
>> of the tools.
>
> I would prefer both, in fact -- pass it to the instrument function, and
> allow tools to make ad-hoc calls to get it, if they want.
OK. I can see how it might be convenient to have a functional interface.
Might eliminate the need for a global variable in the tools.
> A couple of other comments:
>
> * pls can we use type names beginning w/ capitals, as in the rest
> of the code base
Sure, will do.
>
> * another piece of complexity to be aware of (a small one, though) is
> that ARM sets its hwcaps not only by the normal game of trying insns
> and seeing which get SIGILL (in m_machine.c), but also by looking at
> the AUXV entries on the stack at startup.
I think ppc does this, too.
Florian
|
|
From: Christian B. <bor...@de...> - 2012-09-24 10:09:37
|
On 21/09/12 19:01, Florian Krohm wrote: > Cache information will be determined after hwcaps and machine models > have been determined. The rationale is that not all cache information > can be figured out automatically (e.g. on s390 we cannot figure out > whether icaches are coherent). Some such information may depend on > the machine model (part of hwcaps) for which we just know what it is and > can fill it in. Florian, AFAIK icaches on s390 are always coherent Christian |
|
From: Carl E. L. <ce...@li...> - 2012-09-24 18:20:12
|
On Fri, 2012-09-21 at 13:01 -0400, Florian Krohm wrote:
> We had a discussion about this a few weeks back. Here are my thoughts.
>
>
> Objective:
> ----------
> Have coregrind query the properties of the host's cache system. Make
> this information available in a simple interface that hides all
> architecture-specific details e.g. the existence of a cpuid instruction.
>
>
> Benefits:
> ---------
> This is conceptually cleaner than the status quo. Detection of cache
> properties does not belong in the realm of the tools. Additionally,
> if several tools required information about caches there would be code
> duplication.
>
>
> Representation of cache information:
> ------------------------------------
>
> /* The various kinds of caches */
> typedef enum {
> DATA_CACHE,
> INSN_CACHE,
> DATA_INSN_CACHE // combined data and insn cache
> } cache_kind;
>
> /* Information about a particular cache */
> typedef struct {
> cache_kind kind;
> UInt level; /* level this cache is at, e.g. 1 for L1 cache */
> UInt sizeB; /* size of this cache in bytes */
> UInt line_sizeB; /* cache line size in bytes */
> UInt associativity;
> } cache_t;
>
> /* Information about the cache system as a whole */
> typedef struct {
> UInt num_levels;
> UInt num_caches;
> /* Unordered array of caches for this host. NULL if there are
> no caches. Users can assume that the array contains at most one
> cache of a given kind per cache level. */
> cache_t *caches;
> } cacheinfo_t;
>
I was just looking into the POWER architectures a bit more to make sure
they would be reasonably easy to support. Just to be clear, are there
any Valgrind restrictions on the cache sizes, specifically must be a
power of 2?
I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified,
shared cache of size 36MB. I was doing some cache studies last year and
I remember there being issues where the cache size must be a power of 2.
I don't remember what tool it was now that had that restriction.
Similarly, must the Valgrind cache associativity be a power of 2? The
POWER 5 processor's L2 cache is 10-way set associative.
<snip>
|
|
From: Josef W. <Jos...@gm...> - 2012-09-24 18:43:56
|
Am 24.09.2012 20:18, schrieb Carl E. Love: > I was just looking into the POWER architectures a bit more to make sure > they would be reasonably easy to support. Just to be clear, are there > any Valgrind restrictions on the cache sizes, specifically must be a > power of 2? The simulator in Cachegrind/Callgrind currently has various constrains about the sizes: * the cache line size in bytes must be a power of two (at least 16B) * the number of sets (= cache size / line size / associativity) must be a power of two. This is used to allow fast set calculation from address of a memory access. It is already now the case that sometimes values are adjusted (with an corresponding warning printed) to make the simulator happy. If that seems impossible, the simulator will error out, and ask the user to specify parameters via command line. However, this all is the responsibility of the tool. The interface asking for hardware parameters seems fine to me, and should be able to return any numbers. > I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified, > shared cache of size 36MB. I was doing some cache studies last year and > I remember there being issues where the cache size must be a power of 2. > I don't remember what tool it was now that had that restriction. > > Similarly, must the Valgrind cache associativity be a power of 2? No. > The > POWER 5 processor's L2 cache is 10-way set associative. Josef |
|
From: Julian S. <js...@ac...> - 2012-09-24 18:35:50
|
On Monday, September 24, 2012, Carl E. Love wrote: > I was just looking into the POWER architectures a bit more to make sure > they would be reasonably easy to support. Just to be clear, are there > any Valgrind restrictions on the cache sizes, specifically must be a > power of 2? > > I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified, > shared cache of size 36MB. I was doing some cache studies last year and > I remember there being issues where the cache size must be a power of 2. > I don't remember what tool it was now that had that restriction. > > Similarly, must the Valgrind cache associativity be a power of 2? The > POWER 5 processor's L2 cache is 10-way set associative. The same thing happens with server-level CPUs from Intel and AMD. I think Florian is presenting a general mechanism that allows recording of the cache details, regardless of whether they are something the various tools can handle, or not. And I think that's the right approach. As you say, though, some of the cache simulators have problems with non-power-of-2 sizes or associativities (can't remember which), so that the number of cache sets isn't a power of 2. So far that has been kludged up by postprocessing the cache info so as (in the right circumstances) increase the stated associativity by 50% (eg, a factor of 3/2) and decreasing the number of lines by the same factor, so as to make the number of lines be a power of 2 whilst not changing the overall capacity of the cache that is simulated. This kind of gets around the problem for cache sizes of (eg) 12MB (viz, 3/2 * 8MB) but does not fix it for cache sizes of (eg) 10MB since there is no code to do rescaling for the ratio 5/4. This stuff (+ big comment) is in get_caches_from_CPUID in cachegrind/cg-x86-amd64.c. If you or anybody else wants to do the 5/4 rescaling case, pls feel free :) I suppose this stuff should get lifted out, as part of Florian's reorg, and made general. J |
|
From: Josef W. <Jos...@gm...> - 2012-09-24 19:15:05
|
Am 24.09.2012 20:34, schrieb Julian Seward: > On Monday, September 24, 2012, Carl E. Love wrote: > >> I was just looking into the POWER architectures a bit more to make sure >> they would be reasonably easy to support. Just to be clear, are there >> any Valgrind restrictions on the cache sizes, specifically must be a >> power of 2? >> >> I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified, >> shared cache of size 36MB. I was doing some cache studies last year and >> I remember there being issues where the cache size must be a power of 2. >> I don't remember what tool it was now that had that restriction. >> >> Similarly, must the Valgrind cache associativity be a power of 2? The >> POWER 5 processor's L2 cache is 10-way set associative. > > The same thing happens with server-level CPUs from Intel and AMD. > > I think Florian is presenting a general mechanism that allows recording of > the cache details, regardless of whether they are something the various > tools can handle, or not. And I think that's the right approach. > > As you say, though, some of the cache simulators have problems with > non-power-of-2 sizes or associativities (can't remember which), so that > the number of cache sets isn't a power of 2. So far that has been kludged > up by postprocessing the cache info so as (in the right circumstances) > increase the stated associativity by 50% (eg, a factor of 3/2) and > decreasing the number of lines by the same factor, so as to make the > number of lines be a power of 2 whilst not changing the overall capacity > of the cache that is simulated. > > This kind of gets around the problem for cache sizes of (eg) 12MB > (viz, 3/2 * 8MB) but does not fix it for cache sizes of (eg) 10MB > since there is no code to do rescaling for the ratio 5/4. > > This stuff (+ big comment) is in get_caches_from_CPUID in > cachegrind/cg-x86-amd64.c. If you or anybody else wants to do the 5/4 > rescaling case, pls feel free :) This is not possible with trying to keep the cache size the same and the number of sets a power of two. It is important for the performance of the simulator to have a fast way to get from address to set number. Bit mangling is fine, but modulo is prohibitively slow. Better are small lookup tables. If the number of sets is a power of two, the "best" way is somehow straightforward: every cache should cover the address space in an as uniform way as possible. So let's take to lowest bits of the address (not counting the bits needed for the offset inside of a cache line). For every other number of sets, the best solution is not obvious. Josef |
|
From: Carl E. L. <ce...@li...> - 2012-09-24 19:11:45
|
On Mon, 2012-09-24 at 20:34 +0200, Julian Seward wrote: > On Monday, September 24, 2012, Carl E. Love wrote: > > > I was just looking into the POWER architectures a bit more to make sure > > they would be reasonably easy to support. Just to be clear, are there > > any Valgrind restrictions on the cache sizes, specifically must be a > > power of 2? > > > > I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified, > > shared cache of size 36MB. I was doing some cache studies last year and > > I remember there being issues where the cache size must be a power of 2. > > I don't remember what tool it was now that had that restriction. > > > > Similarly, must the Valgrind cache associativity be a power of 2? The > > POWER 5 processor's L2 cache is 10-way set associative. > > The same thing happens with server-level CPUs from Intel and AMD. > > I think Florian is presenting a general mechanism that allows recording of > the cache details, regardless of whether they are something the various > tools can handle, or not. And I think that's the right approach. > > As you say, though, some of the cache simulators have problems with > non-power-of-2 sizes or associativities (can't remember which), so that > the number of cache sets isn't a power of 2. So far that has been kludged > up by postprocessing the cache info so as (in the right circumstances) > increase the stated associativity by 50% (eg, a factor of 3/2) and > decreasing the number of lines by the same factor, so as to make the > number of lines be a power of 2 whilst not changing the overall capacity > of the cache that is simulated. > > This kind of gets around the problem for cache sizes of (eg) 12MB > (viz, 3/2 * 8MB) but does not fix it for cache sizes of (eg) 10MB > since there is no code to do rescaling for the ratio 5/4. > > This stuff (+ big comment) is in get_caches_from_CPUID in > cachegrind/cg-x86-amd64.c. If you or anybody else wants to do the 5/4 > rescaling case, pls feel free :) > > I suppose this stuff should get lifted out, as part of Florian's reorg, > and made general. > > J Yup, from the general recording of cache size, line size, type, associativity the data structures seem to cover all of the info needed for POWER. The hope would be to see if some of the lower code requirements on powers of two could be removed with the code restructuring. I haven't really dived into the cachegrind and other tool implementations to see why the restrictions are there or what it would take to change the restrictions. Good to know that the power of 2 issue is not just a POWER problem. > |
|
From: Josef W. <Jos...@gm...> - 2012-11-07 14:15:47
|
Am 24.09.2012 21:11, schrieb Carl E. Love: > Good to know that the power of 2 issue is not just a POWER problem. Not sure if you are working on that: Before you come up with changing associativity for specific cases, I think it makes sense to relax from this "power of 2 issue" just for the LL simulation. In an old experiment, I replaced the bit masking with a modulo operation to find the cache set which needs to be checked for a hit. If you do that for all levels simulated (L1I, L1D, LL), cachegrind can easily slow down by a factor of 2. But I recently checked the behavior if you do that only for LL. I could not see much slowdown. The modulo operation needs to be done only if the access misses the L1, and that seems to be enough work to do, such that the modulo operation for LL simulation is not really that relevant any more. I'll try to come up with patch & measurements. Josef |
|
From: Josef W. <Jos...@gm...> - 2012-11-16 21:18:48
Attachments:
relaxsets.patch
|
Am 07.11.2012 15:15, schrieb Josef Weidendorfer: > I think it makes sense to relax from this "power of 2 issue" just > for the LL simulation. I just did that by using modulo (%) just for LL simulation, where it's used in mapping an address to set number. See function block2set in attached patch. It allows to get rid of "maybe_tweak_LLc", but shows an performance hit of 5% on average on my laptop with cachegrind (with amd64). The worst case happens when an access misses the L1, but finds a match in the LL set on the first check (ie. at the most-recently-used spot). ffbench seems to expose this case. Thus, unconditionally always using modulo for LL seems to be a bad idea. Instead, one can check for the power-of-two case in block2set(), and use bit masking or modulo depending on that. But this just gets rid of the worst-case scenario in ffbench, and makes the other cases worse. The best would be to have two implementations, and choose the right one at runtime, depending on cache parameters. As far as I see, this choice is best done by instrumenting calls to dirty helpers either implementing one or the other version. However, this needs duplication of all helpers :-( It really would be cool to use VEX's code generation feature for functions which can called from C. Just to generate the "block2set" function in attached patch, either to do bit masking or modulo. Does it make sense to look into this? Or does anybody have another idea? Josef -- Running tests in trunk/perf ---------------------------------------- -- bigcode1 -- bigcode1 trunk :0.14s ca: 4.6s (32.6x, -----) bigcode1 relaxsets :0.14s ca: 4.6s (32.6x, 0.0%) -- bigcode2 -- bigcode2 trunk :0.14s ca: 8.6s (61.1x, -----) bigcode2 relaxsets :0.14s ca: 8.6s (61.7x, -1.1%) -- bz2 -- bz2 trunk :0.66s ca:13.3s (20.1x, -----) bz2 relaxsets :0.66s ca:13.9s (21.0x, -4.4%) -- fbench -- fbench trunk :0.28s ca: 3.8s (13.4x, -----) fbench relaxsets :0.28s ca: 3.9s (13.9x, -3.7%) -- ffbench -- ffbench trunk :0.25s ca: 4.9s (19.4x, -----) ffbench relaxsets :0.25s ca: 5.7s (22.7x,-16.9%) -- heap -- heap trunk :0.10s ca: 3.9s (39.4x, -----) heap relaxsets :0.10s ca: 4.1s (41.4x, -5.1%) -- heap_pdb4 -- heap_pdb4 trunk :0.14s ca: 4.4s (31.5x, -----) heap_pdb4 relaxsets :0.14s ca: 4.7s (33.3x, -5.7%) -- many-loss-records -- many-loss-records trunk :0.01s ca: 0.8s (78.0x, -----) many-loss-records relaxsets :0.01s ca: 0.9s (86.0x,-10.3%) -- many-xpts -- many-xpts trunk :0.05s ca: 1.2s (23.4x, -----) many-xpts relaxsets :0.05s ca: 1.2s (24.0x, -2.6%) -- sarp -- sarp trunk :0.02s ca: 1.0s (51.5x, -----) sarp relaxsets :0.02s ca: 1.1s (55.5x, -7.8%) -- tinycc -- tinycc trunk :0.22s ca: 9.1s (41.5x, -----) tinycc relaxsets :0.22s ca: 9.6s (43.4x, -4.7%) -- Finished tests in trunk/perf ---------------------------------------- |