|
From: Josef W. <Jos...@gm...> - 2012-09-24 19:15:05
|
Am 24.09.2012 20:34, schrieb Julian Seward: > On Monday, September 24, 2012, Carl E. Love wrote: > >> I was just looking into the POWER architectures a bit more to make sure >> they would be reasonably easy to support. Just to be clear, are there >> any Valgrind restrictions on the cache sizes, specifically must be a >> power of 2? >> >> I see the Power 5 has an L2 unified cache of 1.875MB and and L3 unified, >> shared cache of size 36MB. I was doing some cache studies last year and >> I remember there being issues where the cache size must be a power of 2. >> I don't remember what tool it was now that had that restriction. >> >> Similarly, must the Valgrind cache associativity be a power of 2? The >> POWER 5 processor's L2 cache is 10-way set associative. > > The same thing happens with server-level CPUs from Intel and AMD. > > I think Florian is presenting a general mechanism that allows recording of > the cache details, regardless of whether they are something the various > tools can handle, or not. And I think that's the right approach. > > As you say, though, some of the cache simulators have problems with > non-power-of-2 sizes or associativities (can't remember which), so that > the number of cache sets isn't a power of 2. So far that has been kludged > up by postprocessing the cache info so as (in the right circumstances) > increase the stated associativity by 50% (eg, a factor of 3/2) and > decreasing the number of lines by the same factor, so as to make the > number of lines be a power of 2 whilst not changing the overall capacity > of the cache that is simulated. > > This kind of gets around the problem for cache sizes of (eg) 12MB > (viz, 3/2 * 8MB) but does not fix it for cache sizes of (eg) 10MB > since there is no code to do rescaling for the ratio 5/4. > > This stuff (+ big comment) is in get_caches_from_CPUID in > cachegrind/cg-x86-amd64.c. If you or anybody else wants to do the 5/4 > rescaling case, pls feel free :) This is not possible with trying to keep the cache size the same and the number of sets a power of two. It is important for the performance of the simulator to have a fast way to get from address to set number. Bit mangling is fine, but modulo is prohibitively slow. Better are small lookup tables. If the number of sets is a power of two, the "best" way is somehow straightforward: every cache should cover the address space in an as uniform way as possible. So let's take to lowest bits of the address (not counting the bits needed for the offset inside of a cache line). For every other number of sets, the best solution is not obvious. Josef |