|
From: Josef W. <Jos...@gm...> - 2012-10-01 12:50:06
|
Am 29.09.2012 19:50, schrieb Florian Krohm: > On 09/21/2012 02:18 PM, Josef Weidendorfer wrote: >> Am 21.09.2012 19:01, schrieb Florian Krohm: >>> DATA_INSN_CACHE // combined data and insn cache >> >> The standard term here is "unified", ie. UNIFIED_CACHE ? >> > > Hello Josef, > > I finally have a bit of time to look at this again. > Sure, UNIFIED_CACHE sounds good. > > I read the cachegrind documentation. It mentions that for a machine > with 3 levels of cache, the L3 cache will be used instead of L2. > What about a machine with 4 cache levels such as the one below (s390). > Should the L4 cache be used instead of L2 ? Short answer: Yes, I think so. Long answer: The cache simulation model in Cachegrind/Callgrind currently is a synchronous (ie. one access at a time), 2-level (not-strict) inclusive cache hierarchy with private caches for instruction and data at L1, and unified L2. All caches do LRU replacement and are write-allocate. The model does not care about write-back vs. write-through, as this difference would not change hit/miss numbers. All threads go through this single hierarchy (ie. no coherency issues possible, and no concurrency miss numbers, also no MESI/MOESI /MESIF...). Callgrind's model optionally adds a simple prefetcher at L2 level. As far as I know, this model was motivated by Intel processors with two cache levels. With such processors, the mapping of real cache parameters is clear. When processors with 3 levels came up, there was some discussion about useful mappings of real parameters to the simulator model. The mapping used should be able to catch cache-inefficient memory access behavior. The most important issue is of course not using the on-chip caches at all, which gets visible when looking at the behaviour of the last-level cache. Thus, the last-level cache should always be mapped to the L2 in the model. Another important problem is huge number of conflict misses due to low associativity. This can already happen in L1 (last level cache associativity often is much higher, and just looking at that level would not show up such a problem). So it seems right to use real L1 parameters for the L1 in the model. The usefulness of a 3-level cache hierarchy comes from the wish to share the last level among multiple cores. For that, you want to reduce the number of references coming from the cores, resulting in better scalability of the shared cache level. As L1 has to be small to be fast, it's good to have a L2 to reduce the references to the shared level. As Cachegrind's current model does not have private L1 per core anyway, it seemed not really that important to extend the simulator for 3 levels. The same argumentation seems fine for 4 levels. On the other hand, from my point of view, we can always think about making the cache model more flexible. E.g. for detecting cache-bouncing between private L1 caches (ie. concurrency misses), or to understanding NUMA issues. However, for such issues, it is important to see slowdowns because of limited bandwidths of shared resources. This needs time simulation, and makes everything a little bit more complex :-) Do you know if the L4 on s390 covers all memory modules, or is it partitioned for separate modules (as it was with Suns/Oracle Niagara)? This can make a difference: in the latter case, if all accesses go to one module, the cache size effectively gets much smaller. Josef > > Florian > > L1 topology: separate data and instruction; private > L1 cache line size data: 256 > L1 cache line size insn: 256 > L1 total cachesize data: 131072 > L1 total cachesize insn: 65536 > L1 set. assoc. data: 8 > L1 set. assoc. insn: 4 > L2 topology: unified data and instruction; private > L2 cache line size: 256 > L2 total cachesize: 1572864 > L2 set. assoc. : 12 > L3 topology: unified data and instruction; shared > L3 cache line size: 256 > L3 total cachesize: 25165824 > L3 set. assoc. : 12 > L4 topology: unified data and instruction; shared > L4 cache line size: 256 > L4 total cachesize: 201326592 > L4 set. assoc. : 24 > > > |