|
From: Josef W. <Jos...@gm...> - 2005-10-15 15:18:55
|
On Saturday 15 October 2005 08:58, you wrote: > Nicholas Nethercote writes: > > If this part of the simulation is taking a lot of time (and I'd like to > > see profiling evidence) I'd suggest keeping the current algorithm and > > having several specialised versions, for associativities of 1, 2, 4, and > > 8. It should be ugly but doable with some macro magic. > > My point was that if the cpu we are running on uses this pseudo-LRU > algorithm, cachegrind should use it too, because it will give results > that more accurately reflect what the actual hardware will do. I do not think it matters much. If we want to match the hardware in every detail, we would have to change the simulation in other ways: e.g. the tracecache for L1I in P4, the "sectored" caches on intel (always prefetch the neighboured line in addition on a read), the write-back behavior of L2, the hardware prefetcher, the multiple outstanding loads because of OOO, and so on. There is not much point in adding all these things, as a lot is not really documented, and it makes the simulation much slower. You can have a look at callgrind: I optionally added write-back events, and put in a hardware prefetcher to get a best-case szenario. The nice thing of LRU is that it is a quite simply model, the results are somewhat meaningful, and it is easy to check if the simulator is really doing LRU instead of being buggy, given a specially tailored program. If you have a pseudo-LRU, it is more difficult to check that. > I'm > pretty sure the PPC970 (G5) uses pseudo-LRU replacement for the L2 > cache, which is 8-way set associative. Is this officially documented somewhere? > I have no idea what intel or > AMD cpus use, though. They are quite probably doing pseudo-LRU too, but I never saw something hint how this is implemented in detail. So I would say, switching to a pseuo-LRU scheme is worth it if it really speeds up the simulation. It would be nice to choose between the schemes via CLO, and default to the pseudo-LRU. This way, you can compare the 2 schemes. Julian: I really would like to add handling of software-prefetching instructions. Is there a way for a tool to get some hints from IR in this regard? Similar for non-temporal loads. These things can make quite a huge difference for cache simulation. If there is a cache invalidation instruction on ppc32, this also would be a candidate to be detectable by tools. Josef > > Paul. |