|
From: Josef W. <Jos...@gm...> - 2005-11-11 11:27:35
|
On Friday 11 November 2005 05:05, you wrote: > On Wed, 9 Nov 2005, Josef Weidendorfer wrote: > Hi, > > I am running Callgrind with the options -v --log-file=summary --simulate-cache=yes > --simulate-hwpref=yes --cacheuse=yes. The summary log file contains the > lines: > > Prefetch Up: 0 > Prefetch Down: 0 Oh, someone which is using the more advanced (and probably not that much tested), code! Very good. It would be nice if you can tell me if these features are useful for you. You can not use --simulate-hwpref=yes and --cacheuse=yes at once. This is separated simulator code. I will change the code to give out a warning regarding this, thanks. If I do e.g. callgrind -v --simulate-hwpref=yes ls This option also switches on cache simulation. I get --12922-- Prefetch Up: 1507 --12922-- Prefetch Down: 36 so I think this still works fine. > What do these lines mean? From what I understand, --simulate-hwpref=yes > simulates a hardware prefetcher, as is found in the Intel Pentium 4 > processor. Yes. The P4 (and P-M) automatically detects upward and downward streaming, stopping at 4kB boundaries (streams on virtual addresses get a disrupted stream of physical addresses at 4kB boundaries because of VM). A nice thing is that the Pentium-M has hardware performance counters exact for the Prefetch/Up and Prefetch/Down events, i.e. you can observe the hardware prefetcher on the Pentium-M in action by using OProfile/Perfex/PAPI, and compare the results with that from Callgrind. By using --simulate-hwpref=yes I add this heuristic, and presume that every line loaded by the hardware prefetcher will give a hit when accessed later on. Note that this is not always the case: the real access could come that early that you still would get a miss in reality, even if the hardware prefetcher has catched the line. Unfortunately, callgrind has no way to get a simulated wall clock time, which would be needed to detect such cases. So callgrind --simulate-hwpref will give the best case possible for the prefetcher. In reality, it is between the results without and with this option. The usage is to compare results with and without the prefetcher. For functions where you see a big difference, the prefetcher is working quite good, i.e. any microoptimizations to bring down the usual callgrind results (without prefetcher) will not lead to any real improvements. But in the code regions, where the results are not really different, you see that the prefetching heuristic of the P4/PM is not working, and you can try to add software prefetch instructions (or otherwise change the code). A drawback is that callgrind does not take software prefetch instructions into account, as Valgrind does not feed these instructions to the tool, but ignores them. But if there really are users for this simulator enhancement, we can try to include them into VG core (e.g. cachegrind). To make the comparision of the two runs more easy, I should include a compare mode in KCachegrind. > Also, does --cacheuse=yes collect cache line utilization statistics i.e. > what percentage of a line is utilized after being brought into cache and > before being evicted from the cache? Where can this information viewed? Yes. The number of bytes never used in a cache line will be attributed to the instruction which triggered the load. This is event SpLoss1 (for L1) and more important SpLoss2 (for L2). The full amount of bytes loaded by an instruction is given by the number of L1 or L2 misses this instruction gets attributed, multiplied with the cache line size. In KCachegrind, add new derived events with the formula "64 L1m" and "64 L2m" to directly get the numbers to compare. You can view this information with KCachegrind. Unfortunately, there was a a hardcoded maximum of 10 event types in KCachegrind found till KDE 3.4.x. And --cacheuse=yes gives you 12 event types, leading to a load error. This changes in the version in KDE 3.5, or use the newest one from the website (kcachegrind.sf.net). Theoretically, callgrind_annotate should be able to show these results, too. For it to cope with the format, you have to additionally provide --compress-pos=no --compress-strings=no on the callgrind line. Even then, it fails with Line xxxx: summary event and total event mismatch Oh yeah, it is time to provide a better command line tool... Josef > > Thanks, > Aniruddha |