|
From: Jason P. <met...@gm...> - 2012-12-07 19:47:16
|
Hi, I'm trying to compare the number of cache misses (D1, I1, and LL) between what Perf gives me (on the hardware itself) vs. what Cachegrind thinks the number of misses should be. The server machine has two Sandy Bridge Intel Xeon E5-2430 CPUs on it, and the PARSEC 3.0 suite (compiled in gcc-serial format, single threaded) is being run through cachegrind for analysis to obtain the number of D1, I1, and LL misses vs. the number of real misses on the hardware obtained by running the same benchmark binaries through Perf and counting D1 load and store misses as well as I1 misses. A ratio of the Perf misses to Cachegrind misses holds to about a factor of 1-2x. However, some benchmarks like ferret have a much higher number of misses on Perf than on Cachegrind. Has anyone else done analysis on Perf results vs. Cachegrind simulated results by running benchmarks on both of these? The machines are also running RedHat 5. Thanks for any information. |
|
From: Josef W. <Jos...@gm...> - 2012-12-07 20:06:57
|
Am 07.12.2012 20:46, schrieb Jason Palaszewski: > Hi, I'm trying to compare the number of cache misses (D1, I1, and LL) > between what Perf gives me (on the hardware itself) vs. what > Cachegrind thinks the number of misses should be. The server machine > has two Sandy Bridge Intel Xeon E5-2430 CPUs on it, and the PARSEC 3.0 > suite (compiled in gcc-serial format, single threaded) is being run > through cachegrind for analysis to obtain the number of D1, I1, and LL > misses vs. the number of real misses on the hardware obtained by > running the same benchmark binaries through Perf and counting D1 load > and store misses as well as I1 misses. A ratio of the Perf misses to > Cachegrind misses holds to about a factor of 1-2x. However, some > benchmarks like ferret have a much higher number of misses on Perf > than on Cachegrind. Is this LL misses, or L1 misses? For L1 misses, you may observe much more misses as real caches are asynchronous, ie. consecutive loads to the same line will give as much misses as loads, while in Cachegrind after the first miss all others will be hits. Hm. Is ferret pure user-space, or does it trigger work in the kernel? It may be that the kernel side evicts a lot of data from the cache, and this becomes visible via user-level cache misses. Another possibility is that hardware prefetching is too clever, and evicts lines to enable prefetching of data which is not actually used. Josef Has anyone else done analysis on Perf results vs. > Cachegrind simulated results by running benchmarks on both of these? > The machines are also running RedHat 5. Thanks for any information. > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |
|
From: Jason P. <met...@gm...> - 2012-12-07 20:14:37
|
Josef, The difference with ferret is in the D1 misses. Even with a native input set of data, there are almost no LL misses on this machine. I'm unsure as to what type of space ferret actually triggers its workload in. As far as prefetching goes, I am under the impression from the Valgrind documentation that Cachegrind does not include a prefetch algorithm. Another larger issue I'm having is that even a benchmark as simple as blackscholes is showing only 1523 I1 misses in the cache sim, and then when run on real hardware, Perf reports an enormous 54,259,977 I1 load misses. This is another question I would like to pose. Thanks, Jason On Fri, Dec 7, 2012 at 3:06 PM, Josef Weidendorfer <Jos...@gm...> wrote: > Am 07.12.2012 20:46, schrieb Jason Palaszewski: > >> Hi, I'm trying to compare the number of cache misses (D1, I1, and LL) >> between what Perf gives me (on the hardware itself) vs. what >> Cachegrind thinks the number of misses should be. The server machine >> has two Sandy Bridge Intel Xeon E5-2430 CPUs on it, and the PARSEC 3.0 >> suite (compiled in gcc-serial format, single threaded) is being run >> through cachegrind for analysis to obtain the number of D1, I1, and LL >> misses vs. the number of real misses on the hardware obtained by >> running the same benchmark binaries through Perf and counting D1 load >> and store misses as well as I1 misses. A ratio of the Perf misses to >> Cachegrind misses holds to about a factor of 1-2x. However, some >> benchmarks like ferret have a much higher number of misses on Perf >> than on Cachegrind. > > > Is this LL misses, or L1 misses? For L1 misses, you may observe much > more misses as real caches are asynchronous, ie. consecutive loads to > the same line will give as much misses as loads, while in Cachegrind > after the first miss all others will be hits. > > Hm. Is ferret pure user-space, or does it trigger work in the kernel? > It may be that the kernel side evicts a lot of data from the cache, > and this becomes visible via user-level cache misses. > > Another possibility is that hardware prefetching is too clever, and > evicts lines to enable prefetching of data which is not actually used. > > Josef > > > Has anyone else done analysis on Perf results vs. >> >> Cachegrind simulated results by running benchmarks on both of these? >> The machines are also running RedHat 5. Thanks for any information. >> >> >> ------------------------------------------------------------------------------ >> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial >> Remotely access PCs and mobile devices and provide instant support >> Improve your efficiency, and focus on delivering more value-add services >> Discover what IT Professionals Know. Rescue delivers >> http://p.sf.net/sfu/logmein_12329d2d >> _______________________________________________ >> Valgrind-users mailing list >> Val...@li... >> https://lists.sourceforge.net/lists/listinfo/valgrind-users >> > |
|
From: Josef W. <Jos...@gm...> - 2012-12-13 15:38:45
|
Hi Janson, Am 07.12.2012 21:14, schrieb Jason Palaszewski: > The difference with ferret is in the D1 misses. Even with a native > input set of data, there are almost no LL misses on this machine. I'm > unsure as to what type of space ferret actually triggers its workload > in. Ok. For L1 misses, there are various effects which can explain different result. As far as prefetching goes, I am under the impression from the > Valgrind documentation that Cachegrind does not include a prefetch > algorithm. Yes. But this explains differences between simulation and reality, if the simulation does not do any prefetching, but real hardware does. In callgrind, you optionally can switch on a simple stream prefetcher for the last level (LL). > Another larger issue I'm having is that even a benchmark as simple as > blackscholes is showing only 1523 I1 misses in the cache sim, and then > when run on real hardware, Perf reports an enormous 54,259,977 I1 load > misses. This is another question I would like to pose. Do you have other load on the system? Does the number go down if you fix the execution of blackscholes to one core? Any task switching/migration results in invalidation and reloading L1I. Your machine has an uops cache for 1,5k uops (that's below L1). The idea is to save energy as the x86 decoder often can be switched off. For that reason, I would expect L1 misses to be _lower_ than given by cachegrind. If above count includes misses of the uops cache, this may explain the high number. The thing is: Interpretation of performance counter measurement data is always quite tricky. You need to check what you actually are measuring. E.g. what event type is used for perf's I1 load misses? Looking up the Intel SDM, volume 3 (http://download.intel.com/products/processor/manual/325384.pdf), for available events in Sandybridge (Table 19-6), I just see an event 80 / umask 02 for "ICACHE.MISSES" described with "Number of Instruction Cache, Streaming Buffer and Victim Cache Misses. Includes UC accesses." I do not know what streaming buffer and victim caches they are talking about here, and there is nothing about the uops cache. Further, the count probably includes fetches because of speculative execution. It is unclear to me how to compare the absolute number from this counter with the result you get from cachegrind, if it's already difficult to understand what is measured. With cachegrind, it's quite easy: it simulates a cache with given size, associativity and line length, using an LRU replacement scheme. If cachegrind says that execution of blackscholes does not produce any I1 misses, you at least know that the "active code part" fits into 32kB, and there is nothing to improve in that regard. Best, Josef > > Thanks, > > Jason > > On Fri, Dec 7, 2012 at 3:06 PM, Josef Weidendorfer > <Jos...@gm...> wrote: >> Am 07.12.2012 20:46, schrieb Jason Palaszewski: >> >>> Hi, I'm trying to compare the number of cache misses (D1, I1, and LL) >>> between what Perf gives me (on the hardware itself) vs. what >>> Cachegrind thinks the number of misses should be. The server machine >>> has two Sandy Bridge Intel Xeon E5-2430 CPUs on it, and the PARSEC 3.0 >>> suite (compiled in gcc-serial format, single threaded) is being run >>> through cachegrind for analysis to obtain the number of D1, I1, and LL >>> misses vs. the number of real misses on the hardware obtained by >>> running the same benchmark binaries through Perf and counting D1 load >>> and store misses as well as I1 misses. A ratio of the Perf misses to >>> Cachegrind misses holds to about a factor of 1-2x. However, some >>> benchmarks like ferret have a much higher number of misses on Perf >>> than on Cachegrind. >> >> >> Is this LL misses, or L1 misses? For L1 misses, you may observe much >> more misses as real caches are asynchronous, ie. consecutive loads to >> the same line will give as much misses as loads, while in Cachegrind >> after the first miss all others will be hits. >> >> Hm. Is ferret pure user-space, or does it trigger work in the kernel? >> It may be that the kernel side evicts a lot of data from the cache, >> and this becomes visible via user-level cache misses. >> >> Another possibility is that hardware prefetching is too clever, and >> evicts lines to enable prefetching of data which is not actually used. >> >> Josef >> >> >> Has anyone else done analysis on Perf results vs. >>> >>> Cachegrind simulated results by running benchmarks on both of these? >>> The machines are also running RedHat 5. Thanks for any information. >>> >>> >>> ------------------------------------------------------------------------------ >>> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial >>> Remotely access PCs and mobile devices and provide instant support >>> Improve your efficiency, and focus on delivering more value-add services >>> Discover what IT Professionals Know. Rescue delivers >>> http://p.sf.net/sfu/logmein_12329d2d >>> _______________________________________________ >>> Valgrind-users mailing list >>> Val...@li... >>> https://lists.sourceforge.net/lists/listinfo/valgrind-users >>> >> > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users > |