|
From: Mehmet B. <mb...@gm...> - 2007-11-19 22:10:59
|
Hi Everyone, I thought it would be interesting to compare valgrind (cachegrind) cache event results to those of PAPI, which uses real HW counters. Despite all my efforts to configure Valgrind properly, I am unable to get comparable results. I hope a Valgrind Guru might give me a hand here :) I run Valgrind using: valgrind --tool=cachegrind --I1=65536,2,64 --D1=65536,2,64 --L2=1048576,16,16384 ./mycode The characteristics of the machine I use, which is a 64 bit Opteron, is attached to the bottom of this message. I believe my settings in the valgrind line are correct. Here's what I get from PAPI for the code section that I am exploring: total: L1 Access= 30582440, Hit= 28659298, Miss= 1923142 total: L2 Access= 1923365, Hit= 1823712 , Miss= 99653 And here's what I read from Valgrind (please also see the attached image below): 1,918,926 total L1 misses (compare to 1,923,142 in PAPI) ----Close enough!!---- 6,360 total L2 misses (compare to 99,653 in PAPI) ---- NOT even close?? ---- Am I doing anything wrong while configuring L2 cache? I will appreciate any comments... Thanks a lot in advance! -Memo ============= ADDITIONAL INFO ============ The specs of Opteron are as follows: OPTERON Test case: Memory Information. ------------------------------------------------------------------------ L1 Instruction TLB: Number of Entries: 512; Associativity: 4 L1 Data TLB: Number of Entries: 512; Associativity: 4 L1 Instruction Cache: Total size: 64KB Line size: 64B Number of Lines: 1024 Associativity: 2 L1 Data Cache Total size: 64KB Line size: 64B Number of Lines: 1024 Associativity: 2 L2 Unified Cache Total size: 1024KB Line size: 64B Number of Lines: 16384 Associativity: 16 |
|
From: Josef W. <Jos...@gm...> - 2007-11-20 10:10:30
|
On Monday 19 November 2007, Mehmet Belgin wrote: > I run Valgrind using: > valgrind --tool=cachegrind --I1=65536,2,64 --D1=65536,2,64 > --L2=1048576,16,16384 ./mycode Are you sure you want to simulate a L2 cache linesize of 16384? In general, cachegrind is using the parameters of your CPU (detected by cpuid), so if you just want to simulate the cache of your CPU, there is no requirement to specify the parameters explicitly. Josef |
|
From: Mehmet B. <mb...@gm...> - 2007-11-20 20:26:05
|
Thanks a lot for the reply, Josef! You are right, a cache line of 16384 doesn't make sense... I wasn't thinking, so just passed what papi_mem() reported to me... I checked the specs of the CPU (Opteron 240) and learned that both L1 and L2 cache lines are just 64 bytes, which sounds more likely. The Valgrind incorrectly assumes that the L2 cache is 8-way associative, which should be 16-way instead, that's why I don't use its default settings. I don't have cpuid as a command, maybe that's what confuses Valgrind... This time, Valgrinds overestimates number of cache misses, even with the corrected settings: PAPI : 99,653 L2 miss Valgrind : 1,609,350 L2 miss !!! I am really out of ideas, I am very much willing to use valgrind to explain my results where PAPI can't provide enough details... Thanks a lot, -Memo On Nov 20, 2007 5:10 AM, Josef Weidendorfer <Jos...@gm...> wrote: > On Monday 19 November 2007, Mehmet Belgin wrote: > > I run Valgrind using: > > valgrind --tool=cachegrind --I1=65536,2,64 --D1=65536,2,64 > > --L2=1048576,16,16384 ./mycode > > Are you sure you want to simulate a L2 cache linesize of 16384? > > In general, cachegrind is using the parameters of your CPU (detected > by cpuid), so if you just want to simulate the cache of your CPU, > there is no requirement to specify the parameters explicitly. > > Josef > > > > |
|
From: Nicholas N. <nj...@cs...> - 2007-11-20 21:33:18
|
On Tue, 20 Nov 2007, Mehmet Belgin wrote: > This time, Valgrinds overestimates number of cache misses, even with the > corrected settings: > > PAPI : 99,653 L2 miss > Valgrind : 1,609,350 L2 miss !!! That really doesn't sound right. Can you please provide the full command line you used, and the summary output (all hit/mis counts) that Cachegrind prints to stderr on exit. Thanks. Nick |
|
From: Nicholas N. <nj...@cs...> - 2007-11-20 22:03:35
|
On Tue, 20 Nov 2007, Mehmet Belgin wrote: > valgrind -v --tool=cachegrind --I1=65536,2,64 --D1=65536,2,64 > --L2=1048576,16,64 ./my_code input_matrix.mtx > > ==1451== I refs: 373,488,662 > ==1451== I1 misses: 857,325 > ==1451== L2i misses: 1,583 > ==1451== I1 miss rate: 0.22% > ==1451== L2i miss rate: 0.00% > ==1451== > ==1451== D refs: 148,862,676 (113,784,326 rd + 35,078,350 wr) > ==1451== D1 misses: 2,143,107 ( 2,075,640 rd + 67,467 wr) > ==1451== L2d misses: 1,770,365 ( 1,725,279 rd + 45,086 wr) > ==1451== D1 miss rate: 1.4% ( 1.8% + 0.1% ) > ==1451== L2d miss rate: 1.1% ( 1.5% + 0.1% ) > ==1451== > ==1451== L2 refs: 3,000,432 ( 2,932,965 rd + 67,467 wr) > ==1451== L2 misses: 1,771,948 ( 1,726,862 rd + 45,086 wr) > ==1451== L2 miss rate: 0.3% ( 0.3% + 0.1% ) That doesn't look unreasonable. What are PAPI's stats? Nick ps: it's easier to read these emails if you post in plain text instead of HTML. |
|
From: Nicholas N. <nj...@cs...> - 2007-11-21 00:10:52
|
On Tue, 20 Nov 2007, Mehmet Belgin wrote:
> I only count events (with PAPI) in a limited region of the code. PAPI
> reports the following:
>
> total: L1 Access= 30582440, Hit= 28659298, Miss= 1923142,
> Rate=93.711614
> total: L2 Access= 1923365, Hit= 1823712, Miss= 99653,
> Rate=94.818820
>
> And here's what Cachegrind reports for the same code region (I have
> attached the entire list as well, in case the list below is not
> readable):
>
> . . . . . . . .
> . asm (STACKED_STRING(begin_spmv)":");
> 120,240 1 1 90,090 0 0 90 0
> 0 for (i=0; i < n; ++i) {
> 90,000 0 0 90,000 900 30 0 0
> 0 k1 = ia[i] - 1;
> . . . . . . . .
> . k2 = ia[i + 1] - 2;
> 6,007,020 1 1 60,000 1,890 1,890 0 0
> 0 for (k=k1; k < k2 + 1; ++k) {
> . . . . . . . .
> . #define SUM_IS_aa_TIMES_P(_si) \
> . . . . . . . .
> . sum##_si += si2aabase##_si[k] * _IL_(p, (ja[k] -
> 1), curstackdepth, _si)
> 50,225,610 4 4 27,202,350 1,897,169 1,609,440 0 0
> 0 STACKED_OPERATION(SUM_IS_aa_TIMES_P);
> . . . . . . . .
> . }
> . . . . . . . .
> . #define MOVE_SUM_TO_q(_si)\
> . . . . . . . .
> . _NL_(q, n, i, _si) = sum##_si
> 390,000 1 1 180,000 1,470 0 120,000 17,490
> 10,770 STACKED_OPERATION(MOVE_SUM_TO_q);
> . . . . . . . . .
> 120,000 0 0 0 0 0 0 0
> 0 STACKED_ASSIGNMENT_CONST(SUM_I, 0.);
> . . . . . . . .
> . }
> . . . . . . . .
> . asm (STACKED_STRING(end_spmv)":");
In the past, I checked Cachegrind against performance counters and found
that it was pretty good (eg. 95% right) for L1, but much worse for L2, see
http://www.cs.mu.oz.au/~njn/pubs/cache-large-lazy2002.ps.
Cachegrind has various inaccuracies, you can see Chapter 3 of
http://www.valgrind.org/docs/phd2004.pdf for the details, and the cache it
simulates may well be different to what your machine has.
Also, it's not clear to me that the figures Cachegrind and PAPI are
reporting are directly comparable -- are they really referring to exactly
the same lines of source code?
In short, I don't think there's a simple single answer for the differences.
Nick
|
|
From: Josef W. <Jos...@gm...> - 2007-11-21 11:01:40
|
On Tuesday 20 November 2007, Mehmet Belgin wrote: > Thanks a lot for the reply, Josef! You are right, a cache line of 16384 > doesn't make sense... I wasn't thinking, so just passed what papi_mem() > reported to me... I checked the specs of the CPU (Opteron 240) and learned > that both L1 and L2 cache lines are just 64 bytes, which sounds more likely. > The Valgrind incorrectly assumes that the L2 cache is 8-way associative, > which should be 16-way instead, that's why I don't use its default settings. That is a bug then. I just checked http://www.sandpile.org/impl/k8.htm and you are right. Every AMD processor of the 8th generation (including your Opt240) has a 16-way L2 associativity. Can you please file a bug for this? Hmm. For AMD, the L2 associativity can directly be read out with CPUID. BUT... according to http://www.sandpile.org/ia32/cpuid.htm, the numbers can not be used directly. E.g. a value of 8 means 16-way :-( > I don't have cpuid as a command, maybe that's what confuses Valgrind... No, cpuid is a x86 assembly instruction, which is used to identify the processor type. And valgrind calculates from cpuid results the cache parameters. > This time, Valgrinds overestimates number of cache misses, even with the > corrected settings: > > PAPI : 99,653 L2 miss > Valgrind : 1,609,350 L2 miss !!! > > I am really out of ideas, I am very much willing to use valgrind to explain > my results where PAPI can't provide enough details... According to the source in the other post you go sequentially over arrays. As cachegrind gives almost the same number of L2 misses (1,7M) as L1 misses (2,1M), these arrays (si2aabase##_si and ja) look like being to big for the cache. It is quite possible that the hardware prefetcher of the Opteron detects the stream access and succeeds in prefetching here. I assume that PAPI only counts the demand misses, and not the ones triggered by the hardware prefetcher (can you check this?). If you use instead of PAPI a lower level API like perfctr or perfmon2, you can configure the counter to meassure the misses from the hardware prefetcher, too (check the processor manual for the event/event masks). You can look up the source code of PAPI to get the actual counter configuration for your Opteron which PAPI is using in your case. BTW, the cache simulator in callgrind in quite similar to cachegrind, but also has a best-case hardware stream prefetcher simulation (switch on with "--simulate-hwpref=yes"). Perhaps this helps to clarify your difference. Josef |