I think I have it figured out... I ran some tests with perfex, and the
numbers I'm getting seem valid to me. I don't have any patch for PAPI or libpfm,
but I suspect people who are familiar with the insides of it will be able to
create a patch of out this easily...
I measured L1 cache misses as follows on the Pentium 4 machines
available to me:
perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=0x1000001 --p4pmv=0x1
L2 cache misses rates are trivial from this, just change --p4pe to
bits 16-17 ('3'): measure for any active thread
bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR
These settings are the same for the instr_completed event, no surprise
bits 20-27 ('12'): bits 21-24 select 09h, being replay_event
bits 4-7 ('2'): bits 5 set to count NBOGUS tagged µops
bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level
bits 24-27 ('8'): enables fast rdpcm
bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0
This speficies counting replay_event at an appropriate counter, but only
tagged µops will be counted. Tagging is specified by setting the appropriate
bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see Table A-10 in Intel
docs). Using perfex, this is done with --p4pe and --p4pmv respectively.
In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in
0x1000001. Table A-10 in the Intel docs say to also enable bit 25, but that's
only needed when using PEBS (and we are not in this case). MSR_PEBS_MATRIX_VERT
only needs bit 0 to be set, according to Table A-10, hence 0x1.
If something isn't clear in the details above, please let me know, and
I'll try and explain.
Now, for the validation of this, I used two SPEC CPU2000 benchmarks,
art and mcf, which are notorious for having a large amount of cache misses.
I've also measured cache miss rates for these on an Opteron 244 and a Core 2
Duo (same statically linked binaries used on all machines, compiled/linked with
gcc 4.1.2 -O2 -static). The graphs are uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses.
If you want these for future reference, make sure to make a local copy of
these, because I can't guarantee they will be up there forever. To me, these
numbers make perfect sense.
Two notes I should make: the L2 misses for the Core 2 Duo machine are
so low that they are not showing in the graph; and one thing which might seem
strange at first is that the L1 miss rate for art on the model 2 Pentium4 (8K
L1-D) are _lower_ than the model 3/4 Pentium 4s (16K L1-D). I think this can be
explained because the latter models probably have more aggressive instruction
prefetching, which causes more L1 data entries to be pushed out, and hence more
L1-D cache misses.
Any comments on this are highly appreciated.