Thread: [perfmon2] L1 data cache misses on Pentium 4
Status: Beta
Brought to you by:
seranian
From: Kenneth H. <ken...@ug...> - 2007-12-10 15:16:26
|
Hello, I know this has come up before on this list, but this time I'm determined to solve this mystery :-) Could somebody point me to some documentation or previous attemps to get reasonable counts for L1 data cache misses on Pentium4? I'd like to see what people have tried in the past, what their reasoning was and why they concluded the counts they were getting were wrong... That way, I can see where the errors have slipped in (maybe in the Intel documentation itself, as it was the case with the instr_completed event). greetings, Kenneth -- Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste |
From: Philip M. <mu...@cs...> - 2007-12-12 12:31:52
|
Hi Kenneth, We've tried various combinations of this metric over time and we finally gave up. We were testing it with an app where we knew the footprint and the numbers were way off at all problem sizes. Even taking into account prefetching and speculations, we still couldn't get the numbers to match. If you solve this one, we'll give you a star on the PAPI walk of fame... Phil On Dec 10, 2007, at 4:16 PM, Kenneth Hoste wrote: > Hello, > > I know this has come up before on this list, but this time I'm > determined to solve this mystery :-) > > Could somebody point me to some documentation or previous attemps > to get reasonable counts for L1 data cache misses on Pentium4? > I'd like to see what people have tried in the past, what their > reasoning was and why they concluded the counts they were getting > were wrong... > > That way, I can see where the errors have slipped in (maybe in the > Intel documentation itself, as it was the case with the > instr_completed event). > > greetings, > > Kenneth > > -- > > Computer Science is no more about computers than astronomy is about > telescopes. (E. W. Dijkstra) > > Kenneth Hoste > ELIS - Ghent University > email: ken...@el... > blog: http://www.elis.ugent.be/~kehoste/blog > website: http://www.elis.ugent.be/~kehoste > > ---------------------------------------------------------------------- > --- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://sourceforge.net/services/buy/ > index.php_______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel |
From: Kenneth H. <ken...@ug...> - 2007-12-14 10:04:15
|
Hi, I think I have it figured out... I ran some tests with perfex, and the =20= numbers I'm getting seem valid to me. I don't have any patch for PAPI =20= or libpfm, but I suspect people who are familiar with the insides of =20 it will be able to create a patch of out this easily... I measured L1 cache misses as follows on the Pentium 4 machines =20 available to me: perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1 L2 cache misses rates are trivial from this, just change --p4pe to =20 0x1000002. Breaking this down: CCCR: 0x3B000 bits 16-17 ('3'): measure for any active thread bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR =20= 05h These settings are the same for the instr_completed event, no surprise =20= there. ESCR: 0x12000204 bits 20-27 ('12'): bits 21-24 select 09h, being replay_event bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level counter: 0x8000000C bits 24-27 ('8'): enables fast rdpcm bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 This speficies counting replay_event at an appropriate counter, but =20 only tagged =B5ops will be counted. Tagging is specified by setting the =20= appropriate bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see =20 Table A-10 in Intel docs). Using perfex, this is done with --p4pe and =20= --p4pmv respectively. In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in =20 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, but =20= that's only needed when using PEBS (and we are not in this case). =20 MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according to Table =20 A-10, hence 0x1. If something isn't clear in the details above, please let me know, and =20= I'll try and explain. Now, for the validation of this, I used two SPEC CPU2000 benchmarks, =20 art and mcf, which are notorious for having a large amount of cache =20 misses. I've also measured cache miss rates for these on an Opteron =20 244 and a Core 2 Duo (same statically linked binaries used on all =20 machines, compiled/linked with gcc 4.1.2 -O2 -static). The graphs are =20= uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. If =20 you want these for future reference, make sure to make a local copy of =20= these, because I can't guarantee they will be up there forever. To me, =20= these numbers make perfect sense. Two notes I should make: the L2 misses for the Core 2 Duo machine are =20= so low that they are not showing in the graph; and one thing which =20 might seem strange at first is that the L1 miss rate for art on the =20 model 2 Pentium4 (8K L1-D) are _lower_ than the model 3/4 Pentium 4s =20 (16K L1-D). I think this can be explained because the latter models =20 probably have more aggressive instruction prefetching, which causes =20 more L1 data entries to be pushed out, and hence more L1-D cache misses. Any comments on this are highly appreciated. K. -- Computer Science is no more about computers than astronomy is about =20 telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste |
From: Kenneth H. <ken...@ug...> - 2007-12-20 08:55:46
|
Nobody has comments on this? Do the settings seem reasonable? Or am I =20= just dreaming I got this right? K. On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: > Hi, > > I think I have it figured out... I ran some tests with perfex, and =20 > the numbers I'm getting seem valid to me. I don't have any patch for =20= > PAPI or libpfm, but I suspect people who are familiar with the =20 > insides of it will be able to create a patch of out this easily... > > I measured L1 cache misses as follows on the Pentium 4 machines =20 > available to me: > > perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1= > > L2 cache misses rates are trivial from this, just change --p4pe to =20 > 0x1000002. > > Breaking this down: > > CCCR: 0x3B000 > > bits 16-17 ('3'): measure for any active thread > bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select =20 > ESCR 05h > > These settings are the same for the instr_completed event, no =20 > surprise there. > > ESCR: 0x12000204 > > bits 20-27 ('12'): bits 21-24 select 09h, being replay_event > bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops > bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level > > counter: 0x8000000C > > bits 24-27 ('8'): enables fast rdpcm > bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 > > This speficies counting replay_event at an appropriate counter, but =20= > only tagged =B5ops will be counted. Tagging is specified by setting =20= > the appropriate bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT =20 > (see Table A-10 in Intel docs). Using perfex, this is done with --=20 > p4pe and --p4pmv respectively. > > In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in =20 > 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, =20 > but that's only needed when using PEBS (and we are not in this =20 > case). MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according to =20= > Table A-10, hence 0x1. > > If something isn't clear in the details above, please let me know, =20 > and I'll try and explain. > > Now, for the validation of this, I used two SPEC CPU2000 benchmarks, =20= > art and mcf, which are notorious for having a large amount of cache =20= > misses. I've also measured cache miss rates for these on an Opteron =20= > 244 and a Core 2 Duo (same statically linked binaries used on all =20 > machines, compiled/linked with gcc 4.1.2 -O2 -static). The graphs =20 > are uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. =20= > If you want these for future reference, make sure to make a local =20 > copy of these, because I can't guarantee they will be up there =20 > forever. To me, these numbers make perfect sense. > > Two notes I should make: the L2 misses for the Core 2 Duo machine =20 > are so low that they are not showing in the graph; and one thing =20 > which might seem strange at first is that the L1 miss rate for art =20 > on the model 2 Pentium4 (8K L1-D) are _lower_ than the model 3/4 =20 > Pentium 4s (16K L1-D). I think this can be explained because the =20 > latter models probably have more aggressive instruction prefetching, =20= > which causes more L1 data entries to be pushed out, and hence more =20 > L1-D cache misses. > > Any comments on this are highly appreciated. > > K. > > -- > > Computer Science is no more about computers than astronomy is about =20= > telescopes. (E. W. Dijkstra) > > Kenneth Hoste > ELIS - Ghent University > email: ken...@el... > blog: http://www.elis.ugent.be/~kehoste/blog > website: http://www.elis.ugent.be/~kehoste > > = ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > = http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpla= ce_______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel --=20 Computer Science is no more about computers than astronomy is about =20 telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste |
From: Philip M. <mu...@cs...> - 2007-12-20 22:51:55
|
Sorry, we didn't mean to ignore you. This is great stuff. We've =20 needed these definitions for a long time. Are the cache misses data =20 cache or I cache or both? It's worth digging through libpfm to see if =20= this even can be specified symbolically. Dan, you're our PAPI P4 =20 expert, any thoughts? Phil On Dec 20, 2007, at 3:55 AM, Kenneth Hoste wrote: > > Nobody has comments on this? Do the settings seem reasonable? Or am =20= > I just dreaming I got this right? > > K. > > On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: > >> Hi, >> >> I think I have it figured out... I ran some tests with perfex, and =20= >> the numbers I'm getting seem valid to me. I don't have any patch =20 >> for PAPI or libpfm, but I suspect people who are familiar with the =20= >> insides of it will be able to create a patch of out this easily... >> >> I measured L1 cache misses as follows on the Pentium 4 machines =20 >> available to me: >> >> perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 = --p4pmv=3D0x1 >> >> L2 cache misses rates are trivial from this, just change --p4pe to =20= >> 0x1000002. >> >> Breaking this down: >> >> CCCR: 0x3B000 >> >> bits 16-17 ('3'): measure for any active thread >> bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select =20 >> ESCR 05h >> >> These settings are the same for the instr_completed event, no =20 >> surprise there. >> >> ESCR: 0x12000204 >> >> bits 20-27 ('12'): bits 21-24 select 09h, being replay_event >> bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =C2=B5ops >> bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level >> >> counter: 0x8000000C >> >> bits 24-27 ('8'): enables fast rdpcm >> bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 >> >> This speficies counting replay_event at an appropriate counter, =20 >> but only tagged =C2=B5ops will be counted. Tagging is specified by =20= >> setting the appropriate bits in IA32_PEBS_ENABLE and =20 >> MSR_PEBS_MATRIX_VERT (see Table A-10 in Intel docs). Using perfex, =20= >> this is done with --p4pe and --p4pmv respectively. >> >> In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in =20 >> 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, =20= >> but that's only needed when using PEBS (and we are not in this =20 >> case). MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according =20 >> to Table A-10, hence 0x1. >> >> If something isn't clear in the details above, please let me know, =20= >> and I'll try and explain. >> >> Now, for the validation of this, I used two SPEC CPU2000 =20 >> benchmarks, art and mcf, which are notorious for having a large =20 >> amount of cache misses. I've also measured cache miss rates for =20 >> these on an Opteron 244 and a Core 2 Duo (same statically linked =20 >> binaries used on all machines, compiled/linked with gcc 4.1.2 -O2 -=20= >> static). The graphs are uploaded at http://www.elis.ugent.be/=20 >> ~kehoste/PAPI_cache_misses. If you want these for future =20 >> reference, make sure to make a local copy of these, because I =20 >> can't guarantee they will be up there forever. To me, these =20 >> numbers make perfect sense. >> >> Two notes I should make: the L2 misses for the Core 2 Duo machine =20 >> are so low that they are not showing in the graph; and one thing =20 >> which might seem strange at first is that the L1 miss rate for art =20= >> on the model 2 Pentium4 (8K L1-D) are _lower_ than the model 3/4 =20 >> Pentium 4s (16K L1-D). I think this can be explained because the =20 >> latter models probably have more aggressive instruction =20 >> prefetching, which causes more L1 data entries to be pushed out, =20 >> and hence more L1-D cache misses. >> >> Any comments on this are highly appreciated. >> >> K. >> >> -- >> >> Computer Science is no more about computers than astronomy is =20 >> about telescopes. (E. W. Dijkstra) >> >> Kenneth Hoste >> ELIS - Ghent University >> email: ken...@el... >> blog: http://www.elis.ugent.be/~kehoste/blog >> website: http://www.elis.ugent.be/~kehoste >> >> ---------------------------------------------------------------------=20= >> ---- >> SF.Net email is sponsored by: >> Check out the new SourceForge.net Marketplace. >> It's the best place to buy or sell services >> for just about anything Open Source. >> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/=20 >> marketplace_______________________________________________ >> perfmon2-devel mailing list >> per...@li... >> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > --=20 > > Computer Science is no more about computers than astronomy is about =20= > telescopes. (E. W. Dijkstra) > > Kenneth Hoste > ELIS - Ghent University > email: ken...@el... > blog: http://www.elis.ugent.be/~kehoste/blog > website: http://www.elis.ugent.be/~kehoste > > _______________________________________________ > Ptools-perfapi mailing list > Pto...@cs... > http://lists.cs.utk.edu/listinfo/ptools-perfapi |
From: Dan T. <ter...@ee...> - 2007-12-21 22:39:32
|
Kenneth =96 Didn=92t mean to ignore this; just didn=92t get a chance to take a close = look before the holidays got in the way. I still haven=92t looked closely = enough to offer an informed opinion. It=92s on my TODO list=85 - d =20 _____ =20 From: per...@li... [mailto:per...@li...] On Behalf Of = Philip Mucci Sent: Thursday, December 20, 2007 5:52 PM To: Kenneth Hoste Cc: papi list; per...@li... Subject: Re: [perfmon2] [Ptools-perfapi] L1 data cache misses on Pentium = 4 =20 Sorry, we didn't mean to ignore you. This is great stuff. We've needed these definitions for a long time. Are the cache misses data cache or I cache or both? It's worth digging through libpfm to see if this even can = be specified symbolically. Dan, you're our PAPI P4 expert, any thoughts? =20 Phil =20 =20 =20 On Dec 20, 2007, at 3:55 AM, Kenneth Hoste wrote: =20 =20 Nobody has comments on this? Do the settings seem reasonable? Or am I = just dreaming I got this right? =20 K. =20 On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: =20 Hi, =20 I think I have it figured out... I ran some tests with perfex, and the numbers I'm getting seem valid to me. I don't have any patch for PAPI or libpfm, but I suspect people who are familiar with the insides of it = will be able to create a patch of out this easily... =20 I measured L1 cache misses as follows on the Pentium 4 machines = available to me: =20 perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1 =20 L2 cache misses rates are trivial from this, just change --p4pe to 0x1000002. =20 Breaking this down: =20 CCCR: 0x3B000 =20 bits 16-17 ('3'): measure for any active thread bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR = 05h =20 These settings are the same for the instr_completed event, no surprise there. =20 ESCR: 0x12000204 =20 bits 20-27 ('12'): bits 21-24 select 09h, being replay_event bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level =20 counter: 0x8000000C =20 bits 24-27 ('8'): enables fast rdpcm bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 =20 This speficies counting replay_event at an appropriate counter, but only tagged =B5ops will be counted. Tagging is specified by setting the = appropriate bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see Table A-10 in = Intel docs). Using perfex, this is done with --p4pe and --p4pmv respectively. =20 In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in = 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, but that's only needed when using PEBS (and we are not in this case). = MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according to Table A-10, hence 0x1. =20 =20 If something isn't clear in the details above, please let me know, and = I'll try and explain. =20 Now, for the validation of this, I used two SPEC CPU2000 benchmarks, art = and mcf, which are notorious for having a large amount of cache misses. I've also measured cache miss rates for these on an Opteron 244 and a Core 2 = Duo (same statically linked binaries used on all machines, compiled/linked = with gcc 4.1.2 -O2 -static). The graphs are uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. If you want these = for future reference, make sure to make a local copy of these, because I = can't guarantee they will be up there forever. To me, these numbers make = perfect sense.=20 =20 Two notes I should make: the L2 misses for the Core 2 Duo machine are so = low that they are not showing in the graph; and one thing which might seem strange at first is that the L1 miss rate for art on the model 2 = Pentium4 (8K L1-D) are _lower_ than the model 3/4 Pentium 4s (16K L1-D). I think = this can be explained because the latter models probably have more aggressive instruction prefetching, which causes more L1 data entries to be pushed = out, and hence more L1-D cache misses.=20 =20 Any comments on this are highly appreciated. =20 K. =20 -- Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 -------------------------------------------------------------------------= SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl= ace _______________________________________________ perfmon2-devel mailing list per...@li... https://lists.sourceforge.net/lists/listinfo/perfmon2-devel =20 --=20 Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 _______________________________________________ Ptools-perfapi mailing list Pto...@cs... http://lists.cs.utk.edu/listinfo/ptools-perfapi =20 |
From: Dan T. <ter...@ee...> - 2008-01-25 15:27:19
|
Kenneth =96 I want to echo Phil=92s comments below, a month later. I also wanted to provide a bit more background, and some development = news. As it turns out, the implementation you suggest below was one of the = first that we tried about 3 years ago. You can track that history by going to = the PAPI cvs viewer at: http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/ = You need to look in p4_events.c to see what we did. It=92s now in the = attic, but still viewable at: http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/src/p4_events.c?hidea= tti c=3D0 <http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/src/p4_events.c?hide= att ic=3D0&view=3Dlog> &view=3Dlog Check out the tables in version 1.49. Bottom line: we tried implementing cache events with both the = replay_event, as you discuss below, and with the BSQ_cache_reference event for L2 and = L3 events. As I recall, we got significantly varying numbers with both approaches, and had the further problem that the replay_event = implementation used a shared resource (the PEBS registers). This resulted in duplicate counts if someone tried to measure, for example, L1_LD_MISS and = L1_ST_MISS in the same event set. Even worse, the last event added propagated = it=92s counts to all other related events. =20 Having said all that by way of history, I=92ve finally concluded that it = is still better to have *some* way to measure L1 events, even if = unpredictable, than to have no way at all.=20 =20 Now the problem became the fact that we=92ve switched to using = perfmon2=92s libpfm to specify our native event names, and libpfm has no support for using the PEBS registers with replay_event. Over the last few days, I=92ve implemented support for the replay_event modifiers (as described in the Intel Developers Guide, Table A-10) into = the libpfm library and will be submitting a patch to Stephane later today. = These modifiers are supported as =91virtual=92 mask bits that behave the same = way logically as any other unit mask bits. The difference is that they = program the PEBS registers instead of the unit mask field of the escr register. Since there is only one pair of shared PEBS registers, this means that although the virtual masks can be logically OR=92d for any single event, multiple replay_event instances with different virtual masks cannot be measured simultaneously. This is the same restriction PAPI suffered = several years ago. Even with this restriction I think your contribution is = valuable enough to become part of the release for both libpfm and PAPI. =20 I hope Stephane accepts the patch for these changes. If/When he does, I = will commit my PAPI changes to cvs and let you know.=20 =20 Thanks for your work on this, - dan =20 =20 =20 =20 =20 _____ =20 From: per...@li... [mailto:per...@li...] On Behalf Of = Philip Mucci Sent: Thursday, December 20, 2007 5:52 PM To: Kenneth Hoste Cc: papi list; per...@li... Subject: Re: [perfmon2] [Ptools-perfapi] L1 data cache misses on Pentium = 4 =20 Sorry, we didn't mean to ignore you. This is great stuff. We've needed these definitions for a long time. Are the cache misses data cache or I cache or both? It's worth digging through libpfm to see if this even can = be specified symbolically. Dan, you're our PAPI P4 expert, any thoughts? =20 Phil =20 =20 =20 On Dec 20, 2007, at 3:55 AM, Kenneth Hoste wrote: =20 Nobody has comments on this? Do the settings seem reasonable? Or am I = just dreaming I got this right? =20 K. =20 On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: Hi, =20 I think I have it figured out... I ran some tests with perfex, and the numbers I'm getting seem valid to me. I don't have any patch for PAPI or libpfm, but I suspect people who are familiar with the insides of it = will be able to create a patch of out this easily... =20 I measured L1 cache misses as follows on the Pentium 4 machines = available to me: =20 perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1 =20 L2 cache misses rates are trivial from this, just change --p4pe to 0x1000002. =20 Breaking this down: =20 CCCR: 0x3B000 =20 bits 16-17 ('3'): measure for any active thread bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR = 05h =20 These settings are the same for the instr_completed event, no surprise there. =20 ESCR: 0x12000204 =20 bits 20-27 ('12'): bits 21-24 select 09h, being replay_event bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level =20 counter: 0x8000000C =20 bits 24-27 ('8'): enables fast rdpcm bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 =20 This speficies counting replay_event at an appropriate counter, but only tagged =B5ops will be counted. Tagging is specified by setting the = appropriate bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see Table A-10 in = Intel docs). Using perfex, this is done with --p4pe and --p4pmv respectively. =20 In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in = 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, but that's only needed when using PEBS (and we are not in this case). = MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according to Table A-10, hence 0x1. =20 =20 If something isn't clear in the details above, please let me know, and = I'll try and explain. =20 Now, for the validation of this, I used two SPEC CPU2000 benchmarks, art = and mcf, which are notorious for having a large amount of cache misses. I've also measured cache miss rates for these on an Opteron 244 and a Core 2 = Duo (same statically linked binaries used on all machines, compiled/linked = with gcc 4.1.2 -O2 -static). The graphs are uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. If you want these = for future reference, make sure to make a local copy of these, because I = can't guarantee they will be up there forever. To me, these numbers make = perfect sense.=20 =20 Two notes I should make: the L2 misses for the Core 2 Duo machine are so = low that they are not showing in the graph; and one thing which might seem strange at first is that the L1 miss rate for art on the model 2 = Pentium4 (8K L1-D) are _lower_ than the model 3/4 Pentium 4s (16K L1-D). I think = this can be explained because the latter models probably have more aggressive instruction prefetching, which causes more L1 data entries to be pushed = out, and hence more L1-D cache misses.=20 =20 Any comments on this are highly appreciated. =20 K. =20 -- Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 -------------------------------------------------------------------------= SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl= ace _______________________________________________ perfmon2-devel mailing list per...@li... https://lists.sourceforge.net/lists/listinfo/perfmon2-devel =20 --=20 Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 _______________________________________________ Ptools-perfapi mailing list Pto...@cs... http://lists.cs.utk.edu/listinfo/ptools-perfapi =20 |
From: Dan T. <ter...@ee...> - 2008-01-25 15:47:34
|
Stephane =96 Enclosed is a patch to implement support for =91virtual masks=92 on the replay_event for Pentium4 as discussed in the previous mail below. These changes include: - The addition of virtual mask definitions to replay_event in pentium4_events.h as defined by Intel Developers Guide Vol 3B, Table = A-10.=20 - The addition of a new data structure definition in pfmlib_pentium4_priv.h to describe the bit settings of the PEBS = registers for each of these virtual masks. - The addition of a data structure and support logic in the dispatch_events routine of pfmlib_pentium4.c to decode the replay_event virtual masks. Please contact me if you have any questions about these changes. Thanks, - dan =20 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ diff wB C:/Documents and Settings/terpstra/Local Settings/Temp/pentium4_events_1.1.1.1_3.h C:/papiHEAD/src/libpfm-3.y/lib/pentium4_events.h 1781a1782,1817 > {.name =3D "L1_LD_MISS", > .desc =3D "Virtual mask for L1 cache load miss replays.", > .bit =3D 2, > }, > {.name =3D "L2_LD_MISS", > .desc =3D "Virtual mask for L2 cache load miss replays.", > .bit =3D 3, > }, > {.name =3D "DTLB_LD_MISS", > .desc =3D "Virtual mask for DTLB load miss = replays.", > .bit =3D 4, > }, > {.name =3D "DTLB_ST_MISS", > .desc =3D "Virtual mask for DTLB store miss = replays.", > .bit =3D 5, > }, > {.name =3D "DTLB_ALL_MISS", > .desc =3D "Virtual mask for all DTLB miss = replays.", > .bit =3D 6, > }, > {.name =3D "BR_MSP", > .desc =3D "Virtual mask for tagged mispredicted = branch replays.", > .bit =3D 7, > }, > {.name =3D "MOB_LD_REPLAY", > .desc =3D "Virtual mask for MOB load replays.", > .bit =3D 8, > }, > {.name =3D "SP_LD_RET", > .desc =3D "Virtual mask for split load replays. = Use with load_port_replay event.", > .bit =3D 9, > }, > {.name =3D "SP_ST_RET", > .desc =3D "Virtual mask for split store replays. = Use with store_port_replay event.", > .bit =3D 10, > }, 1967a2004 > #define PME_REPLAY_EVENT 37 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ diff wB C:/Documents and Settings/terpstra/Local Settings/Temp/pfmlib_pentium4_1.1.1.1_4.c C:/papiHEAD/src/libpfm-3.y/lib/pfmlib_pentium4.c 115a116,118 > #define PMC_PEBS_MATRIX_VERT 63 > #define PMC_PEBS_ENABLE 64 >=20 136a140,185 > /* This array provides values for the PEBS_ENABLE and PEBS_MATRIX_VERT > registers to support a series of metric for replay_event. > The first two entries are dummies; the remaining 9 correspond = to=20 > virtual bit masks in the replay_event definition and map onto Intel > documentation. > */ >=20 > #define P4_REPLAY_REAL_MASK 0x00000003 > #define P4_REPLAY_VIRT_MASK 0x00000FFC >=20 > static pentium4_replay_regs_t p4_replay_regs[]=3D{ > /* 0 */ {.enb =3D 0, /* = dummy */ > .mat_vert =3D 0, > }, > /* 1 */ {.enb =3D 0, /* = dummy */ > .mat_vert =3D 0, > }, > /* 2 */ {.enb =3D 0x03000001, /* 1stL_cache_load_miss_retired */ > .mat_vert =3D 0x00000001, > }, > /* 3 */ {.enb =3D 0x03000002, /* 2ndL_cache_load_miss_retired */ > .mat_vert =3D 0x00000001, > }, > /* 4 */ {.enb =3D 0x03000004, /* = DTLB_load_miss_retired */ > .mat_vert =3D 0x00000001, > }, > /* 5 */ {.enb =3D 0x03000004, /* = DTLB_store_miss_retired */ > .mat_vert =3D 0x00000002, > }, > /* 6 */ {.enb =3D 0x03000004, /* = DTLB_all_miss_retired */ > .mat_vert =3D 0x00000003, > }, > /* 7 */ {.enb =3D 0x03018001, /* = Tagged_mispred_branch */ > .mat_vert =3D 0x00000010, > }, > /* 8 */ {.enb =3D 0x03000200, /* = MOB_load_replay_retired */ > .mat_vert =3D 0x00000001, > }, > /* 9 */ {.enb =3D 0x03000400, /* split_load_retired = */ > .mat_vert =3D 0x00000001, > }, > /* 10 */ {.enb =3D 0x03000400, /* split_store_retired = */ > .mat_vert =3D 0x00000002, > }, > }; >=20 406a456,480 > /* Special processing for = the replay event: > Remove = virtual mask bits from actual mask; > scan mask bit list and OR bit values for each virtual mask > into the PEBS ENABLE and PEBS MATRIX VERT registers */ > if (event =3D=3D PME_REPLAY_EVENT) { > escr_value.bits.event_mask &=3D P4_REPLAY_REAL_MASK; /* remove = virtual mask bits */ > if = (event_mask & P4_REPLAY_VIRT_MASK) { /* find a valid virtual mask */ > output->pfp_pmcs[j].reg_value =3D 0; > output->pfp_pmcs[j].reg_num =3D PMC_PEBS_ENABLE; > output->pfp_pmcs[j].reg_addr =3D p4_pmc_regmap[PMC_PEBS_ENABLE].addr; > output->pfp_pmcs[j+1].reg_value =3D 0; > output->pfp_pmcs[j+1].reg_num =3D PMC_PEBS_MATRIX_VERT; > output->pfp_pmcs[j+1].reg_addr =3D = p4_pmc_regmap[PMC_PEBS_MATRIX_VERT].addr; > = for (n =3D 0; n < input->pfp_events[i].num_masks; n++) { > mask =3D input->pfp_events[i].unit_masks[n]; > if (mask > 1 && mask < 11) { /* process each valid mask we find */ > output->pfp_pmcs[j].reg_value |=3D p4_replay_regs[mask].enb; > output->pfp_pmcs[j+1].reg_value |=3D p4_replay_regs[mask].mat_vert; > } > } > j = +=3D 2; > output->pfp_pmc_count +=3D 2; > } > } >=20 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ diff wB C:/Documents and Settings/terpstra/Local Settings/Temp/pfmlib_pentium4_priv_1.1.1.1_5.h C:/papiHEAD/src/libpfm-3.y/lib/pfmlib_pentium4_priv.h 89a90,111 > * pentium4_replay_regs_t > * > * Describe one pair of PEBS registers for use with the replay_event event. > * > * "p4_replay_regs" is a flat array of these structures > * that defines all the PEBS pairs per Table A-10 of=20 > * the Intel System Programming Guide Vol 3B. > * > * @enb: value for the PEBS_ENABLE register for a given replay metric. > * @mat_vert: value for the PEBS_MATRIX_VERT register for a given = metric. > * The replay_event event defines a series of virtual mask = bits > * that serve as indexes into this array. The values at = that index > * provide information programmed into the PEBS registers = to count > * specific metrics available to the replay_event event. > **/ >=20 > typedef struct { > int enb; > int mat_vert; > } pentium4_replay_regs_t; >=20 > /** +++++++++++++++++++++++++++++++++++++++++++++++++++++++ =20 =20 =20 _____ =20 From: pto...@cs... [mailto:pto...@cs...] On Behalf Of Dan Terpstra Sent: Friday, January 25, 2008 10:27 AM To: 'Kenneth Hoste' Cc: per...@li...; 'papi list' Subject: Re: [Ptools-perfapi] [perfmon2] L1 data cache misses on Pentium = 4 =20 Kenneth =96 I want to echo Phil=92s comments below, a month later. I also wanted to provide a bit more background, and some development = news. As it turns out, the implementation you suggest below was one of the = first that we tried about 3 years ago. You can track that history by going to = the PAPI cvs viewer at: http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/ = You need to look in p4_events.c to see what we did. It=92s now in the = attic, but still viewable at: http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/src/p4_events.c?hidea= tti c=3D0 <http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/src/p4_events.c?hide= att ic=3D0&view=3Dlog> &view=3Dlog Check out the tables in version 1.49. Bottom line: we tried implementing cache events with both the = replay_event, as you discuss below, and with the BSQ_cache_reference event for L2 and = L3 events. As I recall, we got significantly varying numbers with both approaches, and had the further problem that the replay_event = implementation used a shared resource (the PEBS registers). This resulted in duplicate counts if someone tried to measure, for example, L1_LD_MISS and = L1_ST_MISS in the same event set. Even worse, the last event added propagated = it=92s counts to all other related events. =20 Having said all that by way of history, I=92ve finally concluded that it = is still better to have *some* way to measure L1 events, even if = unpredictable, than to have no way at all.=20 =20 Now the problem became the fact that we=92ve switched to using = perfmon2=92s libpfm to specify our native event names, and libpfm has no support for using the PEBS registers with replay_event. Over the last few days, I=92ve implemented support for the replay_event modifiers (as described in the Intel Developers Guide, Table A-10) into = the libpfm library and will be submitting a patch to Stephane later today. = These modifiers are supported as =91virtual=92 mask bits that behave the same = way logically as any other unit mask bits. The difference is that they = program the PEBS registers instead of the unit mask field of the escr register. Since there is only one pair of shared PEBS registers, this means that although the virtual masks can be logically OR=92d for any single event, multiple replay_event instances with different virtual masks cannot be measured simultaneously. This is the same restriction PAPI suffered = several years ago. Even with this restriction I think your contribution is = valuable enough to become part of the release for both libpfm and PAPI. =20 I hope Stephane accepts the patch for these changes. If/When he does, I = will commit my PAPI changes to cvs and let you know.=20 =20 Thanks for your work on this, - dan =20 =20 =20 =20 =20 _____ =20 From: per...@li... [mailto:per...@li...] On Behalf Of = Philip Mucci Sent: Thursday, December 20, 2007 5:52 PM To: Kenneth Hoste Cc: papi list; per...@li... Subject: Re: [perfmon2] [Ptools-perfapi] L1 data cache misses on Pentium = 4 =20 Sorry, we didn't mean to ignore you. This is great stuff. We've needed these definitions for a long time. Are the cache misses data cache or I cache or both? It's worth digging through libpfm to see if this even can = be specified symbolically. Dan, you're our PAPI P4 expert, any thoughts? =20 Phil =20 =20 =20 On Dec 20, 2007, at 3:55 AM, Kenneth Hoste wrote: =20 =20 Nobody has comments on this? Do the settings seem reasonable? Or am I = just dreaming I got this right? =20 K. =20 On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: =20 Hi, =20 I think I have it figured out... I ran some tests with perfex, and the numbers I'm getting seem valid to me. I don't have any patch for PAPI or libpfm, but I suspect people who are familiar with the insides of it = will be able to create a patch of out this easily... =20 I measured L1 cache misses as follows on the Pentium 4 machines = available to me: =20 perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1 =20 L2 cache misses rates are trivial from this, just change --p4pe to 0x1000002. =20 Breaking this down: =20 CCCR: 0x3B000 =20 bits 16-17 ('3'): measure for any active thread bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR = 05h =20 These settings are the same for the instr_completed event, no surprise there. =20 ESCR: 0x12000204 =20 bits 20-27 ('12'): bits 21-24 select 09h, being replay_event bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level =20 counter: 0x8000000C =20 bits 24-27 ('8'): enables fast rdpcm bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 =20 This speficies counting replay_event at an appropriate counter, but only tagged =B5ops will be counted. Tagging is specified by setting the = appropriate bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see Table A-10 in = Intel docs). Using perfex, this is done with --p4pe and --p4pmv respectively. =20 In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in = 0x1000001. Table A-10 in the Intel docs say to also enable bit 25, but that's only needed when using PEBS (and we are not in this case). = MSR_PEBS_MATRIX_VERT only needs bit 0 to be set, according to Table A-10, hence 0x1. =20 =20 If something isn't clear in the details above, please let me know, and = I'll try and explain. =20 Now, for the validation of this, I used two SPEC CPU2000 benchmarks, art = and mcf, which are notorious for having a large amount of cache misses. I've also measured cache miss rates for these on an Opteron 244 and a Core 2 = Duo (same statically linked binaries used on all machines, compiled/linked = with gcc 4.1.2 -O2 -static). The graphs are uploaded at http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. If you want these = for future reference, make sure to make a local copy of these, because I = can't guarantee they will be up there forever. To me, these numbers make = perfect sense.=20 =20 Two notes I should make: the L2 misses for the Core 2 Duo machine are so = low that they are not showing in the graph; and one thing which might seem strange at first is that the L1 miss rate for art on the model 2 = Pentium4 (8K L1-D) are _lower_ than the model 3/4 Pentium 4s (16K L1-D). I think = this can be explained because the latter models probably have more aggressive instruction prefetching, which causes more L1 data entries to be pushed = out, and hence more L1-D cache misses.=20 =20 Any comments on this are highly appreciated. =20 K. =20 -- Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 -------------------------------------------------------------------------= SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl= ace _______________________________________________ perfmon2-devel mailing list per...@li... https://lists.sourceforge.net/lists/listinfo/perfmon2-devel =20 --=20 Computer Science is no more about computers than astronomy is about telescopes. (E. W. Dijkstra) Kenneth Hoste ELIS - Ghent University email: ken...@el... blog: http://www.elis.ugent.be/~kehoste/blog website: http://www.elis.ugent.be/~kehoste =20 _______________________________________________ Ptools-perfapi mailing list Pto...@cs... http://lists.cs.utk.edu/listinfo/ptools-perfapi =20 |
From: stephane e. <er...@go...> - 2008-01-29 10:13:13
|
Dan, Patch was applied. Thanks. 2008/1/25 Dan Terpstra <ter...@ee...>: > > > > > Stephane =96 > > Enclosed is a patch to implement support for 'virtual masks' on the > replay_event for Pentium4 as discussed in the previous mail below. > > These changes include: > > - The addition of virtual mask definitions to replay_event in > pentium4_events.h as defined by Intel Developers Guide Vol 3B, Table A-10= . > > - The addition of a new data structure definition in > pfmlib_pentium4_priv.h to describe the bit settings of the PEBS registers > for each of these virtual masks. > > - The addition of a data structure and support logic in the > dispatch_events routine of pfmlib_pentium4.c to decode the replay_event > virtual masks. > > Please contact me if you have any questions about these changes. > > Thanks, > > - dan > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > diff wB C:/Documents and Settings/terpstra/Local > Settings/Temp/pentium4_events_1.1.1.1_3.h > C:/papiHEAD/src/libpfm-3.y/lib/pentium4_events.h > > 1781a1782,1817 > > > {.name =3D "L1_LD_MISS", > > > .desc =3D "Virtual mask for L1 cache load miss > replays.", > > > .bit =3D 2, > > > }, > > > {.name =3D "L2_LD_MISS", > > > .desc =3D "Virtual mask for L2 cache load miss > replays.", > > > .bit =3D 3, > > > }, > > > {.name =3D "DTLB_LD_MISS", > > > .desc =3D "Virtual mask for DTLB load miss replay= s.", > > > .bit =3D 4, > > > }, > > > {.name =3D "DTLB_ST_MISS", > > > .desc =3D "Virtual mask for DTLB store miss repla= ys.", > > > .bit =3D 5, > > > }, > > > {.name =3D "DTLB_ALL_MISS", > > > .desc =3D "Virtual mask for all DTLB miss replays= .", > > > .bit =3D 6, > > > }, > > > {.name =3D "BR_MSP", > > > .desc =3D "Virtual mask for tagged mispredicted b= ranch > replays.", > > > .bit =3D 7, > > > }, > > > {.name =3D "MOB_LD_REPLAY", > > > .desc =3D "Virtual mask for MOB load replays.", > > > .bit =3D 8, > > > }, > > > {.name =3D "SP_LD_RET", > > > .desc =3D "Virtual mask for split load replays. U= se > with load_port_replay event.", > > > .bit =3D 9, > > > }, > > > {.name =3D "SP_ST_RET", > > > .desc =3D "Virtual mask for split store replays. = Use > with store_port_replay event.", > > > .bit =3D 10, > > > }, > > 1967a2004 > > > #define PME_REPLAY_EVENT 37 > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > diff wB C:/Documents and Settings/terpstra/Local > Settings/Temp/pfmlib_pentium4_1.1.1.1_4.c > C:/papiHEAD/src/libpfm-3.y/lib/pfmlib_pentium4.c > > 115a116,118 > > > #define PMC_PEBS_MATRIX_VERT 63 > > > #define PMC_PEBS_ENABLE 64 > > > > > 136a140,185 > > > /* This array provides values for the PEBS_ENABLE and PEBS_MATRIX_VERT > > > registers to support a series of metric for replay_event. > > > The first two entries are dummies; the remaining 9 correspond = to > > > virtual bit masks in the replay_event definition and map onto > Intel > > > documentation. > > > */ > > > > > > #define P4_REPLAY_REAL_MASK 0x00000003 > > > #define P4_REPLAY_VIRT_MASK 0x00000FFC > > > > > > static pentium4_replay_regs_t p4_replay_regs[]=3D{ > > > /* 0 */ {.enb =3D 0, /* du= mmy > */ > > > .mat_vert =3D 0, > > > }, > > > /* 1 */ {.enb =3D 0, /* du= mmy > */ > > > .mat_vert =3D 0, > > > }, > > > /* 2 */ {.enb =3D 0x03000001, /* > 1stL_cache_load_miss_retired */ > > > .mat_vert =3D 0x00000001, > > > }, > > > /* 3 */ {.enb =3D 0x03000002, /* > 2ndL_cache_load_miss_retired */ > > > .mat_vert =3D 0x00000001, > > > }, > > > /* 4 */ {.enb =3D 0x03000004, /* DTLB_load_miss_retir= ed */ > > > .mat_vert =3D 0x00000001, > > > }, > > > /* 5 */ {.enb =3D 0x03000004, /* DTLB_store_miss_reti= red > */ > > > .mat_vert =3D 0x00000002, > > > }, > > > /* 6 */ {.enb =3D 0x03000004, /* DTLB_all_miss_retire= d */ > > > .mat_vert =3D 0x00000003, > > > }, > > > /* 7 */ {.enb =3D 0x03018001, /* Tagged_mispred_branc= h */ > > > .mat_vert =3D 0x00000010, > > > }, > > > /* 8 */ {.enb =3D 0x03000200, /* MOB_load_replay_reti= red > */ > > > .mat_vert =3D 0x00000001, > > > }, > > > /* 9 */ {.enb =3D 0x03000400, /* split_load_retired *= / > > > .mat_vert =3D 0x00000001, > > > }, > > > /* 10 */ {.enb =3D 0x03000400, /* split_store_retired *= / > > > .mat_vert =3D 0x00000002, > > > }, > > > }; > > > > > 406a456,480 > > > /* Special processing for = the > replay event: > > > Remove virtual > mask bits from actual mask; > > > scan mask bit > list and OR bit values for each virtual mask > > > into the PEBS > ENABLE and PEBS MATRIX VERT registers */ > > > if (event =3D=3D > PME_REPLAY_EVENT) { > > > > escr_value.bits.event_mask &=3D P4_REPLAY_REAL_MASK; /* remove vi= rtual > mask bits */ > > > if (event_mask= & > P4_REPLAY_VIRT_MASK) { /* find a > valid virtual mask */ > > > > output->pfp_pmcs[j].reg_value =3D 0; > > > > output->pfp_pmcs[j].reg_num =3D PMC_PEBS_ENABLE; > > > > output->pfp_pmcs[j].reg_addr =3D p4_pmc_regmap[PMC_PEBS_ENABLE].addr; > > > > output->pfp_pmcs[j+1].reg_value =3D 0; > > > > output->pfp_pmcs[j+1].reg_num =3D PMC_PEBS_MATRIX_VERT; > > > > output->pfp_pmcs[j+1].reg_addr =3D p4_pmc_regmap[PMC_PEBS_MATRIX_VERT].ad= dr; > > > fo= r > (n =3D 0; n < input->pfp_events[i].num_masks; n++) { > > > > mask =3D input->pfp_events[i].unit_masks[n]; > > > > if (mask > 1 && mask < 11) { /* process each valid mask we find */ > > > > output->pfp_pmcs[j].reg_value |=3D p4_replay_regs[mask].enb; > > > > output->pfp_pmcs[j+1].reg_value |=3D p4_replay_regs[mask].mat_vert; > > > > } > > > } > > > j = +=3D > 2; > > > > output->pfp_pmc_count +=3D 2; > > > } > > > } > > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > diff wB C:/Documents and Settings/terpstra/Local > Settings/Temp/pfmlib_pentium4_priv_1.1.1.1_5.h > C:/papiHEAD/src/libpfm-3.y/lib/pfmlib_pentium4_priv.h > > 89a90,111 > > > * pentium4_replay_regs_t > > > * > > > * Describe one pair of PEBS registers for use with the replay_event > event. > > > * > > > * "p4_replay_regs" is a flat array of these structures > > > * that defines all the PEBS pairs per Table A-10 of > > > * the Intel System Programming Guide Vol 3B. > > > * > > > * @enb: value for the PEBS_ENABLE register for a given replay > metric. > > > * @mat_vert: value for the PEBS_MATRIX_VERT register for a given metri= c. > > > * The replay_event event defines a series of virtual mask b= its > > > * that serve as indexes into this array. The values at that > index > > > * provide information programmed into the PEBS registers to > count > > > * specific metrics available to the replay_event event. > > > **/ > > > > > > typedef struct { > > > int enb; > > > int mat_vert; > > > } pentium4_replay_regs_t; > > > > > > /** > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > ________________________________ > > > From: pto...@cs... > [mailto:pto...@cs...] On Behalf Of Dan Terpstra > Sent: Friday, January 25, 2008 10:27 AM > To: 'Kenneth Hoste' > Cc: per...@li...; 'papi list' > Subject: Re: [Ptools-perfapi] [perfmon2] L1 data cache misses on Pentium= 4 > > > > > > Kenneth =96 > > I want to echo Phil's comments below, a month later. > > I also wanted to provide a bit more background, and some development news= . > > As it turns out, the implementation you suggest below was one of the firs= t > that we tried about 3 years ago. You can track that history by going to t= he > PAPI cvs viewer at: http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/ > > You need to look in p4_events.c to see what we did. It's now in the attic= , > but still viewable at: > > http://icl.cs.utk.edu/viewcvs/viewcvs.cgi/PAPI/papi/src/p4_events.c?hidea= ttic=3D0&view=3Dlog > > Check out the tables in version 1.49. Bottom line: we tried implementing > cache events with both the replay_event, as you discuss below, and with t= he > BSQ_cache_reference event for L2 and L3 events. As I recall, we got > significantly varying numbers with both approaches, and had the further > problem that the replay_event implementation used a shared resource (the > PEBS registers). This resulted in duplicate counts if someone tried to > measure, for example, L1_LD_MISS and L1_ST_MISS in the same event set. Ev= en > worse, the last event added propagated it's counts to all other related > events. > > Having said all that by way of history, I've finally concluded that it is > still better to have *some* way to measure L1 events, even if unpredictab= le, > than to have no way at all. > > Now the problem became the fact that we've switched to using perfmon2's > libpfm to specify our native event names, and libpfm has no support for > using the PEBS registers with replay_event. > Over the last few days, I've implemented support for the replay_event > modifiers (as described in the Intel Developers Guide, Table A-10) into t= he > libpfm library and will be submitting a patch to Stephane later today. Th= ese > modifiers are supported as 'virtual' mask bits that behave the same way > logically as any other unit mask bits. The difference is that they progra= m > the PEBS registers instead of the unit mask field of the escr register. > Since there is only one pair of shared PEBS registers, this means that > although the virtual masks can be logically OR'd for any single event, > multiple replay_event instances with different virtual masks cannot be > measured simultaneously. This is the same restriction PAPI suffered sever= al > years ago. Even with this restriction I think your contribution is valuab= le > enough to become part of the release for both libpfm and PAPI. > > I hope Stephane accepts the patch for these changes. If/When he does, I w= ill > commit my PAPI changes to cvs and let you know. > > Thanks for your work on this, > - dan > > > > > > > > > > > > > ________________________________ > > > From: per...@li... > [mailto:per...@li...] On Behalf Of Philip > Mucci > Sent: Thursday, December 20, 2007 5:52 PM > To: Kenneth Hoste > Cc: papi list; per...@li... > Subject: Re: [perfmon2] [Ptools-perfapi] L1 data cache misses on Pentium= 4 > > > > Sorry, we didn't mean to ignore you. This is great stuff. We've needed > these definitions for a long time. Are the cache misses data cache or I > cache or both? It's worth digging through libpfm to see if this even can = be > specified symbolically. Dan, you're our PAPI P4 expert, any thoughts? > > > > > > Phil > > > > > > > > > > > > > On Dec 20, 2007, at 3:55 AM, Kenneth Hoste wrote: > > > > > > > Nobody has comments on this? Do the settings seem reasonable? Or am I jus= t > dreaming I got this right? > > > > > > K. > > > > > > > On 14 Dec 2007, at 11:04, Kenneth Hoste wrote: > > > > > > Hi, > > > > > > I think I have it figured out... I ran some tests with perfex, and the > numbers I'm getting seem valid to me. I don't have any patch for PAPI or > libpfm, but I suspect people who are familiar with the insides of it will= be > able to create a patch of out this easily... > > > > > > I measured L1 cache misses as follows on the Pentium 4 machines available= to > me: > > > > > > perfex -e 0x3B000/0x12000204@0x8000000C --p4pe=3D0x1000001 --p4pmv=3D0x1 > > > > > > L2 cache misses rates are trivial from this, just change --p4pe to > 0x1000002. > > > > > > Breaking this down: > > > > > > CCCR: 0x3B000 > > > > > > bits 16-17 ('3'): measure for any active thread > > > bits 12-15 ('B'): bit 12 enables the counters, bits 13-15 select ESCR 05h > > > > > > These settings are the same for the instr_completed event, no surprise > there. > > > > > > ESCR: 0x12000204 > > > > > > bits 20-27 ('12'): bits 21-24 select 09h, being replay_event > > > bits 4-7 ('2'): bits 5 set to count NBOGUS tagged =B5ops > > > bits 0-3 ('4'): bit 2 set, enabled counting for thread 0 user-level > > > > > > counter: 0x8000000C > > > > > > bits 24-27 ('8'): enables fast rdpcm > > > bits 0-3 ('C'): 0Ch, which corresponds to MSR_IQ_COUNTER0 > > > > > > This speficies counting replay_event at an appropriate counter, but only > tagged =B5ops will be counted. Tagging is specified by setting the approp= riate > bits in IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT (see Table A-10 in Inte= l > docs). Using perfex, this is done with --p4pe and --p4pmv respectively. > > > > > > In IA32_PEBS_ENABLE, bits 0 and 24 need to be set, resulting in 0x1000001= . > Table A-10 in the Intel docs say to also enable bit 25, but that's only > needed when using PEBS (and we are not in this case). MSR_PEBS_MATRIX_VER= T > only needs bit 0 to be set, according to Table A-10, hence 0x1. > > > > > > If something isn't clear in the details above, please let me know, and I'= ll > try and explain. > > > > > > Now, for the validation of this, I used two SPEC CPU2000 benchmarks, art = and > mcf, which are notorious for having a large amount of cache misses. I've > also measured cache miss rates for these on an Opteron 244 and a Core 2 D= uo > (same statically linked binaries used on all machines, compiled/linked wi= th > gcc 4.1.2 -O2 -static). The graphs are uploaded at > http://www.elis.ugent.be/~kehoste/PAPI_cache_misses. If you want these fo= r > future reference, make sure to make a local copy of these, because I can'= t > guarantee they will be up there forever. To me, these numbers make perfec= t > sense. > > > > > > Two notes I should make: the L2 misses for the Core 2 Duo machine are so = low > that they are not showing in the graph; and one thing which might seem > strange at first is that the L1 miss rate for art on the model 2 Pentium4 > (8K L1-D) are _lower_ than the model 3/4 Pentium 4s (16K L1-D). I think t= his > can be explained because the latter models probably have more aggressive > instruction prefetching, which causes more L1 data entries to be pushed o= ut, > and hence more L1-D cache misses. > > > > > > Any comments on this are highly appreciated. > > > > > > K. > > > > > > > > -- > > > > Computer Science is no more about computers than astronomy is about > telescopes. (E. W. Dijkstra) > > Kenneth Hoste > ELIS - Ghent University > email: ken...@el... > blog: http://www.elis.ugent.be/~kehoste/blog > website: http://www.elis.ugent.be/~kehoste > > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services > for just about anything Open Source. > > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketpl= ace_______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > > > > > > -- > > Computer Science is no more about computers than astronomy is about > telescopes. (E. W. Dijkstra) > > Kenneth Hoste > ELIS - Ghent University > email: ken...@el... > blog: http://www.elis.ugent.be/~kehoste/blog > website: http://www.elis.ugent.be/~kehoste > > > > > _______________________________________________ > > > Ptools-perfapi mailing list > > > Pto...@cs... > > > http://lists.cs.utk.edu/listinfo/ptools-perfapi > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > |
From: Dan T. <ter...@ee...> - 2008-02-05 16:17:37
|
Stephane - There's a bug in the tag bit support for Pentium4. I screwed up levels of indirection when I first implemented it months ago. I finally got a machine running to test on. Below is a *tested* fix. - dan 322c322 < unsigned int tag_value, tag_enable; --- > unsigned int bit, tag_value, tag_enable; 413,415c413,416 < if (mask < EVENT_MASK_BITS && < pentium4_events[event].event_masks[mask].name) { < event_mask |= (1 << pentium4_events[event].event_masks[mask].bit); --- > bit = pentium4_events[event].event_masks[mask].bit; > if (bit < EVENT_MASK_BITS && > pentium4_events[event].event_masks[mask].name) { > event_mask |= (1 << bit); 417,419c418,420 < if (mask >= EVENT_MASK_BITS && < pentium4_events[event].event_masks[mask].name) { < tag_value |= (1 << (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); --- > if (bit >= EVENT_MASK_BITS && > pentium4_events[event].event_masks[mask].name) { > tag_value |= (1 << (bit - EVENT_MASK_BITS)); |
From: stephane e. <er...@go...> - 2008-02-06 21:03:30
|
Dan, On Feb 5, 2008 5:17 PM, Dan Terpstra <ter...@ee...> wrote: > Stephane - > There's a bug in the tag bit support for Pentium4. > I screwed up levels of indirection when I first implemented it months ago. > I finally got a machine running to test on. > Below is a *tested* fix. > - dan Could you, please, resubmit as an attachment? Your mailer truncated the lines, it seems. Thanks. > > 322c322 > < unsigned int tag_value, tag_enable; > --- > > unsigned int bit, tag_value, tag_enable; > 413,415c413,416 > < if (mask < EVENT_MASK_BITS && > < > pentium4_events[event].event_masks[mask].name) { > < event_mask |= (1 << > pentium4_events[event].event_masks[mask].bit); > --- > > bit = > pentium4_events[event].event_masks[mask].bit; > > if (bit < EVENT_MASK_BITS && > > > pentium4_events[event].event_masks[mask].name) { > > event_mask |= (1 << bit); > 417,419c418,420 > < if (mask >= EVENT_MASK_BITS && > < > pentium4_events[event].event_masks[mask].name) { > < tag_value |= (1 << > (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); > --- > > if (bit >= EVENT_MASK_BITS && > > > pentium4_events[event].event_masks[mask].name) { > > tag_value |= (1 << (bit - > EVENT_MASK_BITS)); > > > > |
From: Dan T. <ter...@ee...> - 2008-02-06 21:15:07
Attachments:
pfmlib_pentium4.c.patch
|
Sorry. Probably because of so many nested <cr>s... Patch enclosed. - d > -----Original Message----- > From: stephane eranian [mailto:er...@go...] > Sent: Wednesday, February 06, 2008 4:03 PM > To: Dan Terpstra > Cc: per...@li... > Subject: Re: Pentium 4 tag bits > > Dan, > > > On Feb 5, 2008 5:17 PM, Dan Terpstra <ter...@ee...> wrote: > > Stephane - > > There's a bug in the tag bit support for Pentium4. > > I screwed up levels of indirection when I first implemented it months > ago. > > I finally got a machine running to test on. > > Below is a *tested* fix. > > - dan > > Could you, please, resubmit as an attachment? Your mailer truncated > the lines, it seems. > > Thanks. > > > > > 322c322 > > < unsigned int tag_value, tag_enable; > > --- > > > unsigned int bit, tag_value, tag_enable; > > 413,415c413,416 > > < if (mask < EVENT_MASK_BITS && > > < > > pentium4_events[event].event_masks[mask].name) { > > < event_mask |= (1 << > > pentium4_events[event].event_masks[mask].bit); > > --- > > > bit = > > pentium4_events[event].event_masks[mask].bit; > > > if (bit < EVENT_MASK_BITS && > > > > > pentium4_events[event].event_masks[mask].name) { > > > event_mask |= (1 << > bit); > > 417,419c418,420 > > < if (mask >= EVENT_MASK_BITS && > > < > > pentium4_events[event].event_masks[mask].name) { > > < tag_value |= (1 << > > (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); > > --- > > > if (bit >= EVENT_MASK_BITS && > > > > > pentium4_events[event].event_masks[mask].name) { > > > tag_value |= (1 << (bit > - > > EVENT_MASK_BITS)); > > > > > > > > |
From: stephane e. <er...@go...> - 2008-02-06 23:14:01
|
Dan, Applied. thanks. On Feb 6, 2008 10:14 PM, Dan Terpstra <ter...@ee...> wrote: > Sorry. Probably because of so many nested <cr>s... > Patch enclosed. > - d > > > > -----Original Message----- > > From: stephane eranian [mailto:er...@go...] > > Sent: Wednesday, February 06, 2008 4:03 PM > > To: Dan Terpstra > > Cc: per...@li... > > Subject: Re: Pentium 4 tag bits > > > > Dan, > > > > > > On Feb 5, 2008 5:17 PM, Dan Terpstra <ter...@ee...> wrote: > > > Stephane - > > > There's a bug in the tag bit support for Pentium4. > > > I screwed up levels of indirection when I first implemented it months > > ago. > > > I finally got a machine running to test on. > > > Below is a *tested* fix. > > > - dan > > > > Could you, please, resubmit as an attachment? Your mailer truncated > > the lines, it seems. > > > > Thanks. > > > > > > > > 322c322 > > > < unsigned int tag_value, tag_enable; > > > --- > > > > unsigned int bit, tag_value, tag_enable; > > > 413,415c413,416 > > > < if (mask < EVENT_MASK_BITS && > > > < > > > pentium4_events[event].event_masks[mask].name) { > > > < event_mask |= (1 << > > > pentium4_events[event].event_masks[mask].bit); > > > --- > > > > bit = > > > pentium4_events[event].event_masks[mask].bit; > > > > if (bit < EVENT_MASK_BITS && > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > event_mask |= (1 << > > bit); > > > 417,419c418,420 > > > < if (mask >= EVENT_MASK_BITS && > > > < > > > pentium4_events[event].event_masks[mask].name) { > > > < tag_value |= (1 << > > > (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); > > > --- > > > > if (bit >= EVENT_MASK_BITS && > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > tag_value |= (1 << (bit > > - > > > EVENT_MASK_BITS)); > > > > > > > > > > > > > |
From: Dan T. <ter...@ee...> - 2008-02-16 20:59:46
Attachments:
pfmlib_pentium4.c.patch
|
There's a bug in the data table for pebs_enable values for the replay_event virtual unit masks. Intel docs suggest turning on bits 24 and 25, but 25 is only if PEBS is enabled. That isn't true in this case. The enclosed patch fixes the table. - dan > -----Original Message----- > From: per...@li... [mailto:perfmon2-devel- > bo...@li...] On Behalf Of Dan Terpstra > Sent: Wednesday, February 06, 2008 4:15 PM > To: 'stephane eranian' > Cc: per...@li... > Subject: Re: [perfmon2] Pentium 4 tag bits > > Sorry. Probably because of so many nested <cr>s... > Patch enclosed. > - d > > > -----Original Message----- > > From: stephane eranian [mailto:er...@go...] > > Sent: Wednesday, February 06, 2008 4:03 PM > > To: Dan Terpstra > > Cc: per...@li... > > Subject: Re: Pentium 4 tag bits > > > > Dan, > > > > > > On Feb 5, 2008 5:17 PM, Dan Terpstra <ter...@ee...> wrote: > > > Stephane - > > > There's a bug in the tag bit support for Pentium4. > > > I screwed up levels of indirection when I first implemented it months > > ago. > > > I finally got a machine running to test on. > > > Below is a *tested* fix. > > > - dan > > > > Could you, please, resubmit as an attachment? Your mailer truncated > > the lines, it seems. > > > > Thanks. > > > > > > > > 322c322 > > > < unsigned int tag_value, tag_enable; > > > --- > > > > unsigned int bit, tag_value, tag_enable; > > > 413,415c413,416 > > > < if (mask < EVENT_MASK_BITS && > > > < > > > pentium4_events[event].event_masks[mask].name) { > > > < event_mask |= (1 << > > > pentium4_events[event].event_masks[mask].bit); > > > --- > > > > bit = > > > pentium4_events[event].event_masks[mask].bit; > > > > if (bit < EVENT_MASK_BITS && > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > event_mask |= (1 << > > bit); > > > 417,419c418,420 > > > < if (mask >= EVENT_MASK_BITS && > > > < > > > pentium4_events[event].event_masks[mask].name) { > > > < tag_value |= (1 << > > > (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); > > > --- > > > > if (bit >= EVENT_MASK_BITS && > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > tag_value |= (1 << > (bit > > - > > > EVENT_MASK_BITS)); > > > > > > > > > > > > |
From: stephane e. <er...@go...> - 2008-02-20 17:43:21
|
Dan, Patch applied. On Sat, Feb 16, 2008 at 9:59 PM, Dan Terpstra <ter...@ee...> wrote: > There's a bug in the data table for pebs_enable values for the replay_event > virtual unit masks. Intel docs suggest turning on bits 24 and 25, but 25 is > only if PEBS is enabled. That isn't true in this case. The enclosed patch > fixes the table. > - dan > > > -----Original Message----- > > From: per...@li... [mailto:perfmon2-devel- > > bo...@li...] On Behalf Of Dan Terpstra > > Sent: Wednesday, February 06, 2008 4:15 PM > > To: 'stephane eranian' > > Cc: per...@li... > > Subject: Re: [perfmon2] Pentium 4 tag bits > > > > Sorry. Probably because of so many nested <cr>s... > > Patch enclosed. > > - d > > > > > -----Original Message----- > > > From: stephane eranian [mailto:er...@go...] > > > Sent: Wednesday, February 06, 2008 4:03 PM > > > To: Dan Terpstra > > > Cc: per...@li... > > > Subject: Re: Pentium 4 tag bits > > > > > > Dan, > > > > > > > > > On Feb 5, 2008 5:17 PM, Dan Terpstra <ter...@ee...> wrote: > > > > Stephane - > > > > There's a bug in the tag bit support for Pentium4. > > > > I screwed up levels of indirection when I first implemented it months > > > ago. > > > > I finally got a machine running to test on. > > > > Below is a *tested* fix. > > > > - dan > > > > > > Could you, please, resubmit as an attachment? Your mailer truncated > > > the lines, it seems. > > > > > > Thanks. > > > > > > > > > > > 322c322 > > > > < unsigned int tag_value, tag_enable; > > > > --- > > > > > unsigned int bit, tag_value, tag_enable; > > > > 413,415c413,416 > > > > < if (mask < EVENT_MASK_BITS && > > > > < > > > > pentium4_events[event].event_masks[mask].name) { > > > > < event_mask |= (1 << > > > > pentium4_events[event].event_masks[mask].bit); > > > > --- > > > > > bit = > > > > pentium4_events[event].event_masks[mask].bit; > > > > > if (bit < EVENT_MASK_BITS && > > > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > > event_mask |= (1 << > > > bit); > > > > 417,419c418,420 > > > > < if (mask >= EVENT_MASK_BITS && > > > > < > > > > pentium4_events[event].event_masks[mask].name) { > > > > < tag_value |= (1 << > > > > (pentium4_events[event].event_masks[mask].bit - EVENT_MASK_BITS)); > > > > --- > > > > > if (bit >= EVENT_MASK_BITS && > > > > > > > > > pentium4_events[event].event_masks[mask].name) { > > > > > tag_value |= (1 << > > (bit > > > - > > > > EVENT_MASK_BITS)); > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > perfmon2-devel mailing list > per...@li... > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > |