From: Will C. <wc...@re...> - 2004-01-30 21:09:00
|
I have been experimenting with making OProfile support for the Pentium M processor. The pentium M is a p6 core with tweaks and P4-like bus interface. I have some patches for OProfile and the 2.6 kernel which allow OProfile to use the performance monitoring hardware on the Pentium M. This is still work in progress. I have been trying this out on a Pentium M development machine, not a traditional laptop. There is likely to be some issue with this kernel OProfile support and the power management support on the laptop (nmi vs. smm/power-management bios). The kernel patch (kernel-pentiumm.diff) applied cleanly to the kernel-source-2.6.1-1.141 in the Red Hat Raw hide. I built the kernel with the attached config file (config-20040129) because the Pentium M processors do not run the SMP kernels, so the default config files in the kernel-source RPM are either going to build a kernel without OProfile support or a kernel that doesn't boot on pentium M. The OProfile source patch (pentiumm3.diff) applies cleanly to the current oprofile cvs checkout. I verified that the code builds on a i386 platform. 2004-01-30 Will Cohen <wc...@re...> * events/i386/pentiumm/events: * events/i386/pentiumm/unit_masks: * events/Makefile.am: * libop/op_cpu_type.c: * libop/op_cpu_type.h: * libop/op_events.c: * utils/op_help.c: Add Pentium M support. -Will |
From: Philippe E. <ph...@wa...> - 2004-01-31 01:49:31
|
On Fri, 30 Jan 2004 at 16:08 +0000, Will Cohen wrote: > I have been experimenting with making OProfile support for the Pentium M > processor. The pentium M is a p6 core with tweaks and P4-like bus > interface. I have some patches for OProfile and the 2.6 kernel which > allow OProfile to use the performance monitoring hardware on the Pentium > M. This is still work in progress. I have been trying this out on a It look like complete, a few comments. > Pentium M development machine, not a traditional laptop. There is likely > to be some issue with this kernel OProfile support and the power > management support on the laptop (nmi vs. smm/power-management bios). > > The kernel patch (kernel-pentiumm.diff) applied cleanly to the > kernel-source-2.6.1-1.141 in the Red Hat Raw hide. I built the kernel > with the attached config file (config-20040129) because the Pentium M > processors do not run the SMP kernels, so the default config files in > the kernel-source RPM are either going to build a kernel without > OProfile support or a kernel that doesn't boot on pentium M. > > The OProfile source patch (pentiumm3.diff) applies cleanly to the > current oprofile cvs checkout. I verified that the code builds on a i386 > platform. > 2004-01-30 Will Cohen <wc...@re...> > > * events/i386/pentiumm/events: > * events/i386/pentiumm/unit_masks: Any place you use pentiumm, Pentium M etc. I prefer p6_mobile for filename, p6-mobile for string and CP_P6_MOBILE for the enum. I'm unsure for user visible string "Pentium M (P6 core)" ? > +++ linux-2.6.1-1.141pentm/arch/i386/oprofile/op_model_ppro.c 2004-01-29 16:21:32.000000000 -0500 > + /* P4/Pentium M quirk: you have to re-unmask the apic vector */ > + apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); > + Did you verify if it works w/o this ? If it's needed I prefer a more specific comments. /* Only P6 based Pentium M need to re-unmask the apic vector but it * doesn't hurt other P6 variant */ > +++ oprofile.pentiumm/events/i386/pentiumm/events 2004-01-29 10:17:47.000000000 -0500 > +event:0xd0 counters:0,1 um:zero minimum:6000 name:INST_DECODED : number of instructions decoded This one marked in an errata as not accurate see 25266507.pdf must we add a comment ? > +event:0xd8 counters:0,1 um:sse_sse2_inst_retired minimum:3000 name:EMON_SSE_SSE2_INST_RETIRED : Streaming SIMD Extensions Instructions Retired > +event:0xd9 counters:0,1 um:sse_sse2_inst_retired minimum:3000 name:EMON_SSE_SSE2_COMP_INST_RETIRED : Computational SSE Instructions Retired I'm lost with these coming from A-8, they are called event modified from P6 for Pentium Mobile, events added are in A-7, but they never existed on PIII afaics and does intel added sse2 to pentium mobile P6 core based ? > +event:0x59 counters:0,1 um:zero minimum:3000 name:EMON_THERMAL_TRIP : Duration/Occurrences in thermal trip This one too is marked as not accurate. > +++ oprofile.pentiumm/events/i386/pentiumm/unit_masks 2004-01-30 15:27:23.000000000 -0500 > +name:zero type:mandatory default:0x0 > + 0x0 No unit mask > +name:mesi type:bitmask default:0x0f > + 0x08 (M)odified cache state > + 0x04 (E)xclusive cache state > + 0x02 (S)hared cache state > + 0x01 (I)nvalid cache state > + 0x0f All cache states > +# not currently implemented for mesi > +# bits 5:4 > +#00h excluding hardware prefetched lines > +#01h hardware prefetched lines only > +#02h/03h all(hw prefetched lines and no hw -prefetched lines) I'm unsure if it isn't equivalent to: 0x10 HW prefetched line only 0x20 all prefetched line w/o regarding mask 0x10. I don't understand exactly the relationship with bit 0:3 -- regards, Philippe Elie |
From: Will C. <wc...@re...> - 2004-02-02 15:16:39
|
Philippe Elie wrote: > On Fri, 30 Jan 2004 at 16:08 +0000, Will Cohen wrote: > > >>I have been experimenting with making OProfile support for the Pentium M >>processor. The pentium M is a p6 core with tweaks and P4-like bus >>interface. I have some patches for OProfile and the 2.6 kernel which >>allow OProfile to use the performance monitoring hardware on the Pentium >>M. This is still work in progress. I have been trying this out on a > > > It look like complete, a few comments. It does allow me to collect some data on Pentium M, but as you point out there are some improvements possible. >>Pentium M development machine, not a traditional laptop. There is likely >>to be some issue with this kernel OProfile support and the power >>management support on the laptop (nmi vs. smm/power-management bios). >> >>The kernel patch (kernel-pentiumm.diff) applied cleanly to the >>kernel-source-2.6.1-1.141 in the Red Hat Raw hide. I built the kernel >>with the attached config file (config-20040129) because the Pentium M >>processors do not run the SMP kernels, so the default config files in >>the kernel-source RPM are either going to build a kernel without >>OProfile support or a kernel that doesn't boot on pentium M. >> >>The OProfile source patch (pentiumm3.diff) applies cleanly to the >>current oprofile cvs checkout. I verified that the code builds on a i386 >>platform. > > >>2004-01-30 Will Cohen <wc...@re...> >> >> * events/i386/pentiumm/events: >> * events/i386/pentiumm/unit_masks: > > > Any place you use pentiumm, Pentium M etc. I prefer p6_mobile for > filename, p6-mobile for string and CP_P6_MOBILE for the enum. I'm > unsure for user visible string "Pentium M (P6 core)" ? p6_mobile could be confused for the mobile pentium II and mobile pentium III. However, it is likely that Intel will do the same with Pentium M as they did with Xeon. There are Xeons based on both multiple different architectures. Your approach does make sense given previous Intel naming schemes. >>+++ linux-2.6.1-1.141pentm/arch/i386/oprofile/op_model_ppro.c 2004-01-29 16:21:32.000000000 -0500 > > >>+ /* P4/Pentium M quirk: you have to re-unmask the apic vector */ >>+ apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); >>+ > > > Did you verify if it works w/o this ? If it's needed I prefer a more > specific comments. > > /* Only P6 based Pentium M need to re-unmask the apic vector but it > * doesn't hurt other P6 variant */ > I verified that I only got 1 nmi interrupt sample without this. However, I did not verify that the code still functions on older P6 machines. I will check that it still functions on older machines. I can also make the comment more descriptive. >>+++ oprofile.pentiumm/events/i386/pentiumm/events 2004-01-29 10:17:47.000000000 -0500 > > >>+event:0xd0 counters:0,1 um:zero minimum:6000 name:INST_DECODED : number of instructions decoded > > > This one marked in an errata as not accurate see 25266507.pdf must we add a > comment ? I didn't go through the errata. I just went through the table in the current volume 3 to get the additional events. I saved going throught he errata for a second pass (after I got the data collection working on the Pentium M). >>+event:0xd8 counters:0,1 um:sse_sse2_inst_retired minimum:3000 name:EMON_SSE_SSE2_INST_RETIRED : Streaming SIMD Extensions Instructions Retired >>+event:0xd9 counters:0,1 um:sse_sse2_inst_retired minimum:3000 name:EMON_SSE_SSE2_COMP_INST_RETIRED : Computational SSE Instructions Retired > > > I'm lost with these coming from A-8, they are called event modified from > P6 for Pentium Mobile, events added are in A-7, but they never existed > on PIII afaics and does intel added sse2 to pentium mobile P6 core based ? From piii events event:0xd8 counters:0,1 um:kni_inst_retired minimum:3000 name:EMON_KNI_INST_RETIRED : number of KNI instructions retired event:0xd9 counters:0,1 um:kni_inst_retired minimum:3000 name:EMON_KNI_COMP_INST_RET : number of KNI computation instructions retired >>+event:0x59 counters:0,1 um:zero minimum:3000 name:EMON_THERMAL_TRIP : Duration/Occurrences in thermal trip > > > This one too is marked as not accurate. errata stuff to be added. >>+++ oprofile.pentiumm/events/i386/pentiumm/unit_masks 2004-01-30 15:27:23.000000000 -0500 > > >>+name:zero type:mandatory default:0x0 >>+ 0x0 No unit mask >>+name:mesi type:bitmask default:0x0f >>+ 0x08 (M)odified cache state >>+ 0x04 (E)xclusive cache state >>+ 0x02 (S)hared cache state >>+ 0x01 (I)nvalid cache state >>+ 0x0f All cache states >>+# not currently implemented for mesi >>+# bits 5:4 >>+#00h excluding hardware prefetched lines >>+#01h hardware prefetched lines only >>+#02h/03h all(hw prefetched lines and no hw -prefetched lines) > > > I'm unsure if it isn't equivalent to: > > 0x10 HW prefetched line only > 0x20 all prefetched line w/o regarding mask 0x10. > > I don't understand exactly the relationship with bit 0:3 Reading the documentation it looks like with 0x20 that 0x10 is a don't care. So what you have above appears reasonable. -Will |
From: Philippe E. <ph...@wa...> - 2004-02-02 18:13:29
|
On Mon, 02 Feb 2004 at 10:16 +0000, Will Cohen wrote: > >>2004-01-30 Will Cohen <wc...@re...> > >> > >> * events/i386/pentiumm/events: > >> * events/i386/pentiumm/unit_masks: > > > > > >Any place you use pentiumm, Pentium M etc. I prefer p6_mobile for > >filename, p6-mobile for string and CP_P6_MOBILE for the enum. I'm > >unsure for user visible string "Pentium M (P6 core)" ? > > p6_mobile could be confused for the mobile pentium II and mobile pentium > III. However, it is likely that Intel will do the same with Pentium M as > they did with Xeon. There are Xeons based on both multiple different > architectures. Your approach does make sense given previous Intel naming > schemes. ok I dunno there was a pii_mobile so piii_mobile make sense, we know than piii is based on p6 core. The thing to avoid is using Intel marketing name internally. > >>+++ linux-2.6.1-1.141pentm/arch/i386/oprofile/op_model_ppro.c 2004-01-29 > >>16:21:32.000000000 -0500 > > > > > >>+ /* P4/Pentium M quirk: you have to re-unmask the apic vector */ > >>+ apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); > >>+ > > > > > >Did you verify if it works w/o this ? If it's needed I prefer a more > >specific comments. > > > > /* Only P6 based Pentium M need to re-unmask the apic vector but it > > * doesn't hurt other P6 variant */ > > > > I verified that I only got 1 nmi interrupt sample without this. However, > I did not verify that the code still functions on older P6 machines. I > will check that it still functions on older machines. I can also make > the comment more descriptive. nevermind, if we want to be sure it works just add a global var: int need_lvtpc_unmask; ... if (ned_lvtpc_unmask) apic_write(...); it's just too boring to test all cpu variant and can help future port. > >>+event:0xd8 counters:0,1 um:sse_sse2_inst_retired minimum:3000 > >>name:EMON_SSE_SSE2_INST_RETIRED : Streaming SIMD Extensions Instructions > >>Retired > > > > > >I'm lost with these coming from A-8, they are called event modified from > >P6 for Pentium Mobile, events added are in A-7, but they never existed > >on PIII afaics and does intel added sse2 to pentium mobile P6 core based ? > > From piii events > > event:0xd8 counters:0,1 um:kni_inst_retired minimum:3000 > name:EMON_KNI_INST_RETIRED : number of KNI instructions retired Ok, I looked by name not by number and didn't notice they re-used an event number and change the name regards, Phil |
From: Will C. <wc...@re...> - 2004-02-10 22:15:09
Attachments:
pentiumm6.diff
kernel-pentiumm2.diff
|
Philippe Elie wrote: > On Mon, 02 Feb 2004 at 10:16 +0000, Will Cohen wrote: > > >>>>2004-01-30 Will Cohen <wc...@re...> >>>> >>>> * events/i386/pentiumm/events: >>>> * events/i386/pentiumm/unit_masks: >>> >>> >>>Any place you use pentiumm, Pentium M etc. I prefer p6_mobile for >>>filename, p6-mobile for string and CP_P6_MOBILE for the enum. I'm >>>unsure for user visible string "Pentium M (P6 core)" ? >> >>p6_mobile could be confused for the mobile pentium II and mobile pentium >>III. However, it is likely that Intel will do the same with Pentium M as >>they did with Xeon. There are Xeons based on both multiple different >>architectures. Your approach does make sense given previous Intel naming >>schemes. > > > ok I dunno there was a pii_mobile so piii_mobile make sense, we know than > piii is based on p6 core. The thing to avoid is using Intel marketing name > internally. I have tweaked the patch based on the comments, avoid using the "pentiumm" market name given Intel's variations on the Xeon family of processors. The patch applies cleanly to the current checkout of oprofile and builds. Okay to merge into the development oprofile cvs? 2004-02-10 Will Cohen <wc...@re...> * events/i386/p6_mobile/events: * events/i386/p6_mobile/unit_masks: * events/Makefile.am: * libop/op_cpu_type.c: * libop/op_cpu_type.h: * libop/op_events.c: * utils/op_help.c: Add support for Pentium M (Centrino). The kernel-pentiumm2.diff is the revised patch to generate the appropriate /dev/oprofile/cpu_type. It also resets the APIC in the same way as the P4 nmi handler code. It works on pentium M machine. I have have also verified this works without problem on a PII. It will need to be sent upstream for the kernel. -Will >>>>+++ linux-2.6.1-1.141pentm/arch/i386/oprofile/op_model_ppro.c 2004-01-29 >>>>16:21:32.000000000 -0500 >>> >>> >>>>+ /* P4/Pentium M quirk: you have to re-unmask the apic vector */ >>>>+ apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); >>>>+ >>> >>> >>>Did you verify if it works w/o this ? If it's needed I prefer a more >>>specific comments. >>> >>> /* Only P6 based Pentium M need to re-unmask the apic vector but it >>> * doesn't hurt other P6 variant */ >>> >> >>I verified that I only got 1 nmi interrupt sample without this. However, >>I did not verify that the code still functions on older P6 machines. I >>will check that it still functions on older machines. I can also make >>the comment more descriptive. > > > nevermind, if we want to be sure it works just add a global var: > > int need_lvtpc_unmask; > ... > if (ned_lvtpc_unmask) > apic_write(...); > > it's just too boring to test all cpu variant and can help future port. > > > >>>>+event:0xd8 counters:0,1 um:sse_sse2_inst_retired minimum:3000 >>>>name:EMON_SSE_SSE2_INST_RETIRED : Streaming SIMD Extensions Instructions >>>>Retired >>> >>> >>>I'm lost with these coming from A-8, they are called event modified from >>>P6 for Pentium Mobile, events added are in A-7, but they never existed >>>on PIII afaics and does intel added sse2 to pentium mobile P6 core based ? >> >>From piii events >> >>event:0xd8 counters:0,1 um:kni_inst_retired minimum:3000 >>name:EMON_KNI_INST_RETIRED : number of KNI instructions retired > > > Ok, I looked by name not by number and didn't notice they re-used an > event number and change the name > > > regards, > Phil |
From: John L. <le...@mo...> - 2004-02-11 00:53:25
|
On Tue, Feb 10, 2004 at 05:14:48PM -0500, Will Cohen wrote: > oprofile and builds. Okay to merge into the development oprofile cvs? Looks fine to me, thanks john |
From: Philippe E. <ph...@wa...> - 2004-02-11 02:48:15
|
On Tue, 10 Feb 2004 at 17:14 +0000, Will Cohen wrote: > I have tweaked the patch based on the comments, avoid using the > "pentiumm" market name given Intel's variations on the Xeon family of > processors. The patch applies cleanly to the current checkout of > oprofile and builds. Okay to merge into the development oprofile cvs? > > 2004-02-10 Will Cohen <wc...@re...> > > * events/i386/p6_mobile/events: > * events/i386/p6_mobile/unit_masks: > * events/Makefile.am: > * libop/op_cpu_type.c: > * libop/op_cpu_type.h: > * libop/op_events.c: > * utils/op_help.c: Add support for Pentium M (Centrino). > > The kernel-pentiumm2.diff is the revised patch to generate the > appropriate /dev/oprofile/cpu_type. It also resets the APIC in the same > way as the P4 nmi handler code. It works on pentium M machine. I have > have also verified this works without problem on a PII. It will need to > be sent upstream for the kernel. the userspace part is ok to commit, I'll send the kernel part to Andrew. John nothing against the unmasking of LVPTC for ppro based core ? > --- linux-2.6.1-1.141/arch/i386/oprofile/op_model_ppro.c 2004-01-09 01:59:46.000000000 -0500 > +++ linux-2.6.1-1.141pentm/arch/i386/oprofile/op_model_ppro.c 2004-01-29 16:21:32.000000000 -0500 > @@ -13,6 +13,7 @@ > #include <linux/oprofile.h> > #include <asm/ptrace.h> > #include <asm/msr.h> > +#include <asm/apic.h> > > #include "op_x86_model.h" > #include "op_counter.h" > @@ -101,6 +102,10 @@ > } > } > > + /* Only P6 based Pentium M need to re-unmask the apic vector but it > + * doesn't hurt other P6 variant */ > + apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); > + > /* We can't work out if we really handled an interrupt. We > * might have caught a *second* counter just after overflowing > * the interrupt for this counter then arrives |
From: John L. <le...@mo...> - 2004-02-11 17:16:07
|
On Wed, Feb 11, 2004 at 03:22:59AM +0000, Philippe Elie wrote: > the userspace part is ok to commit, I'll send the kernel part to Andrew. > John nothing against the unmasking of LVPTC for ppro based core ? How slow is it? john |
From: Philippe E. <ph...@wa...> - 2004-02-12 17:04:54
|
On Wed, 11 Feb 2004 at 17:14 +0000, John Levon wrote: > On Wed, Feb 11, 2004 at 03:22:59AM +0000, Philippe Elie wrote: > > > the userspace part is ok to commit, I'll send the kernel part to Andrew. > > John nothing against the unmasking of LVPTC for ppro based core ? > > How slow is it? no idea and no HW to test that until 15 days. It's potentially costly because it's an UC write (UC write serialize too ?). The aternative is to: if (current_cpu_data.x86 == 9) APIC_WRITE(..); where current_cpu_data expand to cpu_data[smp_processor_id()] humm ? regards, Phil |
From: Philippe E. <ph...@wa...> - 2004-02-12 17:12:09
|
On Thu, 12 Feb 2004 at 18:03 +0000, Philippe Elie wrote: > no idea and no HW to test that until 15 days. It's potentially costly because > it's an UC write (UC write serialize too ?). The aternative is to: > > if (current_cpu_data.x86 == 9) ermm I mean if (current_cpu_data.x86_model == 9) |
From: Will C. <wc...@re...> - 2004-02-12 20:01:21
|
Philippe Elie wrote: > On Wed, 11 Feb 2004 at 17:14 +0000, John Levon wrote: > > >>On Wed, Feb 11, 2004 at 03:22:59AM +0000, Philippe Elie wrote: >> >> >>>the userspace part is ok to commit, I'll send the kernel part to Andrew. >>>John nothing against the unmasking of LVPTC for ppro based core ? >> >>How slow is it? > > > no idea and no HW to test that until 15 days. It's potentially costly because > it's an UC write (UC write serialize too ?). The aternative is to: There are the following operations: uncacheable read from local apic, the "and" operation, uncachable write to local apic. However, the uncacheable read and write shouldn't be as expensive as a regular bus operation. Need to get another experiment to find this cost. > if (current_cpu_data.x86 == 9) > APIC_WRITE(..); > > where current_cpu_data expand to cpu_data[smp_processor_id()] > > humm ? > > regards, > Phil I did some experiments on a dual processor Pentium II machine. With and without the high overhead line. I didn't see a noticable difference between the two. However, the apic_write() is done after the counters are restarted, so that overhead probably won't be noticed in changes in the sample rate. Ran two copies of the program run below with the c and script below to keep both processors busy. I compared the estimate from time and the count from the oprofile samples. $ more cycleeater.c void f() { /* just an empty function. */ } int main(int argc, char *argv[]) { int i; int j; while (1){ for (i=0; i<100000; ++i) for (j=0; j<100000; ++j) f(); } return 0; } $ more cscript #!/bin/bash # ./cycleeater & ./cycleeater & sleep $1 killall cycleeater --------------------------------------------------------------- For driver with apic code for pentium M support $ time ./cscript 100 ./cscript: line 6: 2122 Terminated ./cycleeater ./cscript: line 7: 2123 Terminated ./cycleeater real 1m40.221s user 3m18.762s sys 0m0.167s $ opreport image:/home/wcohen/cycleeater CPU: PII, speed 232.718 MHz (estimated) Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000 CPU_CLK_UNHALT...| samples| %| ------------------ 454333 100.000 cycleeater 198.762 seconds user time clock rate =232.718MHz time according to oprofile 454333*100000/232.718e6=195.22 seconds 198.762-195.22=3.533 seconds missing due to interrupt overhead 3.533s/454333 samples = 7.78 microseconds per interrupt 1809 cycles per interrupt ---------------------------------------------------- For driver without apic code for pentium M $ time ./cscript 100 ./cscript: line 6: 4024 Terminated ./cycleeater ./cscript: line 7: 4025 Terminated ./cycleeater real 1m40.222s user 3m18.721s sys 0m0.173s $ opreport image:/home/wcohen/cycleeater CPU: PII, speed 232.718 MHz (estimated) Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000 CPU_CLK_UNHALT...| samples| %| ------------------ 454165 100.000 cycleeater 454165*100000/232.718e6=195.157 (3*60+18.721)-195.157=3.564s missing from oprofile samples 3.564s/454165=7.847 microseconds per interrupt 1821 cycles per interrupt. |
From: Will C. <wc...@re...> - 2004-02-12 21:58:01
|
John Levon wrote: > On Wed, Feb 11, 2004 at 03:22:59AM +0000, Philippe Elie wrote: > > >>the userspace part is ok to commit, I'll send the kernel part to Andrew. >>John nothing against the unmasking of LVPTC for ppro based core ? > > > How slow is it? > > john I made /dev/oprofile/apic_time in the oprofile file system which measured the time of the operation in a loop with the time stamp counter. With loop overhead it was about 66 cycles per iteration on the Pentium II. -Will |
From: John L. <le...@mo...> - 2004-02-14 20:06:16
|
On Thu, Feb 12, 2004 at 04:56:32PM -0500, Will Cohen wrote: > I made /dev/oprofile/apic_time in the oprofile file system which > measured the time of the operation in a loop with the time stamp > counter. With loop overhead it was about 66 cycles per iteration on the > Pentium II. So how ugly is it to change the function pointer depending on model? john -- "Spammers get STABBED by GOD." - Ron Echeverri |
From: Will C. <wc...@re...> - 2004-02-16 19:43:06
Attachments:
kernel-pentiumm3.diff
|
John Levon wrote: > On Thu, Feb 12, 2004 at 04:56:32PM -0500, Will Cohen wrote: > > >>I made /dev/oprofile/apic_time in the oprofile file system which >>measured the time of the operation in a loop with the time stamp >>counter. With loop overhead it was about 66 cycles per iteration on the >>Pentium II. > > > So how ugly is it to change the function pointer depending on model? Here is a example where I duplicated the interrupt handler code for the Pentium M (with the added LVTPC unmask), added a table, and set up the use of the table when Pentium M is detected. I am not wild about it, but it should reduce the overhead of the interrupt handling on regular p6 processors. -Will |
From: John L. <le...@mo...> - 2004-02-16 20:07:21
|
On Mon, Feb 16, 2004 at 02:39:05PM -0500, Will Cohen wrote: > Here is a example where I duplicated the interrupt handler code for the > Pentium M (with the added LVTPC unmask), added a table, and set up the > use of the table when Pentium M is detected. I am not wild about it, but > it should reduce the overhead of the interrupt handling on regular p6 > processors. > How about adding a ->unmask_lvtpc() operation to the model spec. Then you don't duplicate any code. <normal check_ctrs function> > +{ > + unsigned int low, high; > + int i; > + unsigned long eip = instruction_pointer(regs); > + int is_kernel = !user_mode(regs); > + > + for (i = 0 ; i < NUM_COUNTERS; ++i) { > + CTR_READ(low, high, msrs, i); > + if (CTR_OVERFLOWED(low)) { > + oprofile_add_sample(eip, is_kernel, i, cpu); > + CTR_WRITE(reset_value[i], msrs, i); > + } > + } if (model->unmask_lvtpc) model->unmask_lvtpc(); > + /* We can't work out if we really handled an interrupt. We > + * might have caught a *second* counter just after overflowing > + * the interrupt for this counter then arrives > + * and we don't find a counter that's overflowed, so we > + * would return 0 and get dazed + confused. Instead we always > + * assume we found an overflow. This sucks. > + */ > + return 1; or whatever regards joohn |
From: Will C. <wc...@re...> - 2004-02-16 20:30:16
|
John Levon wrote: > On Mon, Feb 16, 2004 at 02:39:05PM -0500, Will Cohen wrote: > > >>Here is a example where I duplicated the interrupt handler code for the >>Pentium M (with the added LVTPC unmask), added a table, and set up the >>use of the table when Pentium M is detected. I am not wild about it, but >>it should reduce the overhead of the interrupt handling on regular p6 >>processors. >> > > > How about adding a ->unmask_lvtpc() operation to the model spec. Then > you don't duplicate any code. > > <normal check_ctrs function> > >>+{ >>+ unsigned int low, high; >>+ int i; >>+ unsigned long eip = instruction_pointer(regs); >>+ int is_kernel = !user_mode(regs); >>+ >>+ for (i = 0 ; i < NUM_COUNTERS; ++i) { >>+ CTR_READ(low, high, msrs, i); >>+ if (CTR_OVERFLOWED(low)) { >>+ oprofile_add_sample(eip, is_kernel, i, cpu); >>+ CTR_WRITE(reset_value[i], msrs, i); >>+ } >>+ } > > > if (model->unmask_lvtpc) > model->unmask_lvtpc(); This seems like overkill. Having another function to implement this will incur prologue and epilogue overhead plus there is the overhead of the indirect call. It would probably be almost same cost as the original inline code. Maybe if (model->reset_int_hw) apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED); The unconditional version code is the easiest to understand. The problem is it has an undesirable 66 cycle cost associated with it. I have been thinking about how to measure the cost of the interrupt handler. When I did the experiment to find out the cost of the unmask LVTPC operation, I built another special file to get that information by running a function when the special file was read. The function just benchmarked the a loop with the unmask LVTPC operation as the body. It should be fairly straightforward to adapt that code to get the cost of the actual nmi interrupt handler. This would give us a better handle on the overall cost of adding samples. It would also give us an idea how much a performance difference this type of change would make in the interrupt routine. >>+ /* We can't work out if we really handled an interrupt. We >>+ * might have caught a *second* counter just after overflowing >>+ * the interrupt for this counter then arrives >>+ * and we don't find a counter that's overflowed, so we >>+ * would return 0 and get dazed + confused. Instead we always >>+ * assume we found an overflow. This sucks. >>+ */ >>+ return 1; > > > or whatever > > regards > joohn regards, -Will |
From: John L. <le...@mo...> - 2004-02-16 22:32:00
|
On Mon, Feb 16, 2004 at 03:26:21PM -0500, Will Cohen wrote: > Maybe > > if (model->reset_int_hw) > apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & > ~APIC_LVT_MASKED); That would be fine. > should be fairly straightforward to adapt that code to get the cost of > the actual nmi interrupt handler. This would give us a better handle on > the overall cost of adding samples. It would also give us an idea how > much a performance difference this type of change would make in the > interrupt routine. I've been meeaning to do something like this for a long, long time regards john |
From: Will C. <wc...@re...> - 2004-02-16 22:41:17
Attachments:
kernel-time_int.diff
|
John Levon wrote: > On Mon, Feb 16, 2004 at 03:26:21PM -0500, Will Cohen wrote: > > >>Maybe >> >> if (model->reset_int_hw) >> apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & >> ~APIC_LVT_MASKED); > > > That would be fine. > > >>should be fairly straightforward to adapt that code to get the cost of >>the actual nmi interrupt handler. This would give us a better handle on >>the overall cost of adding samples. It would also give us an idea how >>much a performance difference this type of change would make in the >>interrupt routine. > > > I've been meeaning to do something like this for a long, long time I have some grungy code that accumulates the number of cycles spent in [model]_check_ctrs() function into /dev/oprofile/status/cpu[0-9]+/int_time. One can compute the average time spent in [model]_check_ctrs() by dividing by the number of samples. It is a rough estimate. I have attached the current work to this email. I have checked the resulting kernel works on ppro (pentium M) machine. -Will |
From: Philippe E. <ph...@wa...> - 2004-02-16 23:08:12
|
On Mon, 16 Feb 2004 at 22:28 +0000, John Levon wrote: > On Mon, Feb 16, 2004 at 03:26:21PM -0500, Will Cohen wrote: > > > Maybe > > > > if (model->reset_int_hw) > > apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & > > ~APIC_LVT_MASKED); > > That would be fine. > > > should be fairly straightforward to adapt that code to get the cost of > > the actual nmi interrupt handler. This would give us a better handle on > > the overall cost of adding samples. It would also give us an idea how > > much a performance difference this type of change would make in the > > interrupt routine. > > I've been meeaning to do something like this for a long, long time nobody can do a kernel compile test with and w/o this chunk ? And anyway if you want to optimize this function it's better to use rdpmc rather rdmsr, it's orthogonal but results: http://marc.theaimsgroup.com/?l=oprofile-list&m=105840481716898&w=2 show it save 60 cycles on P6 core and http://marc.theaimsgroup.com/?l=oprofile-list&m=105840481716899&w=2 show kernel compile doesn't improve, it's only measurable on bz2 test, so I guess it's not really worth to add a conditinal branch here (it'll surely not good for pentium mobile, the branch will be mispredicted, if we add a likely or unlikely we will pessimize either P6 or pentium mobile) We increase pressure on BTB buffer too. if (model->int_hw_reset) In short word I'm not conveincing we need to optimize that. regards, Phil |
From: Will C. <wc...@re...> - 2004-02-17 17:53:48
|
Philippe Elie wrote: > On Mon, 16 Feb 2004 at 22:28 +0000, John Levon wrote: > > >>On Mon, Feb 16, 2004 at 03:26:21PM -0500, Will Cohen wrote: >> >> >>>Maybe >>> >>> if (model->reset_int_hw) >>> apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & >>> ~APIC_LVT_MASKED); >> >>That would be fine. >> >> >>>should be fairly straightforward to adapt that code to get the cost of >>>the actual nmi interrupt handler. This would give us a better handle on >>>the overall cost of adding samples. It would also give us an idea how >>>much a performance difference this type of change would make in the >>>interrupt routine. >> >>I've been meeaning to do something like this for a long, long time > > > nobody can do a kernel compile test with and w/o this chunk ? I ran some experiments with the patched version of the kernel to get a an idea of the cost of the [model]_check_ctrs function: Have the instrumented version of OProfile running on p6, athlon, and p4. Have some idea of the cost of the interrupt routine for the various platforms. Computed the average cycles per sample taken. The difference with and without the re-umask works out to about 50 cycles. P6 (Dual 233MHz Pentium II) (WITH re-unmask line in ppro_check_ctrs) 88211797/139890=631 (CPU_CLK_UNHALTED) 396748650/633470=626 P6 (Dual 233MHz Pentium II) (WITHOUT re-unmask line in ppro_check_ctrs) 139455/247=565 (CPU_CLK_UNHALTED) 141196584/245479=575 P6_mobile (1.6GHz) 449588720/655041=686 (CPU_CLK_UNHALTED) 488144911/711151=686 athlon (1.6GHz) 2408254736/3548855 = 678 (CPU_CLK_UNHALTED) 2529484971/3738657 = 676 p4 (2.4GHz) 1393387120/3700006=3766 (GLOBAL_POWER_EVENTS) 4010478328/11125069=3596 The mechanism on the P4 seems pretty expensive. There are more iterations of the loop (8 vs. 4 athlon or 2 p6). Code is more compilicated for reading the registers due to the HT support; need stagger computation to determine which processor this is in a HT processor. Read two registers and extract bits to find whether register overflowed. Once triggering register found on P4 do the following to reset the counter: CTR_WRITE(reset_value[i], real); CCCR_CLEAR_OVF(low); CCCR_WRITE(low, high, real); CTR_WRITE(reset_value[i], real); Finally have to re-unmask the apic vector so nmi will trigger again. > And anyway if you want to optimize this function it's better to use > rdpmc rather rdmsr, it's orthogonal but results: > > http://marc.theaimsgroup.com/?l=oprofile-list&m=105840481716898&w=2 > > show it save 60 cycles on P6 core and > > http://marc.theaimsgroup.com/?l=oprofile-list&m=105840481716899&w=2 > > show kernel compile doesn't improve, it's only measurable on bz2 test, > so I guess it's not really worth to add a conditinal branch here > (it'll surely not good for pentium mobile, the branch will be > mispredicted, if we add a likely or unlikely we will pessimize either > P6 or pentium mobile) We increase pressure on BTB buffer too. I haven't tried the rdpmc instruction yet. Could the rdpmc be used for the p4? The p4 has higher overhead and could use a reduction in the time spent in it. > if (model->int_hw_reset) > > In short word I'm not conveincing we need to optimize that. The local apic re-unmask operation for the p6 appears to add about <10% to the cost of the interrupt handling (figure there are other costs besides ppro_check_ctr body for handling nmi). -Will |
From: John L. <le...@mo...> - 2004-02-17 19:26:23
|
On Tue, Feb 17, 2004 at 12:49:14PM -0500, Will Cohen wrote: > The local apic re-unmask operation for the p6 appears to add about <10% > to the cost of the interrupt handling (figure there are other costs > besides ppro_check_ctr body for handling nmi). We have an unfortunately heavyweight path from the NMI. But even 3% is probably worth it. It's not like it's complicated code. rdpmc would be interesting. Also calling do_nmi() from interrupt context so it can be profiled by oprofile itself (you'd need to hack it so that the real oprofile uses timer mode and the NMI handler only ever generates stuff for counter #1 or something, but I don't think it'd be hard) regards john -- "Spammers get STABBED by GOD." - Ron Echeverri |
From: Philippe E. <ph...@wa...> - 2004-02-17 19:48:06
|
On Tue, 17 Feb 2004 at 12:49 +0000, Will Cohen wrote: > >nobody can do a kernel compile test with and w/o this chunk ? > > I ran some experiments with the patched version of the kernel to get a > an idea of the cost of the [model]_check_ctrs function: [snp: nice number usefull to know if tried optimization are worth. > The mechanism on the P4 seems pretty expensive. There are more > iterations of the loop (8 vs. 4 athlon or 2 p6). Code is more We can avoid uneeded loop by adding a level of indirection. static void p4_check_ctrs(unsigned int const cpu, ... ... - for (i = 0; i < num_counters; ++i) { - if (!sysctl.ctr[i].event) - continue; + for (index = 0; index < num_counters_setup; ++index) { + int i = setup_counter[i]; num_counter_setup and setup_counter array must be init in p2_setup_ctrs() This will be a big win if we use only a few counter and don't hurt the case where the 8 counter are used. > compilicated for reading the registers due to the HT support; need > stagger computation to determine which processor this is in a HT get_stagger() is not inlined it'll help a bit especially on UP kernel where it expand to return 0; > I haven't tried the rdpmc instruction yet. Could the rdpmc be used for > the p4? The p4 has higher overhead and could use a reduction in the time > spent in it. Yes but we can avoid only one rdmsr multiplied by number of loop, with your measure we will know if it's worth. iirc graydon measured P4 rdpmc vs rdsmr and it was not worth but I think he didn't try the fast rdpmc available on p4 (see Vol 2B rdpmc) > >if (model->int_hw_reset) > > > >In short word I'm not conveincing we need to optimize that. > > The local apic re-unmask operation for the p6 appears to add about <10% > to the cost of the interrupt handling (figure there are other costs > besides ppro_check_ctr body for handling nmi). Nowaday userspace daemon overhead is more costly than kernel once, I dunno exactly what is the kernel overhead coming from the xxx_check_ctrs() function compared to the remaining nmi handler code (at least on P4 we know it probably dominate the nmi overhead) cheers, Phil |
From: Will C. <wc...@re...> - 2004-02-18 14:53:46
Attachments:
kernel-nmireduce.diff
|
Philippe Elie wrote: > On Tue, 17 Feb 2004 at 12:49 +0000, Will Cohen wrote: > > >>>nobody can do a kernel compile test with and w/o this chunk ? >> >>I ran some experiments with the patched version of the kernel to get a >>an idea of the cost of the [model]_check_ctrs function: > > > > [snp: nice number usefull to know if tried optimization are worth. > > >>The mechanism on the P4 seems pretty expensive. There are more >>iterations of the loop (8 vs. 4 athlon or 2 p6). Code is more > > > We can avoid uneeded loop by adding a level of indirection. > > static void p4_check_ctrs(unsigned int const cpu, ... > ... > - for (i = 0; i < num_counters; ++i) { > - if (!sysctl.ctr[i].event) > - continue; > + for (index = 0; index < num_counters_setup; ++index) { > + int i = setup_counter[i]; > > num_counter_setup and setup_counter array must be init > in p2_setup_ctrs() > > This will be a big win if we use only a few counter and don't hurt > the case where the 8 counter are used. > > >>compilicated for reading the registers due to the HT support; need >>stagger computation to determine which processor this is in a HT > > > get_stagger() is not inlined it'll help a bit especially on UP kernel > where it expand to return 0; > > > >>I haven't tried the rdpmc instruction yet. Could the rdpmc be used for >>the p4? The p4 has higher overhead and could use a reduction in the time >>spent in it. > > > Yes but we can avoid only one rdmsr multiplied by number of loop, with > your measure we will know if it's worth. I measured the rdpmc vs. rdmsr on p6 and athlon. the rdmsr is a serializing instruction which slows down the loop quite a bit. The rdpmc is not serializing, a big win for athlon. Saves about 220 cycles on athlon. The savings on the p6 is smaller, only 2 uses of the instruction about 70 cycles save. using rdmsr on p6 P6 (Dual 233MHz Pentium II) (WITHOUT re-unmask line in ppro_check_ctrs) 139455/247=565 (CPU_CLK_UNHALTED) 141196584/245479=575 using rdpmc on p6 (without LVTPC remask) 1214847/2402=505 using rdmsr on Athlon athlon (1.6GHz) 2408254736/3548855 = 678 (CPU_CLK_UNHALTED) 2529484971/3738657 = 676 using rdpmc on Athlon: 987328/2235=442 263518778/586642=449 306291004/678405=451 > iirc graydon measured P4 rdpmc vs rdsmr and it was not worth but > I think he didn't try the fast rdpmc available on p4 (see Vol 2B > rdpmc) I read through the documentation on the p4, but I haven't tried revising the p4 code yet. >>>if (model->int_hw_reset) >>> >>>In short word I'm not conveincing we need to optimize that. >> >>The local apic re-unmask operation for the p6 appears to add about <10% >>to the cost of the interrupt handling (figure there are other costs >>besides ppro_check_ctr body for handling nmi). > > > Nowaday userspace daemon overhead is more costly than kernel once, I dunno > exactly what is the kernel overhead coming from the xxx_check_ctrs() > function compared to the remaining nmi handler code (at least on P4 we know > it probably dominate the nmi overhead) That is an interesting question: What is the overall overhead for collecting a sample with the interrupt and processing it in the userspace daemon? I don't know the overhead of the nmi wrapper code that call xxx_check_ctrs(). But I am guessing that it is less than the thousands of clock cycles for the p4_check_ctrs(). -Will |