From: William C. <wc...@re...> - 2006-03-27 16:10:09
Attachments:
oprof_perfmon2-20060327.diff
perfmon2_oprof20060327.diff
|
I have gotten oprofile to make use of the new perfmon2 mechanism to collect samples. I currently have this running on my AMD64 laptop. The oprof_perfmon2-20060327.diff patches the oprofile user space code and perfmon2_oprof20060327.diff is for the kernel. The patches are still "work in progress" and there are certainly things that need to be corrected. The patches borrow heavily from the previous ia64 oprofile/perfmon support. Due to the different sampling mechanism that could be used for x86, /dev/oprofile/implement has been added so the sampling mechanism being used can be identify how the samples are being collected. Rather than directly setting up the bits for the performance monitoring hardware libpfm is used to map the name to the appropriate bits. For processors with complicated constraints on the performance monitoring hardware this makes more sense than trying to duplicate the constraints mechanism in oprofile. Below are issues that still need to be fixed in the various areas of the oprofile/perfmon2 monitoring. kernel: - separating oprofiles processor id code from i386 nmi mechanism setup - have oprofile/perfmon2 identify cpu for real (currently just hardwired to amd64) - oprofile always uses perfmon2 if kernel configured with perfmon - module installation a bit odd: -install oprofile modules -opcontrol reads information to determine if perfmon2 used -opcontrol install appropropriate perfmon module - oprofile lies that it needs buffer space (perfmon_get_size()) so perfmon2 actually calls oprofile's perfmon_handler() oprofile: - make translation of events names to bit patterns more robust: can hang if event is not found - verify that the event masking support works - get rid of fatal_error() function in opd_perfmon.c - ophelp get the available events from libpfm when possible libpfm: -make event mapping complete (lots of events missing for various processors) -libpfm isn't available on some procesors that perfmon supports (e.g. p4/ppc64) -Will |
From: Stephane E. <er...@hp...> - 2006-03-29 13:16:45
|
Will, On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: > I have gotten oprofile to make use of the new perfmon2 mechanism to > collect samples. I currently have this running on my AMD64 laptop. The > oprof_perfmon2-20060327.diff patches the oprofile user space code and > perfmon2_oprof20060327.diff is for the kernel. The patches are still > "work in progress" and there are certainly things that need to be > corrected. The patches borrow heavily from the previous ia64 > oprofile/perfmon support. Looking at /arch/i386/oprofile/perfmon.c, it is identical to the IA-64 version and the experimental i386 version I developed. I think we can move this format into the generic perfmon code in perfmon/. This way we only have one version to maintain. > Due to the different sampling mechanism that could be used for x86, > /dev/oprofile/implement has been added so the sampling mechanism being > used can be identify how the samples are being collected. > Yes. I think there are things to do in this area. Perfmon2 does not support NMI-based sampling. On Itanium there is no NMI. On other architectures, if I understand clearly, NMI is used because it provides better coverage of kernel code. NMI cannot be masked therefore you can collect samples in code sections were interrupts are masked. Is that the ONLY motivation for this? > Rather than directly setting up the bits for the performance monitoring > hardware libpfm is used to map the name to the appropriate bits. For > processors with complicated constraints on the performance monitoring > hardware this makes more sense than trying to duplicate the constraints > mechanism in oprofile. > Yes, you could use libpfm to simplify this part of the job. My understanding here is that there is already that logic about events/encodings/constraints in Oprofile. The only missing piece would be out to map OProfile register naming scheme to the perfmon2 naming scheme. Using libpfm just for this may look overkill in a sense. I need to look at how rgister names are handled across the various architectures OProfile supports. May be there is a simpler way that would not introduce a dependency on libpfm. > Below are issues that still need to be fixed in the various areas of the > oprofile/perfmon2 monitoring. > kernel: > - separating oprofiles processor id code from i386 nmi mechanism setup > - have oprofile/perfmon2 identify cpu for real (currently just hardwired > to amd64) This is something I don't quite understand in OProfile. Why is it that user code relies on CPU detection done by the OPRofile kernel code? The user code could as well detect the CPU model (via cpuid or equivalent). If you assume that the kernel code probes on init and disables itself if the CPU is not supported, then nothing bad can happen. > - oprofile always uses perfmon2 if kernel configured with perfmon I think we have to do this otherwise we may have PMU access conflicts. > - module installation a bit odd: > -install oprofile modules > -opcontrol reads information to determine if perfmon2 used Yes that makes sense. > -opcontrol install appropropriate perfmon module Yes, or it could be builtin. > - oprofile lies that it needs buffer space (perfmon_get_size()) so > perfmon2 actually calls oprofile's perfmon_handler() I fixed that. This was a bug. The format detection code was wrong. > > oprofile: > - make translation of events names to bit patterns more robust: > can hang if event is not found > - verify that the event masking support works > - get rid of fatal_error() function in opd_perfmon.c > - ophelp get the available events from libpfm when possible > > libpfm: > -make event mapping complete (lots of events missing for various processors) > -libpfm isn't available on some procesors that perfmon supports (e.g. > p4/ppc64) Yes, I know that for non Itanium, there are some events missing, sometimes because of umask combinations. Thanks for your patches. -- -Stephane |
From: William C. <wc...@re...> - 2006-03-29 14:14:41
|
Stephane Eranian wrote: > Will, > > On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: > >>I have gotten oprofile to make use of the new perfmon2 mechanism to >>collect samples. I currently have this running on my AMD64 laptop. The >>oprof_perfmon2-20060327.diff patches the oprofile user space code and >>perfmon2_oprof20060327.diff is for the kernel. The patches are still >>"work in progress" and there are certainly things that need to be >>corrected. The patches borrow heavily from the previous ia64 >>oprofile/perfmon support. > > > Looking at /arch/i386/oprofile/perfmon.c, it is identical to the > IA-64 version and the experimental i386 version I developed. I think > we can move this format into the generic perfmon code in perfmon/. > This way we only have one version to maintain. Yes, the changes for /arch/i386/oprofile/perfmon.c were pretty straightforward and would be the same for other architectures. Factoring out the code and making it common to the platforms is reasonable. >>Due to the different sampling mechanism that could be used for x86, >>/dev/oprofile/implement has been added so the sampling mechanism being >>used can be identify how the samples are being collected. >> > > > Yes. I think there are things to do in this area. Perfmon2 does not support > NMI-based sampling. On Itanium there is no NMI. On other architectures, > if I understand clearly, NMI is used because it provides better coverage > of kernel code. NMI cannot be masked therefore you can collect samples > in code sections were interrupts are masked. > > Is that the ONLY motivation for this? Depending which kernel someone is using the same oprofile code for i386 and x86-64 platforms could use either the original oprofile or perfmon2 to access the performance monitoring hardware. It seemed easiest to have the /dev/oprofile have a file that explicitly stated the mechanism being used. This could also be used by GUIs and other tools to directly determine the profiling mechanism. I wanted to avoid inferring mechanism in uses by looking at a bunch of files. The native OProfile driver on x86-64 and i386 use the NMI. This does allow sampling in IRQ routines. However, need to make sure that the amount of time spent in the NMI handler is limited. Using the NMI routine appears to cause problems on some machines (e.g. laptops where the NMI could happen when the BIOS is doing some power management operation). Is there some idea of the overhead in the perfmon2 timer interval and sampling mechanisms? >>Rather than directly setting up the bits for the performance monitoring >>hardware libpfm is used to map the name to the appropriate bits. For >>processors with complicated constraints on the performance monitoring >>hardware this makes more sense than trying to duplicate the constraints >>mechanism in oprofile. >> > > > Yes, you could use libpfm to simplify this part of the job. My understanding > here is that there is already that logic about events/encodings/constraints > in Oprofile. The only missing piece would be out to map OProfile register naming > scheme to the perfmon2 naming scheme. Using libpfm just for this may look > overkill in a sense. I need to look at how rgister names are handled across > the various architectures OProfile supports. May be there is a simpler way that > would not introduce a dependency on libpfm. OProfile has event and unit_mask files for each of the supported architecture in /usr/share/oprofile/{arch}/{model}. For example the x86-64 amd64 machine would use the event and unit_mask files in /usr/share/oprofile/x86-64/hammer. The constraints are much more complicated for the pentium 4 and and power processors. I would expect that libpfm will be able to do a better job there, once support is in libpfm for them. For the Pentium4 OProfile made a number of simplifications and reduce the available counters to 8 independent counters on non-ht processor and 4 independent counter on ht processor. There are also tagging events that are not handled by OProfile's mechanism. The power (ppc64) processors event selection mechanism is relatively complex. OProfile doe have events for it, but it isn't ideal. The goal here is to factor out the event mapping logic and have it in one place. >>Below are issues that still need to be fixed in the various areas of the >>oprofile/perfmon2 monitoring. > > > >>kernel: >>- separating oprofiles processor id code from i386 nmi mechanism setup >>- have oprofile/perfmon2 identify cpu for real (currently just hardwired >>to amd64) > > > This is something I don't quite understand in OProfile. Why is it that user > code relies on CPU detection done by the OPRofile kernel code? The user > code could as well detect the CPU model (via cpuid or equivalent). If you > assume that the kernel code probes on init and disables itself if the CPU > is not supported, then nothing bad can happen. The cpu identification is required for two purposes: 1) figure out how the oprofile module accesses the performance monitoring hardware. There are different methods of accessing the performance monitoring registers in ppro/p2/p3, p4, and athlon. 2) the user space needs to get the correct list of events to map event names to number and unit masks. The user-space could do find out the cpuid on it's own, but the oprofile native driver has to determine the information anyway. How would perfmon2 tools handle the case of multiple multiple architectures? Do the cpuid in user space and modprobe the appropriate module? What happens if the wrong perfmon kernel module is attepted to be loaded? Is there a check in the initalizaiton to make sure that it will works on the processor? >>- oprofile always uses perfmon2 if kernel configured with perfmon > > > I think we have to do this otherwise we may have PMU access conflicts. I was thinking about the case that someone would prefer to use one of the other sampling mechanisms eg. the nmi or timer mechanism. On OProfile you can force the timer mechanism to be used. >>- module installation a bit odd: >> -install oprofile modules >> -opcontrol reads information to determine if perfmon2 used > > > Yes that makes sense. > > >> -opcontrol install appropropriate perfmon module > > > Yes, or it could be builtin. Has perfmon2 built-in been verified to work with multiple architectures? Don't want to have different kernels for EM64T and AMD64 or P6, Pentium M, P4. Is there some way of identifying that perfmon2 is available on the machine. Right now the oprofile/perfmon2 patch assumes it is always a module. >>- oprofile lies that it needs buffer space (perfmon_get_size()) so >> perfmon2 actually calls oprofile's perfmon_handler() > > > I fixed that. This was a bug. The format detection code was wrong. Excellent. >>oprofile: >>- make translation of events names to bit patterns more robust: >> can hang if event is not found >>- verify that the event masking support works >>- get rid of fatal_error() function in opd_perfmon.c >>- ophelp get the available events from libpfm when possible >> >>libpfm: >>-make event mapping complete (lots of events missing for various processors) >>-libpfm isn't available on some procesors that perfmon supports (e.g. >>p4/ppc64) > > > Yes, I know that for non Itanium, there are some events missing, sometimes > because of umask combinations. > > Thanks for your patches. > Thanks for perfmon2. -Will |
From: Stephane E. <er...@hp...> - 2006-03-30 07:38:12
|
Will, On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote: > > > >This is something I don't quite understand in OProfile. Why is it that user > >code relies on CPU detection done by the OPRofile kernel code? The user > >code could as well detect the CPU model (via cpuid or equivalent). If you > >assume that the kernel code probes on init and disables itself if the CPU > >is not supported, then nothing bad can happen. > > The cpu identification is required for two purposes: > > 1) figure out how the oprofile module accesses the performance > monitoring hardware. There are different methods of accessing the > performance monitoring registers in ppro/p2/p3, p4, and athlon. > This is about the /dev/oprofile stuff, isn't it? > 2) the user space needs to get the correct list of events to map event > names to number and unit masks. > > The user-space could do find out the cpuid on it's own, but the oprofile > native driver has to determine the information anyway. > The driver does not deal with the events, just the type of PMU, i.e, the registers. > How would perfmon2 tools handle the case of multiple multiple > architectures? Do the cpuid in user space and modprobe the appropriate > module? What happens if the wrong perfmon kernel module is attepted to > be loaded? Is there a check in the initalizaiton to make sure that it > will works on the processor? > For a processor family, take i386 for instance, there is arch-specific perfmon2 code built into the kernel. But the PMU description table which describes the mapping from the logical PMU registers, i.e., PMC/PMD, to the actual PMU registers is implemented as a kernel module. We call this the PMU description module. Each module must provide a probe routine which is responsible for verifying that the host PMU matches what the module describes. In other words, the Pentium M module does not work on Pentium 4. At any time, there can be AT MOST one such module inserted. That guarantees that there cannot be conflicts. If the Pentium M module is inserted but user level code thinks it is on a Pentium 4, very likely the logical PMU will not match expectations and pfm_write_pmcs() will fail. Worst case, the application does not measure what it thinks it should. User level code may verify what the kernel is using by checking the content of /sys/kernel/perfmon/pmu_model. This file is the equivalent of /dev/oprofile/cpu_type. |
From: Stephane E. <er...@hp...> - 2006-03-30 16:57:14
|
Will, On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote: > The native OProfile driver on x86-64 and i386 use the NMI. This does > allow sampling in IRQ routines. However, need to make sure that the > amount of time spent in the NMI handler is limited. Using the NMI > routine appears to cause problems on some machines (e.g. laptops where > the NMI could happen when the BIOS is doing some power management > operation). > > Is there some idea of the overhead in the perfmon2 timer interval and > sampling mechanisms? > I maintain some statistics per cpu in /sys/devices/system/cpu/cpu*/perfmon/ Keep in mind that the code has not been optimized at this point. On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing there is an uncompressible 200 cycles or so to get in and out of the kernel and to/from C code. -- -Stephane |
From: William C. <wc...@re...> - 2006-03-30 17:54:14
|
Stephane Eranian wrote: > Will, > > On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote: > >>The native OProfile driver on x86-64 and i386 use the NMI. This does >>allow sampling in IRQ routines. However, need to make sure that the >>amount of time spent in the NMI handler is limited. Using the NMI >>routine appears to cause problems on some machines (e.g. laptops where >>the NMI could happen when the BIOS is doing some power management >>operation). >> >>Is there some idea of the overhead in the perfmon2 timer interval and >>sampling mechanisms? >> > > I maintain some statistics per cpu in /sys/devices/system/cpu/cpu*/perfmon/ > Keep in mind that the code has not been optimized at this point. > > On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing > there is an uncompressible 200 cycles or so to get in and out of the kernel and > to/from C code. > Thanks for the info. Looking at the information in that directory on amd64 machine for oprofile using perfmon2: fmt_handler_calls:14815 fmt_handler_cycles:7793412 handle_timeout_count:0 ovfl_intr_all_count:14815 ovfl_intr_cycles:68857305 ovfl_intr_regular_count:14815 ovfl_intr_replay_count:0 ovfl_intr_spurious_count:0 set_switch_count:0 set_switch_cycles:0 If I understand correctly below would be the average per interrupt. about 526 cycles for fmt_handler about 4648 cycles for ovfl_intr -Will |
From: Stephane E. <er...@hp...> - 2006-03-30 21:38:27
|
Will, On Thu, Mar 30, 2006 at 12:51:52PM -0500, William Cohen wrote: > >On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing > >there is an uncompressible 200 cycles or so to get in and out of the > >kernel and > >to/from C code. > > > > Thanks for the info. Looking at the information in that directory on > amd64 machine for oprofile using perfmon2: > > fmt_handler_calls:14815 > fmt_handler_cycles:7793412 > handle_timeout_count:0 > ovfl_intr_all_count:14815 > ovfl_intr_cycles:68857305 > ovfl_intr_regular_count:14815 > ovfl_intr_replay_count:0 > ovfl_intr_spurious_count:0 > set_switch_count:0 > set_switch_cycles:0 > > If I understand correctly below would be the average per interrupt. > about 526 cycles for fmt_handler > about 4648 cycles for ovfl_intr > Yes. Not that this is using rdtsc on AMD, and other i386 variants. I seem to recall that this is not necessarily reliable on some processors, especially laptops. Feel free to suggest something better or and anything to the inline function to make it more reliable. -- -Stephane |
From: William C. <wc...@nc...> - 2006-03-31 03:36:13
|
Stephane Eranian wrote: > Will, > > On Thu, Mar 30, 2006 at 12:51:52PM -0500, William Cohen wrote: > >>>On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing >>>there is an uncompressible 200 cycles or so to get in and out of the >>>kernel and >>>to/from C code. >>> >> >>Thanks for the info. Looking at the information in that directory on >>amd64 machine for oprofile using perfmon2: >> >>fmt_handler_calls:14815 >>fmt_handler_cycles:7793412 >>handle_timeout_count:0 >>ovfl_intr_all_count:14815 >>ovfl_intr_cycles:68857305 >>ovfl_intr_regular_count:14815 >>ovfl_intr_replay_count:0 >>ovfl_intr_spurious_count:0 >>set_switch_count:0 >>set_switch_cycles:0 >> >>If I understand correctly below would be the average per interrupt. >>about 526 cycles for fmt_handler >>about 4648 cycles for ovfl_intr >> > > Yes. Not that this is using rdtsc on AMD, and other i386 variants. > I seem to recall that this is not necessarily reliable on some > processors, especially laptops. Feel free to suggest something > better or and anything to the inline function to make it more > reliable. > My understanding is that the clock frequency is adjusted for power management on many processors. Thus, the cycles on the AMD64 processor I am using don't always map to .5ns (2GHz). There are some setting the can be made to lock the clock to the max frequency to make these measurements more accurate. -Will |
From: John L. <le...@mo...> - 2006-03-29 15:20:10
|
On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: > I have gotten oprofile to make use of the new perfmon2 mechanism to > collect samples. I currently have this running on my AMD64 laptop. The What actual benefits does this bring? AIUI perfmon2 is not yet sufficiently ported so that we can throw away all our near-duplicate code, and neither is it clear that the patches as they stand are going to be merged into Linus's kernel. > corrected. The patches borrow heavily from the previous ia64 > oprofile/perfmon support. Does this mean we have to run around doing userspace IPIs still? I still want that fixed in perfmon. > Rather than directly setting up the bits for the performance monitoring > hardware libpfm is used to map the name to the appropriate bits. For > processors with complicated constraints on the performance monitoring > hardware this makes more sense than trying to duplicate the constraints > mechanism in oprofile. Does this support HT properly? That is, can it be made aware of the requirement that we need to separate out the samples for each of the 2 threads? > libpfm: > -make event mapping complete (lots of events missing for various processors) > -libpfm isn't available on some procesors that perfmon supports (e.g. > p4/ppc64) What happened with the naming synchronisation effort? > + # need to get the appropriate perfmon module installed > + # FIXME need to remove them when they are not needed Why isn't this done automatically in the kernel??? > +#define op_pfm_unload_context(fd) \ > + perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0) > + > +#else > + > +/* wrapper to allow older perfmon interface to be used */ > +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0) So the new pfm hasn't bothered to provide a proper API?? regards john |
From: Stephane E. <er...@hp...> - 2006-03-29 16:03:00
|
John, On Wed, Mar 29, 2006 at 10:19:59AM -0500, John Levon wrote: > On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: > > > I have gotten oprofile to make use of the new perfmon2 mechanism to > > collect samples. I currently have this running on my AMD64 laptop. The > > What actual benefits does this bring? AIUI perfmon2 is not yet > sufficiently ported so that we can throw away all our near-duplicate > code, and neither is it clear that the patches as they stand are going > to be merged into Linus's kernel. > It is not about throwing code. It is about experimenting to verify that this could be made to work. Concerning OProfile, my goal has never been to drop it. Instead, I have designed the perfmon interface such that the bulk of it could be re-used without modifications. > > corrected. The patches borrow heavily from the previous ia64 > > oprofile/perfmon support. > > Does this mean we have to run around doing userspace IPIs still? I still > want that fixed in perfmon. > I was under the impression that Oprofile uses one sample buffer per CPU. then samples are pushed into a single buffer which is read by user code. That single buffer also stores the OS events such as exit, fork, library unmap used to correlate samples. Is that correct? > > Rather than directly setting up the bits for the performance monitoring > > hardware libpfm is used to map the name to the appropriate bits. For > > processors with complicated constraints on the performance monitoring > > hardware this makes more sense than trying to duplicate the constraints > > mechanism in oprofile. > > Does this support HT properly? That is, can it be made aware of the > requirement that we need to separate out the samples for each of the 2 > threads? > The P4/Xeon perfmon code supports HT. The design is such that if HT is enabled half of the PMU registers are exposed thread. The kernel takes care of the remapping the register onto their respective half on context switch. From the point of view of the tool, this is transparent. PEBS is not supported with HT due to HW limitations. > > libpfm: > > -make event mapping complete (lots of events missing for various processors) > > -libpfm isn't available on some procesors that perfmon supports (e.g. > > p4/ppc64) > > What happened with the naming synchronisation effort? On the AMD side, all the changes submitted by Ray have been integrated into libpfm. > > > + # need to get the appropriate perfmon module installed > > + # FIXME need to remove them when they are not needed > > Why isn't this done automatically in the kernel??? > This can be done automatically by the kernel, i.e. the Oprofile format module compiled in. This could also be done at boot time by a script. My choice would be to have it builtin given how simple it is. > > +#define op_pfm_unload_context(fd) \ > > + perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0) > > + > > +#else > > + > > +/* wrapper to allow older perfmon interface to be used */ > > +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0) > > So the new pfm hasn't bothered to provide a proper API?? > The new perfmon code base uses one system call per command. This is the API that tools should now use. On Ia-64 only and for backward compatibility, we also support the old perfmonctl() system call. |
From: John L. <le...@mo...> - 2006-03-29 16:10:09
|
[removed closed list] On Wed, Mar 29, 2006 at 07:58:02AM -0800, Stephane Eranian wrote: > It is not about throwing code. It is about experimenting to verify > that this could be made to work. Concerning OProfile, my goal But this is eventually what we want: one implementation of stuff that programs perf counters and deals with naming etc. > I was under the impression that Oprofile uses one sample buffer per CPU. > then samples are pushed into a single buffer which is read by user code. > That single buffer also stores the OS events such as exit, fork, library > unmap used to correlate samples. Is that correct? > Right. But we still have this silly requirement that each CPU must be programmed separately, on that CPU. I completely fail to understand your objection to supporting "put this config on all CPUS, please". > > > +/* wrapper to allow older perfmon interface to be used */ > > > +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0) > > > > So the new pfm hasn't bothered to provide a proper API?? > > > The new perfmon code base uses one system call per command. This is the API > that tools should now use. On Ia-64 only and for backward compatibility, we also > support the old perfmonctl() system call. The comment seems to imply that pfm_create_context() et al are the /old/ method? regards john |
From: William C. <wc...@re...> - 2006-03-29 21:49:09
|
John Levon wrote: > On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: > > >>I have gotten oprofile to make use of the new perfmon2 mechanism to >>collect samples. I currently have this running on my AMD64 laptop. The > > > What actual benefits does this bring? AIUI perfmon2 is not yet > sufficiently ported so that we can throw away all our near-duplicate > code, and neither is it clear that the patches as they stand are going > to be merged into Linus's kernel. To mainly see how far off the current perfmon2 patches were from supporting instrumentation tools such as OProfile. As I mentioned there are lots of rough edges in the current implementation. This was to get some feedback on concrete code and come up with something better. >>corrected. The patches borrow heavily from the previous ia64 >>oprofile/perfmon support. > > > Does this mean we have to run around doing userspace IPIs still? I still > want that fixed in perfmon. > > >>Rather than directly setting up the bits for the performance monitoring >>hardware libpfm is used to map the name to the appropriate bits. For >>processors with complicated constraints on the performance monitoring >>hardware this makes more sense than trying to duplicate the constraints >>mechanism in oprofile. > > > Does this support HT properly? That is, can it be made aware of the > requirement that we need to separate out the samples for each of the 2 > threads? > > >>libpfm: >>-make event mapping complete (lots of events missing for various processors) >>-libpfm isn't available on some procesors that perfmon supports (e.g. >>p4/ppc64) > > > What happened with the naming synchronisation effort? The naming inconsistencies between libpfm and oprofile are becoming apparent. For the events without unit masks there should be a agreed upon name. Within the performance monitoring documentation there has been inconsistent naming. There is also the philisophical difference that libpfm has no concept of event masks. There are just event names in libpfm. Thus, DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes "DISPATCHED_FPU_OPS_ADD in libpfm. >>+ # need to get the appropriate perfmon module installed >>+ # FIXME need to remove them when they are not needed > > > Why isn't this done automatically in the kernel??? I would like to do that in the kernel. However, I didn't know how to do that off hand in the case that there are different modules that it could be dependent on based on the processor. >>+#define op_pfm_unload_context(fd) \ >>+ perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0) >>+ >>+#else >>+ >>+/* wrapper to allow older perfmon interface to be used */ >>+#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0) > > > So the new pfm hasn't bothered to provide a proper API?? This was to use the documented interface without removing the old perfmon interface. -Will |
From: John L. <le...@mo...> - 2006-03-29 21:50:43
|
On Wed, Mar 29, 2006 at 04:48:58PM -0500, William Cohen wrote: > To mainly see how far off the current perfmon2 patches were from > supporting instrumentation tools such as OProfile. OK. > There is also the philisophical difference that libpfm has no concept of > event masks. There are just event names in libpfm. Thus, > DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes > "DISPATCHED_FPU_OPS_ADD in libpfm. This seems broken. Is there some special magic to allow libpfm to "or" in such events, or did they just not realise that some of these unit masks aren't exclusive choices? regards john |
From: Stephane E. <er...@hp...> - 2006-03-29 22:22:28
|
Will, John, On Wed, Mar 29, 2006 at 04:50:36PM -0500, John Levon wrote: > On Wed, Mar 29, 2006 at 04:48:58PM -0500, William Cohen wrote: > > > There is also the philisophical difference that libpfm has no concept of > > event masks. There are just event names in libpfm. Thus, > > DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes > > "DISPATCHED_FPU_OPS_ADD in libpfm. > > This seems broken. Is there some special magic to allow libpfm to "or" > in such events, or did they just not realise that some of these unit > masks aren't exclusive choices? > There is no magic, all combinations must be provided in the event table. This has been like that since the beginning. On Itanium, there are events which support unit mask combinations and we provide all combinations. The libpfm interface is designed to be very generic. That means the same generic call is used to pass an event list (by names) to the library and to get back a list of PMC register (index,value) pairs to program. Extended features such as opcode filters on Itanium, or inversion on AMD, are passed into model specific parameters. My goal was to get basic counting counting of any event going without necessarily using a model-specific extension. The main call is: int pfm_dispatch_events(gen_inp *, model_inp *, gen_outp *, model_outp *); A user comes in with "CPU_CYCLES", then does: pfm_get_event_code("CPU_CYCLES", &code); gen_inp.pfp_event[0] = code; Here code is an opaque descriptor for the event, (in reality the index of the event in the event table). Then, the call to pfm_dispatch_events(). In return, the user gets: gen_outp.pfp_pmc[0].reg_num = 4; gen_outp.pfp_pmc[0].reg_value = 0x400128; Those values are then copied into perfmon2 specific data structures are passed to pfm_write_pmcs(). There is no explicit call from libpfm to pfm_write_pmcs(). The only dependency that exists between libpfm and perfmon2 is in the naming of the PMC register, i.e., libpfm PMC4 corresponds to perfmon2 PMC4 on Itanium for instance. This is more a convenience than a requirement. With unit mask separated, you'd have to systematically use a model-specific argument because you never really know how many levels of unit mask do you have. On P4, there is more than the umask that can be configured. On Montecito you have unit mask, the MESI bits, and the .me/.all for some events. But I am not necessarily against splitting, but this becomes only convenient for events with unit mask combinations which are not that many. It is more tedious for other events with only small series of unit masks because now tools need to handle event names and unit mask separately. For instance with pfmon, I can name any event with a single string. Now, I would have to split with event_name1:unit mask, event_name2:unit mask. What would be the format of unit mask? What about PMU when there is more than unit mask, maybe sub unit mask. It becomes more difficult for a novice user to measure certain basic events. I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you are not required to use it to invoke the perfmon2 interface. -- -Stephane |
From: John L. <le...@mo...> - 2006-03-29 22:35:41
|
On Wed, Mar 29, 2006 at 02:16:37PM -0800, Stephane Eranian wrote: > > This seems broken. Is there some special magic to allow libpfm to "or" > > in such events, or did they just not realise that some of these unit > > masks aren't exclusive choices? > > > There is no magic, all combinations must be provided in the event table. Wow. Combinatorial explosion ahoy!! So for an event like: event:0x29 counters:0,1 um:mesi minimum:500 name:L2_LD : number of L2 data loads where the unit mask is: name:mesi type:bitmask default:0x0f 0x08 (M)odified cache state 0x04 (E)xclusive cache state 0x02 (S)hared cache state 0x01 (I)nvalid cache state 0x0f All cache states You explicitly give names to all combinations? > instance with pfmon, I can name any event with a single string. Now, I would have to split with > event_name1:unit mask, event_name2:unit mask. What would be the format of unit mask? What about > PMU when there is more than unit mask, maybe sub unit mask. It becomes more difficult for a > novice user to measure certain basic events. So provide a default. > I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you > are not required to use it to invoke the perfmon2 interface. Yes, fine, but we /do/ want to use it. regards, john |
From: Philip M. <mu...@cs...> - 2006-04-06 06:36:02
|
Hi again, I have a question about the below statement. If libpfm is designed to be separate from perfmon2, what is exactly it's purpose? Is it's goal to replace PAPI or some other API? I'd like to understand where it fits and whether Stefane is going to disrupt my stream of consulting income funding my travels with it. ha haa ;-) Personally, I use libpfm as a nice 'helper' library only for event description tables it provides as part of the IA64 support. In fact, PAPI implements all those separately on other platforms, but we didn't want to reinvent the wheel since Stefane did all the hard work. However, for other things, like register allocation, the algorithms in libpfm are fixed and not portable...they must be re-written for each platform, whereas the scheme in PAPI is in portable code. Anyways, I think you guys understand what I'm getting at...I would like to understand what the group sees as the division of functionality between PAPI and libpfm. Regards, and Namaste from Nepal, Phil > I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you > are not required to use it to invoke the perfmon2 interface. > |
From: Stephane E. <er...@hp...> - 2006-04-06 08:45:44
|
Phil, On Tue, Apr 04, 2006 at 10:27:58AM +0545, Philip Mucci wrote: > > I have a question about the below statement. If libpfm is designed to be > separate from perfmon2, what is exactly it's purpose? Is it's goal to > replace PAPI or some other API? I'd like to understand where it fits and > whether Stefane is going to disrupt my stream of consulting income > funding my travels with it. ha haa ;-) > No, libpfm serves a different purpose. The goal is not to present a uniform interface and set of events of tools. It does present a uniform interface across platforms but it exposes the real events. Just like it is today on Ia-64, PAPI could live on top of libpfm. The reason I keep libpfm separate from theperfmon kernel interface is because I want to minimize the dependencies whenever possible. You say it yourself below. Libpfm is a helper library for performance tools. It solves the difficult event assignement problem. You say "I want to measure event X, Y, Z and use features A, B (e.g., opcode filters)" and the library returns a valid assignement that you then copy over to the perfmon2 interface specific parameters or any other interface for that matter. I do not want libpfm making perfmon kernel calls. The reason is simple, if you do this you will run into problems with tools as they don't all the things the same way. I'd rather see another library side by side with libpfm. The latest libpfm shows this with the system-wide helper library libpfms. Furthermore, I do not want libpfm becoming a required component to use perfmon2. Take HP Caliper, for instance, it does not use the library, yet it inbokes the interface easily. It is not the role of libpfm to hide some of the aspects of the kernel interface, like what is done by perfctr for instance. Tools should be exposed to the real kernel interface. If they need help, then people can design a new library. The point being that this library will serve a different goal from libpfm. To summarize, libpfm does not have the same goal as PAPI. PAPI can be layered on top of libpfm. PAPI brings another set of value-adds to tools, such as generic event names across architectures. > Personally, I use libpfm as a nice 'helper' library only for event > description tables it provides as part of the IA64 support. In fact, > PAPI implements all those separately on other platforms, but we didn't > want to reinvent the wheel since Stefane did all the hard work. However, > for other things, like register allocation, the algorithms in libpfm are > fixed and not portable...they must be re-written for each platform, > whereas the scheme in PAPI is in portable code. > > Anyways, I think you guys understand what I'm getting at...I would like > to understand what the group sees as the division of functionality > between PAPI and libpfm. > > Regards, and Namaste from Nepal, > > Phil > > > I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you > > are not required to use it to invoke the perfmon2 interface. > > -- -Stephane |
From: William C. <wc...@re...> - 2006-03-31 20:20:54
|
John Levon wrote: > On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote: >>+ # need to get the appropriate perfmon module installed >>+ # FIXME need to remove them when they are not needed > > > Why isn't this done automatically in the kernel??? Hi Stephane, I found that the request_module() function can be used to pull in the required code. This should eliminate the check in opcontrol. Stephane, do you have a suggestion on a check that a perfmon module, e.g. perfmon_amd, was successfully loaded? What wouldn't work before the machine specific performance monitoring hardware is loaded, but will work after it is loaded? -Will |
From: Stephane E. <er...@hp...> - 2006-03-31 21:01:17
|
Will, On Fri, Mar 31, 2006 at 03:18:32PM -0500, William Cohen wrote: > > I found that the request_module() function can be used to pull in the > required code. This should eliminate the check in opcontrol. Stephane, I have never looked at request_module() myself. That could be interesting. > do you have a suggestion on a check that a perfmon module, e.g. > perfmon_amd, was successfully loaded? What wouldn't work before the > machine specific performance monitoring hardware is loaded, but will > work after it is loaded? Keep in mind that there can only be one PMU desriptor module inserted at a time. Modules are required to probe to check if they support to hardware. If they fail, insmod fails. When there is no module inserted, you cannot create any perfmon context with pfm_create_context(). You'll get ENOSYS. Another way to check is to look in /sys/kernel/perfmon/pmu_model. If it says "Unknown", then there is nothing inserted. -- -Stephane |
From: William C. <wc...@re...> - 2006-03-31 21:43:43
|
Stephane Eranian wrote: > Will, > > On Fri, Mar 31, 2006 at 03:18:32PM -0500, William Cohen wrote: > >>I found that the request_module() function can be used to pull in the >>required code. This should eliminate the check in opcontrol. Stephane, > > > I have never looked at request_module() myself. That could be interesting. > > >>do you have a suggestion on a check that a perfmon module, e.g. >>perfmon_amd, was successfully loaded? What wouldn't work before the >>machine specific performance monitoring hardware is loaded, but will >>work after it is loaded? > > > Keep in mind that there can only be one PMU desriptor module inserted > at a time. Modules are required to probe to check if they support to > hardware. If they fail, insmod fails. When there is no module > inserted, you cannot create any perfmon context with pfm_create_context(). > You'll get ENOSYS. Another way to check is to look in /sys/kernel/perfmon/pmu_model. > If it says "Unknown", then there is nothing inserted. > Yes, the codes looks up which module to install based on the cpu information. There are a couple gotcha in the current code. There isn't a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there isn't a distinction between p4 and em64t machine. The detection method needs to be runable in the kernel module. Isn't pfm_create_context() the systemcall from user space. Would it be possible to read the information from pfm_pmu_conf? Do something like the following: if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) { /* performance hardware support installed. */ } -Will |
From: Stephane E. <er...@hp...> - 2006-03-31 21:55:14
|
Will, On Fri, Mar 31, 2006 at 04:41:18PM -0500, William Cohen wrote: > > > > Yes, the codes looks up which module to install based on the cpu > information. There are a couple gotcha in the current code. There isn't > a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there > isn't a distinction between p4 and em64t machine. > There is no driver for Athlons at this point. As for P4 vs. EM64T, there is indeed a check for EM64T and vice-versa. Look cloely to the probe routine of the two, you'll see that a test is reversed. > The detection method needs to be runable in the kernel module. Isn't > pfm_create_context() the systemcall from user space. > Ah, yes that won't work. > Would it be possible to read the information from pfm_pmu_conf? Do > something like the following: > > if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) { > /* performance hardware support installed. */ > } > Are you doing this from the OProfile module or the perfmon_amd module? For the format you need to grab the pfm_pmu_conf_lock spinlock to be safe. -- -Stephane |
From: William C. <wc...@re...> - 2006-03-31 22:04:42
|
Stephane Eranian wrote: > Will, > > On Fri, Mar 31, 2006 at 04:41:18PM -0500, William Cohen wrote: > >>Yes, the codes looks up which module to install based on the cpu >>information. There are a couple gotcha in the current code. There isn't >>a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there >>isn't a distinction between p4 and em64t machine. >> > > There is no driver for Athlons at this point. As for P4 vs. EM64T, there is > indeed a check for EM64T and vice-versa. Look cloely to the probe > routine of the two, you'll see that a test is reversed. Is that check the only difference between the two? It seems kind of overboard to have such similar hardware handled by nearly duplicate pieces of code. > >>The detection method needs to be runable in the kernel module. Isn't >>pfm_create_context() the systemcall from user space. >> > > Ah, yes that won't work. > > >>Would it be possible to read the information from pfm_pmu_conf? Do >>something like the following: >> >>if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) { >> /* performance hardware support installed. */ >>} >> > > Are you doing this from the OProfile module or the perfmon_amd module? > For the format you need to grab the pfm_pmu_conf_lock spinlock to be safe. > Doing this from the oprofile module because it is the one actually requesting that a specific perfmon module to be installed. Want to make sure that the module request was successful. -Will |
From: John L. <le...@mo...> - 2006-04-01 02:53:29
|
[snipping closed list] On Fri, Mar 31, 2006 at 05:02:24PM -0500, William Cohen wrote: > Doing this from the oprofile module because it is the one actually > requesting that a specific perfmon module to be installed. Want to make > sure that the module request was successful. I fail to understand why perfmon isn't doing all of this for us. It knows about its modules, and it can certainly know what CPU we are on. regards john |
From: Stephane E. <er...@hp...> - 2006-04-01 05:16:34
|
On Sat, Apr 01, 2006 at 03:53:20AM +0100, John Levon wrote: > > [snipping closed list] > > On Fri, Mar 31, 2006 at 05:02:24PM -0500, William Cohen wrote: > > > Doing this from the oprofile module because it is the one actually > > requesting that a specific perfmon module to be installed. Want to make > > sure that the module request was successful. > > I fail to understand why perfmon isn't doing all of this for us. It > knows about its modules, and it can certainly know what CPU we are on. > Yes, perfmon can do this for you, two ways: - have all the PMU description modules compiled in. During kernel boot they will all be called to probe the CPU. The first to succeed gets control. - have an initscript to insert the right module -- -Stephane |
From: John L. <le...@mo...> - 2006-04-01 13:48:51
|
On Fri, Mar 31, 2006 at 09:11:34PM -0800, Stephane Eranian wrote: > - have an initscript to insert the right module Nope still not getting it. Why is userspace getting involved at all in deciding which module to use? john |