Thread: Patches to get oprofile to work with perfmon2 on amd64

oprofile-list

Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-27 16:10:09

Attachments: oprof_perfmon2-20060327.diff perfmon2_oprof20060327.diff

I have gotten oprofile to make use of the new perfmon2 mechanism to 
collect samples. I currently have this running on my AMD64 laptop. The 
oprof_perfmon2-20060327.diff patches the oprofile user space code and 
perfmon2_oprof20060327.diff is for the kernel. The patches are still 
"work in progress" and there are certainly things that need to be 
corrected. The patches borrow heavily from the previous ia64 
oprofile/perfmon support.

Due to the different sampling mechanism that could be used for x86, 
/dev/oprofile/implement has been added so the sampling mechanism being 
used can be identify how the samples are being collected.

Rather than directly setting up the bits for the performance monitoring 
hardware libpfm is used to map the name to the appropriate bits. For 
processors with complicated constraints on the performance monitoring 
hardware this makes more sense than trying to duplicate the constraints 
mechanism in oprofile.

Below are issues that still need to be fixed in the various areas of the 
oprofile/perfmon2 monitoring.

kernel:
- separating oprofiles processor id code from i386 nmi mechanism setup
- have oprofile/perfmon2 identify cpu for real (currently just hardwired 
to amd64)
- oprofile always uses perfmon2 if kernel configured with perfmon
- module installation a bit odd:
	-install oprofile modules
	-opcontrol reads information to determine if perfmon2 used
	-opcontrol install appropropriate perfmon module
- oprofile lies that it needs buffer space (perfmon_get_size()) so
	perfmon2 actually calls oprofile's perfmon_handler()

oprofile:
- make translation of events names to bit patterns more robust:
	can hang if event is not found
- verify that the event masking support works
- get rid of fatal_error() function in opd_perfmon.c
- ophelp get the available events from libpfm when possible

libpfm:
-make event mapping complete (lots of events missing for various processors)
-libpfm isn't available on some procesors that perfmon supports (e.g. 
p4/ppc64)


-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-29 13:16:45

Will,

On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:
> I have gotten oprofile to make use of the new perfmon2 mechanism to 
> collect samples. I currently have this running on my AMD64 laptop. The 
> oprof_perfmon2-20060327.diff patches the oprofile user space code and 
> perfmon2_oprof20060327.diff is for the kernel. The patches are still 
> "work in progress" and there are certainly things that need to be 
> corrected. The patches borrow heavily from the previous ia64 
> oprofile/perfmon support.

Looking at /arch/i386/oprofile/perfmon.c, it is identical to the
IA-64 version and the experimental i386 version I developed.  I think 
we can move this format into the generic perfmon code in perfmon/.
This way we only have one version to maintain.

> Due to the different sampling mechanism that could be used for x86, 
> /dev/oprofile/implement has been added so the sampling mechanism being 
> used can be identify how the samples are being collected.
> 

Yes. I think there are things to do in this area. Perfmon2 does not support
NMI-based sampling. On Itanium there is no NMI. On other architectures,
if I understand clearly, NMI is used because it provides better coverage
of kernel code. NMI cannot be masked therefore you can collect samples
in code sections were interrupts are masked.

Is that the ONLY motivation for this?

> Rather than directly setting up the bits for the performance monitoring 
> hardware libpfm is used to map the name to the appropriate bits. For 
> processors with complicated constraints on the performance monitoring 
> hardware this makes more sense than trying to duplicate the constraints 
> mechanism in oprofile.
> 

Yes, you could use libpfm to simplify this part of the job. My understanding
here is that there is already that logic about events/encodings/constraints
in Oprofile. The only missing piece would be out to map OProfile register naming
scheme to the perfmon2 naming scheme. Using libpfm just for this may look
overkill in a sense. I need to look at how rgister names are handled across
the various architectures OProfile supports. May be there is a simpler way that
would not introduce a dependency on libpfm.

> Below are issues that still need to be fixed in the various areas of the 
> oprofile/perfmon2 monitoring.

> kernel:
> - separating oprofiles processor id code from i386 nmi mechanism setup
> - have oprofile/perfmon2 identify cpu for real (currently just hardwired 
> to amd64)

This is something I don't quite understand in OProfile. Why is it that user
code relies on CPU detection done by the OPRofile kernel code? The user
code could as well detect the CPU model (via cpuid or equivalent). If you
assume that the kernel code probes on init and disables itself if the CPU
is not supported, then nothing bad can happen.

> - oprofile always uses perfmon2 if kernel configured with perfmon

I think we have to do this otherwise we may have PMU access conflicts.

> - module installation a bit odd:
> 	-install oprofile modules
> 	-opcontrol reads information to determine if perfmon2 used

Yes that makes sense.

> 	-opcontrol install appropropriate perfmon module

Yes, or it could be builtin.

> - oprofile lies that it needs buffer space (perfmon_get_size()) so
> 	perfmon2 actually calls oprofile's perfmon_handler()

I fixed that. This was a bug. The format detection code was wrong.

> 
> oprofile:
> - make translation of events names to bit patterns more robust:
> 	can hang if event is not found
> - verify that the event masking support works
> - get rid of fatal_error() function in opd_perfmon.c
> - ophelp get the available events from libpfm when possible
> 
> libpfm:
> -make event mapping complete (lots of events missing for various processors)
> -libpfm isn't available on some procesors that perfmon supports (e.g. 
> p4/ppc64)

Yes, I know that for non Itanium, there are some events missing, sometimes
because of umask combinations.

Thanks for your patches.

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-29 14:14:41

Stephane Eranian wrote:
> Will,
> 
> On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:
> 
>>I have gotten oprofile to make use of the new perfmon2 mechanism to 
>>collect samples. I currently have this running on my AMD64 laptop. The 
>>oprof_perfmon2-20060327.diff patches the oprofile user space code and 
>>perfmon2_oprof20060327.diff is for the kernel. The patches are still 
>>"work in progress" and there are certainly things that need to be 
>>corrected. The patches borrow heavily from the previous ia64 
>>oprofile/perfmon support.
> 
> 
> Looking at /arch/i386/oprofile/perfmon.c, it is identical to the
> IA-64 version and the experimental i386 version I developed.  I think 
> we can move this format into the generic perfmon code in perfmon/.
> This way we only have one version to maintain.

Yes, the changes for /arch/i386/oprofile/perfmon.c were pretty 
straightforward and would be the same for other architectures. Factoring 
out the code and making it common to the platforms is reasonable.

>>Due to the different sampling mechanism that could be used for x86, 
>>/dev/oprofile/implement has been added so the sampling mechanism being 
>>used can be identify how the samples are being collected.
>>
> 
> 
> Yes. I think there are things to do in this area. Perfmon2 does not support
> NMI-based sampling. On Itanium there is no NMI. On other architectures,
> if I understand clearly, NMI is used because it provides better coverage
> of kernel code. NMI cannot be masked therefore you can collect samples
> in code sections were interrupts are masked.
> 
> Is that the ONLY motivation for this?

Depending which kernel someone is using the same oprofile code for i386 
and x86-64 platforms could use either the original oprofile or perfmon2 
to access the performance monitoring hardware. It seemed easiest to have 
the /dev/oprofile have a file that explicitly stated the mechanism being 
used. This could also be used by GUIs and other tools to directly 
determine the profiling mechanism. I wanted to avoid inferring mechanism 
in uses by looking at a bunch of files.

The native OProfile driver on x86-64 and i386 use the NMI. This does 
allow sampling in IRQ routines. However, need to make sure that the 
amount of time spent in the NMI handler is limited. Using the NMI 
routine appears to cause problems on some machines (e.g. laptops where 
the NMI could happen when the BIOS is doing some power management 
operation).

Is there some idea of the overhead in the perfmon2 timer interval and 
sampling mechanisms?

>>Rather than directly setting up the bits for the performance monitoring 
>>hardware libpfm is used to map the name to the appropriate bits. For 
>>processors with complicated constraints on the performance monitoring 
>>hardware this makes more sense than trying to duplicate the constraints 
>>mechanism in oprofile.
>>
> 
> 
> Yes, you could use libpfm to simplify this part of the job. My understanding
> here is that there is already that logic about events/encodings/constraints
> in Oprofile. The only missing piece would be out to map OProfile register naming
> scheme to the perfmon2 naming scheme. Using libpfm just for this may look
> overkill in a sense. I need to look at how rgister names are handled across
> the various architectures OProfile supports. May be there is a simpler way that
> would not introduce a dependency on libpfm.

OProfile has event and unit_mask files for each of the supported 
architecture in /usr/share/oprofile/{arch}/{model}. For example the 
x86-64 amd64 machine would use the event and unit_mask files in 
/usr/share/oprofile/x86-64/hammer.

The constraints are much more complicated for the pentium 4 and and 
power processors. I would expect that libpfm will be able to do a better 
job there, once support is in libpfm for them. For the Pentium4 OProfile 
made a number of simplifications and reduce the available counters to 8 
independent counters on non-ht processor and 4 independent counter on ht 
  processor.  There are also tagging events that are not handled by 
OProfile's mechanism. The power (ppc64) processors event selection 
mechanism is relatively complex. OProfile doe have events for it, but it 
isn't ideal.

The goal here is to factor out the event mapping logic and have it in 
one place.

>>Below are issues that still need to be fixed in the various areas of the 
>>oprofile/perfmon2 monitoring.
> 
>  
> 
>>kernel:
>>- separating oprofiles processor id code from i386 nmi mechanism setup
>>- have oprofile/perfmon2 identify cpu for real (currently just hardwired 
>>to amd64)
> 
> 
> This is something I don't quite understand in OProfile. Why is it that user
> code relies on CPU detection done by the OPRofile kernel code? The user
> code could as well detect the CPU model (via cpuid or equivalent). If you
> assume that the kernel code probes on init and disables itself if the CPU
> is not supported, then nothing bad can happen.

The cpu identification is required for two purposes:

1) figure out how the oprofile module accesses the performance 
monitoring hardware. There are different methods of accessing the 
performance monitoring registers in ppro/p2/p3, p4, and athlon.

2) the user space needs to get the correct list of events to map event 
names to number and unit masks.

The user-space could do find out the cpuid on it's own, but the oprofile 
native driver has to determine the information anyway.

How would perfmon2 tools handle the case of multiple multiple 
architectures? Do the cpuid in user space and modprobe the appropriate 
module? What happens if the wrong perfmon kernel module is attepted to 
be loaded? Is there a check in the initalizaiton to make sure that it 
will works on the processor?

>>- oprofile always uses perfmon2 if kernel configured with perfmon
> 
> 
> I think we have to do this otherwise we may have PMU access conflicts.

I was thinking about the case that someone would prefer to use one of 
the other sampling mechanisms eg. the nmi or timer mechanism. On 
OProfile you can force the timer mechanism to be used.

>>- module installation a bit odd:
>>	-install oprofile modules
>>	-opcontrol reads information to determine if perfmon2 used
> 
> 
> Yes that makes sense.
> 
> 
>>	-opcontrol install appropropriate perfmon module
> 
> 
> Yes, or it could be builtin.

Has perfmon2 built-in been verified to work with multiple architectures? 
Don't want to have different kernels for EM64T and AMD64 or P6, Pentium 
M, P4.

Is there some way of identifying that perfmon2 is available on the 
machine. Right now the oprofile/perfmon2 patch assumes it is always a 
module.

>>- oprofile lies that it needs buffer space (perfmon_get_size()) so
>>	perfmon2 actually calls oprofile's perfmon_handler()
> 
> 
> I fixed that. This was a bug. The format detection code was wrong.

Excellent.

>>oprofile:
>>- make translation of events names to bit patterns more robust:
>>	can hang if event is not found
>>- verify that the event masking support works
>>- get rid of fatal_error() function in opd_perfmon.c
>>- ophelp get the available events from libpfm when possible
>>
>>libpfm:
>>-make event mapping complete (lots of events missing for various processors)
>>-libpfm isn't available on some procesors that perfmon supports (e.g. 
>>p4/ppc64)
> 
> 
> Yes, I know that for non Itanium, there are some events missing, sometimes
> because of umask combinations.
> 
> Thanks for your patches.
> 

Thanks for perfmon2.

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-30 07:38:12

Will,

On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote:
> >
> >This is something I don't quite understand in OProfile. Why is it that user
> >code relies on CPU detection done by the OPRofile kernel code? The user
> >code could as well detect the CPU model (via cpuid or equivalent). If you
> >assume that the kernel code probes on init and disables itself if the CPU
> >is not supported, then nothing bad can happen.
> 
> The cpu identification is required for two purposes:
> 
> 1) figure out how the oprofile module accesses the performance 
> monitoring hardware. There are different methods of accessing the 
> performance monitoring registers in ppro/p2/p3, p4, and athlon.
> 

This is about the /dev/oprofile stuff, isn't it?

> 2) the user space needs to get the correct list of events to map event 
> names to number and unit masks.
> 
> The user-space could do find out the cpuid on it's own, but the oprofile 
> native driver has to determine the information anyway.
> 

The driver does not deal with the events, just the type of PMU, i.e, the
registers.


> How would perfmon2 tools handle the case of multiple multiple 
> architectures? Do the cpuid in user space and modprobe the appropriate 
> module? What happens if the wrong perfmon kernel module is attepted to 
> be loaded? Is there a check in the initalizaiton to make sure that it 
> will works on the processor?
> 
For a processor family, take i386 for instance, there is arch-specific
perfmon2 code built into the kernel. But the PMU description table which
describes the mapping from the logical PMU registers, i.e., PMC/PMD, to
the actual PMU registers is implemented as a kernel module. We call this
the PMU description module. Each module must provide a probe routine
which is responsible for verifying that the host PMU matches what the
module describes. In other words, the Pentium M module does not work
on Pentium 4. At any time, there can be AT MOST one such module inserted.
That guarantees that there cannot be conflicts.

If the Pentium M module is inserted but user level code thinks it is 
on a Pentium 4, very likely the logical PMU will not match expectations
and  pfm_write_pmcs() will fail. Worst case, the application does
not measure what it thinks it should. User level code may verify what the
kernel is using by checking the content of /sys/kernel/perfmon/pmu_model.
This file is the equivalent of /dev/oprofile/cpu_type.

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-30 16:57:14

Will,

On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote:
> The native OProfile driver on x86-64 and i386 use the NMI. This does 
> allow sampling in IRQ routines. However, need to make sure that the 
> amount of time spent in the NMI handler is limited. Using the NMI 
> routine appears to cause problems on some machines (e.g. laptops where 
> the NMI could happen when the BIOS is doing some power management 
> operation).
> 
> Is there some idea of the overhead in the perfmon2 timer interval and 
> sampling mechanisms?
> 
I maintain some statistics per cpu in /sys/devices/system/cpu/cpu*/perfmon/
Keep in mind that the code has not been optimized at this point.

On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing 
there is an uncompressible 200 cycles or so to get in and out of the kernel and
to/from C code.

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-30 17:54:14

Stephane Eranian wrote:
> Will,
> 
> On Wed, Mar 29, 2006 at 09:12:17AM -0500, William Cohen wrote:
> 
>>The native OProfile driver on x86-64 and i386 use the NMI. This does 
>>allow sampling in IRQ routines. However, need to make sure that the 
>>amount of time spent in the NMI handler is limited. Using the NMI 
>>routine appears to cause problems on some machines (e.g. laptops where 
>>the NMI could happen when the BIOS is doing some power management 
>>operation).
>>
>>Is there some idea of the overhead in the perfmon2 timer interval and 
>>sampling mechanisms?
>>
> 
> I maintain some statistics per cpu in /sys/devices/system/cpu/cpu*/perfmon/
> Keep in mind that the code has not been optimized at this point.
> 
> On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing 
> there is an uncompressible 200 cycles or so to get in and out of the kernel and
> to/from C code.
> 

Thanks for the info. Looking at the information in that directory on 
amd64 machine for oprofile using perfmon2:

fmt_handler_calls:14815
fmt_handler_cycles:7793412
handle_timeout_count:0
ovfl_intr_all_count:14815
ovfl_intr_cycles:68857305
ovfl_intr_regular_count:14815
ovfl_intr_replay_count:0
ovfl_intr_spurious_count:0
set_switch_count:0
set_switch_cycles:0

If I understand correctly below would be the average per interrupt.
about 526 cycles for fmt_handler
about 4648 cycles for ovfl_intr

-Will

Re: [perfmon] Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-30 21:38:27

Will,

On Thu, Mar 30, 2006 at 12:51:52PM -0500, William Cohen wrote:
> >On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing 
> >there is an uncompressible 200 cycles or so to get in and out of the 
> >kernel and
> >to/from C code.
> >
> 
> Thanks for the info. Looking at the information in that directory on 
> amd64 machine for oprofile using perfmon2:
> 
> fmt_handler_calls:14815
> fmt_handler_cycles:7793412
> handle_timeout_count:0
> ovfl_intr_all_count:14815
> ovfl_intr_cycles:68857305
> ovfl_intr_regular_count:14815
> ovfl_intr_replay_count:0
> ovfl_intr_spurious_count:0
> set_switch_count:0
> set_switch_cycles:0
> 
> If I understand correctly below would be the average per interrupt.
> about 526 cycles for fmt_handler
> about 4648 cycles for ovfl_intr
> 
Yes. Not that this is using rdtsc on AMD, and other i386 variants.
I seem to recall that this is not necessarily reliable on some
processors, especially laptops. Feel free to suggest something
better or and anything to the inline function to make it more
reliable.

-- 

-Stephane

Re: [perfmon] Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@nc...> - 2006-03-31 03:36:13

Stephane Eranian wrote:
> Will,
> 
> On Thu, Mar 30, 2006 at 12:51:52PM -0500, William Cohen wrote:
> 
>>>On 1.5GHz Itanium2, it takes about 800 cycles to record a sample. Knowing 
>>>there is an uncompressible 200 cycles or so to get in and out of the 
>>>kernel and
>>>to/from C code.
>>>
>>
>>Thanks for the info. Looking at the information in that directory on 
>>amd64 machine for oprofile using perfmon2:
>>
>>fmt_handler_calls:14815
>>fmt_handler_cycles:7793412
>>handle_timeout_count:0
>>ovfl_intr_all_count:14815
>>ovfl_intr_cycles:68857305
>>ovfl_intr_regular_count:14815
>>ovfl_intr_replay_count:0
>>ovfl_intr_spurious_count:0
>>set_switch_count:0
>>set_switch_cycles:0
>>
>>If I understand correctly below would be the average per interrupt.
>>about 526 cycles for fmt_handler
>>about 4648 cycles for ovfl_intr
>>
> 
> Yes. Not that this is using rdtsc on AMD, and other i386 variants.
> I seem to recall that this is not necessarily reliable on some
> processors, especially laptops. Feel free to suggest something
> better or and anything to the inline function to make it more
> reliable.
> 

My understanding is that the clock frequency is adjusted for power 
management on many processors. Thus, the cycles on the AMD64 processor I 
am using  don't always map to .5ns (2GHz). There are some setting the 
can be made to lock the clock to the max frequency to make these 
measurements more accurate.

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-03-29 15:20:10

On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:

> I have gotten oprofile to make use of the new perfmon2 mechanism to 
> collect samples. I currently have this running on my AMD64 laptop. The 

What actual benefits does this bring? AIUI perfmon2 is not yet
sufficiently ported so that we can throw away all our near-duplicate
code, and neither is it clear that the patches as they stand are going
to be merged into Linus's kernel.

> corrected. The patches borrow heavily from the previous ia64 
> oprofile/perfmon support.

Does this mean we have to run around doing userspace IPIs still? I still
want that fixed in perfmon.

> Rather than directly setting up the bits for the performance monitoring 
> hardware libpfm is used to map the name to the appropriate bits. For 
> processors with complicated constraints on the performance monitoring 
> hardware this makes more sense than trying to duplicate the constraints 
> mechanism in oprofile.

Does this support HT properly? That is, can it be made aware of the
requirement that we need to separate out the samples for each of the 2
threads?

> libpfm:
> -make event mapping complete (lots of events missing for various processors)
> -libpfm isn't available on some procesors that perfmon supports (e.g. 
> p4/ppc64)

What happened with the naming synchronisation effort?

> +		    # need to get the appropriate perfmon module installed
> +		    # FIXME need to remove them when they are not needed

Why isn't this done automatically in the kernel???

> +#define op_pfm_unload_context(fd) \
> +	perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0)
> +
> +#else
> +
> +/* wrapper to allow older perfmon interface to be used */
> +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0)

So the new pfm hasn't bothered to provide a proper API??

regards
john

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-29 16:03:00

John,

On Wed, Mar 29, 2006 at 10:19:59AM -0500, John Levon wrote:
> On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:
> 
> > I have gotten oprofile to make use of the new perfmon2 mechanism to 
> > collect samples. I currently have this running on my AMD64 laptop. The 
> 
> What actual benefits does this bring? AIUI perfmon2 is not yet
> sufficiently ported so that we can throw away all our near-duplicate
> code, and neither is it clear that the patches as they stand are going
> to be merged into Linus's kernel.
> 

It is not about throwing code. It is about experimenting to verify
that this could be made to work. Concerning OProfile, my goal
has never been to drop it. Instead, I have designed the perfmon
interface such that the bulk of it could be re-used without
modifications.


> > corrected. The patches borrow heavily from the previous ia64 
> > oprofile/perfmon support.
> 
> Does this mean we have to run around doing userspace IPIs still? I still
> want that fixed in perfmon.
> 
I was under the impression that Oprofile uses one sample buffer per CPU.
then samples are pushed into a single buffer which is read by user code.
That single buffer also stores the OS events such as exit, fork, library
unmap used to correlate samples. Is that correct?


> > Rather than directly setting up the bits for the performance monitoring 
> > hardware libpfm is used to map the name to the appropriate bits. For 
> > processors with complicated constraints on the performance monitoring 
> > hardware this makes more sense than trying to duplicate the constraints 
> > mechanism in oprofile.
> 
> Does this support HT properly? That is, can it be made aware of the
> requirement that we need to separate out the samples for each of the 2
> threads?
> 

The P4/Xeon perfmon code supports HT. The design is such that if HT is enabled
half of the PMU registers are exposed thread. The kernel takes care of the remapping the
register onto their respective half on context switch. From the point of view
of the tool, this is transparent. PEBS is not supported with HT due to HW
limitations.

> > libpfm:
> > -make event mapping complete (lots of events missing for various processors)
> > -libpfm isn't available on some procesors that perfmon supports (e.g. 
> > p4/ppc64)
> 
> What happened with the naming synchronisation effort?

On the AMD side, all the changes submitted by Ray have been integrated into libpfm.

> 
> > +		    # need to get the appropriate perfmon module installed
> > +		    # FIXME need to remove them when they are not needed
> 
> Why isn't this done automatically in the kernel???
> 

This can be done automatically by the kernel, i.e. the Oprofile format module
compiled in. This could also be done at boot time by a script. My choice would
be to have it builtin given how simple it is.

> > +#define op_pfm_unload_context(fd) \
> > +	perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0)
> > +
> > +#else
> > +
> > +/* wrapper to allow older perfmon interface to be used */
> > +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0)
> 
> So the new pfm hasn't bothered to provide a proper API??
> 
The new perfmon code base uses one system call per command. This is the API
that tools should now use. On Ia-64 only and for backward compatibility, we also
support the old perfmonctl() system call.

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-03-29 16:10:09

[removed closed list]

On Wed, Mar 29, 2006 at 07:58:02AM -0800, Stephane Eranian wrote:

> It is not about throwing code. It is about experimenting to verify
> that this could be made to work. Concerning OProfile, my goal

But this is eventually what we want: one implementation of stuff that
programs perf counters and deals with naming etc.

> I was under the impression that Oprofile uses one sample buffer per CPU.
> then samples are pushed into a single buffer which is read by user code.
> That single buffer also stores the OS events such as exit, fork, library
> unmap used to correlate samples. Is that correct?
> 

Right. But we still have this silly requirement that each CPU must be
programmed separately, on that CPU. I completely fail to understand your
objection to supporting "put this config on all CPUS, please".

> > > +/* wrapper to allow older perfmon interface to be used */
> > > +#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0)
> > 
> > So the new pfm hasn't bothered to provide a proper API??
> > 
> The new perfmon code base uses one system call per command. This is the API
> that tools should now use. On Ia-64 only and for backward compatibility, we also
> support the old perfmonctl() system call.

The comment seems to imply that pfm_create_context() et al are the /old/
method?

regards
john

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-29 21:49:09

John Levon wrote:
> On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:
> 
> 
>>I have gotten oprofile to make use of the new perfmon2 mechanism to 
>>collect samples. I currently have this running on my AMD64 laptop. The 
> 
> 
> What actual benefits does this bring? AIUI perfmon2 is not yet
> sufficiently ported so that we can throw away all our near-duplicate
> code, and neither is it clear that the patches as they stand are going
> to be merged into Linus's kernel.

To mainly see how far off the current perfmon2 patches were from 
supporting instrumentation tools such as OProfile.

As I mentioned there are lots of rough edges in the current 
implementation. This was to get some feedback on concrete code and come 
up with something better.

>>corrected. The patches borrow heavily from the previous ia64 
>>oprofile/perfmon support.
> 
> 
> Does this mean we have to run around doing userspace IPIs still? I still
> want that fixed in perfmon.
> 
> 
>>Rather than directly setting up the bits for the performance monitoring 
>>hardware libpfm is used to map the name to the appropriate bits. For 
>>processors with complicated constraints on the performance monitoring 
>>hardware this makes more sense than trying to duplicate the constraints 
>>mechanism in oprofile.
> 
> 
> Does this support HT properly? That is, can it be made aware of the
> requirement that we need to separate out the samples for each of the 2
> threads?
> 
> 
>>libpfm:
>>-make event mapping complete (lots of events missing for various processors)
>>-libpfm isn't available on some procesors that perfmon supports (e.g. 
>>p4/ppc64)
> 
> 
> What happened with the naming synchronisation effort?

The naming inconsistencies between libpfm and oprofile are becoming 
apparent. For the events without unit masks there should be a agreed 
upon name. Within the performance monitoring documentation there has 
been inconsistent naming.

There is also the philisophical difference that libpfm has no concept of 
event masks. There are just event names in libpfm. Thus, 
DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes 
"DISPATCHED_FPU_OPS_ADD in libpfm.

>>+		    # need to get the appropriate perfmon module installed
>>+		    # FIXME need to remove them when they are not needed
> 
> 
> Why isn't this done automatically in the kernel???

I would like to do that in the kernel. However, I didn't know how to do 
that off hand in the case that there are different modules that it could 
be dependent on based on the processor.

>>+#define op_pfm_unload_context(fd) \
>>+	perfmonctl(fd, PFM_UNLOAD_CONTEXT, NULL, 0)
>>+
>>+#else
>>+
>>+/* wrapper to allow older perfmon interface to be used */
>>+#define op_pfm_create_context(ctx) pfm_create_context(ctx, NULL, 0)
> 
> 
> So the new pfm hasn't bothered to provide a proper API??

This was to use the documented interface without removing the old 
perfmon interface.

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-03-29 21:50:43

On Wed, Mar 29, 2006 at 04:48:58PM -0500, William Cohen wrote:

> To mainly see how far off the current perfmon2 patches were from 
> supporting instrumentation tools such as OProfile.

OK.

> There is also the philisophical difference that libpfm has no concept of 
> event masks. There are just event names in libpfm. Thus, 
> DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes 
> "DISPATCHED_FPU_OPS_ADD in libpfm.

This seems broken. Is there some special magic to allow libpfm to "or"
in such events, or did they just not realise that some of these unit
masks aren't exclusive choices?

regards
john

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-29 22:22:28

Will, John,

On Wed, Mar 29, 2006 at 04:50:36PM -0500, John Levon wrote:
> On Wed, Mar 29, 2006 at 04:48:58PM -0500, William Cohen wrote:
> 
> > There is also the philisophical difference that libpfm has no concept of 
> > event masks. There are just event names in libpfm. Thus, 
> > DISPATCHED_FPU_OPS with "Add pip ops" in oprofile becomes 
> > "DISPATCHED_FPU_OPS_ADD in libpfm.
> 
> This seems broken. Is there some special magic to allow libpfm to "or"
> in such events, or did they just not realise that some of these unit
> masks aren't exclusive choices?
> 

There is no magic, all combinations must be provided in the event table.
This has been like that since the beginning. On Itanium, there are events
which support unit mask combinations and we provide all combinations.

The libpfm interface is designed to be very generic. That means the
same generic call is used to pass an event list (by names) to the
library and to get back a list of PMC register (index,value) pairs to
program. Extended features such as opcode filters on Itanium, or
inversion on AMD, are passed into model specific parameters. My goal
was to get basic counting counting of any event going without necessarily
using a model-specific extension.

The main call is:
	 int pfm_dispatch_events(gen_inp *, model_inp *, gen_outp *, model_outp *);

A user comes in with "CPU_CYCLES", then does:

	pfm_get_event_code("CPU_CYCLES", &code);
	gen_inp.pfp_event[0] = code;

Here code is an opaque descriptor for the event, (in reality the index of the event in the
event table). Then, the call to pfm_dispatch_events(). In return, the user gets:
	gen_outp.pfp_pmc[0].reg_num = 4;
	gen_outp.pfp_pmc[0].reg_value = 0x400128;

Those values are then copied into perfmon2 specific data structures are passed to pfm_write_pmcs().
There is no explicit call from libpfm to pfm_write_pmcs(). The only dependency that exists
between libpfm and perfmon2 is in the naming of the PMC register, i.e., libpfm PMC4 corresponds
to perfmon2 PMC4 on Itanium for instance. This is more a convenience than a requirement.

With unit mask separated, you'd have to systematically use a model-specific argument because
you never really know how many levels of unit mask do you have. On P4, there is more than the
umask that can be configured. On Montecito you have unit mask, the MESI bits, and
the .me/.all  for some events.

But I am not necessarily against splitting, but this becomes only convenient for events with
unit mask combinations which are not that many. It is more tedious for other events with only small
series of unit masks because now tools need to handle event names and unit mask separately. For
instance with pfmon, I can name any event with a single string. Now, I would have to split with
event_name1:unit mask, event_name2:unit mask. What would be the format of unit mask? What about
PMU when there is more than unit mask, maybe sub unit mask. It becomes more difficult for a
novice user to measure certain basic events. 

I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you 
are not required to use it to invoke the perfmon2 interface.

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-03-29 22:35:41

On Wed, Mar 29, 2006 at 02:16:37PM -0800, Stephane Eranian wrote:

> > This seems broken. Is there some special magic to allow libpfm to "or"
> > in such events, or did they just not realise that some of these unit
> > masks aren't exclusive choices?
> > 
> There is no magic, all combinations must be provided in the event table.

Wow. Combinatorial explosion ahoy!! So for an event like:

event:0x29 counters:0,1 um:mesi minimum:500 name:L2_LD : number of L2 data loads

where the unit mask is:

name:mesi type:bitmask default:0x0f
        0x08 (M)odified cache state
        0x04 (E)xclusive cache state
        0x02 (S)hared cache state
        0x01 (I)nvalid cache state
        0x0f All cache states

You explicitly give names to all combinations?

> instance with pfmon, I can name any event with a single string. Now, I would have to split with
> event_name1:unit mask, event_name2:unit mask. What would be the format of unit mask? What about
> PMU when there is more than unit mask, maybe sub unit mask. It becomes more difficult for a
> novice user to measure certain basic events. 

So provide a default.

> I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you 
> are not required to use it to invoke the perfmon2 interface.

Yes, fine, but we /do/ want to use it.

regards,
john

Re: [perfmon] Re: Patches to get oprofile to work with perfmon2 on amd64

From: Philip M. <mu...@cs...> - 2006-04-06 06:36:02

Hi again,

I have a question about the below statement. If libpfm is designed to be
separate from perfmon2, what is exactly it's purpose? Is it's goal to
replace PAPI or some other API? I'd like to understand where it fits and
whether Stefane is going to disrupt my stream of consulting income
funding my travels with it. ha haa ;-)

Personally, I use libpfm as a nice 'helper' library only for event
description tables it provides as part of the IA64 support. In fact,
PAPI implements all those separately on other platforms, but we didn't
want to reinvent the wheel since Stefane did all the hard work. However,
for other things, like register allocation, the algorithms in libpfm are
fixed and not portable...they must be re-written for each platform,
whereas the scheme in PAPI is in portable code.

Anyways, I think you guys understand what I'm getting at...I would like
to understand what the group sees as the division of functionality
between PAPI and libpfm.

Regards, and Namaste from Nepal,

Phil

> I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you 
> are not required to use it to invoke the perfmon2 interface.
>

Re: [perfmon] Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-04-06 08:45:44

Phil,

On Tue, Apr 04, 2006 at 10:27:58AM +0545, Philip Mucci wrote:
> 
> I have a question about the below statement. If libpfm is designed to be
> separate from perfmon2, what is exactly it's purpose? Is it's goal to
> replace PAPI or some other API? I'd like to understand where it fits and
> whether Stefane is going to disrupt my stream of consulting income
> funding my travels with it. ha haa ;-)
> 

No, libpfm serves a different purpose. The goal is not to present a uniform
interface and set of events of tools. It does present a uniform interface
across platforms but it exposes the real events. Just like it is today
on Ia-64, PAPI could live on top of libpfm. 

The reason I keep libpfm separate from theperfmon kernel interface is because
I want to minimize the dependencies whenever possible. You say it yourself
below. Libpfm is a helper library for performance tools. It solves the
difficult event assignement problem. You say "I want to measure event X, Y, Z and
use features A, B (e.g., opcode filters)" and the library returns a valid
assignement that you then copy over to the perfmon2 interface specific 
parameters or any other interface for that matter.

I do not want libpfm making perfmon kernel calls. The reason is simple, if
you do this you will run into problems with tools as they don't all the
things the same way. I'd rather see another library side by side with libpfm.
The latest libpfm shows this with the system-wide helper library libpfms.

Furthermore, I do not want libpfm becoming a required component to use
perfmon2. Take HP Caliper, for instance, it does not use the library, yet
it inbokes the interface easily. It is not the role of libpfm to hide
some of the aspects of the kernel interface, like what is done
by perfctr for instance. Tools should be exposed to the real kernel
interface. If they need help, then people can design a new library. The point
being that this library will serve a different goal from libpfm.

To summarize, libpfm does not have the same goal as PAPI. PAPI can
be layered on top of libpfm. PAPI brings another set of value-adds
to tools, such as generic event names across architectures.

> Personally, I use libpfm as a nice 'helper' library only for event
> description tables it provides as part of the IA64 support. In fact,
> PAPI implements all those separately on other platforms, but we didn't
> want to reinvent the wheel since Stefane did all the hard work. However,
> for other things, like register allocation, the algorithms in libpfm are
> fixed and not portable...they must be re-written for each platform,
> whereas the scheme in PAPI is in portable code.
> 
> Anyways, I think you guys understand what I'm getting at...I would like
> to understand what the group sees as the division of functionality
> between PAPI and libpfm.
> 
> Regards, and Namaste from Nepal,
> 
> Phil
> 
> > I am open to suggestions on this, make a proposal. The good thing about libpfm, is that you 
> > are not required to use it to invoke the perfmon2 interface.
> > 

-- 

-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-31 20:20:54

John Levon wrote:
> On Mon, Mar 27, 2006 at 11:09:57AM -0500, William Cohen wrote:

>>+		    # need to get the appropriate perfmon module installed
>>+		    # FIXME need to remove them when they are not needed
> 
> 
> Why isn't this done automatically in the kernel???

Hi Stephane,

I found that the request_module() function can be used to pull in the 
required code. This should eliminate the check in opcontrol. Stephane, 
do you have a suggestion on a check that a perfmon module, e.g. 
perfmon_amd, was successfully loaded? What wouldn't work before the 
machine specific performance monitoring hardware is loaded, but will 
work after it is loaded?

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-31 21:01:17

Will,

On Fri, Mar 31, 2006 at 03:18:32PM -0500, William Cohen wrote:
> 
> I found that the request_module() function can be used to pull in the 
> required code. This should eliminate the check in opcontrol. Stephane, 

I have never looked at request_module() myself. That could be interesting.

> do you have a suggestion on a check that a perfmon module, e.g. 
> perfmon_amd, was successfully loaded? What wouldn't work before the 
> machine specific performance monitoring hardware is loaded, but will 
> work after it is loaded?

Keep in mind that there can only be one PMU desriptor module inserted
at a time. Modules are required to probe to check if they support to
hardware. If they fail, insmod fails. When there is no module
inserted, you cannot create any perfmon context with pfm_create_context().
You'll get ENOSYS. Another way to check is to look in /sys/kernel/perfmon/pmu_model.
If it says "Unknown", then there is nothing inserted.

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-31 21:43:43

Stephane Eranian wrote:
> Will,
> 
> On Fri, Mar 31, 2006 at 03:18:32PM -0500, William Cohen wrote:
> 
>>I found that the request_module() function can be used to pull in the 
>>required code. This should eliminate the check in opcontrol. Stephane, 
> 
> 
> I have never looked at request_module() myself. That could be interesting.
> 
> 
>>do you have a suggestion on a check that a perfmon module, e.g. 
>>perfmon_amd, was successfully loaded? What wouldn't work before the 
>>machine specific performance monitoring hardware is loaded, but will 
>>work after it is loaded?
> 
> 
> Keep in mind that there can only be one PMU desriptor module inserted
> at a time. Modules are required to probe to check if they support to
> hardware. If they fail, insmod fails. When there is no module
> inserted, you cannot create any perfmon context with pfm_create_context().
> You'll get ENOSYS. Another way to check is to look in /sys/kernel/perfmon/pmu_model.
> If it says "Unknown", then there is nothing inserted.
> 

Yes, the codes looks up which module to install based on the cpu 
information. There are a couple gotcha in the current code. There isn't 
a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there 
isn't a distinction between p4 and em64t machine.

The detection method needs to be runable in the kernel module. Isn't 
pfm_create_context() the systemcall from user space.

Would it be possible to read the information from pfm_pmu_conf? Do 
something like the following:

if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) {
	/* performance hardware support installed. */
}

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-03-31 21:55:14

Will,

On Fri, Mar 31, 2006 at 04:41:18PM -0500, William Cohen wrote:
> >
> 
> Yes, the codes looks up which module to install based on the cpu 
> information. There are a couple gotcha in the current code. There isn't 
> a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there 
> isn't a distinction between p4 and em64t machine.
> 
There is no driver for Athlons at this point. As for P4 vs. EM64T, there is
indeed a check for EM64T and vice-versa. Look cloely to the probe
routine of the two, you'll see that a test is reversed.

> The detection method needs to be runable in the kernel module. Isn't 
> pfm_create_context() the systemcall from user space.
> 
Ah, yes that won't work.

> Would it be possible to read the information from pfm_pmu_conf? Do 
> something like the following:
> 
> if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) {
> 	/* performance hardware support installed. */
> }
> 
Are you doing this from the OProfile module or the perfmon_amd module?
For the format you need to grab the pfm_pmu_conf_lock spinlock to be safe.

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: William C. <wc...@re...> - 2006-03-31 22:04:42

Stephane Eranian wrote:
> Will,
> 
> On Fri, Mar 31, 2006 at 04:41:18PM -0500, William Cohen wrote:
> 
>>Yes, the codes looks up which module to install based on the cpu 
>>information. There are a couple gotcha in the current code. There isn't 
>>a driver for Athlons (perfmon only has perfmon_amd for 64-bit) and there 
>>isn't a distinction between p4 and em64t machine.
>>
> 
> There is no driver for Athlons at this point. As for P4 vs. EM64T, there is
> indeed a check for EM64T and vice-versa. Look cloely to the probe
> routine of the two, you'll see that a test is reversed.

Is that check the only difference between the two? It seems kind of 
overboard to have such similar hardware handled by nearly duplicate 
pieces of code.
>
>>The detection method needs to be runable in the kernel module. Isn't 
>>pfm_create_context() the systemcall from user space.
>>
> 
> Ah, yes that won't work.
> 
> 
>>Would it be possible to read the information from pfm_pmu_conf? Do 
>>something like the following:
>>
>>if ((pfm_pmu_conf != NULL) && (!strcmp(pfm_pmu_conf->pmu_name, "Unknown)) {
>>	/* performance hardware support installed. */
>>}
>>
> 
> Are you doing this from the OProfile module or the perfmon_amd module?
> For the format you need to grab the pfm_pmu_conf_lock spinlock to be safe.
> 

Doing this from the oprofile module because it is the one actually 
requesting that a specific perfmon module to be installed. Want to make 
sure that the module request was successful.

-Will

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-04-01 02:53:29

[snipping closed list]

On Fri, Mar 31, 2006 at 05:02:24PM -0500, William Cohen wrote:

> Doing this from the oprofile module because it is the one actually 
> requesting that a specific perfmon module to be installed. Want to make 
> sure that the module request was successful.

I fail to understand why perfmon isn't doing all of this for us. It
knows about its modules, and it can certainly know what CPU we are on.

regards
john

Re: Patches to get oprofile to work with perfmon2 on amd64

From: Stephane E. <er...@hp...> - 2006-04-01 05:16:34

On Sat, Apr 01, 2006 at 03:53:20AM +0100, John Levon wrote:
> 
> [snipping closed list]
> 
> On Fri, Mar 31, 2006 at 05:02:24PM -0500, William Cohen wrote:
> 
> > Doing this from the oprofile module because it is the one actually 
> > requesting that a specific perfmon module to be installed. Want to make 
> > sure that the module request was successful.
> 
> I fail to understand why perfmon isn't doing all of this for us. It
> knows about its modules, and it can certainly know what CPU we are on.
> 

Yes, perfmon can do this for you, two ways:

	- have all the PMU description modules compiled in. During kernel
	  boot they will all be called to probe the CPU. The first to succeed
	  gets control.

	- have an initscript to insert the right module

-- 
-Stephane

Re: Patches to get oprofile to work with perfmon2 on amd64

From: John L. <le...@mo...> - 2006-04-01 13:48:51

On Fri, Mar 31, 2006 at 09:11:34PM -0800, Stephane Eranian wrote:

> 	- have an initscript to insert the right module

Nope still not getting it. Why is userspace getting involved at all in
deciding which module to use?

john

1 2 > >> (Page 1 of 2)