From: Andreas K. <kr...@li...> - 2011-11-02 15:10:20
|
With this patch the OProfile Basic Mode Sampling support for System z is enhanced with a counter file system. That way hardware sampling can be enabled/disabled using different events (TIMER and HWSAMPLING). This patch breaks existing OProfile user space tools. The cpu_type will now indicate that the hardware sampling facility is available. Existing user space tools will complain about an unknown cpu type. If you encounter this problem please update your OProfile user space tools. In the counter file system only the values of 'event', 'kernel', and 'user' are evaluated by the kernel module. Everything else is either ignored or must contain fixed values. The 'kernel' and 'user' flags can now be used to filter for samples. Signed-off-by: Andreas Krebbel <kr...@li...> --- arch/s390/oprofile/hwsampler.c | 7 ++++++- arch/s390/oprofile/init.c | 37 +++++++++++++++++++++++++++++++++++-- arch/s390/oprofile/op_counter.h | 25 +++++++++++++++++++++++++ 3 files changed, 66 insertions(+), 3 deletions(-) --- a/arch/s390/oprofile/hwsampler.c +++ b/arch/s390/oprofile/hwsampler.c @@ -22,6 +22,7 @@ #include <asm/irq.h> #include "hwsampler.h" +#include "op_counter.h" #define MAX_NUM_SDB 511 #define MIN_NUM_SDB 1 @@ -896,6 +897,8 @@ static void add_samples_to_oprofile(unsi if (sample_data_ptr->P == 1) { /* userspace sample */ unsigned int pid = sample_data_ptr->prim_asn; + if (!counter_config.user) + goto skip_sample; rcu_read_lock(); tsk = pid_task(find_vpid(pid), PIDTYPE_PID); if (tsk) @@ -903,6 +906,8 @@ static void add_samples_to_oprofile(unsi rcu_read_unlock(); } else { /* kernelspace sample */ + if (!counter_config.kernel) + goto skip_sample; regs = task_pt_regs(current); } @@ -910,7 +915,7 @@ static void add_samples_to_oprofile(unsi oprofile_add_ext_hw_sample(sample_data_ptr->ia, regs, 0, !sample_data_ptr->P, tsk); mutex_unlock(&hws_sem); - + skip_sample: sample_data_ptr++; } } --- a/arch/s390/oprofile/init.c +++ b/arch/s390/oprofile/init.c @@ -22,6 +22,7 @@ extern void s390_backtrace(struct pt_reg #ifdef CONFIG_64BIT #include "hwsampler.h" +#include "op_counter.h" #define DEFAULT_INTERVAL 4127518 @@ -40,6 +41,8 @@ static int hwsampler_running; /* start_m static struct oprofile_operations timer_ops; +struct op_counter_config counter_config; + static int oprofile_hwsampler_start(void) { int retval; @@ -112,10 +115,15 @@ static const struct file_operations hwsa static int oprofile_create_hwsampling_files(struct super_block *sb, struct dentry *root) { - struct dentry *hw_dir; + struct dentry *hw_dir, *dir; /* reinitialize default values */ hwsampler_file = 1; + counter_config.count = 0; + counter_config.enabled = 1; + counter_config.unit_mask = 0; + counter_config.kernel = 1; + counter_config.user = 1; hw_dir = oprofilefs_mkdir(sb, root, "hwsampling"); if (!hw_dir) @@ -131,6 +139,30 @@ static int oprofile_create_hwsampling_fi oprofilefs_create_ulong(sb, hw_dir, "hw_sdbt_blocks", &oprofile_sdbt_blocks); + /* Create the counter file system. A single virtual counter + * is created which can be used to enable/disable hardware + * sampling dynamically from user space. The user space will + * configure a single counter with two events 0 and 1. Using + * event 0 enables use of timer based sampling while event 1 + * turns on hardware sampling. These values have to stay like + * this since the /dev/oprofile/0/event always has to match + * /dev/oprofile/hwsampling/hwsampler content in order to + * support the existing interface in parallel. Apart from + * 'event' only the 'kernel' and 'user' flags are evaluated by + * the kernel code. */ + + dir = oprofilefs_mkdir(sb, root, "0"); + if (!dir) + return -EINVAL; + + oprofilefs_create_ulong(sb, dir, "enabled", &counter_config.enabled); + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); + oprofilefs_create_ulong(sb, dir, "count", &counter_config.count); + oprofilefs_create_ulong(sb, dir, "unit_mask", &counter_config.unit_mask); + oprofilefs_create_ulong(sb, dir, "kernel", &counter_config.kernel); + oprofilefs_create_ulong(sb, dir, "user", &counter_config.user); + oprofilefs_create_ulong(sb, dir, "extra", &counter_config.extra); + return 0; } @@ -158,13 +190,14 @@ static int oprofile_hwsampler_init(struc if (oprofile_timer_init(ops)) return -ENODEV; - printk(KERN_INFO "oprofile: using hardware sampling\n"); + printk(KERN_INFO "OProfile: System z hardware sampling facility found.\n"); memcpy(&timer_ops, ops, sizeof(timer_ops)); ops->start = oprofile_hwsampler_start; ops->stop = oprofile_hwsampler_stop; ops->create_files = oprofile_create_hwsampling_files; + ops->cpu_type = "s390x/basic_mode_sampling_v1"; return 0; } --- /dev/null +++ b/arch/s390/oprofile/op_counter.h @@ -0,0 +1,25 @@ +/** + * arch/s390/oprofile/op_counter.h + * + * Copyright (C) 2011 IBM Deutschland Entwicklung GmbH, IBM Corporation + * Author(s): Andreas Krebbel (kr...@li...) + * + * @remark Copyright 2011 OProfile authors + */ + +#ifndef OP_COUNTER_H +#define OP_COUNTER_H + +struct op_counter_config { + /* `event' maps to the hwsampler_file variable */ + unsigned long count; /* unused */ + unsigned long enabled; /* unused */ + unsigned long kernel; + unsigned long user; + unsigned long unit_mask; /* always zero */ + unsigned long extra; /* unused */ +}; + +extern struct op_counter_config counter_config; + +#endif /* OP_COUNTER_H */ |
From: Andreas K. <kr...@li...> - 2011-11-03 11:56:16
|
With this patch the OProfile Basic Mode Sampling support for System z is enhanced with a counter file system. That way hardware sampling can be enabled/disabled using different events (TIMER and HWSAMPLING). This patch breaks existing OProfile user space tools. The cpu_type will now indicate that the hardware sampling facility is available. Existing user space tools will complain about an unknown cpu type. If you encounter this problem please update your OProfile user space tools. In the counter file system only the values of 'event', 'count', 'kernel', and 'user' are evaluated by the kernel module. Everything else is either ignored or must contain fixed values. The 'event' value is used to disable (0) or enabled (1) hardware sampling. The 'count' value specifies the hardware sampling rate as it is passed to the CPU measurement facility. The 'kernel' and 'user' flags can now be used to filter for samples. Signed-off-by: Andreas Krebbel <kr...@li...> --- arch/s390/oprofile/hwsampler.c | 7 ++++++- arch/s390/oprofile/init.c | 36 ++++++++++++++++++++++++++++++++++-- arch/s390/oprofile/op_counter.h | 25 +++++++++++++++++++++++++ 3 files changed, 65 insertions(+), 3 deletions(-) --- a/arch/s390/oprofile/hwsampler.c +++ b/arch/s390/oprofile/hwsampler.c @@ -22,6 +22,7 @@ #include <asm/irq.h> #include "hwsampler.h" +#include "op_counter.h" #define MAX_NUM_SDB 511 #define MIN_NUM_SDB 1 @@ -896,6 +897,8 @@ static void add_samples_to_oprofile(unsi if (sample_data_ptr->P == 1) { /* userspace sample */ unsigned int pid = sample_data_ptr->prim_asn; + if (!counter_config.user) + goto skip_sample; rcu_read_lock(); tsk = pid_task(find_vpid(pid), PIDTYPE_PID); if (tsk) @@ -903,6 +906,8 @@ static void add_samples_to_oprofile(unsi rcu_read_unlock(); } else { /* kernelspace sample */ + if (!counter_config.kernel) + goto skip_sample; regs = task_pt_regs(current); } @@ -910,7 +915,7 @@ static void add_samples_to_oprofile(unsi oprofile_add_ext_hw_sample(sample_data_ptr->ia, regs, 0, !sample_data_ptr->P, tsk); mutex_unlock(&hws_sem); - + skip_sample: sample_data_ptr++; } } --- a/arch/s390/oprofile/init.c +++ b/arch/s390/oprofile/init.c @@ -22,6 +22,7 @@ extern void s390_backtrace(struct pt_reg #ifdef CONFIG_64BIT #include "hwsampler.h" +#include "op_counter.h" #define DEFAULT_INTERVAL 4127518 @@ -40,6 +41,8 @@ static int hwsampler_running; /* start_m static struct oprofile_operations timer_ops; +struct op_counter_config counter_config; + static int oprofile_hwsampler_start(void) { int retval; @@ -112,10 +115,14 @@ static const struct file_operations hwsa static int oprofile_create_hwsampling_files(struct super_block *sb, struct dentry *root) { - struct dentry *hw_dir; + struct dentry *hw_dir, *dir; /* reinitialize default values */ hwsampler_file = 1; + counter_config.enabled = 1; + counter_config.unit_mask = 0; + counter_config.kernel = 1; + counter_config.user = 1; hw_dir = oprofilefs_mkdir(sb, root, "hwsampling"); if (!hw_dir) @@ -131,6 +138,30 @@ static int oprofile_create_hwsampling_fi oprofilefs_create_ulong(sb, hw_dir, "hw_sdbt_blocks", &oprofile_sdbt_blocks); + /* Create the counter file system. A single virtual counter + * is created which can be used to enable/disable hardware + * sampling dynamically from user space. The user space will + * configure a single counter with two events 0 and 1. Using + * event 0 enables use of timer based sampling while event 1 + * turns on hardware sampling. These values have to stay like + * this since the /dev/oprofile/0/event always has to match + * /dev/oprofile/hwsampling/hwsampler content in order to + * support the existing interface in parallel. Only the + * 'event', 'count', 'kernel', and 'user' flags are evaluated + * by the kernel code. */ + + dir = oprofilefs_mkdir(sb, root, "0"); + if (!dir) + return -EINVAL; + + oprofilefs_create_ulong(sb, dir, "enabled", &counter_config.enabled); + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); + oprofilefs_create_ulong(sb, dir, "count", &oprofile_hw_interval); + oprofilefs_create_ulong(sb, dir, "unit_mask", &counter_config.unit_mask); + oprofilefs_create_ulong(sb, dir, "kernel", &counter_config.kernel); + oprofilefs_create_ulong(sb, dir, "user", &counter_config.user); + oprofilefs_create_ulong(sb, dir, "extra", &counter_config.extra); + return 0; } @@ -158,13 +189,14 @@ static int oprofile_hwsampler_init(struc if (oprofile_timer_init(ops)) return -ENODEV; - printk(KERN_INFO "oprofile: using hardware sampling\n"); + printk(KERN_INFO "OProfile: System z hardware sampling facility found.\n"); memcpy(&timer_ops, ops, sizeof(timer_ops)); ops->start = oprofile_hwsampler_start; ops->stop = oprofile_hwsampler_stop; ops->create_files = oprofile_create_hwsampling_files; + ops->cpu_type = "s390x/basic_mode_sampling_v1"; return 0; } --- /dev/null +++ b/arch/s390/oprofile/op_counter.h @@ -0,0 +1,25 @@ +/** + * arch/s390/oprofile/op_counter.h + * + * Copyright (C) 2011 IBM Deutschland Entwicklung GmbH, IBM Corporation + * Author(s): Andreas Krebbel (kr...@li...) + * + * @remark Copyright 2011 OProfile authors + */ + +#ifndef OP_COUNTER_H +#define OP_COUNTER_H + +struct op_counter_config { + /* `event' maps to the hwsampler_file variable */ + /* `count' maps to the oprofile_hw_interval variable */ + unsigned long enabled; /* unused */ + unsigned long kernel; + unsigned long user; + unsigned long unit_mask; /* always zero */ + unsigned long extra; /* unused */ +}; + +extern struct op_counter_config counter_config; + +#endif /* OP_COUNTER_H */ |
From: Robert R. <rob...@am...> - 2011-11-03 13:08:30
|
On 03.11.11 07:56:03, Andreas Krebbel wrote: > With this patch the OProfile Basic Mode Sampling support for System z > is enhanced with a counter file system. That way hardware sampling > can be enabled/disabled using different events (TIMER and HWSAMPLING). > > This patch breaks existing OProfile user space tools. The cpu_type > will now indicate that the hardware sampling facility is available. > Existing user space tools will complain about an unknown cpu type. If > you encounter this problem please update your OProfile user space > tools. > > In the counter file system only the values of 'event', 'count', > 'kernel', and 'user' are evaluated by the kernel module. Everything > else is either ignored or must contain fixed values. > > The 'event' value is used to disable (0) or enabled (1) hardware > sampling. > > The 'count' value specifies the hardware sampling rate as it is passed > to the CPU measurement facility. > > The 'kernel' and 'user' flags can now be used to filter for samples. > > Signed-off-by: Andreas Krebbel <kr...@li...> > --- > arch/s390/oprofile/hwsampler.c | 7 ++++++- > arch/s390/oprofile/init.c | 36 ++++++++++++++++++++++++++++++++++-- > arch/s390/oprofile/op_counter.h | 25 +++++++++++++++++++++++++ > 3 files changed, 65 insertions(+), 3 deletions(-) What is the difference to your previous post? -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-03 13:55:55
|
> What is the difference to your previous post? With the new version the 'count' value of the oprofile event is used as input for oprofile_hw_interval. So it is basically just that line: + oprofilefs_create_ulong(sb, dir, "count", &oprofile_hw_interval); -Andreas- |
From: Andreas K. <kr...@li...> - 2011-11-08 17:49:27
|
Hi Robert, did you have time to look at the kernel changes? I would like to get this into 3.3. Bye, -Andreas- |
From: Robert R. <rob...@am...> - 2011-11-10 13:31:04
|
On 08.11.11 18:48:17, Andreas Krebbel wrote: > did you have time to look at the kernel changes? I would like to get this into 3.3. Yes, there is some time for 3.3. Will look at it. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Robert R. <rob...@am...> - 2011-11-15 09:47:10
|
On 02.11.11 11:10:06, Andreas Krebbel wrote: > With this patch the OProfile Basic Mode Sampling support for System z > is enhanced with a counter file system. That way hardware sampling > can be enabled/disabled using different events (TIMER and HWSAMPLING). > > This patch breaks existing OProfile user space tools. The cpu_type > will now indicate that the hardware sampling facility is available. > Existing user space tools will complain about an unknown cpu type. If > you encounter this problem please update your OProfile user space > tools. You should add a kernel parameter for backward compatibility. See below. > > In the counter file system only the values of 'event', 'kernel', and > 'user' are evaluated by the kernel module. Everything else is either > ignored or must contain fixed values. > > The 'kernel' and 'user' flags can now be used to filter for samples. Does this work for both modes (TIMER and HWSAMPLING)? > @@ -131,6 +139,30 @@ static int oprofile_create_hwsampling_fi > oprofilefs_create_ulong(sb, hw_dir, "hw_sdbt_blocks", > &oprofile_sdbt_blocks); > > + /* Create the counter file system. A single virtual counter > + * is created which can be used to enable/disable hardware > + * sampling dynamically from user space. The user space will > + * configure a single counter with two events 0 and 1. Using > + * event 0 enables use of timer based sampling while event 1 > + * turns on hardware sampling. These values have to stay like > + * this since the /dev/oprofile/0/event always has to match > + * /dev/oprofile/hwsampling/hwsampler content in order to > + * support the existing interface in parallel. Apart from > + * 'event' only the 'kernel' and 'user' flags are evaluated by > + * the kernel code. */ In your commit message you write that this change is not backward compatible, here it looks like you support the existing hwsampler interface too. What do you mean? I guess /dev/oprofile/hwsampling/ becomes obsolete if "s390x/basic_mode_sampling_v1" is reported. Please fix coding style for multiline comments: /* * <comment> */ > + > + dir = oprofilefs_mkdir(sb, root, "0"); > + if (!dir) > + return -EINVAL; > + > + oprofilefs_create_ulong(sb, dir, "enabled", &counter_config.enabled); > + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); > + oprofilefs_create_ulong(sb, dir, "count", &counter_config.count); > + oprofilefs_create_ulong(sb, dir, "unit_mask", &counter_config.unit_mask); > + oprofilefs_create_ulong(sb, dir, "kernel", &counter_config.kernel); > + oprofilefs_create_ulong(sb, dir, "user", &counter_config.user); > + oprofilefs_create_ulong(sb, dir, "extra", &counter_config.extra); What is the reason for creating entries that are unused? Please show and describe the new layout of oprofilefs in your commit message. > + > return 0; > } > > @@ -158,13 +190,14 @@ static int oprofile_hwsampler_init(struc > if (oprofile_timer_init(ops)) > return -ENODEV; > > - printk(KERN_INFO "oprofile: using hardware sampling\n"); > + printk(KERN_INFO "OProfile: System z hardware sampling facility found.\n"); Use 'oprofile:' like in other printks. > > memcpy(&timer_ops, ops, sizeof(timer_ops)); > > ops->start = oprofile_hwsampler_start; > ops->stop = oprofile_hwsampler_stop; > ops->create_files = oprofile_create_hwsampling_files; > + ops->cpu_type = "s390x/basic_mode_sampling_v1"; Why are you using "s390x" instead of "s390"? Also, the long name looks ugly. Any better idea? You could consider to use a kernel parameter to support "timer" cpu type for backward compatibility, something like oprofile.cpu_type=timer (or possibly oprofile.timer=1) For x86 this switch already exists. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-15 17:10:45
|
On 11/15/2011 10:31 AM, Robert Richter wrote: > On 02.11.11 11:10:06, Andreas Krebbel wrote: >> >> The 'kernel' and 'user' flags can now be used to filter for samples. > > Does this work for both modes (TIMER and HWSAMPLING)? No. This only works for HWSAMPLING. > >> @@ -131,6 +139,30 @@ static int oprofile_create_hwsampling_fi >> oprofilefs_create_ulong(sb, hw_dir, "hw_sdbt_blocks", >> &oprofile_sdbt_blocks); >> >> + /* Create the counter file system. A single virtual counter >> + * is created which can be used to enable/disable hardware >> + * sampling dynamically from user space. The user space will >> + * configure a single counter with two events 0 and 1. Using >> + * event 0 enables use of timer based sampling while event 1 >> + * turns on hardware sampling. These values have to stay like >> + * this since the /dev/oprofile/0/event always has to match >> + * /dev/oprofile/hwsampling/hwsampler content in order to >> + * support the existing interface in parallel. Apart from >> + * 'event' only the 'kernel' and 'user' flags are evaluated by >> + * the kernel code. */ > > In your commit message you write that this change is not backward > compatible, here it looks like you support the existing hwsampler > interface too. What do you mean? I guess /dev/oprofile/hwsampling/ > becomes obsolete if "s390x/basic_mode_sampling_v1" is reported. There are two compatibility aspects involved: 1.) With the current patch the existing OProfile user space will not work with hardware sampling being enabled in the machine since the tools do not know about the s390x/basic_mode_sampling_v1 cpu type. So with that regard the patch breaks compatibilty. But as I understand it this would be resolved by adding the backward compatibility module parameter which forces the module to always return "timer" as cpu type. So this would be a non-issue then. 2.) The new filesystem entries for the event interface co-exist with the old filesystem. The user can still use the echo ... > /dev/oprofile/hwsampling/... commands he is used to. So with the current patch the module stays compatible to the existing interface. >> + >> + dir = oprofilefs_mkdir(sb, root, "0"); >> + if (!dir) >> + return -EINVAL; >> + >> + oprofilefs_create_ulong(sb, dir, "enabled", &counter_config.enabled); >> + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); >> + oprofilefs_create_ulong(sb, dir, "count", &counter_config.count); >> + oprofilefs_create_ulong(sb, dir, "unit_mask", &counter_config.unit_mask); >> + oprofilefs_create_ulong(sb, dir, "kernel", &counter_config.kernel); >> + oprofilefs_create_ulong(sb, dir, "user", &counter_config.user); >> + oprofilefs_create_ulong(sb, dir, "extra", &counter_config.extra); > > What is the reason for creating entries that are unused? We are somewhat abusing the oprofile event file system in order to allow configuring and enabling of the hwsampling facility. That's the reason not all of the files are actually used. However, the user space seems to expect their existence. Providing dummy files helps avoiding special cases in the user space tools by making it think we have a full blown event interface. > > Please show and describe the new layout of oprofilefs in your commit > message. Ok will do. >> + >> return 0; >> } >> >> @@ -158,13 +190,14 @@ static int oprofile_hwsampler_init(struc >> if (oprofile_timer_init(ops)) >> return -ENODEV; >> >> - printk(KERN_INFO "oprofile: using hardware sampling\n"); >> + printk(KERN_INFO "OProfile: System z hardware sampling facility found.\n"); > > Use 'oprofile:' like in other printks. > >> >> memcpy(&timer_ops, ops, sizeof(timer_ops)); >> >> ops->start = oprofile_hwsampler_start; >> ops->stop = oprofile_hwsampler_stop; >> ops->create_files = oprofile_create_hwsampling_files; >> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; > > Why are you using "s390x" instead of "s390"? Also, the long name looks > ugly. Any better idea? The hardware sampling is only available in z/Architecture mode and s390x stands for a S/390 machine running in that mode. Mmmh I've no really better idea for the long name. The sampling mode is called "basic mode sampling" in the hardware documentation and I thought that name should match it. And I've appended a '_v1' in case that mode will be changed in the future. > > You could consider to use a kernel parameter to support "timer" cpu > type for backward compatibility, something like > > oprofile.cpu_type=timer (or possibly oprofile.timer=1) > > For x86 this switch already exists. Ok. I'll add this. Bye, -Andreas- |
From: Robert R. <rob...@am...> - 2011-11-16 21:42:17
|
On 15.11.11 18:07:23, Andreas Krebbel wrote: > On 11/15/2011 10:31 AM, Robert Richter wrote: > > On 02.11.11 11:10:06, Andreas Krebbel wrote: > >> > >> The 'kernel' and 'user' flags can now be used to filter for samples. > > > > Does this work for both modes (TIMER and HWSAMPLING)? > > No. This only works for HWSAMPLING. The daemon may not set the flags then in timer mode. I can't remember how it is implemented in x86, but either we need to check the flags and throw an -EINVAL or at least write back the actually used values so that the daemon/user may read it. > There are two compatibility aspects involved: > > 1.) With the current patch the existing OProfile user space will not work with hardware > sampling being enabled in the machine since the tools do not know about the > s390x/basic_mode_sampling_v1 cpu type. So with that regard the patch breaks compatibilty. > But as I understand it this would be resolved by adding the backward compatibility module > parameter which forces the module to always return "timer" as cpu type. So this would be a > non-issue then. > > 2.) The new filesystem entries for the event interface co-exist with the old filesystem. > The user can still use the echo ... > /dev/oprofile/hwsampling/... commands he is used to. > So with the current patch the module stays compatible to the existing interface. Can't we drop /dev/oprofile/hwsampling/ and only support it with "timer" forced by a kernel parameter to be backward compatible with older oprofile daemons? Would make live easier. > >> + oprofilefs_create_ulong(sb, dir, "enabled", &counter_config.enabled); > >> + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); > >> + oprofilefs_create_ulong(sb, dir, "count", &counter_config.count); > >> + oprofilefs_create_ulong(sb, dir, "unit_mask", &counter_config.unit_mask); > >> + oprofilefs_create_ulong(sb, dir, "kernel", &counter_config.kernel); > >> + oprofilefs_create_ulong(sb, dir, "user", &counter_config.user); > >> + oprofilefs_create_ulong(sb, dir, "extra", &counter_config.extra); > > > > What is the reason for creating entries that are unused? > > We are somewhat abusing the oprofile event file system in order to allow configuring and > enabling of the hwsampling facility. That's the reason not all of the files are actually > used. However, the user space seems to expect their existence. Providing dummy files helps > avoiding special cases in the user space tools by making it think we have a full blown > event interface. If possible I would rather prefer to not have it at all, but if the daemon requires it anyway we should at least only allow certain values to be used. The kernel should check this. In the end this would be only unit_mask and extra which should be set to 0 or so. A check is required also for kernel and user flags, so it shouldn't be a big deal to implement this for all. > >> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; > > > > Why are you using "s390x" instead of "s390"? Also, the long name looks > > ugly. Any better idea? > > The hardware sampling is only available in z/Architecture mode and s390x stands for a > S/390 machine running in that mode. > > Mmmh I've no really better idea for the long name. The sampling mode is called "basic mode > sampling" in the hardware documentation and I thought that name should match it. And I've > appended a '_v1' in case that mode will be changed in the future. Usually we take <arch>/<model> where <arch> is as uname reports it, but this is not always consistent. Doing so it would be "s390/systemz", but not sure if this would fit here. Various supported features could be detected by the existence of file/directories in oprofilefs. This could be used if there is a new version or so of hardware sampling. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-17 19:35:22
|
On 11/16/2011 10:41 PM, Robert Richter wrote: > Can't we drop /dev/oprofile/hwsampling/ and only support it with > "timer" forced by a kernel parameter to be backward compatible with > older oprofile daemons? Would make live easier. No. The /dev/oprofile/hwsampling file system is needed also with the new interface. The user space tools need the /dev/oprofile/hwsampling/hw_min|max_interval values in order to do proper parameter checking. These values can only be provided by the kernel module. > If possible I would rather prefer to not have it at all, but if the > daemon requires it anyway we should at least only allow certain values > to be used. The kernel should check this. In the end this would be > only unit_mask and extra which should be set to 0 or so. A check is > required also for kernel and user flags, so it shouldn't be a big > deal to implement this for all. Ok done. I've also removed the `extra' file. This does not seem to be needed. > Usually we take <arch>/<model> where <arch> is as uname reports it, > but this is not always consistent. Doing so it would be > "s390/systemz", but not sure if this would fit here. This would be "s390x/z10" then. Unfortunately the availability does not only depend on the CPU level. It also depends on the millicode-level and the machine configuration. The feature also is the same way available on the z196 and this probably stays like this for a while. Bye, -Andreas- |
From: Andreas K. <kr...@li...> - 2011-11-17 19:37:33
|
From: Andreas Krebbel <kr...@li...> With this patch the OProfile Basic Mode Sampling support for System z is enhanced with a counter file system. That way hardware sampling can be enabled/disabled using different events (TIMER and HWSAMPLING). With the patch by default a new cpu_type is returned in order to indicate that the hardware sampling facility is available. Existing user space tools will complain about an unknown cpu type. In order to be compatible with existing user space tools the `force_timer_cputype' module parameter has been added. Setting the parameter to 1 will force the module to return `timer' as cpu_type. The module will still try to use hardware sampling if available and the hwsampling virtual filesystem will be also be available for configuration. So this has a different effect than using the generic oprofile module parameter `timer=1'. The following new files are provided as part of the OProfile virtual filesystem: /dev/oprofile/0/enabled /dev/oprofile/0/event /dev/oprofile/0/count /dev/oprofile/0/unit_mask /dev/oprofile/0/kernel /dev/oprofile/0/user In the counter file system only the values of 'event', 'count', 'kernel', and 'user' are evaluated by the kernel module. Everything else must contain fixed values. The 'event' value is used to disable (0) or enable (1) hardware sampling. The 'count' value specifies the hardware sampling rate as it is passed to the CPU measurement facility. The 'kernel' and 'user' flags can now be used to filter for samples when using hardware sampling. Signed-off-by: Andreas Krebbel <kr...@li...> --- arch/s390/oprofile/hwsampler.c | 7 + arch/s390/oprofile/init.c | 240 +++++++++++++++++++++++++++++++++++++++- arch/s390/oprofile/op_counter.h | 22 +++ 3 files changed, 263 insertions(+), 6 deletions(-) --- a/arch/s390/oprofile/hwsampler.c +++ b/arch/s390/oprofile/hwsampler.c @@ -22,6 +22,7 @@ #include <asm/irq.h> #include "hwsampler.h" +#include "op_counter.h" #define MAX_NUM_SDB 511 #define MIN_NUM_SDB 1 @@ -896,6 +897,8 @@ static void add_samples_to_oprofile(unsi if (sample_data_ptr->P == 1) { /* userspace sample */ unsigned int pid = sample_data_ptr->prim_asn; + if (!counter_config.user) + goto skip_sample; rcu_read_lock(); tsk = pid_task(find_vpid(pid), PIDTYPE_PID); if (tsk) @@ -903,6 +906,8 @@ static void add_samples_to_oprofile(unsi rcu_read_unlock(); } else { /* kernelspace sample */ + if (!counter_config.kernel) + goto skip_sample; regs = task_pt_regs(current); } @@ -910,7 +915,7 @@ static void add_samples_to_oprofile(unsi oprofile_add_ext_hw_sample(sample_data_ptr->ia, regs, 0, !sample_data_ptr->P, tsk); mutex_unlock(&hws_sem); - + skip_sample: sample_data_ptr++; } } --- a/arch/s390/oprofile/init.c +++ b/arch/s390/oprofile/init.c @@ -2,7 +2,7 @@ * arch/s390/oprofile/init.c * * S390 Version - * Copyright (C) 2003 IBM Deutschland Entwicklung GmbH, IBM Corporation + * Copyright (C) 2003-2011 IBM Deutschland Entwicklung GmbH, IBM Corporation * Author(s): Thomas Spatzier (ts...@de...) * Author(s): Mahesh Salgaonkar (ma...@li...) * Author(s): Heinz Graalfs (gr...@li...) @@ -14,6 +14,7 @@ #include <linux/init.h> #include <linux/errno.h> #include <linux/fs.h> +#include <linux/module.h> #include "../../../drivers/oprofile/oprof.h" @@ -22,6 +23,7 @@ extern void s390_backtrace(struct pt_reg #ifdef CONFIG_64BIT #include "hwsampler.h" +#include "op_counter.h" #define DEFAULT_INTERVAL 4127518 @@ -38,8 +40,15 @@ static unsigned long oprofile_sdb_blocks static int hwsampler_file; static int hwsampler_running; /* start_mutex must be held to change */ +static int force_timer_cputype = 0; +module_param(force_timer_cputype, int, 0); +MODULE_PARM_DESC(force_timer_cputype, "1 forces the oprofile kernel module " + "to return \"timer\" as cpu_type."); + static struct oprofile_operations timer_ops; +struct op_counter_config counter_config; + static int oprofile_hwsampler_start(void) { int retval; @@ -72,6 +81,8 @@ static void oprofile_hwsampler_stop(void return; } +/* /dev/oprofile/hwsampling/hwsampler file ops. */ + static ssize_t hwsampler_read(struct file *file, char __user *buf, size_t count, loff_t *offset) { @@ -109,21 +120,205 @@ static const struct file_operations hwsa .write = hwsampler_write, }; +/* + * /dev/oprofile/hwsampling/hw_interval file ops. + * Check that the value is within the hardware range. + */ + +static ssize_t hw_interval_read(struct file *file, char __user *buf, + size_t count, loff_t *offset) +{ + return oprofilefs_ulong_to_user(oprofile_hw_interval, buf, + count, offset); +} + +static ssize_t hw_interval_write(struct file *file, char const __user *buf, + size_t count, loff_t *offset) +{ + unsigned long val; + int retval; + + if (*offset) + return -EINVAL; + retval = oprofilefs_ulong_from_user(&val, buf, count); + if (retval) + return retval; + if (val < oprofile_min_interval || val > oprofile_max_interval) + return -EINVAL; + oprofile_hw_interval = val; + + return count; +} + +static const struct file_operations hw_interval_fops = { + .read = hw_interval_read, + .write = hw_interval_write, +}; + +/* + * /dev/oprofile/0/enabled file ops. + * This is a dummy file needed by the user space tools. + * No value other than 1 is accepted or returned. + */ + +static ssize_t hwsampler_enabled_read(struct file *file, char __user *buf, + size_t count, loff_t *offset) +{ + return oprofilefs_ulong_to_user(1, buf, count, offset); +} + +static ssize_t hwsampler_enabled_write(struct file *file, char const __user *buf, + size_t count, loff_t *offset) +{ + unsigned long val; + int retval; + + if (*offset) + return -EINVAL; + + retval = oprofilefs_ulong_from_user(&val, buf, count); + if (retval) + return retval; + if (val != 1) + return -EINVAL; + return count; +} + +static const struct file_operations enabled_fops = { + .read = hwsampler_enabled_read, + .write = hwsampler_enabled_write, +}; + +/* + * /dev/oprofile/0/unit_mask file ops. + * This is a dummy file needed by the user space tools. + * No value other than 0 is accepted or returned. + */ + +static ssize_t hwsampler_unit_mask_read(struct file *file, char __user *buf, + size_t count, loff_t *offset) +{ + return oprofilefs_ulong_to_user(0, buf, count, offset); +} + +static ssize_t hwsampler_unit_mask_write(struct file *file, + char const __user *buf, + size_t count, loff_t *offset) +{ + unsigned long val; + int retval; + + if (*offset) + return -EINVAL; + + retval = oprofilefs_ulong_from_user(&val, buf, count); + if (retval) + return retval; + if (val != 0) + return -EINVAL; + return count; +} + +static const struct file_operations unit_mask_fops = { + .read = hwsampler_unit_mask_read, + .write = hwsampler_unit_mask_write, +}; + +/* + * /dev/oprofile/0/kernel file ops. + * This file can only be set to 0 for hardware sampling mode. + */ + +static ssize_t hwsampler_kernel_read(struct file *file, char __user *buf, + size_t count, loff_t *offset) +{ + return oprofilefs_ulong_to_user(counter_config.kernel, + buf, count, offset); +} + +static ssize_t hwsampler_kernel_write(struct file *file, char const __user *buf, + size_t count, loff_t *offset) +{ + unsigned long val; + int retval; + + if (*offset) + return -EINVAL; + + retval = oprofilefs_ulong_from_user(&val, buf, count); + if (retval) + return retval; + + /* No filtering with timer based sampling. */ + if (!hwsampler_file && val == 0) + return -EINVAL; + + counter_config.kernel = val; + + return count; +} + +static const struct file_operations kernel_fops = { + .read = hwsampler_kernel_read, + .write = hwsampler_kernel_write, +}; + +/* + * /dev/oprofile/0/user file ops. + * This file can only be set to 0 for hardware sampling mode. + */ + +static ssize_t hwsampler_user_read(struct file *file, char __user *buf, + size_t count, loff_t *offset) +{ + return oprofilefs_ulong_to_user(counter_config.user, + buf, count, offset); +} + +static ssize_t hwsampler_user_write(struct file *file, char const __user *buf, + size_t count, loff_t *offset) +{ + unsigned long val; + int retval; + + if (*offset) + return -EINVAL; + + retval = oprofilefs_ulong_from_user(&val, buf, count); + if (retval) + return retval; + + /* No filtering with timer based sampling. */ + if (!hwsampler_file && val == 0) + return -EINVAL; + + counter_config.user = val; + + return count; +} + +static const struct file_operations user_fops = { + .read = hwsampler_user_read, + .write = hwsampler_user_write, +}; + + static int oprofile_create_hwsampling_files(struct super_block *sb, struct dentry *root) { - struct dentry *hw_dir; + struct dentry *hw_dir, *dir; /* reinitialize default values */ hwsampler_file = 1; + counter_config.kernel = 1; + counter_config.user = 1; hw_dir = oprofilefs_mkdir(sb, root, "hwsampling"); if (!hw_dir) return -EINVAL; oprofilefs_create_file(sb, hw_dir, "hwsampler", &hwsampler_fops); - oprofilefs_create_ulong(sb, hw_dir, "hw_interval", - &oprofile_hw_interval); + oprofilefs_create_file(sb, hw_dir, "hw_interval", &hw_interval_fops); oprofilefs_create_ro_ulong(sb, hw_dir, "hw_min_interval", &oprofile_min_interval); oprofilefs_create_ro_ulong(sb, hw_dir, "hw_max_interval", @@ -131,6 +326,31 @@ static int oprofile_create_hwsampling_fi oprofilefs_create_ulong(sb, hw_dir, "hw_sdbt_blocks", &oprofile_sdbt_blocks); + /* + * Create the counter file system. A single virtual counter + * is created which can be used to enable/disable hardware + * sampling dynamically from user space. The user space will + * configure a single counter with two events 0 and 1. Using + * event 0 enables use of timer based sampling while event 1 + * turns on hardware sampling. These values have to stay like + * this since the /dev/oprofile/0/event always has to match + * /dev/oprofile/hwsampling/hwsampler content in order to + * support the existing interface in parallel. Only the + * 'event', 'count', 'kernel', and 'user' flags are evaluated + * by the kernel code. + */ + + dir = oprofilefs_mkdir(sb, root, "0"); + if (!dir) + return -EINVAL; + + oprofilefs_create_file(sb, dir, "enabled", &enabled_fops); + oprofilefs_create_file(sb, dir, "event", &hwsampler_fops); + oprofilefs_create_file(sb, dir, "count", &hw_interval_fops); + oprofilefs_create_file(sb, dir, "unit_mask", &unit_mask_fops); + oprofilefs_create_file(sb, dir, "kernel", &kernel_fops); + oprofilefs_create_file(sb, dir, "user", &user_fops); + return 0; } @@ -158,7 +378,7 @@ static int oprofile_hwsampler_init(struc if (oprofile_timer_init(ops)) return -ENODEV; - printk(KERN_INFO "oprofile: using hardware sampling\n"); + printk(KERN_INFO "oprofile: System z hardware sampling facility found.\n"); memcpy(&timer_ops, ops, sizeof(timer_ops)); @@ -166,6 +386,16 @@ static int oprofile_hwsampler_init(struc ops->stop = oprofile_hwsampler_stop; ops->create_files = oprofile_create_hwsampling_files; + /* + * If the user space tools do not support the hardware + * sampling cpu type, the force_timer_cputype module parameter + * can be used to always return \"timer\" as cpu type. + */ + if (force_timer_cputype) { + printk(KERN_INFO "oprofile: cpu_type forced to \"timer\"\n"); + ops->cpu_type = "timer"; + } else + ops->cpu_type = "s390x/basic_mode_sampling_v1"; return 0; } --- /dev/null +++ b/arch/s390/oprofile/op_counter.h @@ -0,0 +1,22 @@ +/** + * arch/s390/oprofile/op_counter.h + * + * Copyright (C) 2011 IBM Deutschland Entwicklung GmbH, IBM Corporation + * Author(s): Andreas Krebbel (kr...@li...) + * + * @remark Copyright 2011 OProfile authors + */ + +#ifndef OP_COUNTER_H +#define OP_COUNTER_H + +struct op_counter_config { + /* `event' maps to the hwsampler_file variable */ + /* `count' maps to the oprofile_hw_interval variable */ + unsigned long kernel; + unsigned long user; +}; + +extern struct op_counter_config counter_config; + +#endif /* OP_COUNTER_H */ |
From: Robert R. <rob...@am...> - 2011-11-21 16:01:23
|
CC'ing Maynard and Suravee, please look esp. at my comments about ops->cpu_type = "s390x/basic_mode_sampling_v1"; and how userland should determine min/max value for the sampling period. On 17.11.11 20:37:23, Andreas Krebbel wrote: > From: Andreas Krebbel <kr...@li...> > Existing > user space tools will complain about an unknown cpu type. In order to > be compatible with existing user space tools the `force_timer_cputype' > module parameter has been added. Setting the parameter to 1 will > force the module to return `timer' as cpu_type. The module will still > try to use hardware sampling if available and the hwsampling virtual > filesystem will be also be available for configuration. So this has a > different effect than using the generic oprofile module parameter > `timer=1'. Please use oprofile.cpu_type=timer here and also update kernel-parameters.txt, e.g.: oprofile.cpu_type= Force an oprofile cpu type This might be useful if you have an older oprofile userland or if you want common events. Format: { arch_perfmon } arch_perfmon: [X86] Force use of architectural perfmon on Intel CPUs instead of the CPU specific event set. timer: [X86] Force use of architectural NMI timer mode (see also oprofile.timer for generic hr timer mode) [s390] Force legacy basic mode sampling (report cpu_type "timer") There are already changes, to avoid conflicts, please use: git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git core > The following new files are provided as part of the OProfile virtual > filesystem: > > /dev/oprofile/0/enabled > /dev/oprofile/0/event > /dev/oprofile/0/count > /dev/oprofile/0/unit_mask > /dev/oprofile/0/kernel > /dev/oprofile/0/user > > In the counter file system only the values of 'event', 'count', > 'kernel', and 'user' are evaluated by the kernel module. Everything > else must contain fixed values. > > The 'event' value is used to disable (0) or enable (1) hardware > sampling. > > The 'count' value specifies the hardware sampling rate as it is passed > to the CPU measurement facility. > > The 'kernel' and 'user' flags can now be used to filter for samples > when using hardware sampling. The meaning of the flags is inverted to have 0 as defaults, see also below. > > Signed-off-by: Andreas Krebbel <kr...@li...> > --- > arch/s390/oprofile/hwsampler.c | 7 + > arch/s390/oprofile/init.c | 240 +++++++++++++++++++++++++++++++++++++++- > arch/s390/oprofile/op_counter.h | 22 +++ > 3 files changed, 263 insertions(+), 6 deletions(-) > > --- a/arch/s390/oprofile/hwsampler.c > +++ b/arch/s390/oprofile/hwsampler.c > @@ -22,6 +22,7 @@ > #include <asm/irq.h> > > #include "hwsampler.h" > +#include "op_counter.h" > > #define MAX_NUM_SDB 511 > #define MIN_NUM_SDB 1 > @@ -896,6 +897,8 @@ static void add_samples_to_oprofile(unsi > if (sample_data_ptr->P == 1) { > /* userspace sample */ > unsigned int pid = sample_data_ptr->prim_asn; > + if (!counter_config.user) > + goto skip_sample; > rcu_read_lock(); > tsk = pid_task(find_vpid(pid), PIDTYPE_PID); > if (tsk) > @@ -903,6 +906,8 @@ static void add_samples_to_oprofile(unsi > rcu_read_unlock(); > } else { > /* kernelspace sample */ > + if (!counter_config.kernel) > + goto skip_sample; As said, counter_config.kernel means not to profile the kernel. Same for .../user. > regs = task_pt_regs(current); > } > > @@ -910,7 +915,7 @@ static void add_samples_to_oprofile(unsi > oprofile_add_ext_hw_sample(sample_data_ptr->ia, regs, 0, > !sample_data_ptr->P, tsk); > mutex_unlock(&hws_sem); > - > + skip_sample: > sample_data_ptr++; > } > } > --- a/arch/s390/oprofile/init.c > +++ b/arch/s390/oprofile/init.c > @@ -2,7 +2,7 @@ > * arch/s390/oprofile/init.c > * > * S390 Version > - * Copyright (C) 2003 IBM Deutschland Entwicklung GmbH, IBM Corporation > + * Copyright (C) 2003-2011 IBM Deutschland Entwicklung GmbH, IBM Corporation > * Author(s): Thomas Spatzier (ts...@de...) > * Author(s): Mahesh Salgaonkar (ma...@li...) > * Author(s): Heinz Graalfs (gr...@li...) > @@ -14,6 +14,7 @@ > #include <linux/init.h> > #include <linux/errno.h> > #include <linux/fs.h> > +#include <linux/module.h> > > #include "../../../drivers/oprofile/oprof.h" > > @@ -22,6 +23,7 @@ extern void s390_backtrace(struct pt_reg > #ifdef CONFIG_64BIT > > #include "hwsampler.h" > +#include "op_counter.h" > > #define DEFAULT_INTERVAL 4127518 > > @@ -38,8 +40,15 @@ static unsigned long oprofile_sdb_blocks > static int hwsampler_file; > static int hwsampler_running; /* start_mutex must be held to change */ > > +static int force_timer_cputype = 0; > +module_param(force_timer_cputype, int, 0); > +MODULE_PARM_DESC(force_timer_cputype, "1 forces the oprofile kernel module " > + "to return \"timer\" as cpu_type."); oprofile.cpu_type=timer > + > static struct oprofile_operations timer_ops; > > +struct op_counter_config counter_config; > + > static int oprofile_hwsampler_start(void) > { > int retval; > @@ -72,6 +81,8 @@ static void oprofile_hwsampler_stop(void > return; > } > > +/* /dev/oprofile/hwsampling/hwsampler file ops. */ > + > static ssize_t hwsampler_read(struct file *file, char __user *buf, > size_t count, loff_t *offset) > { > @@ -109,21 +120,205 @@ static const struct file_operations hwsa > .write = hwsampler_write, This function should only allow the values 0 or 1 to be able to extend this. Currently other values may be passed. > }; > > +/* > + * /dev/oprofile/hwsampling/hw_interval file ops. > + * Check that the value is within the hardware range. > + */ > + > +static ssize_t hw_interval_read(struct file *file, char __user *buf, > + size_t count, loff_t *offset) > +{ > + return oprofilefs_ulong_to_user(oprofile_hw_interval, buf, > + count, offset); > +} > + > +static ssize_t hw_interval_write(struct file *file, char const __user *buf, > + size_t count, loff_t *offset) > +{ > + unsigned long val; > + int retval; > + > + if (*offset) > + return -EINVAL; > + retval = oprofilefs_ulong_from_user(&val, buf, count); > + if (retval) > + return retval; > + if (val < oprofile_min_interval || val > oprofile_max_interval) > + return -EINVAL; Instead of returning -EINVAL you should set the min/max value here and assign it to oprofile_hw_interval. Userland then should read the value back and fail if they are different. Doing so we no longer need /dev/oprofile/hwsampling/. > + oprofile_hw_interval = val; > + > + return count; > +} > +static ssize_t hwsampler_kernel_read(struct file *file, char __user *buf, > + size_t count, loff_t *offset) > +{ > + return oprofilefs_ulong_to_user(counter_config.kernel, > + buf, count, offset); Must return 0 for !hwsampler_file. > +} > + > +static ssize_t hwsampler_kernel_write(struct file *file, char const __user *buf, > + size_t count, loff_t *offset) > +{ > + unsigned long val; > + int retval; > + > + if (*offset) > + return -EINVAL; > + > + retval = oprofilefs_ulong_from_user(&val, buf, count); > + if (retval) > + return retval; > + > + /* No filtering with timer based sampling. */ > + if (!hwsampler_file && val == 0) > + return -EINVAL; > + > + counter_config.kernel = val; Only allow 0 and 1 here. > + > + return count; > +} > + > +static const struct file_operations kernel_fops = { > + .read = hwsampler_kernel_read, > + .write = hwsampler_kernel_write, > +}; > + > +/* > + * /dev/oprofile/0/user file ops. > + * This file can only be set to 0 for hardware sampling mode. > + */ > + > +static ssize_t hwsampler_user_read(struct file *file, char __user *buf, > + size_t count, loff_t *offset) > +{ > + return oprofilefs_ulong_to_user(counter_config.user, > + buf, count, offset); Should return 0 for !hwsampler_file. > +} > + > +static ssize_t hwsampler_user_write(struct file *file, char const __user *buf, > + size_t count, loff_t *offset) > +{ > + unsigned long val; > + int retval; > + > + if (*offset) > + return -EINVAL; > + > + retval = oprofilefs_ulong_from_user(&val, buf, count); > + if (retval) > + return retval; > + > + /* No filtering with timer based sampling. */ > + if (!hwsampler_file && val == 0) > + return -EINVAL; > + > + counter_config.user = val; Same here, allow only 0 and 1 values. > + > + return count; > +} > + > +static const struct file_operations user_fops = { > + .read = hwsampler_user_read, > + .write = hwsampler_user_write, > +}; > + > + > static int oprofile_create_hwsampling_files(struct super_block *sb, > struct dentry *root) > { > - struct dentry *hw_dir; > + struct dentry *hw_dir, *dir; > > /* reinitialize default values */ > hwsampler_file = 1; > + counter_config.kernel = 1; > + counter_config.user = 1; Both parameters usually mean 'exclude_user/exclude_kernel' to have the defaults 0. Thus, bits must be inverted in the code. [...] > + ops->cpu_type = "s390x/basic_mode_sampling_v1"; We want to keep the cpu_type naming scheme of oprofile consistent which is s390/<model> "s390" is the arch string in Linux. Please choose a model name. We want to support multiple features with the interface and thus can't use feature names here. This would break then. Userland should check for available system features by checking for files in oprofilefs which wouldn't exist if the feature is missing. So if s390/<model> is reported you could rely on the existence of basic mode sampling if /dev/oprofile/0 exists. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-21 16:29:06
|
On 11/21/2011 05:01 PM, Robert Richter wrote: > Instead of returning -EINVAL you should set the min/max value here and > assign it to oprofile_hw_interval. Userland then should read the value > back and fail if they are different. > > Doing so we no longer need /dev/oprofile/hwsampling/. We currently also need it for setting the hardware sampling buffer size by writing to /dev/oprofile/hwsampling/hw_sdbt_blocks. Should I move this file to /dev/oprofile/0/ ? -Andreas- |
From: Robert R. <rob...@am...> - 2011-11-21 17:22:21
|
On 21.11.11 17:25:00, Andreas Krebbel wrote: > On 11/21/2011 05:01 PM, Robert Richter wrote: > > Instead of returning -EINVAL you should set the min/max value here and > > assign it to oprofile_hw_interval. Userland then should read the value > > back and fail if they are different. > > > > Doing so we no longer need /dev/oprofile/hwsampling/. > > We currently also need it for setting the hardware sampling buffer size by writing to > /dev/oprofile/hwsampling/hw_sdbt_blocks. Should I move this file to /dev/oprofile/0/ ? Yeah, no problem with additional files there. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-22 10:24:23
|
On 11/21/2011 05:01 PM, Robert Richter wrote: >> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; > > We want to keep the cpu_type naming scheme of oprofile consistent > which is > > s390/<model> > > "s390" is the arch string in Linux. Please choose a model name. > > We want to support multiple features with the interface and thus can't > use feature names here. This would break then. Userland should check > for available system features by checking for files in oprofilefs > which wouldn't exist if the feature is missing. > > So if s390/<model> is reported you could rely on the existence of > basic mode sampling if /dev/oprofile/0 exists. As far as I understand it this wouldn't be possible without a major uglification on the user space part. On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: Alternative 1 I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and s390/z196b for !hwsampling, new feature And this mess would go on with a new CPU and/or new features arriving. Alternative 2 We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... The text files describing the events and their unit masks are located by appending the string in cpu_type to the events directory. This would have to change as well. If /dev/oprofile/0 is not there the files need to be ignored. There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist "/dev/oprofile/0")) (Alternative 3 Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . But this would be a major rework which I'm not able to cover.) While I understand that the cpu_type logic matches other CPUs I really do not see how S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. Any suggestions how to continue with that? - and sorry for the bulky email :( -Andreas- |
From: Suthikulpanit, S. <Sur...@am...> - 2011-11-22 17:15:26
|
Andreas, My assumption here is the performance counter on the S390 system is pretty much the same as other architectures (i.e. require enabled, count, event, unit_mask, kernel, user). It has 1 counter and 1 event and 0 unit_mask. Also, if I understand your interface correctly, these are the requirements: 1. If the CPU doesn't support the performance measurement facility (cpumf), it would default to "timer" mode. 2. If the CPU support cpumf, it should be able to choose between using the "timer" or "cpumf". What you are proposing is basically the ability to do "timer" and "hardware-PMC" profile without having to reboot the kernel with different kernel argument to force the "timer" mode. If this is correct, I would like to suggest making this as another general enhancement to OProfile. Oprofile typically uses "timer" as a back-up mode when the CPU is not supported by the driver. This is a generic feature which other architecture should also be able to use as well. We have had users trying to use hardware event i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has become issue when processors can scale the clock frequency with different workload. In this case, we could introduce the following interface: 1. The driver would introduce /dev/oprofile/timer/{enabled, period, kernel, user} virtual filesystem. This will be available regardless of the value in /dev/oprofile/cpu_type. 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't support the CPU or cpumf is not available), the /dev/oprofile/{0,1,2....} will not be available. The /dev/oprofile/timer will be used. 3. The events file should only list the events which are "hw-counter-based". Typically, it require the "counters" fileld. And we should try to maintain this. 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT mode and maintain the scheme we have been using for /dev/oprofile/cpu_type. Suravee -----Original Message----- From: Andreas Krebbel [mailto:kr...@li...] Sent: Tuesday, November 22, 2011 4:24 AM To: Richter, Robert Cc: opr...@li...; Suthikulpanit, Suravee; Maynard Johnson Subject: Re: [PATCH] S/390: Add event interface to the System z hardware On 11/21/2011 05:01 PM, Robert Richter wrote: >> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; > > We want to keep the cpu_type naming scheme of oprofile consistent > which is > > s390/<model> > > "s390" is the arch string in Linux. Please choose a model name. > > We want to support multiple features with the interface and thus can't > use feature names here. This would break then. Userland should check > for available system features by checking for files in oprofilefs > which wouldn't exist if the feature is missing. > > So if s390/<model> is reported you could rely on the existence of > basic mode sampling if /dev/oprofile/0 exists. As far as I understand it this wouldn't be possible without a major uglification on the user space part. On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: Alternative 1 I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and s390/z196b for !hwsampling, new feature And this mess would go on with a new CPU and/or new features arriving. Alternative 2 We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... The text files describing the events and their unit masks are located by appending the string in cpu_type to the events directory. This would have to change as well. If /dev/oprofile/0 is not there the files need to be ignored. There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist "/dev/oprofile/0")) (Alternative 3 Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . But this would be a major rework which I'm not able to cover.) While I understand that the cpu_type logic matches other CPUs I really do not see how S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. Any suggestions how to continue with that? - and sorry for the bulky email :( -Andreas- |
From: Robert R. <rob...@am...> - 2011-11-22 17:57:55
|
On 22.11.11 18:15:11, Suthikulpanit, Suravee wrote: > My assumption here is the performance counter on the S390 system is > pretty much the same as other architectures (i.e. require enabled, > count, event, unit_mask, kernel, user). It has 1 counter and 1 > event and 0 unit_mask. Also, if I understand your interface > correctly, these are the requirements: > 1. If the CPU doesn't support the performance measurement facility > (cpumf), it would default to "timer" mode. > 2. If the CPU support cpumf, it should be able to choose between > using the "timer" or "cpumf". > > What you are proposing is basically the ability to do "timer" and > "hardware-PMC" profile without having to reboot the kernel with > different kernel argument to force the "timer" mode. If this is > correct, I would like to suggest making this as another general > enhancement to OProfile. Oprofile typically uses "timer" as a > back-up mode when the CPU is not supported by the driver. This is a > generic feature which other architecture should also be able to use > as well. We have had users trying to use hardware event > i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has > become issue when processors can scale the clock frequency with > different workload. > > In this case, we could introduce the following interface: > 1. The driver would introduce /dev/oprofile/timer/{enabled, period, > kernel, user} virtual filesystem. This will be available > regardless of the value in /dev/oprofile/cpu_type. > 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't > support the CPU or cpumf is not available), the > /dev/oprofile/{0,1,2....} will not be available. The > /dev/oprofile/timer will be used. > 3. The events file should only list the events which are > "hw-counter-based". Typically, it require the "counters" fileld. > And we should try to maintain this. > 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT > mode and maintain the scheme we have been using for > /dev/oprofile/cpu_type. Suravee, thanks for coming up with this idea. This solution solves several issues we tried to 'hack' around when implementing a proper kernel/user interface for the above. It also would solve issues other users came up recently like adjusting the sampling period in timer mode. >From my current view on it the above wouldn't change much in the latest post of the kernel implementation for s390 basic mode sampling (cpumf). Basically only the event selection must be dropped for cpumf mode. For basic timer mode functionality we only need: /dev/oprofile/timer/enabled All remaining files for timer mode could be added later. We need locking to only allow either timer or cpumf mode to be enabled. I admit the above requires some more changes in the daemon, but I think the effort spent is worth the changes in favour of a clean interface that other archs can benefit too. To summarize the proposed file layout for cpufm, it would look like the following: /dev/oprofile/0/enabled # enable cpumf, timer must be 0 to enable /dev/oprofile/0/event # 0 /dev/oprofile/0/count # sample period /dev/oprofile/0/unit_mask # 0 /dev/oprofile/0/kernel # exclude kernel samples /dev/oprofile/0/user # exclude user samples /dev/oproifle/timer/enabled # enable timer, cpumf must be 0 to enable /dev/oproifle/cpu_type # report s390/{z10,z196} /dev/oprofile/... # existing standard files (unchanged) If cpumf does not exist on the system /dev/oprofile/0/ also would not exist or fall back to /dev/oproifle/cpu_type=timer. Sounds resonable. Unfortunatly Maynard may not give us his opinion here before end of November. But I would start going that direction anyway. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Andreas K. <kr...@li...> - 2011-11-23 08:46:37
|
On 11/22/2011 06:57 PM, Robert Richter wrote: > From my current view on it the above wouldn't change much in the > latest post of the kernel implementation for s390 basic mode sampling > (cpumf). Basically only the event selection must be dropped for cpumf > mode. For basic timer mode functionality we only need: > > /dev/oprofile/timer/enabled > > All remaining files for timer mode could be added later. > > We need locking to only allow either timer or cpumf mode to be > enabled. Since the intention is to support this on all oprofile platforms anyway wouldn't this belong into the oprofile common code so that the platform kernel modules don't have to care about it? This probably would mean that there has to be an interface for preventing other counters from being enabled while timer/enabled is '1'. Perhaps we could chose the semantics of the enabled attributes clever to reduce the necessary synchronization to a minimum: 1.) An enabled hardware counter always overrules the timer mode. So the enabled attribute of a hardware counter could be '1' while timer/enabled is '1' as well. But the hw counter will always win in that case. Using timer mode would then require to write '0' to all hw counter enabled attributes and write '1' to timer/enabled 2.) Modifying the enabled attributes could also modify other enabled attributes. Writing to an enabled attribute of a hw counter could set timer/enabled to '0' and setting timer/enabled to '1' would set all enabled attributes of hw counter to '0'. What do you think? Bye, -Andreas- > > We need locking to only allow either timer or cpumf mode to be > enabled. > > I admit the above requires some more changes in the daemon, but I > think the effort spent is worth the changes in favour of a clean > interface that other archs can benefit too. > > To summarize the proposed file layout for cpufm, it would look like > the following: > > /dev/oprofile/0/enabled # enable cpumf, timer must be 0 to enable > /dev/oprofile/0/event # 0 > /dev/oprofile/0/count # sample period > /dev/oprofile/0/unit_mask # 0 > /dev/oprofile/0/kernel # exclude kernel samples > /dev/oprofile/0/user # exclude user samples > /dev/oproifle/timer/enabled # enable timer, cpumf must be 0 to enable > /dev/oproifle/cpu_type # report s390/{z10,z196} > /dev/oprofile/... # existing standard files (unchanged) > > If cpumf does not exist on the system /dev/oprofile/0/ also would not > exist or fall back to /dev/oproifle/cpu_type=timer. > > Sounds resonable. > > Unfortunatly Maynard may not give us his opinion here before end of > November. But I would start going that direction anyway. > > -Robert > |
From: Andreas K. <kr...@li...> - 2011-11-23 08:34:02
|
Suravee, being able to switch to timer mode dynamically is one point I was trying to address with my patches. It worked well with the counter/event interface so far but I agree that there should be a way for all platforms to do this and your idea looks like a good way to achieve that - Thanks! However, the major concern from my last email remains I think - but is more related to the user space part. There is still the tight correlation of the number of counters and events to the cpu type all over the user space code which is inappropriate for how CPUMF works on s390. Instead of having a fixed table (libop/op_cpu_type.c) which maps a cpu type to a number of available counters the code should enumerate the /dev/oprofile/0..x/ files in order to get the number of available counters. This is necessary to have a "s390/z10" cpu_type without /dev/oprofile/0/... files. For S/390 it isn't needed but perhaps someone also would like to extend this to the availability of events. Currently the available events are determined by a textfile which is chosen based on the cpu_type string. For full generality the user space should be able to enumerate the available events. Since for CPUMF the only available counter would simply disappear if the facility isn't available this should be no issue here. These issues are more related to the user space part. So I agree with Robert that this will cause quite some work there. I'll try to come up with a kernel patch implementing the interface you described. But I'm not sure how far I'll get with reworking the user space components. Bye, -Andreas- On 11/22/2011 06:15 PM, Suthikulpanit, Suravee wrote: > Andreas, > > My assumption here is the performance counter on the S390 system is pretty much the same as other architectures (i.e. require enabled, count, event, unit_mask, kernel, user). It has 1 counter and 1 event and 0 unit_mask. Also, if I understand your interface correctly, these are the requirements: > 1. If the CPU doesn't support the performance measurement facility (cpumf), it would default to "timer" mode. > 2. If the CPU support cpumf, it should be able to choose between using the "timer" or "cpumf". > > What you are proposing is basically the ability to do "timer" and "hardware-PMC" profile without having to reboot the kernel with different kernel argument to force the "timer" mode. If this is correct, I would like to suggest making this as another general enhancement to OProfile. Oprofile typically uses "timer" as a back-up mode when the CPU is not supported by the driver. This is a generic feature which other architecture should also be able to use as well. We have had users trying to use hardware event i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has become issue when processors can scale the clock frequency with different workload. > > In this case, we could introduce the following interface: > 1. The driver would introduce /dev/oprofile/timer/{enabled, period, kernel, user} virtual filesystem. This will be available regardless of the value in /dev/oprofile/cpu_type. > 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't support the CPU or cpumf is not available), the /dev/oprofile/{0,1,2....} will not be available. The /dev/oprofile/timer will be used. > 3. The events file should only list the events which are "hw-counter-based". Typically, it require the "counters" fileld. And we should try to maintain this. > 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT mode and maintain the scheme we have been using for /dev/oprofile/cpu_type. > > Suravee > > > -----Original Message----- > From: Andreas Krebbel [mailto:kr...@li...] > Sent: Tuesday, November 22, 2011 4:24 AM > To: Richter, Robert > Cc: opr...@li...; Suthikulpanit, Suravee; Maynard Johnson > Subject: Re: [PATCH] S/390: Add event interface to the System z hardware > > On 11/21/2011 05:01 PM, Robert Richter wrote: >>> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; >> >> We want to keep the cpu_type naming scheme of oprofile consistent >> which is >> >> s390/<model> >> >> "s390" is the arch string in Linux. Please choose a model name. >> >> We want to support multiple features with the interface and thus can't >> use feature names here. This would break then. Userland should check >> for available system features by checking for files in oprofilefs >> which wouldn't exist if the feature is missing. >> >> So if s390/<model> is reported you could rely on the existence of >> basic mode sampling if /dev/oprofile/0 exists. > > As far as I understand it this wouldn't be possible without a major uglification on the user space part. > > On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). > > Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: > > Alternative 1 > > I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: > s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and s390/z196b for !hwsampling, new feature > > And this mess would go on with a new CPU and/or new features arriving. > > > Alternative 2 > > We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. > > It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... > > The text files describing the events and their unit masks are located by appending the string in cpu_type to the events directory. This would have to change as well. If > /dev/oprofile/0 is not there the files need to be ignored. > > There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: > if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist "/dev/oprofile/0")) > > > (Alternative 3 > > Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . > > But this would be a major rework which I'm not able to cover.) > > > > While I understand that the cpu_type logic matches other CPUs I really do not see how > S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. > > On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. > I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. > > > > Any suggestions how to continue with that? - and sorry for the bulky email :( > > -Andreas- > > > |
From: Suthikulpanit, S. <Sur...@am...> - 2011-11-23 09:20:43
|
Andreas, If the S390 doesn't need the "/dev/oprofile/0", what kind of interface does it need in the driver ? What type of performance data the cpumf generate? Actually, you don't need to use the events file and /dev/oprofile/0 if you don't want to. These are designed mainly to support the architectures with multiple performance counters and performance events. If S390's cpufm doesn't fit this model, it should define a different interface for this. For the OProfile user-space, another possible approach might be to use the OProfile extended interface which was added to OProfile when we introduce the AMD Instruction-based sampling (IBS). This interface was design so that we could add features which are non-traditional performance counters architecture. The IBS uses this OProfile extended interface (please see daemon/opd_extended.[h,c] and daemon/opd_ibs.[h,c] for more information). Also, for architecture which supports IBS, the driver added /dev/oprofile/ibs_fetch/[enable, max_count, ...] and /dev/oprofile/ibs_op/[enable, max_count ....]. These are IBS-specific configuration interface. In this case, we could probably use the "/dev/oprofile/cpumf/....". Suravee -----Original Message----- From: Andreas Krebbel [mailto:kr...@li...] Sent: Wednesday, November 23, 2011 2:34 AM To: Suthikulpanit, Suravee Cc: Richter, Robert; opr...@li...; Maynard Johnson Subject: Re: [PATCH] S/390: Add event interface to the System z hardware Suravee, being able to switch to timer mode dynamically is one point I was trying to address with my patches. It worked well with the counter/event interface so far but I agree that there should be a way for all platforms to do this and your idea looks like a good way to achieve that - Thanks! However, the major concern from my last email remains I think - but is more related to the user space part. There is still the tight correlation of the number of counters and events to the cpu type all over the user space code which is inappropriate for how CPUMF works on s390. Instead of having a fixed table (libop/op_cpu_type.c) which maps a cpu type to a number of available counters the code should enumerate the /dev/oprofile/0..x/ files in order to get the number of available counters. This is necessary to have a "s390/z10" cpu_type without /dev/oprofile/0/... files. For S/390 it isn't needed but perhaps someone also would like to extend this to the availability of events. Currently the available events are determined by a textfile which is chosen based on the cpu_type string. For full generality the user space should be able to enumerate the available events. Since for CPUMF the only available counter would simply disappear if the facility isn't available this should be no issue here. These issues are more related to the user space part. So I agree with Robert that this will cause quite some work there. I'll try to come up with a kernel patch implementing the interface you described. But I'm not sure how far I'll get with reworking the user space components. Bye, -Andreas- On 11/22/2011 06:15 PM, Suthikulpanit, Suravee wrote: > Andreas, > > My assumption here is the performance counter on the S390 system is pretty much the same as other architectures (i.e. require enabled, count, event, unit_mask, kernel, user). It has 1 counter and 1 event and 0 unit_mask. Also, if I understand your interface correctly, these are the requirements: > 1. If the CPU doesn't support the performance measurement facility (cpumf), it would default to "timer" mode. > 2. If the CPU support cpumf, it should be able to choose between using the "timer" or "cpumf". > > What you are proposing is basically the ability to do "timer" and "hardware-PMC" profile without having to reboot the kernel with different kernel argument to force the "timer" mode. If this is correct, I would like to suggest making this as another general enhancement to OProfile. Oprofile typically uses "timer" as a back-up mode when the CPU is not supported by the driver. This is a generic feature which other architecture should also be able to use as well. We have had users trying to use hardware event i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has become issue when processors can scale the clock frequency with different workload. > > In this case, we could introduce the following interface: > 1. The driver would introduce /dev/oprofile/timer/{enabled, period, kernel, user} virtual filesystem. This will be available regardless of the value in /dev/oprofile/cpu_type. > 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't support the CPU or cpumf is not available), the /dev/oprofile/{0,1,2....} will not be available. The /dev/oprofile/timer will be used. > 3. The events file should only list the events which are "hw-counter-based". Typically, it require the "counters" fileld. And we should try to maintain this. > 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT mode and maintain the scheme we have been using for /dev/oprofile/cpu_type. > > Suravee > > > -----Original Message----- > From: Andreas Krebbel [mailto:kr...@li...] > Sent: Tuesday, November 22, 2011 4:24 AM > To: Richter, Robert > Cc: opr...@li...; Suthikulpanit, Suravee; Maynard > Johnson > Subject: Re: [PATCH] S/390: Add event interface to the System z > hardware > > On 11/21/2011 05:01 PM, Robert Richter wrote: >>> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; >> >> We want to keep the cpu_type naming scheme of oprofile consistent >> which is >> >> s390/<model> >> >> "s390" is the arch string in Linux. Please choose a model name. >> >> We want to support multiple features with the interface and thus >> can't use feature names here. This would break then. Userland should >> check for available system features by checking for files in >> oprofilefs which wouldn't exist if the feature is missing. >> >> So if s390/<model> is reported you could rely on the existence of >> basic mode sampling if /dev/oprofile/0 exists. > > As far as I understand it this wouldn't be possible without a major uglification on the user space part. > > On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). > > Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: > > Alternative 1 > > I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: > s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and > s390/z196b for !hwsampling, new feature > > And this mess would go on with a new CPU and/or new features arriving. > > > Alternative 2 > > We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. > > It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... > > The text files describing the events and their unit masks are located > by appending the string in cpu_type to the events directory. This > would have to change as well. If > /dev/oprofile/0 is not there the files need to be ignored. > > There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: > if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist > "/dev/oprofile/0")) > > > (Alternative 3 > > Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . > > But this would be a major rework which I'm not able to cover.) > > > > While I understand that the cpu_type logic matches other CPUs I really > do not see how > S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. > > On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. > I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. > > > > Any suggestions how to continue with that? - and sorry for the bulky > email :( > > -Andreas- > > > |
From: Andreas K. <kr...@li...> - 2011-11-23 09:49:55
|
On 11/23/2011 10:20 AM, Suthikulpanit, Suravee wrote: > Andreas, > > If the S390 doesn't need the "/dev/oprofile/0", what kind of interface does it need in the driver ? What type of performance data the cpumf generate? Actually, you don't need to use the events file and /dev/oprofile/0 if you don't want to. These are designed mainly to support the architectures with multiple performance counters and performance events. If S390's cpufm doesn't fit this model, it should define a different interface for this. > > For the OProfile user-space, another possible approach might be to use the OProfile extended interface which was added to OProfile when we introduce the AMD Instruction-based sampling (IBS). This interface was design so that we could add features which are non-traditional performance counters architecture. The IBS uses this OProfile extended interface (please see daemon/opd_extended.[h,c] and daemon/opd_ibs.[h,c] for more information). > > Also, for architecture which supports IBS, the driver added /dev/oprofile/ibs_fetch/[enable, max_count, ...] and /dev/oprofile/ibs_op/[enable, max_count ....]. These are IBS-specific configuration interface. In this case, we could probably use the "/dev/oprofile/cpumf/....". > > Suravee Suravee, the CPUMF facility provides counters similiar to other platforms but we do not exploit these yet. So far we only support hardware sampling also provided with CPUMF. With hardware sampling the hardware periodically writes the current instruction address into a buffer provided by the operating system. When enabling the feature this happens in the background. The CPUs only get involved if the buffers are full in order to flush them. So this feature is not actually a performance counter. It was the proposal from Maynard to nevertheless support it using the counter/event file system in order to keep the impact on user space low. I think this has been a reasonable choice. The current patch uses the /dev/oprofile/0/count, kernel, user, and event files in order to configure that feature. The match is not perfect but good enough I would say. Additionally we need another file there to configure the hardware buffer sizes. We also considered the IBS interface but thought it would be overkill here since our hardware sampling from a user space perspective behaves exactly like timer based sampling. We only need a way to enable/disable it dynamically, set the sampling interval, and the size of the hardware sampling buffers. The counter/event file system seems to be sufficient to achieve this with only little user space changes. I don't think that the problem I described below with the static number of counters for a given cpu type is related to the way our hardware sampling works. It would apply the same to our hardware counters. These are a perfect match for the event interface but nevertheless they would not be available on a "cpu_type" if not configured properly or if a too old millicode level is installed. So also with this we would need the user space to be able to figure out the number of available counters dynamically. Bye, -Andreas- > > > -----Original Message----- > From: Andreas Krebbel [mailto:kr...@li...] > Sent: Wednesday, November 23, 2011 2:34 AM > To: Suthikulpanit, Suravee > Cc: Richter, Robert; opr...@li...; Maynard Johnson > Subject: Re: [PATCH] S/390: Add event interface to the System z hardware > > Suravee, > > being able to switch to timer mode dynamically is one point I was trying to address with my patches. It worked well with the counter/event interface so far but I agree that there should be a way for all platforms to do this and your idea looks like a good way to achieve that - Thanks! > > However, the major concern from my last email remains I think - but is more related to the user space part. There is still the tight correlation of the number of counters and events to the cpu type all over the user space code which is inappropriate for how CPUMF works on s390. Instead of having a fixed table (libop/op_cpu_type.c) which maps a cpu type to a number of available counters the code should enumerate the /dev/oprofile/0..x/ files in order to get the number of available counters. This is necessary to have a "s390/z10" > cpu_type without /dev/oprofile/0/... files. > > For S/390 it isn't needed but perhaps someone also would like to extend this to the availability of events. Currently the available events are determined by a textfile which is chosen based on the cpu_type string. For full generality the user space should be able to enumerate the available events. Since for CPUMF the only available counter would simply disappear if the facility isn't available this should be no issue here. > > These issues are more related to the user space part. So I agree with Robert that this will cause quite some work there. > > I'll try to come up with a kernel patch implementing the interface you described. But I'm not sure how far I'll get with reworking the user space components. > > Bye, > > -Andreas- > > > On 11/22/2011 06:15 PM, Suthikulpanit, Suravee wrote: >> Andreas, >> >> My assumption here is the performance counter on the S390 system is pretty much the same as other architectures (i.e. require enabled, count, event, unit_mask, kernel, user). It has 1 counter and 1 event and 0 unit_mask. Also, if I understand your interface correctly, these are the requirements: >> 1. If the CPU doesn't support the performance measurement facility (cpumf), it would default to "timer" mode. >> 2. If the CPU support cpumf, it should be able to choose between using the "timer" or "cpumf". >> >> What you are proposing is basically the ability to do "timer" and "hardware-PMC" profile without having to reboot the kernel with different kernel argument to force the "timer" mode. If this is correct, I would like to suggest making this as another general enhancement to OProfile. Oprofile typically uses "timer" as a back-up mode when the CPU is not supported by the driver. This is a generic feature which other architecture should also be able to use as well. We have had users trying to use hardware event i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has become issue when processors can scale the clock frequency with different workload. >> >> In this case, we could introduce the following interface: >> 1. The driver would introduce /dev/oprofile/timer/{enabled, period, kernel, user} virtual filesystem. This will be available regardless of the value in /dev/oprofile/cpu_type. >> 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't support the CPU or cpumf is not available), the /dev/oprofile/{0,1,2....} will not be available. The /dev/oprofile/timer will be used. >> 3. The events file should only list the events which are "hw-counter-based". Typically, it require the "counters" fileld. And we should try to maintain this. >> 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT mode and maintain the scheme we have been using for /dev/oprofile/cpu_type. >> >> Suravee >> >> >> -----Original Message----- >> From: Andreas Krebbel [mailto:kr...@li...] >> Sent: Tuesday, November 22, 2011 4:24 AM >> To: Richter, Robert >> Cc: opr...@li...; Suthikulpanit, Suravee; Maynard >> Johnson >> Subject: Re: [PATCH] S/390: Add event interface to the System z >> hardware >> >> On 11/21/2011 05:01 PM, Robert Richter wrote: >>>> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; >>> >>> We want to keep the cpu_type naming scheme of oprofile consistent >>> which is >>> >>> s390/<model> >>> >>> "s390" is the arch string in Linux. Please choose a model name. >>> >>> We want to support multiple features with the interface and thus >>> can't use feature names here. This would break then. Userland should >>> check for available system features by checking for files in >>> oprofilefs which wouldn't exist if the feature is missing. >>> >>> So if s390/<model> is reported you could rely on the existence of >>> basic mode sampling if /dev/oprofile/0 exists. >> >> As far as I understand it this wouldn't be possible without a major uglification on the user space part. >> >> On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). >> >> Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: >> >> Alternative 1 >> >> I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: >> s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and >> s390/z196b for !hwsampling, new feature >> >> And this mess would go on with a new CPU and/or new features arriving. >> >> >> Alternative 2 >> >> We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. >> >> It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... >> >> The text files describing the events and their unit masks are located >> by appending the string in cpu_type to the events directory. This >> would have to change as well. If >> /dev/oprofile/0 is not there the files need to be ignored. >> >> There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: >> if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist >> "/dev/oprofile/0")) >> >> >> (Alternative 3 >> >> Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . >> >> But this would be a major rework which I'm not able to cover.) >> >> >> >> While I understand that the cpu_type logic matches other CPUs I really >> do not see how >> S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. >> >> On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. >> I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. >> >> >> >> Any suggestions how to continue with that? - and sorry for the bulky >> email :( >> >> -Andreas- >> >> >> > > > |
From: Andreas K. <kr...@li...> - 2011-11-23 10:00:08
|
Suravee, I probably forgot to make one point clearer in my last email. While we need the /dev/oprofile/0/... files with my current patches when using hardware sampling these files will/should not be there if CPUMF isn't available. But the cpu_type will be the same in both cases! Here are the combinations we have with the current version of the patches: cpu_type force_cpu_type CPUMF extra files s390/z10 0 on /dev/oprofile/0/... s390/z10 timer on /dev/oprofile/hwsampling/... s390/z10 0 off - s390/z10 timer off - and the same for s390/z196 So the problem is that with s390/z10 we sometimes have /dev/oprofile/0/... and sometimes not. And the user space part is not able to deal with this yet. Bye, -Andreas- On 11/23/2011 10:20 AM, Suthikulpanit, Suravee wrote: > Andreas, > > If the S390 doesn't need the "/dev/oprofile/0", what kind of interface does it need in the driver ? What type of performance data the cpumf generate? Actually, you don't need to use the events file and /dev/oprofile/0 if you don't want to. These are designed mainly to support the architectures with multiple performance counters and performance events. If S390's cpufm doesn't fit this model, it should define a different interface for this. > > For the OProfile user-space, another possible approach might be to use the OProfile extended interface which was added to OProfile when we introduce the AMD Instruction-based sampling (IBS). This interface was design so that we could add features which are non-traditional performance counters architecture. The IBS uses this OProfile extended interface (please see daemon/opd_extended.[h,c] and daemon/opd_ibs.[h,c] for more information). > > Also, for architecture which supports IBS, the driver added /dev/oprofile/ibs_fetch/[enable, max_count, ...] and /dev/oprofile/ibs_op/[enable, max_count ....]. These are IBS-specific configuration interface. In this case, we could probably use the "/dev/oprofile/cpumf/....". > > Suravee > > > -----Original Message----- > From: Andreas Krebbel [mailto:kr...@li...] > Sent: Wednesday, November 23, 2011 2:34 AM > To: Suthikulpanit, Suravee > Cc: Richter, Robert; opr...@li...; Maynard Johnson > Subject: Re: [PATCH] S/390: Add event interface to the System z hardware > > Suravee, > > being able to switch to timer mode dynamically is one point I was trying to address with my patches. It worked well with the counter/event interface so far but I agree that there should be a way for all platforms to do this and your idea looks like a good way to achieve that - Thanks! > > However, the major concern from my last email remains I think - but is more related to the user space part. There is still the tight correlation of the number of counters and events to the cpu type all over the user space code which is inappropriate for how CPUMF works on s390. Instead of having a fixed table (libop/op_cpu_type.c) which maps a cpu type to a number of available counters the code should enumerate the /dev/oprofile/0..x/ files in order to get the number of available counters. This is necessary to have a "s390/z10" > cpu_type without /dev/oprofile/0/... files. > > For S/390 it isn't needed but perhaps someone also would like to extend this to the availability of events. Currently the available events are determined by a textfile which is chosen based on the cpu_type string. For full generality the user space should be able to enumerate the available events. Since for CPUMF the only available counter would simply disappear if the facility isn't available this should be no issue here. > > These issues are more related to the user space part. So I agree with Robert that this will cause quite some work there. > > I'll try to come up with a kernel patch implementing the interface you described. But I'm not sure how far I'll get with reworking the user space components. > > Bye, > > -Andreas- > > > On 11/22/2011 06:15 PM, Suthikulpanit, Suravee wrote: >> Andreas, >> >> My assumption here is the performance counter on the S390 system is pretty much the same as other architectures (i.e. require enabled, count, event, unit_mask, kernel, user). It has 1 counter and 1 event and 0 unit_mask. Also, if I understand your interface correctly, these are the requirements: >> 1. If the CPU doesn't support the performance measurement facility (cpumf), it would default to "timer" mode. >> 2. If the CPU support cpumf, it should be able to choose between using the "timer" or "cpumf". >> >> What you are proposing is basically the ability to do "timer" and "hardware-PMC" profile without having to reboot the kernel with different kernel argument to force the "timer" mode. If this is correct, I would like to suggest making this as another general enhancement to OProfile. Oprofile typically uses "timer" as a back-up mode when the CPU is not supported by the driver. This is a generic feature which other architecture should also be able to use as well. We have had users trying to use hardware event i.e. CPU_CLK_UNHALTED to mimic the time-based measurement. This has become issue when processors can scale the clock frequency with different workload. >> >> In this case, we could introduce the following interface: >> 1. The driver would introduce /dev/oprofile/timer/{enabled, period, kernel, user} virtual filesystem. This will be available regardless of the value in /dev/oprofile/cpu_type. >> 2. If /dev/oprofile/cpu_type is "timer" (i.e. the kernel doesn't support the CPU or cpumf is not available), the /dev/oprofile/{0,1,2....} will not be available. The /dev/oprofile/timer will be used. >> 3. The events file should only list the events which are "hw-counter-based". Typically, it require the "counters" fileld. And we should try to maintain this. >> 4. This will simplify the logic for cheking/enabling CPU_TIMER_INT mode and maintain the scheme we have been using for /dev/oprofile/cpu_type. >> >> Suravee >> >> >> -----Original Message----- >> From: Andreas Krebbel [mailto:kr...@li...] >> Sent: Tuesday, November 22, 2011 4:24 AM >> To: Richter, Robert >> Cc: opr...@li...; Suthikulpanit, Suravee; Maynard >> Johnson >> Subject: Re: [PATCH] S/390: Add event interface to the System z >> hardware >> >> On 11/21/2011 05:01 PM, Robert Richter wrote: >>>> + ops->cpu_type = "s390x/basic_mode_sampling_v1"; >>> >>> We want to keep the cpu_type naming scheme of oprofile consistent >>> which is >>> >>> s390/<model> >>> >>> "s390" is the arch string in Linux. Please choose a model name. >>> >>> We want to support multiple features with the interface and thus >>> can't use feature names here. This would break then. Userland should >>> check for available system features by checking for files in >>> oprofilefs which wouldn't exist if the feature is missing. >>> >>> So if s390/<model> is reported you could rely on the existence of >>> basic mode sampling if /dev/oprofile/0 exists. >> >> As far as I understand it this wouldn't be possible without a major uglification on the user space part. >> >> On s390 we currently have two cpus out there (s390/z10 and s390/z196) which do support the cpu measurement facility which in turn provides basic mode sampling and hardware counters (while saying this s390/cpumf probably would have been a much better name in the first place). >> >> Unfortunately there seems to be a lot of code which assumes an unconditional coupling between the cpu_type and the number of available events and counters. I see two ways to deal with the S/390 scheme considering the kernel module returns s390/z10 and s390/z196 depending on the machine level: >> >> Alternative 1 >> >> I could change libop/op_cpu_type.c to switch to CPU_TIMER_INT if cpu_type is "s390/z10" or "s390/z196" but /dev/oprofile/0 does not exists. This way we could get away with only minor changes but how we would add new features to these cpu levels (e.g. support hardware counters)? We would then need new cpu types: >> s390/z10a and s390/z196a for hwsampling + new feature s390/z10b and >> s390/z196b for !hwsampling, new feature >> >> And this mess would go on with a new CPU and/or new features arriving. >> >> >> Alternative 2 >> >> We could make every cpu_type check to also look for the availability of files in /dev/oprofile but there seem to be a lot of these. >> >> It would start with libop/op_cpu_type.c. The functions returning the cpu strings and the number of counters would have to check if /dev/oprofile/0 really exists in case the cpu type is CPU_S390_... >> >> The text files describing the events and their unit masks are located >> by appending the string in cpu_type to the events directory. This >> would have to change as well. If >> /dev/oprofile/0 is not there the files need to be ignored. >> >> There are several places where cpu_type is directly compared to CPU_TIMER_INT. For S/390 this then would need to something like: >> if (cpu_type == CPU_TIMER_INT || (cpu_type == CPU_S390_... && ! exist >> "/dev/oprofile/0")) >> >> >> (Alternative 3 >> >> Rearrange how OProfile deals with CPU levels. Instead of only identifying the CPU itself a vector of available facilities should be transported. Quite similiar to what the kernel is already doing in /proc/cpuinfo with things like sse, mmx etc. . >> >> But this would be a major rework which I'm not able to cover.) >> >> >> >> While I understand that the cpu_type logic matches other CPUs I really >> do not see how >> S/390 fits in there smoothly. The problem is that currently the CPU level is the only information used to identify the capabilities what prevents single features from being enabled/disabled dynamically. >> >> On S/390 all OProfile should ever be interested in is the version number of the cpu measurement facility and this to my opinion can be expressed by something like s390/cpumf. >> I'm aware that adding support for a new feature also would require adding new cpu types but at least the number does not also multiply with the number of CPU levels since the features will usually continue to exists with the next cpu level. >> >> >> >> Any suggestions how to continue with that? - and sorry for the bulky >> email :( >> >> -Andreas- >> >> >> > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list > |
From: Andreas K. <kr...@li...> - 2011-11-23 14:40:25
|
On 11/21/2011 05:01 PM, Robert Richter wrote: >> + counter_config.kernel = 1; >> + counter_config.user = 1; > > Both parameters usually mean 'exclude_user/exclude_kernel' to have the > defaults 0. Thus, bits must be inverted in the code. Are you sure? User space writes 1s in order to enable profiling of kernel/user address spaces and the values seem to get written without inversion into the kernel files. And it also appears to be documented that way. Also the default events for other platforms always have both values set to 1. -Andreas- |
From: Robert R. <rob...@am...> - 2011-11-24 11:51:00
|
On 23.11.11 15:39:36, Andreas Krebbel wrote: > On 11/21/2011 05:01 PM, Robert Richter wrote: > >> + counter_config.kernel = 1; > >> + counter_config.user = 1; > > > > Both parameters usually mean 'exclude_user/exclude_kernel' to have the > > defaults 0. Thus, bits must be inverted in the code. > > Are you sure? User space writes 1s in order to enable profiling of kernel/user address > spaces and the values seem to get written without inversion into the kernel files. And it > also appears to be documented that way. Also the default events for other platforms always > have both values set to 1. No, checked this again and you are right. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |