From: Maynard J. <may...@us...> - 2011-10-18 22:23:53
Attachments:
op-port.tar
|
Hello, oprofile community, Below is a high-level proposal for how we might re-implement parts of oprofile to use the Linux Performance Events Subsystem. I attached a tar file containing a pdf file that might be easier to read, but I also wanted to have inline text for ease of reviewing. Any and all feedback is welcome. Thanks. -Maynard ============================================================================ Porting OProfile to Linux Performance Events 1.History OProfile is a popular and widely-distributed performance tool for the Linux operating system. OProfile consists of a kernel driver and userspace tools. Over the years, support has been added to OProfile for virtually every processor type with a performance monitoring unit capable of running Linux. In 2008, the Linux kernel community finally agreed to add a general purpose performance monitoring API to the kernel. The end result was the “Linux Performance Events Subsystem” (aka “perf_events”). One of the main reasons for developing this new API and subsystem was to enable performance monitoring tools to be built without custom kernel patches. But since the kernel community is generally reluctant to accept a major new feature without there being a ready consumer, the ‘perf’ tool was developed simultaneously with the subsystem. The perf tool source code resides in the kernel tree. 2.Porting OProfile to perf_events: Rationale The main advantage of perf over oprofile (IMHO) is that a user can profile a process that they own without having to be root. The perf tool also supports other options, like event counting, profiling on software events, tracing, etc. The proliferation of sub-commands to perf can be daunting to new users, and, unfortunately, documentation is scarce, incomplete, and often out-of-date. But perhaps the main downside to using perf is that in order to profile on native events, the user must pass a hex code to perf. For Intel, those codes are documented in the PMU user manuals. But for some architectures, the hex codes are not documented (at least not at the time of this writing), and the best way to obtain them is by using libpfm4’s showevtinfo command. Unfortunately, the perf maintainers have steadfastly refused to allow libpfm4 to be linked to perf, and so this usability issue will remain a problem for the foreseeable future. While it’s true that oprofile has its own usability issues, at least users can list the native events for their processor and see descriptions of the events without having to obtain the appropriate hardware manual. The main complaint about oprofile is that the user must have root authority to run the profiler. This is because oprofile is a system-wide profiler, and there are security issues in allowing non-authorized users to collect profile information about all apps running on a system. Some perf proponents believe that perf has made oprofile obsolete, but I believe there’s room in the Linux community for more than one performance analysis tool. For one thing, perf_events does not support as many different architectures and processor types as OProfile. Also, OProfile has a large user base and a community mailing list where people who are new to profiling can get advice, not just about how to run OProfile, but about profiling techniques in general. OProfile can be made even more beneficial to the community by enhancing it to allow profiling by non-root users. One way to do this is to port oprofile to the perf_events subsystem. Then users would have the ability of profiling a single application (non-root) or system-wide (root). 3.Porting to perf_events – kernel choices There are two ways to approach the porting task with regards to the oprofile kernel driver: 1.Bypass the kernel driver completely by changing the userspace to mmap the perf_events ring buffer where profiling data is recorded 2.Keep (and expand) the user-to-kernel driver interface and let the kernel driver invoke perf_events to do the bulk of the work Advantages to option 1: - The kernel driver would essentially go into maintenance mode, since it wouldn’t be used at all on systems where perf_events exists. - This is a cleaner and more natural way to interface with perf_events. Advantages to option 2: Fewer userspace changes would be needed – the user-to-kernel interface remains largely the same, and the sample data format doesn’t change. Retain all the statistics (e.g, buffer overflows), which are an added value of the kernel driver over perf_events. The design documented herein assumes option #1, which implies we’ll use the natural perf_events API. With this approach, the existing oprofile kernel driver can be bypassed, eliminating the need for much (if any) new development work in the driver. However, the oprofile kernel driver will probably live on for quite some time to be used by older processors that won’t be supported by perf_events. Current oprofile statistics differ from the statistics that perf_events can gather (see below). Some of this is due to differences in implementation details, but it’s obvious that oprofile simply gathers more information stats than does perf_events. If we find a need for more statistics that should (i.e., need to) be exposed to users, we can work with the kernel community to add them. 3.1.OProfile statistics enum { OPD_SAMPLES, /**< nr. samples */ OPD_KERNEL, /**< nr. kernel samples */ OPD_PROCESS, /**< nr. userspace samples */ OPD_NO_CTX, /**< nr. samples lost due to not knowing if in the kernel or not */ OPD_LOST_KERNEL, /**< nr. kernel samples lost */ OPD_LOST_SAMPLEFILE, /**< nr samples for which sample file can't be opened */ OPD_LOST_NO_MAPPING, /**< nr samples lost due to no mapping */ OPD_DUMP_COUNT, /**< nr. of times buffer is read */ OPD_DANGLING_CODE, /**< nr. partial code notifications (buffer overflow */ OPD_MAX_STATS /**< end of stats */ }; Additionally, top level stats from /dev/oprofile/stats: -event_lost_overflow -sample_lost_no_mapping -bt_lost_no_mapping -sample_lost_no_mm and CPU-level status from /dev/oprofile/stats: -sample_lost_overflow -samples_lost_task_exit -samples_received -backtrace_aborted -sample_invalid_eip 3.2.Perf_events statistics struct events_stats { u64 total_period; u64 total_lost; u64 total_invalid_chains; u32 nr_events[PERF_RECORD_HEADER_MAX]; u32 nr_unknown_events; u32 nr_invalid_chains; u32 nr_unknown_id; }; Question: Does a non-zero value for total_lost indicate a need to bump mmap_pages? 4.Determining perf_events availability Userspace oprofile should be built with support for both the legacy oprofile kernel driver and for perf_events. A config check for perf_events.h should be sufficient to enable the building of perf_events-specific code. This implies such code should be surrounded by an #ifdef. The oprofile userspace tools will also need to know at runtime whether or not the running processor has perf_events support. From discussions with Robert Richter, it appears that the only way to reliably determine if perf_events support is available for the running processor is to try to invoke the perf_event_open syscall. If it returns -ENOSYS, the syscall is not defined (ergo, no perf_events support); or if it returns –ENOENT, it means the running processor type is not yet supported by perf_events. 5.User interface changes for profile setup and run This design document is intended to be fairly high level, so many of the details regarding existing user interface options will remain to be investigated during implementation. This document focuses mainly on new interface options and extensions to existing options. 5.1.New options We’ll use the ‘perf record’ tool (from here on, referred to as “perf-record”) as a model for the new oprofile user interface options for setup and profiling. Perf-record has some options that are obvious candidates to be added to oprofile and some that not. The obvious options are: -all-cpus -pid -command (i.e., application to be profiled) The options of questionable use are: tid : (post-processing can select on an individual thread using a tid specification) -realtime -no-delay -no-inherit : (always have children threads inherit counter config; if you only want to see parent profile data, you can filter on that with a tgid specification in post-processing) -freq : (oprofile has always used sample period (i.e., event ‘count’), and, IMHO, the freq stuff in perf just isn’t that useful or easy to understand) -cpu : (use –separate=cpu and then use a cpu specification in post-processing) -raw-samples -stat : (showing inherited stats is something that can be done at post-processing time using tgid specification) -no-samples -no-buildid-cache Options already implemented by oprofile (in some manner or another) are: -append : (not doing a –reset) -force : (--reset) -quiet/verbose : (with or without --verbose) -count : (--event=<evt_name>:<count>) -call-graph : (--callgraph) -output (i.e., output filename) : (--save) -mmap-pages : (--buffer-size ?) And options that are candidates for future enhancements are: -data -cgroup -timestamp The upshot is that, initially, we need to add support for only three new options to opcontrol – all-cpus, pid, and command (I think we should rename the “command” option to “app”). For perf-record, the command option is required, but I think it makes sense to leave it optional when used in conjunction with all-cpus. With opcontrol, the command option would be passed along with the start option. 5.2.Obsolete options The following existing options to opcontrol will no longer be necessary (or possible) when using perf_events: -init -start-daemon -dump -deinit -buffer-watershed -cpu-buffer-size -image : (there's no way practical way to do filtering of samples during the profiling phase when using perf_events) -vmlinux : (I propose to move this option to the post-processing tools to correspond with the perf-report way of doing things) -no-vmlinux : (I propose we do away with this option. If the user's event specification does not preclude collecting kernel samples, then post-processing should, by default, use kallsyms to do addr-to-symbol resolution for kernel samples; otherwise use the passed vmlinux image.) -kernel-range : (do we really need this option? Who uses it and what for? If we need to keep it, it will be moved to the post-processing tools since filtering during profiling is not feasible.) -xen -active-domains All of these options will be deprecated for at least one release (i.e., display a deprecation message, but not fail). 6.Internal changes for setup and profiling 6.1.opcontrol basics The opcontrol script will have to be enhanced to handle the case where perf_events support is available, but will also need to retain its ability to interface with older kernels – i.e., the oprofile kernel driver. In order to perform actions appropriate for the level of kernel support, opcontrol must determine the availability of perf_events support. A simple C program that attempts the perf_event_open syscall and returns the errno value can be used to make this determination. OProfile also needs to know the processor type for 'opcontrol --list-events', as well as for validating a requested native event. Currently, the cpu type is given by the oprofile kernel driver via /dev/oprofile/cpu_type. But the kernel driver won't be active on perf_events-enabled systems, so we need an alternate mechanism to determine cpu type. Question: Is /proc/cpuinfo the best we can do? This would be sufficient for ppc64, but how about other architectures? Is there another alternative? 6.2.opcontrol setup parameters The opcontrol script sets up profiling parameters in oprofilefs (/dev/oprofile) in order to communicate information to the oprofile kernel driver. When using perf_events, we’ll need to communicate more or less the same information to perf_events, but we’ll do that by way of the perf_event_open syscall. The oprofile config file (/root/.oprofile/daemonrc) caches profiling parameters, containing much of the same profiling setup information as is stored in /dev/oprofile. Since the perf_events-enabled oprofile (from here on referred to as “oprofile/PE”) will often be run by non-root users, my initial thought was to move this config file to: ~/.oprofile where the home dir is that of the real user ID executing the opcontrol command. However, there's the very real possibility that several users could be logged into the same machine with the same user ID and running oprofile in per-process mode. With current oprofile, only one user can be profiling at a time, so there isn't much chance for collision. But the oprofile/PE model opens up the possibility that one user's 'opcontrol –setup' operations could overwrite another user's setup. To make this less likely to happen, we should do one of the following: - Save the oprofile config file in the current directory or - Eliminate the config file and use the single-command model of perf-record I propose we do both. We could deprecate the opcontrol command, along with its separate “setup” and “start” capability. We would also create a new command (perhaps simply called 'oprofile') that would operate in the mode where all profiling parameters are passed in one command along with the pathname of the application to profile. This “single-command mode” would not need or use a config file. Note: The “deprecation” of opcontrol would have to be conditional – i.e., if perf_events is not supported for the running processor type, opcontrol will not be deprecated and there would be no deprecation message. 6.3.Event specification OProfile users currently specify the event(s) to profile with using symbolic names such as BR_MISS_PRED_RETIRED. The perf_events API requires a hex code, which, for some architectures, is the same code as is found in oprofile's 'events' file (e.g., 0xc5 for BR_MISS_PRED_RETIRED). For other architectures (e.g., ppc64), the codes in the events file are not appropriate for use with perf_events. In such cases, I believe those codes are unused, so we could simply update the events files with the proper perf_events codes. That way, all architectures would be handled the same way. 6.4.The oprofile daemon The current oprofile daemon is a separate process that reads sample data from the kernel driver's event buffer and stores the data in special format in sample files in the filesystem. With oprofile/PE, the daemon function will be replaced with a process that forks a child process to execute the passed app, makes the perf_event_open syscall and reads the sample data from perf_events by mmap'ing to the kernel buffer associated with the file descriptor returned from the syscall. The sample data from perf_events is a very different format from what we currently get from the oprofile kernel driver. Further investigation is needed to make the best choice of how to process this data. There are at least two methods to choose from: - At the end of the profiling run, convert perf_events raw sample data to the sample file format we currently use (stored in /var/lib/oprofile/samples/current). The resulting sample files would be stored in <current_dir>/oprofile_data/samples/current. or - Develop a new post-processing method for parsing the perf_events sample data on-demand. 7.Changes for post-processing tools Theoretically, perf_events should provide us with all the same profiling data that we get from the oprofile kernel driver with the exception of certain statistics. If we choose to convert the perf_events raw sample data to oprofile-format sample files, then very little effort will be needed for post-processing aside from outputting different stats and adding the use of kallsyms for addr-to-symbol resolution for kernel samples. But if we choose the option of parsing perf_events sample data at post-processing time, then we'll need a second version of arrange_profiles(), as well as perf_events versions for its callees that access the sample files. ============================================================================ |
From: Robert R. <rob...@am...> - 2011-10-24 15:44:00
|
Maynard, thanks for bringing this up. See my comments below. On 18.10.11 18:11:40, Maynard Johnson wrote: > 2.Porting OProfile to perf_events: > But perhaps the main downside to using perf is that in order to > profile on native events, the user must pass a hex code to perf. > For Intel, those codes are documented in the PMU user manuals. But > for some architectures, the hex codes are not documented (at least > not at the time of this writing), and the best way to obtain them is > by using libpfm4’s showevtinfo command. Unfortunately, the perf > maintainers have steadfastly refused to allow libpfm4 to be linked > to perf, and so this usability issue will remain a problem for the > foreseeable future. Do you refer to this? http://lkml.org/lkml/2011/8/8/146 If we want to have something like this in the kernel/perf tool, we must discuss this on the list. There was no discussion at all, so no wonder why it isn't in. There are some more reasons to migrate to perf_events: * OProfile/perf coexistence: There is no way of allocating pmu resources (e.g. counters) in the kernel. Oprofile and perf may run exclusively only. But with more and more usage of perf it becomes much harder to disable perf in the system while running oprofile. In some cases oprofile isn't even aware of running perf counters and the measurement data gets corrupt. Esp. with the in-kernel perf api it is possible to setup perf counters by any kernel component without any control to disable it. * Decreased development and maintenance effort for hardware kernel drivers: If oprofile uses perf there would be less effort to implement and maintain hardware features. We could then concentrate on developing pmu features for perf only. Currently we have separate but similar pmu implementations, so there is duplicate work. Or, there are missing features either in perf or oprofile. With a migrations to perf oprofile could benefit from perf development efforts as features that are perf only could be used by oprofile too. > 3.Porting to perf_events – kernel choices > The design documented herein assumes option #1, which implies we’ll > use the natural perf_events API. Yes, that's what I prefer to. > 4.Determining perf_events availability > Userspace oprofile should be built with support for both the legacy > oprofile kernel driver and for perf_events. A config check for > perf_events.h should be sufficient to enable the building of > perf_events-specific code. This implies such code should be > surrounded by an #ifdef. > > The oprofile userspace tools will also need to know at runtime > whether or not the running processor has perf_events support. I would support both kernel interfaces with the same binary and do the check at runtime, e.g. by setting up the perf syscall. This decreases config effort, improves code readability, etc. Otherwise you need different oprofile packages for different kernels. > 5.1.New options > The upshot is that, initially, we need to add support for only three > new options to opcontrol – all-cpus, pid, and command (I think we > should rename the “command” option to “app”). For perf-record, the > command option is required, but I think it makes sense to leave it > optional when used in conjunction with all-cpus. With opcontrol, > the command option would be passed along with the start option. What about extending the command to run as argument list to opcontrol? opcontrol [ options ] [ <command> ] Btw, I use perf record in system wide mode mostly like this: perf record -a sleep 10 > 6.Internal changes for setup and profiling > Question: Is /proc/cpuinfo the best we can do? This would be > sufficient for ppc64, but how about other architectures? Is > there another alternative? Some architectures provide a perf function to get a unique pmu id/name. This is already used in the kernel driver for arm: armpmu_get_pmu_id(). There is also perf_pmu_name() kernel function that is used by sh. If it isn't yet there, the pmu name could be exposed via sysfs. For x86 architecture the family/model from /proc is not sufficient, we need also cpuid for feature detection (e.g. AMD IBS, Intel Architectural Perfmon). Feature detection could potentially be done also with the perf open syscall (e.g. by trying to setup IBS). > 6.2.opcontrol setup parameters > But the oprofile/PE model opens up the possibility that one user's > 'opcontrol –setup' operations could overwrite another user's setup. > To make this less likely to happen, we should do one of the > following: > - Save the oprofile config file in the current directory > or > - Eliminate the config file and use the single-command model of perf-record I would stick with the config file option only for system wide mode keeping the legacy oprofile daemon functionality. For per-task monitoring it is not worth to implement support for separate config files. This metadata also caused some trouble in the past (remember the question "Have you tried with /root/.oprofile/daemonrc removed?"). -Robert -- Advanced Micro Devices, Inc. Operating System Research Center |
From: Maynard J. <may...@us...> - 2011-10-25 23:08:57
|
On 10/24/2011 10:28 AM, Robert Richter wrote: > Maynard, > > thanks for bringing this up. See my comments below. Robert, thanks very much for taking the time to review this design. I'm looking forward to hearing from others. I have responses to some of your comments below. > > On 18.10.11 18:11:40, Maynard Johnson wrote: > [snip] >> 4.Determining perf_events availability > >> Userspace oprofile should be built with support for both the legacy >> oprofile kernel driver and for perf_events. A config check for >> perf_events.h should be sufficient to enable the building of >> perf_events-specific code. This implies such code should be >> surrounded by an #ifdef. >> >> The oprofile userspace tools will also need to know at runtime >> whether or not the running processor has perf_events support. > > I would support both kernel interfaces with the same binary and do the Yes, the point of the #ifdef's in the code would be to allow legacy oprofile to be built on older kernels. But if perf_event.h file exists, then we build both legacy and oprofile/PE. We would need a LOUD WARNING in our configure if perf_event.h file isn't found, telling the user to install the kernel headers package if they want to have perf_events-enabled oprofile. > check at runtime, e.g. by setting up the perf syscall. This decreases > config effort, improves code readability, etc. Otherwise you need > different oprofile packages for different kernels. > >> 5.1.New options > >> The upshot is that, initially, we need to add support for only three >> new options to opcontrol – all-cpus, pid, and command (I think we >> should rename the “command” option to “app”). For perf-record, the >> command option is required, but I think it makes sense to leave it >> optional when used in conjunction with all-cpus. With opcontrol, >> the command option would be passed along with the start option. > > What about extending the command to run as argument list to opcontrol? Yeah, I guess I wasn't clear. That's what I am suggesting. > > opcontrol [ options ] [<command> ] > > Btw, I use perf record in system wide mode mostly like this: > > perf record -a sleep 10 I am not a big fan of this method, as you may decide later that you'd like to run longer. But I'm not sure my proposal for system-wide profiling is so good either, since we need to keep the profiling daemon running somehow. Perhaps could just call sleep on its own, and continue doing so until the user does ctl-C to stop profiling. > >> 6.Internal changes for setup and profiling > >> Question: Is /proc/cpuinfo the best we can do? This would be >> sufficient for ppc64, but how about other architectures? Is >> there another alternative? > > Some architectures provide a perf function to get a unique pmu > id/name. This is already used in the kernel driver for arm: > armpmu_get_pmu_id(). There is also perf_pmu_name() kernel function > that is used by sh. If it isn't yet there, the pmu name could be > exposed via sysfs. I don't understand what this has to do with cpu_type. Is there some cpu information encoded in the pmu id/name? > > For x86 architecture the family/model from /proc is not sufficient, we > need also cpuid for feature detection (e.g. AMD IBS, Intel OK, cpuid could easily be incorporated, but we'd need to know when we should call it. I see "GenuineIntel" in the vendor_id field of an Intel Xeon system. What would it say on an AMD system? Is there a vendor_id field? There's no vendor_id field on POWER systems. Too bad there's no structure to cpuinfo. I thought about the aux vector, too, but here again, there's no structure -- it's free-form. I think this is why cpu_type exists in /dev/oprofile -- because there's no easy way of determining it. We need a solution for this. I know there are a lot of people out there who know Linux a lot better than me. Hopefully someone can come up with a solution. > Architectural Perfmon). Feature detection could potentially be done > also with the perf open syscall (e.g. by trying to setup IBS). > >> 6.2.opcontrol setup parameters > >> But the oprofile/PE model opens up the possibility that one user's >> 'opcontrol –setup' operations could overwrite another user's setup. >> To make this less likely to happen, we should do one of the >> following: >> - Save the oprofile config file in the current directory >> or >> - Eliminate the config file and use the single-command model of perf-record > > I would stick with the config file option only for system wide mode > keeping the legacy oprofile daemon functionality. For per-task > monitoring it is not worth to implement support for separate config > files. This metadata also caused some trouble in the past (remember > the question "Have you tried with /root/.oprofile/daemonrc removed?"). I think you're agreeing with me, mostly. Again, my wording probably wasn't clear. We would keep the daemonrc file for legacy oprofile mode. If perf_events is available on the system, both per-process and system-wide profiling would be done with the single-command mode new 'oprofile' command, which has no need for cached config information. One additional thought I had . . . I think a "--force-legacy-mode" option might be useful to have. The intent would be that such an option should only be used if a problem is suspected in oprofile/PE, and we should document the option that way. Thanks again for your feedback. -Maynard > > -Robert > |
From: Maynard J. <may...@us...> - 2011-10-26 14:38:11
|
Maynard Johnson wrote: > On 10/24/2011 10:28 AM, Robert Richter wrote: >> Maynard, >> >> thanks for bringing this up. See my comments below. > > Robert, thanks very much for taking the time to review this design. I'm looking > forward to hearing from others. I have responses to some of your comments below. > >> >> On 18.10.11 18:11:40, Maynard Johnson wrote: >> [snip] >>> 5.1.New options >> >>> The upshot is that, initially, we need to add support for only three >>> new options to opcontrol – all-cpus, pid, and command (I think we >>> should rename the “command” option to “app”). For perf-record, the >>> command option is required, but I think it makes sense to leave it >>> optional when used in conjunction with all-cpus. With opcontrol, >>> the command option would be passed along with the start option. >> >> What about extending the command to run as argument list to opcontrol? > Yeah, I guess I wasn't clear. That's what I am suggesting. I'm having second thoughts about the approach I suggested in my design regarding changes to opcontrol. To reiterate, I proposed to add two new options (--all-cpus and --pid) to opcontrol, as well as allowing a <command> (i.e., app to be profiled) to be passed as the last argument. Then opcontrol would call the newly-developed oprofile program (let's call this program 'op-perf' for now), passing all setup parameters to it. The op-perf program would validate options and then do all of the perf_events-related work to setup and do the profiling of the passed app. This program would also be able to be directly invoked by the user, by-passing opcontrol entirely -- but the user would have to pass all required profiling parameters to op-perf on the command line. The rationale for that approach was to allow users to transition to a perf_events-enabled oprofile without having to change their mode of operation right away. The intent was to deprecate opcontrol, and disallow its use at some later date (or release) on systems where perf_events is available. This approach is makes some sense when using the "--all-cpus" option, which is equivalent to existing system-wide mode of oprofile. Assuming no "command" is passed as the last argument, an "--all-cpus" mode of operation would invoke op-perf as a daemon, akin to existing oprofiled. Then 'opcontrol --shutdown' could be used to halt op-perf. We could even support the concept of 'opcontrol --stop' by signalling to op-perf to disable the open perf_events counters it has open. But this approach seems less than ideal when applied to profiling a single application. For users who may already be accustomed to the 'perf' mode of doing things, the profiler should simply end when the application is done (or when they do ctl-C). But if we run op-perf as a daemon, the clt-C option isn't available, and they must run 'opcontrol --shutdown' to stop profiling prior to the end of the profiled app. Nevertheless, the approach is workable and can be implemented. I am concerned, however, that the above approach adds more complexity to an already complex opcontrol command. It also retains the use of the daemonrc config file to hold setup information that is later passed to op-perf, when we would not need this file at all if the user simply invoked op-perf directly. My modified proposal is develop the op-perf program initially as a standalone program intended to be invoked directly. I will not make any major modifications to opcontrol initially, meaning that it would continue to operate in legacy mode. We could make a small change to opcontrol to detect whether or not perf_events is available on the system and to output a message recommending to use op-perf instead. Obviously, op-perf would also need to verify that perf_events is available, and if not, then tell the user to use opcontrol. This simpler approach will allow us to get an implementation developed and available for testing much sooner that the original approach. If the community feels there is a need to add an automatic roll-over to perf_events when using opcontrol, we can do that as an enhancement. Thoughts? Opinions? -Maynard >> >> opcontrol [ options ] [<command> ] >> > [snip] > -Maynard > >> >> -Robert >> > > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list |
From: Arnaldo C. de M. <ac...@gh...> - 2011-10-26 10:57:06
|
Em Tue, Oct 25, 2011 at 06:09:00PM -0500, Maynard Johnson escreveu: > On 10/24/2011 10:28 AM, Robert Richter wrote: > > perf record -a sleep 10 > I am not a big fan of this method, as you may decide later that you'd like to > run longer. But I'm not sure my proposal for system-wide profiling is so good > either, since we need to keep the profiling daemon running somehow. Perhaps > could just call sleep on its own, and continue doing so until the user does > ctl-C to stop profiling. You don't have to run sleep if you want to run till the user presses control+c, just use: perf record -a Using sleep is just a way of providing an upper bound when you don't want to fill your disk :-) So doing: perf record -a sleep 10m and pressing control+c after 30 seconds, will produce the same result as: perf record -a + control+c after 30 seconds. I.e. there is one profiling "daemon", its setup at the start of the profiling session and terminated when it finishes, either by the provided timeout (using -a + sleep) or when the user says he/she is done with profiling. The upper bound can as well be provided in terms of other event, not just time, just write an app/workload specific "trigger" that exits when profiling is not interesting anymore, kinda like: perf record -a ./profile-till-this-thing-happens - Arnaldo |
From: Maynard J. <may...@us...> - 2011-10-26 13:21:22
|
Arnaldo Carvalho de Melo wrote: > Em Tue, Oct 25, 2011 at 06:09:00PM -0500, Maynard Johnson escreveu: >> On 10/24/2011 10:28 AM, Robert Richter wrote: >>> perf record -a sleep 10 >> I am not a big fan of this method, as you may decide later that you'd like to >> run longer. But I'm not sure my proposal for system-wide profiling is so good >> either, since we need to keep the profiling daemon running somehow. Perhaps >> could just call sleep on its own, and continue doing so until the user does >> ctl-C to stop profiling. > > You don't have to run sleep if you want to run till the user presses > control+c, just use: > > perf record -a > > Using sleep is just a way of providing an upper bound when you don't > want to fill your disk :-) > > So doing: > > perf record -a sleep 10m > > and pressing control+c after 30 seconds, will produce the same result > as: > > perf record -a > > + control+c after 30 seconds. > > I.e. there is one profiling "daemon", its setup at the start of the > profiling session and terminated when it finishes, either by the > provided timeout (using -a + sleep) or when the user says he/she is done > with profiling. > > The upper bound can as well be provided in terms of other event, not > just time, just write an app/workload specific "trigger" that exits when > profiling is not interesting anymore, kinda like: > > perf record -a ./profile-till-this-thing-happens Thanks for the tips, Arnaldo! -Maynard > > - Arnaldo > |
From: Maynard J. <may...@us...> - 2011-10-26 13:19:32
|
Andi Kleen wrote: >>> >>> For x86 architecture the family/model from /proc is not sufficient, we >>> need also cpuid for feature detection (e.g. AMD IBS, Intel >> >> OK, cpuid could easily be incorporated, but we'd need to know when we >> should call it. I see "GenuineIntel" in the vendor_id field of an Intel >> Xeon system. What would it say on an AMD system? Is there a vendor_id >> field? There's no vendor_id field on POWER systems. Too bad there's no >> structure to cpuinfo. I thought about the aux vector, too, but here again, >> there's no structure -- it's free-form. I think this is why cpu_type >> exists in /dev/oprofile -- because there's no easy way of determining it. >> We need a solution for this. I know there are a lot of people out there >> who know Linux a lot better than me. Hopefully someone can come up with a >> solution. > > perf doesn't solve this problem, you simply have to do it in user space. > In fact we already have the code needed to select the correct eventlist, > so can just follow that. Andi, yes, I realize that. What I'm looking for is an architecture-independent way of at least determining what architecture we're running on. Then, once we know that, we can do the cpuid if we know we're on an Intel system. Because of the free-form nature of /proc/cpuinfo, we'd need all kinds of special case code to ascertain the architecture (and, in some cases, the specific processor type). I considered that 'uname -m' might be a better approach, but I don't think that will give us the distinction we need to select between events/i386 and events/x86-64. Other ideas . . . anyone? -Maynard > > -Andi |
From: Maynard J. <may...@us...> - 2011-10-28 00:01:15
|
1. Introduction The design documented herein assumes oprofile userspace will use the perf_events API. With this approach, the existing oprofile kernel driver can be bypassed, eliminating the need for much (if any) new development work in the driver. However, the oprofile kernel driver will probably exist for quite some time to be used by older processors that won't be supported by perf_events. 2. Statistics Current oprofile statistics differ from the statistics that perf_events can gather (see below). Some of this is due to differences in implementation details, but it's obvious that oprofile simply gathers more information stats than does perf_events. If we find a need for more statistics that should (i.e., need to) be exposed to users, we can work with the kernel community to add them. 2.1. OProfile statistics enum { OPD_SAMPLES, /**< nr. samples */ OPD_KERNEL, /**< nr. kernel samples */ OPD_PROCESS, /**< nr. userspace samples */ OPD_NO_CTX, /**< nr. samples lost due to not knowing if in the kernel or not */ OPD_LOST_KERNEL, /**< nr. kernel samples lost */ OPD_LOST_SAMPLEFILE, /**< nr samples for which sample file can't be opened */ OPD_LOST_NO_MAPPING, /**< nr samples lost due to no mapping */ OPD_DUMP_COUNT, /**< nr. of times buffer is read */ OPD_DANGLING_CODE, /**< nr. partial code notifications (buffer overflow */ OPD_MAX_STATS /**< end of stats */ }; Additionally, top level stats from /dev/oprofile/stats: -event_lost_overflow -sample_lost_no_mapping -bt_lost_no_mapping -sample_lost_no_mm and CPU-level status from /dev/oprofile/stats: -sample_lost_overflow -samples_lost_task_exit -samples_received -backtrace_aborted -sample_invalid_eip 2.2. Perf_events statistics struct events_stats { u64 total_period; u64 total_lost; u64 total_invalid_chains; u32 nr_events[PERF_RECORD_HEADER_MAX]; u32 nr_unknown_events; u32 nr_invalid_chains; u32 nr_unknown_id; }; 3. Determining perf_events availability Userspace oprofile should be built with support for both the legacy oprofile kernel driver and for perf_events. A config check for perf_events.h should be sufficient to enable the building of perf_events-specific code. This implies such code should be surrounded by an #ifdef. If configure does not find perf_events.h, it could simply be that the kernel-headers package isn't installed, not that the running kernel doesn't support perf_events. The configure check should issue a LOUD WARNING (at the end, so it's clearly visible) to indicate that if the kernel version is 2.6.32 or later, the user should install the kernel headers and re-run config. If perf_events.h is found by configure, then both legacy and perf_events-enabled oprofile code will be built. The oprofile userspace tools will also need to know at runtime whether or not the running processor has perf_events support. The best way to determine if perf_events support is available for the running processor is to try to invoke the perf_event_open syscall. If it returns -ENOSYS, the syscall is not defined (ergo, no perf_events support). If it returns --ENOENT, it means the syscall exists, but the running processor type is not yet supported by perf_events. 4. User interface changes for profile setup and run We'll use the 'perf record' tool (from here on, referred to as "perf-record") as a model for the new oprofile user interface for setup and profiling. The intent is to develop a new front-end profiling program -- 'operf' -- that can be used in place of opcontrol. Initially, opcontrol and legacy functionality will still be available even on systems where perf_events is supported. But users will be strongly encouraged to use operf for most circumstances and to use legacy oprofile only for purposes such as comparison to operf output. Perf-record has some options that are obvious candidates to be used by operf and some that not. The obvious options are: -all-cpus -pid -force (not a very intuitive name; suggest 'reset" instead) -verbose -event/count (suggest to combine as follows: '--event=<evt_spec>:<count>') -call-graph -output (i.e., output filename. Depending on the decision of how we store sample data (sees section 6.3), we may have either a "--output-filename" option or a "--session-dir" option.) -mmap-pages (sample buffer size) -command (i.e., application to be profiled) The options of questionable use for us are: -tid (post-processing can select on an individual thread using a tid specification) -realtime -append (doesn't seem to be necessary, since 'perf record' appends by default unless "--force" is used) -no-delay -quiet -no-inherit (we should always have children threads inherit counter config; if you only want to see parent profile data, you can filter on that with a tgid specification in post-processing) -freq (oprofile has always used sample period (i.e., event 'count'), and, IMHO, the freq stuff in perf just isn't that useful or easy to understand) -cpu (use a cpu specification in post-processing) -raw-samples -stat (showing inherited stats is something that can be done at post-processing time using tgid specification) -no-samples -no-buildid-cache New options we will add: -list-events* -help* *NOTE: No other options allowed when passing this option to operf. And options that are candidates for future enhancements are: -data -cgroup -timestamp 5. Incompatible options The following opcontrol options will not be necessary (or possible) when using operf: -init -start-daemon -dump -deinit -buffer-watershed -cpu-buffer-size -image (There's no way practical way to do filtering of samples during the profiling phase when using perf_event.s) -vmlinux (I propose to move this option to the post-processing tools to correspond with the perf-report way of doing things.) -no-vmlinux (I propose we do away with this option. Post-processing should, by default, use kallsyms to do addr-to-symbol resolution for kernel samples; otherwise use the passed vmlinux image.) -kernel-range (Do we really need this option? Who uses it and what for? If we need to keep it, it will be moved to the post-processing tools since filtering during profiling is not feasible.) -xen -active-domains 6.Internal changes for setup and profiling Both opcontrol and operf must determine the availability of perf_events support. For opcontrol, if perf_events support is available, an informational message should be printed, recommending use of operf and indicating that opcontrol will be eliminated in a future release. For operf, detection of perf_events availability is, of course, needed for basic operation. The new operf program needs to know the processor type for the '--list-events' option, as well as for validating a requested native event and obtaining the proper raw hex code. Currently, the cpu type is given by the oprofile kernel driver via /dev/oprofile/cpu_type. But since the kernel driver is activated only when a user runs 'opcontrol', which will not be necessary (or recommended) when using operf, we'll need an alternate mechanism to determine cpu type. **NOTE**: All architecture maintainers in the oprofile community are asked to submit an algorithm by which operf can determine processor type in order to map to the specific events and unit_masks files in events/<arch>/<processor_type> 6.1. Event specification OProfile users currently specify the event(s) to profile with using symbolic names such as BR_MISS_PRED_RETIRED. The perf_events API requires a hex code, which, for some architectures, is the same code as is found in oprofile's 'events' file (e.g., 0xc5 for BR_MISS_PRED_RETIRED). For other architectures (e.g., ppc64), the codes in the events file are not appropriate for use with perf_events. In such cases, I believe those codes are unused, so we could simply update the events files with the proper perf_events codes. That way, all architectures would be handled the same way. 6.2. Multiplexing of events Another advantage of perf_events is its intrinsic ability to multiplex events when more events are requested for monitoring than there are physical performance counters. The minor downside to this feature is that the number of samples occurring for a given event may not be accurate, since that event may not have been scheduled on a performance counter for the entire profiling run. This really should *not* be an issue in profiling, since the main idea in analyzing profile is to look for relative "hot" locations. If knowing the absolute number of events that occurs is important, then the user should use an event counting tool, like 'perf stat'. But if there's a way we can detect that multiplexing has taken place and at least issue a warning message, we should do so. 6.3. The profiling daemon The current oprofile daemon is a separate process that reads sample data from the kernel driver's event buffer and stores the data in special format in sample files in the filesystem. With oprofile/PE, the daemon function will be replaced with operf, which will execute the passed app, make the perf_event_open syscall and read the sample data from perf_events by mmap'ing to the kernel buffer associated with the file descriptor returned from the syscall. The sample data from perf_events is in a very different format from what we currently get from the oprofile kernel driver. There are at least two methods to choose from: - At the end of the profiling run, convert perf_events raw sample data to the sample file format we currently use (which we now store in /var/lib/oprofile/samples/current). The resulting sample files would be stored in <current_dir>/oprofile_data/samples/current. or - Store perf_events sample data in the "natural" format in a file akin to perf's "perf.data" file. We could call our file "operf.data". Then we'd need to develop a new post-processing mechanism for parsing operf.data on-demand. One thing to consider when choosing between these options is how we currently support profiling Java. We look for "anon" sample files in the samples directory, then create an ELF file named <pid>.jo and place it in the anon sample directory. If we chose alternative 2 from above (storing samples no <current_dir>/oprofile_data/samples/current directory), the <pid>.jo would have be stored somewhere independent of operf.data. Then we'd have a synchronization issue to contend with -- i.e., do the *.jo files the post-processing tools find actually correspond with the operf.data file? 6.4 Supporting multiple users simultaneously Unlike legacy oprofile where only one user can run oprofile at a time, operf will allow any number of users to run simultaneously. However, early implementations of perf_events (or perhaps the perf tool?) were buggy in the handling of multiple users when one or more of those users was running system-wide profiling. (I have not yet determined if this problem is actually fixed in later implementations since I've only recently become aware of it.) We may want to consider disallowing this scenario, or at least warning that the results cannot be trusted. 7.Changes for post-processing tools Theoretically, perf_events should provide us with all the same profiling data that we get from the oprofile kernel driver with the exception of certain statistics. If we choose to convert the perf_events raw sample data to oprofile-format sample files, then very little effort will be needed for post-processing aside from outputting different stats and adding the use of kallsyms for addr-to-symbol resolution for kernel samples. But if we choose the option of parsing perf_events sample data at post-processing time, then we'll need a second version of arrange_profiles(), as well as perf_events versions for its callees that access the sample files. ============================================================================ -Maynard |
From: Andi K. <an...@fi...> - 2011-10-28 01:15:45
|
> 2. Statistics > Current oprofile statistics differ from the statistics that perf_events can > gather (see below). Some of this is due to differences in implementation > details, but it's obvious that oprofile simply gathers more information > stats than does perf_events. If we find a need for more statistics that > should (i.e., need to) be exposed to users, we can work with the kernel > community to add them. The other difference is the lack of profile_pc() in perf: when a spinlock is hit oprofile accounts it in the parent, while perf doesn't. I sent a patch to add it to perf some time ago, but it wasn't accepted. This generally results in many profiles looking quite differently. > 4. User interface changes for profile setup and run > We'll use the 'perf record' tool (from here on, referred to as > "perf-record") as a model for the new oprofile user interface for setup and > profiling. The intent is to develop a new front-end profiling program -- > 'operf' -- that can be used in place of opcontrol. Initially, opcontrol > and legacy functionality will still be available even on systems where > perf_events is supported. But users will be strongly encouraged to use > operf for most circumstances and to use legacy oprofile only for purposes > such as comparison to operf output. So will operf still control a daemon? IMHO that's a valuable property of the old oprofile and one of the basic ideas behind the original DEC "continuous profiling" work oprofile was based on. > 6.1. Event specification > OProfile users currently specify the event(s) to profile with using > symbolic names such as BR_MISS_PRED_RETIRED. The perf_events API requires > a hex code, which, for some architectures, is the same code as is found in > oprofile's 'events' file (e.g., 0xc5 for BR_MISS_PRED_RETIRED). For other > architectures (e.g., ppc64), the codes in the events file are not > appropriate for use with perf_events. In such cases, I believe those codes > are unused, so we could simply update the events files with the proper > perf_events codes. That way, all architectures would be handled the same > way. Updating all the event files would be a lot of work. Probably very hard. It would be better to somehow reuse old event files. At least on x86 it's possible. > > 6.3. The profiling daemon > The current oprofile daemon is a separate process that reads sample data > from the kernel driver's event buffer and stores the data in special format > in sample files in the filesystem. With oprofile/PE, the daemon function > will be replaced with operf, which will execute the passed app, make the > perf_event_open syscall and read the sample data from perf_events by > mmap'ing to the kernel buffer associated with the file descriptor returned > from the syscall. I think completely removing the daemon model is a mistake. If you don't want a daemon why use oprofile in the first place? Without it oprofile would just be a inferior clone of perf, and people would likely chose the original. The whole point of the "continuous profiling" work oprofile was based on was to have it always on, and that needs a daemon. That said of course replacing opcontrol with something that doesn't need continuous workarounds from the user would be a good thing. -Andi -- ak...@li... -- Speaking for myself only. |
From: Maynard J. <may...@us...> - 2011-10-28 18:37:46
|
Andi Kleen wrote: >> 2. Statistics >> Current oprofile statistics differ from the statistics that perf_events can >> gather (see below). Some of this is due to differences in implementation >> details, but it's obvious that oprofile simply gathers more information >> stats than does perf_events. If we find a need for more statistics that >> should (i.e., need to) be exposed to users, we can work with the kernel >> community to add them. > > The other difference is the lack of profile_pc() in perf: when > a spinlock is hit oprofile accounts it in the parent, while perf doesn't. > I sent a patch to add it to perf some time ago, but it wasn't > accepted. A good reason to still have the ability to use legacy oprofile -- until such time that equivalent function is added to perf_events. > > This generally results in many profiles looking quite differently. > >> 4. User interface changes for profile setup and run >> We'll use the 'perf record' tool (from here on, referred to as >> "perf-record") as a model for the new oprofile user interface for setup and >> profiling. The intent is to develop a new front-end profiling program -- >> 'operf' -- that can be used in place of opcontrol. Initially, opcontrol >> and legacy functionality will still be available even on systems where >> perf_events is supported. But users will be strongly encouraged to use >> operf for most circumstances and to use legacy oprofile only for purposes >> such as comparison to operf output. > > So will operf still control a daemon? IMHO that's a valuable property > of the old oprofile and one of the basic ideas behind the original > DEC "continuous profiling" work oprofile was based on. Although that was not a feature included in my design, we could certainly add it if need be. Do you know of oprofile users that do such "continuous profiling"? I've not been exposed to this concept before. Is this something intended to be run on a production system? Or during deveopment? Is system-wide the preferred mode? Or would per-process profiling be desired (maybe in addition to system-wide)? I would want to see some use cases that could not be satisfied with the simpler mode of operation already described. If the community consensus is that we need something more, it could be a future enhancement. > >> 6.1. Event specification >> OProfile users currently specify the event(s) to profile with using >> symbolic names such as BR_MISS_PRED_RETIRED. The perf_events API requires >> a hex code, which, for some architectures, is the same code as is found in >> oprofile's 'events' file (e.g., 0xc5 for BR_MISS_PRED_RETIRED). For other >> architectures (e.g., ppc64), the codes in the events file are not >> appropriate for use with perf_events. In such cases, I believe those codes >> are unused, so we could simply update the events files with the proper >> perf_events codes. That way, all architectures would be handled the same >> way. > > Updating all the event files would be a lot of work. Probably very hard. > It would be better to somehow reuse old event files. At least on x86 > it's possible. Yes, as I said, I do believe reusing existing event files for x86 should work. As for ppc64 (any others?), where the event codes in oprofile events files aren't the ones needed for perf_events . . . well, a good script writer could probably do it in a couple days. Probably take me a week. ;-) > >> >> 6.3. The profiling daemon >> The current oprofile daemon is a separate process that reads sample data >> from the kernel driver's event buffer and stores the data in special format >> in sample files in the filesystem. With oprofile/PE, the daemon function >> will be replaced with operf, which will execute the passed app, make the >> perf_event_open syscall and read the sample data from perf_events by >> mmap'ing to the kernel buffer associated with the file descriptor returned >> from the syscall. > > I think completely removing the daemon model is a mistake. > > If you don't want a daemon why use oprofile in the first place? > Without it oprofile would just be a inferior clone of perf, and people > would likely chose the original. I've given my rationale for doing this port to perf_events. If you don't agree with that, that's your opinion. -Maynard > > The whole point of the "continuous profiling" work oprofile was based > on was to have it always on, and that needs a daemon. > > That said of course replacing opcontrol with something that doesn't > need continuous workarounds from the user would be a good thing. > > > -Andi |
From: Andi K. <an...@fi...> - 2011-10-29 09:46:06
|
On Fri, Oct 28, 2011 at 01:37:24PM -0500, Maynard Johnson wrote: > Although that was not a feature included in my design, we could certainly add it if need be. Do you know of oprofile users that do such "continuous profiling"? I've not been exposed to this concept before. Is this something intended to be run on a production system? Or during deveopment? Is system-wide the preferred mode? Or would per-process profiling be desired (maybe in addition to system-wide)? I would want to see some use cases that could not be satisfied with the simpler mode of operation already described. If the community consensus is that we need something more, it could be a future enhancement. When John announced oprofile originally he basically wrote it was a clone of the Digital Unix continuous profiling infrastructure. I'm sure he wouldn't have gone through all the trouble with the daemon otherwise Here's the original paper from DEC: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-016A.pdf And here's a modern paper how it is being used http://research.google.com/pubs/archive/36575.pdf IMHO a lot of the conflict between oprofile users and perf ("too hard to use") was caused this philosophical conflict between continuous profiling and developer debugging. The developers are probably already lost for oprofile to perf, but it would make sense to continue filling the continuous profiling niche well. If you want to add some buzzwords, oprofile and CPI was essentially for cloud profiling, not individual development. Right now you basically seem to kill that aspect, which obsoletes oprofile largely. -Andi |