Thread: [RFC] Porting oprofile to perf_events

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello, oprofile community,
Below is a high-level proposal for how we might re-implement parts of oprofile to use the Linux Performance Events Subsystem.  I attached a tar file containing a pdf file that might be easier to read, but I also wanted to have inline text for ease of reviewing.  Any and all feedback is welcome.

Thanks.
-Maynard

============================================================================

Porting OProfile to Linux Performance Events 

1.History
OProfile is a popular and widely-distributed performance tool for the Linux operating system.  OProfile consists of a kernel driver and userspace tools.  Over the years, support has been added to OProfile for virtually every processor type with a performance monitoring unit capable of running Linux.

In 2008, the Linux kernel community finally agreed to add a general purpose performance monitoring API to the kernel.  The end result was the “Linux Performance Events Subsystem” (aka “perf_events”).  One of the main reasons for developing this new API and subsystem was to enable performance monitoring tools to be built without custom kernel patches.  But since the kernel community is generally reluctant to accept a major new feature without there being a ready consumer, the ‘perf’ tool was developed simultaneously with the subsystem.  The perf tool source code resides in the kernel tree.

2.Porting OProfile to perf_events: Rationale
The main advantage of perf over oprofile (IMHO) is that a user can profile a process that they own without having to be root.  The perf tool also supports other options, like event counting, profiling on software events, tracing, etc.  The proliferation of sub-commands to perf can be daunting to new users, and, unfortunately, documentation is scarce, incomplete, and often out-of-date.  But perhaps the main downside to using perf is that in order to profile on native events, the user must pass a hex code to perf.  For Intel, those codes are documented in the PMU user manuals.  But for some architectures, the hex codes are not documented (at least not at the time of this writing), and the best way to obtain them is by using libpfm4’s showevtinfo command.  Unfortunately, the perf maintainers have steadfastly refused to allow libpfm4 to be linked to perf, and so this usability issue will remain a problem for the foreseeable future.

While it’s true that oprofile has its own usability issues, at least users can list the native events for their processor and see descriptions of the events without having to obtain the appropriate hardware manual.  The main complaint about oprofile is that the user must have root authority to run the profiler.  This is because oprofile is a system-wide profiler, and there are security issues in allowing non-authorized users to collect profile information about all apps running on a system.

Some perf proponents believe that perf has made oprofile obsolete, but I believe there’s room in the Linux community for more than one performance analysis tool.  For one thing, perf_events does not support as many different architectures and processor types as OProfile.  Also, OProfile has a large user base and a community mailing list where people who are new to profiling can get advice, not just about how to run OProfile, but about profiling techniques in general.

OProfile can be made even more beneficial to the community by enhancing it to allow profiling by non-root users.  One way to do this is to port oprofile to the perf_events subsystem.  Then users would have the ability of profiling a single application (non-root) or system-wide (root).

3.Porting to perf_events – kernel choices
There are two ways to approach the porting task with regards to the oprofile kernel driver:
   1.Bypass the kernel driver completely by changing the userspace to mmap the perf_events ring buffer where profiling data is recorded
   2.Keep (and expand) the user-to-kernel driver interface and let the kernel driver invoke perf_events to do the bulk of the work

Advantages to option 1:
   - The kernel driver would essentially go into maintenance mode, since it wouldn’t be used at all on systems where perf_events exists.
   - This is a cleaner and more natural way to interface with perf_events.

Advantages to option 2:
Fewer userspace changes would be needed – the user-to-kernel interface remains largely the same, and the sample data format doesn’t change.
Retain all the statistics (e.g, buffer overflows), which are an added value of the kernel driver over perf_events.

The design documented herein assumes option #1, which implies we’ll use the natural perf_events API.  With this approach, the existing oprofile kernel driver can be bypassed, eliminating the need for much (if any) new development work in the driver.  However, the oprofile kernel driver will probably live on for quite some time to be used by older processors that won’t be supported by perf_events.

Current oprofile statistics differ from the statistics that perf_events can gather (see below).  Some of this is due to differences in implementation details, but it’s obvious that oprofile simply gathers more information stats than does perf_events.  If we find a need for more statistics that should (i.e., need to) be exposed to users, we can work with the kernel community to add them.
3.1.OProfile statistics
enum {  OPD_SAMPLES, /**< nr. samples */
        OPD_KERNEL, /**< nr. kernel samples */
        OPD_PROCESS, /**< nr. userspace samples */
        OPD_NO_CTX, /**< nr. samples lost due to not knowing if in the kernel or not */
        OPD_LOST_KERNEL,  /**< nr. kernel samples lost */
        OPD_LOST_SAMPLEFILE, /**< nr samples for which sample file can't be opened */
        OPD_LOST_NO_MAPPING, /**< nr samples lost due to no mapping */
        OPD_DUMP_COUNT, /**< nr. of times buffer is read */
        OPD_DANGLING_CODE, /**< nr. partial code notifications (buffer overflow */
        OPD_MAX_STATS /**< end of stats */
};

Additionally, top level stats from /dev/oprofile/stats:
   -event_lost_overflow
   -sample_lost_no_mapping
   -bt_lost_no_mapping
   -sample_lost_no_mm

and CPU-level status from /dev/oprofile/stats:
   -sample_lost_overflow
   -samples_lost_task_exit
   -samples_received
   -backtrace_aborted
   -sample_invalid_eip

3.2.Perf_events statistics
struct events_stats {
	u64 total_period;
	u64 total_lost;
	u64 total_invalid_chains;
	u32 nr_events[PERF_RECORD_HEADER_MAX];
	u32 nr_unknown_events;
	u32 nr_invalid_chains;
	u32 nr_unknown_id;
};

   Question: Does a non-zero value for total_lost indicate a need to bump mmap_pages?

4.Determining perf_events availability
Userspace oprofile should be built with support for both the legacy oprofile kernel driver and for perf_events.  A config check for perf_events.h should be sufficient to enable the building of perf_events-specific code.  This implies such code should be surrounded by an #ifdef.

The oprofile userspace tools will also need to know at runtime whether or not the running processor has perf_events support.  From discussions with Robert Richter, it appears that the only way to reliably determine if perf_events support is available for the running processor is to try to invoke the perf_event_open syscall.  If it returns -ENOSYS, the syscall is not defined (ergo, no perf_events support); or if it returns –ENOENT, it means the running processor type is not yet supported by perf_events.

5.User interface changes for profile setup and run
This design document is intended to be fairly high level, so many of the details regarding existing user interface options will remain to be investigated during implementation.  This document focuses mainly on new interface options and extensions to existing options.

5.1.New options
We’ll use the ‘perf record’ tool (from here on, referred to as “perf-record”) as a model for the new oprofile user interface options for setup and profiling.  Perf-record has some options that are obvious candidates to be added to oprofile and some that not.

The obvious options are:
   -all-cpus
   -pid
   -command (i.e., application to be profiled)

The options of questionable use are:
tid : (post-processing can select on an individual thread using a tid specification)
   -realtime
   -no-delay
   -no-inherit : (always have children threads inherit counter config; if you only want to see parent profile data, you can filter on that with a tgid specification in post-processing)
   -freq : (oprofile has always used sample period (i.e., event ‘count’), and, IMHO, the freq stuff in perf just isn’t that useful or easy to understand)
   -cpu : (use –separate=cpu and then use a cpu specification in post-processing)
   -raw-samples
   -stat : (showing inherited stats is something that can be done at post-processing time using tgid specification)
   -no-samples
   -no-buildid-cache

Options already implemented by oprofile (in some manner or another) are:
   -append : (not doing a –reset)
   -force : (--reset)
   -quiet/verbose : (with or without --verbose)
   -count : (--event=<evt_name>:<count>)
   -call-graph : (--callgraph)
   -output (i.e., output filename) : (--save)
   -mmap-pages : (--buffer-size ?)

And options that are candidates for future enhancements are:
   -data
   -cgroup
   -timestamp

The upshot is that, initially, we need to add support for only three new options to opcontrol – all-cpus, pid, and command (I think we should rename the “command” option to “app”).  For perf-record, the command option is required, but I think it makes sense to leave it optional when used in conjunction with all-cpus.  With opcontrol, the command option would be passed along with the start option.

5.2.Obsolete options
The following existing options to opcontrol will no longer be necessary (or possible) when using perf_events:
   -init
   -start-daemon
   -dump
   -deinit
   -buffer-watershed
   -cpu-buffer-size
   -image : (there's no way practical way to do filtering of samples during the profiling phase when using perf_events)
   -vmlinux : (I propose to move this option to the post-processing tools to correspond with the perf-report way of doing things)
   -no-vmlinux : (I propose we do away with this option.  If the user's event specification does not preclude collecting kernel samples, then post-processing should, by default, use kallsyms to do addr-to-symbol resolution for kernel samples; otherwise use the passed vmlinux image.)
   -kernel-range : (do we really need this option?  Who uses it and what for?  If we need to keep it, it will be moved to the post-processing tools since filtering during profiling is not feasible.)
   -xen
   -active-domains

All of these options will be deprecated for at least one release (i.e., display a deprecation message, but not fail).

6.Internal changes for setup and profiling
6.1.opcontrol basics
The opcontrol script will have to be enhanced to handle the case where perf_events support is available, but will also need to retain its ability to interface with older kernels – i.e., the oprofile kernel driver.

In order to perform actions appropriate for the level of kernel support, opcontrol must determine the availability of perf_events support.  A simple C program that attempts the perf_event_open syscall and returns the errno value can be used to make this determination.

OProfile also needs to know the processor type for 'opcontrol --list-events', as well as for validating a requested native event.  Currently, the cpu type is given by the oprofile kernel driver via /dev/oprofile/cpu_type.  But the kernel driver won't be active on perf_events-enabled systems, so we need an alternate mechanism to determine cpu type.

   Question:  Is /proc/cpuinfo the best we can do?  This would be sufficient for ppc64, but how about other architectures?  Is there another alternative?

6.2.opcontrol setup parameters
The opcontrol script sets up profiling parameters in oprofilefs (/dev/oprofile) in order to communicate information to the oprofile kernel driver.  When using perf_events, we’ll need to communicate more or less the same information to perf_events, but we’ll do that by way of the perf_event_open syscall.

The oprofile config file (/root/.oprofile/daemonrc) caches profiling parameters, containing much of the same profiling setup information as is stored in /dev/oprofile.  Since the  perf_events-enabled oprofile (from here on referred to as “oprofile/PE”) will often be run by non-root users, my initial thought was to move this config file to:
        ~/.oprofile
where the home dir is that of the real user ID executing the opcontrol command.  However, there's the very real possibility that several users could be logged into the same machine with the same user ID and running oprofile in per-process mode.  With current oprofile, only one user can be profiling at a time, so there isn't much chance for collision.  But the oprofile/PE model opens up the possibility that one user's 'opcontrol –setup' operations could overwrite another user's setup.  To make this less likely to happen, we should do one of the following:
   - Save the oprofile config file in the current directory
                  or
   - Eliminate the config file and use the single-command model of perf-record

I propose we do both.  We could deprecate the opcontrol command, along with its separate “setup” and “start” capability.  We would also create a new command (perhaps simply called 'oprofile') that would operate in the mode where all profiling parameters are passed in one command along with the pathname of the application to profile.  This “single-command mode” would not need or use a config file.
Note:  The “deprecation” of opcontrol would have to be conditional – i.e., if perf_events is not supported for the running processor type, opcontrol will not be deprecated and there would be no deprecation message.

6.3.Event specification
OProfile users currently specify the event(s) to profile with using symbolic names such as BR_MISS_PRED_RETIRED.  The perf_events API requires a hex code, which, for some architectures, is the same code as is found in oprofile's 'events' file (e.g., 0xc5 for  BR_MISS_PRED_RETIRED).  For other architectures (e.g., ppc64), the codes in the events file are not appropriate for use with perf_events.  In such cases, I believe those codes are unused, so we could simply update the events files with the proper perf_events codes.  That way, all architectures would be handled the same way.

6.4.The oprofile daemon
The current oprofile daemon is a separate process that reads sample data from the kernel driver's event buffer and stores the data in special format in sample files in the filesystem.  With oprofile/PE, the daemon function will be replaced with a process that forks a child process to execute the passed app, makes the perf_event_open syscall and reads the sample data from perf_events by mmap'ing to the kernel buffer associated with the file descriptor returned from the syscall.  The sample data from perf_events is a very different format from what we currently get from the oprofile kernel driver.  Further investigation is needed to make the best choice of how to process this data.  There are at least two methods to choose from:
   - At the end of the profiling run, convert perf_events raw sample data
     to the sample file format we currently use (stored in 
     /var/lib/oprofile/samples/current).  The resulting sample files
     would be stored in <current_dir>/oprofile_data/samples/current.
             or
   - Develop a new post-processing method for parsing the perf_events sample
     data on-demand.

7.Changes for post-processing tools
Theoretically, perf_events should provide us with all the same profiling data that we get from the oprofile kernel driver with the exception of certain statistics.  If we choose to convert the perf_events raw sample data to oprofile-format sample files, then very little effort will be needed for post-processing aside from outputting different stats and adding the use of kallsyms for addr-to-symbol resolution for kernel samples.  But if we choose the option of parsing perf_events sample data at post-processing time, then we'll need a second version of arrange_profiles(), as well as perf_events versions for its callees that access the sample files.

============================================================================

Thread: [RFC] Porting oprofile to perf_events

oprofile-list