1. Introduction
The design documented herein assumes oprofile userspace will use the perf_events API.  With this approach, the existing oprofile kernel driver can be bypassed, eliminating the need for much (if any) new development work in the driver.  However, the oprofile kernel driver will probably exist for quite some time to be used by older processors that won’t be supported by perf_events.

2. Statistics
Current oprofile statistics differ from the statistics that perf_events can gather (see below).  Some of this is due to differences in implementation details, but it’s obvious that oprofile simply gathers more information stats than does perf_events.  If we find a need for more statistics that should (i.e., need to) be exposed to users, we can work with the kernel community to add them.

2.1. OProfile statistics
enum {  OPD_SAMPLES, /**< nr. samples */
        OPD_KERNEL, /**< nr. kernel samples */
        OPD_PROCESS, /**< nr. userspace samples */
        OPD_NO_CTX, /**< nr. samples lost due to not knowing if in the kernel or not */
        OPD_LOST_KERNEL,  /**< nr. kernel samples lost */
        OPD_LOST_SAMPLEFILE, /**< nr samples for which sample file can't be opened */
        OPD_LOST_NO_MAPPING, /**< nr samples lost due to no mapping */
        OPD_DUMP_COUNT, /**< nr. of times buffer is read */
        OPD_DANGLING_CODE, /**< nr. partial code notifications (buffer overflow */
        OPD_MAX_STATS /**< end of stats */
};

Additionally, top level stats from /dev/oprofile/stats:
   -event_lost_overflow
   -sample_lost_no_mapping
   -bt_lost_no_mapping
   -sample_lost_no_mm


and CPU-level status from /dev/oprofile/stats:
   -sample_lost_overflow
   -samples_lost_task_exit
   -samples_received
   -backtrace_aborted
   -sample_invalid_eip

2.2. Perf_events statistics
struct events_stats {
    u64 total_period;
    u64 total_lost;
    u64 total_invalid_chains;
    u32 nr_events[PERF_RECORD_HEADER_MAX];
    u32 nr_unknown_events;
    u32 nr_invalid_chains;
    u32 nr_unknown_id;
};

3. Determining perf_events availability
Userspace oprofile should be built with support for both the legacy oprofile kernel driver and for perf_events.  A config check for perf_events.h should be sufficient to enable the building of perf_events-specific code.  This implies such code should be surrounded by an #ifdef.  If configure does not find perf_events.h, it could simply be that the kernel-headers package isn't installed, not that the running kernel doesn't support perf_events.  The configure check should issue a LOUD WARNING (at the end, so it's clearly visible) to indicate that if the kernel version is 2.6.32 or later, the user should install the kernel headers and re-run config.  If perf_events.h is found by configure, then both legacy and perf_events-enabled oprofile code will be built.

The oprofile userspace tools will also need to know at runtime whether or not the running processor has perf_events support.  The best way to determine if perf_events support is available for the running processor is to try to invoke the perf_event_open syscall.  If it returns -ENOSYS, the syscall is not defined (ergo, no perf_events support).  If it returns –ENOENT, it means the syscall exists, but the running processor type is not yet supported by perf_events.

4. User interface changes for profile setup and run
We’ll use the ‘perf record’ tool (from here on, referred to as “perf-record”) as a model for the new oprofile user interface for setup and profiling. The intent is to develop a new front-end profiling program -- 'operf' -- that can be used in place of opcontrol.  Initially, opcontrol and legacy functionality will still be available even on systems where perf_events is supported.  But users will be strongly encouraged to use operf for most circumstances and to use legacy oprofile only for purposes such as comparison to operf output.

Perf-record has some options that are obvious candidates to be used by operf and some that not. 

The obvious options are:
   -all-cpus
   -pid
   -force (not a very intuitive name; suggest 'reset" instead)
   -verbose
   -event/count (suggest to combine as follows: '--event=<evt_spec>:<count>')
   -call-graph
   -output (i.e., output filename.  Depending on the decision of how we store sample data (sees section 6.3), we may have either a "--output-filename" option or a "--session-dir" option.)
   -mmap-pages (sample buffer size)
   -command (i.e., application to be profiled)

The options of questionable use for us are:
   -tid (post-processing can select on an individual thread using a tid specification)
   -realtime
   -append (doesn't seem to be necessary, since 'perf record' appends by default unless "--force" is used)
   -no-delay
   -quiet
   -no-inherit (we should always have children threads inherit counter config; if you only want to see parent profile data, you can filter on that with a tgid specification in post-processing)
   -freq (oprofile has always used sample period (i.e., event ‘count’), and, IMHO, the freq stuff in perf just isn’t that useful or easy to understand)
   -cpu (use a cpu specification in post-processing)
   -raw-samples
   -stat (showing inherited stats is something that can be done at post-processing time using tgid specification)
   -no-samples
   -no-buildid-cache

New options we will add:
   -list-events*
   -help*

*NOTE: No other options allowed when passing this option to operf.


And options that are candidates for future enhancements are:
   -data
   -cgroup
   -timestamp

5. Incompatible options
The following opcontrol options will not be necessary (or possible) when using operf:
   -init
   -start-daemon
   -dump
   -deinit
   -buffer-watershed
   -cpu-buffer-size
   -image (There's no way practical way to do filtering of samples during the profiling phase when using perf_event.s)
   -vmlinux (I propose to move this option to the post-processing tools to correspond with the perf-report way of doing things.)
   -no-vmlinux (I propose we do away with this option.  Post-processing should, by default, use kallsyms to do addr-to-symbol resolution for kernel samples; otherwise use the passed vmlinux image.)
   -kernel-range (Do we really need this option?  Who uses it and what for?  If we need to keep it, it will be moved to the post-processing tools since filtering during profiling is not feasible.)
   -xen
   -active-domains

6.Internal changes for setup and profiling
Both opcontrol and operf must determine the availability of perf_events support.  For opcontrol, if perf_events support is available, an informational message should be printed, recommending use of operf and indicating that opcontrol will be eliminated in a future release.  For operf, detection of perf_events availability is, of course, needed for basic operation.

The new operf program needs to know the processor type for the '--list-events' option, as well as for validating a requested native event and obtaining the proper raw hex code.  Currently, the cpu type is given by the oprofile kernel driver via /dev/oprofile/cpu_type.  But since the kernel driver is activated only when a user runs 'opcontrol', which will not be necessary (or recommended) when using operf, we'll need an alternate mechanism to determine cpu type.

*NOTE*:  All architecture maintainers in the oprofile community are asked to submit an algorithm by which operf can determine processor type in order to map to the specific events and unit_masks files in events/<arch>/<processor_type>

6.1. Event specification
OProfile users currently specify the event(s) to profile with using symbolic names such as BR_MISS_PRED_RETIRED.  The perf_events API requires a hex code, which, for some architectures, is the same code as is found in oprofile's 'events' file (e.g., 0xc5 for  BR_MISS_PRED_RETIRED).  For other architectures (e.g., ppc64), the codes in the events file are not appropriate for use with perf_events.  In such cases, I believe those codes are unused, so we could simply update the events files with the proper perf_events codes.  That way, all architectures would be handled the same way.

6.2. Multiplexing of events
Another advantage of perf_events is its intrinsic ability to multiplex events when more events are requested for monitoring than there are physical performance counters.  The minor downside to this feature is that the number of samples occurring for a given event may not be accurate, since that event may not have been scheduled on a performance counter for the entire profiling run.  This really should *not* be an issue in profiling, since the main idea in analyzing profile is to look for relative "hot" locations.  If knowing the absolute number of events that occurs is important, then the user should use an event counting tool, like 'perf stat'.  But if there's a way we can detect that multiplexing has taken place and at least issue a warning message, we should do so.

6.3. The profiling daemon
The current oprofile daemon is a separate process that reads sample data from the kernel driver's event buffer and stores the data in special format in sample files in the filesystem.  With oprofile/PE, the daemon function will be replaced with operf, which will execute the passed app, make the perf_event_open syscall and read the sample data from perf_events by mmap'ing to the kernel buffer associated with the file descriptor returned from the syscall.


The sample data from perf_events is in a very different format from what we currently get from the oprofile kernel driver.  There are at least two methods to choose from:

   - At the end of the profiling run, convert perf_events raw sample data to the sample file format we currently use (which we now store in /var/lib/oprofile/samples/current).  The resulting sample files would be stored in <current_dir>/oprofile_data/samples/current.
             or
   - Store perf_events sample data in the "natural" format in a file akin to perf's "perf.data" file.  We could call our file "operf.data".  Then we'd need to develop a new post-processing mechanism for parsing operf.data on-demand.

One thing to consider when choosing between these options is how we currently support profiling Java.  We look for "anon" sample files in the samples directory, then create an ELF file named <pid>.jo and place it in the anon sample directory.  If we chose alternative 2 from above (storing samples no <current_dir>/oprofile_data/samples/current directory), the <pid>.jo would have be stored somewhere independent of operf.data.  Then we'd have a synchronization issue to contend with -- i.e., do the *.jo files the post-processing tools find actually correspond with the operf.data file?

6.4 Supporting multiple users simultaneously
Unlike legacy oprofile where only one user can run oprofile at a time, operf will allow any number of users to run simultaneously.  However, early implementations of perf_events (or perhaps the perf tool?) were buggy in the handling of multiple users when one or more of those users was running system-wide profiling.  (I have not yet determined if this problem is actually fixed in later implementations since I've only recently become aware of it.)  We may want to consider disallowing this scenario, or at least warning that the results cannot be trusted.

7.Changes for post-processing tools
Theoretically, perf_events should provide us with all the same profiling data that we get from the oprofile kernel driver with the exception of certain statistics.  If we choose to convert the perf_events raw sample data to oprofile-format sample files, then very little effort will be needed for post-processing aside from outputting different stats and adding the use of kallsyms for addr-to-symbol resolution for kernel samples.  But if we choose the option of parsing perf_events sample data at post-processing time, then we'll need a second version of arrange_profiles(), as well as perf_events versions for its callees that access the sample files.

============================================================================

-Maynard