[RFC - v2] Porting oprofile to perf_events

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

1. Introduction
The design documented herein assumes oprofile userspace will use the perf_events 
API.  With this approach, the existing oprofile kernel driver can be bypassed, 
eliminating the need for much (if any) new development work in the driver.  
However, the oprofile kernel driver will probably exist for quite some time to 
be used by older processors that won't be supported by perf_events.

2. Statistics
Current oprofile statistics differ from the statistics that perf_events can 
gather (see below).  Some of this is due to differences in implementation 
details, but it's obvious that oprofile simply gathers more information stats 
than does perf_events.  If we find a need for more statistics that should (i.e., 
need to) be exposed to users, we can work with the kernel community to add them.

2.1. OProfile statistics
enum {  OPD_SAMPLES, /**< nr. samples */
         OPD_KERNEL, /**< nr. kernel samples */
         OPD_PROCESS, /**< nr. userspace samples */
         OPD_NO_CTX, /**< nr. samples lost due to not knowing if in the kernel 
or not */
         OPD_LOST_KERNEL,  /**< nr. kernel samples lost */
         OPD_LOST_SAMPLEFILE, /**< nr samples for which sample file can't be 
opened */
         OPD_LOST_NO_MAPPING, /**< nr samples lost due to no mapping */
         OPD_DUMP_COUNT, /**< nr. of times buffer is read */
         OPD_DANGLING_CODE, /**< nr. partial code notifications (buffer overflow */
         OPD_MAX_STATS /**< end of stats */
};

Additionally, top level stats from /dev/oprofile/stats:
    -event_lost_overflow
    -sample_lost_no_mapping
    -bt_lost_no_mapping
    -sample_lost_no_mm

and CPU-level status from /dev/oprofile/stats:
    -sample_lost_overflow
    -samples_lost_task_exit
    -samples_received
    -backtrace_aborted
    -sample_invalid_eip

2.2. Perf_events statistics
struct events_stats {
     u64 total_period;
     u64 total_lost;
     u64 total_invalid_chains;
     u32 nr_events[PERF_RECORD_HEADER_MAX];
     u32 nr_unknown_events;
     u32 nr_invalid_chains;
     u32 nr_unknown_id;
};

3. Determining perf_events availability
Userspace oprofile should be built with support for both the legacy oprofile 
kernel driver and for perf_events.  A config check for perf_events.h should be 
sufficient to enable the building of perf_events-specific code.  This implies 
such code should be surrounded by an #ifdef.  If configure does not find 
perf_events.h, it could simply be that the kernel-headers package isn't 
installed, not that the running kernel doesn't support perf_events.  The 
configure check should issue a LOUD WARNING (at the end, so it's clearly 
visible) to indicate that if the kernel version is 2.6.32 or later, the user 
should install the kernel headers and re-run config.  If perf_events.h is found 
by configure, then both legacy and perf_events-enabled oprofile code will be built.

The oprofile userspace tools will also need to know at runtime whether or not 
the running processor has perf_events support.  The best way to determine if 
perf_events support is available for the running processor is to try to invoke 
the perf_event_open syscall.  If it returns -ENOSYS, the syscall is not defined 
(ergo, no perf_events support).  If it returns --ENOENT, it means the syscall 
exists, but the running processor type is not yet supported by perf_events.

4. User interface changes for profile setup and run
We'll use the 'perf record' tool (from here on, referred to as "perf-record") as 
a model for the new oprofile user interface for setup and profiling. The intent 
is to develop a new front-end profiling program -- 'operf' -- that can be used 
in place of opcontrol.  Initially, opcontrol and legacy functionality will still 
be available even on systems where perf_events is supported.  But users will be 
strongly encouraged to use operf for most circumstances and to use legacy 
oprofile only for purposes such as comparison to operf output.

Perf-record has some options that are obvious candidates to be used by operf and 
some that not.

The obvious options are:
    -all-cpus
    -pid
    -force (not a very intuitive name; suggest 'reset" instead)
    -verbose
    -event/count (suggest to combine as follows: '--event=<evt_spec>:<count>')
    -call-graph
    -output (i.e., output filename.  Depending on the decision of how we store 
sample data (sees section 6.3), we may have either a "--output-filename" option 
or a "--session-dir" option.)
    -mmap-pages (sample buffer size)
    -command (i.e., application to be profiled)

The options of questionable use for us are:
    -tid (post-processing can select on an individual thread using a tid 
specification)
    -realtime
    -append (doesn't seem to be necessary, since 'perf record' appends by 
default unless "--force" is used)
    -no-delay
    -quiet
    -no-inherit (we should always have children threads inherit counter config; 
if you only want to see parent profile data, you can filter on that with a tgid 
specification in post-processing)
    -freq (oprofile has always used sample period (i.e., event 'count'), and, 
IMHO, the freq stuff in perf just isn't that useful or easy to understand)
    -cpu (use a cpu specification in post-processing)
    -raw-samples
    -stat (showing inherited stats is something that can be done at 
post-processing time using tgid specification)
    -no-samples
    -no-buildid-cache

New options we will add:
    -list-events*
    -help*

*NOTE: No other options allowed when passing this option to operf.

And options that are candidates for future enhancements are:
    -data
    -cgroup
    -timestamp

5. Incompatible options
The following opcontrol options will not be necessary (or possible) when using 
operf:
    -init
    -start-daemon
    -dump
    -deinit
    -buffer-watershed
    -cpu-buffer-size
    -image (There's no way practical way to do filtering of samples during the 
profiling phase when using perf_event.s)
    -vmlinux (I propose to move this option to the post-processing tools to 
correspond with the perf-report way of doing things.)
    -no-vmlinux (I propose we do away with this option.  Post-processing should, 
by default, use kallsyms to do addr-to-symbol resolution for kernel samples; 
otherwise use the passed vmlinux image.)
    -kernel-range (Do we really need this option?  Who uses it and what for?  If 
we need to keep it, it will be moved to the post-processing tools since 
filtering during profiling is not feasible.)
    -xen
    -active-domains

6.Internal changes for setup and profiling
Both opcontrol and operf must determine the availability of perf_events 
support.  For opcontrol, if perf_events support is available, an informational 
message should be printed, recommending use of operf and indicating that 
opcontrol will be eliminated in a future release.  For operf, detection of 
perf_events availability is, of course, needed for basic operation.

The new operf program needs to know the processor type for the '--list-events' 
option, as well as for validating a requested native event and obtaining the 
proper raw hex code.  Currently, the cpu type is given by the oprofile kernel 
driver via /dev/oprofile/cpu_type.  But since the kernel driver is activated 
only when a user runs 'opcontrol', which will not be necessary (or recommended) 
when using operf, we'll need an alternate mechanism to determine cpu type.

    **NOTE**:  All architecture maintainers in the oprofile community are asked
    to submit an algorithm by which operf can determine processor type in order
    to map to the specific events and unit_masks files in
    events/<arch>/<processor_type>

6.1. Event specification
OProfile users currently specify the event(s) to profile with using symbolic 
names such as BR_MISS_PRED_RETIRED.  The perf_events API requires a hex code, 
which, for some architectures, is the same code as is found in oprofile's 
'events' file (e.g., 0xc5 for  BR_MISS_PRED_RETIRED).  For other architectures 
(e.g., ppc64), the codes in the events file are not appropriate for use with 
perf_events.  In such cases, I believe those codes are unused, so we could 
simply update the events files with the proper perf_events codes.  That way, all 
architectures would be handled the same way.

6.2. Multiplexing of events
Another advantage of perf_events is its intrinsic ability to multiplex events 
when more events are requested for monitoring than there are physical 
performance counters.  The minor downside to this feature is that the number of 
samples occurring for a given event may not be accurate, since that event may 
not have been scheduled on a performance counter for the entire profiling run.  
This really should *not* be an issue in profiling, since the main idea in 
analyzing profile is to look for relative "hot" locations.  If knowing the 
absolute number of events that occurs is important, then the user should use an 
event counting tool, like 'perf stat'.  But if there's a way we can detect that 
multiplexing has taken place and at least issue a warning message, we should do so.

6.3. The profiling daemon
The current oprofile daemon is a separate process that reads sample data from 
the kernel driver's event buffer and stores the data in special format in sample 
files in the filesystem.  With oprofile/PE, the daemon function will be replaced 
with operf, which will execute the passed app, make the perf_event_open syscall 
and read the sample data from perf_events by mmap'ing to the kernel buffer 
associated with the file descriptor returned from the syscall.

The sample data from perf_events is in a very different format from what we 
currently get from the oprofile kernel driver.  There are at least two methods 
to choose from:

    - At the end of the profiling run, convert perf_events raw sample data to 
the sample file format we currently use (which we now store in 
/var/lib/oprofile/samples/current).  The resulting sample files would be stored 
in <current_dir>/oprofile_data/samples/current.
              or
    - Store perf_events sample data in the "natural" format in a file akin to 
perf's "perf.data" file.  We could call our file "operf.data".  Then we'd need 
to develop a new post-processing mechanism for parsing operf.data on-demand.

One thing to consider when choosing between these options is how we currently 
support profiling Java.  We look for "anon" sample files in the samples 
directory, then create an ELF file named <pid>.jo and place it in the anon 
sample directory.  If we chose alternative 2 from above (storing samples no 
<current_dir>/oprofile_data/samples/current directory), the <pid>.jo would have 
be stored somewhere independent of operf.data.  Then we'd have a synchronization 
issue to contend with -- i.e., do the *.jo files the post-processing tools find 
actually correspond with the operf.data file?

6.4 Supporting multiple users simultaneously
Unlike legacy oprofile where only one user can run oprofile at a time, operf 
will allow any number of users to run simultaneously.  However, early 
implementations of perf_events (or perhaps the perf tool?) were buggy in the 
handling of multiple users when one or more of those users was running 
system-wide profiling.  (I have not yet determined if this problem is actually 
fixed in later implementations since I've only recently become aware of it.)  We 
may want to consider disallowing this scenario, or at least warning that the 
results cannot be trusted.

7.Changes for post-processing tools
Theoretically, perf_events should provide us with all the same profiling data 
that we get from the oprofile kernel driver with the exception of certain 
statistics.  If we choose to convert the perf_events raw sample data to 
oprofile-format sample files, then very little effort will be needed for 
post-processing aside from outputting different stats and adding the use of 
kallsyms for addr-to-symbol resolution for kernel samples.  But if we choose the 
option of parsing perf_events sample data at post-processing time, then we'll 
need a second version of arrange_profiles(), as well as perf_events versions for 
its callees that access the sample files.

============================================================================

-Maynard