[perfmon2] perfmon2 syscall interface rationale

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello everyone,

I intend to send this following description to LKML and a few LKML developers
to try and explain the reasoning behind the current syscall interface
for perfmon2.

I know there have been a lot of doubts and misunderstandings as to why we need
to many syscalls and how they could be extended. I tried to address
those concerns
here.

Please feel free to comment, add to it.

Thanks.

-----------------------------------------------------------------------------------------------------------------------

1) monitoring session breakdown

A monitoring session can be decomposed into a sequence of fundamental
actions which
are as follows:
       - create the session
       - program registers
       - attach to target thread or CPU
       - start monitoring
       - stop monitoring
       - read results
       - detach from thread or CPU
       - terminate session

The order may not necessarily be like shown. For instance, the
programming may happen
after the session has been attached. Obviously, the start/stop operations may be
repeated before results are read and results can be read multiple times.

In the next sections, we examine each action separately.

2) session creation

  Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so
called system-wide)

  During the creation of the session, certain attributes are set, they
remain until the
  session is terminated. For instance, the per-cpu attribute cannot be changed.

  During creation, the kernel state to support the session is
allocated and initialized.
  No PMU hardware is actually accessed. Permissions to create a
session may be checked.
  Resource limits are also validated and memory consumption is accounted for.

  The software state of the PMU is initialized, i.e., all
configuration registers are
  set to a quiescent value. Data registers are initialized to zero
whenever possible.

  Upon return, the kernel returns a unique identifier which is to be
used for all
  subsequent actions on the session.

3) programming the registers

  Programming of the PMU registers can occur at any time during the
lifetime of a session,
  the session does not need to be attached to a thread of CPU.

  It may be necessary to change the settings, e.g., monitor another
event or reset the counts
  when sampling at the user level. Thus, the writing of the registers
MUST be decoupled from
  the creation of the session.

  Similarly, writing of configuration and data registers must also be
decoupled, as data
  registers may be reprogrammed independently of their configuration
registers, like when
  sampling for instance.

  The number of registers varies a lot from one PMU to the other. The
relationships between
  configuration and data registers can be more complex than just
one-to-one. On most PMU,
  writing of the PMU registers requires running at the most privileged
level, i.e., in the
  kernel. To amortize the cost of a system call, it is interesting to
be able to program multiple
  registers in one call. Thus, it must be possible to pass vector
arguments. Of course,
  for security reasons, the system administrator may impose a limit on
how big vectors can
  actually be. The advantage is that vector can vary in size and thus
the amount of data
  passed between application and kernel can be optimized to be just
the minimal needed.
  System call data needs to be copied into the kernel memory space
before it can be used.

4) attachment and detachment

 A session can be attached to a kernel-visible thread or a CPU. If
there is attachment,
 then it must be possible to detach the session to possibly re-attach
it to another thread
 or CPU. Detachment should not require destroying the session.

 There are 3 possibilities for attachment:
       - when the session is created
       - when the monitoring is activated
       - with a dedicated call

 If the attachment is done during the creation of the session, then it
means the target (thread or CPU)
 needs to exist at that time. For a cpu-wide session, this means that
the session must be created while
 executing on that CPU. This does not seem unreasonable especially on
NUMA systems.

 For a per-thread session however, this is a bit more problematic as
this means it is not possible
 to prepare the session and the PMU registers before the thread
exists. When monitoring across fork
 and pthread_create, it is important to minimize overhead. Creation of
a session can trigger complex
 memory allocations in the kernel. Thus, it may be interesting to
prepare a batch of ready-to-go sessions,
 which just need to be attached when the fork or pthread_create
notification arrives.

 If the attachment is coupled with the creation of the session, it
implies that the detachment is coupled
 with its destruction, by symmetry. Coupling of detachment with
termination is problematic for both per-thread
 and CPU-wide mode. With the former, the termination of a thread is
usually totally asynchronous with the
 termination of the session by the monitoring tool. The only case
where they are synchronized is for
 self-monitored threads. When a tool is monitoring a thread in another
process, the termination of that thread
 will cause the kernel to detach the session. But the session must not
be closed because the tool likely wants
 to read the results and also because the session still exists for the
tool. For CPU-wide, there is also an issue
 when a monitored CPU is put off-line dynamically. The session would
be detached by the kernel, yet the session would
 still be live in the tool whose controlling thread would have been
migrated off of that CPU.

 If the attachment is done when monitoring is activated, then the
detachment is done when monitoring
 is deactivated. The following relationships are therefore enforced:

       attached => activated
       stopped  => detached

 It is expected that start/stop operations could be very frequent for
self-monitored workloads. When used
 to monitor small sections of critical code, e.g., loop kernels, it is
important to minimize overhead, thus
 the start/stop should be as simple as possible.

 Attaching requires loading the PMU machine state onto the PMU
hardware. Conversely, detaching implies flushing
 the PMU state to memory so results can be read even after the
termination of a thread, for instance.  Both
 operations are expensive due to the high cost of accessing the PMU registers.

 Furthermore, there are certain PMU models, e.g., Intel Itanium, where
it is possible to let user level code
 start/stop monitoring with a single instruction. To minimize
overhead, it is very important to allow this
 mechanism for self-monitored programs. Yet the session would have to
be attached/detached somehow. With
 dedicated attach/detach calls, this can be supported transparently.
One possible work-around with the coupled
 calls would be to require a system call to attach the session and do
the initial activation, subsequent
 start/stop could use the lightweight instruction. The session would
be stopped and detached with a system call.

 The dedicated attach/detach calls offer a maximum level of
flexibility. The let applications create sessions
 in advance or on-demand. The actions on the session, start/stop and
attach/detach, are perfectly symmetrical.
 The termination of the monitored target can cause its detachment, but
the session remains accessible. Issuing
 of the detach call on a session already detached by the kernel is harmless.

 The cost of start/stop is not impacted.

 The following properties are enforced:
       upon attachment   => monitoring stopped
       during detachment => monitoring stopped

5) start and stop

 It must be possible for an application to start and stop monitoring
at will and at any moment.
 Start and stop can be called very frequently and not just at the
beginning and end of a session.
 This is especially likely for self-monitored threads where it is
customary to monitor execution of
 only one function or loop. Thus those operations can be on the
critical path and they must therefore
 by as lightweight as possible. See the discussion in the section
about attachment and detachment.

6) reading the results

 The results are extracted by reading the PMU registers containing
data (as opposed to configuration).
 The number of registers of interest can vary based on the PMU model,
the type of measurement, the events
 measured.

 Reading can occur at regular interval, e.g., time-based user level
sampling, and can therefore be on the
 critical path. Thus it must as lightweight as possible. Given that
the cost of dominated by the latency
 of accessing the PMU registers, it is important to only read the
registers that are used. Thus, the call
 must provide vector arguments just like for the calls to program the PMU.

 It must be possible to read the registers while the session is
detached but also when it is attached to a
 thread or CPU.

7) termination

 Termination of a session means all the associated resources are
either released to the free pool or destroyed.
 After termination, no state remains. Termination implies, stopping
monitoring and detaching the session if
 necessary.

 For the purpose of termination, one has to differentiate between the
monitored entity and the controlling entity.
 When a tool monitors a thread in another process, all the threads
from the tool are controlling entities, and the
 monitored thread is the monitored entity. Any entity can vanish at any time.

 If the monitored entity terminates voluntarily, i.e., normal exit, or
involuntarily, e.g., core dump, the kernel
 simply detaches the session but it is not destroyed.

 Until the last controlling entity disappears, the session remains accessible.

 There are situations where all the controlling entities disappear
before the monitored entity. In this case, the
 session becomes useless, results cannot be extracted, thus the
session enters the zombie state. It will
 eventually be detached and its resources will be reclaimed by the
kernel, i.e., the session will be terminated.

8) extensibility

  There is already a vast diversity with existing PMU models, this is
unlikely to change, quite to the contrary
  it is envisioned that the PMU will become a true valid-add and that
vendors will therefore try to differentiate
  one from the other. Moreover, the PMU will remain closely tied to
the underlying micro-architecture. Therefore,
  it is very important to ensure that the monitoring interface will be
able to adapt easily to future PMU models
  and their extended features, i.e., what is offered beyond counting events.

  It is important to realize that extensibility is not limited to
supporting more PMU registers. It also includes
  supporting advanced sampling features or socket-level PMUs as
opposed to just core-level PMUs.

  It may be necessary to extend the system calls with new generic or
architecture specific parameters, and this
  without simply adding new system calls.

9) current perfmon2 interface

  The perfmon2 interface design is guided by the principles described
in the previous sections.
  We now explain each call is details.

  a) session creation

     int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name,
void *smpl_arg, size_t arg_size);

     The function creates the perfmon session and returns a file
descriptor used to manipulate the session
     thereafter.

     The calls takes several parameters which are as follows:
        - pfarg_ctx: encapsulates all session parameters (see below)
        - smpl_name: used when sampling to designate which format to use
        - smpl_arg:  point to format-specific arguments
        - smpl_size:  size of the structure passed in smpl_arg

     The pfarg_ctx structure is defined as follows:
        - flags: generic and arch-specific flags for the session
        - reserved: reserved for future extensions

     To provide for future extensions, the pfarg_ctx structure
contains reserved fields. Reserved fields
     must be zeroed.

     To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must
be passed in flags.

     When in-kernel sampling is not used smpl_name, smpl_arg, arg_size
must be 0.

  b) programming the registers

     int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n);
     int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n);

     The calls are provided to program the configuration and data
registers respectively. The parameters are
     as follows:
        - fd: file descriptor identifying the session
        - pmc: pointer to parg_pmc structures
        - pmd: pointer to parg_pmd structures
        - n : number of elements in the pmc or pmd vector

     It is possible to pass vector of parg_pmc or pfarg_pmd registers.
The minimal size is 1, maximum size is
     determined by system administrator.

     The pfarg_pmc structure is defined as follows:
     struct pfarg_pmc {
        u16 reg_num;
        u64 reg_value;
        u64 reserved[];
     };

     The pfarg_pmd structure is defined as follows:
     struct pfarg_pmd {
        u16 reg_num;
        u64 reg_value;
        u64 reserved[];
     };

     Although both structures are currently identical, they will
differ as more functionalities are added so better
     to create two versions from the start.

     Provisions for extensions are provided by the reserved field in
each structure.

  c) attachment and detachment

     int pfm_load_context(int fd, struct pfarg_load *ld);
     int pfm_unload_context(int fd);

     The session is identified by the file descriptor, fd.

     To attach, the targeted thread or CPU must be provided. For
extensibility purposes, the target is passed in
     in structure which is defined as follows:
     struct pfarg_load {
        u32 target;
        u64 reserved[];
     };
     In per-thread mode, the target field must be set to the kernel
thread identification (gettid()).

     In per-cpu mode, the target field must be set to the logical CPU
identification as seen by the kernel.
     Furthermore, the caller must be running on the CPU to monitor
otherwise the call fails.

     Extensions can be implemented using the reserved field.

  d) start and stop

     int pfm_start(int fd);
     int pfm_stop(int fd);

     The session is identified by the file descriptor fd.

     Currently no other parameters are supported for those calls.

   e) reading results

     int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n);

     The session is identified by the file descriptor fd.

     Just like for programming the registers, it is possible to pass
vectors of structures in pmds. The number
     of elements is passed in n.

   f) termination

     int close(fd);

     To terminate a session, the file descriptor has to be closed. The
semantics of file descriptor sharing
     applies, so if another reference to the session, i.e., another
file descriptor exists, the session will
     only be effectively destroyed, once that reference disappears.

     Of course, the kernel does close all file descriptor on process
termination, thus the associated sessions
     will eventually be destroyed.

     In per-cpu mode, it is not necessary, though recommended, to be
on the monitored CPU to issue this call.