[perfmon2] perfmon2 syscall interface rationale
Status: Beta
Brought to you by:
seranian
From: stephane e. <er...@go...> - 2008-07-01 16:41:29
|
Hello everyone, I intend to send this following description to LKML and a few LKML developers to try and explain the reasoning behind the current syscall interface for perfmon2. I know there have been a lot of doubts and misunderstandings as to why we need to many syscalls and how they could be extended. I tried to address those concerns here. Please feel free to comment, add to it. Thanks. ----------------------------------------------------------------------------------------------------------------------- 1) monitoring session breakdown A monitoring session can be decomposed into a sequence of fundamental actions which are as follows: - create the session - program registers - attach to target thread or CPU - start monitoring - stop monitoring - read results - detach from thread or CPU - terminate session The order may not necessarily be like shown. For instance, the programming may happen after the session has been attached. Obviously, the start/stop operations may be repeated before results are read and results can be read multiple times. In the next sections, we examine each action separately. 2) session creation Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so called system-wide) During the creation of the session, certain attributes are set, they remain until the session is terminated. For instance, the per-cpu attribute cannot be changed. During creation, the kernel state to support the session is allocated and initialized. No PMU hardware is actually accessed. Permissions to create a session may be checked. Resource limits are also validated and memory consumption is accounted for. The software state of the PMU is initialized, i.e., all configuration registers are set to a quiescent value. Data registers are initialized to zero whenever possible. Upon return, the kernel returns a unique identifier which is to be used for all subsequent actions on the session. 3) programming the registers Programming of the PMU registers can occur at any time during the lifetime of a session, the session does not need to be attached to a thread of CPU. It may be necessary to change the settings, e.g., monitor another event or reset the counts when sampling at the user level. Thus, the writing of the registers MUST be decoupled from the creation of the session. Similarly, writing of configuration and data registers must also be decoupled, as data registers may be reprogrammed independently of their configuration registers, like when sampling for instance. The number of registers varies a lot from one PMU to the other. The relationships between configuration and data registers can be more complex than just one-to-one. On most PMU, writing of the PMU registers requires running at the most privileged level, i.e., in the kernel. To amortize the cost of a system call, it is interesting to be able to program multiple registers in one call. Thus, it must be possible to pass vector arguments. Of course, for security reasons, the system administrator may impose a limit on how big vectors can actually be. The advantage is that vector can vary in size and thus the amount of data passed between application and kernel can be optimized to be just the minimal needed. System call data needs to be copied into the kernel memory space before it can be used. 4) attachment and detachment A session can be attached to a kernel-visible thread or a CPU. If there is attachment, then it must be possible to detach the session to possibly re-attach it to another thread or CPU. Detachment should not require destroying the session. There are 3 possibilities for attachment: - when the session is created - when the monitoring is activated - with a dedicated call If the attachment is done during the creation of the session, then it means the target (thread or CPU) needs to exist at that time. For a cpu-wide session, this means that the session must be created while executing on that CPU. This does not seem unreasonable especially on NUMA systems. For a per-thread session however, this is a bit more problematic as this means it is not possible to prepare the session and the PMU registers before the thread exists. When monitoring across fork and pthread_create, it is important to minimize overhead. Creation of a session can trigger complex memory allocations in the kernel. Thus, it may be interesting to prepare a batch of ready-to-go sessions, which just need to be attached when the fork or pthread_create notification arrives. If the attachment is coupled with the creation of the session, it implies that the detachment is coupled with its destruction, by symmetry. Coupling of detachment with termination is problematic for both per-thread and CPU-wide mode. With the former, the termination of a thread is usually totally asynchronous with the termination of the session by the monitoring tool. The only case where they are synchronized is for self-monitored threads. When a tool is monitoring a thread in another process, the termination of that thread will cause the kernel to detach the session. But the session must not be closed because the tool likely wants to read the results and also because the session still exists for the tool. For CPU-wide, there is also an issue when a monitored CPU is put off-line dynamically. The session would be detached by the kernel, yet the session would still be live in the tool whose controlling thread would have been migrated off of that CPU. If the attachment is done when monitoring is activated, then the detachment is done when monitoring is deactivated. The following relationships are therefore enforced: attached => activated stopped => detached It is expected that start/stop operations could be very frequent for self-monitored workloads. When used to monitor small sections of critical code, e.g., loop kernels, it is important to minimize overhead, thus the start/stop should be as simple as possible. Attaching requires loading the PMU machine state onto the PMU hardware. Conversely, detaching implies flushing the PMU state to memory so results can be read even after the termination of a thread, for instance. Both operations are expensive due to the high cost of accessing the PMU registers. Furthermore, there are certain PMU models, e.g., Intel Itanium, where it is possible to let user level code start/stop monitoring with a single instruction. To minimize overhead, it is very important to allow this mechanism for self-monitored programs. Yet the session would have to be attached/detached somehow. With dedicated attach/detach calls, this can be supported transparently. One possible work-around with the coupled calls would be to require a system call to attach the session and do the initial activation, subsequent start/stop could use the lightweight instruction. The session would be stopped and detached with a system call. The dedicated attach/detach calls offer a maximum level of flexibility. The let applications create sessions in advance or on-demand. The actions on the session, start/stop and attach/detach, are perfectly symmetrical. The termination of the monitored target can cause its detachment, but the session remains accessible. Issuing of the detach call on a session already detached by the kernel is harmless. The cost of start/stop is not impacted. The following properties are enforced: upon attachment => monitoring stopped during detachment => monitoring stopped 5) start and stop It must be possible for an application to start and stop monitoring at will and at any moment. Start and stop can be called very frequently and not just at the beginning and end of a session. This is especially likely for self-monitored threads where it is customary to monitor execution of only one function or loop. Thus those operations can be on the critical path and they must therefore by as lightweight as possible. See the discussion in the section about attachment and detachment. 6) reading the results The results are extracted by reading the PMU registers containing data (as opposed to configuration). The number of registers of interest can vary based on the PMU model, the type of measurement, the events measured. Reading can occur at regular interval, e.g., time-based user level sampling, and can therefore be on the critical path. Thus it must as lightweight as possible. Given that the cost of dominated by the latency of accessing the PMU registers, it is important to only read the registers that are used. Thus, the call must provide vector arguments just like for the calls to program the PMU. It must be possible to read the registers while the session is detached but also when it is attached to a thread or CPU. 7) termination Termination of a session means all the associated resources are either released to the free pool or destroyed. After termination, no state remains. Termination implies, stopping monitoring and detaching the session if necessary. For the purpose of termination, one has to differentiate between the monitored entity and the controlling entity. When a tool monitors a thread in another process, all the threads from the tool are controlling entities, and the monitored thread is the monitored entity. Any entity can vanish at any time. If the monitored entity terminates voluntarily, i.e., normal exit, or involuntarily, e.g., core dump, the kernel simply detaches the session but it is not destroyed. Until the last controlling entity disappears, the session remains accessible. There are situations where all the controlling entities disappear before the monitored entity. In this case, the session becomes useless, results cannot be extracted, thus the session enters the zombie state. It will eventually be detached and its resources will be reclaimed by the kernel, i.e., the session will be terminated. 8) extensibility There is already a vast diversity with existing PMU models, this is unlikely to change, quite to the contrary it is envisioned that the PMU will become a true valid-add and that vendors will therefore try to differentiate one from the other. Moreover, the PMU will remain closely tied to the underlying micro-architecture. Therefore, it is very important to ensure that the monitoring interface will be able to adapt easily to future PMU models and their extended features, i.e., what is offered beyond counting events. It is important to realize that extensibility is not limited to supporting more PMU registers. It also includes supporting advanced sampling features or socket-level PMUs as opposed to just core-level PMUs. It may be necessary to extend the system calls with new generic or architecture specific parameters, and this without simply adding new system calls. 9) current perfmon2 interface The perfmon2 interface design is guided by the principles described in the previous sections. We now explain each call is details. a) session creation int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name, void *smpl_arg, size_t arg_size); The function creates the perfmon session and returns a file descriptor used to manipulate the session thereafter. The calls takes several parameters which are as follows: - pfarg_ctx: encapsulates all session parameters (see below) - smpl_name: used when sampling to designate which format to use - smpl_arg: point to format-specific arguments - smpl_size: size of the structure passed in smpl_arg The pfarg_ctx structure is defined as follows: - flags: generic and arch-specific flags for the session - reserved: reserved for future extensions To provide for future extensions, the pfarg_ctx structure contains reserved fields. Reserved fields must be zeroed. To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must be passed in flags. When in-kernel sampling is not used smpl_name, smpl_arg, arg_size must be 0. b) programming the registers int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n); int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n); The calls are provided to program the configuration and data registers respectively. The parameters are as follows: - fd: file descriptor identifying the session - pmc: pointer to parg_pmc structures - pmd: pointer to parg_pmd structures - n : number of elements in the pmc or pmd vector It is possible to pass vector of parg_pmc or pfarg_pmd registers. The minimal size is 1, maximum size is determined by system administrator. The pfarg_pmc structure is defined as follows: struct pfarg_pmc { u16 reg_num; u64 reg_value; u64 reserved[]; }; The pfarg_pmd structure is defined as follows: struct pfarg_pmd { u16 reg_num; u64 reg_value; u64 reserved[]; }; Although both structures are currently identical, they will differ as more functionalities are added so better to create two versions from the start. Provisions for extensions are provided by the reserved field in each structure. c) attachment and detachment int pfm_load_context(int fd, struct pfarg_load *ld); int pfm_unload_context(int fd); The session is identified by the file descriptor, fd. To attach, the targeted thread or CPU must be provided. For extensibility purposes, the target is passed in in structure which is defined as follows: struct pfarg_load { u32 target; u64 reserved[]; }; In per-thread mode, the target field must be set to the kernel thread identification (gettid()). In per-cpu mode, the target field must be set to the logical CPU identification as seen by the kernel. Furthermore, the caller must be running on the CPU to monitor otherwise the call fails. Extensions can be implemented using the reserved field. d) start and stop int pfm_start(int fd); int pfm_stop(int fd); The session is identified by the file descriptor fd. Currently no other parameters are supported for those calls. e) reading results int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n); The session is identified by the file descriptor fd. Just like for programming the registers, it is possible to pass vectors of structures in pmds. The number of elements is passed in n. f) termination int close(fd); To terminate a session, the file descriptor has to be closed. The semantics of file descriptor sharing applies, so if another reference to the session, i.e., another file descriptor exists, the session will only be effectively destroyed, once that reference disappears. Of course, the kernel does close all file descriptor on process termination, thus the associated sessions will eventually be destroyed. In per-cpu mode, it is not necessary, though recommended, to be on the monitored CPU to issue this call. |