Please forgive the late comments on this thread. I've been trying to
extract myself from Nepal...After 100Km of walking, I'm on plane for
I agree with Eric and the others that in order to due true 'system-wide'
monitoring, perfmon expects far too much from the user code in order to
I think that we should consider supporting two modes, system-wide-per
CPU and system-wide-all-CPUS. Perhaps these two could be encapsulated
into a CPU bitmask to create context...Where the kernel would do all the
IPI's and/or affinity setting for the user. It's pretty clear that each
mode has it's advantages, both for different user communities (oprofile
for example) and for potential performance impacts on large NUMA/SMP
architectures. But to leave out IPI-enabled true system-wide monitoring
seems to be a mistake at this stage of the game...
My 2 rupees,
On Tue, 2006-04-11 at 07:49 -0700, Stephane Eranian wrote:
> On Mon, Apr 03, 2006 at 09:08:24AM -0700, Stephane Eranian wrote:
> > We are having multiple conversations in parallel. I'd like
> > to summarize the important points/comments/questions still pending:
> > 1- justify current setup for system-wide monitoring, i.e,, no IPI
> System-wide monitoring means that you are monitoring all threads across
> all CPUs in the system. The PMU setup is not part of the threads machine
> state, it survives context switches. This is used to get a global view
> of how the system is doing. That's not because you measure system-wide
> that you only look at the kernel. You can measure at the user level also.
> OProfile is an example of a system-wide profiler that can filter out
> the collected information down to individual processes.
> System-wide monitoring is supported by the perfmon2 interface. However
> the approach is different from the one taken by OProfile, perfctr,
> or VTUNE.
> Perfmon2 knows ONLY about CPU-wide monitoring, i.e., monitoring
> all activities on one CPU only. To cover a 4-way machine, it is necessary
> to create 4 perfmon contexts and bind each to a particular CPU. In other
> words, the perfmon2 implementation never propagates PMU programming
> across PMUs, there is no IPI. Applications implement system-wide by
> forking or by creating threads and pinning them on each CPU to monitor.
> At first, this may look like a burden on applications, but it can easily be
> encapsulated into a library.
> How does this work?
> A system-wide context is bound to a particular CPU during the call
> to pfm_load_context(). There is no explicit parameter saying, I want
> to bind to CPUx. Instead, we use the CPU the pfm_load_context() executes
> on. This means that the caller must ensure it runs on the right CPU
> before making the call. This may look kind of risky but we did not have
> to invent yet another affinity call (note: as a safety, we could pass the
> CPU to monitor and have the kernel check that the thread is running on
> that CPU).
> All calls requiring access to the actual PMU registers must be executed
> on the monitored CPU. In other words, after the pfm_load_context(),
> the caller thread can migrate anywhere. But it has to return on
> the monitored CPU to read/write PMU registers or start/stop monitoring.
> When sampling, the sampling buffer is allocated during the call
> to pfm_create_context(). As such, the affinity must be adjusted
> before the call to ensure that the buffer is allocated on the right
> cell in a NUMA system. Supposing CPU3 is to be monitored:
> unsigned long cpu_mask = 1UL << 3;
> sched_setaffinity(gettid(), &cpu_mask, sizeof(cpu_mask));
> ctx.ctx_flags = PFM_FL_SYSTEM_WIDE
> ret = pfm_create_context(&ctx);
> memset(&load,0, sizeof(load));
> pfm_load_context(ctx.ctx_fd, &load);
> This setup must be repeated on all CPUs to monitor. Therefore, there
> is one sampling buffer per CPU and it is allocated in the cell local
> memory (assuming the right numactl settings).
> For profiles, should the tool need to aggregate results, it can do so
> at the user level. Sampling on each CPU is totally isolated from sampling
> on the other CPUs. Buffer overflows are also treated independently.
> To resume monitoring after a buffer overflow, the pfm_restart() call is
> used and it must be executed on the monitored CPU (because it accesses
> PMU registers).
> Overall, the architecture is designed to enforce CPU affinity and data
> locality. We believe this is a good compromise between complexity
> and scalability.
> What motivated this architecture?
> 1- desire to keep the kernel code as simple as possible
> Keep the code as close as possible between per-thread and CPU-wide.
> no IPI broadcasting across processors or cells in NUMA. Don't copy
> around data structures. No kernel level synchronization.
> 2- do not assume that all CPUS are always monitored
> As the number of cores increases, it is very likely that they will not all
> run the same workload. The IPI model works well when we assume that all CPUS
> monitor the same thing at the same time, i.e., just a broadcast.
> Tools may be interested only in monitoring a subset of CPUs where the workload
> of interest actually runs. With large number of cores, it is likely workload
> will use strong affinity anyway.
> Allowing subsets of CPUs to be monitored also cuts down on the amount
> of data collected, especially when sampling. You just monitor what is of interest.
> Multiple non-overlapping sessions can co-exist.
> It is also possible to monitor different things at the same time for a given
> multi-threaded workload and thus avoid multiple runs.
> No automatic aggregation of counts or samples. Leave this to user level.
> 3 - desire to scale well on NUMA machines
> On NUMA machines, it is important to enforce CPU and memory locality as remote
> accesses are very expensive.
> perfmon mailing list