Re: [Lse-tech] Re: perfmon interface

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Apr 14, 2005 at 03:30:57AM -0700, Stephane Eranian wrote:
> David,
> 
> > > I went back and forth on this. If you look at the latest version of the
> > > document, I think I try to explain why it is not very efficient to treat
> > > the notification message queue as a byte stream. Because it forces the
> > > application to issue two reads to extract a message: 1 to get the type
> > > and a second to read the body. 
> > 
> > Well, that's a consequence of merging the two buffers, thereby making
> > the messages of variable size, not a consequence of treating the
> > notification queue as a byte stream per se.
> > 
> > But if we allowed mmap() access to the notification buffer, then we
> > don't need two syscalls() after all.  Just read the data from memory,
> > then one lseek() to consume the message.
> > 
> It is true that if the message queue was mmapped, the 
> two read() calls would be replaced by a load for the type and possibly
> loads for the rest. 
> 
> In the case where the application does all the sampling at the user
> level, it would mmaped the notification queue, that's a page right there.
> If the application is using a buffer format with buffer remapping, then
> it would issue an mmap for the notification queue and one for the buffer.
> OTOH,I think that is your point, you could say the first case is like
> the second but where the buffer size is actually zero. In other words,
> the mmap offset for the buffer and and the message queue would be, in
> effect, the same. That sound appealing to me. Now my problem would be to
> support legacy IA-64 applications which would still use the read model.
> but I think this could work. I will see how this can be made to work
> with the code I have.

Ok, I'm no longer entirely sure we're talking about the same thing
here, and I'm not sure which things you think are a good idea and
which you don't.  To clarify, there are two separate ideas here:

1) To allow the notification queue to be accessed with mmap(), and
consumed with lseek() as a lower overhead of getting the messages.  I
wouldn't envisage removing the read() mechanism - an application could
use either approach to read the messages from the stream, depending
which was suitable (or even a combination of both).

2) To merge the notification queue and sampling buffer into a single
data stream.  The only connection with idea 1 is that using both
together would mitigate some of the potential performance problems
with idea 2.

My impression is that you're convinced (1) is a good idea, but I'm not
so sure what your current position on (2) is.  For (2) supporting
legacy ia64 programs would, indeed, be a tricky problem.

> > Obviously the sample formats would have to be written such that the
> > single sample buffer has enough meta-information that it can be
> > unambiguously parsed, but that's no big deal.
> > 
> Yes.
> 
> > > Also without a read, I wonder how would the application by waiting/polling
> > > for any new events. Perfmon does not use signals to notify of new events.
> > > The read/poll/select can be used. Should a signal be necessary, then it is
> > > setup normally by requesting ownserhip of the resrouce via some fcntl().
> > > In the end you can receive a SIGIO on notification events, then you know
> > > there is something in the notification message queue.
> > 
> > Well, the stream is still there, so the application can still use
> > select().  It's just that after the select() or poll() returns it will
> > get the data from the mmap()ed region and then lseek() instead of
> > doing a read().  Or alternatively, if it suits the application's
> > structure, it could do a small (blocking) read() to get the header of
> > the next block of sampling data, then read any extra information from
> > the mmap.
> 
> Yes, I think you could still poll/select and then go to the mmaped area
> to get the actual data.
> 
> > > I would add, it is useful for self-monitoring where it is possible
> > > to read the PMD for the user level directly. Moreover, it is only
> > > useful if we know that the counter is going to exceed its hardware
> > > width, i.e. a direct read is not enough.
> > 
> > Or if we're using the start/sum approach to avoid unnecessary writes
> > to the PMDs.  According to Mikael this is very important on some x86
> > models.
> 
> I am not sure I understand thet start/sum approach you are talking about.
> The x86 variants all have this problem that it is very expensive to
> read and write the PMU registers. Writes are killer of context swith out
> because there is no other way to stop monitoring but to write
> the perfsel registers. Maybe on P4, there is something else.

The impression I've gotten from Mikael and the perfctr code is that
apparently on at least some CPUs, reads are substantially less
expensive than writes.  Hence, the code is organized so as to avoid
writes, replacing them with reads.  So, to maintain a virtualized
counter for a particular task/thread, instead of writing the
virtualized value back to hardware counter whenever the task becomes
active it uses the following sequence:

When switching to a monitored task:
	start = hardware pmc value

When switching away from a monitored task:
	sum += (hardware pmc value - start)

When sampling a monitored task during its time slace:
	sum += (hardware pmc value - start)
	start = hardware pmc value

At any point when the monitored task is running, the current
virtualized counter value is (sum + (hardware value - start)).  This
achieves a fully virtualized counter value without any writes to the
hardware counter.  The perfctr mmap()ed window gives both the start
and sum values for each active counter, so a self-monitoring task can
compute the full software counter value itself.

Obviously this technique cannot be used for counters generating
overflow interrupts, since the hardware and software values must be
congruent in that case.  For such counters (i-mode counters, in
perfctr terminology) perfctr falls back to writing the hardware
counter value when switching to the monitored task.

The same technique can be used where counter writes are not just
expensive, but impossible, such as for the tsc/timebase.

> > > BTW, I got rid of PFM_SET_CONFIG/PFM_GET_CONFIG as you suggested.
> > > I use a combination of /proc and /proc/sys. 
> > 
> > Excellent.  Is there somewhere I can grab these latest test versions?
> > 
> I will try to make the curent sources available next week. It would be nice
> if you could attempt the ppc32/ppc64 port. I am confident the core perfmon
> would support this now.

Excellent.  I will be occupied at linux.conf.au most of next week, but
I'll see what I can do.  With any luck I will make some progress in
the week after.

> > > I will be changing the type of reg_value for pfarg_pmc_t
> > > to "unsigned long" from "uint64_t". I think for all PMU I looked at the
> > > PMC are always as wide as an unsigned long. For PMD, they type must
> > > remain uint64_t. Can you confirm this fact for PPC32 and PPC64?
> > 
> > No, no no!  Don't do this!  Yes, things would fit on ppc32 and ppc64,
> > but having the size here variable is a really bad idea.  It means that
> > 32-bit applications on a 64-bit kernel (very common on ppc64 and
> > x86_64) will have a different notion of the structure to the kernel,
> > so we would need to implement an ugly translation layer to support
> > them.  And even that wouldn't work properly for >32bit registers being
> > accessed from 32-bit apps.  Using fixed-width types everywhere will
> > save us a lot of pain in the ABI later.  Speaking of which if there
> > are any unsigned longs in the interface at preset, get rid of them.
> > In fact get rid of any unsigned ints as well - I believe they're
> > reliably 32-bit on all current platforms, but there's no guarantee
> > that will always be the case.
> > 
> Well, I would like you to take a look at the data structures defined in the
> document. I tried hard to make them fixed size. Yet some of them use size_t.
> The mmap offset would introduce an off_t. Both are likely defined as "unsigned long".
> There is also a big problem with the sample entry structure for the default
> format. Thin about the instruction pointer, this one has to be defined 
> as unsigned long (or uintptr_t). This works for both ILP32 and LP64 systems.
> However we have a problem for LP64 kernels but ILP32 applications such as a
> ppc32 monitoring tools try to decode the mmap sampling buffer written by a ppc64
> kernel. There is no automatic way to tell a compiler: "I am a ILP32 application ut
> I would like to use the 64-bit a certain structures". At best you need to have some
> #define to force the pre-processor to pick up the right data structure. In the 
> case of ILP32 running on LP64, it would need to pick up a sample entry format
> where the instruction point is defined as uint64_t. But having it forced
> to uint64_t would be overkill (space) when running on an ILP32 kernel.
> If you think of the mmap offset, most likely there is 32<->64 bit emulation
> for an ILP32 running on an LP64 kernel. The offset can only be 32 bit in that
> mode.

Ah, yes, all those things could cause problems.  I'll take a closer
look when I get the chance.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/people/dgibson