[Lse-tech] Re: overhead of Linprof entry/exit collection of performance counters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Bill, don't you just love these rainy Sundays?

I have been discussing this with Randy Heisch (LinuxPPC) and Luc Smolders
(AIX).  Randy does the entry/exit instrumentation from mcount because he
can use the Power architecture to do so, as we've discussed earlier.  To
adjust for the instrumentation itself he measures an empty call loop to
record the cycles it takes and the CPI.  He then adjusts the results with
what the empty loop shows.

Randy can do this because he always has the same path length, which again
is caused by his implementation, which simply uses vast memory buffers and
never writes to the trace file during tracing. (If Randy were to use the
same type of instrumentation as I do, the path length would vary, he says,
because the code inserted by that instrumentation is not always the same on
PPC.  On IA32 it is the same but I don't know yet what it is on IA64, which
is one of my reasons to postpone the calibration.)

I could do the same as Randy, but because of the enormous data flow, most
of our benchmarks would not look anywhere like themselves if I simply
allocated correspondingly huge buffers and postponed the writing to disk.
I have not outruled simply ignoring the cases where the trace code has to
handle buffer-full conditions (infrequently in the big picture) and start
every trace by doing 100 empty function calls but I have to be careful not
to be interrupted or the calibration would be useless if I do it at
recording time.

I have thought about doing the calibration in user space at the time of
post-processing, but that would make it impossible to do the
post-processing on a different machine.  The best solution is probably to
do the empty loops to a well-known empty function and do the calibration in
the post-processing.  That would give me more freedom, reduce the influx on
the benchmark, and make it immune to interrupts during data collection for
the calibration.

But, for now, I do nothing, which in my opinion isn't too bad, actually.
Most performance work is done by comparative analysis, by which I mean that
you try to pinpoint "hot" areas, where "hot" is loosely defined as
"something that burns a lot of resources", "something that burns more
resources than something else", or maybe "something that has too much
unproductive elapsed time".   For this type of analysis, we don't really
need to adjust for profiling overhead, so we can use what LinProf gives us.

I do realize, of course, that for other types of analysis, we can't accept
this overhead to be included in the measurements.  When I (hopefully soon)
get the IA64 port done, I will know more about how this would work on that
platform.  When I have both Intel platforms working, I promise to implement
something like Randy's calibration -- except that I won't just do CPI.  I
will do whichever counters were selected, but I have to emphasize that the
calibration is only reasonably valid until the tracer runs out of buffer
space.  That, in turn, is another reason to wait for the IA64 port, because
I expect to have much larger buffers at my disposal on IA64 so I can depend
on trace scenarios where we stop tracing when we run out of buffer space --
like it is done on LinuxPPC and in most situations also on AIX.

If this really bothers you, I could postpone the IA64 port and implement it
in the IA32 version.  Only problem is that I only have one work-week left
in 2001.  But, I gotta tell ya, Katz' never closes, and the IA64 port is a
real challenge with several of the functions I need to instrument being
moved to assembler from C.  And there is a completely different assembler,
which I'm only starting to comprehend.  Why can't everybody just stick to
/370 assembler?  There is also some interesting possibilities in the IA64
performance counter implementation.  If I can figure out how to take
advantage of those, we could get more accurate profiling.  Hopefully, I can
also do things that can help the LTC IA64 team in their quest to make gcc
produce more efficient code.

Eventually we'll get there.  As I keep annoying u'all by saying: I only
just started!

Maybe I should add that when Kernprof is driven by a performance counter,
it likewise has no adjustment for the instrumentation.  Neither does it
calibrate for instrumentation overhead in its flat profiling.  In fact, I
never used a tool that adjusted for instrumentation overhead in flat
profiling.  But the lack of calibration is less important in Kernprof,
obviously, since Kernprof can not report per-function counter values or
per-function cycle counts.

I think I'll make curried pork tenderloin with apples for dinner, served
over steamed rice and with mango chutney on the side.  It is Sunday, after
all...

Niels Christiansen
IBM LTC, Kernel Performance

>My instinct tells me that this will reduce the usefulness of the
entry/exit perf cntr support.
>Can the tool be modified so that the event counts do not include the
tracing overhead ?
>
>Bill Hartner
>IBM Linux Technology Center - Linux Kernel Performance lead
>
>|Yes.
>|
>|Niels
>|>
>|>> John Hawkes and Ray Bryant know I'm working on something.  I told
>|>> John I was implementing entry/exit instrumentation, which is part
>|>> of the new tool.  Because it is trace-based, I can record stuff
>|>> that Kernprof can not, such as performance counter and cycle
>|>> counter deltas per function.  It is pretty cool, actually.  The
>|>> overhead is the same order of magnitude as Kernprof acg profiling
>|>> but output is richer.  Entry/exit instrumentation is optional, of
>|>> course, and so is every other group of trace points.
>|>
>|>You indicated that entry/exit "overhead is the same order of magnitude
as Kernprof".
>|>
>|>Do the performance counter values collected by entry/exit include the
overhead you mentioned above ?
>|>
>|>Bill Hartner
>|>IBM Linux Technology Center - Linux Kernel Performance lead