From: Niels C. <nc...@us...> - 2001-12-03 15:23:58
|
Hi Bill, don't you just love these rainy Sundays? I have been discussing this with Randy Heisch (LinuxPPC) and Luc Smolders (AIX). Randy does the entry/exit instrumentation from mcount because he can use the Power architecture to do so, as we've discussed earlier. To adjust for the instrumentation itself he measures an empty call loop to record the cycles it takes and the CPI. He then adjusts the results with what the empty loop shows. Randy can do this because he always has the same path length, which again is caused by his implementation, which simply uses vast memory buffers and never writes to the trace file during tracing. (If Randy were to use the same type of instrumentation as I do, the path length would vary, he says, because the code inserted by that instrumentation is not always the same on PPC. On IA32 it is the same but I don't know yet what it is on IA64, which is one of my reasons to postpone the calibration.) I could do the same as Randy, but because of the enormous data flow, most of our benchmarks would not look anywhere like themselves if I simply allocated correspondingly huge buffers and postponed the writing to disk. I have not outruled simply ignoring the cases where the trace code has to handle buffer-full conditions (infrequently in the big picture) and start every trace by doing 100 empty function calls but I have to be careful not to be interrupted or the calibration would be useless if I do it at recording time. I have thought about doing the calibration in user space at the time of post-processing, but that would make it impossible to do the post-processing on a different machine. The best solution is probably to do the empty loops to a well-known empty function and do the calibration in the post-processing. That would give me more freedom, reduce the influx on the benchmark, and make it immune to interrupts during data collection for the calibration. But, for now, I do nothing, which in my opinion isn't too bad, actually. Most performance work is done by comparative analysis, by which I mean that you try to pinpoint "hot" areas, where "hot" is loosely defined as "something that burns a lot of resources", "something that burns more resources than something else", or maybe "something that has too much unproductive elapsed time". For this type of analysis, we don't really need to adjust for profiling overhead, so we can use what LinProf gives us. I do realize, of course, that for other types of analysis, we can't accept this overhead to be included in the measurements. When I (hopefully soon) get the IA64 port done, I will know more about how this would work on that platform. When I have both Intel platforms working, I promise to implement something like Randy's calibration -- except that I won't just do CPI. I will do whichever counters were selected, but I have to emphasize that the calibration is only reasonably valid until the tracer runs out of buffer space. That, in turn, is another reason to wait for the IA64 port, because I expect to have much larger buffers at my disposal on IA64 so I can depend on trace scenarios where we stop tracing when we run out of buffer space -- like it is done on LinuxPPC and in most situations also on AIX. If this really bothers you, I could postpone the IA64 port and implement it in the IA32 version. Only problem is that I only have one work-week left in 2001. But, I gotta tell ya, Katz' never closes, and the IA64 port is a real challenge with several of the functions I need to instrument being moved to assembler from C. And there is a completely different assembler, which I'm only starting to comprehend. Why can't everybody just stick to /370 assembler? There is also some interesting possibilities in the IA64 performance counter implementation. If I can figure out how to take advantage of those, we could get more accurate profiling. Hopefully, I can also do things that can help the LTC IA64 team in their quest to make gcc produce more efficient code. Eventually we'll get there. As I keep annoying u'all by saying: I only just started! Maybe I should add that when Kernprof is driven by a performance counter, it likewise has no adjustment for the instrumentation. Neither does it calibrate for instrumentation overhead in its flat profiling. In fact, I never used a tool that adjusted for instrumentation overhead in flat profiling. But the lack of calibration is less important in Kernprof, obviously, since Kernprof can not report per-function counter values or per-function cycle counts. I think I'll make curried pork tenderloin with apples for dinner, served over steamed rice and with mango chutney on the side. It is Sunday, after all... Niels Christiansen IBM LTC, Kernel Performance >My instinct tells me that this will reduce the usefulness of the entry/exit perf cntr support. >Can the tool be modified so that the event counts do not include the tracing overhead ? > >Bill Hartner >IBM Linux Technology Center - Linux Kernel Performance lead > >|Yes. >| >|Niels >|> >|>> John Hawkes and Ray Bryant know I'm working on something. I told >|>> John I was implementing entry/exit instrumentation, which is part >|>> of the new tool. Because it is trace-based, I can record stuff >|>> that Kernprof can not, such as performance counter and cycle >|>> counter deltas per function. It is pretty cool, actually. The >|>> overhead is the same order of magnitude as Kernprof acg profiling >|>> but output is richer. Entry/exit instrumentation is optional, of >|>> course, and so is every other group of trace points. >|> >|>You indicated that entry/exit "overhead is the same order of magnitude as Kernprof". >|> >|>Do the performance counter values collected by entry/exit include the overhead you mentioned above ? >|> >|>Bill Hartner >|>IBM Linux Technology Center - Linux Kernel Performance lead |