From: Naveen P. <nav...@gm...> - 2009-01-06 17:47:17
|
Rick, Thanks for the informative reply. Small time granularity of the function seems to be the key issue. The table below shows the important numbers. All the numbers were generated on the same test example and a few were quoted in my previous mail. | Perfsuite | PAPI High-level | PAPI Low-level | ------------------------------------------------------------------------------------------ PAPI_TOT_CYC | 31,257,943,365 | 8,628,425,548 | 7,712,025,538 | ------------------------------------------------------------------------------------------ Corresp. Time (s) | 13.02 | 3.60 | 3.21 | ------------------------------------------------------------------------------------------ PerfSuite Wall-clock time (s): 2.33 Time measured through independent rdtsc (s): 2.02 The overhead shows up in the PAPI_TOT_CYC measurements. The PAPI High-level interface also has some overhead over the Low-level interface. For my purposes, where the time granularity is very small, I am inclined to write a custom interface for measurement of events to improve accuracy. Note that the independent time measurements (and hence, the TOT_CYC) through the use of independent rdtsc is the lowest with minimal overhead. The second point involving the difference between Perfsuite CPU time and Wall-clock time, as you stated, can be explained by "time the counters" or "count the timers". As you said, choosing one approach depends on the objective. For my case, in either approach, the most accurate measurements, will still have some overhead but I can imagine scenarios, where changing from "count the timers" to "time the counters" might be helpful. Please note that the numbers in this email are generated on a full-blown voice recognition system. Earlier when I said "small test program" I meant "small test example", sorry for the confusion that this might have created. Unfortunately, due to license issues involving the speech database, I can't share the experimental setup. -Naveen On Mon, Jan 5, 2009 at 4:40 PM, Rick Kufrin <rk...@nc...> wrote: > Naveen, > > Thanks for digging into this more. I hope this message gets out to the > mailing list (along with yours), as your experiments are interesting and > useful. I think your tests reported below are a slightly different matter > than the earlier one (CPU > wall clock), but they are both related. > > These more recent test results seem to point more to overhead introduced in > libpshwpc, especially when measuring at fine granularity, which it sounds > like you have in your tests. There is definitely additional work > introduced when using the libpshwpc API (above and beyond any work going on > in PAPI). > > A second point that we have been discussing following your earlier reports > arises from an issue that comes when one is measuring two things > simultaneously (as libpshwpc does): the hardware counters and the time stamp > counter. These have to be ordered in some way and the choice is made - I > will paraphrase Luiz DeRose of Cray here - whether one wants to "time the > counters" or "count the timers". That is, do you first read the counter > values and then the time stamp or vice versa? Reviewing the relevant code > in libpshwpc (the relevant source file, by the way, is > src/libpshwpc/hwpc.c), the current implementation chooses to "count the > timers". For most purposes, this gives reasonable results but I think you > are pushing the envelope here. I would be interested in hearing your > viewpoint if the reverse should be done, which may affect the CPU vs. wall > clock issue. > > On the more general overhead issue: we will try to reproduce by writing > test cases here. If it would be possible to send me a copy of the small > test program you are using, that would help us work with you on this. > > Thanks again for following up on this, > Rick > > Naveen Parihar wrote: > >> Rick, >> >> Thanks for the quick response and suggestions. >> >> It seems like a bug in Perfsuite and not PAPI. Here are the numbers for >> PAPI_TOT_CYC from a small test program. >> >> Using Perfsuite libpshwpc <http://perfsuite.ncsa.uiuc.edu/libpshwpc/> >> : 31,257,943,365 >> Using PAPI high-level interface: 8,628,425,548 >> >> Please note that ps_hwpc_start() and ps_hwpc_suspend() are called >> thousands of times within a large program to benchmark a specific function. >> The time granularity of this function is very small and hence, I have been >> using rdtsc to get independent timing measurements. The timing measurement >> using independent rdtsc are closer to one measured directly using PAPI. >> >> I also observed similar behavior with PAPI_TOT_INS, so it is likely that >> this bug is potentially influencing other events as well. >> >> I do not mind investing some time in finding this bug and fixing it in >> Perfsuite code base and will appreciate if you can tell me the potential >> places where I should be looking. >> >> -Naveen >> >> On Mon, Jan 5, 2009 at 3:30 PM, Rick Kufrin <rk...@nc...<mailto: >> rk...@nc...>> wrote: >> >> Naveen, >> >> Thanks for sending the document. We (of course) get the same >> results from post-processing it here as you do, but cannot easily >> reproduce this behavior with our own executables/tests on systems >> we have available. >> >> There are two suggestions I can think of to try offhand: >> >> 1. Add the "time" command to your run, i.e. "time psrun a.out", to >> get a 3rd-party opinion of the wall clock and CPU time >> >> 2. PerfSuite supports using the "gettimeofday" system call for >> wall-clock timing (by default, on a machine like yours, it uses >> the "rdtsc" asm instruction). To use gettimeofday, one has to >> reconfigure PerfSuits with the option "--enable-rtc=gettimeofday". >> Then make clean and remake as normal. >> >> If you come across any more info looking into this, we are of >> course very interested to address. Thanks for reporting. >> >> Rick >> >> Naveen Parihar wrote: >> >> RIck, >> >> The XML document is attached with this email. I'm trying to >> debug the problem by directly using high-level PAPI interface. >> Will let you know my findings later. >> >> -Naveen >> >> On Mon, Jan 5, 2009 at 2:00 PM, Rick Kufrin >> <rk...@nc... <mailto:rk...@nc...> >> <mailto:rk...@nc... <mailto:rk...@nc...>>> >> >> wrote: >> >> Naveen, >> >> That certainly does sound like a bug, or at least unexpected >> behavior... >> >> Would you please send a copy of the XML document that you >> used to >> obtain this output? It may help to look at its contents >> closer to >> track down what is going on. >> >> Rick >> >> Naveen Parihar wrote: >> >> Dear Perfsuite users, >> >> I'm a new user of Persuite/PAPI and would appreciate >> comments >> on my query below. >> >> On a quad core intel runing Fedora Core 6 (kernel >> 2.6.18), I >> get the following numbers while runing a *single* threaded >> program: >> CPU time (seconds) 47.966 >> Wall clock time (seconds) 30.582 >> >> Since the CPU time > Wall clock time, one conclusion >> might me >> a bug in Perfsuite/PAPI or somewhere else. I double >> checked >> the CPU >> time by dividing PAPI_TOT_CYC count by CPU frequency, and I >> arrive at the same number. Any ideas on what might be >> going on >> or what might be the best approach to debugging this >> problem >> is appreciated. >> >> Thanks, >> -Naveen >> >> ------------------------------------------------------------------------ >> >> >> ------------------------------------------------------------------------------ >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> PerfSuite-users mailing list >> Per...@li... >> <mailto:Per...@li...> >> <mailto:Per...@li... >> <mailto:Per...@li...>> >> >> >> https://lists.sourceforge.net/lists/listinfo/perfsuite-users >> >> >> >> >> > |