From: Naveen P. <nav...@gm...> - 2009-01-17 17:36:47
|
Rui, Thanks for generating these comprehensive benchmarks. I appreciate your good work. Naveen On Fri, Jan 16, 2009 at 4:10 PM, Rui Liu <ru...@nc...> wrote: > hi Naveen, > > Thanks a lot for reporting this! > > Rick wrote, > > I understand. We have been able to reproduce the behavior you reported > > with a small test case and are in the process of running a number of > > experiments to summarize the behavior under different conditions, using > > both the PerfSuite library as well as the PAPI high- and low-level > > APIs. Results will be posted to this mailing list when complete. > > We reproduced the issue, and measured the costs of different layers on an > NCSA Intel x86_64 HPC system. > > - Summary of the investigation results are: > > 1. When the PerfSuite HWPC (later referred to as PSHWPC) API is used, as > with any measurement tool, the granularity of the measured function should > be large enough to minimize perturbation by the tool/API. Typically when the > granularity of a measured function is 10-100 times or larger than the cost > of PSHWPC API itself, the measurement is relatively accurate. Otherwise, the > measured values could present a picture significantly distorted by the cost > of the measurement itself. > > 2. An alternative light-weight measurement method using PSHWPC API, > instead of using ps_hwpc_start() and ps_hwpc_suspend(), is using > ps_hwpc_read(). The cost of the function is relatively small, but the flip > side is that the user of this function needs to manage the counter values > him/herself. On an NCSA HPC system with perfctr patch, it was found that > with 1 PAPI event, rdpmc (which is used by perfctr) costs 53 cycles, > PAPI_read (which calls perfctr) costs 164 cycles, ps_hwpc_read (which calls > PAPI_read) costs 184 cycles. > > > - Details: (Please note that these values are specific to this particular > hardware/software combination, but they can be used as a guideline.) > > 1. Measured cost of PSHWPC API calls: > > Approximate cost for 1 PAPI event (PAPI_TOT_CYC): > ps_hwpc_init(): 7.92 M cycles > ps_hwpc_stop(): 4.03 M cycles > ps_hwpc_shutdown(): 446 K cycles > first ps_hwpc_start(): 19 K cycles > first ps_hwpc_suspend(): 1750 cycles > subsequent ps_hwpc_start(): 473 cycles > subsequent ps_hwpc_suspend(): 376 cycles > func(NUM_OPS): 7 * NUM_OPS cycles > > 2. Comparison of measured numbers of "wall ticks" and "PAPI_TOT_CYC" in > PSHWPC generated XML files and direct PAPI high and low level API > measurements: > > ---------------------------------------------------------------- > COUNT NUM_OPS rtc_delta PS_ticks PS_TOT_CYC PAPI h PAPI l > ---------------------------------------------------------------- > 100 K 1 80 M 11 M 46 M 2.4 M 2.3 M > 100 K 10 98 M 22 M 56 M 8.1 M 8.1 M > 100 K 100 152 M 83 M 118 M 73.7 M 73.9 M > 100 K 1 K 784 M 716 M 748 M 704 M 704 M > 100 K 10 K 7295 M 7223 M 7052 M 7005 M 7005 M > 100 K 100 K 71.809 G 71.723 G 70.063 G 70.019 G 69.974 G > ---------------------------------------------------------------- > > In the above table, > - COUNT is how many times the loop (including ps_hwpc_start, > func(NUM_OPS), and suspend) is called, > - NUM_OPS is the granularity of the measured function (which includes a > loop running integer multiplication and summation), > - rtc_delta is the number of cycles of the program from beginning to > finish, which includes ps_hwpc_init, stop, shutdown, etc. > - PS_ticks is the number of wall ticks appearing in PSHWPC generated XML, > - PS_TOT_CYC is the value of PAPI_TOT_CYC in PSHWPC generated XML, > - PAPI h is the value of PAPI_TOT_CYC using PAPI high level API, > - PAPI l is the value of PAPI_TOT_CYC using PAPI low level API. > > From the table, one can observe: > > 1) from PS_ticks and PS_TOT_CYC columns, that indeed there is an issue > where PS_TOT_CYC > PS_ticks when the granularity of measured function is > small (NUM_OPS <= 1 k). This will result in "CPU time > wall clock time" > issue in psprocess output, as observed by Naveen. > This is due to the behavior of implementation in the released version of > PerfSuite where "count the timer" is done, instead of "time the counter", > and is being addressed and will be available in a later PerfSuite version. > > 2) when the granularity of measured function is relatively small (NUM_OPS > <= 100), the costs of ps_hwpc_start() + ps_hwpc_suspend() on top of PAPI > will significantly distort the PAPI_TOT_CYC values, as compared with direct > PAPI measurement. When the granularity is extremely small (NUM_OPS = 1, > which means 7 cycles), the costs of ps_hwpc_start() + ps_hwpc_suspend() > dominate, and PS_TOT_CYC is almost 20 times the value of direct PAPI > measurement. When the granularity is large enough (NUM_OPS = 10 K, which > means 70 K cycles or 30 micro seconds), PS_TOT_CYC and PAPI values are off > only within 0.7%. This is similar to what Naveen reported. > > 3. Comparison of the costs of rdpmc, PAPI_read, and ps_hwpc_read: (unit is > CPU cycles) > > ------------------------------------------- > # counter rdpmc PAPI_read ps_hwpc_read > ------------------------------------------- > 1 53* 164 184 > 2 106* 224 248 > ------------------------------------------- > > *: rdpmc cost was obtained from perfctr init test output in linux kernel > boot up message, as in "/var/log/messages" or dmesg output. The number 106 > was extrapolated by using 53 * 2. The values of PAPI_read and ps_hwpc_read > were measured in this investigation. > > 4. Test setup details: > > 1) Hardware/software: > A login node of the NCSA Intel x86_64 HPC system (Abe) -- > honest3.ncsa.uiuc.edu. > Linux 2.6.18 patched with perfctr 2.6.37 > Intel Xeon E5345, 2.33 GHz (2327.506 MHz), L1 I cache: 32K, L1 D cache: > 32K, L2 Unified cache 4 MB > 8 CPUs > gcc version 3.4.6 20060404 (Red Hat 3.4.6-3) > no compile time optimization, (-O0 was in effect at compile time). > PAPI version is 3.6.2. > PerfSuite version is 0.6.2b1. > > 2) How the PSHWPC API call costs were measured: > A program loops around "ps_hwpc_start(); func(NUM_OPS); > ps_hwpc_suspend()" for many iterations. ps_rtc() is used to measure the API > calls. > > 3) How the PS_ticks and PS_TOT_CYC numbers were obtained: > Using almost the same program as above, but removed the ps_rtc() calls > for measuring the PSHWPC calls, only left the ps_rtc() calls at beginning > and end of the program to measure entire duration of the program (rtc_delta > in the table). Then obtained "wallticks" and "PAPI_TOT_CYC" values from the > generated XML files. > > 4) How the PAPI high and low level API numbers were obtained: > Based on papi-3.6.2/src/examples/high-level.c, wrote a program to have a > loop to call func(NUM_OPS), used PAPI_start_counters() and > PAPI_read_counters() to wrap it for high level API measurement, and used > PAPI_start() and PAPI_stop() to wrap it for low level API measurement. > > 5) How the rdpmc, PAPI_read, ps_hwpc_read costs were obtained: > rdpmc cost was obtained from perfctr self test output in linux kernel > boot up message. Based on perfctr-2.6.37/linux/drivers/perfctr/x86_tests.c, > wrote a program, used rdtscll() assembly language call to measure PAPI_read, > and ps_hwpc_read for 1024000 iterations and averaged it out. > > The files used to measure PAPI_read and ps_hwpc_read costs (measure-cost.c, > Makefile, papi_1event.xml, papi_2events.xml) are attached. Please note that > when measuring 2 PAPI events, source code (measure-cost.c) needs to be > changed (by uncommenting 2 lines in setup_papi()) and recompiled for PAPI > API, while it does not need to be recompiled for PSHWPC API. > > Please let me know if you have any question or comments. Thanks! > > Thanks, > Rui Liu > NCSA / PerfSuite team > |