Re: [PerfSuite-users] CPU time greater than Wall clock time

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

hi Naveen,

Thanks a lot for reporting this!

Rick wrote,
> I understand.  We have been able to reproduce the behavior you reported 
> with a small test case and are in the process of running a number of 
> experiments to summarize the behavior under different conditions, using 
> both the PerfSuite library as well as the PAPI high- and low-level 
> APIs.  Results will be posted to this mailing list when complete.

We reproduced the issue, and measured the costs of different layers on an NCSA Intel x86_64 HPC system. 

- Summary of the investigation results are:

    1. When the PerfSuite HWPC (later referred to as PSHWPC) API is used, as with any measurement tool, the granularity of the measured function should be large enough to minimize perturbation by the tool/API. Typically when the granularity of a measured function is 10-100 times or larger than the cost of PSHWPC API itself, the measurement is relatively accurate. Otherwise, the measured values could present a picture significantly distorted by the cost of the measurement itself.

    2. An alternative light-weight measurement method using PSHWPC API, instead of using ps_hwpc_start() and ps_hwpc_suspend(), is using ps_hwpc_read(). The cost of the function is relatively small, but the flip side is that the user of this function needs to manage the counter values him/herself. On an NCSA HPC system with perfctr patch, it was found that with 1 PAPI event, rdpmc (which is used by perfctr) costs 53 cycles, PAPI_read (which calls perfctr) costs 164 cycles, ps_hwpc_read (which calls PAPI_read) costs 184 cycles.

- Details: (Please note that these values are specific to this particular hardware/software combination, but they can be used as a guideline.)

1. Measured cost of PSHWPC API calls:

Approximate cost for 1 PAPI event (PAPI_TOT_CYC):
  ps_hwpc_init():            7.92 M cycles
  ps_hwpc_stop():            4.03 M cycles
  ps_hwpc_shutdown():         446 K cycles
  first ps_hwpc_start():       19 K cycles
  first ps_hwpc_suspend():     1750 cycles
  subsequent ps_hwpc_start():   473 cycles
  subsequent ps_hwpc_suspend(): 376 cycles
  func(NUM_OPS):        7 * NUM_OPS cycles

2. Comparison of measured numbers of "wall ticks" and "PAPI_TOT_CYC" in PSHWPC generated XML files and direct PAPI high and low level API measurements:

----------------------------------------------------------------
COUNT   NUM_OPS rtc_delta     PS_ticks PS_TOT_CYC PAPI h PAPI l
----------------------------------------------------------------
100 K     1       80 M            11 M    46 M   2.4 M    2.3 M
100 K    10       98 M            22 M    56 M   8.1 M    8.1 M
100 K   100      152 M            83 M   118 M  73.7 M   73.9 M
100 K   1 K      784 M           716 M   748 M   704 M    704 M
100 K  10 K     7295 M          7223 M  7052 M  7005 M   7005 M
100 K 100 K   71.809 G        71.723 G   70.063 G 70.019 G 69.974 G
----------------------------------------------------------------

In the above table,
  - COUNT is how many times the loop (including ps_hwpc_start, func(NUM_OPS), and suspend) is called,
  - NUM_OPS is the granularity of the measured function (which includes a loop running integer multiplication and summation),
  - rtc_delta is the number of cycles of the program from beginning to finish, which includes ps_hwpc_init, stop, shutdown, etc. 
  - PS_ticks is the number of wall ticks appearing in PSHWPC generated XML,
  - PS_TOT_CYC is the value of PAPI_TOT_CYC in PSHWPC generated XML,
  - PAPI h is the value of PAPI_TOT_CYC using PAPI high level API,
  - PAPI l is the value of PAPI_TOT_CYC using PAPI low level API.

>From the table, one can observe:

1) from PS_ticks and PS_TOT_CYC columns, that indeed there is an issue where PS_TOT_CYC > PS_ticks when the granularity of measured function is small (NUM_OPS <= 1 k). This will result in "CPU time > wall clock time" issue in psprocess output, as observed by Naveen.
   This is due to the behavior of implementation in the released version of PerfSuite where "count the timer" is done, instead of "time the counter", and is being addressed and will be available in a later PerfSuite version.

2) when the granularity of measured function is relatively small (NUM_OPS <= 100), the costs of ps_hwpc_start() + ps_hwpc_suspend() on top of PAPI will significantly distort the PAPI_TOT_CYC values, as compared with direct PAPI measurement. When the granularity is extremely small (NUM_OPS = 1, which means 7 cycles), the costs of ps_hwpc_start() + ps_hwpc_suspend() dominate, and PS_TOT_CYC is almost 20 times the value of direct PAPI measurement. When the granularity is large enough (NUM_OPS = 10 K, which means 70 K cycles or 30 micro seconds), PS_TOT_CYC and PAPI values are off only within 0.7%. This is similar to what Naveen reported.

3. Comparison of the costs of rdpmc, PAPI_read, and ps_hwpc_read: (unit is CPU cycles)

-------------------------------------------
# counter   rdpmc   PAPI_read  ps_hwpc_read
-------------------------------------------
1              53*        164           184
2             106*        224           248
-------------------------------------------

*: rdpmc cost was obtained from perfctr init test output in linux kernel boot up message, as in "/var/log/messages" or dmesg output. The number 106 was extrapolated by using 53 * 2. The values of PAPI_read and ps_hwpc_read were measured in this investigation.

4. Test setup details:

1) Hardware/software:
  A login node of the NCSA Intel x86_64 HPC system (Abe) -- honest3.ncsa.uiuc.edu.
  Linux 2.6.18 patched with perfctr 2.6.37
  Intel Xeon E5345, 2.33 GHz (2327.506 MHz), L1 I cache: 32K, L1 D cache: 32K, L2 Unified cache 4 MB
  8 CPUs
  gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
  no compile time optimization, (-O0 was in effect at compile time).
  PAPI version is 3.6.2.
  PerfSuite version is 0.6.2b1.

2) How the PSHWPC API call costs were measured:
   A program loops around "ps_hwpc_start(); func(NUM_OPS); ps_hwpc_suspend()" for many iterations. ps_rtc() is used to measure the API calls.

3) How the PS_ticks and PS_TOT_CYC numbers were obtained:
   Using almost the same program as above, but removed the ps_rtc() calls for measuring the PSHWPC calls, only left the ps_rtc() calls at beginning and end of the program to measure entire duration of the program (rtc_delta in the table). Then obtained "wallticks" and "PAPI_TOT_CYC" values from the generated XML files.

4) How the PAPI high and low level API numbers were obtained:
   Based on papi-3.6.2/src/examples/high-level.c, wrote a program to have a loop to call func(NUM_OPS), used PAPI_start_counters() and PAPI_read_counters() to wrap it for high level API measurement, and used PAPI_start() and PAPI_stop() to wrap it for low level API measurement.

5) How the rdpmc, PAPI_read, ps_hwpc_read costs were obtained:
   rdpmc cost was obtained from perfctr self test output in linux kernel boot up message. Based on perfctr-2.6.37/linux/drivers/perfctr/x86_tests.c, wrote a program, used rdtscll() assembly language call to measure PAPI_read, and ps_hwpc_read for 1024000 iterations and averaged it out. 

The files used to measure PAPI_read and ps_hwpc_read costs (measure-cost.c, Makefile, papi_1event.xml, papi_2events.xml) are attached. Please note that when measuring 2 PAPI events, source code (measure-cost.c) needs to be changed (by uncommenting 2 lines in setup_papi()) and recompiled for PAPI API, while it does not need to be recompiled for PSHWPC API.

Please let me know if you have any question or comments. Thanks!

Thanks,
Rui Liu
NCSA / PerfSuite team