Re: [PerfSuite-users] CPU time greater than Wall clock time

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Rui,

Thanks for generating these comprehensive benchmarks. I appreciate your good
work.

Naveen

On Fri, Jan 16, 2009 at 4:10 PM, Rui Liu <ru...@nc...> wrote:

> hi Naveen,
>
> Thanks a lot for reporting this!
>
> Rick wrote,
> > I understand.  We have been able to reproduce the behavior you reported
> > with a small test case and are in the process of running a number of
> > experiments to summarize the behavior under different conditions, using
> > both the PerfSuite library as well as the PAPI high- and low-level
> > APIs.  Results will be posted to this mailing list when complete.
>
> We reproduced the issue, and measured the costs of different layers on an
> NCSA Intel x86_64 HPC system.
>
> - Summary of the investigation results are:
>
>    1. When the PerfSuite HWPC (later referred to as PSHWPC) API is used, as
> with any measurement tool, the granularity of the measured function should
> be large enough to minimize perturbation by the tool/API. Typically when the
> granularity of a measured function is 10-100 times or larger than the cost
> of PSHWPC API itself, the measurement is relatively accurate. Otherwise, the
> measured values could present a picture significantly distorted by the cost
> of the measurement itself.
>
>    2. An alternative light-weight measurement method using PSHWPC API,
> instead of using ps_hwpc_start() and ps_hwpc_suspend(), is using
> ps_hwpc_read(). The cost of the function is relatively small, but the flip
> side is that the user of this function needs to manage the counter values
> him/herself. On an NCSA HPC system with perfctr patch, it was found that
> with 1 PAPI event, rdpmc (which is used by perfctr) costs 53 cycles,
> PAPI_read (which calls perfctr) costs 164 cycles, ps_hwpc_read (which calls
> PAPI_read) costs 184 cycles.
>
>
> - Details: (Please note that these values are specific to this particular
> hardware/software combination, but they can be used as a guideline.)
>
> 1. Measured cost of PSHWPC API calls:
>
> Approximate cost for 1 PAPI event (PAPI_TOT_CYC):
>  ps_hwpc_init():            7.92 M cycles
>  ps_hwpc_stop():            4.03 M cycles
>  ps_hwpc_shutdown():         446 K cycles
>  first ps_hwpc_start():       19 K cycles
>  first ps_hwpc_suspend():     1750 cycles
>  subsequent ps_hwpc_start():   473 cycles
>  subsequent ps_hwpc_suspend(): 376 cycles
>  func(NUM_OPS):        7 * NUM_OPS cycles
>
> 2. Comparison of measured numbers of "wall ticks" and "PAPI_TOT_CYC" in
> PSHWPC generated XML files and direct PAPI high and low level API
> measurements:
>
> ----------------------------------------------------------------
> COUNT   NUM_OPS rtc_delta     PS_ticks PS_TOT_CYC PAPI h PAPI l
> ----------------------------------------------------------------
> 100 K     1       80 M            11 M    46 M   2.4 M    2.3 M
> 100 K    10       98 M            22 M    56 M   8.1 M    8.1 M
> 100 K   100      152 M            83 M   118 M  73.7 M   73.9 M
> 100 K   1 K      784 M           716 M   748 M   704 M    704 M
> 100 K  10 K     7295 M          7223 M  7052 M  7005 M   7005 M
> 100 K 100 K   71.809 G        71.723 G   70.063 G 70.019 G 69.974 G
> ----------------------------------------------------------------
>
> In the above table,
>  - COUNT is how many times the loop (including ps_hwpc_start,
> func(NUM_OPS), and suspend) is called,
>  - NUM_OPS is the granularity of the measured function (which includes a
> loop running integer multiplication and summation),
>  - rtc_delta is the number of cycles of the program from beginning to
> finish, which includes ps_hwpc_init, stop, shutdown, etc.
>  - PS_ticks is the number of wall ticks appearing in PSHWPC generated XML,
>  - PS_TOT_CYC is the value of PAPI_TOT_CYC in PSHWPC generated XML,
>  - PAPI h is the value of PAPI_TOT_CYC using PAPI high level API,
>  - PAPI l is the value of PAPI_TOT_CYC using PAPI low level API.
>
> From the table, one can observe:
>
> 1) from PS_ticks and PS_TOT_CYC columns, that indeed there is an issue
> where PS_TOT_CYC > PS_ticks when the granularity of measured function is
> small (NUM_OPS <= 1 k). This will result in "CPU time > wall clock time"
> issue in psprocess output, as observed by Naveen.
>   This is due to the behavior of implementation in the released version of
> PerfSuite where "count the timer" is done, instead of "time the counter",
> and is being addressed and will be available in a later PerfSuite version.
>
> 2) when the granularity of measured function is relatively small (NUM_OPS
> <= 100), the costs of ps_hwpc_start() + ps_hwpc_suspend() on top of PAPI
> will significantly distort the PAPI_TOT_CYC values, as compared with direct
> PAPI measurement. When the granularity is extremely small (NUM_OPS = 1,
> which means 7 cycles), the costs of ps_hwpc_start() + ps_hwpc_suspend()
> dominate, and PS_TOT_CYC is almost 20 times the value of direct PAPI
> measurement. When the granularity is large enough (NUM_OPS = 10 K, which
> means 70 K cycles or 30 micro seconds), PS_TOT_CYC and PAPI values are off
> only within 0.7%. This is similar to what Naveen reported.
>
> 3. Comparison of the costs of rdpmc, PAPI_read, and ps_hwpc_read: (unit is
> CPU cycles)
>
> -------------------------------------------
> # counter   rdpmc   PAPI_read  ps_hwpc_read
> -------------------------------------------
> 1              53*        164           184
> 2             106*        224           248
> -------------------------------------------
>
> *: rdpmc cost was obtained from perfctr init test output in linux kernel
> boot up message, as in "/var/log/messages" or dmesg output. The number 106
> was extrapolated by using 53 * 2. The values of PAPI_read and ps_hwpc_read
> were measured in this investigation.
>
> 4. Test setup details:
>
> 1) Hardware/software:
>  A login node of the NCSA Intel x86_64 HPC system (Abe) --
> honest3.ncsa.uiuc.edu.
>  Linux 2.6.18 patched with perfctr 2.6.37
>  Intel Xeon E5345, 2.33 GHz (2327.506 MHz), L1 I cache: 32K, L1 D cache:
> 32K, L2 Unified cache 4 MB
>  8 CPUs
>  gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
>  no compile time optimization, (-O0 was in effect at compile time).
>  PAPI version is 3.6.2.
>  PerfSuite version is 0.6.2b1.
>
> 2) How the PSHWPC API call costs were measured:
>   A program loops around "ps_hwpc_start(); func(NUM_OPS);
> ps_hwpc_suspend()" for many iterations. ps_rtc() is used to measure the API
> calls.
>
> 3) How the PS_ticks and PS_TOT_CYC numbers were obtained:
>   Using almost the same program as above, but removed the ps_rtc() calls
> for measuring the PSHWPC calls, only left the ps_rtc() calls at beginning
> and end of the program to measure entire duration of the program (rtc_delta
> in the table). Then obtained "wallticks" and "PAPI_TOT_CYC" values from the
> generated XML files.
>
> 4) How the PAPI high and low level API numbers were obtained:
>   Based on papi-3.6.2/src/examples/high-level.c, wrote a program to have a
> loop to call func(NUM_OPS), used PAPI_start_counters() and
> PAPI_read_counters() to wrap it for high level API measurement, and used
> PAPI_start() and PAPI_stop() to wrap it for low level API measurement.
>
> 5) How the rdpmc, PAPI_read, ps_hwpc_read costs were obtained:
>   rdpmc cost was obtained from perfctr self test output in linux kernel
> boot up message. Based on perfctr-2.6.37/linux/drivers/perfctr/x86_tests.c,
> wrote a program, used rdtscll() assembly language call to measure PAPI_read,
> and ps_hwpc_read for 1024000 iterations and averaged it out.
>
> The files used to measure PAPI_read and ps_hwpc_read costs (measure-cost.c,
> Makefile, papi_1event.xml, papi_2events.xml) are attached. Please note that
> when measuring 2 PAPI events, source code (measure-cost.c) needs to be
> changed (by uncommenting 2 lines in setup_papi()) and recompiled for PAPI
> API, while it does not need to be recompiled for PSHWPC API.
>
> Please let me know if you have any question or comments. Thanks!
>
> Thanks,
> Rui Liu
> NCSA / PerfSuite team
>