From: Dr. Vincent Keller <Vincent.Keller@sc...>  20091007 13:53:47

Dear all, First of all, I'm quite a newbie on perfmon, I hope my 2 questions will not be too stupid and I apologize if it is the case. Before writing to this mailinglist, I googlized and search in the list archive, without success. I'm currently integrating a performance monitoring module into a C++ project. I need to get, in a systemwide mode, the GFlops rate of a processor, core per core. To compute the GFlops (GigaFlop's per second) rate, I count the FLOP's during a dt time and make the integration. I based my implementation on an example provided with the libpfm. To validate the monitored value, I use two application's kernel (a full matrixmatrix multiplication and a poisson solver. Both uses X87 and SSE floating point operations) that I compute the exact number of FLOP's and the time. I tried my monitor system on an Intel Core 2 Duo and on an Intel Hapertown without any problem: Poisson (app using X87): vkeller@...:~/> ./mxv read n1 = 5 n2 = 1999 nn = 4000000 0 1999 1999 4000000 25 4.67165751457214 Exact result: 35987998.0000000 sum= 35987998.0000000 Mflop/s = 385.302217550264 Poisson (monitored): [MM] Perf : 0.384496 [GFLOPS] for core0 at time 1254920517 [MM] Perf : 0.009804 [GFLOPS] for core1 at time 1254920517 (it means that the app ran on core 0 at the correct rate) MatrixMatrix multiplication (app using SSE2 instructions): vkeller@...:~/> ./mxm size = 1000 k s t Mflop/s kji 0 0.1000000000D+10 0.8655E+00 0.2311E+04 MatrixMatrix multiplication (monitored): [MM] Perf : 0.014436 [GFLOPS] for core0 at time 1254920742 [MM] Perf : 2.300236 [GFLOPS] for core1 at time 1254920742 I used the event FP_COMP_OPS_EXE to measure the FLOP's quantity and the gettimeofday function for the timing. But when I turn to Intel Nehalem, things are getting bad. First of all, the event FP_COM_OPS no more exist. Instead : Umask00 : 0x02 : [MMX] : MMX Uops Umask01 : 0x80 : [SSE_DOUBLE_PRECISION] : SSE* FP double precision Uops Umask02 : 0x04 : [SSE_FP] : SSE and SSE2 FP Uops Umask03 : 0x10 : [SSE_FP_PACKED] : SSE FP packed Uops Umask04 : 0x20 : [SSE_FP_SCALAR] : SSE FP scalar Uops Umask05 : 0x40 : [SSE_SINGLE_PRECISION] : SSE* FP single precision Uops Umask06 : 0x08 : [SSE2_INTEGER] : SSE2 integer Uops Umask07 : 0x01 : [X87] : Computational floatingpoint operations executed (pfmon i FP_COMP_OPS) As far as I understood, each event fits in one HW counter (3 are available on the nhm). My first idea is to sum all the values counted for the 8 subevents of FP_COMP_OPS: FLOPS = FP_COMP_OPS:MMX + FP_COMP_OPS:SSE_DOUBLE_PRECISION + FP_COMP_OPS:FP + FP_COMP_OPS:SSE_FP_PACKED + FP_COMP_OPS:SSE_FP:SCALAR, etc... So I measure the 8 events during dt and integrate then: do i = 1,8 FLOPS = sum (8*event(i) during dt/8) end do FLOP_per_second = FLOPS/dt But the result is totally wrong : [MM] Perf : 0.000010 [GFLOPS] for core0 at time 1254920956 [MM] Perf : 0.000003 [GFLOPS] for core2 at time 1254920956 [MM] Perf : 7.164090 [GFLOPS] for core4 at time 1254920956 [MM] Perf : 0.000000 [GFLOPS] for core6 at time 1254920956 for a "real" performance of k s t Mflop/s kji 0 0.1000000000D+10 0.4570E+00 0.4377E+04 What is wrong ? How to measure the FLOP's quantity using the FP_COMP_OPS:MMX, FP_COMP_OPS:SSE_DOUBLE_PRECISION, FP_COMP_OPS:FP, etc.. values ? Secondly, I have another problem (of affinity ?) with the Nehalem. I understood (thanks to http://perfmon2.sourceforge.net/pfmon_intel_corei7.html) that it was mandatory to precise the ANY_THREAD flag (for that I put the flag PFM_NHM_SEL_ANYTHR to the pfmlib_nhm_counter_t structure) to avoid the problem of HT (linux kernel "thinks" he has 2 physical cores instead of one). But the problem still remains: it can happens that the module measures 0 FLOP's when an application is running. My declaration is : memset(&mod_inp_nhm, 0, sizeof(mod_inp_nhm)); for (int ctr = 0; ctr<PMU_NHM_NUM_COUNTERS;ctr++){ mod_inp_nhm.pfp_nhm_counters[ctr].flags=PFM_NHM_SEL_ANYTHR; } the mod_inp_nhm structure is then passed to the pfm_dispatch_events function. And I measure the flops whenever it is odd: for (int k = 0 ; k < number_of_cores ; k++){ uint64_t value_flops = 0UL; double gflops = 0.0; if (k%2==0){ value_flops = mm>getFlops(k,dt); } } What do I do wrong in my understanding ? Thanks in advance. Best regards Vince   Dr. Vincent KELLER FraunhoferInstitut für Algorithmen und Wissenschaftliches Rechnen SCAI http://scai.fraunhofer.de ADDRESS: Schloss Birlinghoven D  53754 Sankt Augustin Germany PHONE : + 49 (0) 2241/142280 FAX : + 49 (0) 2241/142258 EMAIL : Vincent.Keller@...  
From: Dr. Vincent Keller <Vincent.Keller@sc...>  20091007 14:10:24

From: Caffey, Hugh M <hugh.m.caffey@in...>  20091007 14:50:42

Hi  First, note that these events count microoperations (not full "macro" instructions) that *executed* in the floatingpoint unit (but did not, necessarily, retire). (The event names used below may not be exactly the same as those used by perfmon2.) At the highest level on Corei7, total microoperations executed in the FPU = ( FP_COMP_OPS_EXE.X87 + FP_COMP_OPS_EXE.MMX + FP_COMP_OPS_EXE.SSE_FP + FP_COMP_OPS_EXE.SSE2_INTEGER ) (If you only care about actual fp operations, omit the .SSE2_INTEGER event.) If you want more detail specifically about SSE fp operations, note the following relationships: FP_COMP_OPS_EXE.SSE_FP = FP_COMP_OPS_EXE.SSE_FP_PACKED ("vector" operations) + FP_COMP_OPS_EXE.SSE_FP_SCALAR also: FP_COMP_OPS_EXE.SSE_FP = FP_COMP_OPS_EXE.SSE_SINGLE_PRECISION + FP_COMP_OPS_EXE.SSE_DOUBLE_PRECISION Hope this helps. From: stephane eranian <eranian@go...>  20091007 17:18:11

Vincent, Hugh is right! Be careful than on Core i7, microops are counted not instructions. Other users have also reported variations in the number of microops reported for the same instruction. It depends on the floating point values passed and whether or not they reach the limit of their types (e.g., denormals). As for PFM_NHM_SEL_ANYTHR, it is not mandatory at all. In fact you probably don't want to use it. If you run pfmon on all logical cores (without cpulist), then you can compute total FLOPS by adding up each percpu counts. Alternatively you can use the aggr option to have pfmon do it for you. From: Dr. Vincent Keller <Vincent.Keller@sc...>  20091007 17:30:17

Dear Stéphane, stephane eranian a écrit : > Vincent, > > Hugh is right! > Be careful than on Core i7, microops are counted not instructions. OK. > Other users have also reported variations in the number of > microops reported for the same instruction. It depends on > the floating point values passed and whether or not they > reach the limit of their types (e.g., denormals). OK, I modify the module accordingly to what Hugh said and it seems to be quite OK (at least "enough" for a first prototype): MXM: Measured : Perf : 4.818 [GFLOPS] Computed : Perf : 4.383 [GFLOPS] MXV: Measured : Perf : 1.329 [GFLOPS] Computed : Perf : 1.172 [GFLOPS] (the regular 10 % can be easily used as a correction) > As for PFM_NHM_SEL_ANYTHR, it is not mandatory at all. > In fact you probably don't want to use it. If you run pfmon > on all logical cores (without cpulist), then you can compute > total FLOPS by adding up each percpu counts. Alternatively > you can use the aggr option to have pfmon do it for you. Heum, maybe my first post was not clear, sorry for that. I don't use the "pfmon" tool. I implemented what I needed. My goal is to be able to measure all the FLOP's (actual) on each (physical) core of a processor, regardless of the application/thread/process which is running on. It is why I thought I had to precise the PFM_NHM_SEL_ANYTHR flag in the structure. Maybe I must have a look in the pfmon source code ? Thanks again for your help Cheers :) Vince   Dr. Vincent KELLER FraunhoferInstitut für Algorithmen und Wissenschaftliches Rechnen SCAI http://scai.fraunhofer.de ADDRESS: Schloss Birlinghoven D  53754 Sankt Augustin Germany PHONE : + 49 (0) 2241/142280 FAX : + 49 (0) 2241/142258 EMAIL : Vincent.Keller@...  