From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 02:14:12
|
Hi, Patch the 0.8 release with: - if (counter_config[i].event) { + if (counter_config[i].enabled) { did solve some of the zero event values. But it still doesn't support metrics available for FRONT_END_Tagging, namely, memory_loads and memory_stores. But these are the only L1 data reference events available on P4, which are critical to understand system behavior in many cases. For these events, both FRONT_END_EVENTS and UOP_TYPE needs to be enabled. Current implementation uses 4 pairs of the counters, and unfortunately these two events use the same pair. Thus in HT environment, enable these two events simultaneously will generate an error "Couldn't allocate hardware counters for the selected events." I believe this is still an open bug, refer to: http://sourceforge.net/tracker/index.php?func=detail&aid=841099&group_id=16191&atid=116191 (I am working on another patch to enable IA32_PEBS_ENABLE and MSR_PEBS_MATRIX_VERT for other events mentioned in that thread) Attached is a tentative patch to the current CVS tree, which expands the counters to 5 pairs and let UOP_TYPE use the new pair. A better solution would be use the full 9 pairs, since other events, such as split_load_retired and split_store_retired (refer to IA-32 Intel Architecture Software Developer's Manual, Volume 3, Table A-6, page A-35) may need additional MSRs too. Please review this and see if it can be applied. Thanks -Yaoping ******************** diff -rc oprofile.cvs/events/i386/p4/events oprofile/events/i386/p4/events *** oprofile.cvs/events/i386/p4/events 2004-08-26 21:13:42.206833160 -0400 --- oprofile/events/i386/p4/events 2004-08-26 21:21:59.682205408 -0400 *************** *** 3,47 **** # NOTE: events cannot currently be 0x00 due to event binding checks in # driver # ! event:0x1d counters:0,4 um:global_power_events minimum:3000 name:GLOBAL_POWER_EVENTS : time during which processor is not stopped ! event:0x01 counters:3,7 um:branch_retired minimum:3000 name:BRANCH_RETIRED : retired branches ! event:0x02 counters:3,7 um:mispred_branch_retired minimum:3000 name:MISPRED_BRANCH_RETIRED : retired mispredicted branches ! event:0x04 counters:0,4 um:bpu_fetch_request minimum:3000 name:BPU_FETCH_REQUEST : instruction fetch requests from the branch predict unit ! event:0x05 counters:0,4 um:itlb_reference minimum:3000 name:ITLB_REFERENCE : translations using the instruction translation lookaside buffer ! event:0x06 counters:2,6 um:memory_cancel minimum:3000 name:MEMORY_CANCEL : cancelled requesets in data cache address control unit ! event:0x07 counters:2,6 um:memory_complete minimum:3000 name:MEMORY_COMPLETE : completed split ! event:0x08 counters:2,6 um:load_port_replay minimum:3000 name:LOAD_PORT_REPLAY : replayed events at the load port ! event:0x09 counters:2,6 um:store_port_replay minimum:3000 name:STORE_PORT_REPLAY : replayed events at the store port ! event:0x0a counters:0,4 um:mob_load_replay minimum:3000 name:MOB_LOAD_REPLAY : replayed loads from the memory order buffer ! event:0x0c counters:0,4 um:bsq_cache_reference minimum:3000 name:BSQ_CACHE_REFERENCE : cache references seen by the bus unit # intel doc vol 3 table A-1 P4 and xeon with cpuid signature < 0xf27 doen't allow MSR_FSB_ESCR1 so on only counter 0 is available event:0x0d counters:0 um:ioq minimum:3000 name:IOQ_ALLOCATION : bus transactions # FIXME the unit mask associated is known to get different behavior between cpu # step id, it need to be documented in P4 events doc ! event:0x0e counters:4 um:ioq minimum:3000 name:IOQ_ACTIVE_ENTRIES : number of entries in the IOQ which are active event:0x10 counters:0 um:bsq minimum:3000 name:BSQ_ALLOCATION : allocations in the bus sequence unit ! event:0x12 counters:3,7 um:x87_assist minimum:3000 name:X87_ASSIST : retired x87 instructions which required special handling ! event:0x1c counters:3,7 um:machine_clear minimum:3000 name:MACHINE_CLEAR : cycles with entire machine pipeline cleared ! event:0x1e counters:1,5 um:tc_ms_xfer minimum:3000 name:TC_MS_XFER : number of times uops deliver changed from TC to MS ROM ! event:0x1f counters:1,5 um:uop_queue_writes minimum:3000 name:UOP_QUEUE_WRITES : number of valid uops written to the uop queue ! event:0x20 counters:3,7 um:front_end_event minimum:3000 name:FRONT_END_EVENT : retired uops, tagged with front-end tagging ! event:0x21 counters:3,7 um:execution_event minimum:3000 name:EXECUTION_EVENT : retired uops, tagged with execution tagging ! event:0x22 counters:3,7 um:replay_event minimum:3000 name:REPLAY_EVENT : retired uops, tagged with replay tagging ! event:0x23 counters:3,7 um:instr_retired minimum:3000 name:INSTR_RETIRED : retired instructions ! event:0x24 counters:3,7 um:uops_retired minimum:3000 name:UOPS_RETIRED : retired uops ! event:0x25 counters:3,7 um:uop_type minimum:3000 name:UOP_TYPE : type of uop tagged by front-end tagging ! event:0x26 counters:1,5 um:branch_type minimum:3000 name:RETIRED_MISPRED_BRANCH_TYPE : retired mispredicted branched, selected by type ! event:0x27 counters:1,5 um:branch_type minimum:3000 name:RETIRED_BRANCH_TYPE : retired branches, selected by type ! event:0x03 counters:1,5 um:tc_deliver_mode minimum:3000 name:TC_DELIVER_MODE : duration (in clock cycles) in the trace cache and decode engine ! event:0x0b counters:0,4 um:page_walk_type minimum:3000 name:PAGE_WALK_TYPE : page walks by the page miss handler ! event:0x0f counters:0,4 um:fsb_data_activity minimum:3000 name:FSB_DATA_ACTIVITY : DRDY or DBSY events on the front side bus ! event:0x11 counters:4 um:bsq minimum:3000 name:BSQ_ACTIVE_ENTRIES : number of entries in the bus sequence unit which are active ! event:0x13 counters:2,6 um:flame_uop minimum:3000 name:SSE_INPUT_ASSIST : input assists requested for SSE or SSE2 operands ! event:0x14 counters:2,6 um:flame_uop minimum:3000 name:PACKED_SP_UOP : packed single precision uops ! event:0x15 counters:2,6 um:flame_uop minimum:3000 name:PACKED_DP_UOP : packed double precision uops ! event:0x16 counters:2,6 um:flame_uop minimum:3000 name:SCALAR_SP_UOP : scalar single precision uops ! event:0x17 counters:2,6 um:flame_uop minimum:3000 name:SCALAR_DP_UOP : scalar double presision uops ! event:0x18 counters:2,6 um:flame_uop minimum:3000 name:64BIT_MMX_UOP : 64 bit integer SIMD MMX uops ! event:0x19 counters:2,6 um:flame_uop minimum:3000 name:128BIT_MMX_UOP : 128 bit integer SIMD SSE2 uops ! event:0x1a counters:2,6 um:flame_uop minimum:3000 name:X87_FP_UOP : x87 floating point uops ! event:0x1b counters:2,6 um:x87_simd_moves_uop minimum:3000 name:X87_SIMD_MOVES_UOP : x87 FPU, MMX, SSE, or SSE2 loads, stores and reg-to-reg moves --- 3,47 ---- # NOTE: events cannot currently be 0x00 due to event binding checks in # driver # ! event:0x1d counters:0,5 um:global_power_events minimum:3000 name:GLOBAL_POWER_EVENTS : time during which processor is not stopped ! event:0x01 counters:3,8 um:branch_retired minimum:3000 name:BRANCH_RETIRED : retired branches ! event:0x02 counters:3,8 um:mispred_branch_retired minimum:3000 name:MISPRED_BRANCH_RETIRED : retired mispredicted branches ! event:0x04 counters:0,5 um:bpu_fetch_request minimum:3000 name:BPU_FETCH_REQUEST : instruction fetch requests from the branch predict unit ! event:0x05 counters:0,5 um:itlb_reference minimum:3000 name:ITLB_REFERENCE : translations using the instruction translation lookaside buffer ! event:0x06 counters:2,7 um:memory_cancel minimum:3000 name:MEMORY_CANCEL : cancelled requesets in data cache address control unit ! event:0x07 counters:2,7 um:memory_complete minimum:3000 name:MEMORY_COMPLETE : completed split ! event:0x08 counters:2,7 um:load_port_replay minimum:3000 name:LOAD_PORT_REPLAY : replayed events at the load port ! event:0x09 counters:2,7 um:store_port_replay minimum:3000 name:STORE_PORT_REPLAY : replayed events at the store port ! event:0x0a counters:0,5 um:mob_load_replay minimum:3000 name:MOB_LOAD_REPLAY : replayed loads from the memory order buffer ! event:0x0c counters:0,5 um:bsq_cache_reference minimum:3000 name:BSQ_CACHE_REFERENCE : cache references seen by the bus unit # intel doc vol 3 table A-1 P4 and xeon with cpuid signature < 0xf27 doen't allow MSR_FSB_ESCR1 so on only counter 0 is available event:0x0d counters:0 um:ioq minimum:3000 name:IOQ_ALLOCATION : bus transactions # FIXME the unit mask associated is known to get different behavior between cpu # step id, it need to be documented in P4 events doc ! event:0x0e counters:5 um:ioq minimum:3000 name:IOQ_ACTIVE_ENTRIES : number of entries in the IOQ which are active event:0x10 counters:0 um:bsq minimum:3000 name:BSQ_ALLOCATION : allocations in the bus sequence unit ! event:0x12 counters:3,8 um:x87_assist minimum:3000 name:X87_ASSIST : retired x87 instructions which required special handling ! event:0x1c counters:3,8 um:machine_clear minimum:3000 name:MACHINE_CLEAR : cycles with entire machine pipeline cleared ! event:0x1e counters:1,6 um:tc_ms_xfer minimum:3000 name:TC_MS_XFER : number of times uops deliver changed from TC to MS ROM ! event:0x1f counters:1,6 um:uop_queue_writes minimum:3000 name:UOP_QUEUE_WRITES : number of valid uops written to the uop queue ! event:0x20 counters:3,8 um:front_end_event minimum:3000 name:FRONT_END_EVENT : retired uops, tagged with front-end tagging ! event:0x21 counters:3,8 um:execution_event minimum:3000 name:EXECUTION_EVENT : retired uops, tagged with execution tagging ! event:0x22 counters:3,8 um:replay_event minimum:3000 name:REPLAY_EVENT : retired uops, tagged with replay tagging ! event:0x23 counters:3,8 um:instr_retired minimum:3000 name:INSTR_RETIRED : retired instructions ! event:0x24 counters:3,8 um:uops_retired minimum:3000 name:UOPS_RETIRED : retired uops ! event:0x25 counters:4,9 um:uop_type minimum:3000 name:UOP_TYPE : type of uop tagged by front-end tagging ! event:0x26 counters:1,6 um:branch_type minimum:3000 name:RETIRED_MISPRED_BRANCH_TYPE : retired mispredicted branched, selected by type ! event:0x27 counters:1,6 um:branch_type minimum:3000 name:RETIRED_BRANCH_TYPE : retired branches, selected by type ! event:0x03 counters:1,6 um:tc_deliver_mode minimum:3000 name:TC_DELIVER_MODE : duration (in clock cycles) in the trace cache and decode engine ! event:0x0b counters:0,5 um:page_walk_type minimum:3000 name:PAGE_WALK_TYPE : page walks by the page miss handler ! event:0x0f counters:0,5 um:fsb_data_activity minimum:3000 name:FSB_DATA_ACTIVITY : DRDY or DBSY events on the front side bus ! event:0x11 counters:5 um:bsq minimum:3000 name:BSQ_ACTIVE_ENTRIES : number of entries in the bus sequence unit which are active ! event:0x13 counters:2,7 um:flame_uop minimum:3000 name:SSE_INPUT_ASSIST : input assists requested for SSE or SSE2 operands ! event:0x14 counters:2,7 um:flame_uop minimum:3000 name:PACKED_SP_UOP : packed single precision uops ! event:0x15 counters:2,7 um:flame_uop minimum:3000 name:PACKED_DP_UOP : packed double precision uops ! event:0x16 counters:2,7 um:flame_uop minimum:3000 name:SCALAR_SP_UOP : scalar single precision uops ! event:0x17 counters:2,7 um:flame_uop minimum:3000 name:SCALAR_DP_UOP : scalar double presision uops ! event:0x18 counters:2,7 um:flame_uop minimum:3000 name:64BIT_MMX_UOP : 64 bit integer SIMD MMX uops ! event:0x19 counters:2,7 um:flame_uop minimum:3000 name:128BIT_MMX_UOP : 128 bit integer SIMD SSE2 uops ! event:0x1a counters:2,7 um:flame_uop minimum:3000 name:X87_FP_UOP : x87 floating point uops ! event:0x1b counters:2,7 um:x87_simd_moves_uop minimum:3000 name:X87_SIMD_MOVES_UOP : x87 FPU, MMX, SSE, or SSE2 loads, stores and reg-to-reg moves diff -rc oprofile.cvs/events/i386/p4-ht/events oprofile/events/i386/p4-ht/events *** oprofile.cvs/events/i386/p4-ht/events 2004-08-26 21:13:42.207833008 -0400 --- oprofile/events/i386/p4-ht/events 2004-08-26 21:21:31.572478736 -0400 *************** *** 23,28 **** event:0x22 counters:3 um:replay_event minimum:6000 name:REPLAY_EVENT : retired uops, tagged with replay tagging event:0x23 counters:3 um:instr_retired minimum:6000 name:INSTR_RETIRED : retired instructions event:0x24 counters:3 um:uops_retired minimum:6000 name:UOPS_RETIRED : retired uops ! event:0x25 counters:3 um:uop_type minimum:6000 name:UOP_TYPE : type of uop tagged by front-end tagging event:0x26 counters:1 um:branch_type minimum:6000 name:RETIRED_MISPRED_BRANCH_TYPE : retired mispredicted branched, selected by type event:0x27 counters:1 um:branch_type minimum:6000 name:RETIRED_BRANCH_TYPE : retired branches, selected by type --- 23,28 ---- event:0x22 counters:3 um:replay_event minimum:6000 name:REPLAY_EVENT : retired uops, tagged with replay tagging event:0x23 counters:3 um:instr_retired minimum:6000 name:INSTR_RETIRED : retired instructions event:0x24 counters:3 um:uops_retired minimum:6000 name:UOPS_RETIRED : retired uops ! event:0x25 counters:4 um:uop_type minimum:6000 name:UOP_TYPE : type of uop tagged by front-end tagging event:0x26 counters:1 um:branch_type minimum:6000 name:RETIRED_MISPRED_BRANCH_TYPE : retired mispredicted branched, selected by type event:0x27 counters:1 um:branch_type minimum:6000 name:RETIRED_BRANCH_TYPE : retired branches, selected by type diff -rc oprofile.cvs/module/x86/op_model_p4.c oprofile/module/x86/op_model_p4.c *** oprofile.cvs/module/x86/op_model_p4.c 2004-08-26 21:13:42.200834072 -0400 --- oprofile/module/x86/op_model_p4.c 2004-08-26 21:16:39.064946640 -0400 *************** *** 15,26 **** #define NUM_EVENTS 39 ! #define NUM_COUNTERS_NON_HT 8 #define NUM_ESCRS_NON_HT 45 #define NUM_CCCRS_NON_HT 18 #define NUM_CONTROLS_NON_HT (NUM_ESCRS_NON_HT + NUM_CCCRS_NON_HT) ! #define NUM_COUNTERS_HT2 4 #define NUM_ESCRS_HT2 23 #define NUM_CCCRS_HT2 9 #define NUM_CONTROLS_HT2 (NUM_ESCRS_HT2 + NUM_CCCRS_HT2) --- 15,26 ---- #define NUM_EVENTS 39 ! #define NUM_COUNTERS_NON_HT 10 #define NUM_ESCRS_NON_HT 45 #define NUM_CCCRS_NON_HT 18 #define NUM_CONTROLS_NON_HT (NUM_ESCRS_NON_HT + NUM_CCCRS_NON_HT) ! #define NUM_COUNTERS_HT2 5 #define NUM_ESCRS_HT2 23 #define NUM_CCCRS_HT2 9 #define NUM_CONTROLS_HT2 (NUM_ESCRS_HT2 + NUM_CCCRS_HT2) *************** *** 73,92 **** #define CTR_MS_0 (1 << 1) #define CTR_FLAME_0 (1 << 2) #define CTR_IQ_4 (1 << 3) ! #define CTR_BPU_2 (1 << 4) ! #define CTR_MS_2 (1 << 5) ! #define CTR_FLAME_2 (1 << 6) ! #define CTR_IQ_5 (1 << 7) static struct p4_counter_binding p4_counters [NUM_COUNTERS_NON_HT] = { { CTR_BPU_0, MSR_P4_BPU_PERFCTR0, MSR_P4_BPU_CCCR0 }, { CTR_MS_0, MSR_P4_MS_PERFCTR0, MSR_P4_MS_CCCR0 }, { CTR_FLAME_0, MSR_P4_FLAME_PERFCTR0, MSR_P4_FLAME_CCCR0 }, { CTR_IQ_4, MSR_P4_IQ_PERFCTR4, MSR_P4_IQ_CCCR4 }, { CTR_BPU_2, MSR_P4_BPU_PERFCTR2, MSR_P4_BPU_CCCR2 }, { CTR_MS_2, MSR_P4_MS_PERFCTR2, MSR_P4_MS_CCCR2 }, { CTR_FLAME_2, MSR_P4_FLAME_PERFCTR2, MSR_P4_FLAME_CCCR2 }, ! { CTR_IQ_5, MSR_P4_IQ_PERFCTR5, MSR_P4_IQ_CCCR5 } }; #define NUM_UNUSED_CCCRS NUM_CCCRS_NON_HT - NUM_COUNTERS_NON_HT --- 73,96 ---- #define CTR_MS_0 (1 << 1) #define CTR_FLAME_0 (1 << 2) #define CTR_IQ_4 (1 << 3) ! #define CTR_IQ_0 (1 << 4) ! #define CTR_BPU_2 (1 << 5) ! #define CTR_MS_2 (1 << 6) ! #define CTR_FLAME_2 (1 << 7) ! #define CTR_IQ_5 (1 << 8) ! #define CTR_IQ_2 (1 << 9) static struct p4_counter_binding p4_counters [NUM_COUNTERS_NON_HT] = { { CTR_BPU_0, MSR_P4_BPU_PERFCTR0, MSR_P4_BPU_CCCR0 }, { CTR_MS_0, MSR_P4_MS_PERFCTR0, MSR_P4_MS_CCCR0 }, { CTR_FLAME_0, MSR_P4_FLAME_PERFCTR0, MSR_P4_FLAME_CCCR0 }, { CTR_IQ_4, MSR_P4_IQ_PERFCTR4, MSR_P4_IQ_CCCR4 }, + { CTR_IQ_0, MSR_P4_IQ_PERFCTR0, MSR_P4_IQ_CCCR0 } { CTR_BPU_2, MSR_P4_BPU_PERFCTR2, MSR_P4_BPU_CCCR2 }, { CTR_MS_2, MSR_P4_MS_PERFCTR2, MSR_P4_MS_CCCR2 }, { CTR_FLAME_2, MSR_P4_FLAME_PERFCTR2, MSR_P4_FLAME_CCCR2 }, ! { CTR_IQ_5, MSR_P4_IQ_PERFCTR5, MSR_P4_IQ_CCCR5 }, ! { CTR_IQ_2, MSR_P4_IQ_PERFCTR2, MSR_P4_IQ_CCCR2 } }; #define NUM_UNUSED_CCCRS NUM_CCCRS_NON_HT - NUM_COUNTERS_NON_HT *************** *** 96,103 **** MSR_P4_BPU_CCCR1, MSR_P4_BPU_CCCR3, MSR_P4_MS_CCCR1, MSR_P4_MS_CCCR3, MSR_P4_FLAME_CCCR1, MSR_P4_FLAME_CCCR3, ! MSR_P4_IQ_CCCR0, MSR_P4_IQ_CCCR1, ! MSR_P4_IQ_CCCR2, MSR_P4_IQ_CCCR3 }; /* p4 event codes in libop/op_event.h are indices into this table. */ --- 100,106 ---- MSR_P4_BPU_CCCR1, MSR_P4_BPU_CCCR3, MSR_P4_MS_CCCR1, MSR_P4_MS_CCCR3, MSR_P4_FLAME_CCCR1, MSR_P4_FLAME_CCCR3, ! MSR_P4_IQ_CCCR1, MSR_P4_IQ_CCCR3 }; /* p4 event codes in libop/op_event.h are indices into this table. */ *************** *** 326,333 **** { /* UOP_TYPE */ 0x02, 0x02, ! { { CTR_IQ_4, MSR_P4_RAT_ESCR0}, ! { CTR_IQ_5, MSR_P4_RAT_ESCR1} } }, { /* RETIRED_MISPRED_BRANCH_TYPE */ --- 329,336 ---- { /* UOP_TYPE */ 0x02, 0x02, ! { { CTR_IQ_0, MSR_P4_RAT_ESCR0}, ! { CTR_IQ_2, MSR_P4_RAT_ESCR1} } }, { /* RETIRED_MISPRED_BRANCH_TYPE */ |
From: John L. <le...@mo...> - 2004-08-27 03:36:29
|
On Thu, Aug 26, 2004 at 10:14:07PM -0400, Yaoping Ruan wrote: > Attached is a tentative patch to the current CVS tree, which expands the Your mail client ate it. Please include it inline without wrapping the patch (also please use -u unified diff format) thanks john |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 03:59:39
Attachments:
oprofile.front_end_tagging.patch
|
Hope this time it works. -Yaoping On Fri, 27 Aug 2004, John Levon wrote: > On Thu, Aug 26, 2004 at 10:14:07PM -0400, Yaoping Ruan wrote: > > > Attached is a tentative patch to the current CVS tree, which expands the > > Your mail client ate it. Please include it inline without wrapping the > patch (also please use -u unified diff format) > > thanks > john > > > ------------------------------------------------------- > SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media > 100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 > Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. > http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 > _______________________________________________ > oprofile-list mailing list > opr...@li... > https://lists.sourceforge.net/lists/listinfo/oprofile-list > |
From: John L. <le...@mo...> - 2004-08-27 18:12:02
|
On Thu, Aug 26, 2004 at 10:14:07PM -0400, Yaoping Ruan wrote: > Attached is a tentative patch to the current CVS tree, which expands the Can you please clarify what additional events are supported by this patch (by which I mean ones that you have tested sufficiently to be confident that they work as intended). That is, what is the user-visible impact of the changes ? regards john |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 19:39:25
|
> Can you please clarify what additional events are supported by this > patch (by which I mean ones that you have tested sufficiently to be > confident that they work as intended). That is, what is the user-visible > impact of the changes ? Additional events: 1. memory_loads 2. memory_stores Note: they are derived from FRONT_END_EVENTS and UOP_TYPE Importance of these events: These are the only L1 data reference events available on P4 (BSQ_CACHE_REFERENCES only has L2 and L3 events) L1 data reference is always one of the most important features people want to look at Fixed bug: Zero value of FRONT_END_EVENTS and UOP_TYPE if measured separately, (actually they are not designed to measure individually) "Couldn't allocate hardware counters..." bug if measured simultaneously User-visible impact: Without the patch: (try to measure memory loads:) # opcontrol --event=FRONT_END_EVENT:100000:0x01 --event=UOP_TYPE:100000:0x02 Couldn't allocate hardware counters for the selected events. # (Problem: both FRONT_END_EVENT and UOP_TYPE are signed to hardware counter No. 3) With the patch: # opcontrol --event=FRONT_END_EVENT:100000:0x01 --event=UOP_TYPE:100000:0x02 # opcontrol --start Using 2.6+ OProfile kernel interface. Reading module info. Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running. # opcontrol --dump # opreport --merge=tgid CPU: P4 / Xeon with 2 hyper-threads, speed 1601.15 MHz (estimated) Counted FRONT_END_EVENT events (retired uops, tagged with front-end tagging) with a unit mask of 0x01 (count marked uops which are non-bogus) count 100000 FRONT_END_EVEN...| samples| %| ------------------ 1397 30.1598 perl FRONT_END_EVEN...| samples| %| ------------------ 1123 80.3865 libperl.so 206 14.7459 libc-2.3.3.so 53 3.7938 libpthread-0.61.so 15 1.0737 vmlinux 829 17.8972 bash FRONT_END_EVEN...| samples| %| ..... (Note: as specified in the Intel manual that UOP_TYPE doesn't increment but is used to tag uops. The FRONT_END_EVENT value here indicates memory loads, since UOP_TYPE uses TAGLOADS mask 0x02) Hope this helps -Yaoping |
From: John L. <le...@mo...> - 2004-08-27 19:59:18
|
On Fri, Aug 27, 2004 at 03:39:19PM -0400, Yaoping Ruan wrote: > # opcontrol --event=FRONT_END_EVENT:100000:0x01 > --event=UOP_TYPE:100000:0x02 > # opcontrol --dump > # opreport --merge=tgid > CPU: P4 / Xeon with 2 hyper-threads, speed 1601.15 MHz (estimated) > Counted FRONT_END_EVENT events (retired uops, tagged with front-end > tagging) with a unit mask of 0x01 (count marked uops which are non-bogus) > count 100000 > > (Note: as specified in the Intel manual that UOP_TYPE doesn't > increment but is used to tag uops. The FRONT_END_EVENT value here > indicates memory loads, since UOP_TYPE uses TAGLOADS mask 0x02) OK, we need to be smarter here. We've encoded some hardware-specific feature straight into the events list. This is not how things should work. In particular, specifying --event twice to get a single column of output is not right. We attempt to present a usable interface to the user. What we need here is some synthetic event, with a synthetic unit mask selection, that does the right thing in terms of programming the underlying two counters. For the example above, you might then have: # opcontrol --event=L1_LOADS:100000:0x01 where the unit mask chooses between bogus/non-bogus or whatever. Then the kernel driver needs some code to understand the synthetic event and program things suitably. This is the /only/ way that such things will be usable. It's a pain to do, but it has to be done... BTW, your patch seems to break compatibility between 2.6 kernel and userspace tools. Is it not possible to avoid this (especially given my comments here) ? I do not believe we're in a position to demand upgrades from 2.6 users. regards john |
From: John L. <le...@mo...> - 2004-08-27 20:03:17
|
On Fri, Aug 27, 2004 at 08:59:10PM +0100, John Levon wrote: > # opcontrol --event=L1_LOADS:100000:0x01 BTW, we've wanted a set of synthetic events such as these for quite some time across all the arches. It probably makes most sense to handle the conversion from synthetic -> actual events in userspace where possible. regards john |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 20:33:29
|
On Fri, 27 Aug 2004, John Levon wrote: > On Fri, Aug 27, 2004 at 08:59:10PM +0100, John Levon wrote: > > > # opcontrol --event=L1_LOADS:100000:0x01 > > BTW, we've wanted a set of synthetic events such as these for quite some > time across all the arches. It probably makes most sense to handle the > conversion from synthetic -> actual events in userspace where possible. > I agree that having the synthetic events in userspace probably is the best way to go. But from what I observed, and compared to other performance counter tools such as perfctr+PAPI, and others, I though Oprofile doesn't want to bring out synthetic events to the first place. An evidence of this is that every event in Oprofile is exactly the original shape as in the Intel manual. Otherwise, providing these events such as REPLAY_EVNETS, EXECUTION_EVENTS, and FRONT_END_EVENTS doesn't make any sense. They don't work alone at all. Each of them need to set with other events or extra counter registers simultaneously. Putting them in the list but doing any real work is simply confusing. (To be honest, I spent quite a lot time figuring it out that they doesn't make sense alone) well, of course I agree that providing users with the most straight forward events is the best way. But in that way, some of the unit mask won't be compatible with Intel's manual. Or in other words, some of the unit mask in Oprofile will have different meaning than the unit mask in the manual. So it is your dicision now. Maybe it is worthy to bring another topic now. There are couple of new events on P4 which need two more registers, IA32_PENS_ENABLE and MSR_PENS_MATRIX_VERT, (please refer to Intel manual: IA-32 Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide, Table A-6) and both of them needs some mask for different events. So maybe it is better to think about these events as well. Again, I'd better ask you for user interface first. Regards -Yaoping |
From: John L. <le...@mo...> - 2004-08-27 22:10:50
|
On Fri, Aug 27, 2004 at 04:33:20PM -0400, Yaoping Ruan wrote: > I agree that having the synthetic events in userspace probably is the > best way to go. OK, good. > But from what I observed, and compared to other performance counter > tools such as perfctr+PAPI, and others, I though Oprofile doesn't want > to bring out synthetic events to the first place. This isn't the case. > An evidence of this is that every event in Oprofile is exactly the > original shape as in the Intel manual. This is true, mainly because Intel managed to keep their perfctr configuration at least vaguely sane until the P4, where, as you know, they made it unusably complicated. > Otherwise, providing these events > such as REPLAY_EVNETS, EXECUTION_EVENTS, and FRONT_END_EVENTS doesn't > make any sense. They don't work alone at all. And indeed, this is an existing bug that was overlooked in the initial integration of P4 support. They're wrong and they have to go. > in the list but doing any real work is simply confusing. (To be honest, I > spent quite a lot time figuring it out that they doesn't make sense alone) Minimally, they should be removed. > well, of course I agree that providing users with the most straight > forward events is the best way. But in that way, some of the unit mask > won't be compatible with Intel's manual. Or in other words, some of the > unit mask in Oprofile will have different meaning than the unit mask in > the manual. So it is your dicision now. We need to make clear that these are synthetic events, and document somewhere how they actually map into the real hardware setup. > Maybe it is worthy to bring another topic now. There are couple of new > events on P4 which need two more registers, IA32_PENS_ENABLE and > MSR_PENS_MATRIX_VERT, (please refer to Intel manual: IA-32 Intel > Architecture Software Developer's Manual, Volume 3: System Programming > Guide, Table A-6) and both of them needs some mask for different > events. So maybe it is better to think about these events as well. Again, > I'd better ask you for user interface first. I haven't looked but I imagine we just need some more synthetic events again designed in a user-friendly manner. As it is, it looks like your changes will need incompatible kernel changes (it's impossible to avoid just additions to kernel space, right ?). With the new kernel development policy, I have no idea when we can actually make this change. But when we do, we should make sure it only happens once. So if you're interested in doing the work on these synthetic events to support what's currently not handled, I strongly encourage you to do so, and we'll work out a schedule for how it can be integrated. One alternative, if deemed necessary, would be to version the /dev/oprofile/cpu_type file so we can pick between the older events setup, and a new, fixed, version. regards john |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-30 20:40:05
|
Hi I did some more experiments over the weekend and successfully added some other new events which uses the PEBS registers. So okay, I'd like to take the work on these synthetic events to support what's currently not handled. As what you suggested, let's work out a schedule for the interface first. The questions I am having now are: 1. unit mask for synthetic events Not sure if you want to split the synthetic events into individual ones or still use "mask" for the combination. e.g. memory_loads, which we discussed in the previous posts, could be measured by issuing an event named "MEMORY_LOADS", or by issuing an name like "MEMORY_REFERENCES" + mask 0x02 (please note we need special explanation for mask here). The advantage of the first method is that it is easier for users, and the meaning of mask (used for non-synthetic events) is still the same as what's in the Intel manual, but the disadvantage is that we'll have more synthetic events to cover all of the combinations, though user space program can handle the complexity. The advantage of the second method is that we leave the flexibility to users, and using events+mask is consistent to current format, but we need to explain what it will get with each mask. It will be at least three kinds of meaning for the masks given that all events will be supported. 2. Do you want any change to current available events? In other words, Do you want to keep them as "natural" events, and use their masks as the current format is doing, or want to split the events with their mask combinations into individual ones. e.g ITLB_REFERENCE with various masks could be divided into ITLB_HIT, ITLB_MISS, ITLB_HIT_UC and ITLB_REF. 3. kernel patch > As it is, it looks like your changes will need incompatible kernel > changes (it's impossible to avoid just additions to kernel space, right > ?). With the new kernel development policy, I have no idea when we > can actually make this change. But when we do, we should make sure it > only happens once. When you say "incompatible kernel changes", did you mean it couldn't be patched to current 2.6.x code? If so, actually I've posted another patch over the 2.6.8.1 release. You are right, that it is unavoidable to patch the kernel source in order to support these synthetic events. But I can have as minimal as possible to patch the kernel and leave most of the job in user space. > > So if you're interested in doing the work on these synthetic events to > support what's currently not handled, I strongly encourage you to do so, > and we'll work out a schedule for how it can be integrated. > > One alternative, if deemed necessary, would be to version the > /dev/oprofile/cpu_type file so we can pick between the older events > setup, and a new, fixed, version. > Regards -Yaoping |
From: John L. <le...@mo...> - 2004-08-30 20:57:34
|
On Mon, Aug 30, 2004 at 04:39:57PM -0400, Yaoping Ruan wrote: > I did some more experiments over the weekend and successfully added some > other new events which uses the PEBS registers. So okay, I'd like to take > the work on these synthetic events to support what's currently not > handled. Great. > Not sure if you want to split the synthetic events into individual > ones or still use "mask" for the combination. e.g. memory_loads, which > we discussed in the previous posts, could be measured by issuing an > event named "MEMORY_LOADS", or by issuing an name like > "MEMORY_REFERENCES" + mask 0x02 (please note we need special > explanation for mask here). The advantage of the first method is that > it is easier for users, and the meaning of mask (used for > non-synthetic events) is still the same as what's in the Intel manual, > but the disadvantage is that we'll have more synthetic events to cover > all of the combinations, though user space program can handle the > complexity. Can you give a complete list of what events we'd need if we do it the first way? "MEMORY_REFERENCES" might make sense, or it might not. I think it depends on case to case. > 2. Do you want any change to current available events? In other words, Not right now. As long as we don't need kernel-space support for them (and I don't believe we will) then these can wait. > When you say "incompatible kernel changes", did you mean it couldn't > be patched to current 2.6.x code? Yes. It's a stable series so we can't demand new oprofile userspace after a kernel upgrade. I think. > But I can have as minimal as possible to patch the kernel and leave > most of the job in user space. Yes, that's needed. BTW, regarding the setting of extra MSRs, have you been following the ppc64 thread from Carl Love? It seems that my suggested "mach_mapping" file might apply to some of these P4 cases too, e.g. event:0x<whatever> pens_enable:0x1 which would write to /dev/oprofile/0/pens_enable or something like that. regards john |
From: Yaoping R. <yr...@cs...> - 2004-09-12 05:01:58
|
Hi, Sorry it took so long to reply. Summer is over and it is transitional time. > Can you give a complete list of what events we'd need if we do it the > first way? "MEMORY_REFERENCES" might make sense, or it might not. I > think it depends on case to case. Here's a list I noticed from the Intel manual. They are expressed in the format of new event: (current event 1 : mask, current event 2 : mask ||| extra registers : mask) MEMORY_LOADS: (UOP_TYPE : 0x02, FRONT_END_EVENT : 0x01) MEMORY_STORES: (UOP_TYPE : 0x04, FRONT_END_EVENT : 0x01) PACKED_SP_RETIRED: (PACKED_SP_UOP : 0x8000, EXECUTION_EVENT : 0x01) PACKED_DP_RETIRED: (PACKED_DP_UOP : 0x8000, EXECUTION_EVENT : 0x01) SCALAR_SP_RETIRED: (SCALAR_SP_UOP : 0x8000, EXECUTION_EVENT : 0x01) SCALAR_DP_RETIRED: (SCALAR_DP_UOP : 0x8000, EXECUTION_EVENT : 0x01) 64_BIT_MMX_RETIRED: (64_BIT_MMX_UOP : 0x8000, EXECUTION_EVENT : 0x01) 128_BIT_MMX_RETIRED: (128_BIT_MMX_UOP : 0x8000, EXECUTION_EVENT : 0x01) X87_FP_RETIRED: (X87_FP_UOP : 0x8000, EXECUTION_EVENT : 0x01) X87_SIMD_MEMORY_MOVES_RETIRED: (X87_SIMD_MEMORY_MOVES_UOP : 0x08, EXECUTION_EVENT : 0x01) L1_CACHE_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000001, MSR_PEBS_MATRIX_VERT : 0x01) L2_CACHE_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000002, MSR_PEBS_MATRIX_VERT : 0x01) DTLB_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x01) DTLB_STORE_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x02) DTLB_ALL_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x03) TAGGED_MISPRED_BRANCH: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1018000, MSR_PEBS_MATRIX_VERT : 0x10) MOB_LOAD_REPLAY_RETIRED: (MOB_LOAD_REPLAY : 0x30, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000200, MSR_PEBS_MATRIX_VERT : 0x01) SPLIT_LOAD_RETIRED: (LOAD_PORT_REPLY: 0x02, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000400, MSR_PEBS_MATRIX_VERT : 0x01) SPLIT_STORE_RETIRED: (STORE_PORT_REPLY: 0x02, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000400, MSR_PEBS_MATRIX_VERT : 0x02) (Note: bit 25 of IA32_PEBS_ENABLE shouldn't be set as you indicated in http://marc.theaimsgroup.com/?l=oprofile-list&m=106850982425202&w=2) So how would you like to organize them? > BTW, regarding the setting of extra MSRs, have you been following the > ppc64 thread from Carl Love? It seems that my suggested "mach_mapping" > file might apply to some of these P4 cases too, e.g. > > event:0x<whatever> pens_enable:0x1 > > which would write to > > /dev/oprofile/0/pens_enable This should work. Once we decide the interface, I can test on this. -Yaoping |
From: John L. <le...@mo...> - 2004-09-13 00:12:24
|
On Sun, Sep 12, 2004 at 01:00:38AM -0400, Yaoping Ruan wrote: > > BTW, regarding the setting of extra MSRs, have you been following the > > ppc64 thread from Carl Love? It seems that my suggested "mach_mapping" > > file might apply to some of these P4 cases too, e.g. > > > > event:0x<whatever> pens_enable:0x1 > > > > which would write to > > > > /dev/oprofile/0/pens_enable > > This should work. Once we decide the interface, I can test on this. I'll probably be merging their code after the release of 0.8.1, and then we can work on the necessary changes. > new event: (current event 1 : mask, current event 2 : mask ||| extra registers : mask) > > MEMORY_LOADS: (UOP_TYPE : 0x02, FRONT_END_EVENT : 0x01) > MEMORY_STORES: (UOP_TYPE : 0x04, FRONT_END_EVENT : 0x01) > > PACKED_SP_RETIRED: (PACKED_SP_UOP : 0x8000, EXECUTION_EVENT : 0x01) > PACKED_DP_RETIRED: (PACKED_DP_UOP : 0x8000, EXECUTION_EVENT : 0x01) > SCALAR_SP_RETIRED: (SCALAR_SP_UOP : 0x8000, EXECUTION_EVENT : 0x01) > SCALAR_DP_RETIRED: (SCALAR_DP_UOP : 0x8000, EXECUTION_EVENT : 0x01) > 64_BIT_MMX_RETIRED: (64_BIT_MMX_UOP : 0x8000, EXECUTION_EVENT : 0x01) > 128_BIT_MMX_RETIRED: (128_BIT_MMX_UOP : 0x8000, EXECUTION_EVENT : 0x01) > X87_FP_RETIRED: (X87_FP_UOP : 0x8000, EXECUTION_EVENT : 0x01) > X87_SIMD_MEMORY_MOVES_RETIRED: (X87_SIMD_MEMORY_MOVES_UOP : 0x08, EXECUTION_EVENT : 0x01) > > L1_CACHE_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000001, MSR_PEBS_MATRIX_VERT : 0x01) > L2_CACHE_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000002, MSR_PEBS_MATRIX_VERT : 0x01) > DTLB_LOAD_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x01) > DTLB_STORE_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x02) > DTLB_ALL_MISS_RETIRED: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000004, MSR_PEBS_MATRIX_VERT : 0x03) > TAGGED_MISPRED_BRANCH: (REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1018000, MSR_PEBS_MATRIX_VERT : 0x10) > MOB_LOAD_REPLAY_RETIRED: (MOB_LOAD_REPLAY : 0x30, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000200, MSR_PEBS_MATRIX_VERT : 0x01) > SPLIT_LOAD_RETIRED: (LOAD_PORT_REPLY: 0x02, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000400, MSR_PEBS_MATRIX_VERT : 0x01) > SPLIT_STORE_RETIRED: (STORE_PORT_REPLY: 0x02, REPLAY_EVENT : 0x01 ||| IA32_PEBS_ENABLE : 0x1000400, MSR_PEBS_MATRIX_VERT : 0x02) > > So how would you like to organize them? These new events as specified by Intel seem usable already, right? I was thinking that might not be the case. So it's obvious that our synthetic event name needs to be e.g. "MEMORY_LOADS". I'm not really in the mood for dealing with Intel docs. For, say, MEMORY_LOADS, what actually needs programming? Are there two counters, one of which we have to set up with UOP_TYPE, and one with FRONT_END_EVENT? or what? cheers john |
From: John L. <le...@mo...> - 2004-09-15 01:31:40
|
On Sun, Sep 12, 2004 at 01:00:38AM -0400, Yaoping Ruan wrote: > This should work. Once we decide the interface, I can test on this. So, I just committed the PPC64 work. The "event_mappings" stuff should be a good start for you on this. cheers john |
From: Philippe E. <ph...@wa...> - 2004-08-27 20:12:46
|
On Fri, 27 Aug 2004 at 20:59 +0000, John Levon wrote: > BTW, your patch seems to break compatibility between 2.6 kernel and > userspace tools. Is it not possible to avoid this (especially given my > comments here) ? I do not believe we're in a position to demand upgrades > from 2.6 users. afaics backward compatiblity can work by adding the two new counter at end of the list of counter, it'll decrease a lot the patch size too. It'll compatible both way (except obviously trying to use the new counter with an old driver). -- phe |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 20:42:19
|
On Fri, 27 Aug 2004, Philippe Elie wrote: > On Fri, 27 Aug 2004 at 20:59 +0000, John Levon wrote: > > > BTW, your patch seems to break compatibility between 2.6 kernel and > > userspace tools. Is it not possible to avoid this (especially given my > > comments here) ? I do not believe we're in a position to demand upgrades > > from 2.6 users. > > afaics backward compatiblity can work by adding the two new counter at > end of the list of counter, it'll decrease a lot the patch size too. > It'll compatible both way (except obviously trying to use the new > counter with an old driver). > Well, I can provide the patch over the source in 2.6 kernel. I also notice the difference in CVS and in 2.6.x. But I didn't know which is the one you are working. Did you mean putting the two new counters, CTR_IQ_0 and CTR_IQ_2 at the end of the current counters? If so, it doesn't work for Hyper-threading. In HT environment, the counters work in pairs, if I am not wrong. Following shows the current 4 pairs: BPU_0 vs. BPU_2, MS_0 vs. MS_2 ... #define CTR_BPU_0 (1 << 0) #define CTR_MS_0 (1 << 1) #define CTR_FLAME_0 (1 << 2) #define CTR_IQ_4 (1 << 3) #define CTR_BPU_2 (1 << 5) #define CTR_MS_2 (1 << 6) #define CTR_FLAME_2 (1 << 7) #define CTR_IQ_5 (1 << 8) Not sure if this is what you meant? Regards -Yaoping |
From: Yaoping R. <yruan@CS.Princeton.EDU> - 2004-08-27 21:49:39
|
Hi, I just want to clarify that no matter what user space interface going to have, we need to expand the counters pairs to 5 in the kernel. I split the patch into two, one is over the Linux 2.6.8.1 release source, which has the kernel portion. and the other is over the Oprofile 0.8 release, which has the user space events specification and counter assignment portion. Please review and let me know any problem. Thanks -Yaoping On Fri, 27 Aug 2004, Yaoping Ruan wrote: > > On Fri, 27 Aug 2004 at 20:59 +0000, John Levon wrote: > > > > > BTW, your patch seems to break compatibility between 2.6 kernel and > > > userspace tools. Is it not possible to avoid this (especially given my > > > comments here) ? I do not believe we're in a position to demand upgrades > > > from 2.6 users. |