From: William C. <wc...@nc...> - 2002-12-11 17:00:50
|
I tried an experiment this morning with our 2.4 kernel with the backported 2.5 oprofile support and the lockless sync patch. I used the current oprofile cvs snapshot for daemon and support software. I built did a "make bootstrap" of current gcc cvs checkout on a uniprocessor P4 machine. Just collecting GLOBAL_POWER_EVENTS. op_time|grep oprofile 1252 0.0379 0.0000 /lib/modules/2.4.20-ac1/kernel/arch/i386/oprofile/oprofile.o 1344 0.0407 0.0000 /usr/local/bin/oprofiled Taking a closer look where those sample lie for /lib/modules/2.4.20-ac1/kernel/arch/i386/oprofile/oprofile.o oprofpp /lib/modules/2.4.20-ac1/kernel/arch/i386/oprofile/oprofile.o -l -r Cpu type: P4 / Xeon Cpu speed was (MHz estimation) : 2386.6 Counter 0 counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 1193000 vma samples %-age symbol name 00000910 711 56.7891 sync_buffer 00000a80 152 12.1406 add_event_entry 00000670 98 7.82748 lookup_dcookie 00000610 67 5.35144 get_exec_dcookie 00000810 56 4.47284 add_us_sample 000004e0 41 3.27476 mm_notify 000009d0 29 2.31629 sync_cpu_buffers 00000710 29 2.31629 add_cpu_switch 00000880 24 1.91693 add_sample 000008e0 16 1.27796 take_task_mm 000007e0 15 1.19808 add_sample_entry 000008c0 8 0.638978 release_mm 000007a0 3 0.239617 add_cookie_switch 00000a40 1 0.0798722 timer_ping 00000a30 1 0.0798722 wq_sync_buffers 00000750 1 0.0798722 add_ctx_switch Virtually all of them are in the sync_buffer. oprofpp /lib/modules/2.4.20-ac1/kernel/arch/i386/oprofile/oprofile.o -s sync_buffer Cpu type: P4 / Xeon Cpu speed was (MHz estimation) : 2386.6 Counter 0 counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (count cycles when processor is active) count 1193000 vma samples %-age symbol name 00000910 711 100 sync_buffer 00000910 11 1.54712 00000911 1 0.140647 00000914 1 0.140647 0000091b 7 0.984529 00000923 2 0.281294 00000937 1 0.140647 00000939 2 0.281294 00000940 1 0.140647 00000943 1 0.140647 00000949 29 4.07876 0000094b 1 0.140647 0000095a 1 0.140647 0000095d 1 0.140647 00000966 1 0.140647 0000096e 2 0.281294 00000974 1 0.140647 00000980 6 0.843882 00000986 592 83.263 00000988 3 0.421941 0000098c 4 0.562588 0000098f 3 0.421941 00000992 4 0.562588 00000993 2 0.281294 00000998 2 0.281294 0000099a 1 0.140647 0000099e 2 0.281294 000009a2 1 0.140647 000009b0 2 0.281294 000009b4 6 0.843882 000009b8 6 0.843882 000009bb 3 0.421941 000009c0 9 1.26582 000009c4 2 0.281294 Most of the samples are located right after 980, a lock instruction, which I believe is the result of the rmb(). Below is the section of code from objdump related to the samples in sync_buffer. 97b: e8 d0 fd ff ff call 750 <add_ctx_switch> 980: f0 83 44 24 00 00 lock addl $0x0,0x0(%esp,1) 986: 31 d2 xor %edx,%edx 988: 8b 46 04 mov 0x4(%esi),%eax 98b: 40 inc %eax 98c: 3b 46 08 cmp 0x8(%esi),%eax 98f: 0f 42 d0 cmovb %eax,%edx 992: 47 inc %edi It is still a bit hard to determine the overhead of the exising method because we don't separate out the samples lost to the buffer sync. The existing locks could be just as expensive, but we really don't know. I haven't tried it on SMP machines to see how much of a difference it makes. -Will John Levon wrote: > Have you run your lockless sync patch on 2.5 ? If so, where does our > stuff appear in the profiles ? > > Just ooi > > regards > john |