From: Julian S. <js...@ac...> - 2003-03-18 08:37:24
|
[2nd try at getting this to v-users list] On Monday 17 March 2003 7:37 pm, Nicholas Nethercote wrote: > On Mon, 17 Mar 2003, Jason Evans wrote: > > > But the numbers are not good, actually a performance decrease against > > > switch(). > > > > I've recently been doing some experimentation with computed gotos in an > > unrelated program, and I've also observed a slowdown in most cases. This > > indicates to me that gcc typically does a fine job of optimizing switch > > statements, and there isn't a whole lot to be gained by second guessing > > it in such cases. The computed goto thing is useful for speeding up bytecode interpreters -- I've used it for that before now -- but this isn't such a case. It's the switch in the x86 parser which switches on the opcodes being examined. It is used only once per instruction which V translates and so the cost difference (a few host insns) must be completely swamped by the rest of the translation costs (000s of host insns per translated insn, typically). And translation costs are usually small (0-15%) compared to the cost of running the translation. So I'm mystified where the 7% speedup number comes from. J |
From: Christian L. <chr...@le...> - 2003-03-18 10:51:15
|
On Tue, Mar 18, 2003 at 08:45:08AM +0000, Julian Seward wrote: > The computed goto thing is useful for speeding up bytecode interpreters > -- I've used it for that before now -- but this isn't such a case. It's > the switch in the x86 parser which switches on the opcodes being > examined. It is used only once per instruction which V translates and > so the cost difference (a few host insns) must be completely swamped by > the rest of the translation costs (000s of host insns per translated > insn, typically). And translation costs are usually small (0-15%) > compared to the cost of running the translation. > So I'm mystified where > the 7% speedup number comes from. Yes, absolutly, it's very obscure. Some little changes decresed the performance again, damn, I fooled myself. It could be some alignment effect. So perhaps this stupid 7% are also only on athlon-xp's. But I don't know how to analyse what function takes how long, than it should be easy to find the function were this difference is. Christian Leber -- "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." (Aurelius Augustinus) Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html> |
From: Nicholas N. <nj...@ca...> - 2003-03-18 10:58:24
|
On Tue, 18 Mar 2003, Christian Leber wrote: > > So I'm mystified where the 7% speedup number comes from. > > Yes, absolutly, it's very obscure. > > Some little changes decresed the performance again, damn, I fooled > myself. > > It could be some alignment effect. > So perhaps this stupid 7% are also only on athlon-xp's. > > But I don't know how to analyse what function takes how long, than it > should be easy to find the function were this difference is. If you #include "profile.c" into a skin and recompile, the --profile=yes option turns on tick-based profiling. That's where Julian got his 0--15% figure for translation from (note that's for all translation). N |
From: Christian L. <chr...@le...> - 2003-03-20 00:57:28
|
On Tue, Mar 18, 2003 at 10:58:20AM +0000, Nicholas Nethercote wrote: > If you #include "profile.c" into a skin and recompile, the --profile=yes > option turns on tick-based profiling. That's where Julian got his 0--15% > figure for translation from (note that's for all translation). Sorry, a little bit late. I runned with both, but the "resolution" isn't good enough, the numbers are ok and reproducable. with the patch: core:/home/ijuz/Mail/la# time nice -n -10 /work/dev/val/bin/valgrind --profile=yes gzip -9 politech_at_politechbot_com ==25094== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux. ==25094== Copyright (C) 2002, and GNU GPL'd, by Julian Seward. ==25094== Using valgrind-1.9.4, a program instrumentation system for x86-linux. ==25094== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward. ==25094== Estimated CPU clock rate is 1551 MHz ==25094== For more details, rerun with: -v ==25094== ==25094== ==25094== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==25094== malloc/free: in use at exit: 0 bytes in 0 blocks. ==25094== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==25094== For a detailed leak analysis, rerun with: --leak-check=yes ==25094== For counts of detected errors, rerun with: -v Profiling done, 1402 ticks 0: 2 ( 1 %%) ticks, 1 entries for unclassified 1: 1225 (873 %%) ticks, 4759 entries for running 2: 6 ( 4 %%) ticks, 1 entries for scheduler 3: 3 ( 2 %%) ticks, 40568 entries for low-lev malloc/free 4: 0 ( 0 %%) ticks, 0 entries for client malloc/free 5: 0 ( 0 %%) ticks, 0 entries for adjust-stack 6: 0 ( 0 %%) ticks, 1190 entries for translate-main 7: 0 ( 0 %%) ticks, 1190 entries for to-ucode 8: 3 ( 2 %%) ticks, 1190 entries for from-ucode 9: 0 ( 0 %%) ticks, 1190 entries for improve 10: 1 ( 0 %%) ticks, 1190 entries for reg-alloc 11: 0 ( 0 %%) ticks, 1190 entries for liveness-analysis 12: 0 ( 0 %%) ticks, 0 entries for do-lru 13: 0 ( 0 %%) ticks, 2382 entries for slow-search-transtab 14: 0 ( 0 %%) ticks, 1 entries for init-memory 15: 0 ( 0 %%) ticks, 0 entries for exe-context 16: 0 ( 0 %%) ticks, 0 entries for read-syms 17: 0 ( 0 %%) ticks, 0 entries for search-syms 18: 0 ( 0 %%) ticks, 0 entries for add-to-transtab 19: 0 ( 0 %%) ticks, 459 entries for core-syscall-wrapper 20: 0 ( 0 %%) ticks, 0 entries for demangle 21: 0 ( 0 %%) ticks, 3329 entries for core-cheap-sanity 22: 3 ( 2 %%) ticks, 134 entries for core-expensive-sanity 23: 0 ( 0 %%) ticks, 0 entries for pre-clo-init 24: 0 ( 0 %%) ticks, 0 entries for post-clo-init 25: 1 ( 0 %%) ticks, 1190 entries for instrument 26: 0 ( 0 %%) ticks, 478 entries for skin-syscall-wrapper 27: 0 ( 0 %%) ticks, 3329 entries for skin-cheap-sanity 28: 30 ( 21 %%) ticks, 134 entries for skin-expensive-sanity 29: 0 ( 0 %%) ticks, 0 entries for fini 30: 2 ( 1 %%) ticks, 1436 entries for check-mem-perms 31: 126 ( 89 %%) ticks, 52566183 entries for set-mem-perms real 0m15.607s user 0m13.960s sys 0m0.090s and without: core:/home/ijuz/Mail/la# time nice -n -10 /work/dev/val/bin/valgrind --profile=yes gzip -9 politech_at_politechbot_com ==25629== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux. ==25629== Copyright (C) 2002, and GNU GPL'd, by Julian Seward. ==25629== Using valgrind-1.9.4, a program instrumentation system for x86-linux. ==25629== Copyright (C) 2000-2002, and GNU GPL'd, by Julian Seward. ==25629== Estimated CPU clock rate is 1545 MHz ==25629== For more details, rerun with: -v ==25629== ==25629== ==25629== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ==25629== malloc/free: in use at exit: 0 bytes in 0 blocks. ==25629== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==25629== For a detailed leak analysis, rerun with: --leak-check=yes ==25629== For counts of detected errors, rerun with: -v Profiling done, 1528 ticks 0: 1 ( 0 %%) ticks, 1 entries for unclassified 1: 1320 (863 %%) ticks, 4757 entries for running 2: 9 ( 5 %%) ticks, 1 entries for scheduler 3: 2 ( 1 %%) ticks, 40568 entries for low-lev malloc/free 4: 0 ( 0 %%) ticks, 0 entries for client malloc/free 5: 0 ( 0 %%) ticks, 0 entries for adjust-stack 6: 0 ( 0 %%) ticks, 1190 entries for translate-main 7: 1 ( 0 %%) ticks, 1190 entries for to-ucode 8: 1 ( 0 %%) ticks, 1190 entries for from-ucode 9: 2 ( 1 %%) ticks, 1190 entries for improve 10: 2 ( 1 %%) ticks, 1190 entries for reg-alloc 11: 0 ( 0 %%) ticks, 1190 entries for liveness-analysis 12: 0 ( 0 %%) ticks, 0 entries for do-lru 13: 0 ( 0 %%) ticks, 2380 entries for slow-search-transtab 14: 0 ( 0 %%) ticks, 1 entries for init-memory 15: 0 ( 0 %%) ticks, 0 entries for exe-context 16: 0 ( 0 %%) ticks, 0 entries for read-syms 17: 0 ( 0 %%) ticks, 0 entries for search-syms 18: 0 ( 0 %%) ticks, 0 entries for add-to-transtab 19: 0 ( 0 %%) ticks, 459 entries for core-syscall-wrapper 20: 0 ( 0 %%) ticks, 0 entries for demangle 21: 0 ( 0 %%) ticks, 3329 entries for core-cheap-sanity 22: 7 ( 4 %%) ticks, 134 entries for core-expensive-sanity 23: 0 ( 0 %%) ticks, 0 entries for pre-clo-init 24: 0 ( 0 %%) ticks, 0 entries for post-clo-init 25: 3 ( 1 %%) ticks, 1190 entries for instrument 26: 0 ( 0 %%) ticks, 478 entries for skin-syscall-wrapper 27: 0 ( 0 %%) ticks, 3329 entries for skin-cheap-sanity 28: 31 ( 20 %%) ticks, 134 entries for skin-expensive-sanity 29: 0 ( 0 %%) ticks, 0 entries for fini 30: 6 ( 3 %%) ticks, 1436 entries for check-mem-perms 31: 143 ( 93 %%) ticks, 52566183 entries for set-mem-perms real 0m16.796s user 0m15.230s sys 0m0.090s Christian Leber -- "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." (Aurelius Augustinus) Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html> |
From: Christian L. <chr...@le...> - 2003-03-20 01:40:20
|
On Tue, Mar 18, 2003 at 10:58:20AM +0000, Nicholas Nethercote wrote: > If you #include "profile.c" into a skin and recompile, the --profile=yes > option turns on tick-based profiling. That's where Julian got his 0--15% > figure for translation from (note that's for all translation). Ok, sorry, forgot all the junk I wrote. make CC="gcc -falign-functions=16" install and "running" was 1215 ticks, with 8 it's again 1315 (allway 16 byte alignment) speed: function+loops (1349 ticks) < function (1215) < function+jump (1207) <function+jump+label (1194) <function+jump+label+loops (1173) compare it to the values in the other mail At least the function alignment seem to be good on a Athlon, but I don't know about other system, but I think that it also won't be bad. Christian Leber -- "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." (Aurelius Augustinus) Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html> |