|
From: Nicholas N. <nj...@cs...> - 2005-12-15 19:55:40
|
Hi, Julian's commits r5345 and r5346 (avoiding the profiling in the dispatcher, and using jumps instead of call/return) have the following effect on my 3.0 GHz P4 Prescott. Before and after, trunk: -- bigcode1 -- bigcode1 trunk1 : 0.2s nl: 6.5s (32.7x, -----) mc:12.1s (60.7x, -----) bigcode1 trunk5 : 0.2s nl: 5.5s (27.7x, 15.3%) mc: 9.4s (47.0x, 22.6%) -- bigcode2 -- bigcode2 trunk1 : 0.2s nl:13.1s (65.4x, -----) mc:23.6s (117.8x, -----) bigcode2 trunk5 : 0.2s nl:11.4s (57.0x, 12.9%) mc:20.6s (103.0x, 12.6%) -- bz2 -- bz2 trunk1 : 1.3s nl: 9.4s ( 7.3x, -----) mc:25.9s (20.1x, -----) bz2 trunk5 : 1.3s nl: 7.2s ( 5.5x, 23.6%) mc:22.5s (17.4x, 13.4%) -- fbench -- fbench trunk1 : 1.1s nl: 5.0s ( 4.5x, -----) mc:12.8s (11.4x, -----) fbench trunk5 : 1.1s nl: 4.2s ( 3.8x, 15.5%) mc:12.2s (10.9x, 4.5%) -- ffbench -- ffbench trunk1 : 0.8s nl: 3.8s ( 4.5x, -----) mc:11.2s (13.1x, -----) ffbench trunk5 : 0.8s nl: 4.1s ( 4.8x, -6.8%) mc:10.9s (12.8x, 2.2%) -- gcc -- gcc trunk1 : 0.3s nl:12.3s (38.4x, -----) mc:31.1s (97.3x, -----) gcc trunk5 : 0.3s nl:10.8s (33.7x, 12.3%) mc:30.0s (93.8x, 3.6%) -- sarp -- sarp trunk1 : 0.1s nl: 0.9s (12.4x, -----) mc:11.1s (158.4x, -----) sarp trunk5 : 0.1s nl: 0.5s ( 6.4x, 48.3%) mc:10.8s (154.9x, 2.3%) -- Finished tests in perf ---------------------------------------------- Before and after, COMPVBITS: -- bigcode1 -- bigcode1 compvbits : 0.2s nl: 7.0s (34.8x, -----) mc:10.6s (53.1x, -----) bigcode1 compvbits3: 0.2s nl: 5.6s (27.8x, 20.1%) mc: 8.9s (44.3x, 16.6%) -- bigcode2 -- bigcode2 compvbits : 0.2s nl:12.7s (63.2x, -----) mc:21.8s (109.0x, -----) bigcode2 compvbits3: 0.2s nl:11.5s (57.5x, 9.0%) mc:19.6s (98.0x, 10.0%) -- bz2 -- bz2 compvbits : 1.3s nl: 9.4s ( 7.2x, -----) mc:27.1s (20.9x, -----) bz2 compvbits3: 1.3s nl: 7.3s ( 5.6x, 22.5%) mc:22.3s (17.2x, 17.8%) -- fbench -- fbench compvbits : 1.1s nl: 5.0s ( 4.4x, -----) mc:11.6s (10.3x, -----) fbench compvbits3: 1.1s nl: 4.2s ( 3.7x, 15.6%) mc:11.2s ( 9.9x, 3.4%) -- ffbench -- ffbench compvbits : 0.8s nl: 3.8s ( 4.5x, -----) mc: 9.1s (10.7x, -----) ffbench compvbits3: 0.8s nl: 4.2s ( 5.0x,-10.4%) mc: 8.8s (10.3x, 3.3%) -- gcc -- gcc compvbits : 0.3s nl:12.1s (39.2x, -----) mc:29.1s (94.0x, -----) gcc compvbits3: 0.3s nl:10.8s (34.8x, 11.1%) mc:28.3s (91.3x, 2.8%) -- sarp -- sarp compvbits : 0.1s nl: 0.8s (12.0x, -----) mc: 4.4s (62.4x, -----) sarp compvbits3: 0.1s nl: 0.4s ( 6.3x, 47.6%) mc: 4.1s (59.1x, 5.3%) -- Finished tests in perf ---------------------------------------------- Before and after, trunk and COMPVBITS (the percentages here are all relative to trunk1, which is the "before" version of the trunk). This lets you compare the trunk against COMPVBITS: -- bigcode1 -- bigcode1 trunk1 : 0.2s nl: 6.5s (32.6x, -----) mc:12.2s (60.9x, -----) bigcode1 trunk5 : 0.2s nl: 5.5s (27.7x, 15.0%) mc: 9.4s (46.8x, 23.2%) bigcode1 compvbits : 0.2s nl: 7.0s (35.0x, -7.4%) mc:10.5s (52.6x, 13.6%) bigcode1 compvbits3: 0.2s nl: 5.5s (27.7x, 15.0%) mc: 8.8s (44.0x, 27.8%) -- bigcode2 -- bigcode2 trunk1 : 0.2s nl:13.1s (65.5x, -----) mc:23.6s (118.1x, -----) bigcode2 trunk5 : 0.2s nl:11.4s (57.0x, 13.0%) mc:20.6s (103.1x, 12.7%) bigcode2 compvbits : 0.2s nl:12.7s (63.6x, 2.9%) mc:21.9s (109.3x, 7.4%) bigcode2 compvbits3: 0.2s nl:11.5s (57.4x, 12.4%) mc:20.0s (99.9x, 15.4%) -- bz2 -- bz2 trunk1 : 1.3s nl: 9.3s ( 7.3x, -----) mc:25.9s (20.4x, -----) bz2 trunk5 : 1.3s nl: 7.2s ( 5.7x, 22.6%) mc:22.6s (17.8x, 12.8%) bz2 compvbits : 1.3s nl: 9.4s ( 7.4x, -0.8%) mc:27.0s (21.2x, -4.2%) bz2 compvbits3: 1.3s nl: 7.3s ( 5.7x, 21.7%) mc:22.3s (17.6x, 13.9%) -- fbench -- fbench trunk1 : 1.1s nl: 5.0s ( 4.5x, -----) mc:12.7s (11.3x, -----) fbench trunk5 : 1.1s nl: 4.2s ( 3.8x, 16.0%) mc:12.1s (10.7x, 5.0%) fbench compvbits : 1.1s nl: 5.0s ( 4.4x, 1.2%) mc:11.6s (10.2x, 9.1%) fbench compvbits3: 1.1s nl: 4.2s ( 3.7x, 16.2%) mc:11.3s (10.0x, 11.5%) -- ffbench -- ffbench trunk1 : 0.9s nl: 4.2s ( 4.5x, -----) mc:11.1s (11.8x, -----) ffbench trunk5 : 0.9s nl: 4.0s ( 4.3x, 3.3%) mc:10.9s (11.6x, 1.7%) ffbench compvbits : 0.9s nl: 4.2s ( 4.4x, 0.5%) mc: 9.0s ( 9.6x, 19.2%) ffbench compvbits3: 0.9s nl: 4.0s ( 4.2x, 5.5%) mc: 8.7s ( 9.3x, 21.6%) -- gcc -- gcc trunk1 : 0.3s nl:12.4s (39.9x, -----) mc:31.1s (100.3x, -----) gcc trunk5 : 0.3s nl:10.8s (34.9x, 12.4%) mc:30.0s (96.8x, 3.5%) gcc compvbits : 0.3s nl:12.2s (39.2x, 1.6%) mc:29.3s (94.5x, 5.8%) gcc compvbits3: 0.3s nl:10.8s (34.9x, 12.5%) mc:28.3s (91.2x, 9.1%) -- sarp -- sarp trunk1 : 0.1s nl: 0.9s (12.3x, -----) mc:11.1s (158.7x, -----) sarp trunk5 : 0.1s nl: 0.4s ( 6.3x, 48.8%) mc:10.9s (155.4x, 2.1%) sarp compvbits : 0.1s nl: 0.8s (12.0x, 2.3%) mc: 4.4s (62.3x, 60.8%) sarp compvbits3: 0.1s nl: 0.4s ( 6.3x, 48.8%) mc: 4.1s (58.9x, 62.9%) -- Finished tests in perf ---------------------------------------------- The 'gcc' test is not in the repository; it's GCC compiling (but not assembling or linking) a 2234 line pre-processed C program at -O3. So overall it gives up to 20% improvements on Memcheck. ffbench under Nulgrind is a little weird, no idea why it slows down, but it doesn't seem important. And COMPVBITS is generally faster than the trunk, which is good. Nice work, Julian. Profiling is useful. Nick |
|
From: Julian S. <js...@ac...> - 2005-12-16 01:16:00
|
> Julian's commits r5345 and r5346 (avoiding the profiling in the > dispatcher, and using jumps instead of call/return) have the following > effect on my 3.0 GHz P4 Prescott. There are similar (slightly more modest improvments) on amd64, and also ppc32 now. In fact ppc32 is shaping up to being, if anything, a slightly more efficient target than x86 or amd64. Here are before and after numbers for ppc32 on a 1.25GHz MPC7447. The machine was not as quiet as one would like, so take the numbers with a bit of caution. Nevertheless the direction is clear. I should point out, all these speedups come from being able to do self-hosting, and in particular from cachegrind pointing out performance stupidities. J ppc32, trunk, before: bigcode1 trunk : 0.5s nl:14.3s (30.4x, -----) mc:21.6s (45.9x, -----) bigcode2 trunk : 0.5s nl:22.7s (45.5x, -----) mc:42.5s (84.9x, -----) bz2 trunk : 2.1s nl:17.2s ( 8.2x, -----) mc:49.4s (23.5x, -----) fbench trunk : 1.6s nl:16.6s (10.4x, -----) mc:53.4s (33.6x, -----) ffbench trunk : 4.7s nl: 6.8s ( 1.4x, -----) mc:21.9s ( 4.7x, -----) sarp trunk : 0.1s nl: 1.1s (12.6x, -----) mc:16.7s (186.0x, -----) ppc32, trunk, after: bigcode1 trunk : 0.5s nl:11.5s (23.9x, -----) mc:18.9s (39.4x, -----) bigcode2 trunk : 0.5s nl:20.2s (40.5x, -----) mc:39.0s (78.1x, -----) bz2 trunk : 2.1s nl:13.8s ( 6.6x, -----) mc:45.5s (21.8x, -----) fbench trunk : 1.6s nl:12.9s ( 8.1x, -----) mc:49.7s (31.3x, -----) ffbench trunk : 4.6s nl: 6.7s ( 1.5x, -----) mc:22.0s ( 4.8x, -----) sarp trunk : 0.1s nl: 0.9s (10.6x, -----) mc:16.7s (185.3x, -----) |
|
From: Dirk M. <dm...@gm...> - 2005-12-16 14:02:33
|
On Thursday 15 December 2005 20:55, Nicholas Nethercote wrote: > Julian's commits r5345 and r5346 (avoiding the profiling in the > dispatcher, and using jumps instead of call/return) have the following > effect on my 3.0 GHz P4 Prescott. What are the feelings about backporting this to the 3.1 branch? Dirk |
|
From: Julian S. <js...@ac...> - 2005-12-17 12:54:14
|
> What are the feelings about backporting this to the 3.1 branch? Think gcc - the stable branch is for critical bugfixes only. I prefer not to push potentially destabilising changes like this into it. If people want the absolute max performance right now, they can try out the svn trunk. J |
|
From: Dirk M. <dm...@gm...> - 2005-12-19 14:31:53
|
On Saturday 17 December 2005 13:54, Julian Seward wrote: > people want the absolute max performance right now, they can try out > the svn trunk. Whats the release schedule of trunk? :) Dirk |
|
From: Nicholas N. <nj...@cs...> - 2005-12-19 17:03:00
|
On Mon, 19 Dec 2005, Dirk Mueller wrote: >> people want the absolute max performance right now, they can try out >> the svn trunk. > > Whats the release schedule of trunk? :) From docs/internals/roadmap.txt: ----------------------------------------------------------------------------- 3.2.0 ----------------------------------------------------------------------------- Scheduled for end-Mar 06 (3.1.0 + 4 months) ? In order of increasing speculativeness -------------------------------------- * Add ppc64-linux support. * Fold in the V bit compression stuff if it works well (early signs are promising) and get rid of Addrcheck. * Get function wrapping working again. Reinstate basic thread checks. Reinstate Helgrind. * Performance tuning: - faster register allocation in vex - improve stack-update pass - assess effect of branch misprediction in dispatchers * Try to accelerate development for Darwin ? Smaller things -------------- * Consider using the following defaults: --leak-check=yes --num-callers=20 * Expose some of m_redir's functionality to tools so that Memcheck can replace strlen/strcmp on PPC32 (remove the 3.1.0 hack for this which checked in m_redir.c if the current tool was Memcheck). |