|
From: Julian S. <js...@ac...> - 2012-02-20 23:05:29
|
I've been slowly putting together an implementation of translation chaining. This allows translations to be patched, so that guest conditional and unconditional branches to addresses known at JIT time are converted into jumps from one translation to the next, with no need to return to the dispatcher each time. This gets rid of the cache miss and branch mispredict caused by each trip through the dispatcher. Branches to addresses known only at runtime of course still need to be looked up, but those are relatively uncommon. Getting this to work has proven a swamp of complexity, considering it needs to work for all architectures, and all the stuff that the dispatcher needs to do -- event checks, dealing with no-redirect translations -- needs to be JITted into the code. Deleting translations also becomes more complex, since any others that jump to the one to be deleted first need to be un-chained. First performance numbers are below, on amd64-linux, relative to trunk, for tools none and memcheck. The speedups are good for "none", especially on integer code with a lot of short blocks and hence many branches (bz2, tinycc). Speedups are smaller (even in absolute terms, eg, bz2) for memcheck, which is a bit disappointing. Makes me think that the performance of memcheck for large programs is ultimately determined by the performance of the memory system, and removing instructions doesn't have much effect. Currently this is very incomplete -- deletion of translations is not handled yet, and there are many rough edges to tidy up. When I have something that's functionally complete for amd64-linux I'll post a patch. J $ /usr/bin/perl perf/vg_perf --reps=5 --tools=none,memcheck -- vg=/home/sewardj/VgTRUNK/tchain --vg=/home/sewardj/VgTRUNK/trunk perf/ -- Running tests in perf ---------------------------------------------- -- bigcode1 -- bigcode1 tchain :0.11s no: 1.8s (16.6x, -----) me: 3.6s (32.4x, -----) bigcode1 trunk :0.11s no: 2.1s (19.3x,-15.8%) me: 3.9s (35.0x, -8.1%) -- bigcode2 -- bigcode2 tchain :0.11s no: 4.3s (39.0x, -----) me: 8.9s (81.4x, -----) bigcode2 trunk :0.11s no: 4.5s (40.9x, -4.9%) me: 9.2s (83.7x, -2.9%) -- bz2 -- bz2 tchain :0.63s no: 2.0s ( 3.1x, -----) me: 6.6s (10.4x, -----) bz2 trunk :0.63s no: 2.7s ( 4.2x,-35.0%) me: 6.8s (10.8x, -3.3%) -- fbench -- fbench tchain :0.24s no: 1.1s ( 4.5x, -----) me: 3.9s (16.2x, -----) fbench trunk :0.24s no: 1.3s ( 5.3x,-17.6%) me: 4.2s (17.5x, -7.7%) -- ffbench -- ffbench tchain :0.21s no: 0.9s ( 4.4x, -----) me: 3.0s (14.5x, -----) ffbench trunk :0.21s no: 0.9s ( 4.4x, 0.0%) me: 3.1s (14.6x, -1.0%) -- heap -- heap tchain :0.09s no: 0.6s ( 7.1x, -----) me: 5.7s (63.4x, -----) heap trunk :0.09s no: 0.8s ( 9.0x,-26.6%) me: 5.5s (61.3x, 3.3%) -- heap_pdb4 -- heap_pdb4 tchain :0.11s no: 0.8s ( 6.8x, -----) me: 9.3s (84.5x, -----) heap_pdb4 trunk :0.11s no: 1.0s ( 9.1x,-33.3%) me: 9.9s (89.6x, -6.1%) -- many-loss-records -- many-loss-records tchain :0.02s no: 0.2s (11.5x, -----) me: 1.4s (71.0x, -----) many-loss-records trunk :0.02s no: 0.2s (12.0x, -4.3%) me: 1.3s (67.0x, 5.6%) -- many-xpts -- many-xpts tchain :0.03s no: 0.3s ( 9.3x, -----) me: 1.8s (61.0x, -----) many-xpts trunk :0.03s no: 0.3s (11.0x,-17.9%) me: 1.8s (60.7x, 0.5%) -- sarp -- sarp tchain :0.03s no: 0.2s ( 7.3x, -----) me: 2.4s (80.0x, -----) sarp trunk :0.03s no: 0.2s ( 7.3x, 0.0%) me: 2.5s (83.7x, -4.6%) -- tinycc -- tinycc tchain :0.16s no: 1.5s ( 9.2x, -----) me: 9.7s (60.5x, -----) tinycc trunk :0.16s no: 2.2s (14.1x,-53.1%) me:10.2s (64.1x, -5.9%) -- Finished tests in perf ---------------------------------------------- == 11 programs, 44 timings ================= |