|
From: Josef W. <Jos...@gm...> - 2012-11-16 21:18:48
|
Am 07.11.2012 15:15, schrieb Josef Weidendorfer: > I think it makes sense to relax from this "power of 2 issue" just > for the LL simulation. I just did that by using modulo (%) just for LL simulation, where it's used in mapping an address to set number. See function block2set in attached patch. It allows to get rid of "maybe_tweak_LLc", but shows an performance hit of 5% on average on my laptop with cachegrind (with amd64). The worst case happens when an access misses the L1, but finds a match in the LL set on the first check (ie. at the most-recently-used spot). ffbench seems to expose this case. Thus, unconditionally always using modulo for LL seems to be a bad idea. Instead, one can check for the power-of-two case in block2set(), and use bit masking or modulo depending on that. But this just gets rid of the worst-case scenario in ffbench, and makes the other cases worse. The best would be to have two implementations, and choose the right one at runtime, depending on cache parameters. As far as I see, this choice is best done by instrumenting calls to dirty helpers either implementing one or the other version. However, this needs duplication of all helpers :-( It really would be cool to use VEX's code generation feature for functions which can called from C. Just to generate the "block2set" function in attached patch, either to do bit masking or modulo. Does it make sense to look into this? Or does anybody have another idea? Josef -- Running tests in trunk/perf ---------------------------------------- -- bigcode1 -- bigcode1 trunk :0.14s ca: 4.6s (32.6x, -----) bigcode1 relaxsets :0.14s ca: 4.6s (32.6x, 0.0%) -- bigcode2 -- bigcode2 trunk :0.14s ca: 8.6s (61.1x, -----) bigcode2 relaxsets :0.14s ca: 8.6s (61.7x, -1.1%) -- bz2 -- bz2 trunk :0.66s ca:13.3s (20.1x, -----) bz2 relaxsets :0.66s ca:13.9s (21.0x, -4.4%) -- fbench -- fbench trunk :0.28s ca: 3.8s (13.4x, -----) fbench relaxsets :0.28s ca: 3.9s (13.9x, -3.7%) -- ffbench -- ffbench trunk :0.25s ca: 4.9s (19.4x, -----) ffbench relaxsets :0.25s ca: 5.7s (22.7x,-16.9%) -- heap -- heap trunk :0.10s ca: 3.9s (39.4x, -----) heap relaxsets :0.10s ca: 4.1s (41.4x, -5.1%) -- heap_pdb4 -- heap_pdb4 trunk :0.14s ca: 4.4s (31.5x, -----) heap_pdb4 relaxsets :0.14s ca: 4.7s (33.3x, -5.7%) -- many-loss-records -- many-loss-records trunk :0.01s ca: 0.8s (78.0x, -----) many-loss-records relaxsets :0.01s ca: 0.9s (86.0x,-10.3%) -- many-xpts -- many-xpts trunk :0.05s ca: 1.2s (23.4x, -----) many-xpts relaxsets :0.05s ca: 1.2s (24.0x, -2.6%) -- sarp -- sarp trunk :0.02s ca: 1.0s (51.5x, -----) sarp relaxsets :0.02s ca: 1.1s (55.5x, -7.8%) -- tinycc -- tinycc trunk :0.22s ca: 9.1s (41.5x, -----) tinycc relaxsets :0.22s ca: 9.6s (43.4x, -4.7%) -- Finished tests in trunk/perf ---------------------------------------- |