|
From: Julian S. <js...@ac...> - 2005-10-19 02:30:28
|
Check this. I was looking at the cg profiles for a self hosted run on ppc32 and I noticed that of about 5.2M L2 write misses, it billed 3.8M of them to just one function: the 10-line fn invalidateFastCache in m_transtab.c. So I halved the size of the fast cache, and got a good 8% speedup for nulgrind on ppc32. Not bad! I wonder if it carries over to x86/amd64 though -- most of the invalidations are due to simulating icbis. J Numbers for start/quit Qt designer on Mac Mini (1250 MHz, 512K L2 iirc) With #define VG_TT_FAST_BITS 15 (halved size): --32434-- tt/tc: 2,859,707 tt lookups requiring 9,163,097 probes --32434-- tt/tc: 2,859,707 fast-cache updates, 6,151 flushes --32434-- translate: new 106,953 (4,374,552 -> 20,515,084; ratio 46:10) [0 scs] --32434-- translate: dumped 0 (0 -> ??) --32434-- translate: discarded 38,951 (1,811,388 -> ??) --32434-- scheduler: 197,875,648 jumps (bb entries). --32434-- scheduler: 3,957/2,805,272 major/minor sched events. --32434-- sanity: 3958 cheap, 159 expensive checks. --32434-- exectx: 30,011 lists, 0 contexts (avg 0 per list) --32434-- exectx: 0 searches, 0 full compares (0 per 1000) --32434-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll real 1m15.269s user 0m59.611s sys 0m0.808s With #define VG_TT_FAST_BITS 16 (default): --32435-- tt/tc: 2,404,489 tt lookups requiring 7,730,431 probes --32435-- tt/tc: 2,404,489 fast-cache updates, 6,143 flushes --32435-- translate: new 106,801 (4,365,656 -> 20,476,948; ratio 46:10) [0 scs] --32435-- translate: dumped 0 (0 -> ??) --32435-- translate: discarded 38,879 (1,806,812 -> ??) --32435-- scheduler: 197,369,308 jumps (bb entries). --32435-- scheduler: 3,947/2,349,889 major/minor sched events. --32435-- sanity: 3948 cheap, 158 expensive checks. --32435-- exectx: 30,011 lists, 0 contexts (avg 0 per list) --32435-- exectx: 0 searches, 0 full compares (0 per 1000) --32435-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll real 1m17.949s user 1m5.138s sys 0m0.852s |