|
From: Julian S. <js...@ac...> - 2005-10-19 02:30:28
|
Check this. I was looking at the cg profiles for a self hosted run on ppc32 and I noticed that of about 5.2M L2 write misses, it billed 3.8M of them to just one function: the 10-line fn invalidateFastCache in m_transtab.c. So I halved the size of the fast cache, and got a good 8% speedup for nulgrind on ppc32. Not bad! I wonder if it carries over to x86/amd64 though -- most of the invalidations are due to simulating icbis. J Numbers for start/quit Qt designer on Mac Mini (1250 MHz, 512K L2 iirc) With #define VG_TT_FAST_BITS 15 (halved size): --32434-- tt/tc: 2,859,707 tt lookups requiring 9,163,097 probes --32434-- tt/tc: 2,859,707 fast-cache updates, 6,151 flushes --32434-- translate: new 106,953 (4,374,552 -> 20,515,084; ratio 46:10) [0 scs] --32434-- translate: dumped 0 (0 -> ??) --32434-- translate: discarded 38,951 (1,811,388 -> ??) --32434-- scheduler: 197,875,648 jumps (bb entries). --32434-- scheduler: 3,957/2,805,272 major/minor sched events. --32434-- sanity: 3958 cheap, 159 expensive checks. --32434-- exectx: 30,011 lists, 0 contexts (avg 0 per list) --32434-- exectx: 0 searches, 0 full compares (0 per 1000) --32434-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll real 1m15.269s user 0m59.611s sys 0m0.808s With #define VG_TT_FAST_BITS 16 (default): --32435-- tt/tc: 2,404,489 tt lookups requiring 7,730,431 probes --32435-- tt/tc: 2,404,489 fast-cache updates, 6,143 flushes --32435-- translate: new 106,801 (4,365,656 -> 20,476,948; ratio 46:10) [0 scs] --32435-- translate: dumped 0 (0 -> ??) --32435-- translate: discarded 38,879 (1,806,812 -> ??) --32435-- scheduler: 197,369,308 jumps (bb entries). --32435-- scheduler: 3,947/2,349,889 major/minor sched events. --32435-- sanity: 3948 cheap, 158 expensive checks. --32435-- exectx: 30,011 lists, 0 contexts (avg 0 per list) --32435-- exectx: 0 searches, 0 full compares (0 per 1000) --32435-- exectx: 0 cmp2, 0 cmp4, 0 cmpAll real 1m17.949s user 1m5.138s sys 0m0.852s |
|
From: Nicholas N. <nj...@cs...> - 2005-10-19 14:39:59
|
On Wed, 19 Oct 2005, Julian Seward wrote: > Check this. I was looking at the cg profiles for a self hosted run > on ppc32 and I noticed that of about 5.2M L2 write misses, it billed > 3.8M of them to just one function: the 10-line fn invalidateFastCache > in m_transtab.c. > > So I halved the size of the fast cache, and got a good 8% speedup for > nulgrind on ppc32. Not bad! I wonder if it carries over to x86/amd64 > though -- most of the invalidations are due to simulating icbis. > > > Numbers for start/quit Qt designer on Mac Mini (1250 MHz, 512K L2 iirc) > > With #define VG_TT_FAST_BITS 15 (halved size): > > real 1m15.269s > user 0m59.611s > sys 0m0.808s > > > With #define VG_TT_FAST_BITS 16 (default): > > real 1m17.949s > user 1m5.138s > sys 0m0.852s The user time is down 8%, but the real time is only down about 3%... what does this mean? Are these times consistent over multiple runs? N |
|
From: Julian S. <js...@ac...> - 2005-10-19 15:03:39
|
> > With #define VG_TT_FAST_BITS 15 (halved size): > > > > real 1m15.269s > > user 0m59.611s > > sys 0m0.808s > > > > > > With #define VG_TT_FAST_BITS 16 (default): > > > > real 1m17.949s > > user 1m5.138s > > sys 0m0.852s > > The user time is down 8%, but the real time is only down about 3%... what > does this mean? Are these times consistent over multiple runs? Machine wasn't particularly quiet at the time. I re-ran a couple of times to verify, and also tried a different program. I think the 8% is repeatable, but only on ppc32 (on x86 it seems to have minimal effect). J |