|
From: Julian S. <js...@ac...> - 2015-07-22 07:22:52
|
On 15/07/15 14:19, John Reiser wrote: > [...] but instruction decode often is a bottleneck for memcheck. [...] That's interesting. Can you expand on that? Do you have some measurements or such, that show this? J |
|
From: <jr...@bi...> - 2015-07-23 03:26:13
|
On Wed, July 22, 2015, Julian Seward wrote: > > On 15/07/15 14:19, John Reiser wrote: > >> [...] but instruction decode often is a bottleneck for memcheck. [...] >> > > That's interesting. Can you expand on that? Do you have some > measurements or such, that show this? If a taken branch is mis-predicted, or if the branch is indirect through a register or memory (such as "call *%rax") then the prefetch+decoder starts behind until it has decoded a cumulative average of at least 2 instructions per cycle. In particular if the prefetch [8-bytes, aligned] at the target does not contain 2 complete instructions then it will be behind for at least one more cycle. A data fetch from memory is the surest opportunity to catch up, because the cache latency is 3 or 4 cycles. For JIT code, it would pay to replicate the several most-frequent helper targets at the beginning or end of a large block of JIT code, so that a CALL to those helpers could be made by 32-bit pc-relative displacement instead of by "movabs $64-bits, %rax; call *%rax". A replicated JIT helper that is within 31-bit range of the master can tail merge into the master with a "jmp displ32", preferably in the shadow of a cache fetch. A replicated JIT helper that is not within 31-bit range of the master probably must use equivalent code with carefully-scheduled fetches of 64-bit pointers. |