|
From: Julian S. <js...@ac...> - 2017-08-31 08:25:14
|
> What would be the major challenges here? > My preliminary idea was that trans-cache could request blocks either > from VEX or from the pre-image. I've thought about this a couple of times in the past but never did anything about it. One of the reasons is that I thought it would be difficult to do and actually get a win. Eg, for starting Firefox on Memcheck, the JIT needs to process about 500,000 blocks, giving about 300MB of instrumented code. If we say (perhaps somewhat optimistically) that the JIT can process about 10000 blocks/sec, then that is 50 seconds of computation. In order to get a win, we'd need to be able to at least compute a hash of the block to be jitted (based on the instruction bytes), find the offset of the block in our memory-mapped file, and pull in the relevant translation, all in around 100 microseconds. I might be persuaded that this is doable if the cache file is in the filesystem cache, but as soon as we hit backing storage (especially if it's a rotating disk) I think our prospects are poor. None of that is it real reason I didn't persue it, though. The real reason is address space layout randomization. Because different libraries get loaded at different addresses in subsequent runs, this will cause the hit rate on the cache to be zero for the libraries involved. This implies that the load address for the library somehow needs to be incorporated in the cache keys that we're using. And that's true because the front ends (guest_amd64_toIR.c, etc) bake into the IR, values derived from the program counter: branch target addresses, and PC-relative load/store addresses. I can't see any way around this without major re-engineering of the JITs. Because what we'd need to somehow parameterise the cache so that we could look up a translation independent of its load address, and then if found, patch up the old version so it works for the "new" address. > If you are considering translating the entire program and caching it, I > think that would be much faster, Mhm, but then you have the problem of finding all the code that is part of the program, which is equivalent to solving the halting problem. ----- For these reasons, my preference is to make the JIT faster, and ultimately to move to having a "two speed" JIT. That is, where code initially is instrumented using a fast and low quality JIT, to reduce latency and to gather branch and block-use statistics. When we decide a particular path is hot enough then those blocks are given to a slower, optimising JIT, so we ultimately get both low latency for cold paths and high performance for hot paths. This seems to be the "modern way". Also, the optimising JIT can run in a helper thread, so in effect we never have to wait for it, because we can just use the unoptimised version of a (super)block until the optimised version is ready. J |