|
From: Julian S. <js...@ac...> - 2002-11-20 09:01:58
|
[... numbers re translation chaining and trace cacheing ...] > Oh, yes, I quite agree. The results above make me think that doing > naive trace caching probably won't help, but eliminating dispatch-loop > calls probably will. Good work. It's pretty clear that t-chaining is worth having. Not sure about the others. I'd still like to know why bzip2 runs at approx a 10x slowdown, if the translations are chained, there is no callouts to helpers with --skin=none. So the only thing happening is jumping between translations. In which case I'd expect a 10 x code size increase, but it's more like 4:1. So what's going on, I wonder? It might be helpful to do this benchmarking with a simple microbenchmark with only a couple of hot bbs in the inner loop and no I/O, to factor out those effects. ----------- Another obvious lemon is the INCEIP nonsense (of my own devising :) For every insn executed, except for the last in each bb, there is a corresponding INCEIP, which becomes something like addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops for modern CPUs (load-op-store). All because we need an up-to-date %EIP in some very rare circumstances: when taking a snapshot of the stack that might conceivably get passed to the user, and when delivering signals. These are relatively rare. I wonder if we could dispense with the eip and instead associate with each bb a small table of offsets which indicate how to calculate the simulated %EIP from the value it was set at at the start of the block and the current distance that the real %eip is inside the translation now. If you see what I mean. Easy to test the net effect; disable %EIP generation altogether (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. I would try it now but I have to rush off and duke it out with ARM code all day :-) J |