|
From: Jeremy F. <je...@go...> - 2002-11-21 17:42:28
|
On Thu, 2002-11-21 at 01:08, Julian Seward wrote:
> It's clear we need to do something to fix this. This is a proposal
> with two variants; a less ambitious variant (first) and a more ambitious
> one (second).
>
> Variant 1 (less ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create three new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
> 3 Bool uinstr_maybe_reads_simdEIP ( UInstr* );
>
> which tell you whether or not a uinstr (more correctly, the code
> generated for that uinstr) **might** (1) alter the real machine's
> %eflags, (2) alter the real machine's FPU state, and (3) require
> to see the %EIP of the simulated machine.
>
> Getting a safe-if-conservative result for (1) and (2) is critical
> for translation correctness. For (3) it just effects the accuracy
> of stack traces. So the safe thing to return is True for all functions,
> yet we want to return False as often as we safely can. Classical
> compiler-analysis conservative-estimate stuff.
>
> Further, for all skins which extend ucode, we add suitable functions
> to the core/skin iface so that the same questions can be asked of the
> extended ucodes.
>
> With fn (3), the redundant INCEIP removal phase can be reinstated
> for all skins, and should work correctly. The reason it's commented
> out is precisely because at present it can't reliably ask this question.
>
> Fn (2) would be used for formally and cleanly support Jeremy's lazy FPU
> state save/restore, which I think is rolled into the code generation
> loop. I would like to be able to ship this optimisation with confidence
> that I understand the ramifications.
>
> Fn (3) would dually be used to support lazy %EFLAG save restore in the
> same manner as the FPU.
>
> None of this would be hard to implement, and they would support a useful
> bunch of optimisations.
>
>
> Variant 2 (more ambitious)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> We create two new functions, all of them predicates on UInstrs:
>
> 1 Bool uinstr_maybe_trashes_realEFLAGS ( UInstr* );
> 2 Bool uinstr_maybe_trashes_realFPU ( UInstr* );
>
> We do the lazy FPU and EFLAGS save/restore optimisation exactly as
> above.
>
> Difference here is we do table-based EIP reconstruction as needed
> and completely nuke INCEIP (yay!)
>
> So, precisely how to reconstruct %EIP from %eip? Like this. Firstly
> %EIP must be made up-to-date at the start of each bb, but it is anyway,
> so that's free.
Only when using the dispatch loop. When using t-chaining, I currently
manually update EIP to the target EIP just before each chained jump. I
haven't measured how much this costs (because removing it completely
breaks context switches and makes things crash), but it would be nice to
get rid of it. It would need a separate eip->EIP translation table for
the start of basic blocks, then a second process to find the offset
within a basic block. If we don't too it too often, then simply
scanning the TTE table would be a reasonable way of doing the mapping
without extra datastructures.
> Now suppose we are in the middle of a bb and want to
> know the current %EIP. We actually know the block-start %EIP and the
> current %eip. Using the block-start %EIP we can look in the translation
> table (TT) to find info about this block: start address of its translation,
> and presumably whereabouts it's %eip->%EIP mapping table is.
>
> Knowning where the translation starts (%eip for the start of the
> translated block) and knowing the current %eip, we can figure out how
> far inside the translation we are. That offset can then be used to
> index (somehow) the mapping table for this block, to give the corresponding
> %EIP offset, which, when added to the block-start %EIP, gives the current
> %EIP.
>
> Makes sense?
Yes. The interesting question is how far to go. If we want the ideal
of never having generated code update EIP, then everywhere where we use
%EIP we'd either need to use %eip or convert %eip to %EIP.
Since we already have a VG_(get_EIP) function, making that the clearing
house for all eip->EIP conversions should be pretty easy.
The interesting question is representing eip and EIP in the ThreadState
table. Unless we want to do the mapping on every context switch, the
ThreadStates would need to hold eip and compute EIP on demand. Of
course, since the eips can be invalidated when the translation cache is
cleaned up, we'd need to map all the eips to EIPs. Then, when
scheduling a new thread, we need to start it running at %eip (if valid)
or else %EIP.
Then there's the tricky case of the dispatch loop working in terms of
EIP (as it has to, given its the first person to see code that's never
been seen before) and everything else working in eip. Have to be sure
to get all those transitions right.
Maybe we should at least keep EIP up to date at the basic block level,
so we only need to deal with the intra-BB case.
> So the only question is how to compactly encode the mapping table, and
> perhaps where to store them, but that's not a big deal.
I was thinking of a table with one entry per original instruction, which
contains the length of the original instruction and the length of the
corresponding generated code. It may be possible to pack this into 2
nibbles per instruction, with an escape mechanism to deal with rare
cases (ie, if length == 0xF then get the details from the following
byte). That could be built as the code is generated then tacked onto
the end of the generated code (the TTE would contain the generated code
length plus the length of the whole translation, including the
instruction mapping table). That way doing the mapping is a simple
linear scan:
cur_eip = tte->trans_addr;
cur_EIP = tte->orig_addr;
entry = 0;
while(eip > cur_eip) {
cur_eip += tab[entry].trans_len;
cur_EIP += tab[entry].orig_len;
entry++'
}
> Comments?
>
> If we could do variant 2 rather than variant 1 that would be cool,
> considering it would give a bigger speedup overall.
Dropping EIP updates is a big win. Bigger than chaining indirect jumps,
according to some initial measurements (though I haven't tested anything
which I'd expect to have lots of indirect jumps).
J
|