|
From: Jeremy F. <je...@go...> - 2005-01-27 23:58:48
|
One of the things I've noticed lately is that there are a number of
places in the code which say things like:
if (tid is in baseBlock) {
// get registers from baseBlock
} else {
// get registers from ThreadState
}
This could be simplified by saying:
if (tis is in baseblock)
save_registers_to_ThreadState()
// get registers from ThreadState
But it seems to me that the baseBlock is fairly redundant. We could
just make ThreadState.arch the home for all the VCPU state, and point
ebp at it while the thread runs. This would not only make getting at
the VCPU state much simpler and more consistent, it removes all the code
responsible for shuffling the state around, and all the code which sets
up the baseBlock in the first place.
I can see two objections to this:
But the baseBlock is configured differently for different Tools
True, but but apart from pointers to helper functions (see
below), this is just a has shadow regs/doesn't have shadow regs
switch. The ThreadState always contains shadow state either
way.
But the baseBlock has all the pointers to helpers.
Yes; as I understand it, they're there for two reasons:
1. they let us generate more compact code, because we can
generate short-form call instructions, and
2. the calls don't need relocating when the code is copied
into the translation cache
But it seems to me that these indirect calls are not good for
performance; the CPU would prefer to see a simple predictable
direct call. And if the offset is >128 it ends up generating a
longer sequence anyway.
And it will slightly complicate codegen, since x86 calls are
relative, so we can't fully generate the call until we know
where it will be placed in the translation cache - but this is
already handled to deal with BB-chaining, so extending that
mechanism to handle relocating other calls shouldn't be a big
problem.
The big question to me is what the new codegen is doing with this stuff,
and whether it has some other dependency on/requirement for the
baseBlock. Julian?
J
|
|
From: Julian S. <js...@ac...> - 2005-01-28 10:52:31
|
I also arrived at the same conclusions a while back.
> But it seems to me that the baseBlock is fairly redundant.
In the new jit ("Vex"), the baseblock is gone, and guest state
is just a field in each thread state, which ebp is made to point
at when running that thread.
> This would not only make getting at
> the VCPU state much simpler and more consistent, it removes all the code
> responsible for shuffling the state around, and all the code which sets
> up the baseBlock in the first place.
Yup. I nuked lots of code.
Not only that, it makes V much faster in some cases where there was
previously a lot of copying stuff to/from the baseblock -- bear in mind
the SSEified cpu state == 512 bytes, so this had got expensive.
> But the baseBlock is configured differently for different Tools
> True, but but apart from pointers to helper functions (see
> below), this is just a has shadow regs/doesn't have shadow regs
> switch. The ThreadState always contains shadow state either
> way.
Vex imposes this requirement; doesn't seem a big deal to me.
> But the baseBlock has all the pointers to helpers.
> Yes; as I understand it, they're there for two reasons:
> 1. they let us generate more compact code, because we can
> generate short-form call instructions, and
> 2. the calls don't need relocating when the code is copied
> into the translation cache
>
> But it seems to me that these indirect calls are not good for
> performance; the CPU would prefer to see a simple predictable
> direct call. And if the offset is >128 it ends up generating a
> longer sequence anyway.
Vex disallows pointers to helpers; the vex-generated code has to
be self-contained. That makes it slightly more difficult on x86 to
generate a position-independent call to an absolute address, but overall
since vex is generating faster code anyway, I don't care.
And vex doesn't do translation chaining ... instead it chases across
bb boundaries (possibly even conditional branches, am still experimenting)
which (1) reduces the number of transitions significantly, and (2)
gives a benefit beyond merely removing the branch cost, namely allowing
IR level optimisation to propagate constants and then fold redundant
computations/specialise calls to helper functions between the two basic
blocks.
Additionally vex unrolls single-BB loops during IR optimisation up to 8 times,
which also helps.
J
|
|
From: Nicholas N. <nj...@ca...> - 2005-01-28 19:16:41
|
On Fri, 28 Jan 2005, Julian Seward wrote: > And vex doesn't do translation chaining ... instead it chases across > bb boundaries (possibly even conditional branches, am still experimenting) Will that make function wrapping more difficult? N |
|
From: Jeremy F. <je...@go...> - 2005-01-29 01:16:33
|
On Fri, 2005-01-28 at 19:16 +0000, Nicholas Nethercote wrote: > On Fri, 28 Jan 2005, Julian Seward wrote: > > > And vex doesn't do translation chaining ... instead it chases across > > bb boundaries (possibly even conditional branches, am still experimenting) > > Will that make function wrapping more difficult? Well, I hope the codegen can be reined in a bit when we need to do special stuff. I'm assuming that it will record enough state to be able to produce up-to-date VCPU state for any faulting instruction, but we also need to do function wrapping, have some way of dealing with self-modifying code, and any debugging facility will need breakpoints and single-step. J |
|
From: Julian S. <js...@ac...> - 2005-01-29 02:12:05
|
On Saturday 29 January 2005 01:13, Jeremy Fitzhardinge wrote: > On Fri, 2005-01-28 at 19:16 +0000, Nicholas Nethercote wrote: > > On Fri, 28 Jan 2005, Julian Seward wrote: > > > And vex doesn't do translation chaining ... instead it chases across > > > bb boundaries (possibly even conditional branches, am still > > > experimenting) > > > > Will that make function wrapping more difficult? No, I don't think so. Whenever it wants to chase over a bb boundary, and specifically in the case of chasing a call insn, it first asks (via a callback) if it is OK to do this. That means V itself can arbitrarily stop vex chasing across any boundaries it feels like. I already have to implement this in order that the existing redirection mechanism works. > Well, I hope the codegen can be reined in a bit when we need to do > special stuff. I'm assuming that it will record enough state to be able > to produce up-to-date VCPU state for any faulting instruction, You can ask for any arbitrary subset of the guest state to be up to date at any point where a memory exception might occur. > but we > also need to do function wrapping, have some way of dealing with > self-modifying code, and any debugging facility will need breakpoints > and single-step. In that sense it is neither better nor worse than UCode world since we don't have good solutions for those problems in UCode either. J |
|
From: Jeremy F. <je...@go...> - 2005-01-29 02:45:14
|
On Sat, 2005-01-29 at 02:11 +0000, Julian Seward wrote: > On Saturday 29 January 2005 01:13, Jeremy Fitzhardinge wrote: > > On Fri, 2005-01-28 at 19:16 +0000, Nicholas Nethercote wrote: > > > On Fri, 28 Jan 2005, Julian Seward wrote: > > > > And vex doesn't do translation chaining ... instead it chases across > > > > bb boundaries (possibly even conditional branches, am still > > > > experimenting) > > > > > > Will that make function wrapping more difficult? > > No, I don't think so. Whenever it wants to chase over a bb boundary, and > specifically in the case of chasing a call insn, it first asks (via a > callback) if it is OK to do this. That means V itself can arbitrarily > stop vex chasing across any boundaries it feels like. I already have > to implement this in order that the existing redirection mechanism works. > You can ask for any arbitrary subset of the guest state to be up to date > at any point where a memory exception might occur. Do you mean any kind of exception? > > but we > > also need to do function wrapping, have some way of dealing with > > self-modifying code, and any debugging facility will need breakpoints > > and single-step. > > In that sense it is neither better nor worse than UCode world since we > don't have good solutions for those problems in UCode either. True; they're problems I'd like to solve. If I can assert that "these things should be done" and get you to solve them, then I'm happy ;) With UCode, single stepping is just a matter of using the existing --single-step=yes machinery, and setting the dispatch counter to 1. I expect we could do the same with vex, but it would be nice to have something a bit less brute-force. If vex can be convinced to not intermingle the side-effects of adjacent instructions (ie, commit one instruction's results before starting on committing the next instruciton's, which I guess will be necessary anyway to get the instruction side-effects in-order), then there's be appropriate points in the generated code to delimit the effects of adjacent instructions, and therefore good places to hop between when single-stepping. Those same points would also be where you could put breakpoints too. J |
|
From: Julian S. <js...@ac...> - 2005-01-29 03:07:37
|
> > You can ask for any arbitrary subset of the guest state to be up to date > > at any point where a memory exception might occur. > > Do you mean any kind of exception? No, only memory exceptions. I guess with a bit of care, the IR optimiser could be told that arbitrary other events (int division, all FP) might also cause exceptions and so optionally to be careful to keep them in order. However, it would take some thinking about, would reduce performance, and I'm unclear what the benefit would be. > > > but we > > > also need to do function wrapping, have some way of dealing with > > > self-modifying code, and any debugging facility will need breakpoints > > > and single-step. > > > > In that sense it is neither better nor worse than UCode world since we > > don't have good solutions for those problems in UCode either. > > True; they're problems I'd like to solve. If I can assert that "these > things should be done" and get you to solve them, then I'm happy ;) For s-m-c, the issue is how to detect code overwrites. I guess we could use the host's memory protection when appropriate, our own bitmaps if needed, or some other scheme (self-checking translations?) It would be good to know why this is an issue -- where is there a problem? You mentioned something about sighandler returns before, but I don't remember any details. Re debugging facility, breakpoints, single-step, again, I'm not sure what you want to achieve. Can you outline? J |
|
From: Jeremy F. <je...@go...> - 2005-01-29 08:57:20
|
On Sat, 2005-01-29 at 03:06 +0000, Julian Seward wrote:
> No, only memory exceptions. I guess with a bit of care, the IR optimiser
> could be told that arbitrary other events (int division, all FP) might
> also cause exceptions and so optionally to be careful to keep them in
> order. However, it would take some thinking about, would reduce performance,
> and I'm unclear what the benefit would be.
Real programs rely on getting precise exceptions of all kinds. We need
to make all exceptions precise. Performance is secondary to
correctness.
There are 3 types of exceptions:
* memory - covered
* FP - tricky
* always (int $XX, illegal instruction) - easy
FP is tricky because the FPU/SSE units are modal, and may or may not
generate exceptions. Writes to the mode register should be pretty easy
to see so we know when the modes change, but dealing with them is
tricky. But we still need to - the Mesa 3D driver, for example,
deliberately provokes SSE SIGFPE exceptions, and won't run unless we get
that right (it currently requires --single-step=yes). (Mesa does the
hardest-to-get-right thing: it provokes an exception, modifies the
machine state from within the signal handler, then returns from the
handler, expecting the instruction to restart with the newly modified
state; a big chunk of the signal handling changes were to get this stuff
right.)
Valgrind also relies on precise exceptions, which becomes noticeable
when it is running itself (again, it needs --single-step=yes now).
I've been thinking about adding a client request, something like
VALGRIND_HINT_PRECISE_EXCEPTIONS(0/1), so that a client can request
precision if it needs it. Bit of a hack, and I'd hope it would be
unnecessary.
> For s-m-c, the issue is how to detect code overwrites. I guess we could
> use the host's memory protection when appropriate, our own bitmaps if
> needed, or some other scheme (self-checking translations?)
I think self-checking is the way to go. Basically, if you see you're
translating code from a writable page, then assume it can be changed,
and generate a check to see the original code is still the same as it
was when the block was translated.
Except for the cases where we know a memory write will hit code, I don't
think there's much point in checking every (or any) writes, since writes
so rarely do touch code.
> It would be
> good to know why this is an issue -- where is there a problem? You mentioned
> something about sighandler returns before, but I don't remember any
> details.
Sigreturns are OK, but nested functions are a bit of a problem, since
the compiler dynamically generates code on the stack and jumps to it for
each call (bug 69511, our oldest) Also, due to the current structure of
the translation cache, the manual invalidate call is very slow, mostly
because of all the tracking down and unlinking of chained blocks.
> Re debugging facility, breakpoints, single-step, again, I'm not sure what
> you want to achieve. Can you outline?
I'm building a gdbstub into Valgrind so that you can attach gdb to the
VCPU and interactively debug a program while it runs under Valgrind.
The GDB remote protocol also supports the notion of multiple address
spaces, and it looks easy and convenient to expose the Tool metadata
through this interface. This means that you can do basic debugging with
an unmodified gdb, and also extend gdb to be Valgrind-aware and get a
much more powerful debugging environment.
Mostly this is fairly easy to integrate, but it does require that the
stub be able to control the VCPU's execution at the instruction level.
With UCode, single stepping is pretty easy with
VG_(clo_single_step)=True and making dispatch_ctr per-thread rather than
global, but it still isn't clear to me what the best way to do
breakpoints is. I'm tending towards just stomping an int3 on the first
byte of the translated instruction, and dealing with the trap
appropriately (since debugger mode is special, there's no problem with
generating a special BB prologue to leave space for it, rather than
overwriting something else).
J
|