|
From: Julian S. <js...@ac...> - 2002-11-15 08:25:42
|
[Hi Mark! How's it going? -- J] > > > BTW, will Valgrind run on itself? > > > > No way, Hos/e! Shame; it would be nice to clean up v's memory management > > ... > > I've been carving out chunks and running them in standalone test > programs so I can run them under V, but I've got to the point where I > have dependencies on the actual execution context, and I'm not sure if I > want to try and simulate that standalone... > > What are the problems which prevent recursive execution? Dynamic > codegen is going to be tricky, obviously, and I guess the dynamic linker > hackery might get tricky. Even if V could only run a limited subset of > programs when self-virtualized, it would help a lot. It seems so > strange that we're restricted to primitive tools for use within the > sophisticated tool... Well, now you got me thinking. I think you're into a research-grade question there. Here FWIW is a suggestion. Part of the weakness of the existing V infrastructure is that we depend on the existing dynamic linker and are accumulating a lot of hacks to cope with that. It would be nice if valgrind was "just another" bog-standard C or C++ program; no LD_PRELOAD hackery, no annoying restriction on the use of libc functions or other libraries internally. This could be achieved if V loaded and linked the executable to run, itself. Rather than write our own complete ELF loader/linker, if we ask really nicely (Mark? pretty please) we could "borrow" that machinery from Mark's "bintrans" (http://www.complang.tuwien.ac.at/schani/bintrans). Disadvantages are 1. is it's yet more code to maintain in the codebase. 2. we'd have to be very careful to intercept all dlopen-style calls and do the right emulation. At the moment we ignore them because ld.so or whatever it is sorts it all out for us even when running on the simulated CPU. Advantages are 1. we now have 100% control over the loading/linking process 2. we can use whatever libraries we want in building valgrind; no more mylibc.c 3. related to 2, it becomes possible to envisage using a higher-level language like C++ (no laughing please) to reconstruct the core dynamic translation, instrumentation, optimisation machinery. If you are interested in serious performance improvements, the core translation machinery will probably have to work across multiple basic blocks (for example, detect loops and do better on them) and doing more sophisticated stuff would probably be less painful in C++ or OCaml or something. 4. [to answer your question, finally] it would make valgrind a peer of the programs it runs. Meaning: V would become just-another bog-standard linux executable which runs bog-standard linux executables, and so its chances of self-hosting IMO would be much improved. Currently it is shell script and bunch of .so's which get tacked onto the side of running programs, and so is not really a peer of them. 5. [minor point] V could itself be multithreaded, which might help our thread library simulation -- could make a thread to do asynch IO (am thinking of your recent comments re 03a-open patch). The self-modifying code is not a big deal. If we want this can be transparently supported by checking all writes and invalidating TT/TC as needed. Very old V snapshots (pre Jan 2002) had this and it "just worked" (tm). There is [was?] even a program in the test suite to test it. For more immediate performance stuff I strongly suggest you read "Shade: A Fast Instruction-Set Simulator for Execution Profiling" (http://www.cs.washington.edu/research/compiler/papers.d/shade.html) which is the most direct technical influence on valgrind. It contains a lot of details about translation chaining, a simple-ish change [compared to the above] which would save about 15 (real-machine) insns per basic block executed. J |
|
From: Jeremy F. <je...@go...> - 2002-11-15 20:32:31
|
On Fri, 2002-11-15 at 00:32, Julian Seward wrote: > > What are the problems which prevent recursive execution? Dynamic > > codegen is going to be tricky, obviously, and I guess the dynamic linker > > hackery might get tricky. Even if V could only run a limited subset of > > programs when self-virtualized, it would help a lot. It seems so > > strange that we're restricted to primitive tools for use within the > > sophisticated tool... > > Well, now you got me thinking. I think you're into a research-grade > question there. Here FWIW is a suggestion. > > Part of the weakness of the existing V infrastructure is that we depend > on the existing dynamic linker and are accumulating a lot of hacks to > cope with that. > > It would be nice if valgrind was "just another" bog-standard C or C++ > program; no LD_PRELOAD hackery, no annoying restriction on the use of > libc functions or other libraries internally. I've been thinking along similar lines, though mostly while doing thought experiments in making Valgrind do cross-simulation (though it looks like bintrans is already on that path). > This could be achieved if V loaded and linked the executable to run, > itself. Rather than write our own complete ELF loader/linker, if we > ask really nicely (Mark? pretty please) we could "borrow" that machinery > from Mark's "bintrans" (http://www.complang.tuwien.ac.at/schani/bintrans). Well, a user-mode exec is reasonably easy - I've done that before. And stolen bits of a dynamic linker should be pretty easy to glue in (and bintrans does look like an interesting organ donor). > > Disadvantages are > > 1. is it's yet more code to maintain in the codebase. > > 2. we'd have to be very careful to intercept all dlopen-style calls > and do the right emulation. At the moment we ignore them because > ld.so or whatever it is sorts it all out for us even when running > on the simulated CPU. Really? If we do a full process emulation, then we take control before ever running the first target instruction. It doesn't matter if the target code is statically linked, dynamically linked or does its own code loading. All we need to be carful about is managing the generated code cache. We could still set LD_PRELOAD/LD_LIBRARY_PATH for the target process to influence its choice of libraries, so we can still intercept interesting library functions and do pthreads emulation. There's the interesting question of how to manage the address space, though I suppose the current scheme would work OK (certainly if V reserves a chunk of the user virtual address space for itself, and makes sure the target process doesn't intrude). On the other hand, if the target process lives in its own address space then we don't need to worry about conflicts at all; but this would probably need a software-emulated address translation mechanism, which might be a bit of a performance hit (unless we can do something clever with mmap/mremap to get the hardware to do the work). > 3. related to 2, it becomes possible to envisage using a higher-level > language like C++ (no laughing please) to reconstruct the core dynamic > translation, instrumentation, optimisation machinery. If you are > interested in serious performance improvements, the core translation > machinery will probably have to work across multiple basic blocks > (for example, detect loops and do better on them) and doing more > sophisticated stuff would probably be less painful in C++ or OCaml > or something. It would be nice to have some higher level tools to work with. C++ would be an OK match for the current machinery, and it would be interesting to see what you could do with something like OCaml. I think doing more complex global analysis is going to be hard regardless of the tools you use, because one of V's advantages is working on arbitrary machine code, which may not have any useful structure to take advantage of (what I mean is that any performance advantage of doing global analysis may be overwhelmed by all the checking needed to make sure the assumptions are still valid). > 5. [minor point] V could itself be multithreaded, which might help > our thread library simulation -- could make a thread to do asynch IO > (am thinking of your recent comments re 03a-open patch). Yep. > For more immediate performance stuff I strongly suggest you read > "Shade: A Fast Instruction-Set Simulator for Execution Profiling" > (http://www.cs.washington.edu/research/compiler/papers.d/shade.html) > which is the most direct technical influence on valgrind. It contains > a lot of details about translation chaining, a simple-ish change > [compared to the above] which would save about 15 (real-machine) insns > per basic block executed. Ah, Shade! I spend quite a bit of time reading that paper when it appeared. I was working on Byron Rakitzis's implementation of pico (the image processing language) which did dynamic codegen for Sparc. I was going to write a short description, but the 2nd Google hit is a comp.compilers posting of mine describing it: http://compilers.iecc.com/comparch/article/93-04-102 Anyway, I was thinking about chaining, and it doesn't look too hard (meaning, I couldn't think of any completely insoluble problems). My thinking was something like: 1. for each jump out of a basic block, you'd generate code which looks something like: J |
|
From: Jeremy F. <je...@go...> - 2002-11-15 20:58:31
|
Ahem. Ctrl-enter is send, not line break. Anyway... On Fri, 2002-11-15 at 12:32, Jeremy Fitzhardinge wrote: > > It contains > > a lot of details about translation chaining, a simple-ish change > > [compared to the above] which would save about 15 (real-machine) insns > > per basic block executed. > 2 > > Anyway, I was thinking about chaining, and it doesn't look too hard > (meaning, I couldn't think of any completely insoluble problems). > My thinking was something like: 1. for each normal jump (jump rather than call, syscall, etc; also static rather than computed or indirect target address) out of a basic block, you'd generate code which looks something like: movl $next_bb_addr, %eax decl bbcount jnz 1f # unless x86 has conditional ret ret # normal path 1: call codegen # generate code at %eax nop; nop # nops, if we need space for the jmp 2. the clever bit is that codegen generates the target code and then back-patches its own call site into a jmp to the generated code. If you still want to be able to move generated code around (still don't quite understand that), then it could be an indirect jump via some stable place. codegen() would then jump to the newly generated code. 3. each entry in the translation table would list the set (max of 2 jumps per basic block?) of chained basic blocks. Cleaning up the translation table would have a mark/sweep pass first, so we don't clear out any code which is being chained to. Or we could re-patch the jump into a call to codegen. It might also be possible to keep emulated registers in real registers for longer, but I haven't thought about the details yet (well, I have, and it just kept getting more complex the more I thought about it, so its probably a bad idea). I should really read the Shade paper again. I skimmed it this morning, so the stuff above is just me getting distracted and going off on a tangent rather than something informed by the paper. J |
|
From: Julian S. <js...@ac...> - 2002-11-16 12:06:21
|
During a long train journey late this summer I worked through most of the
design details needed to support t-chaining cleanly. Then I forgot most of
them. I am inclined to agree, it's an obvious (perhaps overdue) optimisation
which should be looked into. Having said that, my priority is still to freeze
and ship 2.0 tho.
The basic idea is that each translation exists in one of two states:
chained and unchained, and can be moved back and forth between them as
needed.
- chained means that jumps out of it to known addresses jump directly
to the target translation.
- unchained means we always do a lookup in the orig->new code address
mapping, ie we go via the dispatcher
New translations are created in the unchained state. Permanently associated
with each translation is enough metadata to facilitate chaining or unchaining
it at will.
When an unchained translation wants to make a jump to a known (orig)address,
it pushes the orig-address it wants to call, and *calls* "patch_me"
which is a short piece of assembly code. This pops the args (orig-addr)
and also pops the return address -- which points just after the call
insn on the original translation. patch_me can arrange to find the
translation and patch the caller to jump directly to it.
There is some fiddly stuff to be sorted out here:
- how to most cleanly and robustly store info to enable chaining/unchaining
- how to minimise the number of magic assembly code sequences needed (these
amount, you'll notice, to an ultra-minimal runtime linker)
- how to cleanly deal with jumps to unknown addresses, which always require
a lookup
- how to deal with jumps which have "extra semantics", ie a JumpSyscall or
JumpClientReq, etc.
- how to handle the event-counter falling to zero in chained translations
My view was to allow chaining for boring ordinary jumps to known addrs, when
no special semantics are needed. This fast-cases the vast-majority of jumps.
All other cases go through a single slow dispatcher which does whatever is
needed, including a old->new lookup, and never get chained. This is pretty
much identical to the vg_dispatch.S stuff as it stands now.
Finally -- and this is the last part of the trick -- whenever we want to
move or discard any translations, we first unchain *all* of them. This
makes them completely self-contained, so we can then mess with them as
we desire. In practice this only really occurs when doing a LRU discard pass,
and those are pretty darned rare. When execution resumes there will be a
little extra work as translations are re-chained, but LRU passes are so rare
that this is assumed to be asymptotically insignificant overhead.
All in all, this is the simplest scheme I could think of which would probably
give good performance.
There are a couple of other points:
- I tried to minimise the amount of magic assembly code sequences needed.
It's much better to put as much of the fancy logic as possible in C-land.
It would be interesting to see how far this can be pushed. For example,
perhaps even the patch_me fragment and slow-patch dispatcher could be done
completely in C.
A design in which the target-CPU-specific-aspects are minimised would
get you extra brownie points. Specifically, if V ever gets ported to
x86-64 / IA64 (god forbid), it would be nice not to have to rewrite reams
of assembly code. And it would be nice not to hardwire too deeply
CPU-specific knowledge of how to patch (chain/unchain) translations and
the exact requirements for padding bytes etc at the end of translations.
Sure, some of this is probably unavoidable, but minimising it is a
worthy design goal.
- Jeremy: you mentioned something about possibly calling _from_ translations
to the translator machinery if a target translation is missing. I prefer to
stick with the structure as it stands on the basis it's less rugged, in
which translations run as the highest point in the call stack. For all
exceptional situations (missing translation, JmpSyscall, etc) the
translations return to C land (the scheduler) which handles the situation.
Therefore the call stack looks like one of the following:
scheduler(C) -> run_thread_for_a_while(C) -> run_innerloop(ASM) ->
(translations)
or
scheduler(C) -> translation-generating-machinery(C)
But specifically I never have
scheduler(C) -> run_thread_for_a_while(C) -> run_innerloop(ASM) ->
some-translation -> translation-generating-machinery(C)
there is no case where C land runs a translation which calls back into C
land, and I think that is more robust.
[ok, not entirely true; translations call helper fns, but these are
pretty simple and don't mess with the global translation state at all]
So: I have no time to chase up any of this stuff (apart from discuss possible
designs), but if you feel the need to do some feasability-assessment hacking,
please do! It would be very interesting to know if the extra performance gain
is worth the complication.
If it can be done simply and cleanly I'm in favour. Generally my approach is
to shoot for 80% of the available performance for 20% of the complication.
This strikes me as good engineering for a resource-constrained small group.
See http://www.cs.princeton.edu/software/lcc for a strikingly effective
demonstration of the same attitude.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-17 11:12:11
|
Hi, I can't add a lot to this discussion, but I have been reading about Dynamo... On Sat, 16 Nov 2002, Julian Seward wrote: > - how to cleanly deal with jumps to unknown addresses, which always require > a lookup Dynamo has some kind of table for these. > Finally -- and this is the last part of the trick -- whenever we want to > move or discard any translations, we first unchain *all* of them. I think Dynamo actually flushes the entire code cache in this situation. But it doesn't have LRU eviction, and their code cache seems to be a lot smaller. So maybe that's not relevant. Also, w.r.t. indirect jumps, I think Dynamo encodes a lot of them as direct jumps and then checks before making the jump that the direct jump is to the right place. This somehow turns out as a win... N |
|
From: Julian S. <js...@ac...> - 2002-11-17 12:30:58
|
> On Sat, 16 Nov 2002, Julian Seward wrote: > > - how to cleanly deal with jumps to unknown addresses, which always > > require a lookup > > Dynamo has some kind of table for these. That seems unavoidable. We'd probably just recycle the current arrangement, or perhaps a simplified version of it, in this case. Bear in mind that x86 RET instructions are going to pull an original address off the stack and jump to it, so there will still me a significant minority of such lookups needed. > > Finally -- and this is the last part of the trick -- whenever we want to > > move or discard any translations, we first unchain *all* of them. > > I think Dynamo actually flushes the entire code cache in this situation. > But it doesn't have LRU eviction, and their code cache seems to be a lot > smaller. So maybe that's not relevant. Yes, there's a tradeoff between complexity of the cache management and expense of translation. Since we've got a hugely expensive translator by JIT standards (10s of 000s of cycles / byte of generated code) we have to a have a more complex arrangement designed to minimise the miss rate. > Also, w.r.t. indirect jumps, I think Dynamo encodes a lot of them as > direct jumps and then checks before making the jump that the direct jump > is to the right place. This somehow turns out as a win... Yes, known trick. I think VMware probably does this too (it's a light weight JIT, in effect). (How VMware works is really described in an earlier paper from Stanford; I can dig it out if anyone wants). Nevertheless I think this is an optimisation/complication which is probably beyond what we'd want. Mark, I hope we're not flooding you with junk mail. You haven't said a word so far :) and so perhaps you have been abducted by aliens? J |
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:08:49
|
On Sun, 2002-11-17 at 04:37, Julian Seward wrote:
> > On Sat, 16 Nov 2002, Julian Seward wrote:
> > > - how to cleanly deal with jumps to unknown addresses, which always
> > > require a lookup
> >
> > Dynamo has some kind of table for these.
>
> That seems unavoidable. We'd probably just recycle the current
> arrangement, or perhaps a simplified version of it, in this case. Bear in
> mind that x86 RET instructions are going to pull an original address off
> the stack and jump to it, so there will still me a significant minority
> of such lookups needed.
It's likely that rets will be the majority of jumps with unknown
destinations (though very OO C++ code's use of virtual method
invocations will get close). I think the current mechanism is fine.
> > > Finally -- and this is the last part of the trick -- whenever we want to
> > > move or discard any translations, we first unchain *all* of them.
> >
> > I think Dynamo actually flushes the entire code cache in this situation.
> > But it doesn't have LRU eviction, and their code cache seems to be a lot
> > smaller. So maybe that's not relevant.
>
> Yes, there's a tradeoff between complexity of the cache management and
> expense of translation. Since we've got a hugely expensive translator
> by JIT standards (10s of 000s of cycles / byte of generated code) we
> have to a have a more complex arrangement designed to minimise the miss
> rate.
My profiling doesn't show more than about 5% of time being spent in
codegen when there's a non-trivial skin running. Unfortunately oprofile
doesn't account for time spent in generated code, so its hard to tell
what the time for code generation vs. time running generated code ratio
is.
> > Also, w.r.t. indirect jumps, I think Dynamo encodes a lot of them as
> > direct jumps and then checks before making the jump that the direct jump
> > is to the right place. This somehow turns out as a win...
>
> Yes, known trick. I think VMware probably does this too (it's a light
> weight JIT, in effect). (How VMware works is really described in an
> earlier paper from Stanford; I can dig it out if anyone wants).
Yes please.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 00:57:49
|
> > That seems unavoidable. We'd probably just recycle the current > > arrangement, or perhaps a simplified version of it, in this case. Bear > > in mind that x86 RET instructions are going to pull an original address > > off the stack and jump to it, so there will still me a significant > > minority of such lookups needed. > > It's likely that rets will be the majority of jumps with unknown > destinations (though very OO C++ code's use of virtual method > invocations will get close). I think the current mechanism is fine. Agree. The only other major source is implementation of switch statements with dense ranges (jump tables). > > Yes, there's a tradeoff between complexity of the cache management and > > expense of translation. Since we've got a hugely expensive translator > > by JIT standards (10s of 000s of cycles / byte of generated code) we > > have to a have a more complex arrangement designed to minimise the miss > > rate. > > My profiling doesn't show more than about 5% of time being spent in > codegen when there's a non-trivial skin running. Unfortunately oprofile > doesn't account for time spent in generated code, so its hard to tell > what the time for code generation vs. time running generated code ratio > is. Hmm, that differs from what the internal valgrind profiler itself tells us ... about 30-40% of the time starting up mozilla, on memcheck, is translation. That also corresponds (non-quantitatively) with the speed variations in the lines printed by -v -v when running stuff (try it). > > Yes, known trick. I think VMware probably does this too (it's a light > > weight JIT, in effect). (How VMware works is really described in an > > earlier paper from Stanford; I can dig it out if anyone wants). > > Yes please. Disco: Running Commodity Operating systems on scalable multiprocessors. Bugnion, Devine, Rosenblum (at least the last of these is a founder of VMware, Inc). 1997 ish. I'm sure you can scag it from the web somewhere; I'd start at citeseer. I only have a paper copy. This says not a word about VMware, but it describes how to implement basically the same thing on a mips-based multiprocesor, and IMO clearly contains all the important ideas underlying vmware. Most clearly in Section 4. J |
|
From: Jeremy F. <je...@go...> - 2002-11-19 05:56:01
|
On Mon, 2002-11-18 at 17:04, Julian Seward wrote:
> > > That seems unavoidable. We'd probably just recycle the current
> > > arrangement, or perhaps a simplified version of it, in this case. Bear
> > > in mind that x86 RET instructions are going to pull an original address
> > > off the stack and jump to it, so there will still me a significant
> > > minority of such lookups needed.
> >
> > It's likely that rets will be the majority of jumps with unknown
> > destinations (though very OO C++ code's use of virtual method
> > invocations will get close). I think the current mechanism is fine.
>
> Agree. The only other major source is implementation of switch statements
> with dense ranges (jump tables).
Another optimisation I'm thinking of is
> Hmm, that differs from what the internal valgrind profiler itself tells
> us ... about 30-40% of the time starting up mozilla, on memcheck, is
> translation. That also corresponds (non-quantitatively) with the speed
> variations in the lines printed by -v -v when running stuff (try it).
Is there a set of benchmarks you've been using for this kind of stuff?
It would be useful to have a set for evaluating all the proposals that
have been flying around lately. It would be nice to have something less
interactive than Mozilla, for example. I'm particularly short of nice
examples of multithreaded programs which arn't all GUI.
> Disco: Running Commodity Operating systems on scalable multiprocessors.
> Bugnion, Devine, Rosenblum (at least the last of these is a founder of VMware,
> Inc). 1997 ish. I'm sure you can scag it from the web somewhere; I'd start
> at citeseer. I only have a paper copy.
Hm, I think I've read that paper, but I don't think can I paid attention
to it. I've certainly read the Cellular Disco paper.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 06:01:14
|
On Mon, 2002-11-18 at 21:55, Jeremy Fitzhardinge wrote:
> > Agree. The only other major source is implementation of switch statements
> > with dense ranges (jump tables).
>
> Another optimisation I'm thinking of is
[This is what happens when you're composing two emails in parallel.]
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 13:51:02
|
> > Hmm, that differs from what the internal valgrind profiler itself tells > > us ... about 30-40% of the time starting up mozilla, on memcheck, is > > translation. That also corresponds (non-quantitatively) with the speed > > variations in the lines printed by -v -v when running stuff (try it). > > Is there a set of benchmarks you've been using for this kind of stuff? > It would be useful to have a set for evaluating all the proposals that > have been flying around lately. It would be nice to have something less > interactive than Mozilla, for example. I'm particularly short of nice > examples of multithreaded programs which arn't all GUI. Starting mozilla-1.0, which requires translation about 2.2M of original code into ~45M of memcheck'd translations. J |
|
From: Jeremy F. <je...@go...> - 2002-11-16 18:17:07
|
On Sat, 2002-11-16 at 04:09, Julian Seward wrote: > During a long train journey late this summer I worked through most of the > design details needed to support t-chaining cleanly. Then I forgot most of > them. I am inclined to agree, it's an obvious (perhaps overdue) optimisation > which should be looked into. Having said that, my priority is still to freeze > and ship 2.0 tho. > > The basic idea is that each translation exists in one of two states: > chained and unchained, and can be moved back and forth between them as > needed. > > - chained means that jumps out of it to known addresses jump directly > to the target translation. > > - unchained means we always do a lookup in the orig->new code address > mapping, ie we go via the dispatcher > > New translations are created in the unchained state. Permanently associated > with each translation is enough metadata to facilitate chaining or unchaining > it at will. > > When an unchained translation wants to make a jump to a known (orig)address, > it pushes the orig-address it wants to call, and *calls* "patch_me" > which is a short piece of assembly code. This pops the args (orig-addr) > and also pops the return address -- which points just after the call > insn on the original translation. patch_me can arrange to find the > translation and patch the caller to jump directly to it. > > There is some fiddly stuff to be sorted out here: > > - how to most cleanly and robustly store info to enable chaining/unchaining > > - how to minimise the number of magic assembly code sequences needed (these > amount, you'll notice, to an ultra-minimal runtime linker) Well, I suppose there's two: there's the sequence the codegen generates for jumps, and there's patch_me. Neither of those are complex. > > - how to cleanly deal with jumps to unknown addresses, which always require > a lookup > > - how to deal with jumps which have "extra semantics", ie a JumpSyscall or > JumpClientReq, etc. Ignore them - just generate the code for them we generate now. > - how to handle the event-counter falling to zero in chained translations I think generate the decrement inline and fall into the dispatcher if we hit 0. > Finally -- and this is the last part of the trick -- whenever we want to > move or discard any translations, we first unchain *all* of them. OK, that's nice and simple. > - Jeremy: you mentioned something about possibly calling _from_ translations > to the translator machinery if a target translation is missing. I prefer to > stick with the structure as it stands on the basis it's less rugged, in > which translations run as the highest point in the call stack. For all > exceptional situations (missing translation, JmpSyscall, etc) the > translations return to C land (the scheduler) which handles the situation. > Therefore the call stack looks like one of the following: > > scheduler(C) -> run_thread_for_a_while(C) -> run_innerloop(ASM) -> > (translations) > > or > > scheduler(C) -> translation-generating-machinery(C) > > But specifically I never have > > scheduler(C) -> run_thread_for_a_while(C) -> run_innerloop(ASM) -> > some-translation -> translation-generating-machinery(C) > > there is no case where C land runs a translation which calls back into C > land, and I think that is more robust. > > [ok, not entirely true; translations call helper fns, but these are > pretty simple and don't mess with the global translation state at all] Well, OK, but you didn't address what patch_me would do if the target address isn't present. Would it fall back into dispatcher loop, who would then trigger a codegen, and then the next time through this BB we'd do the chaining? That seems reasonable to me. > So: I have no time to chase up any of this stuff (apart from discuss possible > designs), but if you feel the need to do some feasability-assessment hacking, > please do! It would be very interesting to know if the extra performance gain > is worth the complication. I'll look at it if I get a moment. I want to finish up everything I've got open at the moment, with the expectation I'll have very little hacking time available in a month or two (whereupon my first-born appears and I get that harried new-parent look). > If it can be done simply and cleanly I'm in favour. Generally my approach is > to shoot for 80% of the available performance for 20% of the complication. > This strikes me as good engineering for a resource-constrained small group. > See http://www.cs.princeton.edu/software/lcc for a strikingly effective > demonstration of the same attitude. Yes, I like lcc's internals. J |
|
From: Josef W. <Jos...@gm...> - 2002-11-17 00:26:05
|
On Saturday 16 November 2002 19:17, Jeremy Fitzhardinge wrote: > On Sat, 2002-11-16 at 04:09, Julian Seward wrote: > ... > > When an unchained translation wants to make a jump to a known > > (orig)address, it pushes the orig-address it wants to call, and *call= s* > > "patch_me" which is a short piece of assembly code. This pops the ar= gs > > (orig-addr) and also pops the return address -- which points just aft= er > > the call insn on the original translation. patch_me can arrange to f= ind > > the translation and patch the caller to jump directly to it. Just an idea: Why not simply use indirect jumps and patching the jump add= ress,=20 same as the symbol resolving with calls to shared lib functions is doing? The jump address would be initialised to "patch_me" and later to the=20 translated code (by patch_me). You wouldn't need chained/unchained versions or patching generated code, = and=20 whenever discarding a translation, set the according jump address back to= =20 "patch_me". Seems way easier for me in a first step. Sorry if I'm totally wrong here :-) Josef |
|
From: Julian S. <js...@ac...> - 2002-11-17 03:20:24
|
> > > When an unchained translation wants to make a jump to a known > > > (orig)address, it pushes the orig-address it wants to call, and *calls* > > > "patch_me" which is a short piece of assembly code. This pops the args > > > (orig-addr) and also pops the return address -- which points just after > > > the call insn on the original translation. patch_me can arrange to > > > find the translation and patch the caller to jump directly to it. > > Just an idea: Why not simply use indirect jumps and patching the jump > address, same as the symbol resolving with calls to shared lib functions is > doing? The jump address would be initialised to "patch_me" and later to the > translated code (by patch_me). > You wouldn't need chained/unchained versions or patching generated code, > and whenever discarding a translation, set the according jump address back > to "patch_me". > Seems way easier for me in a first step. Josef I don't really know how the shared lib stuff works. Can you explain in a little more detail how your proposal would work? Thanks, J |
|
From: Jeremy F. <je...@go...> - 2002-11-17 07:16:24
|
On Sat, 2002-11-16 at 15:25, Josef Weidendorfer wrote:
> On Saturday 16 November 2002 19:17, Jeremy Fitzhardinge wrote:
> > On Sat, 2002-11-16 at 04:09, Julian Seward wrote:
> > ...
> > > When an unchained translation wants to make a jump to a known
> > > (orig)address, it pushes the orig-address it wants to call, and *calls*
> > > "patch_me" which is a short piece of assembly code. This pops the args
> > > (orig-addr) and also pops the return address -- which points just after
> > > the call insn on the original translation. patch_me can arrange to find
> > > the translation and patch the caller to jump directly to it.
>
> Just an idea: Why not simply use indirect jumps and patching the jump address,
> same as the symbol resolving with calls to shared lib functions is doing?
> The jump address would be initialised to "patch_me" and later to the
> translated code (by patch_me).
> You wouldn't need chained/unchained versions or patching generated code, and
> whenever discarding a translation, set the according jump address back to
> "patch_me".
> Seems way easier for me in a first step.
So you mean have a structure per basic block, which contains as one of
its elements the entrypoint for its generated code, or if there is no
generated code, the entrypoint of "patch_me"? This is OK, but it would
be nice not to have an indirect jump in the generated code, since this
makes branch prediction much harder. A nice absolute unconditional jump
makes things easiest on the CPU.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-17 12:33:12
|
> So you mean have a structure per basic block, which contains as one of > its elements the entrypoint for its generated code, or if there is no > generated code, the entrypoint of "patch_me"? This is OK, but it would > be nice not to have an indirect jump in the generated code, since this > makes branch prediction much harder. A nice absolute unconditional jump > makes things easiest on the CPU. Um, yes, that (the ld.so scheme) requires a seperate entry point per jump-target, which is a complication we don't want. I think I declare a red herring on this one. J |
|
From: Jeremy F. <je...@go...> - 2002-11-17 17:34:57
|
On Sun, 2002-11-17 at 04:40, Julian Seward wrote:
> > So you mean have a structure per basic block, which contains as one of
> > its elements the entrypoint for its generated code, or if there is no
> > generated code, the entrypoint of "patch_me"? This is OK, but it would
> > be nice not to have an indirect jump in the generated code, since this
> > makes branch prediction much harder. A nice absolute unconditional jump
> > makes things easiest on the CPU.
>
> Um, yes, that (the ld.so scheme) requires a seperate entry point per
> jump-target, which is a complication we don't want.
Eh? I don't understand. Doesn't every BB have a distinct entry point
anyway?
J
|
|
From: Josef W. <Jos...@gm...> - 2002-11-17 14:25:50
|
On Sunday 17 November 2002 08:16, Jeremy Fitzhardinge wrote: > On Sat, 2002-11-16 at 15:25, Josef Weidendorfer wrote: > > On Saturday 16 November 2002 19:17, Jeremy Fitzhardinge wrote: > > > On Sat, 2002-11-16 at 04:09, Julian Seward wrote: > > > ... > > > > > > > When an unchained translation wants to make a jump to a known > > > > (orig)address, it pushes the orig-address it wants to call, and > > > > *calls* "patch_me" which is a short piece of assembly code. This > > > > pops the args (orig-addr) and also pops the return address -- whi= ch > > > > points just after the call insn on the original translation.=20 > > > > patch_me can arrange to find the translation and patch the caller= to > > > > jump directly to it. > > > > Just an idea: Why not simply use indirect jumps and patching the jump > > address, same as the symbol resolving with calls to shared lib functi= ons > > is doing? The jump address would be initialised to "patch_me" and lat= er > > to the translated code (by patch_me). > > You wouldn't need chained/unchained versions or patching generated co= de, > > and whenever discarding a translation, set the according jump address > > back to "patch_me". > > Seems way easier for me in a first step. > > So you mean have a structure per basic block, which contains as one of > its elements the entrypoint for its generated code, or if there is no Yes. I meant just a table of entry points with indirect jumps at the end of the translated BBs using this table. > generated code, the entrypoint of "patch_me"? This is OK, but it would > be nice not to have an indirect jump in the generated code, since this > makes branch prediction much harder. A nice absolute unconditional jum= p Is branch prediction with indirect jumps always a problem for the CPU=20 hardware? Perhaps register relative, loading the register a few cycles be= fore=20 the jump? Then the pipeline shouldn't stall.=20 But that needs a register. OK, go with patching the generated code. Another idea: If two BBs are to be executed unconditionally after each ot= her, couldn't you try to put them in the translation cache direct behind each = other=20 with no jump at all? Perhaps padding with some NOPs if a jump will be nee= ded=20 later on... Josef > makes things easiest on the CPU. > > > J |
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:42:18
|
On Sun, 2002-11-17 at 06:15, Josef Weidendorfer wrote:
> > generated code, the entrypoint of "patch_me"? This is OK, but it would
> > be nice not to have an indirect jump in the generated code, since this
> > makes branch prediction much harder. A nice absolute unconditional jump
>
> Is branch prediction with indirect jumps always a problem for the CPU
> hardware? Perhaps register relative, loading the register a few cycles before
> the jump? Then the pipeline shouldn't stall.
> But that needs a register. OK, go with patching the generated code.
That still doesn't really help. The P4 has a 20-stage pipeline, it gets
really expensive if there are any bubbles in it.
The worst kind of bubble is a cache miss, so you want to keep cache
pressure as low as possible. Building the target address into the
instruction stream rather than requiring a data cache line for every
branch is a win there.
The lesser bubble is when you get a branch mis-prediction, or a stall
because the jump's target can't otherwise be determined. The branch
target buffer (BTB) will help there, because in the absence of other
information it will predict that the branch will go to the last place it
went; unfortunately if every branch is indirect, it will put a lot of
pressure on the BTB and cause a high miss rate.
Even if the CPU could make use of the fact that you preloaded the target
into a register, you'd need to do it 20 instructions before the actual
jump, which would be very hard (in that most basic blocks aren't 20
instructions long, and it would cause quite a lot of register pressure
keeping all possible jump targets in the next 20 instructions in
registers).
The P4 has the interesting architectural feature of a trace cache rather
than a traditional icache. The trace cache ignores branches altogether,
and just encodes the uops making up the instruction stream. In other
words, it effectively builds branch prediction into the icache. I
suspect whenever there's a mis-predict or some complex jump (ie,
indirect) it will break the current trace, with some performance
impact. If we can give the trace cache unconditional jumps, it will get
nice long traces to work with.
As you suggest, we could emulate this by running together two basic
blocks which have an unconditional jump between them (we do this
implicitly anyway, to some extent, if we get a jump into the middle of
an existing BB). I don't think this would be worthwhile though.
The other advantage of allowing the P4 to do good branch prediction is
to allow it to speculate deeply into the future; this will generate data
cache prefetches, and therefore lower the observed data cache miss
rate. If it speculates into the wrong future, it will end up polluting
the data cache with useless stuff.
Pre-P4 CPUs are less sensitive to all this, I suspect, but would still
benefit from having nicely predictable jumps where ever possible (they
will still prefetch and speculate over the jump).
BTW, the current dispatch loop has about worst-possible behaviour with
regard to branch prediction. It is a single call instruction which is
indirect and always goes to a different address, making the BTB almost
completely ineffective; I would expect a near 0% prediction rate there.
At least the call-return sequence is matched so the return stack cache
will get a good hit rate.
Another useful thing to do (which may already be there, I haven't
looked) is to make sure that generated code always starts at a 16 or 32
byte boundary (depending on the CPU architecture), to make sure it
starts on a cache line. That way it saves the CPU from wasting memory
bandwidth fetching unused memory from before the start of the generated
code.
> Another idea: If two BBs are to be executed unconditionally after each other,
> couldn't you try to put them in the translation cache direct behind each other
> with no jump at all? Perhaps padding with some NOPs if a jump will be needed
> later on...
As I said, this isn't all that important if you use unconditional jumps.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-18 08:45:20
|
On 17 Nov 2002, Jeremy Fitzhardinge wrote: > The P4 has the interesting architectural feature of a trace cache rather > than a traditional icache. I'd be wary of trying to target a single architecture. For example, the Athlon has a traditional I cache. And Valgrind should (hopefully!) be around longer than the P4... Just my $0.02. N |
|
From: Jeremy F. <je...@go...> - 2002-11-18 16:59:40
|
On Mon, 2002-11-18 at 00:45, Nicholas Nethercote wrote:
> On 17 Nov 2002, Jeremy Fitzhardinge wrote:
>
> > The P4 has the interesting architectural feature of a trace cache rather
> > than a traditional icache.
>
> I'd be wary of trying to target a single architecture. For example, the
> Athlon has a traditional I cache. And Valgrind should (hopefully!) be
> around longer than the P4...
The P4 is a good example of a modern architecture. Anything that's good
for P4 will be good for Athlon (and Hammer) as well.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-17 18:10:17
|
On Sunday 17 November 2002 5:34 pm, Jeremy Fitzhardinge wrote: > On Sun, 2002-11-17 at 04:40, Julian Seward wrote: > > > So you mean have a structure per basic block, which contains as one of > > > its elements the entrypoint for its generated code, or if there is no > > > generated code, the entrypoint of "patch_me"? This is OK, but it would > > > be nice not to have an indirect jump in the generated code, since this > > > makes branch prediction much harder. A nice absolute unconditional > > > jump makes things easiest on the CPU. > > > > Um, yes, that (the ld.so scheme) requires a seperate entry point per > > jump-target, which is a complication we don't want. > > Eh? I don't understand. Doesn't every BB have a distinct entry point > anyway? My understanding of the ld.so scheme (which may be wrong!) is: Initially a call to (eg) printf is sent to a stub function (which perhaps lives in the PLT, the procedure linkage table?). This invokes the dynamic linker, to find the address of the real printf. The linker overwrites this stub with a jump to the real address, and jumps onwards there. Second and subsequent calls to printf still jump to the stub, but that simply makes a second jump to the real code. So once the initial lookup is done, there is a 1-insn overhead for all subsequent calls, which is not bad. This scheme has the advantage that neither the caller nor callee has to be patched, since that would negate the advantages of text segments shared between multiple processes. Only each processes' PLT(s?) need be modified, but these are relatively small. Does this clarify? Does you know if this is even correct? J |
|
From: Jeremy F. <je...@go...> - 2002-11-18 02:53:26
|
On Sun, 2002-11-17 at 10:17, Julian Seward wrote:
> My understanding of the ld.so scheme (which may be wrong!) is:
>
> Initially a call to (eg) printf is sent to a stub function (which perhaps
> lives in the PLT, the procedure linkage table?). This
> invokes the dynamic linker, to find the address of the real printf.
> The linker overwrites this stub with a jump to the real address, and
> jumps onwards there.
>
> Second and subsequent calls to printf still jump to the stub, but that
> simply makes a second jump to the real code. So once the initial lookup
> is done, there is a 1-insn overhead for all subsequent calls, which is
> not bad.
>
> This scheme has the advantage that neither the caller nor callee has
> to be patched, since that would negate the advantages of text segments
> shared between multiple processes. Only each processes' PLT(s?) need be
> modified, but these are relatively small.
>
> Does this clarify? Does you know if this is even correct?
Yes, that's correct, but it wasn't my understanding of Josef's proposal
(or more to the point, Josef's proposal is very close, but not quite the
same). Of course, I may have misunderstood.
I think he was proposing a per-BB table which encodes the address of
that BB's generated code. Jumps to that BB would therefore be something
like jmp *(bb_targets[idx]) (where idx is computed at compile time, and
therefore doesn't allow the table to be rearranged without invalidating
the translation cache). That table would be incrementally generated
while compiling basic blocks which contain jumps to that address.
Indirect and computed jumps would still have to go through the current
mechanism.
The distinction is that the PLT is table in each caller's object file
which is either an array of instructions which is overwritten by the
dynamic linker (Sparc, Mips), or just an indirect jump through a table
of pointers to functions (i386, though that's a pretty bad idea these
days, for the reasons I talk about in the other mail I just sent). For
Josef's proposal, that table would be built incrementally in much the
same way as the dynamically generated code is built (in fact, it would
be like a wavefront just in advance of all possible predicted jumps in
the code).
However, I think we'd get the most benefit from chaining by effectively
compiling this table into the generated code, by having an instruction
which gets overwritten with either a chained jump or a return back into
the dispatch loop. I think this is actually simpler to implement as
well.
J
|