|
From: Nuno L. <nun...@sa...> - 2008-03-09 00:17:40
|
Hi, I'm currently attending a Virtual Execution Environment class at the university this semester. For this class I need to do a project (with 2 colleagues) about something related with the class (VMs, etc..). The project duration is about 1 month. My idea (still pending the teacher's authorization) was to do something with valgrind, more specifically implement some binary code optimization over the VEX IR. Do you think this is feasible in the given time frame? And which optimizations are currently implemented and which ones do you think are interesting to implement? We are also open to other ideas (as long as they fit in the class subject). Of course we'll give back the code when we finish the project. Thanks in advance, Nuno |
|
From: Nicholas N. <nj...@cs...> - 2008-03-10 20:52:09
|
On Sun, 9 Mar 2008, Nuno Lopes wrote: > I'm currently attending a Virtual Execution Environment class at the > university this semester. For this class I need to do a project (with 2 > colleagues) about something related with the class (VMs, etc..). The project > duration is about 1 month. > My idea (still pending the teacher's authorization) was to do something with > valgrind, more specifically implement some binary code optimization over the > VEX IR. Do you think this is feasible in the given time frame? And which > optimizations are currently implemented and which ones do you think are > interesting to implement? > > We are also open to other ideas (as long as they fit in the class subject). > Of course we'll give back the code when we finish the project. I think it's certainly feasible, and a nice idea for a project. http://www.valgrind.org/docs/valgrind2007.pdf has some info on optimisations performed in section 3.7. Note especially that there are two optimisation phases. You'll want to read the surrounding sections (and probably the whole paper) to make sense of it. VEX/priv/ir/iropt.c is the main file controlling Vex optimisation, I believe. It has this comment near the top: Transformation order ~~~~~~~~~~~~~~~~~~~~ There are three levels of optimisation, controlled by vex_control.iropt_level. Define first: "Cheap transformations" are the following sequence: * Redundant-Get removal * Redundant-Put removal * Constant propagation/folding * Dead code removal * Specialisation of clean helper functions * Dead code removal "Expensive transformations" are the following sequence: * CSE * Folding of add/sub chains * Redundant-GetI removal * Redundant-PutI removal * Dead code removal Then the transformations are as follows, as defined by vex_control.iropt_level: Level 0: * Flatten into atomic form. Level 1: the following sequence: * Flatten into atomic form. * Cheap transformations. Level 2: the following sequence * Flatten into atomic form. * Cheap transformations. * If block contains any floating or vector types, CSE. * If block contains GetI or PutI, Expensive transformations. * Try unrolling loops. Three possible outcomes: - No effect: do nothing more. - Unrolled a loop, and block does not contain GetI or PutI: Do: * CSE * Dead code removal - Unrolled a loop, and block contains GetI or PutI: Do: * Expensive transformations * Cheap transformations The comments in VEX/pub/libvex_ir.h is the best documentation for Vex. As the Valgrind paper says, these are fairly heavyweight optimisations for a binary translation system. You could try to do some more compiler-style optimisations, but I think the scope for improvement there is not so great. But I could be wrong -- Julian, what do you think? A very interesting project you could be try would be to implement chaining for Vex -- the Valgrind paper (above) talks about this. (Section 2.3.6 of http://www.valgrind.org/docs/phd2004.pdf discusses the old implementation of chaining that was in pre-Vex Valgrind -- you can see the actual implementation in Valgrind 2.4.1 at http://www.valgrind.org/downloads/old.html.) Another interesting one would be to try to stage the compiler somehow -- recompile hot code fragments more aggressively, as frameworks like Pin does. Both of those would be quite challenging, though, because they would require changing how Valgrind manages its translated code blocks. Just doing new intra-block optimisations would be easier because you'd just be adding more optimisation passes (but less likely to achieve interesting results, as I said). An important question you should consider early is what your goals are. More specifically, are you trying to speed up Valgrind's code when no instrumentation is present? In one way, that's the obvious thing to do, but it's also the least interesting thing to do (the paper talks about this in section 5.4). A more interesting thing is to speed up any of the real existing tools, especially Memcheck, since that's the most widely used. As for giving the code you develop back, there's a complication -- Julian is the sole author and thus owns the copyright for the Vex part of Valgrind. This means that accepting external contributions for Vex is more complicated than the rest of Valgrind. He might have more to add. If you have more questions, feel free to ask. Nick |
|
From: Nuno L. <nun...@sa...> - 2008-03-11 23:41:45
|
> http://www.valgrind.org/docs/valgrind2007.pdf has some info on > optimisations performed in section 3.7. Note especially that there are > two optimisation phases. You'll want to read the surrounding sections > (and probably the whole paper) to make sense of it. > > VEX/priv/ir/iropt.c is the main file controlling Vex optimisation, I > believe. It has this comment near the top: > > The comments in VEX/pub/libvex_ir.h is the best documentation for Vex. > > As the Valgrind paper says, these are fairly heavyweight optimisations for > a binary translation system. You could try to do some more compiler-style > optimisations, but I think the scope for improvement there is not so > great. But I could be wrong -- Julian, what do you think? Uhm, it seems that most easy optimizations are already implemented, as I would expect. Anyway I'll take a closer look (maybe next week) to what is implemented to see if we have any chance do improve anything at all. Inter-block optimizations (e.g. inlining, GCSE, ...) would be really cool, but I'm not sure how Vex is suited for that. > A very interesting project you could be try would be to implement chaining > for Vex -- the Valgrind paper (above) talks about this. (Section 2.3.6 of > http://www.valgrind.org/docs/phd2004.pdf discusses the old implementation > of chaining that was in pre-Vex Valgrind -- you can see the actual > implementation in Valgrind 2.4.1 at > http://www.valgrind.org/downloads/old.html.) Sounds interesting, but maybe this requires too much valgrind internals knowledge, but I'll take a look into this as well. > An important question you should consider early is what your goals are. > More specifically, are you trying to speed up Valgrind's code when no > instrumentation is present? In one way, that's the obvious thing to do, > but it's also the least interesting thing to do (the paper talks about > this in section 5.4). A more interesting thing is to speed up any of the > real existing tools, especially Memcheck, since that's the most widely > used. Yes, but I'm not sure I can justify such a project about dynamic program analysis in a virtual execution class. The class is targeted at dynamic binary translation, binary interpretation, intra-block optimizations, virtualization, and so on. > As for giving the code you develop back, there's a complication -- Julian > is the sole author and thus owns the copyright for the Vex part of > Valgrind. This means that accepting external contributions for Vex is more > complicated than the rest of Valgrind. He might have more to add. OK, we can take care of the copyright stuff later if we come up with something useful. Thanks, Nuno |
|
From: Julian S. <js...@ac...> - 2008-03-12 11:59:42
|
Nuno Sorry to be slow replying. Slowness is because I don't have any good alternative suggestions. > > As the Valgrind paper says, these are fairly heavyweight optimisations > > for a binary translation system. You could try to do some more > > compiler-style optimisations, but I think the scope for improvement there > > is not so great. But I could be wrong -- Julian, what do you think? > > Uhm, it seems that most easy optimizations are already implemented, as I > would expect. Anyway I'll take a closer look (maybe next week) to what is > implemented to see if we have any chance do improve anything at all. > Inter-block optimizations (e.g. inlining, GCSE, ...) would be really cool, > but I'm not sure how Vex is suited for that. I agree .. earlier in the life of the project there was a lot of effort put into doing good IR level optimisation; and then before 3.3.0 another round of iropt and code generator tuning. So most of the easy and even the not-so-easy stuff is already done. From a general usability perspective, when a Valgrind tool runs too slowly to be usable, that is almost always a problem to do with the tool and not to do with the core code generation mechanisms. So speeding up the core doesn't improve usability much. Hacking on the tools usually makes a much bigger difference. It depends what you want to do. If you want to do some hacking on classical compiler transformations then perhaps there are not so many interesting opportunities left in Valgrind. > > A very interesting project you could be try would be to implement > > chaining for Vex -- the Valgrind paper (above) talks about this. > > (Section 2.3.6 of http://www.valgrind.org/docs/phd2004.pdf discusses the > > old implementation of chaining that was in pre-Vex Valgrind -- you can > > see the actual implementation in Valgrind 2.4.1 at > > http://www.valgrind.org/downloads/old.html.) > > Sounds interesting, but maybe this requires too much valgrind internals > knowledge, but I'll take a look into this as well. Another thing you could chase is to consider enhancing the superblock formation. Currently vex follows unconditional branches and calls when forming superblocks, but stops at indirect and conditional branches. I experimented with the usual simple heuristic for conditional branches: assume backwards branches taken and forward not taken, and extended it to follow conditional branches on that basis. Often made performance worse, though; although the translations might be a bit faster, they are also a lot bigger (more I1 misses) and the JIT of course runs more slowly too. Not worth the hassle I reckon. > > this in section 5.4). A more interesting thing is to speed up any of the > > real existing tools, especially Memcheck, since that's the most widely > > used. > > Yes, but I'm not sure I can justify such a project about dynamic program > analysis in a virtual execution class. The class is targeted at dynamic > binary translation, binary interpretation, intra-block optimizations, > virtualization, and so on. Yes, I see that. Nevertheless if it does happen that working on a tool is OK, it might be worth looking at Callgrind. It's a great tool (I have used it quite a lot in '08) but a little slow; and from some initial profiles it looks like there are some possibilities for speedup in there. Also, porting the branch-mispredict profiling from Cachegrind into Callgrind would be a cool thing to do. J |
|
From: Nuno L. <nun...@sa...> - 2008-03-17 23:41:33
|
> Sorry to be slow replying. Slowness is because I don't have any good > alternative suggestions. Sorry now for my late answer, but I didn't have internet access over the weekend. > I agree .. earlier in the life of the project there was a lot of effort > put into doing good IR level optimisation; and then before 3.3.0 another > round of iropt and code generator tuning. So most of the easy and > even the not-so-easy stuff is already done. Ok. Anyway I'll compare what's implemented with what's on the book :) If there's still something left we'll implement it (and benchmark it as well). > Another thing you could chase is to consider enhancing the superblock > formation. Currently vex follows unconditional branches and calls when > forming superblocks, but stops at indirect and conditional branches. > I experimented with the usual simple heuristic for conditional branches: > assume backwards branches taken and forward not taken, and extended it > to follow conditional branches on that basis. Often made performance > worse, though; although the translations might be a bit faster, they are > also a lot bigger (more I1 misses) and the JIT of course runs more slowly > too. Not worth the hassle I reckon. Sounds interesting. I'll have a closer look to vex's code to understand it and I'll get back to you if (well, when..) I have some question. BTW, does valgrind has some way to dump the superblock contents (e.g. the asm/vex instructions and jumps)? Thanks, Nuno |
|
From: Nuno L. <nun...@sa...> - 2008-03-18 20:22:00
|
> BTW, does valgrind has some way to dump the superblock contents (e.g. the > asm/vex instructions and jumps)? OK, disregard the question. I already found the answer myself. e.g.: valgrind --tool=none -v --trace-flags=11111111 --trace-children=yes --trace-notbelow=1 ls Nuno |
|
From: Julian S. <js...@ac...> - 2008-03-18 20:14:00
|
On Tuesday 18 March 2008 21:03, Nuno Lopes wrote: > > BTW, does valgrind has some way to dump the superblock contents (e.g. the > > asm/vex instructions and jumps)? > > OK, disregard the question. I already found the answer myself. > e.g.: > valgrind --tool=none -v --trace-flags=11111111 --trace-children=yes > --trace-notbelow=1 ls Also --profile-flags= is quite fun. It has the advantage that it shows you the "hot" blocks in an application. Without that kind of profiling facility, it is difficult to distinguish between bad bits of code generation that only happen rarely, from ones which happen enough to be a performance problem. Also the --tool= is important, since the instrumentation code added by tools is very varied. Try comparing the code generated by none vs cachegrind (lots of calls) vs memcheck (what _is_ all that stuff :-?) If you can get your hands on a powerpc machine, also try on that. You'll see different effects again on that platform. For some simple benchmarks, try the programs in perf, with --profile-flags. For example ffbench.c contains almost all the action in the top 7 blocks or so, which means they are quite interesting to study in detail. J |
|
From: Nuno L. <nun...@sa...> - 2008-03-19 00:13:39
|
> On Tuesday 18 March 2008 21:03, Nuno Lopes wrote: >> > BTW, does valgrind has some way to dump the superblock contents (e.g. >> > the >> > asm/vex instructions and jumps)? >> >> OK, disregard the question. I already found the answer myself. >> e.g.: >> valgrind --tool=none -v --trace-flags=11111111 --trace-children=yes >> --trace-notbelow=1 ls > > Also --profile-flags= is quite fun. It has the advantage that it > shows you the "hot" blocks in an application. Without that kind of > profiling facility, it is difficult to distinguish between bad bits > of code generation that only happen rarely, from ones which happen > enough to be a performance problem. > > Also the --tool= is important, since the instrumentation code added > by tools is very varied. Try comparing the code generated by > none vs cachegrind (lots of calls) vs memcheck (what _is_ all that > stuff :-?) > > If you can get your hands on a powerpc machine, also try on that. > You'll see different effects again on that platform. > > For some simple benchmarks, try the programs in perf, with > --profile-flags. For example ffbench.c contains almost all the > action in the top 7 blocks or so, which means they are quite > interesting to study in detail. Great, thanks for the tips! Nuno |
|
From: Nuno L. <nun...@sa...> - 2008-03-23 00:12:27
|
Hi,
So today I was looking a bit to the superblocks with and without
instrumentation and I got the following:
------ IMark(0x80483CE, 7) ------
t130 = GET:I32(340)
t25 = GET:I32(20)
t132 = Left32(t130)
t24 = Add32(t25,0xFFFFFFF4:I32)
t134 = CmpNEZ32(t132)
DIRTY t134 RdFX-gst(16,4) RdFX-gst(60,4) :::
MC_(helperc_value_check4_fail){0x38006920}()
DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) :::
MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t24,0x0:I32)
STle(t24) = 0x2:I32
------ IMark(0x80483D5, 7) ------
PUT(60) = 0x80483D5:I32
t137 = Left32(t130)
t26 = Add32(t25,0xFFFFFFF8:I32)
t139 = CmpNEZ32(t137)
DIRTY t139 RdFX-gst(16,4) RdFX-gst(60,4) :::
MC_(helperc_value_check4_fail){0x38006920}()
DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) :::
MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t26,0x0:I32)
STle(t26) = 0x4:I32
------ IMark(0x80483DC, 3) ------
PUT(60) = 0x80483DC:I32
IR-NoOp
t144 = DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) :::
MC_(helperc_LOADV32le)[rp=1]{0x380070e0}(t26)
t30 = LDle:I32(t26)
At first glance I would say that the first PUT(60) could be removed, right?
It is basically a dead store (assuming it can only be read/write with
GET/PUT). I didn't check further why the redudant removal optimization
doesn't pick this, so I'm checking here first if this is sane.
Second question is: can the last memcheck check be removed? i.e. the last
call to helperc_LOADV32le() is redudant, since it has already done a store
to that location and thus it knows that the operation is safe. Can this call
be removed? And what value do I assign to t144 then? VA_BITS32_DEFINED? (I'm
just considering the simpler case where the #store bits >= #load bits).
Allow me just a last question: is it safe to replace the 't30 =
LDle:I32(t26)' statement with 't30 = 0x4:I32'? Well in general I would say
it is safe, but I dunno about memory-mapped I/O nor if/how valgrind handles
it. Maybe this can be done in only certain architectures?
Thanks in advance,
Nuno
|
|
From: Nicholas N. <nj...@cs...> - 2008-03-23 01:01:17
|
On Sun, 23 Mar 2008, Nuno Lopes wrote:
> ------ IMark(0x80483CE, 7) ------
> t130 = GET:I32(340)
> t25 = GET:I32(20)
> t132 = Left32(t130)
> t24 = Add32(t25,0xFFFFFFF4:I32)
> t134 = CmpNEZ32(t132)
> DIRTY t134 RdFX-gst(16,4) RdFX-gst(60,4) ::: MC_(helperc_value_check4_fail){0x38006920}()
> DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) ::: MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t24,0x0:I32)
> STle(t24) = 0x2:I32
> ------ IMark(0x80483D5, 7) ------
> PUT(60) = 0x80483D5:I32
> t137 = Left32(t130)
> t26 = Add32(t25,0xFFFFFFF8:I32)
> t139 = CmpNEZ32(t137)
> DIRTY t139 RdFX-gst(16,4) RdFX-gst(60,4) ::: MC_(helperc_value_check4_fail){0x38006920}()
> DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) ::: MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t26,0x0:I32)
> STle(t26) = 0x4:I32
> ------ IMark(0x80483DC, 3) ------
> PUT(60) = 0x80483DC:I32
> IR-NoOp
> t144 = DIRTY 1:I1 RdFX-gst(16,4) RdFX-gst(60,4) ::: MC_(helperc_LOADV32le)[rp=1]{0x380070e0}(t26)
> t30 = LDle:I32(t26)
>
>
> At first glance I would say that the first PUT(60) could be removed, right?
> It is basically a dead store (assuming it can only be read/write with
> GET/PUT). I didn't check further why the redudant removal optimization
> doesn't pick this, so I'm checking here first if this is sane.
It's not removed because there's another statement (the STle) between it and
the subsequent PUT(60) that could cause a memory exception. A signal
handler for such an exception might inspect the %eip value, so it has to be
up-to-date. Annoying, because it's so unlikely, but necessary.
> Second question is: can the last memcheck check be removed? i.e. the last
> call to helperc_LOADV32le() is redudant, since it has already done a store to
> that location and thus it knows that the operation is safe. Can this call be
> removed? And what value do I assign to t144 then? VA_BITS32_DEFINED? (I'm
> just considering the simpler case where the #store bits >= #load bits).
I think it is redundant. Since 0x0 (which is VA_BITS32_DEFINED) was the
shadow store value, you'd assign 0x0 to t144. And you're right that you
could do likewise if t144 was smaller than 32 bits.
> Allow me just a last question: is it safe to replace the 't30 =
> LDle:I32(t26)' statement with 't30 = 0x4:I32'? Well in general I would say it
> is safe, but I dunno about memory-mapped I/O nor if/how valgrind handles it.
> Maybe this can be done in only certain architectures?
If we're optimising away the shadow load, maybe it's reasonable to optimise
away the real load, but I'm really not sure about that one. Julian might
have more to add.
Nick
|
|
From: Julian S. <js...@ac...> - 2008-03-23 03:04:57
|
On Sunday 23 March 2008 02:01, Nicholas Nethercote wrote:
> > ------ IMark(0x80483D5, 7) ------
> > PUT(60) = 0x80483D5:I32
> > t137 = Left32(t130)
> > t26 = Add32(t25,0xFFFFFFF8:I32)
> > t139 = CmpNEZ32(t137)
> > DIRTY t139 RdFX-gst(16,4) RdFX-gst(60,4) :::
> > MC_(helperc_value_check4_fail){0x38006920}() DIRTY 1:I1 RdFX-gst(16,4)
> > RdFX-gst(60,4) ::: MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t26,0x0:I32)
> > STle(t26) = 0x4:I32
> > At first glance I would say that the first PUT(60) could be removed,
> > right? It is basically a dead store (assuming it can only be read/write
> > with GET/PUT). I didn't check further why the redudant removal
> > optimization doesn't pick this, so I'm checking here first if this is
> > sane.
>
> It's not removed because there's another statement (the STle) between it
> and the subsequent PUT(60) that could cause a memory exception. A signal
> handler for such an exception might inspect the %eip value, so it has to be
> up-to-date. Annoying, because it's so unlikely, but necessary.
Yes. But not only for the STle, also for the call to
MC_(helperc_value_check4_fail).
If you look at struct VexGuestX86State in libvex_guest_x86.h, you can
see that offset 60 is guest_EIP (the program counter). So the PUT is
updating the simulated program counter, so it is up to date should the
store generate an exception, but also (importantly) so that if
MC_(helperc_value_check4_fail) needs to make a stack backtrace, it will
start from the right point.
MC_(helperc_value_check4_fail) is the function that will, or at least
may, eventually lead to memcheck giving an error "Use of uninitialised
value of size 4".
If you look at priv/guest-x86/ghelpers.c, function
guest_x86_state_requires_precise_mem_exns tells the ir optimiser
(iropt.c) which parts of the guest state must be kept up to date at
all times, basically so that helper functions can get correct
stack backtraces. So it forces EBP, ESP and EIP to always be up
to date; all other guest registers can get completely out of date
and are only sync'd with reality at superblock transitions.
> > Second question is: can the last memcheck check be removed? i.e. the last
> > call to helperc_LOADV32le() is redudant, since it has already done a
> > store to that location and thus it knows that the operation is safe. Can
> > this call be removed? And what value do I assign to t144 then?
> > VA_BITS32_DEFINED? (I'm just considering the simpler case where the
> > #store bits >= #load bits).
>
> I think it is redundant. Since 0x0 (which is VA_BITS32_DEFINED) was the
> shadow store value, you'd assign 0x0 to t144. And you're right that you
> could do likewise if t144 was smaller than 32 bits.
>
> > Allow me just a last question: is it safe to replace the 't30 =
> > LDle:I32(t26)' statement with 't30 = 0x4:I32'? Well in general I would
> > say it is safe, but I dunno about memory-mapped I/O nor if/how valgrind
> > handles it. Maybe this can be done in only certain architectures?
>
> If we're optimising away the shadow load, maybe it's reasonable to optimise
> away the real load, but I'm really not sure about that one. Julian might
> have more to add.
Difficult question! I would comment first that it might be worth looking
at the original code to figure out why the compiler put a store and then
a load from the same location in the next insn. It might be that the
second instruction (0x80483DC) is a branch target, maybe the top of a
loop.
So is it safe to remove the real and shadow load and forward values
directly to it from the real and shadow store? Well, that would probably
work _most_ of the time, but sounds dangerous if multiple threads are
doing spinlocks or something like that - then it might be that changing
the number of memory references might have some bad effect.
J
|
|
From: Nuno L. <nun...@sa...> - 2008-03-23 15:32:03
|
Thank you both for your comprehensive answers. Some comments inline:
>> > ------ IMark(0x80483D5, 7) ------
>> > PUT(60) = 0x80483D5:I32
>> > t137 = Left32(t130)
>> > t26 = Add32(t25,0xFFFFFFF8:I32)
>> > t139 = CmpNEZ32(t137)
>> > DIRTY t139 RdFX-gst(16,4) RdFX-gst(60,4) :::
>> > MC_(helperc_value_check4_fail){0x38006920}() DIRTY 1:I1 RdFX-gst(16,4)
>> > RdFX-gst(60,4) :::
>> > MC_(helperc_STOREV32le)[rp=2]{0x38006d40}(t26,0x0:I32)
>> > STle(t26) = 0x4:I32
>
>> > At first glance I would say that the first PUT(60) could be removed,
>> > right? It is basically a dead store (assuming it can only be read/write
>> > with GET/PUT). I didn't check further why the redudant removal
>> > optimization doesn't pick this, so I'm checking here first if this is
>> > sane.
>>
>> It's not removed because there's another statement (the STle) between it
>> and the subsequent PUT(60) that could cause a memory exception. A signal
>> handler for such an exception might inspect the %eip value, so it has to
>> be
>> up-to-date. Annoying, because it's so unlikely, but necessary.
>
> Yes. But not only for the STle, also for the call to
> MC_(helperc_value_check4_fail).
>
> If you look at struct VexGuestX86State in libvex_guest_x86.h, you can
> see that offset 60 is guest_EIP (the program counter). So the PUT is
> updating the simulated program counter, so it is up to date should the
> store generate an exception, but also (importantly) so that if
> MC_(helperc_value_check4_fail) needs to make a stack backtrace, it will
> start from the right point.
That makes sense. So there's nothing to do here :S
>> > Second question is: can the last memcheck check be removed? i.e. the
>> > last
>> > call to helperc_LOADV32le() is redudant, since it has already done a
>> > store to that location and thus it knows that the operation is safe.
>> > Can
>> > this call be removed? And what value do I assign to t144 then?
>> > VA_BITS32_DEFINED? (I'm just considering the simpler case where the
>> > #store bits >= #load bits).
>>
>> I think it is redundant. Since 0x0 (which is VA_BITS32_DEFINED) was the
>> shadow store value, you'd assign 0x0 to t144. And you're right that you
>> could do likewise if t144 was smaller than 32 bits.
>>
>> > Allow me just a last question: is it safe to replace the 't30 =
>> > LDle:I32(t26)' statement with 't30 = 0x4:I32'? Well in general I would
>> > say it is safe, but I dunno about memory-mapped I/O nor if/how valgrind
>> > handles it. Maybe this can be done in only certain architectures?
>>
>> If we're optimising away the shadow load, maybe it's reasonable to
>> optimise
>> away the real load, but I'm really not sure about that one. Julian might
>> have more to add.
>
> Difficult question! I would comment first that it might be worth looking
> at the original code to figure out why the compiler put a store and then
> a load from the same location in the next insn. It might be that the
> second instruction (0x80483DC) is a branch target, maybe the top of a
> loop.
>
> So is it safe to remove the real and shadow load and forward values
> directly to it from the real and shadow store? Well, that would probably
> work _most_ of the time, but sounds dangerous if multiple threads are
> doing spinlocks or something like that - then it might be that changing
> the number of memory references might have some bad effect.
I see that this doesn't work.. If the ptr point to some allocated memory,
other thread may free() it, the ptr will start pointing to freed memory and
thus following read/writes need to be marked as an error.
So, no luck this time. I'll keep searching :)
Thanks,
Nuno
|
|
From: Nicholas N. <nj...@cs...> - 2008-03-23 22:47:04
|
On Sun, 23 Mar 2008, Nuno Lopes wrote: >>> > Allow me just a last question: is it safe to replace the 't30 = >>> > LDle:I32(t26)' statement with 't30 = 0x4:I32'? Well in general I would >>> > say it is safe, but I dunno about memory-mapped I/O nor if/how valgrind >>> > handles it. Maybe this can be done in only certain architectures? >> >> Difficult question! I would comment first that it might be worth looking >> at the original code to figure out why the compiler put a store and then >> a load from the same location in the next insn. It might be that the >> second instruction (0x80483DC) is a branch target, maybe the top of a >> loop. > > I see that this doesn't work.. If the ptr point to some allocated memory, > other thread may free() it, the ptr will start pointing to freed memory and > thus following read/writes need to be marked as an error. A lesson in binary translation/instrumentation/optimisation: handling correct programs is much easier than handling bogus programs, for exactly this kind of reason! Nick |
|
From: Filipe C. <fi...@gm...> - 2008-03-29 01:28:43
|
Hi
I'm one of Nuno's colleagues working on binary code optimizations on
valgrind.
One of the micro-optimizations we were thinking of was to keep the
guest instruction pointer state through the VEX basic block
translation so we could, instead of doing a movq of a 32-bit immediate
to a 64-bit memory location, we would only do a movb of an 8-bit
quantity (or movw, with 16 bits), so we could shave a few bytes (lots
of times since PUT(OFFSET_amd64_RIP) is used quite a lot).
I was looking at valgrind's code but the only way I am able to do it
seems to be creating another variable and doing a store, which would
imply a movq to a register and then a movb from the register to the
memory location (which is what happens when valgrind sees a movb
<imm8>, <r/m8> in a "real" program).
Should I add another type of instruction so VEX can do some stores
without passing through a register? Is there another way that I have
missed?
The test I used was this:
$ cat ../a.c
int main(int argc, char**argv)
{
int i = 0;
while (++i<1000000) {
++argc;
asm volatile (
"movb $0x42, 0x20(%%rbp)\n\t"
: /* output */
: "r"(argc) /* input */
: "%rax" /* clobbered */ );
}
return argc;
}
$ ./vg-in-place --tool=none --profile-flags=00000001 --trace-
notbelow=0 ../a
The movb is converted to a Store, but the Store doesn't use the right
mov, so it will have to use two movs.
-- STle(t22) = 0x42:I8
movq $0x42,%vR104
movb %vR104,(%vR22)
Which gets assembled (by valgrind) as (the interesting part):
leaq 0x20(%rsi),%r8
4C 8D 46 20
movq $0x42,%r9
49 C7 C1 42 00 00 00
movb %r9,(%r8)
45 88 08
Which takes quite a bit more effort than the simple
movb $0x42,0x20(%rbp)
c6 45 20 42
Thanks for the help
- Filipe Cabecinhas
P.S: I noticed the only 32-bits of the guest RIP are being saved.
Could that lead to a problem? x86-64 doesn't have a movq <imm64> <r/
m64> so the PUT(OFFSET_amd64_RIP) (at least) may have to be altered if
this is really a problem.
P.P.S: I'm keeping track of the RIP in the env parameter that gets
passed to the iselStmt() function. Is there a better place? This isn't
really platform-independent (and there's a comment saying ISelEnv is
platform-independent) but the x86 version wouldn't be much different
(and I don't know enough PPC ASM to pronounce myself on that subject).
|
|
From: Nuno L. <nun...@sa...> - 2008-04-16 19:18:19
|
> A very interesting project you could be try would be to implement chaining > for Vex -- the Valgrind paper (above) talks about this. (Section 2.3.6 of > http://www.valgrind.org/docs/phd2004.pdf discusses the old implementation > of chaining that was in pre-Vex Valgrind -- you can see the actual > implementation in Valgrind 2.4.1 at > http://www.valgrind.org/downloads/old.html.) Hi, Sorry for the delay, but we have been working on other school projects.. I was thinking in going ahead and trying to implement block chaining. I have two questions (for now): - How did valgrind handle the case when a block is removed from the cache and it has other blocks that jump directly into it (i.e. blocks that were previously patched)? Did valgrind purged the whole cache? Did it scanned the remaining blocks for references and "unpatched" blocks? - Other question is what changes need to be done to the blocks in order to implement the block chaining? I assume I would need to add some prologue to blocks to check for e.g. thread time-slice end and signal handling stuff? Thanks, Nuno |
|
From: Julian S. <js...@ac...> - 2008-04-17 15:44:02
|
> - How did valgrind handle the case when a block is removed from the cache > and it has other blocks that jump directly into it (i.e. blocks that were > previously patched)? Did valgrind purged the whole cache? Did it scanned > the remaining blocks for references and "unpatched" blocks? Honestly .. I can't remember. But you ask the right questions at least; this is the #1 hard problem with chaining. Read the 2.4 sources ... One problem to be aware of, when constantly chaining and unchaining translations is, it can lead to very expensive Icache-vs-Dcache interlocks/ invalidations by the hardware. Especially on Pentium 4 iirc. > - Other question is what changes need to be done to the blocks in order to > implement the block chaining? I assume I would need to add some prologue to > blocks to check for e.g. thread time-slice end and signal handling stuff? Yes. I'm sure 2.4.0 has a timeslice-and-event check at the start of each translation. Try it and see - again, this is beginning to be a long time ago. What else: for most blocks, nothing. For blocks translated with self-modifying-code support (try --smc-check=all), there is also a check at the start to ensure that CRC32 for the guest code for the block matches a CRC32 made at the time the block was translated. If not, the block exits immediately, requesting a retranslation. btw, for lots of background on code cache management, you might be interested to read Kim Hazelwood's PhD thesis: http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf J |
|
From: Nuno L. <nun...@sa...> - 2008-04-17 23:40:03
|
Thanks for the answers! >> - How did valgrind handle the case when a block is removed from the >> cache >> and it has other blocks that jump directly into it (i.e. blocks that were >> previously patched)? Did valgrind purged the whole cache? Did it scanned >> the remaining blocks for references and "unpatched" blocks? > > Honestly .. I can't remember. But you ask the right questions at least; > this is the #1 hard problem with chaining. Read the 2.4 sources ... Ok, so as far as I understood, the cache is divided in 8 sectors, which are managed in a FIFO way. When the oldest sector is purged, it searches all the others sector's blocks for patched jumps and unpatches thems. This can be a good solution if the cache is not purged very often, otherwise the cost may be prohibitive (where adding a little table to each superblock with back-pointers to referencing blocks would perform better, or Josef's idea in the other e-mail as well). Do you have any stats to confirm if's that the case? I dunno about the perftests, but I assume they are so small they don't even fill up the whole cache, right? > btw, for lots of background on code cache management, you might be > interested to read Kim Hazelwood's PhD thesis: > http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf Out of curiosity: why did you change from FIFO to LRU in valgrind 3? I'm only asking this because that thesis suggests that FIFO may be better. Thanks, Nuno |
|
From: Josef W. <Jos...@gm...> - 2008-04-17 20:29:48
|
On Thursday 17 April 2008, Julian Seward wrote: > > > - How did valgrind handle the case when a block is removed from the cache > > and it has other blocks that jump directly into it (i.e. blocks that were > > previously patched)? Did valgrind purged the whole cache? Did it scanned > > the remaining blocks for references and "unpatched" blocks? > > Honestly .. I can't remember. But you ask the right questions at least; > this is the #1 hard problem with chaining. Read the 2.4 sources ... Hmm... Wasn't the code cache partitioned into 8 or 16 parts, and filled/flushed in a FIFO like manner? If chaining is allowed only between blocks inside of one part, and all code of a part is always flushed at once, there is no scanning needed... Josef |
|
From: Julian S. <js...@ac...> - 2008-04-17 22:46:58
|
On Thursday 17 April 2008 22:29, Josef Weidendorfer wrote:
> On Thursday 17 April 2008, Julian Seward wrote:
> > > - How did valgrind handle the case when a block is removed from the
> > > cache and it has other blocks that jump directly into it (i.e. blocks
> > > that were previously patched)? Did valgrind purged the whole cache? Did
> > > it scanned the remaining blocks for references and "unpatched" blocks?
> >
> > Honestly .. I can't remember. But you ask the right questions at least;
> > this is the #1 hard problem with chaining. Read the 2.4 sources ...
To elaborate:
1. This is not only the #1 hard problem, really it's the only hard
problem
2. I'm not claiming the 2.4 solution is a good one. For a start it's
only x86 specific, and now we need something that works well for
ppc too. See below.
> Hmm... Wasn't the code cache partitioned into 8 or 16 parts, and
> filled/flushed in a FIFO like manner?
Yes, 8 parts ("sectors").
> If chaining is allowed only between blocks inside of one part,
Reasonable, but ..
> and all code of a part is always flushed at once, there is no scanning
> needed...
.. this is more of a problem. No problem when throwing away the
complete contents of a sector. But suppose we need to invalidate
only a few translations, then what? This isn't very common on x86/amd64,
but is very common on ppc (ld.so does one invalidate for each symbol
resolved, which means tens of thousands of invalidates for starting
OOo or for a KDE app.)
J
|
|
From: Julian S. <js...@ac...> - 2008-04-17 23:13:59
|
On Friday 18 April 2008 00:37, Nuno Lopes wrote: > Ok, so as far as I understood, the cache is divided in 8 sectors, which are > managed in a FIFO way. Yes. > When the oldest sector is purged, it searches all > the others sector's blocks for patched jumps and unpatches thems. I'm confused. Are you asking how the current code works, or asking for comments on proposed changes? At the moment (V 3.X) there is no chaining at all, all translations are completely independent, and they do not change once they are made. So right now, there is nothing that needs to be done when throwing away a sector - we just forget it exists, and start filling it up again from empty. (not completely true: need to tell the host machine to flush its icache, "icbi" insn on ppc). > cost may be prohibitive (where adding a little table to each superblock > with back-pointers to referencing blocks I think that is the "standard" solution. > that the case? I dunno about the perftests, but I assume they are so small > they don't even fill up the whole cache, right? The cache is huge, really, very large. It rarely fills up. To fill it up completely requires (eg) starting OpenOffice, loading a .ppt presentation and a .doc document, so as to force V to translate all the Powerpoint import code and all the Word import code. For meaningful testing/evaluation, you will need to make it much smaller, by changing N_TTES_PER_SECTOR in m_transtab.c to some small prime number (use auxprogs/primes.c to produce prime numbers). > Out of curiosity: why did you change from FIFO to LRU in valgrind 3? I'm > only asking this because that thesis suggests that FIFO may be better. V 3 uses FIFO; we haven't used LRU in a very long time. LRU has the disadvantage that you need to make some kind of mark/stamp every time a translation is used, which reduces performance. FIFO avoids that. J |
|
From: Nuno L. <nun...@sa...> - 2008-04-18 16:02:36
|
>> When the oldest sector is purged, it searches all >> the others sector's blocks for patched jumps and unpatches thems. > > I'm confused. I was just trying to describe the valgrind 2.4 behaviour for a sanity check. I think I read it correctly, but I was just trying to confirm if I got it correctly. >> cost may be prohibitive (where adding a little table to each superblock >> with back-pointers to referencing blocks > > I think that is the "standard" solution. > >> that the case? I dunno about the perftests, but I assume they are so >> small >> they don't even fill up the whole cache, right? > > The cache is huge, really, very large. It rarely fills up. > To fill it up completely requires (eg) starting OpenOffice, loading a .ppt > presentation and a .doc document, so as to force V to translate all the > Powerpoint import code and all the Word import code. > > For meaningful testing/evaluation, you will need to make it much smaller, > by changing N_TTES_PER_SECTOR in m_transtab.c to some small prime number > (use auxprogs/primes.c to produce prime numbers). So that means that the more or less "naive" solution to scan all blocks is not that bad, right? Well, at least it saves some memory and all the bookeeping. >> Out of curiosity: why did you change from FIFO to LRU in valgrind 3? I'm >> only asking this because that thesis suggests that FIFO may be better. > > V 3 uses FIFO; we haven't used LRU in a very long time. LRU has the > disadvantage that you need to make some kind of mark/stamp every time > a translation is used, which reduces performance. FIFO avoids that. My bad. I probably read that LRU thing in an old paper. Thanks, Nuno |
|
From: Nicholas N. <nj...@cs...> - 2008-04-20 03:06:17
|
On Wed, 16 Apr 2008, Nuno Lopes wrote: > Sorry for the delay, but we have been working on other school projects.. > I was thinking in going ahead and trying to implement block chaining. I have > two questions (for now): > - How did valgrind handle the case when a block is removed from the cache and > it has other blocks that jump directly into it (i.e. blocks that were > previously patched)? Did valgrind purged the whole cache? Did it scanned the > remaining blocks for references and "unpatched" blocks? > - Other question is what changes need to be done to the blocks in order to > implement the block chaining? I assume I would need to add some prologue to > blocks to check for e.g. thread time-slice end and signal handling stuff? Section 2.3.6 of my dissertation (http://www.valgrind.org/docs/phd2004.pdf) has a brief description of chaining, but it doesn't talk about flushing. The 2.4.0 code is the best thing to look at for definitive answers... Nick |
|
From: Nuno L. <nun...@sa...> - 2008-04-20 12:07:19
|
> On Wed, 16 Apr 2008, Nuno Lopes wrote: > >> Sorry for the delay, but we have been working on other school projects.. >> I was thinking in going ahead and trying to implement block chaining. I >> have two questions (for now): >> - How did valgrind handle the case when a block is removed from the cache >> and it has other blocks that jump directly into it (i.e. blocks that were >> previously patched)? Did valgrind purged the whole cache? Did it scanned >> the remaining blocks for references and "unpatched" blocks? >> - Other question is what changes need to be done to the blocks in order >> to implement the block chaining? I assume I would need to add some >> prologue to blocks to check for e.g. thread time-slice end and signal >> handling stuff? > > Section 2.3.6 of my dissertation > (http://www.valgrind.org/docs/phd2004.pdf) has a brief description of > chaining, but it doesn't talk about flushing. The 2.4.0 code is the best > thing to look at for definitive answers... Ok, thank you! Nuno |