|
From: <sv...@va...> - 2005-10-08 19:58:51
|
Author: sewardj
Date: 2005-10-08 20:58:48 +0100 (Sat, 08 Oct 2005)
New Revision: 1418
Log:
Handle the out-of-range shift cases for slw/srw in a different way
which creates less IR and fewer insns at the back end. Worth about 2%
running bzip2 -d with --tool=3Dnone.
Modified:
trunk/priv/guest-ppc32/toIR.c
Modified: trunk/priv/guest-ppc32/toIR.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- trunk/priv/guest-ppc32/toIR.c 2005-10-08 11:28:16 UTC (rev 1417)
+++ trunk/priv/guest-ppc32/toIR.c 2005-10-08 19:58:48 UTC (rev 1418)
@@ -3290,12 +3290,24 @@
case 0x018: // slw (Shift Left Word, PPC32 p505)
DIP("slw%s r%d,r%d,r%d\n", flag_Rc ? "." : "",
Ra_addr, Rs_addr, Rb_addr);
- assign( sh_amt, binop(Iop_And8, mkU8(0x1F),
- unop(Iop_32to8, mkexpr(Rb))) );
- assign( Rs_sh, binop(Iop_Shl32, mkexpr(Rs), mkexpr(sh_amt)) );
- assign( rb_b5, binop(Iop_And32, mkexpr(Rb), mkU32(1<<5)) );
- assign( Ra, IRExpr_Mux0X( unop(Iop_32to8, mkexpr(rb_b5)),
- mkexpr(Rs_sh), mkU32(0) ));
+ /* Ra =3D Rs << Rb */
+ /* ppc32 semantics are:=20
+ slw(x,y) =3D (x << (y & 31)) -- primary result
+ & ~((y << 26) >>s 31) -- make result 0=20
+ for y in 32 .. 63
+ */
+ assign(Ra,
+ binop(
+ Iop_And32,
+ binop( Iop_Shl32,=20
+ mkexpr(Rs),=20
+ unop( Iop_32to8,=20
+ binop(Iop_And32, mkexpr(Rb), mkU32(31)))),
+ unop( Iop_Not32,=20
+ binop( Iop_Sar32,=20
+ binop(Iop_Shl32, mkexpr(Rb), mkU8(26)),=20
+ mkU8(31))))
+ );
break;
=20
case 0x318: // sraw (Shift Right Algebraic Word, PPC32 p506)
@@ -3338,12 +3350,24 @@
case 0x218: // srw (Shift Right Word, PPC32 p508)
DIP("srw%s r%d,r%d,r%d\n", flag_Rc ? "." : "",
Ra_addr, Rs_addr, Rb_addr);
- assign( sh_amt, binop(Iop_And8, mkU8(0x1F),
- unop(Iop_32to8, mkexpr(Rb))) );
- assign( Rs_sh, binop(Iop_Shr32, mkexpr(Rs), mkexpr(sh_amt)) );
- assign( rb_b5, binop(Iop_And32, mkexpr(Rb), mkU32(1<<5)) );
- assign( Ra, IRExpr_Mux0X( unop(Iop_32to8, mkexpr(rb_b5)),
- mkexpr(Rs_sh), mkU32(0) ));
+ /* Ra =3D Rs >>u Rb */
+ /* ppc32 semantics are:=20
+ slw(x,y) =3D (x >>u (y & 31)) -- primary result
+ & ~((y << 26) >>s 31) -- make result 0=20
+ for y in 32 .. 63
+ */
+ assign(Ra,
+ binop(
+ Iop_And32,
+ binop( Iop_Shr32,=20
+ mkexpr(Rs),=20
+ unop( Iop_32to8,=20
+ binop(Iop_And32, mkexpr(Rb), mkU32(31)))),
+ unop( Iop_Not32,=20
+ binop( Iop_Sar32,=20
+ binop(Iop_Shl32, mkexpr(Rb), mkU8(26)),=20
+ mkU8(31))))
+ );
break;
=20
default:
|
|
From: Josef W. <Jos...@gm...> - 2005-10-08 20:45:12
|
On Saturday 08 October 2005 21:58, sv...@va... wrote: > Author: sewardj > Date: 2005-10-08 20:58:48 +0100 (Sat, 08 Oct 2005) > New Revision: 1418 > > Log: > Handle the out-of-range shift cases for slw/srw in a different way > which creates less IR and fewer insns at the back end. Worth about 2% > running bzip2 -d with --tool=none. Hi Julian, how did you find out about optimizing this? Obviously there doesn't exist any profiling tool which allows annotation of code in anonymous mappings, i.e. generated code by a VM or Valgrind. A while ago I had the idea to make Valgrinds translation cache persistant (backed up by a file with mmap). This way, a profile tool can annotate generated code (as the code relates to an existing file). Do you think this is possible? If yes, it would be even better if the TC could optionally loose its "cache character" and simply grow (if VM space allows this). If we additionally generate debug info, it should be possible to relate the generated code back to original client code. Hmmm... this leads to a further question: Could a persistant translation cache speed up Valgrind, by simply executing the "pre-translated" code chunks from an earlier run? In multiple runs, shared objects can be mapped to different addresses. So the translation cache should be separated by shared objects. But this would need a (object/offset) tuple as lookup key instead of a simple code address. Just ideas ;-) Josef |
|
From: Julian S. <js...@ac...> - 2005-10-08 21:20:53
|
> > Handle the out-of-range shift cases for slw/srw in a different way
> > which creates less IR and fewer insns at the back end. Worth about 2%
> > running bzip2 -d with --tool=none.
>
> how did you find out about optimizing this?
The new JIT does continuous low-overhead profiling of the bbs being
executed, on all architectures. I simply ran
valgrind --tool=none --profile-flags=10001000 bzip2 -tvv bigfile.bz2
and then read the immensely detailed result. The 1000.. stuff is the
same as for --trace-flags. This shows the initial code and the IR after
instrumentation and optimisation, for the most popular 100 translations.
To profile V more generally you can now do self-hosting and use cachegrind
(or calltree presumably). We had some fun with that a couple of weekends
ago -- I managed to run Qt designer running on valgrind --tool=none running
on valgrind --tool=cachegrind.
Before you ask ..
(1) Check out 2 trees, "inner" and "outer". "inner" runs the app
directly and is what you will be profiling. "outer" does the
profiling.
(2) Configure inner with --enable-inner and build/install as
usual.
(3) Configure outer normally and build/install as usual.
(4) Choose a very simple program (date) and try
outer/.../bin/valgrind --weird-hacks=enable-outer \
--tool=cachegrind -v inner/.../bin/valgrind --tool=none -v prog
It's fragile, confusing and slow, but it does work well enough for
you to get some useful performance data.
> Hmmm... this leads to a further question: Could a persistant translation
> cache speed up Valgrind
Very likely. Nobody has ever tried it afaik though.
J
|
|
From: Nicholas N. <nj...@cs...> - 2005-10-08 23:50:05
|
On Sat, 8 Oct 2005, Josef Weidendorfer wrote: > Hmmm... this leads to a further question: Could a persistant translation cache > speed up Valgrind, by simply executing the "pre-translated" code chunks from > an earlier run? > In multiple runs, shared objects can be mapped to different addresses. So > the translation cache should be separated by shared objects. But this would > need a (object/offset) tuple as lookup key instead of a simple code address. See this paper from the recent WBIA workshop: http://rogue.colorado.edu/draco/papers/wbia05-persistence.pdf It's quite complicated to get it to work. Nick |
|
From: Oswald B. <os...@kd...> - 2005-10-09 09:52:31
|
On Sat, Oct 08, 2005 at 06:49:52PM -0500, Nicholas Nethercote wrote: > On Sat, 8 Oct 2005, Josef Weidendorfer wrote: > > >Hmmm... this leads to a further question: Could a persistant > >translation cache speed up Valgrind, by simply executing the > >"pre-translated" code chunks from an earlier run? In multiple runs, > >shared objects can be mapped to different addresses. So the > >translation cache should be separated by shared objects. But this > >would need a (object/offset) tuple as lookup key instead of a simple > >code address. > > See this paper from the recent WBIA workshop: > > http://rogue.colorado.edu/draco/papers/wbia05-persistence.pdf > i have the impression that these guys were told to produce six pages, so they did ... with the content for max four pages. oh, well. :) a critical aspect: valgrind is a debugging tool. that means that a particular binary is usually executed exactly once. consequently a wholesale consistency check would drastically lower the effect of persistence (there still would be a gain for separately cached dynamic objects). so we need function-level granularity, based on function signatures. however, static linking will most often yield different binary images for unmodified functions due to offset changes stemming from unrelated code changes. i *think* such locations are easy to spot - they are either dynamic relocations of some type or they are involved into PLT or GOT references, and debug info certainly provides the info directly. they would be made wildcards in the signatures, and their values would have to be propagated into the translated code. fetching from the cache could be done at the function translation stage, but i guess the global view (function order) one has at object load time would help improving the overall overhead. an orthogonal issue: V could detect tight loops and optimize them further (anybody willing to incorporate gcc's optimizer into V? :). with persistent translations, expensive optimizations could be applied more liberally. -- Hi! I'm a .signature virus! Copy me into your ~/.signature, please! -- Chaos, panic, and disorder - my work here is done. |
|
From: Josef W. <Jos...@gm...> - 2005-10-09 11:12:29
|
On Sunday 09 October 2005 11:52, Oswald Buddenhagen wrote: > On Sat, Oct 08, 2005 at 06:49:52PM -0500, Nicholas Nethercote wrote: > > On Sat, 8 Oct 2005, Josef Weidendorfer wrote: > > >Hmmm... this leads to a further question: Could a persistant > > >translation cache speed up Valgrind ... ? > a critical aspect: valgrind is a debugging tool. that means that a > particular binary is usually executed exactly once. Performance was not the original motivation for my suggestion to persistance, but only about improving profiling ability of valgrind itself (either with OProfile or self hosting). Why do you need function level consistency checks? Checking the modify date of a shared object should be enough when separating TCs by objects. You usually do not have self-modifying code in read-only pages backed up by files. And I am not sure whether the issue about absolute addresses generated by the binary translation engine does happen with VEX. Josef |
|
From: Oswald B. <os...@kd...> - 2005-10-09 11:26:09
|
On Sun, Oct 09, 2005 at 01:11:53PM +0200, Josef Weidendorfer wrote: > On Sunday 09 October 2005 11:52, Oswald Buddenhagen wrote: > > On Sat, Oct 08, 2005 at 06:49:52PM -0500, Nicholas Nethercote wrote: > > > On Sat, 8 Oct 2005, Josef Weidendorfer wrote: > > > >Hmmm... this leads to a further question: Could a persistant > > > >translation cache speed up Valgrind ... ? > > > a critical aspect: valgrind is a debugging tool. that means that a > > particular binary is usually executed exactly once. > > Performance was not the original motivation for my suggestion to persistance, > but only about improving profiling ability of valgrind itself (either with > OProfile or self hosting). > i know. but as you stole another thread, i stole your's. ;-P > Why do you need function level consistency checks? Checking the modify > date of a shared object should be enough when separating TCs by > objects. You usually do not have self-modifying code in read-only > pages backed up by files. > you completely missed the point ... > And I am not sure whether the issue about absolute addresses generated > by the binary translation engine does happen with VEX. > i'm pretty sure it does. for the real data the mapping is 1:1. the shadow memory currently has a pretty straight-forward mapping as well. code references are "somehow" different ... :} -- Hi! I'm a .signature virus! Copy me into your ~/.signature, please! -- Chaos, panic, and disorder - my work here is done. |
|
From: Julian S. <js...@ac...> - 2005-10-09 13:28:47
|
> a critical aspect: valgrind is a debugging tool. that means that a > particular binary is usually executed exactly once. consequently a > wholesale consistency check would drastically lower the effect of Most large apps are composed mainly of .so's and I bet 99.9% of the code does not change from run to run. > an orthogonal issue: V could detect tight loops and optimize them > further (anybody willing to incorporate gcc's optimizer into V? :). Hercules, Augean stables, etc. It might not help much - for complex tools like cachegrind and memcheck most of the time is spent in the helper functions called from generated code. The new jit is better than the old at optimising and will even unroll single-basic-block loops itself. J |
|
From: Oswald B. <os...@kd...> - 2005-10-09 13:57:43
|
On Sun, Oct 09, 2005 at 12:12:50PM +0100, Julian Seward wrote: > > a critical aspect: valgrind is a debugging tool. that means that a > > particular binary is usually executed exactly once. consequently a > > wholesale consistency check would drastically lower the effect of > > Most large apps are composed mainly of .so's > most ... > and I bet 99.9% of the code does not change from run to run. > yes, and at dso granularity that means, that the program must have 1000 dsos to match up to that ratio if one source file is modified. but somehow i suspect that at this size the granularity would not really matter anyway ... > > an orthogonal issue: V could detect tight loops and optimize them > > further (anybody willing to incorporate gcc's optimizer into V? :). > > Hercules, Augean stables, etc. > ENEEDMOREINPUT > It might not help much - for complex tools like cachegrind and > memcheck most of the time is spent in the helper functions called from > generated code. > hmm, that's an argument. depends on the tool, of course. > The new jit is better than the old at optimising and will even unroll > single-basic-block loops itself. > i heard rumors like that already. will have to look myself finally. :) -- Hi! I'm a .signature virus! Copy me into your ~/.signature, please! -- Chaos, panic, and disorder - my work here is done. |
|
From: Nicholas N. <nj...@cs...> - 2005-10-09 16:03:21
|
On Sun, 9 Oct 2005, Oswald Buddenhagen wrote: >>> an orthogonal issue: V could detect tight loops and optimize them >>> further (anybody willing to incorporate gcc's optimizer into V? :). >> >> Hercules, Augean stables, etc. >> > ENEEDMOREINPUT Julian's making a point about the level of effort required. That point holds for this whole discussion; there are plenty of things we could do with Valgrind, but we have only a small number of developers so we have to choose our projects carefully. I don't think making translations persistent passes a basic cost/benefit analysis. Nick |
|
From: Julian S. <js...@ac...> - 2005-10-09 19:00:29
|
> >>> further (anybody willing to incorporate gcc's optimizer into V? :). > >> > >> Hercules, Augean stables, etc. > > > > ENEEDMOREINPUT See http://ancienthistory.about.com/library/bl/bl_herc_lab5.htm J |
|
From: Oswald B. <os...@kd...> - 2005-10-10 12:05:01
|
On Sun, Oct 09, 2005 at 08:01:50PM +0100, Julian Seward wrote: > > > >>> further (anybody willing to incorporate gcc's optimizer into V? :). > > >> > > >> Hercules, Augean stables, etc. > > > > > > ENEEDMOREINPUT > > See http://ancienthistory.about.com/library/bl/bl_herc_lab5.htm > does this mean that you'll do it if i promise to contribute a tenth of the time saved by the feature to a project of your choice? :))) -- Hi! I'm a .signature virus! Copy me into your ~/.signature, please! -- Chaos, panic, and disorder - my work here is done. |