Re: [Valgrind-developers] killing INCEIP (and jumping for joy)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi.  Nice hacking.  I made some measurements of the new stuff.
The test programs are the simple loop program discussed in previous
mail (25.2 million basic blocks), and bzip2 compressing a 700k .ps file
(77 million bbs).  Also ktuberling, starting and exiting a silly
(if somewhat amusing) children's game on KDE.

                loop     bzip2      bzip2      bzip2   ktuberling
            nulgrind  nulgrind  addrcheck   memcheck    addrcheck

native          0.25      0.69       0.69       0.69      0.61
nobbchain       2.56      7.77      11.75      17.29      8.09
bbchain         2.23      6.08      10.11      15.68      7.58
chindir         2.18      6.17      10.14      16.02      7.38
fastjcc         2.22      5.14       9.17      14.98      7.38
synceip         1.60      4.48       8.88      14.69      6.61
ALL-chindir     1.59      4.46       8.91      14.81

native is native.  nobbchain is with none of the recent opts.  bbchain 
adds bbchaining.  chindir adds indirect bb chaining.  fastjcc adds 
fastjcc.  synceip adds synceip (ie is all opts so far).  ALL-chindir 
is everything except chindir; I am a bit suspicious of that one and
wanted to see if it was slowing things down sometime.

Measurements made on a noisy PIII (ie, D was hacking C++ at the
same time), although I made runs when it was pretty much idle, and
the numbers are the best of >= 3 runs.  Nevertheless there is some
level of noise, so don't take the above too precisely.

ktuberling's gains are smaller than the rest because it spends a
lot of time translating.  It only runs for 32 million bbs but it
does translate about 940k of original code, which is a lot really.
Also spends considerable time reading full debug info from the
qt and kde .so's (I built them -O -g).  Of course once it gets
going, I expect speed gains similar to the rest.

Just tried running konq on my 1.13 GHz P3 with full opts on addrcheck
and it's surprisingly usable.  Great!

Some points to note

- bbchain is always a win.  I'll move it into the head once I get
  a good LRU story figured out.

- fastjcc is probably always no effect or a win.  It is no effect
  in "loop" because that jumps back to the loop start with a case
  which isn't covered by fastjcc, unfortunately.  I was wondering
  how difficult it would be to cover the L/NGE, NL/GE, LE/NG and NLE/G
  cases -- exprs of the form ((SF xor OF) or ZF) == 1 or 0.  I can't
  think of a neat way to do xor of two bits alas, and your implementation
  is neat indeed.  Even if those cases took (eg) 4 insns instead of 1,
  it would probably be better than the 10+ cycle loss of popfl.

  For all real progs I expect it is a big win.

  I'll move this into the head too.  Is it always beneficial, and has
  only minor and localised complexity.

- chindir looks suspiciously like it slows some things down, although
  I couldn't convince myself either way, even with the ALL-chindir
  measurements.  Maybe it's just measurement noise.

  The comment in vg_dispatch.S is good, but I still am a bit unclear
  as to the precise behaviour of the prediction mechanism.  My impression
  is that after two consecutive jumps to the same target, the translation
  is patched with a compare-and-jumpdirectly-or-go-via-lookup piece
  of code.  Also AIUI, there is no way to unto the patching and 
  commit to some other target later, should the patched code start
  to consistently mispredict.

  Is my understanding correct?  If so doesn't it potentially generate
  permanent mispredictions for returns from any function called from
  many places, or for unpredictable switch statements?  Is there a 
  way to adjust this mechanism so (like all good prediction mechanisms)
  it eventually forgets about ancient history, so it can track changes
  in the current environment?

  I'd like to see a program where it gives a clear gain ... do you
  have one?

- SYNCEIP is a good idea.  Certainly I'll incorporate something like
  that, although I'm not sure of the final shape of it.  Two issues:
  (1) precise exceptions.  SYNCEIP doesn't give that as it stands.
  If a memory load/store should segfault and we wind up in the signal
  handler, we do not have the precise %EIP to hand at that point
  because there is no SYNCEIP before the LD/ST uinstr.  That can 
  cause problems in some obscure, if POSIXly-illegal, sighandling 
  cases.

  (2) not sure how SYNCEIP would interact with proposed lazy eflags
  save/restore. 

  Generally, should we stick with INCEIP+SYNCEIP, or have just SETEIP,
  or what?  And how do we establish exactly where to insert EIP updates?
  Should the skin itself insert them (as per SYNCEIP)?  Or should there
  be a redundant SETEIP-removal pass done by the core, which asks uinstr-
  adding skins whether a uinstr could need to know EIP?  How should we
  handle EIP updates needed by the core itself, specifically if we want
  to supply precise exceptions?  [probably disabled by default, btw]

------------

I think it will soon be time to "pull over" and consolidate what we've
got (which is some nice speedups), since:

1.  I'd like to get this thing out the door sometime this century :)

2.  Nick is disappearing from active hacking in about a week, really

3.  We're getting borkage (as is expected from change).  I was surfing
    just now with konqueror on addrcheck on all of Jeremy's opts, and it
    crapped out (exited unexpectedly, but cleanly) for no apparent reason,
    several times in a row.  Natively it's ok; on 1.0.X it's ok.  
    (un?)Fortunately it also craps out when running on the cvs head, so 
    we've got bogons somewhere.

J