|
From: Jeremy F. <je...@go...> - 2002-12-05 09:16:14
|
I found the last bug in lazy-eflags, which was preventing most code from
working under memcheck. It turned out it was Tag_Left[124]'s habit of
using %ebp as a temp if there were no other dead registers. It was
crashing when it tried to use %ebp while saving the flags value during
this sequence. I changed it to push/pop some other register, thereby
keeping %ebp unchanged.
Performance is pretty good. --skin=none is now under 10 times slower
than native (for my gcc3 benchmark), and memcheck is another factor of 5
slower (addrcheck is 3-4 times slower, which is worse than I would have
expected).
The patch (62-lazy-eflags) keeps track of the flags in one of three
states: UPD_Simd (the baseblock is up to date), UPD_Real (the CPU's
%eflags is up to date) and UPD_Both (both are current). UPD_Both isn't
terribly useful, because the only instructions which read flags but
don't set them are SETcc and Jcc, and Jcc is always at the end of a
basic block anyway (and SETcc isn't that common).
Even if an instruction doesn't use any flags, if it doesn't set all the
flags it must read the flag state so that no unexpected flag state leaks
into the emulated state. The big problem here is the D flag, which is
stored in the eflags register, but is functionally completely different
from the status registers.
Patch 62 depends on 61-special-d, which factors out the D flag from the
rest, meaning the live state of D is not in eflags. Almost all
arithmetic instructions overwrite all the remaining flags, meaning that
mostly a flags fetch is not needed for instructions which don't
explicitly use flags input. The only common instructions which don't
affect all flags are INC and DEC, and they aren't terribly common as the
first flags-using instruction in a basic block.
I'm thinking a small improvement might be gained by making the CPUs
eflags register the default home between basic blocks. The dispatch
loop can be responsible for saving it in the base block when dropping
out of execution, and chaining means that the dispatch loop isn't hit
all that much. (Hm, just noticed a bug which was sometimes failing to
fetch eflags for instructions which do a partial flags update. Two
interesting points: it didn't seem to affect any programs I tried, but
more interestingly, when fixed it slowed things down a fair bit, which
makes me think long-life eflags might help more than I thought.)
I suspect it would only really help with --skin=none; when there's a
real instrumenting skin in place, the flags will be saved out pretty
regularly anyway, and most flags uses are Jcc, which, thanks to
fast-jcc, mostly don't need the flags to be present in the CPU to work
well.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 01:08:10
|
Hi. That's great! I just tried 01- and 61- and 62- and things run faster. bzip2, xedit, kate work. However OO 1.0.1 no longer starts up. You know when you start it, there is a splash screen, which sits above all other windows. If I now run it with --skin=none --trace-children=yes, the same patch of screen behaves as if it is the splash screen, except that the picture of the bluish sky with the stylised birds does not appear -- that rectangle of screen contains whatever was there before. Does that make any sense? Can you repro it? J |
|
From: Jeremy F. <je...@go...> - 2002-12-06 02:26:18
|
On Thu, 2002-12-05 at 17:15, Julian Seward wrote:
> However OO 1.0.1 no longer starts up. You know when you start it, there
> is a splash screen, which sits above all other windows. If I now run it
> with --skin=none --trace-children=yes, the same patch of screen behaves
> as if it is the splash screen, except that the picture of the bluish sky
> with the stylised birds does not appear -- that rectangle of screen contains
> whatever was there before.
>
> Does that make any sense? Can you repro it?
Hm, I don't even get the splash screen (but then I don't normally; maybe
I turned it off at one point). Looking into it. I have some suspicious
around flags and FP instructions.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-06 19:19:06
|
On Thu, 2002-12-05 at 17:15, Julian Seward wrote:
> I just tried 01- and 61- and 62- and things run faster. bzip2,
> xedit, kate work.
>
> However OO 1.0.1 no longer starts up. You know when you start it, there
> is a splash screen, which sits above all other windows. If I now run it
> with --skin=none --trace-children=yes, the same patch of screen behaves
> as if it is the splash screen, except that the picture of the bluish sky
> with the stylised birds does not appear -- that rectangle of screen contains
> whatever was there before.
>
> Does that make any sense? Can you repro it?
Well, I fiddled about a bit, and the symptom has gone away, without any
obvious changes on my part. The only significant thing I remember doing
was fixing a bug in 61-special-d which flipped the state of D over a
context switch (ie, when the dispatch counter dropped to 0). But when I
could reproduce it, it was only with 62 in place - it worked with the
buggy 61. Interestingly, I can still repro it with 63-chained-indirect
in place, even though that works for everything else (and when that
patch has bugs, they're rarely subtle).
So I'm somewhat confused. I wonder if its a timing/race problem in OO
itself, which we're sometimes hitting and sometimes not?
Anyway, tell me what you get with the current versions of 61 and 62.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 20:19:25
|
Hmm, I tried the following, of course linked with libpthread and it works
perfectly. so no easy leads there :-(
J
#include <sys/poll.h>
#include <stdio.h>
int main ( void )
{
int fd;
struct pollfd tab[1];
fd = fileno(stdin);
tab[0].fd = fd;
tab[0].events = POLLIN;
printf("poll begins ... press return\n" );
poll( tab, 1, -1 );
printf("done\n");
return 0;
}
|
|
From: Julian S. <js...@ac...> - 2002-12-06 20:05:10
|
> Anyway, tell me what you get with the current versions of 61 and 62.
No improvement with OO.
I tried mozilla. It also won't start up, the simulated machine falling
into an endless sequence of poll() calls seperated by nanosleep(13
milliseconds), which afaics is the nonblocking poll() in vg_libpthread.c.
Trying OO with tracing on indicates it spins in the same place.
Hmm. This is very odd. I'm wondering if there is some problem with the
non-D flags (OSZACP) causing "if (res > 0) {" at line 2636 never to
get into the then-clause. Except that if there was such a problem,
most programs wouldn't work (I'd guess).
Opera works, so it's not threaded programs per se. Opera doesn't use
the nonblocking poll, tho.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 22:29:44
|
Hi. I think I've found an anomaly. Dunno if its the problem, might be tho. With mozilla spinning and not proceeding, a bit of prodding about shows it is spinning on these original insns: -- ORIGINAL CODE 0x4017263c <js_DoubleToECMAInt32+80>: fprem 0x4017263e <js_DoubleToECMAInt32+82>: fnstsw %ax 0x40172640 <js_DoubleToECMAInt32+84>: sahf 0x40172641 <js_DoubleToECMAInt32+85>: jp 0x4017263c The translation is pretty dismal, due to calling helpers for both fnstsw and sahf, but "that's not important right now" :) The problem is the call the to latter's helper ... and specifically the add $0x4,%esp to clear the args off the stack. This trashes the live %eflags -- push %AH 1: x/i $eip 0x42377bb6: mov 0x0(%ebp),%ebx 1: x/i $eip 0x42377bb9: mov $0xff00,%ecx 1: x/i $eip 0x42377bbe: and %ecx,%ebx 1: x/i $eip 0x42377bc0: push %ebx -- GET EFLAGS 1: x/i $eip 0x42377bc1: pushl 0x20(%ebp) 1: x/i $eip 0x42377bc4: popf -- call helper 1: x/i $eip 0x42377bc5: call *0x17c(%ebp) 1: x/i $eip 0x4005cc84 <vgPlain_helper_SAHF>: push %eax 1: x/i $eip 0x4005cc85 <vgPlain_helper_SAHF+1>: mov 0x8(%esp,1),%eax 1: x/i $eip 0x4005cc89 <vgPlain_helper_SAHF+5>: sahf 1: x/i $eip 0x4005cc8a <vgPlain_helper_SAHF+6>: pop %eax 1: x/i $eip 0x4005cc8b <vgPlain_helper_SAHF+7>: ret -- %eflags is now live -- oops! 1: x/i $eip 0x42377bcb: add $0x4,%esp -- move EIP 1: x/i $eip 0x42377bce: movb $0x41,0x24(%ebp) -- PUT (polluted) eflags 1: x/i $eip 0x42377bd2: pushf 1: x/i $eip 0x42377bd3: popl 0x20(%ebp) -- jump (on result of %esp-adjust :) 1: x/i $eip 0x42377bd6: jnp 0x42377be5 Assuming this analysis is correct ... there's no convenient way to clear dead args off the real stack, unless we find a dead reg to dump it in. Umm, actually that's nonsense. Imagine we have a baseBlock slot purely for the purpose of receiving deal values, then we could do popl VGOFF_(dummySlot)(%ebp) Just occasionally, CISC is great! What do you think? Is the analysis correct? I might check of all places where %esp is involved in an arithmetic op, since those are all potential trashers. J -------------------------------------------------------------------------- The complete translation, in case you should want it, is ... -- preamble 1: x/i $eip 0x42377b84: decl 0x400bb72c 1: x/i $eip 0x42377b8a: jne 0x42377b92 -- GET fpustate 1: x/i $eip 0x42377b92: frstor 0x8c(%ebp) -- fprem 1: x/i $eip 0x42377b98: fprem -- advance EIP 1: x/i $eip 0x42377b9a: movb $0x3e,0x24(%ebp) -- push $0 (why?) 1: x/i $eip 0x42377b9e: xor %eax,%eax 1: x/i $eip 0x42377ba0: push %eax -- PUT fpustate 1: x/i $eip 0x42377ba1: fnsave 0x8c(%ebp) 1: x/i $eip 0x42377ba7: call *0x178(%ebp) 1: x/i $eip 0x4005cc6d <vgPlain_helper_fstsw_AX>: push %eax 1: x/i $eip 0x4005cc6e <vgPlain_helper_fstsw_AX+1>: push %esi 1: x/i $eip 0x4005cc6f <vgPlain_helper_fstsw_AX+2>: mov 0x400a08e0,%esi 1: x/i $eip 0x4005cc75 <vgPlain_helper_fstsw_AX+8>: frstor 0x0(%ebp,%esi,4) 1: x/i $eip 0x4005cc79 <vgPlain_helper_fstsw_AX+12>: fstsw %ax 1: x/i $eip 0x4005cc7a <vgPlain_helper_fstsw_AX+13>: fnstsw %ax 1: x/i $eip 0x4005cc7c <vgPlain_helper_fstsw_AX+15>: pop %esi 1: x/i $eip 0x4005cc7d <vgPlain_helper_fstsw_AX+16>: mov %ax,0x8(%esp,1) 1: x/i $eip 0x4005cc82 <vgPlain_helper_fstsw_AX+21>: pop %eax 1: x/i $eip 0x4005cc83 <vgPlain_helper_fstsw_AX+22>: ret 1: x/i $eip 0x42377bad: pop %eax -- %ax holds FPU status word (simd) -- PUT %AX 1: x/i $eip 0x42377bae: mov %ax,0x0(%ebp) -- advance %EIP 1: x/i $eip 0x42377bb2: movb $0x40,0x24(%ebp) -- push %AH 1: x/i $eip 0x42377bb6: mov 0x0(%ebp),%ebx 1: x/i $eip 0x42377bb9: mov $0xff00,%ecx 1: x/i $eip 0x42377bbe: and %ecx,%ebx 1: x/i $eip 0x42377bc0: push %ebx -- GET EFLAGS 1: x/i $eip 0x42377bc1: pushl 0x20(%ebp) 1: x/i $eip 0x42377bc4: popf -- call helper 1: x/i $eip 0x42377bc5: call *0x17c(%ebp) 1: x/i $eip 0x4005cc84 <vgPlain_helper_SAHF>: push %eax 1: x/i $eip 0x4005cc85 <vgPlain_helper_SAHF+1>: mov 0x8(%esp,1),%eax 1: x/i $eip 0x4005cc89 <vgPlain_helper_SAHF+5>: sahf 1: x/i $eip 0x4005cc8a <vgPlain_helper_SAHF+6>: pop %eax 1: x/i $eip 0x4005cc8b <vgPlain_helper_SAHF+7>: ret -- %eflags is now live 1: x/i $eip 0x42377bcb: add $0x4,%esp 1: x/i $eip 0x42377bce: movb $0x41,0x24(%ebp) 1: x/i $eip 0x42377bd2: pushf 1: x/i $eip 0x42377bd3: popl 0x20(%ebp) 1: x/i $eip 0x42377bd6: jnp 0x42377be5 1: x/i $eip 0x42377bd8: mov $0x4017263c,%eax 1: x/i $eip 0x42377bdd: mov %eax,0x24(%ebp) 1: x/i $eip 0x42377be0: jmp 0x42377b84 |
|
From: Jeremy F. <je...@go...> - 2002-12-06 23:27:24
|
On Fri, 2002-12-06 at 14:37, Julian Seward wrote:
> The translation is pretty dismal, due to calling helpers for both fnstsw and
> sahf, but "that's not important right now" :) The problem is the call the to
> latter's helper ... and specifically the add $0x4,%esp to clear the args off
> the stack. This trashes the live %eflags
Ah, yes, that's where I've seen moz spin. That's why I was suspecting
some badness in FP+flags interaction.
> Assuming this analysis is correct ... there's no convenient way to clear
> dead args off the real stack, unless we find a dead reg to dump it in.
Ewww, nasty.
> Umm, actually that's nonsense. Imagine we have a baseBlock slot purely
> for the purpose of receiving deal values, then we could do
>
> popl VGOFF_(dummySlot)(%ebp)
>
> Just occasionally, CISC is great!
>
> What do you think? Is the analysis correct?
Yes, that looks likely. The quick fix is to change the
VG_(new_emit)(True, ...) to False in emit_add_lit_to_esp, since that add
is not operating on Simd state. That will make it generate flag save
before trashing them. I agree the nicer solution is to fix the helper
calling convention to not trash the flags. How about changing the
convention to just use a real register for the value? Unfortunately for
helpers like CPUID, the stack does seem like the nicest way of doing the
passing (unless you want to allocate an array of slots in the bas
block).
It's a pity that SAHF/LAHF doesn't do the O flag; otherwise they'd be
ideal for flags saving/restoring - the P3 optim guide says they're 1
uop.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-06 22:41:26
|
On Fri, 2002-12-06 at 12:12, Julian Seward wrote:
> > Anyway, tell me what you get with the current versions of 61 and 62.
>
> No improvement with OO.
>
> I tried mozilla. It also won't start up, the simulated machine falling
> into an endless sequence of poll() calls seperated by nanosleep(13
> milliseconds), which afaics is the nonblocking poll() in vg_libpthread.c.
>
> Trying OO with tracing on indicates it spins in the same place.
>
> Hmm. This is very odd. I'm wondering if there is some problem with the
> non-D flags (OSZACP) causing "if (res > 0) {" at line 2636 never to
> get into the then-clause. Except that if there was such a problem,
> most programs wouldn't work (I'd guess).
I think that's a red herring. In OO's case, thread 3 is polling on IO
(I presume to the X server), thread 2 is blocked in a condvar_wait, and
thread 1 is spinning CPU-bound. I'm guessing that thread 2 is waiting
for either thread 1 or 3 to do something, and thread 1 is expected to do
something but isn't.
However, if I use "--trace-codegen=10001 --trace-signals=yes" and pipe
that into a "tail -100000" (my disk not being big enough to fit a
complete codegen trace) and then wait for it to stabilize (stop codegen
for new BBs, observed by looking at strace of the process), then it
tends to actually work. So that's no use.
Mozilla is similar. Sometimes it works, and sometimes it doesn't.
Sometimes it works, but takes a really long time. In particular, it
normally takes 30 user CPU seconds for moz to appear on my laptop with
--skin=none. Sometimes it never appears, and just seems to burn CPU.
Other times, it appears after 60 or more (wall-clock seconds), but after
still only using 30 CPU seconds (with nothing else using the CPU).
If you try just 61, do you see the same problem?
It does seem very fragile. I just added another patch to V, which
should be completely benign, but it changes the behaviour.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 23:04:02
|
> > Hmm. This is very odd. I'm wondering if there is some problem with the
> > non-D flags (OSZACP) causing "if (res > 0) {" at line 2636 never to
> > get into the then-clause. Except that if there was such a problem,
> > most programs wouldn't work (I'd guess).
>
> I think that's a red herring.
I agree.
Re my analysis of stack-clearing add, I can't be exactly right, since
VG_(emit_add_lit_to_esp) begins with the correct call to new_emit.
So perhaps the analysis didn't think that we were in a UPD_Real state
at the point of the call to vgPlain_helper_SAHF. Except that as soon as
it has CLEARed the stack, it then acts to get into UPD_Simd/Both in
preparation for the conditional jump:
1: x/i $eip 0x42377bd2: pushf
1: x/i $eip 0x42377bd3: popl 0x20(%ebp)
Odd.
> In OO's case, thread 3 is polling on IO
> (I presume to the X server), thread 2 is blocked in a condvar_wait, and
> thread 1 is spinning CPU-bound. I'm guessing that thread 2 is waiting
> for either thread 1 or 3 to do something, and thread 1 is expected to do
> something but isn't.
Yes, so that fits together; at least we have a plausible explaination of
why it was spinning, if the moz problem also afflicts OO.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 23:12:41
|
Apologies for mailbombing you even more. Quick prod with GDB shows OO is also spinning in a fstsw_AX .. SAHF ..conditional jump loop. J |
|
From: Jeremy F. <je...@go...> - 2002-12-06 23:33:17
|
On Fri, 2002-12-06 at 15:11, Julian Seward wrote:
> Re my analysis of stack-clearing add, I can't be exactly right, since
> VG_(emit_add_lit_to_esp) begins with the correct call to new_emit.
No, it isn't correct - True means "operate on Simd flags"; False means
"non-simd flags".
Fixing this seems to work, and even has a slight performance improvement
(OO starts up in 47 rather than 48 seconds). At very least it seems
performance neutral.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-07 00:28:14
|
On Friday 06 December 2002 11:33 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-12-06 at 15:11, Julian Seward wrote: > > Re my analysis of stack-clearing add, I can't be exactly right, since > > VG_(emit_add_lit_to_esp) begins with the correct call to new_emit. > > No, it isn't correct - True means "operate on Simd flags"; False means > "non-simd flags". > > Fixing this seems to work, and even has a slight performance improvement > (OO starts up in 47 rather than 48 seconds). At very least it seems > performance neutral. Well, changing that True to False makes OO and moz work fine for me, which is great. So I guess that's pretty much the end of the matter? It seems like the right fix to me; does it also to you? -------- Now that this looks like this is going to work, is it worth keeping the fancy test stuff you did in synth_jcond_lit ? I ask because you have a better understanding of the ramifications of this latest eflags stuff, and so can probably make a better decision. If it's worth keeping ... I did actually write code to also do the 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1, and it did seem to help sometimes. But that was before your latest patch. If it's not worth keeping ... let's nuke it. ---------------- I've been wondering a bit about the dismal performance on P4s. One thing that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..." is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4 optimisation guide, "Use of the inc and dec instructions"). It'd be interesting try changing it to a subl $1, ... as they recommend. Or perhaps not ... according to their tables the latency difference is only 0.5 cycle. J |
|
From: Jeremy F. <je...@go...> - 2002-12-07 02:32:11
|
On Fri, 2002-12-06 at 16:35, Julian Seward wrote:
> So I guess that's pretty much the end of the matter? It seems like the
> right fix to me; does it also to you?
Yes, that was the bug. It explains why it was so non-deterministic too
- it depended on the number of 1 bits in %esp.
It would be nice to change the helper calling convention into something
non-flag smashing.
> Now that this looks like this is going to work, is it worth keeping the
> fancy test stuff you did in synth_jcond_lit ? I ask because you have a
> better understanding of the ramifications of this latest eflags stuff,
> and so can probably make a better decision.
>
> If it's worth keeping ... I did actually write code to also do the
> 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1,
> and it did seem to help sometimes. But that was before your latest patch.
It doesn't make much difference for --skin=none, but it probably does
for skins which instrument. memcheck, for example, always trashes the
flags just before a conditional jump, so the fast-jcc path gets used
often. On the other hand, it may get lost in the overhead which the
instrumenting skins cause anyway (but that's probably the next focus for
performance improvement).
> I've been wondering a bit about the dismal performance on P4s. One thing
> that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..."
> is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4
> optimisation guide, "Use of the inc and dec instructions"). It'd be
> interesting try changing it to a subl $1, ... as they recommend. Or perhaps
> not ... according to their tables the latency difference is only 0.5 cycle.
Well, I was thinking about putting something in to tune the instruction
selection depending on the real CPU. It might be worth it (but I
suspect not in this case).
I still think it has more to do with trashing the trace cache rather
than any of the minor instruction selection issues. I'm doing an
experimental patch which does speculative translation (ie, is a BB's
final instruction is a lit32 jump, then translate the jump target too)
to see if that helps. It should strike a balance between killing the
cache with every translation and the over-translation trace caching
would cause.
I still haven't had a chance to do any P4 instrumentation yet. Maybe
building Rabbit hooks into V would be a useful thing to do (oprofile
seems a bit crippled when faced with code without a file or symbols).
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 07:27:31
|
On Fri, 2002-12-06 at 18:32, Jeremy Fitzhardinge wrote:
> It would be nice to change the helper calling convention into
> something non-flag smashing.
Duh. "lea N(%esp), %esp" is a perfectly good non-flag-smashing way to
fix the stack.
J
|