You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
1
(5) |
2
(3) |
3
(1) |
4
(4) |
5
(1) |
6
(11) |
7
(5) |
|
8
|
9
(6) |
10
(2) |
11
(10) |
12
|
13
|
14
(4) |
|
15
(7) |
16
(1) |
17
(3) |
18
|
19
|
20
|
21
(1) |
|
22
(1) |
23
|
24
|
25
|
26
|
27
|
28
(4) |
|
29
|
30
|
31
|
|
|
|
|
|
From: Jeremy F. <je...@go...> - 2002-12-11 01:51:11
|
On Tue, 2002-12-10 at 17:31, Julian Seward wrote: > Hey? That seems like too many instructions to me. The idea is that the > cache entries are arranged so as to cause lookup failures on misalignment, > so that the testl and jnz are not needed. Yep, you're right. > This is not so good (trashes a second reg), so perhaps your code is better > here. OTOH, providing enough spare regs exist, all reasonable machines > have 2 ALUs capable of doing the andls in parallel, so the sequence should be > fast. I made the ACCESS UInstr take two args: the address and the rounded address, so that I didn't have to scrounge for a pair of temps. It would help if AND accepted a Lit32 argument though. > movl %vv, %temp > movl %vv, %temp2 > andl $MASK, %temp -- cache index, as before > andl $(~2), %temp2 -- dump bit 1 of address (~2 == 111...11101b) > cmpl cache(%temp), %temp2 > jz done > slow: > > The andl $(~2) is the subtlety. For the lowest two bits it gives the mapping > 00 -> 00, 01 -> 01, 10 -> 00, 11 -> 01 > So if the address was 2-aligned (00, 10) it produces 00, which can potentially > match the cache[] entry. Nice. J |
|
From: Jeremy F. <je...@go...> - 2002-12-11 01:44:10
|
On Tue, 2002-12-10 at 16:59, Julian Seward wrote: > The mozilla I was running was a 1.2.1 binary build (the straight .tar.gz) > from ftp.mozilla.org, so egcs is not in the picture, and I would expect > this problem to occur using that binary build on any distro -- the loop > is in some .so supplied in the .tar.gz, so it'll be the same for everyone > (I guess). But you only see a problem under RH6.2? Is it this build: http://ftp.mozilla.org/pub/mozilla/releases/mozilla1.2.1/mozilla-i686-pc-linux-gnu-1.2.1.tar.gz It could still be some interesting interaction between the system libraries and moz itself... I'll see if I can reproduce the problem. > > BTW, I'm having a go at implementing your addrcheck cache idea. It > > isn't working out quite as well as I'd like. > > You are?! I had better reply to your initial comments on it ... My first impression is that cache maintenance overwhelms any benefit of making the fast path faster. On the other hand, I may still be doing something wrong. I'll put the patch up for inspection. The much more interesting contribution is 72-jump, which adds a helper mechanism for computing relative jump offsets rather than always having to hand-compute them (and double-guess the emitters). I implemented it out of necessity because I wanted to do a jump over a sync_ccall site, but it turned out to work well in every other instance of a jcond_lit, and it cleans things up nicely. J |
|
From: Julian S. <js...@ac...> - 2002-12-11 01:24:17
|
> So I guess the full code for size = 4 would be:
>
> testl $3, %a
> jnz slow
> movl %a, %r
> andl $MASK, %r
> cmpl cache(%r), %a
> jz done
> slow: call slow-path
> done:
Hey? That seems like too many instructions to me. The idea is that the
cache entries are arranged so as to cause lookup failures on misalignment,
so that the testl and jnz are not needed.
If a cache slot mentions (holds) some address a, this means that a .. a + 3
inclusive are addressible. Furthermore we require that a has 00 as its lowest
two bits. (**)
----------------
So a test for a 4-byte access at address vv is
movl %vv, %temp
andl $MASK, %temp
cmpl cache(%temp), %vv
jz done
slow:
where MASK is (CACHE_MASK << 2) and CACHE_MASK is ((1 << CACHE_BITS)-1).
If %vv ends in anything other than 00, it cannot match any cache[] value
as implied by ** above.
To mark the ith cache slot empty, we place in it the value ((~i) << 2).
That causes all checks to fail since the middle CACHE_BITS cannot ever
then match. It also observes (**).
----------------
The test for a 1-byte access at address vv is
movl %vv, %temp
movl %vv, %temp2
andl $MASK, %temp -- cache index, as before
andl $(~3), %temp2 -- dump bits 0 and 1 of address (~3 == 111...11100b)
cmpl cache(%temp), %temp2
jz done
slow:
This is not so good (trashes a second reg), so perhaps your code is better
here. OTOH, providing enough spare regs exist, all reasonable machines
have 2 ALUs capable of doing the andls in parallel, so the sequence should be
fast.
----------------
Finally 2-byte is a minor variant of the 1-byte version:
movl %vv, %temp
movl %vv, %temp2
andl $MASK, %temp -- cache index, as before
andl $(~2), %temp2 -- dump bit 1 of address (~2 == 111...11101b)
cmpl cache(%temp), %temp2
jz done
slow:
The andl $(~2) is the subtlety. For the lowest two bits it gives the mapping
00 -> 00, 01 -> 01, 10 -> 00, 11 -> 01
So if the address was 2-aligned (00, 10) it produces 00, which can potentially
match the cache[] entry.
If the address was not 2-aligned (01, 11) it produces 01, which can never
match and forces us to the slow case. It is true to say that this forces
addresses ending in 01 unneccesarily into the slow case whereas your test-
based code doesn't, but misaligned accesses are so rare I think its more
important to accelerate the common case.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-11 00:52:20
|
> So, this only happens with Mozilla on RH6.2, compiled with some version > of egcs? Can you reproduce anything similar with other egcs-generated > code? The mozilla I was running was a 1.2.1 binary build (the straight .tar.gz) from ftp.mozilla.org, so egcs is not in the picture, and I would expect this problem to occur using that binary build on any distro -- the loop is in some .so supplied in the .tar.gz, so it'll be the same for everyone (I guess). Thanks for 74-; I'll try it tomorrow evening. Almost out of time now. > BTW, I'm having a go at implementing your addrcheck cache idea. It > isn't working out quite as well as I'd like. You are?! I had better reply to your initial comments on it ... J |
|
From: Jeremy F. <je...@go...> - 2002-12-11 00:44:48
|
On Tue, 2002-12-10 at 15:35, Julian Seward wrote: > (mozilla-1.2.1 was looping with memcheck ...) > > > > > It all _looks_ plausible. I'm a bit mystified. You sure this j[n]p > > > trick in 69- has no strange side-effects? I can't think of any. Perhaps > > > this is a red herring. > > > > Looks OK to me, but its a bit hard to tell without seeing the original > > code. > > > > What happens if you change it back to the popf slow path? Still happen? > > I dunno; I removed the popf stuff. > > However, backing out 69- makes it work properly. Try the attached (74-paranoid-flags) with 69- still applied see if it helps (try with --paranoid-flags=yes and no). I also found some code passing the old args to new_emit, which may have been causing a problem. J |
|
From: Jeremy F. <je...@go...> - 2002-12-11 00:12:29
|
On Tue, 2002-12-10 at 15:35, Julian Seward wrote: > However, backing out 69- makes it work properly. Very mysterious. > So I'm still mystified. One unedifying explaination is that this translation > is correct, and the reason it is looping is that some earlier translation has > written bogus values into memory, which the above loop is picking up and > looping on. I don't fancy chasing that down. Since the only code which cares about flags are the last two instructions, and they look correct to me, it must be the data they're operating on... > I'm going to back out 69- from cvs until we have a clearer picture what's > going on. Do shout if you have any ideas at all. It makes me uneasy that > I don't know what's going on here. So, this only happens with Mozilla on RH6.2, compiled with some version of egcs? Can you reproduce anything similar with other egcs-generated code? > > One possibility I've been thinking about is whether there's any code > > which depends on the undefined flags behaviour of instructions. It > > would be a (compiler?) bug, but it might change the behaviour of real > > programs. > > Um, that's not good. Should I be concerned? Dunno. It's easy to fix: just add a line into VG_(new_emit)() saying something like: if (set_flags != FlagsEmpty) maybe_emit_get_flags(); which would always make sure that if anyone sets the flags, they start with the simulated flags state in the CPU. A lot of the arithmetic instructions have an undefined effect on some set of flags. I interpret that as being the same as setting them (that is to say, no correct program can rely on them being unchanged by the instruction, so don't bother to preserve their values). It may be that some code "knows" that undefined actually means unchanged, and relies on that behaviour. In which case the conservative thing for us to do is treat undefined as meaning unchanged, and emit considerably more flags fetches (which basically punts the problem to Intel/AMD/Via/Transmeta/etc, because the CPU still has to have an interpretation of what undefined actually means; there's probably a lot of lore about the detailed behaviour of the instructions which goes way beyond their formal description in Vol2). I'm not saying it has any bearing on the present problem, but it would be an interesting experiment to try. > Umm, I'm not sure what you mean by good. Memcheck is probably the most > demanding in that nearly every original ucode is preceded by instrumentation > which very likely trashes (real) eflags. Is that what you meant? > > If there's some way in which you could hack a skin to do a > stress-test of your flags machinery, that would be very helpful. Yes. I might put together a testbed skin. BTW, I'm having a go at implementing your addrcheck cache idea. It isn't working out quite as well as I'd like. J |
|
From: Julian S. <js...@ac...> - 2002-12-10 23:28:07
|
(mozilla-1.2.1 was looping with memcheck ...) > > It all _looks_ plausible. I'm a bit mystified. You sure this j[n]p > > trick in 69- has no strange side-effects? I can't think of any. Perhaps > > this is a red herring. > > Looks OK to me, but its a bit hard to tell without seeing the original > code. > > What happens if you change it back to the popf slow path? Still happen? I dunno; I removed the popf stuff. However, backing out 69- makes it work properly. I identified the original code: 0x40224f10 mov 0x4(%edi),%eax 0x40224f13 mov 0x10(%eax),%eax 0x40224f16 mov %eax,0x4(%edi) 0x40224f19 mov 0x10(%eax),%edx 0x40224f1c mov 0x4(%ecx),%eax 0x40224f1f cmp 0x4(%edx),%eax 0x40224f22 jl 0x40224f10 Attached is the cleaned-up and annotated memcheck translation. The stuff to do with cmp and jl looks OK to me; the %eflags value set by the cmp (simulation) is correctly copied off to safety before the stuff for the jl, and the relevant simd test for JL looks right. So I'm still mystified. One unedifying explaination is that this translation is correct, and the reason it is looping is that some earlier translation has written bogus values into memory, which the above loop is picking up and looping on. I don't fancy chasing that down. I'm going to back out 69- from cvs until we have a clearer picture what's going on. Do shout if you have any ideas at all. It makes me uneasy that I don't know what's going on here. > One possibility I've been thinking about is whether there's any code > which depends on the undefined flags behaviour of instructions. It > would be a (compiler?) bug, but it might change the behaviour of real > programs. Um, that's not good. Should I be concerned? > The simulated CPU will leak lots of real flags into the undefined > flags. The solution would be to add an undef_flags argument to > new_emit, and add a --paranoid-flags=yes|no command line option; > new_emit could then decide whether to fetch the flags or not. A quick > test to see if that's happening in this case is to force new_emit to > always make sure the simulated flags are current before every simulated > instruction. > > At one point I hacked none to "instrument" the code to trash the real > flags between every UInstr. Unfortunately I think I lost this (and it > was a bit of a blight on none's purity). Is there a good existing skin > for this kind of skulduggery? Umm, I'm not sure what you mean by good. Memcheck is probably the most demanding in that nearly every original ucode is preceded by instrumentation which very likely trashes (real) eflags. Is that what you meant? If there's some way in which you could hack a skin to do a stress-test of your flags machinery, that would be very helpful. J |
|
From: Julian S. <js...@ac...> - 2002-12-10 00:25:35
|
Results from this evening's testing of the head: - Works OK on R H 7.2 (it builds, mozilla-1.0, OO-1.0.1 run on all skins) - Ditto R H 7.3, R H 8.0, SuSE 8.1 - After some futzing, got it to build again on R H 6.2 (our oldest supported platform). Two strange things: --skin=cachegrind causes an instant segfault at startup, before anything is printed. It's so quick I wonder if the dynamic linker is crashing. mozilla-1.2.1 (binary .tar.gz build downloaded from mozilla.org) runs OK on nulgrind, addrcheck, but spins after 100 million ish bbs on memcheck, so it draws part of a window and never progresses. This could be a R H 6.2 problem or a virtual CPU problem which only shows up with that 1.2.1 build -- haven't checked on any other distros. I bet its some kinda flag wierdness tho, considering it works ok on some skins. I got a quick trace with gdb and it's definitely in a loop. I'll have a look at it perhaps late tomorrow night; out of time now. Trace is attached. J |
|
From: Jeremy F. <je...@go...> - 2002-12-09 21:51:40
|
On Mon, 2002-12-09 at 11:32, Julian Seward wrote:
> [...]
> > That said, it has been a long while since I looked at that patch in
> > detail, so maybe there's some simple improvements. In particular, I
> > think it leaves some dead code, so that should be cleaned up.
>
> I just a bit disinclined against having two mechanisms for integer
> multiplication (the helper fns _and_ direct ucode). If the direct route
> covered all the bases, I'd take it. Not only does it allow scope for better
> instrumentation, the generated code is surely better too.
Well, there are really two kinds of multiply: the NxN->2N set, and the
NxN->N set. The latter has a UInstr, but the former are done with
helpers. Since the 2N forms are slower instructions which stomp
specific registers, they're not really desireable to generate all the
time; it seems to me that to support inline code generation for the 2N
forms pretty much requires separate opcodes, which leads for 4 being
used for multiply (though perhaps a flag can be used to distinguish
either N from 2N or signed from unsigned, though all the unsigned
multiplies are 2N).
I don't think the quality of the generated code is all that important
since the helpers aren't that expensive to call (push and pop are
cheap), and 2N forms are hardly ever used in code I've tested. Also,
making sure that everything is in the right register would kill a lot of
the potential efficency gains (unless the regalloc can be changed to
make sure that specific temp end up in specific registers so that the
rearrangement happens at compile time rather than runtime - but that
sounds even more complex).
J
|
|
From: Julian S. <js...@ac...> - 2002-12-09 19:25:04
|
[...] > That said, it has been a long while since I looked at that patch in > detail, so maybe there's some simple improvements. In particular, I > think it leaves some dead code, so that should be cleaned up. I just a bit disinclined against having two mechanisms for integer multiplication (the helper fns _and_ direct ucode). If the direct route covered all the bases, I'd take it. Not only does it allow scope for better instrumentation, the generated code is surely better too. > > I've started to fix various end-user reported bugs, as part of > > stabilisation efforts, as you'll see from the cvs mail. > > Yes. As you can see I've started making the attempt at packaging > everything up. I think we should push out another dev snapshot soon so > that we can get more eager testers. I'll try building the current head on various distros, and if that looks promising, I'll try and emit a 1.9.1 snapshot this evening. J |
|
From: Jeremy F. <je...@go...> - 2002-12-09 17:26:57
|
I notice you implemented the rest of the jccs. It struck me that a more
efficient pattern for the SF == OF and SF != OF (jnl/jl) tests would be:
testl $EFlagS|EFlagO, EFLAGS(%ebp)
j[n]p true
The ones which involve Z as well could use 2 jumps:
testl $EFlagZ, EFLAGS(%ebp)
j[n]z true
testl $EFlagS|EFlagO, EFLAGS(%ebp)
j[n]p true
I've tested the simple case - seems to work fine (69-simple-jlo). I
have no idea if two jumps is better or worse than the shifts and bitops,
but it does require a different code structure, so it isn't quite such a
simple patch.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-09 05:29:29
|
On Sat, 2002-12-07 at 05:43, Julian Seward wrote:
> -- work out the fiddly details of extending this to accesses of
> sizes 1, 2 and 8 (treat as 2 x 4 ?).
1 and 2 should be easy; just round the address down to the next multiple
of 4 and probe (since presence in the hash means that N - N+3 are
valid). 8 should also be easy, as two probes, but probably isn't common
enough to spend lots of effort on. Misalignment where the access can
cross a multiple of 4 boundary is irritating, but see below:
> -- make sure that the working out gives correct, slow-case behaviour
> for all possible cases of misaligned addresses.
Misaligned accesses are the tricky bit. Detecting a mis-aligned access
is going to complicate the test site somewhat (another test and
conditional jump). You could make it fall into the slow path, but
testing the next hash entry up would be just as simple (and a hack to
make this slightly quicker: if you have a hash with 2^N entries, then
make the array 2^N+1 entries long, with the last entry always being a
copy of the first entry - that way you can always probe the next one up
without worrying about wrapping). On the other hand, there probably
aren't enough misaligned accesses to make it worth complicating the
inline fastpath.
Hm, so the details:
For size == 4, the access is aligned iff a & 3 == 0. So testing for
that is easy.
For size == 2, the access is aligned (as in not crossing a multiple-of-4
address) if (a & 3 < 3), which can be tested with:
testl $3, %addr
jz aligned // addr is ....00
jnp aligned // addr is ....10 or ....01
call slowpath
jmp done
aligned:
// fastpath
done: ...
And fortunately, size==1 can't be misaligned.
So I guess the full code for size = 4 would be:
testl $3, %a
jnz slow
movl %a, %r
andl $MASK, %r
cmpl cache(%r), %a
jz done
slow: call slow-path
done:
For size == 2:
testl $3, %a
jz fast
jp slow
fast: movl %a, %r
andl $MASK, %r
cmpl cache(%r), %a
jz done
slow: call slow-path
done:
For size ==1:
> movl %a, %r -- %r := %a
> andl $(N_MASK << 2), %r -- %r := sizeof(cache-slot) *
> index(a)
> cmpl cache(%r), %a -- Z flag set iff cache hit
> jz fast-case-continuation
>
> call slow-case-helper
>
> fast-case-continuation:
> -- figure out how the cache interacts with it's backing store,
> ie the existing sparse array
> -- sanity check the entire story, including that about invalidating
> cache entries (I think my story is ok, but not 100% sure)
You mean putting ~a into index(a). Seems reasonable to me; it could
only be a problem if (a & mask) can ever equal (~a & mask); the simple
case is where mask = ~0: can a == ~a?
> -- figure out how this impacts set_address_range_perms(), since that
> is a frequently-called function (every time the simulated machine's
> %esp changes!)
Shouldn't be too hard to write an efficient cache-stomper.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-09 03:48:02
|
On Sun, 2002-12-08 at 16:25, Julian Seward wrote:
> Have considered 01-partial-mul but am somewhat put off by the fact that it
> doesn't cover all smul+umul cases and therefore only patchily achieves its
> aim. How about modifying the UMUL/SMUL uinstrs so that they do a
> NxN -> 2N multiply for N=8/16/32 bits, taking two TempRegs, which are
> read as operands, and then have the double-length result written to them
> both? This simplifies the code generation too since you can just generate
> the NxN -> 2N x86 insn (IIRC; not sure if it is available for insns and
> signedness)?
>
> Or perhaps it's not worth the effort.
Well, in terms of frequency, I didn't find any of the other multiply
forms being used in real code. gcc can be convinced to use the 8 bit
multiply, but partial results don't matter there (at least, I haven't
found any uses of multiply which expect partial results from partial
arguments at the bit level).
In particular, I didn't find any uses of unsigned multiply, so I'm
really unsure about whether its worth adding a new UMUL UInstr just for
its sake. (I know I reserved an opcode for it, but there's no other
support for it.)
That said, it has been a long while since I looked at that patch in
detail, so maybe there's some simple improvements. In particular, I
think it leaves some dead code, so that should be cleaned up.
> I've started to fix various end-user reported bugs, as part of
> stabilisation efforts, as you'll see from the cvs mail.
Yes. As you can see I've started making the attempt at packaging
everything up. I think we should push out another dev snapshot soon so
that we can get more eager testers.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-09 00:17:47
|
Hi. I merged 61-special-d 62-lazy-eflags 67-dist 65-fix-ldt 55-ac-clientreq Thanks as ever for them. Have considered 01-partial-mul but am somewhat put off by the fact that it doesn't cover all smul+umul cases and therefore only patchily achieves its aim. How about modifying the UMUL/SMUL uinstrs so that they do a NxN -> 2N multiply for N=8/16/32 bits, taking two TempRegs, which are read as operands, and then have the double-length result written to them both? This simplifies the code generation too since you can just generate the NxN -> 2N x86 insn (IIRC; not sure if it is available for insns and signedness)? Or perhaps it's not worth the effort. I've started to fix various end-user reported bugs, as part of stabilisation efforts, as you'll see from the cvs mail. Thx for your msg re meaning of new_emit, which just arrived. J |
|
From: Julian S. <js...@ac...> - 2002-12-07 13:36:04
|
Anybody fancy a nice self-contained xmas hack? -- J
Speeding up the "addrcheck" skin -- Julian Seward, 7 December 02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What follows is an idea for an optimisation aimed at speeding up
one of the new tools in valgrind-2.0: the "addrcheck", fine-grained
address checker.
I don't have time to try this myself. But it would be a shame to ship
2.0 without at least trying this idea. It might make addrcheck a lot
faster, or it might have no effect. Either way, I'd like to know.
So here is the idea. Realistically, trying it out would take a hacker
experienced in valgrind internals, perhaps a weekend. A competent
hacker, with a good handle on low-level bit twiddling and x86
assembly, but no knowledge of valgrind, might take more like a week.
Or perhaps less. The nice thing about this one is it's very
self-contained, and you don't actually need to know anything much
about V to do it.
If you have the time, understanding and enthusiasm to try this out,
please go ahead, and let me know you're on the case. This work can be
carried out either on the 1.1.0 snapshot or the CVS head; either way
it can be installed into the CVS head easily enough, if successful.
Timescale: I'd like to ship valgrind-2.0 in late Jan, if possible, so
this would form an interesting xmas-break hack for someone, ideally.
Background
~~~~~~~~~~
The addrcheck skin (tool) is a new bug-detecting tool in valgrind-2.0.
It is a simplified version of the "traditional" valgrind checks that
1.0.X does, with one crucial detail different: there is no checking
for undefined values. Result is that addrcheck does fine-grained
address checking only: for every read and write, it checks that the
program is really allowed to read/write at that address.
This forms an interesting compromise from full-scale 1.0.X-style
valgrinding. It still picks up: reading/writing freed memory,
reading/writing off the start/end of malloc'd blocks, and passing
invalid addresses to system calls. These bugs are hard to find and do
cause crashes in practice. On the other hand, you don't find
undefined-value errors any more; you'll need the memcheck skin to do
that.
So, on the plus side, addrcheck runs about twice as fast as full-scale
valgrinding, whilst still picking up important bugs. On the minus side,
you lose undefined-value checking. No such thing as a free lunch.
How Addrcheck works at the moment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pretty much everything you need to know is in addrcheck/ac_main.c, and
the entire experiment should be possible only hacking that file.
Addrcheck keeps a bitmap, with (potentially) one bit for each byte of
the 4G address space. That bit indicates whether or not the
associated address is currently valid or not. The bitmap is
represented by a two-level sparse array, which is grown dynamically,
so as to keep its space usage sensible.
Every memory access has to be checked. For a 4-byte load or store --
by far the most common case -- the code generated by the valgrind
dynamic translator has to call ac_helperc_ACCESS4(), in
addrcheck/ac_main.c, passing it the address to be checked. This
function does the check, emits a warning if needed, and returns.
The speed of this operation is critical since it is very frequent.
The whole business looks like this. First we start off in
code generated by valgrind. The address to be checked is in %edx:
pushl %eax
pushl %edx
movl %edx, %eax
call * 36(%ebp)
This call takes us directly to ac_helperc_ACCESS4, the fast
(common) path through which is:
ac_helperc_ACCESS4:
pushl %ebx
movl %eax, %ebx
roll $18, %eax
andl $1048572, %eax
movl primary_map(%eax), %edx
movzwl %bx,%eax
shrl $3, %eax
movzbl (%eax,%edx), %eax
movl %ebx, %ecx
andl $4, %ecx
sarl %cl, %eax
testl $15, %eax
je .L314 (expected taken -- the no-error case)
....
.L314:
popl %ebx
ret
Now we're back in generated code, restore callee-save regs:
popl %edx
popl %eax
This means the fast case takes 4 + 15 + 2 = 21 instructions, which
isn't good. Also not good is the rotate and two shifts, both of which
are expensive on the P4 (4 cycle latency each).
The idea
~~~~~~~~
(It occurs to me that you need a rock-solid understanding of how a
direct-mapped cache works, to make sense of the following.)
I believe the common case can be done in 4 or 5 instructions, which
can be generated in-line in the translation, so it doesn't even
involve a call.
The basic idea is to add a simulated direct-mapped cache, which
doesn't hold any data -- we only care about missing vs not missing in
the cache. The cache is an array of 2^N 32-bit addresses, for some N
(the size can be tuned later).
The meaning of the cache is as follows. If we succeed in finding an
address in the cache, it means that that all 4 bytes of the word
surrounding the address are accessible, which is the expected
fast-case. If any of the 4 bytes in a word surrounding the address
are not accessible, we arrange for the cache always to generate a miss
when presented with that address.
The cache-hit/miss test is done in-line in the generated code and
takes 4 or 5 instructions. If we hit, we just keep going and that's
the end of it. If we miss, then we have to call out to a helper
function to handle it, but that should be relatively rare.
Each entry in the cache array is simply a 4-byte-aligned address where
we guarantee that the address is valid. For example, if cache entry
[0] contains the value 0x12345600, this tells us that addresses
0x12345600 to 0x12345603 inclusive, are valid.
The cache (well, really the only part which exists is the tag array)
is indexed by bits (N+2 .. 2) inclusive of the address. That is,
define the function
#define N_MASK ((1 << N)-1)
index(a) = (a >> 2) & N_MASK
Any 4-byte address a will then have a hit in the cache exactly when
a == cache[ index(a) ]
ie, we look up in the relevant slot and find our own address.
Notice this is more subtle than it at first appears. Specifically,
how do we handle a being misaligned (very rare in practice, but we
still need to handle it) ? Well, we only allow the cache[] array to
hold addresses whose lowest two bits are zero. So if a is misaligned
(ie, its lowest two bits are not both zero), the comparison will
always fail.
Another subtlety is how we invalidate an entry. Suppose that the
cache says that some address a is valid, by having an entry satisfying
a == cache[ index(a) ]. And now we want to make that word invalid,
perhaps because it's part of a block of memory being released by the
simulated application calling free().
So we arrange that the value in the cache slot can never match
any address which might map to that slot. An easy way to do this is
to set cache[index(a)] = ~a (bitwise inversion of a).
Now the really neat thing about this is that the fast-case
check a == cache[ index(a) ] translates to not much x86 code.
Let %a be a reg holding a, which we don't want to trash, and let
%r be some spare reg. The check is then:
movl %a, %r -- %r := %a
andl $(N_MASK << 2), %r -- %r := sizeof(cache-slot) * index(a)
cmpl cache(%r), %a -- Z flag set iff cache hit
jz fast-case-continuation
call slow-case-helper
fast-case-continuation:
Not bad! The (>> 2) in the index() calculation is exactly compensated
for by the implicit (<< 2) in the cache[...] array access, saving
insns and (crucially on a P4) any shifting operations. Valgrind's
code generator keeps track of free registers, so we can usually get a
suitable candidate for %r at no extra expense. If we're unlucky we
can push and pop some other register around the sequence, but even
that's pretty darn quick.
Actually doing it
~~~~~~~~~~~~~~~~~
Implementing this idea doesn't mean writing a lot of code. It does
mean considerable preliminary thought, writing designs on paper, etc,
so as to:
-- work out the fiddly details of extending this to accesses of
sizes 1, 2 and 8 (treat as 2 x 4 ?).
-- make sure that the working out gives correct, slow-case behaviour
for all possible cases of misaligned addresses.
-- figure out how the cache interacts with it's backing store,
ie the existing sparse array
-- sanity check the entire story, including that about invalidating
cache entries (I think my story is ok, but not 100% sure)
-- figure out how this impacts set_address_range_perms(), since that
is a frequently-called function (every time the simulated machine's
%esp changes!)
Although performance is the aim here, the number one priority is
correctness. Valgrind is a debugging tool, so it is vital that this
address-check machinery works correctly. IOW, apart from doing the
hackery you'll also need to convince us that your hackery is
absolutely and totally correct under all circumstances!
Antidisirregardless!
A way to get started, without having to immerse yourself in the grotty
details of the x86 insn set, is to add this cache purely inside
ac_main.c and rewrite the *ACCESS* functions to use the cache. This
allows you to develop and debug your logic whilst operating in C land.
Making sure the logic is correct is the hard bit of this. Once that's
done it's easy to persuade V's code generator to emit the
(abovementioned) fast-case code fragments in-line.
Anyway, if you can make this fly, I'd love to hear the outcome.
Please contact me (js...@ac...) and cc to the developers
mailing list val...@li....
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 08:46:47
|
I fixed the ldt problem of the other day. The LDT stuff wasn't dealing
properly with a child thread inheriting a copy of the parent's LDT
state.
I did a comparison between the relative slowdown of P3 native:valgrind,
vs P4 native:valgrind. Previously the P4 was about twice as slow as the
P3, proportionally (that is, for a given benchmark, on the P3 a program
may have run, say, 10 times slower, whereas the same test run on a P4
would be about 20 times slower).
I'm pleased to say that the P3 and P4 are now equally slow - they're
both 5-10 times slower than native when run under Valgrind
(--skin=none). I suspect this is mostly to do with flags handling
improvements; pushf/popf must be proportionally worse for P4 than P3.
I also tried some experiments to try to batch together larger chunks of
compilation. I added the idea of "speculative translation", where
translating one basic block would attempt to follow jumps and translate
their targets too. Not surprisingly, doing this to every jump was
somewhat slower.
What is surprising is that when the speculation was reduced to following
the only direct jump in a basic block (ie, a jump to a basic block which
*must* be executed next), it is still a speed loss. I would have though
that translating multiple basic at once blocks would take advantage of
the compiler being in cache, and amortize the cost of various
self-modifying-code interlocks, etc.
I suspect that VG_(search_transtab) is the problem, since it collapses
into a linear scan of the entire TT when it is full and the address
you're searching for isn't present. Maybe some hashing will help.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 07:27:31
|
On Fri, 2002-12-06 at 18:32, Jeremy Fitzhardinge wrote:
> It would be nice to change the helper calling convention into
> something non-flag smashing.
Duh. "lea N(%esp), %esp" is a perfectly good non-flag-smashing way to
fix the stack.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 02:32:11
|
On Fri, 2002-12-06 at 16:35, Julian Seward wrote:
> So I guess that's pretty much the end of the matter? It seems like the
> right fix to me; does it also to you?
Yes, that was the bug. It explains why it was so non-deterministic too
- it depended on the number of 1 bits in %esp.
It would be nice to change the helper calling convention into something
non-flag smashing.
> Now that this looks like this is going to work, is it worth keeping the
> fancy test stuff you did in synth_jcond_lit ? I ask because you have a
> better understanding of the ramifications of this latest eflags stuff,
> and so can probably make a better decision.
>
> If it's worth keeping ... I did actually write code to also do the
> 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1,
> and it did seem to help sometimes. But that was before your latest patch.
It doesn't make much difference for --skin=none, but it probably does
for skins which instrument. memcheck, for example, always trashes the
flags just before a conditional jump, so the fast-jcc path gets used
often. On the other hand, it may get lost in the overhead which the
instrumenting skins cause anyway (but that's probably the next focus for
performance improvement).
> I've been wondering a bit about the dismal performance on P4s. One thing
> that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..."
> is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4
> optimisation guide, "Use of the inc and dec instructions"). It'd be
> interesting try changing it to a subl $1, ... as they recommend. Or perhaps
> not ... according to their tables the latency difference is only 0.5 cycle.
Well, I was thinking about putting something in to tune the instruction
selection depending on the real CPU. It might be worth it (but I
suspect not in this case).
I still think it has more to do with trashing the trace cache rather
than any of the minor instruction selection issues. I'm doing an
experimental patch which does speculative translation (ie, is a BB's
final instruction is a lit32 jump, then translate the jump target too)
to see if that helps. It should strike a balance between killing the
cache with every translation and the over-translation trace caching
would cause.
I still haven't had a chance to do any P4 instrumentation yet. Maybe
building Rabbit hooks into V would be a useful thing to do (oprofile
seems a bit crippled when faced with code without a file or symbols).
J
|
|
From: Julian S. <js...@ac...> - 2002-12-07 00:28:14
|
On Friday 06 December 2002 11:33 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-12-06 at 15:11, Julian Seward wrote: > > Re my analysis of stack-clearing add, I can't be exactly right, since > > VG_(emit_add_lit_to_esp) begins with the correct call to new_emit. > > No, it isn't correct - True means "operate on Simd flags"; False means > "non-simd flags". > > Fixing this seems to work, and even has a slight performance improvement > (OO starts up in 47 rather than 48 seconds). At very least it seems > performance neutral. Well, changing that True to False makes OO and moz work fine for me, which is great. So I guess that's pretty much the end of the matter? It seems like the right fix to me; does it also to you? -------- Now that this looks like this is going to work, is it worth keeping the fancy test stuff you did in synth_jcond_lit ? I ask because you have a better understanding of the ramifications of this latest eflags stuff, and so can probably make a better decision. If it's worth keeping ... I did actually write code to also do the 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1, and it did seem to help sometimes. But that was before your latest patch. If it's not worth keeping ... let's nuke it. ---------------- I've been wondering a bit about the dismal performance on P4s. One thing that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..." is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4 optimisation guide, "Use of the inc and dec instructions"). It'd be interesting try changing it to a subl $1, ... as they recommend. Or perhaps not ... according to their tables the latency difference is only 0.5 cycle. J |
|
From: Jeremy F. <je...@go...> - 2002-12-06 23:33:17
|
On Fri, 2002-12-06 at 15:11, Julian Seward wrote:
> Re my analysis of stack-clearing add, I can't be exactly right, since
> VG_(emit_add_lit_to_esp) begins with the correct call to new_emit.
No, it isn't correct - True means "operate on Simd flags"; False means
"non-simd flags".
Fixing this seems to work, and even has a slight performance improvement
(OO starts up in 47 rather than 48 seconds). At very least it seems
performance neutral.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-06 23:27:24
|
On Fri, 2002-12-06 at 14:37, Julian Seward wrote:
> The translation is pretty dismal, due to calling helpers for both fnstsw and
> sahf, but "that's not important right now" :) The problem is the call the to
> latter's helper ... and specifically the add $0x4,%esp to clear the args off
> the stack. This trashes the live %eflags
Ah, yes, that's where I've seen moz spin. That's why I was suspecting
some badness in FP+flags interaction.
> Assuming this analysis is correct ... there's no convenient way to clear
> dead args off the real stack, unless we find a dead reg to dump it in.
Ewww, nasty.
> Umm, actually that's nonsense. Imagine we have a baseBlock slot purely
> for the purpose of receiving deal values, then we could do
>
> popl VGOFF_(dummySlot)(%ebp)
>
> Just occasionally, CISC is great!
>
> What do you think? Is the analysis correct?
Yes, that looks likely. The quick fix is to change the
VG_(new_emit)(True, ...) to False in emit_add_lit_to_esp, since that add
is not operating on Simd state. That will make it generate flag save
before trashing them. I agree the nicer solution is to fix the helper
calling convention to not trash the flags. How about changing the
convention to just use a real register for the value? Unfortunately for
helpers like CPUID, the stack does seem like the nicest way of doing the
passing (unless you want to allocate an array of slots in the bas
block).
It's a pity that SAHF/LAHF doesn't do the O flag; otherwise they'd be
ideal for flags saving/restoring - the P3 optim guide says they're 1
uop.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 23:12:41
|
Apologies for mailbombing you even more. Quick prod with GDB shows OO is also spinning in a fstsw_AX .. SAHF ..conditional jump loop. J |
|
From: Julian S. <js...@ac...> - 2002-12-06 23:04:02
|
> > Hmm. This is very odd. I'm wondering if there is some problem with the
> > non-D flags (OSZACP) causing "if (res > 0) {" at line 2636 never to
> > get into the then-clause. Except that if there was such a problem,
> > most programs wouldn't work (I'd guess).
>
> I think that's a red herring.
I agree.
Re my analysis of stack-clearing add, I can't be exactly right, since
VG_(emit_add_lit_to_esp) begins with the correct call to new_emit.
So perhaps the analysis didn't think that we were in a UPD_Real state
at the point of the call to vgPlain_helper_SAHF. Except that as soon as
it has CLEARed the stack, it then acts to get into UPD_Simd/Both in
preparation for the conditional jump:
1: x/i $eip 0x42377bd2: pushf
1: x/i $eip 0x42377bd3: popl 0x20(%ebp)
Odd.
> In OO's case, thread 3 is polling on IO
> (I presume to the X server), thread 2 is blocked in a condvar_wait, and
> thread 1 is spinning CPU-bound. I'm guessing that thread 2 is waiting
> for either thread 1 or 3 to do something, and thread 1 is expected to do
> something but isn't.
Yes, so that fits together; at least we have a plausible explaination of
why it was spinning, if the moz problem also afflicts OO.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-06 22:41:26
|
On Fri, 2002-12-06 at 12:12, Julian Seward wrote:
> > Anyway, tell me what you get with the current versions of 61 and 62.
>
> No improvement with OO.
>
> I tried mozilla. It also won't start up, the simulated machine falling
> into an endless sequence of poll() calls seperated by nanosleep(13
> milliseconds), which afaics is the nonblocking poll() in vg_libpthread.c.
>
> Trying OO with tracing on indicates it spins in the same place.
>
> Hmm. This is very odd. I'm wondering if there is some problem with the
> non-D flags (OSZACP) causing "if (res > 0) {" at line 2636 never to
> get into the then-clause. Except that if there was such a problem,
> most programs wouldn't work (I'd guess).
I think that's a red herring. In OO's case, thread 3 is polling on IO
(I presume to the X server), thread 2 is blocked in a condvar_wait, and
thread 1 is spinning CPU-bound. I'm guessing that thread 2 is waiting
for either thread 1 or 3 to do something, and thread 1 is expected to do
something but isn't.
However, if I use "--trace-codegen=10001 --trace-signals=yes" and pipe
that into a "tail -100000" (my disk not being big enough to fit a
complete codegen trace) and then wait for it to stabilize (stop codegen
for new BBs, observed by looking at strace of the process), then it
tends to actually work. So that's no use.
Mozilla is similar. Sometimes it works, and sometimes it doesn't.
Sometimes it works, but takes a really long time. In particular, it
normally takes 30 user CPU seconds for moz to appear on my laptop with
--skin=none. Sometimes it never appears, and just seems to burn CPU.
Other times, it appears after 60 or more (wall-clock seconds), but after
still only using 30 CPU seconds (with nothing else using the CPU).
If you try just 61, do you see the same problem?
It does seem very fragile. I just added another patch to V, which
should be completely benign, but it changes the behaviour.
J
|
|
From: Julian S. <js...@ac...> - 2002-12-06 22:29:44
|
Hi. I think I've found an anomaly. Dunno if its the problem, might be tho. With mozilla spinning and not proceeding, a bit of prodding about shows it is spinning on these original insns: -- ORIGINAL CODE 0x4017263c <js_DoubleToECMAInt32+80>: fprem 0x4017263e <js_DoubleToECMAInt32+82>: fnstsw %ax 0x40172640 <js_DoubleToECMAInt32+84>: sahf 0x40172641 <js_DoubleToECMAInt32+85>: jp 0x4017263c The translation is pretty dismal, due to calling helpers for both fnstsw and sahf, but "that's not important right now" :) The problem is the call the to latter's helper ... and specifically the add $0x4,%esp to clear the args off the stack. This trashes the live %eflags -- push %AH 1: x/i $eip 0x42377bb6: mov 0x0(%ebp),%ebx 1: x/i $eip 0x42377bb9: mov $0xff00,%ecx 1: x/i $eip 0x42377bbe: and %ecx,%ebx 1: x/i $eip 0x42377bc0: push %ebx -- GET EFLAGS 1: x/i $eip 0x42377bc1: pushl 0x20(%ebp) 1: x/i $eip 0x42377bc4: popf -- call helper 1: x/i $eip 0x42377bc5: call *0x17c(%ebp) 1: x/i $eip 0x4005cc84 <vgPlain_helper_SAHF>: push %eax 1: x/i $eip 0x4005cc85 <vgPlain_helper_SAHF+1>: mov 0x8(%esp,1),%eax 1: x/i $eip 0x4005cc89 <vgPlain_helper_SAHF+5>: sahf 1: x/i $eip 0x4005cc8a <vgPlain_helper_SAHF+6>: pop %eax 1: x/i $eip 0x4005cc8b <vgPlain_helper_SAHF+7>: ret -- %eflags is now live -- oops! 1: x/i $eip 0x42377bcb: add $0x4,%esp -- move EIP 1: x/i $eip 0x42377bce: movb $0x41,0x24(%ebp) -- PUT (polluted) eflags 1: x/i $eip 0x42377bd2: pushf 1: x/i $eip 0x42377bd3: popl 0x20(%ebp) -- jump (on result of %esp-adjust :) 1: x/i $eip 0x42377bd6: jnp 0x42377be5 Assuming this analysis is correct ... there's no convenient way to clear dead args off the real stack, unless we find a dead reg to dump it in. Umm, actually that's nonsense. Imagine we have a baseBlock slot purely for the purpose of receiving deal values, then we could do popl VGOFF_(dummySlot)(%ebp) Just occasionally, CISC is great! What do you think? Is the analysis correct? I might check of all places where %esp is involved in an arithmetic op, since those are all potential trashers. J -------------------------------------------------------------------------- The complete translation, in case you should want it, is ... -- preamble 1: x/i $eip 0x42377b84: decl 0x400bb72c 1: x/i $eip 0x42377b8a: jne 0x42377b92 -- GET fpustate 1: x/i $eip 0x42377b92: frstor 0x8c(%ebp) -- fprem 1: x/i $eip 0x42377b98: fprem -- advance EIP 1: x/i $eip 0x42377b9a: movb $0x3e,0x24(%ebp) -- push $0 (why?) 1: x/i $eip 0x42377b9e: xor %eax,%eax 1: x/i $eip 0x42377ba0: push %eax -- PUT fpustate 1: x/i $eip 0x42377ba1: fnsave 0x8c(%ebp) 1: x/i $eip 0x42377ba7: call *0x178(%ebp) 1: x/i $eip 0x4005cc6d <vgPlain_helper_fstsw_AX>: push %eax 1: x/i $eip 0x4005cc6e <vgPlain_helper_fstsw_AX+1>: push %esi 1: x/i $eip 0x4005cc6f <vgPlain_helper_fstsw_AX+2>: mov 0x400a08e0,%esi 1: x/i $eip 0x4005cc75 <vgPlain_helper_fstsw_AX+8>: frstor 0x0(%ebp,%esi,4) 1: x/i $eip 0x4005cc79 <vgPlain_helper_fstsw_AX+12>: fstsw %ax 1: x/i $eip 0x4005cc7a <vgPlain_helper_fstsw_AX+13>: fnstsw %ax 1: x/i $eip 0x4005cc7c <vgPlain_helper_fstsw_AX+15>: pop %esi 1: x/i $eip 0x4005cc7d <vgPlain_helper_fstsw_AX+16>: mov %ax,0x8(%esp,1) 1: x/i $eip 0x4005cc82 <vgPlain_helper_fstsw_AX+21>: pop %eax 1: x/i $eip 0x4005cc83 <vgPlain_helper_fstsw_AX+22>: ret 1: x/i $eip 0x42377bad: pop %eax -- %ax holds FPU status word (simd) -- PUT %AX 1: x/i $eip 0x42377bae: mov %ax,0x0(%ebp) -- advance %EIP 1: x/i $eip 0x42377bb2: movb $0x40,0x24(%ebp) -- push %AH 1: x/i $eip 0x42377bb6: mov 0x0(%ebp),%ebx 1: x/i $eip 0x42377bb9: mov $0xff00,%ecx 1: x/i $eip 0x42377bbe: and %ecx,%ebx 1: x/i $eip 0x42377bc0: push %ebx -- GET EFLAGS 1: x/i $eip 0x42377bc1: pushl 0x20(%ebp) 1: x/i $eip 0x42377bc4: popf -- call helper 1: x/i $eip 0x42377bc5: call *0x17c(%ebp) 1: x/i $eip 0x4005cc84 <vgPlain_helper_SAHF>: push %eax 1: x/i $eip 0x4005cc85 <vgPlain_helper_SAHF+1>: mov 0x8(%esp,1),%eax 1: x/i $eip 0x4005cc89 <vgPlain_helper_SAHF+5>: sahf 1: x/i $eip 0x4005cc8a <vgPlain_helper_SAHF+6>: pop %eax 1: x/i $eip 0x4005cc8b <vgPlain_helper_SAHF+7>: ret -- %eflags is now live 1: x/i $eip 0x42377bcb: add $0x4,%esp 1: x/i $eip 0x42377bce: movb $0x41,0x24(%ebp) 1: x/i $eip 0x42377bd2: pushf 1: x/i $eip 0x42377bd3: popl 0x20(%ebp) 1: x/i $eip 0x42377bd6: jnp 0x42377be5 1: x/i $eip 0x42377bd8: mov $0x4017263c,%eax 1: x/i $eip 0x42377bdd: mov %eax,0x24(%ebp) 1: x/i $eip 0x42377be0: jmp 0x42377b84 |