You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
1
(5) |
2
(3) |
3
(1) |
4
(4) |
5
(1) |
6
(11) |
7
(5) |
|
8
|
9
(6) |
10
(2) |
11
(10) |
12
|
13
|
14
(4) |
|
15
(7) |
16
(1) |
17
(3) |
18
|
19
|
20
|
21
(1) |
|
22
(1) |
23
|
24
|
25
|
26
|
27
|
28
(4) |
|
29
|
30
|
31
|
|
|
|
|
|
From: Julian S. <js...@ac...> - 2002-12-07 13:36:04
|
Anybody fancy a nice self-contained xmas hack? -- J
Speeding up the "addrcheck" skin -- Julian Seward, 7 December 02
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What follows is an idea for an optimisation aimed at speeding up
one of the new tools in valgrind-2.0: the "addrcheck", fine-grained
address checker.
I don't have time to try this myself. But it would be a shame to ship
2.0 without at least trying this idea. It might make addrcheck a lot
faster, or it might have no effect. Either way, I'd like to know.
So here is the idea. Realistically, trying it out would take a hacker
experienced in valgrind internals, perhaps a weekend. A competent
hacker, with a good handle on low-level bit twiddling and x86
assembly, but no knowledge of valgrind, might take more like a week.
Or perhaps less. The nice thing about this one is it's very
self-contained, and you don't actually need to know anything much
about V to do it.
If you have the time, understanding and enthusiasm to try this out,
please go ahead, and let me know you're on the case. This work can be
carried out either on the 1.1.0 snapshot or the CVS head; either way
it can be installed into the CVS head easily enough, if successful.
Timescale: I'd like to ship valgrind-2.0 in late Jan, if possible, so
this would form an interesting xmas-break hack for someone, ideally.
Background
~~~~~~~~~~
The addrcheck skin (tool) is a new bug-detecting tool in valgrind-2.0.
It is a simplified version of the "traditional" valgrind checks that
1.0.X does, with one crucial detail different: there is no checking
for undefined values. Result is that addrcheck does fine-grained
address checking only: for every read and write, it checks that the
program is really allowed to read/write at that address.
This forms an interesting compromise from full-scale 1.0.X-style
valgrinding. It still picks up: reading/writing freed memory,
reading/writing off the start/end of malloc'd blocks, and passing
invalid addresses to system calls. These bugs are hard to find and do
cause crashes in practice. On the other hand, you don't find
undefined-value errors any more; you'll need the memcheck skin to do
that.
So, on the plus side, addrcheck runs about twice as fast as full-scale
valgrinding, whilst still picking up important bugs. On the minus side,
you lose undefined-value checking. No such thing as a free lunch.
How Addrcheck works at the moment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Pretty much everything you need to know is in addrcheck/ac_main.c, and
the entire experiment should be possible only hacking that file.
Addrcheck keeps a bitmap, with (potentially) one bit for each byte of
the 4G address space. That bit indicates whether or not the
associated address is currently valid or not. The bitmap is
represented by a two-level sparse array, which is grown dynamically,
so as to keep its space usage sensible.
Every memory access has to be checked. For a 4-byte load or store --
by far the most common case -- the code generated by the valgrind
dynamic translator has to call ac_helperc_ACCESS4(), in
addrcheck/ac_main.c, passing it the address to be checked. This
function does the check, emits a warning if needed, and returns.
The speed of this operation is critical since it is very frequent.
The whole business looks like this. First we start off in
code generated by valgrind. The address to be checked is in %edx:
pushl %eax
pushl %edx
movl %edx, %eax
call * 36(%ebp)
This call takes us directly to ac_helperc_ACCESS4, the fast
(common) path through which is:
ac_helperc_ACCESS4:
pushl %ebx
movl %eax, %ebx
roll $18, %eax
andl $1048572, %eax
movl primary_map(%eax), %edx
movzwl %bx,%eax
shrl $3, %eax
movzbl (%eax,%edx), %eax
movl %ebx, %ecx
andl $4, %ecx
sarl %cl, %eax
testl $15, %eax
je .L314 (expected taken -- the no-error case)
....
.L314:
popl %ebx
ret
Now we're back in generated code, restore callee-save regs:
popl %edx
popl %eax
This means the fast case takes 4 + 15 + 2 = 21 instructions, which
isn't good. Also not good is the rotate and two shifts, both of which
are expensive on the P4 (4 cycle latency each).
The idea
~~~~~~~~
(It occurs to me that you need a rock-solid understanding of how a
direct-mapped cache works, to make sense of the following.)
I believe the common case can be done in 4 or 5 instructions, which
can be generated in-line in the translation, so it doesn't even
involve a call.
The basic idea is to add a simulated direct-mapped cache, which
doesn't hold any data -- we only care about missing vs not missing in
the cache. The cache is an array of 2^N 32-bit addresses, for some N
(the size can be tuned later).
The meaning of the cache is as follows. If we succeed in finding an
address in the cache, it means that that all 4 bytes of the word
surrounding the address are accessible, which is the expected
fast-case. If any of the 4 bytes in a word surrounding the address
are not accessible, we arrange for the cache always to generate a miss
when presented with that address.
The cache-hit/miss test is done in-line in the generated code and
takes 4 or 5 instructions. If we hit, we just keep going and that's
the end of it. If we miss, then we have to call out to a helper
function to handle it, but that should be relatively rare.
Each entry in the cache array is simply a 4-byte-aligned address where
we guarantee that the address is valid. For example, if cache entry
[0] contains the value 0x12345600, this tells us that addresses
0x12345600 to 0x12345603 inclusive, are valid.
The cache (well, really the only part which exists is the tag array)
is indexed by bits (N+2 .. 2) inclusive of the address. That is,
define the function
#define N_MASK ((1 << N)-1)
index(a) = (a >> 2) & N_MASK
Any 4-byte address a will then have a hit in the cache exactly when
a == cache[ index(a) ]
ie, we look up in the relevant slot and find our own address.
Notice this is more subtle than it at first appears. Specifically,
how do we handle a being misaligned (very rare in practice, but we
still need to handle it) ? Well, we only allow the cache[] array to
hold addresses whose lowest two bits are zero. So if a is misaligned
(ie, its lowest two bits are not both zero), the comparison will
always fail.
Another subtlety is how we invalidate an entry. Suppose that the
cache says that some address a is valid, by having an entry satisfying
a == cache[ index(a) ]. And now we want to make that word invalid,
perhaps because it's part of a block of memory being released by the
simulated application calling free().
So we arrange that the value in the cache slot can never match
any address which might map to that slot. An easy way to do this is
to set cache[index(a)] = ~a (bitwise inversion of a).
Now the really neat thing about this is that the fast-case
check a == cache[ index(a) ] translates to not much x86 code.
Let %a be a reg holding a, which we don't want to trash, and let
%r be some spare reg. The check is then:
movl %a, %r -- %r := %a
andl $(N_MASK << 2), %r -- %r := sizeof(cache-slot) * index(a)
cmpl cache(%r), %a -- Z flag set iff cache hit
jz fast-case-continuation
call slow-case-helper
fast-case-continuation:
Not bad! The (>> 2) in the index() calculation is exactly compensated
for by the implicit (<< 2) in the cache[...] array access, saving
insns and (crucially on a P4) any shifting operations. Valgrind's
code generator keeps track of free registers, so we can usually get a
suitable candidate for %r at no extra expense. If we're unlucky we
can push and pop some other register around the sequence, but even
that's pretty darn quick.
Actually doing it
~~~~~~~~~~~~~~~~~
Implementing this idea doesn't mean writing a lot of code. It does
mean considerable preliminary thought, writing designs on paper, etc,
so as to:
-- work out the fiddly details of extending this to accesses of
sizes 1, 2 and 8 (treat as 2 x 4 ?).
-- make sure that the working out gives correct, slow-case behaviour
for all possible cases of misaligned addresses.
-- figure out how the cache interacts with it's backing store,
ie the existing sparse array
-- sanity check the entire story, including that about invalidating
cache entries (I think my story is ok, but not 100% sure)
-- figure out how this impacts set_address_range_perms(), since that
is a frequently-called function (every time the simulated machine's
%esp changes!)
Although performance is the aim here, the number one priority is
correctness. Valgrind is a debugging tool, so it is vital that this
address-check machinery works correctly. IOW, apart from doing the
hackery you'll also need to convince us that your hackery is
absolutely and totally correct under all circumstances!
Antidisirregardless!
A way to get started, without having to immerse yourself in the grotty
details of the x86 insn set, is to add this cache purely inside
ac_main.c and rewrite the *ACCESS* functions to use the cache. This
allows you to develop and debug your logic whilst operating in C land.
Making sure the logic is correct is the hard bit of this. Once that's
done it's easy to persuade V's code generator to emit the
(abovementioned) fast-case code fragments in-line.
Anyway, if you can make this fly, I'd love to hear the outcome.
Please contact me (js...@ac...) and cc to the developers
mailing list val...@li....
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 08:46:47
|
I fixed the ldt problem of the other day. The LDT stuff wasn't dealing
properly with a child thread inheriting a copy of the parent's LDT
state.
I did a comparison between the relative slowdown of P3 native:valgrind,
vs P4 native:valgrind. Previously the P4 was about twice as slow as the
P3, proportionally (that is, for a given benchmark, on the P3 a program
may have run, say, 10 times slower, whereas the same test run on a P4
would be about 20 times slower).
I'm pleased to say that the P3 and P4 are now equally slow - they're
both 5-10 times slower than native when run under Valgrind
(--skin=none). I suspect this is mostly to do with flags handling
improvements; pushf/popf must be proportionally worse for P4 than P3.
I also tried some experiments to try to batch together larger chunks of
compilation. I added the idea of "speculative translation", where
translating one basic block would attempt to follow jumps and translate
their targets too. Not surprisingly, doing this to every jump was
somewhat slower.
What is surprising is that when the speculation was reduced to following
the only direct jump in a basic block (ie, a jump to a basic block which
*must* be executed next), it is still a speed loss. I would have though
that translating multiple basic at once blocks would take advantage of
the compiler being in cache, and amortize the cost of various
self-modifying-code interlocks, etc.
I suspect that VG_(search_transtab) is the problem, since it collapses
into a linear scan of the entire TT when it is full and the address
you're searching for isn't present. Maybe some hashing will help.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 07:27:31
|
On Fri, 2002-12-06 at 18:32, Jeremy Fitzhardinge wrote:
> It would be nice to change the helper calling convention into
> something non-flag smashing.
Duh. "lea N(%esp), %esp" is a perfectly good non-flag-smashing way to
fix the stack.
J
|
|
From: Jeremy F. <je...@go...> - 2002-12-07 02:32:11
|
On Fri, 2002-12-06 at 16:35, Julian Seward wrote:
> So I guess that's pretty much the end of the matter? It seems like the
> right fix to me; does it also to you?
Yes, that was the bug. It explains why it was so non-deterministic too
- it depended on the number of 1 bits in %esp.
It would be nice to change the helper calling convention into something
non-flag smashing.
> Now that this looks like this is going to work, is it worth keeping the
> fancy test stuff you did in synth_jcond_lit ? I ask because you have a
> better understanding of the ramifications of this latest eflags stuff,
> and so can probably make a better decision.
>
> If it's worth keeping ... I did actually write code to also do the
> 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1,
> and it did seem to help sometimes. But that was before your latest patch.
It doesn't make much difference for --skin=none, but it probably does
for skins which instrument. memcheck, for example, always trashes the
flags just before a conditional jump, so the fast-jcc path gets used
often. On the other hand, it may get lost in the overhead which the
instrumenting skins cause anyway (but that's probably the next focus for
performance improvement).
> I've been wondering a bit about the dismal performance on P4s. One thing
> that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..."
> is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4
> optimisation guide, "Use of the inc and dec instructions"). It'd be
> interesting try changing it to a subl $1, ... as they recommend. Or perhaps
> not ... according to their tables the latency difference is only 0.5 cycle.
Well, I was thinking about putting something in to tune the instruction
selection depending on the real CPU. It might be worth it (but I
suspect not in this case).
I still think it has more to do with trashing the trace cache rather
than any of the minor instruction selection issues. I'm doing an
experimental patch which does speculative translation (ie, is a BB's
final instruction is a lit32 jump, then translate the jump target too)
to see if that helps. It should strike a balance between killing the
cache with every translation and the over-translation trace caching
would cause.
I still haven't had a chance to do any P4 instrumentation yet. Maybe
building Rabbit hooks into V would be a useful thing to do (oprofile
seems a bit crippled when faced with code without a file or symbols).
J
|
|
From: Julian S. <js...@ac...> - 2002-12-07 00:28:14
|
On Friday 06 December 2002 11:33 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-12-06 at 15:11, Julian Seward wrote: > > Re my analysis of stack-clearing add, I can't be exactly right, since > > VG_(emit_add_lit_to_esp) begins with the correct call to new_emit. > > No, it isn't correct - True means "operate on Simd flags"; False means > "non-simd flags". > > Fixing this seems to work, and even has a slight performance improvement > (OO starts up in 47 rather than 48 seconds). At very least it seems > performance neutral. Well, changing that True to False makes OO and moz work fine for me, which is great. So I guess that's pretty much the end of the matter? It seems like the right fix to me; does it also to you? -------- Now that this looks like this is going to work, is it worth keeping the fancy test stuff you did in synth_jcond_lit ? I ask because you have a better understanding of the ramifications of this latest eflags stuff, and so can probably make a better decision. If it's worth keeping ... I did actually write code to also do the 4 missing cases: (SF xor OF) == 0 or 1, ((SF xor OF) or ZF) == 0 or 1, and it did seem to help sometimes. But that was before your latest patch. If it's not worth keeping ... let's nuke it. ---------------- I've been wondering a bit about the dismal performance on P4s. One thing that occurred to me is that the preamble sequence "decl bbs_to_go; jnz ..." is going to hit the P4's partial-flag-write penalty (page 2-55 of the p4 optimisation guide, "Use of the inc and dec instructions"). It'd be interesting try changing it to a subl $1, ... as they recommend. Or perhaps not ... according to their tables the latency difference is only 0.5 cycle. J |