You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Julian S. <js...@ac...> - 2002-11-20 23:45:05
|
This is the second of two messages about t-chaining. Get a coffee
before reading further; you'll need it.
I found it pretty surprising that bzip2 runs at a 10x slowdown
on nulgrind, even with t-chaining engaged. I know that program
pretty well :) and it's completely compute-bound, no I/O, no malloc,
no weird stuff.
So I've just started messing with the attached program, designed
to compute, have an easily-located inner loop :) and to minimise
absolutely all other effects.
On my PIII-1.13 GHz box, compiled -O -g, it takes 0.250 s natively
and 1.56 seconds on nulgrind, with bb chaining on and INCEIPs
not generated (i.e., our max-performance settings so far).
It runs for 25.2 million bbs, btw.
[Attached below are: the C code, the original, Ucode, and generated
code for the inner loop, and the same with intervening junk removed
for easier reading].
That's a lossage factor of 6.2:1. This _seems_ quite good, until
we start looking at the translation of the inner loop. The original
is 12 (action-packed) x86 insns. The expected-case through the
translation is 44 x86 insns + ud2; ud2; nop, possibly about 42
in total once the patching has been done.
Now I'm at a bit of a loss. Although insn counts are notoriously
crappy for predicting execution times, it seems a little wierd that
increasing the number of insns by a factor of 3.5 increases the
cycle count by 6.2 times. The program should run, both natively
and on V, with zero misses in any caches. And, once chained, this
bb is jumping back to itself directly, so nothing else is going on.
So where the @*%£&?! are my missing cycles?
Particularly in view of the fact that the generated code is very
much in x86 RISC subset, and so should generate fewer uops/insn
than the original. For example, it contains several "movl (%esi), %esi".
Let's recap.
- There are no cache misses.
- There are no unpredictable branches.
- Each insn [in the resulting translation] needs either one
Address-Generation-Unit (leal), one Load-Store-Unit (movl) or one
ALU (add, sub, xor). So the bulk of it is unlikely to be stalling
due to a shortage of functional units.
- Perhaps stalls due to result forwarding? But the problem with
that is that (translation)
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and original
addl 0xC(%reg1,%reg2),%reg3
I'd guess require the same functional units chained in the same order
(AGU -> LSU -> ALU) ... who knows?
- Decode limitations? PIII (P6 core) has this wierd 4-1-1 rule, I don't
know if that's a problem. Surely, there are 8 groups of
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
and each would fit in a 4-1-1 uop pattern?
- So perhaps there are some rare insns which completely serialise the
machine? We've got two pushfls and two popfls (partially redundant, too)
and they are pretty rare, hence might get executed out of ucode ROM.
I'm mystified. Someone with a cycle-accurate PIII model needs to run this
and see where it stalls.
Nick, what sort of time ratios do you get on your Athlonic device for
this?
[useless numbers: 25.0 million trips thru this loop.
Natively, this is 12 orig insns/iteration, = 300 million
insns. In 0.250 seconds, this is 1200 MIPS; at 1133 MHz,
that's a CPI of 0.94, which is pretty much Not Bad.
Translated, this is approx 42 insns/iteration, = 1050
million insns. In 1.56 seconds, this is 673 MIPS; at 1133 MHz,
that's a CPI of 1.68. This is significantly worse for no
apparent reason.
]
I reckon there's a large heavy object lurking in the undergrowth
here. Now we just need to find it. Anyone fancy lifting out
the translated loop, giving it a suitable context and running it
on VTune, to find out? Is oprofile able to tell us exactly
which insn is stalling?
Congratulations to anyone who bothered to read this far.
J
---------------------------------------------------------------------------
Original C code:
#include <stdio.h>
int i_fit_comfortably_in_D1[1000];
int timewaster ( void )
{
int i, j, sum;
for (i = 0; i < 1000; i++) {
i_fit_comfortably_in_D1[i] = i * 654321;
}
sum = 0;
for (j = 0; j < 200000; j++) {
for (i = 0; i < 1000; i += 8) {
sum += i_fit_comfortably_in_D1[i];
sum ^= i_fit_comfortably_in_D1[i+1];
sum -= i_fit_comfortably_in_D1[i+2];
sum += i_fit_comfortably_in_D1[i+3];
sum += i_fit_comfortably_in_D1[i+4];
sum ^= i_fit_comfortably_in_D1[i+5];
sum -= i_fit_comfortably_in_D1[i+6];
sum += i_fit_comfortably_in_D1[i+7];
}
}
return sum;
}
int main ( void )
{
printf ("the answer, you will be fascinated to know, is ... ");
fflush(stdout);
printf("0x%x\n", timewaster() );
return 0;
}
---------------------------------------------------------------------------
Original x86 code for inner loop:
0x8048508: leal 0x0(,%ecx,4), %eax
0x804850F: addl (%eax,%ebx,1),%edx
0x8048512: xorl 4(%ebx,%eax,1),%edx
0x8048516: subl 8(%ebx,%eax,1),%edx
0x804851A: addl 12(%ebx,%eax,1),%edx
0x804851E: addl 16(%ebx,%eax,1),%edx
0x8048522: xorl 20(%ebx,%eax,1),%edx
0x8048526: subl 24(%ebx,%eax,1),%edx
0x804852A: addl 28(%ebx,%eax,1),%edx
0x804852E: addl $0x8, %ecx
0x8048531: cmpl $0x3E7, %ecx
0x8048537: jle-8 0x8048508
---------------------------------------------------------------------------
Cleaned-up ucode for inner loop:
0: GETL %ECX, t2
1: MOVL $0x0, t0
2: LEA2L 0(t0,t2,4), t0
3: PUTL t0, %EAX
4: GETL %EBX, t6
6: LEA2L 0(t0,t6,1), t4
7: LDL (t4), t4
8: ADDL %EDX, t4
12: LEA2L 4(t6,t0,1), t10
13: LDL (t10), t10
14: XORL t4, t10
18: LEA2L 8(t6,t0,1), t16
19: LDL (t16), t16
21: SUBL t16, t10
25: LEA2L 12(t6,t0,1), t24
26: LDL (t24), t24
27: ADDL t10, t24
31: LEA2L 16(t6,t0,1), t30
32: LDL (t30), t30
33: ADDL t24, t30
37: LEA2L 20(t6,t0,1), t36
38: LDL (t36), t36
39: XORL t30, t36
43: LEA2L 24(t6,t0,1), t42
44: LDL (t42), t42
46: SUBL t42, t36
50: LEA2L 28(t6,t0,1), t50
51: LDL (t50), t50
52: ADDL t36, t50
53: PUTL t50, %EDX
55: ADDL $0x8, t2
56: PUTL t2, %ECX
58: SUBL $0x3E7, t2 (-wOSZACP)
59: Jleo $0x8048508 (-rOSZACP)
60: JMPo $0x8048539 ($2)
---------------------------------------------------------------------------
Generated x86 code for inner loop:
0: FF 0D 0C FD 0D 40
decl (0x400DFD0C)
6: 75 06
jnz-8 %eip+6
8: BD 1D 00 00 00
movl $0x1D, %ebp
13: C3
ret
0: GETL %ECX, %eax [a-----]
14: 8B 45 04
movl 0x4(%ebp), %eax
1: MOVL $0x0, %ebx [ab----]
17: 31 DB
xorl %ebx, %ebx
2: LEA2L 0(%ebx,%eax,4), %ebx [ab----]
19: 8D 5C 83 00
leal 0x0(%ebx,%eax,4), %ebx
3: PUTL %ebx, %EAX [ab----]
23: 89 5D 00
movl %ebx, 0x0(%ebp)
4: GETL %EBX, %ecx [abc---]
26: 8B 4D 0C
movl 0xC(%ebp), %ecx
5: LEA2L 0(%ebx,%ecx,1), %edx [abcd--]
29: 8D 54 0B 00
leal 0x0(%ebx,%ecx,1), %edx
6: LDL (%edx), %edx [abcd--]
33: 8B 12
movl (%edx), %edx
7: ADDL %EDX, %edx [abcd--]
35: 03 55 08
addl 0x8(%ebp), %edx
8: LEA2L 4(%ecx,%ebx,1), %esi [abcdS-]
38: 8D 74 19 04
leal 0x4(%ecx,%ebx,1), %esi
9: LDL (%esi), %esi [abcdS-]
42: 8B 36
movl (%esi), %esi
10: XORL %edx, %esi [abc-S-]
44: 31 D6
xorl %edx, %esi
11: LEA2L 8(%ecx,%ebx,1), %edi [abc-SD]
46: 8D 7C 19 08
leal 0x8(%ecx,%ebx,1), %edi
12: LDL (%edi), %edi [abc-SD]
50: 8B 3F
movl (%edi), %edi
13: SUBL %edi, %esi [abc-S-]
52: 29 FE
subl %edi, %esi
14: LEA2L 12(%ecx,%ebx,1), %edx [abcdS-]
54: 8D 54 19 0C
leal 0xC(%ecx,%ebx,1), %edx
15: LDL (%edx), %edx [abcdS-]
58: 8B 12
movl (%edx), %edx
16: ADDL %esi, %edx [abcd--]
60: 01 F2
addl %esi, %edx
17: LEA2L 16(%ecx,%ebx,1), %esi [abcdS-]
62: 8D 74 19 10
leal 0x10(%ecx,%ebx,1), %esi
18: LDL (%esi), %esi [abcdS-]
66: 8B 36
movl (%esi), %esi
19: ADDL %edx, %esi [abc-S-]
68: 01 D6
addl %edx, %esi
20: LEA2L 20(%ecx,%ebx,1), %edx [abcdS-]
70: 8D 54 19 14
leal 0x14(%ecx,%ebx,1), %edx
21: LDL (%edx), %edx [abcdS-]
74: 8B 12
movl (%edx), %edx
22: XORL %esi, %edx [abcd--]
76: 31 F2
xorl %esi, %edx
23: LEA2L 24(%ecx,%ebx,1), %esi [abcdS-]
78: 8D 74 19 18
leal 0x18(%ecx,%ebx,1), %esi
24: LDL (%esi), %esi [abcdS-]
82: 8B 36
movl (%esi), %esi
25: SUBL %esi, %edx [abcd--]
84: 29 F2
subl %esi, %edx
26: LEA2L 28(%ecx,%ebx,1), %esi [a--dS-]
86: 8D 74 19 1C
leal 0x1C(%ecx,%ebx,1), %esi
27: LDL (%esi), %esi [a--dS-]
90: 8B 36
movl (%esi), %esi
28: ADDL %edx, %esi [a---S-]
92: 01 D6
addl %edx, %esi
29: PUTL %esi, %EDX [a-----]
94: 89 75 08
movl %esi, 0x8(%ebp)
30: ADDL $0x8, %eax [a-----]
97: 83 C0 08
addl $0x8, %eax
31: PUTL %eax, %ECX [a-----]
100: 89 45 04
movl %eax, 0x4(%ebp)
32: SUBL $0x3E7, %eax (-wOSZACP) [------]
103: FF 75 20 9D
pushl 32(%ebp) ; popfl
107: 81 E8 E7 03 00 00
subl $0x3E7, %eax
113: 9C 8F 45 20
pushfl ; popl 32(%ebp)
33: Jleo $0x8048508 (-rOSZACP) [------]
117: FF 75 20 9D
pushl 32(%ebp) ; popfl
121: 7F 0D
jnle-8 %eip+13
123: B8 08 85 04 08
movl $0x8048508, %eax
128: 89 45 24
movl %eax, 0x24(%ebp)
131: 0F 0B 0F 0B 90
ud2; ud2; nop
34: JMPo $0x8048539 ($2) [------]
136: B8 39 85 04 08
movl $0x8048539, %eax
141: 89 45 24
movl %eax, 0x24(%ebp)
144: 0F 0B 0F 0B 90
ud2; ud2; nop
---------------------------------------------------------------------------
Generated x86 code for inner loop, with crud removed, and bb start points
distinguished by a blank line.
decl (0x400DFD0C)
jnz-8 %eip+6
movl $0x1D, %ebp
ret
movl 0x4(%ebp), %eax
xorl %ebx, %ebx
leal 0x0(%ebx,%eax,4), %ebx
movl %ebx, 0x0(%ebp)
movl 0xC(%ebp), %ecx
leal 0x0(%ebx,%ecx,1), %edx
movl (%edx), %edx
addl 0x8(%ebp), %edx
leal 0x4(%ecx,%ebx,1), %esi
movl (%esi), %esi
xorl %edx, %esi
leal 0x8(%ecx,%ebx,1), %edi
movl (%edi), %edi
subl %edi, %esi
leal 0xC(%ecx,%ebx,1), %edx
movl (%edx), %edx
addl %esi, %edx
leal 0x10(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
leal 0x14(%ecx,%ebx,1), %edx
movl (%edx), %edx
xorl %esi, %edx
leal 0x18(%ecx,%ebx,1), %esi
movl (%esi), %esi
subl %esi, %edx
leal 0x1C(%ecx,%ebx,1), %esi
movl (%esi), %esi
addl %edx, %esi
movl %esi, 0x8(%ebp)
addl $0x8, %eax
movl %eax, 0x4(%ebp)
pushl 32(%ebp)
popfl
subl $0x3E7, %eax
pushfl
popl 32(%ebp)
pushl 32(%ebp)
popfl
jnle-8 %eip+13
movl $0x8048508, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
movl $0x8048539, %eax
movl %eax, 0x24(%ebp)
ud2; ud2; nop
|
|
From: Julian S. <js...@ac...> - 2002-11-20 22:14:02
|
[I was going to put two topics in here, but decided it's simpler to put each in its own msg.] You'll have to forgive me for not quite keeping up with you young 'uns. I see the stuff on extended bbs but haven't grokked it properly yet. Besides, I spent half the day in meetings with CPU architects and if that isn't enough to put one's head in a spin, I don't know what is. Jeremy. I've built it with the 47-chained-bb patch and it works. A questions: What's the deal on the LRU mechanism when chaining is going, now? Do chained translations get marked when they get used, now? Not afaics ... but this is going to present something of a problem when LRUing because chained ones will appear to not have been used, and so will be much more likely to be dumped. Perhaps we can do something clever based on the idea that if a T has been chained it must have been used at least once since the last LRU pass. Not critical, but something which we need to have a plausible story on. J |
|
From: Jeremy F. <je...@go...> - 2002-11-20 17:05:25
|
On Wed, 2002-11-20 at 02:08, Nicholas Nethercote wrote:
> I agree in principle. One question -- how would this affect the UCode?
> Would it look the same (eg. with all the %ESP PUTs and GETs) and then the
> code generator would combine multiple UInstrs into single x86 instrs? Or
> would the UCode look different.
I would definitely go with making it a special case of GET/PUT (ie, it
looks the same to the skins, and only register allocation/codegen cares
about the difference).
> As for saving/restoring %ESP on C code entry/exit, that would make each
> CCALL two instructions more expensive. For skins doing lots of CCALLs
> (eg. cachegrind) that might be a loss -- the CCALL extra expense could
> outweight the %ESP savings. Not sure. I suppose you could have a skin
> "lots_of_CCALLs" `need' that switches between the two approaches, but that
> would be a pain.
Cachegrind does a CCALL for every original instruction? In that case it
might get a bit slow. One way to fix this is to consider that ESP has
one of two homes: a set register and VGOFF_(m_esp)(%ebp). So long as we
guarantee that esp is in the right place at the end of a basic block, we
can do anything we like within it. So we can apply a lazy move strategy
to it, so if we save it in the baseBlock for a CCALL, we need only
retrieve it again if it gets modified or we're leaving the basic block.
> A related point: I've noticed that for skins inserting CCALLs, often a
> register is set and not touched for quite some time -- it's live range is
> big. If the register is a caller-save one (%e[acd]x) and one or more
> CCALLs occur between the set and use points, it would be better to spill
> it.
>
> For example:
> ...
> For each CCALL, %eax has to be saved/restored. If there are 2 or more
> CCALLs between %eax set and use, it would be better to set %eax and then
> immediately spill it, and then unspill it immediately before use.
Even better: If you've gone to the effort of spilling/saving on the
first ccall, then don't restore it until its actually needed. That is,
rather than pushing the live regs, just save them back into baseBlock,
and reload them when they're needed.
A harder problem is deciding when to do that in the absence of CCALLs,
since a long-life unused register is just causing pressure.
> I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some
> of their diagrams. There's lots of detail in one of the papers about
> handling unusual control flow in Windows (callbacks, asynchronous
> procedure calls, exceptions, setjmp/longjmp) but I don't understand that
> stuff much at all. I'm not much use, sorry.
Last night I convinced myself that it is necessary to maintain the
virtual EIP while running, but now I can't think why. I think it can
always be computed on demand rather than needing to be maintained.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-20 16:46:31
|
On Wed, 2002-11-20 at 04:23, Nicholas Nethercote wrote:
> Does this mean that the fall-through JMP back to the dispatcher is
> replaced by the following basic block? ie. you're allowing multiple exits
> from a basic block?
>
> ie.
>
> [code]
> Jcc X
> JMP 0xY
>
> becomes
>
> [code]
> Jcc ...
> [code for block at 0xY]
>
> ?
>
> Interesting that this doesn't make a definite improvement.
That in itself should be a definite improvement. The problem is that
its going to increase the rate of redundant translations. If you have
the extended basic block:
1: ...
jnz 3
2: ...
jeq 2
3: ...
jz 5
4: ...
5: ...
jmp 4
Then you're going to end up translating some instructions 5 times.
> Another way to extend basic blocks is to inline direct calls in the same
> way, eg:
>
> [code]
> JMP-c 0xX
>
> becomes
>
> [code]
> [code for block at 0xX]
Right. So you'd want to do some branch-weight measurements first to see
which way around it should go.
> Problem with that is that it screws up function-entry detection for skins
> -- currently to detect function entry they can call
> VG_(get_fnname_if_entry)() at the start of a basic block to determine if
> it's the first basic block in a function. With this optimisation they'd
> have to do it for every x86 instruction, urgh.
To do this, you'd need a SETEIP UInstr to keep the EIP up to date, even
if there isn't a jump. Even if you weren't keeping EIP up to date, you
could have a UInstr for representing the start of a basic block.
> Similarly direct JMPs seem pretty common, perhaps they too could be
> inlined:
>
> [code]
> JMP 0xY
>
> becomes
>
> [code]
> [code for block at 0xY]
>
> ----
>
> This sort of inlining seems nicer than chaining. Would it be hard?
This is what I've been referring to as trace caching. The problem is
that its hard to know when to stop, as a tradeoff between runtime
efficiency and repeated compilation. For example, this is the schematic
of a switch in an infinite loop:
1: cmp ...
jeq 5
cmp ...
jle 4
cmp ...
jgt 3
cmp ...
jne 2
...
jmp 6
2: ...
jmp 6
3: ...
jmp 6
4: ...
jmp 6
5: ...
6: ...
jmp 1
It seems to me that you're going to have to work very hard to amortise
the cost of doing all that compilation.
Dynamo does this, but only after running the code for a while to see
that it's worth the cost of extra compilation. We don't have the
machinery to do "regenerate after instrumenting it for a while", and
that's where the complexity in this idea is. The actual following-jmps
is easy.
> Is it any different in principle to what Jeremy's already implemented?
Chaining is almost as efficient as this, but it has the advantage of not
incurring as much compile-time overhead and only incremental runtime
overhead.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 15:28:03
|
On Wed, 20 Nov 2002, Josef Weidendorfer wrote: > > Another way to extend basic blocks is to inline direct calls in the same > > way, eg: > > > > [code] > > JMP-c 0xX > > > > becomes > > > > [code] > > [code for block at 0xX] > > > > Problem with that is that it screws up function-entry detection for skins > > -- currently to detect function entry they can call > > VG_(get_fnname_if_entry)() at the start of a basic block to determine if > > it's the first basic block in a function. With this optimisation they'd > > have to do it for every x86 instruction, urgh. > > Is this really a problem? > I thought VG_(get_fnname_if_entry)() is to be called at instrumentation time? > And BB chaining should be done *after* skin intrumentation? I'm talking about merging together more than one basic block into a larger basic block. In this new larger basic block, the instructions from the start of the called function will appear in the middle of the basic block. Does that make the problem clearer? N |
|
From: Josef W. <Jos...@gm...> - 2002-11-20 15:20:22
|
On Wednesday 20 November 2002 13:23, Nicholas Nethercote wrote: > ... > Another way to extend basic blocks is to inline direct calls in the sam= e > way, eg: > > [code] > JMP-c 0xX > > becomes > > [code] > [code for block at 0xX] > > Problem with that is that it screws up function-entry detection for ski= ns > -- currently to detect function entry they can call > VG_(get_fnname_if_entry)() at the start of a basic block to determine i= f > it's the first basic block in a function. With this optimisation they'= d > have to do it for every x86 instruction, urgh. Is this really a problem? I thought VG_(get_fnname_if_entry)() is to be called at instrumentation t= ime? And BB chaining should be done *after* skin intrumentation? If a skin is "self-contained" in the instrumentations it's doing, there s= hould=20 be no problem at all with this fancy BB chaining, trace cache etc. Example: In my calltree-patch, I add a helper function when I detect a CALL UInstr= =2E Thus, the skin instrumentation has to modify the UInstr of a BB *before* = any=20 chaining is done.=20 But afterwards, the UInstr's of the BB contain the call to my helper, tog= ether=20 with the jump target as argument. There's no problem with getting rid of = the=20 CALL UInstr. itself for BB chaining. Josef |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 12:23:57
|
On 19 Nov 2002, Jeremy Fitzhardinge wrote: > Well, I found that there is an easy way of getting V to generate > extended basic blocks: simply allow conditional jumps to not end the > basic block (they code was all set up to do it, complete with > almost-accurate comment). I haven't done on overall measurement of what > the average increase in BB length is, but from looking a the output of > --trace-codegen, they can get quite long. It certainly allows us to see > what the effect of having long register lifetimes is vs. > indescriminately over-compiling things. The results are far from > conclusive: it seems to vary between a few percent better to a few > percent worse (and it also seems to depend on the skin, though I haven't > really tested that yet). Does this mean that the fall-through JMP back to the dispatcher is replaced by the following basic block? ie. you're allowing multiple exits from a basic block? ie. [code] Jcc X JMP 0xY becomes [code] Jcc ... [code for block at 0xY] ? Interesting that this doesn't make a definite improvement. ---- Another way to extend basic blocks is to inline direct calls in the same way, eg: [code] JMP-c 0xX becomes [code] [code for block at 0xX] Problem with that is that it screws up function-entry detection for skins -- currently to detect function entry they can call VG_(get_fnname_if_entry)() at the start of a basic block to determine if it's the first basic block in a function. With this optimisation they'd have to do it for every x86 instruction, urgh. ---- Similarly direct JMPs seem pretty common, perhaps they too could be inlined: [code] JMP 0xY becomes [code] [code for block at 0xY] ---- This sort of inlining seems nicer than chaining. Would it be hard? Is it any different in principle to what Jeremy's already implemented? N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 10:09:02
|
On Tue, 19 Nov 2002, Julian Seward wrote: > > Perhaps a slightly simpler approach would be to dedicate a register to > > ESP rather than a memory location, so that it is persistently in > > register across basic blocks. > > [...] > > Will still need to work out appropriate times to save ESP into the > > baseblock for when control enters C code. > Yes, this seems a good compromise. We already played with the issue of > 5 vs 6 allocatable registers and it makes very little difference at least > to memcheck, so perhaps it would be ok to give one to %ESP. I agree in principle. One question -- how would this affect the UCode? Would it look the same (eg. with all the %ESP PUTs and GETs) and then the code generator would combine multiple UInstrs into single x86 instrs? Or would the UCode look different. If the latter, how would this interact with MemCheck's instrumentation? It feels to me like this would be a good thing to keep within the code generator if possible, so that %ESP is not treated any differently from a UCode point-of-view, which makes life easier for skins. As for saving/restoring %ESP on C code entry/exit, that would make each CCALL two instructions more expensive. For skins doing lots of CCALLs (eg. cachegrind) that might be a loss -- the CCALL extra expense could outweight the %ESP savings. Not sure. I suppose you could have a skin "lots_of_CCALLs" `need' that switches between the two approaches, but that would be a pain. > > 2. Using trace caching will lengthen the effective size of a basic > > block, and thereby amortize the register load/save over more > > instructions A related point: I've noticed that for skins inserting CCALLs, often a register is set and not touched for quite some time -- it's live range is big. If the register is a caller-save one (%e[acd]x) and one or more CCALLs occur between the set and use points, it would be better to spill it. For example: set %eax ... code not using %eax, with multiple CCALLs ... use %eax For each CCALL, %eax has to be saved/restored. If there are 2 or more CCALLs between %eax set and use, it would be better to set %eax and then immediately spill it, and then unspill it immediately before use. I'm not even sure if this happens a lot, but if it is maybe something could be done in register allocation to avoid it. ---- > Nick, one thing you seemed to not comment on is ... > > > addl $2, 36(%ebp) (the INCEIP) > > ie, we keep exact track of %eip; does DynamoRIO ? What sort of > exception model at they presenting? I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some of their diagrams. There's lots of detail in one of the papers about handling unusual control flow in Windows (callbacks, asynchronous procedure calls, exceptions, setjmp/longjmp) but I don't understand that stuff much at all. I'm not much use, sorry. N |
|
From: Jeremy F. <je...@go...> - 2002-11-20 09:57:11
|
On Wed, 2002-11-20 at 01:08, Julian Seward wrote:
> I'd still like to know why bzip2 runs at approx a 10x slowdown, if the
> translations are chained, there is no callouts to helpers with
> --skin=none. So the only thing happening is jumping between translations.
> In which case I'd expect a 10 x code size increase, but it's more like
> 4:1. So what's going on, I wonder?
Well, any indirect jumps (which includes calls into dynamic libraries
and returns) are still going through the dispatch loop. I think I've
worked out how to extend the chaining mechanism to handle indirect jumps
(though I think returns are still likely to be tricky, unless you've got
functions with a limited number of call-sites).
> It might be helpful to do this benchmarking with a simple microbenchmark
> with only a couple of hot bbs in the inner loop and no I/O, to factor out
> those effects.
I'm measuring user time, so it should factor out time waiting for I/O
and time spent in the kernel.
> Easy to test the net effect; disable %EIP generation altogether
> (one-liner in vg_to_ucode.c). I bet it gives another 10% or so.
> I would try it now but I have to rush off and duke it out with ARM
> code all day :-)
Good guess - it gets 20% on bzip2, 10% on cc1 (that's still keeping EIP
up to date once per basic block). I think you're right about computing
the EIP as we need it rather than keeping it up to date. I guess we
could tack a table onto the generated code itself or something.
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=6.09s ratio:12.4
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=4.85s ratio:9.8
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=5.29s ratio:10.7
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=4.04s ratio:8.2
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.47s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=80.11s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.96s ratio:13.1
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=74.03s ratio:16.5
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=53.51s ratio:11.9
Bedtime.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 09:23:52
|
On Wed, 20 Nov 2002, Julian Seward wrote: > Another obvious lemon is the INCEIP nonsense (of my own devising :) > For every insn executed, except for the last in each bb, there is a > corresponding INCEIP, which becomes something like > addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops > for modern CPUs (load-op-store). All because we need an up-to-date %EIP > in some very rare circumstances: when taking a snapshot of the stack that > might conceivably get passed to the user, and when delivering signals. > These are relatively rare. I wonder if we could dispense with the eip > and instead associate with each bb a small table of offsets which indicate > how to calculate the simulated %EIP from the value it was set at at the > start of the block and the current distance that the real %eip is inside > the translation now. If you see what I mean. I was fiddling with the redundant-INCEIP-removal code yesterday (which is currently commented out). But this table-of-EIP-offsets sounds like a much better idea. (And because of my interface changes, it should be possible to add an extra field to UCodeBlock without breaking binary compatibility :) > Easy to test the net effect; disable %EIP generation altogether > (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. > I would try it now but I have to rush off and duke it out with ARM > code all day :-) I did this yesterday. For gzip with --skin=none, the time difference was smallish (16.95s --> 16.13s, about 5%) but the code size difference was big (4.6x --> 3.6x, about 22%). For MemCheck and Cachegrind, INCEIP instructions account for about 10% and 12% of code size respectively. I think the space savings are representative for a wide range of programs. So, to summarise: INCEIP removal via a table of offsets shouldn't be too hard (I think?) and would definitely be worth it even if only for the space savings. N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 09:04:51
|
On Tue, 19 Nov 2002, Julian Seward wrote: > Hi. Am trying to be a bit more diligent re testsuite failures > now that I've written down how to run it on a bit of paper :) > > Just did cvs up, rebuild from scratch. I've got 3 fails, which is > a whole lot better than a few days ago. Are they expected? > > == Failed tests =============================== > memcheck/tests/memalign_test stderr > memcheck/tests/supp2 stderr > memcheck/tests/tronical stderr tronical is. The other two don't fail for me. It's surprising that only 2 fail unexpectedly on your machine, since I'm the only one who has ever used the regtests I thought there might be more my-machine-specific stuff in there. Can you send me the memcheck/tests/*.stderr.out files for memalign_test and supp2? Thanks. N |
|
From: Julian S. <js...@ac...> - 2002-11-20 09:01:58
|
[... numbers re translation chaining and trace cacheing ...] > Oh, yes, I quite agree. The results above make me think that doing > naive trace caching probably won't help, but eliminating dispatch-loop > calls probably will. Good work. It's pretty clear that t-chaining is worth having. Not sure about the others. I'd still like to know why bzip2 runs at approx a 10x slowdown, if the translations are chained, there is no callouts to helpers with --skin=none. So the only thing happening is jumping between translations. In which case I'd expect a 10 x code size increase, but it's more like 4:1. So what's going on, I wonder? It might be helpful to do this benchmarking with a simple microbenchmark with only a couple of hot bbs in the inner loop and no I/O, to factor out those effects. ----------- Another obvious lemon is the INCEIP nonsense (of my own devising :) For every insn executed, except for the last in each bb, there is a corresponding INCEIP, which becomes something like addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops for modern CPUs (load-op-store). All because we need an up-to-date %EIP in some very rare circumstances: when taking a snapshot of the stack that might conceivably get passed to the user, and when delivering signals. These are relatively rare. I wonder if we could dispense with the eip and instead associate with each bb a small table of offsets which indicate how to calculate the simulated %EIP from the value it was set at at the start of the block and the current distance that the real %eip is inside the translation now. If you see what I mean. Easy to test the net effect; disable %EIP generation altogether (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. I would try it now but I have to rush off and duke it out with ARM code all day :-) J |
|
From: Julian S. <js...@ac...> - 2002-11-20 08:15:44
|
Hi. After applying 00-lazy-fp.patch, 01-partial-mul.patch and 47-chained-bb.patch, I get this: gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../coregrind -I../../include -Winline -Wall -Wshadow -O -fomit-frame-pointer -g -Wno-unused -Wno-shadow -c cp-demangle.c -UHAVE_CONFIG_H In file included from cp-demangle.c:43: ../../coregrind/vg_include.h:1051: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1087: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1471: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1473: confused by earlier errors, bailing out Ummm ... ? J |
|
From: Jeremy F. <je...@go...> - 2002-11-20 01:02:30
|
On Tue, 2002-11-19 at 16:26, Josef Weidendorfer wrote:
> Hi,
>
> may I ask what's the purpose of
>
> /* Kludge ... */
> si->offset
> = si->start==VG_ASSUMED_EXE_BASE ? 0 : si->start;
>
> in vg_symtab2.c ??
[...]
> BTW, PLT and GOT ranges were always calculated wrong for executables, as these
> are calculated from file offset + si->offset, and si->offset is 0 for
> executables...
>
> Solutions?
Use 1.1.90? Both of these should be fixed.
J
|
|
From: Josef W. <Jos...@gm...> - 2002-11-20 00:42:25
|
Hi,
may I ask what's the purpose of
/* Kludge ... */
si->offset=20
=3D si->start=3D=3DVG_ASSUMED_EXE_BASE ? 0 : si->start;
in vg_symtab2.c ??
I just installed glibc 2.3.1 with prelink-support on my systems.
If libraries are prelinked, valgrind doesn't load any symbols at all :-(
That's because in prelinked libraries, for a symbol entry <i>,
=09o_symtab[i].st_value
is not an offset, but a real address. So we add
=09real_address + si->offset
and get doubled addresses, and thus refused.
Not using this offset at all, and calculating a symbols address by
=09 sym_addr =3D (UInt)o_symtab[i].st_value;
=09 if (sym_addr < si->start) sym_addr +=3D si->start;
seems to be fine.=20
But I'm not comfortable with the second line: For small segment start=20
addresses, this could go wrong...
BTW, PLT and GOT ranges were always calculated wrong for executables, as =
these=20
are calculated from file offset + si->offset, and si->offset is 0 for=20
executables...
Solutions?
Josef
|