You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
|
3
|
4
|
5
(2) |
6
(3) |
7
|
8
(2) |
9
(3) |
|
10
(3) |
11
(5) |
12
(1) |
13
|
14
(21) |
15
(6) |
16
(4) |
|
17
(9) |
18
(13) |
19
(15) |
20
(15) |
21
(11) |
22
(16) |
23
(4) |
|
24
|
25
(8) |
26
(4) |
27
(3) |
28
(1) |
29
|
30
(2) |
|
From: Jeremy F. <je...@go...> - 2002-11-20 17:05:25
|
On Wed, 2002-11-20 at 02:08, Nicholas Nethercote wrote:
> I agree in principle. One question -- how would this affect the UCode?
> Would it look the same (eg. with all the %ESP PUTs and GETs) and then the
> code generator would combine multiple UInstrs into single x86 instrs? Or
> would the UCode look different.
I would definitely go with making it a special case of GET/PUT (ie, it
looks the same to the skins, and only register allocation/codegen cares
about the difference).
> As for saving/restoring %ESP on C code entry/exit, that would make each
> CCALL two instructions more expensive. For skins doing lots of CCALLs
> (eg. cachegrind) that might be a loss -- the CCALL extra expense could
> outweight the %ESP savings. Not sure. I suppose you could have a skin
> "lots_of_CCALLs" `need' that switches between the two approaches, but that
> would be a pain.
Cachegrind does a CCALL for every original instruction? In that case it
might get a bit slow. One way to fix this is to consider that ESP has
one of two homes: a set register and VGOFF_(m_esp)(%ebp). So long as we
guarantee that esp is in the right place at the end of a basic block, we
can do anything we like within it. So we can apply a lazy move strategy
to it, so if we save it in the baseBlock for a CCALL, we need only
retrieve it again if it gets modified or we're leaving the basic block.
> A related point: I've noticed that for skins inserting CCALLs, often a
> register is set and not touched for quite some time -- it's live range is
> big. If the register is a caller-save one (%e[acd]x) and one or more
> CCALLs occur between the set and use points, it would be better to spill
> it.
>
> For example:
> ...
> For each CCALL, %eax has to be saved/restored. If there are 2 or more
> CCALLs between %eax set and use, it would be better to set %eax and then
> immediately spill it, and then unspill it immediately before use.
Even better: If you've gone to the effort of spilling/saving on the
first ccall, then don't restore it until its actually needed. That is,
rather than pushing the live regs, just save them back into baseBlock,
and reload them when they're needed.
A harder problem is deciding when to do that in the absence of CCALLs,
since a long-life unused register is just causing pressure.
> I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some
> of their diagrams. There's lots of detail in one of the papers about
> handling unusual control flow in Windows (callbacks, asynchronous
> procedure calls, exceptions, setjmp/longjmp) but I don't understand that
> stuff much at all. I'm not much use, sorry.
Last night I convinced myself that it is necessary to maintain the
virtual EIP while running, but now I can't think why. I think it can
always be computed on demand rather than needing to be maintained.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-20 16:46:31
|
On Wed, 2002-11-20 at 04:23, Nicholas Nethercote wrote:
> Does this mean that the fall-through JMP back to the dispatcher is
> replaced by the following basic block? ie. you're allowing multiple exits
> from a basic block?
>
> ie.
>
> [code]
> Jcc X
> JMP 0xY
>
> becomes
>
> [code]
> Jcc ...
> [code for block at 0xY]
>
> ?
>
> Interesting that this doesn't make a definite improvement.
That in itself should be a definite improvement. The problem is that
its going to increase the rate of redundant translations. If you have
the extended basic block:
1: ...
jnz 3
2: ...
jeq 2
3: ...
jz 5
4: ...
5: ...
jmp 4
Then you're going to end up translating some instructions 5 times.
> Another way to extend basic blocks is to inline direct calls in the same
> way, eg:
>
> [code]
> JMP-c 0xX
>
> becomes
>
> [code]
> [code for block at 0xX]
Right. So you'd want to do some branch-weight measurements first to see
which way around it should go.
> Problem with that is that it screws up function-entry detection for skins
> -- currently to detect function entry they can call
> VG_(get_fnname_if_entry)() at the start of a basic block to determine if
> it's the first basic block in a function. With this optimisation they'd
> have to do it for every x86 instruction, urgh.
To do this, you'd need a SETEIP UInstr to keep the EIP up to date, even
if there isn't a jump. Even if you weren't keeping EIP up to date, you
could have a UInstr for representing the start of a basic block.
> Similarly direct JMPs seem pretty common, perhaps they too could be
> inlined:
>
> [code]
> JMP 0xY
>
> becomes
>
> [code]
> [code for block at 0xY]
>
> ----
>
> This sort of inlining seems nicer than chaining. Would it be hard?
This is what I've been referring to as trace caching. The problem is
that its hard to know when to stop, as a tradeoff between runtime
efficiency and repeated compilation. For example, this is the schematic
of a switch in an infinite loop:
1: cmp ...
jeq 5
cmp ...
jle 4
cmp ...
jgt 3
cmp ...
jne 2
...
jmp 6
2: ...
jmp 6
3: ...
jmp 6
4: ...
jmp 6
5: ...
6: ...
jmp 1
It seems to me that you're going to have to work very hard to amortise
the cost of doing all that compilation.
Dynamo does this, but only after running the code for a while to see
that it's worth the cost of extra compilation. We don't have the
machinery to do "regenerate after instrumenting it for a while", and
that's where the complexity in this idea is. The actual following-jmps
is easy.
> Is it any different in principle to what Jeremy's already implemented?
Chaining is almost as efficient as this, but it has the advantage of not
incurring as much compile-time overhead and only incremental runtime
overhead.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 15:28:03
|
On Wed, 20 Nov 2002, Josef Weidendorfer wrote: > > Another way to extend basic blocks is to inline direct calls in the same > > way, eg: > > > > [code] > > JMP-c 0xX > > > > becomes > > > > [code] > > [code for block at 0xX] > > > > Problem with that is that it screws up function-entry detection for skins > > -- currently to detect function entry they can call > > VG_(get_fnname_if_entry)() at the start of a basic block to determine if > > it's the first basic block in a function. With this optimisation they'd > > have to do it for every x86 instruction, urgh. > > Is this really a problem? > I thought VG_(get_fnname_if_entry)() is to be called at instrumentation time? > And BB chaining should be done *after* skin intrumentation? I'm talking about merging together more than one basic block into a larger basic block. In this new larger basic block, the instructions from the start of the called function will appear in the middle of the basic block. Does that make the problem clearer? N |
|
From: Josef W. <Jos...@gm...> - 2002-11-20 15:20:22
|
On Wednesday 20 November 2002 13:23, Nicholas Nethercote wrote: > ... > Another way to extend basic blocks is to inline direct calls in the sam= e > way, eg: > > [code] > JMP-c 0xX > > becomes > > [code] > [code for block at 0xX] > > Problem with that is that it screws up function-entry detection for ski= ns > -- currently to detect function entry they can call > VG_(get_fnname_if_entry)() at the start of a basic block to determine i= f > it's the first basic block in a function. With this optimisation they'= d > have to do it for every x86 instruction, urgh. Is this really a problem? I thought VG_(get_fnname_if_entry)() is to be called at instrumentation t= ime? And BB chaining should be done *after* skin intrumentation? If a skin is "self-contained" in the instrumentations it's doing, there s= hould=20 be no problem at all with this fancy BB chaining, trace cache etc. Example: In my calltree-patch, I add a helper function when I detect a CALL UInstr= =2E Thus, the skin instrumentation has to modify the UInstr of a BB *before* = any=20 chaining is done.=20 But afterwards, the UInstr's of the BB contain the call to my helper, tog= ether=20 with the jump target as argument. There's no problem with getting rid of = the=20 CALL UInstr. itself for BB chaining. Josef |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 12:23:57
|
On 19 Nov 2002, Jeremy Fitzhardinge wrote: > Well, I found that there is an easy way of getting V to generate > extended basic blocks: simply allow conditional jumps to not end the > basic block (they code was all set up to do it, complete with > almost-accurate comment). I haven't done on overall measurement of what > the average increase in BB length is, but from looking a the output of > --trace-codegen, they can get quite long. It certainly allows us to see > what the effect of having long register lifetimes is vs. > indescriminately over-compiling things. The results are far from > conclusive: it seems to vary between a few percent better to a few > percent worse (and it also seems to depend on the skin, though I haven't > really tested that yet). Does this mean that the fall-through JMP back to the dispatcher is replaced by the following basic block? ie. you're allowing multiple exits from a basic block? ie. [code] Jcc X JMP 0xY becomes [code] Jcc ... [code for block at 0xY] ? Interesting that this doesn't make a definite improvement. ---- Another way to extend basic blocks is to inline direct calls in the same way, eg: [code] JMP-c 0xX becomes [code] [code for block at 0xX] Problem with that is that it screws up function-entry detection for skins -- currently to detect function entry they can call VG_(get_fnname_if_entry)() at the start of a basic block to determine if it's the first basic block in a function. With this optimisation they'd have to do it for every x86 instruction, urgh. ---- Similarly direct JMPs seem pretty common, perhaps they too could be inlined: [code] JMP 0xY becomes [code] [code for block at 0xY] ---- This sort of inlining seems nicer than chaining. Would it be hard? Is it any different in principle to what Jeremy's already implemented? N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 10:09:02
|
On Tue, 19 Nov 2002, Julian Seward wrote: > > Perhaps a slightly simpler approach would be to dedicate a register to > > ESP rather than a memory location, so that it is persistently in > > register across basic blocks. > > [...] > > Will still need to work out appropriate times to save ESP into the > > baseblock for when control enters C code. > Yes, this seems a good compromise. We already played with the issue of > 5 vs 6 allocatable registers and it makes very little difference at least > to memcheck, so perhaps it would be ok to give one to %ESP. I agree in principle. One question -- how would this affect the UCode? Would it look the same (eg. with all the %ESP PUTs and GETs) and then the code generator would combine multiple UInstrs into single x86 instrs? Or would the UCode look different. If the latter, how would this interact with MemCheck's instrumentation? It feels to me like this would be a good thing to keep within the code generator if possible, so that %ESP is not treated any differently from a UCode point-of-view, which makes life easier for skins. As for saving/restoring %ESP on C code entry/exit, that would make each CCALL two instructions more expensive. For skins doing lots of CCALLs (eg. cachegrind) that might be a loss -- the CCALL extra expense could outweight the %ESP savings. Not sure. I suppose you could have a skin "lots_of_CCALLs" `need' that switches between the two approaches, but that would be a pain. > > 2. Using trace caching will lengthen the effective size of a basic > > block, and thereby amortize the register load/save over more > > instructions A related point: I've noticed that for skins inserting CCALLs, often a register is set and not touched for quite some time -- it's live range is big. If the register is a caller-save one (%e[acd]x) and one or more CCALLs occur between the set and use points, it would be better to spill it. For example: set %eax ... code not using %eax, with multiple CCALLs ... use %eax For each CCALL, %eax has to be saved/restored. If there are 2 or more CCALLs between %eax set and use, it would be better to set %eax and then immediately spill it, and then unspill it immediately before use. I'm not even sure if this happens a lot, but if it is maybe something could be done in register allocation to avoid it. ---- > Nick, one thing you seemed to not comment on is ... > > > addl $2, 36(%ebp) (the INCEIP) > > ie, we keep exact track of %eip; does DynamoRIO ? What sort of > exception model at they presenting? I'm not sure. They have a "VPC" which I assume is a "virtual PC" in some of their diagrams. There's lots of detail in one of the papers about handling unusual control flow in Windows (callbacks, asynchronous procedure calls, exceptions, setjmp/longjmp) but I don't understand that stuff much at all. I'm not much use, sorry. N |
|
From: Jeremy F. <je...@go...> - 2002-11-20 09:57:11
|
On Wed, 2002-11-20 at 01:08, Julian Seward wrote:
> I'd still like to know why bzip2 runs at approx a 10x slowdown, if the
> translations are chained, there is no callouts to helpers with
> --skin=none. So the only thing happening is jumping between translations.
> In which case I'd expect a 10 x code size increase, but it's more like
> 4:1. So what's going on, I wonder?
Well, any indirect jumps (which includes calls into dynamic libraries
and returns) are still going through the dispatch loop. I think I've
worked out how to extend the chaining mechanism to handle indirect jumps
(though I think returns are still likely to be tricky, unless you've got
functions with a limited number of call-sites).
> It might be helpful to do this benchmarking with a simple microbenchmark
> with only a couple of hot bbs in the inner loop and no I/O, to factor out
> those effects.
I'm measuring user time, so it should factor out time waiting for I/O
and time spent in the kernel.
> Easy to test the net effect; disable %EIP generation altogether
> (one-liner in vg_to_ucode.c). I bet it gives another 10% or so.
> I would try it now but I have to rush off and duke it out with ARM
> code all day :-)
Good guess - it gets 20% on bzip2, 10% on cc1 (that's still keeping EIP
up to date once per basic block). I think you're right about computing
the EIP as we need it rather than keeping it up to date. I guess we
could tack a table onto the generated code itself or something.
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=6.09s ratio:12.4
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes bzip2 < TAGS > /dev/null
time=4.85s ratio:9.8
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=5.29s ratio:10.7
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no bzip2 < TAGS > /dev/null
time=4.04s ratio:8.2
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.47s
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=80.11s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.96s ratio:13.1
valgrind --skin=none --chain-bb=no --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=74.03s ratio:16.5
valgrind --skin=none --chain-bb=yes --extended-bb=no --enable-inceip=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=53.51s ratio:11.9
Bedtime.
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 09:23:52
|
On Wed, 20 Nov 2002, Julian Seward wrote: > Another obvious lemon is the INCEIP nonsense (of my own devising :) > For every insn executed, except for the last in each bb, there is a > corresponding INCEIP, which becomes something like > addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops > for modern CPUs (load-op-store). All because we need an up-to-date %EIP > in some very rare circumstances: when taking a snapshot of the stack that > might conceivably get passed to the user, and when delivering signals. > These are relatively rare. I wonder if we could dispense with the eip > and instead associate with each bb a small table of offsets which indicate > how to calculate the simulated %EIP from the value it was set at at the > start of the block and the current distance that the real %eip is inside > the translation now. If you see what I mean. I was fiddling with the redundant-INCEIP-removal code yesterday (which is currently commented out). But this table-of-EIP-offsets sounds like a much better idea. (And because of my interface changes, it should be possible to add an extra field to UCodeBlock without breaking binary compatibility :) > Easy to test the net effect; disable %EIP generation altogether > (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. > I would try it now but I have to rush off and duke it out with ARM > code all day :-) I did this yesterday. For gzip with --skin=none, the time difference was smallish (16.95s --> 16.13s, about 5%) but the code size difference was big (4.6x --> 3.6x, about 22%). For MemCheck and Cachegrind, INCEIP instructions account for about 10% and 12% of code size respectively. I think the space savings are representative for a wide range of programs. So, to summarise: INCEIP removal via a table of offsets shouldn't be too hard (I think?) and would definitely be worth it even if only for the space savings. N |
|
From: Nicholas N. <nj...@ca...> - 2002-11-20 09:04:51
|
On Tue, 19 Nov 2002, Julian Seward wrote: > Hi. Am trying to be a bit more diligent re testsuite failures > now that I've written down how to run it on a bit of paper :) > > Just did cvs up, rebuild from scratch. I've got 3 fails, which is > a whole lot better than a few days ago. Are they expected? > > == Failed tests =============================== > memcheck/tests/memalign_test stderr > memcheck/tests/supp2 stderr > memcheck/tests/tronical stderr tronical is. The other two don't fail for me. It's surprising that only 2 fail unexpectedly on your machine, since I'm the only one who has ever used the regtests I thought there might be more my-machine-specific stuff in there. Can you send me the memcheck/tests/*.stderr.out files for memalign_test and supp2? Thanks. N |
|
From: Julian S. <js...@ac...> - 2002-11-20 09:01:58
|
[... numbers re translation chaining and trace cacheing ...] > Oh, yes, I quite agree. The results above make me think that doing > naive trace caching probably won't help, but eliminating dispatch-loop > calls probably will. Good work. It's pretty clear that t-chaining is worth having. Not sure about the others. I'd still like to know why bzip2 runs at approx a 10x slowdown, if the translations are chained, there is no callouts to helpers with --skin=none. So the only thing happening is jumping between translations. In which case I'd expect a 10 x code size increase, but it's more like 4:1. So what's going on, I wonder? It might be helpful to do this benchmarking with a simple microbenchmark with only a couple of hot bbs in the inner loop and no I/O, to factor out those effects. ----------- Another obvious lemon is the INCEIP nonsense (of my own devising :) For every insn executed, except for the last in each bb, there is a corresponding INCEIP, which becomes something like addl $insn_size, 36(%ebp). That's probably expensive; it's 3 microops for modern CPUs (load-op-store). All because we need an up-to-date %EIP in some very rare circumstances: when taking a snapshot of the stack that might conceivably get passed to the user, and when delivering signals. These are relatively rare. I wonder if we could dispense with the eip and instead associate with each bb a small table of offsets which indicate how to calculate the simulated %EIP from the value it was set at at the start of the block and the current distance that the real %eip is inside the translation now. If you see what I mean. Easy to test the net effect; disable %EIP generation altogether (one-liner in vg_to_ucode.c). I bet it gives another 10% or so. I would try it now but I have to rush off and duke it out with ARM code all day :-) J |
|
From: Julian S. <js...@ac...> - 2002-11-20 08:15:44
|
Hi. After applying 00-lazy-fp.patch, 01-partial-mul.patch and 47-chained-bb.patch, I get this: gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../coregrind -I../../include -Winline -Wall -Wshadow -O -fomit-frame-pointer -g -Wno-unused -Wno-shadow -c cp-demangle.c -UHAVE_CONFIG_H In file included from cp-demangle.c:43: ../../coregrind/vg_include.h:1051: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1087: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1471: `VG_MAX_JUMPS' undeclared here (not in a function) ../../coregrind/vg_include.h:1473: confused by earlier errors, bailing out Ummm ... ? J |
|
From: Jeremy F. <je...@go...> - 2002-11-20 01:02:30
|
On Tue, 2002-11-19 at 16:26, Josef Weidendorfer wrote:
> Hi,
>
> may I ask what's the purpose of
>
> /* Kludge ... */
> si->offset
> = si->start==VG_ASSUMED_EXE_BASE ? 0 : si->start;
>
> in vg_symtab2.c ??
[...]
> BTW, PLT and GOT ranges were always calculated wrong for executables, as these
> are calculated from file offset + si->offset, and si->offset is 0 for
> executables...
>
> Solutions?
Use 1.1.90? Both of these should be fixed.
J
|
|
From: Josef W. <Jos...@gm...> - 2002-11-20 00:42:25
|
Hi,
may I ask what's the purpose of
/* Kludge ... */
si->offset=20
=3D si->start=3D=3DVG_ASSUMED_EXE_BASE ? 0 : si->start;
in vg_symtab2.c ??
I just installed glibc 2.3.1 with prelink-support on my systems.
If libraries are prelinked, valgrind doesn't load any symbols at all :-(
That's because in prelinked libraries, for a symbol entry <i>,
=09o_symtab[i].st_value
is not an offset, but a real address. So we add
=09real_address + si->offset
and get doubled addresses, and thus refused.
Not using this offset at all, and calculating a symbols address by
=09 sym_addr =3D (UInt)o_symtab[i].st_value;
=09 if (sym_addr < si->start) sym_addr +=3D si->start;
seems to be fine.=20
But I'm not comfortable with the second line: For small segment start=20
addresses, this could go wrong...
BTW, PLT and GOT ranges were always calculated wrong for executables, as =
these=20
are calculated from file offset + si->offset, and si->offset is 0 for=20
executables...
Solutions?
Josef
|
|
From: Julian S. <js...@ac...> - 2002-11-19 23:58:51
|
> > Yesterday afternoon I ran KDE-3.1-rc3 on addrcheck. This is the entire > > process tree starting from startkde. It pretty much worked, but was a > > bit slow and one process bombed valgrind. All output was logged to a > > different machine running the listener. > > Wow that sounds great. How have things come since Monday? Are your > scripts and tools for this all packaged up? No, but we can do 'make dist' and write a half-page README if you want to try it. I found that valgrind bombed once during the previous session, and I wanted to track it down. + we are not in a total hurry to release it. But as I say, I am very happy to make you a snapshot tarball (version 1.9.1) if you want. > It would be nice to see those Arts errors too. Arts has lots of memory > problems on PPC architecture apparently. Attached. I don't think they are particularly interesting. I'm not using artsd for anything. After startup I get a box saying "Sound server CPU overload" or something like that; when I clicked OK the attached gunk came out. I tried to find all lines in the log for this process (5805) but I am not sure why the startup stuff is missing; that would tell us exactly what's running. Note, this is with RC3; I haven't scagged RC4 yet. > It really seems that my testing suggestion is far too late for 3.1.0, but > I'd like to start formalising it for 3.2.0. You are doing a Good Thing here, and you have my encouragement. J |
|
From: Jeremy F. <je...@go...> - 2002-11-19 23:14:13
|
On Tue, 2002-11-19 at 11:10, Julian Seward wrote:
> > I got basic block chaining working last night. I got about 25%
> > improvement (which is nice, but I was hoping for more) in the particular
> > benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> > whole, the performance was pretty dismal: the native run took about 4.6
> > seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> > chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> > I was hoping it would be a largely CPU-bound test, but maybe its
> > actually spending all its time in malloc or something.
>
> Strange it's so slow. I think you should try bzip2; that is almost
> completely compute bound and does something like 7 malloc calls per
> file processed. Here's what I have:
>
> time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> > /dev/null
> (valgrind startup msgs deleted)
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m7.760s
> user 0m7.670s
> sys 0m0.030s
>
> time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m0.738s
> user 0m0.680s
> sys 0m0.050s
>
> So more like a 11 - 12 x slowdown than the 35 x you get.
Well, I found that there is an easy way of getting V to generate
extended basic blocks: simply allow conditional jumps to not end the
basic block (they code was all set up to do it, complete with
almost-accurate comment). I haven't done on overall measurement of what
the average increase in BB length is, but from looking a the output of
--trace-codegen, they can get quite long. It certainly allows us to see
what the effect of having long register lifetimes is vs.
indescriminately over-compiling things. The results are far from
conclusive: it seems to vary between a few percent better to a few
percent worse (and it also seems to depend on the skin, though I haven't
really tested that yet).
These are the results I'm seeing:
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no bzip2 < TAGS > /dev/null
time=6.15s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=no bzip2 < TAGS > /dev/null
time=4.95s ratio:10.1
valgrind --skin=none --chain-bb=no --extended-bb=yes bzip2 < TAGS > /dev/null
time=6.17s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=yes bzip2 < TAGS > /dev/null
time=5.48s ratio:11.1
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i \
-quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.44s
valgrind --skin=none --chain-bb=no --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=79.88s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=59.14s ratio:13.3
valgrind --skin=none --chain-bb=no --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=73.77s ratio:16.6
valgrind --skin=none --chain-bb=yes --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.16s ratio:13.0
> > My next experiment might be [...]
>
> Let me encourage you to make measurements (direct vs indirect jump counts)
> to gain insight into your current hackery, before embarking on more. I for
> one would like to be assured with numbers that the Right Thing is happening
> and that our assumptions about costs, event frequencies, etc, are justified.
Oh, yes, I quite agree. The results above make me think that doing
naive trace caching probably won't help, but eliminating dispatch-loop
calls probably will.
I've put the patch up in the usual place.
J
|
|
From: Julian S. <js...@ac...> - 2002-11-19 23:11:16
|
Hi. Am trying to be a bit more diligent re testsuite failures now that I've written down how to run it on a bit of paper :) Just did cvs up, rebuild from scratch. I've got 3 fails, which is a whole lot better than a few days ago. Are they expected? J == Failed tests =============================== memcheck/tests/memalign_test stderr memcheck/tests/supp2 stderr memcheck/tests/tronical stderr |
|
From: George S. <st...@kd...> - 2002-11-19 21:31:57
|
On Monday November 18 2002 04:11, Julian Seward wrote: > Just fwiw, > > Yesterday afternoon I ran KDE-3.1-rc3 on addrcheck. This is the entire > process tree starting from startkde. It pretty much worked, but was a > bit slow and one process bombed valgrind. All output was logged to a > different machine running the listener. Wow that sounds great. How have things come since Monday? Are your scripts and tools for this all packaged up? Pardon my ignorance if this is well known already. I've been in and out of "the loop" for the past few weeks. > So following some robustification, I think this might be workable > (P4 1.7 GHz, 512 MB). Even from my small test it picked up some > reads/writes of freed memory in artsd (something to do with KDE > sound I think). It would be nice to see those Arts errors too. Arts has lots of memory problems on PPC architecture apparently. It really seems that my testing suggestion is far too late for 3.1.0, but I'd like to start formalising it for 3.2.0. -- George Staikos |
|
From: Julian S. <js...@ac...> - 2002-11-19 19:08:35
|
> > 1. Lack of translation chaining. Has been discussed a lot already, and I > > think has reasonable probability of getting addressed at some point, > > since we're now floating detailed ideas for doing it. > > I think this is major, and we should be able to tell soon: I hacked up > most of an implementation on the train this afternoon. Doesn't quite > work yet, but it looks convincing. Cool! Keep on hacking. I'd be very interested to hear the outcome of this. > > 2. The inability to cache simulated regs in real ones across bb > > boundaries (they all get flushed with PUTs at the end). I think JF > > mentioned this at some point. I don't know how much effect this has. > > I think this is relatively minor compared to chaining, Possibly true. I think trace cacheing would break cachegrind, which makes some assumptions about uniqueness of bb translations (or something? Nick? but that can't be right, because it copes with multiple tail translations of bbs, as we discussed once). Unfortunately there is no easy way I can see to assess the performance loss from this inability to hold values in regs for long. > > 3. (nobody mentioned this, but I think it is very significant): the > > inability of the code generator to consider groups of UInstrs together > > when generating code. Specifically, this has the following bad effect. > > [...] > > This is surely something worth looking into. An interesting mechanism > > to create would be one which measures the (dynamic) frequency of > > adjacent uinstr pairs, to search for common sequences (pairs, to start > > with) which would be worth putting special code-generation effort into. > > Of course the results would depend on what instrumentation had been added > > by the skin. > > I'm not sure its all that important. It would reduce the actual > instruction count and register pressure, but I'm not sure it would do > all that much for modern CPUs (the Via C3 being the exception, and it > does need all the help it can get). Um, yes, I guess because it merely produces insns which will turn mostly into single micro-ops in the P6/P7/K6/K7 cores anyway, yes? I'd forgotten about that. What it does mean is that the translated code is more likely to stall due to lack of insn decoder bandwidth, compared to "normal" code. J |
|
From: Julian S. <js...@ac...> - 2002-11-19 19:03:51
|
On Tuesday 19 November 2002 5:20 pm, Jeremy Fitzhardinge wrote:
> On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> > There are sure to be other problems, though. And I have no idea how it
> > would affect the MemCheck instrumentation; already the code produced
> > contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> > needs %ESP to be up-to-date.
>
> Perhaps a slightly simpler approach would be to dedicate a register to
> ESP rather than a memory location, so that it is persistently in
> register across basic blocks. With any luck the register allocator
> could then remove redundant moves, and turn (assuming we reserve %edi
> for %ESP):
Yes, this seems a good compromise. We already played with the issue of
5 vs 6 allocatable registers and it makes very little difference at least
to memcheck, so perhaps it would be ok to give one to %ESP.
Nick, one thing you seemed to not comment on is ...
> addl $2, 36(%ebp) (the INCEIP)
ie, we keep exact track of %eip; does DynamoRIO ? What sort of
exception model at they presenting?
> I got basic block chaining working last night. I got about 25%
> improvement (which is nice, but I was hoping for more) in the particular
> benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> whole, the performance was pretty dismal: the native run took about 4.6
> seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> I was hoping it would be a largely CPU-bound test, but maybe its
> actually spending all its time in malloc or something.
Strange it's so slow. I think you should try bzip2; that is almost
completely compute bound and does something like 7 malloc calls per
file processed. Here's what I have:
time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> /dev/null
(valgrind startup msgs deleted)
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m7.760s
user 0m7.670s
sys 0m0.030s
time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
(stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
real 0m0.738s
user 0m0.680s
sys 0m0.050s
So more like a 11 - 12 x slowdown than the 35 x you get.
> My next experiment might be [...]
Let me encourage you to make measurements (direct vs indirect jump counts)
to gain insight into your current hackery, before embarking on more. I for
one would like to be assured with numbers that the Right Thing is happening
and that our assumptions about costs, event frequencies, etc, are justified.
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 17:20:13
|
On Tue, 2002-11-19 at 08:12, Nicholas Nethercote wrote:
> There are sure to be other problems, though. And I have no idea how it
> would affect the MemCheck instrumentation; already the code produced
> contains redundant "PUTL tX, %ESP" instructions because MemCheck always
> needs %ESP to be up-to-date.
Perhaps a slightly simpler approach would be to dedicate a register to
ESP rather than a memory location, so that it is persistently in
register across basic blocks. With any luck the register allocator
could then remove redundant moves, and turn (assuming we reserve %edi
for %ESP):
movl %edi, %ecx
subl $0x4, %ecx
movl %ecx, %edi
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
into
subl $0x4, %edi
xorl %edx, %edx
movl %edx, (%edi)
addl $2, 36(%ebp)
which, while not being push 0x0, isn't too far off. Will still need to
work out appropriate times to save ESP into the baseblock for when
control enters C code.
It seems to me that these optimisations we're talking about all feed
into each other, so there's a synergistic effect:
1. The reason we've got so much memory traffic to the baseBlock is
that we don't have enough registers to keep the simulated
registers in real registers while having enough working
registers for generated code
2. The reason we need lots of working registers is because we break
all the x86 instructions up, and therefore need more temps
3. Even so, we can't possibly really fit all the simulated
registers into real registers, so we're always going to be
saving them off to memory, but we'd like to keep to a minimum
4. We can't do a reasonable global register allocation to try and
keep the working set in registers as much as possible, because
we have a largely local, basic-block, view of the world (except,
perhaps, for special registers, like %esp which may well be used
in every basic block and is therefore worth keeping around).
therefore:
1. Using more compact instructions will cut down on our use of
temps, leaving more registers available for caching architecture
registers during a basic block (as well as making our
instruction footprint smaller)
2. Using trace caching will lengthen the effective size of a basic
block, and thereby amortize the register load/save over more
instructions
In other news:
I got basic block chaining working last night. I got about 25%
improvement (which is nice, but I was hoping for more) in the particular
benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
whole, the performance was pretty dismal: the native run took about 4.6
seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
chained-bb nulgrind took 60 seconds. I haven't looked into it further:
I was hoping it would be a largely CPU-bound test, but maybe its
actually spending all its time in malloc or something.
I also haven't measured how many jumps actually get chained (statically
and dynamically). I only bother with direct jumps and calls, so all
indirects and rets still go through the dispatch loop. Also, some of
the functionality in the dispatch loop needs to be compiled into each
basic block (EIP and VG_(dispatch_ctr) update), which expands the
generated code size and probably detracts from the performance wins.
My next experiment might be to try some rudimentary trace caching by
doing as Josef suggested and having vg_to_ucode simply follow
unconditional jumps and thereby create large basic blocks (hm, better be
careful about infinite loops...). It might do nothing other than
massively increase the dynamic compiler costs...
J
|
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 16:12:39
|
On Tue, 19 Nov 2002, Nicholas Nethercote wrote:
> I think it is important, ie. the fact that Valgrind virtualises all the
> registers and then loses where they were in the UCode.
Thinking about this some more, I think that one particular source of
slowdown is the fact that %esp is virtualised as %ESP.
This is because several very common instructions (push, pop, call, ret)
update %esp as a side-effect. For example:
pushl $0x0
becomes:
GETL %ESP, t4
SUBL $0x4, t4
PUTL t4, %ESP
MOVL $0x0, t6
STL t6, (t4)
INCEIPo $2
becomes:
movl 0x10(%ebp), %ecx
subl $0x4, %ecx
movl %ecx, 0x10(%ebp)
xorl %edx, %edx
movl %edx, (%ecx)
addl $2, 36(%ebp)
A lot of time is spent moving values to and from %ESP. I was wondering
about the possibility of "de-virtualising" %esp -- ie. store %ESP in %esp.
Then maybe the UCode translation could be as simple as "PUSHL 0x0".
I'm sure to have overlooked several bazillion complications here. For
example, %esp is used inside generated code, but I think that might be ok:
- for transferring %eflags to %EFLAGS -- just pushes then pops off the
top of the stack, should be fine
- for CCALLs -- saves caller-save regs, pushes args, clears args,
restores regs, should be fine?
There are sure to be other problems, though. And I have no idea how it
would affect the MemCheck instrumentation; already the code produced
contains redundant "PUTL tX, %ESP" instructions because MemCheck always
needs %ESP to be up-to-date.
I don't know, maybe it's totally impossible, but maybe it's not.
N
|
|
From: Julian S. <js...@ac...> - 2002-11-19 13:51:02
|
> > Hmm, that differs from what the internal valgrind profiler itself tells > > us ... about 30-40% of the time starting up mozilla, on memcheck, is > > translation. That also corresponds (non-quantitatively) with the speed > > variations in the lines printed by -v -v when running stuff (try it). > > Is there a set of benchmarks you've been using for this kind of stuff? > It would be useful to have a set for evaluating all the proposals that > have been flying around lately. It would be nice to have something less > interactive than Mozilla, for example. I'm particularly short of nice > examples of multithreaded programs which arn't all GUI. Starting mozilla-1.0, which requires translation about 2.2M of original code into ~45M of memcheck'd translations. J |
|
From: Nicholas N. <nj...@ca...> - 2002-11-19 10:57:36
|
On 18 Nov 2002, Jeremy Fitzhardinge wrote:
> > 3. (nobody mentioned this, but I think it is very significant): the inability
> > of the code generator to consider groups of UInstrs together when
> > generating code.
>
> I'm not sure its all that important. It would reduce the actual
> instruction count and register pressure, but I'm not sure it would do
> all that much for modern CPUs (the Via C3 being the exception, and it
> does need all the help it can get).
I think it is important, ie. the fact that Valgrind virtualises all the
registers and then loses where they were in the UCode.
Look at DynamoRIO's translation of this small basic block:
add %eax, %ecx
cmp $4, %eax
jle $0x40106f
becomes
frag7: add %eax, %ecx
cmp $4, %eax
jle $0x40106f
stub1: mov %eax, eax-slot # spill %eax
mov &dstub1, %eax # store ptr to stub table
jmp context_switch
stub2: mov %eax, eax-slot # spill %eax
mov &dstub2, %eax # store ptr to stub table
jmp context_switch
Apart from control flow instructions, the resulting code (before
instrumentation and/or trace-level optimisation) is basically the same as
the original code.
Valgrind will turn it into UCode something like this:
GETL %ECX, t14
ADDL %EAX, t14 (-wOSZACP)
PUTL t14, %ECX
INCEIPo $2
GETL %EAX, t10
SUBL $0x1, t10 (-wOSZACP)
INCEIPo $3
Jleo $0x40106f (-rOSZACP)
JMPo $0x4013F9B4
There will be some optimisations, but then each UCode instr will become
one or more x86 instructions. The example isn't so convincing, but repeat
the exercise on a basic block with 10 or 20 instructions and you'll notice
the difference more.
I feel that this must be doing a lot of damage. I don't think it's a
coincidence that the Null skin increases code size by a factor of about 5
and also slows programs by a similar factor, and that Memcheck increases
code size by a factor of about 12--13 and slows programs down by a similar
factor.
> An interesting mechanism to create would be one which measures the
> (dynamic) frequency of adjacent uinstr pairs, to search for common
> sequences (pairs, to start with) which would be worth putting special
> code-generation effort into. Of course the results would depend on what
> instrumentation had been added by the skin.
I just did that... a patch for vg_from_ucode is attached, as are the
(sorted) results for a quick spin of konqueror with --skin=none...
N
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 06:01:14
|
On Mon, 2002-11-18 at 21:55, Jeremy Fitzhardinge wrote:
> > Agree. The only other major source is implementation of switch statements
> > with dense ranges (jump tables).
>
> Another optimisation I'm thinking of is
[This is what happens when you're composing two emails in parallel.]
J
|
|
From: Jeremy F. <je...@go...> - 2002-11-19 05:56:01
|
On Mon, 2002-11-18 at 17:04, Julian Seward wrote:
> > > That seems unavoidable. We'd probably just recycle the current
> > > arrangement, or perhaps a simplified version of it, in this case. Bear
> > > in mind that x86 RET instructions are going to pull an original address
> > > off the stack and jump to it, so there will still me a significant
> > > minority of such lookups needed.
> >
> > It's likely that rets will be the majority of jumps with unknown
> > destinations (though very OO C++ code's use of virtual method
> > invocations will get close). I think the current mechanism is fine.
>
> Agree. The only other major source is implementation of switch statements
> with dense ranges (jump tables).
Another optimisation I'm thinking of is
> Hmm, that differs from what the internal valgrind profiler itself tells
> us ... about 30-40% of the time starting up mozilla, on memcheck, is
> translation. That also corresponds (non-quantitatively) with the speed
> variations in the lines printed by -v -v when running stuff (try it).
Is there a set of benchmarks you've been using for this kind of stuff?
It would be useful to have a set for evaluating all the proposals that
have been flying around lately. It would be nice to have something less
interactive than Mozilla, for example. I'm particularly short of nice
examples of multithreaded programs which arn't all GUI.
> Disco: Running Commodity Operating systems on scalable multiprocessors.
> Bugnion, Devine, Rosenblum (at least the last of these is a founder of VMware,
> Inc). 1997 ish. I'm sure you can scag it from the web somewhere; I'd start
> at citeseer. I only have a paper copy.
Hm, I think I've read that paper, but I don't think can I paid attention
to it. I've certainly read the Cellular Disco paper.
J
|