|
From: Tom H. <to...@co...> - 2005-03-29 09:14:33
|
I've got a PIE build of valgrind running on amd64 now but in order
to do it I had to add four extra arguments to VG_(run_innerloop) so
that the assembly in dispatch.S wouldn't need to reference any global
variables.
Basically the prototype for VG_(run_innerloop) now looks like this:
extern UInt VG_(run_innerloop) ( void* guest_state,
OffT instr_ptr_offset,
UInt *dispatch_ctr_ptr,
ULong *tt_fast[],
UInt *tt_fastN[] );
The necessary changes in the assembly code add three loads to the fast
path in dispatch_boring as things stand. One of those might be avoidable
if the value of VG_(dispatch_ctr) was cached locally and the global was
only updated on exit from VG_(run_innerloop).
Does anybody have any suggestions for a better way to make the code
in dispatch.S position independent?
Note that vex also has to be built with EXTRA_CFLAGS="-fpie" in order
to get a PIE build of valgrind.
I also can't currently get valgrind to load above 0xf0000000 as it
just segfaults while loading the client program if I try.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-03-29 09:34:45
|
> The necessary changes in the assembly code add three loads to the fast > path in dispatch_boring as things stand. Hmm, that's not good. What does the resulting version of run_innerloop look like? J |
|
From: Tom H. <to...@co...> - 2005-03-29 12:18:54
|
In message <yek...@ho...>
Tom Hughes <to...@co...> wrote:
> I also can't currently get valgrind to load above 0xf0000000 as it
> just segfaults while loading the client program if I try.
The fix to ROUNDDN that I just checked it solves this problem
so with a PIE valgrind I now have a memory map that looks like
this:
client_base 0x0 (174161MB)
client_mapbase 0x2A85154000 (348845MB)
client_end 0x7FAFF00000 (1MB)
shadow_base 0x7FB0000000 (0MB)
shadow_end 0x7FB0000000
valgrind_base 0x7FB0000000 (255MB)
valgrind_last 0x7FBFFFFFFF
Hopefully that should be enough address space for our more demanding
users to be going on with ;-)
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-03-29 15:39:51
|
> Does anybody have any suggestions for a better way to make the code > in dispatch.S position independent? Yes. One thing that might work is to create a struct holding dispatch_ctr, tt_fast and tt_fastN, and pass in a pointer to the struct. This reduces the fast-path cost to one load per iteration, which isn't so bad. It sounds a bit ugly though. I'll have a look at it later this evening. J |
|
From: Jeremy F. <je...@go...> - 2005-03-29 16:57:24
|
Tom Hughes wrote:
>I've got a PIE build of valgrind running on amd64 now but in order
>to do it I had to add four extra arguments to VG_(run_innerloop) so
>that the assembly in dispatch.S wouldn't need to reference any global
>variables.
>
>Basically the prototype for VG_(run_innerloop) now looks like this:
>
> extern UInt VG_(run_innerloop) ( void* guest_state,
> OffT instr_ptr_offset,
> UInt *dispatch_ctr_ptr,
> ULong *tt_fast[],
> UInt *tt_fastN[] );
>
>The necessary changes in the assembly code add three loads to the fast
>path in dispatch_boring as things stand. One of those might be avoidable
>if the value of VG_(dispatch_ctr) was cached locally and the global was
>only updated on exit from VG_(run_innerloop).
>
>Does anybody have any suggestions for a better way to make the code
>in dispatch.S position independent?
>
>
You could just use globals and make position-independent references to
them. Look at the gcc -S output to see what the syntax is; I have a
patch somewhere which does this, but I can't find it right now.
J
|
|
From: Tom H. <to...@co...> - 2005-03-30 09:36:05
Attachments:
pie2.patch
|
In message <424...@go...>
Jeremy Fitzhardinge <je...@go...> wrote:
> Tom Hughes wrote:
>
>>Does anybody have any suggestions for a better way to make the code
>>in dispatch.S position independent?
>
> You could just use globals and make position-independent references to
> them. Look at the gcc -S output to see what the syntax is; I have a
> patch somewhere which does this, but I can't find it right now.
Like the attached patch you mean? That still adds loads to the
fast path because you have to load the address from the GOT before
you can use it.
This patch actually adds four loads. One can be eliminate by looking
up instr_ptr_offset once at the start and caching it on the stack. The
dispatch_ctr one could probably be eliminate by keeping a local copy
of the dispatch counter and updating the global at the end.
The tt_fast and tt_fastN ones are a problem however as you don't seem
to be able to do this:
movq VG_(tt_fast)@GOTPCREL(%rip,%rbx,8), %rcx
so you have to do this:
movq VG_(tt_fast)@GOTPCREL(%rip), %rcx
movq (%rcx,%rbx,8), %rcx
and the same for the tt_fastN array.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Tom H. <to...@co...> - 2005-03-30 09:51:28
Attachments:
pie2.patch
|
In message <yek...@ho...>
Tom Hughes <to...@co...> wrote:
> This patch actually adds four loads. One can be eliminate by looking
> up instr_ptr_offset once at the start and caching it on the stack. The
> dispatch_ctr one could probably be eliminate by keeping a local copy
> of the dispatch counter and updating the global at the end.
Here's an updated version that implements those two suggestions so
that we are only adding two loads to the fast path.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Jeremy F. <je...@go...> - 2005-03-30 09:59:53
|
Tom Hughes wrote:
>Like the attached patch you mean? That still adds loads to the
>fast path because you have to load the address from the GOT before
>you can use it.
>
>
Yep, that looks about right. I wasn't too worried about adding a couple
of extra instructions here because BB-chaining should take most of the
load off this code (in 2.4); in 3, one presumes Vex's BB-fusing is about
as effective (or we need to look at doing chaining as well).
J
|
|
From: Julian S. <js...@ac...> - 2005-03-30 22:51:12
|
Hmm. This is a pig of a problem. With r3484/r3485 I managed to get rid of one insn per fast-iteration, but being PIE still costs us. The "best" solution I can think of is to glom all the dispatcher- visible state (dispatch_ctr, fast[], fastN[]) into a single struct, and pass a pointer to that to run_innerloop. That pointer can be reloaded from the stack each time round the loop, so the extra cost is one insn/bb, which given that one insn was just got rid of, is free, at least in terms of insn counts. So the main difficulty -- apart from the uglyness of inventing a struct purely for this reason -- is how the assembly code can know the (literal) offsets of the struct components, without having to mess around with preprocessor programs to extract the offsets. That's just too ugly. For the time being, let's go with the "movq VG_(tt_fast)@GOTPCREL(%rip), %rcx" style fixes that Tom/Jeremy cooked up. Does the x86 side need similar modification or is that OK as-is ? > Yep, that looks about right. I wasn't too worried about adding a couple > of extra instructions here because BB-chaining should take most of the > load off this code (in 2.4); in 3, one presumes Vex's BB-fusing is about > as effective (or we need to look at doing chaining as well). Further ahead, it would be nice to reinstate bb-chaining. Vex's chasing results in bb count reductions on the order of 20%-30% compared to 2.4, which is a start, and gives indirect benefits in that the guest regs remains cached in host regs across the boundary. Nevertheless this leaves a lot of scope for chaining. For one thing, chaining removes the indirect and presumably unpredictable call that the dispatcher must make. J |
|
From: Tom H. <to...@co...> - 2005-03-30 23:16:17
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
> That's just too ugly. For the time being, let's go with the
> "movq VG_(tt_fast)@GOTPCREL(%rip), %rcx" style fixes that
> Tom/Jeremy cooked up. Does the x86 side need similar modification
> or is that OK as-is ?
I'll fix that up and get it committed then. The x86 side should
really have the same change, but we can get away without it there
because the dynamic linker will do relocations at load time for
code that isn't really PIC but ought to be - the amd64 one won't
do that.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-03-30 23:39:13
|
> I'll fix that up and get it committed then. The x86 side should > really have the same change, but we can get away without it there > because the dynamic linker will do relocations at load time for > code that isn't really PIC but ought to be - the amd64 one won't > do that. Cool. I guess there's also a net performance loss putting the same change in x86. Yeh. I guess it can't, because insns with offsets don't carry 64-bit offsets. Or something. I just had a spin with konq and moz startups on amd64. Looks promising, mostly a lot of complaining about missing syscalls which ultimately causes moz to give up and konq to not be able to do anything due to not being able to communicate with dcopserver. Small KDE apps (kmines, kpat, kcalc) work, more or less. Unfortunately SuSE seem to have neglected to ship that most useful of KDE applications, ktuberling ;-) J |