|
From: Duncan S. <bal...@fr...> - 2005-04-28 13:12:37
|
I've hit the valgrind trampoline problem: I need to take the address of a local subroutine, which causes valgrind to barf. The docs say to use VALGRIND_DISCARD_TRANSLATIONS, but don't give much in the way of details. An example of how to do this would be helpful. I've the additional problem that the code is not in C, it is in Ada, however I can hopefully work out how to do it in Ada if I understand how to fix things in a C example. (By the way, it's a pity VALGRIND_DISCARD_TRANSLATIONS is a macro, since these are harder to use from other languages than function calls). Thanks for your help, Duncan. |
|
From: Nicholas N. <nj...@cs...> - 2005-04-30 04:22:56
|
On Thu, 28 Apr 2005, Duncan Sands wrote: > I've hit the valgrind trampoline problem: I need to take > the address of a local subroutine, which causes valgrind > to barf. The docs say to use VALGRIND_DISCARD_TRANSLATIONS, > but don't give much in the way of details. An example of how > to do this would be helpful. Where in the docs does it say how to work around this problem? N |
|
From: Duncan S. <bal...@fr...> - 2005-05-02 07:00:08
|
On Fri, 2005-04-29 at 23:22 -0500, Nicholas Nethercote wrote: > On Thu, 28 Apr 2005, Duncan Sands wrote: > > > I've hit the valgrind trampoline problem: I need to take > > the address of a local subroutine, which causes valgrind > > to barf. The docs say to use VALGRIND_DISCARD_TRANSLATIONS, > > but don't give much in the way of details. An example of how > > to do this would be helpful. > > Where in the docs does it say how to work around this problem? Hi Nicholas, from http://valgrind.org/docs/manual/coregrind_core.html#limits "Valgrind can handle dynamically-generated code just fine. However, if you regenerate code over the top of old code (ie. at the same memory addresses) Valgrind will not realise the code has changed, and will run its old translations, which will be out-of-date. You need to use the VALGRIND_DISCARD_TRANSLATIONS client request in that case. For the same reason gcc's trampolines for nested functions are currently unsupported, see bug 69511. " I read this as implying that VALGRIND_DISCARD_TRANSLATIONS are needed for trampolines. Is that wrong? Thanks, Duncan. |
|
From: Nicholas N. <nj...@cs...> - 2005-05-02 13:39:41
|
On Mon, 2 May 2005, Duncan Sands wrote: >>> I've hit the valgrind trampoline problem: I need to take >>> the address of a local subroutine, which causes valgrind >>> to barf. The docs say to use VALGRIND_DISCARD_TRANSLATIONS, >>> but don't give much in the way of details. An example of how >>> to do this would be helpful. > > "Valgrind can handle dynamically-generated code just fine. However, if you > regenerate code over the top of old code (ie. at the same memory addresses) > Valgrind will not realise the code has changed, and will run its old translations, > which will be out-of-date. You need to use the VALGRIND_DISCARD_TRANSLATIONS > client request in that case. For the same reason gcc's trampolines for nested > functions are currently unsupported, see bug 69511. " > > I read this as implying that VALGRIND_DISCARD_TRANSLATIONS are needed for > trampolines. Is that wrong? It's a bit subtle; some trampolines will work. The problem will only occur if you put a trampoline on the stack, and then later another trampoline at exactly the same address. In that case, Valgrind will rerun the translation it made for the first trampoline, not realising that it is out of date. If no two trampolines are at the same address, it should work ok. As for using VALGRIND_DISCARD_TRANSLATIONS, I guess after you call a function that uses a trampoline, you should use this macro with an address range that would cover where the trampoline was, eg. start at the address of some local variable and go a few hundred bytes beyond that. I realise this is not a very elegant way to do it. One thing puzzles me: you said valgrind barfs. Can you give the output? This problem shouldn't cause Valgrind to barf, but rather silently run the wrong code. N |
|
From: Duncan S. <bal...@fr...> - 2005-05-02 14:05:01
|
Hi Nicholas, thanks for replying. > > I read this as implying that VALGRIND_DISCARD_TRANSLATIONS are needed for > > trampolines. Is that wrong? > > It's a bit subtle; some trampolines will work. The problem will only > occur if you put a trampoline on the stack, and then later another > trampoline at exactly the same address. In that case, Valgrind will rerun > the translation it made for the first trampoline, not realising that it is > out of date. If no two trampolines are at the same address, it should > work ok. If I have a routine that uses a trampoline, and I call it twice in succession, won't that put the trampolines at the same address and cause a problem? Or will it be the same trampoline, so no problem? > As for using VALGRIND_DISCARD_TRANSLATIONS, I guess after you call a > function that uses a trampoline, you should use this macro with an address > range that would cover where the trampoline was, eg. start at the address > of some local variable and go a few hundred bytes beyond that. I realise > this is not a very elegant way to do it. I guess my pointer to the local function is actually a pointer to the trampoline code, so I guess I can use that as the start address. Any idea how much code a trampoline can contain? Does it actually vary depending on the number of local variables used as function arguments? > One thing puzzles me: you said valgrind barfs. Can you give the output? > This problem shouldn't cause Valgrind to barf, but rather silently run the > wrong code. I said that wrong: my program explodes, valgrind produces a ton of error messages, and everything exits. valgrind is not crashing in any way as far as I can see. I guess the problem is that in the routine in question there are two code paths each of which leads to the address being taken of a (different) local procedure - presumably this is why there are sometimes different trampolines at the same stack address. It is to be expected that the program dies if the wrong one is called... Ciao, Duncan. |
|
From: Nicholas N. <nj...@cs...> - 2005-05-02 18:38:30
|
On Mon, 2 May 2005, Duncan Sands wrote: > If I have a routine that uses a trampoline, and I call it twice in > succession, won't that put the trampolines at the same address > and cause a problem? Or will it be the same trampoline, so no > problem? If it's the same trampoline, I think it should work. >> As for using VALGRIND_DISCARD_TRANSLATIONS, I guess after you call a >> function that uses a trampoline, you should use this macro with an address >> range that would cover where the trampoline was, eg. start at the address >> of some local variable and go a few hundred bytes beyond that. I realise >> this is not a very elegant way to do it. > > I guess my pointer to the local function is actually a pointer to the > trampoline code, so I guess I can use that as the start address. Any > idea how much code a trampoline can contain? Does it actually vary > depending on the number of local variables used as function arguments? No idea, but I'd hope it's small, eg. only a handful of instructions. The good thing is that you can be conservative -- all that will happen is that Valgrind will flush any translations it is holding for that range of addresses, which could possibly slow it down. Try 100 bytes, or 1000, see if that helps. > I said that wrong: my program explodes, valgrind produces a ton of error > messages, and everything exits. valgrind is not crashing in any way as > far as I can see. I guess the problem is that in the routine in > question there are two code paths each of which leads to the address > being taken of a (different) local procedure - presumably this is > why there are sometimes different trampolines at the same stack address. > It is to be expected that the program dies if the wrong one is called... Right. Unfortunately you've hit a dark corner of Valgrind that doesn't work very well. As the commentary on bug #69511 suggests, we don't have a good solution for this problem, since detecting it would be very expensive. N |
|
From: Duncan S. <bal...@fr...> - 2005-05-03 07:22:48
|
> > I guess my pointer to the local function is actually a pointer to the
> > trampoline code, so I guess I can use that as the start address. Any
> > idea how much code a trampoline can contain? Does it actually vary
> > depending on the number of local variables used as function arguments?
>
> No idea, but I'd hope it's small, eg. only a handful of instructions.
> The good thing is that you can be conservative -- all that will happen is
> that Valgrind will flush any translations it is holding for that range of
> addresses, which could possibly slow it down. Try 100 bytes, or 1000, see
> if that helps.
>From gcc/config/i386/i386.h:
/* On the 386, the trampoline contains two instructions:
mov #STATIC,ecx
jmp FUNCTION
The trampoline is generated entirely at runtime. The operand of JMP
is the address of FUNCTION relative to the instruction following the
JMP (which is 5 bytes long). */
/* Length in units of the trampoline for entering a nested function. */
#define TRAMPOLINE_SIZE (TARGET_64BIT ? 23 : 10)
> > I said that wrong: my program explodes, valgrind produces a ton of error
> > messages, and everything exits. valgrind is not crashing in any way as
> > far as I can see. I guess the problem is that in the routine in
> > question there are two code paths each of which leads to the address
> > being taken of a (different) local procedure - presumably this is
> > why there are sometimes different trampolines at the same stack address.
> > It is to be expected that the program dies if the wrong one is called...
>
> Right. Unfortunately you've hit a dark corner of Valgrind that doesn't
> work very well. As the commentary on bug #69511 suggests, we don't have a
> good solution for this problem, since detecting it would be very
> expensive.
By the way, local procedures are much more widely used in a language
like Ada, than in C. As a data point, I ran all the ACATS tests (see
the gcc testsuite/ada directory) under valgrind, and the majority of
failures were spurious, coming from the trampoline problem. On the
other hand, there are about 1800 tests, and only about 50 triggered
the trampoline problem (I don't recall the exact number), so it's not
that common either.
All the best,
Duncan.
|
|
From: Nicholas N. <nj...@cs...> - 2005-05-03 12:52:36
|
On Tue, 3 May 2005, Duncan Sands wrote: >> Right. Unfortunately you've hit a dark corner of Valgrind that doesn't >> work very well. > > By the way, local procedures are much more widely used in a language > like Ada, than in C. I guess Ada-interoperability is a dark corner, then :) You're only the second person I recall being stung by this. > As a data point, I ran all the ACATS tests (see the gcc testsuite/ada > directory) under valgrind, and the majority of failures were spurious, > coming from the trampoline problem. On the other hand, there are about > 1800 tests, and only about 50 triggered the trampoline problem (I don't > recall the exact number), so it's not that common either. That's good to know; thanks for the data point. N |
|
From: Duncan S. <bal...@fr...> - 2005-05-19 08:24:19
|
> >> Right. Unfortunately you've hit a dark corner of Valgrind that doesn't > >> work very well. > > > > By the way, local procedures are much more widely used in a language > > like Ada, than in C. > > I guess Ada-interoperability is a dark corner, then :) You're only the > second person I recall being stung by this. The pain is real though... Person number two humbly asks if anything can be done. > > As a data point, I ran all the ACATS tests (see the gcc testsuite/ada > > directory) under valgrind, and the majority of failures were spurious, > > coming from the trampoline problem. On the other hand, there are about > > 1800 tests, and only about 50 triggered the trampoline problem (I don't > > recall the exact number), so it's not that common either. > > That's good to know; thanks for the data point. It's going to get worse though - the latest language revision adds a container library that pretty much requires taking pointers to local subroutines. That's the reason this kind of thing is spreading through my code right now. This will also be the case for other Ada users. I know we are not so numerous, but we certainly exist :) Best wishes, Duncan. |
|
From: Julian S. <js...@ac...> - 2005-05-19 10:17:45
|
> > I guess Ada-interoperability is a dark corner, then :) You're only the > > second person I recall being stung by this. > > The pain is real though... Person number two humbly asks if anything > can be done. Ok, I'll think about what's needed to implement this. Let me see if I understand the landscape correctly. The problem is that ada (and nested C/C++ fns, with gcc) create small bits of code on the stack and then run them. V fails to notice when on of these small bits of code gets re-made, and so may wind up using out of date translations. So we can easily enough generate self-checking translations. A problem is, since self-checking translations are expensive to run, we want to make as few as possible. That means having a good heuristic for deciding when to do so. The currently postulated heuristic is to make a self- checking translation for code within some small offset of the stack pointer. Ideally the heuristic should say "yes" as infrequently as possible, but it should also never miss any such cases either. Is that correct? Any other things I need to take into account? J |
|
From: Duncan S. <bal...@fr...> - 2005-05-19 10:56:01
|
Hi Julian, the pain is already less!
> Let me see if I understand the landscape correctly.
>
> The problem is that ada (and nested C/C++ fns, with gcc) create small
> bits of code on the stack and then run them. V fails to notice when on
> of these small bits of code gets re-made, and so may wind up using out
> of date translations.
that is my understanding.
> So we can easily enough generate self-checking translations. A problem
> is, since self-checking translations are expensive to run, we want to
> make as few as possible. That means having a good heuristic for deciding
> when to do so. The currently postulated heuristic is to make a self-
> checking translation for code within some small offset of the stack
> pointer. Ideally the heuristic should say "yes" as infrequently as
> possible, but it should also never miss any such cases either.
>
> Is that correct? Any other things I need to take into account?
On i386 (32 bit), trampolines have the following properties:
(1) they are always aligned on 16 byte boundaries
(2) they always consist of the following two instructions:
mov #STATIC,ecx
jmp FUNCTION
I got this from gcc/doc/tm.texi and gcc/config/i386/i386.h.
In gcc/config/i386/i386.c you can find x86_initialize_trampoline
which generates 0xb9 (mov ecx,imm32) and 0xe9 (jmp rel32).
II guess if you look for jumps to 16 byte aligned guys holding
those instructions near the stack pointer, then the number of
false positives will be small. Of course this is gcc specific -
do other compilers do this kind of thing too?
All the best,
Duncan.
|
|
From: Josef W. <Jos...@gm...> - 2005-05-19 15:51:38
|
On Thursday 19 May 2005 12:17, Julian Seward wrote: > The problem is that ada (and nested C/C++ fns, with gcc) create small > bits of code on the stack and then run them. What about other programs with self-modifying code and on-the-fly code generation? Like Java/C#/... VMs, and Valgrind itself? Does Valgrind already notify an "outer" Valgrind when it invalidates part of its translation cache? I wonder if it would be needed to extend the check to writable segments. There also could be a client request for a segment with the meaning: "I will execute self-written code inside, but will keep sure myself to invalidate regions before overwriting code". This could be used by Valgrind then. Josef |
|
From: Jeremy F. <je...@go...> - 2005-05-26 00:45:07
|
Julian Seward wrote:
>So we can easily enough generate self-checking translations. A problem
>is, since self-checking translations are expensive to run, we want to
>make as few as possible. That means having a good heuristic for deciding
>when to do so. The currently postulated heuristic is to make a self-
>checking translation for code within some small offset of the stack
>pointer. Ideally the heuristic should say "yes" as infrequently as
>possible, but it should also never miss any such cases either.
>
>Is that correct? Any other things I need to take into account?
>
>
How about "generate self-checking code if the code is being fetched from
a writable page"? That will never happen in normal operation, since all
code is mapped read-only. It could be fooled by someone changing page
permissions with mprotect (write some code into a page, make it RO,
execute it, make it RW, change it), but you could fix that by clearing
the translation cache for a memory range when either switching it from
RW->RO or RO->RW (one or the other should do the trick).
The problem with the "near ESP" heuristic is that it won't cope with
generated elsewhere in the address space, such as running another JIT VM
under Valgrind.
J
|
|
From: Julian S. <js...@ac...> - 2005-05-30 08:22:42
|
> How about "generate self-checking code if the code is being fetched from
> a writable page"?
Yes, also plausible ...
> That will never happen in normal operation, since all
> code is mapped read-only. It could be fooled by someone changing page
> permissions with mprotect (write some code into a page, make it RO,
> execute it, make it RW, change it), but you could fix that by clearing
> the translation cache for a memory range when either switching it from
> RW->RO or RO->RW (one or the other should do the trick).
... sigh, yes I see. More cross-module dependencies and complexity, tho.
> The problem with the "near ESP" heuristic is that it won't cope with
> generated elsewhere in the address space, such as running another JIT VM
> under Valgrind.
Yeh. True.
The "near SP" (we can't say "near ESP" any more) thing at least gets
the Ada folks moving again. I was thinking:
--check-translations={no,auto,yes} [default: auto]
with auto being 'near SP', no being none and yes being all.
Hmm. I suppose transparently supporting JVMs as well as Ada folks
would be a good thing.
One other problem with checking TC at each mprotect is that any TC
check requires inspecting all translations. I wonder if TC should be
redone so that instead of being based on a hash table it is based on an
AVL tree, so that address range invalidation is O(log(N)) in the number
of translations instead of O(N).
J
|
|
From: Jeremy F. <je...@go...> - 2005-05-31 22:26:09
|
Julian Seward wrote:
>>That will never happen in normal operation, since all
>>code is mapped read-only. It could be fooled by someone changing page
>>permissions with mprotect (write some code into a page, make it RO,
>>execute it, make it RW, change it), but you could fix that by clearing
>>the translation cache for a memory range when either switching it from
>>RW->RO or RO->RW (one or the other should do the trick).
>>
>>
>
>... sigh, yes I see. More cross-module dependencies and complexity, tho.
>
>
Well, there should be a general "code may have changed" interface to the
translation cache, which should be a fairly narrow interface. In 2.4
(haven't looked at 3 in detail), the Segment list has a flag to keep
track of which regions of memory have had code generated from them, so
its fairly cheap to tell if something needs to be done.
>Yeh. True.
>
>The "near SP" (we can't say "near ESP" any more) thing at least gets
>the Ada folks moving again. I was thinking:
>
> --check-translations={no,auto,yes} [default: auto]
>
>with auto being 'near SP', no being none and yes being all.
>
>
I think "fetch from writable" is general enough, and should be pretty
foolproof. It would be interesting to see if there are any instances of
code being executed from RW memory which hasn't been generated (for
example someone running from RW mmaped memory). Given some of the
strange executables we've seen various compilers generate, it could
happen, so it probably needs a way to turn it off.
>One other problem with checking TC at each mprotect is that any TC
>check requires inspecting all translations. I wonder if TC should be
>redone so that instead of being based on a hash table it is based on an
>AVL tree, so that address range invalidation is O(log(N)) in the number
>of translations instead of O(N).
>
The existing skiplist implementation can do that easily.
J
|
|
From: Karl <kh...@tr...> - 2005-05-31 16:47:21
|
On 2005-05-25 17:44:58 -0700, Jeremy Fitzhardinge wrote:
> How about "generate self-checking code if the code is being fetched
> from a writable page"? That will never happen in normal operation,
> since all code is mapped read-only. It could be fooled by someone
> changing page permissions with mprotect (write some code into a
> page, make it RO, execute it, make it RW, change it), but you could
> fix that by clearing the translation cache for a memory range when
> either switching it from RW->RO or RO->RW (one or the other should
> do the trick).
You can't fix it by clearing translations only on rw->ro transitions.
If you do, then after an ro->rw transition, you would have
non-self-invalidating code based on writeable memory.
The other option works: then, after a rw->ro transition, you would
have self-invalidating code based on read-only memory, which is
inefficient but correct.
--
Karl Hasselström, kh...@tr...
www.treskal.com/kalle
|
|
From: Jeremy F. <je...@go...> - 2005-05-31 22:27:09
|
Karl Hasselstr=F6m wrote:
>You can't fix it by clearing translations only on rw->ro transitions.
>If you do, then after an ro->rw transition, you would have
>non-self-invalidating code based on writeable memory.
>
>The other option works: then, after a rw->ro transition, you would
>have self-invalidating code based on read-only memory, which is
>inefficient but correct.
>
Yep, you're right.
J
|
|
From: Tom H. <to...@co...> - 2005-05-19 10:32:43
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
> So we can easily enough generate self-checking translations. A problem
> is, since self-checking translations are expensive to run, we want to
> make as few as possible. That means having a good heuristic for deciding
> when to do so. The currently postulated heuristic is to make a self-
> checking translation for code within some small offset of the stack
> pointer. Ideally the heuristic should say "yes" as infrequently as
> possible, but it should also never miss any such cases either.
Wouldn't it be better to do it for any code on the stack? Any code on
the stack is inherently dangerous because it can be invalidated by the
stack pointer moving.
So just testing for code being in a segment with SF_STACK set might
do as a heuristic.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-05-19 11:54:31
|
> On Thursday 19 May 2005 11:32, Tom Hughes wrote: > Wouldn't it be better to do it for any code on the stack? Any code on > the stack is inherently dangerous because it can be invalidated by the > stack pointer moving. > > So just testing for code being in a segment with SF_STACK set might > do as a heuristic. Yes. You're quite right. I'm just trying to figure out how to balance the conflicting demands of (1) self-check as many translations as possible so as to minimise the chances of not catching one of these trampolines vs (2) self-check as few translations as possible so as to minimise the performance hit. > On i386 (32 bit), trampolines have the following properties: > > (1) they are always aligned on 16 byte boundaries > (2) they always consist of the following two instructions: > > mov #STATIC,ecx > jmp FUNCTION That's true, but it's too fragile, as Tom points out. One over- enthusiastic gcc hacker changing this a bit and we're hosed. Self-checking all translations from a stackish segment seems about right. So, what I'm thinking is to calculate a 32-bit CRC of the code and store it in the translation; then rerun the crc for the self-check. Except a CRC is expensive in terms of insns and cache misses (it requires a table). Mark Adler (co-author of gzip) had some other magic checksum scheme that gzip uses, which doesn't require a table and is fast. Maybe use that instead. J |
|
From: Duncan S. <bal...@fr...> - 2005-05-19 12:51:32
|
> So, what I'm thinking is to calculate a 32-bit CRC of the code > and store it in the translation; then rerun the crc for the > self-check. Except a CRC is expensive in terms of insns and > cache misses (it requires a table). Mark Adler (co-author > of gzip) had some other magic checksum scheme that gzip uses, which > doesn't require a table and is fast. Maybe use that instead. How do you know how big the code is? D. |
|
From: Julian S. <js...@ac...> - 2005-05-19 13:16:58
|
> How do you know how big the code is? what code? [unclear what you're referring to] J |
|
From: Duncan S. <bal...@fr...> - 2005-05-19 13:29:10
|
On Thu, 2005-05-19 at 14:16 +0100, Julian Seward wrote: > > How do you know how big the code is? > > what code? [unclear what you're referring to] The code you're taking the CRC of. For example, in the case of a trampoline, you have a pointer to a 10 byte long instruction sequence on the stack. You want to calculate a CRC of these 10 bytes and store them somewhere. How do you know it is 10 bytes long? Or were you planning to do a CRC of some fixed length that seems big enough to cover most cases? But maybe I misunderstood... All the best, Duncan. |
|
From: Nicholas N. <nj...@cs...> - 2005-05-19 13:12:41
|
On Thu, 19 May 2005, Julian Seward wrote: > So, what I'm thinking is to calculate a 32-bit CRC of the code > and store it in the translation; then rerun the crc for the > self-check. Except a CRC is expensive in terms of insns and > cache misses (it requires a table). Mark Adler (co-author > of gzip) had some other magic checksum scheme that gzip uses, which > doesn't require a table and is fast. Maybe use that instead. The good thing is that the checksum doesn't need to be very good, since false collisions will only cause extra translations to be executed. Something incredibly simple like adding up each byte of code (modulo 256) might be good enough. N |
|
From: Olly B. <ol...@su...> - 2005-05-19 15:56:44
|
On 2005-05-19, Julian Seward <js...@ac...> wrote:
> So, what I'm thinking is to calculate a 32-bit CRC of the code
> and store it in the translation; then rerun the crc for the
> self-check. Except a CRC is expensive in terms of insns and
> cache misses (it requires a table). Mark Adler (co-author
> of gzip) had some other magic checksum scheme that gzip uses, which
> doesn't require a table and is fast. Maybe use that instead.
It's called "adler32".
But had you considered simply copying the raw instructions and just
doing a full compare?
For a standard short (10 byte for x86) trampoline sequence, this will
probably be about as quick as calculating a 64 bit checksum. For a
longer sequence it may be a little slower if the code hasn't change,
but it also has absolutely no risk of failing to spot a change. And if
the code has changed, it can terminate the comparison early, which a
checksum can't...
Cheers,
Olly
|
|
From: Julian S. <js...@ac...> - 2005-05-19 13:22:11
|
> The good thing is that the checksum doesn't need to be very good, since > false collisions will only cause extra translations to be executed. > Something incredibly simple like adding up each byte of code (modulo 256) > might be good enough. Are you sure? If the checksum concludes incorrectly that the new code is the same as the old code, then we are hosed :-) [i think] So in fact I'd really prefer a 64-bit checksum if possible. J |