|
From: Tom H. <to...@co...> - 2005-08-07 10:53:03
Attachments:
ehdr.patch
|
While investigating bug #110205 the question arose as to why we hide any kernel provided vdso from the client program and whether we could allow the client program to use it instead. Julian wanted to know more about vdsos, so here is a quick summary of where I'm at. The code for the x86 vdso is in arch/i386/kernel in the ker source (the vsyscall*.S files). Current kernels contain two version of the vdso, one for systems supporting the sysenter instruction and one for systems which do not support it. Each vdso currently contains __kernel_vsyscall which is a routine to make a system call, either using sysenter or using int $80 for older machines. They also contain __kernel_sigreturn and __kernel_rt_sigreturn which are the default return addresses for signal handlers when the kernel builds a signal frame on the stack. The vdso is a full ELF object and AT_SYSINFO_EHDR in the auxv will point to it - in addition AT_SYSINFO will point at the system call routine within that object as old glibc's only a expect a simple system call routine pointer. On an athlon system that does not support sysenter we can get valgrind to use the vdso - the patch to do so is attached. The first thing is just to stop patching in AT_IGNORE over the auxv entries. The second problem is that the vdso seems to be mapped fairly low down in what is normally the client space so we normally accidentally unmap it when we make the client hold. I have worked around that in the patch. The next problem obviously is sysenter. Unfortunately supporting it might be hard as the user program does not provide the return address when it is in use - the kernel returns (using sysexit) to a fixed address based on where it has mapped the vdso. That address is not even the one after the sysenter instruction - there are some nops and a jump in between. So if VEX caught sysenter and handed control back to valgrind and valgrind then ran the system call and tried to return control to the address after the sysenter instruction we would actually wind up looping back and doing the system call again... Something else to beware of is that we wind up with stage2 and the client program both using the vdso (when stage2 uses glibc to do things) although as it has no state that is probably safe. On amd64 I don't believe the vdso is used in the same way - although there is a special page containing vsyscalls I don't believe that the auxv is used to announce it as it is mapped at a fixed address and glibc just assumes it can jump to addresses in that page for certain system calls. The x86 on amd64 system call stuff does use it I believe as when I was playing with a 32 bit build on an amd64 box on Friday I managaed to hit a sysenter instruction (Athlon 64 does support sysenter for 32 bit programs). I haven't looked at that in detail though. Likewise I haven't looked at ppc32 at all. Tom -- Tom Hughes (to...@co...) http://www.compton.nu/ |
|
From: Julian S. <js...@ac...> - 2005-08-07 11:39:39
|
> The next problem obviously is sysenter. Unfortunately supporting it > might be hard as the user program does not provide the return address > when it is in use - the kernel returns (using sysexit) to a fixed > address based on where it has mapped the vdso. That address is not > even the one after the sysenter instruction - there are some nops > and a jump in between. > > So if VEX caught sysenter and handed control back to valgrind and > valgrind then ran the system call and tried to return control to the > address after the sysenter instruction we would actually wind up > looping back and doing the system call again... I think (unless I misunderstand) that this is a non-problem. Vex makes it easy to defer to the scheduler and simultaneously set the next-IP value to anything you like (any constant value known at translation-time). So for sysenter we'd merely need to tell vex what the where-next address is, and there's already a struct to tell Vex that kind of stuff (eg, for ppc32 'dcbz' it needs to know the cache line size of the machine being simulated, and only Valgrind knows that). Shall I hack this up? > Something else to beware of is that we wind up with stage2 and > the client program both using the vdso (when stage2 uses glibc > to do things) although as it has no state that is probably safe. Yet another reason to be completely independent of glibc. > On amd64 I don't believe the vdso is used in the same way - although > there is a special page containing vsyscalls I don't believe that the > auxv is used to announce it as it is mapped at a fixed address and > glibc just assumes it can jump to addresses in that page for certain > system calls. That is my understanding too. J |
|
From: Tom H. <to...@co...> - 2005-08-07 11:56:15
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
> > The next problem obviously is sysenter. Unfortunately supporting it
> > might be hard as the user program does not provide the return address
> > when it is in use - the kernel returns (using sysexit) to a fixed
> > address based on where it has mapped the vdso. That address is not
> > even the one after the sysenter instruction - there are some nops
> > and a jump in between.
> >
> > So if VEX caught sysenter and handed control back to valgrind and
> > valgrind then ran the system call and tried to return control to the
> > address after the sysenter instruction we would actually wind up
> > looping back and doing the system call again...
>
> I think (unless I misunderstand) that this is a non-problem.
> Vex makes it easy to defer to the scheduler and simultaneously
> set the next-IP value to anything you like (any constant value
> known at translation-time). So for sysenter we'd merely need
> to tell vex what the where-next address is, and there's already
> a struct to tell Vex that kind of stuff (eg, for ppc32 'dcbz'
> it needs to know the cache line size of the machine being simulated,
> and only Valgrind knows that).
>
> Shall I hack this up?
Might be any idea. I believe that Solaris needs sysenter support
anyway if I remember correctly.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-08-07 14:52:53
|
> > I think (unless I misunderstand) that this is a non-problem. > > Vex makes it easy to defer to the scheduler and simultaneously > > set the next-IP value to anything you like (any constant value > > known at translation-time). So for sysenter we'd merely need > > to tell vex what the where-next address is, and there's already > > a struct to tell Vex that kind of stuff (eg, for ppc32 'dcbz' > > it needs to know the cache line size of the machine being simulated, > > and only Valgrind knows that). > > > > Shall I hack this up? > > Might be any idea. I believe that Solaris needs sysenter support > anyway if I remember correctly. Done (1320/4337). Note, I realised later the above scheme is unnecessarily complicated. A sysenter now causes the calling thread to return to the scheduler with code VEX_TRC_JMP_SYSENTER_X86. It is the scheduler's problem to fill in the thread's guest_EIP with a valid restart address before letting it run again. J |
|
From: Tom H. <to...@co...> - 2005-08-07 16:03:14
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
>
> > > I think (unless I misunderstand) that this is a non-problem.
> > > Vex makes it easy to defer to the scheduler and simultaneously
> > > set the next-IP value to anything you like (any constant value
> > > known at translation-time). So for sysenter we'd merely need
> > > to tell vex what the where-next address is, and there's already
> > > a struct to tell Vex that kind of stuff (eg, for ppc32 'dcbz'
> > > it needs to know the cache line size of the machine being simulated,
> > > and only Valgrind knows that).
> > >
> > > Shall I hack this up?
> >
> > Might be any idea. I believe that Solaris needs sysenter support
> > anyway if I remember correctly.
>
> Done (1320/4337). Note, I realised later the above scheme is
> unnecessarily complicated. A sysenter now causes the calling thread
> to return to the scheduler with code VEX_TRC_JMP_SYSENTER_X86.
> It is the scheduler's problem to fill in the thread's guest_EIP
> with a valid restart address before letting it run again.
Thanks for that. I'll have to find an Intel box to try it on.
I was being a muppet earlier when I claimed that the amd64 in 32 bit
mode was using sysenter - it was actually using syscall so we will
need that as well... The 32 bit emulation on amd64 uses either syscall
or sysenter in the vdso depending on the processor.
It never uses int $80 although the kernel does seem to recognise and
handle it because after adding syscall support to VEX on x86 I now have
a 32 bit process running under valgrind with the vdso patch.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Tom H. <to...@co...> - 2005-08-11 08:47:21
|
In message <200...@ac...>
Julian Seward <js...@ac...> wrote:
> Done (1320/4337). Note, I realised later the above scheme is
> unnecessarily complicated. A sysenter now causes the calling thread
> to return to the scheduler with code VEX_TRC_JMP_SYSENTER_X86.
> It is the scheduler's problem to fill in the thread's guest_EIP
> with a valid restart address before letting it run again.
The only problem with this is that by zeroing the EIP value valgrind
has no way of knowing the address of the sysenter instruction as far
as I can see, which means no way to fill in the new address which is
relative to the sysenter instruction.
I can't help thinking this is all going to be very fragile anyway as
there is nothing to stop a future linux kernel changing the offset
between the sysenter instruction and the return point.
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Julian S. <js...@ac...> - 2005-08-11 09:57:11
|
> The only problem with this is that by zeroing the EIP value valgrind > has no way of knowing the address of the sysenter instruction as far > as I can see, which means no way to fill in the new address which is > relative to the sysenter instruction. So how about if the EIP was set to be the address after the sysenter insn? Then you would know where it was. > I can't help thinking this is all going to be very fragile anyway as > there is nothing to stop a future linux kernel changing the offset > between the sysenter instruction and the return point. That may be so, but perhaps we should seperate the issues of implementing sysenter in a way which correctly implements the x86 semantics vs whether we can really use it effectively. Are you saying that it's not a good idea to proceed along this route of trying to make vdso's work properly? That would be a shame. J |