Thread: Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on u (Page 4)

user-mode-linux-devel

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Borislav P. <bp...@am...> - 2011-08-23 14:26:26

On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote:
> Almost, but not quite.  What happens is:
> * process hits syscall insn
> * it's stopped and tracer (guest kernel) does GETREGS
> 	+ looks at the registers (mapped to the normal layout)
> 	+ decides to call sys_brk()
> 	+ notices pages to kick out
> 	+ queues munmap request for stub
> * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack
> * tracer does CONT, letting the child run
> * child finishes with syscall insn, carefully preserving ebp.  It returns to
>   userland, in the beginning of the stub.
> * child does munmap() and hits int 3 in the end of stub.
> * the damn thing is stopped again.  The tracer had been waiting for it.
> * tracer finishes with sys_brk() and returns success.
> * it does SETREGS, setting eax to return value, eip to original return
> address of syscall insn... and ebp to what it had in regs.bp.  I.e. the
> damn arg6 value.

Ok, stupid question: can a convoluted ptracing case like this be created
in "normal" userspace, i.e. irrespective of UML and only by using gdb,
for example?

I.e., from what I understand from above, you need to stop the tracee at
syscall and "redirect" it to the stub after it finishes the syscall so
that in another syscall it gets a debug exception... sounds complicated.

> And we are fucked.  It doesn't happen in syscall handler.  It's int3().
> Having no idea that this request to set ebp should be interpreted in
> a really different way - "put the value I asked to put into ecx here,
> please, and ignore this one".
> 
> Sigh...  The really ugly part is that ebp can be changed by the stuff
> done in stub - it's not just munmap, it can do mmap as well.  We can,
> in principle, save ebp on its stack and restore it before trapping.
> Then uml kernel could, in theory, replace that SETREGS with a bunch of
> POKEUSER, leaving ebp alone.  Ho-hum...  In principle, that might even
> be not too horrible - we need eax/eip/esp, of course, but the rest
> could be dealt with by the same trick - have it pushed/popped in the
> stub and to hell with wasting syscalls on setting them...

which could mean that we could get away by not replacing SYSCALL32?

Hmm.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:30:48

On Tue, Aug 23, 2011 at 04:26:08PM +0200, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote:
> > Almost, but not quite.  What happens is:
> > * process hits syscall insn
> > * it's stopped and tracer (guest kernel) does GETREGS
> > 	+ looks at the registers (mapped to the normal layout)
> > 	+ decides to call sys_brk()
> > 	+ notices pages to kick out
> > 	+ queues munmap request for stub
> > * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack
> > * tracer does CONT, letting the child run
> > * child finishes with syscall insn, carefully preserving ebp.  It returns to
> >   userland, in the beginning of the stub.
> > * child does munmap() and hits int 3 in the end of stub.
> > * the damn thing is stopped again.  The tracer had been waiting for it.
> > * tracer finishes with sys_brk() and returns success.
> > * it does SETREGS, setting eax to return value, eip to original return
> > address of syscall insn... and ebp to what it had in regs.bp.  I.e. the
> > damn arg6 value.
> 
> Ok, stupid question: can a convoluted ptracing case like this be created
> in "normal" userspace, i.e. irrespective of UML and only by using gdb,
> for example?

I don't know...

> I.e., from what I understand from above, you need to stop the tracee at
> syscall and "redirect" it to the stub after it finishes the syscall so
> that in another syscall it gets a debug exception... sounds complicated.

Basically, we need to do things that tracer can't do via ptrace() - i.e.
play with mappings in the child.  I.e. we need to do several syscalls
in child, then return it to traced state.  And all of that - before we
return to execution of instructions past the syscall.

BTW, booting 32bit uml with nosysemu on such boxen blows up instantly, since
there we have *all* SETREGS done on the way out of syscall (with sysemu we
use PTRACE_SYSEMU, which will stop on syscall entry, let you play with
registers and suppress both the sys_...() call itself and the stop on
the way out; without sysemu it'll use PTRACE_SYSCALL, replace syscall
number with something harmless (getpid(2)), let it execute, stop on
the way out and update the registers there).  Same issue, only here it
really happens from within the syscall handler itself.

Hell knows...  I have no idea what kind of weirdness ptrace users exhibit.
FWIW, I suspect that there's another mess around signals in uml - signal
frame is built by tracer and it *has* to contain ebp.  And have eip pointing
to insn immediately past the syscall.  What should sigreturn do?  It got
to restore ebp - can't rely on signal handler not having buggered the
register.  And we are again in for it - ebp set to arg6 as we return to
insn right after syscall one.

OTOH, making GETREGS/PEEKUSER return registers without arg2 -> ecx, arg6 -> ebp
would instantly break both uml and far less exotic things.  strace(1), for
one.  Anything that wants to examine the arguments of stopped syscall will
be broken.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:49:23

On Tue, Aug 23, 2011 at 09:03:04AM -0700, Linus Torvalds wrote:

> Suggested fixes:
> 
>  - instead of blindly doing SETREGS, just write the result registers
> individually like you suggested

Not enough.  There is also a PITA with signal handlers.  There we can't
avoid modifying ebp on the way out of handler (i.e. by emulated sigreturn).
And it lands us straight after syscall insn, with ebp "restored" to the
wrong value.

> OR (and perhaps preferably):
> 
>  - teach UML that when you do 'GETREGS' after a system call trapped,
> we have switched things around to match the "official system call
> order", and UML should just undo our swizzling, and do a "regs.ebp =
> regs.ecx" to make it be what the actual original registers were (or
> whatever the actual correct swizzle is - I didn't think that through
> very much).

Um...  How would it know which syscall variant had that been, to start
with?  For int 0x80 it would need to use registers as-is.  For SYSENTER
it also could use them as-is - ebp will differ from what we put there
when entering the sucker, but not critically so; on the way out of
syscall we'll overwrite it anyway immediately (either by pop or mov).
For SYSCALL... we don't really care about ecx contents prior to entering
the kernel (and it'll be blown out anyway), and ebp one could be found in
regs.ecx.  So yes, we can do it that way, but... how to tell what variant
had been triggered?  Examining two bytes prior to user eip?  Sounds bloody
brittle...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 17:34:11

On Tue, Aug 23, 2011 at 9:48 AM, Al Viro <vi...@ze...> wrote:
>
> Um...  How would it know which syscall variant had that been, to start
> with?

Just read the instruction, for chissake.

UML *already* does that, to see if it's "int80" or "sysenter" ('is_syscall()').

Now, I do agree that if we had designed the ptrace interface with
these kinds of issues in mind, then we would have added a "state"
field to the thing that could have this kind of information as part of
the GETREGS interface. There is no question that that would have been
a good idea - but we have what we have.

I mean, technically, we could also have always just given "raw user
space register state" to ptrace, and then just said that "anybody who
traces system calls needs to know the exact calling conventions for
*that* kind of system call". But instead of that, we give the "cooked"
pt_regs values on read-out, to make it simpler for strace and friends.

And it's actualyl simpler for UML too. If we *didn't* give that cooked
register set information, then UML would *still* have to look at the
actual instruction in order to emulate the system call correctly
("it's sysenter, so now I need to take some of the system call
arguments from the stack"). So the fact that we do that register state
swizzling actually helps not just strace, but UML too.

It would be *nice* if we did the swizzling automatically at setregs()
time too, but we simply don't have enough information in the kernel to
do that. Again, exactly because pt_regs doesn't have a "state"
variable, when user-space does the SETREGS call, we simply don't know
whether we are in "normal" code or in some system call entry or exit
state. So the kernel does the swizzling at GETREGS time (by virtue of
always having the registers in a "canonical" state for system call
entry), but we fundamentally *cannot* to do the unswizzle, because we
don't know what the SETREGS caller actually did.

So I think the current state is actually the best we could possibly
do, with the caveat that *if* we had known about the "different system
calls have different register layouts" originally and had thought of
it, we could have added a 'state' word that the kernel could set at
GETREGS time, and use at SETREGS time to decide whether swizzling is
needed or not.

But not only would that have required time travel (ptrace existed
before the multiple system calls did), even then it's not 100% clear
that the current simpler model (with the admittedly subtle case of
implicit state and its effect on register state) isn't actually the
better solution. *Somebody* has to do the register swizzling, and the
current "kernel canonicalizes registers at read time, you need to
swizzle them if you change state" may simply be the RightThing(tm).

                                      Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 19:18:40

On 08/23/2011 09:48 AM, Al Viro wrote:
> 
> Um...  How would it know which syscall variant had that been, to start
> with?  For int 0x80 it would need to use registers as-is.  For SYSENTER
> it also could use them as-is - ebp will differ from what we put there
> when entering the sucker, but not critically so; on the way out of
> syscall we'll overwrite it anyway immediately (either by pop or mov).
> For SYSCALL... we don't really care about ecx contents prior to entering
> the kernel (and it'll be blown out anyway), and ebp one could be found in
> regs.ecx.  So yes, we can do it that way, but... how to tell what variant
> had been triggered?  Examining two bytes prior to user eip?  Sounds bloody
> brittle...

We could drop that information in a metaregister.  It's not backward
compatible, but at least it will be obvious when that information is
available and not.

	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 21:09:20

On 08/23/2011 10:33 AM, Linus Torvalds wrote:
> 
> It would be *nice* if we did the swizzling automatically at setregs()
> time too, but we simply don't have enough information in the kernel to
> do that. Again, exactly because pt_regs doesn't have a "state"
> variable, when user-space does the SETREGS call, we simply don't know
> whether we are in "normal" code or in some system call entry or exit
> state. So the kernel does the swizzling at GETREGS time (by virtue of
> always having the registers in a "canonical" state for system call
> entry), but we fundamentally *cannot* to do the unswizzle, because we
> don't know what the SETREGS caller actually did.
> 

Again, can we steal one of the padding fields to use for that state
variable?  We have two 16-bit padding fields; one for cs and one for ss.

For UML, I agree, let's just not expose the vdso assuming that is
possible, but for other -- possibly future -- users.

	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 21:20:51

On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <hp...@zy...> wrote:
>
> Again, can we steal one of the padding fields to use for that state
> variable?  We have two 16-bit padding fields; one for cs and one for ss.

We can steal them for passing the information to the user, but no, I
don't think we can use them to then take the information *from* the
user.

Somebody may well be setting up a 'pt_regs' structure on his own, and
simply not fill in the padding, resulting in random data in those
fields.

                     Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 23:04:44

On 08/23/2011 02:20 PM, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <hp...@zy...> wrote:
>>
>> Again, can we steal one of the padding fields to use for that state
>> variable?  We have two 16-bit padding fields; one for cs and one for ss.
> 
> We can steal them for passing the information to the user, but no, I
> don't think we can use them to then take the information *from* the
> user.
> 
> Somebody may well be setting up a 'pt_regs' structure on his own, and
> simply not fill in the padding, resulting in random data in those
> fields.
> 

That would be fine, I'd think... just gives the user space application
enough information to know how it would have to reshuffle the registers
if it needs to.

	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 01:48:53

On 08/21/2011 06:41 PM, Linus Torvalds wrote:
> If people are using syscall directly, we're pretty much stuck. No
> amount of "that's hopelessly wrong" will ever matter. We don't break
> existing binaries.
> 
> That said, I'd *hope* that everybody uses the vdso32, simply because
> user programs are not supposed to know which CPU they are running on
> and if that CPU even *supports* the syscall instruction. In which case
> it may be possible that we can play games with the vdso thing. But
> that really would be conditional on "nobody ever reports a failure".

I think we found that out with the vsyscall emulation issue last cycle.
 It works, so it will have been used, somewhere...

> But if that's possible, maybe we can increment the RIP by 2 for
> 'syscall', and slip an "'int 0x80" after the syscall instruction in
> the vdso there? Resulting in the same pseudo-solution I suggested for
> sysenter...

I think we have the above problem.

The problem here is that the syscall state is actually more complex than
we retain: the entire state is given by (entry point, register state);
with that amount of state we have all the information needed to *either*
extract the syscall arguments *or* the register contents.  Without
those, we can only represent one of the two possible metalevels (right
now we represent the higher-level metalevel, the argument vector), but
we need both for different usages.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Richard W. <ri...@no...> - 2011-08-20 20:55:56

Am 20.08.2011 22:14, schrieb Al Viro:
> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>
>> Hmmm, very strange.
>> Sadly I cannot reproduce the issue. :(
>> Everything works fine within UML.
>> (Of course I've applied your vDSO/i386 patches)
>>
>> My test setup:
>> Host kernel: 2.6.37 and 3.0.1
>> Distro: openSUSE 11.4/x86_64
>>
>> UML kernel: 3.1-rc2
>> Distro: openSUSE 11.1/i386
>>
>> Does the problem also occur with another host kernel or a different
>> guest image?
>
> Could you check what you get in __kernel_vsyscall()?  On iAMD64 box
> where that sucker contains sysenter-based variant the bug is not
> present.  IOW, it's sensitive to syscall vs. systenter vs. int 0x80
> differences.

OK, this explains why I cannot reproduce it.
My Intel Core2 box is sysenter-based.

(gdb) disass __kernel_vsyscall
0xffffe420 <__kernel_vsyscall+0>:       push   %ecx
0xffffe421 <__kernel_vsyscall+1>:       push   %edx
0xffffe422 <__kernel_vsyscall+2>:       push   %ebp
0xffffe423 <__kernel_vsyscall+3>:       mov    %esp,%ebp
0xffffe425 <__kernel_vsyscall+5>:       sysenter
0xffffe427 <__kernel_vsyscall+7>:       nop
0xffffe428 <__kernel_vsyscall+8>:       nop
0xffffe429 <__kernel_vsyscall+9>:       nop
0xffffe42a <__kernel_vsyscall+10>:      nop
0xffffe42b <__kernel_vsyscall+11>:      nop
0xffffe42c <__kernel_vsyscall+12>:      nop
0xffffe42d <__kernel_vsyscall+13>:      nop
0xffffe42e <__kernel_vsyscall+14>:      jmp 
0xffffe423<__kernel_vsyscall+3>
0xffffe430 <__kernel_vsyscall+16>:      pop    %ebp
0xffffe431 <__kernel_vsyscall+17>:      pop    %edx
0xffffe432 <__kernel_vsyscall+18>:      pop    %ecx
0xffffe433 <__kernel_vsyscall+19>:      ret

> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/)
> if you want to see if that gets reproduced on your box.  I'll drop it on
> anonftp if you are interested.  FWIW, the same kernel binary/same image
> result in
> 	* K7 box - no breakage, SYSENTER-based vdso
> 	* K8 box - breakage as described, SYSCALL-based vdso32
> 	* P4 box - no breakage, SYSENTER-based vdso32
> Hell knows...  In theory that would seem to point towards ia32_cstar_target(),
> so I'm going to RTFS carefully through that animal.

Now I'm testing with a Debian fs from: 
http://fs.devloop.org.uk/filesystems/Debian-Squeeze/

> The thing is, whatever happens happens when victim gets resumed inside
> vdso page.  I'll try to dump PTRACE_SETREGS and see the values host
> kernel asked to set and work from there, but the interesting part is
> bloody hard to singlestep through - the victim is back to user mode and
> it is already traced by the guest kernel, so it's not as if we could
> attach host gdb to it and walk through that crap.  And guest gdb is not
> going to be able to set breakpoints in there - vdso page is r/o...

[ CC'ing lu...@mi... ]
Andy, do you have an idea?
You can find Al's original report here:
http://marc.info/?l=linux-kernel&m=131380315624244&w=2

Thanks,
//richard

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Andrew L. <lu...@mi...> - 2011-08-20 21:26:28

On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <ri...@no...> wrote:
> Am 20.08.2011 22:14, schrieb Al Viro:
>>
>> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>>
>>> Hmmm, very strange.
>>> Sadly I cannot reproduce the issue. :(
>>> Everything works fine within UML.
>>> (Of course I've applied your vDSO/i386 patches)
>>>
>>> My test setup:
>>> Host kernel: 2.6.37 and 3.0.1
>>> Distro: openSUSE 11.4/x86_64
>>>
>>> UML kernel: 3.1-rc2
>>> Distro: openSUSE 11.1/i386
>>>
>>> Does the problem also occur with another host kernel or a different
>>> guest image?
>>
>> Could you check what you get in __kernel_vsyscall()?  On iAMD64 box
>> where that sucker contains sysenter-based variant the bug is not
>> present.  IOW, it's sensitive to syscall vs. systenter vs. int 0x80
>> differences.
>
> OK, this explains why I cannot reproduce it.
> My Intel Core2 box is sysenter-based.
>
> (gdb) disass __kernel_vsyscall
> 0xffffe420 <__kernel_vsyscall+0>:       push   %ecx
> 0xffffe421 <__kernel_vsyscall+1>:       push   %edx
> 0xffffe422 <__kernel_vsyscall+2>:       push   %ebp
> 0xffffe423 <__kernel_vsyscall+3>:       mov    %esp,%ebp
> 0xffffe425 <__kernel_vsyscall+5>:       sysenter
> 0xffffe427 <__kernel_vsyscall+7>:       nop
> 0xffffe428 <__kernel_vsyscall+8>:       nop
> 0xffffe429 <__kernel_vsyscall+9>:       nop
> 0xffffe42a <__kernel_vsyscall+10>:      nop
> 0xffffe42b <__kernel_vsyscall+11>:      nop
> 0xffffe42c <__kernel_vsyscall+12>:      nop
> 0xffffe42d <__kernel_vsyscall+13>:      nop
> 0xffffe42e <__kernel_vsyscall+14>:      jmp 0xffffe423<__kernel_vsyscall+3>
> 0xffffe430 <__kernel_vsyscall+16>:      pop    %ebp
> 0xffffe431 <__kernel_vsyscall+17>:      pop    %edx
> 0xffffe432 <__kernel_vsyscall+18>:      pop    %ecx
> 0xffffe433 <__kernel_vsyscall+19>:      ret
>
>> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2
>> ;-/)
>> if you want to see if that gets reproduced on your box.  I'll drop it on
>> anonftp if you are interested.  FWIW, the same kernel binary/same image
>> result in
>>        * K7 box - no breakage, SYSENTER-based vdso
>>        * K8 box - breakage as described, SYSCALL-based vdso32
>>        * P4 box - no breakage, SYSENTER-based vdso32
>> Hell knows...  In theory that would seem to point towards
>> ia32_cstar_target(),
>> so I'm going to RTFS carefully through that animal.
>
> Now I'm testing with a Debian fs from:
> http://fs.devloop.org.uk/filesystems/Debian-Squeeze/
>
>> The thing is, whatever happens happens when victim gets resumed inside
>> vdso page.  I'll try to dump PTRACE_SETREGS and see the values host
>> kernel asked to set and work from there, but the interesting part is
>> bloody hard to singlestep through - the victim is back to user mode and
>> it is already traced by the guest kernel, so it's not as if we could
>> attach host gdb to it and walk through that crap.  And guest gdb is not
>> going to be able to set breakpoints in there - vdso page is r/o...
>
> [ CC'ing lu...@mi... ]
> Andy, do you have an idea?
> You can find Al's original report here:
> http://marc.info/?l=linux-kernel&m=131380315624244&w=2

I'm missing a bit of the background.  Is the user-on-UML app calling
into a vdso entry provided by UML or into a vdso entry provided by the
host?

Why does anything care whether ecx is saved?  Doesn't the default
calling convention allow the callee to clobber ecx?

But my guess is that the 64-bit host sysret code might be buggy (or
the value in gs:whatever is wrong). Can you get gdb to breakpoint at
the beginning of __kernel_vsyscall before the crash?

--Andy

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Richard W. <ri...@no...> - 2011-08-20 21:38:52

Am 20.08.2011 23:26, schrieb Andrew Lutomirski:
> On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger<ri...@no...>  wrote:
>> Am 20.08.2011 22:14, schrieb Al Viro:
>>>
>>> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:
>>>
>>>> Hmmm, very strange.
>>>> Sadly I cannot reproduce the issue. :(
>>>> Everything works fine within UML.
>>>> (Of course I've applied your vDSO/i386 patches)
>>>>
>>>> My test setup:
>>>> Host kernel: 2.6.37 and 3.0.1
>>>> Distro: openSUSE 11.4/x86_64
>>>>
>>>> UML kernel: 3.1-rc2
>>>> Distro: openSUSE 11.1/i386
>>>>
>>>> Does the problem also occur with another host kernel or a different
>>>> guest image?
>>>
>>> Could you check what you get in __kernel_vsyscall()?  On iAMD64 box
>>> where that sucker contains sysenter-based variant the bug is not
>>> present.  IOW, it's sensitive to syscall vs. systenter vs. int 0x80
>>> differences.
>>
>> OK, this explains why I cannot reproduce it.
>> My Intel Core2 box is sysenter-based.
>>
>> (gdb) disass __kernel_vsyscall
>> 0xffffe420<__kernel_vsyscall+0>:       push   %ecx
>> 0xffffe421<__kernel_vsyscall+1>:       push   %edx
>> 0xffffe422<__kernel_vsyscall+2>:       push   %ebp
>> 0xffffe423<__kernel_vsyscall+3>:       mov    %esp,%ebp
>> 0xffffe425<__kernel_vsyscall+5>:       sysenter
>> 0xffffe427<__kernel_vsyscall+7>:       nop
>> 0xffffe428<__kernel_vsyscall+8>:       nop
>> 0xffffe429<__kernel_vsyscall+9>:       nop
>> 0xffffe42a<__kernel_vsyscall+10>:      nop
>> 0xffffe42b<__kernel_vsyscall+11>:      nop
>> 0xffffe42c<__kernel_vsyscall+12>:      nop
>> 0xffffe42d<__kernel_vsyscall+13>:      nop
>> 0xffffe42e<__kernel_vsyscall+14>:      jmp 0xffffe423<__kernel_vsyscall+3>
>> 0xffffe430<__kernel_vsyscall+16>:      pop    %ebp
>> 0xffffe431<__kernel_vsyscall+17>:      pop    %edx
>> 0xffffe432<__kernel_vsyscall+18>:      pop    %ecx
>> 0xffffe433<__kernel_vsyscall+19>:      ret
>>
>>> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2
>>> ;-/)
>>> if you want to see if that gets reproduced on your box.  I'll drop it on
>>> anonftp if you are interested.  FWIW, the same kernel binary/same image
>>> result in
>>>         * K7 box - no breakage, SYSENTER-based vdso
>>>         * K8 box - breakage as described, SYSCALL-based vdso32
>>>         * P4 box - no breakage, SYSENTER-based vdso32
>>> Hell knows...  In theory that would seem to point towards
>>> ia32_cstar_target(),
>>> so I'm going to RTFS carefully through that animal.
>>
>> Now I'm testing with a Debian fs from:
>> http://fs.devloop.org.uk/filesystems/Debian-Squeeze/
>>
>>> The thing is, whatever happens happens when victim gets resumed inside
>>> vdso page.  I'll try to dump PTRACE_SETREGS and see the values host
>>> kernel asked to set and work from there, but the interesting part is
>>> bloody hard to singlestep through - the victim is back to user mode and
>>> it is already traced by the guest kernel, so it's not as if we could
>>> attach host gdb to it and walk through that crap.  And guest gdb is not
>>> going to be able to set breakpoints in there - vdso page is r/o...
>>
>> [ CC'ing lu...@mi... ]
>> Andy, do you have an idea?
>> You can find Al's original report here:
>> http://marc.info/?l=linux-kernel&m=131380315624244&w=2
>
> I'm missing a bit of the background.  Is the user-on-UML app calling
> into a vdso entry provided by UML or into a vdso entry provided by the
> host?

UML/i386 reuses the host's vDSO page.
IOW it does not have it's own vDSO like UML/x86_64.

Thanks,
//richard

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-22 01:41:50

On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote:
>
> Is that ability a part of userland ABI or are we declaring that hopelessly
> wrong and require to go through the function in vdso32?  Linus?

If people are using syscall directly, we're pretty much stuck. No
amount of "that's hopelessly wrong" will ever matter. We don't break
existing binaries.

That said, I'd *hope* that everybody uses the vdso32, simply because
user programs are not supposed to know which CPU they are running on
and if that CPU even *supports* the syscall instruction. In which case
it may be possible that we can play games with the vdso thing. But
that really would be conditional on "nobody ever reports a failure".

But if that's possible, maybe we can increment the RIP by 2 for
'syscall', and slip an "'int 0x80" after the syscall instruction in
the vdso there? Resulting in the same pseudo-solution I suggested for
sysenter...

                          Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 04:08:21

On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote:
> On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote:
> >
> > Is that ability a part of userland ABI or are we declaring that hopelessly
> > wrong and require to go through the function in vdso32? ?Linus?
> 
> If people are using syscall directly, we're pretty much stuck. No
> amount of "that's hopelessly wrong" will ever matter. We don't break
> existing binaries.

There's a funny part, though - such binary won't work on 32bit kernel.
AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso
doesn't provide a SYSCALL-based variant).

So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't
we do something with entry_32.S as well?  I don't think it's worth doing,
TBH...

Again, I very much hope that binaries with such stray SYSCALL simply do
not exist.  In theory it's possible to write one, but...

IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter
lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
and (again, IIRC) it had some differences in SYSCALL semantics compared to
K7 (which supports SYSENTER as well).  Bugger if I remember what those
differences might've been...  Some flag not cleared?

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Andrew L. <lu...@mi...> - 2011-08-20 21:40:29

On Sat, Aug 20, 2011 at 5:26 PM, Andrew Lutomirski <lu...@mi...> wrote:
> On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <ri...@no...> wrote:

> I'm missing a bit of the background.  Is the user-on-UML app calling
> into a vdso entry provided by UML or into a vdso entry provided by the
> host?
>
> Why does anything care whether ecx is saved?  Doesn't the default
> calling convention allow the callee to clobber ecx?
>
> But my guess is that the 64-bit host sysret code might be buggy (or
> the value in gs:whatever is wrong). Can you get gdb to breakpoint at
> the beginning of __kernel_vsyscall before the crash?
>

This is suspicious:

ENTRY(ia32_cstar_target)
        CFI_STARTPROC32 simple
        CFI_SIGNAL_FRAME
        CFI_DEF_CFA     rsp,KERNEL_STACK_OFFSET
        CFI_REGISTER    rip,rcx
        /*CFI_REGISTER  rflags,r11*/
        SWAPGS_UNSAFE_STACK
        movl    %esp,%r8d
        CFI_REGISTER    rsp,r8
        movq    PER_CPU_VAR(kernel_stack),%rsp
        /*
         * No need to follow this irqs on/off section: the syscall
         * disabled irqs and here we enable it straight after entry:
         */
        ENABLE_INTERRUPTS(CLBR_NONE)
        SAVE_ARGS 8,0,0
        movl    %eax,%eax       /* zero extension */
        movq    %rax,ORIG_RAX-ARGOFFSET(%rsp)
        movq    %rcx,RIP-ARGOFFSET(%rsp)
        CFI_REL_OFFSET rip,RIP-ARGOFFSET
        movq    %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */

The entry code looks something like:

The text of __kernel_vsyscall() is
	0xffffe420 <__kernel_vsyscall+0>:       push   %ebp
	0xffffe421 <__kernel_vsyscall+1>:       mov    %ecx,%ebp
	0xffffe423 <__kernel_vsyscall+3>:       syscall
	0xffffe425 <__kernel_vsyscall+5>:       mov    $0x2b,%ecx
	0xffffe42a <__kernel_vsyscall+10>:      mov    %ecx,%ss
	0xffffe42c <__kernel_vsyscall+12>:      mov    %ebp,%ecx
	0xffffe42e <__kernel_vsyscall+14>:      pop    %ebp
	0xffffe42f <__kernel_vsyscall+15>:      ret

so the line:

movq    %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */

will cause iret (if iret happens) to restore the original rbp in rcx
(why? -- it seems okay if syscall is hit in __kernel_vsyscall but not
if something else does the syscall).  I don't see what saves rbp to
the stack frame.

This is also suspicious:

        movq    %r11,EFLAGS-ARGOFFSET(%rsp)

that's inconsistent with my reading of the AMD manual.

How well is the compat syscall entry tested through both the fast and
slow paths?  UML is unusual in that it uses ptrace to trap all system
calls, right?  That means that syscalls will enter through the cstar
target but return through the iret path.

--Andy

<< < 1 2 3 4 (Page 4 of 4)