Thread: Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on u (Page 2)

user-mode-linux-devel

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-23 16:12:09

On Tue, Aug 23, 2011 at 12:03 PM, Linus Torvalds
<tor...@li...> wrote:
> On Mon, Aug 22, 2011 at 11:15 PM, Al Viro <vi...@ze...> wrote:
>>
>> * it does SETREGS, setting eax to return value, eip to original return
>> address of syscall insn... and ebp to what it had in regs.bp.  I.e. the
>> damn arg6 value.
>
> Ok, I think that exhaustively explains that
>
>  (a) our system call restart has always worked correctly, and we're good.
>
>  (b) it's simply just UML that is buggy, and doesn't understand the
> subtleties about doing GETREGS at a system call.
>
> and I think that the correct and simple solution is to just teach UML
> to understand the proper logic of pt_regs during a system call (and a
> 'syscall' instruction in particular).
>
> The thing is, UML emulates 'syscall' the way the *CPU* does it, not
> the way *we* do it. That may make sense, but it's simply not correct.
>
> So I would vote very strongly against actually changing anything in
> arch/x86. This is very much an UML issue.
>
> Suggested fixes:
>
>  - instead of blindly doing SETREGS, just write the result registers
> individually like you suggested
>
> OR (and perhaps preferably):
>
>  - teach UML that when you do 'GETREGS' after a system call trapped,
> we have switched things around to match the "official system call
> order", and UML should just undo our swizzling, and do a "regs.ebp =
> regs.ecx" to make it be what the actual original registers were (or
> whatever the actual correct swizzle is - I didn't think that through
> very much).
>
> IOW, I think the core kernel does the right thing. Our argument
> register swizzling is odd, yes, but it's an implementation detail that
> something like uml should just have to take into account. No?
>
> Hmm?

Egads.  Does this mean that doing GETREGS and then doing SETREGS later
on with the *exact same values* is considered incorrect?  IMO, this
way lies madness.

In any case, this seems insanely overcomplicated.  I'd be less afraid
of something like my approach (which, I think, makes all of the
SYSCALL weirdness pretty much transparent to ptrace users) or of just
removing SYSCALL entirely from 32-bit code.

--Andy

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Borislav P. <bp...@am...> - 2011-08-23 16:23:11

On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote:
> In any case, this seems insanely overcomplicated.  I'd be less afraid
> of something like my approach (which, I think, makes all of the
> SYSCALL weirdness pretty much transparent to ptrace users) or of just
> removing SYSCALL entirely from 32-bit code.

I don't think that removing SYSCALL from 32-bit code just so that UML
trapped syscalls work is something we'd like since SYSCALL is much
cheaper than INT $0x80:

"As a result, SYSCALL and SYSRET can take fewer than one-fourth the
number of internal clock cycles to complete than the legacy CALL and RET
instructions."

http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152.

I know, it is 32-bit syscall on 64-bit kernel which should be pretty
rare but still...

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 00:03:51

On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:

> So I think the "let's fix the vdso case for sysenter" + "let's remove
> the 32-bit syscall vdso" is the right solution. If somebody has
> hardcoded syscall instructions, or generates them dynamically with
> some JIT, that's their problem. We'll continue to support it as well
> as we ever have (read: "almost nobody will ever notice").

Umm...  Maybe, but I really wonder if it would be better to do this:
	* check if %ecx is the right one for vdso32 entry.  If it isn't,
act as we act now (and possibly warn).  If it is, increment it by 4.
	* slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
i.e. nop/nop/int 0x80.  Followed by what we currently do there.

	Those who do syscall insn in 32bit mode outside of vdso will get
what they currently get.  __kernel_vsyscall() will keep working and do
that without restart problems.  And the price is comparison + branch not
taken + addition for that path...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 00:07:34

On Tue, Aug 23, 2011 at 01:03:14AM +0100, Al Viro wrote:
> On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:
> 
> > So I think the "let's fix the vdso case for sysenter" + "let's remove
> > the 32-bit syscall vdso" is the right solution. If somebody has
> > hardcoded syscall instructions, or generates them dynamically with
> > some JIT, that's their problem. We'll continue to support it as well
> > as we ever have (read: "almost nobody will ever notice").
> 
> Umm...  Maybe, but I really wonder if it would be better to do this:
> 	* check if %ecx is the right one for vdso32 entry.  If it isn't,
> act as we act now (and possibly warn).  If it is, increment it by 4.
> 	* slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
> i.e. nop/nop/int 0x80.  Followed by what we currently do there.
> 
> 	Those who do syscall insn in 32bit mode outside of vdso will get
> what they currently get.  __kernel_vsyscall() will keep working and do
> that without restart problems.  And the price is comparison + branch not
> taken + addition for that path...

s/whine/set your "don't restart that one" flag/, even...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 16:21:14

On Tue, Aug 23, 2011 at 9:11 AM, Andrew Lutomirski <lu...@mi...> wrote:
>
> Egads.  Does this mean that doing GETREGS and then doing SETREGS later
> on with the *exact same values* is considered incorrect?

No. If it does it with exactly the same values, it's perfectly ok. But
it isn't. It's changing some of those values.

>  IMO, this way lies madness.

No, the madness is in UML.

It's EMULATING A SYSTEM CALL. That original "getregs" value is not
some "user space state". It's the *system call* state that you got
after the system call trapped. Setting it back is an insane operation,
but it would happen to work - if you make no changes.

But UML *does* make changes. It takes that system call state, and then
EMULATES THE SYSTEM CALL INCORRECTLY.

If you see it that way (which is the correct way), then it's clearly
an UML problem, and it's not at all "madness" that your
getregs/setregs pairing doesn't work.

See? Buggy system call emulation. It's really that simple. Of course,
"simple" in this case is "really really subtle differences in how the
kernel treats syscall/sysenter/int80", so the *details* are certainly
not simple, but the concept is.

                    Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 00:08:03

On 08/22/2011 05:03 PM, Al Viro wrote:
> On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote:
> 
>> So I think the "let's fix the vdso case for sysenter" + "let's remove
>> the 32-bit syscall vdso" is the right solution. If somebody has
>> hardcoded syscall instructions, or generates them dynamically with
>> some JIT, that's their problem. We'll continue to support it as well
>> as we ever have (read: "almost nobody will ever notice").
> 
> Umm...  Maybe, but I really wonder if it would be better to do this:
> 	* check if %ecx is the right one for vdso32 entry.  If it isn't,
> act as we act now (and possibly warn).  If it is, increment it by 4.
> 	* slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn -
> i.e. nop/nop/int 0x80.  Followed by what we currently do there.
> 
> 	Those who do syscall insn in 32bit mode outside of vdso will get
> what they currently get.  __kernel_vsyscall() will keep working and do
> that without restart problems.  And the price is comparison + branch not
> taken + addition for that path...

Checking for SYSCALL in the vdso is cheap... the only problem is if
someone thinks that they can copy it out of the vdso and run it, having
special-cased SYSENTER already, but not SYSCALL.  The recent experience
shows that that is apparently a very real possibility.

Without a SYSCALL in the vdso, anything that examines the vdso will end
up using int $0x80 and we don't have an issue.

Again, waiting for Borislav & co to tell us how much the SYSCALL
instruction actually buys us.

	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 16:30:28

On Tue, Aug 23, 2011 at 9:22 AM, Borislav Petkov <bp...@am...> wrote:
>
> I don't think that removing SYSCALL from 32-bit code just so that UML
> trapped syscalls work is something we'd like since SYSCALL is much
> cheaper than INT $0x80:

Oh yes.

System call performance is *important*. And x86 is *important*.

UML? In comparison, not that important.

So quite frankly, if this is purely an UML issue (and unless we're
missing something else, that's what it looks like, despite all the
confusion we've had so far), then if we have a choice between "remove
syscall instruction" and "remove UML", then ...

                           Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:53:46

On Tue, Aug 23, 2011 at 09:29:29AM -0700, Linus Torvalds wrote:

> Oh yes.
> 
> System call performance is *important*. And x86 is *important*.
> 
> UML? In comparison, not that important.
> 
> So quite frankly, if this is purely an UML issue (and unless we're
> missing something else, that's what it looks like, despite all the
> confusion we've had so far), then if we have a choice between "remove
> syscall instruction" and "remove UML", then ...

Agreed.  Note, BTW, that UML has perfectly usable workaround for 99% of
that - don't tell the binaries it has *any* vdso in such cases.  And
"remove UML" turns into "remove support under UML for 32bit binaries
that go out of their way to do SYSCALL directly, which wouldn't work
on native 32bit", which is really a no-brainer.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 17:33:43

On Tue, Aug 23, 2011 at 09:20:12AM -0700, Linus Torvalds wrote:

> It's EMULATING A SYSTEM CALL. That original "getregs" value is not
> some "user space state". It's the *system call* state that you got
> after the system call trapped. Setting it back is an insane operation,
> but it would happen to work - if you make no changes.
> 
> But UML *does* make changes. It takes that system call state, and then
> EMULATES THE SYSTEM CALL INCORRECTLY.
> 
> If you see it that way (which is the correct way), then it's clearly
> an UML problem, and it's not at all "madness" that your
> getregs/setregs pairing doesn't work.
> 
> See? Buggy system call emulation. It's really that simple. Of course,
> "simple" in this case is "really really subtle differences in how the
> kernel treats syscall/sysenter/int80", so the *details* are certainly
> not simple, but the concept is.

It's a bit more than that (ptrace changes to syscall arguments *are*
lost on syscall restart), but... as far as I'm concerned, the situation
is simple now:
	* SYSCALL is not terminally broken wrt restarts.  My apologies for
misreading what was going on.
	* SYSENTER with Linus' patch does work just fine wrt restarts + ptrace
	* SYSCALL is losing ptrace-made changes to arguments when it restarts.
Might or might not be a problem for somebody.
	* UML should not touch SYSCALL for 32bit.  Not without serious changes
in UML and I'm not convinced that it won't be worse than what we probably
ought to do there: check if __kernel_vsyscall() does SYSCALL (recognizable
by interaction with POKEUSER) and don't tell about vdso to guest processes.
Anything well-behaving won't step on SYSCALL and the things that do deserve
the subtle bugs they get.
	* asm glue is subtle, evil and doesn't have anywhere near enough
documentation ;-/

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 18:04:48

On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote:
> 	* SYSCALL is not terminally broken wrt restarts.  My apologies for
> misreading what was going on.
> 	* SYSENTER with Linus' patch does work just fine wrt restarts + ptrace
> 	* SYSCALL is losing ptrace-made changes to arguments when it restarts.
> Might or might not be a problem for somebody.

BTW, that one (irrelevant to UML even if we do end up coping with SYSCALL
there) might be worth spelling it out:

tracer:
	ptrace(tracee, PTRACE_SYSCALL);
tracee:
	recvfrom(..., &addrlen);
tracer:
	ptrace(tracee, PTRACE_POKEUSER, EBP, &len2);
	ptrace(tracee, PTRACE_DETACH, 0, 0);
tracee:
	completes recvfrom(), using &len2 instead of the &addrlen

That works just fine, regardless of the way syscall is entered; yes, including
SYSCALL - there we take care to handle ptrace on the way in.  However, if it's
SYSCALL and (ex-)tracee takes a restart, the second time around we'll have the
original value of 6th argument used.  Changes made by POKEUSER are lost.

It's not a problem with int 0x80 or SYSENTER (now, with int 0x80 instead of
jmp).  It's probably not going to be a real issue for anyone, but I pity the
poor bastard stuck with debugging that if it *does* become someone's problem.

[uml-devel] [PATCH] x86, asm: Document some of the syscall asm glue

From: Borislav P. <bp...@am...> - 2011-08-24 12:59:43

On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote:
> 	* asm glue is subtle, evil and doesn't have anywhere near enough
> documentation ;-/

I took the liberty to document some of your asm glue analysis in an
attempt to make the code a bit more understandable. How about the
following:

--
From: Borislav Petkov <bor...@am...>
Date: Wed, 24 Aug 2011 14:30:43 +0200
Subject: [PATCH] x86, asm: Document some of the syscall asm glue

Document some of the asm glue around compat SYSCALL32 and do a
whitespace cleanup while at it. See linked thread below for further
reference.

Link: http://lkml.kernel.org/r/20110820011845.GC2203@ZenIV.linux.org.uk
Signed-off-by: Borislav Petkov <bor...@am...>
---
 arch/x86/ia32/ia32entry.S  |  138 ++++++++++++++++++++++++++-----------------
 arch/x86/kernel/entry_64.S |   19 +++++-
 2 files changed, 98 insertions(+), 59 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a0e866d..8254432 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -1,16 +1,16 @@
 /*
- * Compatibility mode system call entry point for x86-64. 
- * 		
+ * Compatibility mode system call entry point for x86-64.
+ *
  * Copyright 2000-2002 Andi Kleen, SuSE Labs.
- */		 
+ */
 
 #include <asm/dwarf2.h>
 #include <asm/calling.h>
 #include <asm/asm-offsets.h>
 #include <asm/current.h>
 #include <asm/errno.h>
-#include <asm/ia32_unistd.h>	
-#include <asm/thread_info.h>	
+#include <asm/ia32_unistd.h>
+#include <asm/thread_info.h>
 #include <asm/segment.h>
 #include <asm/irqflags.h>
 #include <linux/linkage.h>
@@ -38,11 +38,11 @@
 	xchg	%ecx,%esi
 	movl	%ebx,%edi
 	movl	%edx,%edx	/* zero extension */
-	.endm 
+	.endm
 
-	/* clobbers %eax */	
+	/* clobbers %eax */
 	.macro  CLEAR_RREGS offset=0, _r9=rax
-	xorl 	%eax,%eax
+	xorl	%eax,%eax
 	movq	%rax,\offset+R11(%rsp)
 	movq	%rax,\offset+R10(%rsp)
 	movq	%\_r9,\offset+R9(%rsp)
@@ -69,7 +69,7 @@
 	movl \offset+64(%rsp),%edi
 	movl %eax,%eax			/* zero extension */
 	.endm
-	
+
 	.macro CFI_STARTPROC32 simple
 	CFI_STARTPROC	\simple
 	CFI_UNDEFINED	r8
@@ -106,14 +106,14 @@ ENDPROC(native_irq_enable_sysexit)
  * %esi Arg4
  * %edi Arg5
  * %ebp user stack
- * 0(%ebp) Arg6	
- * 	
+ * 0(%ebp) Arg6
+ *
  * Interrupts off.
- *	
+ *
  * This is purely a fast path. For anything complicated we use the int 0x80
  * path below.	Set up a complete hardware stack frame to share code
  * with the int 0x80 path.
- */ 	
+ */
 ENTRY(ia32_sysenter_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
@@ -127,7 +127,7 @@ ENTRY(ia32_sysenter_target)
 	 * disabled irqs, here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
- 	movl	%ebp,%ebp		/* zero extension */
+	movl	%ebp,%ebp		/* zero extension */
 	pushq_cfi $__USER32_DS
 	/*CFI_REL_OFFSET ss,0*/
 	pushq_cfi %rbp
@@ -144,12 +144,12 @@ ENTRY(ia32_sysenter_target)
 	pushq_cfi %rax
 	cld
 	SAVE_ARGS 0,1,0
- 	/* no need to do an access_ok check here because rbp has been
- 	   32bit zero extended */ 
+	/* no need to do an access_ok check here because rbp has been
+	   32bit zero extended */
 1:	movl	(%rbp),%ebp
- 	.section __ex_table,"a"
- 	.quad 1b,ia32_badarg
- 	.previous	
+	.section __ex_table,"a"
+	.quad 1b,ia32_badarg
+	.previous
 	GET_THREAD_INFO(%r10)
 	orl    $TS_COMPAT,TI_status(%r10)
 	testl  $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
@@ -170,7 +170,7 @@ sysenter_dispatch:
 sysexit_from_sys_call:
 	andl    $~TS_COMPAT,TI_status(%r10)
 	/* clear IF, that popfq doesn't enable interrupts early */
-	andl  $~0x200,EFLAGS-R11(%rsp) 
+	andl  $~0x200,EFLAGS-R11(%rsp)
 	movl	RIP-R11(%rsp),%edx		/* User %eip */
 	CFI_REGISTER rip,rdx
 	RESTORE_ARGS 0,24,0,0,0,0
@@ -260,20 +260,21 @@ ENDPROC(ia32_sysenter_target)
  * Arguments:
  * %eax	System call number.
  * %ebx Arg1
- * %ecx return EIP 
+ * %ecx return EIP
  * %edx Arg3
  * %esi Arg4
  * %edi Arg5
- * %ebp Arg2    [note: not saved in the stack frame, should not be touched]
- * %esp user stack 
+ * %ebp Arg2    [note: not saved in the stack frame, should not be touched
+ *		 because it is callee-saved in 64-bit calling convention]
+ * %esp user stack
  * 0(%esp) Arg6
- * 	
+ *
  * Interrupts off.
- *	
+ *
  * This is purely a fast path. For anything complicated we use the int 0x80
  * path below.	Set up a complete hardware stack frame to share code
- * with the int 0x80 path.	
- */ 	
+ * with the int 0x80 path.
+ */
 ENTRY(ia32_cstar_target)
 	CFI_STARTPROC32	simple
 	CFI_SIGNAL_FRAME
@@ -281,34 +282,57 @@ ENTRY(ia32_cstar_target)
 	CFI_REGISTER	rip,rcx
 	/*CFI_REGISTER	rflags,r11*/
 	SWAPGS_UNSAFE_STACK
+
+	/* stash away usermode stack ptr */
 	movl	%esp,%r8d
 	CFI_REGISTER	rsp,r8
 	movq	PER_CPU_VAR(kernel_stack),%rsp
+
 	/*
 	 * No need to follow this irqs on/off section: the syscall
 	 * disabled irqs and here we enable it straight after entry:
 	 */
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_ARGS 8,0,0
-	movl 	%eax,%eax	/* zero extension */
+	movl	%eax,%eax	/* zero extension */
 	movq	%rax,ORIG_RAX-ARGOFFSET(%rsp)
+
+	/* return-RIP is in %ecx when executing SYSCALL */
 	movq	%rcx,RIP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rip,RIP-ARGOFFSET
-	movq	%rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
+
+	/*
+	 * Put Arg2 into %rcx pt_regs slot to match kernel syscall
+	 * calling conventions, i.e. what INT80 would expect;
+	 * this lies slightly to ptrace
+	 */
+	movq	%rbp,RCX-ARGOFFSET(%rsp)
 	movl	%ebp,%ecx
 	movq	$__USER32_CS,CS-ARGOFFSET(%rsp)
 	movq	$__USER32_DS,SS-ARGOFFSET(%rsp)
+
+	/* rFLAGS is in %r11 when executing SYSCALL */
 	movq	%r11,EFLAGS-ARGOFFSET(%rsp)
 	/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
-	movq	%r8,RSP-ARGOFFSET(%rsp)	
+
+	/* save usermode stack ptr into pt_regs */
+	movq	%r8,RSP-ARGOFFSET(%rsp)
 	CFI_REL_OFFSET rsp,RSP-ARGOFFSET
-	/* no need to do an access_ok check here because r8 has been
-	   32bit zero extended */ 
-	/* hardware stack frame is complete now */	
+
+	/*
+	 * Get Arg6 which is on the usermode stack; no need to do an
+	 * access_ok check here because %r8 has been 32bit zero extended.
+	 * hardware stack frame is complete now.
+	 */
 1:	movl	(%r8),%r9d
+
+	/*
+	 * handle pagefaulting when accessing usermode stack by returning
+	 * -EFAULT
+	 */
 	.section __ex_table,"a"
 	.quad 1b,ia32_badarg
-	.previous	
+	.previous
 	GET_THREAD_INFO(%r10)
 	orl   $TS_COMPAT,TI_status(%r10)
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
@@ -331,7 +355,7 @@ sysretl_from_sys_call:
 	RESTORE_ARGS 0,-ARG_SKIP,0,0,0
 	movl RIP-ARGOFFSET(%rsp),%ecx
 	CFI_REGISTER rip,rcx
-	movl EFLAGS-ARGOFFSET(%rsp),%r11d	
+	movl EFLAGS-ARGOFFSET(%rsp),%r11d
 	/*CFI_REGISTER rflags,r11*/
 	xorq	%r10,%r10
 	xorq	%r9,%r9
@@ -340,7 +364,7 @@ sysretl_from_sys_call:
 	movl RSP-ARGOFFSET(%rsp),%esp
 	CFI_RESTORE rsp
 	USERGS_SYSRET32
-	
+
 #ifdef CONFIG_AUDITSYSCALL
 cstar_auditsys:
 	CFI_RESTORE_STATE
@@ -358,6 +382,8 @@ cstar_tracesys:
 	testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10)
 	jz cstar_auditsys
 #endif
+
+	/* put Arg6 into %ebp where ptrace expects it */
 	xchgl %r9d,%ebp
 	SAVE_REST
 	CLEAR_RREGS 0, r9
@@ -366,21 +392,23 @@ cstar_tracesys:
 	call syscall_trace_enter
 	LOAD_ARGS32 ARGOFFSET, 1  /* reload args from stack in case ptrace changed it */
 	RESTORE_REST
+
+	/* sync back Arg6's possibly changed value where it is expected by C */
 	xchgl %ebp,%r9d
 	cmpq $(IA32_NR_syscalls-1),%rax
 	ja int_ret_from_sys_call /* cstar_tracesys has set RAX(%rsp) */
 	jmp cstar_do_call
 END(ia32_cstar_target)
-				
+
 ia32_badarg:
 	movq $-EFAULT,%rax
 	jmp ia32_sysret
 	CFI_ENDPROC
 
-/* 
- * Emulated IA32 system calls via int 0x80. 
+/*
+ * Emulated IA32 system calls via int 0x80.
  *
- * Arguments:	 
+ * Arguments:
  * %eax	System call number.
  * %ebx Arg1
  * %ecx Arg2
@@ -390,13 +418,13 @@ ia32_badarg:
  * %ebp Arg6    [note: not saved in the stack frame, should not be touched]
  *
  * Notes:
- * Uses the same stack frame as the x86-64 version.	
+ * Uses the same stack frame as the x86-64 version.
  * All registers except %eax must be saved (but ptrace may violate that)
  * Arguments are zero extended. For system calls that want sign extension and
  * take long arguments a wrapper is needed. Most calls can just be called
  * directly.
- * Assumes it is only called from user space and entered with interrupts off.	
- */ 				
+ * Assumes it is only called from user space and entered with interrupts off.
+ */
 
 ENTRY(ia32_syscall)
 	CFI_STARTPROC32	simple
@@ -433,9 +461,9 @@ ia32_sysret:
 	movq %rax,RAX-ARGOFFSET(%rsp)
 ia32_ret_from_sys_call:
 	CLEAR_RREGS -ARGOFFSET
-	jmp int_ret_from_sys_call 
+	jmp int_ret_from_sys_call
 
-ia32_tracesys:			 
+ia32_tracesys:
 	SAVE_REST
 	CLEAR_RREGS
 	movq $-ENOSYS,RAX(%rsp)	/* ptrace can change this for a bad syscall */
@@ -457,13 +485,13 @@ quiet_ni_syscall:
 	movq $-ENOSYS,%rax
 	ret
 	CFI_ENDPROC
-	
+
 	.macro PTREGSCALL label, func, arg
 	.globl \label
 \label:
 	leaq \func(%rip),%rax
 	leaq -ARGOFFSET+8(%rsp),\arg	/* 8 for return address */
-	jmp  ia32_ptregs_common	
+	jmp  ia32_ptregs_common
 	.endm
 
 	CFI_STARTPROC32
@@ -537,7 +565,7 @@ ia32_sys_call_table:
 	.quad quiet_ni_syscall	/* old stty syscall holder */
 	.quad quiet_ni_syscall	/* old gtty syscall holder */
 	.quad sys_access
-	.quad sys_nice	
+	.quad sys_nice
 	.quad quiet_ni_syscall	/* 35 */	/* old ftime syscall holder */
 	.quad sys_sync
 	.quad sys32_kill
@@ -616,7 +644,7 @@ ia32_sys_call_table:
 	.quad stub32_iopl		/* 110 */
 	.quad sys_vhangup
 	.quad quiet_ni_syscall	/* old "idle" system call */
-	.quad sys32_vm86_warning	/* vm86old */ 
+	.quad sys32_vm86_warning	/* vm86old */
 	.quad compat_sys_wait4
 	.quad sys_swapoff		/* 115 */
 	.quad compat_sys_sysinfo
@@ -669,7 +697,7 @@ ia32_sys_call_table:
 	.quad sys_mremap
 	.quad sys_setresuid16
 	.quad sys_getresuid16	/* 165 */
-	.quad sys32_vm86_warning	/* vm86 */ 
+	.quad sys32_vm86_warning	/* vm86 */
 	.quad quiet_ni_syscall	/* query_module */
 	.quad sys_poll
 	.quad compat_sys_nfsservctl
@@ -724,10 +752,10 @@ ia32_sys_call_table:
 	.quad sys_mincore
 	.quad sys_madvise
 	.quad compat_sys_getdents64	/* 220 getdents64 */
-	.quad compat_sys_fcntl64	
+	.quad compat_sys_fcntl64
 	.quad quiet_ni_syscall		/* tux */
-	.quad quiet_ni_syscall    	/* security */
-	.quad sys_gettid	
+	.quad quiet_ni_syscall		/* security */
+	.quad sys_gettid
 	.quad sys32_readahead	/* 225 */
 	.quad sys_setxattr
 	.quad sys_lsetxattr
@@ -742,7 +770,7 @@ ia32_sys_call_table:
 	.quad sys_lremovexattr
 	.quad sys_fremovexattr
 	.quad sys_tkill
-	.quad sys_sendfile64 
+	.quad sys_sendfile64
 	.quad compat_sys_futex		/* 240 */
 	.quad compat_sys_sched_setaffinity
 	.quad compat_sys_sched_getaffinity
@@ -754,7 +782,7 @@ ia32_sys_call_table:
 	.quad compat_sys_io_submit
 	.quad sys_io_cancel
 	.quad sys32_fadvise64		/* 250 */
-	.quad quiet_ni_syscall 	/* free_huge_pages */
+	.quad quiet_ni_syscall	/* free_huge_pages */
 	.quad sys_exit_group
 	.quad sys32_lookup_dcookie
 	.quad sys_epoll_create
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 6419bb0..9569f11 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -607,10 +607,16 @@ tracesys:
 GLOBAL(int_ret_from_sys_call)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
+
+	/*
+	 * check the Requestor Privilege Level of the CS selector
+	 * previously pushed on the stack. If 0, we're returning
+	 * to kernel space.
+	 */
 	testl $3,CS-ARGOFFSET(%rsp)
 	je retint_restore_args
-	movl $_TIF_ALLWORK_MASK,%edi
 	/* edi:	mask to check */
+	movl $_TIF_ALLWORK_MASK,%edi
 GLOBAL(int_with_check)
 	LOCKDEP_SYS_EXIT_IRQ
 	GET_THREAD_INFO(%rcx)
@@ -618,11 +624,16 @@ GLOBAL(int_with_check)
 	andl %edi,%edx
 	jnz   int_careful
 	andl    $~TS_COMPAT,TI_status(%rcx)
+
+	/* no work pending, return to userspace */
 	jmp   retint_swapgs
 
-	/* Either reschedule or signal or syscall exit tracking needed. */
-	/* First do a reschedule test. */
-	/* edx:	work, edi: workmask */
+	/*
+	 * Either reschedule or signal or syscall exit tracking
+	 * needed. First do a reschedule test.
+	 *
+	 * edx:	work, edi: workmask
+	 */
 int_careful:
 	bt $TIF_NEED_RESCHED,%edx
 	jnc  int_very_careful
-- 
1.7.4


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 19:16:32

On 08/23/2011 09:22 AM, Borislav Petkov wrote:
> On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote:
>> In any case, this seems insanely overcomplicated.  I'd be less afraid
>> of something like my approach (which, I think, makes all of the
>> SYSCALL weirdness pretty much transparent to ptrace users) or of just
>> removing SYSCALL entirely from 32-bit code.
> 
> I don't think that removing SYSCALL from 32-bit code just so that UML
> trapped syscalls work is something we'd like since SYSCALL is much
> cheaper than INT $0x80:
> 
> "As a result, SYSCALL and SYSRET can take fewer than one-fourth the
> number of internal clock cycles to complete than the legacy CALL and RET
> instructions."
> 
> http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152.
> 
> I know, it is 32-bit syscall on 64-bit kernel which should be pretty
> rare but still...
> 

Right, but if you had said the difference had disappeared on current AMD
silicon it would be much less of an issue.  That's why I wanted to find
that bit out from you.
	
	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Borislav P. <bp...@am...> - 2011-08-23 20:56:34

On Tue, Aug 23, 2011 at 03:15:58PM -0400, H. Peter Anvin wrote:
> Right, but if you had said the difference had disappeared on current AMD
> silicon it would be much less of an issue.  That's why I wanted to find
> that bit out from you.

Yeah, I dug this out in the APM, btw.

But no, I don't think the difference has disappeared - to the contrary,
AFAICT, the intention is for SYSCALL to be the fastest way to do
syscalls on x86 due to diminished number of segment checks etc. INT80
is legacy, slower, etc. I believe Andy measured a similar situation on
Sandy Bridge with SYSCALL having latencies in the tens of nsecs range
and INT80 being much slower. Ingo also measured a similar situation
where the latency gap between the two on Intel is even bigger.

So, the only problem left now is what we're going
to do with cases similar to what Al conjured up:
http://marc.info/?l=linux-kernel&m=131412271112461&w=2

I don't know, maybe the most cowardly approach is to issue a warning and
shrug with the shoulders, or do some asm magic... Also, do we care at
all, how relevant is a case like that?

Hmmm.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 19:25:31

On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote:
>
> We could drop that information in a metaregister.  It's not backward
> compatible, but at least it will be obvious when that information is
> available and not.

Well, seriously, UML already looks at the word at "ip-2" for other
reasons. So it isn't like there is any point in adding more support to
just give you that information in another form.

                            Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-23 19:27:17

On 08/23/2011 12:24 PM, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote:
>>
>> We could drop that information in a metaregister.  It's not backward
>> compatible, but at least it will be obvious when that information is
>> available and not.
> 
> Well, seriously, UML already looks at the word at "ip-2" for other
> reasons. So it isn't like there is any point in adding more support to
> just give you that information in another form.
> 

Not for UML, but it might be useful for other things.  Certainly lowers
the priority, though.

	-hpa

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 19:41:28

On Tue, Aug 23, 2011 at 12:24:22PM -0700, Linus Torvalds wrote:
> On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote:
> >
> > We could drop that information in a metaregister. ?It's not backward
> > compatible, but at least it will be obvious when that information is
> > available and not.
> 
> Well, seriously, UML already looks at the word at "ip-2" for other
> reasons. So it isn't like there is any point in adding more support to
> just give you that information in another form.

That is done only for task singlestepped in the guest:
        /*
         * This closes a way to execute a system call on the host.  If
         * you set a breakpoint on a system call instruction and singlestep
         * from it, the tracing thread used to PTRACE_SINGLESTEP the process
         * rather than PTRACE_SYSCALL it, allowing the system call to execute
         * on the host.  The tracing thread will check this flag and
         * PTRACE_SYSCALL if necessary.
         */
        if (current->ptrace & PT_DTRACE)
                current->thread.singlestep_syscall =
                        is_syscall(PT_REGS_IP(&current->thread.regs));
with PT_DTRACE set by uml user_enable_single_step()

And it's not cheap - doing that on each syscall will be unpleasant...
Frankly, I'd rather stopped telling the uml userland about vdso in such
setups.  And anything that plays with SYSCALL outside of vdso... <shrug>
we already have a "don't run it native on 32bit", adding "don't run
it on 32bit uml on amd64 host" is not too serious.  At least for now...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 16:41:52

On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote:

> We do not lie to ptrace and iret.  At all.  We do just what you have
> described.  And fuck up when restart returns us to the SYSCALL / SYSENTER
> instruction again, which expects the different calling conventions,
> so the values arranged in registers in the way int 0x80 would expect
> do us no good.

FWIW, what really happens (for 32bit task on amd64) is this:
	* both SYSCALL and SYSENTER variants of __kernel_vsyscall are
entered with the same calling conventions; eax contains syscall number,
ebx/ecx/edx/esi/edi/ebp contain arg1..6 resp.  Same as what int 0x80
would expect.
	* they arrange slightly different calling conventions for
actual SYSCALL/SYSENTER instructions.  SYSENTER one: ecx and edx saved
on user stack to undo the effect of SYSEXIT clobbering them, arg6 (from
ebp) pushed to stack as well (for kernel side of SYSENTER to pick it
from there) and userland esp copied to ebp (SYSENTER clobbers esp).
SYSCALL one: arg6 (from ebp) pushed to stack (again, for kernel to pick
it from there), arg2 (from ecx) copied to ebp (SYSCALL clobbers ecx).
Then we hit the kernel.
	* Both codepaths start with arranging the same thing on the kernel
stack frame; one 64bit int 0x80 would create.  For the good and simple
reason: they all have to be able to leave via IRET.  Stack layout is the
same, but we need to fill it accordingly to calling conventions we are
stuck with.  I.e. ->cx should be initialized with arg2 and ->bp with
arg6, wherever those currently are on given codepath.  _That_ is what
"lying to ptrace" is about - we store there registers according to how
they were when we entered __kernel_vsyscall(), not as they are at the
moment of actual SYSCALL insn.  Which is precisely the right thing to do,
since if we *are* ptraced, the tracer expects to find the syscall argument
in the same places, whichever variant of syscall tracee happens to be using.
	* In both variants it means picking arg6 from userland stack; if
that pagefaults, we act as if we returned -EFAULT in normal way.  Again,
the value is stored in the expected place - ->bp, same as it would on int 0x80
path.
	* If we are traced, we grow the things on stack to full pt_regs,
including the callee-saved registers.  And call syscall_trace_enter(&regs).
If tracer decides to change registers, it can do so.  After that call we
restore the registers from pt_regs on stack and rejoin the corresponding
common codepath.
	* In both cases we reshuffle registers to match amd64 C calling
conventions; the only subtle part is that SYSCALL path has arg6 in r9d (and
ebp same as we had on entry, i.e. the original arg2, unaffected by whatever
ptrace might have done to regs->cx, BTW) while SYSENTER path has it in ebp,
same as int 0x80 one.  After reshuffling arg6 ends up r9 in all cases and
in all cases ptrace changes to regs->bp (aka where ptrace expects to see
arg6) do affect what's in r9.
	* The actual sys_whatever() is called in all cases.  If there's
any work to do after it (signals, still being traced, need to be rescheduled,
etc.), we go for the good old IRET path (after having cleaned r8--r12 in
pt_regs - IRET path is shared with 64bit and we don't want random kernel values
leaking to userland).
	* If there's no non-trivial work to do, int 0x80 *still* cleans
r8--r12 in pt_regs and goes for IRET path.  End of story for it.
	* In the same case, SYSENTER path will restore the contents of si and
di from pt_regs (bx is unaffected by sys_whatever(), ax holds return value
and cx/dx are going to be clobbered anyway; bp is not restored to the
conditions it had when hitting SYSENTER, but it's redundant - it was equal
to userland sp and *that* we do restore, of course).  r8--r11 are cleared
in actual CPU registers and off we bugger, back to vdso32.  Where we pop
ebp/ecx/edx and return to caller.  Note that syscall restart couldn't have
happened on that path - it would qualify as "work to do after syscall"
(specifically, signal handling) as we'd be off to IRET path.
	* In the same case, SYSCALL path will restore the contents of
si, di and dx from pt_regs (bx is unaffected by sys_whatever(), ax contains
the return value and bp is actually the same as it was on entry, after all
dances).  r8-r11 are cleaned in registers, cx is clobbered by SYSRET and
we are off to __kernel_vsyscall(), again.  This time back in there we
restore cx to what it used to be on entry to __kernel_vsyscall() [*NOTE*:
unaffected by ptrace manipulations; we probably don't care about that] and
restore bp (from stack).  We also restore %ss along the way, but that's
a separate story.  Again, no syscall restarts on that path.
	* If there *was* a syscall restart to be done, we are guaranteed to
have left via IRET path.  In all cases the syscall arguments end up in
registers, in the same way int 0x80 expected them.  What happens afterwards
depends on how we entered, though.
		+ int 0x80: all registers are restored (with ptrace
manipulations, if any, having left their effect) as they'd been the last
time around.  In we go and that's it.
		+ SYSENTER: return address had been set *not* to the
insn right next after SYSENTER when we'd been setting the stack frame
up.  That's the dirty trick Linus had come up with - return ip is set
to insn in vfso32 (SYSENTER loses the original ip for good, unlike SYSCALL
that would store it in cx, so it has to be at fixed location anyway).
Normally we just pop ecx/edx/ebp from stack and we are done.  However,
two bytes prior to that insn (i.e. where syscall restart would land us)
we have jump to just a bit before SYSENTER.  Namely, to the point where
we had copied esp to ebp.  That, combined with what IRET path has done,
will get us the layout SYSENTER expects once we get to SYSENTER again.
Except that ptrace modifications to arg6 will be lost - *ebp is where
SYSENTER picks it from and it's not updated.  Modified value is in ebp
on return from kernel and it's overwritten (with esp) and lost.  That's
ptrace vs. restarts bug I've mentioned in SYSENTER case.
		+ SYSCALL: buggered.  On restart we end up repeating
the call, with arg2 replaced with whatever had been in ebp when we
entered __kernel_vsyscall().  Simply because nobody cared to move it
from ecx (where IRET path has put it) to ebp (where SYSCALL expects
to find it).  ebp gets what used to be in arg6 (again, IRET path doing).
Oh, and ptrace modifications, if any, are lost as well - both in arg2
and in arg6.

	I *think* the above is an accurate description of what happens,
but I could certainly be wrong - that's just from RTFS of unfamiliar
and seriously convoluted code, so I'd very much appreciate ACK/NAK on
the analysis above from the people actually familiar with that thing...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-22 01:09:33

On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <lu...@mi...> wrote:
>
> Which suggests an easy-ish fix: if sysenter is used or if syscall is
> entered from the EIP is is supposed to be entered from, then just
> change ip in the argument save to point to the int 0x80 instruction.

Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
you'd be done.

In fact, just replace the

  jmp .Lenter_kernel

with

  int 0x80

and you'd be pretty much all done, no?

(Ok, that's probably a huge over-simplification, but perhaps "close
enough" to true that it would be workable)

                     Linus

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 01:19:55

On Sun, Aug 21, 2011 at 06:09:00PM -0700, Linus Torvalds wrote:
> On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <lu...@mi...> wrote:
> >
> > Which suggests an easy-ish fix: if sysenter is used or if syscall is
> > entered from the EIP is is supposed to be entered from, then just
> > change ip in the argument save to point to the int 0x80 instruction.
> 
> Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
> you'd be done.
> 
> In fact, just replace the
> 
>   jmp .Lenter_kernel
> 
> with
> 
>   int 0x80
> 
> and you'd be pretty much all done, no?

In case of sysenter - almost, in case of syscall - nope...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 01:19:58

On 08/21/2011 06:09 PM, Linus Torvalds wrote:
> 
> Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and
> you'd be done.
> 
> In fact, just replace the
> 
>   jmp .Lenter_kernel
> 
> with
> 
>   int 0x80
> 
> and you'd be pretty much all done, no?
> 
> (Ok, that's probably a huge over-simplification, but perhaps "close
> enough" to true that it would be workable)
> 

Hm... I think a jump to something which adjusts %esp and invokes int
$0x80 might just work, but only for SYSENTER.

SYSCALL is different, especially since SYSCALL is legal to execute from
anywhere in userspace (and no, as we have learned already doing EIP
checking is *NOT* acceptable.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-20 20:14:15

On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote:

> Hmmm, very strange.
> Sadly I cannot reproduce the issue. :(
> Everything works fine within UML.
> (Of course I've applied your vDSO/i386 patches)
> 
> My test setup:
> Host kernel: 2.6.37 and 3.0.1
> Distro: openSUSE 11.4/x86_64
> 
> UML kernel: 3.1-rc2
> Distro: openSUSE 11.1/i386
> 
> Does the problem also occur with another host kernel or a different
> guest image?

Could you check what you get in __kernel_vsyscall()?  On iAMD64 box
where that sucker contains sysenter-based variant the bug is not
present.  IOW, it's sensitive to syscall vs. systenter vs. int 0x80
differences.

I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/)
if you want to see if that gets reproduced on your box.  I'll drop it on
anonftp if you are interested.  FWIW, the same kernel binary/same image
result in
	* K7 box - no breakage, SYSENTER-based vdso
	* K8 box - breakage as described, SYSCALL-based vdso32
	* P4 box - no breakage, SYSENTER-based vdso32
Hell knows...  In theory that would seem to point towards ia32_cstar_target(),
so I'm going to RTFS carefully through that animal.

The thing is, whatever happens happens when victim gets resumed inside
vdso page.  I'll try to dump PTRACE_SETREGS and see the values host
kernel asked to set and work from there, but the interesting part is
bloody hard to singlestep through - the victim is back to user mode and
it is already traced by the guest kernel, so it's not as if we could
attach host gdb to it and walk through that crap.  And guest gdb is not
going to be able to set breakpoints in there - vdso page is r/o...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-22 00:44:39

On Sun, Aug 21, 2011 at 12:41 PM, Al Viro <vi...@ze...> wrote:
> On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote:
>
>> We do not lie to ptrace and iret.  At all.  We do just what you have
>> described.  And fuck up when restart returns us to the SYSCALL / SYSENTER
>> instruction again, which expects the different calling conventions,
>> so the values arranged in registers in the way int 0x80 would expect
>> do us no good.
>
> FWIW, what really happens (for 32bit task on amd64) is this:

I think I believe your analysis...

>        * Both codepaths start with arranging the same thing on the kernel
> stack frame; one 64bit int 0x80 would create.  For the good and simple
> reason: they all have to be able to leave via IRET.  Stack layout is the
> same, but we need to fill it accordingly to calling conventions we are
> stuck with.  I.e. ->cx should be initialized with arg2 and ->bp with
> arg6, wherever those currently are on given codepath.  _That_ is what
> "lying to ptrace" is about - we store there registers according to how
> they were when we entered __kernel_vsyscall(), not as they are at the
> moment of actual SYSCALL insn.  Which is precisely the right thing to do,
> since if we *are* ptraced, the tracer expects to find the syscall argument
> in the same places, whichever variant of syscall tracee happens to be using.

This is, IMO, gross -- if the values in pt_regs matched what they were
when sysenter / syscall was issued, then we'd be fine -- we could
restart the syscall and everything would work.  Apparently ptrace
users have a problem with that, so we're stuck with the "lie" (i.e.
reporting values as of __kernel_vsyscall, not as of the actual kernel
entry).

>        * If there *was* a syscall restart to be done, we are guaranteed to
> have left via IRET path.  In all cases the syscall arguments end up in
> registers, in the same way int 0x80 expected them.  What happens afterwards
> depends on how we entered, though.
>                + int 0x80: all registers are restored (with ptrace
> manipulations, if any, having left their effect) as they'd been the last
> time around.  In we go and that's it.

Which suggests an easy-ish fix: if sysenter is used or if syscall is
entered from the EIP is is supposed to be entered from, then just
change ip in the argument save to point to the int 0x80 instruction.
This might also require tweaking the userspace stack.  That way,
restart would hit int 0x80 instead of syscall/sysenter and the
registers are exactly as expected.

Getting this right in the case where ptrace attaches during the
syscall might be tricky, though.

--Andy

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-22 13:34:54

On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <mi...@ke...> wrote:
>
> * H. Peter Anvin <hp...@zy...> wrote:
>
>> Borislav,
>>
>> We're tracking down an issue with the way system call arguments are
>> handled on 32 bits.  We have a solution for SYSENTER but not
>> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at
>> best.
>>
>> So the question is: how much overhead would it be to simply fall
>> back to int $0x80 or some other legacy-style domain crossing
>> instruction for 32-bit system calls on AMD64 processors?  We don't
>> ever use SYSCALL in legacy mode, so native i386 kernels are
>> unaffected.
>
> Last i measured INT80 and SYSCALL costs they were pretty close to
> each other on AMD CPUs - closer than on Intel.

>From memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster
than int 0xcc, which is presumably about the same speed as int 0x80.
That's because SYSCALL is blazingly fast (<30 ns for a whole system
call) and int is slower.

--Andy

>
> Also, most installations are either pure 32-bit or dominantly 64-bit,
> the significantly mixed-mode case is dwindling.
>
> Unifying some more in this area would definitely simplify things ...
>
> Thanks,
>
>        Ingo
>

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Borislav P. <bp...@am...> - 2011-08-22 14:41:12

On Mon, Aug 22, 2011 at 09:34:27AM -0400, Andrew Lutomirski wrote:
> On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <mi...@ke...> wrote:
> >
> > * H. Peter Anvin <hp...@zy...> wrote:
> >
> >> Borislav,
> >>
> >> We're tracking down an issue with the way system call arguments are
> >> handled on 32 bits.  We have a solution for SYSENTER but not
> >> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at
> >> best.
> >>
> >> So the question is: how much overhead would it be to simply fall
> >> back to int $0x80 or some other legacy-style domain crossing
> >> instruction for 32-bit system calls on AMD64 processors?  We don't
> >> ever use SYSCALL in legacy mode, so native i386 kernels are
> >> unaffected.
> >
> > Last i measured INT80 and SYSCALL costs they were pretty close to
> > each other on AMD CPUs - closer than on Intel.
> 
> From memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster
> than int 0xcc, which is presumably about the same speed as int 0x80.
> That's because SYSCALL is blazingly fast (<30 ns for a whole system
> call) and int is slower.

Just to make sure I'm grokking this correctly - we want to use int $0x80
only for the SYSCALL variant in __kernel_vsyscall, right? Not for all
32-bit syscalls on a 64-bit kernel.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-23 00:23:16

On Mon, Aug 22, 2011 at 5:07 PM, H. Peter Anvin <hp...@zy...> wrote:
>
> Checking for SYSCALL in the vdso is cheap... the only problem is if
> someone thinks that they can copy it out of the vdso and run it, having
> special-cased SYSENTER already, but not SYSCALL.  The recent experience
> shows that that is apparently a very real possibility.

I don't understand why people keep harping on this "copy the vdso" thing.

Guys, IF THEY COPY THE VDSO, THEN THEY WOULD GET OUR IMPROVEMENTS!

So stop being silly.

As to comparing the IP address, that's just moronic. Why the hell
would anybody ever do something that stupid? It's expensive and not
worth it. You'd be much better off comparing the contents of the
memory at 'rip', and see if that memory contains the 'int 0x80'
instruction we would add. That automatically also gets the 'somebody
copied the vdso text' case right.

You guys seem to positively _want_ to make this a bigger issue than it
is. As far as I can tell, nobody has ever even noticed this problem
before, and we already have a trivial fix ("don't do it then") for the
case Al actually did notice.

The *only* case left as far as I can tell is the case where people do
*not* copy the vdso, but simply generate (statically and/or
dynamically with a JIT) the syscall instruction directly. And we
cannot fix that, so stop even bothering to try. The absolute best we
can do for that case is to just continue to do what we do now. WE
SIMPLY HAVE NO CHOICE.

I really don't see this as being worth this much discussion.

                     Linus

<< < 1 2 3 4 > >> (Page 2 of 4)