Thread: [uml-devel] Subject: [PATCH 00/91] pending uml patches

user-mode-linux-devel

[uml-devel] Subject: [PATCH 00/91] pending uml patches

From: Al V. <vi...@ft...> - 2011-08-18 18:59:07

My apologies for mailbomb from hell.  *All* this stuff is available in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master,
but since uml folks had been stuck with mail and patch for a long time...

Anyway, most of the stuff in this pile is merging, cleaning and mutilating
subarchitecture-related code in arch/um.  By the end of it we have x86
bits largely merged between 32bit and 64bit variants and taken to arch/x86/um;
headers seriously cleaned up and mostly free of x86-isms now (not completely -
we still have page size dependencies in there).

Beginning of the series are pure build and driver fixes; those should go to
Linus before 3.1-final, IMO.

As far as I know, there's no regressions introduced by that pile; testing
and comments would be, of course, welcome.

Re: [uml-devel] Subject: [PATCH 00/91] pending uml patches

From: Richard W. <ri...@no...> - 2011-08-18 19:12:55

Al,

Am 18.08.2011 20:58, schrieb Al Viro:
>
> My apologies for mailbomb from hell.  *All* this stuff is available in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master,
> but since uml folks had been stuck with mail and patch for a long time...

Have you touched your patches since yesterday?
I've already pulled and uploaded them to my shiny new git repo at:
git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable

Due to the current mirroring problems with git.kernel.org I have not 
sent made it public.
Sorry for that, I screwed it. :-(

> Anyway, most of the stuff in this pile is merging, cleaning and mutilating
> subarchitecture-related code in arch/um.  By the end of it we have x86
> bits largely merged between 32bit and 64bit variants and taken to arch/x86/um;
> headers seriously cleaned up and mostly free of x86-isms now (not completely -
> we still have page size dependencies in there).
>
> Beginning of the series are pure build and driver fixes; those should go to
> Linus before 3.1-final, IMO.

Okay.

> As far as I know, there's no regressions introduced by that pile; testing
> and comments would be, of course, welcome.
>

There was a small build regression, I've already fixed it!

Thanks,
//richard

Re: [uml-devel] Subject: [PATCH 00/91] pending uml patches

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-18 19:19:53

On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
> Have you touched your patches since yesterday?
> I've already pulled and uploaded them to my shiny new git repo at:
> git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable

Reordered and added missing S-o-b on a couple, split one commit.

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 06:34:56

On Sat, Aug 20, 2011 at 05:40:03PM -0400, Andrew Lutomirski wrote:

> will cause iret (if iret happens) to restore the original rbp in rcx
> (why? -- it seems okay if syscall is hit in __kernel_vsyscall but not
> if something else does the syscall).  I don't see what saves rbp to
> the stack frame.

Far more interesting question is how the hell does that thing manage to
work in face of syscall restarts?  As the matter of fact, how does it
(and sysenter-based variant) play with ptrace() *and* restarts?

Suppose we have a traced process.  foo6() is called and the thing it
stopped before the sys_foo6() is reached kernel-side.  The sixth argument
is on stack, ebp is set to user esp.  SYSENTER happens, we read the
6th argument from userland stack and put it along with the rest into
pt_regs.  tracer examines the arguments, modifies them (including the last
one) and lets the tracee run free - e.g. detaches from the tracee.  

What should happen if we happen to get a signal that would restart that
sucker?  Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
do anything of that kind.  However, I wouldn't bet a dime on other 6-argument
syscalls not stepping on that.  sendto() and recvfrom(), in particular...

OK, we return to userland.  The sixth argument is placed into %ebp.  Linus'
"pig and proud of that" trick works and we end up slapping userland
%esp into %ebp and hitting SYSENTER again.  Only one problem, though -
the sixth argument on user stack is completely unaffected by what tracer
had done.  Unlike the rest of arguments, that *are* changed.

We could deal with that in case of SYSENTER if we e.g. replaced that
        jmp .Lenter_kernel
with
        jmp .Lrestart
and added
.Lrestart:
	movl %ebp, (%esp)
	jmp .Lenter_kernel
but in case of SYSCALL it seems to be even messier...  Comments?

... and there I thought that last year session of asm glue sniffing couldn't
be topped by anything more unpleasant ;-/

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-22 02:02:07

On Sun, Aug 21, 2011 at 9:48 PM, H. Peter Anvin <hp...@zy...> wrote:
> On 08/21/2011 06:41 PM, Linus Torvalds wrote:
>> If people are using syscall directly, we're pretty much stuck. No
>> amount of "that's hopelessly wrong" will ever matter. We don't break
>> existing binaries.
>>
>> That said, I'd *hope* that everybody uses the vdso32, simply because
>> user programs are not supposed to know which CPU they are running on
>> and if that CPU even *supports* the syscall instruction. In which case
>> it may be possible that we can play games with the vdso thing. But
>> that really would be conditional on "nobody ever reports a failure".
>
> I think we found that out with the vsyscall emulation issue last cycle.
>  It works, so it will have been used, somewhere...
>
>> But if that's possible, maybe we can increment the RIP by 2 for
>> 'syscall', and slip an "'int 0x80" after the syscall instruction in
>> the vdso there? Resulting in the same pseudo-solution I suggested for
>> sysenter...
>
> I think we have the above problem.
>
> The problem here is that the syscall state is actually more complex than
> we retain: the entire state is given by (entry point, register state);
> with that amount of state we have all the information needed to *either*
> extract the syscall arguments *or* the register contents.  Without
> those, we can only represent one of the two possible metalevels (right
> now we represent the higher-level metalevel, the argument vector), but
> we need both for different usages.

My understanding of the problem is the following:

 1. The SYSCALL 32-bit calling convention puts arg2 in ebp and arg6 on
the stack.

 2. The int 0x80 convention is different: arg2 is in ecx.

 3. We're worried that pt_regs-using compat syscalls might want the
regs to appear to match the actual arguments (why?)

 4. ptrace expects the "registers" when SYSCALL happens to match the
int 0x80 convention.  (This is, IMO, sick.)

 5. Syscall restart with the SYSCALL instruction must switch to
userspace and back to the kernel for reasons I don't understand that
presumably involve signal delivery.

 6. Existing ABI requires that the kernel not clobber syscall
arguments (except, of course, when ptrace or syscall restart
explicitly change those arguments).

So we're sort of screwed.  arg2 must be in ecx to keep ptrace happy
but SYSCALL clobbers ecx, so arg2 cannot be preserved.

So here are three strawman ideas:

a) Change #4.  Maybe it's too late to do this, though.

b) When SYSCALL happens, change RIP to point two bytes past an int
0x80 instruction in the vdso.  Make the next instruction there be a
"ret" that returns to the instruction after the original syscall.
Patch the stack in the kernel.

c) Disable syscall restart when SYSCALL happens from somewhere outside the vdso.

--Andy

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 02:08:02

On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote:

>  3. We're worried that pt_regs-using compat syscalls might want the
> regs to appear to match the actual arguments (why?)

run strace and you'll see why.

>  4. ptrace expects the "registers" when SYSCALL happens to match the
> int 0x80 convention.  (This is, IMO, sick.)

That's what ptrace is *for*.  It's there to let debuggers play with
the program being debugged, including taking a look at the syscall
arguments and modifying them.  In a predictable way, please.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-22 02:26:26

On Sun, Aug 21, 2011 at 10:07 PM, Al Viro <vi...@ze...> wrote:
> On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote:
>
>>  3. We're worried that pt_regs-using compat syscalls might want the
>> regs to appear to match the actual arguments (why?)
>
> run strace and you'll see why.
>

I'm talking about the implementations of stub32_rt_sigreturn,
sys32_rt_sigreturn, stub32_sigreturn, stub32_sigaltstack,
stub32_execve, stub32_fork, stub32_clone, stub32_vfork, and stub32_iopl.
I don't know what this has to do with strace or user ABI at all.

>>  4. ptrace expects the "registers" when SYSCALL happens to match the
>> int 0x80 convention.  (This is, IMO, sick.)
>
> That's what ptrace is *for*.  It's there to let debuggers play with
> the program being debugged, including taking a look at the syscall
> arguments and modifying them.  In a predictable way, please.
>

It may be necessary, but I still think it's sick.  Especially in the
case of inlined SYSCALL, where the registers reported to ptrace do not
match any register values that ever actually existed in CPU registers.
 Too late to fix it, though.

Which still leaves the question of how to fix it.  Restarting via an
int 0x80-based helper might be the only option that leaves everything
fully functional.

--Andy

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 02:34:30

On 08/21/2011 07:26 PM, Andrew Lutomirski wrote:
> 
> Which still leaves the question of how to fix it.  Restarting via an
> int 0x80-based helper might be the only option that leaves everything
> fully functional.
> 

The issue is that we don't represent the entire state ... we represent
only one metalevel of it, currently the "cooked" one.  The problem is
that we need the "raw" one as well, and in order to have *both* we need
to know the entry mechanism.

We need that IN EITHER CASE.

This is reasonably straightforward... we can carry the entry mechanism
forward inside the kernel, and fix it up in the IRET path.

The *really* big issue is what we drop as the sigcontext since this is
an ABI carried out to userspace.  We could just say "it's currently
totally broken for SYSCALL" and just change it to drop the raw state,
but which has the potential for breaking unknown programs, *or* we could
add a bit of state (presumably by reclaiming one of the padding fields
around cs and ss) ... which *also* has the potential for breaking programs.

Right now, SYSCALL -> signal -> restart *is broken*, however, so there
is also the option of just doing nothing in this case, I guess.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 04:12:09

On 08/21/2011 09:07 PM, Al Viro wrote:
> On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote:
>> On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote:
>>>
>>> Is that ability a part of userland ABI or are we declaring that hopelessly
>>> wrong and require to go through the function in vdso32? ?Linus?
>>
>> If people are using syscall directly, we're pretty much stuck. No
>> amount of "that's hopelessly wrong" will ever matter. We don't break
>> existing binaries.
> 
> There's a funny part, though - such binary won't work on 32bit kernel.
> AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso
> doesn't provide a SYSCALL-based variant).
> 
> So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't
> we do something with entry_32.S as well?  I don't think it's worth doing,
> TBH...
> 
> Again, I very much hope that binaries with such stray SYSCALL simply do
> not exist.  In theory it's possible to write one, but...
> 
> IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter
> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
> and (again, IIRC) it had some differences in SYSCALL semantics compared to
> K7 (which supports SYSENTER as well).  Bugger if I remember what those
> differences might've been...  Some flag not cleared?

The most likely reason for a binary to execute a stray SYSCALL is
because they read it out of the vdso.  Totally daft, but we certainly
see a lot of stupid things as evidenced by the JIT thread earlier this
month.

In that sense, a "safe" thing would be to drop use of SYSCALL for 32-bit
processes... I just sent Borislav a query about the cost.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 04:26:23

On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
> > lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
> > and (again, IIRC) it had some differences in SYSCALL semantics compared to
> > K7 (which supports SYSENTER as well).  Bugger if I remember what those
> > differences might've been...  Some flag not cleared?
> 
> The most likely reason for a binary to execute a stray SYSCALL is
> because they read it out of the vdso.  Totally daft, but we certainly
> see a lot of stupid things as evidenced by the JIT thread earlier this
> month.

Um...  What, blindly, no matter what surrounds it in there?  What will
happen to the same eager JIT when it steps on SYSENTER?

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 05:03:54

On 08/21/2011 09:26 PM, Al Viro wrote:
> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to
>>> K7 (which supports SYSENTER as well).  Bugger if I remember what those
>>> differences might've been...  Some flag not cleared?
>>
>> The most likely reason for a binary to execute a stray SYSCALL is
>> because they read it out of the vdso.  Totally daft, but we certainly
>> see a lot of stupid things as evidenced by the JIT thread earlier this
>> month.
> 
> Um...  What, blindly, no matter what surrounds it in there?  What will
> happen to the same eager JIT when it steps on SYSENTER?

The JIT will have had to manage SYSENTER already.  It's not a change,
whereas SYSCALL would be.  We could just try it, and see if anything
breaks, of course.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <luto@MIT.EDU> - 2011-08-23 05:10:55

On 08/22/2011 01:03 AM, H. Peter Anvin wrote:
> On 08/21/2011 09:26 PM, Al Viro wrote:
>> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote:
>>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC,
>>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to
>>>> K7 (which supports SYSENTER as well).  Bugger if I remember what those
>>>> differences might've been...  Some flag not cleared?
>>>
>>> The most likely reason for a binary to execute a stray SYSCALL is
>>> because they read it out of the vdso.  Totally daft, but we certainly
>>> see a lot of stupid things as evidenced by the JIT thread earlier this
>>> month.
>>
>> Um...  What, blindly, no matter what surrounds it in there?  What will
>> happen to the same eager JIT when it steps on SYSENTER?
> 
> The JIT will have had to manage SYSENTER already.  It's not a change,
> whereas SYSCALL would be.  We could just try it, and see if anything
> breaks, of course.

Here's a possible solution that works for standalone SYSCALL and vdso
SYSCALL.  The idea is to preserve the exact same SYSCALL invocation
sequence.  Logically, the SYSCALL instruction does:

push %ebp
mov %ebp,%ecx
mov 4(%esp),%ebp
call __fake_int80

and __fake_int80 is:
int 0x80
mov 4(%esp),%ebp
ret $4


The entire system call sequence is then (effectively):

push %ebp
movl %ecx,%ebp

; "SYSCALL" starts here
push %ebp
mov %ebp,%ecx
mov 4(%esp),%ebp
call __fake_int80
; "SYSCALL ends here

movl %ebp,%ecx
popl %ebp
ret

So we rearrange ebp and ecx and then immediately rearrange them back.
The landing point tweaks them again so that we preserve the old
semantics of SYSCALL.  But now the pt_regs values exactly match what
would have happened if we entered via the int 0x80 path, so there
shouldn't be any corner cases with ptrace or restart -- as far as either
one is concerned, we actually entered via int 0x80.  If we deliver a
signal, the signal handler returns to the int 0x80 instruction.

Am I missing something?  Extremely buggy, incomplete code that
implements this is:


diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index a0e866d..6cda8ce 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -291,24 +291,59 @@ ENTRY(ia32_cstar_target)
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	SAVE_ARGS 8,0,0
 	movl 	%eax,%eax	/* zero extension */
-	movq	%rax,ORIG_RAX-ARGOFFSET(%rsp)
-	movq	%rcx,RIP-ARGOFFSET(%rsp)
-	CFI_REL_OFFSET rip,RIP-ARGOFFSET
-	movq	%rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
-	movl	%ebp,%ecx
+
+	/*
+	 * This does (from the user's point of view):
+	 * push %ebp
+	 * mov %ebp, %ecx
+	 * mov 4(%esp), %ebp
+	 * call <function that does int 0x80; mov 4(%esp),%ebp; ret 4>
+	 *
+	 * User address access does not need access_ok check as r8
+	 * has been zero-extended, so even with the offsets it cannot
+	 * exceed 2**32 + 8.
+	 */
+
+	/* XXX: need to check that vdso actually exists. */
+	/* XXX: ia32_badarg may do bad things to the user state. */
+
+	/* move ebp into place on the user stack */
+	1:	movl	%ebp,-4(%r8)
+	.section __ex_table,"a"
+	.quad 1b,ia32_badarg
+	.previous
+
+	/* move eip into place on the user stack */
+	1:	movl	%ecx,-8(%r8)  /* user eip is in ecx */
+	.section __ex_table,"a"
+	.quad 1b,ia32_badarg
+	.previous
+
+	/* move ebp to ecx in CPU registers and argument save area */
+	mov %ebp,%ecx
+	movq %ecx,RCX-ARGOFFSET(%rsp)
+
+	/*
+	 * move arg6 to ebp in CPU registers and argument save area
+	 * minor optimization: the actual value of ebp is irrelevent,
+	 * so stick it straight into r9d -- see the definition of
+	 * IA32_ARG_FIXUP.
+	 */
+1:	movl	(%r8),%r9d
+	.section __ex_table,"a"
+	.quad 1b,ia32_badarg
+	.previous	
+
+	/* Do the fake call */
+	movl [insert address of int 0x80; ret helper + 2 here],RIP-ARGOFFSET(%rsp)
+	subl $8,%r8 /* we pushed twice */
+
 	movq	$__USER32_CS,CS-ARGOFFSET(%rsp)
 	movq	$__USER32_DS,SS-ARGOFFSET(%rsp)
 	movq	%r11,EFLAGS-ARGOFFSET(%rsp)
 	/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
 	movq	%r8,RSP-ARGOFFSET(%rsp)	
 	CFI_REL_OFFSET rsp,RSP-ARGOFFSET
-	/* no need to do an access_ok check here because r8 has been
-	   32bit zero extended */ 
-	/* hardware stack frame is complete now */	
-1:	movl	(%r8),%r9d
-	.section __ex_table,"a"
-	.quad 1b,ia32_badarg
-	.previous	
 	GET_THREAD_INFO(%r10)
 	orl   $TS_COMPAT,TI_status(%r10)
 	testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
index 5415b56..a3e48b0 100644
--- a/arch/x86/vdso/vdso32/syscall.S
+++ b/arch/x86/vdso/vdso32/syscall.S
@@ -19,8 +19,8 @@ __kernel_vsyscall:
 .Lpush_ebp:
 	movl	%ecx, %ebp
 	syscall
-	movl	$__USER32_DS, %ecx
-	movl	%ecx, %ss
+	/* The ret in the fake int80 entry lands here */
+	/* ss is already correct AFAICS */
 	movl	%ebp, %ecx
 	popl	%ebp
 .Lpop_ebp:
@@ -28,6 +28,11 @@ __kernel_vsyscall:
 .LEND_vsyscall:
 	.size __kernel_vsyscall,.-.LSTART_vsyscall
 
+__kernel_vsyscall_fake_int80:
+	int 0x80
+	mov 4(%esp),%ebp
+	ret $4
+
 	.section .eh_frame,"a",@progbits
 .LSTARTFRAME:
 	.long .LENDCIE-.LSTARTCIE


This could be further simplified by checking if any work flags are set and bailing immediately to the right place in the int 0x80 entry.

--Andy

[uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 08:42:52

On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
> Suppose we have a traced process.  foo6() is called and the thing it
> stopped before the sys_foo6() is reached kernel-side.  The sixth argument
> is on stack, ebp is set to user esp.  SYSENTER happens, we read the
> 6th argument from userland stack and put it along with the rest into
> pt_regs.  tracer examines the arguments, modifies them (including the last
> one) and lets the tracee run free - e.g. detaches from the tracee.  
> 
> What should happen if we happen to get a signal that would restart that
> sucker?  Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
> do anything of that kind.  However, I wouldn't bet a dime on other 6-argument
> syscalls not stepping on that.  sendto() and recvfrom(), in particular...
> 
> OK, we return to userland.  The sixth argument is placed into %ebp.  Linus'
> "pig and proud of that" trick works and we end up slapping userland
> %esp into %ebp and hitting SYSENTER again.  Only one problem, though -
> the sixth argument on user stack is completely unaffected by what tracer
> had done.  Unlike the rest of arguments, that *are* changed.
> 
> We could deal with that in case of SYSENTER if we e.g. replaced that
>         jmp .Lenter_kernel
> with
>         jmp .Lrestart
> and added
> .Lrestart:
> 	movl %ebp, (%esp)
> 	jmp .Lenter_kernel
> but in case of SYSCALL it seems to be even messier...  Comments?

Oh, hell...  Compat SYSCALL one is really buggered on syscall restarts,
ptrace or no ptrace.  Look: calling conventions for SYSCALL are
	arg1..5: ebx, ebp, edx, edi, esi.  arg6: stack
and after syscall restart we end up with
	arg1..5: ebx, ecx, edx, edi, esi.  arg6: ebp
so restart will instantly clobber arg2, in effect replacing it with arg6.

And yes, adding ptrace to the mix makes things even uglier.  For one thing,
changes to ECX via ptrace are completely lost on the fast exit.  Not pretty,
and might make life painful for uml, but not for the majority of programs.
What's worse, combination of ptrace with restart will lose changes to arg6
(again, value on stack left as it was, changes to arg6 by tracer lost) *and*
it will lose changes to arg2 (along with arg2 itself - see above).

Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain
the address of next insn and that's where we end up going.  IOW, SYSCALL not
inside vdso32 currently works (for small values of "works", due to restart
issues).  Playing with return elsewhere might break some userland code...

Guys, that's *way* out of the area I'm comfortable with.

Re: [uml-devel] Subject: [PATCH 00/91] pending uml patches

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-19 04:31:28

On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote:
> On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
> > Have you touched your patches since yesterday?
> > I've already pulled and uploaded them to my shiny new git repo at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable
> 
> Reordered and added missing S-o-b on a couple, split one commit.

Umm...  One comment after looking at your tree: you probably want to rebase
for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into
linux-next).

And for pity sake, do *not* merge from Linus every day; that's one sure
way to get yourself flamed into crisp.  Just trying to figure out
what's in your tree is a _hard_ exercise.  git cherry between Linus'
tree and e.g. #fixes in yours gives a long list of commits, most of them
_probably_ duplicates of the stuff in mainline.  What are bnx2 patches
doing in there, for example?

I've tried to figure out what's going on in there; AFAICS, your #fixes
is mainline plus
Al Viro (6):
      um: fix oopsable race in line_close()
      um: winch_interrupt() can happen inside of free_winch()
      um: fix free_winch() mess
      um: PTRACE_[GS]ETFPXREGS had been wired on the wrong subarch
      um: fix strrchr problems
      um: clean arch_ptrace() up a bit

Ingo van Lil (1):
      um: Save FPU registers between task switches

Jonathan Neusch<C3><A4>fer (3):
      UserModeLinux-HOWTO.txt: fix a typo
      um: drivers/xterm.c: fix a file descriptor leak
      UserModeLinux-HOWTO.txt: remove ^H characters

Thadeu Lima de Souza Cascardo (1):
      um: disable CMPXCHG_DOUBLE as it breaks UML build

	I've cherry-picked those on top of the same branchpoint; see
#cleaned-fixes in um-headers.git.  AFAICS, that's the same contents as
your #fixes, with clean history.  Diff against your #fixes consists of
-       .irq_set_type = pmic_irq_type, <<<<<<< HEAD
-       .irq_bus_lock           = pmic_irq_buslock,
+       .irq_set_type           = pmic_irq_type,
+       .irq_bus_lock           = pmic_bus_lock,
in drivers/platform/x86/intel_pmic_gpio.c, which is an obvious mismerge
(AFAICS, on May 29).

IME the sane policy is to keep for-linus, pulling into it when Linus
pulls from you.  At that point it's a fast-forward and all previous
history is not cluttering the things up anymore.  for-next I rebase and
reorder at will, TBH, but generally I start it at the current tip of
for-linus.

Beyond what you've got in #for-3.2 I have a couple of commits, but that
can wait until the history is sorted out.  As it is, I 100% guarantee
that pull request on your #fixes as it is will result in pyrotechnical
effects from hell (OK, from Linus, actually, but in this case there won't
be any real difference).

Re: [uml-devel] Subject: [PATCH 00/91] pending uml patches

From: Richard W. <ri...@no...> - 2011-08-19 08:52:03

Am 19.08.2011 06:31, schrieb Al Viro:
> On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote:
>> On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote:
>>> Have you touched your patches since yesterday?
>>> I've already pulled and uploaded them to my shiny new git repo at:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable
>>
>> Reordered and added missing S-o-b on a couple, split one commit.
>
> Umm...  One comment after looking at your tree: you probably want to rebase
> for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into
> linux-next).

Please slow down a bit. :-)
All these branches are just for testing purposes.
That's why I have not announced them nor sent a pull request to Linus.

Anyway, thanks for the hints!

Thanks,
//richard

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-21 11:25:04

On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <vi...@ze...> wrote:
> On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
>> Suppose we have a traced process.  foo6() is called and the thing it
>> stopped before the sys_foo6() is reached kernel-side.  The sixth argument
>> is on stack, ebp is set to user esp.  SYSENTER happens, we read the
>> 6th argument from userland stack and put it along with the rest into
>> pt_regs.  tracer examines the arguments, modifies them (including the last
>> one) and lets the tracee run free - e.g. detaches from the tracee.
>>
>> What should happen if we happen to get a signal that would restart that
>> sucker?  Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
>> do anything of that kind.  However, I wouldn't bet a dime on other 6-argument
>> syscalls not stepping on that.  sendto() and recvfrom(), in particular...
>>
>> OK, we return to userland.  The sixth argument is placed into %ebp.  Linus'
>> "pig and proud of that" trick works and we end up slapping userland
>> %esp into %ebp and hitting SYSENTER again.  Only one problem, though -
>> the sixth argument on user stack is completely unaffected by what tracer
>> had done.  Unlike the rest of arguments, that *are* changed.
>>
>> We could deal with that in case of SYSENTER if we e.g. replaced that
>>         jmp .Lenter_kernel
>> with
>>         jmp .Lrestart
>> and added
>> .Lrestart:
>>       movl %ebp, (%esp)
>>       jmp .Lenter_kernel
>> but in case of SYSCALL it seems to be even messier...  Comments?
>
> Oh, hell...  Compat SYSCALL one is really buggered on syscall restarts,
> ptrace or no ptrace.  Look: calling conventions for SYSCALL are
>        arg1..5: ebx, ebp, edx, edi, esi.  arg6: stack
> and after syscall restart we end up with
>        arg1..5: ebx, ecx, edx, edi, esi.  arg6: ebp
> so restart will instantly clobber arg2, in effect replacing it with arg6.
>
> And yes, adding ptrace to the mix makes things even uglier.  For one thing,
> changes to ECX via ptrace are completely lost on the fast exit.  Not pretty,
> and might make life painful for uml, but not for the majority of programs.
> What's worse, combination of ptrace with restart will lose changes to arg6
> (again, value on stack left as it was, changes to arg6 by tracer lost) *and*
> it will lose changes to arg2 (along with arg2 itself - see above).
>
> Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain
> the address of next insn and that's where we end up going.  IOW, SYSCALL not
> inside vdso32 currently works (for small values of "works", due to restart
> issues).  Playing with return elsewhere might break some userland code...
>
> Guys, that's *way* out of the area I'm comfortable with.
>

I don't see the point of all this hackery at all.  sysenter/sysexit
indeed screws up some registers, but we can return on the iret path in
the case of restart.

So why do we lie to ptrace (and iret!) at all?  Why not just fill in
pt_regs with the registers as they were (at least the
non-clobbered-by-sysenter ones), set the actual C parameters correctly
to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
return back to userspace without any funny business?  Is there some
ABI reason that, once we've started lying to tracers, we have to keep
doing so?

--Andy

[uml-devel] [RFC] weird crap with vdso on uml/i386

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-20 01:18:55

On Fri, Aug 19, 2011 at 10:51:51AM +0200, Richard Weinberger wrote:

> Please slow down a bit. :-)
> All these branches are just for testing purposes.
> That's why I have not announced them nor sent a pull request to Linus.
> 
> Anyway, thanks for the hints!

np...  FWIW, there's a really ugly bug present in mainline as well as
in mainline + these patches and I'd welcome any help in figuring out
what's going on.

1) USER_OBJS do not see CONFIG_..., so os-Linux/main.c doesn't see
CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA.  As the result, uml/i386 doesn't
notice that host vdso is there.  That one is easy to fix:
-obj-$(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA) += elf_aux.o
+ifeq ($(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA),y)
+obj-y += elf_aux.o
+CFLAGS_main.o += -DCONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA
+endif
in arch/um/os-Linux/Makefile takes care of that.  Unfortunately, it also
exposes a bug in fixrange_init():

2) fixrange_init() gets called with start (and end) not multiple of
PMD_SIZE; moreover, end is very close to the ~0UL - closer than by PMD_SIZE.
Bad things start happening to the loops in there.  Again, easy to fix:

diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 8137ccc..39ee674 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -119,19 +119,22 @@ static void __init fixrange_init(unsigned long start, unsigned long end,
 	int i, j;
 	unsigned long vaddr;
 
-	vaddr = start;
+	vaddr = start & PMD_MASK;
 	i = pgd_index(vaddr);
 	j = pmd_index(vaddr);
 	pgd = pgd_base + i;
+	start >>= PMD_SHIFT;
+	end = (end - 1) >> PMD_SHIFT;
 
-	for ( ; (i < PTRS_PER_PGD) && (vaddr < end); pgd++, i++) {
+	for ( ; (i < PTRS_PER_PGD) && start <= end; pgd++, i++) {
 		pud = pud_offset(pgd, vaddr);
 		if (pud_none(*pud))
 			one_md_table_init(pud);
 		pmd = pmd_offset(pud, vaddr);
-		for (; (j < PTRS_PER_PMD) && (vaddr < end); pmd++, j++) {
+		for (; (j < PTRS_PER_PMD) && start <= end; pmd++, j++) {
 			one_page_table_init(pmd);
 			vaddr += PMD_SIZE;
+			start++;
 		}
 		j = 0;
 	}

That populates the page tables in the right places and fixrange_user_init()
manages to call it, avoid death-by-oom from runaway allocations and then
install references to all pages it wants.  Alas, at that point the things
become really interesting.

3) with the previous two issues dealt with, we get the following magical
mistery shite when running 32bit uml kernel + userland on 64bit host:
	* the system boots all the way to getty/login and sshd (i.e. gets
through the debian /etc/init.d (squeeze/i386))
	* one can log into it, both on terminals and over ssh.  shell and
a bunch of other stuff works.  Mostly.
	* /bin/bash -c "echo *" reliably segfaults.  Always.  So does tab
completion in bash, for that matter.
	* said segfault is reproducible both from shell and under gdb.
For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3).
What happens there apparently boils down to __kernel_vsyscall() getting
called (and yes, sys_brk() is called, succeeds and results in expected
value in %eax) and corrupting the living hell out of %ecx.  Namely, on
return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal
to (original value of) %ebp.  All registers except %eax and %ecx (including
%esp and %ebp) remain unchanged.
	Again, that happens only on the same call of brk(3) - all previous
calls succeed as expected.  I don't believe that it's a race.  I also
very much doubt that we are calling the wrong location - it's hard to tell
with the call being call *%gs:0x10 (is there any way to find what that
is equal to in gdb, BTW?  Short of hot-patching movl *%gs:0x10,%eax in place
of that call and single-stepping it, that is...) but it *does* end up
making the system call that ought to have been made, so I suspect that it
does hit __kernel_vsyscall(), after all...

The text of __kernel_vsyscall() is
	0xffffe420 <__kernel_vsyscall+0>:       push   %ebp
	0xffffe421 <__kernel_vsyscall+1>:       mov    %ecx,%ebp
	0xffffe423 <__kernel_vsyscall+3>:       syscall 
	0xffffe425 <__kernel_vsyscall+5>:       mov    $0x2b,%ecx
	0xffffe42a <__kernel_vsyscall+10>:      mov    %ecx,%ss
	0xffffe42c <__kernel_vsyscall+12>:      mov    %ebp,%ecx
	0xffffe42e <__kernel_vsyscall+14>:      pop    %ebp
	0xffffe42f <__kernel_vsyscall+15>:      ret    
so %ecx on the way out becoming equal to original %ebp is bloody curious -
it would smell like entering that sucker 3 bytes too late and skipping
mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed
on the way out - wrong return address, wrong value in %ebp, changed %esp.
None of that happens.  And we are executing that code in userland - i.e.
to get corrupt it would have to get corrupt in *HOST* 32bit VDSO.  Which
would have much more visible effects, starting with the next attempt to
run the testcase blowing up immediately instead of waiting (as it actually
does) for the same 10th call of brk()...

I'm at loss, to be honest.  The sucker is nicely reproducible, but bisecting
doesn't help at all - it seems to be present all the way back at least to
2.6.33.  I hadn't tried to go back further and I hadn't tried to go for
older host kernels, but I wouldn't put too much faith into that...  The
reason it hadn't been noticed much earlier is that it works fine on i386
host - aforementioned shit happens only when the entire thing (identical
binary, identical fs image, identical options) is run on amd64.  However,
on i386 I have a different __kernel_vsyscall, which might easily be the
reason it doesn't happen there.  It's a K7 box with sysenter-based
variant ending up as __kernel_vsyscall().  Hell knows what's going on...
Behaviour is really weird and I'd appreciate any pointers re debugging
that crap.  Suggestions?

Re: [uml-devel] [RFC] weird crap with vdso on uml/i386

From: Richard W. <ri...@no...> - 2011-08-20 15:22:40

Am 20.08.2011 03:18, schrieb Al Viro:
> 3) with the previous two issues dealt with, we get the following magical
> mistery shite when running 32bit uml kernel + userland on 64bit host:
> 	* the system boots all the way to getty/login and sshd (i.e. gets
> through the debian /etc/init.d (squeeze/i386))
> 	* one can log into it, both on terminals and over ssh.  shell and
> a bunch of other stuff works.  Mostly.
> 	* /bin/bash -c "echo *" reliably segfaults.  Always.  So does tab
> completion in bash, for that matter.
> 	* said segfault is reproducible both from shell and under gdb.
> For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3).
> What happens there apparently boils down to __kernel_vsyscall() getting
> called (and yes, sys_brk() is called, succeeds and results in expected
> value in %eax) and corrupting the living hell out of %ecx.  Namely, on
> return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal
> to (original value of) %ebp.  All registers except %eax and %ecx (including
> %esp and %ebp) remain unchanged.
> 	Again, that happens only on the same call of brk(3) - all previous
> calls succeed as expected.  I don't believe that it's a race.  I also
> very much doubt that we are calling the wrong location - it's hard to tell
> with the call being call *%gs:0x10 (is there any way to find what that
> is equal to in gdb, BTW?  Short of hot-patching movl *%gs:0x10,%eax in place
> of that call and single-stepping it, that is...) but it *does* end up
> making the system call that ought to have been made, so I suspect that it
> does hit __kernel_vsyscall(), after all...
>
> The text of __kernel_vsyscall() is
> 	0xffffe420<__kernel_vsyscall+0>:       push   %ebp
> 	0xffffe421<__kernel_vsyscall+1>:       mov    %ecx,%ebp
> 	0xffffe423<__kernel_vsyscall+3>:       syscall
> 	0xffffe425<__kernel_vsyscall+5>:       mov    $0x2b,%ecx
> 	0xffffe42a<__kernel_vsyscall+10>:      mov    %ecx,%ss
> 	0xffffe42c<__kernel_vsyscall+12>:      mov    %ebp,%ecx
> 	0xffffe42e<__kernel_vsyscall+14>:      pop    %ebp
> 	0xffffe42f<__kernel_vsyscall+15>:      ret
> so %ecx on the way out becoming equal to original %ebp is bloody curious -
> it would smell like entering that sucker 3 bytes too late and skipping
> mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed
> on the way out - wrong return address, wrong value in %ebp, changed %esp.
> None of that happens.  And we are executing that code in userland - i.e.
> to get corrupt it would have to get corrupt in *HOST* 32bit VDSO.  Which
> would have much more visible effects, starting with the next attempt to
> run the testcase blowing up immediately instead of waiting (as it actually
> does) for the same 10th call of brk()...
>
> I'm at loss, to be honest.  The sucker is nicely reproducible, but bisecting
> doesn't help at all - it seems to be present all the way back at least to
> 2.6.33.  I hadn't tried to go back further and I hadn't tried to go for
> older host kernels, but I wouldn't put too much faith into that...  The
> reason it hadn't been noticed much earlier is that it works fine on i386
> host - aforementioned shit happens only when the entire thing (identical
> binary, identical fs image, identical options) is run on amd64.  However,
> on i386 I have a different __kernel_vsyscall, which might easily be the
> reason it doesn't happen there.  It's a K7 box with sysenter-based
> variant ending up as __kernel_vsyscall().  Hell knows what's going on...
> Behaviour is really weird and I'd appreciate any pointers re debugging
> that crap.  Suggestions?

Hmmm, very strange.
Sadly I cannot reproduce the issue. :(
Everything works fine within UML.
(Of course I've applied your vDSO/i386 patches)

My test setup:
Host kernel: 2.6.37 and 3.0.1
Distro: openSUSE 11.4/x86_64

UML kernel: 3.1-rc2
Distro: openSUSE 11.1/i386

Does the problem also occur with another host kernel or a different 
guest image?

Thanks,
//richard

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Andrew L. <lu...@mi...> - 2011-08-21 13:37:45

On Sun, Aug 21, 2011 at 7:24 AM, Andrew Lutomirski <lu...@mi...> wrote:
> On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <vi...@ze...> wrote:
>> On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote:
>>> Suppose we have a traced process.  foo6() is called and the thing it
>>> stopped before the sys_foo6() is reached kernel-side.  The sixth argument
>>> is on stack, ebp is set to user esp.  SYSENTER happens, we read the
>>> 6th argument from userland stack and put it along with the rest into
>>> pt_regs.  tracer examines the arguments, modifies them (including the last
>>> one) and lets the tracee run free - e.g. detaches from the tracee.
>>>
>>> What should happen if we happen to get a signal that would restart that
>>> sucker?  Granted, it's not going to happen with mmap() - it doesn't, AFAICS,
>>> do anything of that kind.  However, I wouldn't bet a dime on other 6-argument
>>> syscalls not stepping on that.  sendto() and recvfrom(), in particular...
>>>
>>> OK, we return to userland.  The sixth argument is placed into %ebp.  Linus'
>>> "pig and proud of that" trick works and we end up slapping userland
>>> %esp into %ebp and hitting SYSENTER again.  Only one problem, though -
>>> the sixth argument on user stack is completely unaffected by what tracer
>>> had done.  Unlike the rest of arguments, that *are* changed.
>>>
>>> We could deal with that in case of SYSENTER if we e.g. replaced that
>>>         jmp .Lenter_kernel
>>> with
>>>         jmp .Lrestart
>>> and added
>>> .Lrestart:
>>>       movl %ebp, (%esp)
>>>       jmp .Lenter_kernel
>>> but in case of SYSCALL it seems to be even messier...  Comments?
>>
>> Oh, hell...  Compat SYSCALL one is really buggered on syscall restarts,
>> ptrace or no ptrace.  Look: calling conventions for SYSCALL are
>>        arg1..5: ebx, ebp, edx, edi, esi.  arg6: stack
>> and after syscall restart we end up with
>>        arg1..5: ebx, ecx, edx, edi, esi.  arg6: ebp
>> so restart will instantly clobber arg2, in effect replacing it with arg6.
>>
>> And yes, adding ptrace to the mix makes things even uglier.  For one thing,
>> changes to ECX via ptrace are completely lost on the fast exit.  Not pretty,
>> and might make life painful for uml, but not for the majority of programs.
>> What's worse, combination of ptrace with restart will lose changes to arg6
>> (again, value on stack left as it was, changes to arg6 by tracer lost) *and*
>> it will lose changes to arg2 (along with arg2 itself - see above).
>>
>> Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain
>> the address of next insn and that's where we end up going.  IOW, SYSCALL not
>> inside vdso32 currently works (for small values of "works", due to restart
>> issues).  Playing with return elsewhere might break some userland code...
>>
>> Guys, that's *way* out of the area I'm comfortable with.
>>
>
> I don't see the point of all this hackery at all.  sysenter/sysexit
> indeed screws up some registers, but we can return on the iret path in
> the case of restart.
>
> So why do we lie to ptrace (and iret!) at all?  Why not just fill in
> pt_regs with the registers as they were (at least the
> non-clobbered-by-sysenter ones), set the actual C parameters correctly
> to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
> return back to userspace without any funny business?  Is there some
> ABI reason that, once we've started lying to tracers, we have to keep
> doing so?

Gack.  Is this a holdover from the 32-bit code that shares the
argument save area with the parameters passed on the C stack?  If so,
we could just set up the argument save area honestly and pass the real
parameters in registers like 64-bit C code expects.

If the tracing and restart cases use iret to return to userspace, this
should all just work.  ptrace users shouldn't notice the overhead, and
syscall restart is presumably slow enough anyway that it wouldn't
matter.  The userspace entry code would be as simple as:

sysenter
ret

or

sysexit
ret

--Andy

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 14:51:51

On Sun, Aug 21, 2011 at 09:37:18AM -0400, Andrew Lutomirski wrote:

> Gack.  Is this a holdover from the 32-bit code that shares the
> argument save area with the parameters passed on the C stack?  If so,
> we could just set up the argument save area honestly and pass the real
> parameters in registers like 64-bit C code expects.
> 
> If the tracing and restart cases use iret to return to userspace, this
> should all just work.  ptrace users shouldn't notice the overhead, and
> syscall restart is presumably slow enough anyway that it wouldn't
> matter.  The userspace entry code would be as simple as:
> 
> sysenter
> ret
> 
> or
> 
> sysexit
> ret

You are making no sense at all...

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 14:44:14

On Sun, Aug 21, 2011 at 07:24:35AM -0400, Andrew Lutomirski wrote:

> I don't see the point of all this hackery at all.  sysenter/sysexit
> indeed screws up some registers, but we can return on the iret path in
> the case of restart.

We *do* return on iret path in case of restart, TYVM.

> So why do we lie to ptrace (and iret!) at all?  Why not just fill in
> pt_regs with the registers as they were (at least the
> non-clobbered-by-sysenter ones), set the actual C parameters correctly
> to contain the six arguments (in rdi, rsi, etc.), do the syscall, and
> return back to userspace without any funny business?  Is there some
> ABI reason that, once we've started lying to tracers, we have to keep
> doing so?

We do not lie to ptrace and iret.  At all.  We do just what you have
described.  And fuck up when restart returns us to the SYSCALL / SYSENTER
instruction again, which expects the different calling conventions,
so the values arranged in registers in the way int 0x80 would expect
do us no good.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 04:06:17

Borislav,

We're tracking down an issue with the way system call arguments are
handled on 32 bits.  We have a solution for SYSENTER but not SYSCALL;
fixing SYSCALL "properly" appears to be very difficult at best.

So the question is: how much overhead would it be to simply fall back to
int $0x80 or some other legacy-style domain crossing instruction for
32-bit system calls on AMD64 processors?  We don't ever use SYSCALL in
legacy mode, so native i386 kernels are unaffected.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Ingo M. <mi...@ke...> - 2011-08-22 10:33:24

* H. Peter Anvin <hp...@zy...> wrote:

> Borislav,
> 
> We're tracking down an issue with the way system call arguments are 
> handled on 32 bits.  We have a solution for SYSENTER but not 
> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at 
> best.
> 
> So the question is: how much overhead would it be to simply fall 
> back to int $0x80 or some other legacy-style domain crossing 
> instruction for 32-bit system calls on AMD64 processors?  We don't 
> ever use SYSCALL in legacy mode, so native i386 kernels are 
> unaffected.

Last i measured INT80 and SYSCALL costs they were pretty close to 
each other on AMD CPUs - closer than on Intel.

Also, most installations are either pure 32-bit or dominantly 64-bit, 
the significantly mixed-mode case is dwindling.

Unifying some more in this area would definitely simplify things ...

Thanks,

	Ingo

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: Linus T. <tor...@li...> - 2011-08-22 23:28:42

On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <hp...@zy...> wrote:
>
> However, we could just issue a SIGILL or SIGSEGV at this point; the same
> way we would if we got an #UD or #GP fault; SIGILL/#UD would be
> consistent with Intel CPUs here.

Considering that this is not a remotely new issue, and that it has
been around for years without anybody even noticing, I'd really prefer
to just fix things going forwards rather than add any code to actively
break any possible unlucky legacy users.

So I think the "let's fix the vdso case for sysenter" + "let's remove
the 32-bit syscall vdso" is the right solution. If somebody has
hardcoded syscall instructions, or generates them dynamically with
some JIT, that's their problem. We'll continue to support it as well
as we ever have (read: "almost nobody will ever notice").

One thing we *could* do is to just say "we never restart a x86-32
'syscall' instruction at all", and just make such a case return EINTR.
IOW, do something along the lines of the appended pseudo-patch.

Because returning -EINTR is always "almost correct".

Hmm?

                               Linus

---
  diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
  index 54ddaeb221c1..bc1a0f8b2707 100644
  --- a/arch/x86/kernel/signal.c
  +++ b/arch/x86/kernel/signal.c
  @@ -678,6 +678,16 @@ setup_rt_frame(int sig, struct k_sigaction *ka,
siginfo_t *info,
   	return ret;
   }

  +static void restart_syscall(struct pt_regs *regs, int orig)
  +{
  +	if (regs->syscall_using_syscall_insn) {
  +		regs->ax = -EINTR;
  +		return;
  +	}
  +	regs->ax = orig;
  +	regs->ip -= 2;
  +}
  +
   static int
   handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka,
   		struct pt_regs *regs)
  @@ -701,8 +711,7 @@ handle_signal(unsigned long sig, siginfo_t
*info, struct k_sigaction *ka,
   			}
   		/* fallthrough */
   		case -ERESTARTNOINTR:
  -			regs->ax = regs->orig_ax;
  -			regs->ip -= 2;
  +			restart_syscall(regs, regs->orig_ax);
   			break;
   		}
   	}
  @@ -786,13 +795,11 @@ static void do_signal(struct pt_regs *regs)
   		case -ERESTARTNOHAND:
   		case -ERESTARTSYS:
   		case -ERESTARTNOINTR:
  -			regs->ax = regs->orig_ax;
  -			regs->ip -= 2;
  +			restart_syscall(regs, regs->orig_ax);
   			break;

   		case -ERESTART_RESTARTBLOCK:
  -			regs->ax = NR_restart_syscall;
  -			regs->ip -= 2;
  +			restart_syscall(regs, NR_restart_syscall);
   			break;
   		}
   	}

Re: [uml-devel] SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weird crap with vdso on uml/i386)

From: H. P. A. <hp...@zy...> - 2011-08-22 23:47:26

On 08/22/2011 04:27 PM, Linus Torvalds wrote:
> On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <hp...@zy...> wrote:
>>
>> However, we could just issue a SIGILL or SIGSEGV at this point; the same
>> way we would if we got an #UD or #GP fault; SIGILL/#UD would be
>> consistent with Intel CPUs here.
> 
> Considering that this is not a remotely new issue, and that it has
> been around for years without anybody even noticing, I'd really prefer
> to just fix things going forwards rather than add any code to actively
> break any possible unlucky legacy users.
> 
> So I think the "let's fix the vdso case for sysenter" + "let's remove
> the 32-bit syscall vdso" is the right solution. If somebody has
> hardcoded syscall instructions, or generates them dynamically with
> some JIT, that's their problem. We'll continue to support it as well
> as we ever have (read: "almost nobody will ever notice").
> 
> One thing we *could* do is to just say "we never restart a x86-32
> 'syscall' instruction at all", and just make such a case return EINTR.
> IOW, do something along the lines of the appended pseudo-patch.
> 
> Because returning -EINTR is always "almost correct".
> 

I have to say it worries me from a potential security hole point of
view, especially since it clearly isn't very well trod ground to begin
with.  An almost-never-used path with access to the full system call
suite is scarier than hell in that sense.

Keep in mind support for SYSCALL32 is already (vendor-)conditional.

(The obvious solution of just putting the proper register frame back in
its place would be okay except for totally breaking anything
trace-on-exit as already hashed to death...)

	-hpa

1 2 3 4 > >> (Page 1 of 4)