From: Andrew L. <lu...@mi...> - 2011-08-23 16:12:09
|
On Tue, Aug 23, 2011 at 12:03 PM, Linus Torvalds <tor...@li...> wrote: > On Mon, Aug 22, 2011 at 11:15 PM, Al Viro <vi...@ze...> wrote: >> >> * it does SETREGS, setting eax to return value, eip to original return >> address of syscall insn... and ebp to what it had in regs.bp. I.e. the >> damn arg6 value. > > Ok, I think that exhaustively explains that > > (a) our system call restart has always worked correctly, and we're good. > > (b) it's simply just UML that is buggy, and doesn't understand the > subtleties about doing GETREGS at a system call. > > and I think that the correct and simple solution is to just teach UML > to understand the proper logic of pt_regs during a system call (and a > 'syscall' instruction in particular). > > The thing is, UML emulates 'syscall' the way the *CPU* does it, not > the way *we* do it. That may make sense, but it's simply not correct. > > So I would vote very strongly against actually changing anything in > arch/x86. This is very much an UML issue. > > Suggested fixes: > > - instead of blindly doing SETREGS, just write the result registers > individually like you suggested > > OR (and perhaps preferably): > > - teach UML that when you do 'GETREGS' after a system call trapped, > we have switched things around to match the "official system call > order", and UML should just undo our swizzling, and do a "regs.ebp = > regs.ecx" to make it be what the actual original registers were (or > whatever the actual correct swizzle is - I didn't think that through > very much). > > IOW, I think the core kernel does the right thing. Our argument > register swizzling is odd, yes, but it's an implementation detail that > something like uml should just have to take into account. No? > > Hmm? Egads. Does this mean that doing GETREGS and then doing SETREGS later on with the *exact same values* is considered incorrect? IMO, this way lies madness. In any case, this seems insanely overcomplicated. I'd be less afraid of something like my approach (which, I think, makes all of the SYSCALL weirdness pretty much transparent to ptrace users) or of just removing SYSCALL entirely from 32-bit code. --Andy |
From: Borislav P. <bp...@am...> - 2011-08-23 16:23:11
|
On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote: > In any case, this seems insanely overcomplicated. I'd be less afraid > of something like my approach (which, I think, makes all of the > SYSCALL weirdness pretty much transparent to ptrace users) or of just > removing SYSCALL entirely from 32-bit code. I don't think that removing SYSCALL from 32-bit code just so that UML trapped syscalls work is something we'd like since SYSCALL is much cheaper than INT $0x80: "As a result, SYSCALL and SYSRET can take fewer than one-fourth the number of internal clock cycles to complete than the legacy CALL and RET instructions." http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152. I know, it is 32-bit syscall on 64-bit kernel which should be pretty rare but still... Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 00:03:51
|
On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote: > So I think the "let's fix the vdso case for sysenter" + "let's remove > the 32-bit syscall vdso" is the right solution. If somebody has > hardcoded syscall instructions, or generates them dynamically with > some JIT, that's their problem. We'll continue to support it as well > as we ever have (read: "almost nobody will ever notice"). Umm... Maybe, but I really wonder if it would be better to do this: * check if %ecx is the right one for vdso32 entry. If it isn't, act as we act now (and possibly warn). If it is, increment it by 4. * slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn - i.e. nop/nop/int 0x80. Followed by what we currently do there. Those who do syscall insn in 32bit mode outside of vdso will get what they currently get. __kernel_vsyscall() will keep working and do that without restart problems. And the price is comparison + branch not taken + addition for that path... |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 00:07:34
|
On Tue, Aug 23, 2011 at 01:03:14AM +0100, Al Viro wrote: > On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote: > > > So I think the "let's fix the vdso case for sysenter" + "let's remove > > the 32-bit syscall vdso" is the right solution. If somebody has > > hardcoded syscall instructions, or generates them dynamically with > > some JIT, that's their problem. We'll continue to support it as well > > as we ever have (read: "almost nobody will ever notice"). > > Umm... Maybe, but I really wonder if it would be better to do this: > * check if %ecx is the right one for vdso32 entry. If it isn't, > act as we act now (and possibly warn). If it is, increment it by 4. > * slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn - > i.e. nop/nop/int 0x80. Followed by what we currently do there. > > Those who do syscall insn in 32bit mode outside of vdso will get > what they currently get. __kernel_vsyscall() will keep working and do > that without restart problems. And the price is comparison + branch not > taken + addition for that path... s/whine/set your "don't restart that one" flag/, even... |
From: Linus T. <tor...@li...> - 2011-08-23 16:21:14
|
On Tue, Aug 23, 2011 at 9:11 AM, Andrew Lutomirski <lu...@mi...> wrote: > > Egads. Does this mean that doing GETREGS and then doing SETREGS later > on with the *exact same values* is considered incorrect? No. If it does it with exactly the same values, it's perfectly ok. But it isn't. It's changing some of those values. > IMO, this way lies madness. No, the madness is in UML. It's EMULATING A SYSTEM CALL. That original "getregs" value is not some "user space state". It's the *system call* state that you got after the system call trapped. Setting it back is an insane operation, but it would happen to work - if you make no changes. But UML *does* make changes. It takes that system call state, and then EMULATES THE SYSTEM CALL INCORRECTLY. If you see it that way (which is the correct way), then it's clearly an UML problem, and it's not at all "madness" that your getregs/setregs pairing doesn't work. See? Buggy system call emulation. It's really that simple. Of course, "simple" in this case is "really really subtle differences in how the kernel treats syscall/sysenter/int80", so the *details* are certainly not simple, but the concept is. Linus |
From: H. P. A. <hp...@zy...> - 2011-08-23 00:08:03
|
On 08/22/2011 05:03 PM, Al Viro wrote: > On Mon, Aug 22, 2011 at 04:27:51PM -0700, Linus Torvalds wrote: > >> So I think the "let's fix the vdso case for sysenter" + "let's remove >> the 32-bit syscall vdso" is the right solution. If somebody has >> hardcoded syscall instructions, or generates them dynamically with >> some JIT, that's their problem. We'll continue to support it as well >> as we ever have (read: "almost nobody will ever notice"). > > Umm... Maybe, but I really wonder if it would be better to do this: > * check if %ecx is the right one for vdso32 entry. If it isn't, > act as we act now (and possibly warn). If it is, increment it by 4. > * slap 0x90, 0x90, 0xcd, 0x80 right after that syscall insn - > i.e. nop/nop/int 0x80. Followed by what we currently do there. > > Those who do syscall insn in 32bit mode outside of vdso will get > what they currently get. __kernel_vsyscall() will keep working and do > that without restart problems. And the price is comparison + branch not > taken + addition for that path... Checking for SYSCALL in the vdso is cheap... the only problem is if someone thinks that they can copy it out of the vdso and run it, having special-cased SYSENTER already, but not SYSCALL. The recent experience shows that that is apparently a very real possibility. Without a SYSCALL in the vdso, anything that examines the vdso will end up using int $0x80 and we don't have an issue. Again, waiting for Borislav & co to tell us how much the SYSCALL instruction actually buys us. -hpa |
From: Linus T. <tor...@li...> - 2011-08-23 16:30:28
|
On Tue, Aug 23, 2011 at 9:22 AM, Borislav Petkov <bp...@am...> wrote: > > I don't think that removing SYSCALL from 32-bit code just so that UML > trapped syscalls work is something we'd like since SYSCALL is much > cheaper than INT $0x80: Oh yes. System call performance is *important*. And x86 is *important*. UML? In comparison, not that important. So quite frankly, if this is purely an UML issue (and unless we're missing something else, that's what it looks like, despite all the confusion we've had so far), then if we have a choice between "remove syscall instruction" and "remove UML", then ... Linus |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:53:46
|
On Tue, Aug 23, 2011 at 09:29:29AM -0700, Linus Torvalds wrote: > Oh yes. > > System call performance is *important*. And x86 is *important*. > > UML? In comparison, not that important. > > So quite frankly, if this is purely an UML issue (and unless we're > missing something else, that's what it looks like, despite all the > confusion we've had so far), then if we have a choice between "remove > syscall instruction" and "remove UML", then ... Agreed. Note, BTW, that UML has perfectly usable workaround for 99% of that - don't tell the binaries it has *any* vdso in such cases. And "remove UML" turns into "remove support under UML for 32bit binaries that go out of their way to do SYSCALL directly, which wouldn't work on native 32bit", which is really a no-brainer. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 17:33:43
|
On Tue, Aug 23, 2011 at 09:20:12AM -0700, Linus Torvalds wrote: > It's EMULATING A SYSTEM CALL. That original "getregs" value is not > some "user space state". It's the *system call* state that you got > after the system call trapped. Setting it back is an insane operation, > but it would happen to work - if you make no changes. > > But UML *does* make changes. It takes that system call state, and then > EMULATES THE SYSTEM CALL INCORRECTLY. > > If you see it that way (which is the correct way), then it's clearly > an UML problem, and it's not at all "madness" that your > getregs/setregs pairing doesn't work. > > See? Buggy system call emulation. It's really that simple. Of course, > "simple" in this case is "really really subtle differences in how the > kernel treats syscall/sysenter/int80", so the *details* are certainly > not simple, but the concept is. It's a bit more than that (ptrace changes to syscall arguments *are* lost on syscall restart), but... as far as I'm concerned, the situation is simple now: * SYSCALL is not terminally broken wrt restarts. My apologies for misreading what was going on. * SYSENTER with Linus' patch does work just fine wrt restarts + ptrace * SYSCALL is losing ptrace-made changes to arguments when it restarts. Might or might not be a problem for somebody. * UML should not touch SYSCALL for 32bit. Not without serious changes in UML and I'm not convinced that it won't be worse than what we probably ought to do there: check if __kernel_vsyscall() does SYSCALL (recognizable by interaction with POKEUSER) and don't tell about vdso to guest processes. Anything well-behaving won't step on SYSCALL and the things that do deserve the subtle bugs they get. * asm glue is subtle, evil and doesn't have anywhere near enough documentation ;-/ |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 18:04:48
|
On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote: > * SYSCALL is not terminally broken wrt restarts. My apologies for > misreading what was going on. > * SYSENTER with Linus' patch does work just fine wrt restarts + ptrace > * SYSCALL is losing ptrace-made changes to arguments when it restarts. > Might or might not be a problem for somebody. BTW, that one (irrelevant to UML even if we do end up coping with SYSCALL there) might be worth spelling it out: tracer: ptrace(tracee, PTRACE_SYSCALL); tracee: recvfrom(..., &addrlen); tracer: ptrace(tracee, PTRACE_POKEUSER, EBP, &len2); ptrace(tracee, PTRACE_DETACH, 0, 0); tracee: completes recvfrom(), using &len2 instead of the &addrlen That works just fine, regardless of the way syscall is entered; yes, including SYSCALL - there we take care to handle ptrace on the way in. However, if it's SYSCALL and (ex-)tracee takes a restart, the second time around we'll have the original value of 6th argument used. Changes made by POKEUSER are lost. It's not a problem with int 0x80 or SYSENTER (now, with int 0x80 instead of jmp). It's probably not going to be a real issue for anyone, but I pity the poor bastard stuck with debugging that if it *does* become someone's problem. |
From: Borislav P. <bp...@am...> - 2011-08-24 12:59:43
|
On Tue, Aug 23, 2011 at 06:33:17PM +0100, Al Viro wrote: > * asm glue is subtle, evil and doesn't have anywhere near enough > documentation ;-/ I took the liberty to document some of your asm glue analysis in an attempt to make the code a bit more understandable. How about the following: -- From: Borislav Petkov <bor...@am...> Date: Wed, 24 Aug 2011 14:30:43 +0200 Subject: [PATCH] x86, asm: Document some of the syscall asm glue Document some of the asm glue around compat SYSCALL32 and do a whitespace cleanup while at it. See linked thread below for further reference. Link: http://lkml.kernel.org/r/20110820011845.GC2203@ZenIV.linux.org.uk Signed-off-by: Borislav Petkov <bor...@am...> --- arch/x86/ia32/ia32entry.S | 138 ++++++++++++++++++++++++++----------------- arch/x86/kernel/entry_64.S | 19 +++++- 2 files changed, 98 insertions(+), 59 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index a0e866d..8254432 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -1,16 +1,16 @@ /* - * Compatibility mode system call entry point for x86-64. - * + * Compatibility mode system call entry point for x86-64. + * * Copyright 2000-2002 Andi Kleen, SuSE Labs. - */ + */ #include <asm/dwarf2.h> #include <asm/calling.h> #include <asm/asm-offsets.h> #include <asm/current.h> #include <asm/errno.h> -#include <asm/ia32_unistd.h> -#include <asm/thread_info.h> +#include <asm/ia32_unistd.h> +#include <asm/thread_info.h> #include <asm/segment.h> #include <asm/irqflags.h> #include <linux/linkage.h> @@ -38,11 +38,11 @@ xchg %ecx,%esi movl %ebx,%edi movl %edx,%edx /* zero extension */ - .endm + .endm - /* clobbers %eax */ + /* clobbers %eax */ .macro CLEAR_RREGS offset=0, _r9=rax - xorl %eax,%eax + xorl %eax,%eax movq %rax,\offset+R11(%rsp) movq %rax,\offset+R10(%rsp) movq %\_r9,\offset+R9(%rsp) @@ -69,7 +69,7 @@ movl \offset+64(%rsp),%edi movl %eax,%eax /* zero extension */ .endm - + .macro CFI_STARTPROC32 simple CFI_STARTPROC \simple CFI_UNDEFINED r8 @@ -106,14 +106,14 @@ ENDPROC(native_irq_enable_sysexit) * %esi Arg4 * %edi Arg5 * %ebp user stack - * 0(%ebp) Arg6 - * + * 0(%ebp) Arg6 + * * Interrupts off. - * + * * This is purely a fast path. For anything complicated we use the int 0x80 * path below. Set up a complete hardware stack frame to share code * with the int 0x80 path. - */ + */ ENTRY(ia32_sysenter_target) CFI_STARTPROC32 simple CFI_SIGNAL_FRAME @@ -127,7 +127,7 @@ ENTRY(ia32_sysenter_target) * disabled irqs, here we enable it straight after entry: */ ENABLE_INTERRUPTS(CLBR_NONE) - movl %ebp,%ebp /* zero extension */ + movl %ebp,%ebp /* zero extension */ pushq_cfi $__USER32_DS /*CFI_REL_OFFSET ss,0*/ pushq_cfi %rbp @@ -144,12 +144,12 @@ ENTRY(ia32_sysenter_target) pushq_cfi %rax cld SAVE_ARGS 0,1,0 - /* no need to do an access_ok check here because rbp has been - 32bit zero extended */ + /* no need to do an access_ok check here because rbp has been + 32bit zero extended */ 1: movl (%rbp),%ebp - .section __ex_table,"a" - .quad 1b,ia32_badarg - .previous + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous GET_THREAD_INFO(%r10) orl $TS_COMPAT,TI_status(%r10) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10) @@ -170,7 +170,7 @@ sysenter_dispatch: sysexit_from_sys_call: andl $~TS_COMPAT,TI_status(%r10) /* clear IF, that popfq doesn't enable interrupts early */ - andl $~0x200,EFLAGS-R11(%rsp) + andl $~0x200,EFLAGS-R11(%rsp) movl RIP-R11(%rsp),%edx /* User %eip */ CFI_REGISTER rip,rdx RESTORE_ARGS 0,24,0,0,0,0 @@ -260,20 +260,21 @@ ENDPROC(ia32_sysenter_target) * Arguments: * %eax System call number. * %ebx Arg1 - * %ecx return EIP + * %ecx return EIP * %edx Arg3 * %esi Arg4 * %edi Arg5 - * %ebp Arg2 [note: not saved in the stack frame, should not be touched] - * %esp user stack + * %ebp Arg2 [note: not saved in the stack frame, should not be touched + * because it is callee-saved in 64-bit calling convention] + * %esp user stack * 0(%esp) Arg6 - * + * * Interrupts off. - * + * * This is purely a fast path. For anything complicated we use the int 0x80 * path below. Set up a complete hardware stack frame to share code - * with the int 0x80 path. - */ + * with the int 0x80 path. + */ ENTRY(ia32_cstar_target) CFI_STARTPROC32 simple CFI_SIGNAL_FRAME @@ -281,34 +282,57 @@ ENTRY(ia32_cstar_target) CFI_REGISTER rip,rcx /*CFI_REGISTER rflags,r11*/ SWAPGS_UNSAFE_STACK + + /* stash away usermode stack ptr */ movl %esp,%r8d CFI_REGISTER rsp,r8 movq PER_CPU_VAR(kernel_stack),%rsp + /* * No need to follow this irqs on/off section: the syscall * disabled irqs and here we enable it straight after entry: */ ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8,0,0 - movl %eax,%eax /* zero extension */ + movl %eax,%eax /* zero extension */ movq %rax,ORIG_RAX-ARGOFFSET(%rsp) + + /* return-RIP is in %ecx when executing SYSCALL */ movq %rcx,RIP-ARGOFFSET(%rsp) CFI_REL_OFFSET rip,RIP-ARGOFFSET - movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */ + + /* + * Put Arg2 into %rcx pt_regs slot to match kernel syscall + * calling conventions, i.e. what INT80 would expect; + * this lies slightly to ptrace + */ + movq %rbp,RCX-ARGOFFSET(%rsp) movl %ebp,%ecx movq $__USER32_CS,CS-ARGOFFSET(%rsp) movq $__USER32_DS,SS-ARGOFFSET(%rsp) + + /* rFLAGS is in %r11 when executing SYSCALL */ movq %r11,EFLAGS-ARGOFFSET(%rsp) /*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/ - movq %r8,RSP-ARGOFFSET(%rsp) + + /* save usermode stack ptr into pt_regs */ + movq %r8,RSP-ARGOFFSET(%rsp) CFI_REL_OFFSET rsp,RSP-ARGOFFSET - /* no need to do an access_ok check here because r8 has been - 32bit zero extended */ - /* hardware stack frame is complete now */ + + /* + * Get Arg6 which is on the usermode stack; no need to do an + * access_ok check here because %r8 has been 32bit zero extended. + * hardware stack frame is complete now. + */ 1: movl (%r8),%r9d + + /* + * handle pagefaulting when accessing usermode stack by returning + * -EFAULT + */ .section __ex_table,"a" .quad 1b,ia32_badarg - .previous + .previous GET_THREAD_INFO(%r10) orl $TS_COMPAT,TI_status(%r10) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10) @@ -331,7 +355,7 @@ sysretl_from_sys_call: RESTORE_ARGS 0,-ARG_SKIP,0,0,0 movl RIP-ARGOFFSET(%rsp),%ecx CFI_REGISTER rip,rcx - movl EFLAGS-ARGOFFSET(%rsp),%r11d + movl EFLAGS-ARGOFFSET(%rsp),%r11d /*CFI_REGISTER rflags,r11*/ xorq %r10,%r10 xorq %r9,%r9 @@ -340,7 +364,7 @@ sysretl_from_sys_call: movl RSP-ARGOFFSET(%rsp),%esp CFI_RESTORE rsp USERGS_SYSRET32 - + #ifdef CONFIG_AUDITSYSCALL cstar_auditsys: CFI_RESTORE_STATE @@ -358,6 +382,8 @@ cstar_tracesys: testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10) jz cstar_auditsys #endif + + /* put Arg6 into %ebp where ptrace expects it */ xchgl %r9d,%ebp SAVE_REST CLEAR_RREGS 0, r9 @@ -366,21 +392,23 @@ cstar_tracesys: call syscall_trace_enter LOAD_ARGS32 ARGOFFSET, 1 /* reload args from stack in case ptrace changed it */ RESTORE_REST + + /* sync back Arg6's possibly changed value where it is expected by C */ xchgl %ebp,%r9d cmpq $(IA32_NR_syscalls-1),%rax ja int_ret_from_sys_call /* cstar_tracesys has set RAX(%rsp) */ jmp cstar_do_call END(ia32_cstar_target) - + ia32_badarg: movq $-EFAULT,%rax jmp ia32_sysret CFI_ENDPROC -/* - * Emulated IA32 system calls via int 0x80. +/* + * Emulated IA32 system calls via int 0x80. * - * Arguments: + * Arguments: * %eax System call number. * %ebx Arg1 * %ecx Arg2 @@ -390,13 +418,13 @@ ia32_badarg: * %ebp Arg6 [note: not saved in the stack frame, should not be touched] * * Notes: - * Uses the same stack frame as the x86-64 version. + * Uses the same stack frame as the x86-64 version. * All registers except %eax must be saved (but ptrace may violate that) * Arguments are zero extended. For system calls that want sign extension and * take long arguments a wrapper is needed. Most calls can just be called * directly. - * Assumes it is only called from user space and entered with interrupts off. - */ + * Assumes it is only called from user space and entered with interrupts off. + */ ENTRY(ia32_syscall) CFI_STARTPROC32 simple @@ -433,9 +461,9 @@ ia32_sysret: movq %rax,RAX-ARGOFFSET(%rsp) ia32_ret_from_sys_call: CLEAR_RREGS -ARGOFFSET - jmp int_ret_from_sys_call + jmp int_ret_from_sys_call -ia32_tracesys: +ia32_tracesys: SAVE_REST CLEAR_RREGS movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */ @@ -457,13 +485,13 @@ quiet_ni_syscall: movq $-ENOSYS,%rax ret CFI_ENDPROC - + .macro PTREGSCALL label, func, arg .globl \label \label: leaq \func(%rip),%rax leaq -ARGOFFSET+8(%rsp),\arg /* 8 for return address */ - jmp ia32_ptregs_common + jmp ia32_ptregs_common .endm CFI_STARTPROC32 @@ -537,7 +565,7 @@ ia32_sys_call_table: .quad quiet_ni_syscall /* old stty syscall holder */ .quad quiet_ni_syscall /* old gtty syscall holder */ .quad sys_access - .quad sys_nice + .quad sys_nice .quad quiet_ni_syscall /* 35 */ /* old ftime syscall holder */ .quad sys_sync .quad sys32_kill @@ -616,7 +644,7 @@ ia32_sys_call_table: .quad stub32_iopl /* 110 */ .quad sys_vhangup .quad quiet_ni_syscall /* old "idle" system call */ - .quad sys32_vm86_warning /* vm86old */ + .quad sys32_vm86_warning /* vm86old */ .quad compat_sys_wait4 .quad sys_swapoff /* 115 */ .quad compat_sys_sysinfo @@ -669,7 +697,7 @@ ia32_sys_call_table: .quad sys_mremap .quad sys_setresuid16 .quad sys_getresuid16 /* 165 */ - .quad sys32_vm86_warning /* vm86 */ + .quad sys32_vm86_warning /* vm86 */ .quad quiet_ni_syscall /* query_module */ .quad sys_poll .quad compat_sys_nfsservctl @@ -724,10 +752,10 @@ ia32_sys_call_table: .quad sys_mincore .quad sys_madvise .quad compat_sys_getdents64 /* 220 getdents64 */ - .quad compat_sys_fcntl64 + .quad compat_sys_fcntl64 .quad quiet_ni_syscall /* tux */ - .quad quiet_ni_syscall /* security */ - .quad sys_gettid + .quad quiet_ni_syscall /* security */ + .quad sys_gettid .quad sys32_readahead /* 225 */ .quad sys_setxattr .quad sys_lsetxattr @@ -742,7 +770,7 @@ ia32_sys_call_table: .quad sys_lremovexattr .quad sys_fremovexattr .quad sys_tkill - .quad sys_sendfile64 + .quad sys_sendfile64 .quad compat_sys_futex /* 240 */ .quad compat_sys_sched_setaffinity .quad compat_sys_sched_getaffinity @@ -754,7 +782,7 @@ ia32_sys_call_table: .quad compat_sys_io_submit .quad sys_io_cancel .quad sys32_fadvise64 /* 250 */ - .quad quiet_ni_syscall /* free_huge_pages */ + .quad quiet_ni_syscall /* free_huge_pages */ .quad sys_exit_group .quad sys32_lookup_dcookie .quad sys_epoll_create diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 6419bb0..9569f11 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -607,10 +607,16 @@ tracesys: GLOBAL(int_ret_from_sys_call) DISABLE_INTERRUPTS(CLBR_NONE) TRACE_IRQS_OFF + + /* + * check the Requestor Privilege Level of the CS selector + * previously pushed on the stack. If 0, we're returning + * to kernel space. + */ testl $3,CS-ARGOFFSET(%rsp) je retint_restore_args - movl $_TIF_ALLWORK_MASK,%edi /* edi: mask to check */ + movl $_TIF_ALLWORK_MASK,%edi GLOBAL(int_with_check) LOCKDEP_SYS_EXIT_IRQ GET_THREAD_INFO(%rcx) @@ -618,11 +624,16 @@ GLOBAL(int_with_check) andl %edi,%edx jnz int_careful andl $~TS_COMPAT,TI_status(%rcx) + + /* no work pending, return to userspace */ jmp retint_swapgs - /* Either reschedule or signal or syscall exit tracking needed. */ - /* First do a reschedule test. */ - /* edx: work, edi: workmask */ + /* + * Either reschedule or signal or syscall exit tracking + * needed. First do a reschedule test. + * + * edx: work, edi: workmask + */ int_careful: bt $TIF_NEED_RESCHED,%edx jnc int_very_careful -- 1.7.4 -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 |
From: H. P. A. <hp...@zy...> - 2011-08-23 19:16:32
|
On 08/23/2011 09:22 AM, Borislav Petkov wrote: > On Tue, Aug 23, 2011 at 12:11:43PM -0400, Andrew Lutomirski wrote: >> In any case, this seems insanely overcomplicated. I'd be less afraid >> of something like my approach (which, I think, makes all of the >> SYSCALL weirdness pretty much transparent to ptrace users) or of just >> removing SYSCALL entirely from 32-bit code. > > I don't think that removing SYSCALL from 32-bit code just so that UML > trapped syscalls work is something we'd like since SYSCALL is much > cheaper than INT $0x80: > > "As a result, SYSCALL and SYSRET can take fewer than one-fourth the > number of internal clock cycles to complete than the legacy CALL and RET > instructions." > > http://support.amd.com/us/Processor_TechDocs/24593.pdf, p. 152. > > I know, it is 32-bit syscall on 64-bit kernel which should be pretty > rare but still... > Right, but if you had said the difference had disappeared on current AMD silicon it would be much less of an issue. That's why I wanted to find that bit out from you. -hpa |
From: Borislav P. <bp...@am...> - 2011-08-23 20:56:34
|
On Tue, Aug 23, 2011 at 03:15:58PM -0400, H. Peter Anvin wrote: > Right, but if you had said the difference had disappeared on current AMD > silicon it would be much less of an issue. That's why I wanted to find > that bit out from you. Yeah, I dug this out in the APM, btw. But no, I don't think the difference has disappeared - to the contrary, AFAICT, the intention is for SYSCALL to be the fastest way to do syscalls on x86 due to diminished number of segment checks etc. INT80 is legacy, slower, etc. I believe Andy measured a similar situation on Sandy Bridge with SYSCALL having latencies in the tens of nsecs range and INT80 being much slower. Ingo also measured a similar situation where the latency gap between the two on Intel is even bigger. So, the only problem left now is what we're going to do with cases similar to what Al conjured up: http://marc.info/?l=linux-kernel&m=131412271112461&w=2 I don't know, maybe the most cowardly approach is to issue a warning and shrug with the shoulders, or do some asm magic... Also, do we care at all, how relevant is a case like that? Hmmm. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 |
From: Linus T. <tor...@li...> - 2011-08-23 19:25:31
|
On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote: > > We could drop that information in a metaregister. It's not backward > compatible, but at least it will be obvious when that information is > available and not. Well, seriously, UML already looks at the word at "ip-2" for other reasons. So it isn't like there is any point in adding more support to just give you that information in another form. Linus |
From: H. P. A. <hp...@zy...> - 2011-08-23 19:27:17
|
On 08/23/2011 12:24 PM, Linus Torvalds wrote: > On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote: >> >> We could drop that information in a metaregister. It's not backward >> compatible, but at least it will be obvious when that information is >> available and not. > > Well, seriously, UML already looks at the word at "ip-2" for other > reasons. So it isn't like there is any point in adding more support to > just give you that information in another form. > Not for UML, but it might be useful for other things. Certainly lowers the priority, though. -hpa |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 19:41:28
|
On Tue, Aug 23, 2011 at 12:24:22PM -0700, Linus Torvalds wrote: > On Tue, Aug 23, 2011 at 12:18 PM, H. Peter Anvin <hp...@zy...> wrote: > > > > We could drop that information in a metaregister. ?It's not backward > > compatible, but at least it will be obvious when that information is > > available and not. > > Well, seriously, UML already looks at the word at "ip-2" for other > reasons. So it isn't like there is any point in adding more support to > just give you that information in another form. That is done only for task singlestepped in the guest: /* * This closes a way to execute a system call on the host. If * you set a breakpoint on a system call instruction and singlestep * from it, the tracing thread used to PTRACE_SINGLESTEP the process * rather than PTRACE_SYSCALL it, allowing the system call to execute * on the host. The tracing thread will check this flag and * PTRACE_SYSCALL if necessary. */ if (current->ptrace & PT_DTRACE) current->thread.singlestep_syscall = is_syscall(PT_REGS_IP(¤t->thread.regs)); with PT_DTRACE set by uml user_enable_single_step() And it's not cheap - doing that on each syscall will be unpleasant... Frankly, I'd rather stopped telling the uml userland about vdso in such setups. And anything that plays with SYSCALL outside of vdso... <shrug> we already have a "don't run it native on 32bit", adding "don't run it on 32bit uml on amd64 host" is not too serious. At least for now... |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 16:41:52
|
On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote: > We do not lie to ptrace and iret. At all. We do just what you have > described. And fuck up when restart returns us to the SYSCALL / SYSENTER > instruction again, which expects the different calling conventions, > so the values arranged in registers in the way int 0x80 would expect > do us no good. FWIW, what really happens (for 32bit task on amd64) is this: * both SYSCALL and SYSENTER variants of __kernel_vsyscall are entered with the same calling conventions; eax contains syscall number, ebx/ecx/edx/esi/edi/ebp contain arg1..6 resp. Same as what int 0x80 would expect. * they arrange slightly different calling conventions for actual SYSCALL/SYSENTER instructions. SYSENTER one: ecx and edx saved on user stack to undo the effect of SYSEXIT clobbering them, arg6 (from ebp) pushed to stack as well (for kernel side of SYSENTER to pick it from there) and userland esp copied to ebp (SYSENTER clobbers esp). SYSCALL one: arg6 (from ebp) pushed to stack (again, for kernel to pick it from there), arg2 (from ecx) copied to ebp (SYSCALL clobbers ecx). Then we hit the kernel. * Both codepaths start with arranging the same thing on the kernel stack frame; one 64bit int 0x80 would create. For the good and simple reason: they all have to be able to leave via IRET. Stack layout is the same, but we need to fill it accordingly to calling conventions we are stuck with. I.e. ->cx should be initialized with arg2 and ->bp with arg6, wherever those currently are on given codepath. _That_ is what "lying to ptrace" is about - we store there registers according to how they were when we entered __kernel_vsyscall(), not as they are at the moment of actual SYSCALL insn. Which is precisely the right thing to do, since if we *are* ptraced, the tracer expects to find the syscall argument in the same places, whichever variant of syscall tracee happens to be using. * In both variants it means picking arg6 from userland stack; if that pagefaults, we act as if we returned -EFAULT in normal way. Again, the value is stored in the expected place - ->bp, same as it would on int 0x80 path. * If we are traced, we grow the things on stack to full pt_regs, including the callee-saved registers. And call syscall_trace_enter(®s). If tracer decides to change registers, it can do so. After that call we restore the registers from pt_regs on stack and rejoin the corresponding common codepath. * In both cases we reshuffle registers to match amd64 C calling conventions; the only subtle part is that SYSCALL path has arg6 in r9d (and ebp same as we had on entry, i.e. the original arg2, unaffected by whatever ptrace might have done to regs->cx, BTW) while SYSENTER path has it in ebp, same as int 0x80 one. After reshuffling arg6 ends up r9 in all cases and in all cases ptrace changes to regs->bp (aka where ptrace expects to see arg6) do affect what's in r9. * The actual sys_whatever() is called in all cases. If there's any work to do after it (signals, still being traced, need to be rescheduled, etc.), we go for the good old IRET path (after having cleaned r8--r12 in pt_regs - IRET path is shared with 64bit and we don't want random kernel values leaking to userland). * If there's no non-trivial work to do, int 0x80 *still* cleans r8--r12 in pt_regs and goes for IRET path. End of story for it. * In the same case, SYSENTER path will restore the contents of si and di from pt_regs (bx is unaffected by sys_whatever(), ax holds return value and cx/dx are going to be clobbered anyway; bp is not restored to the conditions it had when hitting SYSENTER, but it's redundant - it was equal to userland sp and *that* we do restore, of course). r8--r11 are cleared in actual CPU registers and off we bugger, back to vdso32. Where we pop ebp/ecx/edx and return to caller. Note that syscall restart couldn't have happened on that path - it would qualify as "work to do after syscall" (specifically, signal handling) as we'd be off to IRET path. * In the same case, SYSCALL path will restore the contents of si, di and dx from pt_regs (bx is unaffected by sys_whatever(), ax contains the return value and bp is actually the same as it was on entry, after all dances). r8-r11 are cleaned in registers, cx is clobbered by SYSRET and we are off to __kernel_vsyscall(), again. This time back in there we restore cx to what it used to be on entry to __kernel_vsyscall() [*NOTE*: unaffected by ptrace manipulations; we probably don't care about that] and restore bp (from stack). We also restore %ss along the way, but that's a separate story. Again, no syscall restarts on that path. * If there *was* a syscall restart to be done, we are guaranteed to have left via IRET path. In all cases the syscall arguments end up in registers, in the same way int 0x80 expected them. What happens afterwards depends on how we entered, though. + int 0x80: all registers are restored (with ptrace manipulations, if any, having left their effect) as they'd been the last time around. In we go and that's it. + SYSENTER: return address had been set *not* to the insn right next after SYSENTER when we'd been setting the stack frame up. That's the dirty trick Linus had come up with - return ip is set to insn in vfso32 (SYSENTER loses the original ip for good, unlike SYSCALL that would store it in cx, so it has to be at fixed location anyway). Normally we just pop ecx/edx/ebp from stack and we are done. However, two bytes prior to that insn (i.e. where syscall restart would land us) we have jump to just a bit before SYSENTER. Namely, to the point where we had copied esp to ebp. That, combined with what IRET path has done, will get us the layout SYSENTER expects once we get to SYSENTER again. Except that ptrace modifications to arg6 will be lost - *ebp is where SYSENTER picks it from and it's not updated. Modified value is in ebp on return from kernel and it's overwritten (with esp) and lost. That's ptrace vs. restarts bug I've mentioned in SYSENTER case. + SYSCALL: buggered. On restart we end up repeating the call, with arg2 replaced with whatever had been in ebp when we entered __kernel_vsyscall(). Simply because nobody cared to move it from ecx (where IRET path has put it) to ebp (where SYSCALL expects to find it). ebp gets what used to be in arg6 (again, IRET path doing). Oh, and ptrace modifications, if any, are lost as well - both in arg2 and in arg6. I *think* the above is an accurate description of what happens, but I could certainly be wrong - that's just from RTFS of unfamiliar and seriously convoluted code, so I'd very much appreciate ACK/NAK on the analysis above from the people actually familiar with that thing... |
From: Linus T. <tor...@li...> - 2011-08-22 01:09:33
|
On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <lu...@mi...> wrote: > > Which suggests an easy-ish fix: if sysenter is used or if syscall is > entered from the EIP is is supposed to be entered from, then just > change ip in the argument save to point to the int 0x80 instruction. Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and you'd be done. In fact, just replace the jmp .Lenter_kernel with int 0x80 and you'd be pretty much all done, no? (Ok, that's probably a huge over-simplification, but perhaps "close enough" to true that it would be workable) Linus |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 01:19:55
|
On Sun, Aug 21, 2011 at 06:09:00PM -0700, Linus Torvalds wrote: > On Sun, Aug 21, 2011 at 5:44 PM, Andrew Lutomirski <lu...@mi...> wrote: > > > > Which suggests an easy-ish fix: if sysenter is used or if syscall is > > entered from the EIP is is supposed to be entered from, then just > > change ip in the argument save to point to the int 0x80 instruction. > > Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and > you'd be done. > > In fact, just replace the > > jmp .Lenter_kernel > > with > > int 0x80 > > and you'd be pretty much all done, no? In case of sysenter - almost, in case of syscall - nope... |
From: H. P. A. <hp...@zy...> - 2011-08-22 01:19:58
|
On 08/21/2011 06:09 PM, Linus Torvalds wrote: > > Indeed. Just add an "int 0x80" instruction to the vsyscall thing, and > you'd be done. > > In fact, just replace the > > jmp .Lenter_kernel > > with > > int 0x80 > > and you'd be pretty much all done, no? > > (Ok, that's probably a huge over-simplification, but perhaps "close > enough" to true that it would be workable) > Hm... I think a jump to something which adjusts %esp and invokes int $0x80 might just work, but only for SYSENTER. SYSCALL is different, especially since SYSCALL is legal to execute from anywhere in userspace (and no, as we have learned already doing EIP checking is *NOT* acceptable.) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-20 20:14:15
|
On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote: > Hmmm, very strange. > Sadly I cannot reproduce the issue. :( > Everything works fine within UML. > (Of course I've applied your vDSO/i386 patches) > > My test setup: > Host kernel: 2.6.37 and 3.0.1 > Distro: openSUSE 11.4/x86_64 > > UML kernel: 3.1-rc2 > Distro: openSUSE 11.1/i386 > > Does the problem also occur with another host kernel or a different > guest image? Could you check what you get in __kernel_vsyscall()? On iAMD64 box where that sucker contains sysenter-based variant the bug is not present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80 differences. I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/) if you want to see if that gets reproduced on your box. I'll drop it on anonftp if you are interested. FWIW, the same kernel binary/same image result in * K7 box - no breakage, SYSENTER-based vdso * K8 box - breakage as described, SYSCALL-based vdso32 * P4 box - no breakage, SYSENTER-based vdso32 Hell knows... In theory that would seem to point towards ia32_cstar_target(), so I'm going to RTFS carefully through that animal. The thing is, whatever happens happens when victim gets resumed inside vdso page. I'll try to dump PTRACE_SETREGS and see the values host kernel asked to set and work from there, but the interesting part is bloody hard to singlestep through - the victim is back to user mode and it is already traced by the guest kernel, so it's not as if we could attach host gdb to it and walk through that crap. And guest gdb is not going to be able to set breakpoints in there - vdso page is r/o... |
From: Andrew L. <lu...@mi...> - 2011-08-22 00:44:39
|
On Sun, Aug 21, 2011 at 12:41 PM, Al Viro <vi...@ze...> wrote: > On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote: > >> We do not lie to ptrace and iret. At all. We do just what you have >> described. And fuck up when restart returns us to the SYSCALL / SYSENTER >> instruction again, which expects the different calling conventions, >> so the values arranged in registers in the way int 0x80 would expect >> do us no good. > > FWIW, what really happens (for 32bit task on amd64) is this: I think I believe your analysis... > * Both codepaths start with arranging the same thing on the kernel > stack frame; one 64bit int 0x80 would create. For the good and simple > reason: they all have to be able to leave via IRET. Stack layout is the > same, but we need to fill it accordingly to calling conventions we are > stuck with. I.e. ->cx should be initialized with arg2 and ->bp with > arg6, wherever those currently are on given codepath. _That_ is what > "lying to ptrace" is about - we store there registers according to how > they were when we entered __kernel_vsyscall(), not as they are at the > moment of actual SYSCALL insn. Which is precisely the right thing to do, > since if we *are* ptraced, the tracer expects to find the syscall argument > in the same places, whichever variant of syscall tracee happens to be using. This is, IMO, gross -- if the values in pt_regs matched what they were when sysenter / syscall was issued, then we'd be fine -- we could restart the syscall and everything would work. Apparently ptrace users have a problem with that, so we're stuck with the "lie" (i.e. reporting values as of __kernel_vsyscall, not as of the actual kernel entry). > * If there *was* a syscall restart to be done, we are guaranteed to > have left via IRET path. In all cases the syscall arguments end up in > registers, in the same way int 0x80 expected them. What happens afterwards > depends on how we entered, though. > + int 0x80: all registers are restored (with ptrace > manipulations, if any, having left their effect) as they'd been the last > time around. In we go and that's it. Which suggests an easy-ish fix: if sysenter is used or if syscall is entered from the EIP is is supposed to be entered from, then just change ip in the argument save to point to the int 0x80 instruction. This might also require tweaking the userspace stack. That way, restart would hit int 0x80 instead of syscall/sysenter and the registers are exactly as expected. Getting this right in the case where ptrace attaches during the syscall might be tricky, though. --Andy |
From: Andrew L. <lu...@mi...> - 2011-08-22 13:34:54
|
On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <mi...@ke...> wrote: > > * H. Peter Anvin <hp...@zy...> wrote: > >> Borislav, >> >> We're tracking down an issue with the way system call arguments are >> handled on 32 bits. We have a solution for SYSENTER but not >> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at >> best. >> >> So the question is: how much overhead would it be to simply fall >> back to int $0x80 or some other legacy-style domain crossing >> instruction for 32-bit system calls on AMD64 processors? We don't >> ever use SYSCALL in legacy mode, so native i386 kernels are >> unaffected. > > Last i measured INT80 and SYSCALL costs they were pretty close to > each other on AMD CPUs - closer than on Intel. >From memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster than int 0xcc, which is presumably about the same speed as int 0x80. That's because SYSCALL is blazingly fast (<30 ns for a whole system call) and int is slower. --Andy > > Also, most installations are either pure 32-bit or dominantly 64-bit, > the significantly mixed-mode case is dwindling. > > Unifying some more in this area would definitely simplify things ... > > Thanks, > > Ingo > |
From: Borislav P. <bp...@am...> - 2011-08-22 14:41:12
|
On Mon, Aug 22, 2011 at 09:34:27AM -0400, Andrew Lutomirski wrote: > On Mon, Aug 22, 2011 at 5:53 AM, Ingo Molnar <mi...@ke...> wrote: > > > > * H. Peter Anvin <hp...@zy...> wrote: > > > >> Borislav, > >> > >> We're tracking down an issue with the way system call arguments are > >> handled on 32 bits. We have a solution for SYSENTER but not > >> SYSCALL; fixing SYSCALL "properly" appears to be very difficult at > >> best. > >> > >> So the question is: how much overhead would it be to simply fall > >> back to int $0x80 or some other legacy-style domain crossing > >> instruction for 32-bit system calls on AMD64 processors? We don't > >> ever use SYSCALL in legacy mode, so native i386 kernels are > >> unaffected. > > > > Last i measured INT80 and SYSCALL costs they were pretty close to > > each other on AMD CPUs - closer than on Intel. > > From memory, SYSCALL in 64-bit mode on Sandy Bridge is much faster > than int 0xcc, which is presumably about the same speed as int 0x80. > That's because SYSCALL is blazingly fast (<30 ns for a whole system > call) and int is slower. Just to make sure I'm grokking this correctly - we want to use int $0x80 only for the SYSCALL variant in __kernel_vsyscall, right? Not for all 32-bit syscalls on a 64-bit kernel. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 |
From: Linus T. <tor...@li...> - 2011-08-23 00:23:16
|
On Mon, Aug 22, 2011 at 5:07 PM, H. Peter Anvin <hp...@zy...> wrote: > > Checking for SYSCALL in the vdso is cheap... the only problem is if > someone thinks that they can copy it out of the vdso and run it, having > special-cased SYSENTER already, but not SYSCALL. The recent experience > shows that that is apparently a very real possibility. I don't understand why people keep harping on this "copy the vdso" thing. Guys, IF THEY COPY THE VDSO, THEN THEY WOULD GET OUR IMPROVEMENTS! So stop being silly. As to comparing the IP address, that's just moronic. Why the hell would anybody ever do something that stupid? It's expensive and not worth it. You'd be much better off comparing the contents of the memory at 'rip', and see if that memory contains the 'int 0x80' instruction we would add. That automatically also gets the 'somebody copied the vdso text' case right. You guys seem to positively _want_ to make this a bigger issue than it is. As far as I can tell, nobody has ever even noticed this problem before, and we already have a trivial fix ("don't do it then") for the case Al actually did notice. The *only* case left as far as I can tell is the case where people do *not* copy the vdso, but simply generate (statically and/or dynamically with a JIT) the syscall instruction directly. And we cannot fix that, so stop even bothering to try. The absolute best we can do for that case is to just continue to do what we do now. WE SIMPLY HAVE NO CHOICE. I really don't see this as being worth this much discussion. Linus |