From: Borislav P. <bp...@am...> - 2011-08-23 14:26:26
|
On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote: > Almost, but not quite. What happens is: > * process hits syscall insn > * it's stopped and tracer (guest kernel) does GETREGS > + looks at the registers (mapped to the normal layout) > + decides to call sys_brk() > + notices pages to kick out > + queues munmap request for stub > * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack > * tracer does CONT, letting the child run > * child finishes with syscall insn, carefully preserving ebp. It returns to > userland, in the beginning of the stub. > * child does munmap() and hits int 3 in the end of stub. > * the damn thing is stopped again. The tracer had been waiting for it. > * tracer finishes with sys_brk() and returns success. > * it does SETREGS, setting eax to return value, eip to original return > address of syscall insn... and ebp to what it had in regs.bp. I.e. the > damn arg6 value. Ok, stupid question: can a convoluted ptracing case like this be created in "normal" userspace, i.e. irrespective of UML and only by using gdb, for example? I.e., from what I understand from above, you need to stop the tracee at syscall and "redirect" it to the stub after it finishes the syscall so that in another syscall it gets a debug exception... sounds complicated. > And we are fucked. It doesn't happen in syscall handler. It's int3(). > Having no idea that this request to set ebp should be interpreted in > a really different way - "put the value I asked to put into ecx here, > please, and ignore this one". > > Sigh... The really ugly part is that ebp can be changed by the stuff > done in stub - it's not just munmap, it can do mmap as well. We can, > in principle, save ebp on its stack and restore it before trapping. > Then uml kernel could, in theory, replace that SETREGS with a bunch of > POKEUSER, leaving ebp alone. Ho-hum... In principle, that might even > be not too horrible - we need eax/eip/esp, of course, but the rest > could be dealt with by the same trick - have it pushed/popped in the > stub and to hell with wasting syscalls on setting them... which could mean that we could get away by not replacing SYSCALL32? Hmm. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:30:48
|
On Tue, Aug 23, 2011 at 04:26:08PM +0200, Borislav Petkov wrote: > On Tue, Aug 23, 2011 at 02:15:31AM -0400, Al Viro wrote: > > Almost, but not quite. What happens is: > > * process hits syscall insn > > * it's stopped and tracer (guest kernel) does GETREGS > > + looks at the registers (mapped to the normal layout) > > + decides to call sys_brk() > > + notices pages to kick out > > + queues munmap request for stub > > * tracer does SETREGS, pointing the child's eip to stub and sp to stub stack > > * tracer does CONT, letting the child run > > * child finishes with syscall insn, carefully preserving ebp. It returns to > > userland, in the beginning of the stub. > > * child does munmap() and hits int 3 in the end of stub. > > * the damn thing is stopped again. The tracer had been waiting for it. > > * tracer finishes with sys_brk() and returns success. > > * it does SETREGS, setting eax to return value, eip to original return > > address of syscall insn... and ebp to what it had in regs.bp. I.e. the > > damn arg6 value. > > Ok, stupid question: can a convoluted ptracing case like this be created > in "normal" userspace, i.e. irrespective of UML and only by using gdb, > for example? I don't know... > I.e., from what I understand from above, you need to stop the tracee at > syscall and "redirect" it to the stub after it finishes the syscall so > that in another syscall it gets a debug exception... sounds complicated. Basically, we need to do things that tracer can't do via ptrace() - i.e. play with mappings in the child. I.e. we need to do several syscalls in child, then return it to traced state. And all of that - before we return to execution of instructions past the syscall. BTW, booting 32bit uml with nosysemu on such boxen blows up instantly, since there we have *all* SETREGS done on the way out of syscall (with sysemu we use PTRACE_SYSEMU, which will stop on syscall entry, let you play with registers and suppress both the sys_...() call itself and the stop on the way out; without sysemu it'll use PTRACE_SYSCALL, replace syscall number with something harmless (getpid(2)), let it execute, stop on the way out and update the registers there). Same issue, only here it really happens from within the syscall handler itself. Hell knows... I have no idea what kind of weirdness ptrace users exhibit. FWIW, I suspect that there's another mess around signals in uml - signal frame is built by tracer and it *has* to contain ebp. And have eip pointing to insn immediately past the syscall. What should sigreturn do? It got to restore ebp - can't rely on signal handler not having buggered the register. And we are again in for it - ebp set to arg6 as we return to insn right after syscall one. OTOH, making GETREGS/PEEKUSER return registers without arg2 -> ecx, arg6 -> ebp would instantly break both uml and far less exotic things. strace(1), for one. Anything that wants to examine the arguments of stopped syscall will be broken. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-23 16:49:23
|
On Tue, Aug 23, 2011 at 09:03:04AM -0700, Linus Torvalds wrote: > Suggested fixes: > > - instead of blindly doing SETREGS, just write the result registers > individually like you suggested Not enough. There is also a PITA with signal handlers. There we can't avoid modifying ebp on the way out of handler (i.e. by emulated sigreturn). And it lands us straight after syscall insn, with ebp "restored" to the wrong value. > OR (and perhaps preferably): > > - teach UML that when you do 'GETREGS' after a system call trapped, > we have switched things around to match the "official system call > order", and UML should just undo our swizzling, and do a "regs.ebp = > regs.ecx" to make it be what the actual original registers were (or > whatever the actual correct swizzle is - I didn't think that through > very much). Um... How would it know which syscall variant had that been, to start with? For int 0x80 it would need to use registers as-is. For SYSENTER it also could use them as-is - ebp will differ from what we put there when entering the sucker, but not critically so; on the way out of syscall we'll overwrite it anyway immediately (either by pop or mov). For SYSCALL... we don't really care about ecx contents prior to entering the kernel (and it'll be blown out anyway), and ebp one could be found in regs.ecx. So yes, we can do it that way, but... how to tell what variant had been triggered? Examining two bytes prior to user eip? Sounds bloody brittle... |
From: Linus T. <tor...@li...> - 2011-08-23 17:34:11
|
On Tue, Aug 23, 2011 at 9:48 AM, Al Viro <vi...@ze...> wrote: > > Um... How would it know which syscall variant had that been, to start > with? Just read the instruction, for chissake. UML *already* does that, to see if it's "int80" or "sysenter" ('is_syscall()'). Now, I do agree that if we had designed the ptrace interface with these kinds of issues in mind, then we would have added a "state" field to the thing that could have this kind of information as part of the GETREGS interface. There is no question that that would have been a good idea - but we have what we have. I mean, technically, we could also have always just given "raw user space register state" to ptrace, and then just said that "anybody who traces system calls needs to know the exact calling conventions for *that* kind of system call". But instead of that, we give the "cooked" pt_regs values on read-out, to make it simpler for strace and friends. And it's actualyl simpler for UML too. If we *didn't* give that cooked register set information, then UML would *still* have to look at the actual instruction in order to emulate the system call correctly ("it's sysenter, so now I need to take some of the system call arguments from the stack"). So the fact that we do that register state swizzling actually helps not just strace, but UML too. It would be *nice* if we did the swizzling automatically at setregs() time too, but we simply don't have enough information in the kernel to do that. Again, exactly because pt_regs doesn't have a "state" variable, when user-space does the SETREGS call, we simply don't know whether we are in "normal" code or in some system call entry or exit state. So the kernel does the swizzling at GETREGS time (by virtue of always having the registers in a "canonical" state for system call entry), but we fundamentally *cannot* to do the unswizzle, because we don't know what the SETREGS caller actually did. So I think the current state is actually the best we could possibly do, with the caveat that *if* we had known about the "different system calls have different register layouts" originally and had thought of it, we could have added a 'state' word that the kernel could set at GETREGS time, and use at SETREGS time to decide whether swizzling is needed or not. But not only would that have required time travel (ptrace existed before the multiple system calls did), even then it's not 100% clear that the current simpler model (with the admittedly subtle case of implicit state and its effect on register state) isn't actually the better solution. *Somebody* has to do the register swizzling, and the current "kernel canonicalizes registers at read time, you need to swizzle them if you change state" may simply be the RightThing(tm). Linus |
From: H. P. A. <hp...@zy...> - 2011-08-23 19:18:40
|
On 08/23/2011 09:48 AM, Al Viro wrote: > > Um... How would it know which syscall variant had that been, to start > with? For int 0x80 it would need to use registers as-is. For SYSENTER > it also could use them as-is - ebp will differ from what we put there > when entering the sucker, but not critically so; on the way out of > syscall we'll overwrite it anyway immediately (either by pop or mov). > For SYSCALL... we don't really care about ecx contents prior to entering > the kernel (and it'll be blown out anyway), and ebp one could be found in > regs.ecx. So yes, we can do it that way, but... how to tell what variant > had been triggered? Examining two bytes prior to user eip? Sounds bloody > brittle... We could drop that information in a metaregister. It's not backward compatible, but at least it will be obvious when that information is available and not. -hpa |
From: H. P. A. <hp...@zy...> - 2011-08-23 21:09:20
|
On 08/23/2011 10:33 AM, Linus Torvalds wrote: > > It would be *nice* if we did the swizzling automatically at setregs() > time too, but we simply don't have enough information in the kernel to > do that. Again, exactly because pt_regs doesn't have a "state" > variable, when user-space does the SETREGS call, we simply don't know > whether we are in "normal" code or in some system call entry or exit > state. So the kernel does the swizzling at GETREGS time (by virtue of > always having the registers in a "canonical" state for system call > entry), but we fundamentally *cannot* to do the unswizzle, because we > don't know what the SETREGS caller actually did. > Again, can we steal one of the padding fields to use for that state variable? We have two 16-bit padding fields; one for cs and one for ss. For UML, I agree, let's just not expose the vdso assuming that is possible, but for other -- possibly future -- users. -hpa |
From: Linus T. <tor...@li...> - 2011-08-23 21:20:51
|
On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <hp...@zy...> wrote: > > Again, can we steal one of the padding fields to use for that state > variable? We have two 16-bit padding fields; one for cs and one for ss. We can steal them for passing the information to the user, but no, I don't think we can use them to then take the information *from* the user. Somebody may well be setting up a 'pt_regs' structure on his own, and simply not fill in the padding, resulting in random data in those fields. Linus |
From: H. P. A. <hp...@zy...> - 2011-08-23 23:04:44
|
On 08/23/2011 02:20 PM, Linus Torvalds wrote: > On Tue, Aug 23, 2011 at 2:08 PM, H. Peter Anvin <hp...@zy...> wrote: >> >> Again, can we steal one of the padding fields to use for that state >> variable? We have two 16-bit padding fields; one for cs and one for ss. > > We can steal them for passing the information to the user, but no, I > don't think we can use them to then take the information *from* the > user. > > Somebody may well be setting up a 'pt_regs' structure on his own, and > simply not fill in the padding, resulting in random data in those > fields. > That would be fine, I'd think... just gives the user space application enough information to know how it would have to reshuffle the registers if it needs to. -hpa |
From: H. P. A. <hp...@zy...> - 2011-08-22 01:48:53
|
On 08/21/2011 06:41 PM, Linus Torvalds wrote: > If people are using syscall directly, we're pretty much stuck. No > amount of "that's hopelessly wrong" will ever matter. We don't break > existing binaries. > > That said, I'd *hope* that everybody uses the vdso32, simply because > user programs are not supposed to know which CPU they are running on > and if that CPU even *supports* the syscall instruction. In which case > it may be possible that we can play games with the vdso thing. But > that really would be conditional on "nobody ever reports a failure". I think we found that out with the vsyscall emulation issue last cycle. It works, so it will have been used, somewhere... > But if that's possible, maybe we can increment the RIP by 2 for > 'syscall', and slip an "'int 0x80" after the syscall instruction in > the vdso there? Resulting in the same pseudo-solution I suggested for > sysenter... I think we have the above problem. The problem here is that the syscall state is actually more complex than we retain: the entire state is given by (entry point, register state); with that amount of state we have all the information needed to *either* extract the syscall arguments *or* the register contents. Without those, we can only represent one of the two possible metalevels (right now we represent the higher-level metalevel, the argument vector), but we need both for different usages. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: Richard W. <ri...@no...> - 2011-08-20 20:55:56
|
Am 20.08.2011 22:14, schrieb Al Viro: > On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote: > >> Hmmm, very strange. >> Sadly I cannot reproduce the issue. :( >> Everything works fine within UML. >> (Of course I've applied your vDSO/i386 patches) >> >> My test setup: >> Host kernel: 2.6.37 and 3.0.1 >> Distro: openSUSE 11.4/x86_64 >> >> UML kernel: 3.1-rc2 >> Distro: openSUSE 11.1/i386 >> >> Does the problem also occur with another host kernel or a different >> guest image? > > Could you check what you get in __kernel_vsyscall()? On iAMD64 box > where that sucker contains sysenter-based variant the bug is not > present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80 > differences. OK, this explains why I cannot reproduce it. My Intel Core2 box is sysenter-based. (gdb) disass __kernel_vsyscall 0xffffe420 <__kernel_vsyscall+0>: push %ecx 0xffffe421 <__kernel_vsyscall+1>: push %edx 0xffffe422 <__kernel_vsyscall+2>: push %ebp 0xffffe423 <__kernel_vsyscall+3>: mov %esp,%ebp 0xffffe425 <__kernel_vsyscall+5>: sysenter 0xffffe427 <__kernel_vsyscall+7>: nop 0xffffe428 <__kernel_vsyscall+8>: nop 0xffffe429 <__kernel_vsyscall+9>: nop 0xffffe42a <__kernel_vsyscall+10>: nop 0xffffe42b <__kernel_vsyscall+11>: nop 0xffffe42c <__kernel_vsyscall+12>: nop 0xffffe42d <__kernel_vsyscall+13>: nop 0xffffe42e <__kernel_vsyscall+14>: jmp 0xffffe423<__kernel_vsyscall+3> 0xffffe430 <__kernel_vsyscall+16>: pop %ebp 0xffffe431 <__kernel_vsyscall+17>: pop %edx 0xffffe432 <__kernel_vsyscall+18>: pop %ecx 0xffffe433 <__kernel_vsyscall+19>: ret > I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 ;-/) > if you want to see if that gets reproduced on your box. I'll drop it on > anonftp if you are interested. FWIW, the same kernel binary/same image > result in > * K7 box - no breakage, SYSENTER-based vdso > * K8 box - breakage as described, SYSCALL-based vdso32 > * P4 box - no breakage, SYSENTER-based vdso32 > Hell knows... In theory that would seem to point towards ia32_cstar_target(), > so I'm going to RTFS carefully through that animal. Now I'm testing with a Debian fs from: http://fs.devloop.org.uk/filesystems/Debian-Squeeze/ > The thing is, whatever happens happens when victim gets resumed inside > vdso page. I'll try to dump PTRACE_SETREGS and see the values host > kernel asked to set and work from there, but the interesting part is > bloody hard to singlestep through - the victim is back to user mode and > it is already traced by the guest kernel, so it's not as if we could > attach host gdb to it and walk through that crap. And guest gdb is not > going to be able to set breakpoints in there - vdso page is r/o... [ CC'ing lu...@mi... ] Andy, do you have an idea? You can find Al's original report here: http://marc.info/?l=linux-kernel&m=131380315624244&w=2 Thanks, //richard |
From: Andrew L. <lu...@mi...> - 2011-08-20 21:26:28
|
On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <ri...@no...> wrote: > Am 20.08.2011 22:14, schrieb Al Viro: >> >> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote: >> >>> Hmmm, very strange. >>> Sadly I cannot reproduce the issue. :( >>> Everything works fine within UML. >>> (Of course I've applied your vDSO/i386 patches) >>> >>> My test setup: >>> Host kernel: 2.6.37 and 3.0.1 >>> Distro: openSUSE 11.4/x86_64 >>> >>> UML kernel: 3.1-rc2 >>> Distro: openSUSE 11.1/i386 >>> >>> Does the problem also occur with another host kernel or a different >>> guest image? >> >> Could you check what you get in __kernel_vsyscall()? On iAMD64 box >> where that sucker contains sysenter-based variant the bug is not >> present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80 >> differences. > > OK, this explains why I cannot reproduce it. > My Intel Core2 box is sysenter-based. > > (gdb) disass __kernel_vsyscall > 0xffffe420 <__kernel_vsyscall+0>: push %ecx > 0xffffe421 <__kernel_vsyscall+1>: push %edx > 0xffffe422 <__kernel_vsyscall+2>: push %ebp > 0xffffe423 <__kernel_vsyscall+3>: mov %esp,%ebp > 0xffffe425 <__kernel_vsyscall+5>: sysenter > 0xffffe427 <__kernel_vsyscall+7>: nop > 0xffffe428 <__kernel_vsyscall+8>: nop > 0xffffe429 <__kernel_vsyscall+9>: nop > 0xffffe42a <__kernel_vsyscall+10>: nop > 0xffffe42b <__kernel_vsyscall+11>: nop > 0xffffe42c <__kernel_vsyscall+12>: nop > 0xffffe42d <__kernel_vsyscall+13>: nop > 0xffffe42e <__kernel_vsyscall+14>: jmp 0xffffe423<__kernel_vsyscall+3> > 0xffffe430 <__kernel_vsyscall+16>: pop %ebp > 0xffffe431 <__kernel_vsyscall+17>: pop %edx > 0xffffe432 <__kernel_vsyscall+18>: pop %ecx > 0xffffe433 <__kernel_vsyscall+19>: ret > >> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 >> ;-/) >> if you want to see if that gets reproduced on your box. I'll drop it on >> anonftp if you are interested. FWIW, the same kernel binary/same image >> result in >> * K7 box - no breakage, SYSENTER-based vdso >> * K8 box - breakage as described, SYSCALL-based vdso32 >> * P4 box - no breakage, SYSENTER-based vdso32 >> Hell knows... In theory that would seem to point towards >> ia32_cstar_target(), >> so I'm going to RTFS carefully through that animal. > > Now I'm testing with a Debian fs from: > http://fs.devloop.org.uk/filesystems/Debian-Squeeze/ > >> The thing is, whatever happens happens when victim gets resumed inside >> vdso page. I'll try to dump PTRACE_SETREGS and see the values host >> kernel asked to set and work from there, but the interesting part is >> bloody hard to singlestep through - the victim is back to user mode and >> it is already traced by the guest kernel, so it's not as if we could >> attach host gdb to it and walk through that crap. And guest gdb is not >> going to be able to set breakpoints in there - vdso page is r/o... > > [ CC'ing lu...@mi... ] > Andy, do you have an idea? > You can find Al's original report here: > http://marc.info/?l=linux-kernel&m=131380315624244&w=2 I'm missing a bit of the background. Is the user-on-UML app calling into a vdso entry provided by UML or into a vdso entry provided by the host? Why does anything care whether ecx is saved? Doesn't the default calling convention allow the callee to clobber ecx? But my guess is that the 64-bit host sysret code might be buggy (or the value in gs:whatever is wrong). Can you get gdb to breakpoint at the beginning of __kernel_vsyscall before the crash? --Andy |
From: Richard W. <ri...@no...> - 2011-08-20 21:38:52
|
Am 20.08.2011 23:26, schrieb Andrew Lutomirski: > On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger<ri...@no...> wrote: >> Am 20.08.2011 22:14, schrieb Al Viro: >>> >>> On Sat, Aug 20, 2011 at 05:22:23PM +0200, Richard Weinberger wrote: >>> >>>> Hmmm, very strange. >>>> Sadly I cannot reproduce the issue. :( >>>> Everything works fine within UML. >>>> (Of course I've applied your vDSO/i386 patches) >>>> >>>> My test setup: >>>> Host kernel: 2.6.37 and 3.0.1 >>>> Distro: openSUSE 11.4/x86_64 >>>> >>>> UML kernel: 3.1-rc2 >>>> Distro: openSUSE 11.1/i386 >>>> >>>> Does the problem also occur with another host kernel or a different >>>> guest image? >>> >>> Could you check what you get in __kernel_vsyscall()? On iAMD64 box >>> where that sucker contains sysenter-based variant the bug is not >>> present. IOW, it's sensitive to syscall vs. systenter vs. int 0x80 >>> differences. >> >> OK, this explains why I cannot reproduce it. >> My Intel Core2 box is sysenter-based. >> >> (gdb) disass __kernel_vsyscall >> 0xffffe420<__kernel_vsyscall+0>: push %ecx >> 0xffffe421<__kernel_vsyscall+1>: push %edx >> 0xffffe422<__kernel_vsyscall+2>: push %ebp >> 0xffffe423<__kernel_vsyscall+3>: mov %esp,%ebp >> 0xffffe425<__kernel_vsyscall+5>: sysenter >> 0xffffe427<__kernel_vsyscall+7>: nop >> 0xffffe428<__kernel_vsyscall+8>: nop >> 0xffffe429<__kernel_vsyscall+9>: nop >> 0xffffe42a<__kernel_vsyscall+10>: nop >> 0xffffe42b<__kernel_vsyscall+11>: nop >> 0xffffe42c<__kernel_vsyscall+12>: nop >> 0xffffe42d<__kernel_vsyscall+13>: nop >> 0xffffe42e<__kernel_vsyscall+14>: jmp 0xffffe423<__kernel_vsyscall+3> >> 0xffffe430<__kernel_vsyscall+16>: pop %ebp >> 0xffffe431<__kernel_vsyscall+17>: pop %edx >> 0xffffe432<__kernel_vsyscall+18>: pop %ecx >> 0xffffe433<__kernel_vsyscall+19>: ret >> >>> I can throw the trimmed-down fs image your way, BTW (66MB of bzipped ext2 >>> ;-/) >>> if you want to see if that gets reproduced on your box. I'll drop it on >>> anonftp if you are interested. FWIW, the same kernel binary/same image >>> result in >>> * K7 box - no breakage, SYSENTER-based vdso >>> * K8 box - breakage as described, SYSCALL-based vdso32 >>> * P4 box - no breakage, SYSENTER-based vdso32 >>> Hell knows... In theory that would seem to point towards >>> ia32_cstar_target(), >>> so I'm going to RTFS carefully through that animal. >> >> Now I'm testing with a Debian fs from: >> http://fs.devloop.org.uk/filesystems/Debian-Squeeze/ >> >>> The thing is, whatever happens happens when victim gets resumed inside >>> vdso page. I'll try to dump PTRACE_SETREGS and see the values host >>> kernel asked to set and work from there, but the interesting part is >>> bloody hard to singlestep through - the victim is back to user mode and >>> it is already traced by the guest kernel, so it's not as if we could >>> attach host gdb to it and walk through that crap. And guest gdb is not >>> going to be able to set breakpoints in there - vdso page is r/o... >> >> [ CC'ing lu...@mi... ] >> Andy, do you have an idea? >> You can find Al's original report here: >> http://marc.info/?l=linux-kernel&m=131380315624244&w=2 > > I'm missing a bit of the background. Is the user-on-UML app calling > into a vdso entry provided by UML or into a vdso entry provided by the > host? UML/i386 reuses the host's vDSO page. IOW it does not have it's own vDSO like UML/x86_64. Thanks, //richard |
From: Linus T. <tor...@li...> - 2011-08-22 01:41:50
|
On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote: > > Is that ability a part of userland ABI or are we declaring that hopelessly > wrong and require to go through the function in vdso32? Linus? If people are using syscall directly, we're pretty much stuck. No amount of "that's hopelessly wrong" will ever matter. We don't break existing binaries. That said, I'd *hope* that everybody uses the vdso32, simply because user programs are not supposed to know which CPU they are running on and if that CPU even *supports* the syscall instruction. In which case it may be possible that we can play games with the vdso thing. But that really would be conditional on "nobody ever reports a failure". But if that's possible, maybe we can increment the RIP by 2 for 'syscall', and slip an "'int 0x80" after the syscall instruction in the vdso there? Resulting in the same pseudo-solution I suggested for sysenter... Linus |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 04:08:21
|
On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote: > On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote: > > > > Is that ability a part of userland ABI or are we declaring that hopelessly > > wrong and require to go through the function in vdso32? ?Linus? > > If people are using syscall directly, we're pretty much stuck. No > amount of "that's hopelessly wrong" will ever matter. We don't break > existing binaries. There's a funny part, though - such binary won't work on 32bit kernel. AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso doesn't provide a SYSCALL-based variant). So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't we do something with entry_32.S as well? I don't think it's worth doing, TBH... Again, I very much hope that binaries with such stray SYSCALL simply do not exist. In theory it's possible to write one, but... IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter lack of point - the *only* CPU where it would matter would be K6-2, IIRC, and (again, IIRC) it had some differences in SYSCALL semantics compared to K7 (which supports SYSENTER as well). Bugger if I remember what those differences might've been... Some flag not cleared? |
From: Andrew L. <lu...@mi...> - 2011-08-20 21:40:29
|
On Sat, Aug 20, 2011 at 5:26 PM, Andrew Lutomirski <lu...@mi...> wrote: > On Sat, Aug 20, 2011 at 4:55 PM, Richard Weinberger <ri...@no...> wrote: > I'm missing a bit of the background. Is the user-on-UML app calling > into a vdso entry provided by UML or into a vdso entry provided by the > host? > > Why does anything care whether ecx is saved? Doesn't the default > calling convention allow the callee to clobber ecx? > > But my guess is that the 64-bit host sysret code might be buggy (or > the value in gs:whatever is wrong). Can you get gdb to breakpoint at > the beginning of __kernel_vsyscall before the crash? > This is suspicious: ENTRY(ia32_cstar_target) CFI_STARTPROC32 simple CFI_SIGNAL_FRAME CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET CFI_REGISTER rip,rcx /*CFI_REGISTER rflags,r11*/ SWAPGS_UNSAFE_STACK movl %esp,%r8d CFI_REGISTER rsp,r8 movq PER_CPU_VAR(kernel_stack),%rsp /* * No need to follow this irqs on/off section: the syscall * disabled irqs and here we enable it straight after entry: */ ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8,0,0 movl %eax,%eax /* zero extension */ movq %rax,ORIG_RAX-ARGOFFSET(%rsp) movq %rcx,RIP-ARGOFFSET(%rsp) CFI_REL_OFFSET rip,RIP-ARGOFFSET movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */ The entry code looks something like: The text of __kernel_vsyscall() is 0xffffe420 <__kernel_vsyscall+0>: push %ebp 0xffffe421 <__kernel_vsyscall+1>: mov %ecx,%ebp 0xffffe423 <__kernel_vsyscall+3>: syscall 0xffffe425 <__kernel_vsyscall+5>: mov $0x2b,%ecx 0xffffe42a <__kernel_vsyscall+10>: mov %ecx,%ss 0xffffe42c <__kernel_vsyscall+12>: mov %ebp,%ecx 0xffffe42e <__kernel_vsyscall+14>: pop %ebp 0xffffe42f <__kernel_vsyscall+15>: ret so the line: movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */ will cause iret (if iret happens) to restore the original rbp in rcx (why? -- it seems okay if syscall is hit in __kernel_vsyscall but not if something else does the syscall). I don't see what saves rbp to the stack frame. This is also suspicious: movq %r11,EFLAGS-ARGOFFSET(%rsp) that's inconsistent with my reading of the AMD manual. How well is the compat syscall entry tested through both the fast and slow paths? UML is unusual in that it uses ptrace to trap all system calls, right? That means that syscalls will enter through the cstar target but return through the iret path. --Andy |