From: Al V. <vi...@ft...> - 2011-08-18 18:59:07
|
My apologies for mailbomb from hell. *All* this stuff is available in git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master, but since uml folks had been stuck with mail and patch for a long time... Anyway, most of the stuff in this pile is merging, cleaning and mutilating subarchitecture-related code in arch/um. By the end of it we have x86 bits largely merged between 32bit and 64bit variants and taken to arch/x86/um; headers seriously cleaned up and mostly free of x86-isms now (not completely - we still have page size dependencies in there). Beginning of the series are pure build and driver fixes; those should go to Linus before 3.1-final, IMO. As far as I know, there's no regressions introduced by that pile; testing and comments would be, of course, welcome. |
From: Richard W. <ri...@no...> - 2011-08-18 19:12:55
|
Al, Am 18.08.2011 20:58, schrieb Al Viro: > > My apologies for mailbomb from hell. *All* this stuff is available in > git://git.kernel.org/pub/scm/linux/kernel/git/viro/um-header.git/ #master, > but since uml folks had been stuck with mail and patch for a long time... Have you touched your patches since yesterday? I've already pulled and uploaded them to my shiny new git repo at: git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable Due to the current mirroring problems with git.kernel.org I have not sent made it public. Sorry for that, I screwed it. :-( > Anyway, most of the stuff in this pile is merging, cleaning and mutilating > subarchitecture-related code in arch/um. By the end of it we have x86 > bits largely merged between 32bit and 64bit variants and taken to arch/x86/um; > headers seriously cleaned up and mostly free of x86-isms now (not completely - > we still have page size dependencies in there). > > Beginning of the series are pure build and driver fixes; those should go to > Linus before 3.1-final, IMO. Okay. > As far as I know, there's no regressions introduced by that pile; testing > and comments would be, of course, welcome. > There was a small build regression, I've already fixed it! Thanks, //richard |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-18 19:19:53
|
On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote: > Have you touched your patches since yesterday? > I've already pulled and uploaded them to my shiny new git repo at: > git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable Reordered and added missing S-o-b on a couple, split one commit. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 06:34:56
|
On Sat, Aug 20, 2011 at 05:40:03PM -0400, Andrew Lutomirski wrote: > will cause iret (if iret happens) to restore the original rbp in rcx > (why? -- it seems okay if syscall is hit in __kernel_vsyscall but not > if something else does the syscall). I don't see what saves rbp to > the stack frame. Far more interesting question is how the hell does that thing manage to work in face of syscall restarts? As the matter of fact, how does it (and sysenter-based variant) play with ptrace() *and* restarts? Suppose we have a traced process. foo6() is called and the thing it stopped before the sys_foo6() is reached kernel-side. The sixth argument is on stack, ebp is set to user esp. SYSENTER happens, we read the 6th argument from userland stack and put it along with the rest into pt_regs. tracer examines the arguments, modifies them (including the last one) and lets the tracee run free - e.g. detaches from the tracee. What should happen if we happen to get a signal that would restart that sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS, do anything of that kind. However, I wouldn't bet a dime on other 6-argument syscalls not stepping on that. sendto() and recvfrom(), in particular... OK, we return to userland. The sixth argument is placed into %ebp. Linus' "pig and proud of that" trick works and we end up slapping userland %esp into %ebp and hitting SYSENTER again. Only one problem, though - the sixth argument on user stack is completely unaffected by what tracer had done. Unlike the rest of arguments, that *are* changed. We could deal with that in case of SYSENTER if we e.g. replaced that jmp .Lenter_kernel with jmp .Lrestart and added .Lrestart: movl %ebp, (%esp) jmp .Lenter_kernel but in case of SYSCALL it seems to be even messier... Comments? ... and there I thought that last year session of asm glue sniffing couldn't be topped by anything more unpleasant ;-/ |
From: Andrew L. <lu...@mi...> - 2011-08-22 02:02:07
|
On Sun, Aug 21, 2011 at 9:48 PM, H. Peter Anvin <hp...@zy...> wrote: > On 08/21/2011 06:41 PM, Linus Torvalds wrote: >> If people are using syscall directly, we're pretty much stuck. No >> amount of "that's hopelessly wrong" will ever matter. We don't break >> existing binaries. >> >> That said, I'd *hope* that everybody uses the vdso32, simply because >> user programs are not supposed to know which CPU they are running on >> and if that CPU even *supports* the syscall instruction. In which case >> it may be possible that we can play games with the vdso thing. But >> that really would be conditional on "nobody ever reports a failure". > > I think we found that out with the vsyscall emulation issue last cycle. > It works, so it will have been used, somewhere... > >> But if that's possible, maybe we can increment the RIP by 2 for >> 'syscall', and slip an "'int 0x80" after the syscall instruction in >> the vdso there? Resulting in the same pseudo-solution I suggested for >> sysenter... > > I think we have the above problem. > > The problem here is that the syscall state is actually more complex than > we retain: the entire state is given by (entry point, register state); > with that amount of state we have all the information needed to *either* > extract the syscall arguments *or* the register contents. Without > those, we can only represent one of the two possible metalevels (right > now we represent the higher-level metalevel, the argument vector), but > we need both for different usages. My understanding of the problem is the following: 1. The SYSCALL 32-bit calling convention puts arg2 in ebp and arg6 on the stack. 2. The int 0x80 convention is different: arg2 is in ecx. 3. We're worried that pt_regs-using compat syscalls might want the regs to appear to match the actual arguments (why?) 4. ptrace expects the "registers" when SYSCALL happens to match the int 0x80 convention. (This is, IMO, sick.) 5. Syscall restart with the SYSCALL instruction must switch to userspace and back to the kernel for reasons I don't understand that presumably involve signal delivery. 6. Existing ABI requires that the kernel not clobber syscall arguments (except, of course, when ptrace or syscall restart explicitly change those arguments). So we're sort of screwed. arg2 must be in ecx to keep ptrace happy but SYSCALL clobbers ecx, so arg2 cannot be preserved. So here are three strawman ideas: a) Change #4. Maybe it's too late to do this, though. b) When SYSCALL happens, change RIP to point two bytes past an int 0x80 instruction in the vdso. Make the next instruction there be a "ret" that returns to the instruction after the original syscall. Patch the stack in the kernel. c) Disable syscall restart when SYSCALL happens from somewhere outside the vdso. --Andy |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 02:08:02
|
On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote: > 3. We're worried that pt_regs-using compat syscalls might want the > regs to appear to match the actual arguments (why?) run strace and you'll see why. > 4. ptrace expects the "registers" when SYSCALL happens to match the > int 0x80 convention. (This is, IMO, sick.) That's what ptrace is *for*. It's there to let debuggers play with the program being debugged, including taking a look at the syscall arguments and modifying them. In a predictable way, please. |
From: Andrew L. <lu...@mi...> - 2011-08-22 02:26:26
|
On Sun, Aug 21, 2011 at 10:07 PM, Al Viro <vi...@ze...> wrote: > On Sun, Aug 21, 2011 at 10:01:40PM -0400, Andrew Lutomirski wrote: > >> 3. We're worried that pt_regs-using compat syscalls might want the >> regs to appear to match the actual arguments (why?) > > run strace and you'll see why. > I'm talking about the implementations of stub32_rt_sigreturn, sys32_rt_sigreturn, stub32_sigreturn, stub32_sigaltstack, stub32_execve, stub32_fork, stub32_clone, stub32_vfork, and stub32_iopl. I don't know what this has to do with strace or user ABI at all. >> 4. ptrace expects the "registers" when SYSCALL happens to match the >> int 0x80 convention. (This is, IMO, sick.) > > That's what ptrace is *for*. It's there to let debuggers play with > the program being debugged, including taking a look at the syscall > arguments and modifying them. In a predictable way, please. > It may be necessary, but I still think it's sick. Especially in the case of inlined SYSCALL, where the registers reported to ptrace do not match any register values that ever actually existed in CPU registers. Too late to fix it, though. Which still leaves the question of how to fix it. Restarting via an int 0x80-based helper might be the only option that leaves everything fully functional. --Andy |
From: H. P. A. <hp...@zy...> - 2011-08-22 02:34:30
|
On 08/21/2011 07:26 PM, Andrew Lutomirski wrote: > > Which still leaves the question of how to fix it. Restarting via an > int 0x80-based helper might be the only option that leaves everything > fully functional. > The issue is that we don't represent the entire state ... we represent only one metalevel of it, currently the "cooked" one. The problem is that we need the "raw" one as well, and in order to have *both* we need to know the entry mechanism. We need that IN EITHER CASE. This is reasonably straightforward... we can carry the entry mechanism forward inside the kernel, and fix it up in the IRET path. The *really* big issue is what we drop as the sigcontext since this is an ABI carried out to userspace. We could just say "it's currently totally broken for SYSCALL" and just change it to drop the raw state, but which has the potential for breaking unknown programs, *or* we could add a bit of state (presumably by reclaiming one of the padding fields around cs and ss) ... which *also* has the potential for breaking programs. Right now, SYSCALL -> signal -> restart *is broken*, however, so there is also the option of just doing nothing in this case, I guess. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: H. P. A. <hp...@zy...> - 2011-08-22 04:12:09
|
On 08/21/2011 09:07 PM, Al Viro wrote: > On Sun, Aug 21, 2011 at 06:41:16PM -0700, Linus Torvalds wrote: >> On Sun, Aug 21, 2011 at 6:16 PM, Al Viro <vi...@ze...> wrote: >>> >>> Is that ability a part of userland ABI or are we declaring that hopelessly >>> wrong and require to go through the function in vdso32? ?Linus? >> >> If people are using syscall directly, we're pretty much stuck. No >> amount of "that's hopelessly wrong" will ever matter. We don't break >> existing binaries. > > There's a funny part, though - such binary won't work on 32bit kernel. > AFAICS, we never set MSR_*STAR on 32bit kernels (and native 32bit vdso > doesn't provide a SYSCALL-based variant). > > So if we really consider such SYSCALL outside of vdso32 kosher, shouldn't > we do something with entry_32.S as well? I don't think it's worth doing, > TBH... > > Again, I very much hope that binaries with such stray SYSCALL simply do > not exist. In theory it's possible to write one, but... > > IIRC, the reason we never had SYSCALL support in 32bit kernel was the utter > lack of point - the *only* CPU where it would matter would be K6-2, IIRC, > and (again, IIRC) it had some differences in SYSCALL semantics compared to > K7 (which supports SYSENTER as well). Bugger if I remember what those > differences might've been... Some flag not cleared? The most likely reason for a binary to execute a stray SYSCALL is because they read it out of the vdso. Totally daft, but we certainly see a lot of stupid things as evidenced by the JIT thread earlier this month. In that sense, a "safe" thing would be to drop use of SYSCALL for 32-bit processes... I just sent Borislav a query about the cost. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-22 04:26:23
|
On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote: > > lack of point - the *only* CPU where it would matter would be K6-2, IIRC, > > and (again, IIRC) it had some differences in SYSCALL semantics compared to > > K7 (which supports SYSENTER as well). Bugger if I remember what those > > differences might've been... Some flag not cleared? > > The most likely reason for a binary to execute a stray SYSCALL is > because they read it out of the vdso. Totally daft, but we certainly > see a lot of stupid things as evidenced by the JIT thread earlier this > month. Um... What, blindly, no matter what surrounds it in there? What will happen to the same eager JIT when it steps on SYSENTER? |
From: H. P. A. <hp...@zy...> - 2011-08-22 05:03:54
|
On 08/21/2011 09:26 PM, Al Viro wrote: > On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote: >>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC, >>> and (again, IIRC) it had some differences in SYSCALL semantics compared to >>> K7 (which supports SYSENTER as well). Bugger if I remember what those >>> differences might've been... Some flag not cleared? >> >> The most likely reason for a binary to execute a stray SYSCALL is >> because they read it out of the vdso. Totally daft, but we certainly >> see a lot of stupid things as evidenced by the JIT thread earlier this >> month. > > Um... What, blindly, no matter what surrounds it in there? What will > happen to the same eager JIT when it steps on SYSENTER? The JIT will have had to manage SYSENTER already. It's not a change, whereas SYSCALL would be. We could just try it, and see if anything breaks, of course. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: Andrew L. <luto@MIT.EDU> - 2011-08-23 05:10:55
|
On 08/22/2011 01:03 AM, H. Peter Anvin wrote: > On 08/21/2011 09:26 PM, Al Viro wrote: >> On Sun, Aug 21, 2011 at 09:11:54PM -0700, H. Peter Anvin wrote: >>>> lack of point - the *only* CPU where it would matter would be K6-2, IIRC, >>>> and (again, IIRC) it had some differences in SYSCALL semantics compared to >>>> K7 (which supports SYSENTER as well). Bugger if I remember what those >>>> differences might've been... Some flag not cleared? >>> >>> The most likely reason for a binary to execute a stray SYSCALL is >>> because they read it out of the vdso. Totally daft, but we certainly >>> see a lot of stupid things as evidenced by the JIT thread earlier this >>> month. >> >> Um... What, blindly, no matter what surrounds it in there? What will >> happen to the same eager JIT when it steps on SYSENTER? > > The JIT will have had to manage SYSENTER already. It's not a change, > whereas SYSCALL would be. We could just try it, and see if anything > breaks, of course. Here's a possible solution that works for standalone SYSCALL and vdso SYSCALL. The idea is to preserve the exact same SYSCALL invocation sequence. Logically, the SYSCALL instruction does: push %ebp mov %ebp,%ecx mov 4(%esp),%ebp call __fake_int80 and __fake_int80 is: int 0x80 mov 4(%esp),%ebp ret $4 The entire system call sequence is then (effectively): push %ebp movl %ecx,%ebp ; "SYSCALL" starts here push %ebp mov %ebp,%ecx mov 4(%esp),%ebp call __fake_int80 ; "SYSCALL ends here movl %ebp,%ecx popl %ebp ret So we rearrange ebp and ecx and then immediately rearrange them back. The landing point tweaks them again so that we preserve the old semantics of SYSCALL. But now the pt_regs values exactly match what would have happened if we entered via the int 0x80 path, so there shouldn't be any corner cases with ptrace or restart -- as far as either one is concerned, we actually entered via int 0x80. If we deliver a signal, the signal handler returns to the int 0x80 instruction. Am I missing something? Extremely buggy, incomplete code that implements this is: diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index a0e866d..6cda8ce 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -291,24 +291,59 @@ ENTRY(ia32_cstar_target) ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8,0,0 movl %eax,%eax /* zero extension */ - movq %rax,ORIG_RAX-ARGOFFSET(%rsp) - movq %rcx,RIP-ARGOFFSET(%rsp) - CFI_REL_OFFSET rip,RIP-ARGOFFSET - movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */ - movl %ebp,%ecx + + /* + * This does (from the user's point of view): + * push %ebp + * mov %ebp, %ecx + * mov 4(%esp), %ebp + * call <function that does int 0x80; mov 4(%esp),%ebp; ret 4> + * + * User address access does not need access_ok check as r8 + * has been zero-extended, so even with the offsets it cannot + * exceed 2**32 + 8. + */ + + /* XXX: need to check that vdso actually exists. */ + /* XXX: ia32_badarg may do bad things to the user state. */ + + /* move ebp into place on the user stack */ + 1: movl %ebp,-4(%r8) + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* move eip into place on the user stack */ + 1: movl %ecx,-8(%r8) /* user eip is in ecx */ + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* move ebp to ecx in CPU registers and argument save area */ + mov %ebp,%ecx + movq %ecx,RCX-ARGOFFSET(%rsp) + + /* + * move arg6 to ebp in CPU registers and argument save area + * minor optimization: the actual value of ebp is irrelevent, + * so stick it straight into r9d -- see the definition of + * IA32_ARG_FIXUP. + */ +1: movl (%r8),%r9d + .section __ex_table,"a" + .quad 1b,ia32_badarg + .previous + + /* Do the fake call */ + movl [insert address of int 0x80; ret helper + 2 here],RIP-ARGOFFSET(%rsp) + subl $8,%r8 /* we pushed twice */ + movq $__USER32_CS,CS-ARGOFFSET(%rsp) movq $__USER32_DS,SS-ARGOFFSET(%rsp) movq %r11,EFLAGS-ARGOFFSET(%rsp) /*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/ movq %r8,RSP-ARGOFFSET(%rsp) CFI_REL_OFFSET rsp,RSP-ARGOFFSET - /* no need to do an access_ok check here because r8 has been - 32bit zero extended */ - /* hardware stack frame is complete now */ -1: movl (%r8),%r9d - .section __ex_table,"a" - .quad 1b,ia32_badarg - .previous GET_THREAD_INFO(%r10) orl $TS_COMPAT,TI_status(%r10) testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10) diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S index 5415b56..a3e48b0 100644 --- a/arch/x86/vdso/vdso32/syscall.S +++ b/arch/x86/vdso/vdso32/syscall.S @@ -19,8 +19,8 @@ __kernel_vsyscall: .Lpush_ebp: movl %ecx, %ebp syscall - movl $__USER32_DS, %ecx - movl %ecx, %ss + /* The ret in the fake int80 entry lands here */ + /* ss is already correct AFAICS */ movl %ebp, %ecx popl %ebp .Lpop_ebp: @@ -28,6 +28,11 @@ __kernel_vsyscall: .LEND_vsyscall: .size __kernel_vsyscall,.-.LSTART_vsyscall +__kernel_vsyscall_fake_int80: + int 0x80 + mov 4(%esp),%ebp + ret $4 + .section .eh_frame,"a",@progbits .LSTARTFRAME: .long .LENDCIE-.LSTARTCIE This could be further simplified by checking if any work flags are set and bailing immediately to the right place in the int 0x80 entry. --Andy |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 08:42:52
|
On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote: > Suppose we have a traced process. foo6() is called and the thing it > stopped before the sys_foo6() is reached kernel-side. The sixth argument > is on stack, ebp is set to user esp. SYSENTER happens, we read the > 6th argument from userland stack and put it along with the rest into > pt_regs. tracer examines the arguments, modifies them (including the last > one) and lets the tracee run free - e.g. detaches from the tracee. > > What should happen if we happen to get a signal that would restart that > sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS, > do anything of that kind. However, I wouldn't bet a dime on other 6-argument > syscalls not stepping on that. sendto() and recvfrom(), in particular... > > OK, we return to userland. The sixth argument is placed into %ebp. Linus' > "pig and proud of that" trick works and we end up slapping userland > %esp into %ebp and hitting SYSENTER again. Only one problem, though - > the sixth argument on user stack is completely unaffected by what tracer > had done. Unlike the rest of arguments, that *are* changed. > > We could deal with that in case of SYSENTER if we e.g. replaced that > jmp .Lenter_kernel > with > jmp .Lrestart > and added > .Lrestart: > movl %ebp, (%esp) > jmp .Lenter_kernel > but in case of SYSCALL it seems to be even messier... Comments? Oh, hell... Compat SYSCALL one is really buggered on syscall restarts, ptrace or no ptrace. Look: calling conventions for SYSCALL are arg1..5: ebx, ebp, edx, edi, esi. arg6: stack and after syscall restart we end up with arg1..5: ebx, ecx, edx, edi, esi. arg6: ebp so restart will instantly clobber arg2, in effect replacing it with arg6. And yes, adding ptrace to the mix makes things even uglier. For one thing, changes to ECX via ptrace are completely lost on the fast exit. Not pretty, and might make life painful for uml, but not for the majority of programs. What's worse, combination of ptrace with restart will lose changes to arg6 (again, value on stack left as it was, changes to arg6 by tracer lost) *and* it will lose changes to arg2 (along with arg2 itself - see above). Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain the address of next insn and that's where we end up going. IOW, SYSCALL not inside vdso32 currently works (for small values of "works", due to restart issues). Playing with return elsewhere might break some userland code... Guys, that's *way* out of the area I'm comfortable with. |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-19 04:31:28
|
On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote: > On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote: > > Have you touched your patches since yesterday? > > I've already pulled and uploaded them to my shiny new git repo at: > > git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable > > Reordered and added missing S-o-b on a couple, split one commit. Umm... One comment after looking at your tree: you probably want to rebase for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into linux-next). And for pity sake, do *not* merge from Linus every day; that's one sure way to get yourself flamed into crisp. Just trying to figure out what's in your tree is a _hard_ exercise. git cherry between Linus' tree and e.g. #fixes in yours gives a long list of commits, most of them _probably_ duplicates of the stuff in mainline. What are bnx2 patches doing in there, for example? I've tried to figure out what's going on in there; AFAICS, your #fixes is mainline plus Al Viro (6): um: fix oopsable race in line_close() um: winch_interrupt() can happen inside of free_winch() um: fix free_winch() mess um: PTRACE_[GS]ETFPXREGS had been wired on the wrong subarch um: fix strrchr problems um: clean arch_ptrace() up a bit Ingo van Lil (1): um: Save FPU registers between task switches Jonathan Neusch<C3><A4>fer (3): UserModeLinux-HOWTO.txt: fix a typo um: drivers/xterm.c: fix a file descriptor leak UserModeLinux-HOWTO.txt: remove ^H characters Thadeu Lima de Souza Cascardo (1): um: disable CMPXCHG_DOUBLE as it breaks UML build I've cherry-picked those on top of the same branchpoint; see #cleaned-fixes in um-headers.git. AFAICS, that's the same contents as your #fixes, with clean history. Diff against your #fixes consists of - .irq_set_type = pmic_irq_type, <<<<<<< HEAD - .irq_bus_lock = pmic_irq_buslock, + .irq_set_type = pmic_irq_type, + .irq_bus_lock = pmic_bus_lock, in drivers/platform/x86/intel_pmic_gpio.c, which is an obvious mismerge (AFAICS, on May 29). IME the sane policy is to keep for-linus, pulling into it when Linus pulls from you. At that point it's a fast-forward and all previous history is not cluttering the things up anymore. for-next I rebase and reorder at will, TBH, but generally I start it at the current tip of for-linus. Beyond what you've got in #for-3.2 I have a couple of commits, but that can wait until the history is sorted out. As it is, I 100% guarantee that pull request on your #fixes as it is will result in pyrotechnical effects from hell (OK, from Linus, actually, but in this case there won't be any real difference). |
From: Richard W. <ri...@no...> - 2011-08-19 08:52:03
|
Am 19.08.2011 06:31, schrieb Al Viro: > On Thu, Aug 18, 2011 at 08:19:46PM +0100, Al Viro wrote: >> On Thu, Aug 18, 2011 at 09:12:47PM +0200, Richard Weinberger wrote: >>> Have you touched your patches since yesterday? >>> I've already pulled and uploaded them to my shiny new git repo at: >>> git://git.kernel.org/pub/scm/linux/kernel/git/rw/linux-um.git unstable >> >> Reordered and added missing S-o-b on a couple, split one commit. > > Umm... One comment after looking at your tree: you probably want to rebase > for-3.2 on top of fixes (and presumably feed it to sfr for inclusion into > linux-next). Please slow down a bit. :-) All these branches are just for testing purposes. That's why I have not announced them nor sent a pull request to Linus. Anyway, thanks for the hints! Thanks, //richard |
From: Andrew L. <lu...@mi...> - 2011-08-21 11:25:04
|
On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <vi...@ze...> wrote: > On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote: >> Suppose we have a traced process. foo6() is called and the thing it >> stopped before the sys_foo6() is reached kernel-side. The sixth argument >> is on stack, ebp is set to user esp. SYSENTER happens, we read the >> 6th argument from userland stack and put it along with the rest into >> pt_regs. tracer examines the arguments, modifies them (including the last >> one) and lets the tracee run free - e.g. detaches from the tracee. >> >> What should happen if we happen to get a signal that would restart that >> sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS, >> do anything of that kind. However, I wouldn't bet a dime on other 6-argument >> syscalls not stepping on that. sendto() and recvfrom(), in particular... >> >> OK, we return to userland. The sixth argument is placed into %ebp. Linus' >> "pig and proud of that" trick works and we end up slapping userland >> %esp into %ebp and hitting SYSENTER again. Only one problem, though - >> the sixth argument on user stack is completely unaffected by what tracer >> had done. Unlike the rest of arguments, that *are* changed. >> >> We could deal with that in case of SYSENTER if we e.g. replaced that >> jmp .Lenter_kernel >> with >> jmp .Lrestart >> and added >> .Lrestart: >> movl %ebp, (%esp) >> jmp .Lenter_kernel >> but in case of SYSCALL it seems to be even messier... Comments? > > Oh, hell... Compat SYSCALL one is really buggered on syscall restarts, > ptrace or no ptrace. Look: calling conventions for SYSCALL are > arg1..5: ebx, ebp, edx, edi, esi. arg6: stack > and after syscall restart we end up with > arg1..5: ebx, ecx, edx, edi, esi. arg6: ebp > so restart will instantly clobber arg2, in effect replacing it with arg6. > > And yes, adding ptrace to the mix makes things even uglier. For one thing, > changes to ECX via ptrace are completely lost on the fast exit. Not pretty, > and might make life painful for uml, but not for the majority of programs. > What's worse, combination of ptrace with restart will lose changes to arg6 > (again, value on stack left as it was, changes to arg6 by tracer lost) *and* > it will lose changes to arg2 (along with arg2 itself - see above). > > Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain > the address of next insn and that's where we end up going. IOW, SYSCALL not > inside vdso32 currently works (for small values of "works", due to restart > issues). Playing with return elsewhere might break some userland code... > > Guys, that's *way* out of the area I'm comfortable with. > I don't see the point of all this hackery at all. sysenter/sysexit indeed screws up some registers, but we can return on the iret path in the case of restart. So why do we lie to ptrace (and iret!) at all? Why not just fill in pt_regs with the registers as they were (at least the non-clobbered-by-sysenter ones), set the actual C parameters correctly to contain the six arguments (in rdi, rsi, etc.), do the syscall, and return back to userspace without any funny business? Is there some ABI reason that, once we've started lying to tracers, we have to keep doing so? --Andy |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-20 01:18:55
|
On Fri, Aug 19, 2011 at 10:51:51AM +0200, Richard Weinberger wrote: > Please slow down a bit. :-) > All these branches are just for testing purposes. > That's why I have not announced them nor sent a pull request to Linus. > > Anyway, thanks for the hints! np... FWIW, there's a really ugly bug present in mainline as well as in mainline + these patches and I'd welcome any help in figuring out what's going on. 1) USER_OBJS do not see CONFIG_..., so os-Linux/main.c doesn't see CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA. As the result, uml/i386 doesn't notice that host vdso is there. That one is easy to fix: -obj-$(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA) += elf_aux.o +ifeq ($(CONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA),y) +obj-y += elf_aux.o +CFLAGS_main.o += -DCONFIG_ARCH_REUSE_HOST_VSYSCALL_AREA +endif in arch/um/os-Linux/Makefile takes care of that. Unfortunately, it also exposes a bug in fixrange_init(): 2) fixrange_init() gets called with start (and end) not multiple of PMD_SIZE; moreover, end is very close to the ~0UL - closer than by PMD_SIZE. Bad things start happening to the loops in there. Again, easy to fix: diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 8137ccc..39ee674 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -119,19 +119,22 @@ static void __init fixrange_init(unsigned long start, unsigned long end, int i, j; unsigned long vaddr; - vaddr = start; + vaddr = start & PMD_MASK; i = pgd_index(vaddr); j = pmd_index(vaddr); pgd = pgd_base + i; + start >>= PMD_SHIFT; + end = (end - 1) >> PMD_SHIFT; - for ( ; (i < PTRS_PER_PGD) && (vaddr < end); pgd++, i++) { + for ( ; (i < PTRS_PER_PGD) && start <= end; pgd++, i++) { pud = pud_offset(pgd, vaddr); if (pud_none(*pud)) one_md_table_init(pud); pmd = pmd_offset(pud, vaddr); - for (; (j < PTRS_PER_PMD) && (vaddr < end); pmd++, j++) { + for (; (j < PTRS_PER_PMD) && start <= end; pmd++, j++) { one_page_table_init(pmd); vaddr += PMD_SIZE; + start++; } j = 0; } That populates the page tables in the right places and fixrange_user_init() manages to call it, avoid death-by-oom from runaway allocations and then install references to all pages it wants. Alas, at that point the things become really interesting. 3) with the previous two issues dealt with, we get the following magical mistery shite when running 32bit uml kernel + userland on 64bit host: * the system boots all the way to getty/login and sshd (i.e. gets through the debian /etc/init.d (squeeze/i386)) * one can log into it, both on terminals and over ssh. shell and a bunch of other stuff works. Mostly. * /bin/bash -c "echo *" reliably segfaults. Always. So does tab completion in bash, for that matter. * said segfault is reproducible both from shell and under gdb. For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3). What happens there apparently boils down to __kernel_vsyscall() getting called (and yes, sys_brk() is called, succeeds and results in expected value in %eax) and corrupting the living hell out of %ecx. Namely, on return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal to (original value of) %ebp. All registers except %eax and %ecx (including %esp and %ebp) remain unchanged. Again, that happens only on the same call of brk(3) - all previous calls succeed as expected. I don't believe that it's a race. I also very much doubt that we are calling the wrong location - it's hard to tell with the call being call *%gs:0x10 (is there any way to find what that is equal to in gdb, BTW? Short of hot-patching movl *%gs:0x10,%eax in place of that call and single-stepping it, that is...) but it *does* end up making the system call that ought to have been made, so I suspect that it does hit __kernel_vsyscall(), after all... The text of __kernel_vsyscall() is 0xffffe420 <__kernel_vsyscall+0>: push %ebp 0xffffe421 <__kernel_vsyscall+1>: mov %ecx,%ebp 0xffffe423 <__kernel_vsyscall+3>: syscall 0xffffe425 <__kernel_vsyscall+5>: mov $0x2b,%ecx 0xffffe42a <__kernel_vsyscall+10>: mov %ecx,%ss 0xffffe42c <__kernel_vsyscall+12>: mov %ebp,%ecx 0xffffe42e <__kernel_vsyscall+14>: pop %ebp 0xffffe42f <__kernel_vsyscall+15>: ret so %ecx on the way out becoming equal to original %ebp is bloody curious - it would smell like entering that sucker 3 bytes too late and skipping mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed on the way out - wrong return address, wrong value in %ebp, changed %esp. None of that happens. And we are executing that code in userland - i.e. to get corrupt it would have to get corrupt in *HOST* 32bit VDSO. Which would have much more visible effects, starting with the next attempt to run the testcase blowing up immediately instead of waiting (as it actually does) for the same 10th call of brk()... I'm at loss, to be honest. The sucker is nicely reproducible, but bisecting doesn't help at all - it seems to be present all the way back at least to 2.6.33. I hadn't tried to go back further and I hadn't tried to go for older host kernels, but I wouldn't put too much faith into that... The reason it hadn't been noticed much earlier is that it works fine on i386 host - aforementioned shit happens only when the entire thing (identical binary, identical fs image, identical options) is run on amd64. However, on i386 I have a different __kernel_vsyscall, which might easily be the reason it doesn't happen there. It's a K7 box with sysenter-based variant ending up as __kernel_vsyscall(). Hell knows what's going on... Behaviour is really weird and I'd appreciate any pointers re debugging that crap. Suggestions? |
From: Richard W. <ri...@no...> - 2011-08-20 15:22:40
|
Am 20.08.2011 03:18, schrieb Al Viro: > 3) with the previous two issues dealt with, we get the following magical > mistery shite when running 32bit uml kernel + userland on 64bit host: > * the system boots all the way to getty/login and sshd (i.e. gets > through the debian /etc/init.d (squeeze/i386)) > * one can log into it, both on terminals and over ssh. shell and > a bunch of other stuff works. Mostly. > * /bin/bash -c "echo *" reliably segfaults. Always. So does tab > completion in bash, for that matter. > * said segfault is reproducible both from shell and under gdb. > For /bin/bash -c "echo *" under gdb it's always the 10th call of brk(3). > What happens there apparently boils down to __kernel_vsyscall() getting > called (and yes, sys_brk() is called, succeeds and results in expected > value in %eax) and corrupting the living hell out of %ecx. Namely, on > return from what presumably is __kernel_vsyscall() I'm seeing %ecx equal > to (original value of) %ebp. All registers except %eax and %ecx (including > %esp and %ebp) remain unchanged. > Again, that happens only on the same call of brk(3) - all previous > calls succeed as expected. I don't believe that it's a race. I also > very much doubt that we are calling the wrong location - it's hard to tell > with the call being call *%gs:0x10 (is there any way to find what that > is equal to in gdb, BTW? Short of hot-patching movl *%gs:0x10,%eax in place > of that call and single-stepping it, that is...) but it *does* end up > making the system call that ought to have been made, so I suspect that it > does hit __kernel_vsyscall(), after all... > > The text of __kernel_vsyscall() is > 0xffffe420<__kernel_vsyscall+0>: push %ebp > 0xffffe421<__kernel_vsyscall+1>: mov %ecx,%ebp > 0xffffe423<__kernel_vsyscall+3>: syscall > 0xffffe425<__kernel_vsyscall+5>: mov $0x2b,%ecx > 0xffffe42a<__kernel_vsyscall+10>: mov %ecx,%ss > 0xffffe42c<__kernel_vsyscall+12>: mov %ebp,%ecx > 0xffffe42e<__kernel_vsyscall+14>: pop %ebp > 0xffffe42f<__kernel_vsyscall+15>: ret > so %ecx on the way out becoming equal to original %ebp is bloody curious - > it would smell like entering that sucker 3 bytes too late and skipping > mov %ecx, %ebp, but... we would also skip push %ebp, so we'd get trashed > on the way out - wrong return address, wrong value in %ebp, changed %esp. > None of that happens. And we are executing that code in userland - i.e. > to get corrupt it would have to get corrupt in *HOST* 32bit VDSO. Which > would have much more visible effects, starting with the next attempt to > run the testcase blowing up immediately instead of waiting (as it actually > does) for the same 10th call of brk()... > > I'm at loss, to be honest. The sucker is nicely reproducible, but bisecting > doesn't help at all - it seems to be present all the way back at least to > 2.6.33. I hadn't tried to go back further and I hadn't tried to go for > older host kernels, but I wouldn't put too much faith into that... The > reason it hadn't been noticed much earlier is that it works fine on i386 > host - aforementioned shit happens only when the entire thing (identical > binary, identical fs image, identical options) is run on amd64. However, > on i386 I have a different __kernel_vsyscall, which might easily be the > reason it doesn't happen there. It's a K7 box with sysenter-based > variant ending up as __kernel_vsyscall(). Hell knows what's going on... > Behaviour is really weird and I'd appreciate any pointers re debugging > that crap. Suggestions? Hmmm, very strange. Sadly I cannot reproduce the issue. :( Everything works fine within UML. (Of course I've applied your vDSO/i386 patches) My test setup: Host kernel: 2.6.37 and 3.0.1 Distro: openSUSE 11.4/x86_64 UML kernel: 3.1-rc2 Distro: openSUSE 11.1/i386 Does the problem also occur with another host kernel or a different guest image? Thanks, //richard |
From: Andrew L. <lu...@mi...> - 2011-08-21 13:37:45
|
On Sun, Aug 21, 2011 at 7:24 AM, Andrew Lutomirski <lu...@mi...> wrote: > On Sun, Aug 21, 2011 at 4:42 AM, Al Viro <vi...@ze...> wrote: >> On Sun, Aug 21, 2011 at 07:34:43AM +0100, Al Viro wrote: >>> Suppose we have a traced process. foo6() is called and the thing it >>> stopped before the sys_foo6() is reached kernel-side. The sixth argument >>> is on stack, ebp is set to user esp. SYSENTER happens, we read the >>> 6th argument from userland stack and put it along with the rest into >>> pt_regs. tracer examines the arguments, modifies them (including the last >>> one) and lets the tracee run free - e.g. detaches from the tracee. >>> >>> What should happen if we happen to get a signal that would restart that >>> sucker? Granted, it's not going to happen with mmap() - it doesn't, AFAICS, >>> do anything of that kind. However, I wouldn't bet a dime on other 6-argument >>> syscalls not stepping on that. sendto() and recvfrom(), in particular... >>> >>> OK, we return to userland. The sixth argument is placed into %ebp. Linus' >>> "pig and proud of that" trick works and we end up slapping userland >>> %esp into %ebp and hitting SYSENTER again. Only one problem, though - >>> the sixth argument on user stack is completely unaffected by what tracer >>> had done. Unlike the rest of arguments, that *are* changed. >>> >>> We could deal with that in case of SYSENTER if we e.g. replaced that >>> jmp .Lenter_kernel >>> with >>> jmp .Lrestart >>> and added >>> .Lrestart: >>> movl %ebp, (%esp) >>> jmp .Lenter_kernel >>> but in case of SYSCALL it seems to be even messier... Comments? >> >> Oh, hell... Compat SYSCALL one is really buggered on syscall restarts, >> ptrace or no ptrace. Look: calling conventions for SYSCALL are >> arg1..5: ebx, ebp, edx, edi, esi. arg6: stack >> and after syscall restart we end up with >> arg1..5: ebx, ecx, edx, edi, esi. arg6: ebp >> so restart will instantly clobber arg2, in effect replacing it with arg6. >> >> And yes, adding ptrace to the mix makes things even uglier. For one thing, >> changes to ECX via ptrace are completely lost on the fast exit. Not pretty, >> and might make life painful for uml, but not for the majority of programs. >> What's worse, combination of ptrace with restart will lose changes to arg6 >> (again, value on stack left as it was, changes to arg6 by tracer lost) *and* >> it will lose changes to arg2 (along with arg2 itself - see above). >> >> Linus' Dirty Trick(tm) is not trivial to apply - with SYSCALL we *do* retain >> the address of next insn and that's where we end up going. IOW, SYSCALL not >> inside vdso32 currently works (for small values of "works", due to restart >> issues). Playing with return elsewhere might break some userland code... >> >> Guys, that's *way* out of the area I'm comfortable with. >> > > I don't see the point of all this hackery at all. sysenter/sysexit > indeed screws up some registers, but we can return on the iret path in > the case of restart. > > So why do we lie to ptrace (and iret!) at all? Why not just fill in > pt_regs with the registers as they were (at least the > non-clobbered-by-sysenter ones), set the actual C parameters correctly > to contain the six arguments (in rdi, rsi, etc.), do the syscall, and > return back to userspace without any funny business? Is there some > ABI reason that, once we've started lying to tracers, we have to keep > doing so? Gack. Is this a holdover from the 32-bit code that shares the argument save area with the parameters passed on the C stack? If so, we could just set up the argument save area honestly and pass the real parameters in registers like 64-bit C code expects. If the tracing and restart cases use iret to return to userspace, this should all just work. ptrace users shouldn't notice the overhead, and syscall restart is presumably slow enough anyway that it wouldn't matter. The userspace entry code would be as simple as: sysenter ret or sysexit ret --Andy |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 14:51:51
|
On Sun, Aug 21, 2011 at 09:37:18AM -0400, Andrew Lutomirski wrote: > Gack. Is this a holdover from the 32-bit code that shares the > argument save area with the parameters passed on the C stack? If so, > we could just set up the argument save area honestly and pass the real > parameters in registers like 64-bit C code expects. > > If the tracing and restart cases use iret to return to userspace, this > should all just work. ptrace users shouldn't notice the overhead, and > syscall restart is presumably slow enough anyway that it wouldn't > matter. The userspace entry code would be as simple as: > > sysenter > ret > > or > > sysexit > ret You are making no sense at all... |
From: Al V. <viro@ZenIV.linux.org.uk> - 2011-08-21 14:44:14
|
On Sun, Aug 21, 2011 at 07:24:35AM -0400, Andrew Lutomirski wrote: > I don't see the point of all this hackery at all. sysenter/sysexit > indeed screws up some registers, but we can return on the iret path in > the case of restart. We *do* return on iret path in case of restart, TYVM. > So why do we lie to ptrace (and iret!) at all? Why not just fill in > pt_regs with the registers as they were (at least the > non-clobbered-by-sysenter ones), set the actual C parameters correctly > to contain the six arguments (in rdi, rsi, etc.), do the syscall, and > return back to userspace without any funny business? Is there some > ABI reason that, once we've started lying to tracers, we have to keep > doing so? We do not lie to ptrace and iret. At all. We do just what you have described. And fuck up when restart returns us to the SYSCALL / SYSENTER instruction again, which expects the different calling conventions, so the values arranged in registers in the way int 0x80 would expect do us no good. |
From: H. P. A. <hp...@zy...> - 2011-08-22 04:06:17
|
Borislav, We're tracking down an issue with the way system call arguments are handled on 32 bits. We have a solution for SYSENTER but not SYSCALL; fixing SYSCALL "properly" appears to be very difficult at best. So the question is: how much overhead would it be to simply fall back to int $0x80 or some other legacy-style domain crossing instruction for 32-bit system calls on AMD64 processors? We don't ever use SYSCALL in legacy mode, so native i386 kernels are unaffected. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. |
From: Ingo M. <mi...@ke...> - 2011-08-22 10:33:24
|
* H. Peter Anvin <hp...@zy...> wrote: > Borislav, > > We're tracking down an issue with the way system call arguments are > handled on 32 bits. We have a solution for SYSENTER but not > SYSCALL; fixing SYSCALL "properly" appears to be very difficult at > best. > > So the question is: how much overhead would it be to simply fall > back to int $0x80 or some other legacy-style domain crossing > instruction for 32-bit system calls on AMD64 processors? We don't > ever use SYSCALL in legacy mode, so native i386 kernels are > unaffected. Last i measured INT80 and SYSCALL costs they were pretty close to each other on AMD CPUs - closer than on Intel. Also, most installations are either pure 32-bit or dominantly 64-bit, the significantly mixed-mode case is dwindling. Unifying some more in this area would definitely simplify things ... Thanks, Ingo |
From: Linus T. <tor...@li...> - 2011-08-22 23:28:42
|
On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <hp...@zy...> wrote: > > However, we could just issue a SIGILL or SIGSEGV at this point; the same > way we would if we got an #UD or #GP fault; SIGILL/#UD would be > consistent with Intel CPUs here. Considering that this is not a remotely new issue, and that it has been around for years without anybody even noticing, I'd really prefer to just fix things going forwards rather than add any code to actively break any possible unlucky legacy users. So I think the "let's fix the vdso case for sysenter" + "let's remove the 32-bit syscall vdso" is the right solution. If somebody has hardcoded syscall instructions, or generates them dynamically with some JIT, that's their problem. We'll continue to support it as well as we ever have (read: "almost nobody will ever notice"). One thing we *could* do is to just say "we never restart a x86-32 'syscall' instruction at all", and just make such a case return EINTR. IOW, do something along the lines of the appended pseudo-patch. Because returning -EINTR is always "almost correct". Hmm? Linus --- diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 54ddaeb221c1..bc1a0f8b2707 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -678,6 +678,16 @@ setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info, return ret; } +static void restart_syscall(struct pt_regs *regs, int orig) +{ + if (regs->syscall_using_syscall_insn) { + regs->ax = -EINTR; + return; + } + regs->ax = orig; + regs->ip -= 2; +} + static int handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka, struct pt_regs *regs) @@ -701,8 +711,7 @@ handle_signal(unsigned long sig, siginfo_t *info, struct k_sigaction *ka, } /* fallthrough */ case -ERESTARTNOINTR: - regs->ax = regs->orig_ax; - regs->ip -= 2; + restart_syscall(regs, regs->orig_ax); break; } } @@ -786,13 +795,11 @@ static void do_signal(struct pt_regs *regs) case -ERESTARTNOHAND: case -ERESTARTSYS: case -ERESTARTNOINTR: - regs->ax = regs->orig_ax; - regs->ip -= 2; + restart_syscall(regs, regs->orig_ax); break; case -ERESTART_RESTARTBLOCK: - regs->ax = NR_restart_syscall; - regs->ip -= 2; + restart_syscall(regs, NR_restart_syscall); break; } } |
From: H. P. A. <hp...@zy...> - 2011-08-22 23:47:26
|
On 08/22/2011 04:27 PM, Linus Torvalds wrote: > On Mon, Aug 22, 2011 at 3:04 PM, H. Peter Anvin <hp...@zy...> wrote: >> >> However, we could just issue a SIGILL or SIGSEGV at this point; the same >> way we would if we got an #UD or #GP fault; SIGILL/#UD would be >> consistent with Intel CPUs here. > > Considering that this is not a remotely new issue, and that it has > been around for years without anybody even noticing, I'd really prefer > to just fix things going forwards rather than add any code to actively > break any possible unlucky legacy users. > > So I think the "let's fix the vdso case for sysenter" + "let's remove > the 32-bit syscall vdso" is the right solution. If somebody has > hardcoded syscall instructions, or generates them dynamically with > some JIT, that's their problem. We'll continue to support it as well > as we ever have (read: "almost nobody will ever notice"). > > One thing we *could* do is to just say "we never restart a x86-32 > 'syscall' instruction at all", and just make such a case return EINTR. > IOW, do something along the lines of the appended pseudo-patch. > > Because returning -EINTR is always "almost correct". > I have to say it worries me from a potential security hole point of view, especially since it clearly isn't very well trod ground to begin with. An almost-never-used path with access to the full system call suite is scarier than hell in that sense. Keep in mind support for SYSCALL32 is already (vendor-)conditional. (The obvious solution of just putting the proper register frame back in its place would be okay except for totally breaking anything trace-on-exit as already hashed to death...) -hpa |