|
From: Julian S. <js...@ac...> - 2005-02-28 17:25:08
|
(in the cvs head) There seem to be paths through this which do not assign any value to SYSRES. I don't see how this can be right if this is a Special syscall (for which we provide all handling). If (a4 & VKI_MAP_FIXED) is taken and !VG_(valid_client_addr)(...) is not taken then we hit if (SYSRES != -VKI_ENOMEM) ... and read SYSRES without having set it first. What am I missing? Same problem for sys_rt_sigaction and sys_rt_sigprocmask. ----------------------------- Background: I'm tracking down problems in syscall handling on amd64. As part of that, I've added a per-thread Bool used to indicate when a syscall's PRE wrapper has set RES, so that we don't have to figure this out by inspecting RES after the pre-wrapper. Checking RES <= 0 after the pre-wrapper is the wrong thing to do and causes merry hell with __NR_read on amd64. Inspecting RES after the pre-wrapper for <= 0 also makes it impossible for a pre-wrapper to return any result > 0, which isn't good. Generally you can't reliably conclude anything at all about whether the pre-wrapper assigned RES by inspecting it afterwards. As a side effect of this, I added an assertion to check that RES has been set after every syscall marked Special (fair enough, right?) and immediately I have it barfing on mmap. J |
|
From: Jeremy F. <je...@go...> - 2005-02-28 18:17:37
|
Julian Seward wrote:
>(in the cvs head)
>
>There seem to be paths through this which do not assign any
>value to SYSRES. I don't see how this can be right if this
>is a Special syscall (for which we provide all handling).
>
>If (a4 & VKI_MAP_FIXED) is taken
> and !VG_(valid_client_addr)(...) is not taken
>
>then we hit if (SYSRES != -VKI_ENOMEM) ...
>
>and read SYSRES without having set it first.
>
>What am I missing?
>
>
This code relies on the fact that SYSRES is an alias for SYSNO, so it
will always be >0 unless it has been set to something else. Also,
SYS_PRE_MEM_READ/WRITE will SYSNO to EFAULT if a pointer argument is
pointing at bad memory.
>Background: I'm tracking down problems in syscall handling on amd64.
>As part of that, I've added a per-thread Bool used to indicate when
>a syscall's PRE wrapper has set RES, so that we don't have to
>figure this out by inspecting RES after the pre-wrapper. Checking
>RES <= 0 after the pre-wrapper is the wrong thing to do and causes merry
>hell with __NR_read on amd64.
>
Why?
> Inspecting RES after the pre-wrapper
>for <= 0 also makes it impossible for a pre-wrapper to return any
>result > 0, which isn't good.
>Generally you can't reliably conclude anything at all about whether
>the pre-wrapper assigned RES by inspecting it afterwards.
>
>
In general, my policy has been that if the PRE() wants to set some
positive result, then it should probably be Special. It's OK for
non-Special PRE() to fail syscalls, but not complete them.
>As a side effect of this, I added an assertion to check that RES has
>been set after every syscall marked Special (fair enough, right?)
>and immediately I have it barfing on mmap.
>
>
Well, the code is functionally correct on Linux because a result will
always be set to either ENOMEM or whatever PLATFORM_DO_MMAP sets it to.
J
|
|
From: Jeremy F. <je...@go...> - 2005-02-28 18:20:57
|
Julian Seward wrote:
>Background: I'm tracking down problems in syscall handling on amd64.
>As part of that, I've added a per-thread Bool used to indicate when
>a syscall's PRE wrapper has set RES, so that we don't have to
>figure this out by inspecting RES after the pre-wrapper.
>
Oh, I'd prefer it if it weren't thread state, but some direct return
value from PRE() - either via return, or a pointer argument (like flags).
J
|
|
From: Julian S. <js...@ac...> - 2005-02-28 20:25:24
|
> Oh, I'd prefer it if it weren't thread state, but some direct return > value from PRE() - either via return, or a pointer argument (like flags). Fair enough -- what's the reason, though? Is there a potential race condition or something? J |
|
From: Jeremy F. <je...@go...> - 2005-02-28 21:39:11
|
Julian Seward wrote:
>>Oh, I'd prefer it if it weren't thread state, but some direct return
>>value from PRE() - either via return, or a pointer argument (like flags).
>>
>>
>
>Fair enough -- what's the reason, though? Is there a potential
>race condition or something?
>
No, just to keep it relatively local. Stuff in ThreadState is
essentially a global variable. Nick did a good job of throwing a lot of
stuff out of ThreadState, and I've been doing that as much as possible too.
J
|
|
From: Julian S. <js...@ac...> - 2005-02-28 20:25:29
|
> This code relies on the fact that SYSRES is an alias for SYSNO, so it > will always be >0 unless it has been set to something else. Also, > SYS_PRE_MEM_READ/WRITE will SYSNO to EFAULT if a pointer argument is > pointing at bad memory. I guess so. It's not too clear to naive visitors, though. > >RES <= 0 after the pre-wrapper is the wrong thing to do and causes merry > >hell with __NR_read on amd64. > > Why? On amd64, both the syscall number and the syscall result live in RAX. And __NR_read == 0. So, we start with RAX = 0, call the pre-function, check RES (which is RAX) for <= 0, which it duly is, and conclude that the pre-function decided to error out the syscall, and so doesn't hand it off to the kernel. Result is that the client believes that the read got 0 bytes and is totally mystified. I spotted this a couple of weeks back and so changed RES <= 0 to RES < 0. That fixes __NR_read. But now setrlimit is screwed since that intercepts some but not all of the time -- depends on what limit you are changing. For the file-descriptor limits, it sets RES==0 but now the RES < 0 test still allows the call through to the kernel, which means that VG_(safe_fd) is hosed after that. Basically inspecting RES after the pre-function is not a reliable way to establish whether the pre-function assigned to it. The current code works only because (I think) on x86 no syscall number has the value 0, and the go-no-further test is RES <= 0. > In general, my policy has been that if the PRE() wants to set some > positive result, then it should probably be Special. It's OK for > non-Special PRE() to fail syscalls, but not complete them. But what about setrlimit: it intercepts and completes RLIMIT_NOFILE, RLIMIT_DATA and RLIMIT_STACK; but the rest are handed off to the kernel. So a static Special/non-Special division doesn't seem to make sense here, and setrlimit is not marked Special. > Well, the code is functionally correct on Linux because a result will > always be set to either ENOMEM or whatever PLATFORM_DO_MMAP sets it to. Functionally correct is good. Functionally _and_ manifestly correct to the casual reader is even better. J |
|
From: Jeremy F. <je...@go...> - 2005-02-28 21:37:13
|
Julian Seward wrote:
>>This code relies on the fact that SYSRES is an alias for SYSNO, so it
>>will always be >0 unless it has been set to something else. Also,
>>SYS_PRE_MEM_READ/WRITE will SYSNO to EFAULT if a pointer argument is
>>pointing at bad memory.
>>
>>
>
>I guess so. It's not too clear to naive visitors, though.
>
>
No, but it works reasonably well in the context of PRE() functions. The
whole thing is a bit neck-deep in macro-magic to be very straightforward.
>On amd64, both the syscall number and the syscall result live
>in RAX. And __NR_read == 0. So, we start with RAX = 0, call
>the pre-function, check RES (which is RAX) for <= 0, which it
>duly is, and conclude that the pre-function decided to error out
>the syscall, and so doesn't hand it off to the kernel. Result is
>that the client believes that the read got 0 bytes and is
>totally mystified.
>
>
Urk.
>Basically inspecting RES after the pre-function is not a reliable
>way to establish whether the pre-function assigned to it. The
>current code works only because (I think) on x86 no syscall number
>has the value 0, and the go-no-further test is RES <= 0.
>
>
This is essentially the same issue that the BSD/Mach/MacOS people have,
in that their syscall interface doesn't overload the return value with
the "error occurred" status. So we can proabably fix both together.
And, yes, ia32 linux uses syscall 0 as a special magic call for
kernel-internal use only.
>
>But what about setrlimit: it intercepts and completes RLIMIT_NOFILE,
>RLIMIT_DATA and RLIMIT_STACK; but the rest are handed off to the kernel.
>So a static Special/non-Special division doesn't seem to make
>sense here, and setrlimit is not marked Special.
>
>
It should probably be Special. It doesn't block, so there's no reason
not to.
>>Well, the code is functionally correct on Linux because a result will
>>always be set to either ENOMEM or whatever PLATFORM_DO_MMAP sets it to.
>>
>>
>
>Functionally correct is good. Functionally _and_ manifestly correct
>to the casual reader is even better.
>
That would be nice. Fixing the portability would be nice too.
J
|
|
From: Julian S. <js...@ac...> - 2005-02-28 21:45:10
|
> >But what about setrlimit: it intercepts and completes RLIMIT_NOFILE, > >RLIMIT_DATA and RLIMIT_STACK; but the rest are handed off to the kernel. > >So a static Special/non-Special division doesn't seem to make > >sense here, and setrlimit is not marked Special. > > It should probably be Special. It doesn't block, so there's no reason > not to. But the problem is that presence/absence of Special states unilaterally for the call that it does/does not need to be given to the kernel. And setrlimit is one place where that distinction is not black/white. I think a better approach is to forget about Special. Let the pre-routine run. If it turns out to have assigned a result value already, stop at that point; else give it to the kernel. The issue of whether the call may block if it is given to the kernel is orthogonal -- that's what MayBlock is for. J |
|
From: Jeremy F. <je...@go...> - 2005-02-28 21:48:33
|
Julian Seward wrote:
>I think a better approach is to forget about Special. Let the
>pre-routine run. If it turns out to have assigned a result
>value already, stop at that point; else give it to the kernel.
>
>
Sure, that would work.
J
|