From: Jeff D. <jd...@ad...> - 2006-04-13 18:18:47
|
This set of patches implements time virtualization by creating a time namespace an interface to it through unshare a ptrace extension to allow UML to take advantage of this UML support The guts of the namespace is just an offset from the system time. Within the container, gettimeofday adds this offset to the system time. settimeofday changes the offset without touching the system time. As such, within a namespace, settimeofday is unprivileged. The interface to it is through unshare(CLONE_TIME). This creates the new namespace, initialized with a zero offset from the system time. The advantage of this for UML is that it can create a time namespace for itself and subsequently let its process' gettimeofday run on the host, without being intercepted and run inside UML. As such, it should basically run at native speed. In order to allow this, we need selective system call interception. The third patch implements PTRACE_SYSCALL_MASK, which specifies, through a bitmask, which system calls are intercepted and which aren't. Finally, the UML support is straightforward. It calls unshare(CLONE_TIME) to create the new namespace, sets gettimeofday to run without being intercepted, and makes settimeofday call the host's settimeofday instead of maintaining the time offset itself. As expected, a gettimeofday loop runs basically at native speed. The two quick tests I did had it running inside UML at 98.8 and 99.2 % of native. BUG - as I was writing this, I realized that refcounting of the time_ns structures is wrong - they need to be incremented at process creation and decremented at process exit. Jeff |
From: john s. <joh...@us...> - 2006-04-14 00:31:49
|
On Thu, 2006-04-13 at 13:19 -0400, Jeff Dike wrote: > This set of patches implements > time virtualization by creating a time namespace > an interface to it through unshare > a ptrace extension to allow UML to take advantage of this > UML support > > The guts of the namespace is just an offset from the system time. Within > the container, gettimeofday adds this offset to the system time. settimeofday > changes the offset without touching the system time. As such, within a > namespace, settimeofday is unprivileged. > > The interface to it is through unshare(CLONE_TIME). This creates the new > namespace, initialized with a zero offset from the system time. > > The advantage of this for UML is that it can create a time namespace for itself > and subsequently let its process' gettimeofday run on the host, without > being intercepted and run inside UML. As such, it should basically run at > native speed. > > In order to allow this, we need selective system call interception. The > third patch implements PTRACE_SYSCALL_MASK, which specifies, through a > bitmask, which system calls are intercepted and which aren't. > > Finally, the UML support is straightforward. It calls unshare(CLONE_TIME) > to create the new namespace, sets gettimeofday to run without being > intercepted, and makes settimeofday call the host's settimeofday instead > of maintaining the time offset itself. > > As expected, a gettimeofday loop runs basically at native speed. The two > quick tests I did had it running inside UML at 98.8 and 99.2 % of native. > > BUG - as I was writing this, I realized that refcounting of the time_ns > structures is wrong - they need to be incremented at process creation and > decremented at process exit. Looks interesting. I've never quite understood the need for different time domains, it only allows you to run one domain with the incorrect time, but I'm sure there is some use case that is desired. I'm not psyched about possible namespace vs nanosecond confusion w/ terms like "time_ns", but that's pretty minor. Also I hope you're not wanting to deal w/ NTP adjustments between domains that have the incorrect time? That would be very ugly. thanks -john |
From: <ebi...@xm...> - 2006-04-19 08:27:04
|
Jeff Dike <jd...@ad...> writes: > This set of patches implements > time virtualization by creating a time namespace > an interface to it through unshare > a ptrace extension to allow UML to take advantage of this > UML support > > The guts of the namespace is just an offset from the system time. Within > the container, gettimeofday adds this offset to the system time. settimeofday > changes the offset without touching the system time. As such, within a > namespace, settimeofday is unprivileged. > > The interface to it is through unshare(CLONE_TIME). This creates the new > namespace, initialized with a zero offset from the system time. > > The advantage of this for UML is that it can create a time namespace for itself > and subsequently let its process' gettimeofday run on the host, without > being intercepted and run inside UML. As such, it should basically run at > native speed. > > In order to allow this, we need selective system call interception. The > third patch implements PTRACE_SYSCALL_MASK, which specifies, through a > bitmask, which system calls are intercepted and which aren't. That patch should probably be separated, from the rest. But it looks like a fairly sane idea. > Finally, the UML support is straightforward. It calls unshare(CLONE_TIME) > to create the new namespace, sets gettimeofday to run without being > intercepted, and makes settimeofday call the host's settimeofday instead > of maintaining the time offset itself. I think you missed a couple essential things to a time namespace. Timers. The posix timers, in particular. The worst of those is the monotonic timer. In the case of migration the ugly case to properly handle is the monotonic timer. That needs an offset yet it is absolutely forbidden to provide that offset from the inside. So this is the one namespace that I think is inappropriate to use sys_unshare to create. We need a system call so that we can specify the minimum or the starting monotonic time base. I don't know how we want to describe time while a process is not inside of a kernel. > As expected, a gettimeofday loop runs basically at native speed. The two > quick tests I did had it running inside UML at 98.8 and 99.2 % of > native. Interesting. > BUG - as I was writing this, I realized that refcounting of the time_ns > structures is wrong - they need to be incremented at process creation and > decremented at process exit. Actually the more I think of the using PTRACE to help with some of these issues the more I like it. It's only real alternative is a security module, and that must be written as kernel code. The reference counting is terrible, as you don't free syscall_mask, during ptrace_detach. As a comparison what is the overhead if you don't use syscall_mask, and just do a ptrace_cont on the system call you want to let through? Eric |
From: Jeff D. <jd...@ad...> - 2006-04-26 19:00:32
|
On Wed, Apr 19, 2006 at 02:25:00AM -0600, Eric W. Biederman wrote: > That patch should probably be separated, from the rest. > But it looks like a fairly sane idea. Yeah, I'll keep these together for now, but the ptrace one is conceptually different from the rest. > I think you missed a couple essential things to a time namespace. > Timers. The posix timers, in particular. The worst > of those is the monotonic timer. Oops, thanks for pointing that out. > In the case of migration the ugly case to properly handle is the > monotonic timer. That needs an offset yet it is absolutely forbidden > to provide that offset from the inside. So this is the one namespace > that I think is inappropriate to use sys_unshare to create. > We need a system call so that we can specify the minimum or the > starting monotonic time base. For migration, it looks like the container will have to specify the time base at creation so that everything in it will have a consistent view of time if they get moved around. So, maybe it belongs in clone as a "backwards" flag similar to CLONE_NEWNS. Jeff |
From: Blaisorblade <bla...@ya...> - 2006-04-28 11:33:59
|
On Wednesday 26 April 2006 20:01, Jeff Dike wrote: > On Wed, Apr 19, 2006 at 02:25:00AM -0600, Eric W. Biederman wrote: > > In the case of migration the ugly case to properly handle is the > > monotonic timer. That needs an offset yet it is absolutely forbidden > > to provide that offset from the inside. So this is the one namespace > > that I think is inappropriate to use sys_unshare to create. > > We need a system call so that we can specify the minimum or the > > starting monotonic time base. > For migration, it looks like the container will have to specify the > time base at creation so that everything in it will have a consistent > view of time if they get moved around. > So, maybe it belongs in clone as a "backwards" flag similar to > CLONE_NEWNS. I must note that currently every (?) flag allowed for unshare is also allowed for clone, so you need to do that anyway. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com |
From: Jeff D. <jd...@ad...> - 2006-04-28 12:47:48
|
On Fri, Apr 28, 2006 at 01:33:40PM +0200, Blaisorblade wrote: > > So, maybe it belongs in clone as a "backwards" flag similar to > > CLONE_NEWNS. > > I must note that currently every (?) flag allowed for unshare is also allowed > for clone, so you need to do that anyway. Currently. We are running out of CLONE_ bits - in mainline, there are three left, and two of them are likely to be used by CLONE_TIME and CLONE_UTSNAME (or whatever that turns out to be called). I'm eyeing the low eight bits (CSIGNAL) for future unshare flags, but those would be unusable in clone(). And why should there be any overlap between clone flags and unshare flags? Isn't clone(CLONE_TIME); the same as clone(); unshare(CLONE_TIME); ? Jeff |
From: Jeff D. <jd...@ad...> - 2006-04-28 13:13:52
|
On Fri, Apr 28, 2006 at 07:48:23AM -0400, Jeff Dike wrote: > Currently. We are running out of CLONE_ bits - in mainline, there are > three left Errr, make that seven, and I can still see those being used up. Jeff |
From: Blaisorblade <bla...@ya...> - 2006-04-28 13:54:46
|
On Friday 28 April 2006 13:48, Jeff Dike wrote: > On Fri, Apr 28, 2006 at 01:33:40PM +0200, Blaisorblade wrote: > > > So, maybe it belongs in clone as a "backwards" flag similar to > > > CLONE_NEWNS. > > I must note that currently every (?) flag allowed for unshare is also > > allowed for clone, so you need to do that anyway. > Currently. We are running out of CLONE_ bits - in mainline, there are > three left, and two of them are likely to be used by CLONE_TIME and > CLONE_UTSNAME (or whatever that turns out to be called). > And why should there be any overlap between clone flags and unshare > flags? Isn't > clone(CLONE_TIME); > the same as > clone(); > unshare(CLONE_TIME); > ? Now that unshare() exists, you're right, the current situation is just due to unshare() being an afterthought; the second form (clone() + unshare()) is actually more similar to the classical fork() API conceptually (i.e. you don't need a call with thousands of parameters to create a process, you can specify everything later). So we get back to Eric's objection (which I haven't understood but that's my problem). Additionally, if this flag ever goes into clone, it mustn't be named CLONE_TIME, but CLONE_NEWTIME (or CLONE_NEWUTS). And given CLONE_NEWNS, it's IMHO ok to have unshare(CLONE_NEWTIME) to mean "unshare time namespace", even if it's incoherent with unshare(CLONE_FS) - the incoherency already exists with CLONE_NEWNS. -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com |
From: Jeff D. <jd...@ad...> - 2006-04-28 16:15:12
|
On Fri, Apr 28, 2006 at 03:54:31PM +0200, Blaisorblade wrote: > Additionally, if this flag ever goes into clone, it mustn't be named > CLONE_TIME, but CLONE_NEWTIME (or CLONE_NEWUTS). And given CLONE_NEWNS, it's > IMHO ok to have unshare(CLONE_NEWTIME) to mean "unshare time namespace", even > if it's incoherent with unshare(CLONE_FS) - the incoherency already exists > with CLONE_NEWNS. I wonder if they should be CLONE_* at all. Given that we are likely to run out of free CLONE_* bits, unshare will have to reuse bits that don't have anything to do with sharing resources (CSIGNAL, CLONE_VFORK, etc), and it doesn't seem that nice to have two different CLONE_* flags with the same value, different meaning, only one of which can actually be used in clone. It seems better to use UNSHARE_*, with the current bits that are common to unshare and clone being defined the same, i.e. #define UNSHARE_VM CLONE_VM Jeff |
From: Blaisorblade <bla...@ya...> - 2006-04-28 20:19:52
|
On Friday 28 April 2006 17:15, Jeff Dike wrote: > On Fri, Apr 28, 2006 at 03:54:31PM +0200, Blaisorblade wrote: > > Additionally, if this flag ever goes into clone, it mustn't be named > > CLONE_TIME, but CLONE_NEWTIME (or CLONE_NEWUTS). And given CLONE_NEWNS, > > it's IMHO ok to have unshare(CLONE_NEWTIME) to mean "unshare time > > namespace", even if it's incoherent with unshare(CLONE_FS) - the > > incoherency already exists with CLONE_NEWNS. > I wonder if they should be CLONE_* at all. I've wondered about this too. It makes some sense to renforce the relationship with clone, but when you read the call to unshare you must do you get nonsense. Like the above incoherence. > Given that we are likely > to run out of free CLONE_* bits, unshare will have to reuse bits that > don't have anything to do with sharing resources (CSIGNAL, > CLONE_VFORK, etc), and it doesn't seem that nice to have two different > CLONE_* flags with the same value, different meaning, only one of > which can actually be used in clone. > It seems better to use UNSHARE_*, with the current bits that are > common to unshare and clone being defined the same, i.e. > #define UNSHARE_VM CLONE_VM I indeed agree with this. With cg log -r v2.6.16-rc1:v2.6.16 kernel/fork.c We can see the people involved in commits for sys_unshare (there's little other work in there). -- Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!". Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894) http://www.user-mode-linux.org/~blaisorblade Chiacchiera con i tuoi amici in tempo reale! http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com |
From: <ebi...@xm...> - 2006-04-28 16:18:33
|
Blaisorblade <bla...@ya...> writes: > So we get back to Eric's objection (which I haven't understood but that's my > problem). My objection is that to handle the monotonic timer we need an additional struct timespec argument when we create the time namespace. There does not appear to be space in clone or unshare to pass that value. Eric |