From: Mike K. <kr...@us...> - 2001-11-16 20:20:15
|
As you may know, a few of us are experimenting with multi-runqueue scheduler implementations. One area of concern is where to place realtime tasks. It has been my assumption, that POSIX RT semantics require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR. To accommodate this ordering, I further believe that the simplest solution is to ensure that all realtime tasks reside on the same runqueue. In our MQ scheduler we have a separate runqueue for all realtime tasks. The problem is that maintaining a separate realtime runqueue is a pain and results in some fairly complex/ugly code. Since I'm not a realtime expert, I would like to ask if my assumption about strict ordering of RT tasks is accurate. Also, is anyone aware of other ways to approach this problem? Thanks, -- Mike |
From: Richard G. <rg...@ra...> - 2001-11-16 20:27:35
|
Mike Kravetz writes: > As you may know, a few of us are experimenting with multi-runqueue > scheduler implementations. One area of concern is where to place > realtime tasks. It has been my assumption, that POSIX RT semantics > require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR. > To accommodate this ordering, I further believe that the simplest > solution is to ensure that all realtime tasks reside on the same > runqueue. In our MQ scheduler we have a separate runqueue for all > realtime tasks. The problem is that maintaining a separate realtime > runqueue is a pain and results in some fairly complex/ugly code. > > Since I'm not a realtime expert, I would like to ask if my assumption > about strict ordering of RT tasks is accurate. Also, is anyone aware > of other ways to approach this problem? Yes, strict ordering is required. Years ago I championed a separate runqueue for RT tasks. Linus even said he liked the approach. I got busy and never nursed it to inclusion. The patch is here: ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.1/rtqueue-patch Regards, Richard.... Permanent: rg...@at... Current: rg...@ra... |
From: Davide L. <da...@xm...> - 2001-11-16 22:35:31
|
On Fri, 16 Nov 2001, Mike Kravetz wrote: > As you may know, a few of us are experimenting with multi-runqueue > scheduler implementations. One area of concern is where to place > realtime tasks. It has been my assumption, that POSIX RT semantics > require a specific ordering of tasks such as SCHED_FIFO and SCHED_RR. > To accommodate this ordering, I further believe that the simplest > solution is to ensure that all realtime tasks reside on the same > runqueue. In our MQ scheduler we have a separate runqueue for all > realtime tasks. The problem is that maintaining a separate realtime > runqueue is a pain and results in some fairly complex/ugly code. > > Since I'm not a realtime expert, I would like to ask if my assumption > about strict ordering of RT tasks is accurate. Also, is anyone aware > of other ways to approach this problem? I do not use a separate queue coz, if it's single, it becomes a common lock for all CPUs. RT tasks are scheduled as usual and the only problem arises in reschedule_idle() when an RT task is pushed onto the run queue when 1) on its CPU it is _not_ running the idle 2) on its CPU is running another RT task with higher priority In that case a "good CPU" discovery loop is triggered, the task is moved on that CPU runqueue, need_resched is set, an IPI is sent and on return from the remote CPU IPI path the RT task is run. A good solution would be ( i'm not doing it now ), in setscheduler() to move the task in a way to have an even distribution of RT tasks among CPUs. - Davide |
From: Mike K. <kr...@us...> - 2001-11-16 23:47:11
|
On Fri, Nov 16, 2001 at 02:44:30PM -0800, Davide Libenzi wrote: > > I do not use a separate queue coz, if it's single, it becomes a common > lock for all CPUs. > RT tasks are scheduled as usual and the only problem arises in > reschedule_idle() when an RT task is pushed onto the run queue when > 1) on its CPU it is _not_ running the idle > 2) on its CPU is running another RT task with higher priority > > In that case a "good CPU" discovery loop is triggered, the task is moved > on that CPU runqueue, need_resched is set, an IPI is sent and on return > from the remote CPU IPI path the RT task is run. > A good solution would be ( i'm not doing it now ), in setscheduler() to > move the task in a way to have an even distribution of RT tasks among > CPUs. > > - Davide Davide, Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these tasks are realtime with the same realtime priority and the other is an ordinary SCHED_OTHER task. The task distribution on the runqueues looks something like this. CPU 0 CPU 1 --------- --------- RT Task A RT Task B Other Task C RT Task D Task A and Task B are currently running on the 2 CPUs. Now, Task A voluntarily gives up CPU 0 and Task B is still running on CPU 1. At this point, Task D should be chosen to run on CPU 0. Correct? Isn't this a required RT semantic? I'm curious how you plan on accomplishing this. Regards, -- Mike |
From: Davide L. <da...@xm...> - 2001-11-17 00:19:23
|
On Fri, 16 Nov 2001, Mike Kravetz wrote: > Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these > tasks are realtime with the same realtime priority and the other is > an ordinary SCHED_OTHER task. The task distribution on the runqueues > looks something like this. > > CPU 0 CPU 1 > --------- --------- > RT Task A RT Task B > Other Task C RT Task D > > Task A and Task B are currently running on the 2 CPUs. Now, Task A > voluntarily gives up CPU 0 and Task B is still running on CPU 1. > At this point, Task D should be chosen to run on CPU 0. Correct? > Isn't this a required RT semantic? I'm curious how you plan on > accomplishing this. Well I don't know how RT sematics apply to SMP systems. The easy solution ( == big common lock ) would be to have a single RT queue that is checked before the private one. Anyway, sometime it happens that the cure is worst than the disease and to solve a corner case you're going to punish common case performances ( Linux is not an RT OS even with that fix ). - Davide |
From: Mike K. <kr...@us...> - 2001-11-17 00:32:30
|
On Fri, Nov 16, 2001 at 04:28:33PM -0800, Davide Libenzi wrote: > On Fri, 16 Nov 2001, Mike Kravetz wrote: > > > Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these > > tasks are realtime with the same realtime priority and the other is > > an ordinary SCHED_OTHER task. The task distribution on the runqueues > > looks something like this. > > > > CPU 0 CPU 1 > > --------- --------- > > RT Task A RT Task B > > Other Task C RT Task D > > > > Task A and Task B are currently running on the 2 CPUs. Now, Task A > > voluntarily gives up CPU 0 and Task B is still running on CPU 1. > > At this point, Task D should be chosen to run on CPU 0. Correct? > > Isn't this a required RT semantic? I'm curious how you plan on > > accomplishing this. > > Well I don't know how RT sematics apply to SMP systems. Me either (not exactly). > The easy solution ( == big common lock ) would be to have a single RT > queue that is checked before the private one. > Anyway, sometime it happens that the cure is worst than the disease and to > solve a corner case you're going to punish common case performances ( > Linux is not an RT OS even with that fix ). The reason I ask is that we went through the pains of a separate realtime RQ in our MQ scheduler. And yes, it does hurt the common case, not to mention the extra/complex code paths. I was hoping that someone in the know could enlighten us as to how RT semantics apply to SMP systems. If the semantics I suggest above are required, then it implies support must be added to any possible future scheduler implementations. -- Mike |
From: george a. <ge...@mv...> - 2001-11-17 08:08:41
|
Mike Kravetz wrote: > > On Fri, Nov 16, 2001 at 04:28:33PM -0800, Davide Libenzi wrote: > > On Fri, 16 Nov 2001, Mike Kravetz wrote: > > > > > Suppose you have a 2 CPU system with 4 runnable tasks. 3 of these > > > tasks are realtime with the same realtime priority and the other is > > > an ordinary SCHED_OTHER task. The task distribution on the runqueues > > > looks something like this. > > > > > > CPU 0 CPU 1 > > > --------- --------- > > > RT Task A RT Task B > > > Other Task C RT Task D > > > > > > Task A and Task B are currently running on the 2 CPUs. Now, Task A > > > voluntarily gives up CPU 0 and Task B is still running on CPU 1. > > > At this point, Task D should be chosen to run on CPU 0. Correct? > > > Isn't this a required RT semantic? I'm curious how you plan on > > > accomplishing this. > > > > Well I don't know how RT sematics apply to SMP systems. > > Me either (not exactly). > > > The easy solution ( == big common lock ) would be to have a single RT > > queue that is checked before the private one. > > Anyway, sometime it happens that the cure is worst than the disease and to > > solve a corner case you're going to punish common case performances ( > > Linux is not an RT OS even with that fix ). > > The reason I ask is that we went through the pains of a separate > realtime RQ in our MQ scheduler. And yes, it does hurt the common > case, not to mention the extra/complex code paths. I was hoping > that someone in the know could enlighten us as to how RT semantics > apply to SMP systems. If the semantics I suggest above are required, > then it implies support must be added to any possible future > scheduler implementations. > For what its worth here is my $.02 worth. A strict following of the standard seems to me to mean that in a N cpu system with M>N ready real time tasks, the N highest priority tasks should be allocated CPUs. Remember that, for real time tasks, it is MORE important to run the highest priority tasks than to worry about thru put or performance (beyond getting to the highest priority task(s) ASAP). I assume that this is what a user is saying when he declares a task "real time". That said, the pragmatic view is that the customer is always right. After all, he paid for the machine. We have customers who want decidedly non "S" SMP. There are some who want all real time tasks to run on one cpu and all other tasks to run on another. There are also some who want a group of tasks (real time and otherwise) to always run on a particular cpu while other tasks (again real time and otherwise) run on other cpus. What I plan to do is to have a non-affined real-time ready list and an affined list. I really think all the lists should be priority ordered with one per priority so when I say a non-affined ready list I mean one for each priority (see the rtsched on source forge, for example of a list per priority (URL in my signature)). I am not yet sure if the affined list should have the option of being shared with more than one cpu, but as you note, there are performance problems with multiple lists so I only plan to have the cpu check at most two lists, the non-affined, and one affined list. With this scheme it is possible to have a case where a ready real time task will not get an idle cpu, however, this should only happen if the task has affined it self to some other cpu(s). By using priority order lists or a list per priority the overhead and lock contention can, IMHO, be managed. A couple of addition points. There should be an API to declare cpu affinity, and cpu affinity should be passed thru fork and exec. You may need to be god (root) to use the API however. (I think cpu affinity is currently passed thru fork and exec, by the way.) -- George ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Real time sched: http://sourceforge.net/projects/rtsched/ |
From: Matthew D. <col...@us...> - 2001-11-19 20:28:15
|
george anzinger wrote: > A couple of addition points. There should be an API to declare cpu > affinity, and cpu affinity should be passed thru fork and exec. You may > need to be god (root) to use the API however. (I think cpu affinity is > currently passed thru fork and exec, by the way.) There is... there is a launch_policy patch on http://sourceforge.net/projects/lse, that offers 2 ways of setting a processes CPU affinity. A process can muck around with its own affinity through prctl(), and root can muck around with the CPU affinity of any process through /proc/<pid>/cpus_allowed and /proc/<pid>/launch_policy. And the launch_policy *is* inherited from parent to child. The cpus_allowed part comes from Andrew Morton and controls the processes affinity. I've extended that to include the launch_policy part which sets the affinity to be passed on through fork/exec. Enjoy! -matt |
From: Victor Y. <yod...@fs...> - 2001-11-19 15:21:57
|
On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > The reason I ask is that we went through the pains of a separate > realtime RQ in our MQ scheduler. And yes, it does hurt the common > case, not to mention the extra/complex code paths. I was hoping > that someone in the know could enlighten us as to how RT semantics > apply to SMP systems. If the semantics I suggest above are required, > then it implies support must be added to any possible future > scheduler implementations. POSIX RT specs, at least last year, did not mention any SMP requirements for scheduling at all. What we do in RTLinux is require that RT threads be associated with a processor identifier on the theory that the user may have some idea what processors should run which RT threads, but the OS has no way of guessing. > > -- > Mike > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ |
From: Andi K. <ak...@su...> - 2001-11-19 16:30:29
|
On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > The reason I ask is that we went through the pains of a separate > realtime RQ in our MQ scheduler. And yes, it does hurt the common > case, not to mention the extra/complex code paths. I was hoping > that someone in the know could enlighten us as to how RT semantics > apply to SMP systems. If the semantics I suggest above are required, > then it implies support must be added to any possible future > scheduler implementations. It seems a lot of applications/APIs do not care about global RT semantics, but about RT semantics for groups of threads or processes (e.g. java or ada applications). Linux currently simulates this only for root and with a global runqueue. I don't think it makes too much sense to have an global rt queue on a multi processor system, but there should be some way to define "scheduling groups" where rt semantics are followed inside. Such a scheduling group could be a clone flag or default to CLONE_VM for example for compatibility. A scheduling group would also make it possible to support simple rt semantics for thread groups as non root. Then one could run a rt queue per scheduling group, and simulate global rt run queue or per cpu rt run queue as needed by appropiate setup. Comments? -Andi |
From: Davide L. <da...@xm...> - 2001-11-19 16:57:17
|
On Mon, 19 Nov 2001, Andi Kleen wrote: > On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > > The reason I ask is that we went through the pains of a separate > > realtime RQ in our MQ scheduler. And yes, it does hurt the common > > case, not to mention the extra/complex code paths. I was hoping > > that someone in the know could enlighten us as to how RT semantics > > apply to SMP systems. If the semantics I suggest above are required, > > then it implies support must be added to any possible future > > scheduler implementations. > > It seems a lot of applications/APIs do not care about global RT semantics, > but about RT semantics for groups of threads or processes (e.g. java > or ada applications). Linux currently simulates this only for root > and with a global runqueue. I don't think it makes too much sense to have > an global rt queue on a multi processor system, but there should be some > way to define "scheduling groups" where rt semantics are followed inside. > Such a scheduling group could be a clone flag or default to CLONE_VM for > example for compatibility. A scheduling group would also make it possible > to support simple rt semantics for thread groups as non root. Then one > could run a rt queue per scheduling group, and simulate global rt run queue > or per cpu rt run queue as needed by appropiate setup. What i'm currently trying to achieve is to have two kind of rt tasks, local and global ones. Local rt tasks are stored inside the local CPU run queue and compete only with the local tasks. Global rt tasks have a global runqueue that is fast-checked w/out held locks inside the schedule() : if (!list_empty(&runqueue_head(RT_QID))) goto rt_queue_select; rt_queue_select_back: So a local rt tasks can _not_ preempt a task on another CPU while a global one yes. - Davide |
From: Richard G. <rg...@ra...> - 2001-11-19 18:24:06
|
Andi Kleen writes: > On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > > The reason I ask is that we went through the pains of a separate > > realtime RQ in our MQ scheduler. And yes, it does hurt the common > > case, not to mention the extra/complex code paths. I was hoping > > that someone in the know could enlighten us as to how RT semantics > > apply to SMP systems. If the semantics I suggest above are required, > > then it implies support must be added to any possible future > > scheduler implementations. > > It seems a lot of applications/APIs do not care about global RT > semantics, but about RT semantics for groups of threads or processes > (e.g. java or ada applications). Linux currently simulates this only > for root and with a global runqueue. I don't think it makes too much > sense to have an global rt queue on a multi processor system, but > there should be some way to define "scheduling groups" where rt > semantics are followed inside. Such a scheduling group could be a > clone flag or default to CLONE_VM for example for compatibility. A > scheduling group would also make it possible to support simple rt > semantics for thread groups as non root. Then one could run a rt > queue per scheduling group, and simulate global rt run queue or per > cpu rt run queue as needed by appropiate setup. We have to continue providing global RT semantics. However, a non-privileged scheduling class which gives RT-like behaviour within a scheduling group would be *great*! I've wished for such a facility myself. Regards, Richard.... Permanent: rg...@at... Current: rg...@ra... |
From: Mike K. <kr...@us...> - 2001-11-19 19:09:13
|
On Mon, Nov 19, 2001 at 11:23:23AM -0700, Richard Gooch wrote: > > We have to continue providing global RT semantics. However, a > non-privileged scheduling class which gives RT-like behaviour within a > scheduling group would be *great*! I've wished for such a facility > myself. > If we go to the trouble of supporting scheduling groups, then it would seem that one could define a 'global RT' group for any required global semantics. The initial development costs (scheduling groups) may be high, but you would get global RT semantics for free. I also believe that we may be able to keep the overhead out of scheduling decisions for tasks not in the groups. My only concern would be with groups that span multiple CPUs (such as the global one). If the scheduler continues to use a single lock (which I don't advocate), this is not much of an issue. However, things can get quite complicated with multiple (per-CPU) locks and the possibly required per-group synchronization items. -- Mike |
From: george a. <ge...@mv...> - 2001-11-19 18:33:08
|
Andi Kleen wrote: > > On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > > The reason I ask is that we went through the pains of a separate > > realtime RQ in our MQ scheduler. And yes, it does hurt the common > > case, not to mention the extra/complex code paths. I was hoping > > that someone in the know could enlighten us as to how RT semantics > > apply to SMP systems. If the semantics I suggest above are required, > > then it implies support must be added to any possible future > > scheduler implementations. > > It seems a lot of applications/APIs do not care about global RT semantics, > but about RT semantics for groups of threads or processes (e.g. java > or ada applications). Linux currently simulates this only for root > and with a global runqueue. Why do you say only root? Since the schedule type and priority are inherited one only needs to be root to set the progenitors real time priority. Also, a programs priority/ schedule type can be set by another (root) program without the target program being root. I routinely set inetd to real time, for example, to let telnet sessions run at real time. > I don't think it makes too much sense to have > an global rt queue on a multi processor system, but there should be some > way to define "scheduling groups" where rt semantics are followed inside. Still, the customer is king. > Such a scheduling group could be a clone flag or default to CLONE_VM for > example for compatibility. A scheduling group would also make it possible > to support simple rt semantics for thread groups as non root. Then one > could run a rt queue per scheduling group, and simulate global rt run queue > or per cpu rt run queue as needed by appropiate setup. My first thought is that this is fairly high overhead to put in the schedule path. May be if I knew more.... -- George ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Real time sched: http://sourceforge.net/projects/rtsched/ |
From: Andi K. <ak...@su...> - 2001-11-19 18:40:12
|
On Mon, Nov 19, 2001 at 10:32:23AM -0800, george anzinger wrote: > Andi Kleen wrote: > > > > On Fri, Nov 16, 2001 at 04:32:24PM -0800, Mike Kravetz wrote: > > > The reason I ask is that we went through the pains of a separate > > > realtime RQ in our MQ scheduler. And yes, it does hurt the common > > > case, not to mention the extra/complex code paths. I was hoping > > > that someone in the know could enlighten us as to how RT semantics > > > apply to SMP systems. If the semantics I suggest above are required, > > > then it implies support must be added to any possible future > > > scheduler implementations. > > > > It seems a lot of applications/APIs do not care about global RT semantics, > > but about RT semantics for groups of threads or processes (e.g. java > > or ada applications). Linux currently simulates this only for root > > and with a global runqueue. > > Why do you say only root? Since the schedule type and priority are You effectively need to be root to set up the priorities at least once. setuid programs are a big pain to use and a lot of people don't want them. Also it is useful to change priorities on the fly, e.g. to avoid priority inheritance problems. > > I don't think it makes too much sense to have > > an global rt queue on a multi processor system, but there should be some > > way to define "scheduling groups" where rt semantics are followed inside. > > Still, the customer is king. Hmm? > > > Such a scheduling group could be a clone flag or default to CLONE_VM for > > example for compatibility. A scheduling group would also make it possible > > to support simple rt semantics for thread groups as non root. Then one > > could run a rt queue per scheduling group, and simulate global rt run queue > > or per cpu rt run queue as needed by appropiate setup. > > My first thought is that this is fairly high overhead to put in the > schedule path. May be if I knew more.... The only overhead as far as I can see would be a few more pointer references for RT processes (task->runqueue-> ...). I guess with some care it can be even kept out of the non RT fast path. -Andi |