From: Hubertus F. <fr...@us...> - 2001-04-04 13:42:16
|
This is an important point that Mike is raising and it also addresses a critique that Ingo issued yesterday, namely interactivity and fairness. The HP scheduler completely separates the per-CPU runqueues and does not take preemption goodness or alike into account. This can lead to unfair proportionment of CPU cycles, strong priority inversion and a potential lack of interactivity. Our MQ scheduler does yield the same decision in most cases (other than defined by some race condition on locks and counter members) It is not clear that yielding the same decision as the current scheduler is the ultimate goal to shoot for, but it allows comparision. Another point to raise is that the current scheduler does a exhaustive search for the "best" task to run. It touches every process in the runqueue. this is ok if the runqueue length is limited to a very small multiple of the #cpus. But that is not what high end server systems encounter. With the rising number of processors, lock contention can quickly become a bottleneck. If we assume that load (#running-task) increases somewhat linear with #cpus, the problem gets even worth because not only have I increased the number of clients but also the lock hold time. Clinging on to the statement that #running-tasks ~ #cpus, ofcourse saves you from that dilemma, but not everybody is signing on to this limitation. MQ and priority-list help in 2 ways. MQ reduces the average lock holdtime because on average it only inspects #running-tasks/#cpus tasks to make a local decision. It then goes on to inspect (#cpus-1) datastructures representing the next best to run tasks on those remote cpus all without holding the lock, thus substantially reducing lock contention. Note we still search the entire set of runnable tasks, however we do it in a distributed collaborative manner. The only time we deviate from the current scheduler decision is in the case when two cpus have identified the same remote task as a target for remote stealing. In that case one will win and the other cpu will continue looking somewhere else, although there might have been another tasks on that cpu to steal. priority list schedulers (PRS) only helps in reducing lock hold time, which can result in some relieve wrt lock contention, but not a whole lot. PRS can limit the lists it has to search based on the PROC_CHANGE_PENALTY. It also keeps 0-counter in a list that is never inspected. One can even go further and put YIELD tasks in a separate list, given that the sys_sched_yield already does some optimizations. The older version (12/00) posted on LSE is functionally equivalent to the current scheduler. I will put up another version this week that is based on a bitmask and which is a bit more agressive in that it ignores the MM goodness boost of 1 which in my books is merely a tie breaker between two task of equal goodness. Beyond that we have done some work on cpu pooling, which is to identify a set of cpus that form a scheduling set. We still consider in reschedule_idle all cpus for preemption but in schedule it is sufficient to only schedule within the own set. That again can limit lock hold time with MQ and we have seen some improvements. We also deploy load balacing. To summarize, we have taken great care of trying to preserve the current scheduler semantics and have laid out a path to relax some of the semantics for further improvements. I don't believe that the HP scheduler is sufficient since it is lacking load balacing, which naturally occurs in our MQ scheduler, and it lacks the interactivity requirements that Ingo pointed out. Most of these things are discussed in great detail in the writeups under lse.sourceforge.net/scheduling. I also believe we show there that the MQ performance for low thread counts is also matching the vanilla case. I further don't understand the obsession of keeping the scheduler simple. If there are improvements and I don't believe the MQ is all that complicated and its well documented, why not put it in. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkr...@se...> on 04/03/2001 10:47:00 PM To: Fabio Riccardi <fa...@ch...> cc: Mike Kravetz <mkr...@se...>, Ingo Molnar <mi...@el...>, Hubertus Franke/Watson/IBM@IBMUS, Linux Kernel List <lin...@vg...>, Alan Cox <al...@lx...> Subject: Re: a quest for a better scheduler On Tue, Apr 03, 2001 at 05:18:03PM -0700, Fabio Riccardi wrote: > > I have measured the HP and not the "scalability" patch because the two do more > or less the same thing and give me the same performance advantages, but the > former is a lot simpler and I could port it with no effort on any recent > kernel. Actually, there is a significant difference between the HP patch and the one I developed. In the HP patch, if there is a schedulable task on the 'local' (current CPU) runqueue it will ignore runnable tasks on other (remote) runqueues. In the multi-queue patch I developed, the scheduler always attempts to make the same global scheduling decisions as the current scheduler. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Hubertus F. <fr...@us...> - 2001-04-04 15:30:12
|
Yes, Andrea. We actually already went a step further. We treat the scheduler as a single entity, rather than splitting it up. Based on the MQ scheduler we do the balancing across all nodes at reschedule_idle time. We experimented to see whether only looking for idle tasks remotely is a good idea, but it bloats the code. Local scheduling decisions are limited to a set of cpus, which could coincide with the cpus of one node, or if desirable on smaller granularities. In addition we implemented a global load balancing scheme that ensures that load is equally distributed across all run queues. This is made a loadable module, so you can plug in what ever you want. I grant in NUMA it might actually be desirable to separate schedulers completely (we can do that trivially in reschedule_idle), but you need loadbalancing at some point. Here is the list of patches: MultiQueue Scheduler: http://lse.sourceforge.net/scheduling/2.4.1.mq1-sched Pooling Extension: http://lse.sourceforge.net/scheduling/LB/2.4.1-MQpool LoadBalancing: http://lse.sourceforge.net/scheduling/LB/loadbalance.c Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Andrea Arcangeli <an...@su...> on 04/04/2001 11:08:47 AM To: Ingo Molnar <mi...@el...> cc: Hubertus Franke/Watson/IBM@IBMUS, Mike Kravetz <mkr...@se...>, Fabio Riccardi <fa...@ch...>, Linux Kernel List <lin...@vg...>, lse...@li... Subject: Re: a quest for a better scheduler On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. I didn't seen anything from Kanoj but I did something myself for the wildfire: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 this is mostly an userspace issue, not really intended as a kernel optimization (however it's also partly a kernel optimization). Basically it splits the load of the numa machine into per-node load, there can be unbalanced load across the nodes but fairness is guaranteed inside each node. It's not extremely well tested but benchmarks were ok and it is at least certainly stable. However Ingo consider that in a 32-way if you don't have at least 32 tasks running all the time _always_ you're really stupid paying such big money for nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all the time is not nearly enough for those machines (and of course also the scheduling rate automatically increases linearly with the increase of the number of cpus). Now it's perfectly fine that we don't ask the embedded and desktop guys to pay for that, but a kernel configuration option to select an algorithm that scales would be a good idea IMHO. The above patch just adds a CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles). Andrea |
From: Hubertus F. <fr...@us...> - 2001-04-04 15:37:05
|
You imply that high end means thousands of processes, simply because we have shown that in our graphs as an asymptotic end. No, it could mean 5*#cpus and that is not all that absurd. This could happen with a spike in demand. TUX is not the greatest example to use, because it does static webpage serving and is hence tied into the file cache. If you move up the food chain, where middleware is active and things are a bit more complicated than sending stuff out of the filecache, having a bunch of threads hanging around to deal with the spike in demand is common practive, although you think its bad. Now coming again to MQ (forget about priority list from now on). When we scan either the local or the realtime queues we do use goodness value. So we have the same flexibility as the current scheduler if it comes to goodness() flexibility and future improvements. For remote stealing we do use na_goodness to compare for a better process to steal. Hence we would loose the "+1" information here, nevertheless, once a decision has been made, we still use preemption verification with goodness. Eitherway, being off by "+1" for regular tasks once in a while is no big deal, because this problem already exists today. While walking the runqueue, another processor can either update the counter value of task (ok that's covered by can_schedule) or can run recalculate, which makes the decision that one is about to make wrong from the point of always running the best. But that's ok, because counter, nice etc. are approximations anyway. Through in PROC_CHANGE_PENALTY and you have a few knobs that are used to control interactivity and throughput. If you look at some of the results with our reflex benchmark. For the low thread count we basically show improved performance on the 2,4,5,6,7,8 way system if #threads < #cpus, they all show improvements. Look at the following numbers running the reflex benchmark, left most columns are number of threads, with typically 1/2 of them runnable. You can clearly see that the priority list suffers from overhead, but MQ is beating vanilla pretty much everywhere. Again, this is due because of rapid scheduler invocation and resulting lock contention. 2-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 6.725 4.691 7.387 8 6.326 4.766 6.421 12 6.838 5.233 6.431 16 7.13 5.415 7.29 4-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.42 7.873 10.592 8 8.143 3.799 8.691 12 7.877 3.537 8.101 16 7.688 2.953 7.144 6-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.595 7.88 10.358 8 9.703 7.278 10.523 12 10.016 4.652 10.985 16 9.882 3.629 10.525 8-way 2.4.1 2.4.1-mq1 2.4.1-prbit 4 9.804 8.033 10.548 8 10.436 5.783 11.475 12 10.925 6.787 11.646 16 11.426 5.048 11.877 20 11.438 3.895 11.633 24 11.457 3.304 11.347 28 11.495 3.073 11.09 32 11.53 2.944 10.898 Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Ingo Molnar <mi...@el...>@elte.hu> on 04/04/2001 09:23:34 AM Please respond to <mi...@el...> Sent by: <mi...@el...> To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkr...@se...>, Fabio Riccardi <fa...@ch...>, Linux Kernel List <lin...@vg...> Subject: Re: a quest for a better scheduler On Wed, 4 Apr 2001, Hubertus Franke wrote: > I understand the dilemma that the Linux scheduler is in, namely > satisfy the low end at all cost. [...] nope. The goal is to satisfy runnable processes in the range of NR_CPUS. You are playing word games by suggesting that the current behavior prefers 'low end'. 'thousands of runnable processes' is not 'high end' at all, it's 'broken end'. Thousands of runnable processes are the sign of a broken application design, and 'fixing' the scheduler to perform better in that case is just fixing the symptom. [changing the scheduler to perform better in such situations is possible too, but all solutions proposed so far had strings attached.] Ingo |
From: Hubertus F. <fr...@us...> - 2001-04-04 19:07:31
|
I give you a concrete example: Running DB2 on an SMP system. In DB2 there is a processes/thread pool that is sized based on memory and numcpus. People tell me that the size of this pool is in the order of 100s for an 8-way system with reasonable sized database. These <maxagents> determine the number of agents that can simultaneously execute an SQL statement. Requests are flying in for transactions (e.g. driven by TPC-W like applications). The agents are grepped from the pool and concurrently fire the SQL transactions. Assuming that there is enough concurrency in the database, there is no reason to believe that the majority of those active agents is not effectively running. TPC-W loads have observed 100 of active transactions at a time. Ofcourse limiting the number of agents would reduce concurrently running tasks, but would limit the responsiveness of the system. Implementing a database in the kernel ala TUX doesn't seem to be the right approach either (complexity, fault containment, ...) Hope that is one example people accept. I can dig up some information on WebSphere Applications. I'd love to hear from some other applications that fall into a similar category as the above and substantiate a bit the need for 100s of running processes, without claiming that the application is broke. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mark Hahn <ha...@co...> on 04/04/2001 02:28:42 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Subject: Re: a quest for a better scheduler > ok if the runqueue length is limited to a very small multiple of the #cpus. > But that is not what high end server systems encounter. do you have some example of this in mind? so far, noone has actually produced an example of a "high end" server that has long runqueues. |
From: Ingo M. <mi...@el...> - 2001-04-04 14:26:42
|
On Wed, 4 Apr 2001, Hubertus Franke wrote: > It is not clear that yielding the same decision as the current > scheduler is the ultimate goal to shoot for, but it allows > comparision. obviously the current scheduler is not cast into stone, it never was, never will be. but determining whether the current behavior is possible in a different scheduler design is sure a good metric of how flexible that different scheduler design is. Ingo |
From: Ingo M. <mi...@el...> - 2001-04-04 14:35:47
|
On Wed, 4 Apr 2001, Hubertus Franke wrote: > Another point to raise is that the current scheduler does a exhaustive > search for the "best" task to run. It touches every process in the > runqueue. this is ok if the runqueue length is limited to a very small > multiple of the #cpus. [...] indeed. The current scheduler handles UP and SMP systems, up to 32 (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different approach anyway in many other subsystems too, Kanoj is doing some scheduler work in that area. but the original claim was that the scheduling of thousands of runnable processes (which is not equal to having thousands of sleeping processes) must perform well - which is a completely different issue. Ingo |
From: Andrea A. <an...@su...> - 2001-04-04 15:11:30
|
On Wed, Apr 04, 2001 at 03:34:22PM +0200, Ingo Molnar wrote: > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. I didn't seen anything from Kanoj but I did something myself for the wildfire: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 this is mostly an userspace issue, not really intended as a kernel optimization (however it's also partly a kernel optimization). Basically it splits the load of the numa machine into per-node load, there can be unbalanced load across the nodes but fairness is guaranteed inside each node. It's not extremely well tested but benchmarks were ok and it is at least certainly stable. However Ingo consider that in a 32-way if you don't have at least 32 tasks running all the time _always_ you're really stupid paying such big money for nothing ;). So the fact the scheduler is optimized for 1/2 tasks running all the time is not nearly enough for those machines (and of course also the scheduling rate automatically increases linearly with the increase of the number of cpus). Now it's perfectly fine that we don't ask the embedded and desktop guys to pay for that, but a kernel configuration option to select an algorithm that scales would be a good idea IMHO. The above patch just adds a CONFIG_NUMA_SCHED. The scalable algorithm can fit into it and nobody will be hurted by that (CONFIG_NUMA_SCHED cannot even be selected by x86 compiles). Andrea |
From: Kanoj S. <ka...@go...> - 2001-04-04 16:50:59
|
> > I didn't seen anything from Kanoj but I did something myself for the wildfire: > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 > > this is mostly an userspace issue, not really intended as a kernel optimization > (however it's also partly a kernel optimization). Basically it splits the load > of the numa machine into per-node load, there can be unbalanced load across the > nodes but fairness is guaranteed inside each node. It's not extremely well > tested but benchmarks were ok and it is at least certainly stable. > Just a quick comment. Andrea, unless your machine has some hardware that imply pernode runqueues will help (nodelevel caches etc), I fail to understand how this is helping you ... here's a simple theory though. If your system is lightly loaded, your pernode queues are actually implementing some sort of affinity, making sure processes stick to cpus on nodes where they have allocated most of their memory on. I am not sure what the situation will be under huge loads though. As I have mentioned to some people before, percpu/pernode/percpuset/global runqueues probably all have their advantages and disadvantages, and their own sweet spots. Wouldn't it be really neat if a system administrator or performance expert could pick and choose what scheduler behavior he wants, based on how the system is going to be used? Kanoj |
From: Andrea A. <an...@su...> - 2001-04-04 17:17:27
|
On Wed, Apr 04, 2001 at 09:50:58AM -0700, Kanoj Sarcar wrote: > > > > I didn't seen anything from Kanoj but I did something myself for the wildfire: > > > > ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/10_numa-sched-1 > > > > this is mostly an userspace issue, not really intended as a kernel optimization > > (however it's also partly a kernel optimization). Basically it splits the load > > of the numa machine into per-node load, there can be unbalanced load across the > > nodes but fairness is guaranteed inside each node. It's not extremely well > > tested but benchmarks were ok and it is at least certainly stable. > > > > Just a quick comment. Andrea, unless your machine has some hardware > that imply pernode runqueues will help (nodelevel caches etc), I fail > to understand how this is helping you ... here's a simple theory though. It helps by keeping the task in the same node if it cannot keep it in the same cpu anymore. Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup and it gets running and for some reason cpu 8 is busy and there are other cpus idle in the system. Now with the current scheduler it can be moved in any cpu in the system, with the numa sched applied we will try to first reschedule it in the idles cpus of node 2 for example. The per-node runqueue are mainly necessary to implement the heuristic. > cpus on nodes where they have allocated most of their memory on. I am > not sure what the situation will be under huge loads though. after all cpus are busy we try to reschedule only on the cpus of the local node, that's why it can generate some unbalance yes, but it will tend to rebalance over the time because some node will end with all tasks with zero counter first if it's less loaded, and so then it will start getting tasks with has_cpu 0 in the runqueues out of other nodes. You may want to give it a try on your machines and see what difference it makes, I'd be curious to know of course. Andrea |
From: Kanoj S. <ka...@go...> - 2001-04-04 17:48:43
|
> > It helps by keeping the task in the same node if it cannot keep it in > the same cpu anymore. > > Assume task A is sleeping and it last run on cpu 8 node 2. It gets a wakeup > and it gets running and for some reason cpu 8 is busy and there are other > cpus idle in the system. Now with the current scheduler it can be moved in any > cpu in the system, with the numa sched applied we will try to first reschedule > it in the idles cpus of node 2 for example. The per-node runqueue are mainly > necessary to implement the heuristic. > Yes. But this is not the best solution, if I can add on to the example and make some assumptions. Imagine that most of the program's memory is on node 1, it was scheduled on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other higher priority processes took over other cpus on node 1, etc). Then, your patch will try to keep the process on node 2, which is not neccessarily the best solution. Of course, as I mentioned before, if you have a node local cache on node 2, that cache might have been warmed enough to make scheduling on node 2 a good option. I am not saying there is a wrong or right answer, there are so many possibilities, everything probably works and breaks under different circumstances. Btw, while we are swapping patches, the patch at http://oss.sgi.com/projects/numa/download/sched242.patch tries to implement per-arch scheduling. The current scheduler behavior is smp_arch_goodness() and smp_pick_cpu(), but the patch allows the possibility for a specific platform to change that to something else. Linus has seen this patch, and agrees to it in principle. He does not consider this 2.4 material though. Of course, I am completely open to Ingo (or someone else) coming up with a different way of providing the same freedom to arch specific code. Kanoj |
From: Andrea A. <an...@su...> - 2001-04-04 18:03:13
|
On Wed, Apr 04, 2001 at 10:49:04AM -0700, Kanoj Sarcar wrote: > Imagine that most of the program's memory is on node 1, it was scheduled > on node 2 cpu 8 momentarily (maybe because kswapd ran on node 1, other > higher priority processes took over other cpus on node 1, etc). > > Then, your patch will try to keep the process on node 2, which is not > neccessarily the best solution. Of course, as I mentioned before, if > you have a node local cache on node 2, that cache might have been warmed > enough to make scheduling on node 2 a good option. Exactly it made it a good option, and more important this heuristic can only improve performance if compared to the mainline scheduler. Infact I tried to reschedule the task back to its original node and it dropped performance, anyways I cannot say to have done an extensive research on that, I believe if we take care of some more history of the node migration we may understand it's right time to push it back to its original node but that was not obvious and I wanted a simple solution to boost the performance under CPU bound load to start with. Andrea |
From: Kanoj S. <ka...@go...> - 2001-04-04 16:39:04
|
> > > On Wed, 4 Apr 2001, Hubertus Franke wrote: > > > Another point to raise is that the current scheduler does a exhaustive > > search for the "best" task to run. It touches every process in the > > runqueue. this is ok if the runqueue length is limited to a very small > > multiple of the #cpus. [...] > > indeed. The current scheduler handles UP and SMP systems, up to 32 > (perhaps 64) CPUs efficiently. Agressively NUMA systems need a different > approach anyway in many other subsystems too, Kanoj is doing some > scheduler work in that area. Actually, not _much_ work has been done in this area. Alongwith a bunch of other people, I have some ideas about what needs to be done. For example, for NUMA, we need to try hard to schedule a thread on the node that has most of its memory (for no reason other than to decrease memory latency). Independently, some NUMA machines build in multilevel caches and local snoops that also means that specific processors on the same node as the last_processor are also good candidates to run the process next. To handle a single layer of shared caches, I have tried certain simple things, mostly as hacks, but am not pleased with the results yet. More testing needed. Kanoj > > but the original claim was that the scheduling of thousands of runnable > processes (which is not equal to having thousands of sleeping processes) > must perform well - which is a completely different issue. > > Ingo > > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > |
From: Andrea A. <an...@su...> - 2001-04-04 17:02:53
|
On Wed, Apr 04, 2001 at 09:39:23AM -0700, Kanoj Sarcar wrote: > example, for NUMA, we need to try hard to schedule a thread on the > node that has most of its memory (for no reason other than to decrease > memory latency). Independently, some NUMA machines build in multilevel > caches and local snoops that also means that specific processors on > the same node as the last_processor are also good candidates to run > the process next. yes. That will probably need to be optional and choosen by the architecture at compile time too. The probably most important factor to consider is the penality of accessing remote memory, I think I can say on all recent and future machines with a small difference between local and remote memory (and possibly as you say with a decent cache protocol able to snoop cacheline data from the other cpus even if they're not dirty) it's much better to always try to keep the task in its last node. My patch is actually assuming recent machines and it keeps the task in its last node if not in the last cpu and it keeps doing memory allocation from there and it forgets about its original node where it started allocating the memory from. This provided the best performance during userspace CPU bound load as far I can tell and it also better distribute the load. Kanoj could you also have a look at the NUMA related common code MM fixes I did in this patch? I'd like to get them integrated (just skip the arch/alpha/* include/asm-alpha/* stuff while reading the patch, they're totally orthogonal). ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.3aa1/00_alpha-numa-1 If you prefer I can extract them in a more finegrinded patch just dropping the alpha stuff by hand. Andrea |
From: Khalid A. <kh...@fc...> - 2001-04-04 15:45:27
|
Hubertus Franke wrote: > > This is an important point that Mike is raising and it also addresses a > critique that Ingo issued yesterday, namely interactivity and fairness. > The HP scheduler completely separates the per-CPU runqueues and does > not take preemption goodness or alike into account. This can lead to > unfair proportionment of CPU cycles, strong priority inversion and a > potential > lack of interactivity. > > Our MQ scheduler does yield the same decision in most cases > (other than defined by some race condition on locks and counter members) > Let me stress that HP scheduler is not meant to be a replacement for the current scheduler. The HP scheduler patch allows the current scheduler to be replaced by another scheduler by loading a module in special cases. HP is providing three different loadable scheduler modules - Processor sets, Constant time scheduler, and Multi-runqueue scheduler. Each one of these is geared towards a specific requirement. I would not suggest using any of these for a generalized case. Processor sets scheduler is designed to make scheduling decisions on a per-cpu basis and not global basis. All we are trying to do is to make the current scheduler modular so we CAN load an alternate scheduling policy module in cases where the process mix requires a different scheduling policy or the site policy require a different scheduling policy. An example of a specific site processor allocation policy could be a site that runs a database server on an MP machine along with a few other processes and the administrator wants to guarantee that the database server process always gets x percent of processing time or one CPU be dedicated to just the database server. A policy like this is not meant to be fair and of course, not a policy we want to impose upon others. The only HP changes I would put in the kernel sources for general release would be the changes to scheduler to allow such alternate (not necessarily fair or the fastest for benchmarks, general process mix or 1000's of processes) policies to be loaded. When a policy module is not loaded, scheduler works exactly like the current scheduler even after HP patches. There are people who could benefit from being able to load alternate policy schedules. Fabio Ricardi happens to be one of them :-) Anyone who does not want to even allow an alternate scheduler module to be loaded can simply compile the alternate scheduler support out and that is how I would expect most kernels to be compiled, especially the ones that ship with various distributions. I would like the decision to include support for alternate scheduler to be made by sys admins themselves for their individual cases. -- Khalid ==================================================================== Khalid Aziz Linux Development Laboratory (970)898-9214 Hewlett-Packard kh...@fc... Fort Collins, CO |
From: Christoph H. <hc...@ns...> - 2001-04-04 16:11:04
|
On Wed, Apr 04, 2001 at 09:44:22AM -0600, Khalid Aziz wrote: > Let me stress that HP scheduler is not meant to be a replacement for the > current scheduler. The HP scheduler patch allows the current scheduler > to be replaced by another scheduler by loading a module in special > cases. HP also has a simple mq patch that is _not_ integrated into the pluggable scheduler framework, I have used it myself. Christoph -- Of course it doesn't work. We've performed a software upgrade. |