From: Hubertus F. <fr...@us...> - 2001-01-23 19:38:05
|
Well said, John. That is actually exactly our position: We have MQ in mind as the "scalable solution" having multiple runqueues each protected by individual locks. This should bring down lock contention. Priority scheduler is orthogonal to this: it is a queue organization feature that might reduce lock hold time. Ideally we want to combine the two. What is constantly pointed out on these lists is that the run-queue is not heavily populated hence the lock hold time is rather small. However, when the number of cpus is going up, one can savely assume that the overall load on the system will scale accordingly. If the lockhold time scales with the number of processes, then we have more or less a squared problem. Now we need to establish, at what point the overhead we need to establish for the priority or multiqueue solution where the actual break-even point is as compared to vanilla kernel. We are almost done with the Chatroom data measurements. We already posted some numbers for the sched_yield_test, which basically provides you the worst case scenario in terms of lock contention. The entire data set should be ready by next week's LWE. What would be interesting is a patch that registers the length of the run-queue every time we run through it and keeps track of it. Is trivial to write and would allow to establish a histogram in various area. I suspect "0" to get the biggest hit. Would people be willing to run this for a while and I would collect the data ? Maybe it would shed some light into this. Hubertus Franke email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 01/23/2001 02:12:25 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] Re: more on scheduler benchmarks From: "Andrew Morton" <an...@uo...> ...[snip]... > Applying timepegs, plus schedule-timer.patch (attached) reveals that > vanilla schedule() takes 32 microseconds with 100 tasks on the > runqueue, and 4 usecs with an empty runqueue. ...[snip]... > runqueue length microseconds (500MHz PII) ... > 64 25 > 128 44 > > Seems surprisingly slow? What greatly exacerbates the problem is if the global runqueue_lock is held during this span of schedule() time and if the desired context-switch rate is high. On a two-cpu [and not very fast] i386 I've seen context-switch rates of 10-20,000/second. This is obviously going to waste lots of cpu cycles in the spinlock waits. An 8p SMP is going to scale even worse. And the same or larger NUMA machine with a global runqueue_lock exhibits distinctly unfair spinlock contention (i.e., starvation of the cpus "farthest" from the runqueue_lock physical memory). I believe the only effective solution for [large] NUMA systems is to reduce the contention on the global runqueue_lock by using multiple queues with individual spinlocks. Using prioritized runqueues (and the same global runqueue_lock) helps because it reduces the "hold" times, but if you add enough cpus, you'll eventually be back into the same high-contention high-waste situation. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-01-23 21:20:18
|
>I believe that what we should aim for is an algorithm with per-cpu >runqueues, with per-cpu spinlocks, and implemented in such a way that >low loads aren't horribly inefficient. This gets us over the >scalability hump with large cpu-count configurations. Then we can >determine if this big-system algorithm should also implement multiple >priority-level subqueues -- which for me is of secondary importance. >That is, priority-level runqueues that continue to use a single global >runqueue_lock won't scale, period, as we add cpus. YES, YES, YES and YES. John, we are exactly on the same page...... We advocate: MultiQueue with their own locks. (That what our prototype MQ does right now). That's what Mike and I had always as our final goal, given we can show that over a large set of apps and scenarios MQ provides better performance. We basically provided a priority-level scheduler on the website to show various path and potential improvements that can be made other than MQ In general breaking up the run queue into multiple run queues and how we organize each runqueue are orthogonal issues. Again runqueue organization is a secondary issue over multiple queues and elimination of lock contention (absolutely agreed). (a) the priority queue (single list in priority order) NOT LOCKED AT THAT (b) table based priority scheduler LOCKING AT THAT We thought for a while of switching between queue organizations dynamically, but as you pointed out, that might merely be an exercise rather than something that should move into the kernel, due to increased complexity. We actually have some load balancing prototype ready that reduces the interlocks of the MQ scheduler and does some occasional load balancing. Not complex at all. It all boils down on how much you want to relax the ordering imposed by the current scheduler. I agree also on your approach. Best is to first get a scheduler in that can be turned on through <make xconfig>. I would consider that a major step forward. Have you taken a look at our MQ scheduler patch ? The patch I mentioned is there to get some ideas how typical runqueue distributions are. I really don't care what you run (database, webserver, compile server). If we can see that within the feedback of people running this during a day, the typical runqueue length is "0" "1" or N*num_cpus then that would allow us make some decisions on whether we have to deal with the queue organization (beyond multiple queues) or whether its a non-issue. Hubertus Franke email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 01/23/2001 03:42:18 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] Re: more on scheduler benchmarks From: "Hubertus Franke" <fr...@us...> ... > What is constantly pointed out on these lists is that the run-queue is not > heavily populated hence the lock hold time > is rather small. .. > Now we need to establish, at what point the overhead we need to establish > for the priority or multiqueue solution where the actual > break-even point is as compared to vanilla kernel. ... > What would be interesting is a patch that registers the length of the > run-queue every time we run through it and keeps track of it. ... Are you saying that your hope is to devise a runqueue algorithm that dynamically shifts from the vanilla single runqueue in 2.4.0 to a multiple-queue, multiple-priority runqueue triggered by changes in system load? Or are you saying that you want to define some magic load level that signifies a threshold that, above which, the more complex multi-multi runqueue algorithm should be configured into the kernel? Personally, I don't believe it will be useful to collect runqueue-length statistics, since the runqueue loading depends entirely upon what the particular system is being used for. A general "timesharing, multiuser, software developer" system is going to exhibit different load characteristics than a database server, and those different from a web server, and those different from a "compute server" at a research lab, and those different from ... well, you name it. I believe that what we should aim for is an algorithm with per-cpu runqueues, with per-cpu spinlocks, and implemented in such a way that low loads aren't horribly inefficient. This gets us over the scalability hump with large cpu-count configurations. Then we can determine if this big-system algorithm should also implement multiple priority-level subqueues -- which for me is of secondary importance. That is, priority-level runqueues that continue to use a single global runqueue_lock won't scale, period, as we add cpus. Linus hates kernel complexity. I believe we have a chance of persuading Linus to accept multiple scheduler algorithms if (let's say) each lives in a separate kernel/sched*.c source file and is selected for inclusion in the kernel by kernel/Makefile using some CONFIG_SCHED_BIGSYS-like configuration option, and if we can craft a reasonable Help description of how a kernel builder should decide which algorithm to choose. A single kernel/sched.c file that contains multiple algorithms just won't likely get past Linus, nor will a scheduler algorithm that dynamically adjusts itself as the system load changes (because such an algorithm is both complex and will likely be less efficient for small systems than today's scheduler). John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: John H. <ha...@en...> - 2001-01-23 21:33:04
|
From: "Hubertus Franke" <fr...@us...> > Have you taken a look at our MQ scheduler patch ? Yes. I'm trying to get it to work on the mips64 architecture because that gives me a 32p test platform. Right now the boot hangs halfway up, probably because I haven't made all the required changes for the init of the idle process. John Hawkes ha...@en... |
From: cardente, j. <car...@em...> - 2001-01-24 14:15:06
|
Just some $.02 from a list-lurker... The per processor run queue approach sounds good and I know similar techniques proved useful in the ccNUMA machines I worked on for DG (32-64p IA32 machines). That given I would suggest that what ever patches you all come up with should include some forward planning for hierarchies of run queues. The idea being that if you have a large machine constructed out of SMP nodes you would cluster per processor run queues behind SMP node run queues. That may make it easier to make scheduling migration decisions since it will be less costly to "poach" a process from a neighboring SMP CPU than from a CPU in another SMP. Determining when to migrate processes between SMP nodes in our NUMA machines was one of the more difficult things to tune and get (mostly) right in DG/UX. Hierarchical run-queues allows for the tuning of the algorithm at various "depths" in the run-queue hierarchy (ie CPU's need to be "more" idle to poach a process from another SMP node than from a neighboring CPU on the same node). Note that the above applies to more than just NUMA machines, it applies to any architecture that builds a shared memory machine out of multiple SMP busses. Even the current 8p IA32 machines (all based on Intel profusion chipsets) are sensitive to this since there is a limited amount of bandwidth available for cross-bus snooping. Locality awareness even in this machine would likely be beneficial to performance (This is actually suggested by Intel's performance optimization papers for profusion based systems). Of course to avoid the proliferation of per platform ports (vs. per architecture ports) this would ideally all be table driven based on BIOS structures (like the MP table) but that's a whole other issue.....but still worth planning for. thanks... John ---------------------------------------------------------------------------- - John Cardente car...@em... Principal Engineer 508-898-7340 EMC Enterprise Engineering 4400 Computer Dr, Westboro, MA 01580 -----Original Message----- From: Hubertus Franke [mailto:fr...@us...] Sent: Tuesday, January 23, 2001 4:22 PM To: lse...@li... Subject: Re: [Lse-tech] Re: more on scheduler benchmarks >I believe that what we should aim for is an algorithm with per-cpu >runqueues, with per-cpu spinlocks, and implemented in such a way that >low loads aren't horribly inefficient. This gets us over the >scalability hump with large cpu-count configurations. Then we can >determine if this big-system algorithm should also implement multiple >priority-level subqueues -- which for me is of secondary importance. >That is, priority-level runqueues that continue to use a single global >runqueue_lock won't scale, period, as we add cpus. YES, YES, YES and YES. John, we are exactly on the same page...... We advocate: MultiQueue with their own locks. (That what our prototype MQ does right now). That's what Mike and I had always as our final goal, given we can show that over a large set of apps and scenarios MQ provides better performance. We basically provided a priority-level scheduler on the website to show various path and potential improvements that can be made other than MQ In general breaking up the run queue into multiple run queues and how we organize each runqueue are orthogonal issues. Again runqueue organization is a secondary issue over multiple queues and elimination of lock contention (absolutely agreed). (a) the priority queue (single list in priority order) NOT LOCKED AT THAT (b) table based priority scheduler LOCKING AT THAT We thought for a while of switching between queue organizations dynamically, but as you pointed out, that might merely be an exercise rather than something that should move into the kernel, due to increased complexity. We actually have some load balancing prototype ready that reduces the interlocks of the MQ scheduler and does some occasional load balancing. Not complex at all. It all boils down on how much you want to relax the ordering imposed by the current scheduler. I agree also on your approach. Best is to first get a scheduler in that can be turned on through <make xconfig>. I would consider that a major step forward. Have you taken a look at our MQ scheduler patch ? The patch I mentioned is there to get some ideas how typical runqueue distributions are. I really don't care what you run (database, webserver, compile server). If we can see that within the feedback of people running this during a day, the typical runqueue length is "0" "1" or N*num_cpus then that would allow us make some decisions on whether we have to deal with the queue organization (beyond multiple queues) or whether its a non-issue. Hubertus Franke email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 01/23/2001 03:42:18 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] Re: more on scheduler benchmarks From: "Hubertus Franke" <fr...@us...> ... > What is constantly pointed out on these lists is that the run-queue is not > heavily populated hence the lock hold time > is rather small. .. > Now we need to establish, at what point the overhead we need to establish > for the priority or multiqueue solution where the actual > break-even point is as compared to vanilla kernel. ... > What would be interesting is a patch that registers the length of the > run-queue every time we run through it and keeps track of it. ... Are you saying that your hope is to devise a runqueue algorithm that dynamically shifts from the vanilla single runqueue in 2.4.0 to a multiple-queue, multiple-priority runqueue triggered by changes in system load? Or are you saying that you want to define some magic load level that signifies a threshold that, above which, the more complex multi-multi runqueue algorithm should be configured into the kernel? Personally, I don't believe it will be useful to collect runqueue-length statistics, since the runqueue loading depends entirely upon what the particular system is being used for. A general "timesharing, multiuser, software developer" system is going to exhibit different load characteristics than a database server, and those different from a web server, and those different from a "compute server" at a research lab, and those different from ... well, you name it. I believe that what we should aim for is an algorithm with per-cpu runqueues, with per-cpu spinlocks, and implemented in such a way that low loads aren't horribly inefficient. This gets us over the scalability hump with large cpu-count configurations. Then we can determine if this big-system algorithm should also implement multiple priority-level subqueues -- which for me is of secondary importance. That is, priority-level runqueues that continue to use a single global runqueue_lock won't scale, period, as we add cpus. Linus hates kernel complexity. I believe we have a chance of persuading Linus to accept multiple scheduler algorithms if (let's say) each lives in a separate kernel/sched*.c source file and is selected for inclusion in the kernel by kernel/Makefile using some CONFIG_SCHED_BIGSYS-like configuration option, and if we can craft a reasonable Help description of how a kernel builder should decide which algorithm to choose. A single kernel/sched.c file that contains multiple algorithms just won't likely get past Linus, nor will a scheduler algorithm that dynamically adjusts itself as the system load changes (because such an algorithm is both complex and will likely be less efficient for small systems than today's scheduler). John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-01-24 15:45:45
|
Yes, we are planning on doing hierarchies of run-queues or something semantically equivalent In our first Call-for-Participation we didn't want to rock the boat to much. In my first cut of a MQ scheduler I actually allowed multiple processors to share a single run queue. Then there were multiple runqueues in the system. This could be mapped either to a NUMA system or a large SMP system. Ultimately given our experience with AIX, Dynix and other systems, we decided that a single queue per cpu might be the right choice and concentrated on that first. As with respect to NUMA/large SMP stuff. First, I use the word NUMA very liberal. A large SMP in my opinion should/could be treated as a NUMA system. For instance the 8-way 8500R uses two 4-way boards with coherency filters in between. Locality, as you pointed out, can make a difference here. It doesn't make sense to constantly scan all cpus for a better thread to run etc. So imposing a NUMA architecture over a large SMP for locality reasons makes sense to me. I'll post this afternoon a simple patch to the MQ that does the following: (a) it splits up the set of CPUs into smaller sets. (b) when looking for "processes to steal" from another processor, the calling cpu only looks into that set rather then the entire set of cpus (c) for reschedule_idle we have a flag to check either all cpus or cpus in the same set. (d) a load balancing code that is triggered by a timer, which shovels processes from one queue to another based on some load function. This is a very simplified first step to look at what you are saying. I am not advocating this as a solution, just wanted to show how the current MQ can be modified to get some of the partitioning. Gerrit had a good point regarding process collocation. Next, when we truely are talking about NUMA architectures, we need to collocate memory allocation with the processor scheduling as well. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "cardente, john" <car...@em...> on 01/24/2001 09:13:47 AM To: Hubertus Franke/Watson/IBM@IBMUS, lse...@li... cc: Subject: RE: [Lse-tech] Re: more on scheduler benchmarks Just some $.02 from a list-lurker... The per processor run queue approach sounds good and I know similar techniques proved useful in the ccNUMA machines I worked on for DG (32-64p IA32 machines). That given I would suggest that what ever patches you all come up with should include some forward planning for hierarchies of run queues. The idea being that if you have a large machine constructed out of SMP nodes you would cluster per processor run queues behind SMP node run queues. That may make it easier to make scheduling migration decisions since it will be less costly to "poach" a process from a neighboring SMP CPU than from a CPU in another SMP. Determining when to migrate processes between SMP nodes in our NUMA machines was one of the more difficult things to tune and get (mostly) right in DG/UX. Hierarchical run-queues allows for the tuning of the algorithm at various "depths" in the run-queue hierarchy (ie CPU's need to be "more" idle to poach a process from another SMP node than from a neighboring CPU on the same node). Note that the above applies to more than just NUMA machines, it applies to any architecture that builds a shared memory machine out of multiple SMP busses. Even the current 8p IA32 machines (all based on Intel profusion chipsets) are sensitive to this since there is a limited amount of bandwidth available for cross-bus snooping. Locality awareness even in this machine would likely be beneficial to performance (This is actually suggested by Intel's performance optimization papers for profusion based systems). Of course to avoid the proliferation of per platform ports (vs. per architecture ports) this would ideally all be table driven based on BIOS structures (like the MP table) but that's a whole other issue.....but still worth planning for. thanks... John ---------------------------------------------------------------------------- - John Cardente car...@em... Principal Engineer 508-898-7340 EMC Enterprise Engineering 4400 Computer Dr, Westboro, MA 01580 -----Original Message----- From: Hubertus Franke [mailto:fr...@us...] Sent: Tuesday, January 23, 2001 4:22 PM To: lse...@li... Subject: Re: [Lse-tech] Re: more on scheduler benchmarks >I believe that what we should aim for is an algorithm with per-cpu >runqueues, with per-cpu spinlocks, and implemented in such a way that >low loads aren't horribly inefficient. This gets us over the >scalability hump with large cpu-count configurations. Then we can >determine if this big-system algorithm should also implement multiple >priority-level subqueues -- which for me is of secondary importance. >That is, priority-level runqueues that continue to use a single global >runqueue_lock won't scale, period, as we add cpus. YES, YES, YES and YES. John, we are exactly on the same page...... We advocate: MultiQueue with their own locks. (That what our prototype MQ does right now). That's what Mike and I had always as our final goal, given we can show that over a large set of apps and scenarios MQ provides better performance. We basically provided a priority-level scheduler on the website to show various path and potential improvements that can be made other than MQ In general breaking up the run queue into multiple run queues and how we organize each runqueue are orthogonal issues. Again runqueue organization is a secondary issue over multiple queues and elimination of lock contention (absolutely agreed). (a) the priority queue (single list in priority order) NOT LOCKED AT THAT (b) table based priority scheduler LOCKING AT THAT We thought for a while of switching between queue organizations dynamically, but as you pointed out, that might merely be an exercise rather than something that should move into the kernel, due to increased complexity. We actually have some load balancing prototype ready that reduces the interlocks of the MQ scheduler and does some occasional load balancing. Not complex at all. It all boils down on how much you want to relax the ordering imposed by the current scheduler. I agree also on your approach. Best is to first get a scheduler in that can be turned on through <make xconfig>. I would consider that a major step forward. Have you taken a look at our MQ scheduler patch ? The patch I mentioned is there to get some ideas how typical runqueue distributions are. I really don't care what you run (database, webserver, compile server). If we can see that within the feedback of people running this during a day, the typical runqueue length is "0" "1" or N*num_cpus then that would allow us make some decisions on whether we have to deal with the queue organization (beyond multiple queues) or whether its a non-issue. Hubertus Franke email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 01/23/2001 03:42:18 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] Re: more on scheduler benchmarks From: "Hubertus Franke" <fr...@us...> ... > What is constantly pointed out on these lists is that the run-queue is not > heavily populated hence the lock hold time > is rather small. .. > Now we need to establish, at what point the overhead we need to establish > for the priority or multiqueue solution where the actual > break-even point is as compared to vanilla kernel. ... > What would be interesting is a patch that registers the length of the > run-queue every time we run through it and keeps track of it. ... Are you saying that your hope is to devise a runqueue algorithm that dynamically shifts from the vanilla single runqueue in 2.4.0 to a multiple-queue, multiple-priority runqueue triggered by changes in system load? Or are you saying that you want to define some magic load level that signifies a threshold that, above which, the more complex multi-multi runqueue algorithm should be configured into the kernel? Personally, I don't believe it will be useful to collect runqueue-length statistics, since the runqueue loading depends entirely upon what the particular system is being used for. A general "timesharing, multiuser, software developer" system is going to exhibit different load characteristics than a database server, and those different from a web server, and those different from a "compute server" at a research lab, and those different from ... well, you name it. I believe that what we should aim for is an algorithm with per-cpu runqueues, with per-cpu spinlocks, and implemented in such a way that low loads aren't horribly inefficient. This gets us over the scalability hump with large cpu-count configurations. Then we can determine if this big-system algorithm should also implement multiple priority-level subqueues -- which for me is of secondary importance. That is, priority-level runqueues that continue to use a single global runqueue_lock won't scale, period, as we add cpus. Linus hates kernel complexity. I believe we have a chance of persuading Linus to accept multiple scheduler algorithms if (let's say) each lives in a separate kernel/sched*.c source file and is selected for inclusion in the kernel by kernel/Makefile using some CONFIG_SCHED_BIGSYS-like configuration option, and if we can craft a reasonable Help description of how a kernel builder should decide which algorithm to choose. A single kernel/sched.c file that contains multiple algorithms just won't likely get past Linus, nor will a scheduler algorithm that dynamically adjusts itself as the system load changes (because such an algorithm is both complex and will likely be less efficient for small systems than today's scheduler). John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: John H. <ha...@en...> - 2001-01-25 01:10:05
|
From: "Hubertus Franke" <fr...@us...> > (b) when looking for "processes to steal" from another processor, the > calling cpu only looks into that set > rather then the entire set of cpus Scott Foehner (sfo...@en...) has begun to look at this issue, also. His approach was to make relatively simple changes to the scheduler GOODNESS algorithm to introduce the concept of an architecture-specific callback that could be used in that calculation, so as to allow a NUMA machine to effect a different policy about what other CPUs are good candidates to migrate from/to vs. a uniform-memory machine vs. a highly-NUMA machine like a cluster with *slow* links. (The idea here is that a system may have multiple CPUs in a local node that share a large cache, as well as share main memory, and if you can't run a process on the previously used CPUx, then you're better off trying to run that process on the nearby CPUy, rather than a distant CPUz.) The architecture-specific callback allows for useful architecture-specific tweaks to the more general architecture-independent multiple-runqueue paradigm. > Gerrit had a good point regarding process collocation. > Next, when we truely are talking about NUMA architectures, we need to > collocate memory allocation > with the processor scheduling as well. Definitely. John Hawkes ha...@en... |
From: Hubertus F. <fr...@us...> - 2001-01-25 16:09:46
|
Isn't that very similar to the pluggable scheduler that the HP folks did, they basically provided a loadable goodness function. Another idea would be to make the PROC_CHANGE_PENALTY a function of calling cpu and target cpu. e.g. #define PROC_CHANGE_PENALTY(i,j) = SAME_NODE(i,j) ? 15 : 30 or stuff it at initialization time into sched_data .... Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...> on 01/24/2001 06:24:44 PM To: "cardente, john" <car...@em...>, Hubertus Franke/Watson/IBM@IBMUS cc: <lse...@li...> Subject: Re: [Lse-tech] Re: more on scheduler benchmarks From: "Hubertus Franke" <fr...@us...> > (b) when looking for "processes to steal" from another processor, the > calling cpu only looks into that set > rather then the entire set of cpus Scott Foehner (sfo...@en...) has begun to look at this issue, also. His approach was to make relatively simple changes to the scheduler GOODNESS algorithm to introduce the concept of an architecture-specific callback that could be used in that calculation, so as to allow a NUMA machine to effect a different policy about what other CPUs are good candidates to migrate from/to vs. a uniform-memory machine vs. a highly-NUMA machine like a cluster with *slow* links. (The idea here is that a system may have multiple CPUs in a local node that share a large cache, as well as share main memory, and if you can't run a process on the previously used CPUx, then you're better off trying to run that process on the nearby CPUy, rather than a distant CPUz.) The architecture-specific callback allows for useful architecture-specific tweaks to the more general architecture-independent multiple-runqueue paradigm. > Gerrit had a good point regarding process collocation. > Next, when we truely are talking about NUMA architectures, we need to > collocate memory allocation > with the processor scheduling as well. Definitely. John Hawkes ha...@en... |
From: John H. <ha...@en...> - 2001-01-23 20:53:43
|
From: "Hubertus Franke" <fr...@us...> ... > What is constantly pointed out on these lists is that the run-queue is not > heavily populated hence the lock hold time > is rather small. ... > Now we need to establish, at what point the overhead we need to establish > for the priority or multiqueue solution where the actual > break-even point is as compared to vanilla kernel. ... > What would be interesting is a patch that registers the length of the > run-queue every time we run through it and keeps track of it. ... Are you saying that your hope is to devise a runqueue algorithm that dynamically shifts from the vanilla single runqueue in 2.4.0 to a multiple-queue, multiple-priority runqueue triggered by changes in system load? Or are you saying that you want to define some magic load level that signifies a threshold that, above which, the more complex multi-multi runqueue algorithm should be configured into the kernel? Personally, I don't believe it will be useful to collect runqueue-length statistics, since the runqueue loading depends entirely upon what the particular system is being used for. A general "timesharing, multiuser, software developer" system is going to exhibit different load characteristics than a database server, and those different from a web server, and those different from a "compute server" at a research lab, and those different from ... well, you name it. I believe that what we should aim for is an algorithm with per-cpu runqueues, with per-cpu spinlocks, and implemented in such a way that low loads aren't horribly inefficient. This gets us over the scalability hump with large cpu-count configurations. Then we can determine if this big-system algorithm should also implement multiple priority-level subqueues -- which for me is of secondary importance. That is, priority-level runqueues that continue to use a single global runqueue_lock won't scale, period, as we add cpus. Linus hates kernel complexity. I believe we have a chance of persuading Linus to accept multiple scheduler algorithms if (let's say) each lives in a separate kernel/sched*.c source file and is selected for inclusion in the kernel by kernel/Makefile using some CONFIG_SCHED_BIGSYS-like configuration option, and if we can craft a reasonable Help description of how a kernel builder should decide which algorithm to choose. A single kernel/sched.c file that contains multiple algorithms just won't likely get past Linus, nor will a scheduler algorithm that dynamically adjusts itself as the system load changes (because such an algorithm is both complex and will likely be less efficient for small systems than today's scheduler). John Hawkes ha...@en... |