From: Hubertus F. <fr...@wa...> - 2001-10-30 16:29:25
|
Davide, nice analysis. I want to point out that some (not all) of the stuff is already done in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling). What we have: ------------- multiple queues, each protected by their own lock to avoid the contention. Automatic Loadbalancing across all queues (yes, that creates overhead) CPU pooling as configurable mean to get from isolated queues to a fully balanced (global scheduling decision) scheduler. Also have some initial placement to the least loaded runqueue in the least loaded pool We look at this as a configurable infrastructure.... What we don't have: ------------------- The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition. At ALS: I will be reporting on our experience with what we have for a 8-way system and a 4x4-way NUMA system (OSDL) wrt early placement, choice of best pool size ? Are you can get an early start at: http://lse.sourceforge.net/scheduling/als2001/pmqs.ps Are you going to be a ALS ? Maybe we can chat about what the pros and cons of each approach are and whether we could/should merge things together. I am very intriged by the "CPU History Weight" that I see as a major add-on to our stuff. What I am not so keen about is the fact you seem to only do load-balancing at fork and idle time. In a loaded system that can lead to load inbalances We do a periodic (configurable) call, which has also some drawbacks. Another thing that needs to be thought about is the metric used to determine <load> on a queue. For simplicity, runqueue length is one indication, for fairness, maybe the sum of nice-value would be ok. We experimented with both and didn't see to much of a difference, however measuring fairness is difficult to do. * Davide Libenzi <da...@xm...> [20011030 00;38]:" > > Proposal For A More Scalable Linux Scheduler > by > Davide Libenzi <da...@xm...> > Sat 10/27/2001 > > Episode [1] > > Captain's diary, tentative 2, day 1 ... > > > > The current Linux scheduler has been designed and optimized > to be very fast and have a low I/D cache footprint. > Inside the schedule() function the fast path is kept very short > by moving less probable code out : > > if (prev->state == TASK_RUNNING) > goto still_running; > still_running_back: > (fast path follow here) > return; > > still_running: > (slow path lies here) > goto still_running_back; > > <============== Rest deleted ==============> |
From: Davide L. <da...@xm...> - 2001-10-30 17:11:56
|
On Tue, 30 Oct 2001, Hubertus Franke wrote: > Davide, nice analysis. > I want to point out that some (not all) of the stuff is already done > in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling). > > What we have: > ------------- > multiple queues, each protected by their own lock to avoid > the contention. > Automatic Loadbalancing across all queues (yes, that creates overhead) > CPU pooling as configurable mean to get from isolated queues to a fully > balanced (global scheduling decision) scheduler. > Also have some initial placement to the least loaded runqueue in the least > loaded pool > > We look at this as a configurable infrastructure.... > > What we don't have: > ------------------- > > The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition. > > > At ALS: I will be reporting on our experience with what we have > for a 8-way system and a 4x4-way NUMA system (OSDL) > wrt early placement, choice of best pool size ? > > Are you can get an early start at: > http://lse.sourceforge.net/scheduling/als2001/pmqs.ps I see the proposed implementation as a decisive cut with the try to have processes instantly moved across CPUs and stuff like na_goodness, etc.. Inside each CPU the scheduler is _exactly_ the same as the UP one. > Are you going to be a ALS ? Maybe we can chat about what the pros and cons > of each approach are and whether we could/should merge things together. > I am very intriged by the "CPU History Weight" that I see as a major > add-on to our stuff. What I am not so keen about is the fact > you seem to only do load-balancing at fork and idle time. > In a loaded system that can lead to load inbalances > > We do a periodic (configurable) call, which has also some drawbacks. > Another thing that needs to be thought about is the metric used > to determine <load> on a queue. For simplicity, runqueue length is > one indication, for fairness, maybe the sum of nice-value would be ok. > We experimented with both and didn't see to much of a difference, however > measuring fairness is difficult to do. Hey, ... that's part of Episode 2 " Balancing the world", where the evil Mr. MoveSoon fight with Hysteresis for the universe domination :) - Davide |
From: Hubertus F. <fr...@wa...> - 2001-10-30 18:30:26
|
* Davide Libenzi <da...@xm...> [20011030 12;19]:" > On Tue, 30 Oct 2001, Hubertus Franke wrote: > > > Davide, nice analysis. > > I want to point out that some (not all) of the stuff is already done > > in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling). > > > > What we have: > > ------------- > > multiple queues, each protected by their own lock to avoid > > the contention. > > Automatic Loadbalancing across all queues (yes, that creates overhead) > > CPU pooling as configurable mean to get from isolated queues to a fully > > balanced (global scheduling decision) scheduler. > > Also have some initial placement to the least loaded runqueue in the least > > loaded pool > > > > We look at this as a configurable infrastructure.... > > > > What we don't have: > > ------------------- > > > > The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition. > > > > > > At ALS: I will be reporting on our experience with what we have > > for a 8-way system and a 4x4-way NUMA system (OSDL) > > wrt early placement, choice of best pool size ? > > > > Are you can get an early start at: > > http://lse.sourceforge.net/scheduling/als2001/pmqs.ps > > I see the proposed implementation as a decisive cut with the try to have > processes instantly moved across CPUs and stuff like na_goodness, etc.. > Inside each CPU the scheduler is _exactly_ the same as the UP one. > Well, to that extent that what MQ does as too. We do a local decision first and then compare across multiple queues. In the pooling approach we limit that global check to some cpus within the proximity. I think your CPU Weight history could fit into this model as well. We don't care how the local decision was reached. There is however another problem that you haven't addressed yet, which is realtime. As far as I can tell, the realtime semantics require a strict ordering with respect to each other and their priorities. General approach can be either to limit all RT processes to a single CPU or, as we have done, declare a global RT runqueue. > > > > Are you going to be a ALS ? Maybe we can chat about what the pros and cons > > of each approach are and whether we could/should merge things together. > > I am very intriged by the "CPU History Weight" that I see as a major > > add-on to our stuff. What I am not so keen about is the fact > > you seem to only do load-balancing at fork and idle time. > > In a loaded system that can lead to load inbalances > > > > We do a periodic (configurable) call, which has also some drawbacks. > > Another thing that needs to be thought about is the metric used > > to determine <load> on a queue. For simplicity, runqueue length is > > one indication, for fairness, maybe the sum of nice-value would be ok. > > We experimented with both and didn't see to much of a difference, however > > measuring fairness is difficult to do. > > Hey, ... that's part of Episode 2 " Balancing the world", where the evil > Mr. MoveSoon fight with Hysteresis for the universe domination :) > > Well, one has to be careful, if the system is loaded and processes are more long lived rather then come and go, Initial Placement and Idle-Loop Load balancing doesn't get you very far with respect to decent load balancing. In these kind of scenarios, one needs a feedback system. Trick is to come up with an algorithm that is not too intrusive and that is not overcorrecting. Take a look at the paper link, where we experimented with some of these issues. We tolerated a difference tolerance around the runqueue length. > > > - Davide > :-# Hubertus |
From: Davide L. <da...@xm...> - 2001-10-30 18:42:57
|
On Tue, 30 Oct 2001, Hubertus Franke wrote: > * Davide Libenzi <da...@xm...> [20011030 12;19]:" > > > > I see the proposed implementation as a decisive cut with the try to have > > processes instantly moved across CPUs and stuff like na_goodness, etc.. > > Inside each CPU the scheduler is _exactly_ the same as the UP one. > > > > Well, to that extent that what MQ does as too. We do a local decision > first and then compare across multiple queues. In the pooling approach > we limit that global check to some cpus within the proximity. > I think your CPU Weight history could fit into this model as well. > We don't care how the local decision was reached. That's what I don't want to do, at least at every schedule(). The main purpose of the proposed scheduler is to relax process movement policies. > There is however another problem that you haven't addressed yet, which > is realtime. As far as I can tell, the realtime semantics require a > strict ordering with respect to each other and their priorities. > General approach can be either to limit all RT processes to a single CPU > or, as we have done, declare a global RT runqueue. Real time processes, when wakeup up fall calling reschedule_idle() that will either find the CPU idle or will be reschedule due a favorable preemption_goodness(). One of balancing scheme I'm using tries to distribute RT tasks evenly on CPUs. On Tue, 30 Oct 2001, Hubertus Franke wrote: > > > We do a periodic (configurable) call, which has also some drawbacks. > > > Another thing that needs to be thought about is the metric used > > > to determine <load> on a queue. For simplicity, runqueue length is > > > one indication, for fairness, maybe the sum of nice-value would be ok. > > > We experimented with both and didn't see to much of a difference, however > > > measuring fairness is difficult to do. > > > > Hey, ... that's part of Episode 2 " Balancing the world", where the evil > > Mr. MoveSoon fight with Hysteresis for the universe domination :) > > > > > > Well, one has to be careful, if the system is loaded and processes are > more long lived rather then come and go, Initial Placement and Idle-Loop > Load balancing doesn't get you very far with respect to decent load balancing. > In these kind of scenarios, one needs a feedback system. Trick is to come > up with an algorithm that is not too intrusive and that is not overcorrecting. > Take a look at the paper link, where we experimented with some of these > issues. We tolerated a difference tolerance around the runqueue length. I'm currently trying an hysteresis approach with a tunable value of hysteresis to watch at the different performance/behavior. - Davide |
From: Hubertus F. <fr...@wa...> - 2001-10-30 18:53:46
|
* Davide Libenzi <da...@xm...> [20011030 13;50]:" > On Tue, 30 Oct 2001, Hubertus Franke wrote: > > > * Davide Libenzi <da...@xm...> [20011030 12;19]:" > > > > > > I see the proposed implementation as a decisive cut with the try to have > > > processes instantly moved across CPUs and stuff like na_goodness, etc.. > > > Inside each CPU the scheduler is _exactly_ the same as the UP one. > > > > > > > Well, to that extent that what MQ does as too. We do a local decision > > first and then compare across multiple queues. In the pooling approach > > we limit that global check to some cpus within the proximity. > > I think your CPU Weight history could fit into this model as well. > > We don't care how the local decision was reached. > > That's what I don't want to do, at least at every schedule(). > The main purpose of the proposed scheduler is to relax process movement > policies. > And I think that is a GOOD design point, no question. One of our next steps is/was to relax the global decision making process that we pursued to be able to compare A's with A's. Occasionally doing this every Nth time might be something useful. > > > There is however another problem that you haven't addressed yet, which > > is realtime. As far as I can tell, the realtime semantics require a > > strict ordering with respect to each other and their priorities. > > General approach can be either to limit all RT processes to a single CPU > > or, as we have done, declare a global RT runqueue. > > Real time processes, when wakeup up fall calling reschedule_idle() that > will either find the CPU idle or will be reschedule due a favorable > preemption_goodness(). > One of balancing scheme I'm using tries to distribute RT tasks evenly on > CPUs. > I think that would be a problem. My understanding is that if two RT process are globally runnable, then one must run the one with higher priority. Am I missing something here ? > > > On Tue, 30 Oct 2001, Hubertus Franke wrote: > > > > > We do a periodic (configurable) call, which has also some drawbacks. > > > > Another thing that needs to be thought about is the metric used > > > > to determine <load> on a queue. For simplicity, runqueue length is > > > > one indication, for fairness, maybe the sum of nice-value would be ok. > > > > We experimented with both and didn't see to much of a difference, however > > > > measuring fairness is difficult to do. > > > > > > Hey, ... that's part of Episode 2 " Balancing the world", where the evil > > > Mr. MoveSoon fight with Hysteresis for the universe domination :) > > > > > > > > > > Well, one has to be careful, if the system is loaded and processes are > > more long lived rather then come and go, Initial Placement and Idle-Loop > > Load balancing doesn't get you very far with respect to decent load balancing. > > In these kind of scenarios, one needs a feedback system. Trick is to come > > up with an algorithm that is not too intrusive and that is not overcorrecting. > > Take a look at the paper link, where we experimented with some of these > > issues. We tolerated a difference tolerance around the runqueue length. > > I'm currently trying an hysteresis approach with a tunable value of > hysteresis to watch at the different performance/behavior. > > That seems appropriate ... looking forward to see that. > > - Davide > -- Hubertus |
From: Mike K. <kr...@us...> - 2001-10-30 19:08:28
|
On Tue, Oct 30, 2001 at 11:52:57AM -0500, Hubertus Franke wrote: > * Davide Libenzi <da...@xm...> [20011030 13;50]:" > > On Tue, 30 Oct 2001, Hubertus Franke wrote: > > > > > There is however another problem that you haven't addressed yet, which > > > is realtime. As far as I can tell, the realtime semantics require a > > > strict ordering with respect to each other and their priorities. > > > General approach can be either to limit all RT processes to a single CPU > > > or, as we have done, declare a global RT runqueue. > > > > Real time processes, when wakeup up fall calling reschedule_idle() that > > will either find the CPU idle or will be reschedule due a favorable > > preemption_goodness(). > > One of balancing scheme I'm using tries to distribute RT tasks evenly on > > CPUs. > > > > I think that would be a problem. My understanding is that if two RT process > are globally runnable, then one must run the one with higher priority. > Am I missing something here ? It is not just the relative priorities of the realtime tasks, but also the scheduling policy. SCHED_FIFO (and to some extent SCHED_RR) implies an ordering within the runqueue for tasks of the same priority. This is difficult to achieve with multiple runqueues. Most scheduler implementations I am aware of, do something like what you suggested above. -- Mike |
From: Davide L. <da...@xm...> - 2001-10-30 19:12:19
|
On Tue, 30 Oct 2001, Hubertus Franke wrote: > > Real time processes, when wakeup up fall calling reschedule_idle() that > > will either find the CPU idle or will be reschedule due a favorable > > preemption_goodness(). > > One of balancing scheme I'm using tries to distribute RT tasks evenly on > > CPUs. > > > > I think that would be a problem. My understanding is that if two RT process > are globally runnable, then one must run the one with higher priority. > Am I missing something here ? The only difference is when an RT task is currently running and another RT task kicks in with a not favorable preemption_goodness(). In the current scheduler reschedule_idle() loops through the CPUs to find one for the incoming RT tasks while the proposed scheduler actually doesn't. What I'm coding is to plug in get_best_cpu() a way to evenly spread RT tasks between CPUs. But even the RT task rapid move to another CPU is not too rapid due IPI+schedule() latency. Maybe it's faster the currently running RT tasks to reschedule instead of the remote CPU, maybe :) The current IPI method creates very funny/undesirable behavior due IPI+schedule() latency. When watching at the schedcnt dump of a lat_ctx of my 2 way SMP system, I saw both tasks lying on the same CPU with the other one tempested by reschedule IPI without being able to catch one task due latency. - Davide |
From: Mike K. <kr...@us...> - 2001-10-31 00:11:45
|
Davide, We had an 8 way sitting idle, so I ran some benchmarks comparing your scheduler to Vanilla and also to the MQ scheduler at LSE. When I get a chance, I'd also like to run TPC-H with your scheduler. Here are the results for what they are worth. -- Mike -------------------------------------------------------------------- reflex - Similar to lat_ctx of LMbench but much more agressive. Keeps more than a single task active. # active tasks is 1/2 total number of tasks. Result is 'round trip' time. Less is better. -------------------------------------------------------------------- # tasks Vanilla Sched MQ Sched Xsched -------------------------------------------------------------------- 2 6.521 7.429 8.865 4 11.304 8.581 3.187 8 13.501 6.907 2.425 16 15.855 5.299 1.641 32 17.742 3.267 2.049 64 20.613 2.960 2.236 128 26.234 2.983 2.527 -------------------------------------------------------------------- Chat - VolanoMark simulator. Result is a measure of throughput. Higher is better. -------------------------------------------------------------------- Configuratoin Parms Vanilla MQ Sched Xsched -------------------------------------------------------------------- 10 rooms, 100 messages 69465 400055 229301 10 rooms, 200 messages 86868 354468 187775 10 rooms, 300 messages 103715 363141 205799 10 rooms, 400 messages 133385 380603 195987 20 rooms, 100 messages 50936 396710 216406 20 rooms, 200 messages 74200 385996 197076 20 rooms, 300 messages 95509 402232 210225 20 rooms, 400 messages 101305 437776 215118 30 rooms, 100 messages 42019 376442 247781 30 rooms, 200 messages 42315 384598 222258 30 rooms, 300 messages 52948 413984 231298 30 rooms, 400 messages 6564 46316 24879 -------------------------------------------------------------------- lat_ctx - Context switching component of LMbench. Result is 'round trip' time. Less is better. -------------------------------------------------------------------- Size/# tasks Vanilla MQ Sched Xsched -------------------------------------------------------------------- "size=0k ovr=2.30 ovr=2.33 ovr=2.41 2 2.79 2.96 4.74 4.83 11.29 11.30 "size=0k ovr=2.34 ovr=2.36 ovr=2.35 4 2.88 3.78 5.51 6.72 6.77 8.50 "size=0k ovr=2.36 ovr=2.38 ovr=2.38 8 3.08 4.16 5.70 6.27 7.19 8.40 "size=0k ovr=2.43 ovr=2.35 ovr=2.37 16 3.08 3.69 7.54 8.34 8.30 8.92 "size=0k ovr=2.46 ovr=2.47 ovr=2.42 32 3.28 4.54 6.58 7.99 8.61 8.63 "size=0k ovr=2.48 ovr=2.46 ovr=2.47 64 3.78 4.72 7.16 7.54 8.48 8.48 "size=0k ovr=2.53 ovr=2.42 ovr=2.50 128 6.60 7.28 7.49 7.96 8.20 8.25 "size=0k ovr=2.59 ovr=2.56 ovr=2.53 256 9.91 10.08 8.02 8.29 8.44 8.47 "size=4k ovr=3.93 ovr=3.83 ovr=3.92 2 3.31 4.21 5.08 5.13 5.36 10.82 "size=4k ovr=3.95 ovr=3.84 ovr=3.84 4 3.95 4.46 6.17 8.66 11.97 11.98 "size=4k ovr=3.91 ovr=3.90 ovr=3.98 8 4.45 5.28 6.55 7.51 8.63 8.88 "size=4k ovr=3.95 ovr=3.91 ovr=3.91 16 4.60 5.49 9.40 9.61 9.53 9.67 "size=4k ovr=3.96 ovr=3.92 ovr=3.90 32 4.87 5.82 8.59 9.17 9.96 9.97 "size=4k ovr=3.98 ovr=4.01 ovr=3.95 64 13.99 14.06 8.77 9.59 9.93 9.94 "size=4k ovr=4.02 ovr=3.94 ovr=3.98 128 13.26 18.27 9.28 10.11 9.84 9.85 "size=4k ovr=4.07 ovr=3.99 ovr=4.07 256 15.40 16.46 11.02 11.52 10.22 10.26 "size=8k ovr=5.39 ovr=5.35 ovr=5.33 2 4.54 5.40 6.51 6.62 11.71 11.72 "size=8k ovr=5.41 ovr=5.45 ovr=5.41 4 5.83 6.50 6.19 7.26 8.80 11.26 "size=8k ovr=5.39 ovr=5.36 ovr=5.41 8 5.81 6.98 8.18 8.81 10.15 10.17 "size=8k ovr=5.42 ovr=5.31 ovr=5.37 16 6.86 7.57 9.80 10.20 11.69 11.71 "size=8k ovr=5.42 ovr=5.41 ovr=5.41 32 6.77 7.85 10.28 10.90 11.46 11.47 "size=8k ovr=5.48 ovr=5.48 ovr=5.45 64 17.32 17.49 10.60 11.24 11.45 11.51 "size=8k ovr=5.51 ovr=5.40 ovr=5.45 128 14.74 20.35 11.43 11.97 11.52 11.53 "size=8k ovr=5.54 ovr=5.47 ovr=5.54 256 16.36 18.89 13.77 14.46 12.59 12.61 "size=16k ovr=8.30 ovr=8.37 ovr=8.32 2 8.47 9.62 8.63 8.86 7.11 9.29 "size=16k ovr=8.33 ovr=8.30 ovr=8.29 4 8.76 9.36 9.42 10.07 11.83 13.28 "size=16k ovr=8.33 ovr=8.35 ovr=8.30 8 9.22 10.14 13.78 14.21 13.30 13.49 "size=16k ovr=8.34 ovr=8.29 ovr=8.35 16 9.39 10.08 12.57 12.98 14.30 14.31 "size=16k ovr=8.39 ovr=8.36 ovr=8.35 32 26.91 28.87 13.26 13.64 14.26 14.35 "size=16k ovr=8.40 ovr=8.47 ovr=8.45 64 28.82 31.86 13.62 13.79 14.62 14.63 "size=16k ovr=8.41 ovr=8.38 ovr=8.42 128 21.35 24.18 14.65 15.51 14.59 14.62 "size=16k ovr=8.47 ovr=8.47 ovr=8.46 256 28.84 30.53 20.76 22.94 25.63 25.79 "size=32k ovr=14.27 ovr=14.19 ovr=14.14 2 13.80 14.74 15.96 16.10 21.87 21.89 "size=32k ovr=14.25 ovr=14.16 ovr=14.19 4 13.77 14.48 16.49 18.04 21.99 22.00 "size=32k ovr=14.26 ovr=14.19 ovr=14.24 8 13.91 14.90 18.63 19.01 20.99 21.00 "size=32k ovr=14.26 ovr=14.20 ovr=14.27 16 43.47 43.87 18.44 19.40 20.37 20.38 "size=32k ovr=14.29 ovr=14.24 ovr=14.30 32 40.37 52.34 19.14 19.77 19.48 19.50 "size=32k ovr=14.33 ovr=14.29 ovr=14.31 64 42.02 46.30 19.84 20.58 20.44 20.52 "size=32k ovr=14.30 ovr=14.30 ovr=14.30 128 50.81 55.60 28.88 30.81 31.83 31.84 "size=32k ovr=14.38 ovr=14.35 ovr=14.41 256 98.82 99.65 88.38 90.05 82.76 83.08 "size=64k ovr=26.01 ovr=26.03 ovr=26.00 2 24.00 24.59 26.08 26.21 32.00 32.01 "size=64k ovr=26.09 ovr=26.04 ovr=26.05 4 23.98 24.70 26.71 26.86 32.21 32.24 "size=64k ovr=26.00 ovr=26.05 ovr=26.09 8 78.16 78.41 28.97 30.39 31.07 31.10 "size=64k ovr=26.03 ovr=26.03 ovr=26.04 16 127.98 136.30 28.90 29.52 30.55 30.61 "size=64k ovr=26.11 ovr=26.08 ovr=26.31 32 97.96 104.32 31.65 33.72 31.19 31.31 "size=64k ovr=26.15 ovr=26.08 ovr=26.20 64 99.31 111.11 48.61 53.92 47.59 47.62 "size=64k ovr=26.16 ovr=26.09 ovr=26.12 128 175.26 179.98 149.98 152.29 166.18 166.44 "size=64k ovr=26.21 ovr=26.17 ovr=26.23 256 238.61 244.75 227.39 227.64 224.67 224.73 |
From: Davide L. <da...@xm...> - 2001-10-31 00:59:20
|
On Tue, 30 Oct 2001, Mike Kravetz wrote: > -------------------------------------------------------------------- > reflex - Similar to lat_ctx of LMbench but much more agressive. > Keeps more than a single task active. # active tasks > is 1/2 total number of tasks. Result is 'round trip' > time. Less is better. > -------------------------------------------------------------------- > # tasks Vanilla Sched MQ Sched Xsched > -------------------------------------------------------------------- > 2 6.521 7.429 8.865 > 4 11.304 8.581 3.187 > 8 13.501 6.907 2.425 > 16 15.855 5.299 1.641 > 32 17.742 3.267 2.049 > 64 20.613 2.960 2.236 > 128 26.234 2.983 2.527 Try to use the LatSched kernel patch to get the real cycles spent inside the scheduler. Also a schedcnt dump would help. I'm saying this because I had to reject some tests ( lat_ctx is one ) they give very different process distributions and hence results. Looking at the numbers going down with increasing rqlen sounds strange to me and seems that there're things like different process distribution that comes into play. With the cycle counter running on the proposed scheduler I saw numbers to go up with a little derivate but they went up. > -------------------------------------------------------------------- > Chat - VolanoMark simulator. Result is a measure of throughput. > Higher is better. > -------------------------------------------------------------------- > Configuratoin Parms Vanilla MQ Sched Xsched > -------------------------------------------------------------------- > 10 rooms, 100 messages 69465 400055 229301 > 10 rooms, 200 messages 86868 354468 187775 > 10 rooms, 300 messages 103715 363141 205799 > 10 rooms, 400 messages 133385 380603 195987 > 20 rooms, 100 messages 50936 396710 216406 > 20 rooms, 200 messages 74200 385996 197076 > 20 rooms, 300 messages 95509 402232 210225 > 20 rooms, 400 messages 101305 437776 215118 > 30 rooms, 100 messages 42019 376442 247781 > 30 rooms, 200 messages 42315 384598 222258 > 30 rooms, 300 messages 52948 413984 231298 > 30 rooms, 400 messages 6564 46316 24879 How does this test work exactly ? Did You measure the run queue length while this was running ? I'm going to modify LatSched to output other info like rqlen and tsk->counter at switch time. I'd like to have a schedcnt dump while this test is running. Anyway this test shows that what i'm doing now ( working on the balancing schemes ) is not wasted time :) > -------------------------------------------------------------------- > lat_ctx - Context switching component of LMbench. Result is 'round > trip' time. Less is better. > -------------------------------------------------------------------- > Size/# tasks Vanilla MQ Sched Xsched > -------------------------------------------------------------------- lat_ctx is ok only on UP machines, try to look at process distribution on an SMP vanilla kernel. - Davide |
From: Mike K. <kr...@us...> - 2001-10-31 04:31:13
|
On Tue, Oct 30, 2001 at 05:06:16PM -0800, Davide Libenzi wrote: > On Tue, 30 Oct 2001, Mike Kravetz wrote: > > -------------------------------------------------------------------- > > Chat - VolanoMark simulator. Result is a measure of throughput. > > Higher is better. > > -------------------------------------------------------------------- > > How does this test work exactly ? > Did You measure the run queue length while this was running ? > I'm going to modify LatSched to output other info like rqlen and > tsk->counter at switch time. > I'd like to have a schedcnt dump while this test is running. > Anyway this test shows that what i'm doing now ( working on the balancing > schemes ) is not wasted time :) Take a look at our OLS presentation slides starting at: http://lse.sourceforge.net/scheduling/ols2001/img50.htm This is basically a message passing (via sockets) benchmark with high thread counts/runqueue lengths. I'll kick off the all important 'kernel build benchmark' and have some results tomorrow. -- Mike |
From: Davide L. <da...@xm...> - 2001-10-31 04:37:31
|
On Tue, 30 Oct 2001, Mike Kravetz wrote: > On Tue, Oct 30, 2001 at 05:06:16PM -0800, Davide Libenzi wrote: > > On Tue, 30 Oct 2001, Mike Kravetz wrote: > > > -------------------------------------------------------------------- > > > Chat - VolanoMark simulator. Result is a measure of throughput. > > > Higher is better. > > > -------------------------------------------------------------------- > > > > How does this test work exactly ? > > Did You measure the run queue length while this was running ? > > I'm going to modify LatSched to output other info like rqlen and > > tsk->counter at switch time. > > I'd like to have a schedcnt dump while this test is running. > > Anyway this test shows that what i'm doing now ( working on the balancing > > schemes ) is not wasted time :) > > Take a look at our OLS presentation slides starting at: > > http://lse.sourceforge.net/scheduling/ols2001/img50.htm > > This is basically a message passing (via sockets) benchmark with > high thread counts/runqueue lengths. Are threads pre-created at startup or are dynamic ? - Davide |
From: Mike K. <kr...@us...> - 2001-10-31 04:51:16
|
On Tue, Oct 30, 2001 at 08:45:02PM -0800, Davide Libenzi wrote: > On Tue, 30 Oct 2001, Mike Kravetz wrote: > > > On Tue, Oct 30, 2001 at 05:06:16PM -0800, Davide Libenzi wrote: > > > On Tue, 30 Oct 2001, Mike Kravetz wrote: > > > > -------------------------------------------------------------------- > > > > Chat - VolanoMark simulator. Result is a measure of throughput. > > > > Higher is better. > > > > -------------------------------------------------------------------- > > > > > > How does this test work exactly ? > > > > This is basically a message passing (via sockets) benchmark with > > high thread counts/runqueue lengths. > > Are threads pre-created at startup or are dynamic ? They are pre-created at startup. -- Mike |
From: Mike K. <kr...@us...> - 2001-10-31 17:08:01
|
On Tue, Oct 30, 2001 at 09:29:41PM -0800, Mike Kravetz wrote: > > I'll kick off the all important 'kernel build benchmark' and have > some results tomorrow. -------------------------------------------------------------------- mkbench - Time how long it takes to compile the kernel. On this 8 CPU system we use 'make -j 8' and increase the number of makes run in parallel. Result is average build time in seconds. -------------------------------------------------------------------- # parallel makes Vanilla Sched MQ Sched Xsched -------------------------------------------------------------------- 1 56 55 61 2 105 105 105 3 156 156 154 4 206 206 205 5 257 257 256 6 308 308 307 -- Mike |
From: Davide L. <da...@xm...> - 2001-10-31 17:51:53
|
On Wed, 31 Oct 2001, Mike Kravetz wrote: > On Tue, Oct 30, 2001 at 09:29:41PM -0800, Mike Kravetz wrote: > > > > I'll kick off the all important 'kernel build benchmark' and have > > some results tomorrow. > > -------------------------------------------------------------------- > mkbench - Time how long it takes to compile the kernel. On this > 8 CPU system we use 'make -j 8' and increase the number > of makes run in parallel. Result is average build time in > seconds. > -------------------------------------------------------------------- > # parallel makes Vanilla Sched MQ Sched Xsched > -------------------------------------------------------------------- > 1 56 55 61 > 2 105 105 105 > 3 156 156 154 > 4 206 206 205 > 5 257 257 256 > 6 308 308 307 Thanks Mike, I'm pretty happy with this coz it shows that the current process migration scheme does not suck so much, only a bit. Kernel builds are not a stress test for the scheduler, when I run a -j 2 on my 2SMP I got a vmstat cs~= 500 I'm currently working on different schemes of process balancing among CPUs with a concept of tunable hysteresis to try to give the ability for users to setup the scheduler behavior for their needs. What is happening with the volanomark with the proposed scheduler is that one task start creating the requested number of threads and the new created thread is very likely going to sleep after its creation. So suppose that the main test task run on CPU0 ( rqlen[0] == 1 ), get_best_cpu() is going to return 1 coz it's probably the first free. But the new task is going to sleep/"wait for some event" before the test task on CPU0 will call fork()/clone() again. This will create a heavy process distribution over CPU1 that, when the real test begin, will not find any idle reschedule to have the opportunity to balance. - Davide |
From: Mike K. <kr...@us...> - 2001-10-31 23:14:32
|
Here's some data from TPC-H. As you can imagine, I can't post any actual numbers, so I'll just report the % improvement. *** Disclaimer *** this is not representative of an actual TPC-H run as the database is small (500MB) and is run with as much data in memory as possible. Hence, there is little or no I/O. It is my understanding that on 'real' TPC-H runs, the I/O path becomes a limiting factor and no scheduler impacts can be seen. -------------------------------------------------------------------- TPC-H - 8 CPU system. Small (500MB) database with most data cached and little if any I/O. -------------------------------------------------------------------- Vanilla MQ Sched Xsched -------------------------------------------------------------------- Baseline +6.2% +2.4% I'm going to try and merge your 'cache warmth' replacement for PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable the code to prevent task stealing during IPI delivery. This should still be significantly different than your design because MQ will still attempt to make global decisions. Results should be interesting. -- Mike |
From: Davide L. <da...@xm...> - 2001-11-01 00:01:19
|
On Wed, 31 Oct 2001, Mike Kravetz wrote: > I'm going to try and merge your 'cache warmth' replacement for > PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable > the code to prevent task stealing during IPI delivery. This > should still be significantly different than your design because > MQ will still attempt to make global decisions. Results should > be interesting. I'm currently evaluating different weights for that. Right now I'm using : if (p->cpu_jtime > jiffies) weight += p->cpu_jtime - jiffies; that might be too much. Solutions : 1) if (p->cpu_jtime > jiffies) weight += (p->cpu_jtime - jiffies) >> 1; 2) int wtable[]; if (p->cpu_jtime > jiffies) weight += wtable[p->cpu_jtime - jiffies]; Speed will like 1). Other optimization is jiffies that is volatile and forces gcc to always reload it. static inline int goodness(struct task_struct * p, struct mm_struct *this_mm, unsigned long jiff) might be better, with jiffies taken out of the goodness loop. Mike I suggest you to use the LatSched patch to 1) know how really is performing the scheduler 2) understand if certain test gives certain results due wierd distributions. - Davide |
From: Hubertus F. <fr...@wa...> - 2001-11-02 14:21:25
|
* Davide Libenzi <da...@xm...> [20011031 18;53]:" > On Wed, 31 Oct 2001, Mike Kravetz wrote: > > > I'm going to try and merge your 'cache warmth' replacement for > > PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable > > the code to prevent task stealing during IPI delivery. This > > should still be significantly different than your design because > > MQ will still attempt to make global decisions. Results should > > be interesting. > > I'm currently evaluating different weights for that. > Right now I'm using : > > if (p->cpu_jtime > jiffies) > weight += p->cpu_jtime - jiffies; > > that might be too much. > Solutions : > > 1) > if (p->cpu_jtime > jiffies) > weight += (p->cpu_jtime - jiffies) >> 1; > > 2) > int wtable[]; > > if (p->cpu_jtime > jiffies) > weight += wtable[p->cpu_jtime - jiffies]; > > Speed will like 1). > Other optimization is jiffies that is volatile and forces gcc to always > reload it. > > static inline int goodness(struct task_struct * p, struct mm_struct > *this_mm, unsigned long jiff) > > might be better, with jiffies taken out of the goodness loop. > Mike I suggest you to use the LatSched patch to 1) know how really is > performing the scheduler 2) understand if certain test gives certain > results due wierd distributions. > One more. Throughout our MQ evaluation, it was also true that the overall performance particularly for large thread counts was very sensitive to the goodness function, that why a na_goodness_local was introduced. -- Hubertus |
From: Mike K. <kr...@us...> - 2001-11-02 16:58:15
|
On Fri, Nov 02, 2001 at 07:20:36AM -0500, Hubertus Franke wrote: > > One more. Throughout our MQ evaluation, it was also true that > the overall performance particularly for large thread counts was > very sensitive to the goodness function, that why a na_goodness_local > was introduced. > Correct, we did notice measurable differences in performance just from the additional (unnecessary) checks in goodness. Unfortunately, the current version of MQ has 3 different (but similar) variations of the goodness function. This is UGLY, and I intend to clean this up (without impacting performance of course :). -- Mike |
From: Davide L. <da...@xm...> - 2001-11-03 22:02:25
|
On Fri, 2 Nov 2001, Hubertus Franke wrote: > One more. Throughout our MQ evaluation, it was also true that > the overall performance particularly for large thread counts was > very sensitive to the goodness function, that why a na_goodness_local > was introduced. Yes it is, but the real question is - It is better a save a few clock cycles in goodness() or achieve a better process scheduling decisions ? - Davide |
From: Hubertus F. <fr...@wa...> - 2001-11-02 14:19:09
|
* Mike Kravetz <kr...@us...> [20011031 18;12]:" > Here's some data from TPC-H. As you can imagine, I can't post any > actual numbers, so I'll just report the % improvement. > > *** Disclaimer *** this is not representative of an actual TPC-H > run as the database is small (500MB) and is run with as much data > in memory as possible. Hence, there is little or no I/O. It is > my understanding that on 'real' TPC-H runs, the I/O path becomes > a limiting factor and no scheduler impacts can be seen. > > -------------------------------------------------------------------- > TPC-H - 8 CPU system. Small (500MB) database with most data > cached and little if any I/O. > -------------------------------------------------------------------- > Vanilla MQ Sched Xsched > -------------------------------------------------------------------- > Baseline +6.2% +2.4% > > I'm going to try and merge your 'cache warmth' replacement for > PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable > the code to prevent task stealing during IPI delivery. This > should still be significantly different than your design because > MQ will still attempt to make global decisions. Results should > be interesting. > Excellent idea, I think, that's the new idea that Davide brings to the table. MQ has been around in our stuff and in the HP scheduler (Rhine). BTW: The reason why for the reflex MQ starts getting better with larger numbers is because the likelihood of moving a task is going down. -- Hubertus |
From: Davide L. <da...@xm...> - 2001-11-03 22:00:19
|
On Fri, 2 Nov 2001, Hubertus Franke wrote: > * Mike Kravetz <kr...@us...> [20011031 18;12]:" > > Here's some data from TPC-H. As you can imagine, I can't post any > > actual numbers, so I'll just report the % improvement. > > > > *** Disclaimer *** this is not representative of an actual TPC-H > > run as the database is small (500MB) and is run with as much data > > in memory as possible. Hence, there is little or no I/O. It is > > my understanding that on 'real' TPC-H runs, the I/O path becomes > > a limiting factor and no scheduler impacts can be seen. > > > > -------------------------------------------------------------------- > > TPC-H - 8 CPU system. Small (500MB) database with most data > > cached and little if any I/O. > > -------------------------------------------------------------------- > > Vanilla MQ Sched Xsched > > -------------------------------------------------------------------- > > Baseline +6.2% +2.4% > > > > I'm going to try and merge your 'cache warmth' replacement for > > PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable > > the code to prevent task stealing during IPI delivery. This > > should still be significantly different than your design because > > MQ will still attempt to make global decisions. Results should > > be interesting. > > > Excellent idea, I think, that's the new idea that Davide brings to > the table. MQ has been around in our stuff and in the HP scheduler (Rhine). > > BTW: The reason why for the reflex MQ starts getting better with larger numbers > is because the likelihood of moving a task is going down. As i told to Mike in a previous email the reason of lower performance on certain test is due the lack of balancing tests in reschedule_idle(). What happens with this tests is that the main test thread starts creating a bunch of threads that very likely are going to sleep very soon waiting for the main test thread to issue a start event. This make get_best_cpu() to return the same cpu to all threads and, when the real test begin, there's no idle reschedule ( due the load ) to enable a decent balancing. A simple balancing hook in reschedule_idle() has made the scheduler to perform very well in this kind of tests. - Davide |
From: Hubertus F. <fr...@wa...> - 2001-11-05 16:00:30
|
Davide, by introducing a balancer hook in reschedule_idle(), you are actually coming back to the design we have in the PMQS (::= MQ + pooling). What we experimented around with are different scheduling and reschedule pools. In your case you really wanted to isolate the queues. -- Hubertus * Davide Libenzi <da...@xm...> [20011103 17;08]:" > On Fri, 2 Nov 2001, Hubertus Franke wrote: > > > * Mike Kravetz <kr...@us...> [20011031 18;12]:" > > > Here's some data from TPC-H. As you can imagine, I can't post any > > > actual numbers, so I'll just report the % improvement. > > > > > > *** Disclaimer *** this is not representative of an actual TPC-H > > > run as the database is small (500MB) and is run with as much data > > > in memory as possible. Hence, there is little or no I/O. It is > > > my understanding that on 'real' TPC-H runs, the I/O path becomes > > > a limiting factor and no scheduler impacts can be seen. > > > > > > -------------------------------------------------------------------- > > > TPC-H - 8 CPU system. Small (500MB) database with most data > > > cached and little if any I/O. > > > -------------------------------------------------------------------- > > > Vanilla MQ Sched Xsched > > > -------------------------------------------------------------------- > > > Baseline +6.2% +2.4% > > > > > > I'm going to try and merge your 'cache warmth' replacement for > > > PROC_CHANGE_PENALTY into the LSE MQ scheduler, as well as enable > > > the code to prevent task stealing during IPI delivery. This > > > should still be significantly different than your design because > > > MQ will still attempt to make global decisions. Results should > > > be interesting. > > > > > Excellent idea, I think, that's the new idea that Davide brings to > > the table. MQ has been around in our stuff and in the HP scheduler (Rhine). > > > > BTW: The reason why for the reflex MQ starts getting better with larger numbers > > is because the likelihood of moving a task is going down. > > As i told to Mike in a previous email the reason of lower performance on > certain test is due the lack of balancing tests in reschedule_idle(). > What happens with this tests is that the main test thread starts creating > a bunch of threads that very likely are going to sleep very soon waiting > for the main test thread to issue a start event. > This make get_best_cpu() to return the same cpu to all threads and, when > the real test begin, there's no idle reschedule ( due the load ) to enable > a decent balancing. > A simple balancing hook in reschedule_idle() has made the scheduler to > perform very well in this kind of tests. > > > > > - Davide > > > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Davide L. <da...@xm...> - 2001-11-05 18:11:57
|
On Mon, 5 Nov 2001, Hubertus Franke wrote: > Davide, by introducing a balancer hook in reschedule_idle(), you > are actually coming back to the design we have in the PMQS (::= MQ + pooling). > What we experimented around with are different scheduling and > reschedule pools. In your case you really wanted to isolate the queues. No, no global scheduling decisions are taken inside the main schedule and the hook inside reschedule_idle() is not always triggered. I'm just trying that approach even if, i hope you know, is easier to convince my cat to have a bath than persuading Linus to change the scheduler :) My approach leaves the main schedule() path _exactly_ the same as the old one ( if you exclude the cpu-history ), so complains like "hey we lose 1% of performance when the cpu load is 0.01% ..." should be stopped before they start. The same code will make people that do not understand the code safer because the old scheduler, <irony>that took 5 years to these guys to be assimilated</irony>, is basically not changed. If it's going to have strong performance on big/loaded systems, speaking with numbers ( that is the only thing that matters ), they should at least give a _good_ reason/numbers to reject the proposed one. It's all but sure anyway that it'll have "good" numbers coz it's not easy to plug sophisticated forms of moving decisions w/out impacting the fast path. It's not that i don't like the MQ scheduler and i think that the overhead that results from the global scheduling decision is not an issue, but i can see it attacked from the reason described above, and hence never merged. - Davide |