From: Martin J. B. <mb...@ar...> - 2003-01-10 00:01:32
|
I tried a small experiment today - did a simple restriction of the O(1) scheduler to only balance inside a node. Coupled with the small initial load balancing patch floating around, this covers 95% of cases, is a trivial change (3 lines), performs just as well as Erich's patch on a kernel compile, and actually better on schedbench. This is NOT meant to be a replacement for the code Erich wrote, it's meant to be a simple way to get integration and acceptance. Code that just forks and never execs will stay on one node - but we can take the code Erich wrote, and put it in seperate rebalancer that fires much less often to do a cross-node rebalance. All that would be under #ifdef CONFIG_NUMA, the only thing that would touch mainline is these three lines of change, and it's trivial to see they're completely equivalent to the current code on non-NUMA systems. I also believe that this is the more correct approach in design, it should result in much less cross-node migration of tasks, and less scanning of remote runqueues. Opinions / comments? M. Kernbench: Elapsed User System CPU 2.5.54-mjb3 19.41s 186.38s 39.624s 1191.4% 2.5.54-mjb3-mjbsched 19.508s 186.356s 39.888s 1164.6% Schedbench 4: AvgUser Elapsed TotalUser TotalSys 2.5.54-mjb3 0.00 35.14 88.82 0.64 2.5.54-mjb3-mjbsched 0.00 31.84 88.91 0.49 Schedbench 8: AvgUser Elapsed TotalUser TotalSys 2.5.54-mjb3 0.00 47.55 269.36 1.48 2.5.54-mjb3-mjbsched 0.00 41.01 252.34 1.07 Schedbench 16: AvgUser Elapsed TotalUser TotalSys 2.5.54-mjb3 0.00 76.53 957.48 4.17 2.5.54-mjb3-mjbsched 0.00 69.01 792.71 2.74 Schedbench 32: AvgUser Elapsed TotalUser TotalSys 2.5.54-mjb3 0.00 145.20 1993.97 11.05 2.5.54-mjb3-mjbsched 0.00 117.47 1798.93 5.95 Schedbench 64: AvgUser Elapsed TotalUser TotalSys 2.5.54-mjb3 0.00 307.80 4643.55 20.36 2.5.54-mjb3-mjbsched 0.00 241.04 3589.55 12.74 ----------------------------------------- diff -purN -X /home/mbligh/.diff.exclude virgin/kernel/sched.c mjbsched/kernel/sched.c --- virgin/kernel/sched.c Mon Dec 9 18:46:15 2002 +++ mjbsched/kernel/sched.c Thu Jan 9 14:09:17 2003 @@ -654,7 +654,7 @@ static inline unsigned int double_lock_b /* * find_busiest_queue - find the busiest runqueue. */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask) { int nr_running, load, max_load, i; runqueue_t *busiest, *rq_src; @@ -689,7 +689,7 @@ static inline runqueue_t *find_busiest_q busiest = NULL; max_load = 1; for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) + if (!cpu_online(i) || !((1 << i) & cpumask) ) continue; rq_src = cpu_rq(i); @@ -764,7 +764,8 @@ static void load_balance(runqueue_t *thi struct list_head *head, *curr; task_t *tmp; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); + busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, + __node_to_cpu_mask(__cpu_to_node(this_cpu)) ); if (!busiest) goto out; --------------------------------------------------- A tiny change in the current ilb patch is also needed to stop it using a macro from the first patch: diff -purN -X /home/mbligh/.diff.exclude ilbold/kernel/sched.c ilbnew/kernel/sched.c --- ilbold/kernel/sched.c Thu Jan 9 15:20:53 2003 +++ ilbnew/kernel/sched.c Thu Jan 9 15:27:49 2003 @@ -2213,6 +2213,7 @@ static void sched_migrate_task(task_t *p static int sched_best_cpu(struct task_struct *p) { int i, minload, load, best_cpu, node = 0; + unsigned long cpumask; best_cpu = task_cpu(p); if (cpu_rq(best_cpu)->nr_running <= 2) @@ -2226,9 +2227,11 @@ static int sched_best_cpu(struct task_st node = i; } } + minload = 10000000; - loop_over_node(i,node) { - if (!cpu_online(i)) + cpumask = __node_to_cpu_mask(node); + for (i = 0; i < NR_CPUS; ++i) { + if (!(cpumask & (1 << i))) continue; if (cpu_rq(i)->nr_running < minload) { best_cpu = i; |
From: Michael H. <hoh...@us...> - 2003-01-10 05:37:03
|
On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote: > I tried a small experiment today - did a simple restriction of > the O(1) scheduler to only balance inside a node. Coupled with > the small initial load balancing patch floating around, this > covers 95% of cases, is a trivial change (3 lines), performs > just as well as Erich's patch on a kernel compile, and actually > better on schedbench. > > This is NOT meant to be a replacement for the code Erich wrote, > it's meant to be a simple way to get integration and acceptance. > Code that just forks and never execs will stay on one node - but > we can take the code Erich wrote, and put it in seperate rebalancer > that fires much less often to do a cross-node rebalance. I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got similar results. Also, added in the cputime_stats patch and am attaching the matrix results from the 32 process run. Basically, all runs show that the initial load balancer is able to place the tasks evenly across the nodes, and the better overall times show that not migrating to keep the nodes balanced over time results in better performance - at least on these boxes. Obviously, there can be situations where load balancing across nodes is necessary, but these results point to less load balancing being better, at least on these machines. It will be interesting to repeat this on boxes with other interconnects. $ reportbench hacksched54 sched54 stock54 Kernbench: Elapsed User System CPU hacksched54 29.406s 282.18s 81.716s 1236.8% sched54 29.112s 283.888s 82.84s 1259.4% stock54 31.348s 303.134s 87.824s 1247.2% Schedbench 4: AvgUser Elapsed TotalUser TotalSys hacksched54 21.94 31.93 87.81 0.53 sched54 22.03 34.90 88.15 0.75 stock54 49.35 57.53 197.45 0.86 Schedbench 8: AvgUser Elapsed TotalUser TotalSys hacksched54 28.23 31.62 225.87 1.11 sched54 27.95 37.12 223.66 1.50 stock54 43.14 62.97 345.18 2.12 Schedbench 16: AvgUser Elapsed TotalUser TotalSys hacksched54 49.29 71.31 788.83 2.88 sched54 55.37 69.58 886.10 3.79 stock54 66.00 81.25 1056.25 7.12 Schedbench 32: AvgUser Elapsed TotalUser TotalSys hacksched54 56.41 117.98 1805.35 5.90 sched54 57.93 132.11 1854.01 10.74 stock54 77.81 173.26 2490.31 12.37 Schedbench 64: AvgUser Elapsed TotalUser TotalSys hacksched54 56.62 231.93 3624.42 13.32 sched54 72.91 308.87 4667.03 21.06 stock54 86.68 368.55 5548.57 25.73 hacksched54 = 2.5.54 + Martin's tiny NUMA patch + 03-cputimes_stat-2.5.53.patch + 02-numa-sched-ilb-2.5.53-21.patch sched54 = 2.5.54 + 01-numa-sched-core-2.5.53-24.patch + 02-ilb-2.5.53-21.patch02 + 03-cputimes_stat-2.5.53.patch stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch Detailed results from running numa_test 32: Executing 32 times: ./randupdt 1000000 Running 'hackbench 20' in the background: Time: 4.557 Job node00 node01 node02 node03 | iSched MSched | UserTime(s) 1 100.0 0.0 0.0 0.0 | 0 0 | 57.12 2 0.0 100.0 0.0 0.0 | 1 1 | 55.89 3 100.0 0.0 0.0 0.0 | 0 0 | 55.39 4 0.0 0.0 100.0 0.0 | 2 2 | 56.67 5 0.0 0.0 0.0 100.0 | 3 3 | 57.08 6 0.0 100.0 0.0 0.0 | 1 1 | 56.31 7 100.0 0.0 0.0 0.0 | 0 0 | 57.11 8 0.0 0.0 0.0 100.0 | 3 3 | 56.66 9 0.0 100.0 0.0 0.0 | 1 1 | 55.87 10 0.0 0.0 100.0 0.0 | 2 2 | 55.83 11 0.0 0.0 0.0 100.0 | 3 3 | 56.01 12 0.0 100.0 0.0 0.0 | 1 1 | 56.56 13 0.0 0.0 100.0 0.0 | 2 2 | 56.31 14 0.0 0.0 0.0 100.0 | 3 3 | 56.40 15 100.0 0.0 0.0 0.0 | 0 0 | 55.82 16 0.0 100.0 0.0 0.0 | 1 1 | 57.47 17 0.0 0.0 100.0 0.0 | 2 2 | 56.76 18 0.0 0.0 0.0 100.0 | 3 3 | 57.10 19 0.0 100.0 0.0 0.0 | 1 1 | 57.26 20 0.0 0.0 100.0 0.0 | 2 2 | 56.48 21 0.0 0.0 0.0 100.0 | 3 3 | 56.65 22 0.0 100.0 0.0 0.0 | 1 1 | 55.81 23 0.0 0.0 100.0 0.0 | 2 2 | 55.77 24 0.0 0.0 0.0 100.0 | 3 3 | 56.91 25 0.0 100.0 0.0 0.0 | 1 1 | 56.86 26 0.0 0.0 100.0 0.0 | 2 2 | 56.62 27 0.0 0.0 0.0 100.0 | 3 3 | 57.16 28 0.0 0.0 100.0 0.0 | 2 2 | 56.36 29 100.0 0.0 0.0 0.0 | 0 0 | 55.72 30 100.0 0.0 0.0 0.0 | 0 0 | 56.00 31 100.0 0.0 0.0 0.0 | 0 0 | 55.48 32 100.0 0.0 0.0 0.0 | 0 0 | 55.59 AverageUserTime 56.41 seconds ElapsedTime 117.98 TotalUserTime 1805.35 TotalSysTime 5.90 Runs with 4, 8, 16, and 64 processes all showed the same even distribution across all nodes and 100% time on node for each process. -- Michael Hohnbaum 503-578-5486 hoh...@us... T/L 775-5486 |
From: Erich F. <ef...@es...> - 2003-01-10 16:34:34
|
Hi Martin & Michael, indeed, restricting a process to the node on which it was started helps, as the memory will always be local. The NUMA scheduler allows a process to move away from it's node, if the load conditions require it, but in the current form the process will not come back to its homenode. That's what the "node affine scheduler" tried to realise. The miniature NUMA scheduler relies on the quality of the initial load balancer, and that one seems to be good enough. As you mentioned, multithreaded jobs are disadvantaged as their threads have to stick on the originating node. Having some sort of automatic node affinity of processes and equal node loads in mind (as design targets), we could: - take the minimal NUMA scheduler - if the normal (node-restricted) find_busiest_queue() fails and certain conditions are fulfilled (tried to balance inside own node for a while and didn't succeed, own CPU idle, etc... ???) rebalance over node boundaries (eg. using my load balancer) This actually resembles the original design of the node affine scheduler, having the cross-node balancing separate is ok and might make the ideas clearer. I made some measurements, too, and found basically what I expected. The numbers are from a machine with 4 nodes of 2 CPUs each. It's on ia64, so 2.5.52 based. As the minsched cannot make mistakes (by moving tasks away from their homenode), it leads to the best results with numa_test. OTOH hackbench suffers a lot from the limitation to one node. The hackbench tasks are not latency/bandwidth limited, the faster they spread over the whole machine, the quicker finishes the job. That's why NUMA-sched is slightly worse than a stock kernel. But minsched looses >50%. Funilly, on my machine kernbench is slightly faster with the normal NUMA scheduler. Regards, Erich Results on a 8 CPU machine with 4 nodes (2 CPUs per node). kernbench: elapsed user system stock52 134.52(0.84) 951.64(0.97) 20.72(0.22) sched52 133.19(1.49) 944.24(0.50) 21.36(0.24) minsched52 135.47(0.47) 937.61(0.20) 21.30(0.14) schedbench/hackbench: time(s) 10 25 50 100 stock52 0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25) sched52 0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35) minsched52 1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27) numabench/numa_test 4 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 27.23(0.52) 89.30(4.18) 0.09(0.01) sched52 22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01) minsched52 20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01) numabench/numa_test 8 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 27.50(2.58) 174.74(6.66) 0.18(0.01) sched52 21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01) minsched52 20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01) numabench/numa_test 16 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 52.68(1.51) 390.03(15.10) 0.34(0.01) sched52 21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01) minsched52 20.50(0.03) 43.82(0.08) 328.05(0.45) 0.34(0.01) numabench/numa_test 32 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 102.60(3.89) 794.57(31.72) 0.65(0.01) sched52 21.93(0.57) 92.46(1.10) 701.75(18.38) 0.67(0.02) minsched52 20.64(0.10) 89.95(3.16) 660.72(3.13) 0.68(0.07) On Friday 10 January 2003 06:36, Michael Hohnbaum wrote: > On Thu, 2003-01-09 at 15:54, Martin J. Bligh wrote: > > I tried a small experiment today - did a simple restriction of > > the O(1) scheduler to only balance inside a node. Coupled with > > the small initial load balancing patch floating around, this > > covers 95% of cases, is a trivial change (3 lines), performs > > just as well as Erich's patch on a kernel compile, and actually > > better on schedbench. > > > > This is NOT meant to be a replacement for the code Erich wrote, > > it's meant to be a simple way to get integration and acceptance. > > Code that just forks and never execs will stay on one node - but > > we can take the code Erich wrote, and put it in seperate rebalancer > > that fires much less often to do a cross-node rebalance. > > I tried this on my 4 node NUMAQ (16 procs, 16GB memory) and got > similar results. Also, added in the cputime_stats patch and am > attaching the matrix results from the 32 process run. Basically, > all runs show that the initial load balancer is able to place the > tasks evenly across the nodes, and the better overall times show > that not migrating to keep the nodes balanced over time results > in better performance - at least on these boxes. > > Obviously, there can be situations where load balancing across > nodes is necessary, but these results point to less load balancing > being better, at least on these machines. It will be interesting > to repeat this on boxes with other interconnects. > > $ reportbench hacksched54 sched54 stock54 > Kernbench: > Elapsed User System CPU > hacksched54 29.406s 282.18s 81.716s 1236.8% > sched54 29.112s 283.888s 82.84s 1259.4% > stock54 31.348s 303.134s 87.824s 1247.2% > > Schedbench 4: > AvgUser Elapsed TotalUser TotalSys > hacksched54 21.94 31.93 87.81 0.53 > sched54 22.03 34.90 88.15 0.75 > stock54 49.35 57.53 197.45 0.86 > > Schedbench 8: > AvgUser Elapsed TotalUser TotalSys > hacksched54 28.23 31.62 225.87 1.11 > sched54 27.95 37.12 223.66 1.50 > stock54 43.14 62.97 345.18 2.12 > > Schedbench 16: > AvgUser Elapsed TotalUser TotalSys > hacksched54 49.29 71.31 788.83 2.88 > sched54 55.37 69.58 886.10 3.79 > stock54 66.00 81.25 1056.25 7.12 > > Schedbench 32: > AvgUser Elapsed TotalUser TotalSys > hacksched54 56.41 117.98 1805.35 5.90 > sched54 57.93 132.11 1854.01 10.74 > stock54 77.81 173.26 2490.31 12.37 > > Schedbench 64: > AvgUser Elapsed TotalUser TotalSys > hacksched54 56.62 231.93 3624.42 13.32 > sched54 72.91 308.87 4667.03 21.06 > stock54 86.68 368.55 5548.57 25.73 > > hacksched54 =3D 2.5.54 + Martin's tiny NUMA patch + > 03-cputimes_stat-2.5.53.patch + > 02-numa-sched-ilb-2.5.53-21.patch > sched54 =3D 2.5.54 + 01-numa-sched-core-2.5.53-24.patch + > 02-ilb-2.5.53-21.patch02 + > 03-cputimes_stat-2.5.53.patch > stock54 - 2.5.54 + 03-cputimes_stat-2.5.53.patch |
From: Martin J. B. <mb...@ar...> - 2003-01-10 16:57:51
|
> Having some sort of automatic node affinity of processes and equal > node loads in mind (as design targets), we could: > - take the minimal NUMA scheduler > - if the normal (node-restricted) find_busiest_queue() fails and > certain conditions are fulfilled (tried to balance inside own node > for a while and didn't succeed, own CPU idle, etc... ???) rebalance > over node boundaries (eg. using my load balancer) > This actually resembles the original design of the node affine > scheduler, having the cross-node balancing separate is ok and might > make the ideas clearer. This seems like the right approach to me, apart from the trigger to do the cross-node rebalance. I don't believe that has anything to do with when we're internally balanced within a node or not, it's whether the nodes are balanced relative to each other. I think we should just check that every N ticks, looking at node load averages, and do a cross-node rebalance if they're "significantly out". The definintion of "N ticks" and "significantly out" would be a tunable number, defined by each platform; roughly speaking, the lower the NUMA ratio, the lower these numbers would be. That also allows us to wedge all sorts of smarts in the NUMA rebalance part of the scheduler, such as moving the tasks with the smallest RSS off node. The NUMA rebalancer is obviously completely missing from the current implementation, and I expect we'd use mainly Erich's current code to implement that. However, it's suprising how well we do with no rebalancer at all, apart from the exec-time initial load balance code. Another big advantage of this approach is that it *obviously* changes nothing at all for standard SMP systems (whereas your current patch does), so it should be much easier to get it accepted into mainline .... M. |
From: Erich F. <ef...@es...> - 2003-01-12 23:34:38
|
On Friday 10 January 2003 17:57, Martin J. Bligh wrote: > This seems like the right approach to me, apart from the trigger to > do the cross-node rebalance. I don't believe that has anything to do > with when we're internally balanced within a node or not, it's > whether the nodes are balanced relative to each other. I think we shoul= d > just check that every N ticks, looking at node load averages, and do > a cross-node rebalance if they're "significantly out". OK, I changed my mind about the trigger and made some experiments with the cross-node balancer called after every N calls of load_balance. If we make N tunable, we can even have some dynamics in case the nodes are unbalanced: make N small if the current node is less loaded than the average node loads, make N large if the node is averagely loaded. I'll send the patches in a separate email. > The NUMA rebalancer > is obviously completely missing from the current implementation, and > I expect we'd use mainly Erich's current code to implement that. > However, it's suprising how well we do with no rebalancer at all, > apart from the exec-time initial load balance code. The fact that we're doing fine on kernbench and numa_test/schedbench is (I think) understandeable. In both benchmarks a process cannot change its node during its lifetime, therefore has minimal memory latency. In numa_test the "disturbing" hackbench just cannot move away any of the tasks from their originating node. Therefore the results are the best possible. Regards, Erich |
From: Erich F. <ef...@es...> - 2003-01-12 23:55:10
|
Hi Martin & Michael, as discussed on the LSE call I played around with a cross-node balancer approach put on top of the miniature NUMA scheduler. The patches are attached and it seems to be clear that we can regain the good performance for hackbench by adding a cross-node balancer. The patches are: 01-minisched-2.5.55.patch : the miniature scheduler from Martin. Balances strictly within a node. Added an inlined function to make the integration of the cross-node balancer easier. The added code is optimised away by the compiler. 02-initial-lb-2.5.55.patch : Michael's initial load balancer at exec(). 03-internode-lb-2.5.55.patch : internode load balancer core. Called after INTERNODE_LB calls of the inter-node load balancer. The most loaded node's cpu-mask is ORed to the own node's cpu-mask and find_busiest_in_mask() finds the most loaded CPU in this set. 04-smooth-node-load-2.5.55.patch : The node load measure is smoothed by adding half of the previous node load (and 1/4 of the one before, etc..., as discussed in the LSE call). This should improve a bit the behavior in case of short timed load peaks and avoid bouncing tasks between nodes. 05-var-intnode-lb2-2.5.55.patch : Replaces the fixed INTERNODE_LB interval (between cross-node balancer calls) by a variable node-specific interval. Currently only two values used. Certainly needs some tweaking and tuning. 01, 02, 03 are enough to produce a working NUMA scheduler, when including all patches the behavior should be better. I made some measurements which I'd like to post in a separate email tomorrow. Comments? Ideas? Regards, Erich |
From: Christoph H. <hc...@in...> - 2003-01-13 08:03:23
|
On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote: > Hi Martin & Michael, > > as discussed on the LSE call I played around with a cross-node > balancer approach put on top of the miniature NUMA scheduler. The > patches are attached and it seems to be clear that we can regain the > good performance for hackbench by adding a cross-node balancer. The changes look fine to me. But I think there's some conding style issues that need cleaning up (see below). Also is there a reason patches 2/3 and 4/5 aren't merged into one patch each? /* - * find_busiest_queue - find the busiest runqueue. + * find_busiest_in_mask - find the busiest runqueue among the cpus in cpumask */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static inline runqueue_t *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask) find_busiest_queue has just one caller in 2.5.56, I'd suggest just changing the prototype and updating that single caller to pass in the cpumask opencoded. @@ -160,7 +160,6 @@ extern void update_one_process(struct ta extern void scheduler_tick(int user_tick, int system); extern unsigned long cache_decay_ticks; - #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); asmlinkage void schedule(void); I don't think you need this spurious whitespace change :) +#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +extern void node_nr_running_init(void); +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \ + rq->nr_running++ +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \ + rq->nr_running-- static inline void nr_running_inc(runqueue_t *rq) { atomic_inc(rq->node_ptr); rq->nr_running++ } etc.. would look a bit nicer. diff -urNp linux-2.5.55-ms/kernel/sched.c linux-2.5.55-ms-ilb/kernel/sched.c --- linux-2.5.55-ms/kernel/sched.c 2003-01-10 23:01:02.000000000 +0100 +++ linux-2.5.55-ms-ilb/kernel/sched.c 2003-01-11 01:12:43.000000000 +0100 @@ -153,6 +153,7 @@ struct runqueue { task_t *curr, *idle; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; + atomic_t * node_ptr; atomic_t *node_ptr; would match the style above. +#if CONFIG_NUMA +static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)}; Maybe wants some linewrapping after 80 chars? +__init void node_nr_running_init(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) { + cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)]; + } + return; +} The braces and the return are superflous. Also kernel/sched.c (or mingo codein general) seems to prefer array + i instead of &array[i] (not that I have a general preferences, but you should try to match the surrounding code) +static void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + if (!(old_mask & (1UL << dest_cpu))) + return; + /* force the process onto the specified CPU */ + set_cpus_allowed(p, 1UL << dest_cpu); + + /* restore the cpus allowed mask */ + set_cpus_allowed(p, old_mask); This double set_cpus_allowed doesn't look nice to me. I don't have a better suggestion of-hand, though :( +#define decl_numa_int(ctr) int ctr This is ugly as hell. I'd prefer wasting one int in each runqueue or even an ifdef in the struct declaration over this obsfucation all the time. @@ -816,6 +834,16 @@ out: static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) { unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu)); +#if CONFIG_NUMA + int node; +# define INTERNODE_LB 10 This wants to be put to the other magic constants in the scheduler and needs an explanation there. #define nr_running_dec(rq) atomic_dec(rq->node_ptr); \ rq->nr_running-- #define decl_numa_int(ctr) int ctr +#define decl_numa_nodeint(v) int v[MAX_NUMNODES] Another one of those.. You should reall just stick the CONFIG_NUMA ifdef into the actual struct declaration. +/* + * Find the busiest node. All previous node loads contribute with a geometrically + * deccaying weight to the load measure: Linewrapping again.. |
From: Michael H. <hoh...@us...> - 2003-01-14 01:22:19
|
On Sun, 2003-01-12 at 15:55, Erich Focht wrote: > Hi Martin & Michael, > > as discussed on the LSE call I played around with a cross-node > balancer approach put on top of the miniature NUMA scheduler. The > patches are attached and it seems to be clear that we can regain the > good performance for hackbench by adding a cross-node balancer. Erich, I played with this today on my 4 node (16 CPU) NUMAQ. Spent most of the time working with the first three patches. What I found was that rebalancing was happening too much between nodes. I tried a few things to change this, but have not yet settled on the best approach. A key item to work with is the check in find_busiest_node to determine if the found node is busier enough to warrant stealing from it. Currently the check is that the node has 125% of the load of the current node. I think that, for my system at least, we need to add in a constant to this equation. I tried using 4 and that helped a little. Finally I added in the 04 patch, and that helped a lot. Still, there is too much process movement between nodes. Tomorrow, I will continue experiments, but base them on the first 4 patches. Two suggestions for minor changes: * Make the check in find_busiest_node into a macro that is defined in the arch specific topology header file. Then different NUMA architectures can tune this appropriately. * In find_busiest_queue change: cpumask |= __node_to_cpu_mask(node); to: cpumask = __node_to_cpu_mask(node) | (1UL << (this_cpu)); There is no reason to iterate over the runqueues on the current node, which is what the code currently does. Some numbers for anyone interested:Kernbench: All numbers are based on a 2.5.55 kernel with the cputime stats patch: * stock55 = no additional patches * mini+rebal-44 = patches 01, 02, and 03 * rebal+4+fix = patches 01, 02, 03, and the cpumask change described above, and a +4 constant added to the check in find_busiest_node * rebal+4+fix+04 = above with the 04 patch added Elapsed User System CPU rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253% rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6% mini+rebal-55 30.756s 287.646s 85.512s 1212.8% stock55 31.018s 303.084s 86.626s 1256.2% Schedbench 4: AvgUser Elapsed TotalUser TotalSys rebal+4+fix+04-55 27.34 40.49 109.39 0.88 rebal+4+fix-55 24.73 38.50 98.94 0.84 mini+rebal-55 25.18 43.23 100.76 0.68 stock55 31.38 41.55 125.54 1.24 Schedbench 8: AvgUser Elapsed TotalUser TotalSys rebal+4+fix+04-55 30.05 44.15 240.48 2.50 rebal+4+fix-55 34.33 46.40 274.73 2.31 mini+rebal-55 32.99 52.42 264.00 2.08 stock55 44.63 61.28 357.11 2.22 Schedbench 16: AvgUser Elapsed TotalUser TotalSys rebal+4+fix+04-55 52.13 57.68 834.23 3.55 rebal+4+fix-55 52.72 65.16 843.70 4.55 mini+rebal-55 57.29 71.51 916.84 5.10 stock55 66.91 85.08 1070.72 6.05 Schedbench 32: AvgUser Elapsed TotalUser TotalSys rebal+4+fix+04-55 56.38 124.09 1804.67 7.71 rebal+4+fix-55 55.13 115.18 1764.46 8.86 mini+rebal-55 57.83 125.80 1850.84 10.19 stock55 80.38 181.80 2572.70 13.22 Schedbench 64: AvgUser Elapsed TotalUser TotalSys rebal+4+fix+04-55 57.42 238.32 3675.77 17.68 rebal+4+fix-55 60.06 252.96 3844.62 18.88 mini+rebal-55 58.15 245.30 3722.38 19.64 stock55 123.96 513.66 7934.07 26.39 And here is the results from running numa_test 32 on rebal+4+fix+04: Executing 32 times: ./randupdt 1000000 Running 'hackbench 20' in the background: Time: 8.383 Job node00 node01 node02 node03 | iSched MSched | UserTime(s) 1 100.0 0.0 0.0 0.0 | 0 0 | 56.19 2 100.0 0.0 0.0 0.0 | 0 0 | 53.80 3 0.0 0.0 100.0 0.0 | 2 2 | 55.61 4 100.0 0.0 0.0 0.0 | 0 0 | 54.13 5 3.7 0.0 0.0 96.3 | 3 3 | 56.48 6 0.0 0.0 100.0 0.0 | 2 2 | 55.11 7 0.0 0.0 100.0 0.0 | 2 2 | 55.94 8 0.0 0.0 100.0 0.0 | 2 2 | 55.69 9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53 10 0.0 0.0 0.0 100.0 | 3 3 | 53.00 11 0.0 99.2 0.0 0.8 | 1 1 | 56.72 12 0.0 0.0 0.0 100.0 | 3 3 | 54.58 13 0.0 100.0 0.0 0.0 | 1 1 | 59.38 14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06 15 0.0 100.0 0.0 0.0 | 1 1 | 56.02 16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07 17 0.0 100.0 0.0 0.0 | 1 1 | 53.78 18 0.0 0.0 100.0 0.0 | 2 2 | 55.28 19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20 20 0.0 100.0 0.0 0.0 | 1 1 | 53.27 21 0.0 0.0 100.0 0.0 | 2 2 | 55.79 22 0.0 0.0 0.0 100.0 | 3 3 | 57.23 23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05 24 0.0 0.0 100.0 0.0 | 2 2 | 54.50 25 0.0 0.0 0.0 100.0 | 3 3 | 56.82 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36 AverageUserTime 56.38 seconds ElapsedTime 124.09 TotalUserTime 1804.67 TotalSysTime 7.71 Ideally, there would be nothing but 100.0 in all non-zero entries. I'll try adding in the 05 patch, and if that does not help, will try a few other adjustments. Thanks for the quick effort on putting together the node rebalance code. I'll also get some hackbench numbers soon. -- Michael Hohnbaum 503-578-5486 hoh...@us... T/L 775-5486 |
From: Andrew T. <hab...@us...> - 2003-01-14 04:37:01
|
> Erich, > > I played with this today on my 4 node (16 CPU) NUMAQ. Spent most > of the time working with the first three patches. What I found was > that rebalancing was happening too much between nodes. I tried a > few things to change this, but have not yet settled on the best > approach. A key item to work with is the check in find_busiest_node > to determine if the found node is busier enough to warrant stealing > from it. Currently the check is that the node has 125% of the load > of the current node. I think that, for my system at least, we need > to add in a constant to this equation. I tried using 4 and that > helped a little. Michael, in: +static int find_busiest_node(int this_node) +{ + int i, node = this_node, load, this_load, maxload; + + this_load = maxload = atomic_read(&node_nr_running[this_node]); + for (i = 0; i < numnodes; i++) { + if (i == this_node) + continue; + load = atomic_read(&node_nr_running[i]); + if (load > maxload && (4*load > ((5*4*this_load)/4))) { + maxload = load; + node = i; + } + } + return node; +} You changed ((5*4*this_load)/4) to: (5*4*(this_load+4)/4) or (4+(5*4*(this_load)/4)) ? We def need some constant to avoid low load ping pong, right? Finally I added in the 04 patch, and that helped > a lot. Still, there is too much process movement between nodes. perhaps increase INTERNODE_LB? -Andrew Theurer |
From: Erich F. <ef...@es...> - 2003-01-14 10:55:56
|
Hi Michael, thanks for the many experiments, they help giving some more insight. It's quite clear that we'll have to tune the knobs a bit to make this fit well. At first a comment on the "target" of the tuning: > 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28 > 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12 > 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85 > 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66 > 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86 > 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94 > 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36 > AverageUserTime 56.38 seconds > ElapsedTime 124.09 > TotalUserTime 1804.67 > TotalSysTime 7.71 > > Ideally, there would be nothing but 100.0 in all non-zero entries. > I'll try adding in the 05 patch, and if that does not help, will > try a few other adjustments. The numa_test is written such that you MUST get numbers below 100% with this kind of scheduler! There is a "disturbing" process started 3 seconds after the parallel tasks are up and running which has the aim to "shake" the system a bit and the scheduler ends up with some of the parallel tasks moved away from their homenode. This is a reallistic situation for long running parallel background jobs. In the min-sched case this cannot happen, therefore you see the 100% all over! Increasing more and more the barrier for moving a task apart from it's node won't necessarilly help the scheduler as we end up with the min-sched limitations. The aim will be to introduce node affinity for processes such that they can return to their homenode when the exceptional load situation is gone. We should not try to reach the min-sched numbers without having node-affine tasks. > * Make the check in find_busiest_node into a macro that is defined > in the arch specific topology header file. Then different NUMA > architectures can tune this appropriately. I still think that the load balancing condition should be clear and the same for every architecture. But tunable. I understand that the constant added helps but I don't have a clear motivation for it. We could keep the imbalance ratio as a condition for searching the maximum: > + if (load > maxload && (4*load > ((5*4*this_load)/4))) { (with tunable barrier height: say replace 5/4 by 11/8 or similar, i.e. (8*load > ((11*8*this_load)/8)) ) and rule out any special unwanted cases after the loop. These would be cases where it is obvious that we don't have to steal anything, like: load of remote node is <=3D number of cpus in remote node. For making this tunable I suggest something like: #define CROSS_NODE_BAL 125 (100*load > ((CROSS_NODE_BAL*100*this_load)/100)) and let platforms define their own value. Using 100 makes it easy to understand. > * In find_busiest_queue change: > > =09cpumask |=3D __node_to_cpu_mask(node); > to: > =09cpumask =3D __node_to_cpu_mask(node) | (1UL << (this_cpu)); > > > There is no reason to iterate over the runqueues on the current > node, which is what the code currently does. Here I thought I'd give the load balancer the chance to balance internally just in case the own node's load jumped up in the mean time. But this should ideally be caught by find_busiest_node(), therefore I agree with your change. > Some numbers for anyone interested:Kernbench: > > All numbers are based on a 2.5.55 kernel with the cputime stats patch: > * stock55 =3D no additional patches > * mini+rebal-44 =3D patches 01, 02, and 03 > * rebal+4+fix =3D patches 01, 02, 03, and the cpumask change describe= d > above, and a +4 constant added to the check in find_busiest_node > * rebal+4+fix+04 =3D above with the 04 patch added > > Elapsed User System CPU > rebal+4+fix+04-55 29.302s 285.136s 82.106s 1253% > rebal+4+fix-55 30.498s 286.586s 88.176s 1228.6% > mini+rebal-55 30.756s 287.646s 85.512s 1212.8% > stock55 31.018s 303.084s 86.626s 1256.2% The first line is VERY good! The results below (for Schedbench) are not really meaningful without statistics... But the combination rebal+4+fix+04-55 is the clear winner for me. Still, I'd be curious to know the rebal+fix+04-55 numbers ;-) > Schedbench 4: > AvgUser Elapsed TotalUser TotalSys > rebal+4+fix+04-55 27.34 40.49 109.39 0.88 > rebal+4+fix-55 24.73 38.50 98.94 0.84 > mini+rebal-55 25.18 43.23 100.76 0.68 > stock55 31.38 41.55 125.54 1.24 > > Schedbench 8: > AvgUser Elapsed TotalUser TotalSys > rebal+4+fix+04-55 30.05 44.15 240.48 2.50 > rebal+4+fix-55 34.33 46.40 274.73 2.31 > mini+rebal-55 32.99 52.42 264.00 2.08 > stock55 44.63 61.28 357.11 2.22 > > Schedbench 16: > AvgUser Elapsed TotalUser TotalSys > rebal+4+fix+04-55 52.13 57.68 834.23 3.55 > rebal+4+fix-55 52.72 65.16 843.70 4.55 > mini+rebal-55 57.29 71.51 916.84 5.10 > stock55 66.91 85.08 1070.72 6.05 > > Schedbench 32: > AvgUser Elapsed TotalUser TotalSys > rebal+4+fix+04-55 56.38 124.09 1804.67 7.71 > rebal+4+fix-55 55.13 115.18 1764.46 8.86 > mini+rebal-55 57.83 125.80 1850.84 10.19 > stock55 80.38 181.80 2572.70 13.22 > > Schedbench 64: > AvgUser Elapsed TotalUser TotalSys > rebal+4+fix+04-55 57.42 238.32 3675.77 17.68 > rebal+4+fix-55 60.06 252.96 3844.62 18.88 > mini+rebal-55 58.15 245.30 3722.38 19.64 > stock55 123.96 513.66 7934.07 26.39 > > > And here is the results from running numa_test 32 on rebal+4+fix+04: > > Executing 32 times: ./randupdt 1000000 > Running 'hackbench 20' in the background: Time: 8.383 > Job node00 node01 node02 node03 | iSched MSched | UserTime(s) > 1 100.0 0.0 0.0 0.0 | 0 0 | 56.19 > 2 100.0 0.0 0.0 0.0 | 0 0 | 53.80 > 3 0.0 0.0 100.0 0.0 | 2 2 | 55.61 > 4 100.0 0.0 0.0 0.0 | 0 0 | 54.13 > 5 3.7 0.0 0.0 96.3 | 3 3 | 56.48 > 6 0.0 0.0 100.0 0.0 | 2 2 | 55.11 > 7 0.0 0.0 100.0 0.0 | 2 2 | 55.94 > 8 0.0 0.0 100.0 0.0 | 2 2 | 55.69 > 9 80.6 19.4 0.0 0.0 | 1 0 *| 56.53 > 10 0.0 0.0 0.0 100.0 | 3 3 | 53.00 > 11 0.0 99.2 0.0 0.8 | 1 1 | 56.72 > 12 0.0 0.0 0.0 100.0 | 3 3 | 54.58 > 13 0.0 100.0 0.0 0.0 | 1 1 | 59.38 > 14 0.0 55.6 0.0 44.4 | 3 1 *| 63.06 > 15 0.0 100.0 0.0 0.0 | 1 1 | 56.02 > 16 0.0 19.2 0.0 80.8 | 1 3 *| 58.07 > 17 0.0 100.0 0.0 0.0 | 1 1 | 53.78 > 18 0.0 0.0 100.0 0.0 | 2 2 | 55.28 > 19 0.0 78.6 0.0 21.4 | 3 1 *| 63.20 > 20 0.0 100.0 0.0 0.0 | 1 1 | 53.27 > 21 0.0 0.0 100.0 0.0 | 2 2 | 55.79 > 22 0.0 0.0 0.0 100.0 | 3 3 | 57.23 > 23 12.4 19.1 0.0 68.5 | 1 3 *| 61.05 > 24 0.0 0.0 100.0 0.0 | 2 2 | 54.50 > 25 0.0 0.0 0.0 100.0 | 3 3 | 56.82 > 26 0.0 0.0 100.0 0.0 | 2 2 | 56.28 > 27 15.3 0.0 0.0 84.7 | 3 3 | 57.12 > 28 100.0 0.0 0.0 0.0 | 0 0 | 53.85 > 29 32.7 67.2 0.0 0.0 | 0 1 *| 62.66 > 30 100.0 0.0 0.0 0.0 | 0 0 | 53.86 > 31 100.0 0.0 0.0 0.0 | 0 0 | 53.94 > 32 100.0 0.0 0.0 0.0 | 0 0 | 55.36 > AverageUserTime 56.38 seconds > ElapsedTime 124.09 > TotalUserTime 1804.67 > TotalSysTime 7.71 > > Ideally, there would be nothing but 100.0 in all non-zero entries. > I'll try adding in the 05 patch, and if that does not help, will > try a few other adjustments. > > Thanks for the quick effort on putting together the node rebalance > code. I'll also get some hackbench numbers soon. Best regards, Erich |
From: Bill D. <dav...@tm...> - 2003-01-11 14:48:49
|
On Fri, 10 Jan 2003, Erich Focht wrote: > Having some sort of automatic node affinity of processes and equal > node loads in mind (as design targets), we could: > - take the minimal NUMA scheduler > - if the normal (node-restricted) find_busiest_queue() fails and > certain conditions are fulfilled (tried to balance inside own node > for a while and didn't succeed, own CPU idle, etc... ???) rebalance > over node boundaries (eg. using my load balancer) > This actually resembles the original design of the node affine > scheduler, having the cross-node balancing separate is ok and might > make the ideas clearer. Agreed, but honestly just this explanation would make it easier to understand! I'm not sure you have the "balance of nodes" trigger defined quite right, but I'm assuming if this gets implemented as described that some long term umbalance detector mechanism will be run occasionally. -- bill davidsen <dav...@tm...> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. |
From: Erich F. <ef...@es...> - 2003-01-12 23:24:31
|
On Saturday 11 January 2003 15:43, Bill Davidsen wrote: > Agreed, but honestly just this explanation would make it easier to > understand! I'm not sure you have the "balance of nodes" trigger define= d > quite right, but I'm assuming if this gets implemented as described tha= t > some long term umbalance detector mechanism will be run occasionally. Yes, the current plan is to extend the miniature NUMA scheduler by a inter-node balancer which is called less frequently. Regards, Erich |
From: Erich F. <ef...@es...> - 2003-01-13 11:32:13
|
Hi Christoph, thanks for reviewing the code and the helpful comments! On Monday 13 January 2003 09:02, Christoph Hellwig wrote: > On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote: > > as discussed on the LSE call I played around with a cross-node > > balancer approach put on top of the miniature NUMA scheduler. The > > patches are attached and it seems to be clear that we can regain the > > good performance for hackbench by adding a cross-node balancer. > > The changes look fine to me. But I think there's some conding style > issues that need cleaning up (see below). Also is there a reason > patches 2/3 and 4/5 aren't merged into one patch each? The patches are separated by their functionality. Patch 2 comes from Michael Hohnbaum, so I kept that separate for that reason. Right now we can exchange the single components but when we decide that they are doing the job, I'd also prefer to have just one patch. > /* > - * find_busiest_queue - find the busiest runqueue. > + * find_busiest_in_mask - find the busiest runqueue among the cpus in > cpumask */ > -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int > this_cpu, int idle, int *imbalance) +static inline runqueue_t > *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int > *imbalance, unsigned long cpumask) > > > =09find_busiest_queue has just one caller in 2.5.56, I'd suggest just > =09changing the prototype and updating that single caller to pass in > =09the cpumask opencoded. Having find_busiest_queue() and find_busiest_in_mask() as separate function makes it simpler to merge in the cross-node balancer (patch 3). Otherwise we'd have to add two #ifdef CONFIG_NUMA blocks into load_balance() (one for new variable declarations, the other one for selecting the target node mask). We might have several calls to find_busiest_in_mask() later, if we decide to add multi-level node hierarchy support... > =09I don't think you need this spurious whitespace change :) :-) slipped in somehow. > +#ifdef CONFIG_NUMA > +extern void sched_balance_exec(void); > +extern void node_nr_running_init(void); > +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \ > +=09rq->nr_running++ > +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \ > +=09rq->nr_running-- > > =09static inline void nr_running_inc(runqueue_t *rq) > =09{ > =09=09atomic_inc(rq->node_ptr); > =09=09rq->nr_running++ > =09} > > =09etc.. would look a bit nicer. We can change this. Michael, ok with you? > +#if CONFIG_NUMA > +static atomic_t node_nr_running[MAX_NUMNODES] > ____cacheline_maxaligned_in_smp =3D {[0 ...MAX_NUMNODES-1] =3D ATOMIC_I= NIT(0)}; > > =09Maybe wants some linewrapping after 80 chars? Yes. > +=09for (i =3D 0; i < NR_CPUS; i++) { > +=09=09cpu_rq(i)->node_ptr =3D &node_nr_running[__cpu_to_node(i)]; > +=09} > +=09return; > +} > > =09The braces and the return are superflous. Also kernel/sched.c (or > =09mingo codein general) seems to prefer array + i instead of &array[i] > =09(not that I have a general preferences, but you should try to match > =09the surrounding code) Will change the braces and remove the return. I personally find &array[i] more readable. > +static void sched_migrate_task(task_t *p, int dest_cpu) > +{ > +=09unsigned long old_mask; > + > +=09old_mask =3D p->cpus_allowed; > +=09if (!(old_mask & (1UL << dest_cpu))) > +=09=09return; > +=09/* force the process onto the specified CPU */ > +=09set_cpus_allowed(p, 1UL << dest_cpu); > + > +=09/* restore the cpus allowed mask */ > +=09set_cpus_allowed(p, old_mask); > > =09This double set_cpus_allowed doesn't look nice to me. I don't > =09have a better suggestion of-hand, though :( This is not that bad. It involves only one single wakeup of the migration thread, and that's more important. Doing it another way would mean to replicate the set_cpus_allowed() code. > +#define decl_numa_int(ctr) int ctr > > =09This is ugly as hell. I'd prefer wasting one int in each runqueue > =09or even an ifdef in the struct declaration over this obsfucation > =09all the time. Agreed :-) Just trying to avoid #ifdefs in sched.c as much as possible. Somehow had the feeling Linus doesn't like that. On the other hand: CONFIG_NUMA is a special case of CONFIG_SMP and nobody has anything against CONFIG_SMP in sched.c... > @@ -816,6 +834,16 @@ out: > static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int > this_cpu, int idle, int *imbalance) { > =09unsigned long cpumask =3D __node_to_cpu_mask(__cpu_to_node(this_cpu= )); > +#if CONFIG_NUMA > +=09int node; > +# define INTERNODE_LB 10 > > =09This wants to be put to the other magic constants in the scheduler > =09and needs an explanation there. We actually get rid of this in patch #5 (variable internode_lb, depending on the load of the current node). But ok, I'll move the MIN_INTERNODE_LB and MAX_INTERNODE_LB variables to the magic constants. They'll be outside the #ifdef CONFIG_NUMA block... Thanks, best regards, Erich |
From: Christoph H. <hc...@in...> - 2003-01-13 15:28:09
|
Anyone interested in this cleaned up minimal numa scheduler? This is basically Erich's patches 1-3 with my suggestions applied. This does not mean I don't like 4 & 5, but I'd rather get a small, non-intrusive patch into Linus' tree now and do the fine-tuning later. --- 1.62/fs/exec.c Fri Jan 10 08:21:00 2003 +++ edited/fs/exec.c Mon Jan 13 15:33:32 2003 @@ -1031,6 +1031,8 @@ int retval; int i; + sched_balance_exec(); + file = open_exec(filename); retval = PTR_ERR(file); --- 1.119/include/linux/sched.h Sat Jan 11 07:44:15 2003 +++ edited/include/linux/sched.h Mon Jan 13 15:58:11 2003 @@ -444,6 +444,14 @@ # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +extern void node_nr_running_init(void); +#else +# define sched_balance_exec() do { } while (0) +# define node_nr_running_init() do { } while (0) +#endif + extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); --- 1.91/init/main.c Mon Jan 6 04:08:49 2003 +++ edited/init/main.c Mon Jan 13 15:33:33 2003 @@ -495,6 +495,7 @@ migration_init(); #endif + node_nr_running_init(); spawn_ksoftirqd(); } --- 1.148/kernel/sched.c Sat Jan 11 07:44:22 2003 +++ edited/kernel/sched.c Mon Jan 13 16:17:34 2003 @@ -67,6 +67,7 @@ #define INTERACTIVE_DELTA 2 #define MAX_SLEEP_AVG (2*HZ) #define STARVATION_LIMIT (2*HZ) +#define NODE_BALANCE_RATIO 10 /* * If a task is 'interactive' then we reinsert it in the active @@ -154,6 +155,11 @@ prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; +#ifdef CONFIG_NUMA + atomic_t *node_nr_running; + int nr_balanced; +#endif + task_t *migration_thread; struct list_head migration_queue; @@ -178,6 +184,38 @@ #endif /* + * Keep track of running tasks. + */ +#if CONFIG_NUMA + +/* XXX(hch): this should go into a struct sched_node_data */ +static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = + {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)}; + +static inline void nr_running_init(struct runqueue *rq) +{ + rq->node_nr_running = &node_nr_running[0]; +} + +static inline void nr_running_inc(struct runqueue *rq) +{ + atomic_inc(rq->node_nr_running); + rq->nr_running++; +} + +static inline void nr_running_dec(struct runqueue *rq) +{ + atomic_dec(rq->node_nr_running); + rq->nr_running--; +} + +#else +# define nr_running_init(rq) do { } while (0) +# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0) +# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0) +#endif /* CONFIG_NUMA */ + +/* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. @@ -294,7 +332,7 @@ p->prio = effective_prio(p); } enqueue_task(p, array); - rq->nr_running++; + nr_running_inc(rq); } /* @@ -302,7 +340,7 @@ */ static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) { - rq->nr_running--; + nr_running_dec(rq); if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; dequeue_task(p, p->array); @@ -624,9 +662,108 @@ spin_unlock(&rq2->lock); } -#if CONFIG_SMP +#if CONFIG_NUMA +/* + * If dest_cpu is allowed for this process, migrate the task to it. + * This is accomplished by forcing the cpu_allowed mask to only + * allow dest_cpu, which will force the cpu onto dest_cpu. Then + * the cpu_allowed mask is restored. + * + * Note: This isn't actually numa-specific, but just not used otherwise. + */ +static void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask = p->cpus_allowed; + + if (old_mask & (1UL << dest_cpu)) { + unsigned long flags; + struct runqueue *rq; + + /* force the process onto the specified CPU */ + set_cpus_allowed(p, 1UL << dest_cpu); + + /* restore the cpus allowed mask */ + rq = task_rq_lock(p, &flags); + p->cpus_allowed = old_mask; + task_rq_unlock(rq, &flags); + } +} /* + * Find the least loaded CPU. Slightly favor the current CPU by + * setting its runqueue length as the minimum to start. + */ +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, load, best_cpu, node = 0; + unsigned long cpumask; + + best_cpu = task_cpu(p); + if (cpu_rq(best_cpu)->nr_running <= 2) + return best_cpu; + + minload = 10000000; + for (i = 0; i < numnodes; i++) { + load = atomic_read(&node_nr_running[i]); + if (load < minload) { + minload = load; + node = i; + } + } + + minload = 10000000; + cpumask = __node_to_cpu_mask(node); + for (i = 0; i < NR_CPUS; ++i) { + if (!(cpumask & (1UL << i))) + continue; + if (cpu_rq(i)->nr_running < minload) { + best_cpu = i; + minload = cpu_rq(i)->nr_running; + } + } + return best_cpu; +} + +void sched_balance_exec(void) +{ + int new_cpu; + + if (numnodes > 1) { + new_cpu = sched_best_cpu(current); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); + } +} + +static int find_busiest_node(int this_node) +{ + int i, node = this_node, load, this_load, maxload; + + this_load = maxload = atomic_read(&node_nr_running[this_node]); + for (i = 0; i < numnodes; i++) { + if (i == this_node) + continue; + load = atomic_read(&node_nr_running[i]); + if (load > maxload && (4*load > ((5*4*this_load)/4))) { + maxload = load; + node = i; + } + } + + return node; +} + +__init void node_nr_running_init(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) + cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i); +} +#endif /* CONFIG_NUMA */ + +#if CONFIG_SMP +/* * double_lock_balance - lock the busiest runqueue * * this_rq is locked already. Recalculate nr_running if we have to @@ -652,9 +789,10 @@ } /* - * find_busiest_queue - find the busiest runqueue. + * find_busiest_queue - find the busiest runqueue among the cpus in cpumask */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, + int idle, int *imbalance, unsigned long cpumask) { int nr_running, load, max_load, i; runqueue_t *busiest, *rq_src; @@ -689,7 +827,7 @@ busiest = NULL; max_load = 1; for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) + if (!cpu_online(i) || !((1UL << i) & cpumask)) continue; rq_src = cpu_rq(i); @@ -736,9 +874,9 @@ static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) { dequeue_task(p, src_array); - src_rq->nr_running--; + nr_running_dec(src_rq); set_task_cpu(p, this_cpu); - this_rq->nr_running++; + nr_running_inc(this_rq); enqueue_task(p, this_rq->active); /* * Note that idle threads have a prio of MAX_PRIO, for this test @@ -758,13 +896,27 @@ */ static void load_balance(runqueue_t *this_rq, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx, this_cpu, this_node; + unsigned long cpumask; runqueue_t *busiest; prio_array_t *array; struct list_head *head, *curr; task_t *tmp; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); + this_cpu = smp_processor_id(); + this_node = __cpu_to_node(this_cpu); + cpumask = __node_to_cpu_mask(this_node); + +#if CONFIG_NUMA + /* + * Avoid rebalancing between nodes too often. + */ + if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO)) + cpumask |= __node_to_cpu_mask(find_busiest_node(this_node)); +#endif + + busiest = find_busiest_queue(this_rq, this_cpu, idle, + &imbalance, cpumask); if (!busiest) goto out; @@ -2231,6 +2383,7 @@ spin_lock_init(&rq->lock); INIT_LIST_HEAD(&rq->migration_queue); atomic_set(&rq->nr_iowait, 0); + nr_running_init(rq); for (j = 0; j < 2; j++) { array = rq->arrays + j; |
From: Erich F. <ef...@es...> - 2003-01-13 15:45:44
|
Hi Christoph, I just finished some experiments which show that the finetuning can really be left for later. So this approach is ok for me. I hope we can get enough support for integrating this tiny numa scheduler.=20 I didn't do all possible measurements, the interesting ones are with patches 1-4 (nb-smooth) and 1-5 (nb-sm-var1, nb-sm-var2) applied. They show pretty consistent results (within error bars). The fine-tuning in patch #5 doesn't buy us much right now (on my platform), so we can leave it out. Here's the data: Results on a 8 CPU ia64 machine with 4 nodes (2 CPUs per node). kernbench: elapsed user system stock52 134.52(0.84) 951.64(0.97) 20.72(0.22) sched52 133.19(1.49) 944.24(0.50) 21.36(0.24) minsched52 135.47(0.47) 937.61(0.20) 21.30(0.14) nb-smooth 133.61(0.71) 944.71(0.35) 21.22(0.22) nb-sm-var1 135.23(2.07) 943.78(0.54) 21.54(0.17) nb-sm-var2 133.87(0.61) 944.18(0.62) 21.32(0.13) schedbench/hackbench: time(s) 10 25 50 100 stock52 0.81(0.04) 2.06(0.07) 4.09(0.13) 7.89(0.25) sched52 0.81(0.04) 2.03(0.07) 4.14(0.20) 8.61(0.35) minsched52 1.28(0.05) 3.19(0.06) 6.59(0.13) 13.56(0.27) nb-smooth 0.77(0.03) 1.94(0.04) 3.80(0.06) 7.97(0.35) nb-sm-var1 0.81(0.05) 2.01(0.09) 3.89(0.21) 8.20(0.34) nb-sm-var2 0.82(0.04) 2.10(0.09) 4.19(0.14) 8.15(0.24) numabench/numa_test 4 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 27.23(0.52) 89.30(4.18) 0.09(0.01) sched52 22.32(1.00) 27.39(0.42) 89.29(4.02) 0.10(0.01) minsched52 20.01(0.01) 23.40(0.13) 80.05(0.02) 0.08(0.01) nb-smooth 21.01(0.79) 24.70(2.75) 84.04(3.15) 0.09(0.01) nb-sm-var1 21.39(0.83) 26.03(2.15) 85.56(3.31) 0.09(0.01) nb-sm-var2 22.18(0.74) 27.36(0.42) 88.72(2.94) 0.09(0.01) numabench/numa_test 8 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 27.50(2.58) 174.74(6.66) 0.18(0.01) sched52 21.73(1.00) 33.70(1.82) 173.87(7.96) 0.18(0.01) minsched52 20.31(0.00) 23.50(0.12) 162.47(0.04) 0.16(0.01) nb-smooth 20.46(0.44) 24.21(1.95) 163.68(3.56) 0.16(0.01) nb-sm-var1 20.45(0.44) 23.95(1.73) 163.62(3.49) 0.17(0.01) nb-sm-var2 20.71(0.82) 23.78(2.42) 165.74(6.58) 0.17(0.01) numabench/numa_test 16 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 52.68(1.51) 390.03(15.10) 0.34(0.01) sched52 21.51(0.80) 47.18(3.24) 344.29(12.78) 0.36(0.01) minsched52 20.50(0.03) 43.82(0.08) 328.05(0.45) 0.34(0.01) nb-smooth 21.12(0.69) 47.42(4.02) 337.99(10.99) 0.34(0.01) nb-sm-var1 21.18(0.77) 48.19(5.05) 338.94(12.38) 0.34(0.01) nb-sm-var2 21.69(0.91) 49.05(4.36) 347.03(14.49) 0.34(0.01) numabench/numa_test 32 AvgUserTime ElapsedTime TotUserTime TotSysTime stock52 --- 102.60(3.89) 794.57(31.72) 0.65(0.01) sched52 21.93(0.57) 92.46(1.10) 701.75(18.38) 0.67(0.02) minsched52 20.64(0.10) 89.95(3.16) 660.72(3.13) 0.68(0.07) nb-smooth 20.95(0.19) 86.63(1.74) 670.56(6.02) 0.66(0.02) nb-sm-var1 21.47(0.54) 90.95(3.28) 687.12(17.42) 0.67(0.02) nb-sm-var2 21.45(0.67) 89.91(3.80) 686.47(21.37) 0.68(0.02) The kernels used: - stock52 : 2.5.52 + ia64 patch - sched52 : stock52 + old numa scheduler - minisched52 : stock52 + miniature NUMA scheduler (cannot load balance across nodes) - nb-smooth : minisched52 + node balancer + smooth node load patch - nb-sm-var1 : nb-smooth + variable internode_lb, (MIN,MAX) =3D (4,40) - nb-sm-var2 : nb-smooth + variable internode_lb, (MIN,MAX) =3D (1,16) Best regards, Erich On Monday 13 January 2003 16:26, Christoph Hellwig wrote: > Anyone interested in this cleaned up minimal numa scheduler? This > is basically Erich's patches 1-3 with my suggestions applied. > > This does not mean I don't like 4 & 5, but I'd rather get a small, > non-intrusive patch into Linus' tree now and do the fine-tuning later. > [...] |
From: Michael H. <hoh...@us...> - 2003-01-13 19:02:56
|
On Mon, 2003-01-13 at 03:32, Erich Focht wrote: > Hi Christoph, > > thanks for reviewing the code and the helpful comments! > > On Monday 13 January 2003 09:02, Christoph Hellwig wrote: > > On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote: > > > as discussed on the LSE call I played around with a cross-node > > > balancer approach put on top of the miniature NUMA scheduler. The > > > patches are attached and it seems to be clear that we can regain the > > > good performance for hackbench by adding a cross-node balancer. > > > > The changes look fine to me. But I think there's some conding style > > issues that need cleaning up (see below). Also is there a reason > > patches 2/3 and 4/5 aren't merged into one patch each? > > The patches are separated by their functionality. Patch 2 comes from > Michael Hohnbaum, so I kept that separate for that reason. Right now > we can exchange the single components but when we decide that they are > doing the job, I'd also prefer to have just one patch. I'm not opposed to combining the patches, but isn't it typically preferred to have patches as small as possible? The initial load balance patch (patch 2) applies and functions fairly independent of the the other patches, so has been easy to maintain as a separate patch. > > > > +#ifdef CONFIG_NUMA > > +extern void sched_balance_exec(void); > > +extern void node_nr_running_init(void); > > +#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \ > > + rq->nr_running++ > > +#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \ > > + rq->nr_running-- > > > > static inline void nr_running_inc(runqueue_t *rq) > > { > > atomic_inc(rq->node_ptr); > > rq->nr_running++ > > } > > > > etc.. would look a bit nicer. > > We can change this. Michael, ok with you? Fine with me. > > > > +#if CONFIG_NUMA > > +static atomic_t node_nr_running[MAX_NUMNODES] > > ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)}; > > > > Maybe wants some linewrapping after 80 chars? > > Yes. > > > > + for (i = 0; i < NR_CPUS; i++) { > > + cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)]; > > + } > > + return; > > +} > > > > The braces and the return are superflous. Also kernel/sched.c (or > > mingo codein general) seems to prefer array + i instead of &array[i] > > (not that I have a general preferences, but you should try to match > > the surrounding code) > > Will change the braces and remove the return. I personally find > &array[i] more readable. I agree with Erich here - &array[i] is clearer. > > > +static void sched_migrate_task(task_t *p, int dest_cpu) > > +{ > > + unsigned long old_mask; > > + > > + old_mask = p->cpus_allowed; > > + if (!(old_mask & (1UL << dest_cpu))) > > + return; > > + /* force the process onto the specified CPU */ > > + set_cpus_allowed(p, 1UL << dest_cpu); > > + > > + /* restore the cpus allowed mask */ > > + set_cpus_allowed(p, old_mask); > > > > This double set_cpus_allowed doesn't look nice to me. I don't > > have a better suggestion of-hand, though :( > > This is not that bad. It involves only one single wakeup of the > migration thread, and that's more important. Doing it another way > would mean to replicate the set_cpus_allowed() code. While this shows up in the patch credited to me, it was actually Erich's idea, and I think a quite good one. My initial impression of this was that it was ugly, but upon closer examination (and trying to implement this some other way) realized that Erich's implementation, which I borrowed from, was quite good. Good work Erich in getting the internode balancer written and working. I'm currently building a kernel with these patches and will post results later in the day. Thanks. -- Michael Hohnbaum 503-578-5486 hoh...@us... T/L 775-5486 |
From: Martin J. B. <mb...@ar...> - 2003-01-14 04:56:33
|
>> I played with this today on my 4 node (16 CPU) NUMAQ. Spent most >> of the time working with the first three patches. What I found was >> that rebalancing was happening too much between nodes. I tried a >> few things to change this, but have not yet settled on the best >> approach. A key item to work with is the check in find_busiest_node >> to determine if the found node is busier enough to warrant stealing >> from it. Currently the check is that the node has 125% of the load >> of the current node. I think that, for my system at least, we need >> to add in a constant to this equation. I tried using 4 and that >> helped a little. > > Michael, > > in: > > +static int find_busiest_node(int this_node) > +{ > + int i, node = this_node, load, this_load, maxload; > + > + this_load = maxload = atomic_read(&node_nr_running[this_node]); > + for (i = 0; i < numnodes; i++) { > + if (i == this_node) > + continue; > + load = atomic_read(&node_nr_running[i]); > + if (load > maxload && (4*load > ((5*4*this_load)/4))) { > + maxload = load; > + node = i; > + } > + } > + return node; > +} > > You changed ((5*4*this_load)/4) to: > (5*4*(this_load+4)/4) > or > (4+(5*4*(this_load)/4)) ? > > We def need some constant to avoid low load ping pong, right? > > Finally I added in the 04 patch, and that helped >> a lot. Still, there is too much process movement between nodes. > > perhaps increase INTERNODE_LB? Before we tweak this too much, how about using the global load average for this? I can envisage a situation where we have two nodes with 8 tasks per node, one with 12 tasks, and one with four. You really don't want the ones with 8 tasks pulling stuff from the 12 ... only for the least loaded node to start pulling stuff later. What about if we take the global load average, and multiply by num_cpus_on_this_node / num_cpus_globally ... that'll give us roughly what we should have on this node. If we're significantly out underloaded compared to that, we start pulling stuff from the busiest node? And you get the damping over time for free. I think it'd be best if we stay fairly conservative for this, all we're trying to catch is the corner case where a bunch of stuff has forked, but not execed, and we have a long term imbalance. Agressive rebalance might win a few benchmarks, but you'll just cause inter-node task bounce on others. M. |
From: Michael H. <mi...@hb...> - 2003-01-14 05:52:13
|
On Mon, 2003-01-13 at 20:45, Andrew Theurer wrote: > > Erich, > > > > I played with this today on my 4 node (16 CPU) NUMAQ. Spent most > > of the time working with the first three patches. What I found was > > that rebalancing was happening too much between nodes. I tried a > > few things to change this, but have not yet settled on the best > > approach. A key item to work with is the check in find_busiest_node > > to determine if the found node is busier enough to warrant stealing > > from it. Currently the check is that the node has 125% of the load > > of the current node. I think that, for my system at least, we need > > to add in a constant to this equation. I tried using 4 and that > > helped a little. > > Michael, > > in: > > +static int find_busiest_node(int this_node) > +{ > + int i, node = this_node, load, this_load, maxload; > + > + this_load = maxload = atomic_read(&node_nr_running[this_node]); > + for (i = 0; i < numnodes; i++) { > + if (i == this_node) > + continue; > + load = atomic_read(&node_nr_running[i]); > + if (load > maxload && (4*load > ((5*4*this_load)/4))) { > + maxload = load; > + node = i; > + } > + } > + return node; > +} > > You changed ((5*4*this_load)/4) to: > (5*4*(this_load+4)/4) > or > (4+(5*4*(this_load)/4)) ? I suppose I should not have been so dang lazy and cut-n-pasted the line I changed. The change was (((5*4*this_load)/4) + 4) which should be the same as your second choice. > > We def need some constant to avoid low load ping pong, right? Yep. Without the constant, one could have 6 processes on node A and 4 on node B, and node B would end up stealing. While making a perfect balance, the expense of the off-node traffic does not justify it. At least on the NUMAQ box. It might be justified for a different NUMA architecture, which is why I propose putting this check in a macro that can be defined in topology.h for each architecture. > > Finally I added in the 04 patch, and that helped > > a lot. Still, there is too much process movement between nodes. > > perhaps increase INTERNODE_LB? That is on the list to try. Martin was mumbling something about use the system wide load average to help make the inter-node balance decision. I'd like to give that a try before tweaking ITERNODE_LB. > > -Andrew Theurer > Michael Hohnbaum hoh...@us... |
From: Andrew T. <hab...@us...> - 2003-01-14 14:55:24
|
> I suppose I should not have been so dang lazy and cut-n-pasted > the line I changed. The change was (((5*4*this_load)/4) + 4) > which should be the same as your second choice. > > > > We def need some constant to avoid low load ping pong, right? > > Yep. Without the constant, one could have 6 processes on node > A and 4 on node B, and node B would end up stealing. While making > a perfect balance, the expense of the off-node traffic does not > justify it. At least on the NUMAQ box. It might be justified > for a different NUMA architecture, which is why I propose putting > this check in a macro that can be defined in topology.h for each > architecture. Yes, I was also concerned about one task in one node and none in the others. Without some sort of constant we will ping pong the task on every node endlessly, since there is no % threshold that could make any difference when the original load value is 0.. Your +4 gets rid of the 1 task case. -Andrew |
From: Erich F. <ef...@es...> - 2003-01-14 15:13:02
|
On Tuesday 14 January 2003 17:52, Andrew Theurer wrote: > > I suppose I should not have been so dang lazy and cut-n-pasted > > the line I changed. The change was (((5*4*this_load)/4) + 4) > > which should be the same as your second choice. > > > > > We def need some constant to avoid low load ping pong, right? > > > > Yep. Without the constant, one could have 6 processes on node > > A and 4 on node B, and node B would end up stealing. While making > > a perfect balance, the expense of the off-node traffic does not > > justify it. At least on the NUMAQ box. It might be justified > > for a different NUMA architecture, which is why I propose putting > > this check in a macro that can be defined in topology.h for each > > architecture. > > Yes, I was also concerned about one task in one node and none in the > others. Without some sort of constant we will ping pong the task on eve= ry > node endlessly, since there is no % threshold that could make any > difference when the original load value is 0.. Your +4 gets rid of the= 1 > task case. That won't happen because: - find_busiest_queue() won't detect a sufficient imbalance (imbalance =3D=3D 0), - load_balance() won't be able to steal the task, as it will be running (the rq->curr task) So the worst thing that happens is that load_balance() finds out that it cannot steal the only task running on the runqueue and returns. But we won't get there because find_busiest_queue() returns NULL. It happens even in the bad case: node#0 : 0 tasks node#1 : 4 tasks (each on its own CPU) where we would desire to distribute the tasks equally among the nodes. At least on our IA64 platform... Regards, Erich |
From: Erich F. <ef...@es...> - 2003-01-14 11:13:43
|
Hi Martin, > Before we tweak this too much, how about using the global load > average for this? I can envisage a situation where we have two > nodes with 8 tasks per node, one with 12 tasks, and one with four. > You really don't want the ones with 8 tasks pulling stuff from > the 12 ... only for the least loaded node to start pulling stuff > later. Hmmm, yet another idea from the old NUMA scheduler coming back, therefore it has my full support ;-). Though we can't do it as I did it there: in the old NUMA approach every cross-node steal was delayed, only 1-2ms if the own node was underloaded, a lot more if the own node's load was average or above average. We might need to finally steal something even if we're having "above average" load, because the cpu mask of the tasks on the overloaded node might only allow them to migrate to us... But this is also a special case which we should solve later. > What about if we take the global load average, and multiply by > num_cpus_on_this_node / num_cpus_globally ... that'll give us > roughly what we should have on this node. If we're significantly > out underloaded compared to that, we start pulling stuff from > the busiest node? And you get the damping over time for free. Patch 05 is going towards this direction but the constants I chose were very aggressive. I'll update the whole set of patches and try to send out something today. Best regards, Erich |
From: Erich F. <ef...@es...> - 2003-01-14 15:54:47
|
Here's the new version of the NUMA scheduler built on top of the miniature scheduler of Martin. I incorporated Michael's ideas and Christoph's suggestions and rediffed for 2.5.58.=20 The whole patch is really tiny: 9.5k. This time I attached the numa scheduler in form of two patches: numa-sched-2.5.58.patch (7k) : components 01, 02, 03 numa-sched-add-2.5.58.patch (3k) : components 04, 05 The single components are also attached in a small tgz archive: 01-minisched-2.5.58.patch : the miniature scheduler from Martin. Balances strictly within a node. Removed the find_busiest_in_mask() function. 02-initial-lb-2.5.58.patch : Michael's initial load balancer at exec(). Cosmetic corrections sugegsted by Christoph. 03-internode-lb-2.5.58.patch : internode load balancer core. Called after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable parameters: NODE_BALANCE_RATE (default: 10) NODE_THRESHOLD (default: 125) : consider only nodes with load above NODE_THRESHOLD/100 * own_node_load I added the constant factor of 4 suggested by Michael, but I'm not really happy with it. This should be nr_cpus_in_node, but we don't have that info in topology.h 04-smooth-node-load-2.5.58.patch : The node load measure is smoothed by adding half of the previous node load (and 1/4 of the one before, etc..., as discussed in the LSE call). This should improve a bit the behavior in case of short timed load peaks and avoid bouncing tasks between nodes. 05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE interval (between cross-node balancer calls) by a variable node-specific interval. Currently only two values used: NODE_BALANCE_MIN : 10 NODE_BALANCE_MAX : 40 If the node load is less than avg_node_load/2, we switch to NODE_BALANCE_MIN, otherwise we use the large interval. I also added a function to reduce the number of tasks stolen from remote nodes. Regards, Erich |
From: Christoph H. <hc...@in...> - 2003-01-14 16:09:07
|
On Tue, Jan 14, 2003 at 04:55:06PM +0100, Erich Focht wrote: > Here's the new version of the NUMA scheduler built on top of the > miniature scheduler of Martin. I incorporated Michael's ideas and > Christoph's suggestions and rediffed for 2.5.58. This one looks a lot nicer. You might also want to take the different nr_running stuff from the patch I posted, I think it looks a lot nicer. The patch (not updated yet) is below again for reference. --- 1.62/fs/exec.c Fri Jan 10 08:21:00 2003 +++ edited/fs/exec.c Mon Jan 13 15:33:32 2003 @@ -1031,6 +1031,8 @@ int retval; int i; + sched_balance_exec(); + file = open_exec(filename); retval = PTR_ERR(file); --- 1.119/include/linux/sched.h Sat Jan 11 07:44:15 2003 +++ edited/include/linux/sched.h Mon Jan 13 15:58:11 2003 @@ -444,6 +444,14 @@ # define set_cpus_allowed(p, new_mask) do { } while (0) #endif +#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +extern void node_nr_running_init(void); +#else +# define sched_balance_exec() do { } while (0) +# define node_nr_running_init() do { } while (0) +#endif + extern void set_user_nice(task_t *p, long nice); extern int task_prio(task_t *p); extern int task_nice(task_t *p); --- 1.91/init/main.c Mon Jan 6 04:08:49 2003 +++ edited/init/main.c Mon Jan 13 15:33:33 2003 @@ -495,6 +495,7 @@ migration_init(); #endif + node_nr_running_init(); spawn_ksoftirqd(); } --- 1.148/kernel/sched.c Sat Jan 11 07:44:22 2003 +++ edited/kernel/sched.c Mon Jan 13 16:17:34 2003 @@ -67,6 +67,7 @@ #define INTERACTIVE_DELTA 2 #define MAX_SLEEP_AVG (2*HZ) #define STARVATION_LIMIT (2*HZ) +#define NODE_BALANCE_RATIO 10 /* * If a task is 'interactive' then we reinsert it in the active @@ -154,6 +155,11 @@ prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; +#ifdef CONFIG_NUMA + atomic_t *node_nr_running; + int nr_balanced; +#endif + task_t *migration_thread; struct list_head migration_queue; @@ -178,6 +184,38 @@ #endif /* + * Keep track of running tasks. + */ +#if CONFIG_NUMA + +/* XXX(hch): this should go into a struct sched_node_data */ +static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = + {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)}; + +static inline void nr_running_init(struct runqueue *rq) +{ + rq->node_nr_running = &node_nr_running[0]; +} + +static inline void nr_running_inc(struct runqueue *rq) +{ + atomic_inc(rq->node_nr_running); + rq->nr_running++; +} + +static inline void nr_running_dec(struct runqueue *rq) +{ + atomic_dec(rq->node_nr_running); + rq->nr_running--; +} + +#else +# define nr_running_init(rq) do { } while (0) +# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0) +# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0) +#endif /* CONFIG_NUMA */ + +/* * task_rq_lock - lock the runqueue a given task resides on and disable * interrupts. Note the ordering: we can safely lookup the task_rq without * explicitly disabling preemption. @@ -294,7 +332,7 @@ p->prio = effective_prio(p); } enqueue_task(p, array); - rq->nr_running++; + nr_running_inc(rq); } /* @@ -302,7 +340,7 @@ */ static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) { - rq->nr_running--; + nr_running_dec(rq); if (p->state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible++; dequeue_task(p, p->array); @@ -624,9 +662,108 @@ spin_unlock(&rq2->lock); } -#if CONFIG_SMP +#if CONFIG_NUMA +/* + * If dest_cpu is allowed for this process, migrate the task to it. + * This is accomplished by forcing the cpu_allowed mask to only + * allow dest_cpu, which will force the cpu onto dest_cpu. Then + * the cpu_allowed mask is restored. + * + * Note: This isn't actually numa-specific, but just not used otherwise. + */ +static void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask = p->cpus_allowed; + + if (old_mask & (1UL << dest_cpu)) { + unsigned long flags; + struct runqueue *rq; + + /* force the process onto the specified CPU */ + set_cpus_allowed(p, 1UL << dest_cpu); + + /* restore the cpus allowed mask */ + rq = task_rq_lock(p, &flags); + p->cpus_allowed = old_mask; + task_rq_unlock(rq, &flags); + } +} /* + * Find the least loaded CPU. Slightly favor the current CPU by + * setting its runqueue length as the minimum to start. + */ +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, load, best_cpu, node = 0; + unsigned long cpumask; + + best_cpu = task_cpu(p); + if (cpu_rq(best_cpu)->nr_running <= 2) + return best_cpu; + + minload = 10000000; + for (i = 0; i < numnodes; i++) { + load = atomic_read(&node_nr_running[i]); + if (load < minload) { + minload = load; + node = i; + } + } + + minload = 10000000; + cpumask = __node_to_cpu_mask(node); + for (i = 0; i < NR_CPUS; ++i) { + if (!(cpumask & (1UL << i))) + continue; + if (cpu_rq(i)->nr_running < minload) { + best_cpu = i; + minload = cpu_rq(i)->nr_running; + } + } + return best_cpu; +} + +void sched_balance_exec(void) +{ + int new_cpu; + + if (numnodes > 1) { + new_cpu = sched_best_cpu(current); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); + } +} + +static int find_busiest_node(int this_node) +{ + int i, node = this_node, load, this_load, maxload; + + this_load = maxload = atomic_read(&node_nr_running[this_node]); + for (i = 0; i < numnodes; i++) { + if (i == this_node) + continue; + load = atomic_read(&node_nr_running[i]); + if (load > maxload && (4*load > ((5*4*this_load)/4))) { + maxload = load; + node = i; + } + } + + return node; +} + +__init void node_nr_running_init(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) + cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i); +} +#endif /* CONFIG_NUMA */ + +#if CONFIG_SMP +/* * double_lock_balance - lock the busiest runqueue * * this_rq is locked already. Recalculate nr_running if we have to @@ -652,9 +789,10 @@ } /* - * find_busiest_queue - find the busiest runqueue. + * find_busiest_queue - find the busiest runqueue among the cpus in cpumask */ -static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance) +static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, + int idle, int *imbalance, unsigned long cpumask) { int nr_running, load, max_load, i; runqueue_t *busiest, *rq_src; @@ -689,7 +827,7 @@ busiest = NULL; max_load = 1; for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) + if (!cpu_online(i) || !((1UL << i) & cpumask)) continue; rq_src = cpu_rq(i); @@ -736,9 +874,9 @@ static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu) { dequeue_task(p, src_array); - src_rq->nr_running--; + nr_running_dec(src_rq); set_task_cpu(p, this_cpu); - this_rq->nr_running++; + nr_running_inc(this_rq); enqueue_task(p, this_rq->active); /* * Note that idle threads have a prio of MAX_PRIO, for this test @@ -758,13 +896,27 @@ */ static void load_balance(runqueue_t *this_rq, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx, this_cpu, this_node; + unsigned long cpumask; runqueue_t *busiest; prio_array_t *array; struct list_head *head, *curr; task_t *tmp; - busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance); + this_cpu = smp_processor_id(); + this_node = __cpu_to_node(this_cpu); + cpumask = __node_to_cpu_mask(this_node); + +#if CONFIG_NUMA + /* + * Avoid rebalancing between nodes too often. + */ + if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO)) + cpumask |= __node_to_cpu_mask(find_busiest_node(this_node)); +#endif + + busiest = find_busiest_queue(this_rq, this_cpu, idle, + &imbalance, cpumask); if (!busiest) goto out; @@ -2231,6 +2383,7 @@ spin_lock_init(&rq->lock); INIT_LIST_HEAD(&rq->migration_queue); atomic_set(&rq->nr_iowait, 0); + nr_running_init(rq); for (j = 0; j < 2; j++) { array = rq->arrays + j; |
From: Erich F. <ef...@es...> - 2003-01-14 16:23:10
|
In the previous email the patch 02-initial-lb-2.5.58.patch had a bug and this was present in the numa-sched-2.5.58.patch and numa-sched-add-2.5.58.patch, too. Please use the patches attached to this email! Sorry for the silly mistake... Christoph, I used your way of coding nr_running_inc/dec now. Regards, Erich On Tuesday 14 January 2003 16:55, Erich Focht wrote: > Here's the new version of the NUMA scheduler built on top of the > miniature scheduler of Martin. I incorporated Michael's ideas and > Christoph's suggestions and rediffed for 2.5.58. > > The whole patch is really tiny: 9.5k. This time I attached the numa > scheduler in form of two patches: > > numa-sched-2.5.58.patch (7k) : components 01, 02, 03 > numa-sched-add-2.5.58.patch (3k) : components 04, 05 > > The single components are also attached in a small tgz archive: > > 01-minisched-2.5.58.patch : the miniature scheduler from > Martin. Balances strictly within a node. Removed the > find_busiest_in_mask() function. > > 02-initial-lb-2.5.58.patch : Michael's initial load balancer at > exec(). Cosmetic corrections sugegsted by Christoph. > > 03-internode-lb-2.5.58.patch : internode load balancer core. Called > after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable > parameters: > NODE_BALANCE_RATE (default: 10) > NODE_THRESHOLD (default: 125) : consider only nodes with load > above NODE_THRESHOLD/100 * own_node_load > I added the constant factor of 4 suggested by Michael, but I'm not > really happy with it. This should be nr_cpus_in_node, but we don't > have that info in topology.h > > 04-smooth-node-load-2.5.58.patch : The node load measure is smoothed > by adding half of the previous node load (and 1/4 of the one before, > etc..., as discussed in the LSE call). This should improve a bit the > behavior in case of short timed load peaks and avoid bouncing tasks > between nodes. > > 05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE > interval (between cross-node balancer calls) by a variable > node-specific interval. Currently only two values used: > NODE_BALANCE_MIN : 10 > NODE_BALANCE_MAX : 40 > If the node load is less than avg_node_load/2, we switch to > NODE_BALANCE_MIN, otherwise we use the large interval. > I also added a function to reduce the number of tasks stolen from > remote nodes. > > Regards, > Erich |
From: Christoph H. <hc...@in...> - 2003-01-14 16:52:27
|
+#ifdef CONFIG_NUMA +extern void sched_balance_exec(void); +extern void node_nr_running_init(void); +#else +#define sched_balance_exec() {} +#endif You accidentally (?) removed the stub for node_nr_running_init. Also sched.h used # define inside ifdefs. +#ifdef CONFIG_NUMA + atomic_t * node_ptr; The name is still a bit non-descriptive and the * placed wrong :) What about atomic_t *nr_running_at_node? +#if CONFIG_NUMA +static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = + {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)}; + I think my two comments here were pretty usefull :) +static inline void nr_running_dec(runqueue_t *rq) +{ + atomic_dec(rq->node_ptr); + rq->nr_running--; +} + +__init void node_nr_running_init(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) + cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)]; +} +#else +# define nr_running_init(rq) do { } while (0) +# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0) +# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0) +#endif /* CONFIG_NUMA */ + /* @@ -689,7 +811,7 @@ static inline runqueue_t *find_busiest_q busiest = NULL; max_load = 1; for (i = 0; i < NR_CPUS; i++) { - if (!cpu_online(i)) + if (!cpu_online(i) || !((1UL << i) & cpumask) ) spurious whitespace before the closing brace. prio_array_t *array; struct list_head *head, *curr; task_t *tmp; + int this_node = __cpu_to_node(this_cpu); + unsigned long cpumask = __node_to_cpu_mask(this_node); If that's not to much style nitpicking: put this_node on one line with all the other local ints and initialize all three vars after the declarations (like in my patch *duck*) +#if CONFIG_NUMA + rq->node_ptr = &node_nr_running[0]; +#endif /* CONFIG_NUMA */ I had a nr_running_init() abstraction for this, but you only took it partially. It would be nice to merge the last bit to get rid of this ifdef. Else the patch looks really, really good and I'm looking forward to see it in mainline real soon! |