From: Rick L. <ric...@us...> - 2003-02-03 23:40:07
|
Ingo -- the hyperthreading changes are mixed in with some other changes (as Robert pointed out.) I'm in the process of separating the threads as hyperthreading mods are useful to some while some of the other fixes/tuning are proving useful to others. Two questions: - is this something you've already done? (no point in reinventing the wheel) - if not, do you have time to take (limited) email about questions about the effect certain segments may have on other segments? (We can take that to private email if you prefer.) Rick |
From: Rick L. <ric...@us...> - 2003-02-04 00:18:49
|
Thanks Robert -- I'm not currently operating out of the 2.5-mm tree, but that's just out of convenience, not philosophy. Breaking down the D7 and E2 patches are, in part, to address something you've started to address in your patches -- not all parts of D7 and E2 make sense for all people. When we see changes, positive or negative, it's not clear which portions of the patch to blame/praise. Breaking them down will make this easier. Your patches are nice and small already -- the only way to make them smaller would be to compress them! Keep up the good work :) Rick |
From: Andrew T. <hab...@us...> - 2003-02-04 01:28:03
|
In case anyone is interested, here is SchedD7, with and without HT bits, = =20 compared to SchedD7 minus HT bits plus numa-ht topology, compared to a ru= n=20 with E2: The system is a dual P4 xeon, serverworks chipset, kernel is 2.5.59. I ran "kernbench" with -j2, -j4, and -j8. Kernbench result is an average= of=20 10 runs. The kernels are: 2.5.59-schedD7:=09=09D7 unchanged 2.5.59-schedD7noht:=09D7 minus HT bits 2.5.59-schedD7-numaht:=09D7 minus HT bits plus a numa-HT topology, first = check=20 in sched_best_cpu removed (could easily be a arch tunable in the future) 2.5.59-schedD7-numaht2:=09Above kernel plus node_rebalance_rate is 1 inst= ead of=20 10 2.5.59-schedE2:=09=09E2 unchanged -j2 kernbench: =09Used to show advantages of load load workloads with HT.= =20 Faster times usually mean scheduler avoided tasks sharing same physical c= pu=20 when possible. 2.5.59-schedD7=09=09Elapsed: 85.129s User: 156.367s System: 15.166s CPU: = 201% 2.5.59-schedD7noht=09Elapsed: 112.454s User: 211.029s System: 15.829s CPU= : 201% 2.5.59-D7-numaht=09Elapsed: 95.705s User: 177.929s System: 15.308s CPU: 2= 01.4% 2.5.59-D7-numaht2=09Elapsed: 91.63s User: 170.269s System: 14.962s CPU: 2= 01.9% 2.5.59-schedE2=09=09Elapsed: 84.538s User: 155.325s System: 15.122s CPU: = 201% The result for the "noht" D7 kernel is what I would expect. Individual=20 compile times varied quite a bit depending on the scheduler's luck at get= ting=20 the proper balance. Times on regular D7 were very consistent, -numaht va= ried=20 a little bit (but not as bad as -D7-noht), and numaht2 varied even less (= but=20 not as little as schedD7). Setting the node_rebalance_rate to 1 on numah= t2=20 really helped. =20 -j4 kernbench: should be near 100% cpu util. Just making sure none of th= e HT=20 code hurt 100% load situations. 2.5.59-schedD7=09=09Elapsed: 79.516s User: 253.64s System: 20.012s CPU: 3= 43.5% 2.5.59-schedD7noht=09Elapsed: 77.382s User: 279.937s System: 20.831s CPU:= 388.1% 2.5.59-D7-numaht=09Elapsed: 78.309s User: 266.917s System: 20.572s CPU: 3= 66.5% 2.5.59-D7-numaht2=09Elapsed: 77.897s User: 275.234s System: 20.915s CPU: = 379.8% 2.5.59-schedE2=09=09Elapsed: 77.309s User: 279.156s System: 20.681s CPU: = 387.3%=09 Looks like there is some (small) penalty for HT aware kernels on ~100% cp= u=20 util with the exception of E2. Not sure why SchedD7 had lower overall cp= u=20 util. -j8 kernbench: for the fun of it. Just a comparison. 2.5.59-schedD7=09=09Elapsed: 78.382s User: 283.752s System: 21.113s CPU: = 388.5% 2.5.59-schedD7noht=09Elapsed: 78.077s User: 282.711s System: 21.07s CPU: = 388.6% 2.5.59-D7-numaht=09Elapsed: 78.23s User: 282.184s System: 21.683s CPU: 38= 7.9% 2.5.59-D7-numaht2=09Elapsed: 78.336s User: 282.983s System: 21.506s CPU: = 388.2% 2.5.59-schedE2=09=09Elapsed: 78.039s User: 282.628s System: 21.038s CPU: = 388.6% Things seem to even out here.=20 -Andrew Theurer |
From: Martin J. B. <mb...@ar...> - 2003-02-04 07:01:57
|
> -j4 kernbench: should be near 100% cpu util. Just making sure none of > the HT code hurt 100% load situations. > > 2.5.59-schedD7 > Elapsed: 79.516s User: 253.64s System: 20.012s CPU: 343.5% > 2.5.59-schedD7noht > Elapsed: 77.382s User: 279.937s System: 20.831s CPU: 388.1% > 2.5.59-D7-numaht > Elapsed: 78.309s User: 266.917s System: 20.572s CPU: 366.5% > 2.5.59-D7-numaht2 > Elapsed: 77.897s User: 275.234s System: 20.915s CPU: 379.8% > 2.5.59-schedE2 > Elapsed: 77.309s User: 279.156s System: 20.681s CPU: 387.3% > > Looks like there is some (small) penalty for HT aware kernels on ~100% > cpu util with the exception of E2. Not sure why SchedD7 had lower > overall cpu util. Looks like it's doing significantly less work ... if the user time is 10% lower (253 vs 279), it looks like E2 might be thrashing tasks between CPUs or something? 2.5.59-schedD7noht and 2.5.59-schedE2 seem to have rather similar looking user times. M. |
From: Ingo M. <mi...@el...> - 2003-02-04 10:18:25
|
On Mon, 3 Feb 2003, Andrew Theurer wrote: > In case anyone is interested, here is SchedD7, with and without HT bits, > compared to SchedD7 minus HT bits plus numa-ht topology, compared to a > run with E2: > > The system is a dual P4 xeon, serverworks chipset, kernel is 2.5.59. I > ran "kernbench" with -j2, -j4, and -j8. Kernbench result is an average > of 10 runs. The kernels are: thanks for the analysis - looks like the -E2 scheduler got the best numbers in every benchmark, the difference is especially visible in the -j2 test, where the difference between schedD7-noht and schedE2 is more than 30%. But -E2 even beats the best NUMA-HT scheduler (2.5.59-D7-numaht2) by 8%. i'll take out the non-HT bits from the -E2 patch to have a clean HT scheduler patch we can compare. Ingo |
From: Andrew T. <hab...@us...> - 2003-02-04 14:01:15
|
On Tuesday 04 February 2003 04:23, Ingo Molnar wrote: > On Mon, 3 Feb 2003, Andrew Theurer wrote: > > In case anyone is interested, here is SchedD7, with and without HT bi= ts, > > compared to SchedD7 minus HT bits plus numa-ht topology, compared to = a > > run with E2: > > > > The system is a dual P4 xeon, serverworks chipset, kernel is 2.5.59. = I > > ran "kernbench" with -j2, -j4, and -j8. Kernbench result is an avera= ge > > of 10 runs. The kernels are: > > thanks for the analysis - looks like the -E2 scheduler got the best > numbers in every benchmark, the difference is especially visible in the > -j2 test, where the difference between schedD7-noht and schedE2 is more > than 30%. But -E2 even beats the best NUMA-HT scheduler > (2.5.59-D7-numaht2) by 8%. I think it's going to be hard, if not impossible to beat E2 with a numa-s= ched=20 for HT. The approaches are a little different, and E2 just may be the mo= re=20 efficient approach. The numa scheduler tries to find the best cpu right = when=20 the new process is exec'd, but does not do an "active" load balance that = the=20 D7&E2 patches do. The sched_best_cpu() from numa_sched can take some tim= e to=20 complete, while active load balance can help when some tasks exit and a H= T=20 type imbalance exists. Although the HT-numa way is a good improvement ov= er=20 stock 2559, it just doesn't approach the performance of E2. Also, FYI, last time Michael tested with D7, he had some weirdness with a= =20 "real" NUMA system. Hopefully we can reproduce that soon on E2 and track= =20 down any issues. =20 -Andrew Theurer |
From: Ingo M. <mi...@el...> - 2003-02-04 14:14:12
|
On Tue, 4 Feb 2003, Andrew Theurer wrote: > I think it's going to be hard, if not impossible to beat E2 with a > numa-sched for HT. The approaches are a little different, and E2 just > may be the more efficient approach. The numa scheduler tries to find > the best cpu right when the new process is exec'd, but does not do an > "active" load balance that the D7&E2 patches do. The sched_best_cpu() > from numa_sched can take some time to complete, while active load > balance can help when some tasks exit and a HT type imbalance exists. > Although the HT-numa way is a good improvement over stock 2559, it just > doesn't approach the performance of E2. it's not just the matter of active load-balancing - it's also the recognition that in case of this special SMT concept, where there's basically no per-logical-CPU cached context, immediate balancing within the HT-node has no negative performance impact. all the delayed and statistical balancing done in the SMP and NUMA load balancer tries to deal with the conflict between immediate idle CPU utilization and cache trashing. In the specific case of HT there's basically no such issue. another fundamental difference is that a HT physical CPU with 2 tasks is almost as bad as a normal physical CPU with 2 tasks => ie. we have to balance one of the tasks over to any potentially idle physical CPU. In the non-HT case this is easy - one of the tasks is not running currently so we send it over to the idle CPU. In the HT CPU it's much more difficult: both tasks are running, so 'balancing' has to deal with a currently active, running task. This is the real difference in 'active balancing', not any timing issue. Ingo |
From: Andrew T. <hab...@us...> - 2003-02-04 15:45:28
|
On Tuesday 04 February 2003 08:19, Ingo Molnar wrote: > On Tue, 4 Feb 2003, Andrew Theurer wrote: > > I think it's going to be hard, if not impossible to beat E2 with a > > numa-sched for HT. The approaches are a little different, and E2 jus= t > > may be the more efficient approach. The numa scheduler tries to find > > the best cpu right when the new process is exec'd, but does not do an > > "active" load balance that the D7&E2 patches do. The sched_best_cpu(= ) > > from numa_sched can take some time to complete, while active load > > balance can help when some tasks exit and a HT type imbalance exists. > > Although the HT-numa way is a good improvement over stock 2559, it ju= st > > doesn't approach the performance of E2. > > it's not just the matter of active load-balancing - it's also the > recognition that in case of this special SMT concept, where there's > basically no per-logical-CPU cached context, immediate balancing within > the HT-node has no negative performance impact. Yup, not much I can do about that. > all the delayed and statistical balancing done in the SMP and NUMA load > balancer tries to deal with the conflict between immediate idle CPU > utilization and cache trashing. In the specific case of HT there's > basically no such issue. > > another fundamental difference is that a HT physical CPU with 2 tasks i= s > almost as bad as a normal physical CPU with 2 tasks =3D> ie. we have to > balance one of the tasks over to any potentially idle physical CPU. In = the > non-HT case this is easy - one of the tasks is not running currently so= we > send it over to the idle CPU. In the HT CPU it's much more difficult: b= oth > tasks are running, so 'balancing' has to deal with a currently active, > running task. This is the real difference in 'active balancing', not an= y > timing issue. Yes, I agree, and I probably should have been more clear. That's why I t= ook=20 the first check out of sched_best_cpu. If it was still there, I would al= most=20 always have one physical cpu with 1 task per logical, while possibly havi= ng=20 another pysical cpu idle. So, I can only correct this at "initial load=20 balance" during exec. I still have that problem when some tasks exit and= =20 then have this undersirable imbalance. Unless I were to add an active=20 balance to numa sched, I can't really get around this. Unfortuneatly the= =20 numa_sched cannot be all things to all people. The only advantage I see = with=20 the ht-numa approach is less changes to 2.5.59's sched.c. I also ran E2 with acpi=3Doff to comapre to no HT cpus: kernbench -j2 2.5.59-E2:=09=09=09E: 84.538s U: 155.325s S: 15.122s CPU: 201% 2.5.59-E2 acpi=3Doff =09E: 85.17s U: 153.088s S: 14.566s CPU: 196.1% This is really good, you get the performance of a non HT system when runn= ing=20 only two tasks. I don't think we can ask for more here. kernbench -j4 2.5.59-E2=09=09=09E: 77.309s U: 279.156s S: 20.681s CPU: 387.3% 2.5.59-E2 acpi=3Doff=09E: 86.219s U: 155.109s S: 14.692s CPU: 196.4% And here is the real benefit, 11% reduction in elapsed time with HT. -Andrew Theurer |
From: Erich F. <ef...@es...> - 2003-02-04 14:43:54
|
On Tuesday 04 February 2003 15:02, Andrew Theurer wrote: > Also, FYI, last time Michael tested with D7, he had some weirdness with= a > "real" NUMA system. Hopefully we can reproduce that soon on E2 and tra= ck > down any issues. =20 Well... those are actually understood: - numa_test forks 4 tasks on its node - the eager steal mode of idle CPUs spread the freshly forked tasks across the CPUs of the same node - sched_balance_exec() returns early without having moved the execed tasks because they are running alone on their CPUs - hackbench kicks in and starts many threads - the node on which the four numa_test tasks are running doesn't rebalance while hackbench runs because numa_test are allways running (=3D=3D rq->curr) and the busy rebalance interval on ia32 is ridiculously large (25s =3D 100 * 250ms or so). The cures: 1: make the idle CPUs less eager when stealing (see my previous post) 2: change the NODE_REBALANCE_RATE to be reasonably small. 10 is already quite large IMHO. Regards, Erich |
From: Andrew T. <hab...@us...> - 2003-02-04 15:58:14
|
On Tuesday 04 February 2003 08:44, Erich Focht wrote: > On Tuesday 04 February 2003 15:02, Andrew Theurer wrote: > > Also, FYI, last time Michael tested with D7, he had some weirdness wi= th a > > "real" NUMA system. Hopefully we can reproduce that soon on E2 and t= rack > > down any issues. > > Well... those are actually understood: > - numa_test forks 4 tasks on its node > - the eager steal mode of idle CPUs spread the freshly forked tasks > across the CPUs of the same node > - sched_balance_exec() returns early without having moved the execed > tasks because they are running alone on their CPUs > - hackbench kicks in and starts many threads > - the node on which the four numa_test tasks are running doesn't > rebalance while hackbench runs because numa_test are allways running > (=3D=3D rq->curr) and the busy rebalance interval on ia32 is ridiculous= ly > large (25s =3D 100 * 250ms or so). > > The cures: > 1: make the idle CPUs less eager when stealing (see my previous post) > 2: change the NODE_REBALANCE_RATE to be reasonably small. 10 is > already quite large IMHO. OK, I was also thinking of having a variable per cpu node rebalance rate,= =20 based on a ratio of (some_constant * runqueue length of current cpu / hig= hest=20 runqueue length in system). This rate would be recalculated each time th= e=20 cpu finished a node balance, some time after after find_busiest_queue=20 returned (to get highest runqueue length for the calculation). Having a=20 variable rate would hopefully balance out a system with a very high imbal= ance=20 in a reasonable amount of time, while a fairly well balanced system will = have=20 infrequent balance attempts. The same could be done for rebalance ticks within a node. I'm not sure if= I=20 like the idea of just 250ms or 1ms, I'd like to experiment with a sliding= =20 scale based the imbalance across cpus within the node. -Andrew |
From: Martin J. B. <mb...@ar...> - 2003-02-04 16:21:59
|
> Well... those are actually understood: > - numa_test forks 4 tasks on its node > - the eager steal mode of idle CPUs spread the freshly forked tasks > across the CPUs of the same node > - sched_balance_exec() returns early without having moved the execed > tasks because they are running alone on their CPUs > - hackbench kicks in and starts many threads > - the node on which the four numa_test tasks are running doesn't > rebalance while hackbench runs because numa_test are allways running > (== rq->curr) and the busy rebalance interval on ia32 is ridiculously > large (25s = 100 * 250ms or so). > > The cures: > 1: make the idle CPUs less eager when stealing (see my previous post) > 2: change the NODE_REBALANCE_RATE to be reasonably small. 10 is > already quite large IMHO. 1. sounds fine. 2. I'm more wary of. If you mean changing the busy rebalance interval from 100 to 10, that sounds good, though I'd prefer not to go lower than that, for our boxes at least. If someone turned it to 100, I think that's just because it's better to effectively disable it in some circumstances because the algorithm itself is too agressive ... that's better fixed by changing the parameters of the algorithm a bit once we get to that. M. |
From: Ingo M. <mi...@el...> - 2003-02-04 16:58:10
|
this is the -E6 patch, with most of the HT-unrelated tunings and leftover changes (noticed by Robert) removed. I kept the activation reorganization because that is important in the HT case as well. based on Erich's observations i also added AGRESSIVE_IDLE_STEAL, which defaults to 1 currently - could anyone try it with 0 on a NUMA box, how much of a difference does it make? Ingo --- linux/arch/i386/kernel/cpu/proc.c.orig +++ linux/arch/i386/kernel/cpu/proc.c @@ -1,4 +1,5 @@ #include <linux/smp.h> +#include <linux/sched.h> #include <linux/timex.h> #include <linux/string.h> #include <asm/semaphore.h> @@ -101,6 +102,13 @@ fpu_exception ? "yes" : "no", c->cpuid_level, c->wp_works_ok ? "yes" : "no"); +#if CONFIG_SHARE_RUNQUEUE +{ + extern long __rq_idx[NR_CPUS]; + + seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]); +} +#endif for ( i = 0 ; i < 32*NCAPINTS ; i++ ) if ( test_bit(i, c->x86_capability) && --- linux/arch/i386/kernel/smpboot.c.orig +++ linux/arch/i386/kernel/smpboot.c @@ -38,6 +38,7 @@ #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/sched.h> #include <linux/kernel_stat.h> #include <linux/smp_lock.h> #include <linux/irq.h> @@ -945,6 +946,16 @@ int cpu_sibling_map[NR_CPUS] __cacheline_aligned; +static int test_ht; + +static int __init ht_setup(char *str) +{ + test_ht = 1; + return 1; +} + +__setup("test_ht", ht_setup); + static void __init smp_boot_cpus(unsigned int max_cpus) { int apicid, cpu, bit; @@ -1087,16 +1098,31 @@ Dprintk("Boot done.\n"); /* - * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so + * Here we can be sure that there is an IO-APIC in the system. Let's + * go and set it up: + */ + smpboot_setup_io_apic(); + + setup_boot_APIC_clock(); + + /* + * Synchronize the TSC with the AP + */ + if (cpu_has_tsc && cpucount) + synchronize_tsc_bp(); + /* + * If Hyper-Threading is available, construct cpu_sibling_map[], so * that we can tell the sibling CPU efficiently. */ +printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus()); if (cpu_has_ht && smp_num_siblings > 1) { for (cpu = 0; cpu < NR_CPUS; cpu++) cpu_sibling_map[cpu] = NO_PROC_ID; for (cpu = 0; cpu < NR_CPUS; cpu++) { int i; - if (!test_bit(cpu, &cpu_callout_map)) continue; + if (!test_bit(cpu, &cpu_callout_map)) + continue; for (i = 0; i < NR_CPUS; i++) { if (i == cpu || !test_bit(i, &cpu_callout_map)) @@ -1112,17 +1138,41 @@ printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu); } } - } - - smpboot_setup_io_apic(); - - setup_boot_APIC_clock(); +#if CONFIG_SHARE_RUNQUEUE + /* + * At this point APs would have synchronised TSC and + * waiting for smp_commenced, with their APIC timer + * disabled. So BP can go ahead do some initialization + * for Hyper-Threading (if needed). + */ + for (cpu = 0; cpu < NR_CPUS; cpu++) { + int i; + if (!test_bit(cpu, &cpu_callout_map)) + continue; + for (i = 0; i < NR_CPUS; i++) { + if (i <= cpu) + continue; + if (!test_bit(i, &cpu_callout_map)) + continue; - /* - * Synchronize the TSC with the AP - */ - if (cpu_has_tsc && cpucount) - synchronize_tsc_bp(); + if (phys_proc_id[cpu] != phys_proc_id[i]) + continue; + /* + * merge runqueues, resulting in one + * runqueue per package: + */ + sched_map_runqueue(cpu, i); + break; + } + } +#endif + } +#if CONFIG_SHARE_RUNQUEUE + if (smp_num_siblings == 1 && test_ht) { + printk("Simulating a 2-sibling 1-phys-CPU HT setup!\n"); + sched_map_runqueue(0, 1); + } +#endif } /* These are wrappers to interface to the new boot process. Someone --- linux/arch/i386/Kconfig.orig +++ linux/arch/i386/Kconfig @@ -408,6 +408,24 @@ If you don't know what to do here, say N. +choice + + prompt "Hyperthreading Support" + depends on SMP + default NR_SIBLINGS_0 + +config NR_SIBLINGS_0 + bool "off" + +config NR_SIBLINGS_2 + bool "2 siblings" + +config NR_SIBLINGS_4 + bool "4 siblings" + +endchoice + + config PREEMPT bool "Preemptible Kernel" help --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -147,6 +147,24 @@ extern void sched_init(void); extern void init_idle(task_t *idle, int cpu); +/* + * Is there a way to do this via Kconfig? + */ +#define CONFIG_NR_SIBLINGS 0 + +#if CONFIG_NR_SIBLINGS_2 +# define CONFIG_NR_SIBLINGS 2 +#elif CONFIG_NR_SIBLINGS_4 +# define CONFIG_NR_SIBLINGS 4 +#endif + +#if CONFIG_NR_SIBLINGS +# define CONFIG_SHARE_RUNQUEUE 1 +#else +# define CONFIG_SHARE_RUNQUEUE 0 +#endif +extern void sched_map_runqueue(int cpu1, int cpu2); + extern void show_state(void); extern void show_trace(unsigned long *stack); extern void show_stack(unsigned long *stack); @@ -293,7 +311,7 @@ prio_array_t *array; unsigned long sleep_avg; - unsigned long sleep_timestamp; + unsigned long last_run; unsigned long policy; unsigned long cpus_allowed; @@ -770,11 +788,6 @@ return p->thread_info->cpu; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ - p->thread_info->cpu = cpu; -} - #else static inline unsigned int task_cpu(struct task_struct *p) @@ -782,10 +795,6 @@ return 0; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ -} - #endif /* CONFIG_SMP */ #endif /* __KERNEL__ */ --- linux/include/asm-i386/apic.h.orig +++ linux/include/asm-i386/apic.h @@ -98,4 +98,6 @@ #endif /* CONFIG_X86_LOCAL_APIC */ +extern int phys_proc_id[NR_CPUS]; + #endif /* __ASM_APIC_H */ --- linux/kernel/fork.c.orig +++ linux/kernel/fork.c @@ -876,7 +876,7 @@ */ p->first_time_slice = 1; current->time_slice >>= 1; - p->sleep_timestamp = jiffies; + p->last_run = jiffies; if (!current->time_slice) { /* * This case is rare, it happens when the parent has only --- linux/kernel/sched.c.orig +++ linux/kernel/sched.c @@ -67,6 +67,7 @@ #define INTERACTIVE_DELTA 2 #define MAX_SLEEP_AVG (2*HZ) #define STARVATION_LIMIT (2*HZ) +#define AGRESSIVE_IDLE_STEAL 1 #define NODE_THRESHOLD 125 /* @@ -141,6 +142,48 @@ }; /* + * It's possible for two CPUs to share the same runqueue. + * This makes sense if they eg. share caches. + * + * We take the common 1:1 (SMP, UP) case and optimize it, + * the rest goes via remapping: rq_idx(cpu) gives the + * runqueue on which a particular cpu is on, cpu_idx(cpu) + * gives the rq-specific index of the cpu. + * + * (Note that the generic scheduler code does not impose any + * restrictions on the mappings - there can be 4 CPUs per + * runqueue or even assymetric mappings.) + */ +#if CONFIG_SHARE_RUNQUEUE +# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS + long __rq_idx[NR_CPUS] __cacheline_aligned; + static long __cpu_idx[NR_CPUS] __cacheline_aligned; +# define rq_idx(cpu) (__rq_idx[(cpu)]) +# define cpu_idx(cpu) (__cpu_idx[(cpu)]) +# define for_each_sibling(idx, rq) \ + for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++) +# define rq_nr_cpus(rq) ((rq)->nr_cpus) +# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance) +#else +# define MAX_NR_SIBLINGS 1 +# define rq_idx(cpu) (cpu) +# define cpu_idx(cpu) 0 +# define for_each_sibling(idx, rq) while (0) +# define cpu_active_balance(c) 0 +# define do_active_balance(rq, cpu) do { } while (0) +# define rq_nr_cpus(rq) 1 + static inline void active_load_balance(runqueue_t *rq, int this_cpu) { } +#endif + +typedef struct cpu_s { + task_t *curr, *idle; + task_t *migration_thread; + struct list_head migration_queue; + int active_balance; + int cpu; +} cpu_t; + +/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues @@ -151,7 +194,7 @@ spinlock_t lock; unsigned long nr_running, nr_switches, expired_timestamp, nr_uninterruptible; - task_t *curr, *idle; + struct mm_struct *prev_mm; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; #ifdef CONFIG_NUMA @@ -159,27 +202,39 @@ unsigned int nr_balanced; int prev_node_load[MAX_NUMNODES]; #endif - task_t *migration_thread; - struct list_head migration_queue; + int nr_cpus; + cpu_t cpu[MAX_NR_SIBLINGS]; atomic_t nr_iowait; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; -#define cpu_rq(cpu) (runqueues + (cpu)) +#define cpu_rq(cpu) (runqueues + (rq_idx(cpu))) +#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c)) +#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr) +#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle) + #define this_rq() cpu_rq(smp_processor_id()) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define migration_thread(cpu) (cpu_int(cpu)->migration_thread) +#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue) + +#if NR_CPUS > 1 +# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu))) +#else +# define task_allowed(p, cpu) 1 +#endif + /* * Default context-switch locking: */ #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(p) (cpu_curr_ptr(task_cpu(p)) == (p)) #endif #ifdef CONFIG_NUMA @@ -322,16 +377,21 @@ * Also update all the scheduling statistics stuff. (sleep average * calculation, priority modifiers, etc.) */ +static inline void __activate_task(task_t *p, runqueue_t *rq) +{ + enqueue_task(p, rq->active); + nr_running_inc(rq); +} + static inline void activate_task(task_t *p, runqueue_t *rq) { - unsigned long sleep_time = jiffies - p->sleep_timestamp; - prio_array_t *array = rq->active; + unsigned long sleep_time = jiffies - p->last_run; if (!rt_task(p) && sleep_time) { /* * This code gives a bonus to interactive tasks. We update * an 'average sleep time' value here, based on - * sleep_timestamp. The more time a task spends sleeping, + * ->last_run. The more time a task spends sleeping, * the higher the average gets - and the higher the priority * boost gets as well. */ @@ -340,8 +400,7 @@ p->sleep_avg = MAX_SLEEP_AVG; p->prio = effective_prio(p); } - enqueue_task(p, array); - nr_running_inc(rq); + __activate_task(p, rq); } /* @@ -382,8 +441,18 @@ #endif } +static inline void resched_cpu(int cpu) +{ + resched_task(cpu_curr_ptr(cpu)); +} + #ifdef CONFIG_SMP +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ + p->thread_info->cpu = cpu; +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -398,7 +467,7 @@ repeat: preempt_disable(); rq = task_rq(p); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { cpu_relax(); /* * enable/disable preemption just to make this @@ -409,7 +478,7 @@ goto repeat; } rq = task_rq_lock(p, &flags); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { task_rq_unlock(rq, &flags); preempt_enable(); goto repeat; @@ -417,6 +486,13 @@ task_rq_unlock(rq, &flags); preempt_enable(); } + +#else + +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ +} + #endif /* @@ -431,10 +507,39 @@ */ void kick_if_running(task_t * p) { - if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id())) + if ((task_running(p)) && (task_cpu(p) != smp_processor_id())) resched_task(p); } +static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p) +{ + cpu_t *curr_cpu; + task_t *curr; + int idx; + + if (idle_cpu(cpu)) + return resched_cpu(cpu); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + if (curr_cpu->idle == curr_cpu->curr) + return resched_cpu(curr_cpu->cpu); + } + + if (p->prio < cpu_curr_ptr(cpu)->prio) + return resched_task(cpu_curr_ptr(cpu)); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + curr = curr_cpu->curr; + if (p->prio < curr->prio) + return resched_task(curr); + } +} /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -463,7 +568,7 @@ * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. */ - if (unlikely(sync && !task_running(rq, p) && + if (unlikely(sync && !task_running(p) && (task_cpu(p) != smp_processor_id()) && (p->cpus_allowed & (1UL << smp_processor_id())))) { @@ -473,10 +578,12 @@ } if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - activate_task(p, rq); - - if (p->prio < rq->curr->prio) - resched_task(rq->curr); + if (sync) + __activate_task(p, rq); + else { + activate_task(p, rq); + wake_up_cpu(rq, task_cpu(p), p); + } success = 1; } p->state = TASK_RUNNING; @@ -561,7 +668,7 @@ * context_switch - switch to the new MM and the new * thread's register state. */ -static inline task_t * context_switch(task_t *prev, task_t *next) +static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next) { struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; @@ -575,7 +682,7 @@ if (unlikely(!prev->mm)) { prev->active_mm = NULL; - mmdrop(oldmm); + rq->prev_mm = oldmm; } /* Here we just switch the register state and the stack. */ @@ -596,8 +703,9 @@ unsigned long i, sum = 0; for (i = 0; i < NR_CPUS; i++) - sum += cpu_rq(i)->nr_running; - + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_running; return sum; } @@ -608,7 +716,9 @@ for (i = 0; i < NR_CPUS; i++) { if (!cpu_online(i)) continue; - sum += cpu_rq(i)->nr_uninterruptible; + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_uninterruptible; } return sum; } @@ -790,7 +900,23 @@ #endif /* CONFIG_NUMA */ -#if CONFIG_SMP +/* + * One of the idle_cpu_tick() and busy_cpu_tick() functions will + * get called every timer tick, on every CPU. Our balancing action + * frequency and balancing agressivity depends on whether the CPU is + * idle or not. + * + * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * systems with HZ=100, every 10 msecs.) + */ +#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) + +#if !CONFIG_SMP + +static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { } + +#else /* * double_lock_balance - lock the busiest runqueue @@ -906,12 +1032,7 @@ set_task_cpu(p, this_cpu); nr_running_inc(this_rq); enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + wake_up_cpu(this_rq, this_cpu, p); } /* @@ -922,9 +1043,9 @@ * We call this with the current runqueue locked, * irqs disabled. */ -static void load_balance(runqueue_t *this_rq, int idle) +static void load_balance(runqueue_t *this_rq, int this_cpu, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx; runqueue_t *busiest; prio_array_t *array; struct list_head *head, *curr; @@ -972,12 +1093,15 @@ * 1) running (obviously), or * 2) cannot be migrated to this CPU due to cpus_allowed, or * 3) are cache-hot on their current CPU. + * + * (except if we are in idle mode which is a more agressive + * form of rebalancing.) */ -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define CAN_MIGRATE_TASK(p,rq,cpu) \ + (((idle && AGRESSIVE_IDLE_STEAL) || \ + (jiffies - (p)->last_run > cache_decay_ticks)) && \ + !task_running(p) && task_allowed(p, cpu)) curr = curr->prev; @@ -1000,31 +1124,136 @@ ; } +#if CONFIG_SHARE_RUNQUEUE +static void active_load_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + /* + * Any SMT-specific imbalance? + */ + for_each_sibling(idx, rq) + if (rq->cpu[idx].idle == rq->cpu[idx].curr) + goto next_cpu; + + /* + * At this point it's sure that we have a SMT + * imbalance: this (physical) CPU is idle but + * another CPU has two (or more) tasks running. + * + * We wake up one of the migration threads (it + * doesnt matter which one) and let it fix things up: + */ + if (!cpu_active_balance(i)) { + cpu_active_balance(i) = 1; + spin_unlock(&this_rq->lock); + wake_up_process(rq->cpu[0].migration_thread); + spin_lock(&this_rq->lock); + } +next_cpu: + } +} + +static void do_active_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + spin_unlock(&this_rq->lock); + + cpu_active_balance(this_cpu) = 0; + + /* + * Is the imbalance still present? + */ + for_each_sibling(idx, this_rq) + if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr) + goto out; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + + /* completely idle CPU? */ + if (rq->nr_running) + continue; + + /* + * At this point it's reasonably sure that we have an + * imbalance. Since we are the migration thread, try to + * balance a thread over to the target queue. + */ + spin_lock(&rq->lock); + load_balance(rq, i, 1); + spin_unlock(&rq->lock); + goto out; + } +out: + spin_lock(&this_rq->lock); +} + /* - * One of the idle_cpu_tick() and busy_cpu_tick() functions will - * get called every timer tick, on every CPU. Our balancing action - * frequency and balancing agressivity depends on whether the CPU is - * idle or not. + * This routine is called to map a CPU into another CPU's runqueue. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on - * systems with HZ=100, every 10 msecs.) + * This must be called during bootup with the merged runqueue having + * no tasks. */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) -#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +void sched_map_runqueue(int cpu1, int cpu2) +{ + runqueue_t *rq1 = cpu_rq(cpu1); + runqueue_t *rq2 = cpu_rq(cpu2); + int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx; + + printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id()); + if (rq1 == rq2) + BUG(); + if (rq2->nr_running) + BUG(); + /* + * At this point, we dont have anything in the runqueue yet. So, + * there is no need to move processes between the runqueues. + * Only, the idle processes should be combined and accessed + * properly. + */ + cpu2_idx = rq1->nr_cpus++; + + if (rq_idx(cpu1) != cpu1) + BUG(); + rq_idx(cpu2) = cpu1; + cpu_idx(cpu2) = cpu2_idx; + rq1->cpu[cpu2_idx].cpu = cpu2; + rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle; + rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr; + INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue); + + /* just to be safe: */ + rq2->cpu[cpu2_idx_orig].idle = NULL; + rq2->cpu[cpu2_idx_orig].curr = NULL; +} +#endif +#endif + +DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; -static inline void idle_tick(runqueue_t *rq) +static inline void idle_tick(runqueue_t *rq, unsigned long j) { - if (jiffies % IDLE_REBALANCE_TICK) + if (j % IDLE_REBALANCE_TICK) return; spin_lock(&rq->lock); - load_balance(rq, 1); + load_balance(rq, smp_processor_id(), 1); spin_unlock(&rq->lock); } -#endif - -DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; - /* * We place interactive tasks back into the active array, if possible. * @@ -1035,9 +1264,9 @@ * increasing number of running tasks: */ #define EXPIRED_STARVING(rq) \ - ((rq)->expired_timestamp && \ + (STARVATION_LIMIT && ((rq)->expired_timestamp && \ (jiffies - (rq)->expired_timestamp >= \ - STARVATION_LIMIT * ((rq)->nr_running) + 1)) + STARVATION_LIMIT * ((rq)->nr_running) + 1))) /* * This function gets called by the timer code, with HZ frequency. @@ -1050,12 +1279,13 @@ { int cpu = smp_processor_id(); runqueue_t *rq = this_rq(); + unsigned long j = jiffies; task_t *p = current; if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_ticks); - if (p == rq->idle) { + if (p == cpu_idle_ptr(cpu)) { /* note: this timer irq context must be accounted for as well */ if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET) kstat_cpu(cpu).cpustat.system += sys_ticks; @@ -1063,9 +1293,7 @@ kstat_cpu(cpu).cpustat.iowait += sys_ticks; else kstat_cpu(cpu).cpustat.idle += sys_ticks; -#if CONFIG_SMP - idle_tick(rq); -#endif + idle_tick(rq, j); return; } if (TASK_NICE(p) > 0) @@ -1074,12 +1302,13 @@ kstat_cpu(cpu).cpustat.user += user_ticks; kstat_cpu(cpu).cpustat.system += sys_ticks; + spin_lock(&rq->lock); /* Task might have expired already, but not scheduled off yet */ if (p->array != rq->active) { set_tsk_need_resched(p); + spin_unlock(&rq->lock); return; } - spin_lock(&rq->lock); if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. @@ -1115,16 +1344,14 @@ if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; + rq->expired_timestamp = j; enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } out: -#if CONFIG_SMP - if (!(jiffies % BUSY_REBALANCE_TICK)) - load_balance(rq, 0); -#endif + if (!(j % BUSY_REBALANCE_TICK)) + load_balance(rq, smp_processor_id(), 0); spin_unlock(&rq->lock); } @@ -1135,11 +1362,11 @@ */ asmlinkage void schedule(void) { + int idx, this_cpu, retry = 0; + struct list_head *queue; task_t *prev, *next; - runqueue_t *rq; prio_array_t *array; - struct list_head *queue; - int idx; + runqueue_t *rq; /* * Test if we are atomic. Since do_exit() needs to call into @@ -1152,15 +1379,15 @@ dump_stack(); } } - - check_highmem_ptes(); need_resched: + check_highmem_ptes(); + this_cpu = smp_processor_id(); preempt_disable(); prev = current; rq = this_rq(); release_kernel_lock(prev); - prev->sleep_timestamp = jiffies; + prev->last_run = jiffies; spin_lock_irq(&rq->lock); /* @@ -1183,12 +1410,14 @@ } pick_next_task: if (unlikely(!rq->nr_running)) { -#if CONFIG_SMP - load_balance(rq, 1); + load_balance(rq, this_cpu, 1); if (rq->nr_running) goto pick_next_task; -#endif - next = rq->idle; + active_load_balance(rq, this_cpu); + if (rq->nr_running) + goto pick_next_task; +pick_idle: + next = cpu_idle_ptr(this_cpu); rq->expired_timestamp = 0; goto switch_tasks; } @@ -1204,24 +1433,60 @@ rq->expired_timestamp = 0; } +new_array: idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); + if ((next != prev) && (rq_nr_cpus(rq) > 1)) { + struct list_head *tmp = queue->next; + + while ((task_running(next) && (next != prev)) || !task_allowed(next, this_cpu)) { + tmp = tmp->next; + if (tmp != queue) { + next = list_entry(tmp, task_t, run_list); + continue; + } + idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx); + if (idx == MAX_PRIO) { + if (retry || !rq->expired->nr_active) { + goto pick_idle; + } + /* + * To avoid infinite changing of arrays, + * when we have only tasks runnable by + * sibling. + */ + retry = 1; + + array = rq->expired; + goto new_array; + } + queue = array->queue + idx; + tmp = queue->next; + next = list_entry(tmp, task_t, run_list); + } + } switch_tasks: prefetch(next); clear_tsk_need_resched(prev); - RCU_qsctr(prev->thread_info->cpu)++; + RCU_qsctr(task_cpu(prev))++; if (likely(prev != next)) { + struct mm_struct *prev_mm; rq->nr_switches++; - rq->curr = next; + cpu_curr_ptr(this_cpu) = next; + set_task_cpu(next, this_cpu); prepare_arch_switch(rq, next); - prev = context_switch(prev, next); + prev = context_switch(rq, prev, next); barrier(); rq = this_rq(); + prev_mm = rq->prev_mm; + rq->prev_mm = NULL; finish_arch_switch(rq, prev); + if (prev_mm) + mmdrop(prev_mm); } else spin_unlock_irq(&rq->lock); @@ -1481,9 +1746,8 @@ * If the task is running and lowered its priority, * or increased its priority then reschedule its CPU: */ - if ((NICE_TO_PRIO(nice) < p->static_prio) || - task_running(rq, p)) - resched_task(rq->curr); + if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p)) + resched_task(cpu_curr_ptr(task_cpu(p))); } out_unlock: task_rq_unlock(rq, &flags); @@ -1561,7 +1825,7 @@ */ int task_curr(task_t *p) { - return cpu_curr(task_cpu(p)) == p; + return cpu_curr_ptr(task_cpu(p)) == p; } /** @@ -1570,7 +1834,7 @@ */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu); } /** @@ -1660,7 +1924,7 @@ else p->prio = p->static_prio; if (array) - activate_task(p, task_rq(p)); + __activate_task(p, task_rq(p)); out_unlock: task_rq_unlock(rq, &flags); @@ -2135,7 +2399,7 @@ local_irq_save(flags); double_rq_lock(idle_rq, rq); - idle_rq->curr = idle_rq->idle = idle; + cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle; deactivate_task(idle, rq); idle->array = NULL; idle->prio = MAX_PRIO; @@ -2190,6 +2454,7 @@ unsigned long flags; migration_req_t req; runqueue_t *rq; + int cpu; #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */ new_mask &= cpu_online_map; @@ -2211,31 +2476,31 @@ * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!p->array && !task_running(p)) { set_task_cpu(p, __ffs(p->cpus_allowed)); task_rq_unlock(rq, &flags); return; } init_completion(&req.done); req.task = p; - list_add(&req.list, &rq->migration_queue); + cpu = task_cpu(p); + list_add(&req.list, migration_queue(cpu)); task_rq_unlock(rq, &flags); - - wake_up_process(rq->migration_thread); + wake_up_process(migration_thread(cpu)); wait_for_completion(&req.done); } /* - * migration_thread - this is a highprio system thread that performs + * migration_task - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. */ -static int migration_thread(void * data) +static int migration_task(void * data) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; int cpu = (long) data; runqueue_t *rq; - int ret; + int ret, idx; daemonize(); sigfillset(¤t->blocked); @@ -2250,7 +2515,8 @@ ret = setscheduler(0, SCHED_FIFO, ¶m); rq = this_rq(); - rq->migration_thread = current; + migration_thread(cpu) = current; + idx = cpu_idx(cpu); sprintf(current->comm, "migration/%d", smp_processor_id()); @@ -2263,7 +2529,9 @@ task_t *p; spin_lock_irqsave(&rq->lock, flags); - head = &rq->migration_queue; + if (cpu_active_balance(cpu)) + do_active_balance(rq, cpu); + head = migration_queue(cpu); current->state = TASK_INTERRUPTIBLE; if (list_empty(head)) { spin_unlock_irqrestore(&rq->lock, flags); @@ -2292,9 +2560,8 @@ set_task_cpu(p, cpu_dest); if (p->array) { deactivate_task(p, rq_src); - activate_task(p, rq_dest); - if (p->prio < rq_dest->curr->prio) - resched_task(rq_dest->curr); + __activate_task(p, rq_dest); + wake_up_cpu(rq_dest, cpu_dest, p); } } double_rq_unlock(rq_src, rq_dest); @@ -2312,12 +2579,13 @@ unsigned long action, void *hcpu) { + long cpu = (long) hcpu; + switch (action) { case CPU_ONLINE: - printk("Starting migration thread for cpu %li\n", - (long)hcpu); - kernel_thread(migration_thread, hcpu, CLONE_KERNEL); - while (!cpu_rq((long)hcpu)->migration_thread) + printk("Starting migration thread for cpu %li\n", cpu); + kernel_thread(migration_task, hcpu, CLONE_KERNEL); + while (!migration_thread(cpu)) yield(); break; } @@ -2392,11 +2660,20 @@ for (i = 0; i < NR_CPUS; i++) { prio_array_t *array; + /* + * Start with a 1:1 mapping between CPUs and runqueues: + */ +#if CONFIG_SHARE_RUNQUEUE + rq_idx(i) = i; + cpu_idx(i) = 0; +#endif rq = cpu_rq(i); rq->active = rq->arrays; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); - INIT_LIST_HEAD(&rq->migration_queue); + INIT_LIST_HEAD(migration_queue(i)); + rq->nr_cpus = 1; + rq->cpu[cpu_idx(i)].cpu = i; atomic_set(&rq->nr_iowait, 0); nr_running_init(rq); @@ -2414,9 +2691,9 @@ * We have to do a little magic to get the first * thread right in SMP mode. */ - rq = this_rq(); - rq->curr = current; - rq->idle = current; + cpu_curr_ptr(smp_processor_id()) = current; + cpu_idle_ptr(smp_processor_id()) = current; + set_task_cpu(current, smp_processor_id()); wake_up_process(current); --- linux/init/main.c.orig +++ linux/init/main.c @@ -354,7 +354,14 @@ static void rest_init(void) { + /* + * We count on the initial thread going ok + * Like idlers init is an unlocked kernel thread, which will + * make syscalls (and thus be locked). + */ + init_idle(current, smp_processor_id()); kernel_thread(init, NULL, CLONE_KERNEL); + unlock_kernel(); cpu_idle(); } @@ -438,13 +445,6 @@ check_bugs(); printk("POSIX conformance testing by UNIFIX\n"); - /* - * We count on the initial thread going ok - * Like idlers init is an unlocked kernel thread, which will - * make syscalls (and thus be locked). - */ - init_idle(current, smp_processor_id()); - /* Do the rest non-__init'ed, we're now alive */ rest_init(); } |
From: Linus T. <tor...@tr...> - 2003-02-04 17:13:58
|
On Tue, 4 Feb 2003, Ingo Molnar wrote: > > this is the -E6 patch, with most of the HT-unrelated tunings and leftover > changes (noticed by Robert) removed. I kept the activation reorganization > because that is important in the HT case as well. It has some code that seems to be either left-over debugging or just mis-indented (I assume it's left-over, since the "printk at the start of the line" thing tends to be what you use for temporary debugging): +printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus()); Also, since some of your previous patches have been against various -mm kernels (or other things), I'd wish you explicitly mentioned which kernel this is against. Linus |
From: Ingo M. <mi...@el...> - 2003-02-07 08:33:08
|
On Tue, 4 Feb 2003, Linus Torvalds wrote: > It has some code that seems to be either left-over debugging [...] indeed - i have removed those from the attached patch. I also fixed a compiler warning reported by Jordan Breeding, and another compiler warning. I kept the /proc/cpuinfo runqueue index number, for debugging purposes, so that people can be sure about the actual runqueue mappings activated on their box. I also kept the test_ht boot option so that people can enable a shared-runqueue setup for testing purposes. > Also, since some of your previous patches have been against various -mm > kernels (or other things), I'd wish you explicitly mentioned which > kernel this is against. the attached patch is against current BK. I tested it on the following 6 kernel/hardware combinations: (HT SMP kernel, SMP kernel, UP kernel) X (non-HT SMP box, HT SMP box) all kernels compiled & booted fine. Ingo --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -147,6 +147,24 @@ extern void sched_init(void); extern void init_idle(task_t *idle, int cpu); +/* + * Is there a way to do this via Kconfig? + */ +#if CONFIG_NR_SIBLINGS_2 +# define CONFIG_NR_SIBLINGS 2 +#elif CONFIG_NR_SIBLINGS_4 +# define CONFIG_NR_SIBLINGS 4 +#else +# define CONFIG_NR_SIBLINGS 0 +#endif + +#if CONFIG_NR_SIBLINGS +# define CONFIG_SHARE_RUNQUEUE 1 +#else +# define CONFIG_SHARE_RUNQUEUE 0 +#endif +extern void sched_map_runqueue(int cpu1, int cpu2); + extern void show_state(void); extern void show_trace(unsigned long *stack); extern void show_stack(unsigned long *stack); @@ -298,7 +316,7 @@ prio_array_t *array; unsigned long sleep_avg; - unsigned long sleep_timestamp; + unsigned long last_run; unsigned long policy; unsigned long cpus_allowed; @@ -778,11 +796,6 @@ return p->thread_info->cpu; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ - p->thread_info->cpu = cpu; -} - #else static inline unsigned int task_cpu(struct task_struct *p) @@ -790,10 +803,6 @@ return 0; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ -} - #endif /* CONFIG_SMP */ #endif /* __KERNEL__ */ --- linux/include/asm-i386/apic.h.orig +++ linux/include/asm-i386/apic.h @@ -98,4 +98,6 @@ #endif /* CONFIG_X86_LOCAL_APIC */ +extern int phys_proc_id[NR_CPUS]; + #endif /* __ASM_APIC_H */ --- linux/arch/i386/kernel/cpu/proc.c.orig +++ linux/arch/i386/kernel/cpu/proc.c @@ -1,4 +1,5 @@ #include <linux/smp.h> +#include <linux/sched.h> #include <linux/timex.h> #include <linux/string.h> #include <asm/semaphore.h> @@ -101,6 +102,13 @@ fpu_exception ? "yes" : "no", c->cpuid_level, c->wp_works_ok ? "yes" : "no"); +#if CONFIG_SHARE_RUNQUEUE +{ + extern long __rq_idx[NR_CPUS]; + + seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]); +} +#endif for ( i = 0 ; i < 32*NCAPINTS ; i++ ) if ( test_bit(i, c->x86_capability) && --- linux/arch/i386/kernel/smpboot.c.orig +++ linux/arch/i386/kernel/smpboot.c @@ -38,6 +38,7 @@ #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/sched.h> #include <linux/kernel_stat.h> #include <linux/smp_lock.h> #include <linux/irq.h> @@ -945,6 +946,16 @@ int cpu_sibling_map[NR_CPUS] __cacheline_aligned; +static int test_ht; + +static int __init ht_setup(char *str) +{ + test_ht = 1; + return 1; +} + +__setup("test_ht", ht_setup); + static void __init smp_boot_cpus(unsigned int max_cpus) { int apicid, cpu, bit; @@ -1087,7 +1098,20 @@ Dprintk("Boot done.\n"); /* - * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so + * Here we can be sure that there is an IO-APIC in the system. Let's + * go and set it up: + */ + smpboot_setup_io_apic(); + + setup_boot_APIC_clock(); + + /* + * Synchronize the TSC with the AP + */ + if (cpu_has_tsc && cpucount) + synchronize_tsc_bp(); + /* + * If Hyper-Threading is available, construct cpu_sibling_map[], so * that we can tell the sibling CPU efficiently. */ if (cpu_has_ht && smp_num_siblings > 1) { @@ -1096,7 +1120,8 @@ for (cpu = 0; cpu < NR_CPUS; cpu++) { int i; - if (!test_bit(cpu, &cpu_callout_map)) continue; + if (!test_bit(cpu, &cpu_callout_map)) + continue; for (i = 0; i < NR_CPUS; i++) { if (i == cpu || !test_bit(i, &cpu_callout_map)) @@ -1112,17 +1137,41 @@ printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu); } } - } - - smpboot_setup_io_apic(); - - setup_boot_APIC_clock(); +#if CONFIG_SHARE_RUNQUEUE + /* + * At this point APs would have synchronised TSC and + * waiting for smp_commenced, with their APIC timer + * disabled. So BP can go ahead do some initialization + * for Hyper-Threading (if needed). + */ + for (cpu = 0; cpu < NR_CPUS; cpu++) { + int i; + if (!test_bit(cpu, &cpu_callout_map)) + continue; + for (i = 0; i < NR_CPUS; i++) { + if (i <= cpu) + continue; + if (!test_bit(i, &cpu_callout_map)) + continue; - /* - * Synchronize the TSC with the AP - */ - if (cpu_has_tsc && cpucount) - synchronize_tsc_bp(); + if (phys_proc_id[cpu] != phys_proc_id[i]) + continue; + /* + * merge runqueues, resulting in one + * runqueue per package: + */ + sched_map_runqueue(cpu, i); + break; + } + } +#endif + } +#if CONFIG_SHARE_RUNQUEUE + if (smp_num_siblings == 1 && test_ht) { + printk("Simulating shared runqueues - mapping CPU#1's runqueue to CPU#0's runqueue.\n"); + sched_map_runqueue(0, 1); + } +#endif } /* These are wrappers to interface to the new boot process. Someone --- linux/arch/i386/Kconfig.orig +++ linux/arch/i386/Kconfig @@ -408,6 +408,24 @@ If you don't know what to do here, say N. +choice + + prompt "Hyperthreading Support" + depends on SMP + default NR_SIBLINGS_0 + +config NR_SIBLINGS_0 + bool "off" + +config NR_SIBLINGS_2 + bool "2 siblings" + +config NR_SIBLINGS_4 + bool "4 siblings" + +endchoice + + config PREEMPT bool "Preemptible Kernel" help --- linux/kernel/fork.c.orig +++ linux/kernel/fork.c @@ -878,7 +878,7 @@ */ p->first_time_slice = 1; current->time_slice >>= 1; - p->sleep_timestamp = jiffies; + p->last_run = jiffies; if (!current->time_slice) { /* * This case is rare, it happens when the parent has only --- linux/kernel/sched.c.orig +++ linux/kernel/sched.c @@ -67,6 +67,7 @@ #define INTERACTIVE_DELTA 2 #define MAX_SLEEP_AVG (2*HZ) #define STARVATION_LIMIT (2*HZ) +#define AGRESSIVE_IDLE_STEAL 1 #define NODE_THRESHOLD 125 /* @@ -141,6 +142,48 @@ }; /* + * It's possible for two CPUs to share the same runqueue. + * This makes sense if they eg. share caches. + * + * We take the common 1:1 (SMP, UP) case and optimize it, + * the rest goes via remapping: rq_idx(cpu) gives the + * runqueue on which a particular cpu is on, cpu_idx(cpu) + * gives the rq-specific index of the cpu. + * + * (Note that the generic scheduler code does not impose any + * restrictions on the mappings - there can be 4 CPUs per + * runqueue or even assymetric mappings.) + */ +#if CONFIG_SHARE_RUNQUEUE +# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS + long __rq_idx[NR_CPUS] __cacheline_aligned; + static long __cpu_idx[NR_CPUS] __cacheline_aligned; +# define rq_idx(cpu) (__rq_idx[(cpu)]) +# define cpu_idx(cpu) (__cpu_idx[(cpu)]) +# define for_each_sibling(idx, rq) \ + for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++) +# define rq_nr_cpus(rq) ((rq)->nr_cpus) +# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance) +#else +# define MAX_NR_SIBLINGS 1 +# define rq_idx(cpu) (cpu) +# define cpu_idx(cpu) 0 +# define for_each_sibling(idx, rq) while (0) +# define cpu_active_balance(c) 0 +# define do_active_balance(rq, cpu) do { } while (0) +# define rq_nr_cpus(rq) 1 + static inline void active_load_balance(runqueue_t *rq, int this_cpu) { } +#endif + +typedef struct cpu_s { + task_t *curr, *idle; + task_t *migration_thread; + struct list_head migration_queue; + int active_balance; + int cpu; +} cpu_t; + +/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues @@ -151,7 +194,7 @@ spinlock_t lock; unsigned long nr_running, nr_switches, expired_timestamp, nr_uninterruptible; - task_t *curr, *idle; + struct mm_struct *prev_mm; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; #ifdef CONFIG_NUMA @@ -159,27 +202,39 @@ unsigned int nr_balanced; int prev_node_load[MAX_NUMNODES]; #endif - task_t *migration_thread; - struct list_head migration_queue; + int nr_cpus; + cpu_t cpu[MAX_NR_SIBLINGS]; atomic_t nr_iowait; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; -#define cpu_rq(cpu) (runqueues + (cpu)) +#define cpu_rq(cpu) (runqueues + (rq_idx(cpu))) +#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c)) +#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr) +#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle) + #define this_rq() cpu_rq(smp_processor_id()) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define migration_thread(cpu) (cpu_int(cpu)->migration_thread) +#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue) + +#if NR_CPUS > 1 +# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu))) +#else +# define task_allowed(p, cpu) 1 +#endif + /* * Default context-switch locking: */ #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(p) (cpu_curr_ptr(task_cpu(p)) == (p)) #endif #ifdef CONFIG_NUMA @@ -322,16 +377,21 @@ * Also update all the scheduling statistics stuff. (sleep average * calculation, priority modifiers, etc.) */ +static inline void __activate_task(task_t *p, runqueue_t *rq) +{ + enqueue_task(p, rq->active); + nr_running_inc(rq); +} + static inline void activate_task(task_t *p, runqueue_t *rq) { - unsigned long sleep_time = jiffies - p->sleep_timestamp; - prio_array_t *array = rq->active; + unsigned long sleep_time = jiffies - p->last_run; if (!rt_task(p) && sleep_time) { /* * This code gives a bonus to interactive tasks. We update * an 'average sleep time' value here, based on - * sleep_timestamp. The more time a task spends sleeping, + * ->last_run. The more time a task spends sleeping, * the higher the average gets - and the higher the priority * boost gets as well. */ @@ -340,8 +400,7 @@ p->sleep_avg = MAX_SLEEP_AVG; p->prio = effective_prio(p); } - enqueue_task(p, array); - nr_running_inc(rq); + __activate_task(p, rq); } /* @@ -382,8 +441,18 @@ #endif } +static inline void resched_cpu(int cpu) +{ + resched_task(cpu_curr_ptr(cpu)); +} + #ifdef CONFIG_SMP +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ + p->thread_info->cpu = cpu; +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -398,7 +467,7 @@ repeat: preempt_disable(); rq = task_rq(p); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { cpu_relax(); /* * enable/disable preemption just to make this @@ -409,7 +478,7 @@ goto repeat; } rq = task_rq_lock(p, &flags); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { task_rq_unlock(rq, &flags); preempt_enable(); goto repeat; @@ -417,6 +486,13 @@ task_rq_unlock(rq, &flags); preempt_enable(); } + +#else + +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ +} + #endif /* @@ -431,10 +507,39 @@ */ void kick_if_running(task_t * p) { - if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id())) + if ((task_running(p)) && (task_cpu(p) != smp_processor_id())) resched_task(p); } +static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p) +{ + cpu_t *curr_cpu; + task_t *curr; + int idx; + + if (idle_cpu(cpu)) + return resched_cpu(cpu); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + if (curr_cpu->idle == curr_cpu->curr) + return resched_cpu(curr_cpu->cpu); + } + + if (p->prio < cpu_curr_ptr(cpu)->prio) + return resched_task(cpu_curr_ptr(cpu)); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + curr = curr_cpu->curr; + if (p->prio < curr->prio) + return resched_task(curr); + } +} /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -463,7 +568,7 @@ * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. */ - if (unlikely(sync && !task_running(rq, p) && + if (unlikely(sync && !task_running(p) && (task_cpu(p) != smp_processor_id()) && (p->cpus_allowed & (1UL << smp_processor_id())))) { @@ -473,10 +578,12 @@ } if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - activate_task(p, rq); - - if (p->prio < rq->curr->prio) - resched_task(rq->curr); + if (sync) + __activate_task(p, rq); + else { + activate_task(p, rq); + wake_up_cpu(rq, task_cpu(p), p); + } success = 1; } p->state = TASK_RUNNING; @@ -561,7 +668,7 @@ * context_switch - switch to the new MM and the new * thread's register state. */ -static inline task_t * context_switch(task_t *prev, task_t *next) +static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next) { struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; @@ -575,7 +682,7 @@ if (unlikely(!prev->mm)) { prev->active_mm = NULL; - mmdrop(oldmm); + rq->prev_mm = oldmm; } /* Here we just switch the register state and the stack. */ @@ -596,8 +703,9 @@ unsigned long i, sum = 0; for (i = 0; i < NR_CPUS; i++) - sum += cpu_rq(i)->nr_running; - + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_running; return sum; } @@ -608,7 +716,9 @@ for (i = 0; i < NR_CPUS; i++) { if (!cpu_online(i)) continue; - sum += cpu_rq(i)->nr_uninterruptible; + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_uninterruptible; } return sum; } @@ -790,7 +900,23 @@ #endif /* CONFIG_NUMA */ -#if CONFIG_SMP +/* + * One of the idle_cpu_tick() and busy_cpu_tick() functions will + * get called every timer tick, on every CPU. Our balancing action + * frequency and balancing agressivity depends on whether the CPU is + * idle or not. + * + * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * systems with HZ=100, every 10 msecs.) + */ +#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) + +#if !CONFIG_SMP + +static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { } + +#else /* * double_lock_balance - lock the busiest runqueue @@ -906,12 +1032,7 @@ set_task_cpu(p, this_cpu); nr_running_inc(this_rq); enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + wake_up_cpu(this_rq, this_cpu, p); } /* @@ -922,9 +1043,9 @@ * We call this with the current runqueue locked, * irqs disabled. */ -static void load_balance(runqueue_t *this_rq, int idle) +static void load_balance(runqueue_t *this_rq, int this_cpu, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx; runqueue_t *busiest; prio_array_t *array; struct list_head *head, *curr; @@ -972,12 +1093,15 @@ * 1) running (obviously), or * 2) cannot be migrated to this CPU due to cpus_allowed, or * 3) are cache-hot on their current CPU. + * + * (except if we are in idle mode which is a more agressive + * form of rebalancing.) */ -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define CAN_MIGRATE_TASK(p,rq,cpu) \ + (((idle && AGRESSIVE_IDLE_STEAL) || \ + (jiffies - (p)->last_run > cache_decay_ticks)) && \ + !task_running(p) && task_allowed(p, cpu)) curr = curr->prev; @@ -1000,31 +1124,129 @@ ; } +#if CONFIG_SHARE_RUNQUEUE +static void active_load_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + /* + * Any SMT-specific imbalance? + */ + for_each_sibling(idx, rq) + if (rq->cpu[idx].idle == rq->cpu[idx].curr) + continue; + + /* + * At this point it's sure that we have a SMT + * imbalance: this (physical) CPU is idle but + * another CPU has two (or more) tasks running. + * + * We wake up one of the migration threads (it + * doesnt matter which one) and let it fix things up: + */ + if (!cpu_active_balance(i)) { + cpu_active_balance(i) = 1; + spin_unlock(&this_rq->lock); + wake_up_process(rq->cpu[0].migration_thread); + spin_lock(&this_rq->lock); + } + } +} + +static void do_active_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + spin_unlock(&this_rq->lock); + + cpu_active_balance(this_cpu) = 0; + + /* + * Is the imbalance still present? + */ + for_each_sibling(idx, this_rq) + if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr) + goto out; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + + /* completely idle CPU? */ + if (rq->nr_running) + continue; + + /* + * At this point it's reasonably sure that we have an + * imbalance. Since we are the migration thread, try to + * balance a thread over to the target queue. + */ + spin_lock(&rq->lock); + load_balance(rq, i, 1); + spin_unlock(&rq->lock); + goto out; + } +out: + spin_lock(&this_rq->lock); +} + /* - * One of the idle_cpu_tick() and busy_cpu_tick() functions will - * get called every timer tick, on every CPU. Our balancing action - * frequency and balancing agressivity depends on whether the CPU is - * idle or not. + * This routine is called to map a CPU into another CPU's runqueue. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on - * systems with HZ=100, every 10 msecs.) + * This must be called during bootup with the merged runqueue having + * no tasks. */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) -#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +void sched_map_runqueue(int cpu1, int cpu2) +{ + runqueue_t *rq1 = cpu_rq(cpu1); + runqueue_t *rq2 = cpu_rq(cpu2); + int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx; + + BUG_ON(rq1 == rq2 || rq2->nr_running || rq_idx(cpu1) != cpu1); + /* + * At this point, we dont have anything in the runqueue yet. So, + * there is no need to move processes between the runqueues. + * Only, the idle processes should be combined and accessed + * properly. + */ + cpu2_idx = rq1->nr_cpus++; + + rq_idx(cpu2) = cpu1; + cpu_idx(cpu2) = cpu2_idx; + rq1->cpu[cpu2_idx].cpu = cpu2; + rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle; + rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr; + INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue); + + /* just to be safe: */ + rq2->cpu[cpu2_idx_orig].idle = NULL; + rq2->cpu[cpu2_idx_orig].curr = NULL; +} +#endif +#endif -static inline void idle_tick(runqueue_t *rq) +DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; + +static inline void idle_tick(runqueue_t *rq, unsigned long j) { - if (jiffies % IDLE_REBALANCE_TICK) + if (j % IDLE_REBALANCE_TICK) return; spin_lock(&rq->lock); - load_balance(rq, 1); + load_balance(rq, smp_processor_id(), 1); spin_unlock(&rq->lock); } -#endif - -DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; - /* * We place interactive tasks back into the active array, if possible. * @@ -1035,9 +1257,9 @@ * increasing number of running tasks: */ #define EXPIRED_STARVING(rq) \ - ((rq)->expired_timestamp && \ + (STARVATION_LIMIT && ((rq)->expired_timestamp && \ (jiffies - (rq)->expired_timestamp >= \ - STARVATION_LIMIT * ((rq)->nr_running) + 1)) + STARVATION_LIMIT * ((rq)->nr_running) + 1))) /* * This function gets called by the timer code, with HZ frequency. @@ -1050,12 +1272,13 @@ { int cpu = smp_processor_id(); runqueue_t *rq = this_rq(); + unsigned long j = jiffies; task_t *p = current; if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_ticks); - if (p == rq->idle) { + if (p == cpu_idle_ptr(cpu)) { /* note: this timer irq context must be accounted for as well */ if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET) kstat_cpu(cpu).cpustat.system += sys_ticks; @@ -1063,9 +1286,7 @@ kstat_cpu(cpu).cpustat.iowait += sys_ticks; else kstat_cpu(cpu).cpustat.idle += sys_ticks; -#if CONFIG_SMP - idle_tick(rq); -#endif + idle_tick(rq, j); return; } if (TASK_NICE(p) > 0) @@ -1074,12 +1295,13 @@ kstat_cpu(cpu).cpustat.user += user_ticks; kstat_cpu(cpu).cpustat.system += sys_ticks; + spin_lock(&rq->lock); /* Task might have expired already, but not scheduled off yet */ if (p->array != rq->active) { set_tsk_need_resched(p); + spin_unlock(&rq->lock); return; } - spin_lock(&rq->lock); if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. @@ -1115,16 +1337,14 @@ if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; + rq->expired_timestamp = j; enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } out: -#if CONFIG_SMP - if (!(jiffies % BUSY_REBALANCE_TICK)) - load_balance(rq, 0); -#endif + if (!(j % BUSY_REBALANCE_TICK)) + load_balance(rq, smp_processor_id(), 0); spin_unlock(&rq->lock); } @@ -1135,11 +1355,11 @@ */ asmlinkage void schedule(void) { + int idx, this_cpu, retry = 0; + struct list_head *queue; task_t *prev, *next; - runqueue_t *rq; prio_array_t *array; - struct list_head *queue; - int idx; + runqueue_t *rq; /* * Test if we are atomic. Since do_exit() needs to call into @@ -1152,15 +1372,15 @@ dump_stack(); } } - - check_highmem_ptes(); need_resched: + check_highmem_ptes(); + this_cpu = smp_processor_id(); preempt_disable(); prev = current; rq = this_rq(); release_kernel_lock(prev); - prev->sleep_timestamp = jiffies; + prev->last_run = jiffies; spin_lock_irq(&rq->lock); /* @@ -1183,12 +1403,14 @@ } pick_next_task: if (unlikely(!rq->nr_running)) { -#if CONFIG_SMP - load_balance(rq, 1); + load_balance(rq, this_cpu, 1); if (rq->nr_running) goto pick_next_task; -#endif - next = rq->idle; + active_load_balance(rq, this_cpu); + if (rq->nr_running) + goto pick_next_task; +pick_idle: + next = cpu_idle_ptr(this_cpu); rq->expired_timestamp = 0; goto switch_tasks; } @@ -1204,24 +1426,60 @@ rq->expired_timestamp = 0; } +new_array: idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); + if ((next != prev) && (rq_nr_cpus(rq) > 1)) { + struct list_head *tmp = queue->next; + + while ((task_running(next) && (next != prev)) || !task_allowed(next, this_cpu)) { + tmp = tmp->next; + if (tmp != queue) { + next = list_entry(tmp, task_t, run_list); + continue; + } + idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx); + if (idx == MAX_PRIO) { + if (retry || !rq->expired->nr_active) { + goto pick_idle; + } + /* + * To avoid infinite changing of arrays, + * when we have only tasks runnable by + * sibling. + */ + retry = 1; + + array = rq->expired; + goto new_array; + } + queue = array->queue + idx; + tmp = queue->next; + next = list_entry(tmp, task_t, run_list); + } + } switch_tasks: prefetch(next); clear_tsk_need_resched(prev); - RCU_qsctr(prev->thread_info->cpu)++; + RCU_qsctr(task_cpu(prev))++; if (likely(prev != next)) { + struct mm_struct *prev_mm; rq->nr_switches++; - rq->curr = next; + cpu_curr_ptr(this_cpu) = next; + set_task_cpu(next, this_cpu); prepare_arch_switch(rq, next); - prev = context_switch(prev, next); + prev = context_switch(rq, prev, next); barrier(); rq = this_rq(); + prev_mm = rq->prev_mm; + rq->prev_mm = NULL; finish_arch_switch(rq, prev); + if (prev_mm) + mmdrop(prev_mm); } else spin_unlock_irq(&rq->lock); @@ -1481,9 +1739,8 @@ * If the task is running and lowered its priority, * or increased its priority then reschedule its CPU: */ - if ((NICE_TO_PRIO(nice) < p->static_prio) || - task_running(rq, p)) - resched_task(rq->curr); + if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p)) + resched_task(cpu_curr_ptr(task_cpu(p))); } out_unlock: task_rq_unlock(rq, &flags); @@ -1561,7 +1818,7 @@ */ int task_curr(task_t *p) { - return cpu_curr(task_cpu(p)) == p; + return cpu_curr_ptr(task_cpu(p)) == p; } /** @@ -1570,7 +1827,7 @@ */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu); } /** @@ -1660,7 +1917,7 @@ else p->prio = p->static_prio; if (array) - activate_task(p, task_rq(p)); + __activate_task(p, task_rq(p)); out_unlock: task_rq_unlock(rq, &flags); @@ -2135,7 +2392,7 @@ local_irq_save(flags); double_rq_lock(idle_rq, rq); - idle_rq->curr = idle_rq->idle = idle; + cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle; deactivate_task(idle, rq); idle->array = NULL; idle->prio = MAX_PRIO; @@ -2190,6 +2447,7 @@ unsigned long flags; migration_req_t req; runqueue_t *rq; + int cpu; #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */ new_mask &= cpu_online_map; @@ -2211,31 +2469,31 @@ * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!p->array && !task_running(p)) { set_task_cpu(p, __ffs(p->cpus_allowed)); task_rq_unlock(rq, &flags); return; } init_completion(&req.done); req.task = p; - list_add(&req.list, &rq->migration_queue); + cpu = task_cpu(p); + list_add(&req.list, migration_queue(cpu)); task_rq_unlock(rq, &flags); - - wake_up_process(rq->migration_thread); + wake_up_process(migration_thread(cpu)); wait_for_completion(&req.done); } /* - * migration_thread - this is a highprio system thread that performs + * migration_task - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. */ -static int migration_thread(void * data) +static int migration_task(void * data) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; int cpu = (long) data; runqueue_t *rq; - int ret; + int ret, idx; daemonize(); sigfillset(¤t->blocked); @@ -2250,7 +2508,8 @@ ret = setscheduler(0, SCHED_FIFO, ¶m); rq = this_rq(); - rq->migration_thread = current; + migration_thread(cpu) = current; + idx = cpu_idx(cpu); sprintf(current->comm, "migration/%d", smp_processor_id()); @@ -2263,7 +2522,9 @@ task_t *p; spin_lock_irqsave(&rq->lock, flags); - head = &rq->migration_queue; + if (cpu_active_balance(cpu)) + do_active_balance(rq, cpu); + head = migration_queue(cpu); current->state = TASK_INTERRUPTIBLE; if (list_empty(head)) { spin_unlock_irqrestore(&rq->lock, flags); @@ -2292,9 +2553,8 @@ set_task_cpu(p, cpu_dest); if (p->array) { deactivate_task(p, rq_src); - activate_task(p, rq_dest); - if (p->prio < rq_dest->curr->prio) - resched_task(rq_dest->curr); + __activate_task(p, rq_dest); + wake_up_cpu(rq_dest, cpu_dest, p); } } double_rq_unlock(rq_src, rq_dest); @@ -2312,12 +2572,13 @@ unsigned long action, void *hcpu) { + long cpu = (long) hcpu; + switch (action) { case CPU_ONLINE: - printk("Starting migration thread for cpu %li\n", - (long)hcpu); - kernel_thread(migration_thread, hcpu, CLONE_KERNEL); - while (!cpu_rq((long)hcpu)->migration_thread) + printk("Starting migration thread for cpu %li\n", cpu); + kernel_thread(migration_task, hcpu, CLONE_KERNEL); + while (!migration_thread(cpu)) yield(); break; } @@ -2392,11 +2653,20 @@ for (i = 0; i < NR_CPUS; i++) { prio_array_t *array; + /* + * Start with a 1:1 mapping between CPUs and runqueues: + */ +#if CONFIG_SHARE_RUNQUEUE + rq_idx(i) = i; + cpu_idx(i) = 0; +#endif rq = cpu_rq(i); rq->active = rq->arrays; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); - INIT_LIST_HEAD(&rq->migration_queue); + INIT_LIST_HEAD(migration_queue(i)); + rq->nr_cpus = 1; + rq->cpu[cpu_idx(i)].cpu = i; atomic_set(&rq->nr_iowait, 0); nr_running_init(rq); @@ -2414,9 +2684,9 @@ * We have to do a little magic to get the first * thread right in SMP mode. */ - rq = this_rq(); - rq->curr = current; - rq->idle = current; + cpu_curr_ptr(smp_processor_id()) = current; + cpu_idle_ptr(smp_processor_id()) = current; + set_task_cpu(current, smp_processor_id()); wake_up_process(current); --- linux/init/main.c.orig +++ linux/init/main.c @@ -354,7 +354,14 @@ static void rest_init(void) { + /* + * We count on the initial thread going ok + * Like idlers init is an unlocked kernel thread, which will + * make syscalls (and thus be locked). + */ + init_idle(current, smp_processor_id()); kernel_thread(init, NULL, CLONE_KERNEL); + unlock_kernel(); cpu_idle(); } @@ -438,13 +445,6 @@ check_bugs(); printk("POSIX conformance testing by UNIFIX\n"); - /* - * We count on the initial thread going ok - * Like idlers init is an unlocked kernel thread, which will - * make syscalls (and thus be locked). - */ - init_idle(current, smp_processor_id()); - /* Do the rest non-__init'ed, we're now alive */ rest_init(); } |
From: Ingo M. <mi...@el...> - 2003-02-07 09:30:07
|
this is a quick followup patch - the migration threads should do an uninterruptible sleep. This, besides being correct, also works around another bug that is being worked on - setting current->blocked alone is not enough to inhibit SIGKILL being sent to a kernel thread. So this change fixes a lockup during boot. Ingo --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -147,6 +147,24 @@ extern void sched_init(void); extern void init_idle(task_t *idle, int cpu); +/* + * Is there a way to do this via Kconfig? + */ +#if CONFIG_NR_SIBLINGS_2 +# define CONFIG_NR_SIBLINGS 2 +#elif CONFIG_NR_SIBLINGS_4 +# define CONFIG_NR_SIBLINGS 4 +#else +# define CONFIG_NR_SIBLINGS 0 +#endif + +#if CONFIG_NR_SIBLINGS +# define CONFIG_SHARE_RUNQUEUE 1 +#else +# define CONFIG_SHARE_RUNQUEUE 0 +#endif +extern void sched_map_runqueue(int cpu1, int cpu2); + extern void show_state(void); extern void show_trace(unsigned long *stack); extern void show_stack(unsigned long *stack); @@ -298,7 +316,7 @@ prio_array_t *array; unsigned long sleep_avg; - unsigned long sleep_timestamp; + unsigned long last_run; unsigned long policy; unsigned long cpus_allowed; @@ -778,11 +796,6 @@ return p->thread_info->cpu; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ - p->thread_info->cpu = cpu; -} - #else static inline unsigned int task_cpu(struct task_struct *p) @@ -790,10 +803,6 @@ return 0; } -static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) -{ -} - #endif /* CONFIG_SMP */ #endif /* __KERNEL__ */ --- linux/include/asm-i386/apic.h.orig +++ linux/include/asm-i386/apic.h @@ -98,4 +98,6 @@ #endif /* CONFIG_X86_LOCAL_APIC */ +extern int phys_proc_id[NR_CPUS]; + #endif /* __ASM_APIC_H */ --- linux/arch/i386/kernel/cpu/proc.c.orig +++ linux/arch/i386/kernel/cpu/proc.c @@ -1,4 +1,5 @@ #include <linux/smp.h> +#include <linux/sched.h> #include <linux/timex.h> #include <linux/string.h> #include <asm/semaphore.h> @@ -101,6 +102,13 @@ fpu_exception ? "yes" : "no", c->cpuid_level, c->wp_works_ok ? "yes" : "no"); +#if CONFIG_SHARE_RUNQUEUE +{ + extern long __rq_idx[NR_CPUS]; + + seq_printf(m, "\nrunqueue\t: %ld\n", __rq_idx[n]); +} +#endif for ( i = 0 ; i < 32*NCAPINTS ; i++ ) if ( test_bit(i, c->x86_capability) && --- linux/arch/i386/kernel/smpboot.c.orig +++ linux/arch/i386/kernel/smpboot.c @@ -38,6 +38,7 @@ #include <linux/kernel.h> #include <linux/mm.h> +#include <linux/sched.h> #include <linux/kernel_stat.h> #include <linux/smp_lock.h> #include <linux/irq.h> @@ -945,6 +946,16 @@ int cpu_sibling_map[NR_CPUS] __cacheline_aligned; +static int test_ht; + +static int __init ht_setup(char *str) +{ + test_ht = 1; + return 1; +} + +__setup("test_ht", ht_setup); + static void __init smp_boot_cpus(unsigned int max_cpus) { int apicid, cpu, bit; @@ -1087,7 +1098,20 @@ Dprintk("Boot done.\n"); /* - * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so + * Here we can be sure that there is an IO-APIC in the system. Let's + * go and set it up: + */ + smpboot_setup_io_apic(); + + setup_boot_APIC_clock(); + + /* + * Synchronize the TSC with the AP + */ + if (cpu_has_tsc && cpucount) + synchronize_tsc_bp(); + /* + * If Hyper-Threading is available, construct cpu_sibling_map[], so * that we can tell the sibling CPU efficiently. */ if (cpu_has_ht && smp_num_siblings > 1) { @@ -1096,7 +1120,8 @@ for (cpu = 0; cpu < NR_CPUS; cpu++) { int i; - if (!test_bit(cpu, &cpu_callout_map)) continue; + if (!test_bit(cpu, &cpu_callout_map)) + continue; for (i = 0; i < NR_CPUS; i++) { if (i == cpu || !test_bit(i, &cpu_callout_map)) @@ -1112,17 +1137,41 @@ printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu); } } - } - - smpboot_setup_io_apic(); - - setup_boot_APIC_clock(); +#if CONFIG_SHARE_RUNQUEUE + /* + * At this point APs would have synchronised TSC and + * waiting for smp_commenced, with their APIC timer + * disabled. So BP can go ahead do some initialization + * for Hyper-Threading (if needed). + */ + for (cpu = 0; cpu < NR_CPUS; cpu++) { + int i; + if (!test_bit(cpu, &cpu_callout_map)) + continue; + for (i = 0; i < NR_CPUS; i++) { + if (i <= cpu) + continue; + if (!test_bit(i, &cpu_callout_map)) + continue; - /* - * Synchronize the TSC with the AP - */ - if (cpu_has_tsc && cpucount) - synchronize_tsc_bp(); + if (phys_proc_id[cpu] != phys_proc_id[i]) + continue; + /* + * merge runqueues, resulting in one + * runqueue per package: + */ + sched_map_runqueue(cpu, i); + break; + } + } +#endif + } +#if CONFIG_SHARE_RUNQUEUE + if (smp_num_siblings == 1 && test_ht) { + printk("Simulating shared runqueues - mapping CPU#1's runqueue to CPU#0's runqueue.\n"); + sched_map_runqueue(0, 1); + } +#endif } /* These are wrappers to interface to the new boot process. Someone --- linux/arch/i386/Kconfig.orig +++ linux/arch/i386/Kconfig @@ -408,6 +408,24 @@ If you don't know what to do here, say N. +choice + + prompt "Hyperthreading Support" + depends on SMP + default NR_SIBLINGS_0 + +config NR_SIBLINGS_0 + bool "off" + +config NR_SIBLINGS_2 + bool "2 siblings" + +config NR_SIBLINGS_4 + bool "4 siblings" + +endchoice + + config PREEMPT bool "Preemptible Kernel" help --- linux/kernel/fork.c.orig +++ linux/kernel/fork.c @@ -878,7 +878,7 @@ */ p->first_time_slice = 1; current->time_slice >>= 1; - p->sleep_timestamp = jiffies; + p->last_run = jiffies; if (!current->time_slice) { /* * This case is rare, it happens when the parent has only --- linux/kernel/sched.c.orig +++ linux/kernel/sched.c @@ -67,6 +67,7 @@ #define INTERACTIVE_DELTA 2 #define MAX_SLEEP_AVG (2*HZ) #define STARVATION_LIMIT (2*HZ) +#define AGRESSIVE_IDLE_STEAL 1 #define NODE_THRESHOLD 125 /* @@ -141,6 +142,48 @@ }; /* + * It's possible for two CPUs to share the same runqueue. + * This makes sense if they eg. share caches. + * + * We take the common 1:1 (SMP, UP) case and optimize it, + * the rest goes via remapping: rq_idx(cpu) gives the + * runqueue on which a particular cpu is on, cpu_idx(cpu) + * gives the rq-specific index of the cpu. + * + * (Note that the generic scheduler code does not impose any + * restrictions on the mappings - there can be 4 CPUs per + * runqueue or even assymetric mappings.) + */ +#if CONFIG_SHARE_RUNQUEUE +# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS + long __rq_idx[NR_CPUS] __cacheline_aligned; + static long __cpu_idx[NR_CPUS] __cacheline_aligned; +# define rq_idx(cpu) (__rq_idx[(cpu)]) +# define cpu_idx(cpu) (__cpu_idx[(cpu)]) +# define for_each_sibling(idx, rq) \ + for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++) +# define rq_nr_cpus(rq) ((rq)->nr_cpus) +# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance) +#else +# define MAX_NR_SIBLINGS 1 +# define rq_idx(cpu) (cpu) +# define cpu_idx(cpu) 0 +# define for_each_sibling(idx, rq) while (0) +# define cpu_active_balance(c) 0 +# define do_active_balance(rq, cpu) do { } while (0) +# define rq_nr_cpus(rq) 1 + static inline void active_load_balance(runqueue_t *rq, int this_cpu) { } +#endif + +typedef struct cpu_s { + task_t *curr, *idle; + task_t *migration_thread; + struct list_head migration_queue; + int active_balance; + int cpu; +} cpu_t; + +/* * This is the main, per-CPU runqueue data structure. * * Locking rule: those places that want to lock multiple runqueues @@ -151,7 +194,7 @@ spinlock_t lock; unsigned long nr_running, nr_switches, expired_timestamp, nr_uninterruptible; - task_t *curr, *idle; + struct mm_struct *prev_mm; prio_array_t *active, *expired, arrays[2]; int prev_nr_running[NR_CPUS]; #ifdef CONFIG_NUMA @@ -159,27 +202,39 @@ unsigned int nr_balanced; int prev_node_load[MAX_NUMNODES]; #endif - task_t *migration_thread; - struct list_head migration_queue; + int nr_cpus; + cpu_t cpu[MAX_NR_SIBLINGS]; atomic_t nr_iowait; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; -#define cpu_rq(cpu) (runqueues + (cpu)) +#define cpu_rq(cpu) (runqueues + (rq_idx(cpu))) +#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c)) +#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr) +#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle) + #define this_rq() cpu_rq(smp_processor_id()) #define task_rq(p) cpu_rq(task_cpu(p)) -#define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define rt_task(p) ((p)->prio < MAX_RT_PRIO) +#define migration_thread(cpu) (cpu_int(cpu)->migration_thread) +#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue) + +#if NR_CPUS > 1 +# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu))) +#else +# define task_allowed(p, cpu) 1 +#endif + /* * Default context-switch locking: */ #ifndef prepare_arch_switch # define prepare_arch_switch(rq, next) do { } while(0) # define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock) -# define task_running(rq, p) ((rq)->curr == (p)) +# define task_running(p) (cpu_curr_ptr(task_cpu(p)) == (p)) #endif #ifdef CONFIG_NUMA @@ -322,16 +377,21 @@ * Also update all the scheduling statistics stuff. (sleep average * calculation, priority modifiers, etc.) */ +static inline void __activate_task(task_t *p, runqueue_t *rq) +{ + enqueue_task(p, rq->active); + nr_running_inc(rq); +} + static inline void activate_task(task_t *p, runqueue_t *rq) { - unsigned long sleep_time = jiffies - p->sleep_timestamp; - prio_array_t *array = rq->active; + unsigned long sleep_time = jiffies - p->last_run; if (!rt_task(p) && sleep_time) { /* * This code gives a bonus to interactive tasks. We update * an 'average sleep time' value here, based on - * sleep_timestamp. The more time a task spends sleeping, + * ->last_run. The more time a task spends sleeping, * the higher the average gets - and the higher the priority * boost gets as well. */ @@ -340,8 +400,7 @@ p->sleep_avg = MAX_SLEEP_AVG; p->prio = effective_prio(p); } - enqueue_task(p, array); - nr_running_inc(rq); + __activate_task(p, rq); } /* @@ -382,8 +441,18 @@ #endif } +static inline void resched_cpu(int cpu) +{ + resched_task(cpu_curr_ptr(cpu)); +} + #ifdef CONFIG_SMP +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ + p->thread_info->cpu = cpu; +} + /* * wait_task_inactive - wait for a thread to unschedule. * @@ -398,7 +467,7 @@ repeat: preempt_disable(); rq = task_rq(p); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { cpu_relax(); /* * enable/disable preemption just to make this @@ -409,7 +478,7 @@ goto repeat; } rq = task_rq_lock(p, &flags); - if (unlikely(task_running(rq, p))) { + if (unlikely(task_running(p))) { task_rq_unlock(rq, &flags); preempt_enable(); goto repeat; @@ -417,6 +486,13 @@ task_rq_unlock(rq, &flags); preempt_enable(); } + +#else + +static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) +{ +} + #endif /* @@ -431,10 +507,39 @@ */ void kick_if_running(task_t * p) { - if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id())) + if ((task_running(p)) && (task_cpu(p) != smp_processor_id())) resched_task(p); } +static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p) +{ + cpu_t *curr_cpu; + task_t *curr; + int idx; + + if (idle_cpu(cpu)) + return resched_cpu(cpu); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + if (curr_cpu->idle == curr_cpu->curr) + return resched_cpu(curr_cpu->cpu); + } + + if (p->prio < cpu_curr_ptr(cpu)->prio) + return resched_task(cpu_curr_ptr(cpu)); + + for_each_sibling(idx, rq) { + curr_cpu = rq->cpu + idx; + if (!task_allowed(p, curr_cpu->cpu)) + continue; + curr = curr_cpu->curr; + if (p->prio < curr->prio) + return resched_task(curr); + } +} /*** * try_to_wake_up - wake up a thread * @p: the to-be-woken-up thread @@ -463,7 +568,7 @@ * Fast-migrate the task if it's not running or runnable * currently. Do not violate hard affinity. */ - if (unlikely(sync && !task_running(rq, p) && + if (unlikely(sync && !task_running(p) && (task_cpu(p) != smp_processor_id()) && (p->cpus_allowed & (1UL << smp_processor_id())))) { @@ -473,10 +578,12 @@ } if (old_state == TASK_UNINTERRUPTIBLE) rq->nr_uninterruptible--; - activate_task(p, rq); - - if (p->prio < rq->curr->prio) - resched_task(rq->curr); + if (sync) + __activate_task(p, rq); + else { + activate_task(p, rq); + wake_up_cpu(rq, task_cpu(p), p); + } success = 1; } p->state = TASK_RUNNING; @@ -561,7 +668,7 @@ * context_switch - switch to the new MM and the new * thread's register state. */ -static inline task_t * context_switch(task_t *prev, task_t *next) +static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next) { struct mm_struct *mm = next->mm; struct mm_struct *oldmm = prev->active_mm; @@ -575,7 +682,7 @@ if (unlikely(!prev->mm)) { prev->active_mm = NULL; - mmdrop(oldmm); + rq->prev_mm = oldmm; } /* Here we just switch the register state and the stack. */ @@ -596,8 +703,9 @@ unsigned long i, sum = 0; for (i = 0; i < NR_CPUS; i++) - sum += cpu_rq(i)->nr_running; - + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_running; return sum; } @@ -608,7 +716,9 @@ for (i = 0; i < NR_CPUS; i++) { if (!cpu_online(i)) continue; - sum += cpu_rq(i)->nr_uninterruptible; + /* Shared runqueues are counted only once. */ + if (!cpu_idx(i)) + sum += cpu_rq(i)->nr_uninterruptible; } return sum; } @@ -790,7 +900,23 @@ #endif /* CONFIG_NUMA */ -#if CONFIG_SMP +/* + * One of the idle_cpu_tick() and busy_cpu_tick() functions will + * get called every timer tick, on every CPU. Our balancing action + * frequency and balancing agressivity depends on whether the CPU is + * idle or not. + * + * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on + * systems with HZ=100, every 10 msecs.) + */ +#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) +#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) + +#if !CONFIG_SMP + +static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { } + +#else /* * double_lock_balance - lock the busiest runqueue @@ -906,12 +1032,7 @@ set_task_cpu(p, this_cpu); nr_running_inc(this_rq); enqueue_task(p, this_rq->active); - /* - * Note that idle threads have a prio of MAX_PRIO, for this test - * to be always true for them. - */ - if (p->prio < this_rq->curr->prio) - set_need_resched(); + wake_up_cpu(this_rq, this_cpu, p); } /* @@ -922,9 +1043,9 @@ * We call this with the current runqueue locked, * irqs disabled. */ -static void load_balance(runqueue_t *this_rq, int idle) +static void load_balance(runqueue_t *this_rq, int this_cpu, int idle) { - int imbalance, idx, this_cpu = smp_processor_id(); + int imbalance, idx; runqueue_t *busiest; prio_array_t *array; struct list_head *head, *curr; @@ -972,12 +1093,15 @@ * 1) running (obviously), or * 2) cannot be migrated to this CPU due to cpus_allowed, or * 3) are cache-hot on their current CPU. + * + * (except if we are in idle mode which is a more agressive + * form of rebalancing.) */ -#define CAN_MIGRATE_TASK(p,rq,this_cpu) \ - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ - !task_running(rq, p) && \ - ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define CAN_MIGRATE_TASK(p,rq,cpu) \ + (((idle && AGRESSIVE_IDLE_STEAL) || \ + (jiffies - (p)->last_run > cache_decay_ticks)) && \ + !task_running(p) && task_allowed(p, cpu)) curr = curr->prev; @@ -1000,31 +1124,129 @@ ; } +#if CONFIG_SHARE_RUNQUEUE +static void active_load_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + /* + * Any SMT-specific imbalance? + */ + for_each_sibling(idx, rq) + if (rq->cpu[idx].idle == rq->cpu[idx].curr) + continue; + + /* + * At this point it's sure that we have a SMT + * imbalance: this (physical) CPU is idle but + * another CPU has two (or more) tasks running. + * + * We wake up one of the migration threads (it + * doesnt matter which one) and let it fix things up: + */ + if (!cpu_active_balance(i)) { + cpu_active_balance(i) = 1; + spin_unlock(&this_rq->lock); + wake_up_process(rq->cpu[0].migration_thread); + spin_lock(&this_rq->lock); + } + } +} + +static void do_active_balance(runqueue_t *this_rq, int this_cpu) +{ + runqueue_t *rq; + int i, idx; + + spin_unlock(&this_rq->lock); + + cpu_active_balance(this_cpu) = 0; + + /* + * Is the imbalance still present? + */ + for_each_sibling(idx, this_rq) + if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr) + goto out; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(i)) + continue; + rq = cpu_rq(i); + if (rq == this_rq) + continue; + + /* completely idle CPU? */ + if (rq->nr_running) + continue; + + /* + * At this point it's reasonably sure that we have an + * imbalance. Since we are the migration thread, try to + * balance a thread over to the target queue. + */ + spin_lock(&rq->lock); + load_balance(rq, i, 1); + spin_unlock(&rq->lock); + goto out; + } +out: + spin_lock(&this_rq->lock); +} + /* - * One of the idle_cpu_tick() and busy_cpu_tick() functions will - * get called every timer tick, on every CPU. Our balancing action - * frequency and balancing agressivity depends on whether the CPU is - * idle or not. + * This routine is called to map a CPU into another CPU's runqueue. * - * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on - * systems with HZ=100, every 10 msecs.) + * This must be called during bootup with the merged runqueue having + * no tasks. */ -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1) -#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +void sched_map_runqueue(int cpu1, int cpu2) +{ + runqueue_t *rq1 = cpu_rq(cpu1); + runqueue_t *rq2 = cpu_rq(cpu2); + int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx; + + BUG_ON(rq1 == rq2 || rq2->nr_running || rq_idx(cpu1) != cpu1); + /* + * At this point, we dont have anything in the runqueue yet. So, + * there is no need to move processes between the runqueues. + * Only, the idle processes should be combined and accessed + * properly. + */ + cpu2_idx = rq1->nr_cpus++; + + rq_idx(cpu2) = cpu1; + cpu_idx(cpu2) = cpu2_idx; + rq1->cpu[cpu2_idx].cpu = cpu2; + rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle; + rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr; + INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue); + + /* just to be safe: */ + rq2->cpu[cpu2_idx_orig].idle = NULL; + rq2->cpu[cpu2_idx_orig].curr = NULL; +} +#endif +#endif -static inline void idle_tick(runqueue_t *rq) +DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; + +static inline void idle_tick(runqueue_t *rq, unsigned long j) { - if (jiffies % IDLE_REBALANCE_TICK) + if (j % IDLE_REBALANCE_TICK) return; spin_lock(&rq->lock); - load_balance(rq, 1); + load_balance(rq, smp_processor_id(), 1); spin_unlock(&rq->lock); } -#endif - -DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } }; - /* * We place interactive tasks back into the active array, if possible. * @@ -1035,9 +1257,9 @@ * increasing number of running tasks: */ #define EXPIRED_STARVING(rq) \ - ((rq)->expired_timestamp && \ + (STARVATION_LIMIT && ((rq)->expired_timestamp && \ (jiffies - (rq)->expired_timestamp >= \ - STARVATION_LIMIT * ((rq)->nr_running) + 1)) + STARVATION_LIMIT * ((rq)->nr_running) + 1))) /* * This function gets called by the timer code, with HZ frequency. @@ -1050,12 +1272,13 @@ { int cpu = smp_processor_id(); runqueue_t *rq = this_rq(); + unsigned long j = jiffies; task_t *p = current; if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_ticks); - if (p == rq->idle) { + if (p == cpu_idle_ptr(cpu)) { /* note: this timer irq context must be accounted for as well */ if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET) kstat_cpu(cpu).cpustat.system += sys_ticks; @@ -1063,9 +1286,7 @@ kstat_cpu(cpu).cpustat.iowait += sys_ticks; else kstat_cpu(cpu).cpustat.idle += sys_ticks; -#if CONFIG_SMP - idle_tick(rq); -#endif + idle_tick(rq, j); return; } if (TASK_NICE(p) > 0) @@ -1074,12 +1295,13 @@ kstat_cpu(cpu).cpustat.user += user_ticks; kstat_cpu(cpu).cpustat.system += sys_ticks; + spin_lock(&rq->lock); /* Task might have expired already, but not scheduled off yet */ if (p->array != rq->active) { set_tsk_need_resched(p); + spin_unlock(&rq->lock); return; } - spin_lock(&rq->lock); if (unlikely(rt_task(p))) { /* * RR tasks need a special form of timeslice management. @@ -1115,16 +1337,14 @@ if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) { if (!rq->expired_timestamp) - rq->expired_timestamp = jiffies; + rq->expired_timestamp = j; enqueue_task(p, rq->expired); } else enqueue_task(p, rq->active); } out: -#if CONFIG_SMP - if (!(jiffies % BUSY_REBALANCE_TICK)) - load_balance(rq, 0); -#endif + if (!(j % BUSY_REBALANCE_TICK)) + load_balance(rq, smp_processor_id(), 0); spin_unlock(&rq->lock); } @@ -1135,11 +1355,11 @@ */ asmlinkage void schedule(void) { + int idx, this_cpu, retry = 0; + struct list_head *queue; task_t *prev, *next; - runqueue_t *rq; prio_array_t *array; - struct list_head *queue; - int idx; + runqueue_t *rq; /* * Test if we are atomic. Since do_exit() needs to call into @@ -1152,15 +1372,15 @@ dump_stack(); } } - - check_highmem_ptes(); need_resched: + check_highmem_ptes(); + this_cpu = smp_processor_id(); preempt_disable(); prev = current; rq = this_rq(); release_kernel_lock(prev); - prev->sleep_timestamp = jiffies; + prev->last_run = jiffies; spin_lock_irq(&rq->lock); /* @@ -1183,12 +1403,14 @@ } pick_next_task: if (unlikely(!rq->nr_running)) { -#if CONFIG_SMP - load_balance(rq, 1); + load_balance(rq, this_cpu, 1); if (rq->nr_running) goto pick_next_task; -#endif - next = rq->idle; + active_load_balance(rq, this_cpu); + if (rq->nr_running) + goto pick_next_task; +pick_idle: + next = cpu_idle_ptr(this_cpu); rq->expired_timestamp = 0; goto switch_tasks; } @@ -1204,24 +1426,60 @@ rq->expired_timestamp = 0; } +new_array: idx = sched_find_first_bit(array->bitmap); queue = array->queue + idx; next = list_entry(queue->next, task_t, run_list); + if ((next != prev) && (rq_nr_cpus(rq) > 1)) { + struct list_head *tmp = queue->next; + + while ((task_running(next) && (next != prev)) || !task_allowed(next, this_cpu)) { + tmp = tmp->next; + if (tmp != queue) { + next = list_entry(tmp, task_t, run_list); + continue; + } + idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx); + if (idx == MAX_PRIO) { + if (retry || !rq->expired->nr_active) { + goto pick_idle; + } + /* + * To avoid infinite changing of arrays, + * when we have only tasks runnable by + * sibling. + */ + retry = 1; + + array = rq->expired; + goto new_array; + } + queue = array->queue + idx; + tmp = queue->next; + next = list_entry(tmp, task_t, run_list); + } + } switch_tasks: prefetch(next); clear_tsk_need_resched(prev); - RCU_qsctr(prev->thread_info->cpu)++; + RCU_qsctr(task_cpu(prev))++; if (likely(prev != next)) { + struct mm_struct *prev_mm; rq->nr_switches++; - rq->curr = next; + cpu_curr_ptr(this_cpu) = next; + set_task_cpu(next, this_cpu); prepare_arch_switch(rq, next); - prev = context_switch(prev, next); + prev = context_switch(rq, prev, next); barrier(); rq = this_rq(); + prev_mm = rq->prev_mm; + rq->prev_mm = NULL; finish_arch_switch(rq, prev); + if (prev_mm) + mmdrop(prev_mm); } else spin_unlock_irq(&rq->lock); @@ -1481,9 +1739,8 @@ * If the task is running and lowered its priority, * or increased its priority then reschedule its CPU: */ - if ((NICE_TO_PRIO(nice) < p->static_prio) || - task_running(rq, p)) - resched_task(rq->curr); + if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p)) + resched_task(cpu_curr_ptr(task_cpu(p))); } out_unlock: task_rq_unlock(rq, &flags); @@ -1561,7 +1818,7 @@ */ int task_curr(task_t *p) { - return cpu_curr(task_cpu(p)) == p; + return cpu_curr_ptr(task_cpu(p)) == p; } /** @@ -1570,7 +1827,7 @@ */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu); } /** @@ -1660,7 +1917,7 @@ else p->prio = p->static_prio; if (array) - activate_task(p, task_rq(p)); + __activate_task(p, task_rq(p)); out_unlock: task_rq_unlock(rq, &flags); @@ -2135,7 +2392,7 @@ local_irq_save(flags); double_rq_lock(idle_rq, rq); - idle_rq->curr = idle_rq->idle = idle; + cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle; deactivate_task(idle, rq); idle->array = NULL; idle->prio = MAX_PRIO; @@ -2190,6 +2447,7 @@ unsigned long flags; migration_req_t req; runqueue_t *rq; + int cpu; #if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */ new_mask &= cpu_online_map; @@ -2211,31 +2469,31 @@ * If the task is not on a runqueue (and not running), then * it is sufficient to simply update the task's cpu field. */ - if (!p->array && !task_running(rq, p)) { + if (!p->array && !task_running(p)) { set_task_cpu(p, __ffs(p->cpus_allowed)); task_rq_unlock(rq, &flags); return; } init_completion(&req.done); req.task = p; - list_add(&req.list, &rq->migration_queue); + cpu = task_cpu(p); + list_add(&req.list, migration_queue(cpu)); task_rq_unlock(rq, &flags); - - wake_up_process(rq->migration_thread); + wake_up_process(migration_thread(cpu)); wait_for_completion(&req.done); } /* - * migration_thread - this is a highprio system thread that performs + * migration_task - this is a highprio system thread that performs * thread migration by 'pulling' threads into the target runqueue. */ -static int migration_thread(void * data) +static int migration_task(void * data) { struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 }; int cpu = (long) data; runqueue_t *rq; - int ret; + int ret, idx; daemonize(); sigfillset(¤t->blocked); @@ -2250,7 +2508,8 @@ ret = setscheduler(0, SCHED_FIFO, ¶m); rq = this_rq(); - rq->migration_thread = current; + migration_thread(cpu) = current; + idx = cpu_idx(cpu); sprintf(current->comm, "migration/%d", smp_processor_id()); @@ -2263,8 +2522,10 @@ task_t *p; spin_lock_irqsave(&rq->lock, flags); - head = &rq->migration_queue; - current->state = TASK_INTERRUPTIBLE; + if (cpu_active_balance(cpu)) + do_active_balance(rq, cpu); + head = migration_queue(cpu); + current->state = TASK_UNINTERRUPTIBLE; if (list_empty(head)) { spin_unlock_irqrestore(&rq->lock, flags); schedule(); @@ -2292,9 +2553,8 @@ set_task_cpu(p, cpu_dest); if (p->array) { deactivate_task(p, rq_src); - activate_task(p, rq_dest); - if (p->prio < rq_dest->curr->prio) - resched_task(rq_dest->curr); + __activate_task(p, rq_dest); + wake_up_cpu(rq_dest, cpu_dest, p); } } double_rq_unlock(rq_src, rq_dest); @@ -2312,12 +2572,13 @@ unsigned long action, void *hcpu) { + long cpu = (long) hcpu; + switch (action) { case CPU_ONLINE: - printk("Starting migration thread for cpu %li\n", - (long)hcpu); - kernel_thread(migration_thread, hcpu, CLONE_KERNEL); - while (!cpu_rq((long)hcpu)->migration_thread) + printk("Starting migration thread for cpu %li\n", cpu); + kernel_thread(migration_task, hcpu, CLONE_KERNEL); + while (!migration_thread(cpu)) yield(); break; } @@ -2392,11 +2653,20 @@ for (i = 0; i < NR_CPUS; i++) { prio_array_t *array; + /* + * Start with a 1:1 mapping between CPUs and runqueues: + */ +#if CONFIG_SHARE_RUNQUEUE + rq_idx(i) = i; + cpu_idx(i) = 0; +#endif rq = cpu_rq(i); rq->active = rq->arrays; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); - INIT_LIST_HEAD(&rq->migration_queue); + INIT_LIST_HEAD(migration_queue(i)); + rq->nr_cpus = 1; + rq->cpu[cpu_idx(i)].cpu = i; atomic_set(&rq->nr_iowait, 0); nr_running_init(rq); @@ -2414,9 +2684,9 @@ * We have to do a little magic to get the first * thread right in SMP mode. */ - rq = this_rq(); - rq->curr = current; - rq->idle = current; + cpu_curr_ptr(smp_processor_id()) = current; + cpu_idle_ptr(smp_processor_id()) = current; + set_task_cpu(current, smp_processor_id()); wake_up_process(current); --- linux/init/main.c.orig +++ linux/init/main.c @@ -354,7 +354,14 @@ static void rest_init(void) { + /* + * We count on the initial thread going ok + * Like idlers init is an unlocked kernel thread, which will + * make syscalls (and thus be locked). + */ + init_idle(current, smp_processor_id()); kernel_thread(init, NULL, CLONE_KERNEL); + unlock_kernel(); cpu_idle(); } @@ -438,13 +445,6 @@ check_bugs(); printk("POSIX conformance testing by UNIFIX\n"); - /* - * We count on the initial thread going ok - * Like idlers init is an unlocked kernel thread, which will - * make syscalls (and thus be locked). - */ - init_idle(current, smp_processor_id()); - /* Do the rest non-__init'ed, we're now alive */ rest_init(); } |
From: Michael H. <hoh...@us...> - 2003-02-07 21:51:37
|
On Fri, 2003-02-07 at 00:38, Ingo Molnar wrote: > > the attached patch is against current BK. I tested it on the following 6 > kernel/hardware combinations: > > (HT SMP kernel, SMP kernel, UP kernel) X (non-HT SMP box, HT SMP box) > > all kernels compiled & booted fine. > > Ingo > Tested on a 4 node (16 700 MHZ P3 processors, 16 GB memory) NUMAQ. Works fine. Ran kernbench and schedbench (aka numatest) on this. Also tested the E6 variant of this patch. Both show similar results as on stock kernels, except for light loads where the stock kernel tends to spread processes more evenly across nodes. On Tue, 2003-02-04 at 10:48, Erich Focht suggested the following to address this light load case: > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + (((idle && AGRESSIVE_IDLE_STEAL) || \ > > + (jiffies - (p)->last_run > cache_decay_ticks)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > this is still either all or nothing. I like the idle rebalancing > beeing more aggressive than the non-idle one, but it could be > smoother. What speaks against the following? > > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > > Where it hurts? As Andrew mentioned, Michael has seen weird behavior > with the D7..E6(AGRESSIVE_IDLE_STEAL=1) alike schedulers. My testing did not show any noticeable difference with Erich's change. Results below include the following kernels: ingoF3 = linux 2.5.59 + Ingo's F3 patch + cputime_stats patch ingoE6-59 = linux 2.5.59 + Ingo's E6 patch + cputime_stats patch ingoE6-erich-59 = ingoE6-59 with change suggested by Erich Focht stock59 = linux 2.5.59 + cputime_stats patch Kernbench: Elapsed User System CPU ingoF3 27.97s 275.96s 71.916s 1243.6% ingoE6-59 27.826s 276.514s 71.542s 1250.4% ingoE6-erich-59 27.59s 276.028s 69.964s 1254.2% stock59 27.58s 275.92s 71.134s 1258.4% Schedbench 4: AvgUser Elapsed TotalUser TotalSys ingoF3 33.64 41.43 134.61 0.71 ingoE6-59 33.21 44.76 132.88 0.70 ingoE6-erich-59 33.49 43.80 134.02 0.81 stock59 22.62 37.16 90.50 0.82 Schedbench 8: AvgUser Elapsed TotalUser TotalSys ingoF3 35.86 53.73 286.97 1.93 ingoE6-59 32.65 44.14 261.25 1.52 ingoE6-erich-59 35.83 68.42 286.69 1.90 stock59 29.26 42.79 234.18 1.62 Schedbench 16: AvgUser Elapsed TotalUser TotalSys ingoF3 52.66 56.65 842.81 4.09 ingoE6-59 52.94 56.69 847.25 3.57 ingoE6-erich-59 52.90 58.27 846.57 3.93 stock59 52.91 57.08 846.75 3.58 Schedbench 32: AvgUser Elapsed TotalUser TotalSys ingoF3 56.42 115.94 1805.65 6.31 ingoE6-59 56.62 116.70 1812.17 6.76 ingoE6-erich-59 56.45 117.27 1806.68 6.17 stock59 56.66 117.14 1813.41 6.15 Schedbench 64: AvgUser Elapsed TotalUser TotalSys ingoF3 56.52 231.19 3617.94 14.84 ingoE6-59 56.93 237.63 3644.36 16.78 ingoE6-erich-59 56.98 236.96 3647.06 15.64 stock59 56.56 230.71 3620.70 15.94 -- Michael Hohnbaum 503-578-5486 hoh...@us... T/L 775-5486 |
From: Erich F. <ef...@es...> - 2003-02-04 18:48:00
|
Hi Ingo, > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > + (((idle && AGRESSIVE_IDLE_STEAL) || \ > + (jiffies - (p)->last_run > cache_decay_ticks)) && \ > + !task_running(p) && task_allowed(p, cpu)) this is still either all or nothing. I like the idle rebalancing beeing more aggressive than the non-idle one, but it could be smoother. What speaks against the following? > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > + ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \ > + !task_running(p) && task_allowed(p, cpu)) Where it hurts? As Andrew mentioned, Michael has seen weird behavior with the D7..E6(AGRESSIVE_IDLE_STEAL=3D1) alike schedulers. Here is the node load measured once a second for a numabench (on 4nodes x 4cpus): > node0 0 steady state: > node1 2 > node2 0 > node3 0 > > node0 0 numa_test 4 fired off > node1 2 > node2 1 > node3 0 > > node0 0 > node1 2 > node2 4 [EF] numa_test tasks stolen by the idle cpus in same node > node3 0 > > node0 0 > node1 2 > node2 4 [EF] no chance for sched_balance_exec() to spread the tasks = over the nodes > node3 0 > > node0 0 > node1 2 > node2 4 > node3 0 > > node0 1 spike - hackbench is running and forks a bunch > node1 681 > node2 4 > node3 1 > > node0 8 > node1 306 > node2 4 [EF] these CPUs don't steal because they are allways busy > node3 7 > > node0 5 > node1 212 > node2 6 [EF] maybe they do, sometimes... but too seldomly > node3 9 > > node0 7 > node1 200 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 8 > > node0 2 > node1 124 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 2 > > node0 3 > node1 128 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 2 > > node0 33 > node1 83 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 33 > > node0 6 > node1 37 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 3 > > node0 8 > node1 4 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 5 > > node0 2 things calm down but the balance is off > node1 0 > node2 4 [EF] still runnning 1 numa_test per cpu on this node > node3 0 So you can see that the too agressive steal spread the numa_test tasks across the cpus of the own node instead of leaving the balance in the hands of sched_balance_node(). The additional cache_decay_tics/2 ticks delay would have given the tasks a chance to go be balanced away to other nodes. Regards, Erich |
From: Michael H. <hoh...@us...> - 2003-02-04 19:00:48
|
Erich, Thanks for the thorough analysis (below). I've got some time this afternoon and a 4 node box, so will rerun these tests with the change you outline below. Any other suggested ideas for me to test? Michael On Tue, 2003-02-04 at 10:48, Erich Focht wrote: > Hi Ingo, > > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + (((idle && AGRESSIVE_IDLE_STEAL) || \ > > + (jiffies - (p)->last_run > cache_decay_ticks)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > this is still either all or nothing. I like the idle rebalancing > beeing more aggressive than the non-idle one, but it could be > smoother. What speaks against the following? > > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > > Where it hurts? As Andrew mentioned, Michael has seen weird behavior > with the D7..E6(AGRESSIVE_IDLE_STEAL=1) alike schedulers. Here is the > node load measured once a second for a numabench (on 4nodes x 4cpus): > > > node0 0 steady state: > > node1 2 > > node2 0 > > node3 0 > > > > node0 0 numa_test 4 fired off > > node1 2 > > node2 1 > > node3 0 > > > > node0 0 > > node1 2 > > node2 4 [EF] numa_test tasks stolen by the idle cpus in same node > > node3 0 > > > > node0 0 > > node1 2 > > node2 4 [EF] no chance for sched_balance_exec() to spread the tasks over the nodes > > node3 0 > > > > node0 0 > > node1 2 > > node2 4 > > node3 0 > > > > node0 1 spike - hackbench is running and forks a bunch > > node1 681 > > node2 4 > > node3 1 > > > > node0 8 > > node1 306 > > node2 4 [EF] these CPUs don't steal because they are allways busy > > node3 7 > > > > node0 5 > > node1 212 > > node2 6 [EF] maybe they do, sometimes... but too seldomly > > node3 9 > > > > node0 7 > > node1 200 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 8 > > > > node0 2 > > node1 124 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 2 > > > > node0 3 > > node1 128 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 2 > > > > node0 33 > > node1 83 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 33 > > > > node0 6 > > node1 37 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 3 > > > > node0 8 > > node1 4 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 5 > > > > node0 2 things calm down but the balance is off > > node1 0 > > node2 4 [EF] still runnning 1 numa_test per cpu on this node > > node3 0 > > So you can see that the too agressive steal spread the numa_test tasks > across the cpus of the own node instead of leaving the balance in the > hands of sched_balance_node(). The additional cache_decay_tics/2 ticks > delay would have given the tasks a chance to go be balanced away to > other nodes. > > Regards, > Erich > > |
From: Andrew T. <hab...@us...> - 2003-02-04 19:05:31
|
On Tuesday 04 February 2003 12:48, Erich Focht wrote: > Hi Ingo, > > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + (((idle && AGRESSIVE_IDLE_STEAL) || \ > > + (jiffies - (p)->last_run > cache_decay_ticks)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > this is still either all or nothing. I like the idle rebalancing > beeing more aggressive than the non-idle one, but it could be > smoother. What speaks against the following? > > > +#define CAN_MIGRATE_TASK(p,rq,cpu) \ > > + ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \ > > + !task_running(p) && task_allowed(p, cpu)) > > Where it hurts? As Andrew mentioned, Michael has seen weird behavior > with the D7..E6(AGRESSIVE_IDLE_STEAL=3D1) alike schedulers. Here is the > > node load measured once a second for a numabench (on 4nodes x 4cpus): > > node0 0 steady state: > > node1 2 > > node2 0 > > node3 0 > > > > node0 0 numa_test 4 fired off > > node1 2 > > node2 1 > > node3 0 > > > > node0 0 > > node1 2 > > node2 4 [EF] numa_test tasks stolen by the idle cpus in same node > > node3 0 > > > > node0 0 > > node1 2 > > node2 4 [EF] no chance for sched_balance_exec() to spread the task= s > > over the nodes node3 0 Do you think removing the first check in sched_balance_exec would have av= oided=20 this condition? That's exactly why I did this for the HT case, so the ta= sks=20 do not get "stuck" when they could be better served spread across nodes. My gut feeling is that we rarely get beyond the first check of sched_best= _cpu,=20 and when we do, we get a good initial placement, but at a cost of more sy= stem=20 time, because we do a search for best nod and best cpu. I was considerin= g=20 keeping just the middle portion, where we find the best node, then don't = even=20 bother finding the best cpu within the node, just put it on the first one= =20 -let the cpu balancer do it's job later. What we are really concerned he= re=20 is the best node (so we hopefully stay on that node for good node local=20 memory performance), not really the best cpu. I was hoping this would gi= ve=20 is good, fast node placement. -Andrew Theurer |
From: Martin J. B. <mb...@ar...> - 2003-02-04 21:18:25
|
> this is the -E6 patch, with most of the HT-unrelated tunings and leftover > changes (noticed by Robert) removed. I kept the activation reorganization > because that is important in the HT case as well. > > based on Erich's observations i also added AGRESSIVE_IDLE_STEAL, which > defaults to 1 currently - could anyone try it with 0 on a NUMA box, how > much of a difference does it make? Passive aggressive results below - 16x NUMA-Q: Summary: passive seems OK, though no better than current (for me). Aggressive seems a little worse than current, though not too drastically. -------------------------------------- Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus) Elapsed User System CPU 2.5.59-gcc3.2 45.86 563.63 119.58 1489.33 2.5.59-E6-agressive 46.09 564.41 120.19 1484.67 2.5.59-E6-passive 46.20 564.53 120.00 1481.33 Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus) Elapsed User System CPU 2.5.59-gcc3.2 47.15 567.41 143.72 1507.50 2.5.59-E6-agressive 47.58 567.42 146.40 1499.50 2.5.59-E6-passive 47.51 567.56 144.57 1498.33 DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered trademarks of the Standard Performance Evaluation Corporation. This benchmarking was performed for research purposes only, and the run results are non-compliant and not-comparable with any published results. Results are shown as percentages of the first set displayed SDET 1 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 5.2% 2.5.59-E6-agressive 93.5% 5.4% 2.5.59-E6-passive 104.2% 1.6% SDET 2 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 7.1% 2.5.59-E6-agressive 100.9% 4.0% 2.5.59-E6-passive 102.8% 9.9% SDET 4 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 5.3% 2.5.59-E6-agressive 105.3% 8.1% 2.5.59-E6-passive 98.8% 7.2% SDET 8 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 4.7% 2.5.59-E6-agressive 99.7% 2.9% 2.5.59-E6-passive 98.2% 4.3% SDET 16 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 1.8% 2.5.59-E6-agressive 97.8% 0.3% 2.5.59-E6-passive 98.4% 1.7% SDET 32 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 1.6% 2.5.59-E6-agressive 100.3% 1.5% 2.5.59-E6-passive 99.6% 1.6% SDET 64 (see disclaimer) Throughput Std. Dev 2.5.59-gcc3.2 100.0% 1.1% 2.5.59-E6-agressive 99.8% 0.6% 2.5.59-E6-passive 99.4% 1.1% NUMA schedbench 4: AvgUser Elapsed TotalUser TotalSys 2.5.59-gcc3.2 0.00 41.80 107.76 0.73 2.5.59-E6-agressive 0.00 43.65 116.28 0.78 2.5.59-E6-passive 0.00 41.56 93.15 0.65 NUMA schedbench 8: AvgUser Elapsed TotalUser TotalSys 2.5.59-gcc3.2 0.00 38.00 229.83 2.11 2.5.59-E6-agressive 0.00 60.41 298.63 1.91 2.5.59-E6-passive 0.00 41.76 250.80 1.80 NUMA schedbench 16: AvgUser Elapsed TotalUser TotalSys 2.5.59-gcc3.2 0.00 57.28 839.21 2.85 2.5.59-E6-agressive 0.00 57.36 849.87 2.90 2.5.59-E6-passive 0.00 57.45 850.05 3.55 NUMA schedbench 32: AvgUser Elapsed TotalUser TotalSys 2.5.59-gcc3.2 0.00 118.44 1788.09 6.25 2.5.59-E6-agressive 0.00 117.88 1796.30 6.13 2.5.59-E6-passive 0.00 116.91 1802.43 6.38 NUMA schedbench 64: AvgUser Elapsed TotalUser TotalSys 2.5.59-gcc3.2 0.00 234.55 3633.76 15.02 2.5.59-E6-agressive 0.00 233.69 3614.37 14.53 2.5.59-E6-passive 0.00 236.53 3635.14 15.29 |
From: Christoph H. <hc...@in...> - 2003-02-04 17:04:16
|
On Tue, Feb 04, 2003 at 06:03:35PM +0100, Ingo Molnar wrote: > +choice > + prompt "Hyperthreading Support" > + depends on SMP > + default NR_SIBLINGS_0 > + > +config NR_SIBLINGS_0 > + bool "off" > + > +config NR_SIBLINGS_2 > + bool "2 siblings" > + > +config NR_SIBLINGS_4 > + bool "4 siblings" > + > +endchoice Any reason this isn't something like int "Number of Hyperthreading Siblings" depends on SMP default 0 |
From: Ingo M. <mi...@el...> - 2003-02-04 17:24:34
|
On Tue, 4 Feb 2003, Christoph Hellwig wrote: > > +choice > > + prompt "Hyperthreading Support" > > + depends on SMP > > + default NR_SIBLINGS_0 > > + > > +config NR_SIBLINGS_0 > > + bool "off" > > + > > +config NR_SIBLINGS_2 > > + bool "2 siblings" > > + > > +config NR_SIBLINGS_4 > > + bool "4 siblings" > > + > > +endchoice > > Any reason this isn't something like > > int "Number of Hyperthreading Siblings" > depends on SMP > default 0 i wanted to restrict the actual range of choice - is that possible with an int field? Ingo |
From: Randy.Dunlap <rdd...@os...> - 2003-02-04 17:52:55
|
On Tue, 4 Feb 2003, Ingo Molnar wrote: | | On Tue, 4 Feb 2003, Christoph Hellwig wrote: | | > > +choice | > > + prompt "Hyperthreading Support" | > > + depends on SMP | > > + default NR_SIBLINGS_0 | > > + | > > +config NR_SIBLINGS_0 | > > + bool "off" | > > + | > > +config NR_SIBLINGS_2 | > > + bool "2 siblings" | > > + | > > +config NR_SIBLINGS_4 | > > + bool "4 siblings" | > > + | > > +endchoice | > | > Any reason this isn't something like | > | > int "Number of Hyperthreading Siblings" | > depends on SMP | > default 0 | | i wanted to restrict the actual range of choice - is that possible with an | int field? No. I asked Roman a few weeks ago for a way to do range checking and restricting, but you want even more than simple range checking (certain values). -- ~Randy |
From: Andrew M. <ak...@di...> - 2003-02-05 02:00:53
|
Christoph Hellwig <hc...@in...> wrote: > > On Tue, Feb 04, 2003 at 06:03:35PM +0100, Ingo Molnar wrote: > > +choice > > + prompt "Hyperthreading Support" > > + depends on SMP > > + default NR_SIBLINGS_0 > > + > > +config NR_SIBLINGS_0 > > + bool "off" > > + > > +config NR_SIBLINGS_2 > > + bool "2 siblings" > > + > > +config NR_SIBLINGS_4 > > + bool "4 siblings" > > + > > +endchoice > > Any reason this isn't something like > > int "Number of Hyperthreading Siblings" > depends on SMP > default 0 Worries me that this option could triple the number of SMP kernels which vendors must ship. Can it be done at runtime? |
From: Rick L. <ric...@us...> - 2003-02-04 01:38:19
|
In case anyone is interested, here is SchedD7, with and without HT bits, compared to SchedD7 minus HT bits plus numa-ht topology, compared to a run with E2: Andrew, since I'm in the process of breaking out the non HT bits myself, I'd be interested in your patch to either change D7 to D7 minus HT or 2.5.9 to D7 minus HT. Rick |