From: Mike K. <kr...@us...> - 2002-02-22 18:57:06
|
Below is preliminary patch to implement some form of NUMA scheduling on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early code and brings up some issues that need to be discussed/explored in more detail. This patch was created to form a basis for discussion, rather than as a solution. The patch was created for the i386 based NUMA system I have access to. It will not work on other architectures. However, the only architecture specific code is a call to initialize some of the NUMA specific scheduling data structures. Therefore, it should be trivial to port. Here is what the patch does: - Creates 'cpu sets', which are sets of cpus that tasks can be scheduled on. The cpus in these sets should all have something in common such as local memory, a shared cache, etc. - In general, once a task begins execution on a cpu set, it remains on that cpu set. The load balancing code that exists in the K3 scheduler is used to balance the load among cpus within a set. - Load balancing also takes place across cpu set boundaries. Load balancing across set boundaries can happen: - at exec time. This is an obvious choice since we have just thrown away a tasks address space and are about to create a new one. The exec'ing task is migrated to the least loaded cpu set - when a cpu goes idle. At this time we first look for another task to run within the cpu set. If there is no task available within the set, then we attempt to get a task from the most loaded cpu set. - at specified intervals when a cpu is idle. In the K3 scheduler, the timer code notices a cpu is idle and looks for more work to do. First it checks other runqueues within the cpu set. If nothing is found, it attempts to get a task from the most loaded cpu set. - at specified intervals when a cpu is busy. This is also kicked off via the timer interrupt code. It simply attempts to move tasks from the most loaded cpu sets to the least loaded cpu sets. For the most part, cpu set load balancing is invoked in the same places as cpu to cpu load balancing in the K3 scheduler. The only exception is exec time task placement. In the patch, the various forms of cpu set load balancing are behind #defines which can be turned on or off for experimentation. Things not addressed in the patch: - Topology discovery. To create cpu sets, a routine numa_sched_init() is called after we know how many cpus there are in the system. Inside numa_sched_init(), there are hard coded values used to indicate the number of cpus per set, and the 'distance' between sets. Obviously, creation of cpu sets would be based on topology information that is not available at this time. In addition, there should be support for (or at least an understanding of) non-symmetric and multi-level topologies. - Tasks should have a notion of a 'home' cpu set. Short term execution of a task on a remote cpu set should only be done to take advantage of idle cpus. Tasks should only execute for short periods of time on remote cpu sets. Most task execution should occur on the home cpu set. Long term execution on a remote cpu set would involve a full task migration which includes moving/copying the memory associated with the task. - Partitioning of scheduling data. On a NUMA architecture, data structures should be backed by memory local to the cpus which will be accessing them most frequently. Therefore, things like runqueues and cpu_sets should not be arrays. - Better determination for what tasks to execute remotely/migrate. The path uses the existing code in load_balance(). Most likely this will be sufficient. -- Mike diff -Naur linux-2.4.17-sched.orig/arch/i386/kernel/smpboot.c linux-2.4.17-ns/arch/i386/kernel/smpboot.c --- linux-2.4.17-sched.orig/arch/i386/kernel/smpboot.c Thu Feb 7 17:43:14 2002 +++ linux-2.4.17-ns/arch/i386/kernel/smpboot.c Mon Feb 11 18:54:33 2002 @@ -1198,6 +1198,11 @@ } } } + + /* + * Hack to get cpu sets initialized on NUMA architectures + */ + numa_sched_init(); #ifndef CONFIG_VISWS /* diff -Naur linux-2.4.17-sched.orig/fs/exec.c linux-2.4.17-ns/fs/exec.c --- linux-2.4.17-sched.orig/fs/exec.c Fri Dec 21 17:41:55 2001 +++ linux-2.4.17-ns/fs/exec.c Mon Feb 11 19:39:59 2002 @@ -860,6 +860,10 @@ int retval; int i; +#if CONFIG_SMP + sched_exec_migrate(); +#endif + file = open_exec(filename); retval = PTR_ERR(file); diff -Naur linux-2.4.17-sched.orig/include/linux/sched.h linux-2.4.17-ns/include/linux/sched.h --- linux-2.4.17-sched.orig/include/linux/sched.h Thu Feb 7 20:25:29 2002 +++ linux-2.4.17-ns/include/linux/sched.h Tue Feb 12 22:45:59 2002 @@ -152,6 +152,7 @@ extern void sched_task_migrated(task_t *p); extern void smp_migrate_task(int cpu, task_t *task); extern unsigned long cache_decay_ticks; +extern void sched_exec_migrate(void); #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); diff -Naur linux-2.4.17-sched.orig/kernel/sched.c linux-2.4.17-ns/kernel/sched.c --- linux-2.4.17-sched.orig/kernel/sched.c Thu Feb 7 17:43:13 2002 +++ linux-2.4.17-ns/kernel/sched.c Thu Feb 14 16:59:57 2002 @@ -20,6 +20,53 @@ #include <linux/interrupt.h> #include <asm/mmu_context.h> +/* + * Definitions that depend on/should be part of NUMA topology discovery + */ +/* + * Periodically poll for new (off cpu set) work as part of the idle loop +#define IDLE_LOOP_REBALANCE + */ +/* + * Look for off cpu set work before going idle + */ +#define GOING_IDLE_REBALANCE +/* + * Attempt to balance load between cpu sets (even when busy) +#define BUSY_REBALANCE + */ +/* + * Send tasks to least loaded CPU at exec time + */ +#define EXEC_BALANCE + +#define NR_CPU_SETS NR_CPUS + +union cpu_set { + struct cpu_set_data { + spinlock_t lock; + unsigned long local_cpus; + unsigned long load_avg; + } csd; + char __pad [SMP_CACHE_BYTES]; +}; +typedef union cpu_set cpu_set_t; +#define cs_lock csd.lock +#define cs_local_cpus csd.local_cpus +#define cs_load_avg csd.load_avg + +static int numa_num_cpu_sets = 0; +static int numa_cpus_per_local_set = NR_CPUS; +static int numa_cpu_set_distance = 1; +static cpu_set_t cpu_sets[NR_CPU_SETS] __cacheline_aligned; + +#define cpu_to_set_id(c) ((c) / numa_cpus_per_local_set) +#define next_cs_id(c) ((c) + 1 == numa_num_cpu_sets ? 0 : (c) + 1) +#define cpu_to_set(c) (cpu_sets + ((c) / numa_cpus_per_local_set)) +/* + * End of NUMA definitions + */ + #define BITMAP_SIZE ((((MAX_PRIO+7)/8)+sizeof(long)-1)/sizeof(long)) typedef struct runqueue runqueue_t; @@ -41,6 +88,7 @@ */ struct runqueue { spinlock_t lock; + cpu_set_t *cpu_set; unsigned long nr_running, nr_switches, expired_timestamp; task_t *curr, *idle; prio_array_t *active, *expired, arrays[2]; @@ -395,11 +443,13 @@ static void load_balance(runqueue_t *this_rq, int idle) { int imbalance, nr_running, load, max_load, - idx, i, this_cpu = smp_processor_id(); + idx, i, this_cpu = smp_processor_id(), + total_set_load = 0, cpus_in_set = 0; task_t *next = this_rq->idle, *tmp; runqueue_t *busiest, *rq_src; prio_array_t *array; list_t *head, *curr; + unsigned long target_cpu_set; /* * We search all runqueues to find the most busy one. @@ -430,7 +480,11 @@ busiest = NULL; max_load = 1; + target_cpu_set = this_rq->cpu_set->cs_local_cpus; for (i = 0; i < smp_num_cpus; i++) { + /* Skip CPUs not in the target set */ + if (!(target_cpu_set & (1UL << cpu_logical_map(i)))) + continue; rq_src = cpu_rq(cpu_logical_map(i)); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; @@ -438,6 +492,9 @@ load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i] = rq_src->nr_running; + total_set_load += load; + cpus_in_set++; + if ((load > max_load) && (rq_src != this_rq)) { busiest = rq_src; max_load = load; @@ -458,10 +515,20 @@ * Make sure nothing changed since we checked the * runqueue length. */ - if (busiest->nr_running <= this_rq->nr_running + 1) + if (busiest->nr_running <= nr_running + 1) goto out_unlock; /* + * Update cpu set load average + */ + total_set_load = total_set_load / cpus_in_set; + if (total_set_load != this_rq->cpu_set->cs_load_avg) { + spin_lock(&this_rq->cpu_set->cs_lock); + this_rq->cpu_set->cs_load_avg = total_set_load; + spin_unlock(&this_rq->cpu_set->cs_lock); + } + + /* * We first consider expired tasks. Those will likely not be * executed in the near future, and they are most likely to * be cache-cold, thus switching CPUs has the least effect @@ -534,6 +601,49 @@ spin_unlock(&busiest->lock); } +#if defined(IDLE_LOOP_REBALANCE) || defined(GOING_IDLE_REBALANCE) || \ + defined(BUSY_REBALANCE) +/* + * Load balance CPU sets + */ +static void balance_cpu_sets(runqueue_t *this_rq, int idle) +{ + int i, this_load, max_load; + cpu_set_t *this_set, *target_set = NULL; + + this_set = cpu_to_set(smp_processor_id()); + this_load = this_set->cs_load_avg; + + if (this_load > 1) + max_load = this_load; + else { + if (idle) + max_load = 0; /* Looking for anything! */ + else + max_load = 1; + } + + for(i = 0; i < numa_num_cpu_sets; i++) { + if (cpu_sets[i].cs_load_avg > max_load) { + max_load = cpu_sets[i].cs_load_avg; + target_set = &cpu_sets[i]; + } + } + + if (!target_set || (max_load <= this_load)) + return; + + /* + * We point current CPU at target cpu_set. This is safe + * because the caller ensures the current CPUs runqueue + * is locked. + */ + this_rq()->cpu_set = target_set; + load_balance(this_rq, idle); + this_rq()->cpu_set = this_set; +} +#endif + /* * One of the idle_cpu_tick() or the busy_cpu_tick() function will * gets called every timer tick, on every CPU. Our balancing action @@ -545,16 +655,104 @@ */ #define BUSY_REBALANCE_TICK (HZ/4 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +#define CS_IDLE_REBALANCE_TICK (IDLE_REBALANCE_TICK * numa_cpu_set_distance) +#define CS_BUSY_REBALANCE_TICK (BUSY_REBALANCE_TICK * numa_cpu_set_distance) static inline void idle_tick(void) { - if (jiffies % IDLE_REBALANCE_TICK) + unsigned long now = jiffies; + + if (now % IDLE_REBALANCE_TICK) return; spin_lock(&this_rq()->lock); load_balance(this_rq(), 1); + +#ifdef IDLE_LOOP_REBALANCE + /* + * If there are no waiting tasks on the local cpu set, then + * at least take a look at other sets. + */ + if (!this_rq()->nr_running && !(now % CS_IDLE_REBALANCE_TICK)) + balance_cpu_sets(this_rq(), 1); +#endif + spin_unlock(&this_rq()->lock); } +/* + * Code to be executed as part of the exec path on NUMA architectures. + * Since exec throws away the old process image, there is little + * advantage to keeping the task on the same cpu set. Therefore, we + * consider moving the task to the least laded cpu set. + */ +void sched_exec_migrate(void) +{ +#ifdef EXEC_BALANCE + int this_cpu, this_cs_id, this_load_avg, min_load_avg, + i, target_cs_id, target_cpu; + unsigned long target_cs_mask, + cpus_allowed = current->cpus_allowed; + runqueue_t *rq; + + /* + * Only consider migration if other tasks are waiting for + * this CPU, and the load average for this cpu set is + * greater than 1. Also, make sure the task is allowed + * to run on another cpu set. + */ + this_cpu = smp_processor_id(); + if (cpu_rq(this_cpu)->nr_running < 2) + return; + + this_cs_id = cpu_to_set_id(this_cpu); + this_load_avg = cpu_sets[this_cs_id].cs_load_avg; + if (this_load_avg < 1) + return; + + if (!(cpus_allowed & ~(cpu_sets[this_cs_id].cs_local_cpus))) + return; + + /* + * Look for another cpu set with a lower load average. + */ + min_load_avg = this_load_avg; + target_cs_id = this_cs_id; + for (i = next_cs_id(this_cs_id); i != this_cs_id; i = next_cs_id(i)) { + if (!(cpus_allowed & cpu_sets[i].cs_local_cpus)) + continue; + if (cpu_sets[i].cs_load_avg < min_load_avg) { + min_load_avg = cpu_sets[i].cs_load_avg; + target_cs_id = i; + } + } + if (!(min_load_avg < this_load_avg)) + return; + + /* + * Find shortest runqueue on target cpu set. + */ + min_load_avg = INT_MAX; + target_cs_mask = cpu_sets[target_cs_id].cs_local_cpus; + target_cpu = this_cpu; + for (i=0; i < smp_num_cpus; i++) { + if (!(cpus_allowed & target_cs_mask & (1UL << i))) + continue; + rq = cpu_rq(cpu_logical_map(i)); + if (rq->nr_running < min_load_avg) { + min_load_avg = rq->nr_running; + target_cpu = cpu_logical_map(i); + } + } + + /* + * Migrate task + */ + current->state = TASK_UNINTERRUPTIBLE; + smp_migrate_task(target_cpu, current); + schedule(); +#endif +} + #endif /* @@ -618,6 +816,11 @@ #if CONFIG_SMP if (!(now % BUSY_REBALANCE_TICK)) load_balance(rq, 0); + +#ifdef BUSY_REBALANCE + if (!(now % CS_BUSY_REBALANCE_TICK)) + balance_cpu_sets(rq, 0); +#endif #endif spin_unlock(&rq->lock); } @@ -659,6 +862,10 @@ if (unlikely(!rq->nr_running)) { #if CONFIG_SMP load_balance(rq, 1); +#ifdef GOING_IDLE_REBALANCE + if (!rq->nr_running) + balance_cpu_sets(rq, 1); +#endif if (rq->nr_running) goto pick_next_task; #endif @@ -1332,10 +1539,20 @@ runqueue_t *rq; int i, j, k; + /* + * Default is to have a single cpu set containing all CPUs + */ + numa_num_cpu_sets = 0; + cpu_sets[0].cs_lock = SPIN_LOCK_UNLOCKED; + cpu_sets[0].cs_local_cpus = -1; + cpu_sets[0].cs_load_avg = 0; + for (i = 0; i < NR_CPUS; i++) { runqueue_t *rq = cpu_rq(i); prio_array_t *array; + rq->cpu_set = &cpu_sets[0]; + rq->active = rq->arrays + 0; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); @@ -1372,3 +1589,42 @@ atomic_inc(&init_mm.mm_count); enter_lazy_tlb(&init_mm, current, smp_processor_id()); } + +void numa_sched_init(void) +{ + int i, cpu_set_num; + unsigned long numa_local_cpu_set_mask; + +#define NUMA_CPUS_PER_LOCAL_SET 4 +#define NUMA_CPU_SET_DISTANCE 2 + + numa_cpus_per_local_set = NUMA_CPUS_PER_LOCAL_SET; + numa_cpu_set_distance = NUMA_CPU_SET_DISTANCE; + + numa_local_cpu_set_mask = 0UL; + for (i=0; i<numa_cpus_per_local_set; i++) { + numa_local_cpu_set_mask = numa_local_cpu_set_mask << 1; + numa_local_cpu_set_mask |= 0x1UL; + } + + /* + * Make sure all runqueues point to the correct cpu_set + */ + cpu_set_num = -1; + numa_num_cpu_sets = 0; + for (i=0; i<smp_num_cpus; i++) { + if (cpu_to_set_id(i) != cpu_set_num) { + cpu_set_num = cpu_to_set_id(i); + + cpu_sets[cpu_set_num].cs_lock = SPIN_LOCK_UNLOCKED; + cpu_sets[cpu_set_num].cs_local_cpus = + numa_local_cpu_set_mask << + (cpu_set_num * numa_cpus_per_local_set); + cpu_sets[cpu_set_num].cs_load_avg = 0; + + numa_num_cpu_sets++; + } + + runqueues[i].cpu_set = &cpu_sets[cpu_set_num]; + } +} |
From: Jesse B. <jb...@sg...> - 2002-02-22 19:14:08
|
On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote: > Below is preliminary patch to implement some form of NUMA scheduling > on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early > code and brings up some issues that need to be discussed/explored in > more detail. This patch was created to form a basis for discussion, > rather than as a solution. The patch was created for the i386 based > NUMA system I have access to. It will not work on other architectures. > However, the only architecture specific code is a call to initialize > some of the NUMA specific scheduling data structures. Therefore, it > should be trivial to port. Ah, you beat me to it; I was working on code very similar to this when I got your e-mail. I think this sort of thing will address the problem we've been seeing on machines with more than 16 cpus or so (IDLE_REBALANCE_TICK is too small, flooding CPUs with load_balance requests), as well as making numa scheduling a little more efficient. I'll see if I can make it work on our platform and let you know how it goes. Jesse |
From: Peter R. <fr...@zk...> - 2002-02-22 19:29:57
|
Ditto. Maybe I'll even get a chance to check it out on an EV7. ;) - Pete Jesse Barnes wrote: > On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote: > > Below is preliminary patch to implement some form of NUMA scheduling > > on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early > > code and brings up some issues that need to be discussed/explored in > > more detail. This patch was created to form a basis for discussion, > > rather than as a solution. The patch was created for the i386 based > > NUMA system I have access to. It will not work on other architectures. > > However, the only architecture specific code is a call to initialize > > some of the NUMA specific scheduling data structures. Therefore, it > > should be trivial to port. > > Ah, you beat me to it; I was working on code very similar to this when > I got your e-mail. I think this sort of thing will address the > problem we've been seeing on machines with more than 16 cpus or so > (IDLE_REBALANCE_TICK is too small, flooding CPUs with load_balance > requests), as well as making numa scheduling a little more efficient. > I'll see if I can make it work on our platform and let you know how it > goes. > > Jesse > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mike K. <kr...@us...> - 2002-02-22 21:17:12
|
On Fri, Feb 22, 2002 at 07:59:02PM +0000, Alan Cox wrote: > > - Topology discovery. To create cpu sets, a routine numa_sched_init() > > is called after we know how many cpus there are in the system. > > Inside numa_sched_init(), there are hard coded values used to indicate > > the number of cpus per set, and the 'distance' between sets. > > Obviously, creation of cpu sets would be based on topology information > > that is not available at this time. In addition, there should be > > support for (or at least an understanding of) non-symmetric and > > multi-level topologies. > > Would you expect that to be in user space eventually (I'm thinking about > hot plug CPU's for one here) ? Yes, I believe there should be interfaces for topology discovery both in kernel and user space. Some people on the lse-tech list have been working on this. Although, I have to admit that I haven't been closely following those discussions. Perhaps the people working on this could let you know what they are planning. -- Mike |
From: Mike K. <kr...@us...> - 2002-02-23 00:00:03
|
On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote: > Below is preliminary patch to implement some form of NUMA scheduling > on top of Ingo's K3 scheduler patch for 2.4.17. My apologies, the previously included patch was created on top of of Ingo's J9 scheduler patch. The patch below is built on K3. -- Mike diff -Naur linux-2.4.17-K3/arch/i386/kernel/smpboot.c linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c --- linux-2.4.17-K3/arch/i386/kernel/smpboot.c Thu Feb 14 18:42:53 2002 +++ linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c Thu Feb 14 17:38:51 2002 @@ -1198,6 +1198,11 @@ } } } + + /* + * Hack to get cpu sets initialized on NUMA architectures + */ + numa_sched_init(); #ifndef CONFIG_VISWS /* diff -Naur linux-2.4.17-K3/fs/exec.c linux-2.4.17-K3-ns/fs/exec.c --- linux-2.4.17-K3/fs/exec.c Fri Dec 21 17:41:55 2001 +++ linux-2.4.17-K3-ns/fs/exec.c Thu Feb 14 17:38:51 2002 @@ -860,6 +860,10 @@ int retval; int i; +#if CONFIG_SMP + sched_exec_migrate(); +#endif + file = open_exec(filename); retval = PTR_ERR(file); diff -Naur linux-2.4.17-K3/include/linux/sched.h linux-2.4.17-K3-ns/include/linux/sched.h --- linux-2.4.17-K3/include/linux/sched.h Thu Feb 14 18:46:54 2002 +++ linux-2.4.17-K3-ns/include/linux/sched.h Thu Feb 14 17:52:18 2002 @@ -152,6 +152,7 @@ extern void sched_task_migrated(task_t *p); extern void smp_migrate_task(int cpu, task_t *task); extern unsigned long cache_decay_ticks; +extern void sched_exec_migrate(void); #define MAX_SCHEDULE_TIMEOUT LONG_MAX extern signed long FASTCALL(schedule_timeout(signed long timeout)); diff -Naur linux-2.4.17-K3/kernel/sched.c linux-2.4.17-K3-ns/kernel/sched.c --- linux-2.4.17-K3/kernel/sched.c Thu Feb 14 18:42:52 2002 +++ linux-2.4.17-K3-ns/kernel/sched.c Thu Feb 14 17:40:47 2002 @@ -22,6 +22,53 @@ #include <linux/kernel_stat.h> /* + * Definitions that depend on/should be part of NUMA topology discovery + */ +/* + * Periodically poll for new (off cpu set) work as part of the idle loop +#define IDLE_LOOP_REBALANCE + */ +/* + * Look for off cpu set work before going idle + */ +#define GOING_IDLE_REBALANCE +/* + * Attempt to balance load between cpu sets (even when busy) +#define BUSY_REBALANCE + */ +/* + * Send tasks to least loaded CPU at exec time + */ +#define EXEC_BALANCE + +#define NR_CPU_SETS NR_CPUS + +union cpu_set { + struct cpu_set_data { + spinlock_t lock; + unsigned long local_cpus; + unsigned long load_avg; + } csd; + char __pad [SMP_CACHE_BYTES]; +}; +typedef union cpu_set cpu_set_t; +#define cs_lock csd.lock +#define cs_local_cpus csd.local_cpus +#define cs_load_avg csd.load_avg + +static int numa_num_cpu_sets = 0; +static int numa_cpus_per_local_set = NR_CPUS; +static int numa_cpu_set_distance = 1; +static cpu_set_t cpu_sets[NR_CPU_SETS] __cacheline_aligned; + +#define cpu_to_set_id(c) ((c) / numa_cpus_per_local_set) +#define next_cs_id(c) ((c) + 1 == numa_num_cpu_sets ? 0 : (c) + 1) +#define cpu_to_set(c) (cpu_sets + ((c) / numa_cpus_per_local_set)) +/* + * End of NUMA definitions + */ + +/* * Priority of a process goes from 0 to 139. The 0-99 * priority range is allocated to RT tasks, the 100-139 * range is for SCHED_OTHER tasks. Priority values are @@ -140,6 +187,7 @@ */ struct runqueue { spinlock_t lock; + cpu_set_t *cpu_set; unsigned long nr_running, nr_switches, expired_timestamp; task_t *curr, *idle; prio_array_t *active, *expired, arrays[2]; @@ -484,11 +532,13 @@ static void load_balance(runqueue_t *this_rq, int idle) { int imbalance, nr_running, load, max_load, - idx, i, this_cpu = smp_processor_id(); + idx, i, this_cpu = smp_processor_id(), + total_set_load = 0, cpus_in_set = 0; task_t *next = this_rq->idle, *tmp; runqueue_t *busiest, *rq_src; prio_array_t *array; list_t *head, *curr; + unsigned long target_cpu_set; /* * We search all runqueues to find the most busy one. @@ -519,7 +569,11 @@ busiest = NULL; max_load = 1; + target_cpu_set = this_rq->cpu_set->cs_local_cpus; for (i = 0; i < smp_num_cpus; i++) { + /* Skip CPUs not in the target set */ + if (!(target_cpu_set & (1UL << cpu_logical_map(i)))) + continue; rq_src = cpu_rq(cpu_logical_map(i)); if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) load = rq_src->nr_running; @@ -527,6 +581,9 @@ load = this_rq->prev_nr_running[i]; this_rq->prev_nr_running[i] = rq_src->nr_running; + total_set_load += load; + cpus_in_set++; + if ((load > max_load) && (rq_src != this_rq)) { busiest = rq_src; max_load = load; @@ -547,10 +604,20 @@ * Make sure nothing changed since we checked the * runqueue length. */ - if (busiest->nr_running <= this_rq->nr_running + 1) + if (busiest->nr_running <= nr_running + 1) goto out_unlock; /* + * Update cpu set load average + */ + total_set_load = total_set_load / cpus_in_set; + if (total_set_load != this_rq->cpu_set->cs_load_avg) { + spin_lock(&this_rq->cpu_set->cs_lock); + this_rq->cpu_set->cs_load_avg = total_set_load; + spin_unlock(&this_rq->cpu_set->cs_lock); + } + + /* * We first consider expired tasks. Those will likely not be * executed in the near future, and they are most likely to * be cache-cold, thus switching CPUs has the least effect @@ -623,6 +690,49 @@ spin_unlock(&busiest->lock); } +#if defined(IDLE_LOOP_REBALANCE) || defined(GOING_IDLE_REBALANCE) || \ + defined(BUSY_REBALANCE) +/* + * Load balance CPU sets + */ +static void balance_cpu_sets(runqueue_t *this_rq, int idle) +{ + int i, this_load, max_load; + cpu_set_t *this_set, *target_set = NULL; + + this_set = cpu_to_set(smp_processor_id()); + this_load = this_set->cs_load_avg; + + if (this_load > 1) + max_load = this_load; + else { + if (idle) + max_load = 0; /* Looking for anything! */ + else + max_load = 1; + } + + for(i = 0; i < numa_num_cpu_sets; i++) { + if (cpu_sets[i].cs_load_avg > max_load) { + max_load = cpu_sets[i].cs_load_avg; + target_set = &cpu_sets[i]; + } + } + + if (!target_set || (max_load <= this_load)) + return; + + /* + * We point current CPU at target cpu_set. This is safe + * because the caller ensures the current CPUs runqueue + * is locked. + */ + this_rq()->cpu_set = target_set; + load_balance(this_rq, idle); + this_rq()->cpu_set = this_set; +} +#endif + /* * One of the idle_cpu_tick() or the busy_cpu_tick() function will * gets called every timer tick, on every CPU. Our balancing action @@ -634,16 +744,104 @@ */ #define BUSY_REBALANCE_TICK (HZ/4 ?: 1) #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +#define CS_IDLE_REBALANCE_TICK (IDLE_REBALANCE_TICK * numa_cpu_set_distance) +#define CS_BUSY_REBALANCE_TICK (BUSY_REBALANCE_TICK * numa_cpu_set_distance) static inline void idle_tick(void) { - if (jiffies % IDLE_REBALANCE_TICK) + unsigned long now = jiffies; + + if (now % IDLE_REBALANCE_TICK) return; spin_lock(&this_rq()->lock); load_balance(this_rq(), 1); + +#ifdef IDLE_LOOP_REBALANCE + /* + * If there are no waiting tasks on the local cpu set, then + * at least take a look at other sets. + */ + if (!this_rq()->nr_running && !(now % CS_IDLE_REBALANCE_TICK)) + balance_cpu_sets(this_rq(), 1); +#endif + spin_unlock(&this_rq()->lock); } +/* + * Code to be executed as part of the exec path on NUMA architectures. + * Since exec throws away the old process image, there is little + * advantage to keeping the task on the same cpu set. Therefore, we + * consider moving the task to the least laded cpu set. + */ +void sched_exec_migrate(void) +{ +#ifdef EXEC_BALANCE + int this_cpu, this_cs_id, this_load_avg, min_load_avg, + i, target_cs_id, target_cpu; + unsigned long target_cs_mask, + cpus_allowed = current->cpus_allowed; + runqueue_t *rq; + + /* + * Only consider migration if other tasks are waiting for + * this CPU, and the load average for this cpu set is + * greater than 1. Also, make sure the task is allowed + * to run on another cpu set. + */ + this_cpu = smp_processor_id(); + if (cpu_rq(this_cpu)->nr_running < 2) + return; + + this_cs_id = cpu_to_set_id(this_cpu); + this_load_avg = cpu_sets[this_cs_id].cs_load_avg; + if (this_load_avg < 1) + return; + + if (!(cpus_allowed & ~(cpu_sets[this_cs_id].cs_local_cpus))) + return; + + /* + * Look for another cpu set with a lower load average. + */ + min_load_avg = this_load_avg; + target_cs_id = this_cs_id; + for (i = next_cs_id(this_cs_id); i != this_cs_id; i = next_cs_id(i)) { + if (!(cpus_allowed & cpu_sets[i].cs_local_cpus)) + continue; + if (cpu_sets[i].cs_load_avg < min_load_avg) { + min_load_avg = cpu_sets[i].cs_load_avg; + target_cs_id = i; + } + } + if (!(min_load_avg < this_load_avg)) + return; + + /* + * Find shortest runqueue on target cpu set. + */ + min_load_avg = INT_MAX; + target_cs_mask = cpu_sets[target_cs_id].cs_local_cpus; + target_cpu = this_cpu; + for (i=0; i < smp_num_cpus; i++) { + if (!(cpus_allowed & target_cs_mask & (1UL << i))) + continue; + rq = cpu_rq(cpu_logical_map(i)); + if (rq->nr_running < min_load_avg) { + min_load_avg = rq->nr_running; + target_cpu = cpu_logical_map(i); + } + } + + /* + * Migrate task + */ + current->state = TASK_UNINTERRUPTIBLE; + smp_migrate_task(target_cpu, current); + schedule(); +#endif +} + #endif /* @@ -732,6 +930,11 @@ #if CONFIG_SMP if (!(jiffies % BUSY_REBALANCE_TICK)) load_balance(rq, 0); + +#ifdef BUSY_REBALANCE + if (!(now % CS_BUSY_REBALANCE_TICK)) + balance_cpu_sets(rq, 0); +#endif #endif spin_unlock(&rq->lock); } @@ -772,6 +975,10 @@ if (unlikely(!rq->nr_running)) { #if CONFIG_SMP load_balance(rq, 1); +#ifdef GOING_IDLE_REBALANCE + if (!rq->nr_running) + balance_cpu_sets(rq, 1); +#endif if (rq->nr_running) goto pick_next_task; #endif @@ -1488,10 +1695,20 @@ runqueue_t *rq; int i, j, k; + /* + * Default is to have a single cpu set containing all CPUs + */ + numa_num_cpu_sets = 0; + cpu_sets[0].cs_lock = SPIN_LOCK_UNLOCKED; + cpu_sets[0].cs_local_cpus = -1; + cpu_sets[0].cs_load_avg = 0; + for (i = 0; i < NR_CPUS; i++) { runqueue_t *rq = cpu_rq(i); prio_array_t *array; + rq->cpu_set = &cpu_sets[0]; + rq->active = rq->arrays + 0; rq->expired = rq->arrays + 1; spin_lock_init(&rq->lock); @@ -1528,3 +1745,42 @@ atomic_inc(&init_mm.mm_count); enter_lazy_tlb(&init_mm, current, smp_processor_id()); } + +void numa_sched_init(void) +{ + int i, cpu_set_num; + unsigned long numa_local_cpu_set_mask; + +#define NUMA_CPUS_PER_LOCAL_SET 4 +#define NUMA_CPU_SET_DISTANCE 2 + + numa_cpus_per_local_set = NUMA_CPUS_PER_LOCAL_SET; + numa_cpu_set_distance = NUMA_CPU_SET_DISTANCE; + + numa_local_cpu_set_mask = 0UL; + for (i=0; i<numa_cpus_per_local_set; i++) { + numa_local_cpu_set_mask = numa_local_cpu_set_mask << 1; + numa_local_cpu_set_mask |= 0x1UL; + } + + /* + * Make sure all runqueues point to the correct cpu_set + */ + cpu_set_num = -1; + numa_num_cpu_sets = 0; + for (i=0; i<smp_num_cpus; i++) { + if (cpu_to_set_id(i) != cpu_set_num) { + cpu_set_num = cpu_to_set_id(i); + + cpu_sets[cpu_set_num].cs_lock = SPIN_LOCK_UNLOCKED; + cpu_sets[cpu_set_num].cs_local_cpus = + numa_local_cpu_set_mask << + (cpu_set_num * numa_cpus_per_local_set); + cpu_sets[cpu_set_num].cs_load_avg = 0; + + numa_num_cpu_sets++; + } + + runqueues[i].cpu_set = &cpu_sets[cpu_set_num]; + } +} |
From: Jesse B. <jb...@sg...> - 2002-02-27 22:12:59
|
I was able to get your last version working on a 32p ia64 we have here. Here's what I came up with: o Current 2.4. scheduler - scales poorly but works o Ingo's stock scheduler - scales well, but we needed to change the IDLE_REBALANCE_TICK value from 1ms to 10ms for it to work o Your NUMA scheduler - scales better than the 2.4 scheduler, but not as well as Ingo's, but it didn't need the IDLE_REBALANCE_TICK hack. Here are some hackbench numbers that I collected: # 2.4 Ingo NUMA -- ------ ----- ------ 30 23.872 7.904 16.286 40 34.389 8.889 22.359 50 46.615 9.432 30.097 I could very well be getting the cpuset stuff wrong in my code, which would partially explain the high numbers. It sure would be nice to have a topology api availble for 2.4... Jesse On Fri, Feb 22, 2002 at 03:59:50PM -0800, Mike Kravetz wrote: > On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote: > > Below is preliminary patch to implement some form of NUMA scheduling > > on top of Ingo's K3 scheduler patch for 2.4.17. > > My apologies, the previously included patch was created on top of > of Ingo's J9 scheduler patch. The patch below is built on K3. > > -- > Mike > > diff -Naur linux-2.4.17-K3/arch/i386/kernel/smpboot.c linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c > --- linux-2.4.17-K3/arch/i386/kernel/smpboot.c Thu Feb 14 18:42:53 2002 > +++ linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c Thu Feb 14 17:38:51 2002 > @@ -1198,6 +1198,11 @@ > } > } > } > + > + /* > + * Hack to get cpu sets initialized on NUMA architectures > + */ > + numa_sched_init(); > > #ifndef CONFIG_VISWS > /* > diff -Naur linux-2.4.17-K3/fs/exec.c linux-2.4.17-K3-ns/fs/exec.c > --- linux-2.4.17-K3/fs/exec.c Fri Dec 21 17:41:55 2001 > +++ linux-2.4.17-K3-ns/fs/exec.c Thu Feb 14 17:38:51 2002 > @@ -860,6 +860,10 @@ > int retval; > int i; > > +#if CONFIG_SMP > + sched_exec_migrate(); > +#endif > + > file = open_exec(filename); > > retval = PTR_ERR(file); > diff -Naur linux-2.4.17-K3/include/linux/sched.h linux-2.4.17-K3-ns/include/linux/sched.h > --- linux-2.4.17-K3/include/linux/sched.h Thu Feb 14 18:46:54 2002 > +++ linux-2.4.17-K3-ns/include/linux/sched.h Thu Feb 14 17:52:18 2002 > @@ -152,6 +152,7 @@ > extern void sched_task_migrated(task_t *p); > extern void smp_migrate_task(int cpu, task_t *task); > extern unsigned long cache_decay_ticks; > +extern void sched_exec_migrate(void); > > #define MAX_SCHEDULE_TIMEOUT LONG_MAX > extern signed long FASTCALL(schedule_timeout(signed long timeout)); > diff -Naur linux-2.4.17-K3/kernel/sched.c linux-2.4.17-K3-ns/kernel/sched.c > --- linux-2.4.17-K3/kernel/sched.c Thu Feb 14 18:42:52 2002 > +++ linux-2.4.17-K3-ns/kernel/sched.c Thu Feb 14 17:40:47 2002 > @@ -22,6 +22,53 @@ > #include <linux/kernel_stat.h> > > /* > + * Definitions that depend on/should be part of NUMA topology discovery > + */ > +/* > + * Periodically poll for new (off cpu set) work as part of the idle loop > +#define IDLE_LOOP_REBALANCE > + */ > +/* > + * Look for off cpu set work before going idle > + */ > +#define GOING_IDLE_REBALANCE > +/* > + * Attempt to balance load between cpu sets (even when busy) > +#define BUSY_REBALANCE > + */ > +/* > + * Send tasks to least loaded CPU at exec time > + */ > +#define EXEC_BALANCE > + > +#define NR_CPU_SETS NR_CPUS > + > +union cpu_set { > + struct cpu_set_data { > + spinlock_t lock; > + unsigned long local_cpus; > + unsigned long load_avg; > + } csd; > + char __pad [SMP_CACHE_BYTES]; > +}; > +typedef union cpu_set cpu_set_t; > +#define cs_lock csd.lock > +#define cs_local_cpus csd.local_cpus > +#define cs_load_avg csd.load_avg > + > +static int numa_num_cpu_sets = 0; > +static int numa_cpus_per_local_set = NR_CPUS; > +static int numa_cpu_set_distance = 1; > +static cpu_set_t cpu_sets[NR_CPU_SETS] __cacheline_aligned; > + > +#define cpu_to_set_id(c) ((c) / numa_cpus_per_local_set) > +#define next_cs_id(c) ((c) + 1 == numa_num_cpu_sets ? 0 : (c) + 1) > +#define cpu_to_set(c) (cpu_sets + ((c) / numa_cpus_per_local_set)) > +/* > + * End of NUMA definitions > + */ > + > +/* > * Priority of a process goes from 0 to 139. The 0-99 > * priority range is allocated to RT tasks, the 100-139 > * range is for SCHED_OTHER tasks. Priority values are > @@ -140,6 +187,7 @@ > */ > struct runqueue { > spinlock_t lock; > + cpu_set_t *cpu_set; > unsigned long nr_running, nr_switches, expired_timestamp; > task_t *curr, *idle; > prio_array_t *active, *expired, arrays[2]; > @@ -484,11 +532,13 @@ > static void load_balance(runqueue_t *this_rq, int idle) > { > int imbalance, nr_running, load, max_load, > - idx, i, this_cpu = smp_processor_id(); > + idx, i, this_cpu = smp_processor_id(), > + total_set_load = 0, cpus_in_set = 0; > task_t *next = this_rq->idle, *tmp; > runqueue_t *busiest, *rq_src; > prio_array_t *array; > list_t *head, *curr; > + unsigned long target_cpu_set; > > /* > * We search all runqueues to find the most busy one. > @@ -519,7 +569,11 @@ > > busiest = NULL; > max_load = 1; > + target_cpu_set = this_rq->cpu_set->cs_local_cpus; > for (i = 0; i < smp_num_cpus; i++) { > + /* Skip CPUs not in the target set */ > + if (!(target_cpu_set & (1UL << cpu_logical_map(i)))) > + continue; > rq_src = cpu_rq(cpu_logical_map(i)); > if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) > load = rq_src->nr_running; > @@ -527,6 +581,9 @@ > load = this_rq->prev_nr_running[i]; > this_rq->prev_nr_running[i] = rq_src->nr_running; > > + total_set_load += load; > + cpus_in_set++; > + > if ((load > max_load) && (rq_src != this_rq)) { > busiest = rq_src; > max_load = load; > @@ -547,10 +604,20 @@ > * Make sure nothing changed since we checked the > * runqueue length. > */ > - if (busiest->nr_running <= this_rq->nr_running + 1) > + if (busiest->nr_running <= nr_running + 1) > goto out_unlock; > > /* > + * Update cpu set load average > + */ > + total_set_load = total_set_load / cpus_in_set; > + if (total_set_load != this_rq->cpu_set->cs_load_avg) { > + spin_lock(&this_rq->cpu_set->cs_lock); > + this_rq->cpu_set->cs_load_avg = total_set_load; > + spin_unlock(&this_rq->cpu_set->cs_lock); > + } > + > + /* > * We first consider expired tasks. Those will likely not be > * executed in the near future, and they are most likely to > * be cache-cold, thus switching CPUs has the least effect > @@ -623,6 +690,49 @@ > spin_unlock(&busiest->lock); > } > > +#if defined(IDLE_LOOP_REBALANCE) || defined(GOING_IDLE_REBALANCE) || \ > + defined(BUSY_REBALANCE) > +/* > + * Load balance CPU sets > + */ > +static void balance_cpu_sets(runqueue_t *this_rq, int idle) > +{ > + int i, this_load, max_load; > + cpu_set_t *this_set, *target_set = NULL; > + > + this_set = cpu_to_set(smp_processor_id()); > + this_load = this_set->cs_load_avg; > + > + if (this_load > 1) > + max_load = this_load; > + else { > + if (idle) > + max_load = 0; /* Looking for anything! */ > + else > + max_load = 1; > + } > + > + for(i = 0; i < numa_num_cpu_sets; i++) { > + if (cpu_sets[i].cs_load_avg > max_load) { > + max_load = cpu_sets[i].cs_load_avg; > + target_set = &cpu_sets[i]; > + } > + } > + > + if (!target_set || (max_load <= this_load)) > + return; > + > + /* > + * We point current CPU at target cpu_set. This is safe > + * because the caller ensures the current CPUs runqueue > + * is locked. > + */ > + this_rq()->cpu_set = target_set; > + load_balance(this_rq, idle); > + this_rq()->cpu_set = this_set; > +} > +#endif > + > /* > * One of the idle_cpu_tick() or the busy_cpu_tick() function will > * gets called every timer tick, on every CPU. Our balancing action > @@ -634,16 +744,104 @@ > */ > #define BUSY_REBALANCE_TICK (HZ/4 ?: 1) > #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) > +#define CS_IDLE_REBALANCE_TICK (IDLE_REBALANCE_TICK * numa_cpu_set_distance) > +#define CS_BUSY_REBALANCE_TICK (BUSY_REBALANCE_TICK * numa_cpu_set_distance) > > static inline void idle_tick(void) > { > - if (jiffies % IDLE_REBALANCE_TICK) > + unsigned long now = jiffies; > + > + if (now % IDLE_REBALANCE_TICK) > return; > spin_lock(&this_rq()->lock); > load_balance(this_rq(), 1); > + > +#ifdef IDLE_LOOP_REBALANCE > + /* > + * If there are no waiting tasks on the local cpu set, then > + * at least take a look at other sets. > + */ > + if (!this_rq()->nr_running && !(now % CS_IDLE_REBALANCE_TICK)) > + balance_cpu_sets(this_rq(), 1); > +#endif > + > spin_unlock(&this_rq()->lock); > } > > +/* > + * Code to be executed as part of the exec path on NUMA architectures. > + * Since exec throws away the old process image, there is little > + * advantage to keeping the task on the same cpu set. Therefore, we > + * consider moving the task to the least laded cpu set. > + */ > +void sched_exec_migrate(void) > +{ > +#ifdef EXEC_BALANCE > + int this_cpu, this_cs_id, this_load_avg, min_load_avg, > + i, target_cs_id, target_cpu; > + unsigned long target_cs_mask, > + cpus_allowed = current->cpus_allowed; > + runqueue_t *rq; > + > + /* > + * Only consider migration if other tasks are waiting for > + * this CPU, and the load average for this cpu set is > + * greater than 1. Also, make sure the task is allowed > + * to run on another cpu set. > + */ > + this_cpu = smp_processor_id(); > + if (cpu_rq(this_cpu)->nr_running < 2) > + return; > + > + this_cs_id = cpu_to_set_id(this_cpu); > + this_load_avg = cpu_sets[this_cs_id].cs_load_avg; > + if (this_load_avg < 1) > + return; > + > + if (!(cpus_allowed & ~(cpu_sets[this_cs_id].cs_local_cpus))) > + return; > + > + /* > + * Look for another cpu set with a lower load average. > + */ > + min_load_avg = this_load_avg; > + target_cs_id = this_cs_id; > + for (i = next_cs_id(this_cs_id); i != this_cs_id; i = next_cs_id(i)) { > + if (!(cpus_allowed & cpu_sets[i].cs_local_cpus)) > + continue; > + if (cpu_sets[i].cs_load_avg < min_load_avg) { > + min_load_avg = cpu_sets[i].cs_load_avg; > + target_cs_id = i; > + } > + } > + if (!(min_load_avg < this_load_avg)) > + return; > + > + /* > + * Find shortest runqueue on target cpu set. > + */ > + min_load_avg = INT_MAX; > + target_cs_mask = cpu_sets[target_cs_id].cs_local_cpus; > + target_cpu = this_cpu; > + for (i=0; i < smp_num_cpus; i++) { > + if (!(cpus_allowed & target_cs_mask & (1UL << i))) > + continue; > + rq = cpu_rq(cpu_logical_map(i)); > + if (rq->nr_running < min_load_avg) { > + min_load_avg = rq->nr_running; > + target_cpu = cpu_logical_map(i); > + } > + } > + > + /* > + * Migrate task > + */ > + current->state = TASK_UNINTERRUPTIBLE; > + smp_migrate_task(target_cpu, current); > + schedule(); > +#endif > +} > + > #endif > > /* > @@ -732,6 +930,11 @@ > #if CONFIG_SMP > if (!(jiffies % BUSY_REBALANCE_TICK)) > load_balance(rq, 0); > + > +#ifdef BUSY_REBALANCE > + if (!(now % CS_BUSY_REBALANCE_TICK)) > + balance_cpu_sets(rq, 0); > +#endif > #endif > spin_unlock(&rq->lock); > } > @@ -772,6 +975,10 @@ > if (unlikely(!rq->nr_running)) { > #if CONFIG_SMP > load_balance(rq, 1); > +#ifdef GOING_IDLE_REBALANCE > + if (!rq->nr_running) > + balance_cpu_sets(rq, 1); > +#endif > if (rq->nr_running) > goto pick_next_task; > #endif > @@ -1488,10 +1695,20 @@ > runqueue_t *rq; > int i, j, k; > > + /* > + * Default is to have a single cpu set containing all CPUs > + */ > + numa_num_cpu_sets = 0; > + cpu_sets[0].cs_lock = SPIN_LOCK_UNLOCKED; > + cpu_sets[0].cs_local_cpus = -1; > + cpu_sets[0].cs_load_avg = 0; > + > for (i = 0; i < NR_CPUS; i++) { > runqueue_t *rq = cpu_rq(i); > prio_array_t *array; > > + rq->cpu_set = &cpu_sets[0]; > + > rq->active = rq->arrays + 0; > rq->expired = rq->arrays + 1; > spin_lock_init(&rq->lock); > @@ -1528,3 +1745,42 @@ > atomic_inc(&init_mm.mm_count); > enter_lazy_tlb(&init_mm, current, smp_processor_id()); > } > + > +void numa_sched_init(void) > +{ > + int i, cpu_set_num; > + unsigned long numa_local_cpu_set_mask; > + > +#define NUMA_CPUS_PER_LOCAL_SET 4 > +#define NUMA_CPU_SET_DISTANCE 2 > + > + numa_cpus_per_local_set = NUMA_CPUS_PER_LOCAL_SET; > + numa_cpu_set_distance = NUMA_CPU_SET_DISTANCE; > + > + numa_local_cpu_set_mask = 0UL; > + for (i=0; i<numa_cpus_per_local_set; i++) { > + numa_local_cpu_set_mask = numa_local_cpu_set_mask << 1; > + numa_local_cpu_set_mask |= 0x1UL; > + } > + > + /* > + * Make sure all runqueues point to the correct cpu_set > + */ > + cpu_set_num = -1; > + numa_num_cpu_sets = 0; > + for (i=0; i<smp_num_cpus; i++) { > + if (cpu_to_set_id(i) != cpu_set_num) { > + cpu_set_num = cpu_to_set_id(i); > + > + cpu_sets[cpu_set_num].cs_lock = SPIN_LOCK_UNLOCKED; > + cpu_sets[cpu_set_num].cs_local_cpus = > + numa_local_cpu_set_mask << > + (cpu_set_num * numa_cpus_per_local_set); > + cpu_sets[cpu_set_num].cs_load_avg = 0; > + > + numa_num_cpu_sets++; > + } > + > + runqueues[i].cpu_set = &cpu_sets[cpu_set_num]; > + } > +} > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ |
From: Mike K. <kr...@us...> - 2002-02-28 02:52:57
|
On Wed, Feb 27, 2002 at 10:58:19AM -0800, Jesse Barnes wrote: > I was able to get your last version working on a 32p ia64 we have > here. Here's what I came up with: > > o Current 2.4. scheduler - scales poorly but works > o Ingo's stock scheduler - scales well, but we needed to change the > IDLE_REBALANCE_TICK value from 1ms to 10ms for it to work > o Your NUMA scheduler - scales better than the 2.4 scheduler, but not as > well as Ingo's, but it didn't need the IDLE_REBALANCE_TICK hack. > > Here are some hackbench numbers that I collected: > > # 2.4 Ingo NUMA > -- ------ ----- ------ > 30 23.872 7.904 16.286 > 40 34.389 8.889 22.359 > 50 46.615 9.432 30.097 Ouch! I assume you left IDLE_LOOP_REBALANCE undefined in the patch? -- Mike |
From: Jesse B. <jb...@sg...> - 2002-02-28 03:04:06
|
On Wed, Feb 27, 2002 at 11:41:27AM -0800, Mike Kravetz wrote: > > # 2.4 Ingo NUMA > > -- ------ ----- ------ > > 30 23.872 7.904 16.286 > > 40 34.389 8.889 22.359 > > 50 46.615 9.432 30.097 > > Ouch! I assume you left IDLE_LOOP_REBALANCE undefined in the patch? Well, it's #defined to nothing at the moment, should I set it to anything in particular? I'd be happy to try out any additional hacks you might have and post some more numbers. I'll see if I can get better ones in the meantime... Thanks, Jesse |
From: Jesse B. <jb...@sg...> - 2002-02-28 03:07:54
|
On Wed, Feb 27, 2002 at 07:04:04PM -0800, Jesse Barnes wrote: > Well, it's #defined to nothing at the moment, should I set it to > anything in particular? Of course, I looked at the code *after* I sent that last message, so now I know that #defining it to some value wouldn't make a difference. I'll poke around some more to see if I can find why hackbench takes so much longer... Jesse |
From: Erich F. <fo...@es...> - 2002-02-28 17:41:42
|
On Wed, 27 Feb 2002, Jesse Barnes wrote: > I was able to get your last version working on a 32p ia64 we have > here. Here's what I came up with: > ... > > Here are some hackbench numbers that I collected: > > # 2.4 Ingo NUMA > -- ------ ----- ------ > 30 23.872 7.904 16.286 > 40 34.389 8.889 22.359 > 50 46.615 9.432 30.097 Maybe the cpusets are really wrong, otherwise I can't understand the numbers. Another reason might be (sorry if I repeat this again and again): Mike's scheduler uses the cpu_set_data->load_avg for the balancing decision. This number can be pretty old (250ms) when comparing loads of the cpu sets... This is what I'm getting on a 16 CPU AzusA (Itanium) with Ingo's O(1) scheduler based on the version in 2.5.6-pre1. The kernel is 2.4.17, otherwise. The results are averages of three measurements. NUMA_EF is my version of the node-affinity extensions without memory affinity, i.e. the memory is allocated more or less randomly. hackbench Ingo NUMA_EF --------- ----- ------- 30 3.058 3.181 40 4.599 4.354 50 6.412 5.976 I find it strange that the results you are showing for the Ingo scheduler scale so poorly, maybe there is still some basic problem there? Best regards, Erich |
From: Mike K. <kr...@us...> - 2002-02-28 17:55:18
|
On Thu, Feb 28, 2002 at 06:41:30PM +0100, Erich Focht wrote: > > Maybe the cpusets are really wrong, otherwise I can't understand the > numbers. Another reason might be (sorry if I repeat this again and > again): Mike's scheduler uses the cpu_set_data->load_avg for the balancing > decision. This number can be pretty old (250ms) when comparing loads of > the cpu sets... > > This is what I'm getting on a 16 CPU AzusA (Itanium) with Ingo's > O(1) scheduler based on the version in 2.5.6-pre1. The kernel is 2.4.17, > otherwise. The results are averages of three measurements. NUMA_EF is my > version of the node-affinity extensions without memory affinity, i.e. the > memory is allocated more or less randomly. > > hackbench Ingo NUMA_EF > --------- ----- ------- > 30 3.058 3.181 > 40 4.599 4.354 > 50 6.412 5.976 > > I find it strange that the results you are showing for the Ingo scheduler > scale so poorly, maybe there is still some basic problem there? There very well could be a basic problem in the code I provided. It only went through minimal testing on a 12 CPU Sequent NUMA-Q and a 8 CPU IBM Netfinity (SMP). My benchmarking included 'chat' and 'mkbench' (kernel compiles). Results were comparable to Ingo's scheduler. Jesse, can you give Erich's patch a try? -- Mike |
From: Erich F. <fo...@es...> - 2002-02-28 18:32:10
|
On Thu, 28 Feb 2002, Mike Kravetz wrote: > Jesse, can you give Erich's patch a try? I'll send out what's needed to use it, as it's based on Ingo's latest version of the scheduler (the one included in 2.5.6-pre1). Regards, Erich |
From: Jesse B. <jb...@sg...> - 2002-02-28 18:44:51
|
Do you have a patch against 2.4.17 + ia64 + kdb? That would be the easiest for me to apply... Thanks, Jesse On Thu, Feb 28, 2002 at 07:32:02PM +0100, Erich Focht wrote: > On Thu, 28 Feb 2002, Mike Kravetz wrote: > > > Jesse, can you give Erich's patch a try? > > I'll send out what's needed to use it, as it's based on Ingo's latest > version of the scheduler (the one included in 2.5.6-pre1). > > Regards, > Erich > |
From: Erich F. <fo...@es...> - 2002-02-28 18:51:05
|
On Thu, 28 Feb 2002, Jesse Barnes wrote: > Do you have a patch against 2.4.17 + ia64 + kdb? That would be the > easiest for me to apply... Here it comes. You'll need the K3+ IA64 patch I've sent to LSE before, too. To get it running correctly you'll have to at least addapt include/asm-ia64/smp.h and add an SGI-specific function SAPICID_TO_NODE(). BTW: I just noticed that my test has been done with IDLE_REBALANCE_TICK set to 50ms ... I've changed that for debugging and didn't change it back :-/ Best regards, Erich diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/arch/ia64/kernel/smpboot.c 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/arch/ia64/kernel/smpboot.c --- 2.4.17-ia64-kdbv2.1-k3y_al2/arch/ia64/kernel/smpboot.c Thu Feb 28 19:28:16 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/arch/ia64/kernel/smpboot.c Thu Feb 28 19:31:58 2002 @@ -83,6 +83,8 @@ /* which logical CPU number maps to which CPU (physical APIC ID) */ volatile int ia64_cpu_to_sapicid[NR_CPUS]; +char node_number[NR_CPUS] __cacheline_aligned; + static volatile unsigned long cpu_callin_map; struct smp_boot_data smp_boot_data __initdata; @@ -134,6 +136,23 @@ __setup("nointroute", nointroute); +/* simple sort routine for sorting the hardware CPU IDs */ +void __init +bub_sort(int n, int *a) +{ + int end, j, t; + + for (end = n-1; end >= 0; end--) { + for (j = 0; j < end; j++) { + if (a[j] > a[j+1]) { + t = a[j+1]; + a[j+1] = a[j]; + a[j] = t; + } + } + } +} + void sync_master (void *arg) { @@ -496,6 +515,13 @@ if (max_cpus != -1) printk (KERN_INFO "Limiting CPUs to %d\n", max_cpus); +#ifdef CONFIG_IA64_DIG + /* + * To be on the safe side: sort SAPIC IDs of CPUs + */ + bub_sort(smp_boot_data.cpu_count, &smp_boot_data.cpu_phys_id[0]); +#endif + if (smp_boot_data.cpu_count > 1) { printk(KERN_INFO "SMP: starting up secondaries.\n"); @@ -541,7 +567,24 @@ } smp_done: ; +#ifdef CONFIG_IA64_DIG + bld_node_number(); + bld_pools(); +#endif +} +#ifdef CONFIG_IA64_DIG +/* build translation table for CPU_TO_NODE macro */ +void __init +bld_node_number(void) +{ + int cpu; + + for (cpu = 0; cpu < NR_CPUS; cpu++) + if (cpu_online_map & (1<<cpu)) + node_number[cpu] = SAPICID_TO_NODE(cpu_physical_id(cpu)); } +#endif + /* * Assume that CPU's have been discovered by some platform-dependant interface. For @@ -575,3 +618,4 @@ * So let's try 10 ticks. */ unsigned long cache_decay_ticks=10; + diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/fs/exec.c 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/fs/exec.c --- 2.4.17-ia64-kdbv2.1-k3y_al2/fs/exec.c Fri Dec 21 18:41:55 2001 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/fs/exec.c Thu Feb 28 19:31:58 2002 @@ -860,6 +860,10 @@ int retval; int i; +#ifdef CONFIG_SMP + sched_exec_migrate(); +#endif + file = open_exec(filename); retval = PTR_ERR(file); diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/include/asm-ia64/smp.h 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/asm-ia64/smp.h --- 2.4.17-ia64-kdbv2.1-k3y_al2/include/asm-ia64/smp.h Thu Feb 28 19:28:16 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/asm-ia64/smp.h Thu Feb 28 19:31:58 2002 @@ -13,6 +13,7 @@ #ifdef CONFIG_SMP +#include <linux/cache.h> #include <linux/init.h> #include <linux/threads.h> #include <linux/kernel.h> @@ -113,6 +114,23 @@ #define NO_PROC_ID 0xffffffff /* no processor magic marker */ +extern char node_number[NR_CPUS] __cacheline_aligned; +#ifdef CONFIG_IA64_DIG +/* sooner or later this should be a configurable parameter [EF] */ +#define NR_NODES 4 +#define CPU_TO_NODE(cpu) node_number[cpu] +/* + * This is the node ID on the NEC AzusA, + * on LION and BigSur it correctly initializes to node 0 + */ +#define SAPICID_TO_NODE(hwid) ((hwid >> 12) & 0xff) +#else +/* need to be set for the specific platform! */ +#define NR_NODES 1 +#define CPU_TO_NODE(cpu) 0 +#define SAPICID_TO_NODE(hwid) 0 +#endif + extern void __init init_smp_config (void); extern void smp_do_timer (struct pt_regs *regs); diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/include/linux/prctl.h 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/linux/prctl.h --- 2.4.17-ia64-kdbv2.1-k3y_al2/include/linux/prctl.h Mon Feb 4 12:41:39 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/linux/prctl.h Thu Feb 28 19:31:58 2002 @@ -26,4 +26,8 @@ # define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */ # define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */ +/* Get/set node for node-affine scheduling */ +#define PR_GET_NODE 16 +#define PR_SET_NODE 17 + #endif /* _LINUX_PRCTL_H */ diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/include/linux/sched.h 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/linux/sched.h --- 2.4.17-ia64-kdbv2.1-k3y_al2/include/linux/sched.h Thu Feb 28 19:28:16 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/include/linux/sched.h Thu Feb 28 19:34:33 2002 @@ -149,6 +149,7 @@ extern void update_one_process(task_t *p, unsigned long user, unsigned long system, int cpu); extern void scheduler_tick(int user_tick, int system); +extern void sched_migrate_task(task_t *p, int cpu); extern void migration_init(void); extern unsigned long cache_decay_ticks; @@ -314,6 +315,9 @@ unsigned long cpus_allowed; unsigned int time_slice; + int node; + list_t node_list; + task_t *next_task, *prev_task; struct mm_struct *mm, *active_mm; diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/fork.c 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/fork.c --- 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/fork.c Thu Feb 28 19:28:16 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/fork.c Thu Feb 28 19:31:58 2002 @@ -640,10 +640,6 @@ { int i; - if (likely(p->cpus_allowed & (1UL<<smp_processor_id()))) - p->cpu = smp_processor_id(); - else - p->cpu = __ffs(p->cpus_allowed); /* ?? should we just memset this ?? */ for(i = 0; i < smp_num_cpus; i++) p->per_cpu_utime[cpu_logical_map(i)] = diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/sched.c 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/sched.c --- 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/sched.c Thu Feb 28 19:53:58 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/sched.c Thu Feb 28 19:35:29 2002 @@ -144,6 +144,9 @@ int prev_nr_running[NR_CPUS]; task_t *migration_thread; list_t migration_queue; + int nr_homenode[NR_NODES]; + int cpus_load[NR_CPUS]; + int pools_load[NR_NODES]; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -163,6 +166,27 @@ return (p == task_rq(p)->curr); } +/* + * Variables for describing and accessing processor pools. Using a + * compressed row format like notation. + * + * numpools: number of CPU pools (nodes), + * pool_cpus[]: CPUs in pools sorted by their pool ID, + * pool_ptr[node]: index of first element in pool_cpus[] belonging to node. + * pool_mask[]: cpu mask of a pool. + * + * Example: loop over all CPUs in a pool p: + * for (i = pool_ptr[p]; i < pool_ptr[p+1]; i++) { + * cpu = pool_cpus[i]; + * ... + * } + * <ef...@es...> + */ +static int numpools; +static int pool_ptr[NR_NODES+1]; +static int pool_cpus[NR_CPUS]; +static unsigned long pool_mask[NR_NODES]; + static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags) { struct runqueue *rq; @@ -247,10 +271,12 @@ } enqueue_task(p, array); rq->nr_running++; + rq->nr_homenode[p->node]++; } static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) { + rq->nr_homenode[p->node]--; rq->nr_running--; dequeue_task(p, p->array); p->array = NULL; @@ -471,6 +497,112 @@ } /* + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their home node. + * This is done in two steps: + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. Remote runqueues running tasks having their homenode on the + * current node are preferred (those tasks count twice in the load calculation). + * + * This concept can be extended easilly to more than two levels (multi-level + * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode... + * <ef...@es...> + */ +static runqueue_t *scan_pools(runqueue_t *this_rq, int idle, int nr_running, + int *imbalance) +{ + runqueue_t *busiest, *rq_src; + int i, ii, load, max_load, pool, max_pool_load, max_pool_idx, + best_cpu, + this_cpu = smp_processor_id(), + this_pool = CPU_TO_NODE(this_cpu); + + busiest = NULL; + max_load = 1; + + this_rq->pools_load[this_pool] = 0; + for (ii = pool_ptr[this_pool]; ii < pool_ptr[this_pool+1]; ii++) { + i = pool_cpus[ii]; + rq_src = cpu_rq(cpu_logical_map(i)); + if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) + load = rq_src->nr_running; + else + load = this_rq->prev_nr_running[i]; + this_rq->prev_nr_running[i] = rq_src->nr_running; + this_rq->pools_load[this_pool] += load; + + if ((load > max_load) && (rq_src != this_rq)) { + busiest = rq_src; + max_load = load; + } + } + + if (likely(!busiest)) + goto scan_all; + + *imbalance = (max_load - nr_running)/2; + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (idle || (*imbalance >= (max_load + 3)/4)) + goto out; + else + busiest = NULL; + + scan_all: + max_pool_load = this_rq->pools_load[this_pool]; + max_pool_idx = this_pool; + for (pool = 0; pool < numpools; pool++) { + if (pool == this_pool) continue; // current pool already done + this_rq->pools_load[pool] = 0; + for (ii = pool_ptr[pool]; ii < pool_ptr[pool+1]; ii++) { + i = pool_cpus[ii]; + rq_src = cpu_rq(cpu_logical_map(i)); + if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) + load = rq_src->nr_running; + else + load = this_rq->prev_nr_running[i]; + this_rq->prev_nr_running[i] = rq_src->nr_running; + /* prefer RQs which have tasks from this node running */ + if (rq_src->nr_homenode[this_pool]) + load++; + this_rq->cpus_load[i] = load; + this_rq->pools_load[pool] += load; + } + + if (this_rq->pools_load[pool] > max_pool_load) { + max_pool_load = this_rq->pools_load[pool]; + max_pool_idx = pool; + } + } + + if (likely(max_pool_idx == this_pool)) + goto out; + + *imbalance = (max_pool_load - this_rq->pools_load[this_pool])/2; + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (!idle && (*imbalance < (max_pool_load + 3)/4)) + goto out; + + /* find most loaded CPU within pool from which we'll try to steal a task */ + best_cpu = pool_cpus[pool_ptr[max_pool_idx]]; + max_load = this_rq->cpus_load[best_cpu]; + for (ii = pool_ptr[max_pool_idx]+1; ii < pool_ptr[max_pool_idx+1]; ii++) { + i = pool_cpus[ii]; + if (this_rq->cpus_load[i] > max_load) { + max_load = this_rq->cpus_load[i]; + best_cpu = i; + } + } + *imbalance = (max_load - nr_running)/2; + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (idle || (*imbalance >= (max_load + 3)/4)) + busiest = cpu_rq(cpu_logical_map(best_cpu)); + out: + return busiest; +} + +/* * Current runqueue is empty, or rebalance tick: if there is an * inbalance (current runqueue is too short) then pull from * busiest runqueue(s). @@ -480,12 +612,12 @@ */ static void load_balance(runqueue_t *this_rq, int idle) { - int imbalance, nr_running, load, max_load, - idx, i, this_cpu = smp_processor_id(); + int imbalance, nr_running, idx, this_cpu = smp_processor_id(); task_t *next = this_rq->idle, *tmp; - runqueue_t *busiest, *rq_src; + runqueue_t *busiest; prio_array_t *array; list_t *head, *curr; + int this_pool=CPU_TO_NODE(this_cpu), take_own; /* * We search all runqueues to find the most busy one. @@ -514,29 +646,10 @@ else nr_running = this_rq->prev_nr_running[this_cpu]; - busiest = NULL; - max_load = 1; - for (i = 0; i < smp_num_cpus; i++) { - rq_src = cpu_rq(cpu_logical_map(i)); - if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) - load = rq_src->nr_running; - else - load = this_rq->prev_nr_running[i]; - this_rq->prev_nr_running[i] = rq_src->nr_running; - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; - } - } - - if (likely(!busiest)) - return; - - imbalance = (max_load - nr_running) / 2; + busiest = scan_pools(this_rq, idle, nr_running, &imbalance); - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (imbalance < (max_load + 3)/4)) + if (!busiest) return; nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); @@ -548,6 +661,14 @@ goto out_unlock; /* + * Try to steal tasks coming from this_pool, if any + */ + if (busiest->nr_homenode[this_pool]) + take_own = 1; + else + take_own = 0; + + /* * We first consider expired tasks. Those will likely not be * executed in the near future, and they are most likely to * be cache-cold, thus switching CPUs has the least effect @@ -589,7 +710,8 @@ #define CAN_MIGRATE_TASK(p,rq,this_cpu) \ ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ ((p) != (rq)->curr) && \ - (tmp->cpus_allowed & (1 << (this_cpu)))) + (tmp->cpus_allowed & (1 << (this_cpu))) && \ + ((take_own && (tmp->node == this_pool)) || !take_own)) if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { curr = curr->next; @@ -605,9 +727,11 @@ */ dequeue_task(next, array); busiest->nr_running--; + busiest->nr_homenode[next->node]--; next->cpu = this_cpu; this_rq->nr_running++; enqueue_task(next, this_rq->active); + this_rq->nr_homenode[next->node]++; if (next->prio < current->prio) current->need_resched = 1; if (!idle && --imbalance) { @@ -630,7 +754,8 @@ * systems with HZ=100, every 10 msecs.) */ #define BUSY_REBALANCE_TICK (HZ/4 ?: 1) -#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +//#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1) +#define IDLE_REBALANCE_TICK (HZ/20 ?: 50) static inline void idle_tick(void) { @@ -1377,6 +1502,133 @@ spin_unlock(&rq2->lock); } +/* used as counter for round-robin node-scheduling */ +static atomic_t sched_node=ATOMIC_INIT(0); + +/* + * Find the least loaded CPU on the current node of the task. + */ +int sched_best_cpu(struct task_struct *p) +{ + int nn, best_node = p->node, best_cpu, cpu, load, min_load; + + best_cpu = p->cpu; + min_load = task_rq(p)->nr_running; + for (nn = pool_ptr[best_node]; nn < pool_ptr[best_node+1]; nn++) { + cpu = cpu_logical_map(pool_cpus[nn]); + if (!(p->cpus_allowed & (1UL << cpu))) + continue; + load = cpu_rq(cpu)->nr_running; + if (load < min_load) { + best_cpu = cpu; + min_load = load; + } + } + return best_cpu; +} + +/* + * Find the (relatively) least loaded node for the given process taking into + * account the cpus_allowed mask. Try to Round Robin search for the best node. + */ +int sched_best_node(struct task_struct *p) +{ + int i, nn, node=0, best_node=0, load, min_load; + int pool_load[NR_NODES] = { [0 ... NR_NODES-1] = 0}; + int cpus_per_node[NR_NODES] = { [0 ... NR_NODES-1] = 0}; + unsigned long mask = p->cpus_allowed & cpu_online_map; + + if (numpools == 1) + goto out; + do { + best_node = atomic_inc_return(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); + + for (i = 0; i < smp_num_cpus; i++) { + int cpu = cpu_logical_map(i); + if (!(mask & (1<<cpu))) + continue; + load = cpu_rq(cpu)->nr_running; + node = CPU_TO_NODE(cpu); + cpus_per_node[node]++; + pool_load[node] += 1000*load; + } + min_load = pool_load[best_node] / cpus_per_node[node]; + for (nn = 1; nn < numpools; nn++) { + node = (best_node + nn) % numpools; + if (cpus_per_node[node] > 0) { + load = pool_load[node] / cpus_per_node[node]; + if (load < min_load) { + min_load = load; + best_node = node; + } + } + } + out: + atomic_set(&sched_node, best_node); + return best_node; +} + +void sched_exec_migrate(void) +{ + int new_cpu, new_node; + + if (numpools > 1) { + new_node = sched_best_node(current); + if (new_node != current->node) { + this_rq()->nr_homenode[current->node]--; + this_rq()->nr_homenode[new_node]++; + current->node = new_node; + } + } + new_cpu = sched_best_cpu(current); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} + + +void pools_info(void) +{ + int i, j; + + printk("CPU pools : %d\n",numpools); + for (i=0;i<numpools;i++) { + printk("pool %d :",i); + for (j=pool_ptr[i];j<pool_ptr[i+1];j++) + printk("%d ",pool_cpus[j]); + printk("\n"); + } +} + +void bld_pools(void) +{ + int i, j, ptr; + int flag[NR_CPUS] = { [ 0 ... NR_CPUS-1] = 0 }; + unsigned long mask; + + numpools = 0; + ptr = 0; + for (i = 0; i < smp_num_cpus; i++) { + if (!(cpu_online_map & (1<<i))) continue; + if (!flag[i]) { + pool_ptr[numpools]=ptr; + mask = 0UL; + for (j = 0; j < smp_num_cpus; j++) { + if (! (cpu_online_map & (1<<j))) continue; + if (i == j || CPU_TO_NODE(i) == CPU_TO_NODE(j)) { + pool_cpus[ptr++]=j; + flag[j]=1; + mask |= (1<<j); + } + } + pool_mask[numpools]=mask; + numpools++; + } + } + pool_ptr[numpools]=ptr; + pools_info(); +} + void __init init_idle(task_t *idle, int cpu) { runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(idle->cpu); @@ -1392,6 +1644,7 @@ idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; idle->cpu = cpu; + idle->node = SAPICID_TO_NODE(cpu_physical_id(cpu)); double_rq_unlock(idle_rq, rq); idle->need_resched = 1; __restore_flags(flags); @@ -1425,7 +1678,14 @@ // delimiter for bitsearch __set_bit(MAX_PRIO, array->bitmap); } - } + for (j = 0; j < NR_NODES; j++) + rq->nr_homenode[j]=0; + pool_cpus[i] = i; + } + pool_ptr[0] = 0; + pool_ptr[1] = NR_CPUS; + numpools = 1; + pool_mask[0] = -1L; /* * We have to do a little magic to get the first * process right in SMP mode. @@ -1509,6 +1769,15 @@ down(&req.sem); } +void sched_migrate_task(task_t *p, int dest_cpu) +{ + unsigned long old_mask; + + old_mask = p->cpus_allowed; + set_cpus_allowed(p, 1UL << dest_cpu); + set_cpus_allowed(p, old_mask); +} + static volatile unsigned long migration_mask; static int migration_thread(void * unused) @@ -1539,7 +1808,8 @@ for (;;) { if (test_and_clear_bit(smp_processor_id(), &migration_mask)) - current->cpus_allowed = 1 << smp_processor_id(); + printk("migration_task on cpu=%d mask=%lx\n", + cpu(),current->cpus_allowed); if (current->need_resched) schedule(); if (!migration_mask) @@ -1576,7 +1846,7 @@ repeat: cpu_src = p->cpu; rq_src = cpu_rq(cpu_src); - + local_irq_save(flags); double_rq_lock(rq_src, rq_dest); if (p->cpu != cpu_src) { diff -urN 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/sys.c 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/sys.c --- 2.4.17-ia64-kdbv2.1-k3y_al2/kernel/sys.c Thu Feb 28 19:28:17 2002 +++ 2.4.17-ia64-kdbv2.1-k3y-nod5_al2/kernel/sys.c Thu Feb 28 19:31:58 2002 @@ -1205,6 +1205,8 @@ { int error = 0; int sig; + int pid; + struct task_struct *child; switch (option) { case PR_SET_PDEATHSIG: @@ -1272,6 +1274,35 @@ } current->keep_capabilities = arg2; break; + case PR_GET_NODE: + pid = (int) arg3; + read_lock(&tasklist_lock); + child = find_task_by_pid(pid); + if (child) { + error = put_user(child->node,(int *)arg2); + } else { + printk("prctl: could not find process %d\n",pid); + error = -EINVAL; + } + read_unlock(&tasklist_lock); + break; + case PR_SET_NODE: + pid = (int) arg3; + read_lock(&tasklist_lock); + child = find_task_by_pid(pid); + if (child) { + if (child->uid == current->uid || \ + current->uid == 0) { + printk("setting node of process %d to %d\n",pid,(int)arg2); + child->node = (int)arg2; + } + } else { + printk("prctl: could not find process %d\n",pid); + error = -EINVAL; + } + read_unlock(&tasklist_lock); + break; + default: error = -EINVAL; break; |
From: Jesse B. <jb...@sg...> - 2002-03-01 01:34:27
|
On Thu, Feb 28, 2002 at 06:41:30PM +0100, Erich Focht wrote: > This is what I'm getting on a 16 CPU AzusA (Itanium) with Ingo's > O(1) scheduler based on the version in 2.5.6-pre1. The kernel is 2.4.17, > otherwise. The results are averages of three measurements. NUMA_EF is my > version of the node-affinity extensions without memory affinity, i.e. the > memory is allocated more or less randomly. > > hackbench Ingo NUMA_EF > --------- ----- ------- > 30 3.058 3.181 > 40 4.599 4.354 > 50 6.412 5.976 > > I find it strange that the results you are showing for the Ingo scheduler > scale so poorly, maybe there is still some basic problem there? Possibly--TLB flushes are outrageously expensive on our test platform. Anyway, I ran with your latest patch and came up with this on a 16p IA64 machine. The system was somewhat unstable however--upon starting a hackbench process, the system would sometimes hang. # NUMA_EF 2.4.17 stock -- ------- ------------ 30 4.336 24.026 40 5.800 35.468 50 6.716 47.090 I added the CPU_TO_NODE and SAPICID_TO_NODE macros (easy enough), but when are you planning to get rid of NR_NODES. Is it possible to just use numnodes to kmalloc the arrays at init time? Thanks, Jesse |
From: Erich F. <fo...@es...> - 2002-03-01 18:04:12
|
On Thu, 28 Feb 2002, Jesse Barnes wrote: > Possibly--TLB flushes are outrageously expensive on our test platform. > Anyway, I ran with your latest patch and came up with this on a 16p > IA64 machine. The system was somewhat unstable however--upon starting > a hackbench process, the system would sometimes hang. > > # NUMA_EF 2.4.17 stock > -- ------- ------------ > 30 4.336 24.026 > 40 5.800 35.468 > 50 6.716 47.090 OK, that looks reasonable. Did you change IDLE_REBALANCE_TICK back to 1ms? I didn't see any crashes after running 50x hackbench 50... > I added the CPU_TO_NODE and SAPICID_TO_NODE macros (easy enough), but > when are you planning to get rid of NR_NODES. Is it possible to just > use numnodes to kmalloc the arrays at init time? I'll change that, it's annoying, you're right. But we don't have numnodes without NUMA & DISCONTIG, I guess? Anyhow, I can count the nodes if SAPICID_TO_NODE() works. I forgot to mention: I changed the initial balancing from do_fork() to do_exec(), but there is a small mistake in there... The current process should not be counted when deciding where to move it to. Best regards, Erich |
From: Jesse B. <jb...@sg...> - 2002-03-01 18:10:49
|
On Fri, Mar 01, 2002 at 07:04:06PM +0100, Erich Focht wrote: > OK, that looks reasonable. Did you change IDLE_REBALANCE_TICK back to > 1ms? I didn't see any crashes after running 50x hackbench 50... No, I left it at 50. I'll try again with 1ms though, and maybe 10ms. > I'll change that, it's annoying, you're right. But we don't have numnodes > without NUMA & DISCONTIG, I guess? Anyhow, I can count the nodes if > SAPICID_TO_NODE() works. numnodes should be defined for all architectures & configs if I'm reading mm/numa.c and mm/Makefile right, so it looks safe to use. > I forgot to mention: I changed the initial balancing from do_fork() to > do_exec(), but there is a small mistake in there... The current process > should not be counted when deciding where to move it to. I'll take a look at that too. Jesse |
From: Erich F. <fo...@es...> - 2002-02-25 18:32:21
|
Hello Mike, On Fri, 22 Feb 2002, Mike Kravetz wrote: > Below is preliminary patch to implement some form of NUMA scheduling > on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early > code and brings up some issues that need to be discussed/explored in > more detail. This patch was created to form a basis for discussion, > rather than as a solution. The patch was created for the i386 based > NUMA system I have access to. It will not work on other architectures. > However, the only architecture specific code is a call to initialize > some of the NUMA specific scheduling data structures. Therefore, it > should be trivial to port. I am yet another one working on the NUMA extension of Ingo's O(1) scheduler, having IA64 in mind, like Jesse, I guess. It's good to see that so many people take care of this issue, lets me hope that we'll get a good solution at the end. My approach is in some details different from yours therefore I'll append it to this email. It is for IA64 (DIG64 architectures), works fine on NEC's AzusA Itanium server and is also meant as a discussion basis, not as a finished solution/project. Some basic but minor differences to your patch are: - The 'cpu sets' are called cpu pools. Arrays for looping over the CPUs within one pool are provided. - It is built on top of Ingo's solution for set_cpus_allowed(), let's call it K3+ (for 2.4.17 kernels). The more important differences to your patch: - It adresses the notion of "home node" of a task, this should be the node from which a task should get its memory and on which it should preferably run. I'm using this feature together with a memory affinity patch not included here. - Each task gets an additional "node" field in its task_struct, - each runqueue tracks the number of tasks from different nodes, - tasks coming from a remote node are migrated earlier. - The load_balancing() concept is different: - there are no special time intervals for balancing across pool boundaries, the need for this can occur very quickly and I have the feeling that 2*250ms is a long time for keeping the nodes unbalanced. This means: each time load_balance() is called it _can_ balance across pool boundaries (but doesn't have to). - When load_balance() is called it first tries to balance the local cpu pool. If this one is well balanced, it will try to find the most loaded pool. - Each runqueue has its own statistics arrays, thus gathering the balancing information happens lockless. The information is as "fresh" as possible. Again I think that with 2*250ms the info in your cpu sets could be too old for certain cases. - Initial balancing is done in do_fork() instead of do_execve(). Again: I collect the current balance information because I have the feeling this should be as "new" as possible. Using do_fork() means that we don't have to call smp_migrate_task() (which went away in the latest version from Ingo, but certainly we'd find some replacement), such that the task has chances to live long on its initial CPU. - When balancing initially I tried to take into account the case of cpu pools of different sizes as well as cpus_allowed masks covering only parts of a cpu pool. I didn't address things like topology discovery, either. The topology is hidden behind a macro called CPU_TO_NODE() and is currently considered to have two levels (cpus and nodes). It is easy to extend this concept to a multi-level scheduler (e.g. cpus -> multi-core packages -> nodes -> supernodes) but I didn't have to (yet?). In such a scheduler the tasks wouldn't get too far from their home node because we would first try to balance locally and over short distances. Also the load balancing decisions are only based on the CPU loads, sure there are some more criteria we might want to use (memory consumption...). Coming back to the discussion, which is partly embedded in the description of the patch: I like the way how your cpu sets are defined, this is much more elegant than my solution. But your only way to address a set is by its cpu mask, which means that when you want to address a set you have to loop over all smp_num_cpus. With large numbers of cpus this could become a problem. Anyhow, this is minor and can be changed easilly. As well as the problem of having cpu sets of different sizes. Of more concern is the reuse of old balancing data. For busy CPUs I think the interval of 0.5s is far too long. Maybe this would make more sense if the data would be some running average, otherwise I would expect mistakes in the balancing decisions. Especially for the initial balancing, where I would like to decide on which node the memory of the task should be. But I still believe that this long term decision should be taken with the best data available! Just think about quickly forking several tasks: they might all go to the same node and lead to a terrible imbalance. Not that bad if you take them to another node after a short while, but quite bad if all the memory goes to their original node and most of the tasks have to access it with poor latency. Would be interesting to hear oppinions on initial balancing. What are the pros and cons of balancing at do_fork() or do_execve()? And it would be interesting to learn about other approaches, too... Best regards, Erich diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/arch/ia64/kernel/smpboot.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/arch/ia64/kernel/smpboot.c --- 2.4.17-ia64-kdbv2.1-K3ia64/arch/ia64/kernel/smpboot.c Mon Feb 18 12:26:44 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/arch/ia64/kernel/smpboot.c Fri Feb 22 14:11:45 2002 @@ -83,6 +83,8 @@ /* which logical CPU number maps to which CPU (physical APIC ID) */ volatile int ia64_cpu_to_sapicid[NR_CPUS]; +char node_number[NR_CPUS] __cacheline_aligned; + static volatile unsigned long cpu_callin_map; struct smp_boot_data smp_boot_data __initdata; @@ -134,6 +136,23 @@ __setup("nointroute", nointroute); +/* simple sort routine for sorting the hardware CPU IDs */ +void __init +bub_sort(int n, int *a) +{ + int end, j, t; + + for (end = n-1; end >= 0; end--) { + for (j = 0; j < end; j++) { + if (a[j] > a[j+1]) { + t = a[j+1]; + a[j+1] = a[j]; + a[j] = t; + } + } + } +} + void sync_master (void *arg) { @@ -496,6 +515,13 @@ if (max_cpus != -1) printk (KERN_INFO "Limiting CPUs to %d\n", max_cpus); +#ifdef CONFIG_IA64_DIG + /* + * To be on the safe side: sort SAPIC IDs of CPUs + */ + bub_sort(smp_boot_data.cpu_count, &smp_boot_data.cpu_phys_id[0]); +#endif + if (smp_boot_data.cpu_count > 1) { printk(KERN_INFO "SMP: starting up secondaries.\n"); @@ -541,7 +567,24 @@ } smp_done: ; +#ifdef CONFIG_IA64_DIG + bld_node_number(); + bld_pools(); +#endif +} +#ifdef CONFIG_IA64_DIG +/* build translation table for CPU_TO_NODE macro */ +void __init +bld_node_number(void) +{ + int cpu; + + for (cpu = 0; cpu < NR_CPUS; cpu++) + if (cpu_online_map & (1<<cpu)) + node_number[cpu] = SAPICID_TO_NODE(cpu_physical_id(cpu)); } +#endif + /* * Assume that CPU's have been discovered by some platform-dependant interface. For @@ -575,3 +618,4 @@ * So let's try 10 ticks. */ unsigned long cache_decay_ticks=10; + diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/asm-ia64/smp.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/asm-ia64/smp.h --- 2.4.17-ia64-kdbv2.1-K3ia64/include/asm-ia64/smp.h Thu Feb 21 19:33:12 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/asm-ia64/smp.h Mon Feb 25 18:02:45 2002 @@ -13,6 +13,7 @@ #ifdef CONFIG_SMP +#include <linux/cache.h> #include <linux/init.h> #include <linux/threads.h> #include <linux/kernel.h> @@ -113,6 +114,23 @@ #define NO_PROC_ID 0xffffffff /* no processor magic marker */ +extern char node_number[NR_CPUS] __cacheline_aligned; +#ifdef CONFIG_IA64_DIG +/* sooner or later this should be a configurable parameter [EF] */ +#define NR_NODES 4 +#define CPU_TO_NODE(cpu) node_number[cpu] +/* + * This is the node ID on the NEC AzusA, + * on LION and BigSur it correctly initializes to node 0 + */ +#define SAPICID_TO_NODE(hwid) ((hwid >> 12) & 0xff) +#else +/* need to be set for the specific platform! */ +#define NR_NODES 1 +#define CPU_TO_NODE(cpu) 0 +#define SAPICID_TO_NODE(hwid) 0 +#endif + extern void __init init_smp_config (void); extern void smp_do_timer (struct pt_regs *regs); diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/prctl.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/prctl.h --- 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/prctl.h Mon Feb 4 12:41:39 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/prctl.h Mon Feb 25 17:43:54 2002 @@ -26,4 +26,8 @@ # define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */ # define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */ +/* Get/set node for node-affine scheduling */ +#define PR_GET_NODE 16 +#define PR_SET_NODE 17 + #endif /* _LINUX_PRCTL_H */ diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/sched.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/sched.h --- 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/sched.h Thu Feb 21 19:33:12 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/sched.h Mon Feb 25 18:02:45 2002 @@ -314,6 +314,9 @@ unsigned long cpus_allowed; unsigned int time_slice; + int node; + list_t node_list; + task_t *next_task, *prev_task; struct mm_struct *mm, *active_mm; diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/fork.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/fork.c --- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/fork.c Tue Feb 19 15:09:35 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/fork.c Mon Feb 25 18:26:18 2002 @@ -639,11 +639,15 @@ #ifdef CONFIG_SMP { int i; + void sched_best_cpu(struct task_struct *p); +#if NR_NODES > 1 + void sched_best_node(struct task_struct *p); + + if (!(clone_flags & CLONE_VM)) + sched_best_node(p); +#endif + sched_best_cpu(p); - if (likely(p->cpus_allowed & (1UL<<smp_processor_id()))) - p->cpu = smp_processor_id(); - else - p->cpu = __ffs(p->cpus_allowed); /* ?? should we just memset this ?? */ for(i = 0; i < smp_num_cpus; i++) p->per_cpu_utime[cpu_logical_map(i)] = diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sched.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sched.c --- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sched.c Thu Feb 21 19:40:11 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sched.c Mon Feb 25 18:12:29 2002 @@ -144,6 +144,9 @@ int prev_nr_running[NR_CPUS]; task_t *migration_thread; list_t migration_queue; + int nr_homenode[NR_NODES]; + int cpus_load[NR_CPUS]; + int pools_load[NR_NODES]; } ____cacheline_aligned; static struct runqueue runqueues[NR_CPUS] __cacheline_aligned; @@ -163,6 +166,27 @@ return (p == task_rq(p)->curr); } +/* + * Variables for describing and accessing processor pools. Using a + * compressed row format like notation. + * + * numpools: number of CPU pools (nodes), + * pool_cpus[]: CPUs in pools sorted by their pool ID, + * pool_ptr[node]: index of first element in pool_cpus[] belonging to node. + * pool_mask[]: cpu mask of a pool. + * + * Example: loop over all CPUs in a pool p: + * for (i = pool_ptr[p]; i < pool_ptr[p+1]; i++) { + * cpu = pool_cpus[i]; + * ... + * } + * <ef...@es...> + */ +static int numpools; +static int pool_ptr[NR_NODES+1]; +static int pool_cpus[NR_CPUS]; +static unsigned long pool_mask[NR_NODES]; + static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags) { struct runqueue *rq; @@ -247,10 +271,12 @@ } enqueue_task(p, array); rq->nr_running++; + rq->nr_homenode[p->node]++; } static inline void deactivate_task(struct task_struct *p, runqueue_t *rq) { + rq->nr_homenode[p->node]--; rq->nr_running--; dequeue_task(p, p->array); p->array = NULL; @@ -341,7 +367,7 @@ void wake_up_forked_process(task_t * p) { - runqueue_t *rq = this_rq(); + runqueue_t *rq = task_rq(p); spin_lock_irq(&rq->lock); p->state = TASK_RUNNING; @@ -356,7 +382,7 @@ p->prio = effective_prio(p); } INIT_LIST_HEAD(&p->migration_list); - p->cpu = smp_processor_id(); + //p->cpu = smp_processor_id(); activate_task(p, rq); spin_unlock_irq(&rq->lock); init_MUTEX(&p->migration_sem); @@ -471,6 +497,110 @@ } /* + * Find a runqueue from which to steal a task. We try to do this as locally as + * possible because we don't want to let tasks get far from their home node. + * This is done in two steps: + * 1. First try to find a runqueue within the own CPU pool (AKA node) with + * imbalance larger than 25% (relative to the current runqueue). + * 2. If the local node is well balanced, locate the most loaded node and its + * most loaded CPU. Remote runqueues running tasks having their homenode on the + * current node are preferred (those tasks count twice in the load calculation). + * + * This concept can be extended easilly to more than two levels (multi-level + * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode... + * <ef...@es...> + */ +static runqueue_t *scan_pools(runqueue_t *this_rq, int idle, int *curr_running, + int *imbalance) +{ + runqueue_t *busiest, *rq_src; + int i, ii, load, max_load, pool, max_pool_load, max_pool_idx, + best_cpu, nr_running = *curr_running, + this_cpu = smp_processor_id(), + this_pool = CPU_TO_NODE(this_cpu); + + busiest = NULL; + max_load = 1; + + this_rq->pools_load[this_pool] = 0; + for (ii = pool_ptr[this_pool]; ii < pool_ptr[this_pool+1]; ii++) { + i = pool_cpus[ii]; + rq_src = cpu_rq(cpu_logical_map(i)); + if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) + load = rq_src->nr_running; + else + load = this_rq->prev_nr_running[i]; + this_rq->prev_nr_running[i] = rq_src->nr_running; + this_rq->pools_load[this_pool] += load; + + if ((load > max_load) && (rq_src != this_rq)) { + busiest = rq_src; + max_load = load; + } + } + + if (likely(!busiest)) + goto scan_all; + + *imbalance = (max_load - nr_running)/2; + + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (idle || (*imbalance >= (max_load + 3)/4)) + goto out; + else + busiest = NULL; + + scan_all: + max_pool_load = this_rq->pools_load[this_pool]; + max_pool_idx = this_pool; + for (pool = 0; pool < numpools; pool++) { + if (pool == this_pool) continue; // current pool already done + this_rq->pools_load[pool] = 0; + for (ii = pool_ptr[pool]; ii < pool_ptr[pool+1]; ii++) { + i = pool_cpus[ii]; + rq_src = cpu_rq(cpu_logical_map(i)); + if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) + load = rq_src->nr_running; + else + load = this_rq->prev_nr_running[i]; + this_rq->prev_nr_running[i] = rq_src->nr_running; + /* prefer RQs which have tasks from this node running */ + load += rq_src->nr_homenode[this_pool]; + this_rq->cpus_load[i] = load; + this_rq->pools_load[pool] += load; + } + if (this_rq->pools_load[pool] > max_pool_load) { + max_pool_load = this_rq->pools_load[pool]; + max_pool_idx = pool; + } + } + if (likely(max_pool_idx == this_pool)) + goto out; + + *imbalance = (max_pool_load - this_rq->pools_load[this_pool])/2; + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (!idle && (*imbalance < (max_pool_load + 3)/4)) + goto out; + + /* find most loaded CPU within pool from which we'll try to steal a task */ + best_cpu = pool_cpus[pool_ptr[max_pool_idx]]; + max_load = this_rq->cpus_load[best_cpu]; + for (ii = pool_ptr[max_pool_idx]+1; ii < pool_ptr[max_pool_idx+1]; ii++) { + i = pool_cpus[ii]; + if (this_rq->cpus_load[i] > max_load) { + max_load = this_rq->cpus_load[i]; + best_cpu = i; + } + } + *imbalance = (max_load - nr_running)/2; + /* It needs an at least ~25% imbalance to trigger balancing. */ + if (idle || (*imbalance >= (max_load + 3)/4)) + busiest = cpu_rq(cpu_logical_map(best_cpu)); + out: + return busiest; +} + +/* * Current runqueue is empty, or rebalance tick: if there is an * inbalance (current runqueue is too short) then pull from * busiest runqueue(s). @@ -480,12 +610,12 @@ */ static void load_balance(runqueue_t *this_rq, int idle) { - int imbalance, nr_running, load, max_load, - idx, i, this_cpu = smp_processor_id(); + int imbalance, nr_running, idx, this_cpu = smp_processor_id(); task_t *next = this_rq->idle, *tmp; - runqueue_t *busiest, *rq_src; + runqueue_t *busiest; prio_array_t *array; list_t *head, *curr; + int this_pool=CPU_TO_NODE(this_cpu), take_own; /* * We search all runqueues to find the most busy one. @@ -514,29 +644,9 @@ else nr_running = this_rq->prev_nr_running[this_cpu]; - busiest = NULL; - max_load = 1; - for (i = 0; i < smp_num_cpus; i++) { - rq_src = cpu_rq(cpu_logical_map(i)); - if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i])) - load = rq_src->nr_running; - else - load = this_rq->prev_nr_running[i]; - this_rq->prev_nr_running[i] = rq_src->nr_running; - - if ((load > max_load) && (rq_src != this_rq)) { - busiest = rq_src; - max_load = load; - } - } - if (likely(!busiest)) - return; - - imbalance = (max_load - nr_running) / 2; - - /* It needs an at least ~25% imbalance to trigger balancing. */ - if (!idle && (imbalance < (max_load + 3)/4)) + busiest = scan_pools(this_rq, idle, &nr_running, &imbalance); + if (!busiest) return; nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running); @@ -548,6 +658,14 @@ goto out_unlock; /* + * Try to steal tasks coming from this_pool, if any + */ + if (busiest->nr_homenode[this_pool]) + take_own = 1; + else + take_own = 0; + + /* * We first consider expired tasks. Those will likely not be * executed in the near future, and they are most likely to * be cache-cold, thus switching CPUs has the least effect @@ -589,7 +707,8 @@ #define CAN_MIGRATE_TASK(p,rq,this_cpu) \ ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \ ((p) != (rq)->curr) && \ - (tmp->cpus_allowed & (1 << (this_cpu)))) + (tmp->cpus_allowed & (1 << (this_cpu))) && \ + ((take_own && (tmp->node == this_pool)) || !take_own)) if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) { curr = curr->next; @@ -605,9 +724,11 @@ */ dequeue_task(next, array); busiest->nr_running--; + busiest->nr_homenode[next->node]--; next->cpu = this_cpu; this_rq->nr_running++; enqueue_task(next, this_rq->active); + this_rq->nr_homenode[next->node]++; if (next->prio < current->prio) current->need_resched = 1; if (!idle && --imbalance) { @@ -1436,31 +1557,111 @@ spin_unlock(&rq2->lock); } -int sched_move_task(task_t *p, int src_cpu, int tgt_cpu) +/* used as counter for round-robin node-scheduling */ +static atomic_t sched_node=ATOMIC_INIT(0); + +/* + * Find the least loaded CPU on the current node of the task. + */ +void sched_best_cpu(struct task_struct *p) { - int res = 0; - unsigned long flags; - runqueue_t *src = cpu_rq(src_cpu); - runqueue_t *tgt = cpu_rq(tgt_cpu); + int nn, best_node = p->node, best_cpu, cpu, load, min_load; - local_irq_save(flags); - double_rq_lock(src, tgt); - if (task_rq(p) != src || p == src->curr) - goto out; - if (p->cpu != tgt_cpu) { - p->cpu = tgt_cpu; - if (p->array) { - deactivate_task(p, src); - activate_task(p, tgt); + best_cpu = p->cpu; + min_load = task_rq(p)->nr_running; + for (nn = pool_ptr[best_node]; nn < pool_ptr[best_node+1]; nn++) { + cpu = cpu_logical_map(pool_cpus[nn]); + if (p->cpus_allowed & (1UL << cpu)) + continue; + load = cpu_rq(cpu)->nr_running; + if (load < min_load) { + best_cpu = cpu; + min_load = load; } } - res = 1; - out: - double_rq_unlock(src, tgt); - local_irq_restore(flags); - return res; + p->cpu = best_cpu; +} + +/* + * Find the (relatively) least loaded node for the given process taking into + * account the cpus_allowed mask. Try to Round Robin search for the best node. + */ +void sched_best_node(struct task_struct *p) +{ + int i, nn, node=0, best_node, load, min_load; + int pool_load[NR_NODES] = { [0 ... NR_NODES-1] = 0}; + int cpus_per_node[NR_NODES] = { [0 ... NR_NODES-1] = 0}; + unsigned long mask = p->cpus_allowed & cpu_online_map; + + do { + best_node = atomic_inc_return(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); + + for (i = 0; i < smp_num_cpus; i++) { + int cpu = cpu_logical_map(i); + if (!(mask & (1<<cpu))) + continue; + load = cpu_rq(cpu)->nr_running; + node = CPU_TO_NODE(cpu); + cpus_per_node[node]++; + pool_load[node] += 1000*load; + } + min_load = pool_load[best_node] / cpus_per_node[node]; + for (nn = 1; nn < numpools; nn++) { + node = (best_node + nn) % numpools; + if (cpus_per_node[node] > 0) { + load = pool_load[node] / cpus_per_node[node]; + if (load < min_load) { + min_load = load; + best_node = node; + } + } + } + p->node = best_node; + atomic_set(&sched_node, best_node); } +void pools_info(void) +{ + int i, j; + + printk("CPU pools : %d\n",numpools); + for (i=0;i<numpools;i++) { + printk("pool %d :",i); + for (j=pool_ptr[i];j<pool_ptr[i+1];j++) + printk("%d ",pool_cpus[j]); + printk("\n"); + } +} + +void bld_pools(void) +{ + int i, j, ptr; + int flag[NR_CPUS] = { [ 0 ... NR_CPUS-1] = 0 }; + unsigned long mask; + + numpools = 0; + ptr = 0; + for (i = 0; i < smp_num_cpus; i++) { + if (!(cpu_online_map & (1<<i))) continue; + if (!flag[i]) { + pool_ptr[numpools]=ptr; + mask = 0UL; + for (j = 0; j < smp_num_cpus; j++) { + if (! (cpu_online_map & (1<<j))) continue; + if (i == j || CPU_TO_NODE(i) == CPU_TO_NODE(j)) { + pool_cpus[ptr++]=j; + flag[j]=1; + mask |= (1<<j); + } + } + pool_mask[numpools]=mask; + numpools++; + } + } + pool_ptr[numpools]=ptr; + pools_info(); +} void __init init_idle(task_t *idle, int cpu) { @@ -1477,6 +1678,7 @@ idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; idle->cpu = cpu; + idle->node = SAPICID_TO_NODE(cpu_physical_id(cpu)); double_rq_unlock(idle_rq, rq); idle->need_resched = 1; __restore_flags(flags); @@ -1510,6 +1712,8 @@ // delimiter for bitsearch __set_bit(MAX_PRIO, array->bitmap); } + for (j = 0; j < NR_NODES; j++) + rq->nr_homenode[j]=0; } /* * We have to do a little magic to get the first diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sys.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sys.c --- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sys.c Fri Feb 8 12:02:06 2002 +++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sys.c Mon Feb 25 17:56:55 2002 @@ -1205,6 +1205,8 @@ { int error = 0; int sig; + int pid; + struct task_struct *child; switch (option) { case PR_SET_PDEATHSIG: @@ -1272,6 +1274,35 @@ } current->keep_capabilities = arg2; break; + case PR_GET_NODE: + pid = (int) arg3; + read_lock(&tasklist_lock); + child = find_task_by_pid(pid); + if (child) { + error = put_user(child->node,(int *)arg2); + } else { + printk("prctl: could not find process %d\n",pid); + error = -EINVAL; + } + read_unlock(&tasklist_lock); + break; + case PR_SET_NODE: + pid = (int) arg3; + read_lock(&tasklist_lock); + child = find_task_by_pid(pid); + if (child) { + if (child->uid == current->uid || \ + current->uid == 0) { + printk("setting node of process %d to %d\n",pid,(int)arg2); + child->node = (int)arg2; + } + } else { + printk("prctl: could not find process %d\n",pid); + error = -EINVAL; + } + read_unlock(&tasklist_lock); + break; + default: error = -EINVAL; break; |
From: Martin J. B. <Mar...@us...> - 2002-02-25 18:54:46
|
> - The load_balancing() concept is different: > - there are no special time intervals for balancing across pool > boundaries, the need for this can occur very quickly and I > have the feeling that 2*250ms is a long time for keeping the > nodes unbalanced. This means: each time load_balance() is called > it _can_ balance across pool boundaries (but doesn't have to). Imagine for a moment that there's a short spike in workload on one node. By agressively balancing across nodes, won't you incur a high cost in terms of migrating all the cache data to the remote node (destroying the cache on both the remote and local node), when it would be cheaper to wait for a few more ms, and run on the local node? This is a non-trivial problem to solve, and I'm not saying either approach is correct, just that there are some disadvantages of being too agressive. Perhaps it's architecture dependant (I'm used to NUMA-Q, which has caches on the interconnect, and a cache-miss access speed ratio of about 20:1 remote:local). > Would be interesting to hear oppinions on initial balancing. What are the > pros and cons of balancing at do_fork() or do_execve()? And it would be > interesting to learn about other approaches, too... Presumably exec-time balancing is cheaper, since there are fewer shared pages to be bounced around between nodes, but less effective if the main load on the machine is one large daemon app, which just forks a few copies of itself ... I would have though that'd get sorted out a little later anyway by the background rebalancing though? M. |
From: Andy P. <an...@os...> - 2002-02-25 23:34:51
|
> > Would be interesting to hear oppinions on initial balancing. What are the > > pros and cons of balancing at do_fork() or do_execve()? And it would be > > interesting to learn about other approaches, too... I worked on a system several years ago that supported single-system image and shared no memory between nodes (NORMA == NO Remote Memory Access), but did have a very high performance, low-latency interconnect (100's of megabytes/sec, a few 10's of ns latency for com startup). The ratios between CPU Clock Rate / Local Memory / Offboard Memory were (at a gross level) similar to today's GHz CPU's with on-chip L1, off-die L2, local dram, wires + state machines + dram on some other node. There was initially much debate about load balancing at fork time or at exec time (static), followed by when and how often to rebalance already running processes (dynamic). We eventually chose to statically balance at exec time, using a possibly stale metric, because we wouldn't have to spend time to create address space remotely (parent on node A, child on node B), only to have it torn down a few clocks later by a subsequent exec. (Our workload consisted almost entirely of fork+exec rather than fork+fork+fork... ). The analogy here is that commiting modified dcache lines owned by CPU A and reheating the cache with them on CPU B, only to throw them away by an exec a few clocks later may be similar to the "rfork() vs. rexec()" choice we faced. If you rebalance do_exec(), you are starting with an empty working set and a cold cache. To balance at exec time, we used a separate process that would periodically query (via a spanning tree) nodes within the "interactive partition" for their current load average, compute which was the least loaded (a heuristic that used a combination of factors: cpu utilization, # of processes, memory utilization, is there a shell over there, etc.), and then update the nodes (again via a spanning tree) with the current global opinion as to the least loaded node. In the context of this discussion, computing the loading metric at regular intervals *when otherwise idle* would appear to be similar to the approach we used. The static inter-node load leveling worked pretty well in practice (although some said it wasn't much better than pure round-robin across nodes), and it was non-fatal if the initial node selection was a poor choice The main problem was minimizing the storm of inbound rexec()'s when a sudden burst of activity (say, with a make -j) on multiple nodes at once could cause N-1 nodes to throw *everything* at the least loaded node. I don't think this would be a problem in this case because there is a single entity making a single choice, not mutiple entities making multiple choices in isolation. Dynamic load leveling (moving an entire process from one node to another) was always problematic for highly interactive workloads and a rash of complexity issues well off topic from this discussion, but worked well for long-running CPU bound tasks. Andy |
From: Erich F. <fo...@es...> - 2002-02-26 10:33:26
|
On Mon, 25 Feb 2002, Martin J. Bligh wrote: > > - The load_balancing() concept is different: > > - there are no special time intervals for balancing across pool > > boundaries, the need for this can occur very quickly and I > > have the feeling that 2*250ms is a long time for keeping the > > nodes unbalanced. This means: each time load_balance() is called > > it _can_ balance across pool boundaries (but doesn't have to). > > Imagine for a moment that there's a short spike in workload on one node. > By agressively balancing across nodes, won't you incur a high cost > in terms of migrating all the cache data to the remote node (destroying > the cache on both the remote and local node), when it would be cheaper > to wait for a few more ms, and run on the local node? This is a > non-trivial problem to solve, and I'm not saying either approach is > correct, just that there are some disadvantages of being too agressive. > Perhaps it's architecture dependant (I'm used to NUMA-Q, which has > caches on the interconnect, and a cache-miss access speed ratio of > about 20:1 remote:local). Well, maybe my description was a little bit misleading. My approach is not balancing much more aggressively, the difference is actually minimal, e.g. for 1ms ticks: Mike's approach: - idle CPU : load_balance() every 1ms (only within local node) balance_cpu_sets() every 2ms (balance across nodes) - busy CPU : load_balance() every 250ms balance_cpu_sets() every 500ms - schedule() : load_balance() if idle (only within local node) Erich's approach: - idle CPU : load_balance() every 1ms (first try balancing the local node, if already balanced (no CPU exceeds the current load by >25%) try to find a remote node with larger load than on the current one (>25% again)). - busy CPU : load_balance() every 250ms (same comment as above) - schedule() : load_balance() if idle (same comment as above). So the functional difference is not really that big here, I am also trying to balance locally first. If that fails (no imbalance), I try globally. The factor of 2 in the times is not so relevant, I think, and also I don't consider my approach significantly more aggressive. More significant is the difference in the data used for the balance decision: Mike: calculate load of a particular cpu set in the corresponding load_balance() call. Advantage: cheap (if spinlocks don't hurt there) Disadvantage: for busy CPUs it can be really old (250ms) Erich: calculate load when needed, at the load_balance() call, but not more than needed (normally only local node data, global data if needed, all lockless). Advantage: fresh, lockless Disadvantage: sometimes slower (when balancing across nodes) As Mike has mainly the cache affinity in mind, it doesn't really matter where a task is scheduled as long as it stays there long enough and the nodes are well balanced. A wrong scheduling decision (based on old data) will be fixed sooner or later (after x*250ms or so). My problem is that I need to decide on the home node of a process and will allocate all of its memory on that node. Therefore the decision must be as good as possible, I really want to keep the task as near to the home node as possible. If a task gets away from its home node (because of imbalance between the nodes) the scheduler will try to return it. Of course this should give us less load on the crossbar between the nodes and best memory access latency. > Presumably exec-time balancing is cheaper, since there are fewer shared > pages to be bounced around between nodes, but less effective if the main > load on the machine is one large daemon app, which just forks a few copies > of itself ... I would have though that'd get sorted out a little later anyway > by the background rebalancing though? OK, thanks. I agree with the first part of your reply. The last sentence is true for Mike's approach but a bit more complicated with the boundary condition of a process having its memory allocated on a particular node... Thank you very much for your comments! Best regards, Erich |
From: Martin J. B. <Mar...@us...> - 2002-02-26 15:59:29
|
> Well, maybe my description was a little bit misleading. My approach is not > balancing much more aggressively, the difference is actually minimal, > e.g. for 1ms ticks: > > Mike's approach: > - idle CPU : load_balance() every 1ms (only within local node) > balance_cpu_sets() every 2ms (balance across nodes) > - busy CPU : load_balance() every 250ms > balance_cpu_sets() every 500ms > - schedule() : load_balance() if idle (only within local node) > > Erich's approach: > - idle CPU : load_balance() every 1ms (first try balancing the local > node, if already balanced (no CPU exceeds the > current load by >25%) try to find a remote > node with larger load than on the current one > (>25% again)). > - busy CPU : load_balance() every 250ms (same comment as above) > - schedule() : load_balance() if idle (same comment as above). Actually, if I understand what you're doing correctly, what you really want is to only shift when the load average on the source cpu (the one you're shifting from) has been above "limit" for a specified amount of time, rather than making the interval when we inspect longer (in order to avoid bouncing stuff around unnecessarily). > So the functional difference is not really that big here, I am also trying > to balance locally first. If that fails (no imbalance), I try > globally. The factor of 2 in the times is not so relevant, I think, and > also I don't consider my approach significantly more aggressive. When you say "balance", are you trying to balance the number of tasks in each queue, or the % of time each CPU is busy? It would seem OK to me to have 100 IO bound tasks on one node, and 4 cpu bound tasks on another (substitute "CPU" for "node" above at will). In fact, shifting around IO bound tasks will win you far less than moving around CPU bound tasks in general. > More significant is the difference in the data used for the balance > decision: > > Mike: calculate load of a particular cpu set in the corresponding > load_balance() call. > Advantage: cheap (if spinlocks don't hurt there) > Disadvantage: for busy CPUs it can be really old (250ms) > > Erich: calculate load when needed, at the load_balance() call, but not > more than needed (normally only local node data, global data if needed, > all lockless). > Advantage: fresh, lockless > Disadvantage: sometimes slower (when balancing across nodes) How are you calculating load? Load average? Over what period of time? Please forgive my idleness in not going and reading the code ;-) > As Mike has mainly the cache affinity in mind, it doesn't really matter > where a task is scheduled as long as it stays there long enough and the > nodes are well balanced. A wrong scheduling decision (based on old > data) will be fixed sooner or later (after x*250ms or so). Not sure I understand that - the wrong decision will cost you two moves, blowing not only the cache for your current process, but also whoever else's cache you accidentally trampled upon (which admittedly might be nothing useful). >> Presumably exec-time balancing is cheaper, since there are fewer shared >> pages to be bounced around between nodes, but less effective if the main >> load on the machine is one large daemon app, which just forks a few >> copies of itself ... I would have though that'd get sorted out a little >> later anyway by the background rebalancing though? > > OK, thanks. I agree with the first part of your reply. The last sentence > is true for Mike's approach but a bit more complicated with the boundary > condition of a process having its memory allocated on a particular node... Ah, now I see why you're doing what you're doing ;-) Maybe we're getting into the realm of processes providing hints to the kernel at fork time ... I need to think on this one some more ... Thanks, Martin. |
From: david o. <da...@us...> - 2002-02-26 17:43:44
|
On Tue, 26 Feb 2002, Martin J. Bligh wrote: > When you say "balance", are you trying to balance the number of > tasks in each queue, or the % of time each CPU is busy? It would > seem OK to me to have 100 IO bound tasks on one node, and 4 cpu > bound tasks on another (substitute "CPU" for "node" above at will). > > In fact, shifting around IO bound tasks will win you far less than > moving around CPU bound tasks in general. i would imagine you would want to equally distribute the IO bound and cpu bound tasks. i don't think an IO bound process like /bin/bash has that large a cache foot print; whereas 4 cpu bound processes on a single cpu will be more likely to keep squashing each others cached data and code. if you equally distribute cpu load among the processors, one could expect (on average) that we'll get the best cache sharing. Is the assumption that the more cpu bound a task is the more likely it is to have a larger cache foot print incorrect? |
From: Martin J. B. <Mar...@us...> - 2002-02-26 18:11:41
|
>> When you say "balance", are you trying to balance the number of >> tasks in each queue, or the % of time each CPU is busy? It would >> seem OK to me to have 100 IO bound tasks on one node, and 4 cpu >> bound tasks on another (substitute "CPU" for "node" above at will). >> >> In fact, shifting around IO bound tasks will win you far less than >> moving around CPU bound tasks in general. > > i would imagine you would want to equally distribute the IO bound and cpu > bound tasks. i don't think an IO bound process like /bin/bash has that > large a cache foot print; whereas 4 cpu bound processes on a single cpu > will be more likely to keep squashing each others cached data and code. > if you equally distribute cpu load among the processors, one could expect > (on average) that we'll get the best cache sharing. Is the assumption that > the more cpu bound a task is the more likely it is to have a larger cache > foot print incorrect? I'm not sure there's necessarily a correllation there. The RSS would probably be a better guide to cache footprint. The reason I said that shifting IO bound tasks around wouldn't win you much is just that it doesn't really rebalance the queues much (in terms of how busy you're keeping the CPU, rather than number of processes). M. |
From: Erich F. <fo...@es...> - 2002-02-27 16:57:02
|
> > Erich's approach: > > - idle CPU : load_balance() every 1ms (first try balancing the local > > node, if already balanced (no CPU exceeds the > > current load by >25%) try to find a remote > > node with larger load than on the current one > > (>25% again)). > > - busy CPU : load_balance() every 250ms (same comment as above) > > - schedule() : load_balance() if idle (same comment as above). > > Actually, if I understand what you're doing correctly, what > you really want is to only shift when the load average on > the source cpu (the one you're shifting from) has been > above "limit" for a specified amount of time, rather than > making the interval when we inspect longer (in order to > avoid bouncing stuff around unnecessarily). I just tried to keep Ingo's original time intervals for load balancing and the 25% (which he imposes) were taken as a balance condition for the nodes, too. The delayed balancing is not implemented, but that would be very nice to have. Davide Libenzi was using that in his scheduler (incrementing a counter for a cpu which was busier than the current one, resetting it when it wasn't, select that cpu if the counter reached some number larger than the "distance" between the cpus). He was balancing only when the cpu went idle, I'm not sure what would be the corresponding treatment for busy balancing. > When you say "balance", are you trying to balance the number of > tasks in each queue, or the % of time each CPU is busy? It would > seem OK to me to have 100 IO bound tasks on one node, and 4 cpu > bound tasks on another (substitute "CPU" for "node" above at will). Just as primitive as rq->nr_running (or the minimum of that value and the one at the previous balance tick, as Ingo implemented it). > In fact, shifting around IO bound tasks will win you far less than > moving around CPU bound tasks in general. Agreed. Not quite sure how to distinguish the IO bound tasks currently, maybe somehow by their priority? Anyhow, tasks with state not equal to TASK_RUNNING are deactivated by schedule() and don't show up in rq->nr_running, also they can't be migrated by the load balancer. So there's no problem with shells, etc... > How are you calculating load? Load average? Over what period of time? > Please forgive my idleness in not going and reading the code ;-) Again: it's just nr_running (minimum of last two balance ticks). > > As Mike has mainly the cache affinity in mind, it doesn't really matter > > where a task is scheduled as long as it stays there long enough and the > > nodes are well balanced. A wrong scheduling decision (based on old > > data) will be fixed sooner or later (after x*250ms or so). > > Not sure I understand that - the wrong decision will cost you two > moves, blowing not only the cache for your current process, but > also whoever else's cache you accidentally trampled upon (which > admittedly might be nothing useful). Yes, you are right. That's why I think that the relatively old data which Mike uses (can be 250ms old!) could lead to such a degradation. Thanks, best regards, Erich |