From: Matthew D. <col...@us...> - 2004-10-07 00:52:06
|
This code is in no way complete. But since I brought it up in the "cpusets - big numa cpu and memory placement" thread, I figure the code needs to be posted. The basic idea is as follows: 1) Rip out sched_groups and move them into the sched_domains. 2) Add some reference counting, and eventually locking, to sched_domains. 3) Rewrite & simplify the way sched_domains are built and linked into a cohesive tree. This should allow us to support hotplug more easily, simply removing the domain belonging to the going-away CPU, rather than throwing away the whole domain tree and rebuilding from scratch. This should also allow us to support multiple, independent (ie: no shared root) domain trees which will facilitate isolated CPU groups and exclusive domains. I also hope this will allow us to leverage the existing topology infrastructure to build domains that closely resemble the physical structure of the machine automagically, thus making supporting interesting NUMA machines and SMT machines easier. This patch is just a snapshot in the middle of development, so there are certainly some uglies & bugs that will get fixed. That said, any comments about the general design are strongly encouraged. Heck, any feedback at all is welcome! :) Patch against 2.6.9-rc3-mm2. [mcd@arrakis mcd]$ diffstat ~/linux/patches/sched_domains/mcd-sched_domains-changes.patch include/asm-i386/topology.h | 5 include/asm-ia64/topology.h | 5 include/asm-ppc64/topology.h | 5 include/asm-x86_64/topology.h | 5 include/linux/sched.h | 21 - include/linux/topology.h | 10 kernel/sched.c | 710 +++++++++++++++++++----------------------- 7 files changed, 360 insertions(+), 401 deletions(-) -Matt diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-i386/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-i386/topology.h --- linux-2.6.9-rc3-mm2/include/asm-i386/topology.h 2004-10-04 14:38:56.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-i386/topology.h 2004-10-05 16:34:35.000000000 -0700 @@ -74,9 +74,10 @@ static inline cpumask_t pcibus_to_cpumas /* sched_domains SD_NODE_INIT for NUMAQ machines */ #define SD_NODE_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 8, \ .max_interval = 32, \ .busy_factor = 32, \ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-ia64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ia64/topology.h --- linux-2.6.9-rc3-mm2/include/asm-ia64/topology.h 2004-10-04 14:38:56.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ia64/topology.h 2004-10-05 16:36:57.000000000 -0700 @@ -47,9 +47,10 @@ void build_cpu_to_node_map(void); /* sched_domains SD_NODE_INIT for IA64 NUMA machines */ #define SD_NODE_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 80, \ .max_interval = 320, \ .busy_factor = 320, \ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-ppc64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ppc64/topology.h --- linux-2.6.9-rc3-mm2/include/asm-ppc64/topology.h 2004-10-04 14:38:56.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-ppc64/topology.h 2004-10-05 16:35:55.000000000 -0700 @@ -42,9 +42,10 @@ static inline int node_to_first_cpu(int /* sched_domains SD_NODE_INIT for PPC64 machines */ #define SD_NODE_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 8, \ .max_interval = 32, \ .busy_factor = 32, \ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/asm-x86_64/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/asm-x86_64/topology.h --- linux-2.6.9-rc3-mm2/include/asm-x86_64/topology.h 2004-10-04 14:38:57.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/asm-x86_64/topology.h 2004-10-05 16:35:48.000000000 -0700 @@ -37,9 +37,10 @@ static inline cpumask_t __pcibus_to_cpum #ifdef CONFIG_NUMA /* sched_domains SD_NODE_INIT for x86_64 machines */ #define SD_NODE_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 8, \ .max_interval = 32, \ .busy_factor = 32, \ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/linux/sched.h linux-2.6.9-rc3-mm2+mcd_sd/include/linux/sched.h --- linux-2.6.9-rc3-mm2/include/linux/sched.h 2004-10-04 14:38:59.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/linux/sched.h 2004-10-05 16:33:08.000000000 -0700 @@ -414,22 +414,17 @@ enum idle_type #define SD_WAKE_BALANCE 32 /* Perform balancing at task wakeup */ #define SD_SHARE_CPUPOWER 64 /* Domain members share cpu power */ -struct sched_group { - struct sched_group *next; /* Must be a circular list */ - cpumask_t cpumask; - - /* - * CPU power of this group, SCHED_LOAD_SCALE being max power for a - * single CPU. This is read only (except for setup, hotplug CPU). - */ - unsigned long cpu_power; -}; - struct sched_domain { /* These fields must be setup */ struct sched_domain *parent; /* top domain must be null terminated */ - struct sched_group *groups; /* the balancing groups of the domain */ + struct list_head children; /* link to this domain's children */ + struct list_head peers; /* peer balancing domains of this domain */ cpumask_t span; /* span of all CPUs in this domain */ + atomic_t refcnt; /* number of references to this domain */ + unsigned long cpu_power; /* CPU power of this domain, + * SCHED_LOAD_SCALE being max power for + * a single CPU. This is read only (except + * for setup, hotplug CPU). */ unsigned long min_interval; /* Minimum balance interval ms */ unsigned long max_interval; /* Maximum balance interval ms */ unsigned int busy_factor; /* less balancing by factor if busy */ @@ -465,8 +460,6 @@ struct sched_domain { #ifdef ARCH_HAS_SCHED_DOMAIN /* Useful helpers that arch setup code may use. Defined in kernel/sched.c */ extern cpumask_t cpu_isolated_map; -extern void init_sched_build_groups(struct sched_group groups[], - cpumask_t span, int (*group_fn)(int cpu)); extern void cpu_attach_domain(struct sched_domain *sd, int cpu); #endif /* ARCH_HAS_SCHED_DOMAIN */ #endif /* CONFIG_SMP */ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/include/linux/topology.h linux-2.6.9-rc3-mm2+mcd_sd/include/linux/topology.h --- linux-2.6.9-rc3-mm2/include/linux/topology.h 2004-10-04 14:38:59.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/include/linux/topology.h 2004-10-05 16:34:07.000000000 -0700 @@ -80,9 +80,10 @@ static inline int __next_node_with_cpus( /* Common values for SMT siblings */ #ifndef SD_SIBLING_INIT #define SD_SIBLING_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 1, \ .max_interval = 2, \ .busy_factor = 8, \ @@ -106,9 +107,10 @@ static inline int __next_node_with_cpus( /* Common values for CPUs */ #ifndef SD_CPU_INIT #define SD_CPU_INIT (struct sched_domain) { \ - .span = CPU_MASK_NONE, \ .parent = NULL, \ - .groups = NULL, \ + .span = CPU_MASK_NONE, \ + .refcnt = ATOMIC_INIT(0), \ + .cpu_power = 0, \ .min_interval = 1, \ .max_interval = 4, \ .busy_factor = 64, \ diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.9-rc3-mm2/kernel/sched.c linux-2.6.9-rc3-mm2+mcd_sd/kernel/sched.c --- linux-2.6.9-rc3-mm2/kernel/sched.c 2004-10-04 14:39:00.000000000 -0700 +++ linux-2.6.9-rc3-mm2+mcd_sd/kernel/sched.c 2004-10-05 17:19:17.000000000 -0700 @@ -2209,32 +2209,35 @@ out: } /* - * find_busiest_group finds and returns the busiest CPU group within the + * find_busiest_child finds and returns the busiest child domain of the given * domain. It calculates and returns the number of tasks which should be * moved to restore balance via the imbalance parameter. */ -static struct sched_group * -find_busiest_group(struct sched_domain *sd, int this_cpu, +static struct sched_domain * +find_busiest_child(struct sched_domain *sd, int cpu, unsigned long *imbalance, enum idle_type idle) { - struct sched_group *busiest = NULL, *this = NULL, *group = sd->groups; - unsigned long max_load, avg_load, total_load, this_load, total_pwr; + struct sched_domain *busiest_domain, *local_domain, *domain; + unsigned long max_load, avg_load, total_load, local_load, total_pwr; - max_load = this_load = total_load = total_pwr = 0; - - do { + max_load = local_load = total_load = total_pwr = 0; + busiest_domain = local_domain = NULL; + list_for_each_entry(domain, &sd->children, peers) { unsigned long load; - int local_group; int i, nr_cpus = 0; - local_group = cpu_isset(this_cpu, group->cpumask); + if (unlikely(cpus_empty(domain->span))) + continue; + + if (cpu_isset(cpu, domain->span)) + local_domain = domain; - /* Tally up the load of all CPUs in the group */ + /* Tally up the load of all CPUs in the domain */ avg_load = 0; - for_each_cpu_mask(i, group->cpumask) { + for_each_cpu_mask(i, domain->span) { /* Bias balancing toward cpus of our domain */ - if (local_group) + if (domain == local_domain) load = target_load(i); else load = source_load(i); @@ -2243,34 +2246,27 @@ find_busiest_group(struct sched_domain * avg_load += load; } - if (!nr_cpus) - goto nextgroup; - total_load += avg_load; - total_pwr += group->cpu_power; + total_pwr += domain->cpu_power; - /* Adjust by relative CPU power of the group */ - avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power; + /* Adjust by relative CPU power of the domain */ + avg_load = (avg_load * SCHED_LOAD_SCALE) / domain->cpu_power; - if (local_group) { - this_load = avg_load; - this = group; - goto nextgroup; + if (domain == local_domain) { + local_load = avg_load; } else if (avg_load > max_load) { max_load = avg_load; - busiest = group; + busiest_domain = domain; } -nextgroup: - group = group->next; - } while (group != sd->groups); + } - if (!busiest || this_load >= max_load) + if (!busiest_domain || local_load >= max_load) goto out_balanced; avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr; - if (this_load >= avg_load || - 100*max_load <= sd->imbalance_pct*this_load) + if (local_load >= avg_load || + 100*max_load <= sd->imbalance_pct*local_load) goto out_balanced; /* @@ -2284,19 +2280,19 @@ nextgroup: * by pulling tasks to us. Be careful of negative numbers as they'll * appear as very large values with unsigned longs. */ - *imbalance = min(max_load - avg_load, avg_load - this_load); + *imbalance = min(max_load - avg_load, avg_load - local_load); /* How much load to actually move to equalise the imbalance */ - *imbalance = (*imbalance * min(busiest->cpu_power, this->cpu_power)) + *imbalance = (*imbalance * min(busiest_domain->cpu_power, local_domain->cpu_power)) / SCHED_LOAD_SCALE; if (*imbalance < SCHED_LOAD_SCALE - 1) { unsigned long pwr_now = 0, pwr_move = 0; unsigned long tmp; - if (max_load - this_load >= SCHED_LOAD_SCALE*2) { + if (max_load - local_load >= SCHED_LOAD_SCALE*2) { *imbalance = 1; - return busiest; + return busiest_domain; } /* @@ -2305,21 +2301,21 @@ nextgroup: * moving them. */ - pwr_now += busiest->cpu_power*min(SCHED_LOAD_SCALE, max_load); - pwr_now += this->cpu_power*min(SCHED_LOAD_SCALE, this_load); + pwr_now += busiest_domain->cpu_power*min(SCHED_LOAD_SCALE, max_load); + pwr_now += local_domain->cpu_power*min(SCHED_LOAD_SCALE, local_load); pwr_now /= SCHED_LOAD_SCALE; /* Amount of load we'd subtract */ - tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/busiest->cpu_power; + tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/busiest_domain->cpu_power; if (max_load > tmp) - pwr_move += busiest->cpu_power*min(SCHED_LOAD_SCALE, + pwr_move += busiest_domain->cpu_power*min(SCHED_LOAD_SCALE, max_load - tmp); /* Amount of load we'd add */ - tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/this->cpu_power; + tmp = SCHED_LOAD_SCALE*SCHED_LOAD_SCALE/local_domain->cpu_power; if (max_load < tmp) tmp = max_load; - pwr_move += this->cpu_power*min(SCHED_LOAD_SCALE, this_load + tmp); + pwr_move += local_domain->cpu_power*min(SCHED_LOAD_SCALE, local_load + tmp); pwr_move /= SCHED_LOAD_SCALE; /* Move if we gain another 8th of a CPU worth of throughput */ @@ -2327,19 +2323,19 @@ nextgroup: goto out_balanced; *imbalance = 1; - return busiest; + return busiest_domain; } /* Get rid of the scaling factor, rounding down as we divide */ *imbalance = (*imbalance + 1) / SCHED_LOAD_SCALE; - return busiest; + return busiest_domain; out_balanced: - if (busiest && (idle == NEWLY_IDLE || + if (busiest_domain && (idle == NEWLY_IDLE || (idle == SCHED_IDLE && max_load > SCHED_LOAD_SCALE)) ) { *imbalance = 1; - return busiest; + return busiest_domain; } *imbalance = 0; @@ -2347,15 +2343,15 @@ out_balanced: } /* - * find_busiest_queue - find the busiest runqueue among the cpus in group. + * find_busiest_queue - find the busiest runqueue among the cpus in @domain */ -static runqueue_t *find_busiest_queue(const struct sched_group *group) +static runqueue_t *find_busiest_queue(const struct sched_domain *domain) { unsigned long load, max_load = 0; runqueue_t *busiest = NULL; int i; - for_each_cpu_mask(i, group->cpumask) { + for_each_cpu_mask(i, domain->span) { load = source_load(i); if (load > max_load) { @@ -2368,30 +2364,30 @@ static runqueue_t *find_busiest_queue(co } /* - * Check this_cpu to ensure it is balanced within domain. Attempt to move + * Check @cpu to ensure it is balanced within @sd. Attempt to move * tasks if there is an imbalance. * - * Called with this_rq unlocked. + * Called with @rq unlocked. */ -static int load_balance(int this_cpu, runqueue_t *this_rq, - struct sched_domain *sd, enum idle_type idle) +static int load_balance(int cpu, runqueue_t *rq, struct sched_domain *sd, + enum idle_type idle) { - struct sched_group *group; - runqueue_t *busiest; + struct sched_domain *domain; + runqueue_t *busiest_rq; unsigned long imbalance; int nr_moved; - spin_lock(&this_rq->lock); + spin_lock(&rq->lock); schedstat_inc(sd, lb_cnt[idle]); - group = find_busiest_group(sd, this_cpu, &imbalance, idle); - if (!group) { + domain = find_busiest_child(sd, cpu, &imbalance, idle); + if (!domain) { schedstat_inc(sd, lb_nobusyg[idle]); goto out_balanced; } - busiest = find_busiest_queue(group); - if (!busiest) { + busiest_rq = find_busiest_queue(domain); + if (!busiest_rq) { schedstat_inc(sd, lb_nobusyq[idle]); goto out_balanced; } @@ -2401,7 +2397,7 @@ static int load_balance(int this_cpu, ru * balancing is inherently racy and statistical, * it could happen in theory. */ - if (unlikely(busiest == this_rq)) { + if (unlikely(busiest_rq == rq)) { WARN_ON(1); goto out_balanced; } @@ -2409,19 +2405,18 @@ static int load_balance(int this_cpu, ru schedstat_add(sd, lb_imbalance[idle], imbalance); nr_moved = 0; - if (busiest->nr_running > 1) { + if (busiest_rq->nr_running > 1) { /* - * Attempt to move tasks. If find_busiest_group has found - * an imbalance but busiest->nr_running <= 1, the group is + * Attempt to move tasks. If find_busiest_child has found + * an imbalance but busiest_rq->nr_running <= 1, the domain is * still unbalanced. nr_moved simply stays zero, so it is * correctly treated as an imbalance. */ - double_lock_balance(this_rq, busiest); - nr_moved = move_tasks(this_rq, this_cpu, busiest, - imbalance, sd, idle); - spin_unlock(&busiest->lock); + double_lock_balance(rq, busiest_rq); + nr_moved = move_tasks(rq, cpu, busiest_rq, imbalance, sd, idle); + spin_unlock(&busiest_rq->lock); } - spin_unlock(&this_rq->lock); + spin_unlock(&rq->lock); if (!nr_moved) { schedstat_inc(sd, lb_failed[idle]); @@ -2430,15 +2425,15 @@ static int load_balance(int this_cpu, ru if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) { int wake = 0; - spin_lock(&busiest->lock); - if (!busiest->active_balance) { - busiest->active_balance = 1; - busiest->push_cpu = this_cpu; + spin_lock(&busiest_rq->lock); + if (!busiest_rq->active_balance) { + busiest_rq->active_balance = 1; + busiest_rq->push_cpu = cpu; wake = 1; } - spin_unlock(&busiest->lock); + spin_unlock(&busiest_rq->lock); if (wake) - wake_up_process(busiest->migration_thread); + wake_up_process(busiest_rq->migration_thread); /* * We've kicked active balancing, reset the failure @@ -2455,7 +2450,7 @@ static int load_balance(int this_cpu, ru return nr_moved; out_balanced: - spin_unlock(&this_rq->lock); + spin_unlock(&rq->lock); /* tune up the balancing interval */ if (sd->balance_interval < sd->max_interval) @@ -2465,59 +2460,57 @@ out_balanced: } /* - * Check this_cpu to ensure it is balanced within domain. Attempt to move + * Check @cpu to ensure it is balanced within @sd. Attempt to move * tasks if there is an imbalance. * - * Called from schedule when this_rq is about to become idle (NEWLY_IDLE). - * this_rq is locked. + * Called from schedule() when @rq is about to become idle (NEWLY_IDLE). + * @rq is locked. */ -static int load_balance_newidle(int this_cpu, runqueue_t *this_rq, - struct sched_domain *sd) +static int load_balance_newidle(int cpu, runqueue_t *rq, struct sched_domain *sd) { - struct sched_group *group; - runqueue_t *busiest = NULL; + struct sched_domain *domain; + runqueue_t *busiest_rq = NULL; unsigned long imbalance; int nr_moved = 0; schedstat_inc(sd, lb_cnt[NEWLY_IDLE]); - group = find_busiest_group(sd, this_cpu, &imbalance, NEWLY_IDLE); - if (!group) { + domain = find_busiest_child(sd, cpu, &imbalance, NEWLY_IDLE); + if (!domain) { schedstat_inc(sd, lb_nobusyg[NEWLY_IDLE]); goto out; } - busiest = find_busiest_queue(group); - if (!busiest || busiest == this_rq) { + busiest_rq = find_busiest_queue(domain); + if (!busiest_rq || busiest_rq == rq) { schedstat_inc(sd, lb_nobusyq[NEWLY_IDLE]); goto out; } /* Attempt to move tasks */ - double_lock_balance(this_rq, busiest); + double_lock_balance(rq, busiest_rq); schedstat_add(sd, lb_imbalance[NEWLY_IDLE], imbalance); - nr_moved = move_tasks(this_rq, this_cpu, busiest, - imbalance, sd, NEWLY_IDLE); + nr_moved = move_tasks(rq, cpu, busiest_rq, imbalance, sd, NEWLY_IDLE); if (!nr_moved) schedstat_inc(sd, lb_failed[NEWLY_IDLE]); - spin_unlock(&busiest->lock); + spin_unlock(&busiest_rq->lock); out: return nr_moved; } /* - * idle_balance is called by schedule() if this_cpu is about to become + * idle_balance is called by schedule() if @cpu is about to become * idle. Attempts to pull tasks from other CPUs. */ -static inline void idle_balance(int this_cpu, runqueue_t *this_rq) +static inline void idle_balance(int cpu, runqueue_t *rq) { struct sched_domain *sd; - for_each_domain(this_cpu, sd) { + for_each_domain(cpu, sd) { if (sd->flags & SD_BALANCE_NEWIDLE) { - if (load_balance_newidle(this_cpu, this_rq, sd)) { + if (load_balance_newidle(cpu, rq, sd)) { /* We've pulled tasks over so stop searching */ break; } @@ -2531,40 +2524,37 @@ static inline void idle_balance(int this * running on each physical CPU where possible, and not have a physical / * logical imbalance. * - * Called with busiest locked. + * Called with busiest_rq locked. */ -static void active_load_balance(runqueue_t *busiest, int busiest_cpu) +static void active_load_balance(runqueue_t *busiest_rq, int busiest_cpu) { - struct sched_domain *sd; - struct sched_group *group, *busy_group; + struct sched_domain *balance_domain, *domain, *busiest_domain; int i; - schedstat_inc(busiest, alb_cnt); - if (busiest->nr_running <= 1) + schedstat_inc(busiest_rq, alb_cnt); + if (busiest_rq->nr_running <= 1) return; - for_each_domain(busiest_cpu, sd) - if (cpu_isset(busiest->push_cpu, sd->span)) + for_each_domain(busiest_cpu, balance_domain) + if (cpu_isset(busiest_rq->push_cpu, balance_domain->span)) break; - if (!sd) + if (!balance_domain) return; - group = sd->groups; - while (!cpu_isset(busiest_cpu, group->cpumask)) - group = group->next; - busy_group = group; - - group = sd->groups; - do { + list_for_each_entry(busiest_domain, &balance_domain->children, peers) { + if (cpu_isset(busiest_cpu, busiest_domain->span)) + break; + } + list_for_each_entry(domain, &balance_domain->children, peers) { runqueue_t *rq; int push_cpu = 0; - if (group == busy_group) - goto next_group; + if (domain == busiest_domain) + continue; - for_each_cpu_mask(i, group->cpumask) { + for_each_cpu_mask(i, domain->span) { if (!idle_cpu(i)) - goto next_group; + continue; push_cpu = i; } @@ -2576,19 +2566,17 @@ static void active_load_balance(runqueue * it can trigger.. Reported by Bjorn Helgaas on a * 128-cpu setup. */ - if (unlikely(busiest == rq)) - goto next_group; - double_lock_balance(busiest, rq); - if (move_tasks(rq, push_cpu, busiest, 1, sd, SCHED_IDLE)) { - schedstat_inc(busiest, alb_lost); + if (unlikely(busiest_rq == rq)) + continue; + double_lock_balance(busiest_rq, rq); + if (move_tasks(rq, push_cpu, busiest_rq, 1, balance_domain, SCHED_IDLE)) { + schedstat_inc(busiest_rq, alb_lost); schedstat_inc(rq, alb_gained); } else { - schedstat_inc(busiest, alb_failed); + schedstat_inc(busiest_rq, alb_failed); } spin_unlock(&rq->lock); -next_group: - group = group->next; - } while (group != sd->groups); + } } /* @@ -5024,7 +5012,7 @@ static int migration_call(struct notifie } spin_unlock_irq(&rq->lock); break; -#endif +#endif /* CONFIG_HOTPLUG_CPU */ } return NOTIFY_OK; } @@ -5049,37 +5037,6 @@ int __init migration_init(void) #endif #ifdef CONFIG_SMP -/* - * Attach the domain 'sd' to 'cpu' as its base domain. Callers must - * hold the hotplug lock. - */ -void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu) -{ - migration_req_t req; - unsigned long flags; - runqueue_t *rq = cpu_rq(cpu); - int local = 1; - - spin_lock_irqsave(&rq->lock, flags); - - if (cpu == smp_processor_id() || !cpu_online(cpu)) { - rq->sd = sd; - } else { - init_completion(&req.done); - req.type = REQ_SET_DOMAIN; - req.sd = sd; - list_add(&req.list, &rq->migration_queue); - local = 0; - } - - spin_unlock_irqrestore(&rq->lock, flags); - - if (!local) { - wake_up_process(rq->migration_thread); - wait_for_completion(&req.done); - } -} - /* cpus with isolated domains */ cpumask_t __devinitdata cpu_isolated_map = CPU_MASK_NONE; @@ -5094,245 +5051,257 @@ static int __init isolated_cpu_setup(cha cpu_set(ints[i], cpu_isolated_map); return 1; } +__setup("isolcpus=", isolated_cpu_setup); -__setup ("isolcpus=", isolated_cpu_setup); /* - * init_sched_build_groups takes an array of groups, the cpumask we wish - * to span, and a pointer to a function which identifies what group a CPU - * belongs to. The return value of group_fn must be a valid index into the - * groups[] array, and must be >= 0 and < NR_CPUS (due to the fact that we - * keep track of groups covered with a cpumask_t). - * - * init_sched_build_groups will build a circular linked list of the groups - * covered by the given span, and will set each group's ->cpumask correctly, - * and ->cpu_power to 0. + * Calculate and set the cpu_power field of sched_domain @sd */ -void __devinit init_sched_build_groups(struct sched_group groups[], - cpumask_t span, int (*group_fn)(int cpu)) +static void __devinit set_cpu_power(struct sched_domain *sd) { - struct sched_group *first = NULL, *last = NULL; - cpumask_t covered = CPU_MASK_NONE; - int i; - - for_each_cpu_mask(i, span) { - int group = group_fn(i); - struct sched_group *sg = &groups[group]; - int j; - - if (cpu_isset(i, covered)) - continue; - - sg->cpumask = CPU_MASK_NONE; - sg->cpu_power = 0; - - for_each_cpu_mask(j, span) { - if (group_fn(j) != group) - continue; - - cpu_set(j, covered); - cpu_set(j, sg->cpumask); - } - if (!first) - first = sg; - if (last) - last->next = sg; - last = sg; + if (list_empty(&sd->children)) { + /* Leaf domain. Power = 1 + ((#cpus-1) * per_cpu_gain) */ + sd->cpu_power = 1 + + ((cpus_weight(sd->span) - 1) * sd->per_cpu_gain); + } else { + /* Interior domain. Power = sum of childrens' power. */ + struct sched_domain *child_domain; + sd->cpu_power = 0; + list_for_each_entry(child_domain, &sd->children, peers) + sd->cpu_power += child_domain->cpu_power; } - last->next = first; } +/* What type of sched_domain are we creating? */ +enum sd_type { + SD_SYSTEM = 0, + SD_NODE = 1, + SD_SIBLING = 2, +}; -#ifdef ARCH_HAS_SCHED_DOMAIN -extern void __devinit arch_init_sched_domains(void); -extern void __devinit arch_destroy_sched_domains(void); +/* + * Allocates a sched_domain, initializes it as @type, parents it to @parent, + * and links it into the list of its peers (if any). + * + * All the ugly ifdef'ing in this function is because I didn't want to rename + * an modify the SD_*_INIT functions in this patch. It makes it bigger and more + * confusing. Arch-specific initializers are what is necessary to fix this mess, + * but I'm not including that in this patch. Another reason for this confusion + * is that the domain initializers are named for what the domains are _composed of_ + * rather than what type of domains they are. Ex. SD_NODE_INIT is the initializer + * for a domain composed of NUMA nodes, rather than a domain that _is_ a NUMA node. + */ +struct sched_domain * __devinit create_domain(struct sched_domain *parent, + enum sd_type type) +{ + struct sched_domain *domain; + + domain = (struct sched_domain *)kmalloc(sizeof(struct sched_domain), GFP_KERNEL); + if (!domain) { + printk("Couldn't kmalloc sched_domain!\n"); + return NULL; + } + switch (type) { + case SD_SYSTEM: +#ifdef CONFIG_NUMA +/* Initializing system domain for NUMA machine: contains nodes */ + *domain = SD_NODE_INIT; #else #ifdef CONFIG_SCHED_SMT -static DEFINE_PER_CPU(struct sched_domain, cpu_domains); -static struct sched_group sched_group_cpus[NR_CPUS]; -static int __devinit cpu_to_cpu_group(int cpu) -{ - return cpu; -} +/* Initializing system domain for SMP + SMT machine: contains SMT siblings */ + *domain = SD_SIBLING_INIT; +#else +/* Initializing system domain for Flat SMP machine: contains CPUs */ + *domain = SD_CPU_INIT; #endif - -static DEFINE_PER_CPU(struct sched_domain, phys_domains); -static struct sched_group sched_group_phys[NR_CPUS]; -static int __devinit cpu_to_phys_group(int cpu) -{ +#endif + break; +#ifdef CONFIG_NUMA + case SD_NODE: #ifdef CONFIG_SCHED_SMT - return first_cpu(cpu_sibling_map[cpu]); +/* Initializing node domain for NUMA + SMT machine: contains SMT siblings */ + *domain = SD_SIBLING_INIT; #else - return cpu; +/* Initializing node domain for Flat NUMA machine: contains CPUs */ + *domain = SD_CPU_INIT; +#endif + break; #endif +#ifdef CONFIG_SCHED_SMT + case SD_SIBLING: +/* Initializing sibling domain: contains CPUs */ + *domain = SD_CPU_INIT; + break; +#endif + default: + printk("Invalid sd_type!\n"); + kfree(domain); + return NULL; + } + INIT_LIST_HEAD(&domain->children); + INIT_LIST_HEAD(&domain->peers); + /* MCD - Need locking here? */ + if (parent) { + domain->parent = parent; + atomic_inc(&domain->parent->refcnt); + list_add_tail(&domain->peers, &domain->parent->children); + } + return domain; } -#ifdef CONFIG_NUMA - -static DEFINE_PER_CPU(struct sched_domain, node_domains); -static struct sched_group sched_group_nodes[MAX_NUMNODES]; -static int __devinit cpu_to_node_group(int cpu) +/* + * Unlinks and destroys sched_domain @sd + */ +void __devinit destroy_domain(struct sched_domain *sd) { - return cpu_to_node(cpu); + if (!sd || !atomic_dec_and_test(&sd->refcnt)) + return; + kfree(sd); } -#endif /* - * Set up scheduler domains and groups. Callers must hold the hotplug lock. + * Attach the domain @sd to @cpu as its base domain. + * Callers must hold the hotplug lock. */ -static void __devinit arch_init_sched_domains(void) +void __devinit cpu_attach_domain(struct sched_domain *sd, int cpu) { - int i; - cpumask_t cpu_default_map; - - /* - * Setup mask for cpus without special case scheduling requirements. - * For now this just excludes isolated cpus, but could be used to - * exclude other special cases in the future. - */ - cpus_complement(cpu_default_map, cpu_isolated_map); - cpus_and(cpu_default_map, cpu_default_map, cpu_online_map); - - /* - * Set up domains. Isolated domains just stay on the dummy domain. - */ - for_each_cpu_mask(i, cpu_default_map) { - int group; - struct sched_domain *sd = NULL, *p; - cpumask_t nodemask = node_to_cpumask(cpu_to_node(i)); + struct sched_domain *old_sd, *domain; + migration_req_t req; + unsigned long flags; + runqueue_t *rq = cpu_rq(cpu); + int local = 1; - cpus_and(nodemask, nodemask, cpu_default_map); + /* Add @cpu to domain's span & compute its cpu_power */ + for (domain = sd; domain; domain = domain->parent) { + /* MCD - Should add locking here. Lock each domain while modifying */ + if (cpu_isset(cpu, domain->span)) + continue; + cpu_set(cpu, domain->span); + set_cpu_power(domain); + /* MCD - Unlock here */ + } -#ifdef CONFIG_NUMA - sd = &per_cpu(node_domains, i); - group = cpu_to_node_group(i); - *sd = SD_NODE_INIT; - sd->span = cpu_default_map; - sd->groups = &sched_group_nodes[group]; -#endif + spin_lock_irqsave(&rq->lock, flags); - p = sd; - sd = &per_cpu(phys_domains, i); - group = cpu_to_phys_group(i); - *sd = SD_CPU_INIT; - sd->span = nodemask; - sd->parent = p; - sd->groups = &sched_group_phys[group]; + old_sd = rq->sd; -#ifdef CONFIG_SCHED_SMT - p = sd; - sd = &per_cpu(cpu_domains, i); - group = cpu_to_cpu_group(i); - *sd = SD_SIBLING_INIT; - sd->span = cpu_sibling_map[i]; - cpus_and(sd->span, sd->span, cpu_default_map); - sd->parent = p; - sd->groups = &sched_group_cpus[group]; -#endif + if (cpu == smp_processor_id() || !cpu_online(cpu)) { + rq->sd = sd; + } else { + init_completion(&req.done); + req.type = REQ_SET_DOMAIN; + req.sd = sd; + list_add(&req.list, &rq->migration_queue); + local = 0; } -#ifdef CONFIG_SCHED_SMT - /* Set up CPU (sibling) groups */ - for_each_online_cpu(i) { - cpumask_t this_sibling_map = cpu_sibling_map[i]; - cpus_and(this_sibling_map, this_sibling_map, cpu_default_map); - if (i != first_cpu(this_sibling_map)) - continue; + spin_unlock_irqrestore(&rq->lock, flags); - init_sched_build_groups(sched_group_cpus, this_sibling_map, - &cpu_to_cpu_group); + if (!local) { + wake_up_process(rq->migration_thread); + wait_for_completion(&req.done); } -#endif - /* Set up physical groups */ - for (i = 0; i < MAX_NUMNODES; i++) { - cpumask_t nodemask = node_to_cpumask(i); + destroy_domain(old_sd); +} - cpus_and(nodemask, nodemask, cpu_default_map); - if (cpus_empty(nodemask)) - continue; +#ifndef ARCH_HAS_SCHED_DOMAIN +/* + * Set up scheduler domains. Callers must hold the hotplug lock. + */ +static void __devinit arch_init_sched_domains(void) +{ + struct sched_domain *sys_domain; + struct sched_domain *node_domains[MAX_NUMNODES]; + cpumask_t cpu_usable_map; + int i; - init_sched_build_groups(sched_group_phys, nodemask, - &cpu_to_phys_group); - } + /* + * Setup mask for cpus without special case scheduling requirements. + * For now this just excludes isolated cpus, but could be used to + * exclude other special cases in the future. + */ + cpus_complement(cpu_usable_map, cpu_isolated_map); + cpus_and(cpu_usable_map, cpu_usable_map, cpu_online_map); + sys_domain = create_domain(NULL, SD_SYSTEM); + BUG_ON(!sys_domain); + +/* MCD - Replace these #ifdef's with runtime variables? ie: numa_enabled, smt_enabled... */ #ifdef CONFIG_NUMA - /* Set up node groups */ - init_sched_build_groups(sched_group_nodes, cpu_default_map, - &cpu_to_node_group); -#endif + for_each_node(i) { + cpumask_t nodemask; + nodemask = node_to_cpumask(i); + if (!cpus_empty(nodemask)) + node_domains[i] = create_domain(sys_domain, SD_NODE); + } +#else /* !CONFIG_NUMA */ + node_domains[0] = sys_domain; +#endif /* CONFIG_NUMA */ - /* Calculate CPU power for physical packages and nodes */ - for_each_cpu_mask(i, cpu_default_map) { - int power; - struct sched_domain *sd; + for_each_cpu_mask(i, cpu_usable_map) { #ifdef CONFIG_SCHED_SMT - sd = &per_cpu(cpu_domains, i); - power = SCHED_LOAD_SCALE; - sd->groups->cpu_power = power; -#endif - - sd = &per_cpu(phys_domains, i); - power = SCHED_LOAD_SCALE + SCHED_LOAD_SCALE * - (cpus_weight(sd->groups->cpumask)-1) / 10; - sd->groups->cpu_power = power; + struct sched_domain *sibling_domain; + cpumask_t sibling_map = cpu_sibling_map[i]; + int j; -#ifdef CONFIG_NUMA - if (i == first_cpu(sd->groups->cpumask)) { - /* Only add "power" once for each physical package. */ - sd = &per_cpu(node_domains, i); - sd->groups->cpu_power += power; - } -#endif - } + cpus_and(sibling_map, sibling_map, cpu_usable_map); + if (i != first_cpu(sibling_map)) + continue; - /* Attach the domains */ - for_each_online_cpu(i) { - struct sched_domain *sd; -#ifdef CONFIG_SCHED_SMT - sd = &per_cpu(cpu_domains, i); -#else - sd = &per_cpu(phys_domains, i); -#endif - cpu_attach_domain(sd, i); + sibling_domain = create_domain(node_domains[cpu_to_node(i)], SD_SIBLING); + for_each_cpu_mask(j, sibling_map) + cpu_attach_domain(sibling_domain, j); +#else /* !CONFIG_SCHED_SMT */ + cpu_attach_domain(node_domains[cpu_to_node(i)], i); +#endif /* CONFIG_SCHED_SMT */ } } #ifdef CONFIG_HOTPLUG_CPU static void __devinit arch_destroy_sched_domains(void) { - /* Do nothing: everything is statically allocated. */ + /* MCD - Write this function */ } #endif -#endif /* ARCH_HAS_SCHED_DOMAIN */ +#else /* ARCH_HAS_SCHED_DOMAIN */ +extern void __devinit arch_init_sched_domains(void); +#ifdef CONFIG_HOTPLUG_CPU +extern void __devinit arch_destroy_sched_domains(void); +#endif +#endif /* !ARCH_HAS_SCHED_DOMAIN */ -#undef SCHED_DOMAIN_DEBUG +#define SCHED_DOMAIN_DEBUG #ifdef SCHED_DOMAIN_DEBUG -static void sched_domain_debug(void) +#define SD_DEBUG KERN_ALERT +static void __devinit sched_domain_debug(void) { - int i; + int cpu; - for_each_online_cpu(i) { - runqueue_t *rq = cpu_rq(i); - struct sched_domain *sd; + for (cpu = 0; cpu < NR_CPUS; cpu++) { + runqueue_t *rq = cpu_rq(cpu); + struct sched_domain *sd = NULL; int level = 0; - - sd = rq->sd; - - printk(KERN_DEBUG "CPU%d:\n", i); + + if (rq) + sd = rq->sd; + + printk(SD_DEBUG "CPU%d:", cpu); + if (!sd) { + printk(" not online.\n"); + continue; + } + printk("\n"); do { - int j; - char str[NR_CPUS]; - struct sched_group *group = sd->groups; + struct sched_domain *child; cpumask_t groupmask; + char str[NR_CPUS]; + int i; - cpumask_scnprintf(str, NR_CPUS, sd->span); - cpus_clear(groupmask); - - printk(KERN_DEBUG); - for (j = 0; j < level + 1; j++) + printk(SD_DEBUG); + for (i = 0; i < level + 1; i++) printk(" "); printk("domain %d: ", level); @@ -5344,71 +5313,62 @@ static void sched_domain_debug(void) break; } - printk("span %s\n", str); + cpumask_scnprintf(str, NR_CPUS, sd->span); + printk("span %s, cpu_power %ld\n", str, sd->cpu_power); - if (!cpu_isset(i, sd->span)) - printk(KERN_DEBUG "ERROR domain->span does not contain CPU%d\n", i); - if (!cpu_isset(i, group->cpumask)) - printk(KERN_DEBUG "ERROR domain->groups does not contain CPU%d\n", i); - if (!group->cpu_power) - printk(KERN_DEBUG "ERROR domain->cpu_power not set\n"); + if (!cpu_isset(cpu, sd->span)) + printk(SD_DEBUG "ERROR domain->span does not contain CPU%d\n", cpu); - printk(KERN_DEBUG); - for (j = 0; j < level + 2; j++) + cpus_clear(groupmask); + printk(SD_DEBUG); + for (i = 0; i < level + 1; i++) printk(" "); - printk("groups:"); - do { - if (!group) { - printk(" ERROR: NULL"); - break; + if (list_empty(&sd->children)) { + printk("Leaf domain, no children\n"); + } else { + printk("Interior domain, childrens' spans:"); + list_for_each_entry(child, &sd->children, peers) { + if (cpus_empty(child->span)) + printk(" ERROR child has empty span:"); + + if (cpus_intersects(groupmask, child->span)) + printk(" ERROR repeated CPUs:"); + + cpumask_scnprintf(str, NR_CPUS, child->span); + printk(" %s", str); + cpus_or(groupmask, groupmask, child->span); } + printk("\n"); - if (!cpus_weight(group->cpumask)) - printk(" ERROR empty group:"); - - if (cpus_intersects(groupmask, group->cpumask)) - printk(" ERROR repeated CPUs:"); - - cpus_or(groupmask, groupmask, group->cpumask); - - cpumask_scnprintf(str, NR_CPUS, group->cpumask); - printk(" %s", str); - - group = group->next; - } while (group != sd->groups); - printk("\n"); - - if (!cpus_equal(sd->span, groupmask)) - printk(KERN_DEBUG "ERROR groups don't span domain->span\n"); + if (!cpus_equal(sd->span, groupmask)) + printk(SD_DEBUG "ERROR children don't span domain->span\n"); + } level++; sd = sd->parent; if (sd) { if (!cpus_subset(groupmask, sd->span)) - printk(KERN_DEBUG "ERROR parent span is not a superset of domain->span\n"); + printk(SD_DEBUG "ERROR parent span is not a superset of domain->span\n"); } - } while (sd); } } -#else +#else /* !SCHED_DOMAIN_DEBUG */ #define sched_domain_debug() {} -#endif +#endif /* SCHED_DOMAIN_DEBUG */ -#ifdef CONFIG_SMP /* * Initial dummy domain for early boot and for hotplug cpu. Being static, * it is initialized to zero, so all balancing flags are cleared which is * what we want. */ static struct sched_domain sched_domain_dummy; -#endif #ifdef CONFIG_HOTPLUG_CPU /* * Force a reinitialization of the sched domains hierarchy. The domains - * and groups cannot be updated in place without racing with the balancing + * cannot be updated in place without racing with the balancing * code, so we temporarily attach all running cpus to a "dummy" domain * which will prevent rebalancing while the sched domains are recalculated. */ @@ -5444,7 +5404,7 @@ static int update_sched_domains(struct n return NOTIFY_OK; } -#endif +#endif /* CONFIG_HOTPLUG_CPU */ void __init sched_init_smp(void) { @@ -5455,7 +5415,7 @@ void __init sched_init_smp(void) /* XXX: Theoretical race here - CPU may be hotplugged now */ hotcpu_notifier(update_sched_domains, 0); } -#else +#else /* !CONFIG_SMP */ void __init sched_init_smp(void) { } |
From: Nick P. <nic...@ya...> - 2004-10-07 02:14:07
|
Matthew Dobson wrote: > This code is in no way complete. But since I brought it up in the > "cpusets - big numa cpu and memory placement" thread, I figure the code > needs to be posted. > > The basic idea is as follows: > > 1) Rip out sched_groups and move them into the sched_domains. > 2) Add some reference counting, and eventually locking, to > sched_domains. > 3) Rewrite & simplify the way sched_domains are built and linked into a > cohesive tree. > OK. I'm not sure that I like the direction, but... (I haven't looked too closely at it). > This should allow us to support hotplug more easily, simply removing the > domain belonging to the going-away CPU, rather than throwing away the > whole domain tree and rebuilding from scratch. Although what we have in -mm now should support CPU hotplug just fine. The hotplug guys really seem not to care how disruptive a hotplug operation is. > This should also allow > us to support multiple, independent (ie: no shared root) domain trees > which will facilitate isolated CPU groups and exclusive domains. I also Hmm, what was my word for them... yeah, disjoint. We can do that now, see isolcpus= for a subset of the functionality you want (doing larger exclusive sets would probably just require we run the setup code once for each exclusive set we want to build). > hope this will allow us to leverage the existing topology infrastructure > to build domains that closely resemble the physical structure of the > machine automagically, thus making supporting interesting NUMA machines > and SMT machines easier. > > This patch is just a snapshot in the middle of development, so there are > certainly some uglies & bugs that will get fixed. That said, any > comments about the general design are strongly encouraged. Heck, any > feedback at all is welcome! :) > > Patch against 2.6.9-rc3-mm2. This is what I did in my first (that nobody ever saw) implementation of sched domains. Ie. no sched_groups, just use sched_domains as the balancing object... I'm not sure this works too well. For example, your bottom level domain is going to basically be a redundant, single CPU on most topologies, isn't it? Also, how will you do overlapping domains that SGI want to do (see arch/ia64/kernel/domain.c in -mm kernels)? node2 wants to balance between node0, node1, itself, node3, node4. node4 wants to balance between node2, node3, itself, node5, node6. etc. I think your lists will get tangled, no? |
From: Jesse B. <jb...@en...> - 2004-10-07 17:07:47
|
On Wednesday, October 6, 2004 7:13 pm, Nick Piggin wrote: > Hmm, what was my word for them... yeah, disjoint. We can do that now, > see isolcpus= for a subset of the functionality you want (doing larger > exclusive sets would probably just require we run the setup code once > for each exclusive set we want to build). Yeah, and unfortunately since I added the code for overlapping domains w/o adding a top level domain at the same time, we have disjoint domains by default on large systems. > Also, how will you do overlapping domains that SGI want to do (see > arch/ia64/kernel/domain.c in -mm kernels)? > > node2 wants to balance between node0, node1, itself, node3, node4. > node4 wants to balance between node2, node3, itself, node5, node6. > etc. > > I think your lists will get tangled, no? Yeah, but overlapping domains aren't a requirement. In fact, making the scheduling domains dynamically configurable is probably a *much* better route, since I doubt that some default overlap setup will be optimal for many workloads (that doesn't mean we shouldn't have good defaults though). Being able to configure the rebalance and tick rates of the various domains would also be a good thing (the defaults could be keyed off of the number of CPUs and/or nodes in the domain). Thanks, Jesse |
From: Matthew D. <col...@us...> - 2004-10-07 21:58:57
|
On Wed, 2004-10-06 at 19:13, Nick Piggin wrote: > Matthew Dobson wrote: > > This code is in no way complete. But since I brought it up in the > > "cpusets - big numa cpu and memory placement" thread, I figure the code > > needs to be posted. > > > > The basic idea is as follows: > > > > 1) Rip out sched_groups and move them into the sched_domains. > > 2) Add some reference counting, and eventually locking, to > > sched_domains. > > 3) Rewrite & simplify the way sched_domains are built and linked into a > > cohesive tree. > > > > OK. I'm not sure that I like the direction, but... (I haven't looked > too closely at it). The patch is made somewhat larger by a lot of variable renaming because of the removal of sched_groups. A lot of s/group/domain/. The vast majority of the changes are in a rewrite of arch_init_sched_domains & it's assorted helpers. > > This should allow us to support hotplug more easily, simply removing the > > domain belonging to the going-away CPU, rather than throwing away the > > whole domain tree and rebuilding from scratch. > > Although what we have in -mm now should support CPU hotplug just fine. > The hotplug guys really seem not to care how disruptive a hotplug > operation is. I wasn't trying to imply that CPU hotplug isn't supported right now. But it is currently a very disruptive operation, throwing away the entire sched_domains & sched_groups tree and then rebuilding it from scratch just to remove a single CPU! I also understand that this is supposed to be a rare event (CPU hotplug), but that doesn't mean it *has* to be a slow, disruptive event. :) > > This should also allow > > us to support multiple, independent (ie: no shared root) domain trees > > which will facilitate isolated CPU groups and exclusive domains. I also > > Hmm, what was my word for them... yeah, disjoint. We can do that now, > see isolcpus= for a subset of the functionality you want (doing larger > exclusive sets would probably just require we run the setup code once > for each exclusive set we want to build). The current code doesn't, to my knowledge support multiple isolated domains. You can set up a single 'isolated' group with boot time options, but you can't set up *multiple* isolated groups, nor is there the ability to do any partitioning/isolation at runtime. This was more of the motivation for my code than the hotplug simplification. That was more of a side-benefit. > > hope this will allow us to leverage the existing topology infrastructure > > to build domains that closely resemble the physical structure of the > > machine automagically, thus making supporting interesting NUMA machines > > and SMT machines easier. > > > > This patch is just a snapshot in the middle of development, so there are > > certainly some uglies & bugs that will get fixed. That said, any > > comments about the general design are strongly encouraged. Heck, any > > feedback at all is welcome! :) > > > > Patch against 2.6.9-rc3-mm2. > > This is what I did in my first (that nobody ever saw) implementation of > sched domains. Ie. no sched_groups, just use sched_domains as the balancing > object... I'm not sure this works too well. > > For example, your bottom level domain is going to basically be a redundant, > single CPU on most topologies, isn't it? > > Also, how will you do overlapping domains that SGI want to do (see > arch/ia64/kernel/domain.c in -mm kernels)? > > node2 wants to balance between node0, node1, itself, node3, node4. > node4 wants to balance between node2, node3, itself, node5, node6. > etc. > > I think your lists will get tangled, no? Yes. I have to put my thinking cap on snug, but I don't think my version would support this kind of setup. It sounds, from Jesse's follow up to your mail, that this is not a requirement, though. I'll take a closer look at the IA64 code and see if it would be supported or if I could make some small changes to support it. Thanks for the feedback!! -Matt |
From: Nick P. <nic...@ya...> - 2004-10-08 00:22:51
|
Matthew Dobson wrote: >On Wed, 2004-10-06 at 19:13, Nick Piggin wrote: > >>Matthew Dobson wrote: >> > >>>This should allow us to support hotplug more easily, simply removing the >>>domain belonging to the going-away CPU, rather than throwing away the >>>whole domain tree and rebuilding from scratch. >>> >>Although what we have in -mm now should support CPU hotplug just fine. >>The hotplug guys really seem not to care how disruptive a hotplug >>operation is. >> > >I wasn't trying to imply that CPU hotplug isn't supported right now. >But it is currently a very disruptive operation, throwing away the >entire sched_domains & sched_groups tree and then rebuilding it from >scratch just to remove a single CPU! I also understand that this is >supposed to be a rare event (CPU hotplug), but that doesn't mean it >*has* to be a slow, disruptive event. :) > > Well no... but it already is disruptive :) > >>> This should also allow >>>us to support multiple, independent (ie: no shared root) domain trees >>>which will facilitate isolated CPU groups and exclusive domains. I also >>> >>Hmm, what was my word for them... yeah, disjoint. We can do that now, >>see isolcpus= for a subset of the functionality you want (doing larger >>exclusive sets would probably just require we run the setup code once >>for each exclusive set we want to build). >> > >The current code doesn't, to my knowledge support multiple isolated >domains. You can set up a single 'isolated' group with boot time >options, but you can't set up *multiple* isolated groups, nor is there >the ability to do any partitioning/isolation at runtime. This was more >of the motivation for my code than the hotplug simplification. That was >more of a side-benefit. > > No, the isolcpus= option allows you to set up n *single CPU* isolated domains. You currently can't setup isolated groups with multiple CPUs in them, no. You can't do runtime partitioning either. I think both would be pretty trivial to do though with the current code though. > >>>hope this will allow us to leverage the existing topology infrastructure >>>to build domains that closely resemble the physical structure of the >>>machine automagically, thus making supporting interesting NUMA machines >>>and SMT machines easier. >>> >>>This patch is just a snapshot in the middle of development, so there are >>>certainly some uglies & bugs that will get fixed. That said, any >>>comments about the general design are strongly encouraged. Heck, any >>>feedback at all is welcome! :) >>> >>>Patch against 2.6.9-rc3-mm2. >>> >>This is what I did in my first (that nobody ever saw) implementation of >>sched domains. Ie. no sched_groups, just use sched_domains as the balancing >>object... I'm not sure this works too well. >> >>For example, your bottom level domain is going to basically be a redundant, >>single CPU on most topologies, isn't it? >> >>Also, how will you do overlapping domains that SGI want to do (see >>arch/ia64/kernel/domain.c in -mm kernels)? >> >>node2 wants to balance between node0, node1, itself, node3, node4. >>node4 wants to balance between node2, node3, itself, node5, node6. >>etc. >> >>I think your lists will get tangled, no? >> > >Yes. I have to put my thinking cap on snug, but I don't think my >version would support this kind of setup. It sounds, from Jesse's >follow up to your mail, that this is not a requirement, though. I'll >take a closer look at the IA64 code and see if it would be supported or >if I could make some small changes to support it. > > I they might find that it will be a requirement. If not now, then soon. Your periodic balancing happens from the timer interrupt as you know... that means pulling a cacheline off every CPU. But anyway.. >Thanks for the feedback!! > OK... I still don't know exactly how your system is an improvement over what we have, but I'll try to be open minded :) |
From: Matthew D. <col...@us...> - 2004-10-07 22:22:49
|
On Wed, 2004-10-06 at 19:13, Nick Piggin wrote: > This is what I did in my first (that nobody ever saw) implementation of > sched domains. Ie. no sched_groups, just use sched_domains as the balancing > object... I'm not sure this works too well. > > For example, your bottom level domain is going to basically be a redundant, > single CPU on most topologies, isn't it? I forgot to respond to this part in my last mail... :( My benchmarks haven't shown any real deviation in performance from stock -mm. Granted, my benchmarking has been very limited, pretty much just running kernbench on a few different machines with different configs (ie: SMT, SMP & NUMA on/off). A performance NOOP is exactly what I was hoping for, though. I don't really expect these changes to be either a performance win or loss, but a functionality improvement. The patch is pretty dense because of all the renaming, but there are no single CPU domains. The lowest level domain would be: 1) Node domains, for NUMA w/ SMP 2) Sibling domains, for SMT or NUMA w/ SMT 3) System domain, for flat SMP The pseudo code version of my arch_init_sched_domains() looks like: cpu_usable_map = cpu_online_map & ~cpu_isolated_map create system domain; if NUMA for_each_node() create node domain, parented to system domain; for_each_cpu(cpu_usable_map) if SCHED_SMT create sibling domain, parented to node domain; attach sibling cpu to it's domain; else attach cpu to either its node domain or system domain; So there shouldn't be any redundant CPU domains. -Matt |
From: Marc E. F. <mef@CS.Princeton.EDU> - 2004-10-07 04:13:10
|
> ... thus making supporting interesting NUMA machines > and SMT machines easier. Would you be so kind and elaborate on the SMT part. Marc |
From: Paul J. <pj...@sg...> - 2004-10-07 05:39:20
|
Marc wrote: > > ... thus making supporting interesting NUMA machines > > and SMT machines easier. > > Would you be so kind and elaborate on the SMT part. Sure - easy - I mistyped "SMP". Even on a system as simple as a dual processor, tightly coupled HPC applications, using one thread per processor, run much better if indeed they are placed one thread to a processor, allowing for genuine parallelism. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Matthew D. <col...@us...> - 2004-10-07 22:07:53
|
On Wed, 2004-10-06 at 21:12, Marc E. Fiuczynski wrote: > > ... thus making supporting interesting NUMA machines > > and SMT machines easier. > > Would you be so kind and elaborate on the SMT part. > > Marc Contrary to Paul's posting that no one saw, SMT is not a typo. ;) What I was trying to get at is that there are already several differing implementations of SMT (Synchronous Multi-Threading) with different names and different characteristics. Right now, they're all kind of handled the same. In the future, however, I see each architecture defining their own SD_SIBLING_INIT for sibling domains, allowing their own cache timings, balancing rates, etc. I feel that it would be easier to support potentially complicated and/or dynamic sibling 'CPU' relationships with my patch. We've already run into some issues with hotplugging the siblings of 'real' CPUs on/off, and how the current sched_domains handles that. Currently, as the code is static and based on config options rather than runtime variables, it tends to leave a single CPU in it's own domain, balancing amongst itself with no sibling (b/c it's been hotplugged off and CONFIG_SCHED_SMT is on). My code was written with dynamic runtime changes in mind to prevent these kinds of suboptimal situations. -Matt |
From: Paul J. <pj...@sg...> - 2004-10-07 09:35:03
|
Never mind my s/SMT/SMP/ post ... please pretend you never saw it ;) -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Takayoshi K. <t-...@bq...> - 2004-10-08 05:59:02
|
Hi, From: Jesse Barnes <jb...@en...> Subject: [Lse-tech] Re: [RFC PATCH] scheduler: Dynamic sched_domains Date: Thu, 7 Oct 2004 10:01:07 -0700 > On Wednesday, October 6, 2004 7:13 pm, Nick Piggin wrote: > > Hmm, what was my word for them... yeah, disjoint. We can do that now, > > see isolcpus= for a subset of the functionality you want (doing larger > > exclusive sets would probably just require we run the setup code once > > for each exclusive set we want to build). > > Yeah, and unfortunately since I added the code for overlapping domains w/o > adding a top level domain at the same time, we have disjoint domains by > default on large systems. Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have two disjoint domains ;( (though the current default is 6 for ia64...) I think the default configuration of the scheduler domains should be as identical to its real hardware topology as possible, and should modify the default only when necessary (e.g. for Altix). Right now with the sched domain scheduler, we have to setup the domain hierarcy only at boot time statically, which makes it harder to find the optimal domain topology/parameter. The dynamic patch makes it easier to modify the default configuration. If the scheduler gains more dynamic configurability like what Jesse said, it adds more flexibility for runtime optimization and seems a way to go. I'm not sure runtime configurability of domain topology is necessary for all users, but it's extremely useful for developers. I'll look into the Matt's patch further. > > Also, how will you do overlapping domains that SGI want to do (see > > arch/ia64/kernel/domain.c in -mm kernels)? > > > > node2 wants to balance between node0, node1, itself, node3, node4. > > node4 wants to balance between node2, node3, itself, node5, node6. > > etc. > > > > I think your lists will get tangled, no? > > Yeah, but overlapping domains aren't a requirement. In fact, making the > scheduling domains dynamically configurable is probably a *much* better > route, since I doubt that some default overlap setup will be optimal for many > workloads (that doesn't mean we shouldn't have good defaults though). Being > able to configure the rebalance and tick rates of the various domains would > also be a good thing (the defaults could be keyed off of the number of CPUs > and/or nodes in the domain). --- Takayoshi Kochi |
From: Nick P. <nic...@ya...> - 2004-10-08 06:08:15
|
Takayoshi Kochi wrote: > Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have > two disjoint domains ;( > (though the current default is 6 for ia64...) > > I think the default configuration of the scheduler domains should be > as identical to its real hardware topology as possible, and should > modify the default only when necessary (e.g. for Altix). > That is the idea. Unfortunately the ia64 modifications are ia64 wide. I don't think it should be too hard to make it sn2 only. > Right now with the sched domain scheduler, we have to setup the > domain hierarcy only at boot time statically, which makes it harder to > find the optimal domain topology/parameter. The dynamic patch > makes it easier to modify the default configuration. > No you don't have to. If you have a look at the work in -mm, basically the whole thing gets recreated on every hoplug operation. It would be trivial to modify some parameters then reinit the domains in the same way. N disjoint domains can be trivially handled by making N passes over the init code, each using a different set of CPUs as its "cpu_possible_map". This can easily be done dynamically by using the above method. > If the scheduler gains more dynamic configurability like what Jesse > said, it adds more flexibility for runtime optimization and seems > a way to go. I'm not sure runtime configurability of domain topology > is necessary for all users, but it's extremely useful for developers. > That would be nice. The patch queue is pretty well clogged up at the moment, so I'm not going to look at the scheduler again until all the patches from -mm get into 2.6.10-bk. |
From: Jesse B. <jb...@en...> - 2004-10-08 16:44:02
|
On Thursday, October 7, 2004 11:08 pm, Nick Piggin wrote: > Takayoshi Kochi wrote: > > Yup, if SD_NODES_PER_DOMAIN is set to 4, our 32-way TX-7 have > > two disjoint domains ;( > > (though the current default is 6 for ia64...) > > > > I think the default configuration of the scheduler domains should be > > as identical to its real hardware topology as possible, and should > > modify the default only when necessary (e.g. for Altix). > > That is the idea. Unfortunately the ia64 modifications are ia64 wide. > I don't think it should be too hard to make it sn2 only. The NEC and Altix machines both use a SLIT table to describe the machine layout, so it should be possible to build them correctly w/o special case code (I hope). The question is how big to make them, but if that's runtime changeable, then no big deal. Like I said, the main thing missing from my changes is a system wide domain, but I think John has some ideas about that. Jesse |
From: Erich F. <ef...@hp...> - 2004-10-08 10:16:44
|
Hi Matthew, On Thursday 07 October 2004 02:51, Matthew Dobson wrote: > 1) Rip out sched_groups and move them into the sched_domains. > 2) Add some reference counting, and eventually locking, to > sched_domains. > 3) Rewrite & simplify the way sched_domains are built and linked into a > cohesive tree. >=20 > This should allow us to support hotplug more easily, simply removing the > domain belonging to the going-away CPU, rather than throwing away the > whole domain tree and rebuilding from scratch. =A0This should also allow > us to support multiple, independent (ie: no shared root) domain trees > which will facilitate isolated CPU groups and exclusive domains. =A0I also > hope this will allow us to leverage the existing topology infrastructure > to build domains that closely resemble the physical structure of the > machine automagically, thus making supporting interesting NUMA machines > and SMT machines easier. more flexibility in building the sched_domains is badly needed, so your effort towards providing this is the right step. I'm not sure yet whether your big change is really (and already) a simplification, but what you described sounded for me like getting the chance to configure the sched_domains at runtime, dynamically, from user space. I didn't notice any user interface in your patch, or overlooked it. Could you please describe the API you had in mind for that? Regards, Erich |
From: Matthew D. <col...@us...> - 2004-10-08 18:47:03
|
On Fri, 2004-10-08 at 03:14, Erich Focht wrote: > Hi Matthew, Hi Erich! > On Thursday 07 October 2004 02:51, Matthew Dobson wrote: > > 1) Rip out sched_groups and move them into the sched_domains. > > 2) Add some reference counting, and eventually locking, to > > sched_domains. > > 3) Rewrite & simplify the way sched_domains are built and linked into a > > cohesive tree. > > > > This should allow us to support hotplug more easily, simply removing the > > domain belonging to the going-away CPU, rather than throwing away the > > whole domain tree and rebuilding from scratch. This should also allow > > us to support multiple, independent (ie: no shared root) domain trees > > which will facilitate isolated CPU groups and exclusive domains. I also > > hope this will allow us to leverage the existing topology infrastructure > > to build domains that closely resemble the physical structure of the > > machine automagically, thus making supporting interesting NUMA machines > > and SMT machines easier. > > more flexibility in building the sched_domains is badly needed, so > your effort towards providing this is the right step. I'm not sure > yet whether your big change is really (and already) a simplification, > but what you described sounded for me like getting the chance to > configure the sched_domains at runtime, dynamically, from user > space. I didn't notice any user interface in your patch, or overlooked > it. Could you please describe the API you had in mind for that? You are correct in your assumption that you didn't notice any user API in the patch because it wasn't there. The idea I had for the API would be along the lines of the current cpusets/CKRM interface. A hierarchical filesystem where you can do operations to create/modify/remove sched_domains. Along the lines of: cd /dev/sched_domains/sys_domain mkdir node0 mkdir node1 mkdir node2 mkdir node3 cd node0 echo 0-3 > cpus cd ../node1 echo 4-7 > cpus cd ../node2 echo 8-11 > cpus cd ../node3 echo 12-15 > cpus To create a simple 4 node each w/ 4 cpus setup. This is a trivial example, because this is the kind of thing that would be setup by default at boot time. I really like the interface that Paul came up with for cpusets, and I think that the interface we eventually settle on should be along those lines. Hopefully it can be shared with the interface CKRM uses to avoid too much interface bloat. I think that we can probably get the two mechanisms to share a common interface. -Matt |
From: Nick P. <nic...@ya...> - 2004-10-08 10:41:09
|
Erich Focht wrote: > more flexibility in building the sched_domains is badly needed, so > your effort towards providing this is the right step. I'm not sure > yet whether your big change is really (and already) a simplification, > but what you described sounded for me like getting the chance to > configure the sched_domains at runtime, dynamically, from user > space. I didn't notice any user interface in your patch, or overlooked > it. Could you please describe the API you had in mind for that? > OK, what we have in -mm is already close to what we need to do dynamic building. But let's explore the other topic. User interface. First of all, I think it may be easiest to allow the user to specify which cpus belong to which exclusive domains, and have them otherwise built in the shape of the underlying topology. So for example if your domains look like this (excuse the crappy ascii art): 0 1 2 3 4 5 6 7 --- --- --- --- <- domain 0 | | | | ------ ------ <- domain 1 | | ---------- <- domain 2 (global) And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} for some crazy reason, the new domains would look like this: 0 1 2 4 5 3 6 7 --- - --- - --- <- 0 | | | | | ----- - - - <- 1 | | | | ------- ----- <- 2 (global, partitioned) Agreed? You don't need to get fancier than that, do you? Then how to input the partitions... you could have a sysfs entry that takes the complete partition info in the form: 0,1,2,3 4,5,6 7,8 ... Pretty dumb and simple. |
From: Matthew D. <col...@us...> - 2004-10-08 18:55:43
|
On Fri, 2004-10-08 at 03:40, Nick Piggin wrote: > And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} > for some crazy reason, the new domains would look like this: > > 0 1 2 4 5 3 6 7 > --- - --- - --- <- 0 > | | | | | > ----- - - - <- 1 > | | | | > ------- ----- <- 2 (global, partitioned) > > Agreed? You don't need to get fancier than that, do you? > > Then how to input the partitions... you could have a sysfs entry that > takes the complete partition info in the form: > > 0,1,2,3 4,5,6 7,8 ... > > Pretty dumb and simple. How do we describe the levels other than the first? We'd either need to: 1) come up with a language to describe the full tree. For your example I quoted above: echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions 2) have multiple files: echo "0,1,2,4,5 3,6,7" > level2 echo "0,1,2 4,5 3 6,7" > level1 echo "0,1 2 4,5 3 6,7" > level0 3) Or do it hierarchically as Paul implemented in cpusets, and as I described in an earlier mail: mkdir level2 echo "0,1,2,4,5 3,6,7" > level2/partitions mkdir level1 echo "0,1,2 4,5 3 6,7" > level1/partitions mkdir level0 echo "0,1 2 4,5 3 6,7" > level0/partitions I personally like the hierarchical idea. Machine topologies tend to look tree-like, and every useful sched_domain layout I've ever seen has been tree-like. I think our interface should match that. -Matt |
From: Hubertus F. <fr...@wa...> - 2004-10-08 15:51:46
|
Nick Piggin wrote: > Erich Focht wrote: > >> more flexibility in building the sched_domains is badly needed, so >> your effort towards providing this is the right step. I'm not sure >> yet whether your big change is really (and already) a simplification, >> but what you described sounded for me like getting the chance to >> configure the sched_domains at runtime, dynamically, from user >> space. I didn't notice any user interface in your patch, or overlooked >> it. Could you please describe the API you had in mind for that? >> > > OK, what we have in -mm is already close to what we need to do > dynamic building. But let's explore the other topic. User interface. > > First of all, I think it may be easiest to allow the user to specify > which cpus belong to which exclusive domains, and have them otherwise > built in the shape of the underlying topology. So for example if your > domains look like this (excuse the crappy ascii art): > > 0 1 2 3 4 5 6 7 > --- --- --- --- <- domain 0 > | | | | > ------ ------ <- domain 1 > | | > ---------- <- domain 2 (global) > > And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} > for some crazy reason, the new domains would look like this: > > 0 1 2 4 5 3 6 7 > --- - --- - --- <- 0 > | | | | | > ----- - - - <- 1 > | | | | > ------- ----- <- 2 (global, partitioned) > > Agreed? You don't need to get fancier than that, do you? > > Then how to input the partitions... you could have a sysfs entry that > takes the complete partition info in the form: > > 0,1,2,3 4,5,6 7,8 ... > > Pretty dumb and simple. > Agreed, what we are thinking is that the CKRM API can be used for that. Each domain is a class build of resources (cpus,mem). You use the config interface of CKRM to specify which cpu/mem belongs to the class. The underlying controller verifies it. For a first approximation, classes that have config constraints specified this way will not be allowed to set shares. In sched_domain terms it would mean that if the sched_domain is not balancable with its siblings then it forms an exclusive domain. Under the exclusive class one can continue with the hierarchy that will allow share settings. So from an API issue this certainly looks feasible, maybe even clean. |
From: Matthew D. <col...@us...> - 2004-10-08 22:50:10
|
On Fri, 2004-10-08 at 08:50, Hubertus Franke wrote: > Nick Piggin wrote: > > Agreed? You don't need to get fancier than that, do you? > > > > Then how to input the partitions... you could have a sysfs entry that > > takes the complete partition info in the form: > > > > 0,1,2,3 4,5,6 7,8 ... > > > > Pretty dumb and simple. > > Agreed, what we are thinking is that the CKRM API can be used for that. > Each domain is a class build of resources (cpus,mem). > You use the config interface of CKRM to specify which cpu/mem belongs > to the class. The underlying controller verifies it. > > For a first approximation, classes that have config constraints > specified this way will not be allowed to set shares. In sched_domain > terms it would mean that if the sched_domain is not balancable with its > siblings then it forms an exclusive domain. Under the exclusive > class one can continue with the hierarchy that will allow share settings. > > So from an API issue this certainly looks feasible, maybe even clean. For anyone who doesn't grok CRKM-speak, as I didn't until some recent conversations with Hubertus, I think this roughly translates into the following: "CKRM & sched_domains should be integrated and include management of both CPU & memory resources. We will use the CKRM rcfs filesystem API to set up both sched_domains and memory allocation constraints. sched_domains that are marked as exclusive, ie: their parent domains do not have the SD_LOAD_BALANCE flag set, will not have their computing power shared outside that domain via CKRM. This will create 'compute pools' composed of the exclusive domains in the system. CKRM will ensure that 'shares' are correctly handled within exclusive (aka isolated) domains, and that 'shares' don't include compute resources not available in the sched_domain in question." Is that a fairly accurate translation for those not initiated in the ways of CKRM, Hubertus? -Matt |
From: Peter W. <pwi...@bi...> - 2004-10-08 21:56:28
|
Matthew Dobson wrote: > On Fri, 2004-10-08 at 03:40, Nick Piggin wrote: > >>And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} >>for some crazy reason, the new domains would look like this: >> >>0 1 2 4 5 3 6 7 >>--- - --- - --- <- 0 >> | | | | | >> ----- - - - <- 1 >> | | | | >> ------- ----- <- 2 (global, partitioned) >> >>Agreed? You don't need to get fancier than that, do you? >> >>Then how to input the partitions... you could have a sysfs entry that >>takes the complete partition info in the form: >> >>0,1,2,3 4,5,6 7,8 ... >> >>Pretty dumb and simple. > > > How do we describe the levels other than the first? We'd either need > to: > 1) come up with a language to describe the full tree. For your example > I quoted above: > echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions I think the idea was that the full hierarchy was (automatically) derived from the partition in a way that best matched the physical layout of the machine? > > 2) have multiple files: > echo "0,1,2,4,5 3,6,7" > level2 > echo "0,1,2 4,5 3 6,7" > level1 > echo "0,1 2 4,5 3 6,7" > level0 > > 3) Or do it hierarchically as Paul implemented in cpusets, and as I > described in an earlier mail: > mkdir level2 > echo "0,1,2,4,5 3,6,7" > level2/partitions > mkdir level1 > echo "0,1,2 4,5 3 6,7" > level1/partitions > mkdir level0 > echo "0,1 2 4,5 3 6,7" > level0/partitions > > I personally like the hierarchical idea. Machine topologies tend to > look tree-like, and every useful sched_domain layout I've ever seen has > been tree-like. I think our interface should match that. > > -Matt > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Peter Williams pwi...@bi... "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce |
From: Matthew D. <col...@us...> - 2004-10-08 22:53:16
|
On Fri, 2004-10-08 at 14:56, Peter Williams wrote: > > How do we describe the levels other than the first? We'd either need > > to: > > 1) come up with a language to describe the full tree. For your example > > I quoted above: > > echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions > > I think the idea was that the full hierarchy was (automatically) derived > from the partition in a way that best matched the physical layout of the > machine? Absolutely. I mentioned that in a different response in this thread. The default behavior on boot up should be for the arch_init_sched_domains() to build a sched_domains hierarchy that mirrors the physical layout of the machine for maximal scheduling efficiency for general computing. If the users/admin of the machine want to configure the machine for a particular type(s) of workload, then they can do that through the soon-to-be-figured-out API to rebuild the sched_domains hierarchy in their own image and likeness... -Matt |
From: Erich F. <ef...@hp...> - 2004-10-08 22:53:28
|
On Friday 08 October 2004 12:40, Nick Piggin wrote: > Erich Focht wrote: > > Could you please describe the API you had in mind for that? > > > > First of all, I think it may be easiest to allow the user to specify > which cpus belong to which exclusive domains, and have them otherwise > built in the shape of the underlying topology. So for example if your > domains look like this (excuse the crappy ascii art): > > 0 1 2 3 4 5 6 7 > --- --- --- --- <- domain 0 > | | | | > ------ ------ <- domain 1 > | | > ---------- <- domain 2 (global) > > And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} > for some crazy reason, the new domains would look like this: > > 0 1 2 4 5 3 6 7 > --- - --- - --- <- 0 > | | | | | > ----- - - - <- 1 > | | | | > ------- ----- <- 2 (global, partitioned) > > Agreed? You don't need to get fancier than that, do you? > > Then how to input the partitions... you could have a sysfs entry that > takes the complete partition info in the form: > > 0,1,2,3 4,5,6 7,8 ... > > Pretty dumb and simple. Hmmm, this is unusable as long as you don't tell me how to create new levels and link them together. Adding CPUs is the simplest part. I'm with Matt here, the filesystem approach is the most elegant. On the other hand something simple for the start wouldn't be bad. It would show immediately whether Matt's or your way of dealing with domains is better suited for relinking and reorganising the domains structure dynamically. Functionality could be something like: - list domains - create domain - add child domains - link in parent domain We're building this from bottom (cpus) up and need to take care of the unlinking of the global domain when inserting something. But otherwise this could be sufficient. Regards, Erich |
From: Matthew D. <col...@us...> - 2004-10-09 01:07:25
|
On Fri, 2004-10-08 at 15:51, Erich Focht wrote: > Hmmm, this is unusable as long as you don't tell me how to create new > levels and link them together. Adding CPUs is the simplest > part. I'm with Matt here, the filesystem approach is the most > elegant. On the other hand something simple for the start wouldn't be > bad. It would show immediately whether Matt's or your way of dealing > with domains is better suited for relinking and reorganising the > domains structure dynamically. Functionality could be something like: > - list domains > - create domain > - add child domains > - link in parent domain > We're building this from bottom (cpus) up and need to take care of the > unlinking of the global domain when inserting something. But otherwise > this could be sufficient. > > Regards, > Erich I personally like to think of it from the top down. The internal API I came up with looks like: create_domain(parent_domain, type); destroy_domain(domain); add_cpu_to_domain(cpu, domain); So you basically build your domain from the top down, from your 1 or more top-level domains, down to your lowest level domains. You then add cpus (1 or more per domain) to the leaf domains in the tree you built. Those cpus cascade up the tree, and the whole tree knows exactly which cpus are contained in each domain in it. I think these are the three main functions you need to construct pretty much any conceivable, useful sched_domains hierarchy. -Matt |
From: Erich F. <ef...@hp...> - 2004-10-08 23:15:53
|
Hi Matt, I reply to this post because it has more examples ;-) On Friday 08 October 2004 20:54, Matthew Dobson wrote: > On Fri, 2004-10-08 at 03:40, Nick Piggin wrote: > > And so you want to make a partition with CPUs {0,1,2,4,5}, and {3,6,7} > > for some crazy reason, the new domains would look like this: > > > > 0 1 2 4 5 3 6 7 > > --- - --- - --- <- 0 > > | | | | | > > ----- - - - <- 1 > > | | | | > > ------- ----- <- 2 (global, partitioned) > > > > Agreed? You don't need to get fancier than that, do you? > > > > Then how to input the partitions... you could have a sysfs entry that > > takes the complete partition info in the form: > > > > 0,1,2,3 4,5,6 7,8 ... > > > > Pretty dumb and simple. > > How do we describe the levels other than the first? We'd either need > to: > 1) come up with a language to describe the full tree. For your example > I quoted above: > echo "0,1,2,4,5 3,6 7,8;0,1,2 4,5 3 6,7;0,1 2 4,5 3 6,7" > partitions This is not nice. Especially because changes cannot be made gradually. You need to put in the whole tree each time you change something. Concurrent processes modifying the structre need additional mechanisms for synchronization. > 2) have multiple files: > echo "0,1,2,4,5 3,6,7" > level2 > echo "0,1,2 4,5 3 6,7" > level1 > echo "0,1 2 4,5 3 6,7" > level0 Add a way to create levels. This one is hard to "parse" and again there is no protection for the "own" domains and maybe trouble with synchronization. > 3) Or do it hierarchically as Paul implemented in cpusets, and as I > described in an earlier mail: > mkdir level2 > echo "0,1,2,4,5 3,6,7" > level2/partitions > mkdir level1 > echo "0,1,2 4,5 3 6,7" > level1/partitions > mkdir level0 > echo "0,1 2 4,5 3 6,7" > level0/partitions > > I personally like the hierarchical idea. Machine topologies tend to > look tree-like, and every useful sched_domain layout I've ever seen has > been tree-like. I think our interface should match that. I like the hierarchical idea, too. The natural way to build it would be by starting from the cpus and going up. This tree stands on its leafs... and I'm not sure how to express that in a filesystem. How about this: If you would go from top (global) domain down the default could be to have a directory sched_domains/ global/ cpu1 cpu2 ... The cpuX files would always exist. All you'd need to do would be to create subdirectories and move the cpuX files from one to the other. You should be able to remove directories if they are empty. # cd sched_domains/global # mkdir node1 node2 # mv cpu1 cpu2 node1 # mv cpu3 cpu4 node2 sched_domains/ global/ node1/ cpu1 cpu2 node2/ cpu3 cpu4 or disconnected domains: # cd sched_domains # mkdir interactive batch # mv global/cpu1 global/cpu2 interactive # mv global/cpu3 global/cpu4 batch # rmdir global sched_domains/ interactive/ cpu1 cpu2 batch/ cpu3 cpu4 Regards, Erich |
From: Nick P. <nic...@ya...> - 2004-10-08 23:54:08
|
Erich Focht wrote: >>I personally like the hierarchical idea. Machine topologies tend to >>look tree-like, and every useful sched_domain layout I've ever seen has >>been tree-like. I think our interface should match that. > > > I like the hierarchical idea, too. The natural way to build it would > be by starting from the cpus and going up. This tree stands on its > leafs... and I'm not sure how to express that in a filesystem. > Why would you ever want to play around with the internals of the thing though? Provided you have a way to create exclusive sets of CPUs, when would you care about doing more? |