From: Dinakar G. <di...@in...> - 2005-05-01 18:59:08
Attachments:
dyn-sd-v0.5-1.patch
dyn-sd-v0.5-2.patch
|
Ok, Here is the minimal patchset that I had promised after the last discussion. What it does have o The current patch enhances the meaning of exclusive cpusets by attaching them (exclusive cpusets) to sched domains o It does _not_ require any additional cpumask_t variable. It just parses the cpus_allowed of the parent/sibling/children cpusets for manipulating sched domains o All existing operations on non-/exclusive cpusets are preserved as-is. o The sched code has been modified to bring it upto 2.6.12-rc2-mm3 Usage o On setting the cpu_exclusive flag of a cpuset X, it creates two sched domains a. One, All cpus from X's parent cpuset that dont belong to any exclusive sibling cpuset of X b. Two, All cpus in X's cpus_allowed o On adding/deleting cpus to/from a exclusive cpuset X that has exclusive children, it creates two sched domains a. One, All cpus from X's parent cpuset that dont belong to any exclusive sibling cpuset of X b. Two, All cpus in X's cpus_allowed, after taking away any cpus that belong to exclusive child cpusets of X o On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a single sched domain, containing all cpus from X' parent cpuset that dont belong to any exclusive sibling of X and the cpus of X o It does _not_ modify the cpus_allowed variable of the parent as in the previous version. It relies on user space to move tasks to proper cpusets for complete isolation/exclusion o See function update_cpu_domains for more details What it does not have o It is still short on documentation o Does not have hotplug support as yet o Supports only x86 as of now o No thoughts on "memory domains" (Though I am not sure, who would use such a thing and what exactly are the requirements) Here is the increase in binary sizes for different values of CONFIG_NR_CPUS (32 and 255) kernel/cpuset.o.orig-32 16728 kernel/cpuset.o-v0.5-32 17176 (~2.7% increase) kernel/cpuset.o.orig-255 17916 kernel/cpuset.o-v0.5-255 19120 (~6.7% increase) The code changes are also minimal include/linux/init.h | 2 include/linux/sched.h | 1 kernel/cpuset.c | 73 +++++++++++++++++++++++++++------ kernel/sched.c | 109 +++++++++++++++++++++++++++++++++----------------- 4 files changed, 134 insertions(+), 51 deletions(-) -Dinakar |
From: Nick P. <nic...@ya...> - 2005-05-02 09:10:54
|
Dinakar Guniguntala wrote: > Ok, Here is the minimal patchset that I had promised after the > last discussion. > The sched-domains part of it (kernel/sched.c) is looking much better now. Haven't had a really good look, but it is definitely on the right track now. Well done. As I said before, I am not expert on the cpusets side of things, but as far as sched-domains partitioning goes, we really just want the absolute minimum support in kernel/sched.c which can be managed by a higher layer. I'll review it in detail when it gets into -mm, but I don't expect to find any major problems. Thanks, Nick -- SUSE Labs, Novell Inc. |
From: Dinakar G. <di...@in...> - 2005-05-02 17:05:09
|
On Mon, May 02, 2005 at 07:10:35PM +1000, Nick Piggin wrote: > Dinakar Guniguntala wrote: > >Ok, Here is the minimal patchset that I had promised after the > >last discussion. > > > > The sched-domains part of it (kernel/sched.c) is looking much > better now. Haven't had a really good look, but it is definitely > on the right track now. Well done. > Thank you for reviewing ! -Dinakar |
From: Nick P. <nic...@ya...> - 2005-05-02 09:44:14
|
Dinakar Guniguntala wrote: > +void rebuild_sched_domains(cpumask_t span1, cpumask_t span2) > +{ > + cpumask_t change_map; > + > + cpus_or(change_map, span1, span2); > + > + preempt_disable(); Oh, you can't do this here, attach_domains does a synchronize_kernel. So take it out, it doesn't do anything anyway, does it? I suggest you also use some sort of locking to prevent concurrent rebuilds and rebuilds racing with cpu hotplug. You could probably have a static semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too. Or alternatively take the semaphore in the cpu hotplug notifier as well... Maybe both - neither are performance critical code. -- SUSE Labs, Novell Inc. |
From: Dinakar G. <di...@in...> - 2005-05-02 17:05:28
|
On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote: > Dinakar Guniguntala wrote: > > >+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2) > >+{ > >+ cpumask_t change_map; > >+ > >+ cpus_or(change_map, span1, span2); > >+ > >+ preempt_disable(); > > Oh, you can't do this here, attach_domains does a synchronize_kernel. > So take it out, it doesn't do anything anyway, does it? I put that in to prevent hangs with CONFIG_PREEMPT turned on, but clearly didn't test it with preempt turned on. Looks like all I need to do here is a local_irq_disable > > I suggest you also use some sort of locking to prevent concurrent rebuilds > and rebuilds racing with cpu hotplug. You could probably have a static > semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too. I already do a lock_cpu_hotplug() in cpuset.c before calling rebuild_sched_domains and also am holding cpuset_sem, so that should take care of both hotplug and concurrent rebuilds -Dinakar |
From: Nick P. <nic...@ya...> - 2005-05-02 23:23:30
|
Dinakar Guniguntala wrote: > On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote: > >>Dinakar Guniguntala wrote: >> >> >>>+void rebuild_sched_domains(cpumask_t span1, cpumask_t span2) >>>+{ >>>+ cpumask_t change_map; >>>+ >>>+ cpus_or(change_map, span1, span2); >>>+ >>>+ preempt_disable(); >> >>Oh, you can't do this here, attach_domains does a synchronize_kernel. >>So take it out, it doesn't do anything anyway, does it? > > > I put that in to prevent hangs with CONFIG_PREEMPT turned on, but > clearly didn't test it with preempt turned on. Looks like all I need to > do here is a local_irq_disable > What are you protecting against, though? synchroinze_kernel can sleep, so local_irq_disable is probably the wrong thing to do as well. AFAIKS, you don't need anything here - so long as you have mutual exclusion from other sched-domain building then this can take as long as it wants / be preempted as many times as we like. > >>I suggest you also use some sort of locking to prevent concurrent rebuilds >>and rebuilds racing with cpu hotplug. You could probably have a static >>semaphore around rebuild_sched_domains, and take lock_cpu_hotplug here too. > > > I already do a lock_cpu_hotplug() in cpuset.c before calling > rebuild_sched_domains and also am holding cpuset_sem, so that should take > care of both hotplug and concurrent rebuilds > OK. But if we want this to be a respectable interface (possibly for more than just cpusets) then it should probably do some locking itself. It isn't performance critical, so I think taking a semaphore wouldn't hurt. -- SUSE Labs, Novell Inc. |
From: Paul J. <pj...@sg...> - 2005-05-02 18:02:01
|
Thanks for the minimalist patch. I'll review it in more detail when I get a chance. It looks promising, more promising than before. My current concerns include: o Having it work on ia64 would facilitate my testing. o _Still_ no clear statement of the requirements - see below. o Does this patch ensure that isolated sched domains form a partition (disjoint cover) of a systems CPUs? Should it? o Does this change any documented semantics of cpusets? I don't see offhand that it does. Perhaps that's good. Perhaps I missed something. I am still in the frustrating position of playing guessing games with what are you essential requirements, the one or two features that you require. The closest we got was a six step list which was really more like pseudo code for a prior implementation. I see nothing on first glance in your latest patch that responds to my previous attempt to ask this. I don't see how I can improve on the wording of that question, so I repeat myself ... On April 25, pj wrote: ====================== A few days ago, you provided a six step list, under the introduction: > Ok, Let me begin at the beginning and attempt to define what I am > doing here I suspect those six steps were not really your essential requirements, but one possible procedure that accomplishes them. So far I am guessing that your requirement(s) are one or both of the following two items: (1) avoid having the scheduler wasting too much time trying to load balance tasks that only turn out to be not allowed on the cpus the scheduler is considering, and/or (2) provide improved administrative control of a system by being able to construct a cpuset that is guaranteed to have no overlap of allowed cpus with its parent or any other cpuset not descendent from it. If (1) is one of your essential requirements, then I have described a couple of implementations that mark some existing cpusets to form a partition (in the mathematical sense of a disjoint covering of subsets) of the system to define isolated scheduler domains. I did this without adding any additional bitmasks to each cpuset. If (2) is one of your essential requirements, then I believe this can be done with the current cpuset kernel code, entirely with additional user level code. > I am working on a minimalistic design right now and will get back in > a day or two Good. Hopefully, you will also be able to get through my thick head what your essential requirement(s) is or are. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dinakar G. <di...@in...> - 2005-05-03 14:30:41
|
On Mon, May 02, 2005 at 11:01:35AM -0700, Paul Jackson wrote: > My current concerns include: > o Having it work on ia64 would facilitate my testing. I am working on making changes to ia64, should be ready pretty soon > o Does this patch ensure that isolated sched domains form > a partition (disjoint cover) of a systems CPUs? Should it? With this patch the cpus of an exclusive cpuset form a sched domain. Since only exclusive cpusets can form a sched domain, this ensures that the cpus form a disjoint cover > o Does this change any documented semantics of cpusets? I don't > see offhand that it does. Perhaps that's good. Perhaps > I missed something. No, all semantics continue to be the same as before I have trimmed the requirements to do only the absolute minimal in kernel space 1. Partitioning of the system (both cpus and memory) to ensure dedicated resources are available for application use. This has to be done through user space with the help of the existing cpuset infrastructure and no additional changes are required to be done in the kernel 2. Remove unnecessary scheduler load balancing overhead in the partitions mentioned above a. Ensure that load balance code is aware of partitioning of cpus and load balance happens within these partitions and not across the entire system b. Provide for complete removal of load-balancing on a given partition of cpus This is necessary for a variety of workloads, including real-time, HPC and any mix of these workloads In the current patch, only 2(a) has been addressed. I intend to add support for 2(b) once the current patch is acceptable to everyone. I think this should not be a major change 3. Support CPU hotplug is another requirement, though not a direct one Hope this helps -Dinakar |
From: Paul J. <pj...@sg...> - 2005-05-03 15:22:12
|
Dinakar wrote: > Since only exclusive cpusets can form a sched domain, this ensures > that the cpus form a disjoint cover I don't understand why this is so. Certainly, the exclusive cpusets do not form a disjoint cover. Not disjoint, because the top cpuset is exclusive, and overlaps with all other cpusets. Nor do the other exclusive cpusets, excluding the top, necessarily form a cover - there might not even be any other such cpusets. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-05-03 15:35:19
|
Dinakar wrote: > No, all semantics continue to be the same as before Good. > I have trimmed the requirements to do only the absolute minimal in > kernel space Good. > Partitioning of the system ... done through user space Ok. > Remove unnecessary scheduler load balancing ... Ok. Thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dinakar G. <di...@in...> - 2005-05-05 13:24:24
|
Paul, Any reason why changes are allowed to the top_cpuset ? Right now a user can change all the masks/flags of the top_cpuset, I was wondering if this should not be happening. Should the top_cpuset be "read only" and always display all of the the system cpu and memory resources? -Dinakar |
From: Paul J. <pj...@sg...> - 2005-05-05 15:10:19
|
> Any reason why changes are allowed to the top_cpuset ? Eh - why not allow it? I tend not to code special cases until I have a clear and compelling need. Does something go surprisingly and horribly wrong if you change the top cpuset in a particular way? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dinakar G. <di...@in...> - 2005-05-06 17:50:46
|
On Thu, May 05, 2005 at 08:09:53AM -0700, Paul Jackson wrote: > > Any reason why changes are allowed to the top_cpuset ? > > Eh - why not allow it? > > I tend not to code special cases until I have a clear and compelling > need. Does something go surprisingly and horribly wrong if you change > the top cpuset in a particular way? I was working on hotplug support and making changes to the top cpuset can complicate things a bit, let me see if I can work around it -Dinakar |
From: Dinakar G. <di...@in...> - 2005-05-03 14:44:50
|
> >On Mon, May 02, 2005 at 07:44:05PM +1000, Nick Piggin wrote: > > What are you protecting against, though? synchroinze_kernel can > sleep, so local_irq_disable is probably the wrong thing to do as well. Paul, any reason why code marked "####" (fn cpuset_rmdir) is under the dentry lock ?? spin_lock(&cs->dentry->d_lock); parent = cs->parent; #### set_bit(CS_REMOVED, &cs->flags); #### if (is_cpu_exclusive(cs)) update_cpu_domains(cs); list_del(&cs->sibling); /* delete my sibling from parent->children */ if (list_empty(&parent->children)) check_for_release(parent); d = dget(cs->dentry); <---- cs->dentry = NULL; <---- spin_unlock(&d->d_lock); As far as I can see only the ones marked "<----" should be under the dentry lock, considering the fact that it already holds the cpuset_sem all the while. I saw that calling update_cpu_domains with the dentry lock held, causes it to oops with preempt turned on. (Scheduling while atomic) -Dinakar |
From: Paul J. <pj...@sg...> - 2005-05-03 15:31:38
|
Dinakar wrote: > As far as I can see only the ones marked "<----" should be under the > dentry lock, considering the fact that it already holds the cpuset_sem > all the while. It looks that way to me, too. I doubt we had any particular reason for locking entry as early as we do in that code. It's ok by me if you move the dentry lock lower, as you suggest. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Matthew D. <col...@us...> - 2005-05-03 22:03:34
|
Dinakar Guniguntala wrote: > Ok, Here is the minimal patchset that I had promised after the > last discussion. > > What it does have > o The current patch enhances the meaning of exclusive cpusets by > attaching them (exclusive cpusets) to sched domains > o It does _not_ require any additional cpumask_t variable. It > just parses the cpus_allowed of the parent/sibling/children > cpusets for manipulating sched domains > o All existing operations on non-/exclusive cpusets are preserved as-is. > o The sched code has been modified to bring it upto 2.6.12-rc2-mm3 > > Usage > o On setting the cpu_exclusive flag of a cpuset X, it creates two > sched domains > a. One, All cpus from X's parent cpuset that dont belong to any > exclusive sibling cpuset of X > b. Two, All cpus in X's cpus_allowed > o On adding/deleting cpus to/from a exclusive cpuset X that has exclusive > children, it creates two sched domains > a. One, All cpus from X's parent cpuset that dont belong to any > exclusive sibling cpuset of X > b. Two, All cpus in X's cpus_allowed, after taking away any cpus that > belong to exclusive child cpusets of X > o On unsetting the cpu_exclusive flag of cpuset X or rmdir X, it creates a > single sched domain, containing all cpus from X' parent cpuset that > dont belong to any exclusive sibling of X and the cpus of X > o It does _not_ modify the cpus_allowed variable of the parent as in the > previous version. It relies on user space to move tasks to proper > cpusets for complete isolation/exclusion > o See function update_cpu_domains for more details > > What it does not have > o It is still short on documentation > o Does not have hotplug support as yet > o Supports only x86 as of now > o No thoughts on "memory domains" (Though I am not sure, who > would use such a thing and what exactly are the requirements) An interesting feature. I tried a while ago to get cpusets and sched_domains to play nice (nicer?) and didn't have much luck. It seems you're taking a better approach, with smaller patches. Good luck! > diff -Naurp linux-2.6.12-rc2.orig/include/linux/init.h linux-2.6.12-rc2/include/linux/init.h > --- linux-2.6.12-rc2.orig/include/linux/init.h 2005-04-04 22:07:52.000000000 +0530 > +++ linux-2.6.12-rc2/include/linux/init.h 2005-05-01 22:07:56.000000000 +0530 > @@ -217,7 +217,7 @@ void __init parse_early_param(void); > #define __initdata_or_module __initdata > #endif /*CONFIG_MODULES*/ > > -#ifdef CONFIG_HOTPLUG > +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS) > #define __devinit > #define __devinitdata > #define __devexit This looks just plain wrong. Why do you need this? It doesn't seem that arch_init_sched_domains() and/or update_sched_domains() are called from anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS? > diff -Naurp linux-2.6.12-rc2.orig/kernel/sched.c linux-2.6.12-rc2/kernel/sched.c > --- linux-2.6.12-rc2.orig/kernel/sched.c 2005-04-28 18:24:11.000000000 +0530 > +++ linux-2.6.12-rc2/kernel/sched.c 2005-05-01 22:06:55.000000000 +0530 > @@ -4526,7 +4526,7 @@ int __init migration_init(void) > #endif > > #ifdef CONFIG_SMP > -#define SCHED_DOMAIN_DEBUG > +#undef SCHED_DOMAIN_DEBUG > #ifdef SCHED_DOMAIN_DEBUG > static void sched_domain_debug(struct sched_domain *sd, int cpu) > { Is this just to quiet boot for your testing? Is there are better reason you're turning this off? It seems unrelated to the rest of your patch. > ------------------------------------------------------------------------ > > diff -Naurp linux-2.6.12-rc2.orig/kernel/cpuset.c linux-2.6.12-rc2/kernel/cpuset.c > --- linux-2.6.12-rc2.orig/kernel/cpuset.c 2005-04-28 18:24:11.000000000 +0530 > +++ linux-2.6.12-rc2/kernel/cpuset.c 2005-05-01 22:15:06.000000000 +0530 > @@ -602,12 +602,48 @@ static int validate_change(const struct > return 0; > } > > +static void update_cpu_domains(struct cpuset *cur) > +{ > + struct cpuset *c, *par = cur->parent; > + cpumask_t span1, span2; > + > + if (par == NULL || cpus_empty(cur->cpus_allowed)) > + return; > + > + /* Get all non-exclusive cpus from parent domain */ > + span1 = par->cpus_allowed; > + list_for_each_entry(c, &par->children, sibling) { > + if (is_cpu_exclusive(c)) > + cpus_andnot(span1, span1, c->cpus_allowed); > + } > + if (is_removed(cur) || !is_cpu_exclusive(cur)) { > + cpus_or(span1, span1, cur->cpus_allowed); > + if (cpus_equal(span1, cur->cpus_allowed)) > + return; > + span2 = CPU_MASK_NONE; > + } > + else { > + if (cpus_empty(span1)) > + return; > + span2 = cur->cpus_allowed; > + /* If current cpuset has exclusive children, exclude from domain */ > + list_for_each_entry(c, &cur->children, sibling) { > + if (is_cpu_exclusive(c)) > + cpus_andnot(span2, span2, c->cpus_allowed); > + } > + } > + > + lock_cpu_hotplug(); > + rebuild_sched_domains(span1, span2); > + unlock_cpu_hotplug(); > +} Nitpicky, but span1 and span2 could do with better names. Otherwise, the patch looks good to me. -Matt |
From: Nick P. <nic...@ya...> - 2005-05-04 00:08:51
|
Matthew Dobson wrote: > Dinakar Guniguntala wrote: > >>+ lock_cpu_hotplug(); >>+ rebuild_sched_domains(span1, span2); >>+ unlock_cpu_hotplug(); >>+} > > > Nitpicky, but span1 and span2 could do with better names. > As could rebuild_sched_domains while we're at it. partition_disjoint_sched_domains(partition1, partition2); ? Dunno. That isn't really great, but maybe better? Pretty long, but it'll only ever be called in one or two places. -- SUSE Labs, Novell Inc. |
From: Matthew D. <col...@us...> - 2005-05-04 00:28:34
|
Nick Piggin wrote: > Matthew Dobson wrote: > >> Dinakar Guniguntala wrote: >> >>> + lock_cpu_hotplug(); >>> + rebuild_sched_domains(span1, span2); >>> + unlock_cpu_hotplug(); >>> +} >> >> >> >> Nitpicky, but span1 and span2 could do with better names. >> > > As could rebuild_sched_domains while we're at it. > > partition_disjoint_sched_domains(partition1, partition2); > ? > > Dunno. That isn't really great, but maybe better? Pretty > long, but it'll only ever be called in one or two places. build_disjoint_sched_domains(partition1, partition2)? Or just partition_sched_domains(partition1, partition2)? Partition and disjoint seem mildly redundant to me, for varying definitions of partition... ;) -Matt |
From: Dinakar G. <di...@in...> - 2005-05-05 13:15:19
|
On Tue, May 03, 2005 at 05:28:20PM -0700, Matthew Dobson wrote: > > build_disjoint_sched_domains(partition1, partition2)? Or just > partition_sched_domains(partition1, partition2)? Partition and disjoint > seem mildly redundant to me, for varying definitions of partition... ;) partition_sched_domains sounds fine to me, Thanks -Dinakar |
From: Dinakar G. <di...@in...> - 2005-05-05 13:14:06
|
On Tue, May 03, 2005 at 03:03:23PM -0700, Matthew Dobson wrote: > An interesting feature. I tried a while ago to get cpusets and > sched_domains to play nice (nicer?) and didn't have much luck. It seems > you're taking a better approach, with smaller patches. Good luck! Thanks ! I would very much like to know your findings as far as memory/node domains are concerned or are you going to be working on it? I dont have any thoughts on it right now > > -#ifdef CONFIG_HOTPLUG > > +#if defined(CONFIG_HOTPLUG) || defined(CONFIG_CPUSETS) > > #define __devinit > > #define __devinitdata > > #define __devexit > > This looks just plain wrong. Why do you need this? It doesn't seem that > arch_init_sched_domains() and/or update_sched_domains() are called from > anywhere that is cpuset related, so why the #ifdef CONFIG_CPUSETS? cpu_attach_domain is defined as a __devinit, maybe I need to remove that instead of the #ifdef > > #ifdef CONFIG_SMP > > -#define SCHED_DOMAIN_DEBUG > > +#undef SCHED_DOMAIN_DEBUG > > #ifdef SCHED_DOMAIN_DEBUG > > static void sched_domain_debug(struct sched_domain *sd, int cpu) > > { > > Is this just to quiet boot for your testing? Is there are better reason > you're turning this off? It seems unrelated to the rest of your patch. > This gets called from cpu_attach_domain, and so everytime partitioning is done and not only during boot with my changes Thanks for your review ! -Dinakar |