From: Nick P. <nic...@ya...> - 2005-04-19 07:09:54
|
On Mon, 2005-04-18 at 23:59 -0700, Paul Jackson wrote: > Nick wrote: > > Basically you just have to know that it has the > > capability to partition the system in an arbitrary disjoint set > > of sets of cpus. > > > > If you can make use of that, then we're in business ;) > > You read fast ;) > > So you do _not_ want to consider nested sched domains, just disjoint > ones. Good. > You don't either? Good. :) > > > From what I gather, this partitioning does not exactly fit > > the cpusets architecture. Because with cpusets you are specifying > > on what cpus can a set of tasks run, not dividing the whole system. > > My evil scheme, and Dinakar's as well, is to provide a way for the user > to designate _some_ of their cpusets as also defining the partition that > controls which cpus are in each sched domain, and so dividing the > system. > > "partition" == "an arbitrary disjoint set of sets of cpus" > That would make sense. I'm not familiar with the workings of cpusets, but that would require every task to be assigned to one of these sets (or a subset within it), yes? > This fits naturally with the way people use cpusets anyway. They divide > up the system along boundaries that are natural topologically and that > provide a good fit for their jobs, and hope that the kernel will adapt > to such localized placement. They then throw a few more nested (smaller) > cpusets at the problem, to deal with various special needs. If we can > provide them with a means to tell us which of their cpusets define the > natural partitioning of their system, for the job mix and hardware > topology they have, then all is well. > Sounds like a good fit then. I'll touch up the sched-domains side of the equation when I get some time hopefully this week or next. -- SUSE Labs, Novell Inc. |
From: Paul J. <pj...@sg...> - 2005-04-19 07:27:13
|
> > So you do _not_ want to consider nested sched domains, just disjoint > > ones. Good. > > > > You don't either? Good. :) From the point of view of cpusets, I'd rather not think about nested sched domains, for now at least. But my scheduler savvy colleagues on the big SGI boxes may well have ambitions here. I can't speak for them. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-04-19 07:30:17
|
Nick wrote: > That would make sense. I'm not familiar with the workings of cpusets, > but that would require every task to be assigned to one of these > sets (or a subset within it), yes? That's the rub, as I noted a couple of messages ago, while you were writing this message. It doesn't require every task to be in one of these or a subset. Tasks could be in some multiple-domain superset, unless that is so painful that we have to add mechanisms to cpusets to prohibit it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Simon D. <Sim...@bu...> - 2005-04-19 08:13:38
|
On Mon, 18 Apr 2005, Paul Jackson wrote: > Hmmm ... interesting patch. My reaction to the changes in > kernel/cpuset.c are complicated: > > * I'm supposed to be on vacation the rest of this month, > so trying (entirely unsuccessfully so far) not to think > about this. > * This is perhaps the first non-trivial cpuset patch to come > in the last many months from someone other than Simon or > myself - welcome. I'm glad to see this happening. > This leads to a possible interface. For each of cpus and > memory, add four per-cpuset control files. Let me take the > cpu case first. > > Add the per-cpuset control files: > * domain_cpu_current # readonly boolean > * domain_cpu_pending # read/write boolean > * domain_cpu_rebuild # write only trigger > * domain_cpu_error # read only - last error msg > 4) If the write failed, read the domain_cpu_error file > for an explanation. > Otherwise the write will fail, and an error message explaining > the problem made available in domain_cpu_error for subsequent > reading. Just setting errno would be insufficient in this > case, as the possible reasons for error are too complex to be > adequately described that way. I guess we hit a limit of the filesystem-interface approach here. Are the possible failure reasons really that complex ? Is such an error reporting scheme already in use in the kernel ? I find the two-files approach a bit disturbing -- we have no guarantee that the error we read is the error we produced. If this is only to get a hint, OK. On the other hand, there's also no guarantee that what we are triggering by writing in domain_cpu_rebuild is what we have set up by writing in domain_cpu_pending. User applications will need a bit of self-discipline. > The above scheme should significantly reduce the number of > special cases in the update_sched_domains() routine (which I > would rename to update_cpu_domains, alongside another one to be > provided later, update_mem_domains.) These new update routines > will verify that all the preconditions are met, tear down all > the cpu or mem domains within the scope of the specified cpuset, > and rebuild them according to the partition defined by the > pending_*_domain flags on the descendent cpusets. It's the > same complete rebuild of the partitioning of some subtree, > each time, without all the special cases for incrementally > adding and removing cpus or mems from this or that. Complex > nested if-else-if-else logic is a breeding ground for bugs -- > good riddance. Oh yes. There's already a good bunch of if-then-else logic in the cpusets because of the different flags that can apply. We don't need more. > There -- what do you think of this alternative? Most of all, that you write mails faster than I am able to read them, so I might have missed something. But so far I like your proposal. Simon. |
From: Paul J. <pj...@sg...> - 2005-04-19 16:20:30
|
Simon wrote: > I guess we hit a limit of the filesystem-interface approach here. > Are the possible failure reasons really that complex ? Given the amount of head scratching my proposal has provoked so far, they might be that complex, yes. Failure reasons include: * The cpuset Foo whose domain_cpu_rebuild file we wrote does not align with the current partition of CPUs on the system (align: every subset of the partition is either within or outside the CPUs of Foo) * The cpusets Foo and its descendents which are marked with a true domain_cpu_pending do not form a partition of the CPUs in Foo. This could be either because two of these cpusets have overlapping CPUs, or because the union of all the CPUs in these cpusets doesn't cover. * The usual other reasons such as lacking write permission. > If this is only to get a hint, OK. Yes - it would be a hint. The official explanation would be the errno setting on the failed write. The hint, written to the domain_cpu_error file, could actually state which two cpusets had overlapping CPUs, or which CPUs in Foo were not covered by the union of the CPUs in the marked descendent cpusets. Yes - it pushing the limits of available mechanisms. Though I don't offhand see where the filesystem-interface approach is to blame here. Can you describe any other approach that would provide such a similarly useful error explanation in a less unusual fashion? > Is such an error reporting scheme already in use in the kernel ? I don't think so. > On the other hand, there's also no guarantee that what we are triggering > by writing in domain_cpu_rebuild is what we have set up by writing in > domain_cpu_pending. User applications will need a bit of self-discipline. True. To preserve the invariant that the CPUs in the selected cpusets form a partition (disjoint cover) of the systems CPUs, we either need to provide an atomic operation that allows passing in a selection of cpusets, or we need to provide a sequence of operations that essentially drive a little finite state machine, building up a description of the new state while the old state remains in place, until the final trigger is fired. This suggests what the primary alternative to my proposed API would be, and that would be an interface that let one pass in a list of cpusets, requesting that the partition below the specified cpuset subtree Foo be completely and atomically rebuilt, to be that defined by the list of cpusets, with the set of CPUS in each of these cpusets defining one subset of the partition. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dinakar G. <di...@in...> - 2005-04-21 17:14:51
Attachments:
dyn-sd-v0.2.patch
|
Based on the Paul's feedback, I have simplified and cleaned up the code quite a bit. o I have taken care of most of the nits, except for the output format change for cpusets with isolated children. o Also most of my documentation has been part of my earlier mails and I have not yet added them to cpusets.txt. o I still havent looked at the memory side of things. o Most of the changes are in the cpusets code and almost none in the sched code. (I'll do that next week) o Hopefully my earlier mails regarding the design have clarified many of the questions that were raised So here goes version 0.2 -rw-r--r-- 1 root root 16548 Apr 21 20:54 cpuset.o.orig -rw-r--r-- 1 root root 17548 Apr 21 22:09 cpuset.o.sd-v0.2 Around ~6% increase in kernel text size of cpuset.o include/linux/init.h | 2 include/linux/sched.h | 1 kernel/cpuset.c | 153 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sched.c | 111 ++++++++++++++++++++++++------------ 4 files changed, 216 insertions(+), 51 deletions(-) |
From: Paul J. <pj...@sg...> - 2005-04-23 03:14:12
|
Dinakar's patch contains: + /* Make the change */ + par->cpus_allowed = t.cpus_allowed; + par->isolated_map = t.isolated_map; Doesn't the above make changes to the parent cpus_allowed without calling validate_change()? Couldn't we do nasty things like empty that cpus_allowed, leaving tasks in that cpuset starved (or testing the last chance code that scans up the cpuset hierarchy looking for a non-empty cpus_allowed)? What prevents all the immediate children of the top cpuset from using up all the cpus as isolated cpus, leaving the top cpuset cpus_allowed empty, which fails even that last chance check, going to the really really last chance code, allowing any online cpu to tasks in that cpuset? These questions are in addition to my earlier question: Why don't you need to propogate upward this change to the parents cpus_allowed and isolated_map? If a parents isolated_map grows (or shrinks), doesn't that affect every ancestor, all the way to the top cpuset? I am unable to tell, just from code reading, whether this code has adequately worked through the details involved in properly handling nested changes. I am unable to build or test this on ia64, because you have code such as the rebuild_sched_domains() routine, that is in the '#else' half of a very large "#ifdef ARCH_HAS_SCHED_DOMAIN - #else - #endif" section of kernel/sched.c, and ia64 arch (and only that arch, so far as I know) defines ARCH_HAS_SCHED_DOMAIN, so doesn't see this '#else' half. + /* + * If current isolated cpuset has isolated children + * disallow changes to cpu mask + */ + if (!cpus_empty(cs->isolated_map)) + return -EBUSY; 1) spacing - there's 8 spaces, not a tab, on two of the lines above. 2) I can't tell yet - but I am curious as to whether the above restriction prohibiting cpu mask changes to a cpuset with isolated children might be a bit draconian. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-04-22 18:53:15
|
A few code details (still working on more substantive reply): + /* An isolated cpuset has to be exclusive */ + if ((is_cpu_isolated(trial) && !is_cpu_exclusive(cur)) + || (!is_cpu_exclusive(trial) && is_cpu_isolated(cur))) + return -EINVAL; Is the above code equivalant to what the comment states: if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial)) return -EINVAL; + t = old_parent = *par; + cpus_or(all_map, cs->cpus_allowed, cs->isolated_map); + + /* If cpuset empty or top_cpuset, return */ + if (cpus_empty(all_map) || par == NULL) + return; If the (par == NULL) check succeeds, then perhaps the earlier (*par) dereference will have oopsed first? + struct cpuset *par = cs->parent, t, old_parent; Looks like 't' was chosen to be a one-char variable name, to keep some lines below within 80 columns. I'd do the same myself. But this leaves a non-symmetrical naming pattern for the new and old parent cpuset values. Perhaps the following would work better? struct cpuset *parptr; struct cpuset o, n; /* old and new parent cpuset values */ +static void update_cpu_domains(struct cpuset *cs, cpumask_t old_map) Could old_map be passed as a (const cpumask_t *)? The stack space of this code, just for cpumask_t's (see the old and new above) is getting large for (really) big systems. + /* Make the change */ + par->cpus_allowed = t.cpus_allowed; + par->isolated_map = t.isolated_map; Why don't you need to propogate upward this change to the parents cpus_allowed and isolated_map? If a parents isolated_map grows (or shrinks), doesn't that affect every ancestor, all the way to the top cpuset? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Paul J. <pj...@sg...> - 2005-04-22 21:38:41
|
> Is the above code equivalant to what the comment states: > > if (is_cpu_isolated(trial) <= is_cpu_exclusive(trial)) > return -EINVAL; I think I got that backwards. How about: /* An isolated cpuset has to be exclusive */ if (!(is_cpu_isolated(trial) <= is_cpu_exclusive(trial))) return -EINVAL; -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Dinakar G. <di...@in...> - 2005-04-25 11:37:30
|
On Sat, Apr 23, 2005 at 03:30:59PM -0700, Paul Jackson wrote: > The top cpuset holds the kernel threads that are pinned to a particular > cpu or node. It's not right that their cpusets cpus_allowed is empty, > which is what I guess the "0" in the cpus_allowed column above means. > (Even if the "0" means CPU 0, that still conflicts with kernel threads > on CPUs 1-7.) Yes, I meant cpus_allowed is empty > > We might get away with it on cpus, because we don't change the tasks > cpus_allowed to match the cpusets cpus_allowed (we don't call > set_cpus_allowed, from kernel/cpuset.c) _except_ when someone rebinds > that task to its cpuset by writing its pid into the cpuset tasks file. > So for as long as no one tries to rebind the per-cpu or per-node > kernel threads, no one will notice that they in a cpuset with an > empty cpus_allowed. True. > 4) There are some tasks that _do_ require to run on the same cpus as > the tasks you would assign to isolated cpusets. These kernel threads, > such as for example the migration and ksoftirqd threads, must be setup > well before user code is run that can configure job specific isolated > cpusets, so these tasks need a cpuset to run in that can be created > during the system boot, before init (pid == 1) starts up. This cpuset > is the top cpuset. And those processes (kernel threads) will continue to run on their cpus > > I don't understand why what's there now isn't sufficient. I don't see > that this patch provides any capability that you can't get just by > properly placing tasks in cpusets that have the desired cpus and nodes. > This patch leaves the per-cpu kernel threads with no cpuset that allows > what they need, and it complicates the semantics of things, in ways that > I still don't entirely understand. You are forgetting the fact that the scheduler is still load balancing across all CPUs and tries to pull tasks only to find that the task's cpus_allowed mask prevents it from being moved > > You don't need to isolate a set of cpus; you need to isolate a set of > processes. So long as you can create non-overlapping cpusets, and > assign processes to them, I don't see where it matters that you cannot > prohibit the creation of overlapping cpusets, or in the case of the top > cpuset, why it matters that you cannot _disallow_ allowed cpus > or memory nodes in existing cpusets. > I am working on a minimalistic design right now and will get back in a day or two -Dinakar |
From: Paul J. <pj...@sg...> - 2005-04-25 14:40:08
|
Dinakar, replying to pj: > > I don't understand why what's there now isn't sufficient. I don't see > > that this patch provides any capability that you can't get just by > > properly placing tasks in cpusets that have the desired cpus and nodes. > > This patch leaves the per-cpu kernel threads with no cpuset that allows > > what they need, and it complicates the semantics of things, in ways that > > I still don't entirely understand. > > You are forgetting the fact that the scheduler is still load balancing > across all CPUs and tries to pull tasks only to find that the task's > cpus_allowed mask prevents it from being moved Well, I haven't forgotten, but I am having a difficult time figuring out what your real (most likely just one or two) essential requirements are. A few days ago, you provided a six step list, under the introduction: > Ok, Let me begin at the beginning and attempt to define what I am > doing here I suspect those six steps were not really your essential requirements, but one possible procedure that accomplishes them. So far I am guessing that your requirement(s) are one or both of the following two items: (1) avoid having the scheduler wasting too much time trying to load balance tasks that only turn out to be not allowed on the cpus the scheduler is considering, and/or (2) provide improved administrative control of a system by being able to construct a cpuset that is guaranteed to have no overlap of allowed cpus with its parent or any other cpuset not descendent from it. If (1) is one of your essential requirements, then I have described a couple of implementations that mark some existing cpusets to form a partition (in the mathematical sense of a disjoint covering of subsets) of the system to define isolated scheduler domains. I did this without adding any additional bitmasks to each cpuset. If (2) is one of your essential requirements, then I believe this can be done with the current cpuset kernel code, entirely with additional user level code. > I am working on a minimalistic design right now and will get back in > a day or two Good. Hopefully, you will also be able to get through my thick head what you essential requirement(s) is or are. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |
From: Matthew D. <col...@us...> - 2005-04-26 00:52:56
|
Paul Jackson wrote: > A few days ago, Nick wrote: > >>Well the scheduler simply can't handle it, so it is not so much a >>matter of pushing - you simply can't use partitioned domains and >>meaningfully have a cpuset above them. > > > And I (pj) replied: > >>Translating that into cpuset-speak, I think what you mean is ... > > > I then went on to ask some questions. I haven't seen a reply. > I probably wrote too many words, and you had more pressing matters > to deal with. Which is fine. > > Let's make this simpler. > > Ignore cpusets -- let's just talk about a tasks cpus_allowed value, > and scheduler domains. Think of cpusets as just a strange way of > setting a tasks cpus_allowed value. > > Question: > > What happens if we have say two isolated scheduler domains > on a system, covering say two halves of the system, and > some task has its cpus_allowed set to allow _all_ CPUs? > > What kind of pain does that cause? I'm hoping you will say that > the only pain it causes is that the task will only run on one > half of the system, even if the other half is idle. And that > so long as I don't mind that, it's no problem to do this. I'm not the sched_domains guru that Nick is, but as your question has gone unanswered for a few days I'll chime in and see if I can't help you provoke a more definitive response. Your assumptions above are correct to the best of my knowledge. The only pain it causes is that the scheduler will not be able to "see" outside of the span of the highest sched_domain attached to a particular CPU. A B / \ / \ X Y Z W /\ /\ /\ /\ 0 1 2 3 4 5 6 7 In this setup, with your "Alpha" & "Beta" domains splitting the system in half, a process in a cpuset spanning cpus 0..7 will get "stuck" in whichever domain it happens to be in when the Alpha & Beta domains get created. Explicit sys_sched_setaffinity() calls will still move it between domains, but it will not move between Alpha & Beta on its own. Loadbalancing from CPU 0's perspective (in Alpha) sees only CPUs 0..3. Right Nick? ;) -Matt |
From: Paul J. <pj...@sg...> - 2005-04-26 01:01:26
|
Matthew wrote: > and see if I can't help you provoke > a more definitive response. Thanks ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@en...> 1.650.933.1373, 1.925.600.0401 |