From: Michael H. <hoh...@us...> - 2002-01-22 22:46:11
|
On Fri, 18 Jan 2002, Paul Jackson stated: > The remaining open items in the CpuMemSets design: > > http://sourceforge.net/docman/display_doc.php?docid=8463&group_id=8875 > > are: > > 1) how do we group the users of CpuMemSets (the processes and vm areas), > 2) how do we group the resources (cpus and memory blocks), and > 3) in particular, how do we handle reconfiguring after removing a cpu? > > Proposed Solutions: > > 1) I think that any adequate solution to (1): > > 1a) must be hierarchical (groups and subgroups) Why? > 1b) must be inherited (across fork and vm area creation), No, but it must be possible to request inheritance (e.g., through the use of launch policies, or through specific system calls) > 1c) requires the kernel to restrict change authorization > to group boundaries, Huh? > 1d) requires the kernel to support the automatic > inheritance across process and vm area creation, and Not automatic inheritance, but optional inheritance. > 1e) requires the kernel to support bulk operations to change > maps and sets for all members of a group atomically. Or, alternatively, requires all members of a group to use the same maps. > ..remaining post deleted.. I have previously promised (threatened) to post my views on process grouping to this forum, fully intending to post last week but was sidetracked. Below is what I have put together. It addresses directly the first of the three open items, indirectly the second, and so far, ignores the third. Thoughts on grouping of processes for NUMA performance: The goal is to ensure that 2 or more processes execute on the same NUMA node, and if one process migrates to another node, all grouped processes also migrate. This grouping is referred to as process association in the following discussion. The reason - process association has shown significant performance benefits on NUMA systems with dynamic, database-oriented workloads. Issues: * CpuMemSets does not provide the concept of a node. It works on the association of a memory block with one or more CPUs. The only way to force a set of processes to always reside on the same node is to only provide one node in their cpumemmap or cpumemset. This is too restrictive. * Currently there is no support for process migration across NUMA nodes. That infrastructure needs to be established. * Assuming process migration, process association, and CpuMemSets all exist, how does the system handle associated processes in respect to their CpuMemSets? Should all associated processes use the same set? And if so which process controls it? At the least, a mechanism needs to be in place to a) allow sharing of cpumemsets; b) force the sharing of cpumemsets; and c) establish and enforce ownership rights over cpumemsets. If, on the other hand, all associated processes do not use the same cpumemset, then at process migration time the set of associated processes must have their cpumemsets scanned, and the association may only be migrated to a node that is represented in all of the cpumemsets. * How is the system to determine what constitutes a node for the purpose of process migration? This is being addressed in the topology design, but how best to incorporate this concept into CpuMemSets? Ideas: * First, we need to establish the concept of a NUMA node. While this has been intentionally kept out of CpuMemSets, I believe it necessary to provide process association. Getting topology info from the topology subsystem is fine, but is it practical to have to scan through the topology structures everytime that a decision is needed regarding nodes and process association? Thinking through the steps, one must go through the cpumemset, then the cpumemmap to obtain a system processor number. From there the topology structures must be scanned to find the processor and the node it resides on. This lookup needs to be done at least twice (once for current processor, once for target), and would be happening from the scheduling routines. This seems like a good spot to provide a shortcut to obtain this information. Can we come up with a way to keep node info in the cpumemmap? * Process migration can be implemented completely independent of CpuMemSets. CpuMemSets, though, would provide policy concerning what resources process may be migrated to. * Assuming we solve the previous two problems, then my preference would be to have associated processes share a cpumemmap/cpumemset with modification rights restricted (to whom?). Summary: Paul Jackson's "Hare-braned" solution, while intriguing, is perhaps larger than what is needed in respect to process grouping. I propose a smaller, simpler solution that ties into CpuMemSets. Does this meet the needs of potential users? |
From: Michael H. <hoh...@us...> - 2002-01-25 17:18:21
|
Paul, The missing element that has resulted in our difference regarding inheritance is that I am referring to inheritance of binding of processes to each other, not binding of processes to physical resources. Let me restate what I am looking to accomplish, and hopefully this will be clearer. This discussion is going to use the term "node" that I hope does not cause too much heartburn. Look at "node" in a more abstract sense - that is, it is a collection of physical resources within some (perhaps arbitrary) boundary. Thus, for the existing IBM NUMAQ systems, a node maps quite easily to what we refer to as a quad, which is a single unit contiaining memory, 4 CPUs, and some I/O busses. However, a node could actually, in some other implementation, consist of multiple CPU components and memory components not implemented within the same physical unit. A node can be a logical grouping established (by the topology subsystem?) at boot time. The idea to bear in mind is that, for the following discussion, a node is a discreet collection of physical resources. In a dynamic system the load on the system is likely to vary over time. The normal scheduling policy is to assign a process to a node and keep it there through the life of the process. However, the load across nodes may become out of balance, and thus system throughput will benefit from moving processes from a heavily loaded node to a lightly loaded node. So far all is good. However, there will exist some processes that due to either data sharing, or some other interprocess communication, benefit from always being kept within the same node. It is ok to migrate these processes to other nodes, they just need to go together. However, associating a large number of processes in the same grouping can make migration very difficult to the point of being prohibitive. Thus it is desirable to keep process association to a minimum, and only to processes that can demonstratably benefit from it. For this reason inheritance of process association, by default, is undesirable. A parent process might know that association with a child is beneficial so the capability needs to exist to provide that, but as a default policy process association inheritance would create unnecessary process to process bindings thus making process migration difficult to impossible. Note, also, that by default there is no process association, when a process is spawned there is no required association of it with its parent. It must be requested. Now taking this need to CpuMemSets, what is missing is a way establish the boundaries around a set of physical resources to define what I refer to as a node. We need some way to say it is ok to for this process to execute on some set of CPUs, however, there is a boundary such that if the process moves from this subset of CPUs to come other subset of CPUs all associated processes must move with it. This boundary needs to be defined. Since CpuMemSets is defining what cpus a process may execute on, it seems to follow that it should also identify where node boundaries exist. This point is arguable, but is what I am after. A secondary concern is the need for associated processes to have access to the same set of physical resources. This is where my suggestion of mandatory sharing of maps/sets comes from. I consider this a secondary concern and this problem can be solved in other ways. The actual binding of processes to other processes (association) happens outside of CpuMemSets. However, there is some support, as described above, that I feel is needed. Nowhere in here am I questioning the inheritance of cpumemmaps/sets. I am only questioning the need for default inheritance of process association. Now, if I had a better understanding of what your needs are in respect to process association, perhaps some common ground can be found and a solution developed that solves both sets of needs. That is why I asked the (terse) questions (see below), and based upon your response I still do not understand what your needs/goals are in respect to process association. Michael Hohnbaum hoh...@us... On Thu, 24 Jan 2002, Paul Jackson said: > > Ah - I realize a further confusion in my reading > of Michael's response to my hare brained proposal last week: > > Michael, responding to Paul: > > > Proposed Solutions: > > > > > > 1) I think that any adequate solution to (1): > > > > > > 1a) must be hierarchical (groups and subgroups) > > > > Why? > > > > > 1b) must be inherited (across fork and vm area creation), > > > > No, but it must be possible to request inheritance (e.g., through > > the use of launch policies, or through specific system calls) > > > It must be the case that inheritance of processor and memory > restrictions can be _automatic_, without any explicit use by > the parent process of some launch policy or system call. > > That is, it must be that some grandparent process can be fired > off to run on some particular node (or other specific set of > cpus and memory blocks), with the result that all the child and > grandchild processes spawned directly or indirectly from that > grandparent process remain constrained to run in no more (but > possibly less) than that original node (or specific set). All > this with _no_ code in or awareness by any of these processes. > > Like most constraints on the execution of a process in Unix (user > and group ids, ulimits, capabilities, nice value, controlling > tty, ...) the constraint as to which cpus it can schedule on, > and which nodes it can allocate from, and now the identify of > the allocation group to which it belongs for the purpose of bulk > migration, should be, and one would _expect_ to be, inherited. > In particular, the process allocation group is specified by the > cpumemmap, ahd hence inherited along with the rest of the map. > > How in heavens else might you want the default setting of this, > if not by inheritance? > > _Every_ task, _every_ vm area, has some cpumemmap and cpumemset. > > There must be some default setting of these on each fork and > vm area creation, absent specific instructions to the contrary > from the creating process. To my mind, the only obvious choice > of default value is by inheritance from the creator (in this > case, with the refinement that it's the CHILD value that is > inherited, as opposed to the parents CURRENT value). > > I am rather puzzled that you would question this "Why?" or > doubt it "No". That you did so suggests to me that you have > something quite different in mind, and that we are failing more > than we realize to communicate. > > Or [wild guess alert here] is it only that you were objecting > to what you thought was an insistence from me that _only_ > inheritance be available to determine the allocation group. > Of course, with explicit system calls (and adequate permission) > one needs to be able to set something other than what is > inherited. Otherwise, everyone inherits the common value > from the init (pid == 1) process and the setting is useless. > Clearly I wouldn't advocate that. |
From: Paul J. <pj...@en...> - 2002-02-01 23:37:44
|
My apologies, Michael, for not responding sooner to your message of last Friday. I have been mostly unavailable for work this week, and likely next week as well, due to some commitments at home. Your explanation of the grouping you have in mind seems quite clear from my quick reading of your message. Let me play it back, in the context as I see it, and see if we are more in sync. You are describing another way in which we must group something that I wasn't thinking of, namely grouping the tasks that should move together in the face of process migration. Yes, I agree that a simple inheritance of that grouping that led to having to move large pools of tasks would not work right, as it would force too many tasks move as a group. So seems like we have the following "groupings": 1) grouping of resources (cpu, mem) -- including "nodes", 2) grouping of tasks for the purpose of controlling which subset of nodes they are allowed to use, which is what I had in mind when I presumed that such grouping must be inherited, and 3) grouping of tasks that should be migrated as a group, which you point out doesn't fit simple inheritance well. And then we need to bind task groups to resource groups, as in "hey - you guys run here". Note here that I am trying (pedantically) to distinguish groupings (sets and subsets of similar things) from bindings (mappings between two groupings). I readily agree that CpuMemSets doesn't yet speak to any of these needs. Well, for the next 7 paragraphs, for the sake of discussion, I agree. Could it be that grouping (3) is a simple variant flavor of (2)? Imagine that all we have are groups of resources, groups of tasks, and a boolean attribute of tasks groups: does the group migrate together? Let the default setting for that attribute be false -- don't require migration together. For a close knit group of tasks that _did_ warrant keeping together, the attribute would be set true -- migrate together. Group membership of tasks would by default be inherited, but because the default is not to force groups to migrate together, we avoid the problem of a request to migrate turning into an avalanche. If this seems ok, we are down to groups of resources (such as nodes) and groups of tasks, where membership in a task group is inherited across fork and vm area creation. For the applications that (it seems to me) you are focused on, migration is more important. But for the applications that I am focused on, "subleasing" is more important. These are long running, very large, big data, big compute applications that may be tuned to have specific pages of memory in certain memory nodes, and specific tasks on certain cpus, for perhaps hours or days at a time. Perhaps like the difference between tracking all the UPS trucks in real-time, versus tracking one space shuttle launch. For my needs, the grouping of resources by node is only a poor fit. We need the ability to name most any subset of cpus and memory blocks as a resource group. Hmmm, contrary to what I said above about cpumemsets not supporting any sort of grouping, might it be that cpumemmaps provide an anonymous grouping? A given cpumemmap certainly specifies some particular set of cpus and memory blocks, that is the grouping of resources in item (1) above. That's definitely grouping them. Furthermore, the set of tasks and vm areas sharing a particular cpumemset is a grouping of the users in item (2) above. And the association of a cpumemset with its cpumemmap is definitely a binding of said users to said resources. So why aren't we done, give or take a minor additional attribute to state whether a given group of users need migrate together or not? Could it be that these groupings (of resources and users) need an identity? My earliest designs for CpuMemSets gave these groups explicit handles, like process id's, that were visible and stable across a system, for the duration of a boot. I later retracted that identity, leaving cpumemmaps and cpumemsets as anonymous attributes of the tasks and vm areas which attached to them. I'm beginning to think that it's time to reconsider this choice, and to examine adding the various paraphernalia associated with a managable system object, such as identity (a name or handle), permissions, resource management and explicit binding to cpumemmaps (as groups of resources), cpumemsets (as groups of users) and the linkage of sets to maps (the binding). This is where I think the "grouping" question must lead. We already have the groups -- we just need to give them full form. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2002-01-23 05:31:08
|
On Tue, 22 Jan 2002, Michael Hohnbaum wrote: > > I have previously promised (threatened) to post my views on process > grouping to this forum, fully intending to post last week but was > sidetracked. Below is what I have put together. Excellent. You've done a better job than I of describing the Issues and Ideas, so I will work from your notes. Thanks. > The goal is ... "process association" ... [which] has shown > significant performance benefits on NUMA systems with dynamic, > database-oriented workloads. I see 3 things in this goal: - the need to associate processes (so they can move together) - the need to associate cpus (creating nodes, say), and - the performance benefits of keeping associated processes together on associated cpus (benefiting performance). That is, two additional features (cpu and process association) are needed beyond what CpuMemSets provides, and then one particular performance benefit results, which is, for _some_ systems and _some_ loads, the important goal of improved performance on dynamic dbms oriented loads. > Issues: > > * CpuMemSets does not provide the concept of a node. It works on the > association of a memory block with one or more CPUs. I agree that a notion of associating cpus (and memory) is required. That was my item (2) in my hare-brained posting: > > 2) how do we group the resources (cpus and memory blocks) ... I appreciate that the most common term for such a group of cpus and memory is "node". I'd like a bit more flexible grouping. For example, I include in my design space systems with many nodes, not all equi-distant from each other. Imagine say a system with 2 cpus per chip, 4 chips per node, 8 nodes per super-node, and 4 super-nodes total in the system, for a total of 2*4*8*4 == 256 cpus per system. Mind you, I don't have any such system in my lab right now, and I'm not saying when I might have such a system. But I'd like the basic API and structure we provide to extend sensibly to such a system. Just as the systems won't be simply equi-distant nodes of cpus, also the application loads aren't just multiple dynamically scheduled single node loads. We may have a mix of job sizes running together on a system, with jobs suitable for most anything from 1 to 256 of all the cpus available (or however large the system is). And some jobs may want to be dynamically scheduled while others might want to sit on a single set of cpus for hours, even days, crunching away. > The only way to force a set of processes to always reside on > the same node is to only provide one node in their cpumemmap > or cpumemset. This is too restrictive. I'm missing your point here. Yes, one forces a set of processes to reside on the same node by restricting their cpumemmap to the cpus and memory in that node. But I don't get what you mean by "This is too restrictive". My wild guess would be that you have in mind something like the following: While a given set of associated processes might have long standing (static) authority to run on any of several nodes, still for any given shorter period of time (more dynamic) they should all be scheduled on the same node, for improved performance in the presence of big honking caches. If this is along the lines you're thinking, then I would suggest that CpuMemSets be used to capture the static constraints, and other mechanisms, such as perhaps scheduler changes from Hubertus or Ingo, be used to capture the more dynamic constraints. These more dynamic scheduler changes would respect the static CpuMemSet constraints (only run on these nodes), and further tend to keep associated processes all on the same node, with some form of dynamic load balancing, perhaps just at fork-time, or perhaps also migrating running processes. > * Currently there is no support for process migration across > NUMA nodes. That infrastructure needs to be established. Yes - I quite agree. Though I will readily admit that it's likely that IBM has more money riding on this feature than SGI (heck, they've got more money in general <grin>), so I anticipate that IBM will be more likely to take the lead here, while I and the other SGI folks working on CpuMemSets do our best to encourage, support and "play nice" with this process migration work. > * ... Should all associated processes use the same set? And > if so [ ... ownership rights ...] > > If, on the other hand, all associated processes do not use the same > cpumemset, then at process migration time the set of associated > processes must have their cpumemsets scanned, and the association > may only be migrated to a node that is represented in all of the > cpumemsets. I intuit that in the back of your mind as you wrote this you were concerned that if we have the scheduler rescanning all the cpumemsets in a process association on each reschedule, looking for the best cpu in the intersection of the cpumemsets in that association, then we will have grossly unacceptable impact on the scheduler performance. And I'd quite agree - doing such would be outrageously unacceptable. But, as is already the case with the CpuMemSet patch we just released, the performance critical portion of the scheduler shouldn't be involved in CpuMemSets. Rather whatever mechanism it would have used anyway (such as Ingo's cpus_allowed bit vector) continues to control which cpus it can run on. Only when cpumemsets are changed from user land, which is a rare event, do we have to perform the more expensive scans of cpumemset and cpumemmap data structures to recompute their affect on the performance critical data settings used by the main line scheduler (and optional process migrator). Also, I would not expect that load migration should be considered on each pass through the scheduler. That sounds like a serious waste of cpu time. Rather at some more liesurely background pace or every now and then should migration be considered. > Ideas: > > * First, we need to establish the concept of a NUMA node. ... > but is it practical to have to scan through the topology > structures everytime that a decision is needed regarding nodes > and process association? [ ... Answer: no - too expensive ...] > Can we come up with a way to keep node info in the cpumemmap? Yes - I quite agree with your implication that such would be way too expensive. Instead of trying to keep node information in the cpumemmap, how about keeping it in data used by (1) the scheduler, and (2) the process migration code?. We already do (1) with the cpus_allowed bit vector, and no doubt other such mechanisms in the more recent scheduler work by Hubertus and by Ingo. Do not ask anyone writing an improved scheduler to go poking through the cpumemmaps and cpumemsets, perhaps aided by having additional needed information cached in these sets and maps. Rather ask the CpuMemSet code to update the data structures that have been fine tuned for the purposes of such scheduling and migration code with whatever relatively static constraints are imposed on which cpus one can consider using. Such updates are rare -- just when the user changes cpumemsets or maps. > * Process migration can be implemented completely independent of CpuMemSets. > CpuMemSets, though, would provide policy concerning what resources process > may be migrated to. yup > * Assuming we solve the previous two problems, then my preference would be > to have associated processes share a cpumemmap/cpumemset with modification > rights restricted (to whom?). As I wrote in my hare-brained note, I think we can handle the association of _cpus_ mostly from user land. User land system services would construct named and rights restricted collections of cpus and memory (nodes, super-nodes, whatever). I do not understand how we can enforce sharing of cpumemsets and maps to obtain such grouping. In particular, we (well, SGI anyway) must support applications that "sublease" their resources to subtasks. Hence the hierarchical structure to collections of cpus and memory, and hence the need to inherit such across fork, exec and memory region creation. I would encourage you not to re-invent CpuMemSets as the "node + process association" mechanism you seek, with support for the particular dynamic scheduling and migration suitable for your particular loads and system. Rather keep CpuMemSets as the projection to the lowest common denominator elements (cpus and memory blocks) of various groupings and mechanisms, which would be separately implemented and layered on top of CpuMemSets. There are a number of wildly different system architectures, system loads and performance criteria, some of which are but a twinkle in someone's eye, that we should support. This is best done by breaking everything down to its basic elements and reconstructing them. Rather like the body handles food. Think of CpuMemSets as elementary nutrients extracted by the digestive track and reconstituted by the liver. Further keep CpuMemSets separate from and invisible to whatever various scheduler and process migration code there is. Don't ask performance critical code to go wandering around long painful lists of cpus and tasks. Rather ask the generic CpuMemSet mechanism to "pre-digest" its constraints on the decisions made in the fast code, to whatever form the fast code demands. > Summary: > > Paul Jackson's "Hare-braned" solution, while intriguing, is perhaps > larger than what is needed in respect to process grouping. I propose > a smaller, simpler solution that ties into CpuMemSets. Does this meet > the needs of potential users? Ah - I missed something along the way here - "the simpler solution". I saw, and responded to, various suggestions as to the shape that such a simpler solution can and should take. But I failed to actually come away with a substantive alternative solution. I think it is still the case, as I wrote in my hare-brained note, that: > > The remaining open items in the CpuMemSets design: > > > > http://sourceforge.net/docman/display_doc.php?docid=8463&group_id=8875 > > > > are: > > > > 1) how do we group the users of CpuMemSets (the processes and vm areas), > > 2) how do we group the resources (cpus and memory blocks), and > > 3) in particular, how do we handle reconfiguring after removing a cpu? Michael had a few specifid comments on my previous message: > > Proposed Solutions: > > > > 1) I think that any adequate solution to (1): > > > > 1a) must be hierarchical (groups and subgroups) > > Why? As noted above, so that processes can "sublease". As explained at greater length in my CpuMemSet Design Notes, a long running application with a fixed, carefully tuned set of subtasks having differing resource needs may want to subdivide the larger set of resources it is running on into various smaller units: this child task runs on this cpu, and that memory region comes from that node, for example. > > 1b) must be inherited (across fork and vm area creation), > > No, but it must be possible to request inheritance (e.g., through > the use of launch policies, or through specific system calls) So you agree that at least fork and vm area creation must optionally support inheritance, right? In which case, that policy must be coded. What other policies for defining a new task or vm area's default cpumemmap and set should we consider beyond inheritance? > > 1c) requires the kernel to restrict change authorization > > to group boundaries, > > Huh? This is your "appropriate permission for resource groups", phrased in a more opaque wording. > > 1d) requires the kernel to support the automatic > > inheritance across process and vm area creation, and > > Not automatic inheritance, but optional inheritance. So much the more work. > > 1e) requires the kernel to support bulk operations to change > > maps and sets for all members of a group atomically. > > Or, alternatively, requires all members of a group to use the same maps. No - subleasing is essential, in my view. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2002-01-25 05:03:28
|
Ah - I realize a further confusion in my reading of Michael's response to my hare brained proposal last week: Michael, responding to Paul: > > Proposed Solutions: > > > > 1) I think that any adequate solution to (1): > > > > 1a) must be hierarchical (groups and subgroups) > > Why? > > > 1b) must be inherited (across fork and vm area creation), > > No, but it must be possible to request inheritance (e.g., through > the use of launch policies, or through specific system calls) It must be the case that inheritance of processor and memory restrictions can be _automatic_, without any explicit use by the parent process of some launch policy or system call. That is, it must be that some grandparent process can be fired off to run on some particular node (or other specific set of cpus and memory blocks), with the result that all the child and grandchild processes spawned directly or indirectly from that grandparent process remain constrained to run in no more (but possibly less) than that original node (or specific set). All this with _no_ code in or awareness by any of these processes. Like most constraints on the execution of a process in Unix (user and group ids, ulimits, capabilities, nice value, controlling tty, ...) the constraint as to which cpus it can schedule on, and which nodes it can allocate from, and now the identify of the allocation group to which it belongs for the purpose of bulk migration, should be, and one would _expect_ to be, inherited. In particular, the process allocation group is specified by the cpumemmap, ahd hence inherited along with the rest of the map. How in heavens else might you want the default setting of this, if not by inheritance? _Every_ task, _every_ vm area, has some cpumemmap and cpumemset. There must be some default setting of these on each fork and vm area creation, absent specific instructions to the contrary from the creating process. To my mind, the only obvious choice of default value is by inheritance from the creator (in this case, with the refinement that it's the CHILD value that is inherited, as opposed to the parents CURRENT value). I am rather puzzled that you would question this "Why?" or doubt it "No". That you did so suggests to me that you have something quite different in mind, and that we are failing more than we realize to communicate. Or [wild guess alert here] is it only that you were objecting to what you thought was an insistence from me that _only_ inheritance be available to determine the allocation group. Of course, with explicit system calls (and adequate permission) one needs to be able to set something other than what is inherited. Otherwise, everyone inherits the common value from the init (pid == 1) process and the setting is useless. Clearly I wouldn't advocate that. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |