From: Paul M. <Pau...@us...> - 2001-10-12 05:57:03
|
Hello, Paul! Took a first pass through your proposal -- good stuff! Some very interesting approaches! Some comments: Under "Desired Properties / Vendor neutral base", I recommend adding Tru64, HP-UX, AIX, etc. May as well be all-inclusive. ;-) Under "Implementation Layers", near the end of the second paragraph: "a small scale forking" seems a bit dramatic. That said, it would be good if the NUMA and SMP code paths used the same C code. A question on items 1, 2, and 3 under "Implementation Layers". Item 1 seems to indicate that the current use of bitmasks within the kernel can continue unchanged, but hints at longer-term changes. Do you envision the kernel moving towards using CpuMemMaps and CpuMemSets, or do you expect these two concepts to be strictly confined to user space? Under "Error cases" in the header file, you say that if you really want to force an application to use CPUs and memory from disjoint nodes (which you might in diagnostics or performance-measurement code), then the CpuMemSet memory list must contain CMS_DEFAULT_CPU. But CMS_DEFAULT_CPU is casted to cms_lcpu_t. Shouldn't it be cast instead to cms_lmem_t? Or is CMS_DEFAULT_CPU instead supposed to go on the CpuMemSet's list of CPUs? If the latter, does this allow the disjoint-node operation for CPUs and memory? (So that all the permitted CPUs are on one set of nodes, but all the permitted memory is on another set of nodes, and the two sets of nodes are disjoint.) First-touch and stripe policies seem to be missing from the list. Some of the OSes have also had least-memory-utilization policies. A cpumemset has no meaning except in the context of a cpumemmap, right? A cpumemmap maps the CPU numbers, while a cpumemset simply restricts them, right? (Could make it work either way, but things like getcpu() and getnode() need to be in sync with the choice.) For example, suppose that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose that the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. What is the physical CPU corresponding to logical CPUs 0, 1, and 2? HP's launch policies include policies for CPU allocation (round robin and the like). My guess is that this requires either additional bits in cms_policy or another field for CPU (as opposed to memory) policies. Non-root processes are prohibited from creating cpumemmaps. Shouldn't they be allowed to create them, as long as they are subsets of the cpumemmap that they are currently running with? If non-root processes are really prohibited from playing with cpumemmaps, what happens to their children? Is the child's cpumemmap generated from the cpumemset/cpumemmap pair that the parent specified with CMS_CHILD? Or does the child just get copies of the CMS_CHILD cpumemset/cpumemmap? (At this point, I am favoring letting non-root processes manipulate cpumemmaps, with the restriction that any that are installed must be pure restrictions of the current cpumemmap.) There is no way to create either a cpumemmap or a cpumemset without querying for it. The intended usage is to query for this process's set/map, then manipulate the resulting structure? Is the user supposed to directly manipulate the fields of the cpumemmap and cpumemset? A few typos, possibly due to WikiWeb issues: "cpu's" -> "CPUs", "mem's" -> "memories", "preasure points" -> "pressure points", "CpumemSet" -> "CpuMemSet", etc. Enough questions for now! More later... Once again, some good stuff here! Thanx, Paul |
From: Paul M. <Pau...@us...> - 2001-10-15 22:33:05
|
> Thanks - excellent comments, Paul. I look forward to continued > feedback from yourself and others. Will do my best! And thank you for reformatting them, sorry for their previous less-than-helpful state. > ==== > > A question on items 1, 2, and 3 under "Implementation Layers". Item 1 > > seems to indicate that the current use of bitmasks within the kernel can > > continue unchanged, but hints at longer-term changes. Do you envision > > the kernel moving towards using CpuMemMaps and CpuMemSets, or do you > > expect these two concepts to be strictly confined to user space? > > I doubt that the guts of scheduling or allocation code will ever > want to be written in terms of CpuMemMaps and Sets. The data > structure used in CpuMem*, arrays of cpu or mem id's, is almost > surely too inefficient for critical path code. > > I expect that when someone needs a kernel for > 64 cpus, they > will have to find an alternative to, or extension of, Ingo's > cpus_allowed bit vector. I agree with this. My hope is that there will be a way to bury such differences between 64-CPU and >64-CPU data structure manipulation in a macro, an inlined function, or some such. > And I expect continued coding activity for various other purposes > in both the scheduling and allocation code, which will at times > impact the preferred data representation of available cpu and > memory resources for these critical code paths. > > The kernel code that implements the system calls to set > CpuMemMaps and Sets will always have the responsibility for > translating the stable, but cumbersome, API of these Maps and > Sets into the efficient representation of the day required by > the scheduling and allocation code. Yep! There may come a time when people want a short-form interface, but I believe that what is good enough for select() vectors and for signal masks is good enough for NUMA. ;-) > ==== > > Under "Error cases" in the header file, you say that if you really want > > to force an application to use CPUs and memory from disjoint nodes > > (which you might in diagnostics or performance-measurement code), then > > the CpuMemSet memory list must contain CMS_DEFAULT_CPU. But > > CMS_DEFAULT_CPU is casted to cms_lcpu_t. Shouldn't it be cast instead > > to cms_lmem_t? Or is CMS_DEFAULT_CPU instead supposed to go on the > > CpuMemSet's list of CPUs? If the latter, does this allow the > > disjoint-node operation for CPUs and memory? (So that all the permitted > > CPUs are on one set of nodes, but all the permitted memory is on another > > set of nodes, and the two sets of nodes are disjoint.) > > Aha - this Error Case is confused, sufficiently so that it apparently > managed to further confuse your critiique ;). Glad it wasn't just me. ;-) > And the Error Case preceding it is also confused. For the benefit > of those who don't have this Design Note at hand, the two confused > Error Cases are: > > * It is not an error if a CpuMemSet for an object (task, vm area > * or kernel) doesn't provide memory lists for all the cpus in > * that objects CpuMemMap. That is, it is ok for a CpuMemSet to > * "be smaller than" (only use a subset of) its Map. > * > * However it is an error to set a CpuMemSet that shows cpus that > * are not listed in any of the memory lists of that CpuMemSet, > * unless the memory lists include a CMS_DEFAULT_CPU. Attempts to > * set such a CpuMemSet fail with errno set to ESRCH. This case > * must be an error to avoid trying to allocate memory without > * knowing which memory list to search. > > > Let me try again ... > > A CpuMemSet specifies two things. It specifies on which cpus > (in the corresponding CpuMemMap) a task may be scheduled, > and it specifies in what order to search for memory, per > virtual memory area, depending on which cpu the request > for memory was executed. OK... How is the CPU taken into account? Do you traverse the list of memories until you find one that is closest to the current CPU? If there is no memory to be had there, do you then go through the list of memories in order starting at that point, and wrapping around if necessary? Or are you using something similar to classzones, which would be represented separately? > The question arises - where do we look for memory if a > request for memory is executed on a cpu that is not specified > in the active CpuMemSet? Perhaps someone didn't provide > memory lists for all possible cpus that might execute code > sharing that area. > > Heck, perhaps they _couldn't_ specify such memory lists, > because they are setting up a shared memory area that will > be shared with other tasks running on cpus outside the Map > of the process initializing the shared memory area. > > Seems that I can either (1) require that a memory list _always_ > be specified for the CMS_DEFAULT_CPU, or (2) mandate that > attempts to allocate memory when the CpuMemSet does not specify > (even by CMS_DEFAULT_CPU) any memory list for the currently > executing cpu must fail. Choice (2) would cause nasty, obscure > and intermittent errors. So it must be choice (1). Agreed that deterministic errors are much better than non-deterministic errors! But how do you indicate which memory is associated with CMS_DEFAULT_CPU? Must the CPU and the memory lists be the same size? My guess was "no", since there are separate fields for the length of each. Or does CMS_DEFAULT_CPU just start the search of memory from the first element of the array? > Hence these two Error Cases collapse into the following > single case: > > * Every CpuMemSet must specify a memory list for the > * CMS_DEFAULT_CPU, to ensure that regardless of which CPU a > * memory request is executed on, a memory list will be available > * to search for memory. Attempts to set a CpuMemSet without a > * memory list specified for the CMS_DEFAULT_CPU will fail, with > * errno set to EINVAL. > > > ==== > > First-touch and stripe policies seem to be missing from the list. Some > > of the OSes have also had least-memory-utilization policies. > > First touch is the natural order of things. If, as will usually > be the case, the memory lists are ordered by distance from the > faulting cpu, then this provides first touch. Existing upper > level API's, such as cpusets, dplace, runon, OpenMP, MPI that > support a First Touch policy would presumably implement that > policy by properly sorting the memory lists. > > Aha - perhaps I should change the CMS_DEFAULT policy comment, from: > > #define CMS_DEFAULT 0x00 /* None of the following optional policies */ > > to some such as: > > #define CMS_DEFAULT 0x00 /* Memory list order (first-touch, typically) */ > > Tell me more about what is the essence of stripe policies, > as they apply here. My guess is that a combination of proper > memory list sorting plus a round robin (CMS_ROUND_ROBIN) policy > will provide the desired semantics. But more consideration of > this point is needed. Confusion on my part -- different names for the same thing in the different APIs... > What is a "least-memory utilization policy"? I am not familiar > with that term. It means that, at page-fault time, you allocate memory from the node/memory with the least utilization. > ==== > > A cpumemset has no meaning except in the context of a cpumemmap, right? > > A cpumemmap maps the CPU numbers, while a cpumemset simply restricts > > them, right? (Could make it work either way, but things like getcpu () > > and getnode() need to be in sync with the choice.) For example, suppose > > that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose that > > the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. What > > is the physical CPU corresponding to logical CPUs 0, 1, and 2? > > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}. > > My presumption, from reading this, is that getcpu(), executed in this > map and set, would return either 100 or 103, rather oblivious to the > CpuMemMap. I have no clue what getnode() would or should do that > relates to CpuMemSets. Perhaps we need a "getcmscpu()"? I believe that we need a way to get the physical CPU ID. So, at this point, do we have two levels of ID, or three? In case of diagnostics, you want to identify the physical CPU. So we quickly get into the issue that Martin Bligh raised earlier. ;-) For related sets of processes running as root, you can end up with many more levels of ID. Process 100 maps to exclude the first CPU, then forks process 101, which maps to exclude its idea of the first CPU, and so on. This problem is inherent in the notion of virtualizing the CPUs. We could try to eliminate the middle level (cms_pcpu_t), but that could require using whatever ugly IDs the hardware wanted to provide. Maybe the name of the cms_pcpu_t level should be "complete" instead of "physical"? Or some other naming? So the relevant numbering schemes are the following: (1) the numbering that the current process sees, as mapped by the CpuMemMap that controls it, and (2) the underlying physical identifiers that would make sense to someone servicing the hardware. Thoughts? > ==== > > HP's launch policies include policies for CPU allocation (round robin > > and the like). My guess is that this requires either additional bits in > > cms_policy or another field for CPU (as opposed to memory) policies. > > Aha - might be so. I should investigate this further. http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for "Launch". > ==== > > Non-root processes are prohibited from creating cpumemmaps. Shouldn't > > they be allowed to create them, as long as they are subsets of the > > cpumemmap that they are currently running with? > > Non-root processes can restrict where they execute tasks or allocate > memory by altering their CpuMemSet, within the confines of the CpuMemMap > established for them. Why is that not enough? My concern is that non-root processes cannot virtualize their children. If you want to divide the CPUs and memory available to you into two pieces, and run a child in each piece, you can do so, but you cannot have both children thinking that they have their own CPU 0. > ==== > > If non-root processes are really prohibited from playing with > > cpumemmaps, what happens to their children? Is the child's cpumemmap > > generated from the cpumemset/cpumemmap pair that the parent specified > > with CMS_CHILD? Or does the child just get copies of the CMS_CHILD > > cpumemset/cpumemmap? > > I don't understand the difference between these two choices. > They both sound the same to me, and both sound right. It turns out that they are the same. When I wrote this, I didn't know that cpumemset didn't also map numberings. So I thought that maybe the child would get a cpumemset generated from the "product" of the CMS_CHILD cpumemset and cpumemmap, and a full cpumemset. > ==== > > (At this point, I am favoring letting non-root > > processes manipulate cpumemmaps, with the restriction that any that are > > installed must be pure restrictions of the current cpumemmap.) > > Well, having not yet appreciated the comments above on this, I > will have to table this suggestion pending further my further > enlightment. OK... ;-) Please let me know whether my non-root virtualization example above seems reasonable to you. > ==== > > There is no way to create either a cpumemmap or a cpumemset without > > querying for it. The intended usage is to query for this process's > > set/map, then manipulate the resulting structure? > > Yes, yes. Earlier designs allowed for creating and manipulating > CpuMemSets, as a kernel supported object that was visible as a > distinct identified object to applications, separate from their > binding to any given task or vm area. But I could see no essential > use for unbound CpuMemSets, so now they are an attribute of known > tasks and vm areas. Sounds reasonable. > ==== > > Is the user supposed to directly manipulate the fields of the cpumemmap > > and cpumemset? > > Yes, via the cmsSet*() system calls. Well, as direct as > anything in a protected operating system -- the user politely > asks the kernel to set a map or set, and doesn't directly write > kernel memory. > > My hunch here is that I missed the real point of this question ... Should the interfaces be used as follows? p = cmsQueryCMS(CMS_CURRENT, (void *)0); if (p->nr_cpus > 1) { p->nr_cpus--; } cmsSetCMS(CMS_CURRENT, (void *)0, p); > > Enough questions for now! More later... Once again, some good stuff here! > > I look forward to your further comments, and to integrating this > work with substantial good work being done by folks on your end, > and elsewhere. One thing we need to hash out is what the unit of memory control is. In the simple-binding proposal, it is an arbitrary range of virtual memory (similar to what madvise() might do), while in the Process Scheduling and Memory Placement proposal, it appears to be an object. Conflicts are handled in a last-change-wins manner? Thanx, Paul |
From: Paul J. <pj...@en...> - 2001-10-16 23:35:28
|
Thanks again, Paul, for the continued excellent feedback. On Mon, 15 Oct 2001, Paul McKenney wrote: > > I expect that when someone needs a kernel for > 64 cpus, they > > will have to find an alternative to, or extension of, Ingo's > > cpus_allowed bit vector. > > I agree with this. My hope is that there will be a way to bury such > differences between 64-CPU and >64-CPU data structure manipulation > in a macro, an inlined function, or some such. Something like that, yes. This is just a concern to the schedule() code - nicely isolated. > > The kernel code that implements the system calls to set > > CpuMemMaps and Sets will always have the responsibility for > > translating the stable, but cumbersome, API of these Maps and > > Sets into the efficient representation of the day required by > > the scheduling and allocation code. > > Yep! There may come a time when people want a short-form interface, > but I believe that what is good enough for select() vectors and for > signal masks is good enough for NUMA. ;-) My intention is that short-form interfaces are provided on top of CpuMemSets -- we (sgi) expect to do a few such ourselves, to emulate existing interfaces. > > Let me try again ... > > > > A CpuMemSet specifies two things. It specifies on which cpus > > (in the corresponding CpuMemMap) a task may be scheduled, > > and it specifies in what order to search for memory, per > > virtual memory area, depending on which cpu the request > > for memory was executed. > > OK... How is the CPU taken into account? Do you traverse the > list of memories until you find one that is closest to the current > CPU? If there is no memory to be had there, do you then go through > the list of memories in order starting at that point, and wrapping > around if necessary? Or are you using something similar to > classzones, which would be represented separately? Hmmm ... I have not yet adequately presented some aspects of this data structure. I must add an example ... that might connect with additional readers. Lets try this one: Example: ======== One way to understand these data structures is to look at an example. Given the following hardware configuration: Let's say we have a four node system, with four CPUs per node, and one memory per node, named as follows: Name the 16 CPUs: c0, c1, ..., c15 # 'c' for CPU and number them: 0, 1, 2, ..., 15 # cms_pcpu_t Name the 4 memories: mn0, mn1, mn2, mn3 # 'mn' for memory node and number them: 0, 1, 2, 3 # cms_pmem_t CpuMemMap: Now lets say the administrator (root) chooses to setup a Map containing just the 2nd and 3rd node (CPUs and memory thereon). The cpumemmap for this would contain: { 8, # nr_cpus (length of cpus array) p1, # cpus (ptr to array of cms_pcpu_t) 2, # nr_mems (length of mems array) p2 # mems (ptr to array of cms_pmem_t) } where p1, p2 point to arrays of physical cpu + mem numbers: p1 = [ 4,5,6,7,8,9,10,11 ] # cpus (array of cms_pcpu_t) p2 = [ 1,2 ] # mems (array of cms_pmem_t) This map shows, for example, that for this Map, logical cpu 0 corresponds to physical cpu 4 (c4). CpuMemSet: Further lets say that an application running within this map chooses to restrict itself to just the odd-numbered CPUs, and to search memory in the common "first-touch" manner (local node first). It would establish a CpuMemSet containing: { CMS_DEFAULT, # cms_policy 4, # nr_cpus (length of cpus array) q1, # cpus (ptr to array of cms_lcpu_t) 2, # nr_mems (length of mems array) q2, # mems (ptr to array of cms_memory_list) } where q1 points to an array of 4 logical cpu numbers and q2 to an array of 2 memory lists: q1 = [ 1,3,5,7 ], # cpus (array of cms_lcpu_t) q2 = [ # See "Verbalization example" below { 3, r1, 2, s1 } { 2, r2, 2, s2 } ] where r1, r2 are arrays of logical cpus: r1 = [1, 3, CMS_DEFAULT_CPU] r2 = [5, 7] and s1, s2 are arrays of memory nodes: s1 = [0, 1] s2 = [1, 0] Verbalization examples: To read item q1 out loud: Tasks in this CpuMemSet may be scheduled on any of the logical CPUs [ 1, 3, 5, 7 ], which correspond in the associated Map with physical CPUs c5, c7, c9 and c11. To read item q2 out loud: If a fault occurs on any of the 2 explicit CPUs in r1, then search the 2 memory nodes in s1 in order, looking for available memory (mn1, then mn2). If a fault occurs on any of the 2 CPUs in r2, search the 2 memory nodes in s2 in order (mn2, then mn1). If a fault occurs on any other CPU, then since the CMS_DEFAULT_CPU value is listed in r1, search the 2 memory nodes in s1 in order (mn1, then mn2). Interpretations of the above: The meaning of "s1 = [0, 1]" is that if a page fault occurs on the logical CPUs in "r1 = [1, 3, CMS_DEFAULT_CPU]", then the allocator should search logical memory node 0 first (that's the memory on physical node 1 - mn1), then search logical memory node 1 second (the memory on physical node 2 - mn2). The meaning of "s2 = [1, 0]" is that if a page fault occurs on the logical CPUs listed in "r2 = [5, 7]", then the same memory nodes are searched, but in the other order, mn2 then mn1. In particular, if a vm area using the above CpuMemSet was also shared with an application running on some other Map, and that application faulted while running on some CPU not explicitly listed in the above CpuMemSet (item r1 or r2), then the allocator would search mn1 first, then mn2, for available memory. This is because CMS_DEFAULT_CPU is listed amongst the CPUs in r1, and the corresponding s1 is equivalent to the ordered array of physical memory nodes [mn1, mn2]. Observation: The allocator need have _no_ notion of distance. It just searches, in order specified, the memory list prescribed for that vm area, for a fault on the specified CPU (or the CMS_DEFAULT_CPU). To provide the usual first-touch, distance ordered memory search, some system service or utility must sort the memory lists in distance order. ======== I should add the above example to my Design Notes. My apologies for the tediousness of this example. I realize that the above data structure is a layer or two deeper than intuitions expect. However when I methodically strip all (most?) higher level policy from the various CPU and memory API's we need to support, the above is what I am left with, as the necessary generic multiplexor between a variety of API's and the specific needs of the static placement logic in the kernel allocation and scheduliing code. For example, observe that no notion of locality domain or node exists here - it has been disassembled into simple lists of CPUs and chunks of memory, called here 'memory nodes' only because there tends to be one chunk per node, and I couldn't find a better noun to name a maximal contiguous (?) chunk of memory that is equidistant from all processing elements. > > Seems that I can either (1) require that a memory list _always_ > > be specified for the CMS_DEFAULT_CPU, or (2) mandate that > > attempts to allocate memory when the CpuMemSet does not specify > > (even by CMS_DEFAULT_CPU) any memory list for the currently > > executing cpu must fail. Choice (2) would cause nasty, obscure > > and intermittent errors. So it must be choice (1). > > Agreed that deterministic errors are much better than non-deterministic > errors! But how do you indicate which memory is associated with > CMS_DEFAULT_CPU? The value CMS_DEFAULT_CPU is included on a the cpu list (r1 or r2, in the above example) in one of the memory lists. ==> Memory lists have a list of memory nodes, sorted in search order, _and_ a list of CPUs to which that memory list applies. > Must the CPU and the memory lists be the same size? > My guess was "no", since there are separate fields for the length of > each. Correct - they need not be, and in the case of architectures with multiple CPUs per memory node, typically are not the same size. > Or does CMS_DEFAULT_CPU just start the search of memory from the > first element of the array? er eh no. This question confuses me. CMS_DEFAULT_CPU chooses which memory list to search if the faulting cpu is not explicitly listed. The search order is by default (CMS_DEFAULT) always from the first element of the memory node array (s1 and s2, above), unless the CMS_ROUND_ROBIN policy is specified for that CpuMemSet. > > What is a "least-memory utilization policy"? I am not familiar > > with that term. > > It means that, at page-fault time, you allocate memory from the > node/memory with the least utilization. This is clearly then a dynamic placement policy, not a static one. I would expect the implementation of a "least-memory utilization policy" to add code to the allocator and elsewhere to track memory utilization. And I would expect the CpuMemSet (or at least CpuMemMap) to control which memory nodes could be searched. But the order of search would depend on more dynamic code outside the domain of CpuMemSets. > > > ==== > > > A cpumemset has no meaning except in the context of a cpumemmap, > right? > > > A cpumemmap maps the CPU numbers, while a cpumemset simply restricts > > > them, right? (Could make it work either way, but things like getcpu > () > > > and getnode() need to be in sync with the choice.) For example, > suppose > > > that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose > that > > > the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. > What > > > is the physical CPU corresponding to logical CPUs 0, 1, and 2? > > > > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}. > > > > My presumption, from reading this, is that getcpu(), executed in this > > map and set, would return either 100 or 103, rather oblivious to the > > CpuMemMap. I have no clue what getnode() would or should do that > > relates to CpuMemSets. Perhaps we need a "getcmscpu()"? > > I believe that we need a way to get the physical CPU ID. So, at this > point, do we have two levels of ID, or three? In case of diagnostics, > you want to identify the physical CPU. So we quickly get into the > issue that Martin Bligh raised earlier. ;-) When you want the physical CPU ID, as with diagnostics, then getcpu() provides the physical CPU ID, which should be no more ambiguous than it was before CpuMemSets. I only see one level of ID here. I see nothing logical about this <grin>. > For related sets of processes running as root, you can end up with many > more levels of ID. Process 100 maps to exclude the first CPU, then forks > process 101, which maps to exclude its idea of the first CPU, and so on. > This problem is inherent in the notion of virtualizing the CPUs. We could > try to eliminate the middle level (cms_pcpu_t), but that could require > using whatever ugly IDs the hardware wanted to provide. Maybe the name > of the cms_pcpu_t level should be "complete" instead of "physical"? Or > some other naming? No - cms_pcpu_t is the ugly hardware ID (well one of them - seems that these too come in a couple forms, such as compact or not). This is not a hall of infinitely regressing mirrors. There are only two levels, the ugly physical ID level, and the logical mapping to the integers 0..N-1 (logical ID) for any given subset of size N CPUs or memory nodes. I suspect that CpuMemMaps are a degree less fancy than you are presuming, while CpuMemSets are a couple degrees more fancy, as in the above example ;). > So the relevant numbering schemes are the following: (1) the numbering > that the current process sees, as mapped by the CpuMemMap that controls > it, and (2) the underlying physical identifiers that would make sense > to someone servicing the hardware. Yes - exactly. How we ended up at this same point, after the above confusions, baffles me a bit. oh well. > > > HP's launch policies include policies for CPU allocation (round > > > robin and the like). My guess is that this requires either > > > additional bits in cms_policy or another field for CPU (as > > > opposed to memory) policies. > > > > Aha - might be so. I should investigate this further. > > http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for > "Launch". Here I see several dynamic scheduling policies: ROUND_ROBIN PACKED FILL LEASTLOAD As with the dynamic "least-memory utilization" policy above, I would expect CpuMemSets to control which CPUs were eligible for scheduling, but that the control of details of scheduling would be left to other mechanisms. > > Non-root processes can restrict where they execute tasks or allocate > > memory by altering their CpuMemSet, within the confines of the CpuMemMap > > established for them. Why is that not enough? > > My concern is that non-root processes cannot virtualize their children. > If you want to divide the CPUs and memory available to you into two > pieces, and run a child in each piece, you can do so, but you cannot > have both children thinking that they have their own CPU 0. Yes, this limitation exists. Below you suggest letting non-root processes further restrict their map. That would be doable - a modest wart, in that it complicates what was a simple story: root manipulates maps, anyone manipulates sets. How serious is this limitation? > > > (At this point, I am favoring letting non-root processes > > > manipulate cpumemmaps, with the restriction that any that > > > arevinstalled must be pure restrictions of the current > > > cpumemmap.) > Please let me know whether my non-root virtualization example > above seems reasonable to you. I am still lacking an appreciation of the benefit of this sufficient to justify the extra half-twist of logic. I'm open to hearing more. > > > Is the user supposed to directly manipulate the fields of the > > > cpumemmap and cpumemset? > > > > Yes, via the cmsSet*() system calls. Well, as direct as > > anything in a protected operating system -- the user politely > > asks the kernel to set a map or set, and doesn't directly write > > kernel memory. > > > > My hunch here is that I missed the real point of this question ... > > Should the interfaces be used as follows? > > p = cmsQueryCMS(CMS_CURRENT, (void *)0); > if (p->nr_cpus > 1) { > p->nr_cpus--; > } > cmsSetCMS(CMS_CURRENT, (void *)0, p); Ah ... that gets dicey. What I hear you asking is whether it is appropriate for an application (more likely, some library supporting a friendlier interface on top of CpuMemSets) to (1) get a response from a cmsQuery*(), (2) munge it in place, and (3) push it back down a cmsSet*() call. As we have already seen in the Example above, the key data structure is a tad nasty to deal with in C, with its nested variable length arrays. It is especially nasty across the system call boundary -- the kernel can't exactly malloc and assemble multiple small chunks of user memory during the response to a single system call. This is in good part the motivation for the cmsFree*() calls, to isolate the caller of these routines from knowing just how the memory for them is allocated. For a change such as you give in your example above, changing a value in place, that's no problem .. because it makes no assumptions as to how the memory was allocated. Changing a Policy flag would be easy, for the same reason. Shortening one of the arrays, and even overwriting the values in them, is fine. Any lenghthening of an array should be done with a deep copy into memory that is managed in ways known to the caller, so that the caller knows how to free that memory, when finished. > One thing we need to hash out is what the unit of memory control is. > In the simple-binding proposal, it is an arbitrary range of virtual > memory (similar to what madvise() might do), while in the Process > Scheduling and Memory Placement proposal, it appears to be an object. > Conflicts are handled in a last-change-wins manner? Yes - I see that section in the "Proposed NUMA API" now. There are a couple things here we need to hash out. My reading of this section is that bindmemory() has a couple of capabilities that are not a natural fit with CpuMemSets: 1) Does bindmemory (vaddr, len, ...) apply to: a) already mapped and allocated memory, or b) already mapped, but not allocated, memory, or c) also extended mappings in already known vm areas d) any future mapping or remapping in the vaddr range? Since these bindings are inherited across fork/exec, I guess it must be (d). This means that the binding request has to be kept as an inherited property of the process, and recalculated for applicability any time that any mapping is changed, whether to grow or shrink an existing vm area or to add or remove a vm area. CpuMemSets are, as you note, per object (per vm area), not for an arbitrary address range and whatever portions of whatever future vm areas might happen to overlap that range. 2) bindmemory() requires per-process specifications as to where to find memory. CpuMemSets supports per-cpu specifications (based on which cpu is executing the page_alloc request). My first inclination is to recommend against supporting capability (1) above, on the grounds that its semantics don't respect very well the existing objects that the kernel supports. That, or I failed to understand it, quite possible. The second capability, per-process rather than (or in addition to) per-cpu memory search specifications is at least more doable, but still seems not quite right to me. I am of course open to hearing the motivation for these capabilities. I may be a purist at heart, but I am a pragmatist in my wallet. Also perhaps you could comment on the status of the Simple Binding proposal. Whether it is in use, or is required to support API's in use. What level of flexibility exists on such significant semantics as the two capabilities just above. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul M. <Pau...@us...> - 2001-10-19 17:36:54
|
Hello, Paul! The example -really- helped! I was missing the point of the cms_memory_list data structure. This clears up most of my confusion, I believe, anyway... I believe that we will really want to have a default value (e.g., NULL) for the pointer to the cms_memory_list, since for many applications a default of either "search my node first, then search the rest however you wish" or "search my node only" will suffice. I believe that it is a bit cruel and unusual to force someone to specify either of these in great detail. ;-) Each "mirror" in my "hall" is in a separate process that used cpumemmap to restrict things. So a given process normally cares only about its own view and that of the underlying system. One common exception might be debuggers, but, on the other hand, the multithreaded debuggers I have used don't know a CPU from a hole in the ground. So, the two views seemed likely to be sufficient. The reason I keep wanting non-root processes to be able to restrict their cpumemmaps is that the cpumemmap is the only thing in your proposal that is capable of virtualizing CPUs and memory. So, an application that wishes to partition itself needs either to track the CPU numbers and offsets itself, or it needs to provide a different cpumemmap to each of its children. Having done quite a bit of the former, the latter seems quite attractive. ;-) Also, I need to manipulate cpumemmaps to implement pieces of the simple-binding API. Manual manipulation of the cpumemset structures seems quite painful. Am I right in assuming that one cannot invoke cmsFreeCMS on hand-built cpumemset structures? If I use my crude trick of decrementing the nr_cpus field, am I then prohibited from calling cmsFreeCMS(), or does cmsFreeCMS() hide some additionaw information away somewhere? So here is what I believe I need to do to write the simple-binding API in terms of cpumemset and cpumemmap. I assume that cmsFreeCMS() is telepathic, since I cannot think of a way that it could do what I want it to do... I also take the liberty of leaking memory right and left. The purpose of this exercise is to see if I understand the cpumemsets. Once I find that I do, I will try one of the existing proprietary-Unix APIs. And only -then- will I worry about optimal, nice-looking code!!! This exercise raised the following questions: 1. It seems to be possible to have a different cmsmemmap for different objects in a given virtual address space. The CPU portions of these cmsmemmaps are ignored, right? 2. The memory portion of the cmsmemmap associated with a process is used as a default for things like data, bss, and heap space? Is it used as a default for objects created subsequently? Or is it ignored? 3. The memory portion of the cmsmemmap associated with a thread is used as a default for that thread's stack? Is it used as a default for objects created subsequently by this thread? Or is it ignored? If the answers to both #2 and #3 are "it is ignored", then I am not sure how the various node-restriction APIs would be implemented. 4. I made up the makecmsmemset() API, since I did not want to second-guess the layout. I know that this is not really what you had in mind, so am looking for guidance here. 5. I used NULL for cpumemset mems, since the simple-binding API does not care about the order of search. 6. We need to align the memory-binding behavior... One could implement bindtonode() stepping through each page in the process's virtual=address space, but this might be a bit inefficient on 64-bit architectures. ;-) typedef struct { cpumemset *set; cpumemmap *map; } numasubset_t; numasubset_t *restrictcpus(unsigned long cpus, numasubset_t *res) { unsigned i; unsigned j; unsigned nbits; cms_pcpu_t *setcpus; cms_pcpu_t *mapcpus; cpumemmap *r = &(res->map); numasubset_t *newres; nbits = 0; for (i = 0; i < sizeof(cpus) * 8; i++) { if (i >= r->nr_cpus) { break; } if (cpus & (1 << i)) { nbits++; } } if (r == NULL) { r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0); /* Do I need getpid()? */ /* Do I care about vaddr??? */ /* should I substitute */ /* &restrictcpus? */ if (r == NULL) { goto die0; } } mapcpus = malloc(nbits * sizeof(*mapcpus)); if (mapcpus == NULL) { errno = ENOMEM; goto die1; setcpus = malloc(nbits * sizeof(*setcpus)); if (setcpus == NULL) { errno = ENOMEM; goto die1; } j = 0; for (i = 0; i < sizeof(cpus) * 8; i++) { if (i >= r->nr_cpus) { break; } if (cpus & (1 << i)) { mapcpus[j] = r->cpus[i]; setcpus[j] = j; } } newres = malloc(sizeof(*newres)); if (newres == NULL) { errno = ENOMEM; goto die3; } newres->map = r; newres->map = makecmsmemmap(nbits, setcpus, res->map->nr_mems, res->map->mems); if (newres->map == NULL) { goto die4; } newres->set = makecmsmemset(nbits, setcpus, res->set->nr_mems, res->set->mems); if (newres->map == NULL) { goto die5; } return (r); die5: cmsFreeCMM(newres->map); die4: free(newres); die3: free(mapcpus); die2: free(setcpus); die1: if (r != &(res->map)) { cmsFreeCMM(r); } die0: return (NULL); } numasubset_t *restrictnodes(unsigned long nodes, numasubset_t *res) { unsigned i; unsigned j; unsigned nbits; cms_pmem_t *setmem; cms_pmem_t *mapmem; cpumemmap *r = &(res->map); numasubset_t *newres; nbits = 0; for (i = 0; i < sizeof(nodes) * 8; i++) { if (i >= r->nr_nodes) { break; } if (nodes & (1 << i)) { nbits++; } } if (r == NULL) { r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0); /* Do I need getpid()? */ /* Do I care about vaddr??? */ /* should I substitute */ /* &restrictcpus? */ if (r == NULL) { goto die0; } } mapnodes = malloc(nbits * sizeof(*mapnodes)); if (mapcpus == NULL) { errno = ENOMEM; goto die1; setnodes = malloc(nbits * sizeof(*setnodes)); if (setnodes == NULL) { errno = ENOMEM; goto die1; } j = 0; for (i = 0; i < sizeof(nodes) * 8; i++) { if (i >= r->nr_nodes) { break; } if (nodes & (1 << i)) { mapnodes[j] = r->nodes[i]; setnodes[j] = j; } } newres = malloc(sizeof(*newres)); if (newres == NULL) { errno = ENOMEM; goto die3; } newres->map = r; newres->map = makecmsmemmap(res->map->nr_cpus, res->map->cpus, nbits, setnodes); if (newres->map == NULL) { goto die4; } newres->set = makecmsmemset(res->set->nr_cpus, res->set->cpus, 0, NULL); if (newres->map == NULL) { goto die5; } return (r); die5: cmsFreeCMM(newres->map); die4: free(newres); die3: free(mapnodes); die2: free(setnodes); die1: if (r != &(res->map)) { cmsFreeCMM(r); } die0: return (NULL); } void freerestrict(numasubset_t *restrict) { cmsFreeCMM(restrict->map); cmsFreeCMS(restrict->set); free(restrict); } I don't know how to implement getcpu(), getnode(), cputonode(), or nodetocpu() based on this API. int bindtocpu(unsigned long cpus, numasubset_t *restrict) { numasubset_t *r1; *r1 = restrictcpu(cpus, restrict); if (r1 == NULL) { return (-1); /* adds ENOMEM */ } cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map); freerestrict(r1); } int bindtonode(unsigned long nodes, int behavior, numasubset_t *restrict) { numasubset_t *r1; *r1 = restrictnode(nodes, restrict); if (r1 == NULL) { return (-1); /* adds ENOMEM */ } r1->set->cms_policy = behavior; /* assume we rationalize */ cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map); /* Need to step through all VM objects... How? */ freerestrict(r1); } int setlaunch(unsigned long cpus, unsigned long nodes, int behavior, numasubset_t *restrict) { numasubset_t *r1; numasubset_t *r2; r1 = restrictnode(nodes, restrict); if (r1 == NULL) { return (-1); /* adds ENOMEM */ } r2 = restrictcpu(cpus, r1); freerestrict(r1); if (r2 == NULL) { return (-1); /* adds ENOMEM */ } r2->set->cms_policy = behavior; /* assume we rationalize */ cmsSetCMS(CMS_CHILD, (void *)0, r2->set); cmsSetCMM(CMS_CHILD, 0, (void *)0, r2->map); freerestrict(r2); } int bindmemory(unsigned long start, size_t len, unsigned long nodes, int behavior, numasubset_t *restrict) { numasubset_t *r1; unsigned long cur; *r1 = restrictnode(nodes, restrict); if (r1 == NULL) { return (-1); /* adds ENOMEM */ } r1->set->cms_policy = behavior; /* assume we rationalize */ cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); cur = start + len - PAGESIZE; for (; cur >= start; cur -= PAGESIZE) { cmsSetCMM(CMS_VMAREA, 0, (void *)cur, r1->map); } freerestrict(r1); } > On Mon, 15 Oct 2001, Paul McKenney wrote: > > > I expect that when someone needs a kernel for > 64 cpus, they > > > will have to find an alternative to, or extension of, Ingo's > > > cpus_allowed bit vector. > > > > I agree with this. My hope is that there will be a way to bury such > > differences between 64-CPU and >64-CPU data structure manipulation > > in a macro, an inlined function, or some such. > > Something like that, yes. This is just a concern to the schedule() > code - nicely isolated. > > > > > The kernel code that implements the system calls to set > > > CpuMemMaps and Sets will always have the responsibility for > > > translating the stable, but cumbersome, API of these Maps and > > > Sets into the efficient representation of the day required by > > > the scheduling and allocation code. > > > > Yep! There may come a time when people want a short-form interface, > > but I believe that what is good enough for select() vectors and for > > signal masks is good enough for NUMA. ;-) > > My intention is that short-form interfaces are provided on top > of CpuMemSets -- we (sgi) expect to do a few such ourselves, > to emulate existing interfaces. > > > > > Let me try again ... > > > > > > A CpuMemSet specifies two things. It specifies on which cpus > > > (in the corresponding CpuMemMap) a task may be scheduled, > > > and it specifies in what order to search for memory, per > > > virtual memory area, depending on which cpu the request > > > for memory was executed. > > > > OK... How is the CPU taken into account? Do you traverse the > > list of memories until you find one that is closest to the current > > CPU? If there is no memory to be had there, do you then go through > > the list of memories in order starting at that point, and wrapping > > around if necessary? Or are you using something similar to > > classzones, which would be represented separately? > > > Hmmm ... I have not yet adequately presented some aspects of > this data structure. I must add an example ... that might > connect with additional readers. > > Lets try this one: > > Example: > ======== > One way to understand these data structures is to look at > an example. > > Given the following hardware configuration: > > Let's say we have a four node system, with four CPUs > per node, and one memory per node, named as follows: > > Name the 16 CPUs: c0, c1, ..., c15 # 'c' for CPU > and number them: 0, 1, 2, ..., 15 # cms_pcpu_t > > Name the 4 memories: mn0, mn1, mn2, mn3 # 'mn' for memory node > and number them: 0, 1, 2, 3 # cms_pmem_t > > CpuMemMap: > > Now lets say the administrator (root) chooses to setup a > Map containing just the 2nd and 3rd node (CPUs and memory > thereon). The cpumemmap for this would contain: > > { > 8, # nr_cpus (length of cpus array) > p1, # cpus (ptr to array of cms_pcpu_t) > 2, # nr_mems (length of mems array) > p2 # mems (ptr to array of cms_pmem_t) > } > > where p1, p2 point to arrays of physical cpu + mem numbers: > > p1 = [ 4,5,6,7,8,9,10,11 ] # cpus (array of cms_pcpu_t) > p2 = [ 1,2 ] # mems (array of cms_pmem_t) > > This map shows, for example, that for this Map, logical cpu 0 > corresponds to physical cpu 4 (c4). > > CpuMemSet: > > Further lets say that an application running within this map > chooses to restrict itself to just the odd-numbered CPUs, and > to search memory in the common "first-touch" manner (local > node first). It would establish a CpuMemSet containing: > > { > CMS_DEFAULT, # cms_policy > 4, # nr_cpus (length of cpus array) > q1, # cpus (ptr to array of cms_lcpu_t) > 2, # nr_mems (length of mems array) > q2, # mems (ptr to array of cms_memory_list) > } > > where q1 points to an array of 4 logical cpu numbers and q2 to an > array of 2 memory lists: > > > q1 = [ 1,3,5,7 ], # cpus (array of cms_lcpu_t) > q2 = [ # See "Verbalization example" below > { 3, r1, 2, s1 } > { 2, r2, 2, s2 } > ] > where r1, r2 are arrays of logical cpus: > r1 = [1, 3, CMS_DEFAULT_CPU] > r2 = [5, 7] > and s1, s2 are arrays of memory nodes: > s1 = [0, 1] > s2 = [1, 0] > > Verbalization examples: > > To read item q1 out loud: > > Tasks in this CpuMemSet may be scheduled on any of > the logical CPUs [ 1, 3, 5, 7 ], which correspond > in the associated Map with physical CPUs c5, c7, c9 > and c11. > > To read item q2 out loud: > > If a fault occurs on any of the 2 explicit CPUs in > r1, then search the 2 memory nodes in s1 in order, > looking for available memory (mn1, then mn2). > > If a fault occurs on any of the 2 CPUs in r2, search > the 2 memory nodes in s2 in order (mn2, then mn1). > > If a fault occurs on any other CPU, then since the > CMS_DEFAULT_CPU value is listed in r1, search the > 2 memory nodes in s1 in order (mn1, then mn2). > > Interpretations of the above: > > The meaning of "s1 = [0, 1]" is that if a page fault occurs on > the logical CPUs in "r1 = [1, 3, CMS_DEFAULT_CPU]", then the > allocator should search logical memory node 0 first (that's > the memory on physical node 1 - mn1), then search logical > memory node 1 second (the memory on physical node 2 - mn2). > > The meaning of "s2 = [1, 0]" is that if a page fault occurs > on the logical CPUs listed in "r2 = [5, 7]", then the same > memory nodes are searched, but in the other order, mn2 then mn1. > > In particular, if a vm area using the above CpuMemSet was > also shared with an application running on some other Map, > and that application faulted while running on some CPU not > explicitly listed in the above CpuMemSet (item r1 or r2), > then the allocator would search mn1 first, then mn2, for > available memory. This is because CMS_DEFAULT_CPU is listed > amongst the CPUs in r1, and the corresponding s1 is equivalent > to the ordered array of physical memory nodes [mn1, mn2]. > > Observation: > > The allocator need have _no_ notion of distance. It just > searches, in order specified, the memory list prescribed > for that vm area, for a fault on the specified CPU (or the > CMS_DEFAULT_CPU). To provide the usual first-touch, distance > ordered memory search, some system service or utility must > sort the memory lists in distance order. > > ======== > > I should add the above example to my Design Notes. > > My apologies for the tediousness of this example. I realize > that the above data structure is a layer or two deeper than > intuitions expect. However when I methodically strip all > (most?) higher level policy from the various CPU and memory > API's we need to support, the above is what I am left with, as > the necessary generic multiplexor between a variety of API's > and the specific needs of the static placement logic in the > kernel allocation and scheduliing code. > > For example, observe that no notion of locality domain or node > exists here - it has been disassembled into simple lists of > CPUs and chunks of memory, called here 'memory nodes' only > because there tends to be one chunk per node, and I couldn't > find a better noun to name a maximal contiguous (?) chunk of > memory that is equidistant from all processing elements. > > > > > Seems that I can either (1) require that a memory list _always_ > > > be specified for the CMS_DEFAULT_CPU, or (2) mandate that > > > attempts to allocate memory when the CpuMemSet does not specify > > > (even by CMS_DEFAULT_CPU) any memory list for the currently > > > executing cpu must fail. Choice (2) would cause nasty, obscure > > > and intermittent errors. So it must be choice (1). > > > > Agreed that deterministic errors are much better than non-deterministic > > errors! But how do you indicate which memory is associated with > > CMS_DEFAULT_CPU? > > The value CMS_DEFAULT_CPU is included on a the cpu list > (r1 or r2, in the above example) in one of the memory lists. > > ==> Memory lists have a list of memory nodes, sorted in search > order, _and_ a list of CPUs to which that memory list applies. > > > Must the CPU and the memory lists be the same size? > > My guess was "no", since there are separate fields for the length of > > each. > > Correct - they need not be, and in the case of architectures > with multiple CPUs per memory node, typically are not the same size. > > > Or does CMS_DEFAULT_CPU just start the search of memory from the > > first element of the array? > > er eh no. This question confuses me. > > CMS_DEFAULT_CPU chooses which memory list to search if the > faulting cpu is not explicitly listed. > > The search order is by default (CMS_DEFAULT) always from the > first element of the memory node array (s1 and s2, above), unless > the CMS_ROUND_ROBIN policy is specified for that CpuMemSet. > > > > > What is a "least-memory utilization policy"? I am not familiar > > > with that term. > > > > It means that, at page-fault time, you allocate memory from the > > node/memory with the least utilization. > > This is clearly then a dynamic placement policy, not a static > one. I would expect the implementation of a "least-memory > utilization policy" to add code to the allocator and elsewhere > to track memory utilization. And I would expect the CpuMemSet > (or at least CpuMemMap) to control which memory nodes could > be searched. But the order of search would depend on more > dynamic code outside the domain of CpuMemSets. > > > > > > ==== > > > > A cpumemset has no meaning except in the context of a cpumemmap, > > right? > > > > A cpumemmap maps the CPU numbers, while a cpumemset simply restricts > > > > them, right? (Could make it work either way, but things like getcpu > > () > > > > and getnode() need to be in sync with the choice.) For example, > > suppose > > > > that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose > > that > > > > the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. > > What > > > > is the physical CPU corresponding to logical CPUs 0, 1, and 2? > > > > > > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}. > > > > > > My presumption, from reading this, is that getcpu(), executed in this > > > map and set, would return either 100 or 103, rather oblivious to the > > > CpuMemMap. I have no clue what getnode() would or should do that > > > relates to CpuMemSets. Perhaps we need a "getcmscpu()"? > > > > I believe that we need a way to get the physical CPU ID. So, at this > > point, do we have two levels of ID, or three? In case of diagnostics, > > you want to identify the physical CPU. So we quickly get into the > > issue that Martin Bligh raised earlier. ;-) > > When you want the physical CPU ID, as with diagnostics, then getcpu() > provides the physical CPU ID, which should be no more ambiguous than > it was before CpuMemSets. I only see one level of ID here. I see > nothing logical about this <grin>. > > > > For related sets of processes running as root, you can end up with many > > more levels of ID. Process 100 maps to exclude the first CPU, then forks > > process 101, which maps to exclude its idea of the first CPU, and so on. > > This problem is inherent in the notion of virtualizing the CPUs. We could > > try to eliminate the middle level (cms_pcpu_t), but that could require > > using whatever ugly IDs the hardware wanted to provide. Maybe the name > > of the cms_pcpu_t level should be "complete" instead of "physical"? Or > > some other naming? > > No - cms_pcpu_t is the ugly hardware ID (well one of them - > seems that these too come in a couple forms, such as compact > or not). This is not a hall of infinitely regressing mirrors. > There are only two levels, the ugly physical ID level, and the > logical mapping to the integers 0..N-1 (logical ID) for any > given subset of size N CPUs or memory nodes. > > I suspect that CpuMemMaps are a degree less fancy than you are > presuming, while CpuMemSets are a couple degrees more fancy, > as in the above example ;). > > > > So the relevant numbering schemes are the following: (1) the numbering > > that the current process sees, as mapped by the CpuMemMap that controls > > it, and (2) the underlying physical identifiers that would make sense > > to someone servicing the hardware. > > Yes - exactly. How we ended up at this same point, after the > above confusions, baffles me a bit. oh well. > > > > > > > HP's launch policies include policies for CPU allocation (round > > > > robin and the like). My guess is that this requires either > > > > additional bits in cms_policy or another field for CPU (as > > > > opposed to memory) policies. > > > > > > Aha - might be so. I should investigate this further. > > > > http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for > > "Launch". > > Here I see several dynamic scheduling policies: > ROUND_ROBIN > PACKED > FILL > LEASTLOAD > > As with the dynamic "least-memory utilization" policy above, > I would expect CpuMemSets to control which CPUs were eligible > for scheduling, but that the control of details of scheduling > would be left to other mechanisms. > > > > > Non-root processes can restrict where they execute tasks or allocate > > > memory by altering their CpuMemSet, within the confines of the CpuMemMap > > > established for them. Why is that not enough? > > > > My concern is that non-root processes cannot virtualize their children. > > If you want to divide the CPUs and memory available to you into two > > pieces, and run a child in each piece, you can do so, but you cannot > > have both children thinking that they have their own CPU 0. > > Yes, this limitation exists. Below you suggest letting non-root > processes further restrict their map. That would be doable - > a modest wart, in that it complicates what was a simple story: > root manipulates maps, anyone manipulates sets. How serious > is this limitation? > > > > > > (At this point, I am favoring letting non-root processes > > > > manipulate cpumemmaps, with the restriction that any that > > > > arevinstalled must be pure restrictions of the current > > > > cpumemmap.) > > Please let me know whether my non-root virtualization example > > above seems reasonable to you. > > I am still lacking an appreciation of the benefit of this > sufficient to justify the extra half-twist of logic. I'm > open to hearing more. > > > > > > Is the user supposed to directly manipulate the fields of the > > > > cpumemmap and cpumemset? > > > > > > Yes, via the cmsSet*() system calls. Well, as direct as > > > anything in a protected operating system -- the user politely > > > asks the kernel to set a map or set, and doesn't directly write > > > kernel memory. > > > > > > My hunch here is that I missed the real point of this question ... > > > > Should the interfaces be used as follows? > > > > p = cmsQueryCMS(CMS_CURRENT, (void *)0); > > if (p->nr_cpus > 1) { > > p->nr_cpus--; > > } > > cmsSetCMS(CMS_CURRENT, (void *)0, p); > > Ah ... that gets dicey. What I hear you asking is whether it > is appropriate for an application (more likely, some library > supporting a friendlier interface on top of CpuMemSets) to (1) > get a response from a cmsQuery*(), (2) munge it in place, and > (3) push it back down a cmsSet*() call. > > As we have already seen in the Example above, the key data > structure is a tad nasty to deal with in C, with its nested > variable length arrays. It is especially nasty across the system > call boundary -- the kernel can't exactly malloc and assemble > multiple small chunks of user memory during the response to a > single system call. > > This is in good part the motivation for the cmsFree*() calls, > to isolate the caller of these routines from knowing just how > the memory for them is allocated. > > For a change such as you give in your example above, changing > a value in place, that's no problem .. because it makes no > assumptions as to how the memory was allocated. Changing a > Policy flag would be easy, for the same reason. Shortening > one of the arrays, and even overwriting the values in them, > is fine. > > Any lenghthening of an array should be done with a deep copy > into memory that is managed in ways known to the caller, so > that the caller knows how to free that memory, when finished. > > > > One thing we need to hash out is what the unit of memory control is. > > In the simple-binding proposal, it is an arbitrary range of virtual > > memory (similar to what madvise() might do), while in the Process > > Scheduling and Memory Placement proposal, it appears to be an object. > > Conflicts are handled in a last-change-wins manner? > > Yes - I see that section in the "Proposed NUMA API" now. > > There are a couple things here we need to hash out. My reading > of this section is that bindmemory() has a couple of capabilities > that are not a natural fit with CpuMemSets: > > 1) Does bindmemory (vaddr, len, ...) apply to: > a) already mapped and allocated memory, or > b) already mapped, but not allocated, memory, or > c) also extended mappings in already known vm areas > d) any future mapping or remapping in the vaddr range? > > Since these bindings are inherited across fork/exec, I guess > it must be (d). This means that the binding request has > to be kept as an inherited property of the process, and > recalculated for applicability any time that any mapping > is changed, whether to grow or shrink an existing vm area > or to add or remove a vm area. CpuMemSets are, as you note, > per object (per vm area), not for an arbitrary address range > and whatever portions of whatever future vm areas might > happen to overlap that range. > > 2) bindmemory() requires per-process specifications as to > where to find memory. CpuMemSets supports per-cpu > specifications (based on which cpu is executing the > page_alloc request). > > My first inclination is to recommend against supporting > capability (1) above, on the grounds that its semantics don't > respect very well the existing objects that the kernel supports. > That, or I failed to understand it, quite possible. > > The second capability, per-process rather than (or in addition > to) per-cpu memory search specifications is at least more > doable, but still seems not quite right to me. > > I am of course open to hearing the motivation for these > capabilities. I may be a purist at heart, but I am a pragmatist > in my wallet. > > Also perhaps you could comment on the status of the Simple > Binding proposal. Whether it is in use, or is required to > support API's in use. What level of flexibility exists on such > significant semantics as the two capabilities just above. > I won't rest till it's the best ... > Manager, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2001-10-29 01:58:31
|
Sorry for the delay in responding ... On Fri, 19 Oct 2001, Paul McKenney wrote: > The example -really- helped! Good - I will include it in the next draft of the Design Note that I am preparing for release soon. > I believe that we will really want to have a default value > (e.g., NULL) for the pointer to the cms_memory_list, since for > many applications a default of either "search my node first, > then search the rest however you wish" or "search my node only" > will suffice. I believe that it is a bit cruel and unusual to > force someone to specify either of these in great detail. ;-) Check out the new cmsDeepCopy*() methods below. They should insulate your code from munging with the memory lists when not needed. However, understand that this CpuMemSet API is intended to support the essential kernel infrastructure for cpu and memory placement. It matters less that it be user friendly. It is intended for use in supporting other, more friendly, but perhaps less generic, API's, such as your simple binding. Granted, it shouldn't be unnecessarily user hostile, and from the looks of your programming examples below, it needs to be a little friendlier. The cmsDeepCopy*() methods, below, should help considerably. They convert the inscrutable memory layout of the CpuMemSets returned by the cmsQuery*() methods into the linked malloc elements you can work with predictably. > The reason I keep wanting non-root processes to be able to > restrict their cpumemmaps is that the cpumemmap is the only > thing in your proposal that is capable of virtualizing CPUs > and memory. So, an application that wishes to partition > itself needs either to track the CPU numbers and offsets > itself, or it needs to provide a different cpumemmap to each > of its children. Having done quite a bit of the former, the > latter seems quite attractive. ;-) It is my intention, not yet very well articulated, to have tasks and vm areas explicitly (as part of the published API) share Sets and Maps that they inherit from a common origin, and to allow operations on those shared Sets and Maps that affect all those sharing it. So far as I can tell, the naive implementation of this sharing is quite incompatible with allowing the non-root map restricting that you describe. I can't simply shrink (restrict) a shared map, for the benefit of one of those sharing it, without either breaking the share, or else affecting the innocent. You seem to be asking that an existing map be used to perform an additional mapping because it "is the only thing in [my] proposal that is capable of virtualizing CPUS and memory." This is not a compelling reason in my book, and it (cpumemmap) doesn't do that well, in this case, as explained above. At this point, given that I would minimize what the kernel is asked to do, I can only ask: How can we make this more pleasant for user code, with some additional or improved library code? > Also, I need to manipulate cpumemmaps to implement pieces of > the simple-binding API. Elaborate ... again, I am seeking out essential semantics that require kernel involvment, as opposed to what the library codes can do to ease the burden. > Manual manipulation of the cpumemset structures seems quite > painful. Am I right in assuming that one cannot invoke > cmsFreeCMS on hand-built cpumemset structures? If I use > my crude trick of decrementing the nr_cpus field, am I then > prohibited from calling cmsFreeCMS(), or does cmsFreeCMS() > hide some additionaw information away somewhere? This can and should be improved, once again at the library level. Yes, as you so eloquently note a few lines lower, the cmsFree*() calls are telepathic. They have special knowlege of the memory layout of the cpumemsets and maps returned from cmsQuery*() calls. Your coding examples lower down were quite helpful in pointing out to me the difficulties you are seeing. I propose to add two more methods to the cpumemset library, to construct deep copies of sets and maps, using malloc for each element. Each discrete element (structure or array) will be allocated using malloc(). I will also teach the cmsFree() routines to distinguish between those maps and sets returned from the cmsQuery*() calls, and those returned from the cmsDeepCopy*() calls, and be able to free either one. I suspect this means adding an entry to the cpumemmap and cpumemset data structures, indicating how they were allocated (by cmsQuery, cmsDeepCopy or otherUser). So the rules become: * You can construct your own maps and sets, and pass to cmsSet*(). * You can examine the maps and sets returned from cmsQuery*() calls. * You can free your own map and set constructions as you will. * You can only free maps and sets from cmsQuery*() calls using cmsFree*() * To munge a set or map in place, first cmsDeepCopy*() it, and use the copy. * You can free a deep copy using the cmsFree*() methods. * You can also free a deep copy using free() on each element. Thus the following are added: cpumemmap *cmsDeepCopyMap (const cpumemmap *cmm); cpumemset *cmsDeepCopySet (const cpumemset *cms); > So here is what I believe I need to do to write the > simple-binding API in terms of cpumemset and cpumemmap. > I assume that cmsFreeCMS() is telepathic, since I cannot > think of a way that it could do what I want it to do... > I also take the liberty of leaking memory right and left. > The purpose of this exercise is to see if I understand the > cpumemsets. Once I find that I do, I will try one of the > existing proprietary-Unix APIs. And only -then- will I worry > about optimal, nice-looking code!!! > > This exercise raised the following questions: > > 1. It seems to be possible to have a different cmsmemmap > for different objects in a given virtual address space. > The CPU portions of these cmsmemmaps are ignored, right? Yes - possible. Yes - ignored. > 2. The memory portion of the cmsmemmap associated with > a process is used as a default for things like data, > bss, and heap space? Is it used as a default for objects > created subsequently? Or is it ignored? As published about 2 weeks ago, it was ignored. In the LSE conference call perhaps 8 days ago, you persuaded me that this was wrong. Now the memory portion of a task's *current* cpumemset is used for objects created subsequently by that task. > 3. The memory portion of the cmsmemmap associated with a > thread is used as a default for that thread's stack? Is > it used as a default for objects created subsequently by > this thread? Or is it ignored? As above - no longer ignored, thanks to your guidance. > 4. I made up the makecmsmemset() API, since I did not want to > second-guess the layout. I know that this is not really what > you had in mind, so am looking for guidance here. Use cmsDeepCopy*() to get maps and sets with known layout. > 5. I used NULL for cpumemset mems, since the simple-binding API > does not care about the order of search. Use cmsDeepCopy*(), and then leave the maps 'as is', if you don't need to change them. > 6. We need to align the memory-binding behavior... One could > implement bindtonode() stepping through each page in the > process's virtual=address space, but this might be a bit > inefficient on 64-bit architectures. ;-) Here I lost you. Can't figure out what you're saying, and don't like what I'm guessing you're saying. This sounds like a part of the discussion we had in the last LSE conference call, as to whether it was essential to support memory placement policies on any (page alligned) address range, irrespective of vm area (object) boundaries. I just sent out an email inquiry to Rik van Riel and Andrea Arcangeli, asking for their input on this issue of page vs object memory placement policy granularity. |> > typedef struct { |> > cpumemset *set; |> > cpumemmap *map; |> > } numasubset_t; |> > |> > numasubset_t *restrictcpus(unsigned long cpus, numasubset_t *res) |> > { |> > [ snip 85 lines of code ... ] |> > } Hmmm ... I made an effort to grok this code ... but failed. I guess you are trying to compute a "numasubset_t" that expresses a restriction, for possible further use in a bindto*() call. For this, try: /* * Restrict the cpus in a CpuMemSet to the intersection * of those currently in it, and those specified by * 'rcpus'. 'rcpus' is a bit vector, where bit N is set * iff the application can run on cpu N, using CpuMemSet * 'application' cpu numbering. * * Return a new, malloc'd, further restricted, CpuMemSet, or * NULL on error. */ cpumemset *restrictcpus (unsigned long rcpus, cpumemset *cms1) { cpumemset *cms2 /* deepcopy cpumemset */ int i, j; /* scan cms2.cpus, copy only if in rcpus */ cms2 = cmsDeepCopy (cms1); if (cms2 == NULL) return NULL; /* copy cpus[] in place; drop if corresponding rcpus bit not set */ for (i = 0, j = 0; i <= cms2->nr_cpus; i++) { int n = cms2->cpus[i]; if (rcpus & (1 << n)) cms2-cpus[j++] = n; } cms2->nr_cpus = j; return cms2; } |> > numasubset_t *restrictnodes(unsigned long nodes, numasubset_t *res) |> > { |> > [ snip more lines of code ... ] |> > } Now this one confuses me even more. I guessed that a node (in your "Proposed NUMA API") is essentially a set of cpus, for the purpose of restrictnodes(). I don't see where the paper states this clearly, but it does go into length about numbering cpus in ways consistent with the node numbering. So I expected to see something involving nodetocpu() used to convert nodes to cpus. But it seems that this code is just the restrictcpus() code, with the word 'cpus' replaced with 'nodes'. My hunch is (really shooting in the dark here) that you want to add a method that will convert a set of nodes into a set of all the cpus on those nodes (not just the first cpu), and then have restrictnodes() simply call that conversion, and make use of restrictcpus(). ... I'm hearing the sound of someone sawing on a limb, right behind me ... But, continuing to climb further out ... perhaps by node restriction you want memory, not CPU, restriction. The "Proposed NUMA API" paper isn't clear to me on this point: On the second page, under "Simple Node Restrictions", it doesn't say whether restrictnode() applies to CPUs or memory, except to note that "no CPU restriction will be implied" in one case. But later on, under "Bind Tasks to Node(s)" it clearly states "helpful to bind a task's memory ...". Ok - lets imagine that you wanted a routine to restrict a CpuMemSet to a subset of memory nodes (as called in the CpuMemSet design notes). For that lets try: /* * Restrict the mems in a CpuMemSet to the intersection of * those currently in it, and those specified by 'rmems'. * 'rmems' is a bit vector, where bit N is set iff the * application can allocate memory on node N, using CpuMemSet * 'application' memory node numbering. * * Return a new, malloc'd, further restricted, CpuMemSet, * or NULL on error. */ /* First - a helper routine to restrict one memory list, in place. */ static void restrict_mem_list ( cms_memory_list *p_mems, unsigned long rmems) { int i, j; /* scans mems, copy only if in rmems */ for (i = 0, j = 0; i < p_mems->nr_mems; i++) { int n = p_mems->mems[i]; if (rmems & (1 << n)) p_mems->mems[j++] = n; } p_mems->nr_mems = j; } cpumemset *restrictnodes(unsigned long rmems, cpumemset *cms1) { cpumemset *cms2; /* points to resulting cpumemset */ int i; /* scan cms2 mems */ cms2 = cmsDeepCopy (cms1); if (cms2 == NULL) return NULL; for (i = 0; i < cms2->nr_mems; i++) { restrict_mem_list (cms2->mems[i], rmems)); return cms2; } |> > void freerestrict(numasubset_t *restrict) |> > { |> > cmsFreeCMM(restrict->map); |> > cmsFreeCMS(restrict->set); |> > free(restrict); |> > } I don't think that freerestrict() is needed. Nor is numasubset_t needed. Don't worry about the map - just work with the set, except while discovering the topology, if the system (aka physical) cpu and memory node numbers embed critical topology information. > I don't know how to implement getcpu(), getnode(), cputonode(), or > nodetocpu() based on this API. This API doesn't know nodes (as collections of cpus, if that is what you mean). I can't help you there, except to suggest that whatever root-privileged process sets up the CpuMemMaps to be used by applications making use of this simple-binding API should assign the application cpu numbering in accordance with the rules set forth in the "Proposed NUMA API" paper, relating cpu numbering to node numbering. I need to add a getcmscpu() call to this API, to return the current application cpu number on which a task is executing. Then your getcpu() is just getcmscpu(). I also need to make cmsQueryCMM() not require root privilege, so that you can use it, along with other information in /proc, to discover topology, and understand how that topology maps to the application cpu numbers used in your CpuMemSets. |> > int bindtocpu(unsigned long cpus, numasubset_t *restrict) |> > { |> > numasubset_t *r1; |> > |> > *r1 = restrictcpu(cpus, restrict); |> > if (r1 == NULL) { |> > return (-1); /* adds ENOMEM */ |> > } |> > cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); |> > cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map); |> > freerestrict(r1); |> > } For this, try: int bindtocpu ((unsigned long rcpus, cpumemset *cms1) { cpumemset *cms2; int r; cms2 = restrictcpus (rcpus, cms1); if (cms2 == NULL) return -1; r = cmsSetCMS (CMS_CURRENT, (void *)0, cms2); cmsFreeCMS (cms2); return r; } |> > int bindtonode(unsigned long nodes, |> > int behavior, |> > numasubset_t *restrict) |> > { |> > numasubset_t *r1; |> > |> > *r1 = restrictnode(nodes, restrict); |> > if (r1 == NULL) { |> > return (-1); /* adds ENOMEM */ |> > } |> > r1->set->cms_policy = behavior; /* assume we rationalize */ |> > cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); |> > cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map); |> > /* Need to step through all VM objects... How? */ |> > freerestrict(r1); |> > } If by nodes, you mean memory nodes, not collections of cpus, try: int bindtonode (unsigned long rmems, int behavior, cpumemset *cms1) { cpumemset *cms2; int r; cms2 = restrictnodes (rmems, cms1); if (cms2 == NULL) return -1; r = cmsSetCMS (CMS_CURRENT, (void *)0, cms2); cmsFreeCMS (cms2); return r; } |> > int setlaunch(...) |> > { |> > ... |> > cmsSetCMS(CMS_CHILD, (void *)0, r2->set); |> > cmsSetCMM(CMS_CHILD, 0, (void *)0, r2->map); |> > ... |> > } Yes - setlaunch will set the CMS_CHILD cpumemset. If it has to also set the cpumemmap, it must be root; I hope this isn't needed. |> > int bindmemory(...) |> > { |> > ... |> > *r1 = restrictnode(nodes, restrict); |> > ... |> > cmsSetCMS(CMS_CURRENT, (void *)0, r1->set); |> > cur = start + len - PAGESIZE; |> > for (; cur >= start; cur -= PAGESIZE) { |> > cmsSetCMM(CMS_VMAREA, 0, (void *)cur, r1->map); |> > } |> > } I remain confused as to whether restrictnode() means to restrict scheduling tasks to the cpus on the specified nodes, and/or means to restrict allocating memory to the specified memory nodes. If your API has no notion of where its vm objects lie, and has to accept instructions that require it to walk, page by page, over large chunks of virtual address space, trying to bind whatever vm objects happen to be lurking underneath, then this is unfortunate. At least consider opening "/proc/self/maps", to learn what vm objects lurk beneath the virtual addresss surface. Thank-you, Paul, for your good questions and suggestions. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul M. <Pau...@us...> - 2001-10-30 04:54:45
|
Hello, Paul! Thank you for the good feedback on my first attempt to express simple binding in terms of cpumemsets and cpumemmaps. I am going to study your note for a bit, then make another attempt. Thanx, Paul |
From: Paul D. <pad...@us...> - 2001-11-02 23:55:52
|
Hi Paul, I have a few comments regarding the cpumemset document. Paul Dorwin pd...@us... --- You refer to memory as memory nodes. Other times you refer to the same thing as a node. And still other times you refer to a node in the more familiar context of a container. To me, the term node refers to something which can contain cpus, memory, and IO. I would be more confortable with some other term which refers to a range of memory. --- In 'Using CpuMemSets' you say: On systems supporting hot-swap of CPUs (or even memory, if someone can figure that out) the system administrator would be able to change CPUs and remap by changing the applications CpuMemMap, without the application being aware of the change. How are you doing this? Will there be /proc/<pid>/cpumemset and /proc/<pid>/cpumemmap interfaces? A /proc interface would also be useful for managing an application which is already running. One could view existing memmaps by cat /proc/123/cpumemmap. A line for each memmap used by the application would be printed. Using your example, a line could be displayed as follows: addr size 8 4,5,6,7,8,9,10,11 2 1,2 Using your example again, could one modify an existing application from the command line by sepcifying a memmap as follows: echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap And finally, the process could be migrated to processors 4-11 via echo ff0 /proc/123/cpus_allowed. You could also use /proc/cpumemset and /proc/cpumemmap to alter the system defaults. --- In 'Processors, Memory and Distance' your discussion of <cpu,mem> distances deals primarily with cache warmth issuse. Should you also discuss the disadvantages of scheduling a process on a cpu further from where the physical pages are contained? For example, you run on node 0 and allocate pages from the memory on that node. If you sleep (maybe on IO) you no longer have any cache warmth. However, you would still incur a potentially more expensive penalty if you are scheduled on a cpu on another node because you now have to pull all data into cache over a longer latency/lower bandwidth pipe. Does any of this make sense? |
From: Paul J. <pj...@en...> - 2001-11-03 01:09:42
|
On Fri, 2 Nov 2001, Paul Dorwin wrote: > I have a few comments regarding the cpumemset document. Excellent - thank-you! > You refer to memory as memory nodes. Other times you refer to > the same thing as a node. And still other times you refer to > a node in the more familiar context of a container. To me, > the term node refers to something which can contain cpus, > memory, and IO. I would be more confortable with some other > term which refers to a range of memory. I would be more confortable with another name as well ;). Any suggestions? If I had to pick an alternative right now, it would be "memory chunk". Earlier I tried "memory bank", but that had too many prior connotations to me. I used "memory node" in this Version because it seemed that on the architectures we are currently concerned with (the big ia64 numa systems I knew of) there was a one-to-one relation between chunks of memory and system nodes. I thought I had been fairly pedantic in using "memory node" everywhere, except sometimes when using that term multiple times in a single sentence, and I hoped that secondary references could be abbreviated to just "nodes" without confusion. If you see a contrary instance, I'd be happy to fix it. Or if you have a better name, I'm interested. > > --- > > In 'Using CpuMemSets' you say: > > On systems supporting hot-swap of CPUs (or even memory, if someone can > figure that out) the system administrator would be able to change CPUs > and remap by changing the applications CpuMemMap, without the > application being aware of the change. > > How are you doing this? See the Bulk Remap call (CMS_BULK_ALL). It goes through and alters any CpuMemMap as requested, perhaps to remove a cpu or memory node (according to its system numbering) from service, by replacing that system number with another. The implementation then walks through the tasks and vm areas in the system, recomputing cpus_allowed and zone lists as need be, to reflect the changed CpuMemMaps. By the time that one system call returns, no further task will be scheduled on the mapped out cpu, and no further memory will be allocated on the mapped out memory node. > Will there be /proc/<pid>/cpumemset and > /proc/<pid>/cpumemmap interfaces? No plans for this, though it's possible. > A /proc interface would also be useful for managing an application > which is already running. Use the cms*() calls with "pid" arguments, such as: cmsQueryCMM, cmsSetCMM, cmsQueryCMSbyPid, cmsSetCMSbyPid, cmsBulkRemap to manage currently running applications. I find the use of /proc to manage a system, as opposed to (1) report on it, or (2) toggle obscure debug hooks, to be an ugly interface, and resist such. From the latest work I see from Rusty Russell <ru...@ru...>: [PATCH] 2.5 PROPOSAL: Replacement for current /proc of shit. I am not alone in this opinion. > One could view existing memmaps by cat /proc/123/cpumemmap. > A line for each memmap used by the application would be printed. > Using your example, a line could be displayed as follows: > > addr size 8 4,5,6,7,8,9,10,11 2 1,2 > > Using your example again, could one modify an existing application > from the command line by sepcifying a memmap as follows: > echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap ugh - try instead: cmsSetCMM (CMS_CURRENT, 123, 0, 0, &cmm); > And finally, the process could be migrated to processors 4-11 > via echo ff0 /proc/123/cpus_allowed. There is no visible "cpus_allowed" in CpuMemSets - rather cpus_allowed is an implementation detail of the task scheduler for systems with fewer than 64 cpus. Rather we need command line utilities, built on the CpuMemSets infrastructure, to support such migration and related tasks. > You could also use /proc/cpumemset and /proc/cpumemmap to alter the > system defaults. You _could_. I hope not. Also there is no particularly interesting "system default" map or set, beyond the initial one the kernel sets up during boot, and uses when starting the init process. From that point forward, all maps and sets are inherited or user specified. > --- > > In 'Processors, Memory and Distance' your discussion of <cpu,mem> > distances deals primarily with cache warmth issuse. Should you also > discuss the disadvantages of scheduling a process on a cpu further > from where the physical pages are contained? My recollection is that I have two distances: <cpu, mem> - for modeling cpu to memory latency/bandwidth <cpu, cpu> - for modeling cache warmth Perhaps something in my presentation is confusing these two? > For example, you run on node 0 and allocate pages from the memory on > that node. If you sleep (maybe on IO) you no longer have any cache > warmth. However, you would still incur a potentially more expensive > penalty if you are scheduled on a cpu on another node because you now > have to pull all data into cache over a longer latency/lower bandwidth > pipe. This example sounds like it is getting at <cpu, mem> distances. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Hubertus F. <fr...@wa...> - 2001-11-05 16:32:17
|
Well my 2.5 cents: Under "Chunk" I usually imply a smaller piece of a large thing. Hence this doesn't really fit. "MemoryBlock" would fit better with me. -- Hubertus * Paul Jackson <pj...@en...> [20011102 20;09]:" > On Fri, 2 Nov 2001, Paul Dorwin wrote: > > I have a few comments regarding the cpumemset document. > > Excellent - thank-you! > > > > You refer to memory as memory nodes. Other times you refer to > > the same thing as a node. And still other times you refer to > > a node in the more familiar context of a container. To me, > > the term node refers to something which can contain cpus, > > memory, and IO. I would be more confortable with some other > > term which refers to a range of memory. > > I would be more confortable with another name as well ;). > Any suggestions? If I had to pick an alternative right > now, it would be "memory chunk". > > Earlier I tried "memory bank", but that had too many prior > connotations to me. I used "memory node" in this Version > because it seemed that on the architectures we are currently > concerned with (the big ia64 numa systems I knew of) there was > a one-to-one relation between chunks of memory and system nodes. > > I thought I had been fairly pedantic in using "memory node" > everywhere, except sometimes when using that term multiple times > in a single sentence, and I hoped that secondary references > could be abbreviated to just "nodes" without confusion. If you > see a contrary instance, I'd be happy to fix it. > > Or if you have a better name, I'm interested. > > > > > --- > > > > In 'Using CpuMemSets' you say: > > > > On systems supporting hot-swap of CPUs (or even memory, if someone can > > figure that out) the system administrator would be able to change CPUs > > and remap by changing the applications CpuMemMap, without the > > application being aware of the change. > > > > How are you doing this? > > See the Bulk Remap call (CMS_BULK_ALL). > > It goes through and alters any CpuMemMap as requested, > perhaps to remove a cpu or memory node (according to > its system numbering) from service, by replacing that > system number with another. The implementation then walks > through the tasks and vm areas in the system, recomputing > cpus_allowed and zone lists as need be, to reflect the > changed CpuMemMaps. By the time that one system call > returns, no further task will be scheduled on the mapped > out cpu, and no further memory will be allocated on the > mapped out memory node. > > > > Will there be /proc/<pid>/cpumemset and > > /proc/<pid>/cpumemmap interfaces? > > No plans for this, though it's possible. > > > > A /proc interface would also be useful for managing an application > > which is already running. > > Use the cms*() calls with "pid" arguments, such as: > > cmsQueryCMM, cmsSetCMM, cmsQueryCMSbyPid, cmsSetCMSbyPid, cmsBulkRemap > > to manage currently running applications. I find the use of /proc > to manage a system, as opposed to (1) report on it, or (2) toggle > obscure debug hooks, to be an ugly interface, and resist such. > > >From the latest work I see from Rusty Russell <ru...@ru...>: > > [PATCH] 2.5 PROPOSAL: Replacement for current /proc of shit. > > I am not alone in this opinion. > > > > One could view existing memmaps by cat /proc/123/cpumemmap. > > A line for each memmap used by the application would be printed. > > Using your example, a line could be displayed as follows: > > > > addr size 8 4,5,6,7,8,9,10,11 2 1,2 > > > > Using your example again, could one modify an existing application > > from the command line by sepcifying a memmap as follows: > > echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap > > ugh - try instead: > > cmsSetCMM (CMS_CURRENT, 123, 0, 0, &cmm); > > > > And finally, the process could be migrated to processors 4-11 > > via echo ff0 /proc/123/cpus_allowed. > > There is no visible "cpus_allowed" in CpuMemSets - rather > cpus_allowed is an implementation detail of the task scheduler > for systems with fewer than 64 cpus. > > Rather we need command line utilities, built on the CpuMemSets > infrastructure, to support such migration and related tasks. > > > > You could also use /proc/cpumemset and /proc/cpumemmap to alter the > > system defaults. > > You _could_. I hope not. Also there is no particularly interesting > "system default" map or set, beyond the initial one the kernel sets > up during boot, and uses when starting the init process. From that > point forward, all maps and sets are inherited or user specified. > > > > --- > > > > In 'Processors, Memory and Distance' your discussion of <cpu,mem> > > distances deals primarily with cache warmth issuse. Should you also > > discuss the disadvantages of scheduling a process on a cpu further > > from where the physical pages are contained? > > My recollection is that I have two distances: > > <cpu, mem> - for modeling cpu to memory latency/bandwidth > <cpu, cpu> - for modeling cache warmth > > Perhaps something in my presentation is confusing these two? > > > > For example, you run on node 0 and allocate pages from the memory on > > that node. If you sleep (maybe on IO) you no longer have any cache > > warmth. However, you would still incur a potentially more expensive > > penalty if you are scheduled on a cpu on another node because you now > > have to pull all data into cache over a longer latency/lower bandwidth > > pipe. > > This example sounds like it is getting at <cpu, mem> distances. > > > > I won't rest till it's the best ... > Manager, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 > > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Paul J. <pj...@en...> - 2001-11-05 23:39:43
|
On Mon, 5 Nov 2001, Hubertus Franke wrote: > Well my 2.5 cents: > > Under "Chunk" I usually imply a smaller piece of a large thing. > Hence this doesn't really fit. > "MemoryBlock" would fit better with me. MemoryBlock - I like it. I will probably change it to that, in the next version. Thanks, Hubertus. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Niels C. <nc...@us...> - 2001-12-24 06:11:37
|
> I have just posted on SourceForge LSE a new version of > the Design Note: > Process Scheduling and Memory Placement Hi Paul, I took the time to read our design notes. As you know, I have no NUMA background so maybe I'm not getting the whole picture. Please bear with me if I ask too dumb questions in my ignorance but I ran into a few things I would like to bring up. > cpumemmap: > The lower layer of this proposal, named cpumemmaps, > provides two simple maps, mapping system CPU and memory > block numbers to application CPU and memory block numbers. > Each process, each virtual memory area, and the kernel for > its needs, has such a cpumemmap. Your above statement leaves me a bit confused. The way I understand the rest of the description, there are two types of cpumemmaps, one for the system (existing in only one instance) and one for "applications" with one instance per task. Do I understand that correctly? You say: "Each cpumemset has an associated cpumemmap" and that is the way I understand it but I think your design notes could benefit from a graphical view of how one maps to the other and on to the system map. Further graphs to illustrate bulk remapping would be even better. The only two principal thorns in my eye are around preferences when it comes to selecting memory block and cpu. You have added the memory block preference in the upper layer and you have opted to not supply cpu preference. For all I can see, the memory preference is likely to be defaulting to "the closer the memory the sooner you try it" -- and is likely to remain so in most cases. I do not see why cpumemmaps should not carry this information with the cpumemsets being able to override. Similarly, I see no reason why the cpu preference can not be kept just like memory block preference and handled the same way. I think the inclusion of cpu preference fits into a design that combines Paul Dorwin's topology design with your two-layer abstraction. Instead of having the sequence in which the preferred cpus and memory blocks appear determine the preference, a data-carrying link structure would be preferable although not the only possible solution. I say so because I think that in the near future, we need more than "preferred". In the case of memory, a number of states may influence where we grab memory. For example, if allocation from the closest memory would cause a page-out but allocation from the next-closest memory would not, where should we allocate? Should memory preference be influenced by the likelihood or history of the task's cpu bounces? In the case of processor, when should be decide not to schedule a task because no preferred processor is idle or has a lower priority task running? For how long? Do the rules change if this is a hypertasking processor? I believe the design should allow for such extensions and, probably, include a couple of properly named reserved fields for that purpose. Thanks for passing-on my plea for help with Co-Pilot. Niels |
From: Paul J. <pj...@en...> - 2001-12-24 20:12:35
|
Thanks for reviewing the Notes, Neil. You and I should be celebrating Christmas ... oh well. You wrote: > > cpumemmap: > > The lower layer of this proposal, named cpumemmaps, > > provides two simple maps, mapping system CPU and memory > > block numbers to application CPU and memory block numbers. > > Each process, each virtual memory area, and the kernel for > > its needs, has such a cpumemmap. > > Your above statement leaves me a bit confused. The way I > understand the rest of the description, there are two types > of cpumemmaps, one for the system (existing in only one > instance) and one for "applications" with one instance per > task. Do I understand that correctly? No. Not two maps (system and applications). Many maps, one kernel, and one for each task, and one for each vm area. The maps are shared if inherited from a common map and unchanged. But except for one Bulk Remap operation, this sharing is not apparent to the user. The "system" and "application" dichotomy is in the numbering of CPUs and memory blocks. The system (Linux kernel) has its preferred way of numbering CPUs ... one such way is the compact node id. From the perspective of CpuMemSets, there is only one such numbering scheme, pre-ordained by the kernel. Each cpumemmap describes an application numbering. While there is only one system numbering scheme (known to CpuMemSets) there are many application numbering schemes -- exactly the many cpumemmaps. Each cpumemmap describes the mapping from some "application" CPU and memory block numbers to _the_ "system" numbers. |> your design notes could benefit from a graphical view ... Yes, some pictures would help. I chose the technology (basically hand edited HTML containing just old-fashioned text) for presenting these Notes to optimize the speed with which I could edit them, and the ease of presenting them in different contexts. However, now I should soon go back and refine the presentation to include pictures, in order to make it easier to understand. > Similarly, I see no reason why the cpu preference can not be > kept just like memory block preference and handled the same > way. > > I think that in the near future, we need > more than "preferred". In the case of memory, a number of > states may influence where we grab memory. This comment goes to the heart of a key design decision in this CpuMemSets design. The challenge is to present enough flexibility to be generally useful, while keeping it simple enough to be generally understandable. In particular, there will be some systems with some additional needs that don't fit explicitly in CpuMemSets, and these needs will have to be met by additional mechanisms. Hopefully, CpuMemSets can integrate smoothly with such additional mechanisms, with CpuMemSets say "here's the ordered list of where to search", and leaving it to other code to pick and choose from that ordered list with more refined methods. For example, it is already the case, in the normal Linux kernel page allocator, that it takes into account additional heuristics on where to allocate a page. It does this in part by scanning the list of zones multiple times, first looking for easy memory, then getting increasingly desperate. In this enironment, CpuMemSets will provide some framework -- an ordered set of memory blocks to search, but it will leave to other mechanisms the details of deciding which memory block is "best". |> You have added the memory block preference in the upper |> layer and you have opted to not supply cpu preference. In the case of memory, the current CpuMemSets design provides a couple of additional option flags, stating which order to search the memory in. In the case of CPUs, it doesn't have any such options currently, and treats the CPU list as simply an unordered set, even though that set is passed into the kernel as an ordered list. If the kernel had use for diverse CPU search orders in the scheduler, then I could easily see extending this CpuMemSet design to make that explicit, and add a few flags, specifying various search orders over the provided CPU list. |> For all I can see, the memory preference is likely to be |> defaulting to "the closer the memory the sooner you try it" |> -- and is likely to remain so in most cases. Once you determine which memory blocks to search, and where in that list to begin the seach, and disregarding more dynamic issues (see comments above involving searching the same list multiple times, with increasing desperation), then yes, the search order will often be distance based. |> I do not see why cpumemmaps should not carry this information |> with the cpumemsets being able to override. The maps just renumber. One could easily have all maps present all CPUs and all memory blocks, all the time. I see no use in having two layers that both attempt to present mechanisms for specifying preference information. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Niels C. <nc...@us...> - 2001-12-28 04:11:59
|
> No. Not two maps (system and applications). Many maps, > one kernel, and one for each task, and one for each vm > area. The maps are shared if inherited from a common map > and unchanged. But except for one Bulk Remap operation, > this sharing is not apparent to the user. Paul, I give up. Why do we need one per vm area when all operations on memory are done from a task? Why can't we use the task's map, which is used to generate the initial memory area map anyway? You do say: * When allocating another page to an area, the kernel will * choose the memory list for the CPU on which the current * task is being executed, if that CPU is in the cpumemset of * that memory area, else it will choose the memory list for * the default CPU (see CMS_DEFAULT_CPU) in that memory areas * cpumemset. The kernel then searches the chosen memory list * in order... Now you are adding functionality to the way the data is organized. Who says this is the only or even the best way to allocate memory? But notice that you use the term "memory list"... So maybe I should rephrase that: It seems to me that the task should have the map, mapping to a different map list structure for associated memory areas. I do not see that memory area list having a cpu mapping. I think you are trying to generalize too much here, which -- to me -- is apparent from your comment: * Not all portions of a cpumemset are useful in all cases. * For example the CPU portion of a vm area cpumemset is unused. * It is not clear as of this writing whether CPU portions of the * kernels cpumemset are useful So while I agree that we need a data structure to hold the vm areas, I see it as different from the cpu structure. Now, while I think you are being too generalizing in one area I think you are the opposite in the other. That other one is the rejection of cpu-to-cpu relationships. In a two- layer design, I would expect everything hard-wired, such as cpu-to-cpu affinity/distance/whatever to be reflected in the lower layer just as the cpu-to-memory-block relationship is. And, maybe, as the concept of a node is, since that seems to be(come) a very popular way of designing NUMA systems. And this will be my last words in this thread. I will be following it but will leave the discussion to Michael's team. It has been interesting, educating and entertaining though, at least for me, but I don't have the bandwidth ... Niels |
From: Paul J. <pj...@en...> - 2002-01-12 09:11:14
|
Thanks for your comments, Neils. I appreciate that you don't have the bandwidth to continue this discussion. Heck - it took me over two weeks to find the bandwidth to reply, and this is my main task. My apologies for not responding sooner -- I did however enjoy a rare vacation, and hope to conquer my recently acquired "Civilization III (Sid Meier)" adiction soon. Anyhow, for the benefit of others who might have been intrigued by your comments, I will attempt a modest response. On Thu, 27 Dec 2001, Niels Christiansen wrote: > Paul, I give up. Why do we need one per vm area when all > operations on memory are done from a task? Why can't we > use the task's map, which is used to generate the initial > memory area map anyway? Except for kernel allocation within interrupts, yes, there is a task (as well as a vm area) that can naturally be associated with each page fault. Large NUMA apps that have their memory allocations fine tuned for a system will place different chunks of memory on different nodes, quite intentionally. This includes shared memory, that may be accessed from many processes. So just knowing the process and executing CPU isn't enough to place the memory. One must also know which address region of that process owns the fault. A single process will typically have access to several regions, that have various and different placement directives. The natural place, in my view, to attach such directives that vary by memory region is to the region, known in Linux as the vm area. > It seems to me that the task should have the map, mapping to > a different map list structure for associated memory areas. > I do not see that memory area list having a cpu mapping. Well, we clearly need to have the choice of where to look for memory depend on which cpu executed the fault, and which memory region (vm area) faulted. I say "clearly" based on my analysis of what major well-tuned NUMA apps require. My focus here is on the big-data, long running, big compute jobs that are SGI's primary business focus. Other types of loads will have other requirements. The trick is to find a single mechanism that will be useful to the variety of uses. ... I couldn't quite make sense of the rest of your post, so will just have to let it rest in peace. Perhaps it touched on a point that someone else will raise. Once again - thanks for your experienced and extended examination of this proposal. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Michael H. <hoh...@us...> - 2002-01-09 19:21:48
|
Paul, I appreciate that you have addressed most of my concerns in your latest revision of CpuMemSets. At this point in time I have little to add, although, as we have previously discussed, there is a need for adding grouping capabilities to CpuMemSet. Next week I'll try to post some thoughts on this. A question that has been brought up is how would one remove a resource (CPU or memory block) from the system via CpuMemSets? The only option I see now is to use the bulk remap capability and map a different resource to replace the resource being removed. So, for example, if on an eight processor system, processor 5 was being removed, one would have to choose a different processor, say 6, to map what had been mapped to 5. Thus a processor list, assuming all system processors were being mapped by a CpuMemMap one to one, would be bulk remapped to look like {0,1,2,3,4,6,6,7}. Is my understanding correct? If so, while technically possible, there are numerous problems I see with this, especially with respect to implications on cpumemsets based on the now changed cpumemmap. Michael Hohnbaum hoh...@us... |
From: Paul J. <pj...@en...> - 2002-01-12 10:18:11
|
On Wed, 9 Jan 2002, Michael Hohnbaum wrote: > Paul, > > I appreciate that you have addressed most of my concerns in > your latest revision of CpuMemSets. At this point in time I > have little to add, although, as we have previously discussed, > there is a need for adding grouping capabilities to CpuMemSet. > Next week I'll try to post some thoughts on this. > > A question that has been brought up is how would one remove a > resource (CPU or memory block) from the system via CpuMemSets? > The only option I see now is to use the bulk remap capability > and map a different resource to replace the resource being > removed. So, for example, if on an eight processor system, > processor 5 was being removed, one would have to choose a > different processor, say 6, to map what had been mapped to 5. > Thus a processor list, assuming all system processors were > being mapped by a CpuMemMap one to one, would be bulk remapped > to look like {0,1,2,3,4,6,6,7}. Is my understanding correct? > If so, while technically possible, there are numerous problems > I see with this, especially with respect to implications on > cpumemsets based on the now changed cpumemmap. > > Michael Hohnbaum > hoh...@us... Your crystal ball is working at full strength. I'd advise you to visit Las Vegas and bet heavily. In other words, in the final round of changes before packaging up our first patch (should be out next week, crossing my fingers), the only significant design detail that took a hit was the usefulness of the Bulk Remap feature. It seems to me, by my present understanding, that we must insist that the CpuMemMap memory lists be injective - meaning that you can't list a cpu twice in the memory lists. Or if we allow duplicates (such as cpu '6' in your example above) then as you state "there are numerous problems". Any ideas? I agree that the "canonical" scenario to be solved with bulk remap, or some other design element, is removing a cpu from a system. I also look forward to your thoughts on "grouping". That should be a "fun" challenge. I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2001-10-12 18:57:26
|
Paul McKenney wrote: > Took a first pass through your proposal -- good stuff! Some very > interesting approaches! Thanks for the initial review. Before I read it in detail, let me repost it, with some paragraph breaks, so that others may read it more easily. What follows is Paul Mckenney's writing, reformated ... ======================================================= Some comments: Under "Desired Properties / Vendor neutral base", I recommend adding Tru64, HP-UX, AIX, etc. May as well be all-inclusive. ;-) Under "Implementation Layers", near the end of the second paragraph: "a small scale forking" seems a bit dramatic. That said, it would be good if the NUMA and SMP code paths used the same C code. A question on items 1, 2, and 3 under "Implementation Layers". Item 1 seems to indicate that the current use of bitmasks within the kernel can continue unchanged, but hints at longer-term changes. Do you envision the kernel moving towards using CpuMemMaps and CpuMemSets, or do you expect these two concepts to be strictly confined to user space? Under "Error cases" in the header file, you say that if you really want to force an application to use CPUs and memory from disjoint nodes (which you might in diagnostics or performance-measurement code), then the CpuMemSet memory list must contain CMS_DEFAULT_CPU. But CMS_DEFAULT_CPU is casted to cms_lcpu_t. Shouldn't it be cast instead to cms_lmem_t? Or is CMS_DEFAULT_CPU instead supposed to go on the CpuMemSet's list of CPUs? If the latter, does this allow the disjoint-node operation for CPUs and memory? (So that all the permitted CPUs are on one set of nodes, but all the permitted memory is on another set of nodes, and the two sets of nodes are disjoint.) First-touch and stripe policies seem to be missing from the list. Some of the OSes have also had least-memory-utilization policies. A cpumemset has no meaning except in the context of a cpumemmap, right? A cpumemmap maps the CPU numbers, while a cpumemset simply restricts them, right? (Could make it work either way, but things like getcpu() and getnode() need to be in sync with the choice.) For example, suppose that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose that the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. What is the physical CPU corresponding to logical CPUs 0, 1, and 2? HP's launch policies include policies for CPU allocation (round robin and the like). My guess is that this requires either additional bits in cms_policy or another field for CPU (as opposed to memory) policies. Non-root processes are prohibited from creating cpumemmaps. Shouldn't they be allowed to create them, as long as they are subsets of the cpumemmap that they are currently running with? If non-root processes are really prohibited from playing with cpumemmaps, what happens to their children? Is the child's cpumemmap generated from the cpumemset/cpumemmap pair that the parent specified with CMS_CHILD? Or does the child just get copies of the CMS_CHILD cpumemset/cpumemmap? (At this point, I am favoring letting non-root processes manipulate cpumemmaps, with the restriction that any that are installed must be pure restrictions of the current cpumemmap.) There is no way to create either a cpumemmap or a cpumemset without querying for it. The intended usage is to query for this process's set/map, then manipulate the resulting structure? Is the user supposed to directly manipulate the fields of the cpumemmap and cpumemset? A few typos, possibly due to WikiWeb issues: "cpu's" -> "CPUs", "mem's" -> "memories", "preasure points" -> "pressure points", "CpumemSet" -> "CpuMemSet", etc. Enough questions for now! More later... Once again, some good stuff here! Thanx, Paul ======================================================= I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2001-10-12 20:58:47
|
Thanks - excellent comments, Paul. I look forward to continued feedback from yourself and others. Paul McKenney wrote: ==== > Took a first pass through your proposal ... > > adding Tru64, HP-UX, AIX, etc. May as well be all-inclusive. ;-) good idea - added. ==== > Under "Implementation Layers", near the end of the second paragraph: "a > small scale forking" seems a bit dramatic. yup - softened to: increased the risk for minor Linux kernel code duplication ==== > A question on items 1, 2, and 3 under "Implementation Layers". Item 1 > seems to indicate that the current use of bitmasks within the kernel can > continue unchanged, but hints at longer-term changes. Do you envision > the kernel moving towards using CpuMemMaps and CpuMemSets, or do you > expect these two concepts to be strictly confined to user space? I doubt that the guts of scheduling or allocation code will ever want to be written in terms of CpuMemMaps and Sets. The data structure used in CpuMem*, arrays of cpu or mem id's, is almost surely too inefficient for critical path code. I expect that when someone needs a kernel for > 64 cpus, they will have to find an alternative to, or extension of, Ingo's cpus_allowed bit vector. And I expect continued coding activity for various other purposes in both the scheduling and allocation code, which will at times impact the preferred data representation of available cpu and memory resources for these critical code paths. The kernel code that implements the system calls to set CpuMemMaps and Sets will always have the responsibility for translating the stable, but cumbersome, API of these Maps and Sets into the efficient representation of the day required by the scheduling and allocation code. ==== > Under "Error cases" in the header file, you say that if you really want > to force an application to use CPUs and memory from disjoint nodes > (which you might in diagnostics or performance-measurement code), then > the CpuMemSet memory list must contain CMS_DEFAULT_CPU. But > CMS_DEFAULT_CPU is casted to cms_lcpu_t. Shouldn't it be cast instead > to cms_lmem_t? Or is CMS_DEFAULT_CPU instead supposed to go on the > CpuMemSet's list of CPUs? If the latter, does this allow the > disjoint-node operation for CPUs and memory? (So that all the permitted > CPUs are on one set of nodes, but all the permitted memory is on another > set of nodes, and the two sets of nodes are disjoint.) Aha - this Error Case is confused, sufficiently so that it apparently managed to further confuse your critiique ;). And the Error Case preceding it is also confused. For the benefit of those who don't have this Design Note at hand, the two confused Error Cases are: * It is not an error if a CpuMemSet for an object (task, vm area * or kernel) doesn't provide memory lists for all the cpus in * that objects CpuMemMap. That is, it is ok for a CpuMemSet to * "be smaller than" (only use a subset of) its Map. * * However it is an error to set a CpuMemSet that shows cpus that * are not listed in any of the memory lists of that CpuMemSet, * unless the memory lists include a CMS_DEFAULT_CPU. Attempts to * set such a CpuMemSet fail with errno set to ESRCH. This case * must be an error to avoid trying to allocate memory without * knowing which memory list to search. Let me try again ... A CpuMemSet specifies two things. It specifies on which cpus (in the corresponding CpuMemMap) a task may be scheduled, and it specifies in what order to search for memory, per virtual memory area, depending on which cpu the request for memory was executed. The question arises - where do we look for memory if a request for memory is executed on a cpu that is not specified in the active CpuMemSet? Perhaps someone didn't provide memory lists for all possible cpus that might execute code sharing that area. Heck, perhaps they _couldn't_ specify such memory lists, because they are setting up a shared memory area that will be shared with other tasks running on cpus outside the Map of the process initializing the shared memory area. Seems that I can either (1) require that a memory list _always_ be specified for the CMS_DEFAULT_CPU, or (2) mandate that attempts to allocate memory when the CpuMemSet does not specify (even by CMS_DEFAULT_CPU) any memory list for the currently executing cpu must fail. Choice (2) would cause nasty, obscure and intermittent errors. So it must be choice (1). Hence these two Error Cases collapse into the following single case: * Every CpuMemSet must specify a memory list for the * CMS_DEFAULT_CPU, to ensure that regardless of which CPU a * memory request is executed on, a memory list will be available * to search for memory. Attempts to set a CpuMemSet without a * memory list specified for the CMS_DEFAULT_CPU will fail, with * errno set to EINVAL. ==== > First-touch and stripe policies seem to be missing from the list. Some > of the OSes have also had least-memory-utilization policies. First touch is the natural order of things. If, as will usually be the case, the memory lists are ordered by distance from the faulting cpu, then this provides first touch. Existing upper level API's, such as cpusets, dplace, runon, OpenMP, MPI that support a First Touch policy would presumably implement that policy by properly sorting the memory lists. Aha - perhaps I should change the CMS_DEFAULT policy comment, from: #define CMS_DEFAULT 0x00 /* None of the following optional policies */ to some such as: #define CMS_DEFAULT 0x00 /* Memory list order (first-touch, typically) */ Tell me more about what is the essence of stripe policies, as they apply here. My guess is that a combination of proper memory list sorting plus a round robin (CMS_ROUND_ROBIN) policy will provide the desired semantics. But more consideration of this point is needed. What is a "least-memory utilization policy"? I am not familiar with that term. ==== > A cpumemset has no meaning except in the context of a cpumemmap, right? > A cpumemmap maps the CPU numbers, while a cpumemset simply restricts > them, right? (Could make it work either way, but things like getcpu() > and getnode() need to be in sync with the choice.) For example, suppose > that the CPUs are numbered 100, 101, 102, 103, and 104. Suppose that > the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}. What > is the physical CPU corresponding to logical CPUs 0, 1, and 2? Yes, Yes, tell me about getcpu/getnode, and {100,102,103}. My presumption, from reading this, is that getcpu(), executed in this map and set, would return either 100 or 103, rather oblivious to the CpuMemMap. I have no clue what getnode() would or should do that relates to CpuMemSets. Perhaps we need a "getcmscpu()"? ==== > HP's launch policies include policies for CPU allocation (round robin > and the like). My guess is that this requires either additional bits in > cms_policy or another field for CPU (as opposed to memory) policies. Aha - might be so. I should investigate this further. ==== > Non-root processes are prohibited from creating cpumemmaps. Shouldn't > they be allowed to create them, as long as they are subsets of the > cpumemmap that they are currently running with? Non-root processes can restrict where they execute tasks or allocate memory by altering their CpuMemSet, within the confines of the CpuMemMap established for them. Why is that not enough? ==== > If non-root processes are really prohibited from playing with > cpumemmaps, what happens to their children? Is the child's cpumemmap > generated from the cpumemset/cpumemmap pair that the parent specified > with CMS_CHILD? Or does the child just get copies of the CMS_CHILD > cpumemset/cpumemmap? I don't understand the difference between these two choices. They both sound the same to me, and both sound right. ==== > (At this point, I am favoring letting non-root > processes manipulate cpumemmaps, with the restriction that any that are > installed must be pure restrictions of the current cpumemmap.) Well, having not yet appreciated the comments above on this, I will have to table this suggestion pending further my further enlightment. ==== > There is no way to create either a cpumemmap or a cpumemset without > querying for it. The intended usage is to query for this process's > set/map, then manipulate the resulting structure? Yes, yes. Earlier designs allowed for creating and manipulating CpuMemSets, as a kernel supported object that was visible as a distinct identified object to applications, separate from their binding to any given task or vm area. But I could see no essential use for unbound CpuMemSets, so now they are an attribute of known tasks and vm areas. ==== > Is the user supposed to directly manipulate the fields of the cpumemmap > and cpumemset? Yes, via the cmsSet*() system calls. Well, as direct as anything in a protected operating system -- the user politely asks the kernel to set a map or set, and doesn't directly write kernel memory. My hunch here is that I missed the real point of this question ... ==== > A few typos, possibly due to WikiWeb issues: "cpu's" -> "CPUs", "mem's" > -> "memories", "preasure points" -> "pressure points", "CpumemSet" -> > "CpuMemSet", etc. I can hardly blaim wiki for typos and mispellings ;). Thanks for pointing these out. > Enough questions for now! More later... Once again, some good stuff here! I look forward to your further comments, and to integrating this work with substantial good work being done by folks on your end, and elsewhere. > > Thanx, Paul > > ======================================================= > > I won't rest till it's the best ... > Manager, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 > > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech > I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |