From: Niels C. <nc...@us...> - 2002-02-12 04:01:12
|
Michael Hohnbaum wrote: > A new version of the simple binding API is now available at: > http://lse.sourceforge.net/numa/numa_api.html > This is an update to the original proposal by Paul McKenney. > All comments are welcomed. That was easy, Michael. Looks good to me. I have a few questions and suggestions, though: 1. Since there is a behavior argument to the bindtomemblk(), setlaunch() and bindmemory() calls, I would like to know why one is not needed for the bindtocpu() call? Would it not be useful to be able to utilize idle processors rather than have several processes compete for a subset of processors? I'm thinking of allowing for a NOT_STRICT option, possibly along with others, that say whether and under which circumstances a process may disregard the (preferred) binding and where it may go. 2. What with a function to say bind a process to a set of processors on any one node and memory on any one (other?) node? Something along the lines of: int bindcpumem(int cpunode, u_int cpuset, int memnode, u_int memset, u_int behavior); where a cpuset is a bitmask for the processors on a node regardless of the global cpu-numbering and memset is the same for memory. This would permit a "job" to be moved as a unit of work from one node to another. By giving -1 arguments to cpunode and memnode you could say that you don't really care which node as long as the process group executes on processors of the same node. 3. I notice the absence of "distance". As far as I can tell, the coupling to nodes is mostly another way of saying distance. With all the discussion about distance in Paul Dorwin's topology design, maybe the binding API could be enhanced by using distance. For example, my suggested bindcpumem() function could then be something like: int binddist(int node, int max_cpu_dist, int max_mem_dist, u_int behavior); where a node argument of -1 means any node. 4. The API functional specification is one thing. What with the changes to data structures in the kernel, scheduler, system calls, etc. Is that something you have prototyped? 5. Would we not need a migrate() function or is that beyond the scope of the API? - nc - |
From: Paul J. <pj...@en...> - 2002-02-12 04:13:53
|
Thanks for the simple binding update, Michael. Overall looks good. This fills an essential need. Thanks. And thanks for the several mentions of CpuMemSets. My specific comments: 1) Two details of the API still concern me -- the placeholder numamap argument and the 64 bit limit on number of cpus (without at least an API change such as you say Russ Wright is considering). Neither placeholders nor such potential API changes stand the test of time, in my experience. How about instead (being controversial ... brainstorming): a] build the simple binding only on top of CpuMemSets, b] with the simple binding _only_ using application numbering, not system numbering, c] drop the numamap placeholder argument, and d] accept forever afterward that simple binding only manages up to 64 (or 32) cpus, memory blocks, and nodes. The last item, [d], doesn't keep you from running on a larger system -- just keeps a given application from using the simple binding to manage a larger set of cpus. 2) You observe that there is no node number mapping, so all node numbers are physical. Do you think that there should be a node number mapping? 3) Typo: Under restrictmemblk(... memblk, ... numamap) you wrote: If the memblk bitmask adds CPUs and the user is not root ... where I suspect you intended: If the memblk bitmask adds blocks and the user is not root ... 4) You state that getcpu() does the same thing as cmsGetCpu(). If CpuMemSets is _not_ present, then I have a slightly subtle concern here. Your simple binding makes certain promises to its user that all CPUs within one node will have numbers in a range that does not overlap with the CPU numbers of any other node. However I am not aware of any promise by the kernel, across all architectures, to honor this numbering of physical CPU numbers. Seems to me that you require some sort of abstract binding mechanism to ensure a favorable CPU numbering. 5) The shell script binding using 'runon' looks interesting, and I can imagine that most of the simple binding API can and should be exposed as runon options. I can also imagine Python (and Perl) modules that implement the simple API, and (a heretical thought) coding the runon command in Python using this module. 6) You state that "Launch policies are not inherited" (with emphasis on the 'not'). At first reading, this confuses me. Are you saying, at least in part, that if: 1. Process P1 (operating under policy x1) sets a launch policy x2 2. Process P1 forks P2 3. Process P2 modifies its own policy (rebinds) to policy x3 4. Process P2 forks P3 then P3 starts with a policy of x3 (its parent P2s policy, since P2 didn't setlaunch any other launch policy), _not_ with a policy of x2 (which would be the inherited launch policy, if such were inherited)? If so -- good. With the minor note that usually when I find myself writing something with "not" italicized, I usually end up later rewriting that sentence in a more direct manner. 7) You state: ... the binding in effect for the process or thread that executes the first page fault on a given page of memory determines the binding for that page ... see the rationale I don't see any "rationale". And I'd think it would be more stable to have the creator of the memory region determine its memory policy, not the first faulter. Once again - many thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Niels C. <nc...@us...> - 2002-02-12 21:29:49
|
> > The advantage of prohibiting interleaving of CPUs from different > > nodes is that it allows NUMA-aware algorithms to operate efficiently > > in relative ignorance of the NUMA topology. This in turn allows the > > algorithms to be much simpler, as they do not need to reconfigure > > themselves based on details of topological details. > > If code simplicity is the object of this numbering, and the leaving of > holes in the numbering, is it worth specifying that the number of cpus > (present or holes left) per node should be the same (possibly always > a power of 2?) Actually, I was wondering here as well but one thing that's been consistently pointed out is that we really want persistent naming and numbering for many reasons, among which are the ability to save setups, scripts etc. over boots. It doesn't matter too much to me how we achieve persistence as long as we do. This is one way. > > Re: 64 proc limitation ... > > could this not be avoided simply by passing a pointer to a bitmap > of size defined by a #define? Would avoid changing the API later. > Or at least just typedef something opaque to a long for now. Sounds good to me but maybe this solution has been suggested before and rejected for some good reason we are not aware of? > > Neils: > > 1. Since there is a behavior argument to the bindtomemblk(), > > setlaunch() and bindmemory() calls, I would like to know > > why one is not needed for the bindtocpu() call? Would it > > not be useful to be able to utilize idle processors > > rather than have several processes compete for a subset > > of processors? > > Whilst I think the general poing is interesting, I think we need to > be slightly careful when defining idle processors. I was discussing > something similar for a NUMA scheduler yesterday; a processor > that's idle in that particular instance may be the busiest cpu on > the system - we need something like a load average over the last > second (or whatever interval). I certainly agree. Please don't assume that by "idle" I mean idle in this fraction of a second. I did say "I'm thinking of allowing for a NOT_STRICT option, possibly along with others, that say whether and under which circumstances a process may disregard the (preferred) binding and where it may go." That would include a parameter to say for how long a processor must have been how much idle along with a limit for the expense we will tolerate for moving the task. --------------- > Addition: it would be useful to have a method to map from a device > to a node. I have something like this for PCI devices in NUMA-Q, > it would be useful to have (in kernel) access for PCI devices, and > the resource reservation stuff (eg IO ports) to their node IDs. Not > quite sure how the best way to implement this is - for PCI, they can > find their own bus, and I just provide a "bus to node" map - I think > this should be generically available across arches (for instance, > if I want my process located close to a particular ethernet card). Oh boy, you are so right, I could kick myself for not pointing this out in Paul Dorwin's topology design and Pat Mochel's driverfs. It simply slipped my mind. I have wanted to point it out for a long time but something else always gets in the way. I think this API design should have the ability to determine affinity of devices as well. As in "don't move to a different node if that will increase the distance to a device you are using". Pat's driverfs design would certainly benefit from carrying distance information as well. Old hat, but the place to carry it is in the link structures. -nc- |
From: Michael H. <hoh...@us...> - 2002-02-12 22:32:13
|
On Mon, 11 Feb 2002 Paul Jackson wrote: > 1) Two details of the API still concern me -- the placeholder > numamap argument and the 64 bit limit on number of cpus > (without at least an API change such as you say Russ Wright > is considering). Neither placeholders nor such potential API > changes stand the test of time, in my experience. Martin Bligh had a couple of suggestions for how to deal with the cpu_t. I'll probably take his suggestion of using a cpu_t and defining that as a long. Then when a real cpu_t is defined it should be a trivial transition. > > How about instead (being controversial ... brainstorming): > a] build the simple binding only on top of CpuMemSets, > b] with the simple binding _only_ using application numbering, > not system numbering, > c] drop the numamap placeholder argument, and > d] accept forever afterward that simple binding only manages > up to 64 (or 32) cpus, memory blocks, and nodes. > > The last item, [d], doesn't keep you from running on a > larger system -- just keeps a given application from using > the simple binding to manage a larger set of cpus. Well, that is always one approach. For now, I prefer to keep this separate from CpuMemSets. > > 2) You observe that there is no node number mapping, so all > node numbers are physical. Do you think that there should > be a node number mapping? We thought about node mapping for a few minutes and then realized that the point of the API was to provide a means for applications to place their resources on specific nodes. Mapping the nodes makes this problematic, and we could see no benefit. > > 3) Typo: > > Under restrictmemblk(... memblk, ... numamap) you wrote: > > If the memblk bitmask adds CPUs and the user is not root ... > > where I suspect you intended: > > If the memblk bitmask adds blocks and the user is not root ... Yes, typo on my part. > > 4) You state that getcpu() does the same thing as cmsGetCpu(). > If CpuMemSets is _not_ present, then I have a slightly subtle > concern here. Your simple binding makes certain promises to > its user that all CPUs within one node will have numbers in > a range that does not overlap with the CPU numbers of any > other node. However I am not aware of any promise by the > kernel, across all architectures, to honor this numbering of > physical CPU numbers. Seems to me that you require some > sort of abstract binding mechanism to ensure a favorable > CPU numbering. Actually, this numbering scheme is a requirement being placed on the kernel to support NUMA. This was put out last year with no loud objections, and is explained in more detail in the rationale. > > 5) The shell script binding using 'runon' looks interesting, > and I can imagine that most of the simple binding API can > and should be exposed as runon options. I can also imagine > Python (and Perl) modules that implement the simple API, and > (a heretical thought) coding the runon command in Python > using this module. > > 6) You state that "Launch policies are not inherited" (with > emphasis on the 'not'). At first reading, this confuses me. > Are you saying, at least in part, that if: > 1. Process P1 (operating under policy x1) sets a launch policy x2 > 2. Process P1 forks P2 > 3. Process P2 modifies its own policy (rebinds) to policy x3 > 4. Process P2 forks P3 > then P3 starts with a policy of x3 (its parent P2s policy, since > P2 didn't setlaunch any other launch policy), _not_ with a policy > of x2 (which would be the inherited launch policy, if such were > inherited)? If so -- good. With the minor note that usually > when I find myself writing something with "not" italicized, > I usually end up later rewriting that sentence in a more direct > manner. Your understanding is correct, and also spells out quite clearly why launch policies are not inherited. > > 7) You state: > > ... the binding in effect for the process or thread that > executes the first page fault on a given page of memory > determines the binding for that page ... see the rationale > > I don't see any "rationale". And I'd think it would be > more stable to have the creator of the memory region determine > its memory policy, not the first faulter. The "rationale" is a link. Click on it and it will take you to a separate document that has, amongst other things, a section titled "Conflicting Memory Bindings". It also has quite a bit of discussion and examples, of the processor/node numbering scheme. Michael Hohnbaum hoh...@us... |
From: Paul J. <pj...@en...> - 2002-02-13 08:35:23
|
Michael Hohnbaum wrote: |> Martin Bligh had a couple of suggestions for how to deal with |> the cpu_t. I'll probably take his suggestion of using a cpu_t |> and defining that as a long. Then when a real cpu_t is defined |> it should be a trivial transition. There's more to compatibility than source compatibility at the point of the call. In particular, it essential that the nearby application code that manipulates the cpu_t be independent of the size of cpu_t. And it is also good if the binary API across the kernel boundary is stable, though in Linux this is not considered to be essential. This means to me that nothing in the published interface makes any mention of the size of cpu_t. Byte the bullet up front, with routines, such as the FD_SET et. al. routines in the select interface, or sigaddset et. al. routines in the POSIX sigsetops interface. Or, if you need sorted sets, not just unordered sets, then something like the (admittedly more cumbersome) cpumemsets interface for passing ordered sets of cpus and mems. Or have this simple binding work over abstractly bound cpu/mem numbers and forever be constrained to 32 or 64 (bits in a word) cpus and mems. But don't ask applications to write to one API, then change in anyway more than trivially cosmetic. |> Actually, this numbering scheme is a requirement being placed |> on the kernel to support NUMA. This was put out last year |> with no loud objections, and is explained in more detail in |> the rationale. Ok - reasonable enough. |> > 6) You state that "Launch policies are not inherited" (with |> > emphasis on the 'not'). ... |> |> Your understanding is correct, and also spells out quite clearly |> why launch policies are not inherited. Ok - then this is a bug (er eh missing feature) in cpumemsets, because the *child* cpumemset (essentially the holder of the launch policy) is inherited. That is, both the *current* and *child* cpumemsets are initialized on fork from the parents *child* settings, and subsequent changes by the forked process do _not_ affect its already established *child* settings. So cpumemsets must support, at least as an option, the ability to support this non-inherited launch policy semantics, such that the *child* policy is not used unless something like a setlaunch() request activates it, and until then, the *current* policy applies to everything, both current process and any launched children. Question in this case: which policy, x4 or x5, would apply to process P3 in the following scenario: > 1. Process P1 (operating under policy x1) sets a launch policy x2 > 2. Process P1 forks P2 > 3. Process P2 modifies its own policy (rebinds) to policy x3 > 4. Process P2 sets its launch policy to policy x4 > 5. Process P2 again modifies its own policy, this time to x5 > 6. Process P2 forks P3. === The placeholder numamap argument still bothers me -- I don't understand it, don't know how to use it, and in general suspect that placeholder arguments are the cannon fodder of future reality. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2002-02-15 02:07:55
|
Michael wrote: |> > |> > 7) You state: |> > |> > ... the binding in effect for the process or thread that |> > executes the first page fault on a given page of memory |> > determines the binding for that page ... see the rationale |> > |> > I don't see any "rationale". And I'd think it would be |> > more stable to have the creator of the memory region determine |> > its memory policy, not the first faulter. |> |> The "rationale" is a link. Click on it and it will take you to |> a separate document that has, amongst other things, a section |> titled "Conflicting Memory Bindings". It also has quite a bit |> of discussion and examples, of the processor/node numbering |> scheme. Ok - now that I read this rationale, my question still stands. You list two possible approaches therein to specifying the memory policy on a shared region: 1. a global policy, where the last binding wins 2. a local policy, where the first page fault wins. Then you recommend choice 2. I see a third possible choice: 3. a per vm area policy that can differ per faulting cpu This results in each vm area linking to a (usually shared) structure (list of struct of list) that specifies, for any fault on any given cpu, for that given vm area, which nodes to search for memory, in what order. This is what cpumemsets has. It determines the policy deteministically up front, rather than based on some random accident such as the location of the first fault. I'd recommend choice 3 (big surprise). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Michael H. <hoh...@us...> - 2002-02-12 22:46:04
|
On Mon, 11 Feb 2002 Neils Christiansen wrote: > Michael Hohnbaum wrote: > > > A new version of the simple binding API is now available at: > > http://lse.sourceforge.net/numa/numa_api.html > > This is an update to the original proposal by Paul McKenney. > > All comments are welcomed. > > That was easy, Michael. Looks good to me. I have a few > questions and suggestions, though: > > 1. Since there is a behavior argument to the bindtomemblk(), > setlaunch() and bindmemory() calls, I would like to know > why one is not needed for the bindtocpu() call? Would it > not be useful to be able to utilize idle processors > rather than have several processes compete for a subset > of processors? I'm thinking of allowing for a NOT_STRICT > option, possibly along with others, that say whether and > under which circumstances a process may disregard the > (preferred) binding and where it may go. I tossed this idea around some, and from a simple binding perspective, rejected it. However, there is merit to it, possibly as an extension, so adding a behaviour argument to the API seems reasonable. For now the only behavior supported will be STRICT, but this allows room for additional behaviors to be specified. > > 2. What with a function to say bind a process to a set of > processors on any one node and memory on any one (other?) > node? Something along the lines of: > > int bindcpumem(int cpunode, u_int cpuset, > int memnode, u_int memset, u_int behavior); > > where a cpuset is a bitmask for the processors on a node > regardless of the global cpu-numbering and memset is the > same for memory. This would permit a "job" to be moved as > a unit of work from one node to another. By giving -1 > arguments to cpunode and memnode you could say that you > don't really care which node as long as the process group > executes on processors of the same node. First, it is not clear to me exactly what you are trying to accomplish with this function. However, I believe that some of this can be accomplished with a combination of the already defined functions. I'll also point out that the simple binding API is establishing a set of basic functions that can be combined to create various combinations such as this. CpuMemSets provides a different level of control over the resources and might be a closer fit for providing this capability. > > 3. I notice the absence of "distance". As far as I can tell, > the coupling to nodes is mostly another way of saying > distance. With all the discussion about distance in Paul > Dorwin's topology design, maybe the binding API could be > enhanced by using distance. For example, my suggested > bindcpumem() function could then be something like: > > int binddist(int node, int max_cpu_dist, int max_mem_dist, > u_int behavior); > > where a node argument of -1 means any node. Again, this can be accomplished by making use of the simple binding functions. Further, distance is going to have different meanings based upon the architecture of the machine. Until some common understanding is reached as to what distance is, it does not make sense to define higher-level APIs based upon it. > > 4. The API functional specification is one thing. What with > the changes to data structures in the kernel, scheduler, > system calls, etc. Is that something you have prototyped? We are working on this. Some of the underlying pieces are available, others are under development. This API is not implementing much of mechanism, but rather is exposing the mechanisms such that they can be manipulated by user-level code. The API will be implemented as underlying support mechanisms are made available. > > 5. Would we not need a migrate() function or is that beyond > the scope of the API? I see process migration as being a ways away. However, it can be supported in this API by either an implicit action (e.g., bindtomemblk() could cause migration of pages already allocated to the process to the specified memory blocks), or explicit (e.g., add MPOL_MIGRATE to the behavior flags), or a combination of the two (implicit but with an MPOL_NOMIGRATE flag). > > - nc - Michael Hohnbaum hoh...@us... |
From: Niels C. <nc...@us...> - 2002-02-12 23:27:00
|
Michael Hohnbaum wrote: > > > 2. What with a function to say bind a process to a set of > > processors on any one node and memory on any one (other?) > > node? Something along the lines of: > > > > int bindcpumem(int cpunode, u_int cpuset, > > int memnode, u_int memset, u_int behavior); > > > > where a cpuset is a bitmask for the processors on a node > > regardless of the global cpu-numbering and memset is the > > same for memory. This would permit a "job" to be moved as > > a unit of work from one node to another. By giving -1 > > arguments to cpunode and memnode you could say that you > > don't really care which node as long as the process group > > executes on processors of the same node. > > First, it is not clear to me exactly what you are trying to > accomplish with this function. However, I believe that some of this > can be accomplished with a combination of the already defined functions. > I'll also point out that the simple binding API is establishing > a set of basic functions that can be combined to create various > combinations such as this. CpuMemSets provides a different level > of control over the resources and might be a closer fit for > providing this capability. What I'm trying to accomplish is, as I said, the ability to "move a job as a unit of work from one node to another." I suggested it because I am a bit put off by the insistence of assigning work to specific processors. This is an attempt to define ways of saying "hey system, here's a set of workloads which I would appreciate if you run so none are split over nodes; feel free to put them where there's room but don't waste time accessing remote memory". I believe that is the way a system administrator would want to run his system, rather than assigning work to specific processors on specific nodes. You could argue that the same could be done by using cpumemsets but that would add a level of complexity, which is bound to annoy your otherwise friendly system administrator. And it still would require the mapping of real processors to logical ones and be less flexible than bindcpumem(). > > 3. I notice the absence of "distance". As far as I can tell, > > the coupling to nodes is mostly another way of saying > > distance. With all the discussion about distance in Paul > > Dorwin's topology design, maybe the binding API could be > > enhanced by using distance. For example, my suggested > > bindcpumem() function could then be something like: > > > > int binddist(int node, int max_cpu_dist, int max_mem_dist, > > u_int behavior); > > > > where a node argument of -1 means any node. > > Again, this can be accomplished by making use of the simple > binding functions. Further, distance is going to have different > meanings based upon the architecture of the machine. Until > some common understanding is reached as to what distance is, > it does not make sense to define higher-level APIs based upon it. I don't see how, Michael. Not with the kind of flexibility the use of distance would allow. Again, think of who is going to use the system management tools that use your API. It may be a lot of fun to work with actual processor numbers (whether physical or logical) and actual node numbers but it is not a way to run an efficient shop. I don't see why it matters much that "distance" has different meanings (I assume you mean different penalties) on different platforms. And if there is no common understanding of what distance is, call it penalty. The fact remains that somebody or something has to consider the penalty of distance before using your API. How else can the system administrator decide how and where to place work? So why not incorporate the concept into the API? Besides, we are trying to define distance in the Topology API so we're working on it. > I see process migration as being a ways away. However, it can > be supported in this API by either an implicit action (e.g., > bindtomemblk() could cause migration of pages already allocated > to the process to the specified memory blocks), or explicit (e.g., > add MPOL_MIGRATE to the behavior flags), or a combination of > the two (implicit but with an MPOL_NOMIGRATE flag). Migrate would go very well with binddist() or bindcpumem(). In fact, I suggested the two with migration in mind. As long as you keep an opening for a future migrate() function, I'm happy. I'm not happy with adding implied functionality to other API calls or being forced to use a series of the other API calls when I need to migrate. -nc- |
From: Paul J. <pj...@en...> - 2002-02-14 22:22:27
|
On Tue, 12 Feb 2002, Niels Christiansen wrote: > > What I'm trying to accomplish is, as I said, the ability to > "move a job as a unit of work from one node to another." I > suggested it because I am a bit put off by the insistence of > assigning work to specific processors. We need and will have several API's, some layered on others. > > [Michael wrote] ... Until > > some common understanding is reached as to what distance is, > > it does not make sense to define higher-level APIs based upon it. Note Neils what Michael said here -- he is assuming we will define higher-level APIs, when the time is ripe. > ... t may be a lot of fun to work with actual processor numbers > (whether physical or logical) and actual node numbers but it > is not a way to run an efficient shop. Then don't use the "actual processor number" API ... Wrong tool for the job. Let's not try turning a hand saw into a chain saw, a claw hammer into a jack hammer. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Michael H. <hoh...@us...> - 2002-02-13 00:26:23
|
On Tue, 12 Feb 2002: Niels Christiansen wrote: > > Michael Hohnbaum wrote: > > > > > 2. What with a function to say bind a process to a set of > > > processors on any one node and memory on any one (other?) > > > node? Something along the lines of: > > > > > > int bindcpumem(int cpunode, u_int cpuset, > > > int memnode, u_int memset, u_int behavior); > > > > > > where a cpuset is a bitmask for the processors on a node > > > regardless of the global cpu-numbering and memset is the > > > same for memory. This would permit a "job" to be moved as > > > a unit of work from one node to another. By giving -1 > > > arguments to cpunode and memnode you could say that you > > > don't really care which node as long as the process group > > > executes on processors of the same node. > > > > First, it is not clear to me exactly what you are trying to > > accomplish with this function. However, I believe that some of this > > can be accomplished with a combination of the already defined functions. > > I'll also point out that the simple binding API is establishing > > a set of basic functions that can be combined to create various > > combinations such as this. CpuMemSets provides a different level > > of control over the resources and might be a closer fit for > > providing this capability. > > What I'm trying to accomplish is, as I said, the ability to > "move a job as a unit of work from one node to another." I > suggested it because I am a bit put off by the insistence of > assigning work to specific processors. This is an attempt to > define ways of saying "hey system, here's a set of workloads > which I would appreciate if you run so none are split over > nodes; feel free to put them where there's room but don't > waste time accessing remote memory". > I see two things being requested here: 1. A mechanism to keep a process using memory local to the node it is executing on; and 2. A mechanism to associate a group of processes to keep them on the same node. The first is memory allocation policy. This is the default policy for allocating memory for a process on NUMA systems. If nothing else is done to change this, memory is allocated from the local node. The capability that is needed to go along with this is the logic to always dispatch the process on the same node. Once again, this should be the default on a NUMA system. An API is not needed to provide this capability - it is what has been referred to as "default actions". The second is something that Paul Jackson and I have been discussing, process association. That is currently outside the scope of the simple binding API, but is on my list of things to get to. > > I believe that is the way a system administrator would want > to run his system, rather than assigning work to specific > processors on specific nodes. You could argue that the same > could be done by using cpumemsets but that would add a level > of complexity, which is bound to annoy your otherwise friendly > system administrator. And it still would require the mapping > of real processors to logical ones and be less flexible than > bindcpumem(). > > > > 3. I notice the absence of "distance". As far as I can tell, > > > the coupling to nodes is mostly another way of saying > > > distance. With all the discussion about distance in Paul > > > Dorwin's topology design, maybe the binding API could be > > > enhanced by using distance. For example, my suggested > > > bindcpumem() function could then be something like: > > > > > > int binddist(int node, int max_cpu_dist, int max_mem_dist, > > > u_int behavior); > > > > > > where a node argument of -1 means any node. > > > > Again, this can be accomplished by making use of the simple > > binding functions. Further, distance is going to have different > > meanings based upon the architecture of the machine. Until > > some common understanding is reached as to what distance is, > > it does not make sense to define higher-level APIs based upon it. > > I don't see how, Michael. Not with the kind of flexibility the > use of distance would allow. Again, think of who is going to > use the system management tools that use your API. It may be > a lot of fun to work with actual processor numbers (whether > physical or logical) and actual node numbers but it is not a > way to run an efficient shop. Assuming the topology API can provide distance, at user-level distance can be used to determine the processors/memblks that are to be bound to and the simple binding API functions invoked. > > I don't see why it matters much that "distance" has different > meanings (I assume you mean different penalties) on different > platforms. And if there is no common understanding of what > distance is, call it penalty. The fact remains that somebody > or something has to consider the penalty of distance before > using your API. How else can the system administrator decide > how and where to place work? So why not incorporate the > concept into the API? Currently, I've seen people define distance as the number of hops, the latency, the bandwidth, an ACPI number scaled such that 10 is on the same node, and combinations of these. So using distance as a metric, without some agreement as to what this metric is is problematic. Using the ACPI concept that "10" is on the same node would clash, for instance, with distance being hops. A binddist(10) call would have vastly different results. I have some thoughts on how to handle distance, but, again, it is not part of the simple binding API. The simple binding API is, to quote the document "intended to be a 'simple' API which provides rudimentary binding of processes to processors and/or memory blocks." It was initially written as the result of several meetings of people interested in NUMA work from across the Linux community. This revision is an attempt to bring it more into line with some of the thoughts and concepts that have been discussed since then. > > Besides, we are trying to define distance in the Topology API > so we're working on it. > > > I see process migration as being a ways away. However, it can > > be supported in this API by either an implicit action (e.g., > > bindtomemblk() could cause migration of pages already allocated > > to the process to the specified memory blocks), or explicit (e.g., > > add MPOL_MIGRATE to the behavior flags), or a combination of > > the two (implicit but with an MPOL_NOMIGRATE flag). > > Migrate would go very well with binddist() or bindcpumem(). > In fact, I suggested the two with migration in mind. As long > as you keep an opening for a future migrate() function, I'm > happy. I'm not happy with adding implied functionality to > other API calls or being forced to use a series of the other > API calls when I need to migrate. The "implied functionality" can (and should) be made explicit. Currently, it is undefined what happens to existing memory for a process when the process binds to a memory block. Ideally, any existing memory allocated to the process should migrate to the new memory binding. However, until process migration is supported, the best we can do is put any future memory allocations on the bound memory block. My preference, for now, is stating clearly that this is undefined, but that if an application does not want memory that is already allocated to migrate, use a MPOL_NOMIGRATE flag. However, why do you see a need for a process to decide it should migrate? And if it does need to migrate, what is the issue with using a series of function calls to accomplish this? If an application wants to read a record from a file, it must issue a series of calls to accomplish that. As long as the pieces that are necessary are available and can be assembled, why isn't this adequate? > > -nc- Michael Hohnbaum hoh...@us... |
From: Paul J. <pj...@en...> - 2002-02-12 04:32:41
|
Neils (who beat me to the response by 12 minutes ;) wrote: |> |> 4. The API functional specification is one thing. What with |> the changes to data structures in the kernel, scheduler, |> system calls, etc. Is that something you have prototyped? I'm hoping that these changes have been prototyped, under the rubric of "CpuMemSets" <grin>. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |