From: Luck, T. <ton...@in...> - 2001-12-13 16:37:35
|
Michael Hohnbaum wrote: > 3. In the section "Processors, Memory and Distance", fourth paragraph, > there is mention that on IA64 systems a distance metric will be scaled > such that the closest <processor, memory> pair will be at a distance > of 10. Where is this metric being maintained, and what is using it? There is an extension to ACPI to provide node-node distance information that uses this factor of ten scaling. The spec for this can be found at: http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf The discontig patch for the IA-64 kernel (discontig.sourceforge.net) includes code to parse this table, but does not yet make use of it. -Tony Luck |
From: Michael H. <hb...@us...> - 2001-12-18 23:10:35
|
Paul, Thanks for your thorough response. I've commented on it below. This is turning into another long response - readers beware. Michael On Monday, December 17, 2001, Paul Jackson wrote: > On Wed, 12 Dec 2001, Michael Hohnbaum wrote: > > > 1. In the section "Using CpuMemSets" the idea is presented that initially > > the kernel CpuMemSet will contain all elements (processors and memory > > blocks) from within the system and no attempt is made to sort this. It > > is anticipated that a user-level application will be invoked from init > > that will order the kernel's CpuMemSet, and potentially readjust those > > of other processes that have already been created by this time. > > > > While this approach would establish CpuMemSets for future allocations, > > I am concerned that kernel initialization might make non-optimal choices > > for locating important data structures. I would suggest providing > > an architecture specific call early in the boot process to allow the > > ordering of the kernel's CpuMemSet. > > I agree with your concern that the initial kernel CpuMemSet > needs to be sensible, as it affects critical kernel data > structure setup. I have come to the same conclusion myself > over the last couple of weeks, thanks in part to suggestions > from several commentators. > > We are coding this initialization to be more useful to our > particular SGI hardware, and we will add an action item to > our work queue to provide the hook for architecture specific > init routines. > This sounds good to me. > > > 2. Also in the section "Using CpuMemSets" is the suggestion that legacy > > APIs be supported on top of CpuMemSets and that applications would > > use these. That is ok for migration purposes, but it seems to me > > that applications are going to want to have a single API to use on > > Linux. Is there any effort to establish a default API for applications > > to use? > > While the original motivation for CpuMemSets was to port > legacy code from our proprietary Unix system, certainly > much of the feedback, and if this work succeeds, most of > the ultimate use, will likely be more direct use of > CpuMemSets, rather than via some legacy API that provides > similar services using a wrapper around CpuMemSets. > > I take the following action items from this: > > 1) Reword the Design Notes slightly, to account for the > shifting motivations, as described above. > > 2) Provide such documentation as man page and tutorial > suitable for more widespread usage. > > As for whether there is an effort to establish this or any other > API as the default for applications to use for processor and > memory placement, I hope to continue spending a little of my > time encouraging the use of CpuMemSets for such uses. Obviously > the next step is to get the patch out. Soon, soon, ... > I welcome the support and encouragement of others. > It was my understanding that you considered CpuMemSets to be a bit too low-level an API for applications to write to, and envisioned a more robust API layered on top of it. That point is made clear in item 4 of the "Implementation Layers" section. If you are now proposing CpuMemSets as the user-level API, then I need to reevaluate it from that perspective. > > > 3. In the section "Processors, Memory and Distance", fourth paragraph, > > there is mention that on IA64 systems a distance metric will be scaled > > such that the closest <processor, memory> pair will be at a distance > > of 10. Where is this metric being maintained, and what is using it? > > As Tony Luck already replied, this is (quoting) an extension > to ACPI to provide node-node distance information that uses > this factor of ten scaling. The spec for this can be found at: > > http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf > This is fine, however, I'm still left wondering from the discussion of distance in the CpuMemSet design where you see this distance metric being maintained within the kernel, how it will be used, and how it will interact with CpuMemSets. > > > 4. In the section "Restricting CPU and memory for exclusive use" there > > is reference to setting the policy flags "CMS_DROPMASTER|CMS_SHARE". > > However, one of these flags is defined as a map policy and the other > > is defined as a set policy, and they both have the same value. It > > is not clear to me why one would need to set CMS_SHARE in the case > > being described, and if so how that would be specified. > > You are quite right about this confusion. I noticed it too > a couple of weeks ago, and have already dropped the "|CMS_SHARE" > term in my master copy of the Design Notes. Ok. Thanks for clearing up that confusion. > > > > 5. On a related note, if the CMS_DROPMASTER is specified in a CpuMemMap > > policy, is this only going to be set when invoked by user-level > > code in the CpuMemMap, and cleared by the kernel before the CpuMemMap > > goes into effect? > > I agree that I've been unclear (even to myself) on some of > this Master apparatus. > > Let me try again to describe this: > > If you have some portion of your system (some CPUs > and memory) that you wish to dedicate to a particular > application, then first setup the granddaddy process for > this application to be Master of the intended CPUs and > memory, using cmsSetCMM (CMS_MASTER, ...) > > This will cause the kernel to mark a single bit of state > for each CPU and memory (using system numbering) in the > Map passed into this call, making it a Master. So long > as a CPU or memory remains a Master, no other Map may be > setup including it. > > Then if other apps were previously using the intended CPUs > or memory, chase them off, using appropriate kill, cmsSetCMM > or cmsBulkRemap calls. > > The processes descending from the granddaddy process, that > share the same Map, will then be the exclusive users of the > specified CPUs and memory blocks. > > To cause some particular set of CPUs and/or memory blocks > that are currently marked as Master to no longer be so > marked: > > Construct a Map containing these CPUs and memory (this > Map can contain other, non-Master items harmlessly). > Apply that map to something, anything, with a cmsSetCMM > call passing a Map with the flag CMS_DROPMASTER set. > The Master status for any system CPU or memory listed > in that map will be dropped by the kernel. > > Observe that, in this use of cmsSetCMM(), one is > not really constructing a map that one necessarily wants > to run on, but rather just (ab)using the passed in map to > list the CPU and memory block system numbers that should > have their Master status cleared (dropped) in the kernel. > > The process or vm area to which you applied this cmsSetCMM > with flag CMS_DROPMASTER _will_ be running on this new map, > but that might just be a not very useful side affect of the > cmsSetCMM call. > > Thus, to throw away _all_ the Masters in the system, > just construct a CMS_DROPMASTER Map with all the CPUs > and memory blocks listed, and apply it to something. > > Of course, you have to be root (or have an appropriate > capability) to do these operations. > Might it be better to have a cmsSetSMM() call that specifies CMS_DROPMASTER do nothing but drop the master status from any resource named, but not apply this artificial set to a process or vmarea? > > > Also, the comment before the definition of CMS_MASTER and CMS_DROPMASTER > > implies that these values may be OR'd together. Is that correct? > > If so, why would one do that? > > No, it would not be a good idea to OR these two options together, > because the intent wouldn't be clear. I should note that this > combination is not allowed, and probably add an error case to > the kernel to return with errno == EINVAL if both are set. That is what I thought was most likely the case. > > > > 6. The virualization provided by CpuMemMap forces both processor and > > memory block numbering to begin at 0 and continue upwards with no > > holes in the range. While CpuMemSets allows mapping of application > > numbers that are non-contiguous, the CpuMemMap scheme forces the > > potential numbers used to be within a set range. Thus, for example, > > on an 8 processor system, there is no way for an application to have > > a virtual processor number of 20. > > > > Providing an "unused" value would allow populating a CpuMemMap > > processor or memory array such that other potential number combinations > > are possible. While I don't have any good examples of why one would > > want to do that, it seems like it might provide additional flexibility > > at low cost. > > Part of this can be done now, because while the "unused" value isn't > available, duplications are. Your 8 processor system could have a Map > that mapped the first 30 virtual (application) CPUs all to real > (system) CPU 2, say. Then application processor 20 would work just > fine, as would any other application processor in the range 0 to 29. > > Adding an "unused" system CPU number (0xffff, no doubt) would > complete this. I've taken to calling this CPU the "village > idiot". I guess the corresponding Memory Block would be an > Alzheimer's victim ... but this is getting a tad macabre. > > I expect to add these two values to the design and implementation. > This is what I was suggesting. However, isn't 0xffff the value used for CMS_DEFAULT_CPU? Might want to try 0xfffe for unused. While the context used should differentiate, a different value might avoid confusion. > > === > > On Mon, 17 Dec 2001, Michael Hohnbaum wrote: > > > ...The CpuMemSet Design does not give a clear understanding as to > > what the bulk remap actually does. > > The bulk remap operation changes the system CPU and memory > block numbers that are in the affected maps. You pass in a > list of substitutions to be made. > > For example, you could ask that each appearance of system CPU > 7 in the affected maps be changed to CPU 5. If you did that > example to all the Maps in the system (using CMS_BULK_ALL) > then immediately nothing would be scheduled on CPU 7 anymore. > Bit 7 in cpus_allowed would be cleared in all tasks, and any > cpus_allowed that had bit 7 on would get bit 5 set instead. > > The CPU substitutions to be made are passed in as a pair of equal > length lists, one with the old system CPU numbers, the other list > with the corresponding new number. The memory block substitions > are passed in with another such pair of equal length lists. > > Is that clearer? Or is something still confusing? > > > > What is needed for process migration is a mechanism that will cause > > currently allocated resources used by a process to be relocated onto > > the newly assigned resource set. For processors, this should be > > fairly straight forward as the scheduler should (conceptually, at > > least) just need to dispatch the process on a processor in the new > > CpuMemSet. However, does the bulk remap call provide for moving > > the currently allocated memory for a process to a different memory > > block? > > No, no support for migrating existing memory is included in > the current CpuMemSets design. The focus of CpuMemSets has > been on static directives that affect dynamic scheduling and > allocation decisions of existing mechanisms, not on providing > new dynamic system mechanisms. > > So CpuMemSets is not sufficient to provide process migration, > as you (reasonably enough) describe it. Hopefully it might > be a useful participant in a process migration solution. > > How might we get there? I see process migration as separate from CpuMemSet, but a necessary tool to support the capabilities enabled by CpuMemSet in a dynamic environment. Process migration is needed to provide a NUMA API with similar capabilities to other OS's. If CpuMemSets is envisioned to provide the NUMA API for Linux, then it would make sense to incorporate process migration under the same general API. However, if CpuMemSet is only seen to be one component of a larger NUMA API, then process migration can be done separate. > > > > A follow-on issue to this is how to associate a group of processes > > such that if one moves to a different node all processes in the group > > moves. The process could move to a different node in its current > > CpuMemSet, thus through no explicit request from the user; or could > > move due to a change in its CpuMemSet/map. In the latter case, I > > believe the bulk remap capability could be used. The former case > > appears to be unsupported. I can envision a means of supporting this, > > but it would require some changes in CpuMemSets related to sharing > > of CpuMemSets, ownership, and modification rights. Alternatively, > > a linkage through the task structure could work, although resolving > > conflicts due to non-intersecting CpuMemSets could be tricky. Do > > you have any thoughts on this? > > hmmm ... mumble ... mumble ... I didn't grok: > > The process could move to a different node in its current > CpuMemSet, thus through no explicit request from the user How about an example. If a process is assigned to a CpuMemSet that contains resources on two separate NUMA nodes, the scheduler is free to dispatch the process on processors running on either node. However, for optimal performance, an attempt is made to allocate resources for the process on one node, only. If that node becomes overloaded, the scheduler is free to dispatch the process on a processor on the other node (identified in the CpuMemSet). Thus, through no explicit request from the user, the process moves to a different node. > > Besides process migration, the other capability that folks > think of when they consider CpuMemSets, that isn't supported, > is such grouping. > > This is a bigger can of worms than I've succeeded in getting > my head around (ugh - another macabre metaphor ...). I agree that grouping has the potential to be a quagmire. Probably wise to get an initial version of CpuMemSets working before attacking grouping. However, it is likely to be useful, so should not disappear. > > In the case of process migration, I'm not likely to be an active > player, except in so far as it involves CpuMemSets. It's not > something that is high on SGI's list, nor mine personally. > > In the case of grouping, I'd like to do more. But first I'd > better get our patch out for what we have so far. > > > > I won't rest till it's the best ... > Manager, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 > Michael Hohnbaum hoh...@us... |
From: Paul J. <pj...@en...> - 2001-12-19 19:36:54
|
Executive summary - Michael said it best: > I agree that grouping has the potential to be a quagmire. > Probably wise to get an initial version of CpuMemSets working > before attacking grouping. However, it is likely to be > useful, so should not disappear. == This and other stuff, in more detail ... On Tue, 18 Dec 2001, Michael Hohnbaum (responding to pj): > > We are coding this initialization to be more useful to our > > particular SGI hardware, and we will add an action item to > > our work queue to provide the hook for architecture specific > > init routines. > > > > This sounds good to me. good. > It was my understanding that you considered CpuMemSets to be a bit too > low-level an API for applications to write to, and envisioned a more > robust API layered on top of it. That point is made clear in item 4 > of the "Implementation Layers" section. If you are now proposing > CpuMemSets as the user-level API, then I need to reevaluate it from > that perspective. I'm fumbling here a bit to find the right "spin". I think it's this -- the CpuMemSet API was driven by the needs of other API's that will be layered on top of it. I was not striving for an API that some application programmer, expert in a discipline unrelated to any of this, would find intuitive and easy. Rather I was striving for an API, and kernel mechanism, that porters of legacy API's would find powerful, general, and easy to bend into whatever semantic model they had to support. If others want to use this API directly, that's fine. And if I have spare energy, and their requests happen to dovetail with what I was doing anyway, then I'll do what I can to satisfy them. And this is Open Source -- if others pick up the ball and run further with it in some direction, and obtains sufficient general approval and interest for that work, then more power to them. The primary requirement is that we need one agreed upon and accepted kernel mechanism for static CPU and memory placement. That means that this one mechanism had better be sufficiently general purpose to meet a (more-or-less) full range of needs, preferrably with a minimum of policy sensitive code in the kernel - just general purpose mechanism. If there is enough hue and cry for a more "user" (app programmer, or sys adm or whomever) friendly interface, then that might be worth doing in its own right, hopefully leveraging the CpuMemSets mechanisms. For some audiences, at least on my SGI side, such "user" friendly API's already exist, and we will be supporting them (in some cases as Open Source) on top of CpuMemSets. > This is fine, however, I'm still left wondering from the discussion > of distance in the CpuMemSet design where you see this distance metric > being maintained within the kernel, how it will be used, and how it > will interact with CpuMemSets. Yes - work remains to be done here. The implementation and initial simple uses of CpuMemSets don't depend on this, but more powerful uses will. I have no intentions of "over hanging the market" here ... if others get to it first, I hope to contribute to their success and to seeing to it that whatever we do meets the needs I am aware of. > Might it be better to have a cmsSetSMM() call that specifies > CMS_DROPMASTER do nothing but drop the master status from any > resource named, but not apply this artificial set to a process > or vmarea? Yeah - as you saw later in my next email, that was where I went next. Then, as you recommended (but I hadn't read yet) I decided to table this topic for this week, to work on getting out a patch with what we had otherwise. I will return to this in January. > This is what I was suggesting. However, isn't 0xffff the value > used for CMS_DEFAULT_CPU? Might want to try 0xfffe for unused. > While the context used should differentiate, a different value > might avoid confusion. Good suggestion - thanks. > I see process migration as separate from CpuMemSet, but a necessary > tool to support the capabilities enabled by CpuMemSet in a dynamic > environment. Process migration is needed to provide a NUMA API > with similar capabilities to other OS's. If CpuMemSets is envisioned > to provide the NUMA API for Linux, then it would make sense to > incorporate process migration under the same general API. However, > if CpuMemSet is only seen to be one component of a larger NUMA API, > then process migration can be done separate. I'll do what I can, given other priorities, to help us arrive at a pleasing and useful result. I don't have enough insight into the process migration needs to predict yet the appropriate degree of API separation. > > > A follow-on issue to this is how to associate a group of processes > > > such that if one moves to a different node all processes in the group > > > moves. ... I can envision a means of supporting this, > > > but it would require some changes in CpuMemSets related to sharing > > > of CpuMemSets, ownership, and modification rights. ... Ah - the point being that the kernel may have other actions that it wants to take at a "group" level, beyond simple managing the static allocation of CPUs and memory. It might want to be aware of such grouping in its more dynamic actions, such as migration, as well. And as we're both suspecting by now, this presumes the kernel has some way to identify and authenticate these groups. Key question - are these groups of users (tasks and vm areas) or they groups of resources (CPUs and memory blocks). I tend toward the later. One possibility - a special file can be created that represents some set of CPUs and memory, and you can include those resources in your cpumemmap if you can obtain an open file descriptor on that special file. The grouping of users becomes implicit - by what resources they have access to. Then I'd expect higher level API's to focus more on providing the grouping of users, leveraging the lower level facilities that support the allocation and sharing of resources. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |