On Wed, 12 Dec 2001, Michael Hohnbaum wrote:
> In getting back to mapping DYNIX/ptx APIs onto CpuMemSet I've generated
> a bit more feedback on the CpuMemSets design. Maybe you will find some
> of this useful.
Yes - very useful. Thank-you.
> 1. In the section "Using CpuMemSets" the idea is presented that initially
> the kernel CpuMemSet will contain all elements (processors and memory
> blocks) from within the system and no attempt is made to sort this. It
> is anticipated that a user-level application will be invoked from init
> that will order the kernel's CpuMemSet, and potentially readjust those
> of other processes that have already been created by this time.
> While this approach would establish CpuMemSets for future allocations,
> I am concerned that kernel initialization might make non-optimal choices
> for locating important data structures. I would suggest providing
> an architecture specific call early in the boot process to allow the
> ordering of the kernel's CpuMemSet.
I agree with your concern that the initial kernel CpuMemSet
needs to be sensible, as it affects critical kernel data
structure setup. I have come to the same conclusion myself
over the last couple of weeks, thanks in part to suggestions
from several commentators.
We are coding this initialization to be more useful to our
particular SGI hardware, and we will add an action item to
our work queue to provide the hook for architecture specific
> 2. Also in the section "Using CpuMemSets" is the suggestion that legacy
> APIs be supported on top of CpuMemSets and that applications would
> use these. That is ok for migration purposes, but it seems to me
> that applications are going to want to have a single API to use on
> Linux. Is there any effort to establish a default API for applications
> to use?
While the original motivation for CpuMemSets was to port
legacy code from our proprietary Unix system, certainly
much of the feedback, and if this work succeeds, most of
the ultimate use, will likely be more direct use of
CpuMemSets, rather than via some legacy API that provides
similar services using a wrapper around CpuMemSets.
I take the following action items from this:
1) Reword the Design Notes slightly, to account for the
shifting motivations, as described above.
2) Provide such documentation as man page and tutorial
suitable for more widespread usage.
As for whether there is an effort to establish this or any other
API as the default for applications to use for processor and
memory placement, I hope to continue spending a little of my
time encouraging the use of CpuMemSets for such uses. Obviously
the next step is to get the patch out. Soon, soon, ...
I welcome the support and encouragement of others.
> 3. In the section "Processors, Memory and Distance", fourth paragraph,
> there is mention that on IA64 systems a distance metric will be scaled
> such that the closest <processor, memory> pair will be at a distance
> of 10. Where is this metric being maintained, and what is using it?
As Tony Luck already replied, this is (quoting) an extension
to ACPI to provide node-node distance information that uses
this factor of ten scaling. The spec for this can be found at:
> 4. In the section "Restricting CPU and memory for exclusive use" there
> is reference to setting the policy flags "CMS_DROPMASTER|CMS_SHARE".
> However, one of these flags is defined as a map policy and the other
> is defined as a set policy, and they both have the same value. It
> is not clear to me why one would need to set CMS_SHARE in the case
> being described, and if so how that would be specified.
You are quite right about this confusion. I noticed it too
a couple of weeks ago, and have already dropped the "|CMS_SHARE"
term in my master copy of the Design Notes.
> 5. On a related note, if the CMS_DROPMASTER is specified in a CpuMemMap
> policy, is this only going to be set when invoked by user-level
> code in the CpuMemMap, and cleared by the kernel before the CpuMemMap
> goes into effect?
I agree that I've been unclear (even to myself) on some of
this Master apparatus.
Let me try again to describe this:
If you have some portion of your system (some CPUs
and memory) that you wish to dedicate to a particular
application, then first setup the granddaddy process for
this application to be Master of the intended CPUs and
memory, using cmsSetCMM (CMS_MASTER, ...)
This will cause the kernel to mark a single bit of state
for each CPU and memory (using system numbering) in the
Map passed into this call, making it a Master. So long
as a CPU or memory remains a Master, no other Map may be
setup including it.
Then if other apps were previously using the intended CPUs
or memory, chase them off, using appropriate kill, cmsSetCMM
or cmsBulkRemap calls.
The processes descending from the granddaddy process, that
share the same Map, will then be the exclusive users of the
specified CPUs and memory blocks.
To cause some particular set of CPUs and/or memory blocks
that are currently marked as Master to no longer be so
Construct a Map containing these CPUs and memory (this
Map can contain other, non-Master items harmlessly).
Apply that map to something, anything, with a cmsSetCMM
call passing a Map with the flag CMS_DROPMASTER set.
The Master status for any system CPU or memory listed
in that map will be dropped by the kernel.
Observe that, in this use of cmsSetCMM(), one is
not really constructing a map that one necessarily wants
to run on, but rather just (ab)using the passed in map to
list the CPU and memory block system numbers that should
have their Master status cleared (dropped) in the kernel.
The process or vm area to which you applied this cmsSetCMM
with flag CMS_DROPMASTER _will_ be running on this new map,
but that might just be a not very useful side affect of the
Thus, to throw away _all_ the Masters in the system,
just construct a CMS_DROPMASTER Map with all the CPUs
and memory blocks listed, and apply it to something.
Of course, you have to be root (or have an appropriate
capability) to do these operations.
> Also, the comment before the definition of CMS_MASTER and CMS_DROPMASTER
> implies that these values may be OR'd together. Is that correct?
> If so, why would one do that?
No, it would not be a good idea to OR these two options together,
because the intent wouldn't be clear. I should note that this
combination is not allowed, and probably add an error case to
the kernel to return with errno == EINVAL if both are set.
> 6. The virualization provided by CpuMemMap forces both processor and
> memory block numbering to begin at 0 and continue upwards with no
> holes in the range. While CpuMemSets allows mapping of application
> numbers that are non-contiguous, the CpuMemMap scheme forces the
> potential numbers used to be within a set range. Thus, for example,
> on an 8 processor system, there is no way for an application to have
> a virtual processor number of 20.
> Providing an "unused" value would allow populating a CpuMemMap
> processor or memory array such that other potential number combinations
> are possible. While I don't have any good examples of why one would
> want to do that, it seems like it might provide additional flexibility
> at low cost.
Part of this can be done now, because while the "unused" value isn't
available, duplications are. Your 8 processor system could have a Map
that mapped the first 30 virtual (application) CPUs all to real
(system) CPU 2, say. Then application processor 20 would work just
fine, as would any other application processor in the range 0 to 29.
Adding an "unused" system CPU number (0xffff, no doubt) would
complete this. I've taken to calling this CPU the "village
idiot". I guess the corresponding Memory Block would be an
Alzheimer's victim ... but this is getting a tad macabre.
I expect to add these two values to the design and implementation.
On Mon, 17 Dec 2001, Michael Hohnbaum wrote:
> ...The CpuMemSet Design does not give a clear understanding as to
> what the bulk remap actually does.
The bulk remap operation changes the system CPU and memory
block numbers that are in the affected maps. You pass in a
list of substitutions to be made.
For example, you could ask that each appearance of system CPU
7 in the affected maps be changed to CPU 5. If you did that
example to all the Maps in the system (using CMS_BULK_ALL)
then immediately nothing would be scheduled on CPU 7 anymore.
Bit 7 in cpus_allowed would be cleared in all tasks, and any
cpus_allowed that had bit 7 on would get bit 5 set instead.
The CPU substitutions to be made are passed in as a pair of equal
length lists, one with the old system CPU numbers, the other list
with the corresponding new number. The memory block substitions
are passed in with another such pair of equal length lists.
Is that clearer? Or is something still confusing?
> What is needed for process migration is a mechanism that will cause
> currently allocated resources used by a process to be relocated onto
> the newly assigned resource set. For processors, this should be
> fairly straight forward as the scheduler should (conceptually, at
> least) just need to dispatch the process on a processor in the new
> CpuMemSet. However, does the bulk remap call provide for moving
> the currently allocated memory for a process to a different memory
No, no support for migrating existing memory is included in
the current CpuMemSets design. The focus of CpuMemSets has
been on static directives that affect dynamic scheduling and
allocation decisions of existing mechanisms, not on providing
new dynamic system mechanisms.
So CpuMemSets is not sufficient to provide process migration,
as you (reasonably enough) describe it. Hopefully it might
be a useful participant in a process migration solution.
How might we get there?
> A follow-on issue to this is how to associate a group of processes
> such that if one moves to a different node all processes in the group
> moves. The process could move to a different node in its current
> CpuMemSet, thus through no explicit request from the user; or could
> move due to a change in its CpuMemSet/map. In the latter case, I
> believe the bulk remap capability could be used. The former case
> appears to be unsupported. I can envision a means of supporting this,
> but it would require some changes in CpuMemSets related to sharing
> of CpuMemSets, ownership, and modification rights. Alternatively,
> a linkage through the task structure could work, although resolving
> conflicts due to non-intersecting CpuMemSets could be tricky. Do
> you have any thoughts on this?
hmmm ... mumble ... mumble ... I didn't grok:
The process could move to a different node in its current
CpuMemSet, thus through no explicit request from the user
Besides process migration, the other capability that folks
think of when they consider CpuMemSets, that isn't supported,
is such grouping.
This is a bigger can of worms than I've succeeded in getting
my head around (ugh - another macabre metaphor ...).
In the case of process migration, I'm not likely to be an active
player, except in so far as it involves CpuMemSets. It's not
something that is high on SGI's list, nor mine personally.
In the case of grouping, I'd like to do more. But first I'd
better get our patch out for what we have so far.
I won't rest till it's the best ...
Manager, Linux Scalability
Paul Jackson <pj@...> 1.650.933.1373