Thread: RE: [Lse-tech] Thoughts on CpuMemSet Design - 11/14/2001 version

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

RE: [Lse-tech] Thoughts on CpuMemSet Design - 11/14/2001 version

From: Luck, T. <ton...@in...> - 2001-12-13 16:37:35

Michael Hohnbaum wrote:
> 3. In the section "Processors, Memory and Distance", fourth paragraph,
>    there is mention that on IA64 systems a distance metric will be scaled
>    such that the closest <processor, memory> pair will be at a distance
>    of 10.  Where is this metric being maintained, and what is using it?

There is an extension to ACPI to provide node-node distance information
that uses this factor of ten scaling.  The spec for this can be found
at:

http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf

The discontig patch for the IA-64 kernel (discontig.sourceforge.net)
includes
code to parse this table, but does not yet make use of it.

-Tony Luck

[Lse-tech] Re: Thoughts on CpuMemSet Design - 11/14/2001 version

From: Michael H. <hb...@us...> - 2001-12-18 23:10:35

Paul,

Thanks for your thorough response.  I've commented on it below.  
This is turning into another long response - readers beware.

          Michael


On Monday, December 17, 2001, Paul Jackson wrote:

> On Wed, 12 Dec 2001, Michael Hohnbaum wrote:
> 
> > 1. In the section "Using CpuMemSets" the idea is presented that initially
> >    the kernel CpuMemSet will contain all elements (processors and memory
> >    blocks) from within the system and no attempt is made to sort this.  It
> >    is anticipated that a user-level application will be invoked from init
> >    that will order the kernel's CpuMemSet, and potentially readjust those
> >    of other processes that have already been created by this time.
> > 
> >    While this approach would establish CpuMemSets for future allocations,
> >    I am concerned that kernel initialization might make non-optimal choices
> >    for locating important data structures.  I would suggest providing
> >    an architecture specific call early in the boot process to allow the
> >    ordering of the kernel's CpuMemSet.
> 
> I agree with your concern that the initial kernel CpuMemSet
> needs to be sensible, as it affects critical kernel data
> structure setup.  I have come to the same conclusion myself
> over the last couple of weeks, thanks in part to suggestions
> from several commentators.
> 
> We are coding this initialization to be more useful to our
> particular SGI hardware, and we will add an action item to
> our work queue to provide the hook for architecture specific
> init routines.
> 

This sounds good to me.

> 
> > 2. Also in the section "Using CpuMemSets" is the suggestion that legacy
> >    APIs be supported on top of CpuMemSets and that applications would
> >    use these.  That is ok for migration purposes, but it seems to me
> >    that applications are going to want to have a single API to use on
> >    Linux.  Is there any effort to establish a default API for applications
> >    to use?
> 
> While the original motivation for CpuMemSets was to port
> legacy code from our proprietary Unix system, certainly
> much of the feedback, and if this work succeeds, most of
> the ultimate use, will likely be more direct use of
> CpuMemSets, rather than via some legacy API that provides
> similar services using a wrapper around CpuMemSets.
> 
> I take the following action items from this:
> 
>  1) Reword the Design Notes slightly, to account for the
>     shifting motivations, as described above.
> 
>  2) Provide such documentation as man page and tutorial
>     suitable for more widespread usage.
> 
> As for whether there is an effort to establish this or any other
> API as the default for applications to use for processor and
> memory placement, I hope to continue spending a little of my
> time encouraging the use of CpuMemSets for such uses.  Obviously
> the next step is to get the patch out.  Soon, soon, ...
> I welcome the support and encouragement of others.
> 

It was my understanding that you considered CpuMemSets to be a bit too
low-level an API for applications to write to, and envisioned a more
robust API layered on top of it.  That point is made clear in item 4
of the "Implementation Layers" section.  If you are now proposing
CpuMemSets as the user-level API, then I need to reevaluate it from
that perspective.

> 
> > 3. In the section "Processors, Memory and Distance", fourth paragraph,
> >    there is mention that on IA64 systems a distance metric will be scaled
> >    such that the closest <processor, memory> pair will be at a distance
> >    of 10.  Where is this metric being maintained, and what is using it?
> 
> As Tony Luck already replied, this is (quoting) an extension
> to ACPI to provide node-node distance information that uses
> this factor of ten scaling.  The spec for this can be found at:
> 
>   http://devresource.hp.com/devresource/Docs/TechPapers/IA64/slit.pdf
> 

This is fine, however, I'm still left wondering from the discussion
of distance in the CpuMemSet design where you see this distance metric
being maintained within the kernel, how it will be used, and how it
will interact with CpuMemSets.

> 
> > 4. In the section "Restricting CPU and memory for exclusive use" there
> >    is reference to setting the policy flags "CMS_DROPMASTER|CMS_SHARE".
> >    However, one of these flags is defined as a map policy and the other
> >    is defined as a set policy, and they both have the same value.  It
> >    is not clear to me why one would need to set CMS_SHARE in the case
> >    being described, and if so how that would be specified.
> 
> You are quite right about this confusion.  I noticed it too
> a couple of weeks ago, and have already dropped the "|CMS_SHARE"
> term in my master copy of the Design Notes.

Ok.  Thanks for clearing up that confusion.
> 
> 
> > 5. On a related note, if the CMS_DROPMASTER is specified in a CpuMemMap
> >    policy, is this only going to be set when invoked by user-level
> >    code in the CpuMemMap, and cleared by the kernel before the CpuMemMap
> >    goes into effect?
> 
> I agree that I've been unclear (even to myself) on some of
> this Master apparatus.
> 
> Let me try again to describe this:
> 
>     If you have some portion of your system (some CPUs
>     and memory) that you wish to dedicate to a particular
>     application, then first setup the granddaddy process for
>     this application to be Master of the intended CPUs and
>     memory, using cmsSetCMM (CMS_MASTER, ...)
> 
>     This will cause the kernel to mark a single bit of state
>     for each CPU and memory (using system numbering) in the
>     Map passed into this call, making it a Master.  So long
>     as a CPU or memory remains a Master, no other Map may be
>     setup including it.
> 
>     Then if other apps were previously using the intended CPUs
>     or memory, chase them off, using appropriate kill, cmsSetCMM
>     or cmsBulkRemap calls.
> 
>     The processes descending from the granddaddy process, that
>     share the same Map, will then be the exclusive users of the
>     specified CPUs and memory blocks.
> 
>     To cause some particular set of CPUs and/or memory blocks
>     that are currently marked as Master to no longer be so
>     marked:
> 
> 	Construct a Map containing these CPUs and memory (this
> 	Map can contain other, non-Master items harmlessly).
> 	Apply that map to something, anything, with a cmsSetCMM
> 	call passing a Map with the flag CMS_DROPMASTER set.
> 	The Master status for any system CPU or memory listed
> 	in that map will be dropped by the kernel.
> 
> 	Observe that, in this use of cmsSetCMM(), one is
> 	not really constructing a map that one necessarily wants
> 	to run on, but rather just (ab)using the passed in map to
> 	list the CPU and memory block system numbers that should
> 	have their Master status cleared (dropped) in the kernel.
> 
> 	The process or vm area to which you applied this cmsSetCMM
> 	with flag CMS_DROPMASTER _will_ be running on this new map,
> 	but that might just be a not very useful side affect of the
> 	cmsSetCMM call.
> 
> 	Thus, to throw away _all_ the Masters in the system,
> 	just construct a CMS_DROPMASTER Map with all the CPUs
> 	and memory blocks listed, and apply it to something.
> 
>     Of course, you have to be root (or have an appropriate
>     capability) to do these operations.
> 

Might it be better to have a cmsSetSMM() call that specifies 
CMS_DROPMASTER do nothing but drop the master status from any
resource named, but not apply this artificial set to a process
or vmarea?

> 
> >    Also, the comment before the definition of CMS_MASTER and CMS_DROPMASTER
> >    implies that these values may be OR'd together.  Is that correct?
> >    If so, why would one do that?
> 
> No, it would not be a good idea to OR these two options together,
> because the intent wouldn't be clear.  I should note that this
> combination is not allowed, and probably add an error case to
> the kernel to return with errno == EINVAL if both are set.

That is what I thought was most likely the case.

> 
> 
> > 6. The virualization provided by CpuMemMap forces both processor and
> >    memory block numbering to begin at 0 and continue upwards with no
> >    holes in the range.  While CpuMemSets allows mapping of application
> >    numbers that are non-contiguous, the CpuMemMap scheme forces the
> >    potential numbers used to be within a set range. Thus, for example,
> >    on an 8 processor system, there is no way for an application to have
> >    a virtual processor number of 20.
> > 
> >    Providing an "unused" value would allow populating a CpuMemMap
> >    processor or memory array such that other potential number combinations
> >    are possible.  While I don't have any good examples of why one would
> >    want to do that, it seems like it might provide additional flexibility
> >    at low cost.
> 
> Part of this can be done now, because while the "unused" value isn't
> available, duplications are.  Your 8 processor system could have a Map
> that mapped the first 30 virtual (application) CPUs all to real
> (system) CPU 2, say.  Then application processor 20 would work just
> fine, as would any other application processor in the range 0 to 29.
> 
> Adding an "unused" system CPU number (0xffff, no doubt) would
> complete this.  I've taken to calling this CPU the "village
> idiot".  I guess the corresponding Memory Block would be an
> Alzheimer's victim ... but this is getting a tad macabre.
> 
> I expect to add these two values to the design and implementation.
> 
This is what I was suggesting.  However, isn't 0xffff the value
used for CMS_DEFAULT_CPU?  Might want to try 0xfffe for unused.
While the context used should differentiate, a different value
might avoid confusion.

> 
> ===
> 
> On Mon, 17 Dec 2001, Michael Hohnbaum wrote:
> 
> >  ...The CpuMemSet Design does not give a clear understanding as to
> >  what the bulk remap actually does.
> 
> The bulk remap operation changes the system CPU and memory
> block numbers that are in the affected maps.  You pass in a
> list of substitutions to be made.
> 
> For example, you could ask that each appearance of system CPU
> 7 in the affected maps be changed to CPU 5.  If you did that
> example to all the Maps in the system (using CMS_BULK_ALL)
> then immediately nothing would be scheduled on CPU 7 anymore.
> Bit 7 in cpus_allowed would be cleared in all tasks, and any
> cpus_allowed that had bit 7 on would get bit 5 set instead.
> 
> The CPU substitutions to be made are passed in as a pair of equal
> length lists, one with the old system CPU numbers, the other list
> with the corresponding new number.  The memory block substitions
> are passed in with another such pair of equal length lists.
> 
> Is that clearer?  Or is something still confusing?
> 
> 
> >  What is needed for process migration is a mechanism that will cause
> >  currently allocated resources used by a process to be relocated onto
> >  the newly assigned resource set.  For processors, this should be
> >  fairly straight forward as the scheduler should (conceptually, at
> >  least) just need to dispatch the process on a processor in the new
> >  CpuMemSet.  However, does the bulk remap call provide for moving
> >  the currently allocated memory for a process to a different memory
> >  block?
> 
> No, no support for migrating existing memory is included in
> the current CpuMemSets design.  The focus of CpuMemSets has
> been on static directives that affect dynamic scheduling and
> allocation decisions of existing mechanisms, not on providing
> new dynamic system mechanisms.
> 
> So CpuMemSets is not sufficient to provide process migration,
> as you (reasonably enough) describe it.  Hopefully it might
> be a useful participant in a process migration solution.
> 
> How might we get there?

I see process migration as separate from CpuMemSet, but a necessary
tool to support the capabilities enabled by CpuMemSet in a dynamic
environment.  Process migration is needed to provide a NUMA API
with similar capabilities to other OS's.  If CpuMemSets is envisioned
to provide the NUMA API for Linux, then it would make sense to
incorporate process migration under the same general API.  However,
if CpuMemSet is only seen to be one component of a larger NUMA API,
then process migration can be done separate.

> 
> 
> >  A follow-on issue to this is how to associate a group of processes
> >  such that if one moves to a different node all processes in the group 
> >  moves.  The process could move to a different node in its current
> >  CpuMemSet, thus through no explicit request from the user; or could
> >  move due to a change in its CpuMemSet/map.  In the latter case, I
> >  believe the bulk remap capability could be used.  The former case
> >  appears to be unsupported.  I can envision a means of supporting this, 
> >  but it would require some changes in CpuMemSets related to sharing
> >  of CpuMemSets, ownership, and modification rights.  Alternatively,
> >  a linkage through the task structure could work, although resolving
> >  conflicts due to non-intersecting CpuMemSets could be tricky.  Do
> >  you have any thoughts on this?
> 
> hmmm ... mumble ... mumble ...  I didn't grok:
> 
>     The process could move to a different node in its current
>     CpuMemSet, thus through no explicit request from the user

How about an example.  If a process is assigned to a CpuMemSet 
that contains resources on two separate NUMA nodes, the scheduler
is free to dispatch the process on processors running on either
node.  However, for optimal performance, an attempt is made to
allocate resources for the process on one node, only.  If that
node becomes overloaded, the scheduler is free to dispatch the
process on a processor on the other node (identified in the 
CpuMemSet).  Thus, through no explicit request from the user, the
process moves to a different node.

> 
> Besides process migration, the other capability that folks
> think of when they consider CpuMemSets, that isn't supported,
> is such grouping.
> 
> This is a bigger can of worms than I've succeeded in getting
> my head around (ugh - another macabre metaphor ...).

I agree that grouping has the potential to be a quagmire.  Probably
wise to get an initial version of CpuMemSets working before attacking
grouping.  However, it is likely to be useful, so should not disappear.
> 
> In the case of process migration, I'm not likely to be an active
> player, except in so far as it involves CpuMemSets.  It's not
> something that is high on SGI's list, nor mine personally.
> 
> In the case of grouping, I'd like to do more.  But first I'd
> better get our patch out for what we have so far.
> 
> 
> 
>                           I won't rest till it's the best ...
> 			  Manager, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373
> 


Michael Hohnbaum
hoh...@us...

Re: [Lse-tech] Re: Thoughts on CpuMemSet Design - 11/14/2001 version

From: Paul J. <pj...@en...> - 2001-12-19 19:36:54

Executive summary - Michael said it best:

> I agree that grouping has the potential to be a quagmire.
> Probably wise to get an initial version of CpuMemSets working
> before attacking grouping.  However, it is likely to be
> useful, so should not disappear.


==

This and other stuff, in more detail ...


On Tue, 18 Dec 2001, Michael Hohnbaum (responding to pj):

> > We are coding this initialization to be more useful to our
> > particular SGI hardware, and we will add an action item to
> > our work queue to provide the hook for architecture specific
> > init routines.
> > 
> 
> This sounds good to me.

good.


> It was my understanding that you considered CpuMemSets to be a bit too
> low-level an API for applications to write to, and envisioned a more
> robust API layered on top of it.  That point is made clear in item 4
> of the "Implementation Layers" section.  If you are now proposing
> CpuMemSets as the user-level API, then I need to reevaluate it from
> that perspective.

I'm fumbling here a bit to find the right "spin".

I think it's this -- the CpuMemSet API was driven by the needs
of other API's that will be layered on top of it.  I was not
striving for an API that some application programmer, expert in
a discipline unrelated to any of this, would find intuitive and
easy.  Rather I was striving for an API, and kernel mechanism,
that porters of legacy API's would find powerful, general, and
easy to bend into whatever semantic model they had to support.

If others want to use this API directly, that's fine.  And if
I have spare energy, and their requests happen to dovetail with
what I was doing anyway, then I'll do what I can to satisfy them.
And this is Open Source -- if others pick up the ball and run
further with it in some direction, and obtains sufficient
general approval and interest for that work, then more power
to them.

The primary requirement is that we need one agreed upon and
accepted kernel mechanism for static CPU and memory placement.
That means that this one mechanism had better be sufficiently
general purpose to meet a (more-or-less) full range of needs,
preferrably with a minimum of policy sensitive code in the
kernel - just general purpose mechanism.

If there is enough hue and cry for a more "user" (app programmer,
or sys adm or whomever) friendly interface, then that might be
worth doing in its own right, hopefully leveraging the CpuMemSets
mechanisms.  For some audiences, at least on my SGI side, such
"user" friendly API's already exist, and we will be supporting
them (in some cases as Open Source) on top of CpuMemSets.


> This is fine, however, I'm still left wondering from the discussion
> of distance in the CpuMemSet design where you see this distance metric
> being maintained within the kernel, how it will be used, and how it
> will interact with CpuMemSets.

Yes - work remains to be done here.  The implementation and
initial simple uses of CpuMemSets don't depend on this, but more
powerful uses will.  I have no intentions of "over hanging the
market" here ... if others get to it first, I hope to contribute
to their success and to seeing to it that whatever we do meets
the needs I am aware of.



> Might it be better to have a cmsSetSMM() call that specifies 
> CMS_DROPMASTER do nothing but drop the master status from any
> resource named, but not apply this artificial set to a process
> or vmarea?

Yeah - as you saw later in my next email, that was where I
went next.

Then, as you recommended (but I hadn't read yet) I decided to
table this topic for this week, to work on getting out a patch
with what we had otherwise.  I will return to this in January.


> This is what I was suggesting.  However, isn't 0xffff the value
> used for CMS_DEFAULT_CPU?  Might want to try 0xfffe for unused.
> While the context used should differentiate, a different value
> might avoid confusion.

Good suggestion - thanks.


> I see process migration as separate from CpuMemSet, but a necessary
> tool to support the capabilities enabled by CpuMemSet in a dynamic
> environment.  Process migration is needed to provide a NUMA API
> with similar capabilities to other OS's.  If CpuMemSets is envisioned
> to provide the NUMA API for Linux, then it would make sense to
> incorporate process migration under the same general API.  However,
> if CpuMemSet is only seen to be one component of a larger NUMA API,
> then process migration can be done separate.

I'll do what I can, given other priorities, to help us arrive at
a pleasing and useful result.  I don't have enough insight into
the process migration needs to predict yet the appropriate
degree of API separation.


> > >  A follow-on issue to this is how to associate a group of processes
> > >  such that if one moves to a different node all processes in the group 
> > >  moves.  ...  I can envision a means of supporting this, 
> > >  but it would require some changes in CpuMemSets related to sharing
> > >  of CpuMemSets, ownership, and modification rights. ...

Ah - the point being that the kernel may have other actions that
it wants to take at a "group" level, beyond simple managing
the static allocation of CPUs and memory.  It might want to
be aware of such grouping in its more dynamic actions, such as
migration, as well.

And as we're both suspecting by now, this presumes the kernel
has some way to identify and authenticate these groups.

Key question - are these groups of users (tasks and vm areas)
or they groups of resources (CPUs and memory blocks).  I tend
toward the later.  One possibility - a special file can be
created that represents some set of CPUs and memory, and you
can include those resources in your cpumemmap if you can obtain
an open file descriptor on that special file.  The grouping of
users becomes implicit - by what resources they have access to.
Then I'd expect higher level API's to focus more on providing
the grouping of users, leveraging the lower level facilities
that support the allocation and sharing of resources.


                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373