Re: [Lse-tech] NUMA: Simple binding API, rev 0.3

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tue, 12 Feb 2002: Niels Christiansen wrote:
>
> Michael Hohnbaum wrote:
> >
> > > 2. What with a function to say bind a process to a set of
> > >    processors on any one node and memory on any one (other?)
> > >    node?  Something along the lines of:
> > >
> > >    int bindcpumem(int cpunode, u_int cpuset,
> > >        int memnode, u_int memset, u_int behavior);
> > >
> > >    where a cpuset is a bitmask for the processors on a node
> > >    regardless of the global cpu-numbering and memset is the
> > >    same for memory.  This would permit a "job" to be moved as
> > >    a unit of work from one node to another.  By giving -1
> > >    arguments to cpunode and memnode you could say that you
> > >    don't really care which node as long as the process group
> > >    executes on processors of the same node.
> >
> > First, it is not clear to me exactly what you are trying to
> > accomplish with this function.  However, I believe that some of this
> > can be accomplished with a combination of the already defined functions.
> > I'll also point out that the simple binding API is establishing
> > a set of basic functions that can be combined to create various
> > combinations such as this.  CpuMemSets provides a different level
> > of control over the resources and might be a closer fit for
> > providing this capability.
>
> What I'm trying to accomplish is, as I said, the ability to
> "move a job as a unit of work from one node to another."  I
> suggested it because I am a bit put off by the insistence of
> assigning work to specific processors.  This is an attempt to
> define ways of saying "hey system, here's a set of workloads
> which I would appreciate if you run so none are split over
> nodes; feel free to put them where there's room but don't
> waste time accessing remote memory".
>
I see two things being requested here:

1. A mechanism to keep a process using memory local to the node
   it is executing on; and

2. A mechanism to associate a group of processes to keep them on
   the same node.

The first is memory allocation policy.  This is the default policy
for allocating memory for a process on NUMA systems.  If nothing
else is done to change this, memory is allocated from the local
node.  The capability that is needed to go along with this is
the logic to always dispatch the process on the same node.  Once
again, this should be the default on a NUMA system.  An API is
not needed to provide this capability - it is what has been
referred to as "default actions".

The second is something that Paul Jackson and I have been discussing,
process association.  That is currently outside the scope of the
simple binding API, but is on my list of things to get to.
>
> I believe that is the way a system administrator would want
> to run his system, rather than assigning work to specific
> processors on specific nodes.  You could argue that the same
> could be done by using cpumemsets but that would add a level
> of complexity, which is bound to annoy your otherwise friendly
> system administrator.  And it still would require the mapping
> of real processors to logical ones and be less flexible than
> bindcpumem().
>
> > > 3. I notice the absence of "distance".  As far as I can tell,
> > >    the coupling to nodes is mostly another way of saying
> > >    distance.  With all the discussion about distance in Paul
> > >    Dorwin's topology design, maybe the binding API could be
> > >    enhanced by using distance.  For example, my suggested
> > >    bindcpumem() function could then be something like:
> > >
> > >    int binddist(int node, int max_cpu_dist, int max_mem_dist,
> > >                 u_int behavior);
> > >
> > >    where a node argument of -1 means any node.
> >
> > Again, this can be accomplished by making use of the simple
> > binding functions.  Further, distance is going to have different
> > meanings based upon the architecture of the machine.  Until
> > some common understanding is reached as to what distance is,
> > it does not make sense to define higher-level APIs based upon it.
>
> I don't see how, Michael.  Not with the kind of flexibility the
> use of distance would allow.  Again, think of who is going to
> use the system management tools that use your API.  It may be
> a lot of fun to work with actual processor numbers (whether
> physical or logical) and actual node numbers but it is not a
> way to run an efficient shop.

Assuming the topology API can provide distance, at user-level
distance can be used to determine the processors/memblks that
are to be bound to and the simple binding API functions invoked.
>
> I don't see why it matters much that "distance" has different
> meanings (I assume you mean different penalties) on different
> platforms.  And if there is no common understanding of what
> distance is, call it penalty.  The fact remains that somebody
> or something has to consider the penalty of distance before
> using your API.  How else can the system administrator decide
> how and where to place work?  So why not incorporate the
> concept into the API?

Currently, I've seen people define distance as the number of hops,
the latency, the bandwidth, an ACPI number scaled such that 10 is
on the same node, and combinations of these.  So using distance
as a metric, without some agreement as to what this metric is
is problematic.  Using the ACPI concept that "10" is on the same
node would clash, for instance, with distance being hops.  A
binddist(10) call would have vastly different results.  I have
some thoughts on how to handle distance, but, again, it is not
part of the simple binding API.

The simple binding API is, to quote the document "intended to be a
'simple' API which provides rudimentary binding of processes to
processors and/or memory blocks."  It was initially written as
the result of several meetings of people interested in NUMA work
from across the Linux community.   This revision is an
attempt to bring it more into line with some of the thoughts and
concepts that have been discussed since then.
>
> Besides, we are trying to define distance in the Topology API
> so we're working on it.
>
> > I see process migration as being a ways away.  However, it can
> > be supported in this API by either an implicit action (e.g.,
> > bindtomemblk() could cause migration of pages already allocated
> > to the process to the specified memory blocks), or explicit (e.g.,
> > add MPOL_MIGRATE to the behavior flags), or a combination of
> > the two (implicit but with an MPOL_NOMIGRATE flag).
>
> Migrate would go very well with binddist() or bindcpumem().
> In fact, I suggested the two with migration in mind.  As long
> as you keep an opening for a future migrate() function, I'm
> happy.  I'm not happy with adding implied functionality to
> other API calls or being forced to use a series of the other
> API calls when I need to migrate.

The "implied functionality" can (and should) be made explicit.
Currently, it is undefined what happens to existing memory for
a process when the process binds to a memory block.  Ideally,
any existing memory allocated to the process should migrate
to the new memory binding.  However, until process migration
is supported, the best we can do is put any future memory
allocations on the bound memory block.  My preference, for now,
is stating clearly that this is undefined, but that if an
application does not want memory that is already allocated
to migrate, use a MPOL_NOMIGRATE flag.

However, why do you see a need for a process to decide it should
migrate?  And if it does need to migrate, what is the issue with
using a series of function calls to accomplish this?  If an
application wants to read a record from a file, it must issue
a series of calls to accomplish that.  As long as the pieces
that are necessary are available and can be assembled, why isn't
this adequate?
>
> -nc-

Michael Hohnbaum
hoh...@us...