Thread: Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

lse-tech

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul M. <Pau...@us...> - 2001-10-12 05:57:03

Hello, Paul!

Took a first pass through your proposal -- good stuff!  Some very
interesting approaches!

Some comments:

   Under "Desired Properties / Vendor neutral base", I recommend adding
   Tru64, HP-UX, AIX, etc.  May as well be all-inclusive.  ;-)
   Under "Implementation Layers", near the end of the second paragraph: "a
   small scale forking" seems a bit dramatic.  That said, it would be good
   if the NUMA and SMP code paths used the same C code.
   A question on items 1, 2, and 3 under "Implementation Layers".  Item 1
   seems to indicate that the current use of bitmasks within the kernel can
   continue unchanged, but hints at longer-term changes.  Do you envision
   the kernel moving towards using CpuMemMaps and CpuMemSets, or do you
   expect these two concepts to be strictly confined to user space?
   Under "Error cases" in the header file, you say that if you really want
   to force an application to use CPUs and memory from disjoint nodes
   (which you might in diagnostics or performance-measurement code), then
   the CpuMemSet memory list must contain CMS_DEFAULT_CPU.  But
   CMS_DEFAULT_CPU is casted to cms_lcpu_t.  Shouldn't it be cast instead
   to cms_lmem_t?  Or is CMS_DEFAULT_CPU instead supposed to go on the
   CpuMemSet's list of CPUs?  If the latter, does this allow the
   disjoint-node operation for CPUs and memory? (So that all the permitted
   CPUs are on one set of nodes, but all the permitted memory is on another
   set of nodes, and the two sets of nodes are disjoint.)
   First-touch and stripe policies seem to be missing from the list.  Some
   of the OSes have also had least-memory-utilization policies.
   A cpumemset has no meaning except in the context of a cpumemmap, right?
   A cpumemmap maps the CPU numbers, while a cpumemset simply restricts
   them, right?  (Could make it work either way, but things like getcpu()
   and getnode() need to be in sync with the choice.)  For example, suppose
   that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose that
   the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.  What
   is the physical CPU corresponding to logical CPUs 0, 1, and 2?
   HP's launch policies include policies for CPU allocation (round robin
   and the like).  My guess is that this requires either additional bits in
   cms_policy or another field for CPU (as opposed to memory) policies.
   Non-root processes are prohibited from creating cpumemmaps.  Shouldn't
   they be allowed to create them, as long as they are subsets of the
   cpumemmap that they are currently running with?
   If non-root processes are really prohibited from playing with
   cpumemmaps, what happens to their children?  Is the child's cpumemmap
   generated from the cpumemset/cpumemmap pair that the parent specified
   with CMS_CHILD?  Or does the child just get copies of the CMS_CHILD
   cpumemset/cpumemmap?  (At this point, I am favoring letting non-root
   processes manipulate cpumemmaps, with the restriction that any that are
   installed must be pure restrictions of the current cpumemmap.)
   There is no way to create either a cpumemmap or a cpumemset without
   querying for it.  The intended usage is to query for this process's
   set/map, then manipulate the resulting structure?
   Is the user supposed to directly manipulate the fields of the cpumemmap
   and cpumemset?
   A few typos, possibly due to WikiWeb issues:  "cpu's" -> "CPUs", "mem's"
   -> "memories", "preasure points" -> "pressure points", "CpumemSet" ->
   "CpuMemSet", etc.

Enough questions for now!  More later...  Once again, some good stuff here!

                                   Thanx, Paul

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul M. <Pau...@us...> - 2001-10-15 22:33:05

> Thanks - excellent comments, Paul.  I look forward to continued
> feedback from yourself and others.

Will do my best!  And thank you for reformatting them, sorry for their
previous less-than-helpful state.

> ====
> >    A question on items 1, 2, and 3 under "Implementation Layers".  Item
1
> >    seems to indicate that the current use of bitmasks within the kernel
can
> >    continue unchanged, but hints at longer-term changes.  Do you
envision
> >    the kernel moving towards using CpuMemMaps and CpuMemSets, or do you
> >    expect these two concepts to be strictly confined to user space?
>
> I doubt that the guts of scheduling or allocation code will ever
> want to be written in terms of CpuMemMaps and Sets.  The data
> structure used in CpuMem*, arrays of cpu or mem id's, is almost
> surely too inefficient for critical path code.
>
> I expect that when someone needs a kernel for > 64 cpus, they
> will have to find an alternative to, or extension of, Ingo's
> cpus_allowed bit vector.

I agree with this.  My hope is that there will be a way to bury such
differences between 64-CPU and >64-CPU data structure manipulation
in a macro, an inlined function, or some such.

> And I expect continued coding activity for various other purposes
> in both the scheduling and allocation code, which will at times
> impact the preferred data representation of available cpu and
> memory resources for these critical code paths.
>
> The kernel code that implements the system calls to set
> CpuMemMaps and Sets will always have the responsibility for
> translating the stable, but cumbersome, API of these Maps and
> Sets into the efficient representation of the day required by
> the scheduling and allocation code.

Yep!  There may come a time when people want a short-form interface,
but I believe that what is good enough for select() vectors and for
signal masks is good enough for NUMA.  ;-)

> ====
> >    Under "Error cases" in the header file, you say that if you really
want
> >    to force an application to use CPUs and memory from disjoint nodes
> >    (which you might in diagnostics or performance-measurement code),
then
> >    the CpuMemSet memory list must contain CMS_DEFAULT_CPU.  But
> >    CMS_DEFAULT_CPU is casted to cms_lcpu_t.  Shouldn't it be cast
instead
> >    to cms_lmem_t?  Or is CMS_DEFAULT_CPU instead supposed to go on the
> >    CpuMemSet's list of CPUs?  If the latter, does this allow the
> >    disjoint-node operation for CPUs and memory? (So that all the
permitted
> >    CPUs are on one set of nodes, but all the permitted memory is on
another
> >    set of nodes, and the two sets of nodes are disjoint.)
>
> Aha - this Error Case is confused, sufficiently so that it apparently
> managed to further confuse your critiique ;).

Glad it wasn't just me.  ;-)

> And the Error Case preceding it is also confused.  For the benefit
> of those who don't have this Design Note at hand, the two confused
> Error Cases are:
>
>  *    It is not an error if a CpuMemSet for an object (task, vm area
>  *    or kernel) doesn't provide memory lists for all the cpus in
>  *    that objects CpuMemMap.  That is, it is ok for a CpuMemSet to
>  *    "be smaller than" (only use a subset of) its Map.
>  *
>  *    However it is an error to set a CpuMemSet that shows cpus that
>  *    are not listed in any of the memory lists of that CpuMemSet,
>  *    unless the memory lists include a CMS_DEFAULT_CPU.  Attempts to
>  *    set such a CpuMemSet fail with errno set to ESRCH.  This case
>  *    must be an error to avoid trying to allocate memory without
>  *    knowing which memory list to search.
>
>
> Let me try again ...
>
>     A CpuMemSet specifies two things.  It specifies on which cpus
>     (in the corresponding CpuMemMap) a task may be scheduled,
>     and it specifies in what order to search for memory, per
>     virtual memory area, depending on which cpu the request
>     for memory was executed.

OK...  How is the CPU taken into account?  Do you traverse the
list of memories until you find one that is closest to the current
CPU?  If there is no memory to be had there, do you then go through
the list of memories in order starting at that point, and wrapping
around if necessary?  Or are you using something similar to
classzones, which would be represented separately?

>     The question arises - where do we look for memory if a
>     request for memory is executed on a cpu that is not specified
>     in the active CpuMemSet?  Perhaps someone didn't provide
>     memory lists for all possible cpus that might execute code
>     sharing that area.
>
>     Heck, perhaps they _couldn't_ specify such memory lists,
>     because they are setting up a shared memory area that will
>     be shared with other tasks running on cpus outside the Map
>     of the process initializing the shared memory area.
>
>     Seems that I can either (1) require that a memory list _always_
>     be specified for the CMS_DEFAULT_CPU, or (2) mandate that
>     attempts to allocate memory when the CpuMemSet does not specify
>     (even by CMS_DEFAULT_CPU) any memory list for the currently
>     executing cpu must fail.  Choice (2) would cause nasty, obscure
>     and intermittent errors.  So it must be choice (1).

Agreed that deterministic errors are much better than non-deterministic
errors!  But how do you indicate which memory is associated with
CMS_DEFAULT_CPU?  Must the CPU and the memory lists be the same size?
My guess was "no", since there are separate fields for the length of
each.  Or does CMS_DEFAULT_CPU just start the search of memory from the
first element of the array?

>     Hence these two Error Cases collapse into the following
>     single case:
>
>  *    Every CpuMemSet must specify a memory list for the
>  *    CMS_DEFAULT_CPU, to ensure that regardless of which CPU a
>  *    memory request is executed on, a memory list will be available
>  *    to search for memory.  Attempts to set a CpuMemSet without a
>  *    memory list specified for the CMS_DEFAULT_CPU will fail, with
>  *    errno set to EINVAL.
>
>
> ====
> >    First-touch and stripe policies seem to be missing from the list.
Some
> >    of the OSes have also had least-memory-utilization policies.
>
> First touch is the natural order of things.  If, as will usually
> be the case, the memory lists are ordered by distance from the
> faulting cpu, then this provides first touch.  Existing upper
> level API's, such as cpusets, dplace, runon, OpenMP, MPI that
> support a First Touch policy would presumably implement that
> policy by properly sorting the memory lists.
>
> Aha - perhaps I should change the CMS_DEFAULT policy comment, from:
>
>   #define CMS_DEFAULT        0x00            /* None of the following
optional policies */
>
> to some such as:
>
>   #define CMS_DEFAULT        0x00            /* Memory list order
(first-touch, typically) */
>
> Tell me more about what is the essence of stripe policies,
> as they apply here.  My guess is that a combination of proper
> memory list sorting plus a round robin (CMS_ROUND_ROBIN) policy
> will provide the desired semantics.  But more consideration of
> this point is needed.

Confusion on my part -- different names for the same thing in the
different APIs...

> What is a "least-memory utilization policy"?  I am not familiar
> with that term.

It means that, at page-fault time, you allocate memory from the
node/memory with the least utilization.

> ====
> >    A cpumemset has no meaning except in the context of a cpumemmap,
right?
> >    A cpumemmap maps the CPU numbers, while a cpumemset simply restricts
> >    them, right?  (Could make it work either way, but things like getcpu
()
> >    and getnode() need to be in sync with the choice.)  For example,
suppose
> >    that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose
that
> >    the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.
What
> >    is the physical CPU corresponding to logical CPUs 0, 1, and 2?
>
> Yes, Yes, tell me about getcpu/getnode, and {100,102,103}.
>
> My presumption, from reading this, is that getcpu(), executed in this
> map and set, would return either 100 or 103, rather oblivious to the
> CpuMemMap.  I have no clue what getnode() would or should do that
> relates to CpuMemSets.  Perhaps we need a "getcmscpu()"?

I believe that we need a way to get the physical CPU ID.  So, at this
point, do we have two levels of ID, or three?  In case of diagnostics,
you want to identify the physical CPU.  So we quickly get into the
issue that Martin Bligh raised earlier.  ;-)

For related sets of processes running as root, you can end up with many
more levels of ID.  Process 100 maps to exclude the first CPU, then forks
process 101, which maps to exclude its idea of the first CPU, and so on.
This problem is inherent in the notion of virtualizing the CPUs.  We could
try to eliminate the middle level (cms_pcpu_t), but that could require
using whatever ugly IDs the hardware wanted to provide.  Maybe the name
of the cms_pcpu_t level should be "complete" instead of "physical"?  Or
some other naming?

So the relevant numbering schemes are the following: (1) the numbering
that the current process sees, as mapped by the CpuMemMap that controls
it, and (2) the underlying physical identifiers that would make sense
to someone servicing the hardware.

Thoughts?

> ====
> >    HP's launch policies include policies for CPU allocation (round
robin
> >    and the like).  My guess is that this requires either additional
bits in
> >    cms_policy or another field for CPU (as opposed to memory) policies.
>
> Aha - might be so.  I should investigate this further.

http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for
"Launch".

> ====
> >    Non-root processes are prohibited from creating cpumemmaps.
Shouldn't
> >    they be allowed to create them, as long as they are subsets of the
> >    cpumemmap that they are currently running with?
>
> Non-root processes can restrict where they execute tasks or allocate
> memory by altering their CpuMemSet, within the confines of the CpuMemMap
> established for them.  Why is that not enough?

My concern is that non-root processes cannot virtualize their children.
If you want to divide the CPUs and memory available to you into two
pieces, and run a child in each piece, you can do so, but you cannot
have both children thinking that they have their own CPU 0.

> ====
> >    If non-root processes are really prohibited from playing with
> >    cpumemmaps, what happens to their children?  Is the child's
cpumemmap
> >    generated from the cpumemset/cpumemmap pair that the parent
specified
> >    with CMS_CHILD?  Or does the child just get copies of the CMS_CHILD
> >    cpumemset/cpumemmap?
>
> I don't understand the difference between these two choices.
> They both sound the same to me, and both sound right.

It turns out that they are the same.  When I wrote this, I didn't know
that cpumemset didn't also map numberings.  So I thought that maybe
the child would get a cpumemset generated from the "product" of the
CMS_CHILD
cpumemset and cpumemmap, and a full cpumemset.

> ====
> >    (At this point, I am favoring letting non-root
> >    processes manipulate cpumemmaps, with the restriction that any that
are
> >    installed must be pure restrictions of the current cpumemmap.)
>
> Well, having not yet appreciated the comments above on this, I
> will have to table this suggestion pending further my further
> enlightment.

OK...  ;-)  Please let me know whether my non-root virtualization example
above seems reasonable to you.

> ====
> >    There is no way to create either a cpumemmap or a cpumemset without
> >    querying for it.  The intended usage is to query for this process's
> >    set/map, then manipulate the resulting structure?
>
> Yes, yes.  Earlier designs allowed for creating and manipulating
> CpuMemSets, as a kernel supported object that was visible as a
> distinct identified object to applications, separate from their
> binding to any given task or vm area.  But I could see no essential
> use for unbound CpuMemSets, so now they are an attribute of known
> tasks and vm areas.

Sounds reasonable.

> ====
> >    Is the user supposed to directly manipulate the fields of the
cpumemmap
> >    and cpumemset?
>
> Yes, via the cmsSet*() system calls.  Well, as direct as
> anything in a protected operating system -- the user politely
> asks the kernel to set a map or set, and doesn't directly write
> kernel memory.
>
> My hunch here is that I missed the real point of this question ...

Should the interfaces be used as follows?

     p = cmsQueryCMS(CMS_CURRENT, (void *)0);
     if (p->nr_cpus > 1) {
          p->nr_cpus--;
     }
     cmsSetCMS(CMS_CURRENT, (void *)0, p);

> > Enough questions for now!  More later...  Once again, some good stuff
here!
>
> I look forward to your further comments, and to integrating this
> work with substantial good work being done by folks on your end,
> and elsewhere.

One thing we need to hash out is what the unit of memory control is.
In the simple-binding proposal, it is an arbitrary range of virtual
memory (similar to what madvise() might do), while in the Process
Scheduling and Memory Placement proposal, it appears to be an object.
Conflicts are handled in a last-change-wins manner?

                                   Thanx, Paul

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-10-16 23:35:28

Thanks again, Paul, for the continued excellent feedback.

On Mon, 15 Oct 2001, Paul McKenney wrote:
> > I expect that when someone needs a kernel for > 64 cpus, they
> > will have to find an alternative to, or extension of, Ingo's
> > cpus_allowed bit vector.
> 
> I agree with this.  My hope is that there will be a way to bury such
> differences between 64-CPU and >64-CPU data structure manipulation
> in a macro, an inlined function, or some such.

Something like that, yes.  This is just a concern to the schedule()
code - nicely isolated.

> > The kernel code that implements the system calls to set
> > CpuMemMaps and Sets will always have the responsibility for
> > translating the stable, but cumbersome, API of these Maps and
> > Sets into the efficient representation of the day required by
> > the scheduling and allocation code.
> 
> Yep!  There may come a time when people want a short-form interface,
> but I believe that what is good enough for select() vectors and for
> signal masks is good enough for NUMA.  ;-)

My intention is that short-form interfaces are provided on top
of CpuMemSets -- we (sgi) expect to do a few such ourselves,
to emulate existing interfaces.

> > Let me try again ...
> >
> >     A CpuMemSet specifies two things.  It specifies on which cpus
> >     (in the corresponding CpuMemMap) a task may be scheduled,
> >     and it specifies in what order to search for memory, per
> >     virtual memory area, depending on which cpu the request
> >     for memory was executed.
> 
> OK...  How is the CPU taken into account?  Do you traverse the
> list of memories until you find one that is closest to the current
> CPU?  If there is no memory to be had there, do you then go through
> the list of memories in order starting at that point, and wrapping
> around if necessary?  Or are you using something similar to
> classzones, which would be represented separately?

Hmmm ... I have not yet adequately presented some aspects of
this data structure.  I must add an example ... that might
connect with additional readers.

Lets try this one:

Example:
========
    One way to understand these data structures is to look at
    an example.

    Given the following hardware configuration:

	Let's say we have a four node system, with four CPUs
	per node, and one memory per node, named as follows:

	    Name the 16 CPUs:    c0, c1, ..., c15	# 'c' for CPU
	    and number them:     0, 1, 2, ..., 15	# cms_pcpu_t

	    Name the 4 memories: mn0, mn1, mn2, mn3	# 'mn' for memory node
	    and number them:     0, 1, 2, 3		# cms_pmem_t

    CpuMemMap:

	Now lets say the administrator (root) chooses to setup a
	Map containing just the 2nd and 3rd node (CPUs and memory
	thereon).  The cpumemmap for this would contain:

	    {
		8,			# nr_cpus (length of cpus array)
		p1,			# cpus (ptr to array of cms_pcpu_t)
		2,			# nr_mems (length of mems array)
		p2			# mems (ptr to array of cms_pmem_t)
	    }

	where p1, p2 point to arrays of physical cpu + mem numbers:

	    p1 = [ 4,5,6,7,8,9,10,11 ]	# cpus (array of cms_pcpu_t)
	    p2 = [ 1,2 ]		# mems (array of cms_pmem_t)

	This map shows, for example, that for this Map, logical cpu 0
	corresponds to physical cpu 4 (c4).

    CpuMemSet:

	Further lets say that an application running within this map
	chooses to restrict itself to just the odd-numbered CPUs, and
	to search memory in the common "first-touch" manner (local
	node first).  It would establish a CpuMemSet containing:

	    {
		CMS_DEFAULT,		# cms_policy
		4,			# nr_cpus (length of cpus array)
		q1,			# cpus (ptr to array of cms_lcpu_t)
		2,			# nr_mems (length of mems array)
		q2,			# mems (ptr to array of cms_memory_list)
	    }

	where q1 points to an array of 4 logical cpu numbers and q2 to an
	array of 2 memory lists:

	    q1 = [ 1,3,5,7 ],		# cpus  (array of cms_lcpu_t)
	    q2 = [			# See "Verbalization example" below
		    { 3, r1, 2, s1 }
		    { 2, r2, 2, s2 }
		 ]
	where r1, r2 are arrays of logical cpus:
	    r1 = [1, 3, CMS_DEFAULT_CPU]
	    r2 = [5, 7]
	and s1, s2 are arrays of memory nodes:
	    s1 = [0, 1]
	    s2 = [1, 0]

    Verbalization examples:

	To read item q1 out loud:

	    Tasks in this CpuMemSet may be scheduled on any of
	    the logical CPUs [ 1, 3, 5, 7 ], which correspond
	    in the associated Map with physical CPUs c5, c7, c9
	    and c11.

	To read item q2 out loud:

	    If a fault occurs on any of the 2 explicit CPUs in
	    r1, then search the 2 memory nodes in s1 in order,
	    looking for available memory (mn1, then mn2).

	    If a fault occurs on any of the 2 CPUs in r2, search
	    the 2 memory nodes in s2 in order (mn2, then mn1).

	    If a fault occurs on any other CPU, then since the
	    CMS_DEFAULT_CPU value is listed in r1, search the
	    2 memory nodes in s1 in order (mn1, then mn2).

    Interpretations of the above:

	The meaning of "s1 = [0, 1]" is that if a page fault occurs on
	the logical CPUs in "r1 = [1, 3, CMS_DEFAULT_CPU]", then the
	allocator should search logical memory node 0 first (that's
	the memory on physical node 1 - mn1), then search logical
	memory node 1 second (the memory on physical node 2 - mn2).

	The meaning of "s2 = [1, 0]" is that if a page fault occurs
	on the logical CPUs listed in "r2 = [5, 7]", then the same
	memory nodes are searched, but in the other order, mn2 then mn1.

	In particular, if a vm area using the above CpuMemSet was
	also shared with an application running on some other Map,
	and that application faulted while running on some CPU not
	explicitly listed in the above CpuMemSet (item r1 or r2),
	then the allocator would search mn1 first, then mn2, for
	available memory.  This is because CMS_DEFAULT_CPU is listed
	amongst the CPUs in r1, and the corresponding s1 is equivalent
	to the ordered array of physical memory nodes [mn1, mn2].

    Observation:

	The allocator need have _no_ notion of distance.  It just
	searches, in order specified, the memory list prescribed
	for that vm area, for a fault on the specified CPU (or the
	CMS_DEFAULT_CPU).  To provide the usual first-touch, distance
	ordered memory search, some system service or utility must
	sort the memory lists in distance order.

========

I should add the above example to my Design Notes.

My apologies for the tediousness of this example.  I realize
that the above data structure is a layer or two deeper than
intuitions expect.  However when I methodically strip all
(most?) higher level policy from the various CPU and memory
API's we need to support, the above is what I am left with, as
the necessary generic multiplexor between a variety of API's
and the specific needs of the static placement logic in the
kernel allocation and scheduliing code.

For example, observe that no notion of locality domain or node
exists here - it has been disassembled into simple lists of
CPUs and chunks of memory, called here 'memory nodes' only
because there tends to be one chunk per node, and I couldn't
find a better noun to name a maximal contiguous (?) chunk of
memory that is equidistant from all processing elements.

> >     Seems that I can either (1) require that a memory list _always_
> >     be specified for the CMS_DEFAULT_CPU, or (2) mandate that
> >     attempts to allocate memory when the CpuMemSet does not specify
> >     (even by CMS_DEFAULT_CPU) any memory list for the currently
> >     executing cpu must fail.  Choice (2) would cause nasty, obscure
> >     and intermittent errors.  So it must be choice (1).
> 
> Agreed that deterministic errors are much better than non-deterministic
> errors!  But how do you indicate which memory is associated with
> CMS_DEFAULT_CPU?

The value CMS_DEFAULT_CPU is included on a the cpu list
(r1 or r2, in the above example) in one of the memory lists.

==> Memory lists have a list of memory nodes, sorted in search
    order, _and_ a list of CPUs to which that memory list applies.

> Must the CPU and the memory lists be the same size?
> My guess was "no", since there are separate fields for the length of
> each.

Correct - they need not be, and in the case of architectures
with multiple CPUs per memory node, typically are not the same size.

> Or does CMS_DEFAULT_CPU just start the search of memory from the
> first element of the array?

er eh no.  This question confuses me.

CMS_DEFAULT_CPU chooses which memory list to search if the
faulting cpu is not explicitly listed.

The search order is by default (CMS_DEFAULT) always from the
first element of the memory node array (s1 and s2, above), unless
the CMS_ROUND_ROBIN policy is specified for that CpuMemSet.

> > What is a "least-memory utilization policy"?  I am not familiar
> > with that term.
> 
> It means that, at page-fault time, you allocate memory from the
> node/memory with the least utilization.

This is clearly then a dynamic placement policy, not a static
one.  I would expect the implementation of a "least-memory
utilization policy" to add code to the allocator and elsewhere
to track memory utilization.  And I would expect the CpuMemSet
(or at least CpuMemMap) to control which memory nodes could
be searched.  But the order of search would depend on more
dynamic code outside the domain of CpuMemSets.

> 
> > ====
> > >    A cpumemset has no meaning except in the context of a cpumemmap,
> right?
> > >    A cpumemmap maps the CPU numbers, while a cpumemset simply restricts
> > >    them, right?  (Could make it work either way, but things like getcpu
> ()
> > >    and getnode() need to be in sync with the choice.)  For example,
> suppose
> > >    that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose
> that
> > >    the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.
> What
> > >    is the physical CPU corresponding to logical CPUs 0, 1, and 2?
> >
> > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}.
> >
> > My presumption, from reading this, is that getcpu(), executed in this
> > map and set, would return either 100 or 103, rather oblivious to the
> > CpuMemMap.  I have no clue what getnode() would or should do that
> > relates to CpuMemSets.  Perhaps we need a "getcmscpu()"?
> 
> I believe that we need a way to get the physical CPU ID.  So, at this
> point, do we have two levels of ID, or three?  In case of diagnostics,
> you want to identify the physical CPU.  So we quickly get into the
> issue that Martin Bligh raised earlier.  ;-)

When you want the physical CPU ID, as with diagnostics, then getcpu()
provides the physical CPU ID, which should be no more ambiguous than
it was before CpuMemSets.  I only see one level of ID here.  I see
nothing logical about this <grin>.

> For related sets of processes running as root, you can end up with many
> more levels of ID.  Process 100 maps to exclude the first CPU, then forks
> process 101, which maps to exclude its idea of the first CPU, and so on.
> This problem is inherent in the notion of virtualizing the CPUs.  We could
> try to eliminate the middle level (cms_pcpu_t), but that could require
> using whatever ugly IDs the hardware wanted to provide.  Maybe the name
> of the cms_pcpu_t level should be "complete" instead of "physical"?  Or
> some other naming?

No - cms_pcpu_t is the ugly hardware ID (well one of them -
seems that these too come in a couple forms, such as compact
or not).  This is not a hall of infinitely regressing mirrors.
There are only two levels, the ugly physical ID level, and the
logical mapping to the integers 0..N-1 (logical ID) for any
given subset of size N CPUs or memory nodes.

I suspect that CpuMemMaps are a degree less fancy than you are
presuming, while CpuMemSets are a couple degrees more fancy,
as in the above example ;).

> So the relevant numbering schemes are the following: (1) the numbering
> that the current process sees, as mapped by the CpuMemMap that controls
> it, and (2) the underlying physical identifiers that would make sense
> to someone servicing the hardware.

Yes - exactly.  How we ended up at this same point, after the
above confusions, baffles me a bit.  oh well.

> > >    HP's launch policies include policies for CPU allocation (round
> > >    robin and the like).  My guess is that this requires either
> > >    additional bits in cms_policy or another field for CPU (as
> > >    opposed to memory) policies.
> >
> > Aha - might be so.  I should investigate this further.
> 
> http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for
> "Launch".

Here I see several dynamic scheduling policies:
    ROUND_ROBIN
    PACKED
    FILL
    LEASTLOAD

As with the dynamic "least-memory utilization" policy above,
I would expect CpuMemSets to control which CPUs were eligible
for scheduling, but that the control of details of scheduling
would be left to other mechanisms.

> > Non-root processes can restrict where they execute tasks or allocate
> > memory by altering their CpuMemSet, within the confines of the CpuMemMap
> > established for them.  Why is that not enough?
> 
> My concern is that non-root processes cannot virtualize their children.
> If you want to divide the CPUs and memory available to you into two
> pieces, and run a child in each piece, you can do so, but you cannot
> have both children thinking that they have their own CPU 0.

Yes, this limitation exists.  Below you suggest letting non-root
processes further restrict their map.  That would be doable -
a modest wart, in that it complicates what was a simple story:
root manipulates maps, anyone manipulates sets.  How serious
is this limitation?

> > >    (At this point, I am favoring letting non-root processes
> > >    manipulate cpumemmaps, with the restriction that any that
> > >    arevinstalled must be pure restrictions of the current
> > >    cpumemmap.)
> Please let me know whether my non-root virtualization example
> above seems reasonable to you.

I am still lacking an appreciation of the benefit of this
sufficient to justify the extra half-twist of logic.  I'm
open to hearing more.

> > >    Is the user supposed to directly manipulate the fields of the
> > >    cpumemmap and cpumemset?
> >
> > Yes, via the cmsSet*() system calls.  Well, as direct as
> > anything in a protected operating system -- the user politely
> > asks the kernel to set a map or set, and doesn't directly write
> > kernel memory.
> >
> > My hunch here is that I missed the real point of this question ...
> 
> Should the interfaces be used as follows?
> 
>      p = cmsQueryCMS(CMS_CURRENT, (void *)0);
>      if (p->nr_cpus > 1) {
>           p->nr_cpus--;
>      }
>      cmsSetCMS(CMS_CURRENT, (void *)0, p);

Ah ... that gets dicey.  What I hear you asking is whether it
is appropriate for an application (more likely, some library
supporting a friendlier interface on top of CpuMemSets) to (1)
get a response from a cmsQuery*(), (2) munge it in place, and
(3) push it back down a cmsSet*() call.

As we have already seen in the Example above, the key data
structure is a tad nasty to deal with in C, with its nested
variable length arrays.  It is especially nasty across the system
call boundary -- the kernel can't exactly malloc and assemble
multiple small chunks of user memory during the response to a
single system call.

This is in good part the motivation for the cmsFree*() calls,
to isolate the caller of these routines from knowing just how
the memory for them is allocated.

For a change such as you give in your example above, changing
a value in place, that's no problem .. because it makes no
assumptions as to how the memory was allocated.  Changing a
Policy flag would be easy, for the same reason.  Shortening
one of the arrays, and even overwriting the values in them,
is fine.

Any lenghthening of an array should be done with a deep copy
into memory that is managed in ways known to the caller, so
that the caller knows how to free that memory, when finished.

> One thing we need to hash out is what the unit of memory control is.
> In the simple-binding proposal, it is an arbitrary range of virtual
> memory (similar to what madvise() might do), while in the Process
> Scheduling and Memory Placement proposal, it appears to be an object.
> Conflicts are handled in a last-change-wins manner?

Yes - I see that section in the "Proposed NUMA API" now.

There are a couple things here we need to hash out.  My reading
of this section is that bindmemory() has a couple of capabilities
that are not a natural fit with CpuMemSets:

 1) Does bindmemory (vaddr, len, ...) apply to:
	a) already mapped and allocated memory, or
	b) already mapped, but not allocated, memory, or
	c) also extended mappings in already known vm areas
	d) any future mapping or remapping in the vaddr range?

    Since these bindings are inherited across fork/exec, I guess
    it must be (d).  This means that the binding request has
    to be kept as an inherited property of the process, and
    recalculated for applicability any time that any mapping
    is changed, whether to grow or shrink an existing vm area
    or to add or remove a vm area.  CpuMemSets are, as you note,
    per object (per vm area), not for an arbitrary address range
    and whatever portions of whatever future vm areas might
    happen to overlap that range.

 2) bindmemory() requires per-process specifications as to
    where to find memory.  CpuMemSets supports per-cpu
    specifications (based on which cpu is executing the
    page_alloc request).

My first inclination is to recommend against supporting
capability (1) above, on the grounds that its semantics don't
respect very well the existing objects that the kernel supports.
That, or I failed to understand it, quite possible.

The second capability, per-process rather than (or in addition
to) per-cpu memory search specifications is at least more
doable, but still seems not quite right to me.

I am of course open to hearing the motivation for these
capabilities.  I may be a purist at heart, but I am a pragmatist
in my wallet.

Also perhaps you could comment on the status of the Simple
Binding proposal.  Whether it is in use, or is required to
support API's in use.  What level of flexibility exists on such
significant semantics as the two capabilities just above.

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul M. <Pau...@us...> - 2001-10-19 17:36:54

Hello, Paul!

The example -really- helped!  I was missing the point of the
cms_memory_list
data structure.  This clears up most of my confusion, I believe, anyway...

I believe that we will really want to have a default value (e.g., NULL)
for the pointer to the cms_memory_list, since for many applications
a default of either "search my node first, then search the rest however
you wish" or "search my node only" will suffice.  I believe that it is
a bit cruel and unusual to force someone to specify either of these in
great detail.  ;-)

Each "mirror" in my "hall" is in a separate process that used cpumemmap
to restrict things.  So a given process normally cares only about its own
view and that of the underlying system.  One common exception might be
debuggers, but, on the other hand, the multithreaded debuggers I have
used don't know a CPU from a hole in the ground.  So, the two views
seemed likely to be sufficient.

The reason I keep wanting non-root processes to be able to restrict their
cpumemmaps is that the cpumemmap is the only thing in your proposal that
is capable of virtualizing CPUs and memory.  So, an application that wishes
to partition itself needs either to track the CPU numbers and offsets
itself,
or it needs to provide a different cpumemmap to each of its children.
Having
done quite a bit of the former, the latter seems quite attractive.  ;-)
Also, I need to manipulate cpumemmaps to implement pieces of the
simple-binding API.

Manual manipulation of the cpumemset structures seems quite painful.  Am
I right in assuming that one cannot invoke cmsFreeCMS on hand-built
cpumemset structures?  If I use my crude trick of decrementing the
nr_cpus field, am I then prohibited from calling cmsFreeCMS(), or does
cmsFreeCMS() hide some additionaw information away somewhere?

So here is what I believe I need to do to write the simple-binding API in
terms of cpumemset and cpumemmap.  I assume that cmsFreeCMS() is
telepathic,
since I cannot think of a way that it could do what I want it to do...
I also take the liberty of leaking memory right and left.  The purpose of
this exercise is to see if I understand the cpumemsets.  Once I find that
I do, I will try one of the existing proprietary-Unix APIs.  And only
-then- will I worry about optimal, nice-looking code!!!

This exercise raised the following questions:

1.   It seems to be possible to have a different cmsmemmap for different
     objects in a given virtual address space.  The CPU portions of these
     cmsmemmaps are ignored, right?

2.   The memory portion of the cmsmemmap associated with a process
     is used as a default for things like data, bss, and heap space?
     Is it used as a default for objects created subsequently?
     Or is it ignored?

3.   The memory portion of the cmsmemmap associated with a thread is
     used as a default for that thread's stack? Is it used as a
     default for objects created subsequently by this thread?
     Or is it ignored?

If the answers to both #2 and #3 are "it is ignored", then I am not sure
how the various node-restriction APIs would be implemented.

4.   I made up the makecmsmemset() API, since I did not want to
     second-guess the layout.  I know that this is not really what
     you had in mind, so am looking for guidance here.

5.   I used NULL for cpumemset mems, since the simple-binding API
     does not care about the order of search.

6.   We need to align the memory-binding behavior...  One could
     implement bindtonode() stepping through each page in the
     process's virtual=address space, but this might be a bit
     inefficient on 64-bit architectures.  ;-)

     typedef struct {
          cpumemset *set;
          cpumemmap *map;
     } numasubset_t;

     numasubset_t *restrictcpus(unsigned long cpus, numasubset_t *res)
     {
          unsigned i;
          unsigned j;
          unsigned nbits;
          cms_pcpu_t *setcpus;
          cms_pcpu_t *mapcpus;
          cpumemmap *r = &(res->map);
          numasubset_t *newres;

          nbits = 0;
          for (i = 0; i < sizeof(cpus) * 8; i++) {
               if (i >= r->nr_cpus) {
                    break;
               }
               if (cpus & (1 << i)) {
                    nbits++;
               }
          }
          if (r == NULL) {
               r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0);
                              /* Do I need getpid()? */
                              /* Do I care about vaddr??? */
                              /* should I substitute */
                              /* &restrictcpus? */
               if (r == NULL) {
                    goto die0;
               }
          }

          mapcpus = malloc(nbits * sizeof(*mapcpus));
          if (mapcpus == NULL) {
               errno = ENOMEM;
               goto die1;
          setcpus = malloc(nbits * sizeof(*setcpus));
          if (setcpus == NULL) {
               errno = ENOMEM;
               goto die1;
          }
          j = 0;
          for (i = 0; i < sizeof(cpus) * 8; i++) {
               if (i >= r->nr_cpus) {
                    break;
               }
               if (cpus & (1 << i)) {
                    mapcpus[j] = r->cpus[i];
                    setcpus[j] = j;
               }
          }

          newres = malloc(sizeof(*newres));
          if (newres == NULL) {
               errno = ENOMEM;
               goto die3;
          }
          newres->map = r;
          newres->map = makecmsmemmap(nbits,
                             setcpus,
                             res->map->nr_mems,
                             res->map->mems);
          if (newres->map == NULL) {
               goto die4;
          }
          newres->set = makecmsmemset(nbits,
                             setcpus,
                             res->set->nr_mems,
                             res->set->mems);
          if (newres->map == NULL) {
               goto die5;
          }
          return (r);

     die5:
          cmsFreeCMM(newres->map);
     die4:
          free(newres);
     die3:
          free(mapcpus);
     die2:
          free(setcpus);
     die1:
          if (r != &(res->map)) {
               cmsFreeCMM(r);
          }
     die0:
          return (NULL);
     }

     numasubset_t *restrictnodes(unsigned long nodes, numasubset_t *res)
     {
          unsigned i;
          unsigned j;
          unsigned nbits;
          cms_pmem_t *setmem;
          cms_pmem_t *mapmem;
          cpumemmap *r = &(res->map);
          numasubset_t *newres;

          nbits = 0;
          for (i = 0; i < sizeof(nodes) * 8; i++) {
               if (i >= r->nr_nodes) {
                    break;
               }
               if (nodes & (1 << i)) {
                    nbits++;
               }
          }
          if (r == NULL) {
               r = cmsQueryMAP(CMS_CURRENT, 0, (void *)0);
                              /* Do I need getpid()? */
                              /* Do I care about vaddr??? */
                              /* should I substitute */
                              /* &restrictcpus? */
               if (r == NULL) {
                    goto die0;
               }
          }

          mapnodes = malloc(nbits * sizeof(*mapnodes));
          if (mapcpus == NULL) {
               errno = ENOMEM;
               goto die1;
          setnodes = malloc(nbits * sizeof(*setnodes));
          if (setnodes == NULL) {
               errno = ENOMEM;
               goto die1;
          }
          j = 0;
          for (i = 0; i < sizeof(nodes) * 8; i++) {
               if (i >= r->nr_nodes) {
                    break;
               }
               if (nodes & (1 << i)) {
                    mapnodes[j] = r->nodes[i];
                    setnodes[j] = j;
               }
          }

          newres = malloc(sizeof(*newres));
          if (newres == NULL) {
               errno = ENOMEM;
               goto die3;
          }
          newres->map = r;
          newres->map = makecmsmemmap(res->map->nr_cpus,
                             res->map->cpus,
                             nbits,
                             setnodes);
          if (newres->map == NULL) {
               goto die4;
          }
          newres->set = makecmsmemset(res->set->nr_cpus,
                             res->set->cpus,
                             0,
                             NULL);
          if (newres->map == NULL) {
               goto die5;
          }
          return (r);

     die5:
          cmsFreeCMM(newres->map);
     die4:
          free(newres);
     die3:
          free(mapnodes);
     die2:
          free(setnodes);
     die1:
          if (r != &(res->map)) {
               cmsFreeCMM(r);
          }
     die0:
          return (NULL);
     }

     void freerestrict(numasubset_t *restrict)
     {
          cmsFreeCMM(restrict->map);
          cmsFreeCMS(restrict->set);
          free(restrict);
     }

I don't know how to implement getcpu(), getnode(), cputonode(), or
nodetocpu() based on this API.

     int bindtocpu(unsigned long cpus, numasubset_t *restrict)
     {
          numasubset_t *r1;

          *r1 = restrictcpu(cpus, restrict);
          if (r1 == NULL) {
               return (-1); /* adds ENOMEM */
          }
          cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
          cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
          freerestrict(r1);
     }

     int bindtonode(unsigned long nodes,
                 int behavior,
                 numasubset_t *restrict)
     {
          numasubset_t *r1;

          *r1 = restrictnode(nodes, restrict);
          if (r1 == NULL) {
               return (-1); /* adds ENOMEM */
          }
          r1->set->cms_policy = behavior; /* assume we rationalize */
          cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
          cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
          /* Need to step through all VM objects...  How? */
          freerestrict(r1);
     }

     int setlaunch(unsigned long cpus,
                unsigned long nodes,
                int behavior,
                numasubset_t *restrict)
     {
          numasubset_t *r1;
          numasubset_t *r2;

          r1 = restrictnode(nodes, restrict);
          if (r1 == NULL) {
               return (-1); /* adds ENOMEM */
          }
          r2 = restrictcpu(cpus, r1);
          freerestrict(r1);
          if (r2 == NULL) {
               return (-1); /* adds ENOMEM */
          }
          r2->set->cms_policy = behavior; /* assume we rationalize */
          cmsSetCMS(CMS_CHILD, (void *)0, r2->set);
          cmsSetCMM(CMS_CHILD, 0, (void *)0, r2->map);
          freerestrict(r2);
     }

     int bindmemory(unsigned long start,
                 size_t len,
                 unsigned long nodes,
                 int behavior,
                 numasubset_t *restrict)
     {
          numasubset_t *r1;
          unsigned long cur;

          *r1 = restrictnode(nodes, restrict);
          if (r1 == NULL) {
               return (-1); /* adds ENOMEM */
          }
          r1->set->cms_policy = behavior; /* assume we rationalize */
          cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
          cur = start + len - PAGESIZE;
          for (; cur >= start; cur -= PAGESIZE) {
               cmsSetCMM(CMS_VMAREA, 0, (void *)cur, r1->map);
          }
          freerestrict(r1);
     }

> On Mon, 15 Oct 2001, Paul McKenney wrote:
> > > I expect that when someone needs a kernel for > 64 cpus, they
> > > will have to find an alternative to, or extension of, Ingo's
> > > cpus_allowed bit vector.
> >
> > I agree with this.  My hope is that there will be a way to bury such
> > differences between 64-CPU and >64-CPU data structure manipulation
> > in a macro, an inlined function, or some such.
>
> Something like that, yes.  This is just a concern to the schedule()
> code - nicely isolated.
>
>
> > > The kernel code that implements the system calls to set
> > > CpuMemMaps and Sets will always have the responsibility for
> > > translating the stable, but cumbersome, API of these Maps and
> > > Sets into the efficient representation of the day required by
> > > the scheduling and allocation code.
> >
> > Yep!  There may come a time when people want a short-form interface,
> > but I believe that what is good enough for select() vectors and for
> > signal masks is good enough for NUMA.  ;-)
>
> My intention is that short-form interfaces are provided on top
> of CpuMemSets -- we (sgi) expect to do a few such ourselves,
> to emulate existing interfaces.
>
>
> > > Let me try again ...
> > >
> > >     A CpuMemSet specifies two things.  It specifies on which cpus
> > >     (in the corresponding CpuMemMap) a task may be scheduled,
> > >     and it specifies in what order to search for memory, per
> > >     virtual memory area, depending on which cpu the request
> > >     for memory was executed.
> >
> > OK...  How is the CPU taken into account?  Do you traverse the
> > list of memories until you find one that is closest to the current
> > CPU?  If there is no memory to be had there, do you then go through
> > the list of memories in order starting at that point, and wrapping
> > around if necessary?  Or are you using something similar to
> > classzones, which would be represented separately?
>
>
> Hmmm ... I have not yet adequately presented some aspects of
> this data structure.  I must add an example ... that might
> connect with additional readers.
>
> Lets try this one:
>
> Example:
> ========
>     One way to understand these data structures is to look at
>     an example.
>
>     Given the following hardware configuration:
>
>          Let's say we have a four node system, with four CPUs
>          per node, and one memory per node, named as follows:
>
>              Name the 16 CPUs:    c0, c1, ..., c15          # 'c' for CPU
>              and number them:     0, 1, 2, ..., 15          # cms_pcpu_t
>
>              Name the 4 memories: mn0, mn1, mn2, mn3        # 'mn' for
memory node
>              and number them:     0, 1, 2, 3                     #
cms_pmem_t
>
>     CpuMemMap:
>
>          Now lets say the administrator (root) chooses to setup a
>          Map containing just the 2nd and 3rd node (CPUs and memory
>          thereon).  The cpumemmap for this would contain:
>
>              {
>                    8,                             # nr_cpus (length of
cpus array)
>                    p1,                            # cpus (ptr to array of
cms_pcpu_t)
>                    2,                             # nr_mems (length of
mems array)
>                    p2                             # mems (ptr to array of
cms_pmem_t)
>              }
>
>          where p1, p2 point to arrays of physical cpu + mem numbers:
>
>              p1 = [ 4,5,6,7,8,9,10,11 ]           # cpus (array of
cms_pcpu_t)
>              p2 = [ 1,2 ]                    # mems (array of cms_pmem_t)
>
>          This map shows, for example, that for this Map, logical cpu 0
>          corresponds to physical cpu 4 (c4).
>
>     CpuMemSet:
>
>          Further lets say that an application running within this map
>          chooses to restrict itself to just the odd-numbered CPUs, and
>          to search memory in the common "first-touch" manner (local
>          node first).  It would establish a CpuMemSet containing:
>
>              {
>                    CMS_DEFAULT,                   # cms_policy
>                    4,                             # nr_cpus (length of
cpus array)
>                    q1,                            # cpus (ptr to array of
cms_lcpu_t)
>                    2,                             # nr_mems (length of
mems array)
>                    q2,                            # mems (ptr to array of
cms_memory_list)
>              }
>
>          where q1 points to an array of 4 logical cpu numbers and q2 to
an
>          array of 2 memory lists:
>
>
>              q1 = [ 1,3,5,7 ],                    # cpus  (array of
cms_lcpu_t)
>              q2 = [                               # See "Verbalization
example" below
>                        { 3, r1, 2, s1 }
>                        { 2, r2, 2, s2 }
>                     ]
>          where r1, r2 are arrays of logical cpus:
>              r1 = [1, 3, CMS_DEFAULT_CPU]
>              r2 = [5, 7]
>          and s1, s2 are arrays of memory nodes:
>              s1 = [0, 1]
>              s2 = [1, 0]
>
>     Verbalization examples:
>
>          To read item q1 out loud:
>
>              Tasks in this CpuMemSet may be scheduled on any of
>              the logical CPUs [ 1, 3, 5, 7 ], which correspond
>              in the associated Map with physical CPUs c5, c7, c9
>              and c11.
>
>          To read item q2 out loud:
>
>              If a fault occurs on any of the 2 explicit CPUs in
>              r1, then search the 2 memory nodes in s1 in order,
>              looking for available memory (mn1, then mn2).
>
>              If a fault occurs on any of the 2 CPUs in r2, search
>              the 2 memory nodes in s2 in order (mn2, then mn1).
>
>              If a fault occurs on any other CPU, then since the
>              CMS_DEFAULT_CPU value is listed in r1, search the
>              2 memory nodes in s1 in order (mn1, then mn2).
>
>     Interpretations of the above:
>
>          The meaning of "s1 = [0, 1]" is that if a page fault occurs on
>          the logical CPUs in "r1 = [1, 3, CMS_DEFAULT_CPU]", then the
>          allocator should search logical memory node 0 first (that's
>          the memory on physical node 1 - mn1), then search logical
>          memory node 1 second (the memory on physical node 2 - mn2).
>
>          The meaning of "s2 = [1, 0]" is that if a page fault occurs
>          on the logical CPUs listed in "r2 = [5, 7]", then the same
>          memory nodes are searched, but in the other order, mn2 then mn1.
>
>          In particular, if a vm area using the above CpuMemSet was
>          also shared with an application running on some other Map,
>          and that application faulted while running on some CPU not
>          explicitly listed in the above CpuMemSet (item r1 or r2),
>          then the allocator would search mn1 first, then mn2, for
>          available memory.  This is because CMS_DEFAULT_CPU is listed
>          amongst the CPUs in r1, and the corresponding s1 is equivalent
>          to the ordered array of physical memory nodes [mn1, mn2].
>
>     Observation:
>
>          The allocator need have _no_ notion of distance.  It just
>          searches, in order specified, the memory list prescribed
>          for that vm area, for a fault on the specified CPU (or the
>          CMS_DEFAULT_CPU).  To provide the usual first-touch, distance
>          ordered memory search, some system service or utility must
>          sort the memory lists in distance order.
>
> ========
>
> I should add the above example to my Design Notes.
>
> My apologies for the tediousness of this example.  I realize
> that the above data structure is a layer or two deeper than
> intuitions expect.  However when I methodically strip all
> (most?) higher level policy from the various CPU and memory
> API's we need to support, the above is what I am left with, as
> the necessary generic multiplexor between a variety of API's
> and the specific needs of the static placement logic in the
> kernel allocation and scheduliing code.
>
> For example, observe that no notion of locality domain or node
> exists here - it has been disassembled into simple lists of
> CPUs and chunks of memory, called here 'memory nodes' only
> because there tends to be one chunk per node, and I couldn't
> find a better noun to name a maximal contiguous (?) chunk of
> memory that is equidistant from all processing elements.
>
>
> > >     Seems that I can either (1) require that a memory list _always_
> > >     be specified for the CMS_DEFAULT_CPU, or (2) mandate that
> > >     attempts to allocate memory when the CpuMemSet does not specify
> > >     (even by CMS_DEFAULT_CPU) any memory list for the currently
> > >     executing cpu must fail.  Choice (2) would cause nasty, obscure
> > >     and intermittent errors.  So it must be choice (1).
> >
> > Agreed that deterministic errors are much better than non-deterministic
> > errors!  But how do you indicate which memory is associated with
> > CMS_DEFAULT_CPU?
>
> The value CMS_DEFAULT_CPU is included on a the cpu list
> (r1 or r2, in the above example) in one of the memory lists.
>
> ==> Memory lists have a list of memory nodes, sorted in search
>     order, _and_ a list of CPUs to which that memory list applies.
>
> > Must the CPU and the memory lists be the same size?
> > My guess was "no", since there are separate fields for the length of
> > each.
>
> Correct - they need not be, and in the case of architectures
> with multiple CPUs per memory node, typically are not the same size.
>
> > Or does CMS_DEFAULT_CPU just start the search of memory from the
> > first element of the array?
>
> er eh no.  This question confuses me.
>
> CMS_DEFAULT_CPU chooses which memory list to search if the
> faulting cpu is not explicitly listed.
>
> The search order is by default (CMS_DEFAULT) always from the
> first element of the memory node array (s1 and s2, above), unless
> the CMS_ROUND_ROBIN policy is specified for that CpuMemSet.
>
>
> > > What is a "least-memory utilization policy"?  I am not familiar
> > > with that term.
> >
> > It means that, at page-fault time, you allocate memory from the
> > node/memory with the least utilization.
>
> This is clearly then a dynamic placement policy, not a static
> one.  I would expect the implementation of a "least-memory
> utilization policy" to add code to the allocator and elsewhere
> to track memory utilization.  And I would expect the CpuMemSet
> (or at least CpuMemMap) to control which memory nodes could
> be searched.  But the order of search would depend on more
> dynamic code outside the domain of CpuMemSets.
>
> >
> > > ====
> > > >    A cpumemset has no meaning except in the context of a cpumemmap,
> > right?
> > > >    A cpumemmap maps the CPU numbers, while a cpumemset simply
restricts
> > > >    them, right?  (Could make it work either way, but things like
getcpu
> > ()
> > > >    and getnode() need to be in sync with the choice.)  For example,
> > suppose
> > > >    that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose
> > that
> > > >    the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.
> > What
> > > >    is the physical CPU corresponding to logical CPUs 0, 1, and 2?
> > >
> > > Yes, Yes, tell me about getcpu/getnode, and {100,102,103}.
> > >
> > > My presumption, from reading this, is that getcpu(), executed in this
> > > map and set, would return either 100 or 103, rather oblivious to the
> > > CpuMemMap.  I have no clue what getnode() would or should do that
> > > relates to CpuMemSets.  Perhaps we need a "getcmscpu()"?
> >
> > I believe that we need a way to get the physical CPU ID.  So, at this
> > point, do we have two levels of ID, or three?  In case of diagnostics,
> > you want to identify the physical CPU.  So we quickly get into the
> > issue that Martin Bligh raised earlier.  ;-)
>
> When you want the physical CPU ID, as with diagnostics, then getcpu()
> provides the physical CPU ID, which should be no more ambiguous than
> it was before CpuMemSets.  I only see one level of ID here.  I see
> nothing logical about this <grin>.
>
>
> > For related sets of processes running as root, you can end up with many
> > more levels of ID.  Process 100 maps to exclude the first CPU, then
forks
> > process 101, which maps to exclude its idea of the first CPU, and so
on.
> > This problem is inherent in the notion of virtualizing the CPUs.  We
could
> > try to eliminate the middle level (cms_pcpu_t), but that could require
> > using whatever ugly IDs the hardware wanted to provide.  Maybe the name
> > of the cms_pcpu_t level should be "complete" instead of "physical"?  Or
> > some other naming?
>
> No - cms_pcpu_t is the ugly hardware ID (well one of them -
> seems that these too come in a couple forms, such as compact
> or not).  This is not a hall of infinitely regressing mirrors.
> There are only two levels, the ugly physical ID level, and the
> logical mapping to the integers 0..N-1 (logical ID) for any
> given subset of size N CPUs or memory nodes.
>
> I suspect that CpuMemMaps are a degree less fancy than you are
> presuming, while CpuMemSets are a couple degrees more fancy,
> as in the above example ;).
>
>
> > So the relevant numbering schemes are the following: (1) the numbering
> > that the current process sees, as mapped by the CpuMemMap that controls
> > it, and (2) the underlying physical identifiers that would make sense
> > to someone servicing the hardware.
>
> Yes - exactly.  How we ended up at this same point, after the
> above confusions, baffles me a bit.  oh well.
>
>
>
> > > >    HP's launch policies include policies for CPU allocation (round
> > > >    robin and the like).  My guess is that this requires either
> > > >    additional bits in cms_policy or another field for CPU (as
> > > >    opposed to memory) policies.
> > >
> > > Aha - might be so.  I should investigate this further.
> >
> > http://lse.sourceforge.net/numa/manpages/hpux-mm.txt, then search for
> > "Launch".
>
> Here I see several dynamic scheduling policies:
>     ROUND_ROBIN
>     PACKED
>     FILL
>     LEASTLOAD
>
> As with the dynamic "least-memory utilization" policy above,
> I would expect CpuMemSets to control which CPUs were eligible
> for scheduling, but that the control of details of scheduling
> would be left to other mechanisms.
>
>
> > > Non-root processes can restrict where they execute tasks or allocate
> > > memory by altering their CpuMemSet, within the confines of the
CpuMemMap
> > > established for them.  Why is that not enough?
> >
> > My concern is that non-root processes cannot virtualize their children.
> > If you want to divide the CPUs and memory available to you into two
> > pieces, and run a child in each piece, you can do so, but you cannot
> > have both children thinking that they have their own CPU 0.
>
> Yes, this limitation exists.  Below you suggest letting non-root
> processes further restrict their map.  That would be doable -
> a modest wart, in that it complicates what was a simple story:
> root manipulates maps, anyone manipulates sets.  How serious
> is this limitation?
>
>
> > > >    (At this point, I am favoring letting non-root processes
> > > >    manipulate cpumemmaps, with the restriction that any that
> > > >    arevinstalled must be pure restrictions of the current
> > > >    cpumemmap.)
> > Please let me know whether my non-root virtualization example
> > above seems reasonable to you.
>
> I am still lacking an appreciation of the benefit of this
> sufficient to justify the extra half-twist of logic.  I'm
> open to hearing more.
>
>
> > > >    Is the user supposed to directly manipulate the fields of the
> > > >    cpumemmap and cpumemset?
> > >
> > > Yes, via the cmsSet*() system calls.  Well, as direct as
> > > anything in a protected operating system -- the user politely
> > > asks the kernel to set a map or set, and doesn't directly write
> > > kernel memory.
> > >
> > > My hunch here is that I missed the real point of this question ...
> >
> > Should the interfaces be used as follows?
> >
> >      p = cmsQueryCMS(CMS_CURRENT, (void *)0);
> >      if (p->nr_cpus > 1) {
> >           p->nr_cpus--;
> >      }
> >      cmsSetCMS(CMS_CURRENT, (void *)0, p);
>
> Ah ... that gets dicey.  What I hear you asking is whether it
> is appropriate for an application (more likely, some library
> supporting a friendlier interface on top of CpuMemSets) to (1)
> get a response from a cmsQuery*(), (2) munge it in place, and
> (3) push it back down a cmsSet*() call.
>
> As we have already seen in the Example above, the key data
> structure is a tad nasty to deal with in C, with its nested
> variable length arrays.  It is especially nasty across the system
> call boundary -- the kernel can't exactly malloc and assemble
> multiple small chunks of user memory during the response to a
> single system call.
>
> This is in good part the motivation for the cmsFree*() calls,
> to isolate the caller of these routines from knowing just how
> the memory for them is allocated.
>
> For a change such as you give in your example above, changing
> a value in place, that's no problem .. because it makes no
> assumptions as to how the memory was allocated.  Changing a
> Policy flag would be easy, for the same reason.  Shortening
> one of the arrays, and even overwriting the values in them,
> is fine.
>
> Any lenghthening of an array should be done with a deep copy
> into memory that is managed in ways known to the caller, so
> that the caller knows how to free that memory, when finished.
>
>
> > One thing we need to hash out is what the unit of memory control is.
> > In the simple-binding proposal, it is an arbitrary range of virtual
> > memory (similar to what madvise() might do), while in the Process
> > Scheduling and Memory Placement proposal, it appears to be an object.
> > Conflicts are handled in a last-change-wins manner?
>
> Yes - I see that section in the "Proposed NUMA API" now.
>
> There are a couple things here we need to hash out.  My reading
> of this section is that bindmemory() has a couple of capabilities
> that are not a natural fit with CpuMemSets:
>
>  1) Does bindmemory (vaddr, len, ...) apply to:
>          a) already mapped and allocated memory, or
>          b) already mapped, but not allocated, memory, or
>          c) also extended mappings in already known vm areas
>          d) any future mapping or remapping in the vaddr range?
>
>     Since these bindings are inherited across fork/exec, I guess
>     it must be (d).  This means that the binding request has
>     to be kept as an inherited property of the process, and
>     recalculated for applicability any time that any mapping
>     is changed, whether to grow or shrink an existing vm area
>     or to add or remove a vm area.  CpuMemSets are, as you note,
>     per object (per vm area), not for an arbitrary address range
>     and whatever portions of whatever future vm areas might
>     happen to overlap that range.
>
>  2) bindmemory() requires per-process specifications as to
>     where to find memory.  CpuMemSets supports per-cpu
>     specifications (based on which cpu is executing the
>     page_alloc request).
>
> My first inclination is to recommend against supporting
> capability (1) above, on the grounds that its semantics don't
> respect very well the existing objects that the kernel supports.
> That, or I failed to understand it, quite possible.
>
> The second capability, per-process rather than (or in addition
> to) per-cpu memory search specifications is at least more
> doable, but still seems not quite right to me.
>
> I am of course open to hearing the motivation for these
> capabilities.  I may be a purist at heart, but I am a pragmatist
> in my wallet.
>
> Also perhaps you could comment on the status of the Simple
> Binding proposal.  Whether it is in use, or is required to
> support API's in use.  What level of flexibility exists on such
> significant semantics as the two capabilities just above.
>                           I won't rest till it's the best ...
>                           Manager, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-10-29 01:58:31

Sorry for the delay in responding ...

On Fri, 19 Oct 2001, Paul McKenney wrote:
> The example -really- helped!

Good - I will include it in the next draft of the Design Note
that I am preparing for release soon.

> I believe that we will really want to have a default value
> (e.g., NULL) for the pointer to the cms_memory_list, since for
> many applications a default of either "search my node first,
> then search the rest however you wish" or "search my node only"
> will suffice.  I believe that it is a bit cruel and unusual to
> force someone to specify either of these in great detail.  ;-)

Check out the new cmsDeepCopy*() methods below.  They should
insulate your code from munging with the memory lists when
not needed.

However, understand that this CpuMemSet API is intended
to support the essential kernel infrastructure for cpu and
memory placement.  It matters less that it be user friendly.
It is intended for use in supporting other, more friendly, but
perhaps less generic, API's, such as your simple binding.

Granted, it shouldn't be unnecessarily user hostile, and from
the looks of your programming examples below, it needs to be
a little friendlier.

The cmsDeepCopy*() methods, below, should help considerably.
They convert the inscrutable memory layout of the CpuMemSets
returned by the cmsQuery*() methods into the linked malloc
elements you can work with predictably.

> The reason I keep wanting non-root processes to be able to
> restrict their cpumemmaps is that the cpumemmap is the only
> thing in your proposal that is capable of virtualizing CPUs
> and memory.  So, an application that wishes to partition
> itself needs either to track the CPU numbers and offsets
> itself, or it needs to provide a different cpumemmap to each
> of its children.  Having done quite a bit of the former, the
> latter seems quite attractive.  ;-)

It is my intention, not yet very well articulated, to have tasks
and vm areas explicitly (as part of the published API) share
Sets and Maps that they inherit from a common origin, and to
allow operations on those shared Sets and Maps that affect all
those sharing it.

So far as I can tell, the naive implementation of this sharing
is quite incompatible with allowing the non-root map restricting
that you describe.  I can't simply shrink (restrict) a shared
map, for the benefit of one of those sharing it, without either
breaking the share, or else affecting the innocent.

You seem to be asking that an existing map be used to perform
an additional mapping because it "is the only thing in [my]
proposal that is capable of virtualizing CPUS and memory."
This is not a compelling reason in my book, and it (cpumemmap)
doesn't do that well, in this case, as explained above.

At this point, given that I would minimize what the kernel is
asked to do, I can only ask:

    How can we make this more pleasant for user code,
    with some additional or improved library code?

> Also, I need to manipulate cpumemmaps to implement pieces of
> the simple-binding API.

Elaborate ... again, I am seeking out essential semantics that
require kernel involvment, as opposed to what the library codes
can do to ease the burden.

> Manual manipulation of the cpumemset structures seems quite
> painful.  Am I right in assuming that one cannot invoke
> cmsFreeCMS on hand-built cpumemset structures?  If I use
> my crude trick of decrementing the nr_cpus field, am I then
> prohibited from calling cmsFreeCMS(), or does cmsFreeCMS()
> hide some additionaw information away somewhere?

This can and should be improved, once again at the library
level.

Yes, as you so eloquently note a few lines lower, the cmsFree*()
calls are telepathic.  They have special knowlege of the memory
layout of the cpumemsets and maps returned from cmsQuery*()
calls.

Your coding examples lower down were quite helpful in pointing
out to me the difficulties you are seeing.

I propose to add two more methods to the cpumemset library,
to construct deep copies of sets and maps, using malloc for
each element.  Each discrete element (structure or array) will
be allocated using malloc().  I will also teach the cmsFree()
routines to distinguish between those maps and sets returned from
the cmsQuery*() calls, and those returned from the cmsDeepCopy*()
calls, and be able to free either one.  I suspect this means
adding an entry to the cpumemmap and cpumemset data structures,
indicating how they were allocated (by cmsQuery, cmsDeepCopy
or otherUser).

So the rules become:
    * You can construct your own maps and sets, and pass to cmsSet*().
    * You can examine the maps and sets returned from cmsQuery*() calls.
    * You can free your own map and set constructions as you will.
    * You can only free maps and sets from cmsQuery*() calls using cmsFree*()
    * To munge a set or map in place, first cmsDeepCopy*() it, and use the copy.
    * You can free a deep copy using the cmsFree*() methods.
    * You can also free a deep copy using free() on each element.

Thus the following are added:

    cpumemmap *cmsDeepCopyMap (const cpumemmap *cmm);
    cpumemset *cmsDeepCopySet (const cpumemset *cms);

> So here is what I believe I need to do to write the
> simple-binding API in terms of cpumemset and cpumemmap.
> I assume that cmsFreeCMS() is telepathic, since I cannot
> think of a way that it could do what I want it to do...
> I also take the liberty of leaking memory right and left.
> The purpose of this exercise is to see if I understand the
> cpumemsets.  Once I find that I do, I will try one of the
> existing proprietary-Unix APIs.  And only -then- will I worry
> about optimal, nice-looking code!!!
> 
> This exercise raised the following questions:
> 
> 1.   It seems to be possible to have a different cmsmemmap
>      for different objects in a given virtual address space.
>      The CPU portions of these cmsmemmaps are ignored, right?

Yes - possible.  Yes - ignored.

> 2.   The memory portion of the cmsmemmap associated with
>      a process is used as a default for things like data,
>      bss, and heap space?  Is it used as a default for objects
>      created subsequently?  Or is it ignored?

As published about 2 weeks ago, it was ignored.  In the LSE
conference call perhaps 8 days ago, you persuaded me that this
was wrong.  Now the memory portion of a task's *current* cpumemset
is used for objects created subsequently by that task.

> 3.   The memory portion of the cmsmemmap associated with a
>      thread is used as a default for that thread's stack? Is
>      it used as a default for objects created subsequently by
>      this thread?  Or is it ignored?

As above - no longer ignored, thanks to your guidance.

> 4.   I made up the makecmsmemset() API, since I did not want to
>      second-guess the layout.  I know that this is not really what
>      you had in mind, so am looking for guidance here.

Use cmsDeepCopy*() to get maps and sets with known layout.

> 5.   I used NULL for cpumemset mems, since the simple-binding API
>      does not care about the order of search.

Use cmsDeepCopy*(), and then leave the maps 'as is', if you
don't need to change them.

> 6.   We need to align the memory-binding behavior...  One could
>      implement bindtonode() stepping through each page in the
>      process's virtual=address space, but this might be a bit
>      inefficient on 64-bit architectures.  ;-)

Here I lost you.  Can't figure out what you're saying, and don't
like what I'm guessing you're saying.  This sounds like a part
of the discussion we had in the last LSE conference call, as to
whether it was essential to support memory placement policies
on any (page alligned) address range, irrespective of vm area
(object) boundaries.  I just sent out an email inquiry to Rik
van Riel and Andrea Arcangeli, asking for their input on this
issue of page vs object memory placement policy granularity.

|> >      typedef struct {
|> >           cpumemset *set;
|> >           cpumemmap *map;
|> >      } numasubset_t;
|> > 
|> >      numasubset_t *restrictcpus(unsigned long cpus, numasubset_t *res)
|> >      {
|> > 		[ snip 85 lines of code ... ]
|> >      }

Hmmm ... I made an effort to grok this code ... but failed.

I guess you are trying to compute a "numasubset_t" that
expresses a restriction, for possible further use in a bindto*()
call.

For this, try:

    /*
     * Restrict the cpus in a CpuMemSet to the intersection
     * of those currently in it, and those specified by
     * 'rcpus'.  'rcpus' is a bit vector, where bit N is set
     * iff the application can run on cpu N, using CpuMemSet
     * 'application' cpu numbering.
     * 
     * Return a new, malloc'd, further restricted, CpuMemSet, or
     * NULL on error.
     */

    cpumemset *restrictcpus (unsigned long rcpus, cpumemset *cms1)
    {
	cpumemset *cms2		/* deepcopy cpumemset */
	int i, j;		/* scan cms2.cpus, copy only if in rcpus */

    	cms2 = cmsDeepCopy (cms1);
	if (cms2 == NULL)
	    return NULL;

	/* copy cpus[] in place; drop if corresponding rcpus bit not set */
	for (i = 0, j = 0; i <= cms2->nr_cpus; i++) {
	    int n = cms2->cpus[i];
	    if (rcpus & (1 << n))
		cms2-cpus[j++] = n;
	}
	cms2->nr_cpus = j;

	return cms2;
    }

|> >      numasubset_t *restrictnodes(unsigned long nodes, numasubset_t *res)
|> >      {
|> > 		[ snip more lines of code ... ]
|> >      }

Now this one confuses me even more.  I guessed that a node (in
your "Proposed NUMA API") is essentially a set of cpus, for the
purpose of restrictnodes().  I don't see where the paper states
this clearly, but it does go into length about numbering cpus in
ways consistent with the node numbering.  So I expected to see
something involving nodetocpu() used to convert nodes to cpus.
But it seems that this code is just the restrictcpus() code,
with the word 'cpus' replaced with 'nodes'.

My hunch is (really shooting in the dark here) that you want
to add a method that will convert a set of nodes into a set of
all the cpus on those nodes (not just the first cpu), and then
have restrictnodes() simply call that conversion, and make use
of restrictcpus().

... I'm hearing the sound of someone sawing on a limb,
    right behind me ...

But, continuing to climb further out ... perhaps by node
restriction you want memory, not CPU, restriction.  The "Proposed
NUMA API" paper isn't clear to me on this point:

    On the second page, under "Simple Node Restrictions",
    it doesn't say whether restrictnode() applies to CPUs or
    memory, except to note that "no CPU restriction will be
    implied" in one case.

    But later on, under "Bind Tasks to Node(s)" it clearly states
    "helpful to bind a task's memory ...".

Ok - lets imagine that you wanted a routine to restrict
a CpuMemSet to a subset of memory nodes (as called in the
CpuMemSet design notes).

For that lets try:

    /*
     * Restrict the mems in a CpuMemSet to the intersection of
     * those currently in it, and those specified by 'rmems'.
     * 'rmems' is a bit vector, where bit N is set iff the
     * application can allocate memory on node N, using CpuMemSet
     * 'application' memory node numbering.
     * 
     * Return a new, malloc'd, further restricted, CpuMemSet,
     * or NULL on error.
     */

    /* First - a helper routine to restrict one memory list, in place. */

    static void restrict_mem_list (
	cms_memory_list *p_mems, unsigned long rmems)
    {
    	int i, j;		/* scans mems, copy only if in rmems */

	for (i = 0, j = 0; i < p_mems->nr_mems; i++) {
	    int n = p_mems->mems[i];
	    if (rmems & (1 << n))
		p_mems->mems[j++] = n;
	}
	p_mems->nr_mems = j;
    }

    cpumemset *restrictnodes(unsigned long rmems, cpumemset *cms1)
    {
	cpumemset *cms2;	/* points to resulting cpumemset */
    	int i;			/* scan cms2 mems */

    	cms2 = cmsDeepCopy (cms1);
	if (cms2 == NULL)
	    return NULL;

	for (i = 0; i < cms2->nr_mems; i++) {
	    restrict_mem_list (cms2->mems[i], rmems));
	return cms2;
    }

|> >      void freerestrict(numasubset_t *restrict)
|> >      {
|> >           cmsFreeCMM(restrict->map);
|> >           cmsFreeCMS(restrict->set);
|> >           free(restrict);
|> >      }

I don't think that freerestrict() is needed.  Nor is numasubset_t
needed.  Don't worry about the map - just work with the set, except
while discovering the topology, if the system (aka physical) cpu
and memory node numbers embed critical topology information.

> I don't know how to implement getcpu(), getnode(), cputonode(), or
> nodetocpu() based on this API.

This API doesn't know nodes (as collections of cpus, if that
is what you mean).  I can't help you there, except to suggest
that whatever root-privileged process sets up the CpuMemMaps to
be used by applications making use of this simple-binding API
should assign the application cpu numbering in accordance with
the rules set forth in the "Proposed NUMA API" paper, relating
cpu numbering to node numbering.

I need to add a getcmscpu() call to this API, to return the
current application cpu number on which a task is executing.

Then your getcpu() is just getcmscpu().

I also need to make cmsQueryCMM() not require root privilege,
so that you can use it, along with other information in /proc,
to discover topology, and understand how that topology maps to
the application cpu numbers used in your CpuMemSets.

|> >      int bindtocpu(unsigned long cpus, numasubset_t *restrict)
|> >      {
|> >           numasubset_t *r1;
|> > 
|> >           *r1 = restrictcpu(cpus, restrict);
|> >           if (r1 == NULL) {
|> >                return (-1); /* adds ENOMEM */
|> >           }
|> >           cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
|> >           cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
|> >           freerestrict(r1);
|> >      }

For this, try:

    int bindtocpu ((unsigned long rcpus, cpumemset *cms1)
    {
	cpumemset *cms2;
	int r;

	cms2 = restrictcpus (rcpus, cms1);
	if (cms2 == NULL)
	    return -1;
	r = cmsSetCMS (CMS_CURRENT, (void *)0, cms2);
	cmsFreeCMS (cms2);
	return r;
    }

|> >      int bindtonode(unsigned long nodes,
|> >                  int behavior,
|> >                  numasubset_t *restrict)
|> >      {
|> >           numasubset_t *r1;
|> > 
|> >           *r1 = restrictnode(nodes, restrict);
|> >           if (r1 == NULL) {
|> >                return (-1); /* adds ENOMEM */
|> >           }
|> >           r1->set->cms_policy = behavior; /* assume we rationalize */
|> >           cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
|> >           cmsSetCMM(CMS_CURRENT, 0, (void *)0, r1->map);
|> >           /* Need to step through all VM objects...  How? */
|> >           freerestrict(r1);
|> >      }

If by nodes, you mean memory nodes, not collections of cpus, try:

    int bindtonode (unsigned long rmems, int behavior, cpumemset *cms1)
    {
	cpumemset *cms2;
	int r;

	cms2 = restrictnodes (rmems, cms1);
	if (cms2 == NULL)
	    return -1;
	r = cmsSetCMS (CMS_CURRENT, (void *)0, cms2);
	cmsFreeCMS (cms2);
	return r;
    }

|> >      int setlaunch(...)
|> >      {
|> > 	       ...
|> >           cmsSetCMS(CMS_CHILD, (void *)0, r2->set);
|> >           cmsSetCMM(CMS_CHILD, 0, (void *)0, r2->map);
|> >           ...
|> >      }

Yes - setlaunch will set the CMS_CHILD cpumemset.  If it has
to also set the cpumemmap, it must be root; I hope this isn't
needed.

|> >      int bindmemory(...)
|> >      {
|> >           ...
|> >           *r1 = restrictnode(nodes, restrict);
|> >           ...
|> >           cmsSetCMS(CMS_CURRENT, (void *)0, r1->set);
|> >           cur = start + len - PAGESIZE;
|> >           for (; cur >= start; cur -= PAGESIZE) {
|> >                cmsSetCMM(CMS_VMAREA, 0, (void *)cur, r1->map);
|> >           }
|> >       }

I remain confused as to whether restrictnode() means to restrict
scheduling tasks to the cpus on the specified nodes, and/or means
to restrict allocating memory to the specified memory nodes.

If your API has no notion of where its vm objects lie, and has
to accept instructions that require it to walk, page by page,
over large chunks of virtual address space, trying to bind
whatever vm objects happen to be lurking underneath, then this
is unfortunate.  At least consider opening "/proc/self/maps", to
learn what vm objects lurk beneath the virtual addresss surface.

Thank-you, Paul, for your good questions and suggestions.

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul M. <Pau...@us...> - 2001-10-30 04:54:45

Hello, Paul!

Thank you for the good feedback on my first attempt to express simple
binding in terms of cpumemsets and cpumemmaps.  I am going to study your
note for a bit, then make another attempt.

                              Thanx, Paul

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul D. <pad...@us...> - 2001-11-02 23:55:52

Hi Paul,

I have a few comments regarding the cpumemset document.

Paul Dorwin
pd...@us...

--- 

You refer to memory as memory nodes. Other times you refer to the same
thing as a node. And still other times you refer to a node in the more
familiar context of a container. To me, the term node refers to
something which can contain cpus, memory, and IO. I would be more
confortable with some other term which refers to a range of memory.

---

In 'Using CpuMemSets' you say:

On systems supporting hot-swap of CPUs (or even memory, if someone can
figure that out) the system administrator would be able to change CPUs
and remap by changing the applications CpuMemMap, without the
application being aware of the change.

How are you doing this? Will there be /proc/<pid>/cpumemset and
/proc/<pid>/cpumemmap interfaces?

A /proc interface would also be useful for managing an application
which is already running. 

One could view existing memmaps by cat /proc/123/cpumemmap.
A line for each memmap used by the application would be printed.
Using your example, a line could be displayed as follows:

addr size 8 4,5,6,7,8,9,10,11 2 1,2

Using your example again, could one modify an existing application
from the command line by sepcifying a memmap as follows:
echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap

And finally, the process could be migrated to processors 4-11 
via echo ff0 /proc/123/cpus_allowed.

You could also use /proc/cpumemset and /proc/cpumemmap to alter the
system defaults.  

---

In 'Processors, Memory and Distance' your discussion of <cpu,mem>
distances deals primarily with cache warmth issuse. Should you also
discuss the disadvantages of scheduling a process on a cpu further
from where the physical pages are contained?  

For example, you run on node 0 and allocate pages from the memory on
that node. If you sleep (maybe on IO) you no longer have any cache
warmth. However, you would still incur a potentially more expensive
penalty if you are scheduled on a cpu on another node because you now
have to pull all data into cache over a longer latency/lower bandwidth
pipe.

Does any of this make sense?

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-11-03 01:09:42

On Fri, 2 Nov 2001, Paul Dorwin wrote:
> I have a few comments regarding the cpumemset document.

Excellent - thank-you!

> You refer to memory as memory nodes. Other times you refer to
> the same thing as a node. And still other times you refer to
> a node in the more familiar context of a container. To me,
> the term node refers to something which can contain cpus,
> memory, and IO. I would be more confortable with some other
> term which refers to a range of memory.

I would be more confortable with another name as well ;).
Any suggestions?  If I had to pick an alternative right
now, it would be "memory chunk".

Earlier I tried "memory bank", but that had too many prior
connotations to me.  I used "memory node" in this Version
because it seemed that on the architectures we are currently
concerned with (the big ia64 numa systems I knew of) there was
a one-to-one relation between chunks of memory and system nodes.

I thought I had been fairly pedantic in using "memory node"
everywhere, except sometimes when using that term multiple times
in a single sentence, and I hoped that secondary references
could be abbreviated to just "nodes" without confusion.  If you
see a contrary instance, I'd be happy to fix it.

Or if you have a better name, I'm interested.

> 
> ---
> 
> In 'Using CpuMemSets' you say:
> 
> On systems supporting hot-swap of CPUs (or even memory, if someone can
> figure that out) the system administrator would be able to change CPUs
> and remap by changing the applications CpuMemMap, without the
> application being aware of the change.
> 
> How are you doing this?

See the Bulk Remap call (CMS_BULK_ALL).

    It goes through and alters any CpuMemMap as requested,
    perhaps to remove a cpu or memory node (according to
    its system numbering) from service, by replacing that
    system number with another.  The implementation then walks
    through the tasks and vm areas in the system, recomputing
    cpus_allowed and zone lists as need be, to reflect the
    changed CpuMemMaps.  By the time that one system call
    returns, no further task will be scheduled on the mapped
    out cpu, and no further memory will be allocated on the
    mapped out memory node.

> Will there be /proc/<pid>/cpumemset and
> /proc/<pid>/cpumemmap interfaces?

No plans for this, though it's possible.

> A /proc interface would also be useful for managing an application
> which is already running.

Use the cms*() calls with "pid" arguments, such as:

    cmsQueryCMM, cmsSetCMM, cmsQueryCMSbyPid, cmsSetCMSbyPid, cmsBulkRemap

to manage currently running applications.  I find the use of /proc
to manage a system, as opposed to (1) report on it, or (2) toggle
obscure debug hooks, to be an ugly interface, and resist such.

From the latest work I see from Rusty Russell <ru...@ru...>:

    [PATCH] 2.5 PROPOSAL: Replacement for current /proc of shit.

I am not alone in this opinion.

> One could view existing memmaps by cat /proc/123/cpumemmap.
> A line for each memmap used by the application would be printed.
> Using your example, a line could be displayed as follows:
> 
> addr size 8 4,5,6,7,8,9,10,11 2 1,2
> 
> Using your example again, could one modify an existing application
> from the command line by sepcifying a memmap as follows:
> echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap

ugh - try instead:

    cmsSetCMM (CMS_CURRENT, 123, 0, 0, &cmm);

> And finally, the process could be migrated to processors 4-11 
> via echo ff0 /proc/123/cpus_allowed.

There is no visible "cpus_allowed" in CpuMemSets - rather
cpus_allowed is an implementation detail of the task scheduler
for systems with fewer than 64 cpus.

Rather we need command line utilities, built on the CpuMemSets
infrastructure, to support such migration and related tasks.

> You could also use /proc/cpumemset and /proc/cpumemmap to alter the
> system defaults.

You _could_.  I hope not.  Also there is no particularly interesting
"system default" map or set, beyond the initial one the kernel sets
up during boot, and uses when starting the init process.  From that
point forward, all maps and sets are inherited or user specified.

> ---
> 
> In 'Processors, Memory and Distance' your discussion of <cpu,mem>
> distances deals primarily with cache warmth issuse. Should you also
> discuss the disadvantages of scheduling a process on a cpu further
> from where the physical pages are contained?  

My recollection is that I have two distances:

    <cpu, mem>	- for modeling cpu to memory latency/bandwidth
    <cpu, cpu>  - for modeling cache warmth

Perhaps something in my presentation is confusing these two?

> For example, you run on node 0 and allocate pages from the memory on
> that node. If you sleep (maybe on IO) you no longer have any cache
> warmth. However, you would still incur a potentially more expensive
> penalty if you are scheduled on a cpu on another node because you now
> have to pull all data into cache over a longer latency/lower bandwidth
> pipe.

This example sounds like it is getting at <cpu, mem> distances.

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Hubertus F. <fr...@wa...> - 2001-11-05 16:32:17

Well my 2.5 cents:

Under "Chunk" I usually imply a smaller piece of a large thing.
Hence this doesn't really fit.
"MemoryBlock" would fit better with me.
-- Hubertus

* Paul Jackson <pj...@en...> [20011102 20;09]:"
> On Fri, 2 Nov 2001, Paul Dorwin wrote:
> > I have a few comments regarding the cpumemset document.
> 
> Excellent - thank-you!
> 
> 
> > You refer to memory as memory nodes. Other times you refer to
> > the same thing as a node. And still other times you refer to
> > a node in the more familiar context of a container. To me,
> > the term node refers to something which can contain cpus,
> > memory, and IO. I would be more confortable with some other
> > term which refers to a range of memory.
> 
> I would be more confortable with another name as well ;).
> Any suggestions?  If I had to pick an alternative right
> now, it would be "memory chunk".
> 
> Earlier I tried "memory bank", but that had too many prior
> connotations to me.  I used "memory node" in this Version
> because it seemed that on the architectures we are currently
> concerned with (the big ia64 numa systems I knew of) there was
> a one-to-one relation between chunks of memory and system nodes.
> 
> I thought I had been fairly pedantic in using "memory node"
> everywhere, except sometimes when using that term multiple times
> in a single sentence, and I hoped that secondary references
> could be abbreviated to just "nodes" without confusion.  If you
> see a contrary instance, I'd be happy to fix it.
> 
> Or if you have a better name, I'm interested.
> 
> > 
> > ---
> > 
> > In 'Using CpuMemSets' you say:
> > 
> > On systems supporting hot-swap of CPUs (or even memory, if someone can
> > figure that out) the system administrator would be able to change CPUs
> > and remap by changing the applications CpuMemMap, without the
> > application being aware of the change.
> > 
> > How are you doing this?
> 
> See the Bulk Remap call (CMS_BULK_ALL).
>     
>     It goes through and alters any CpuMemMap as requested,
>     perhaps to remove a cpu or memory node (according to
>     its system numbering) from service, by replacing that
>     system number with another.  The implementation then walks
>     through the tasks and vm areas in the system, recomputing
>     cpus_allowed and zone lists as need be, to reflect the
>     changed CpuMemMaps.  By the time that one system call
>     returns, no further task will be scheduled on the mapped
>     out cpu, and no further memory will be allocated on the
>     mapped out memory node.
> 
> 
> > Will there be /proc/<pid>/cpumemset and
> > /proc/<pid>/cpumemmap interfaces?
> 
> No plans for this, though it's possible.
> 
> 
> > A /proc interface would also be useful for managing an application
> > which is already running.
> 
> Use the cms*() calls with "pid" arguments, such as:
> 
>     cmsQueryCMM, cmsSetCMM, cmsQueryCMSbyPid, cmsSetCMSbyPid, cmsBulkRemap
> 
> to manage currently running applications.  I find the use of /proc
> to manage a system, as opposed to (1) report on it, or (2) toggle
> obscure debug hooks, to be an ugly interface, and resist such.
> 
> >From the latest work I see from Rusty Russell <ru...@ru...>:
> 
>     [PATCH] 2.5 PROPOSAL: Replacement for current /proc of shit.
> 
> I am not alone in this opinion.
> 
> 
> > One could view existing memmaps by cat /proc/123/cpumemmap.
> > A line for each memmap used by the application would be printed.
> > Using your example, a line could be displayed as follows:
> > 
> > addr size 8 4,5,6,7,8,9,10,11 2 1,2
> > 
> > Using your example again, could one modify an existing application
> > from the command line by sepcifying a memmap as follows:
> > echo "migrate addr size 8 4,5,6,7,8,9,10,11 2 1,2" > /proc/123/cpumemmap
> 
> ugh - try instead:
> 
>     cmsSetCMM (CMS_CURRENT, 123, 0, 0, &cmm);
> 
> 
> > And finally, the process could be migrated to processors 4-11 
> > via echo ff0 /proc/123/cpus_allowed.
> 
> There is no visible "cpus_allowed" in CpuMemSets - rather
> cpus_allowed is an implementation detail of the task scheduler
> for systems with fewer than 64 cpus.
> 
> Rather we need command line utilities, built on the CpuMemSets
> infrastructure, to support such migration and related tasks.
> 
> 
> > You could also use /proc/cpumemset and /proc/cpumemmap to alter the
> > system defaults.
> 
> You _could_.  I hope not.  Also there is no particularly interesting
> "system default" map or set, beyond the initial one the kernel sets
> up during boot, and uses when starting the init process.  From that
> point forward, all maps and sets are inherited or user specified.
> 
> 
> > ---
> > 
> > In 'Processors, Memory and Distance' your discussion of <cpu,mem>
> > distances deals primarily with cache warmth issuse. Should you also
> > discuss the disadvantages of scheduling a process on a cpu further
> > from where the physical pages are contained?  
> 
> My recollection is that I have two distances:
> 
>     <cpu, mem>	- for modeling cpu to memory latency/bandwidth
>     <cpu, cpu>  - for modeling cache warmth
> 
> Perhaps something in my presentation is confusing these two?
> 
> 
> > For example, you run on node 0 and allocate pages from the memory on
> > that node. If you sleep (maybe on IO) you no longer have any cache
> > warmth. However, you would still incur a potentially more expensive
> > penalty if you are scheduled on a cpu on another node because you now
> > have to pull all data into cache over a longer latency/lower bandwidth
> > pipe.
> 
> This example sounds like it is getting at <cpu, mem> distances.
> 
> 
> 
>                           I won't rest till it's the best ...
>                           Manager, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373
> 
> 
> _______________________________________________
> Lse-tech mailing list
> Lse...@li...
> https://lists.sourceforge.net/lists/listinfo/lse-tech

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-11-05 23:39:43

On Mon, 5 Nov 2001, Hubertus Franke wrote:

> Well my 2.5 cents:
> 
> Under "Chunk" I usually imply a smaller piece of a large thing.
> Hence this doesn't really fit.
> "MemoryBlock" would fit better with me.

MemoryBlock - I like it.  I will probably change it to
that, in the next version.

Thanks, Hubertus.



                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Niels C. <nc...@us...> - 2001-12-24 06:11:37

> I have just posted on SourceForge LSE a new version of
> the Design Note:
>     Process Scheduling and Memory Placement

Hi Paul,

I took the time to read our design notes.  As you know, I have
no NUMA background so maybe I'm not getting the whole picture.
Please bear with me if I ask too dumb questions in my ignorance
but I ran into a few things I would like to bring up.

> cpumemmap:
>   The lower layer of this proposal, named cpumemmaps,
>   provides two simple maps, mapping system CPU and memory
>   block numbers to application CPU and memory block numbers.
>   Each process, each virtual memory area, and the kernel for
>   its needs, has such a cpumemmap.

Your above statement leaves me a bit confused.  The way I
understand the rest of the description, there are two types
of cpumemmaps, one for the system (existing in only one
instance) and one for "applications" with one instance per
task.  Do I understand that correctly?

You say: "Each cpumemset has an associated cpumemmap" and
that is the way I understand it but I think your design
notes could benefit from a graphical view of how one maps
to the other and on to the system map.  Further graphs to
illustrate bulk remapping would be even better.

The only two principal thorns in my eye are around preferences
when it comes to selecting memory block and cpu.  You have added
the memory block preference in the upper layer and you have
opted to not supply cpu preference.  For all I can see, the
memory preference is likely to be defaulting to "the closer
the memory the sooner you try it" -- and is likely to remain
so in most cases.  I do not see why cpumemmaps should not
carry this information with the cpumemsets being able to
override.

Similarly, I see no reason why the cpu preference can not be
kept just like memory block preference and handled the same
way.  I think the inclusion of cpu preference fits into a
design that combines Paul Dorwin's topology design with your
two-layer abstraction.  Instead of having the sequence in
which the preferred cpus and memory blocks appear determine
the preference, a data-carrying link structure would be
preferable although not the only possible solution.

I say so because I think that in the near future, we need
more than "preferred".  In the case of memory, a number of
states may influence where we grab memory.  For example,
if allocation from the closest memory would cause a page-out
but allocation from the next-closest memory would not, where
should we allocate?  Should memory preference be influenced
by the likelihood or history of the task's cpu bounces?

In the case of processor, when should be decide not to
schedule a task because no preferred processor is idle or
has a lower priority task running?  For how long?  Do the
rules change if this is a hypertasking processor?

I believe the design should allow for such extensions and,
probably, include a couple of properly named reserved fields
for that purpose.

Thanks for passing-on my plea for help with Co-Pilot.
Niels

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-12-24 20:12:35

Thanks for reviewing the Notes, Neil.  You and I should be
celebrating Christmas ... oh well.

You wrote:
> > cpumemmap:
> >   The lower layer of this proposal, named cpumemmaps,
> >   provides two simple maps, mapping system CPU and memory
> >   block numbers to application CPU and memory block numbers.
> >   Each process, each virtual memory area, and the kernel for
> >   its needs, has such a cpumemmap.
>
> Your above statement leaves me a bit confused.  The way I
> understand the rest of the description, there are two types
> of cpumemmaps, one for the system (existing in only one
> instance) and one for "applications" with one instance per
> task.  Do I understand that correctly?

No.  Not two maps (system and applications).  Many maps,
one kernel, and one for each task, and one for each vm
area.  The maps are shared if inherited from a common map
and unchanged.  But except for one Bulk Remap operation,
this sharing is not apparent to the user.

The "system" and "application" dichotomy is in the numbering
of CPUs and memory blocks.  The system (Linux kernel) has
its preferred way of numbering CPUs ... one such way is the
compact node id.  From the perspective of CpuMemSets, there
is only one such numbering scheme, pre-ordained by the kernel.

Each cpumemmap describes an application numbering.  While
there is only one system numbering scheme (known to CpuMemSets)
there are many application numbering schemes -- exactly the
many cpumemmaps.  Each cpumemmap describes the mapping from some
"application" CPU and memory block numbers to _the_ "system"
numbers.

|> your design notes could benefit from a graphical view ...

Yes, some pictures would help.  I chose the technology (basically
hand edited HTML containing just old-fashioned text) for
presenting these Notes to optimize the speed with which I could
edit them, and the ease of presenting them in different contexts.
However, now I should soon go back and refine the presentation
to include pictures, in order to make it easier to understand.

> Similarly, I see no reason why the cpu preference can not be
> kept just like memory block preference and handled the same
> way.
>
>  I think that in the near future, we need
> more than "preferred".  In the case of memory, a number of
> states may influence where we grab memory.

This comment goes to the heart of a key design decision in
this CpuMemSets design.  The challenge is to present enough
flexibility to be generally useful, while keeping it simple
enough to be generally understandable.  In particular,
there will be some systems with some additional needs that
don't fit explicitly in CpuMemSets, and these needs will
have to be met by additional mechanisms.

Hopefully, CpuMemSets can integrate smoothly with such additional
mechanisms, with CpuMemSets say "here's the ordered list of where
to search", and leaving it to other code to pick and choose from
that ordered list with more refined methods.

For example, it is already the case, in the normal
Linux kernel page allocator, that it takes into account
additional heuristics on where to allocate a page.  It does
this in part by scanning the list of zones multiple times,
first looking for easy memory, then getting increasingly
desperate.  In this enironment, CpuMemSets will provide
some framework -- an ordered set of memory blocks to
search, but it will leave to other mechanisms the details
of deciding which memory block is "best".

|> You have added the memory block preference in the upper
|> layer and you have opted to not supply cpu preference.

In the case of memory, the current CpuMemSets design provides
a couple of additional option flags, stating which order to
search the memory in.

In the case of CPUs, it doesn't have any such options currently,
and treats the CPU list as simply an unordered set, even though
that set is passed into the kernel as an ordered list.  If the
kernel had use for diverse CPU search orders in the scheduler,
then I could easily see extending this CpuMemSet design to make
that explicit, and add a few flags, specifying various search
orders over the provided CPU list.

|> For all I can see, the memory preference is likely to be
|> defaulting to "the closer the memory the sooner you try it"
|> -- and is likely to remain so in most cases.

Once you determine which memory blocks to search, and where
in that list to begin the seach, and disregarding more
dynamic issues (see comments above involving searching the
same list multiple times, with increasing desperation), then
yes, the search order will often be distance based.

|> I do not see why cpumemmaps should not carry this information
|> with the cpumemsets being able to override.

The maps just renumber.  One could easily have all maps
present all CPUs and all memory blocks, all the time.
I see no use in having two layers that both attempt to present
mechanisms for specifying preference information.

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Niels C. <nc...@us...> - 2001-12-28 04:11:59

> No.  Not two maps (system and applications).  Many maps,
> one kernel, and one for each task, and one for each vm
> area.  The maps are shared if inherited from a common map
> and unchanged.  But except for one Bulk Remap operation,
> this sharing is not apparent to the user.

Paul, I give up.  Why do we need one per vm area when all
operations on memory are done from a task?  Why can't we
use the task's map, which is used to generate the initial
memory area map anyway?  You do say:

* When allocating another page to an area, the kernel will
* choose the memory list for the CPU on which the current
* task is being executed, if that CPU is in the cpumemset of
* that memory area, else it will choose the memory list for
* the default CPU (see CMS_DEFAULT_CPU) in that memory areas
* cpumemset.  The kernel then searches the chosen memory list
* in order...

Now you are adding functionality to the way the data is
organized.  Who says this is the only or even the best
way to allocate memory?  But notice that you use the term
"memory list"...

So maybe I should rephrase that:  It seems to me that the
task should have the map, mapping to a different map list
structure for associated memory areas.  I do not see that
memory area list having a cpu mapping.  I think you are
trying to generalize too much here, which -- to me -- is
apparent from your comment:

* Not all portions of a cpumemset are useful in all cases.
* For example the CPU portion of a vm area cpumemset is unused.
* It is not clear as of this writing whether CPU portions of the
* kernels cpumemset are useful

So while I agree that we need a data structure to hold the
vm areas, I see it as different from the cpu structure.

Now, while I think you are being too generalizing in one
area I think you are the opposite in the other.  That other
one is the rejection of cpu-to-cpu relationships.  In a two-
layer design, I would expect everything hard-wired, such as
cpu-to-cpu affinity/distance/whatever to be reflected in the
lower layer just as the cpu-to-memory-block relationship is.
And, maybe, as the concept of a node is, since that seems
to be(come) a very popular way of designing NUMA systems.

And this will be my last words in this thread.  I will be
following it but will leave the discussion to Michael's
team.  It has been interesting, educating and entertaining
though, at least for me, but I don't have the bandwidth ...

Niels

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2002-01-12 09:11:14

Thanks for your comments, Neils.

I appreciate that you don't have the bandwidth to continue
this discussion.  Heck - it took me over two weeks to find the
bandwidth to reply, and this is my main task.

My apologies for not responding sooner -- I did however enjoy
a rare vacation, and hope to conquer my recently acquired
"Civilization III (Sid Meier)" adiction soon.

Anyhow, for the benefit of others who might have been intrigued
by your comments, I will attempt a modest response.

On Thu, 27 Dec 2001, Niels Christiansen wrote:

> Paul, I give up.  Why do we need one per vm area when all
> operations on memory are done from a task?  Why can't we
> use the task's map, which is used to generate the initial
> memory area map anyway?

Except for kernel allocation within interrupts, yes, there is
a task (as well as a vm area) that can naturally be associated
with each page fault.

Large NUMA apps that have their memory allocations fine tuned
for a system will place different chunks of memory on different
nodes, quite intentionally.  This includes shared memory, that
may be accessed from many processes.

So just knowing the process and executing CPU isn't enough to
place the memory.  One must also know which address region of
that process owns the fault.  A single process will typically
have access to several regions, that have various and different
placement directives.

The natural place, in my view, to attach such directives that
vary by memory region is to the region, known in Linux as the
vm area.

> It seems to me that the task should have the map, mapping to
> a different map list structure for associated memory areas.
> I do not see that memory area list having a cpu mapping.

Well, we clearly need to have the choice of where to look
for memory depend on which cpu executed the fault, and
which memory region (vm area) faulted.  I say "clearly"
based on my analysis of what major well-tuned NUMA apps
require.  My focus here is on the big-data, long running,
big compute jobs that are SGI's primary business focus.

Other types of loads will have other requirements.  The
trick is to find a single mechanism that will be useful
to the variety of uses.

... I couldn't quite make sense of the rest of your post,
so will just have to let it rest in peace.  Perhaps it
touched on a point that someone else will raise.

Once again - thanks for your experienced and extended
examination of this proposal.

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Michael H. <hoh...@us...> - 2002-01-09 19:21:48

Paul,

I appreciate that you have addressed most of my concerns in your latest
revision of CpuMemSets.  At this point in time I have little to add,
although, as we have previously discussed, there is a need for adding
grouping capabilities to CpuMemSet.  Next week I'll try to post some
thoughts on this.

A question that has been brought up is how would one remove a resource
(CPU or memory block) from the system via CpuMemSets?  The only option
I see now is to use the bulk remap capability and map a different resource
to replace the resource being removed.  So, for example, if on an eight
processor system, processor 5 was being removed, one would have to choose
a different processor, say 6, to map what had been mapped to 5.  Thus 
a processor list, assuming all system processors were being mapped by
a CpuMemMap one to one, would be bulk remapped to look like {0,1,2,3,4,6,6,7}.
Is my understanding correct?  If so, while technically possible, there
are numerous problems I see with this, especially with respect to 
implications on cpumemsets based on the now changed cpumemmap.

Michael Hohnbaum
hoh...@us...

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2002-01-12 10:18:11

On Wed, 9 Jan 2002, Michael Hohnbaum wrote:

> Paul,
> 
> I appreciate that you have addressed most of my concerns in
> your latest revision of CpuMemSets.  At this point in time I
> have little to add, although, as we have previously discussed,
> there is a need for adding grouping capabilities to CpuMemSet.
> Next week I'll try to post some thoughts on this.
> 
> A question that has been brought up is how would one remove a
> resource (CPU or memory block) from the system via CpuMemSets?
> The only option I see now is to use the bulk remap capability
> and map a different resource to replace the resource being
> removed.  So, for example, if on an eight processor system,
> processor 5 was being removed, one would have to choose a
> different processor, say 6, to map what had been mapped to 5.
> Thus a processor list, assuming all system processors were
> being mapped by a CpuMemMap one to one, would be bulk remapped
> to look like {0,1,2,3,4,6,6,7}.  Is my understanding correct?
> If so, while technically possible, there are numerous problems
> I see with this, especially with respect to implications on
> cpumemsets based on the now changed cpumemmap.
> 
> Michael Hohnbaum
> hoh...@us...

Your crystal ball is working at full strength.  I'd advise you
to visit Las Vegas and bet heavily.

In other words, in the final round of changes before packaging up
our first patch (should be out next week, crossing my fingers),
the only significant design detail that took a hit was the
usefulness of the Bulk Remap feature.  It seems to me, by my
present understanding, that we must insist that the CpuMemMap
memory lists be injective - meaning that you can't list a cpu
twice in the memory lists.  Or if we allow duplicates (such
as cpu '6' in your example above) then as you state "there are
numerous problems".

Any ideas?  I agree that the "canonical" scenario to be solved
with bulk remap, or some other design element, is removing a
cpu from a system.

I also look forward to your thoughts on "grouping".  That should
be a "fun" challenge.

                          I won't rest till it's the best ...
			  Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-10-12 18:57:26

Paul McKenney wrote:
> Took a first pass through your proposal -- good stuff!  Some very
> interesting approaches!

Thanks for the initial review.  Before I read it in detail,
let me repost it, with some paragraph breaks, so that others
may read it more easily.

What follows is Paul Mckenney's writing, reformated ...

=======================================================

Some comments:

   Under "Desired Properties / Vendor neutral base", I recommend adding
   Tru64, HP-UX, AIX, etc.  May as well be all-inclusive.  ;-)

   Under "Implementation Layers", near the end of the second paragraph: "a
   small scale forking" seems a bit dramatic.  That said, it would be good
   if the NUMA and SMP code paths used the same C code.

   A question on items 1, 2, and 3 under "Implementation Layers".  Item 1
   seems to indicate that the current use of bitmasks within the kernel can
   continue unchanged, but hints at longer-term changes.  Do you envision
   the kernel moving towards using CpuMemMaps and CpuMemSets, or do you
   expect these two concepts to be strictly confined to user space?

   Under "Error cases" in the header file, you say that if you really want
   to force an application to use CPUs and memory from disjoint nodes
   (which you might in diagnostics or performance-measurement code), then
   the CpuMemSet memory list must contain CMS_DEFAULT_CPU.  But
   CMS_DEFAULT_CPU is casted to cms_lcpu_t.  Shouldn't it be cast instead
   to cms_lmem_t?  Or is CMS_DEFAULT_CPU instead supposed to go on the
   CpuMemSet's list of CPUs?  If the latter, does this allow the
   disjoint-node operation for CPUs and memory? (So that all the permitted
   CPUs are on one set of nodes, but all the permitted memory is on another
   set of nodes, and the two sets of nodes are disjoint.)

   First-touch and stripe policies seem to be missing from the list.  Some
   of the OSes have also had least-memory-utilization policies.

   A cpumemset has no meaning except in the context of a cpumemmap, right?
   A cpumemmap maps the CPU numbers, while a cpumemset simply restricts
   them, right?  (Could make it work either way, but things like getcpu()
   and getnode() need to be in sync with the choice.)  For example, suppose
   that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose that
   the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.  What
   is the physical CPU corresponding to logical CPUs 0, 1, and 2?

   HP's launch policies include policies for CPU allocation (round robin
   and the like).  My guess is that this requires either additional bits in
   cms_policy or another field for CPU (as opposed to memory) policies.
   Non-root processes are prohibited from creating cpumemmaps.  Shouldn't
   they be allowed to create them, as long as they are subsets of the
   cpumemmap that they are currently running with?

   If non-root processes are really prohibited from playing with
   cpumemmaps, what happens to their children?  Is the child's cpumemmap
   generated from the cpumemset/cpumemmap pair that the parent specified
   with CMS_CHILD?  Or does the child just get copies of the CMS_CHILD
   cpumemset/cpumemmap?  (At this point, I am favoring letting non-root
   processes manipulate cpumemmaps, with the restriction that any that are
   installed must be pure restrictions of the current cpumemmap.)

   There is no way to create either a cpumemmap or a cpumemset without
   querying for it.  The intended usage is to query for this process's
   set/map, then manipulate the resulting structure?

   Is the user supposed to directly manipulate the fields of the cpumemmap
   and cpumemset?

   A few typos, possibly due to WikiWeb issues:  "cpu's" -> "CPUs", "mem's"
   -> "memories", "preasure points" -> "pressure points", "CpumemSet" ->
   "CpuMemSet", etc.

Enough questions for now!  More later...  Once again, some good stuff here!

                                   Thanx, Paul

=======================================================

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] New Version of LSE CpuMemSets Design Notes

From: Paul J. <pj...@en...> - 2001-10-12 20:58:47

Thanks - excellent comments, Paul.  I look forward to continued
feedback from yourself and others.

Paul McKenney wrote:
====
> Took a first pass through your proposal ...
> 
>    adding Tru64, HP-UX, AIX, etc.  May as well be all-inclusive.  ;-)

good idea - added.

====
>    Under "Implementation Layers", near the end of the second paragraph: "a
>    small scale forking" seems a bit dramatic.

yup - softened to:

    increased the risk for minor Linux kernel code duplication

====
>    A question on items 1, 2, and 3 under "Implementation Layers".  Item 1
>    seems to indicate that the current use of bitmasks within the kernel can
>    continue unchanged, but hints at longer-term changes.  Do you envision
>    the kernel moving towards using CpuMemMaps and CpuMemSets, or do you
>    expect these two concepts to be strictly confined to user space?

I doubt that the guts of scheduling or allocation code will ever
want to be written in terms of CpuMemMaps and Sets.  The data
structure used in CpuMem*, arrays of cpu or mem id's, is almost
surely too inefficient for critical path code.

I expect that when someone needs a kernel for > 64 cpus, they
will have to find an alternative to, or extension of, Ingo's
cpus_allowed bit vector.

And I expect continued coding activity for various other purposes
in both the scheduling and allocation code, which will at times
impact the preferred data representation of available cpu and
memory resources for these critical code paths.

The kernel code that implements the system calls to set
CpuMemMaps and Sets will always have the responsibility for
translating the stable, but cumbersome, API of these Maps and
Sets into the efficient representation of the day required by
the scheduling and allocation code.

====
>    Under "Error cases" in the header file, you say that if you really want
>    to force an application to use CPUs and memory from disjoint nodes
>    (which you might in diagnostics or performance-measurement code), then
>    the CpuMemSet memory list must contain CMS_DEFAULT_CPU.  But
>    CMS_DEFAULT_CPU is casted to cms_lcpu_t.  Shouldn't it be cast instead
>    to cms_lmem_t?  Or is CMS_DEFAULT_CPU instead supposed to go on the
>    CpuMemSet's list of CPUs?  If the latter, does this allow the
>    disjoint-node operation for CPUs and memory? (So that all the permitted
>    CPUs are on one set of nodes, but all the permitted memory is on another
>    set of nodes, and the two sets of nodes are disjoint.)

Aha - this Error Case is confused, sufficiently so that it apparently
managed to further confuse your critiique ;).

And the Error Case preceding it is also confused.  For the benefit
of those who don't have this Design Note at hand, the two confused
Error Cases are:

 *    It is not an error if a CpuMemSet for an object (task, vm area
 *    or kernel) doesn't provide memory lists for all the cpus in
 *    that objects CpuMemMap.  That is, it is ok for a CpuMemSet to
 *    "be smaller than" (only use a subset of) its Map.
 *
 *    However it is an error to set a CpuMemSet that shows cpus that
 *    are not listed in any of the memory lists of that CpuMemSet,
 *    unless the memory lists include a CMS_DEFAULT_CPU.  Attempts to
 *    set such a CpuMemSet fail with errno set to ESRCH.  This case
 *    must be an error to avoid trying to allocate memory without
 *    knowing which memory list to search.

Let me try again ...

    A CpuMemSet specifies two things.  It specifies on which cpus
    (in the corresponding CpuMemMap) a task may be scheduled,
    and it specifies in what order to search for memory, per
    virtual memory area, depending on which cpu the request
    for memory was executed.

    The question arises - where do we look for memory if a
    request for memory is executed on a cpu that is not specified
    in the active CpuMemSet?  Perhaps someone didn't provide
    memory lists for all possible cpus that might execute code
    sharing that area.

    Heck, perhaps they _couldn't_ specify such memory lists,
    because they are setting up a shared memory area that will
    be shared with other tasks running on cpus outside the Map
    of the process initializing the shared memory area.

    Seems that I can either (1) require that a memory list _always_
    be specified for the CMS_DEFAULT_CPU, or (2) mandate that
    attempts to allocate memory when the CpuMemSet does not specify
    (even by CMS_DEFAULT_CPU) any memory list for the currently
    executing cpu must fail.  Choice (2) would cause nasty, obscure
    and intermittent errors.  So it must be choice (1).

    Hence these two Error Cases collapse into the following
    single case:

 *    Every CpuMemSet must specify a memory list for the
 *    CMS_DEFAULT_CPU, to ensure that regardless of which CPU a
 *    memory request is executed on, a memory list will be available
 *    to search for memory.  Attempts to set a CpuMemSet without a
 *    memory list specified for the CMS_DEFAULT_CPU will fail, with
 *    errno set to EINVAL.

====
>    First-touch and stripe policies seem to be missing from the list.  Some
>    of the OSes have also had least-memory-utilization policies.

First touch is the natural order of things.  If, as will usually
be the case, the memory lists are ordered by distance from the
faulting cpu, then this provides first touch.  Existing upper
level API's, such as cpusets, dplace, runon, OpenMP, MPI that
support a First Touch policy would presumably implement that
policy by properly sorting the memory lists.

Aha - perhaps I should change the CMS_DEFAULT policy comment, from:

  #define CMS_DEFAULT	0x00	/* None of the following optional policies */

to some such as:

  #define CMS_DEFAULT	0x00	/* Memory list order (first-touch, typically) */

Tell me more about what is the essence of stripe policies,
as they apply here.  My guess is that a combination of proper
memory list sorting plus a round robin (CMS_ROUND_ROBIN) policy
will provide the desired semantics.  But more consideration of
this point is needed.

What is a "least-memory utilization policy"?  I am not familiar
with that term.

====
>    A cpumemset has no meaning except in the context of a cpumemmap, right?
>    A cpumemmap maps the CPU numbers, while a cpumemset simply restricts
>    them, right?  (Could make it work either way, but things like getcpu()
>    and getnode() need to be in sync with the choice.)  For example, suppose
>    that the CPUs are numbered 100, 101, 102, 103, and 104.  Suppose that
>    the cpumemmap is {100,102,103}, and that the cpumemset is {0,2}.  What
>    is the physical CPU corresponding to logical CPUs 0, 1, and 2?

Yes, Yes, tell me about getcpu/getnode, and {100,102,103}.

My presumption, from reading this, is that getcpu(), executed in this
map and set, would return either 100 or 103, rather oblivious to the
CpuMemMap.  I have no clue what getnode() would or should do that
relates to CpuMemSets.  Perhaps we need a "getcmscpu()"?

====
>    HP's launch policies include policies for CPU allocation (round robin
>    and the like).  My guess is that this requires either additional bits in
>    cms_policy or another field for CPU (as opposed to memory) policies.

Aha - might be so.  I should investigate this further.

====
>    Non-root processes are prohibited from creating cpumemmaps.  Shouldn't
>    they be allowed to create them, as long as they are subsets of the
>    cpumemmap that they are currently running with?

Non-root processes can restrict where they execute tasks or allocate
memory by altering their CpuMemSet, within the confines of the CpuMemMap
established for them.  Why is that not enough?

====
>    If non-root processes are really prohibited from playing with
>    cpumemmaps, what happens to their children?  Is the child's cpumemmap
>    generated from the cpumemset/cpumemmap pair that the parent specified
>    with CMS_CHILD?  Or does the child just get copies of the CMS_CHILD
>    cpumemset/cpumemmap?

I don't understand the difference between these two choices.
They both sound the same to me, and both sound right.

====
>    (At this point, I am favoring letting non-root
>    processes manipulate cpumemmaps, with the restriction that any that are
>    installed must be pure restrictions of the current cpumemmap.)

Well, having not yet appreciated the comments above on this, I
will have to table this suggestion pending further my further
enlightment.

====
>    There is no way to create either a cpumemmap or a cpumemset without
>    querying for it.  The intended usage is to query for this process's
>    set/map, then manipulate the resulting structure?

Yes, yes.  Earlier designs allowed for creating and manipulating
CpuMemSets, as a kernel supported object that was visible as a
distinct identified object to applications, separate from their
binding to any given task or vm area.  But I could see no essential
use for unbound CpuMemSets, so now they are an attribute of known
tasks and vm areas.

====
>    Is the user supposed to directly manipulate the fields of the cpumemmap
>    and cpumemset?

Yes, via the cmsSet*() system calls.  Well, as direct as
anything in a protected operating system -- the user politely
asks the kernel to set a map or set, and doesn't directly write
kernel memory.

My hunch here is that I missed the real point of this question ...

====
>    A few typos, possibly due to WikiWeb issues:  "cpu's" -> "CPUs", "mem's"
>    -> "memories", "preasure points" -> "pressure points", "CpumemSet" ->
>    "CpuMemSet", etc.

I can hardly blaim wiki for typos and mispellings ;).
Thanks for pointing these out.

> Enough questions for now!  More later...  Once again, some good stuff here!

I look forward to your further comments, and to integrating this
work with substantial good work being done by folks on your end,
and elsewhere.

> 
>                                    Thanx, Paul
> 
> =======================================================
> 
>                           I won't rest till it's the best ...
>                           Manager, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373
> 
> 
> _______________________________________________
> Lse-tech mailing list
> Lse...@li...
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373