Thread: [Lse-tech] DYNIX/ptx NUMA APIs mapping to CpuMemSets

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In an effort to enhance my understanding of the CpuMemSet proposal I
performed a mapping of the DYNIX/ptx NUMA APIs to the CpuMemSet facility.
This effort also attempted to validate the claim that "this CpuMemSet
facility is intended to provide the power sufficient to support emulations
of these existing API's."

In doing this I came across several shortcomings, that are listed below.
At the end of this summary is the mapping with a discussion included of
issues encountered.

1.  CpuMemSets makes an attempt at providing support for memory allocation
    policy, but comes short of providing the allocation policies supported
    by DYNIX/ptx.  By providing only partial support for memory allocation
    policies it opens up the possibility that memory allocation policies
    might be implemented at various places, instead of through one common
    mechanism.

    Policies missing:

       * soft versus hard - DYNIX/ptx has the notion of treating placement
         requests as either hints (soft) or demands (hard).  CpuMemSet 
         provides only the hard option.

       * first touch, followed by round robin.  The default algorithm for
         memory allocations for DYNIX/ptx is to allocate on the same quad
         the process is running on, and if none available, to round robin
         through the remaining quads.  The CpuMemSet choices are either
         round robin or always in the order of the memory lists.

       * LARGEMEM/SMALLMEM hints - DYNIX/ptx provides the option for a 
         process to indicate if it was a large memory user or a small 
         memory user, so that the system could place it on an appropriately
         loaded quad.

       * MOSTFREE - i.e., allocate on the node with the most available
         memory.

2.  No mechanism is provided to migrate a process to a different node after 
    a process is assigned to it via CpuMemSet.

3.  There is no means to bind processes such that if one moves to a 
    different node, they all move to the node.  (Or conversely, to
    prevent the migration of one to a different node when it needs to
    remain on its current node with other specific processes.)

4.  DYNIX/ptx has the ability to attach a process to a shared memory
    segment.  This binds the process to running on quads that the shared
    memory segment resides on.  In CpuMemSets there is a note that the
    CPU portion of a CpuMemSet for a memory area is ignored.  To support
    the DYNIX/ptx attachment of a process to a shared memory segment,
    the CPU portion of the memory are CpuMemSet needs to be used.

5.  It appears that mmap uses the CpuMemSet of the current process.
    Thus, to target a different set of resources (numa nodes) than
    the current process is using it is necessary to change the current
    CpuMemSet of the process to that desired for the mmap, execute
    the mmap operation, and then restore the CpuMemSet for the current
    process to what it was originally.  This opens up the possiblity
    that any pages allocated during this activity for the process could
    end up being allocated on the wrong node (i.e., on a node that was
    intended for the mmap).  Perhaps a third CpuMemSet needs to be
    established for a process to be used for mmaps and shmget.

6.  Not really a CpuMemSet deficiency, but DYNIX/ptx had a privilege
    vector implementation that allowed the granting of some typical
    root only capabilities to specific users.  Some NUMA APIs required
    specific privileges for non-root users.  This could not be mapped
    into Linux.

7.  More of a NUMA Topology issue - DYNIX/ptx provided an efficient
    interface for a process to obtain the engine number and quad number
    that it is executing on.  This is not yet supported.

Those are the main deficiencies found.  The following is the outcome
of the mapping work, the first part outlines the general strategy,
the remainder is snippets from man pages that define the APIs with 
discussion interjected about how the mapping could happen and what 
problems arise.  Any suggestions, corrections, discussion, etc. are
welcome.

Michael Hohnbaum
hoh...@us...

Mapping DYNIX/ptx NUMA APIs onto CpuMemSets
-------------------------------------------

The DYNIX/ptx NUMA APIs are based around the notion of a quad, with a 
quad consisting of 4 CPUs, memory, and PCI slots.  Quads are numbered
starting with 0 and continuing up with no gaps in the numbering name
space.  Similarly, CPUs are numbered starting with 0 and continuing up
with no gaps in the numbering name space. CPUs 0-3 are on quad 0,
CPUs 4-7 are on quad 1, etc.  A missing CPU on a quad (e.g., processor
failure, or non-fully populated quad) does not result in a gap in the
processor number name space.  Thus, some mechanism must be provided
to associate processors with quads.  This issue is being worked as
part of the NUMA Topology effort.

To map the ptx NUMA APIs to CpuMemSets, one must map the quad resources
into the CpuMemSets structures.  The general mapping described below
assumes fully populated quads (i.e., four functioning processors per
quad).  This mapping will differ on systems with less than four
processors per quad.

NOTE: This assumes that a CpuMemSet structure may be established that
      applies to all processes on the system as the default. 

First, create a CpuMemMap that contains all of the CPUs and memory of
the system, treating the memory in each quad as a separate memory.
(N equals the number of quads in the system.)

The CpuMemMap contains:

	{
		4N,			# number of CPUs
		p1,			# pointer to array of cms_pcpu_t
		N,			# number of memories
		p2			# pointer to array of cms_pmem_t
	}

p1 = [0,1, .. ,4N-1]
p1 = [0, .. ,N-1]

This maps logical CPU addresses one to one to physical CPU addresses (e.g.,
logical processor 0 is mapped to physical processor 0, etc.).  It maps
memories one to one to the quad number that contains them.

Then create a CpuMemSet that establishes a linkage between the CPUs and
the memory on the same quad.  

	{
	 CMS_DEFAULT,		# cms_policy
	 4N,			# number of CPUs
	 q1,			# pointer to array of cms_lcpu_t
	 N			# number of memories
	 q2			# pointer to array of cms_memory_list
	}

q1 = [0,1,2, .. ,<number of CPUs>]
q2 = [
	{4, r1, N, s1}
	...
	{4, rN, N, sN}
     ]

r1 = [0,1,2,3]
rN = [N-4,N-3,N-2,N-1]

s1 = [0,1,..N-1]
s2 = [1,2,..,N-1,0]
sN = [N-1,0,..,N-2]

This establishes sets of 4 processors such that memory allocations from
these processors try to be resolved from the memory on the same quad, and
if not, will attempt to get memory from the next quad up, modulo the
number of quads.

NOTE: This does not match the policy in DYNIX/ptx but is a reasonable 
      approximation.

Soft versus Hard Allocation Policies
------------------------------------

The DYNIX/ptx process placement APIs provide the option of being used as
"hints" (soft policy) or firm directives (hard policy).  By default, soft
policy is assumed.  Hard policy is indicated by the use of the INSIST flag
with the associated system calls.

CpuMemSet does not provide this soft versus hard policy distinction.  Rather
a set of resources are identified by a CpuMemSet and a process is associated
with the CpuMemSet.  It is equivalent to a hard policy in the DYNIX/ptx
implementation.  

Therefore, in a mapping of the DYNIX/ptx NUMA APIs to CpuMemSet, either
the soft policy of DYNIX/ptx must be abandoned, or the logic must be
placed at the NUMA API level to make the scheduling and placement decisions
of the DYNIX/ptx soft policy into CpuMemSet hard policy.  To accomplish 
the latter requires the exporting of all necessary kernel usage metrics
including memory and processor usage on each node, location of shared
memory, locality of I/O devices used by the process, etc.  This type of
information and management of system resources based upon it, really belongs
within the kernel.  It is not practical nor likely to be performant, to 
manage this external to the kernel.  Because of these considerations, the
DYNIX/ptx NUMA API to CpuMemSet mapping will only provide for the hard policy.

API Mappings
------------

What follows are the DYNIX/ptx APIs with a discussion of how they map
onto CpuMemSets as defined above.  These are snippets from the man pages
with enough content left to hopefully, explain the intent of the API.  For
the complete man page go to http://oss.sgi.com/projects/numa/download/dynix.

          int tmp_ctl(command, arg)
          int command, arg;

          TMP_NENG
               returns the number of processors configured in the
               processor pool.  The processors may be online or
               offline.  However, deconfigured processors will not be
               included.  The arg argument is ignored.

===>  Note that a deconfigured processor is one that is logically removed
===>  from the system before the OS is booted, and thus the OS is not
===>  aware of its presence.  An offline processor is one that is present
===>  when the OS boots and the OS chooses to remove from service.

===>  cmsQueryCMM (CMS_KERNEL, 0, NULL)
===>  nr_cpus contains the number of CPUs in the system
===>
===>  NOTE: CMS_KERNEL is listed as root only.  Can non-root use it for query?

          TMP_RGN_NENG
               N/A

          TMP_OFFLINE
               N/A

          TMP_ONLINE
               N/A

          TMP_QUERY
               N/A

          TMP_TYPE
               N/A

          TMP_RATE
               N/A

          TMP_NQUAD
               returns the number of quads configured in the quad
               pool.  The quads may be online or offline.  However,
               deconfigured quads will not be included.  The arg
               argument is ignored.

===>  cmsQueryCMM (CMS_KERNEL, 0, NULL)
===>  nr_mems in resulting structure maps to the number of quads
===>
===>  NOTE: CMS_KERNEL is listed as root only.  Can non-root use it for query?

          TMP_RGN_NQUAD
               N/A

          TMP_ENGTOQUAD
               returns the quad on which the processor specified by
               arg. is located.

===>  derivable based on the 4 cpus per quad and numbering assumptions
===>  mentioned above for fully populated quads.  Needs support from NUMA
===>  Topology to handle non-fully populated quads.

          TMP_QUADNEXTENG,
               given a processor number (specified in arg ), returns
               the next processor on that quad, or the first processor
               on that quad if there are no more processors on that
               quad.

===>  derivable based on the 4 cpus per quad and numbering assumptions
===>  mentioned above for fully populated quads.  Needs support from NUMA
===>  Topology to handle non-fully populated quads.

          TMP_RGN_QUADNEXTENG,
               N/A

          TMP_QUADTOENG
               returns the first processor located on the quad
               specified by arg.

===>  derivable based on the 4 cpus per quad and numbering assumptions
===>  mentioned above for fully populated quads.  Needs support from NUMA
===>  Topology to handle non-fully populated quads.

          TMP_RGN_QUADTOENG
               N/A

-----------------------------------------------------------------------

       int attach_proc(rsrc_type, rsrcp, flags, pid)

       rsrctype_t rsrc_type;
       rsrcdescr_t *rsrcp;
       int flags;
       pid_t pid;

       typedef union rsrcdescr {
              quadset_t rd_quadset;
              int   rd_fd;
              char  *rd_pathname;
              int   rd_shmid;
              pid_t rd_pid;
       } rsrcdescr_t;

       attach_proc attaches a process to the  resource  specified
       by rsrc_type and rsrcp .  Attachment implies that the pro-
       cess will be located on the quad or set of quads where the
       specified resource is located.  If the process is not cur-
       rently located at the quad where the specified resource is
       located,  the  process  will migrate to that quad.  If the
       resource is located on several  quads,  the  process  will
       migrate  to  the  best of those quads, taking CPU load and
       memory availability into account.  It's possible that none
       of those quads is suitable to have the process migrated to
       it, due to extremely high CPU loads or insufficient avail-
       able  memory.   In  that  case, attach_proc returns -1 and
       sets errno to EAGAIN to advise the  caller  to  try  again
       later.   This  possibility can be avoided by including the
       QUAD_INSIST bit in flags, which makes choosing one of  the
       quads on which the resource is located mandatory.

===> A CpuMemSet can be constructed that includes the quads which
===> match the criteria specified, and the process assigned to 
===> that CpuMemSet.

===> Migration of the process to the CpuMemSet is not provided for.

===> CpuMemSet does not have the semantics to provide the QUAD_INSIST
===> option.  Actually, it always provides QUAD_INSIST, unless the target
===> has a MASTER, in which case CpuMemSet refuses the attachment.

       Note  that  the process' run queue must include processors
       on at least one of the quads where the specified  resource
       is located.  If no processors on any of these quads sched-
       ule from the process' run queue,  attach_proc  returns  -1
       and  sets  errno to EINVAL.     In that case, to allow the
       process to be attached to the desired  quads,  either  the
       process  must  be reassigned to a run queue having proces-
       sors on at least one of the desired quads or processors on
       at  least  one  of  the desired quads must be added to the
       process' run queue (see the 'assign' and 'change'  options
       of rqadmin(1M)).

===> Not currently an issue.  Will need to evaluate against the MQ scheduler.

       If  the  specified  resource  is  located on more than one
       quad, the process may later migrate among these quads  for
       load-balancing  purposes.   If  the QUAD_INSIST bit is not
       set, the process may also migrate  to  a  quad  where  the
       resource  is  not  located, though this is done only under
       extreme CPU or memory load conditions.

       It is advisable that QUAD_INSIST be used with considerable
       caution.  Situations are rare in which a process will per-
       form better at a quad where a specific resource is located
       than  at  some other quad even under extreme CPU or memory
       overload conditions.  For example, being  located  on  the
       same quad as an important resource is of little value to a
       process that has been swapped out due to an extreme memory
       overload, or that is being starved for CPU bandwidth under
       an extreme CPU overload.

       If rsrc_type is R_QUAD, the process is attached to the set
       of quads specified by rd_quadset .

       If rsrc_type is R_PID, the process is attached to the pro-
       cess specified  by  rd_pid  .   Attachment  to  a  process
       implies  that  the attached process will always be located
       on the same quad as the process it is attached  to,  until
       one of them either detaches, attaches to another resource,
       or exits.  If any process that is attached to  other  pro-
       cesses  migrates  to  another quad for load-balancing pur-
       poses, the processes attached to it will accompany it.

===> No way to enforce attachment to the same quad as another process
===> unless the other process is attached to one specific quad via
===> a CpuMemSet.  This would inhibit the processes's ability to 
===> migrate between quads.

       If rsrc_type is R_SHM, the process is attached to the  set
       of  quads  that  may contain memory pages that are part of
       the shared  memory  segment  specified  by  rd_shmid  (see
       shmgetq(2SEQ)).

===> A shared memory segment has a CpuMemSet, so the requesting process
===> would be attached to that CpuMemSet.  There is a note in the
===> proposal that the CPU portion of a CpuMemSet attached to a memory
===> area is ignored.  For the mapping requested through attach_proc
===> the CPU portion would need to be used.

       If  rsrc_type  is R_FILDES, the process is attached to the
       set of quads on which  the  resource  specified  by  rd_fd
       resides.   Typically,  this implies that a process running
       on one of the quads in that set of quads  will  have  more
       efficient access to that resource.  For different resource
       types, the operating system uses  different  criteria  for
       determining  which  quads to include in this quad set.  If
       rd_fd is a stream, the process is  attached  to  the  quad
       where  the  memory containing the stream head resides.  If
       rd_fd is a file, the process is attached  to  the  set  of
       quads  that  are able to DMA directly to the disk on which
       the file is located.  If rd_fd  specifies  a  device,  the
       process  is  attached  to the set of quads that have effi-
       cient access to that device.  attach_proc works  similarly
       if rd_fd specifies a socket, FIFO or remote file.

       If rsrc_type is R_PATH, the process is attached to the set
       of quads that are nearest to  the  resource  specified  by
       rd_pathname.    The  criteria  for  determining  "nearest"
       is the same as for rsrc_type R_FILDES.

===> Decision can be made as to which quad(s) best fit the criteria
===> and create a CpuMemSet containing it(them).  Can all of the
===> information needed to make an appropriate decision be obtained?

       The caller can further specify an appropriate quad by set-
       ting  the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags
       argument.  QUAD_SMALLMEM indicates that the  process  will
       have  very  low memory requirements, so can be placed on a
       quad having little available memory if  that  quad  has  a
       particularly  light  CPU  load.  Conversely, QUAD_LARGEMEM
       indicates that the process should be placed  on  the  quad
       with  the  most available memory even though that quad may
       have a high CPU load.  QUAD_SMALLMEM and QUAD_LARGEMEM are
       also  taken  into account during any future process migra-
       tions.

===> The memory lists associated with the CpuMemSet can be ordered
===> such that the above selection criteria are factored in.

===> This could be accomplished by querying the memory usage of
===> the quads, and based upon memory available on the quads at
===> that point in time order, the memory list in the CpuMemSet
===> to favor the node that meets the LARGEMEM/SMALLMEM hint.

       A process that attaches itself to  a  resource  loses  all
       previous attachment.  The real or effective user ID of the
       calling process must match the real or  saved  set-user-ID
       of  the  process to be migrated, unless the effective user
       ID of the calling  process  is  superuser  or  unless  the
       caller holds the PRIV_SCHED (scheduling) privilege.

===> No concept of PRIV_SCHED in Linux, so restriction is to root
===> or RUID/EUID match.

-----------------------------------------------------------------------

       int detach_proc(pid_t pid)

       detach_proc disassociates the process  identified  by  pid
       from the quad set it was previously attached to, if it was
       attached to a quad set. The real or effective uid  of  the
       caller  must  match  the  real or saved set-user-id of the
       process to be detached, unless the caller is the superuser
       or holds the PRIV_SCHED (scheduling) privilege.

===> Revert the process back to using the system-wide CpuMemSet

-----------------------------------------------------------------------

          getengno, GETENGNO, getquadno, GETQUADNO, engdata_init,
          all_processors_support_mmx, all_processors_support_simdx -
          get/init user-visible engine data

          int getengno(void), GETENGNO()

          int getquadno(void), GETQUADNO()

          int all_processors_support_mmx(void)
               N/A

          int all_processors_support_simdx(void)
               N/A

          void engdata_init(void)

===> The efficient i/f for obtaining the information described here 
===> is not provided as part of CpuMemSet.  No plans, that I know of,
===> exist for this capability in the NUMA Topology project, either.

     DESCRIPTION
          These interfaces are provided for the benefit of application
          processes that want to take advantage of the NUMA (Non
          Uniform Memory Access) architecture, or specific processor
          instruction set extensions (MMX and/or the Streaming SIMD
          extensions).  These are very fast and efficient interfaces
          (overhead is only a very few clock ticks) to obtain certain
          information, such as the engine number and the quad number
          that a process is currently executing on, or the presence of
          MMX and/or Streaming SIMD extension instruction support.

          engdata_init establishes access to the getengno, GETENGNO(),
          getquadno and GETQUADNO interfaces, making the data
          accessible in the address space of the calling process for
          subsequent use.  engdata_init should be called once and must
          precede the first call to any of these interfaces.

          getengno, when called from within a user process, returns
          the engine number on which the process is currently running.
          GETENGNO() is a cpp macro that eliminates the overhead of a
          function call and gives the same result.

          getquadno returns the quad number on which the process is
          currently running. GETQUADNO() is the equivalent cpp macro.

          The cpp macros GETENGNO and GETQUADNO  are defined in the
          /usr/include/engdata.h header file.

          The all_processors_support_mmx() and
          all_processors_support_simdx() primitives are defined in the
          /usr/include/engdata.h header file as in-line assembly-code
          operations, and return a non-zero value only if all
          processors in the system respectively support the MMX and
          Streaming SIMD extension instructions.

          The engdata_init primitive does not need to be invoked to
          use the all_processors_support_mmx() or
          all_processors_support_simdx() primitives.

-----------------------------------------------------------------------

          void *mmap (void *addr, size_t len, int prot, int flags, int
          fd, off_t pos)
               N/A

          void *mmap64 (void *addr, size_t len, int prot, int flags,
          int fd, off64_t pos)
               N/A

          void *mmapq (void *addr, size_t len, int prot, int flags,
          int fd, off_t pos, quadset_t *qsp)

          void *mmap64q (void *addr, size_t len, int prot, int flags,
          int fd, off64_t pos, quadset_t *qsp)

          The paging policy flags:  MAP_FIRSTREF, MAP_MOSTFREE and
          MAP_DISTRIBUTE control which quads pages for the mapped
          region come from.  The paging policy flags are in affect for
          all versions of the mmap() system call.  The mmap() and
          mmap64() versions use a default quadset involving all quads
          in the system and mmapq() and mmap64q() can be used to
          specify a quadset with fewer than the full compliment of
          system quads.

===> It appears that mmap uses the CpuMemSet of the current process.
===> Thus, to target a different set of resources (numa nodes) than
===> the current process is using it is necessary to change the current
===> CpuMemSet of the process to that desired for the mmap, execute
===> the mmap operation, and then restore the CpuMemSet for the current
===> process to what it was originally.  This opens up the possiblity
===> that any pages allocated during this activity for the process could
===> end up being allocated on the wrong node (i.e., on a node that was
===> intended for the mmap).  Perhaps a third CpuMemSet needs to be
===> established for a process to be used for mmaps and shmget.

===> Excluding the selection of paging policy, mmap using the system
===> CpuMemSet is equivalent to the DYNIX/ptx version.  CpuMemSets
===> provides paging policies of first touch, which is equivalent to
===> MAP_FIRSTREF, and round robin, which is equivalent to MAP_DISTRIBUTE.
===> There is not an equivalent for MAP_MOSTFREE, and it has been suggested
===> that some other mechanism be provided to implement this sort of 
===> paging policy.  A solution for providing support for multiple
===> policies to support the DYNIX/ptx APIs is needed.

===> If mmap is issued with a paging policy that differs from the one
===> associated with the system CpuMemSet, then an identical CpuMemSet
===> can be constructed, but with a different paging policy.  In the
===> current definition of CpuMemSet, this only allows switching between
===> first touch and round robin.

          mmapq() behaves the same as mmap(), with the addition of the
          qds parameter used to restrict the set of quads the paging
          policy flags MAP_FIRSTREF, MAP_MOSTFREE and MAP_DISTRIBUTE
          affect.  By default mmap() and mmap64() assumes a quadset
          containing all the quads in the system.  The mmapq() and
          mmap64q() variants are used to specify a quadset with fewer
          quads to restrict the paging policy to just those quads in
          qds.  See quademptyset(2SEQ) for information on construction
          of quadsets.

===> For mmapq, a quadset specifier is provided.  Assuming there is an
===> API equivalent to the DYNIX/ptx quadset APIs that creates equivalent
===> CpuMemSets, the mapping performed by mmapq should be fairly straight-
===> forward.

          mmap64q() is the union of mmap64() and mmapq() in that it
          allows the specification of 64-bit offsets with pos and a
          restricted quadset with the qds parameter. 

-----------------------------------------------------------------------

          int qexecl (qsetp, flags, file, arg0, arg1, ..., argn,
               (char *)0)
          quadset_t *qsetp;
          int flags;
          char *file, *arg0, *arg1, ..., *argn;

          int qexecle (qsetp, flags, file, arg0, arg1, ..., argn,
               (char *)0, envp)
          quadset_t *qsetp;
          int flags;
          char *file, *arg0, *arg1, ..., *argn, *envp[];

          int qexeclp (qsetp, flags, file, arg0, arg1, ..., argn,
               (char *)0)
          quadset_t *qsetp;
          int flags;
          char *file;
          char *arg0, *arg1, ..., *argn;

          int qexecv (qsetp, flags, file, argv)
          quadset_t *qsetp;
          int flags;
          char *file, *argv[];
          int qexecve (qsetp, flags, file, argv, envp)
          quadset_t *qsetp;
          int flags;
          char *file, *argv[], *envp[];

          int qexecvp (qsetp, flags, file, argv)
          quadset_t *qsetp;
          int flags;
          char *file, *argv[];

          The qexec variants operate the same as their corresponding
          variations of exec (see exec(2)), with the addition of
          specified quad placement.  The caller can specify a suitable
          quad or set of quads where the process should be located
          either through the qsetp argument or by setting the
          QUAD_ATTACH_TO_PARENT flag in the flags argument.

===> In the case where the caller specifies a quad (or set of quads), this
===> can be mapped to appropriate CpuMemSet and set as the current for the
===> process.

===> There is no mechanism provided by CpuMemSet to attach a process to 
===> another process to ensure that they always use the same quad.  The
===> closest is if the processes are mapped to exactly one quad, thus ensuring
===> that one of them will not migrate to a different quad than the other.

          If the qsetp argument is used to specify suitable quads,
          qsetp points to a quadset_t that identifies the quad or
          quads that are acceptable for the process.  A suitable quad
          or set of quads can be identified using quad_loc(2SEQ), and
          can be manipulated using the operators described in
          quademptyset(2SEQ).  If more than one quad is specified by
          qsetp , the process will be loaded on the best of the quads,
          taking CPU load, memory availability and other factors into
          account.  Unless the QUAD_INSIST bit in flags is set, the
          set of quads specified by qsetp is considered a hint, which
          may be overridden in extreme cases if all the quads in the
          specified quad set have very high CPU loads or too little
          available memory.  If the QUAD_INSIST bit is included in the
          flags argument, the quad specification is treated as
          "mandatory," and the process is loaded on one of the
          specified quads despite a large CPU load or memory shortage.
          Once the process has been loaded, using the qsetp argument
          with qexec has the same effect as using the R_QUAD option of
          attach_proc(2SEQ) to attach the process to a set of quads.
          If the set contains more than one quad, the process may
          migrate among the quads in the set for load-balancing
          purposes.  If the QUAD_INSIST bit in flags is not set, the
          process may also migrate to a quad outside the specified set
          of quads under the above extreme conditions.

===> See the soft versus hard policy discussion in the intro concerning
===> the (non-)support of QUAD_INSIST.

          It is advisable that QUAD_INSIST be used with considerable
          caution.  Situations are rare in which a process will
          perform better at a quad where a specific resource is
          located than at some other quad even under extreme CPU or
          memory overload conditions.  For example, being located on
          the same quad as an important resource is of little value to
          a process that has been swapped out due to an extreme memory
          overload, or that is being starved for CPU bandwidth under
          an extreme CPU overload.

          If the qsetp argument is used to specify suitable quads, the
          caller can further specify an appropriate quad by setting
          the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags
          argument.  QUAD_SMALLMEM indicates that the process will
          have very low memory requirements, so can be placed on a
          quad having little available memory if that quad has a
          particularly light CPU load.  Conversely, if QUAD_LARGEMEM
          is set, the process is placed on the quad with the most
          available memory even though that quad may have a high CPU
          load.  QUAD_SMALLMEM and QUAD_LARGEMEM are also taken into
          account during any future process migrations.
          Alternatively, when using qexec, the user can specify a
          suitable quad for the process by setting the
          QUAD_ATTACH_TO_PARENT bit in the flags argument.  If the
          QUAD_ATTACH_TO_PARENT flag is set, the process is loaded on
          the quad where its parent process is currently located.
          Once the process has been loaded, using the
          QUAD_ATTACH_TO_PARENT bit with qexec has the same effect as
          using the R_PID option of attach_proc(2SEQ) to attach the
          process to its parent.  If either the child or parent
          process migrates to another quad for load-balancing
          purposes, the other process will accompany it.  Both
          processes will always be located on the same quad, until one
          either detaches or exits.  If the QUAD_ATTACH_TO_PARENT flag
          is set, the qsetp argument is ignored, as are the
          QUAD_SMALLMEM and QUAD_LARGEMEM flags.

===> There is no policy option provided by CpuMemSet to suggest placing 
===> a process on a quad with either large or small memory availability.

-----------------------------------------------------------------------

          pid_t shfork (uint_t flags)

          pid_t qfork (quadset_t *attach_qsetp, uint_t flags)

          pid_t shqfork (quadset_t *attach_qsetp, uint_t flags)

===> Except for using the "child" CpuMemSet instead of the current process
===> CpuMemSet, qfork and variants can be mapped to CpuMemSets in the same
===> manner (and with the same limitations) as qexec.

          qfork causes creation of a new process.  The new process
          (child process) is an exact copy of the calling process
          (parent process).  This means the child process inherits the
          following attributes from the parent process:

               environment

               close-on-exec flag (see exec(2))

               signal handling settings (i.e., SIG_DFL , SIG_IGN ,
               SIG_HOLD , function address)

               set-user-ID mode bit

               set-group-ID mode bit

               profiling on/off status
               nice value (see nice(2))

               all attached shared memory segments (see shmop(2))

               all attached mapped regions (see mmap(2SEQ))

               process group ID

               session ID

               foreground process ID (see exit(2))

               current working directory

               root directory

               file mode creation mask (see umask(2))

               file size limit (see ulimit(2))

          The child process differs from the parent process in the
          following ways:

               The child process has a unique process ID.

               The child process has a different parent process ID
               (i.e., the process ID of the parent process).

               When using fork and qfork, the child process has its
               own copy of the parent's file descriptors.  Each of the
               child's file descriptors shares a common file pointer
               with the corresponding file descriptor of the parent.
               However, if shfork or shqfork is used, the child
               process may share the parent's file descriptors (more
               below on these variations of fork).

               All semadj values are cleared (see semop(2)).

               Process locks, text locks and data locks are not
               inherited by the child (see plock(2)).

               The child process's utime, stime, cutime, and cstime
               are set to 0.  The time left until an alarm clock
               signal is reset to 0.

          qfork is used when the caller wishes to specify the quad or
          set of quads where the child process should be located.  The
          caller can specify suitable quads either through the
          attach_qset argument or by setting the QUAD_ATTACH_TO_PARENT
          flag in the flags argument.

          If the attach_qset argument is used to specify suitable
          quads, attach_qset points to a quadset_t that identifies the
          quad or quads that are acceptable for the process.  A
          suitable quad or set of quads can be identified using
          quad_loc(2SEQ), and can be manipulated using the operators
          described in quademptyset(2SEQ).  If more than one quad is
          specified by attach_qset , the process will be loaded on the
          best of the quads, taking CPU load, memory availability and
          other factors into account.  Unless the QUAD_INSIST bit in
          flags is set, the set of quads specified by attach_qset is
          considered a hint, which may be overridden in extreme cases
          if all the quads in the specified quad set have very high
          CPU loads or too little available memory.  If the
          QUAD_INSIST bit is included in the flags argument, the quad
          specification is treated as "mandatory," and the child
          process is loaded on one of the specified quads despite a
          large CPU load or memory shortage.  Once the process has
          been loaded, using the attach_qset argument with qfork has
          the same effect as using the R_QUAD option of
          attach_proc(2SEQ) to attach the child process to a set of
          quads.  If the set contains more than one quad, the process
          may migrate among the quads in the set for load-balancing
          purposes.  If the QUAD_INSIST bit in flags is not set, the
          process may also migrate to a quad outside the specified set
          of quads under the above extreme conditions.

          It is advisable that QUAD_INSIST be used with considerable
          caution.  Situations are rare in which a process will
          perform better at a quad where a specific resource is
          located than at some other quad even under extreme CPU or
          memory overload conditions.  For example, being located on
          the same quad as an important resource is of little value to
          a process that has been swapped out due to an extreme memory
          overload, or that is being starved for CPU bandwidth under
          an extreme CPU overload.

          If the attach_qset argument is used to specify suitable
          quads, the caller can further specify an appropriate quad by
          setting the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags
          argument.  QUAD_SMALLMEM indicates that the child will have
          very low memory requirements, so can be placed on a quad
          having little available memory if that quad has a
          particularly light CPU load.  Conversely, if QUAD_LARGEMEM
          is set, the process is placed on the quad with the most
          available memory even though that quad may have a high CPU
          load.  QUAD_SMALLMEM and QUAD_LARGEMEM are also taken into
          account during any future process migrations.

          Alternatively, when using qfork, the user can specify a
          suitable quad for the child process by setting the
          QUAD_ATTACH_TO_PARENT bit in the flags argument.  If the
          QUAD_ATTACH_TO_PARENT flag is set, the child process is
          loaded on the quad where the parent process is currently
          located.  Once the child process has been loaded, using the
          QUAD_ATTACH_TO_PARENT bit with qfork has the same effect as
          using the R_PID option of attach_proc(2SEQ) to attach the
          child to its parent.  If either the child or parent process
          migrates to another quad for load-balancing purposes, the
          other process will accompany it.  Both processes will always
          be located on the same quad, until one either detaches or
          exits.  If the QUAD_ATTACH_TO_PARENT flag is set, the
          attach_qset argument is ignored, as are the QUAD_SMALLMEM
          and QUAD_LARGEMEM flags.

          shfork operates the same as fork, but allows the child
          process to share the parent processes file descriptors
          instead of just copying them.  To enable the file descriptor
          sharing, the FM_SHARE_OFILE bit must be set in flags.  If
          this flag is not set, shfork will act exactly the same as 
          fork.

          shqfork allows for the same operation as qfork along with
          the shared file descriptors of shfork.  As with shfork, the
          FM_SHARE_OFILE flag must be set to enable the sharing.

     FLAG VALUES
          The following bit patterns are valid values for the flags
          argument.

          for use with shfork and shqfork:
          FM_SHARE_OFILE 0x0002         share open file table

          for use with qfork and shqfork:
          QUAD_INSIST    0x00000010     mandatory quad placement
          QUAD_SMALLMEM  0x00000020     requires little memory
          QUAD_LARGEMEM  0x00000040     requires lots of memory 

-----------------------------------------------------------------------

          int quademptyset (set) 
          quadset_t *set;

          int quadfillset (set)
          quadset_t *set;

          int rgn_quadfillset (set, rgnname)
          quadset_t *set; 
          const char* rgnname;

          int quadaddset (set, quadno)
          quadset_t *set;
          int quadno;

          int quaddelset (set, quadno)
          quadset_t *set;
          int quadno;

          int quadismember (set, quadno)
          quadset_t *set;
          int quadno;

          int quadisemptyset (set)
          quadset_t *set;

          int quadandset (set1, set2)
          quadset_t *set1, *set2;

          int quadorset (set1, set2)
          quadset_t *set1, *set2;

          int quaddiffset (set1, set2)
          quadset_t *set1, *set2;

===> These APIs manipulate quadset_t objects which are opaque data types
===> that describe collections of quads.  In mapping the DYNIX/ptx NUMA
===> APIs to CpuMemSets, the quadset_t object is still used by the DYNIX/ptx
===> API.  Thus the quadset_t manipulating APIs should be used and can be
===> implemented in the same manner as on DYNIX/ptx.  At the point in time
===> that another DYNIX/ptx NUMA API uses a quadset, the quadset can be
===> mapped into an equivalent CpuMemSet.

          The quadsetops primitives manipulate sets of quads.  They
          operate on data objects of type quadset_t.

          The quademptyset function initializes the quad set pointed
          to by set, such that no quads are included in the set.

          The quadfillset function initializes the quad set pointed to
          by set, such that all quads that are currently configured in
          the caller's region are included in the set.

          The rgn_quadfillset function initializes the quad set
          pointed to by set, such that all quads that are currently
          configured in the region rgnname are included in the set.

          If rgnname is NULL then rgn_quadfillset initializes the
          quadset pointed to by set, such that all quads that are
          currently configured in the caller's region are included in
          the set.

          Applications should call either quademptyset or quadfillset
          at least once for each object of type quadset_t prior to any
          other operation on the object.  If such an object is not
          initialized in this way, but is supplied as an argument to
          any of the quadaddset, quaddelset, quadismember, quadandset,
          quadorset, quaddiffset, qfork, qexec, etc.  functions, the
          results are undefined.

          The quadaddset and quaddelset functions respectively add and
          delete the individual quads specified by the value of quadno
          from the quad set pointed to by the argument set.

          The quadandset and quadorset functions respectively perform
          logical and and or operations on the quad sets pointed to by
          the arguments set1 and set2, storing the result in the quad
          set pointed to by set1.

          The quaddiffset function finds the logical difference (the
          members that are contained in the first set but not in the
          second set) between the quad sets pointed to by the
          arguments set1 and set2.  The result is stored in the quad
          set pointed to by set1.

          The quadismember function tests whether the quad specified
          by the value of quadno is a member of the set pointed to by
          the argument set, and the quadisemptyset function tests
          whether the quad set pointed to by the argument set is
          empty.

-----------------------------------------------------------------------

          int shmget (key, size, shmflg)
          key_t key;
          int size, shmflg;
               N/A

          int shmgetq (key, size, shmflg, qds)
          key_t key;
          int size, shmflg;
          quadset_t *qds;

===> Can create a CpuMemSet which maps the requested quad affiliation and
===> attach to the vm area describing the shared memory.  As with exec, fork,
===> etc., this lacks some of the options/capabilities of DYNIX/ptx.

          int shmgetv (key, size, shmflg, shmvcnt, shmv)
          key_t key;
          size_t size,
          int shmflg, shmvcnt;
          shmvec_t *shmv;
               N/A

          shmgetq(2SEQ) behaves the same as shmget(2) with the
          addition of a qds parameter used to specify a quadset to
          restrict the set of quads the paging policy flags affect.
          By default shmget(2) assumes a quadset containing all the
          quads in the system.  See quademptyset(2SEQ) for information
          on constructing quadsets.

-----------------------------------------------------------------------

          int tmp_affinity(processor);
          int  processor;

===> Create a CpuMemSet with only the requested processor in it, include
===> the memory on the same processor node, and assign the process to it.

===> If AFF_NONE is specified, then assign the process to the system's
===> global CpuMemSet.

===> Must change both the current process CpuMemSet and the process's
===> child CpuMemSet.

===> No Linux support for PRIV_REGION or PRIV_SCHED.

          tmp_affinity allows a process to be bound to a specified
          logical processor.  Processor numbers start at zero and are
          numbered contiguously.  Deconfigured processors are not
          included.  The previous affinity is returned.

          The process must have PRIV_REGION vectored privilege to be
          bound to a processor which does not belong to the region to
          which the process belongs. On a successful bind, if the
          process belongs to the system region, then it stays in the
          system region, otherwise it moves to the user defined region
          to which the processor belongs.

          If the processor argument is AFF_NONE, the process is
          realeased from any previous affinity (that is, allowed to
          migrate within its region).  If the processor argument is
          AFF_QUERY, the current affinity, or AFF_NONE is returned,
          without changing the current affinity.

          You must have the  PRIV_SCHED vectored privilege to change
          affinity.

          Affinity is inherited across the fork(2) and exec(2) system
          calls.

Thread: [Lse-tech] DYNIX/ptx NUMA APIs mapping to CpuMemSets

lse-tech