From: Michael H. <hoh...@us...> - 2001-11-16 22:07:09
|
In an effort to enhance my understanding of the CpuMemSet proposal I performed a mapping of the DYNIX/ptx NUMA APIs to the CpuMemSet facility. This effort also attempted to validate the claim that "this CpuMemSet facility is intended to provide the power sufficient to support emulations of these existing API's." In doing this I came across several shortcomings, that are listed below. At the end of this summary is the mapping with a discussion included of issues encountered. 1. CpuMemSets makes an attempt at providing support for memory allocation policy, but comes short of providing the allocation policies supported by DYNIX/ptx. By providing only partial support for memory allocation policies it opens up the possibility that memory allocation policies might be implemented at various places, instead of through one common mechanism. Policies missing: * soft versus hard - DYNIX/ptx has the notion of treating placement requests as either hints (soft) or demands (hard). CpuMemSet provides only the hard option. * first touch, followed by round robin. The default algorithm for memory allocations for DYNIX/ptx is to allocate on the same quad the process is running on, and if none available, to round robin through the remaining quads. The CpuMemSet choices are either round robin or always in the order of the memory lists. * LARGEMEM/SMALLMEM hints - DYNIX/ptx provides the option for a process to indicate if it was a large memory user or a small memory user, so that the system could place it on an appropriately loaded quad. * MOSTFREE - i.e., allocate on the node with the most available memory. 2. No mechanism is provided to migrate a process to a different node after a process is assigned to it via CpuMemSet. 3. There is no means to bind processes such that if one moves to a different node, they all move to the node. (Or conversely, to prevent the migration of one to a different node when it needs to remain on its current node with other specific processes.) 4. DYNIX/ptx has the ability to attach a process to a shared memory segment. This binds the process to running on quads that the shared memory segment resides on. In CpuMemSets there is a note that the CPU portion of a CpuMemSet for a memory area is ignored. To support the DYNIX/ptx attachment of a process to a shared memory segment, the CPU portion of the memory are CpuMemSet needs to be used. 5. It appears that mmap uses the CpuMemSet of the current process. Thus, to target a different set of resources (numa nodes) than the current process is using it is necessary to change the current CpuMemSet of the process to that desired for the mmap, execute the mmap operation, and then restore the CpuMemSet for the current process to what it was originally. This opens up the possiblity that any pages allocated during this activity for the process could end up being allocated on the wrong node (i.e., on a node that was intended for the mmap). Perhaps a third CpuMemSet needs to be established for a process to be used for mmaps and shmget. 6. Not really a CpuMemSet deficiency, but DYNIX/ptx had a privilege vector implementation that allowed the granting of some typical root only capabilities to specific users. Some NUMA APIs required specific privileges for non-root users. This could not be mapped into Linux. 7. More of a NUMA Topology issue - DYNIX/ptx provided an efficient interface for a process to obtain the engine number and quad number that it is executing on. This is not yet supported. Those are the main deficiencies found. The following is the outcome of the mapping work, the first part outlines the general strategy, the remainder is snippets from man pages that define the APIs with discussion interjected about how the mapping could happen and what problems arise. Any suggestions, corrections, discussion, etc. are welcome. Michael Hohnbaum hoh...@us... Mapping DYNIX/ptx NUMA APIs onto CpuMemSets ------------------------------------------- The DYNIX/ptx NUMA APIs are based around the notion of a quad, with a quad consisting of 4 CPUs, memory, and PCI slots. Quads are numbered starting with 0 and continuing up with no gaps in the numbering name space. Similarly, CPUs are numbered starting with 0 and continuing up with no gaps in the numbering name space. CPUs 0-3 are on quad 0, CPUs 4-7 are on quad 1, etc. A missing CPU on a quad (e.g., processor failure, or non-fully populated quad) does not result in a gap in the processor number name space. Thus, some mechanism must be provided to associate processors with quads. This issue is being worked as part of the NUMA Topology effort. To map the ptx NUMA APIs to CpuMemSets, one must map the quad resources into the CpuMemSets structures. The general mapping described below assumes fully populated quads (i.e., four functioning processors per quad). This mapping will differ on systems with less than four processors per quad. NOTE: This assumes that a CpuMemSet structure may be established that applies to all processes on the system as the default. First, create a CpuMemMap that contains all of the CPUs and memory of the system, treating the memory in each quad as a separate memory. (N equals the number of quads in the system.) The CpuMemMap contains: { 4N, # number of CPUs p1, # pointer to array of cms_pcpu_t N, # number of memories p2 # pointer to array of cms_pmem_t } p1 = [0,1, .. ,4N-1] p1 = [0, .. ,N-1] This maps logical CPU addresses one to one to physical CPU addresses (e.g., logical processor 0 is mapped to physical processor 0, etc.). It maps memories one to one to the quad number that contains them. Then create a CpuMemSet that establishes a linkage between the CPUs and the memory on the same quad. { CMS_DEFAULT, # cms_policy 4N, # number of CPUs q1, # pointer to array of cms_lcpu_t N # number of memories q2 # pointer to array of cms_memory_list } q1 = [0,1,2, .. ,<number of CPUs>] q2 = [ {4, r1, N, s1} ... {4, rN, N, sN} ] r1 = [0,1,2,3] rN = [N-4,N-3,N-2,N-1] s1 = [0,1,..N-1] s2 = [1,2,..,N-1,0] sN = [N-1,0,..,N-2] This establishes sets of 4 processors such that memory allocations from these processors try to be resolved from the memory on the same quad, and if not, will attempt to get memory from the next quad up, modulo the number of quads. NOTE: This does not match the policy in DYNIX/ptx but is a reasonable approximation. Soft versus Hard Allocation Policies ------------------------------------ The DYNIX/ptx process placement APIs provide the option of being used as "hints" (soft policy) or firm directives (hard policy). By default, soft policy is assumed. Hard policy is indicated by the use of the INSIST flag with the associated system calls. CpuMemSet does not provide this soft versus hard policy distinction. Rather a set of resources are identified by a CpuMemSet and a process is associated with the CpuMemSet. It is equivalent to a hard policy in the DYNIX/ptx implementation. Therefore, in a mapping of the DYNIX/ptx NUMA APIs to CpuMemSet, either the soft policy of DYNIX/ptx must be abandoned, or the logic must be placed at the NUMA API level to make the scheduling and placement decisions of the DYNIX/ptx soft policy into CpuMemSet hard policy. To accomplish the latter requires the exporting of all necessary kernel usage metrics including memory and processor usage on each node, location of shared memory, locality of I/O devices used by the process, etc. This type of information and management of system resources based upon it, really belongs within the kernel. It is not practical nor likely to be performant, to manage this external to the kernel. Because of these considerations, the DYNIX/ptx NUMA API to CpuMemSet mapping will only provide for the hard policy. API Mappings ------------ What follows are the DYNIX/ptx APIs with a discussion of how they map onto CpuMemSets as defined above. These are snippets from the man pages with enough content left to hopefully, explain the intent of the API. For the complete man page go to http://oss.sgi.com/projects/numa/download/dynix. int tmp_ctl(command, arg) int command, arg; TMP_NENG returns the number of processors configured in the processor pool. The processors may be online or offline. However, deconfigured processors will not be included. The arg argument is ignored. ===> Note that a deconfigured processor is one that is logically removed ===> from the system before the OS is booted, and thus the OS is not ===> aware of its presence. An offline processor is one that is present ===> when the OS boots and the OS chooses to remove from service. ===> cmsQueryCMM (CMS_KERNEL, 0, NULL) ===> nr_cpus contains the number of CPUs in the system ===> ===> NOTE: CMS_KERNEL is listed as root only. Can non-root use it for query? TMP_RGN_NENG N/A TMP_OFFLINE N/A TMP_ONLINE N/A TMP_QUERY N/A TMP_TYPE N/A TMP_RATE N/A TMP_NQUAD returns the number of quads configured in the quad pool. The quads may be online or offline. However, deconfigured quads will not be included. The arg argument is ignored. ===> cmsQueryCMM (CMS_KERNEL, 0, NULL) ===> nr_mems in resulting structure maps to the number of quads ===> ===> NOTE: CMS_KERNEL is listed as root only. Can non-root use it for query? TMP_RGN_NQUAD N/A TMP_ENGTOQUAD returns the quad on which the processor specified by arg. is located. ===> derivable based on the 4 cpus per quad and numbering assumptions ===> mentioned above for fully populated quads. Needs support from NUMA ===> Topology to handle non-fully populated quads. TMP_QUADNEXTENG, given a processor number (specified in arg ), returns the next processor on that quad, or the first processor on that quad if there are no more processors on that quad. ===> derivable based on the 4 cpus per quad and numbering assumptions ===> mentioned above for fully populated quads. Needs support from NUMA ===> Topology to handle non-fully populated quads. TMP_RGN_QUADNEXTENG, N/A TMP_QUADTOENG returns the first processor located on the quad specified by arg. ===> derivable based on the 4 cpus per quad and numbering assumptions ===> mentioned above for fully populated quads. Needs support from NUMA ===> Topology to handle non-fully populated quads. TMP_RGN_QUADTOENG N/A ----------------------------------------------------------------------- int attach_proc(rsrc_type, rsrcp, flags, pid) rsrctype_t rsrc_type; rsrcdescr_t *rsrcp; int flags; pid_t pid; typedef union rsrcdescr { quadset_t rd_quadset; int rd_fd; char *rd_pathname; int rd_shmid; pid_t rd_pid; } rsrcdescr_t; attach_proc attaches a process to the resource specified by rsrc_type and rsrcp . Attachment implies that the pro- cess will be located on the quad or set of quads where the specified resource is located. If the process is not cur- rently located at the quad where the specified resource is located, the process will migrate to that quad. If the resource is located on several quads, the process will migrate to the best of those quads, taking CPU load and memory availability into account. It's possible that none of those quads is suitable to have the process migrated to it, due to extremely high CPU loads or insufficient avail- able memory. In that case, attach_proc returns -1 and sets errno to EAGAIN to advise the caller to try again later. This possibility can be avoided by including the QUAD_INSIST bit in flags, which makes choosing one of the quads on which the resource is located mandatory. ===> A CpuMemSet can be constructed that includes the quads which ===> match the criteria specified, and the process assigned to ===> that CpuMemSet. ===> Migration of the process to the CpuMemSet is not provided for. ===> CpuMemSet does not have the semantics to provide the QUAD_INSIST ===> option. Actually, it always provides QUAD_INSIST, unless the target ===> has a MASTER, in which case CpuMemSet refuses the attachment. Note that the process' run queue must include processors on at least one of the quads where the specified resource is located. If no processors on any of these quads sched- ule from the process' run queue, attach_proc returns -1 and sets errno to EINVAL. In that case, to allow the process to be attached to the desired quads, either the process must be reassigned to a run queue having proces- sors on at least one of the desired quads or processors on at least one of the desired quads must be added to the process' run queue (see the 'assign' and 'change' options of rqadmin(1M)). ===> Not currently an issue. Will need to evaluate against the MQ scheduler. If the specified resource is located on more than one quad, the process may later migrate among these quads for load-balancing purposes. If the QUAD_INSIST bit is not set, the process may also migrate to a quad where the resource is not located, though this is done only under extreme CPU or memory load conditions. It is advisable that QUAD_INSIST be used with considerable caution. Situations are rare in which a process will per- form better at a quad where a specific resource is located than at some other quad even under extreme CPU or memory overload conditions. For example, being located on the same quad as an important resource is of little value to a process that has been swapped out due to an extreme memory overload, or that is being starved for CPU bandwidth under an extreme CPU overload. If rsrc_type is R_QUAD, the process is attached to the set of quads specified by rd_quadset . If rsrc_type is R_PID, the process is attached to the pro- cess specified by rd_pid . Attachment to a process implies that the attached process will always be located on the same quad as the process it is attached to, until one of them either detaches, attaches to another resource, or exits. If any process that is attached to other pro- cesses migrates to another quad for load-balancing pur- poses, the processes attached to it will accompany it. ===> No way to enforce attachment to the same quad as another process ===> unless the other process is attached to one specific quad via ===> a CpuMemSet. This would inhibit the processes's ability to ===> migrate between quads. If rsrc_type is R_SHM, the process is attached to the set of quads that may contain memory pages that are part of the shared memory segment specified by rd_shmid (see shmgetq(2SEQ)). ===> A shared memory segment has a CpuMemSet, so the requesting process ===> would be attached to that CpuMemSet. There is a note in the ===> proposal that the CPU portion of a CpuMemSet attached to a memory ===> area is ignored. For the mapping requested through attach_proc ===> the CPU portion would need to be used. If rsrc_type is R_FILDES, the process is attached to the set of quads on which the resource specified by rd_fd resides. Typically, this implies that a process running on one of the quads in that set of quads will have more efficient access to that resource. For different resource types, the operating system uses different criteria for determining which quads to include in this quad set. If rd_fd is a stream, the process is attached to the quad where the memory containing the stream head resides. If rd_fd is a file, the process is attached to the set of quads that are able to DMA directly to the disk on which the file is located. If rd_fd specifies a device, the process is attached to the set of quads that have effi- cient access to that device. attach_proc works similarly if rd_fd specifies a socket, FIFO or remote file. If rsrc_type is R_PATH, the process is attached to the set of quads that are nearest to the resource specified by rd_pathname. The criteria for determining "nearest" is the same as for rsrc_type R_FILDES. ===> Decision can be made as to which quad(s) best fit the criteria ===> and create a CpuMemSet containing it(them). Can all of the ===> information needed to make an appropriate decision be obtained? The caller can further specify an appropriate quad by set- ting the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags argument. QUAD_SMALLMEM indicates that the process will have very low memory requirements, so can be placed on a quad having little available memory if that quad has a particularly light CPU load. Conversely, QUAD_LARGEMEM indicates that the process should be placed on the quad with the most available memory even though that quad may have a high CPU load. QUAD_SMALLMEM and QUAD_LARGEMEM are also taken into account during any future process migra- tions. ===> The memory lists associated with the CpuMemSet can be ordered ===> such that the above selection criteria are factored in. ===> This could be accomplished by querying the memory usage of ===> the quads, and based upon memory available on the quads at ===> that point in time order, the memory list in the CpuMemSet ===> to favor the node that meets the LARGEMEM/SMALLMEM hint. A process that attaches itself to a resource loses all previous attachment. The real or effective user ID of the calling process must match the real or saved set-user-ID of the process to be migrated, unless the effective user ID of the calling process is superuser or unless the caller holds the PRIV_SCHED (scheduling) privilege. ===> No concept of PRIV_SCHED in Linux, so restriction is to root ===> or RUID/EUID match. ----------------------------------------------------------------------- int detach_proc(pid_t pid) detach_proc disassociates the process identified by pid from the quad set it was previously attached to, if it was attached to a quad set. The real or effective uid of the caller must match the real or saved set-user-id of the process to be detached, unless the caller is the superuser or holds the PRIV_SCHED (scheduling) privilege. ===> Revert the process back to using the system-wide CpuMemSet ----------------------------------------------------------------------- getengno, GETENGNO, getquadno, GETQUADNO, engdata_init, all_processors_support_mmx, all_processors_support_simdx - get/init user-visible engine data int getengno(void), GETENGNO() int getquadno(void), GETQUADNO() int all_processors_support_mmx(void) N/A int all_processors_support_simdx(void) N/A void engdata_init(void) ===> The efficient i/f for obtaining the information described here ===> is not provided as part of CpuMemSet. No plans, that I know of, ===> exist for this capability in the NUMA Topology project, either. DESCRIPTION These interfaces are provided for the benefit of application processes that want to take advantage of the NUMA (Non Uniform Memory Access) architecture, or specific processor instruction set extensions (MMX and/or the Streaming SIMD extensions). These are very fast and efficient interfaces (overhead is only a very few clock ticks) to obtain certain information, such as the engine number and the quad number that a process is currently executing on, or the presence of MMX and/or Streaming SIMD extension instruction support. engdata_init establishes access to the getengno, GETENGNO(), getquadno and GETQUADNO interfaces, making the data accessible in the address space of the calling process for subsequent use. engdata_init should be called once and must precede the first call to any of these interfaces. getengno, when called from within a user process, returns the engine number on which the process is currently running. GETENGNO() is a cpp macro that eliminates the overhead of a function call and gives the same result. getquadno returns the quad number on which the process is currently running. GETQUADNO() is the equivalent cpp macro. The cpp macros GETENGNO and GETQUADNO are defined in the /usr/include/engdata.h header file. The all_processors_support_mmx() and all_processors_support_simdx() primitives are defined in the /usr/include/engdata.h header file as in-line assembly-code operations, and return a non-zero value only if all processors in the system respectively support the MMX and Streaming SIMD extension instructions. The engdata_init primitive does not need to be invoked to use the all_processors_support_mmx() or all_processors_support_simdx() primitives. ----------------------------------------------------------------------- void *mmap (void *addr, size_t len, int prot, int flags, int fd, off_t pos) N/A void *mmap64 (void *addr, size_t len, int prot, int flags, int fd, off64_t pos) N/A void *mmapq (void *addr, size_t len, int prot, int flags, int fd, off_t pos, quadset_t *qsp) void *mmap64q (void *addr, size_t len, int prot, int flags, int fd, off64_t pos, quadset_t *qsp) The paging policy flags: MAP_FIRSTREF, MAP_MOSTFREE and MAP_DISTRIBUTE control which quads pages for the mapped region come from. The paging policy flags are in affect for all versions of the mmap() system call. The mmap() and mmap64() versions use a default quadset involving all quads in the system and mmapq() and mmap64q() can be used to specify a quadset with fewer than the full compliment of system quads. ===> It appears that mmap uses the CpuMemSet of the current process. ===> Thus, to target a different set of resources (numa nodes) than ===> the current process is using it is necessary to change the current ===> CpuMemSet of the process to that desired for the mmap, execute ===> the mmap operation, and then restore the CpuMemSet for the current ===> process to what it was originally. This opens up the possiblity ===> that any pages allocated during this activity for the process could ===> end up being allocated on the wrong node (i.e., on a node that was ===> intended for the mmap). Perhaps a third CpuMemSet needs to be ===> established for a process to be used for mmaps and shmget. ===> Excluding the selection of paging policy, mmap using the system ===> CpuMemSet is equivalent to the DYNIX/ptx version. CpuMemSets ===> provides paging policies of first touch, which is equivalent to ===> MAP_FIRSTREF, and round robin, which is equivalent to MAP_DISTRIBUTE. ===> There is not an equivalent for MAP_MOSTFREE, and it has been suggested ===> that some other mechanism be provided to implement this sort of ===> paging policy. A solution for providing support for multiple ===> policies to support the DYNIX/ptx APIs is needed. ===> If mmap is issued with a paging policy that differs from the one ===> associated with the system CpuMemSet, then an identical CpuMemSet ===> can be constructed, but with a different paging policy. In the ===> current definition of CpuMemSet, this only allows switching between ===> first touch and round robin. mmapq() behaves the same as mmap(), with the addition of the qds parameter used to restrict the set of quads the paging policy flags MAP_FIRSTREF, MAP_MOSTFREE and MAP_DISTRIBUTE affect. By default mmap() and mmap64() assumes a quadset containing all the quads in the system. The mmapq() and mmap64q() variants are used to specify a quadset with fewer quads to restrict the paging policy to just those quads in qds. See quademptyset(2SEQ) for information on construction of quadsets. ===> For mmapq, a quadset specifier is provided. Assuming there is an ===> API equivalent to the DYNIX/ptx quadset APIs that creates equivalent ===> CpuMemSets, the mapping performed by mmapq should be fairly straight- ===> forward. mmap64q() is the union of mmap64() and mmapq() in that it allows the specification of 64-bit offsets with pos and a restricted quadset with the qds parameter. ----------------------------------------------------------------------- int qexecl (qsetp, flags, file, arg0, arg1, ..., argn, (char *)0) quadset_t *qsetp; int flags; char *file, *arg0, *arg1, ..., *argn; int qexecle (qsetp, flags, file, arg0, arg1, ..., argn, (char *)0, envp) quadset_t *qsetp; int flags; char *file, *arg0, *arg1, ..., *argn, *envp[]; int qexeclp (qsetp, flags, file, arg0, arg1, ..., argn, (char *)0) quadset_t *qsetp; int flags; char *file; char *arg0, *arg1, ..., *argn; int qexecv (qsetp, flags, file, argv) quadset_t *qsetp; int flags; char *file, *argv[]; int qexecve (qsetp, flags, file, argv, envp) quadset_t *qsetp; int flags; char *file, *argv[], *envp[]; int qexecvp (qsetp, flags, file, argv) quadset_t *qsetp; int flags; char *file, *argv[]; The qexec variants operate the same as their corresponding variations of exec (see exec(2)), with the addition of specified quad placement. The caller can specify a suitable quad or set of quads where the process should be located either through the qsetp argument or by setting the QUAD_ATTACH_TO_PARENT flag in the flags argument. ===> In the case where the caller specifies a quad (or set of quads), this ===> can be mapped to appropriate CpuMemSet and set as the current for the ===> process. ===> There is no mechanism provided by CpuMemSet to attach a process to ===> another process to ensure that they always use the same quad. The ===> closest is if the processes are mapped to exactly one quad, thus ensuring ===> that one of them will not migrate to a different quad than the other. If the qsetp argument is used to specify suitable quads, qsetp points to a quadset_t that identifies the quad or quads that are acceptable for the process. A suitable quad or set of quads can be identified using quad_loc(2SEQ), and can be manipulated using the operators described in quademptyset(2SEQ). If more than one quad is specified by qsetp , the process will be loaded on the best of the quads, taking CPU load, memory availability and other factors into account. Unless the QUAD_INSIST bit in flags is set, the set of quads specified by qsetp is considered a hint, which may be overridden in extreme cases if all the quads in the specified quad set have very high CPU loads or too little available memory. If the QUAD_INSIST bit is included in the flags argument, the quad specification is treated as "mandatory," and the process is loaded on one of the specified quads despite a large CPU load or memory shortage. Once the process has been loaded, using the qsetp argument with qexec has the same effect as using the R_QUAD option of attach_proc(2SEQ) to attach the process to a set of quads. If the set contains more than one quad, the process may migrate among the quads in the set for load-balancing purposes. If the QUAD_INSIST bit in flags is not set, the process may also migrate to a quad outside the specified set of quads under the above extreme conditions. ===> See the soft versus hard policy discussion in the intro concerning ===> the (non-)support of QUAD_INSIST. It is advisable that QUAD_INSIST be used with considerable caution. Situations are rare in which a process will perform better at a quad where a specific resource is located than at some other quad even under extreme CPU or memory overload conditions. For example, being located on the same quad as an important resource is of little value to a process that has been swapped out due to an extreme memory overload, or that is being starved for CPU bandwidth under an extreme CPU overload. If the qsetp argument is used to specify suitable quads, the caller can further specify an appropriate quad by setting the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags argument. QUAD_SMALLMEM indicates that the process will have very low memory requirements, so can be placed on a quad having little available memory if that quad has a particularly light CPU load. Conversely, if QUAD_LARGEMEM is set, the process is placed on the quad with the most available memory even though that quad may have a high CPU load. QUAD_SMALLMEM and QUAD_LARGEMEM are also taken into account during any future process migrations. Alternatively, when using qexec, the user can specify a suitable quad for the process by setting the QUAD_ATTACH_TO_PARENT bit in the flags argument. If the QUAD_ATTACH_TO_PARENT flag is set, the process is loaded on the quad where its parent process is currently located. Once the process has been loaded, using the QUAD_ATTACH_TO_PARENT bit with qexec has the same effect as using the R_PID option of attach_proc(2SEQ) to attach the process to its parent. If either the child or parent process migrates to another quad for load-balancing purposes, the other process will accompany it. Both processes will always be located on the same quad, until one either detaches or exits. If the QUAD_ATTACH_TO_PARENT flag is set, the qsetp argument is ignored, as are the QUAD_SMALLMEM and QUAD_LARGEMEM flags. ===> There is no policy option provided by CpuMemSet to suggest placing ===> a process on a quad with either large or small memory availability. ----------------------------------------------------------------------- pid_t shfork (uint_t flags) pid_t qfork (quadset_t *attach_qsetp, uint_t flags) pid_t shqfork (quadset_t *attach_qsetp, uint_t flags) ===> Except for using the "child" CpuMemSet instead of the current process ===> CpuMemSet, qfork and variants can be mapped to CpuMemSets in the same ===> manner (and with the same limitations) as qexec. qfork causes creation of a new process. The new process (child process) is an exact copy of the calling process (parent process). This means the child process inherits the following attributes from the parent process: environment close-on-exec flag (see exec(2)) signal handling settings (i.e., SIG_DFL , SIG_IGN , SIG_HOLD , function address) set-user-ID mode bit set-group-ID mode bit profiling on/off status nice value (see nice(2)) all attached shared memory segments (see shmop(2)) all attached mapped regions (see mmap(2SEQ)) process group ID session ID foreground process ID (see exit(2)) current working directory root directory file mode creation mask (see umask(2)) file size limit (see ulimit(2)) The child process differs from the parent process in the following ways: The child process has a unique process ID. The child process has a different parent process ID (i.e., the process ID of the parent process). When using fork and qfork, the child process has its own copy of the parent's file descriptors. Each of the child's file descriptors shares a common file pointer with the corresponding file descriptor of the parent. However, if shfork or shqfork is used, the child process may share the parent's file descriptors (more below on these variations of fork). All semadj values are cleared (see semop(2)). Process locks, text locks and data locks are not inherited by the child (see plock(2)). The child process's utime, stime, cutime, and cstime are set to 0. The time left until an alarm clock signal is reset to 0. qfork is used when the caller wishes to specify the quad or set of quads where the child process should be located. The caller can specify suitable quads either through the attach_qset argument or by setting the QUAD_ATTACH_TO_PARENT flag in the flags argument. If the attach_qset argument is used to specify suitable quads, attach_qset points to a quadset_t that identifies the quad or quads that are acceptable for the process. A suitable quad or set of quads can be identified using quad_loc(2SEQ), and can be manipulated using the operators described in quademptyset(2SEQ). If more than one quad is specified by attach_qset , the process will be loaded on the best of the quads, taking CPU load, memory availability and other factors into account. Unless the QUAD_INSIST bit in flags is set, the set of quads specified by attach_qset is considered a hint, which may be overridden in extreme cases if all the quads in the specified quad set have very high CPU loads or too little available memory. If the QUAD_INSIST bit is included in the flags argument, the quad specification is treated as "mandatory," and the child process is loaded on one of the specified quads despite a large CPU load or memory shortage. Once the process has been loaded, using the attach_qset argument with qfork has the same effect as using the R_QUAD option of attach_proc(2SEQ) to attach the child process to a set of quads. If the set contains more than one quad, the process may migrate among the quads in the set for load-balancing purposes. If the QUAD_INSIST bit in flags is not set, the process may also migrate to a quad outside the specified set of quads under the above extreme conditions. It is advisable that QUAD_INSIST be used with considerable caution. Situations are rare in which a process will perform better at a quad where a specific resource is located than at some other quad even under extreme CPU or memory overload conditions. For example, being located on the same quad as an important resource is of little value to a process that has been swapped out due to an extreme memory overload, or that is being starved for CPU bandwidth under an extreme CPU overload. If the attach_qset argument is used to specify suitable quads, the caller can further specify an appropriate quad by setting the QUAD_SMALLMEM or QUAD_LARGEMEM bits in the flags argument. QUAD_SMALLMEM indicates that the child will have very low memory requirements, so can be placed on a quad having little available memory if that quad has a particularly light CPU load. Conversely, if QUAD_LARGEMEM is set, the process is placed on the quad with the most available memory even though that quad may have a high CPU load. QUAD_SMALLMEM and QUAD_LARGEMEM are also taken into account during any future process migrations. Alternatively, when using qfork, the user can specify a suitable quad for the child process by setting the QUAD_ATTACH_TO_PARENT bit in the flags argument. If the QUAD_ATTACH_TO_PARENT flag is set, the child process is loaded on the quad where the parent process is currently located. Once the child process has been loaded, using the QUAD_ATTACH_TO_PARENT bit with qfork has the same effect as using the R_PID option of attach_proc(2SEQ) to attach the child to its parent. If either the child or parent process migrates to another quad for load-balancing purposes, the other process will accompany it. Both processes will always be located on the same quad, until one either detaches or exits. If the QUAD_ATTACH_TO_PARENT flag is set, the attach_qset argument is ignored, as are the QUAD_SMALLMEM and QUAD_LARGEMEM flags. shfork operates the same as fork, but allows the child process to share the parent processes file descriptors instead of just copying them. To enable the file descriptor sharing, the FM_SHARE_OFILE bit must be set in flags. If this flag is not set, shfork will act exactly the same as fork. shqfork allows for the same operation as qfork along with the shared file descriptors of shfork. As with shfork, the FM_SHARE_OFILE flag must be set to enable the sharing. FLAG VALUES The following bit patterns are valid values for the flags argument. for use with shfork and shqfork: FM_SHARE_OFILE 0x0002 share open file table for use with qfork and shqfork: QUAD_INSIST 0x00000010 mandatory quad placement QUAD_SMALLMEM 0x00000020 requires little memory QUAD_LARGEMEM 0x00000040 requires lots of memory ----------------------------------------------------------------------- int quademptyset (set) quadset_t *set; int quadfillset (set) quadset_t *set; int rgn_quadfillset (set, rgnname) quadset_t *set; const char* rgnname; int quadaddset (set, quadno) quadset_t *set; int quadno; int quaddelset (set, quadno) quadset_t *set; int quadno; int quadismember (set, quadno) quadset_t *set; int quadno; int quadisemptyset (set) quadset_t *set; int quadandset (set1, set2) quadset_t *set1, *set2; int quadorset (set1, set2) quadset_t *set1, *set2; int quaddiffset (set1, set2) quadset_t *set1, *set2; ===> These APIs manipulate quadset_t objects which are opaque data types ===> that describe collections of quads. In mapping the DYNIX/ptx NUMA ===> APIs to CpuMemSets, the quadset_t object is still used by the DYNIX/ptx ===> API. Thus the quadset_t manipulating APIs should be used and can be ===> implemented in the same manner as on DYNIX/ptx. At the point in time ===> that another DYNIX/ptx NUMA API uses a quadset, the quadset can be ===> mapped into an equivalent CpuMemSet. The quadsetops primitives manipulate sets of quads. They operate on data objects of type quadset_t. The quademptyset function initializes the quad set pointed to by set, such that no quads are included in the set. The quadfillset function initializes the quad set pointed to by set, such that all quads that are currently configured in the caller's region are included in the set. The rgn_quadfillset function initializes the quad set pointed to by set, such that all quads that are currently configured in the region rgnname are included in the set. If rgnname is NULL then rgn_quadfillset initializes the quadset pointed to by set, such that all quads that are currently configured in the caller's region are included in the set. Applications should call either quademptyset or quadfillset at least once for each object of type quadset_t prior to any other operation on the object. If such an object is not initialized in this way, but is supplied as an argument to any of the quadaddset, quaddelset, quadismember, quadandset, quadorset, quaddiffset, qfork, qexec, etc. functions, the results are undefined. The quadaddset and quaddelset functions respectively add and delete the individual quads specified by the value of quadno from the quad set pointed to by the argument set. The quadandset and quadorset functions respectively perform logical and and or operations on the quad sets pointed to by the arguments set1 and set2, storing the result in the quad set pointed to by set1. The quaddiffset function finds the logical difference (the members that are contained in the first set but not in the second set) between the quad sets pointed to by the arguments set1 and set2. The result is stored in the quad set pointed to by set1. The quadismember function tests whether the quad specified by the value of quadno is a member of the set pointed to by the argument set, and the quadisemptyset function tests whether the quad set pointed to by the argument set is empty. ----------------------------------------------------------------------- int shmget (key, size, shmflg) key_t key; int size, shmflg; N/A int shmgetq (key, size, shmflg, qds) key_t key; int size, shmflg; quadset_t *qds; ===> Can create a CpuMemSet which maps the requested quad affiliation and ===> attach to the vm area describing the shared memory. As with exec, fork, ===> etc., this lacks some of the options/capabilities of DYNIX/ptx. int shmgetv (key, size, shmflg, shmvcnt, shmv) key_t key; size_t size, int shmflg, shmvcnt; shmvec_t *shmv; N/A shmgetq(2SEQ) behaves the same as shmget(2) with the addition of a qds parameter used to specify a quadset to restrict the set of quads the paging policy flags affect. By default shmget(2) assumes a quadset containing all the quads in the system. See quademptyset(2SEQ) for information on constructing quadsets. ----------------------------------------------------------------------- int tmp_affinity(processor); int processor; ===> Create a CpuMemSet with only the requested processor in it, include ===> the memory on the same processor node, and assign the process to it. ===> If AFF_NONE is specified, then assign the process to the system's ===> global CpuMemSet. ===> Must change both the current process CpuMemSet and the process's ===> child CpuMemSet. ===> No Linux support for PRIV_REGION or PRIV_SCHED. tmp_affinity allows a process to be bound to a specified logical processor. Processor numbers start at zero and are numbered contiguously. Deconfigured processors are not included. The previous affinity is returned. The process must have PRIV_REGION vectored privilege to be bound to a processor which does not belong to the region to which the process belongs. On a successful bind, if the process belongs to the system region, then it stays in the system region, otherwise it moves to the user defined region to which the processor belongs. If the processor argument is AFF_NONE, the process is realeased from any previous affinity (that is, allowed to migrate within its region). If the processor argument is AFF_QUERY, the current affinity, or AFF_NONE is returned, without changing the current affinity. You must have the PRIV_SCHED vectored privilege to change affinity. Affinity is inherited across the fork(2) and exec(2) system calls. |
From: Paul J. <pj...@en...> - 2001-11-17 06:04:41
|
Awesome - thank-you, Michael, for an excellent job. This is just the sort of analysis needed to make sure that our processor and memory placement needs are met. The DYNIX/ptx NUMA APIs are a rich source of actual, proven capability for these needs, and it is essential that we understand how they can be supported on Linux, and hopefully, on CpuMemSets. Those who are allergic to long email messages should probably bail now - sorry <grin>. === On Fri, 16 Nov 2001, Michael Hohnbaum wrote: |> In an effort to enhance my understanding of the CpuMemSet proposal I |> performed a mapping of the DYNIX/ptx NUMA APIs to the CpuMemSet facility. |> This effort also attempted to validate the claim that "this CpuMemSet |> facility is intended to provide the power sufficient to support emulations |> of these existing API's." I much appreciate your taking the time to perform this analysis, and hopefully we can find good ways to meet the needs you have identified. Your careful and detailed presentation is most helpful to my understanding of the needs of these DYNIX/ptx API's. Thank-you. |> In doing this I came across several shortcomings, that are listed below. |> At the end of this summary is the mapping with a discussion included of |> issues encountered. |> |> 1. CpuMemSets makes an attempt at providing support for memory allocation |> policy, but comes short of providing the allocation policies supported |> by DYNIX/ptx. By providing only partial support for memory allocation |> policies it opens up the possibility that memory allocation policies |> might be implemented at various places, instead of through one common |> mechanism. This was intentional - to a greater extent than any of the existing SGI API's for Numa placement or of the DYNIX/ptx API's, I have attempted to strip policies out of the kernel, and provide only general purpose mechanisms, suitable for implementing any of the needed policies. Granted, this places a burden on me of explaining just how in the heck, for any given policy that has shown itself to be useful, that policy can still be affected in a sensible fashion. The time to shoulder this burden has now arrived. |> Policies missing: |> |> * soft versus hard - DYNIX/ptx has the notion of treating placement |> requests as either hints (soft) or demands (hard). CpuMemSet |> provides only the hard option. At first I was confused by just what a "soft" policy meant, but thanks to your fine snippets of documentation below, I think I understand it now. It seems that when attaching processes to resources, a "hard" request will fail if it can't place the process on a node with all the requested resources, whereas a "soft" request will fall back to other nodes, if need be. If I understand this correctly, then CpuMemSets supports both, easily. No kernel support is required or relevant. Rather, when setting up a CpuMemSet, in the library code that is emulating the DYNIX/ptx API's on top of CpuMemSets, the library code can decide to succeed or fail, if the requested resources aren't available where the requester wants them, depending on whether the requester used the "hard" or "soft" option. This is not (so far as I can tell) something that requires kernel awareness each time a cpu is scheduled or a page allocated. Rather it seems to only affect the initial binding of resources to processes, and can easily and naturally be resolved in the library code. |> * first touch, followed by round robin. The default algorithm for |> memory allocations for DYNIX/ptx is to allocate on the same quad |> the process is running on, and if none available, to round robin |> through the remaining quads. The CpuMemSet choices are either |> round robin or always in the order of the memory lists. I am unclear just what ordering "first touch, followed by round robin" might be. I suspect that it is one of these two: 1) Try allocating on the node that is executing the allocation request, and if that fails, try allocating on the next closest nodes, in distance order. 2) Try allocating on the node that is executing the allocation request, and if that fails, try allocating in a distributed fashion, on the next node past the last one that satisfied an allocation request, according to some list. If you mean (1), then that's too easy - just sort the memory lists in distance order from the faulting cpu. So probably you mean (2). If so, you're right that the current CpuMemSet design doesn't have this combination of options. But it would be trivial to add, if you want me to. Just another memory allocation policy option, and a few more lines of code that combine the current DEFAULT and ROUND_ROBIN policies. Let's say: #define CMS_FIRST_ROBIN 0x03 /* First touch, then round-robin */ * If a CpuMemSet has a policy of CMS_FIRST_ROBIN, the * kernel first searches the first Memory Block on the memory * list, then if that doesn't provide the required memory, * the kernel searches the memory list beginning one past * where the last search on that same Memory List of that * same CpuMemSet concluded. (Surely someone has a better name than CMS_FIRST_ROBIN ;). Let me know if this is what you need, and I will add it. |> * LARGEMEM/SMALLMEM hints - DYNIX/ptx provides the option for a |> process to indicate if it was a large memory user or a small |> memory user, so that the system could place it on an appropriately |> loaded quad. This also seems to be properly implemented in the library code, where the topology can be examined to find quads with the requested amount of memory, and construct a CpuMemSet corresponding to that request. |> * MOSTFREE - i.e., allocate on the node with the most available |> memory. This sounds like a more dynamic policy that on each allocation, determines which node currently has the most available memory, and allocates the memory there. If that is the case, then this is pretty much outside the relatively static perview of CpuMemSets, and will require some additional code in the page allocation routines. That additional code should respect the CpuMemSet/Map for that current allocation, and only allocate memory where allowed. And the CpuMemSet interface might (or might not - not sure) be the best place to pass in the MOSTFREE policy flag that would trigger such behaviour in the allocator. |> 2. No mechanism is provided to migrate a process to a different node after |> a process is assigned to it via CpuMemSet. See the Bulk Remap feature, which was added in the October 8 version of the CpuMemSet Design Notes. Is this feature perhaps what you are looking for here? |> 3. There is no means to bind processes such that if one moves to a |> different node, they all move to the node. (Or conversely, to |> prevent the migration of one to a different node when it needs to |> remain on its current node with other specific processes.) Correct - almost. I have added in the November 14 version explicit support in the API for inheriting CpuMemMaps, and a Bulk Remap feature (CMS_BULK_SHARE) that can remap all tasks and vm areas sharing a Map. However that facility is of limited use, because any change to the Map breaks the inheritance, and because now (as of this same version) any non-root process can change its Map (just not acquire more resources). I am open to more sophisticated ways of grouping CpuMemMaps, so that it is convenient to mass migrate everyone in a group. But so far this has been a bit of quagmire (which doesn't surprise me -- this is typical in the history of Unix). Such a mass migration can be approximated, with the current CpuMemSet facilities, by scanning the system for all users of the resource to be migrated, and group-by-share-group, moving their Maps over. However this is not so clean, not so efficient, and theoretically doesn't guarantee to finish in finite time. Tell me - what should we add to CpuMemSets to accomplish this grouping? |> 4. DYNIX/ptx has the ability to attach a process to a shared memory |> segment. This binds the process to running on quads that the shared |> memory segment resides on. In CpuMemSets there is a note that the |> CPU portion of a CpuMemSet for a memory area is ignored. To support |> the DYNIX/ptx attachment of a process to a shared memory segment, |> the CPU portion of the memory are CpuMemSet needs to be used. Aha - the CPU portion of a CpuMemSet for a vm area is ignored because the kernel has no purpose for it. However the library code supporting the DYNIX/ptx may well have a purpose - for just this use. It would be entirely reasonable for the library code to set the CPU portions of such CpuMemSets to the cpus that are on the same nodes as the memory described. Then when a process attaches to a shared memory segment, the library code just picks up the CpuMemSet/Map from that segment, cpus and all, and applies it to the requesting process. No kernel awareness required, beyond simply preserving the user specified settings. If the above makes sense, then I would be quite willing to add to the CpuMemSet documentation a mention that though the kernel doesn't itself use the cpu portion of vm area CpuMemSets, applications might use them, and the kernel will preserve their setting. |> 5. It appears that mmap uses the CpuMemSet of the current process. |> Thus, to target a different set of resources (numa nodes) than |> the current process is using it is necessary to change the current |> CpuMemSet of the process to that desired for the mmap, execute |> the mmap operation, and then restore the CpuMemSet for the current |> process to what it was originally. This opens up the possiblity |> that any pages allocated during this activity for the process could |> end up being allocated on the wrong node (i.e., on a node that was |> intended for the mmap). Perhaps a third CpuMemSet needs to be |> established for a process to be used for mmaps and shmget. No - I don't think it opens up such a possibility. This is because page allocation requests by the current process are not controlled by the CpuMemSet of that process, but rather by the CpuMemSet of the vm area allocating the page. So long as the process knows that it is not creating any other vm area (another mmap call or a Sys V shared memory attachment, say), at the "same time", then there is no race, no risk. |> 6. Not really a CpuMemSet deficiency, but DYNIX/ptx had a privilege |> vector implementation that allowed the granting of some typical |> root only capabilities to specific users. Some NUMA APIs required |> specific privileges for non-root users. This could not be mapped |> into Linux. Yes so far as I know, this doesn't map into current Linux. |> 7. More of a NUMA Topology issue - DYNIX/ptx provided an efficient |> interface for a process to obtain the engine number and quad number |> that it is executing on. This is not yet supported. Yes - a topology issue. I heartily endorse the work of Paul Dorwin here. I think this covers the main points you raised, and I hope that we can find sensible ways to meet the needs that you have described. === A few quite minor points in the rest of your mapping description: |> q1 = [0,1,2, .. ,<number of CPUs>] Should be: q1 = [0,1,2, .. ,<number of CPUs>-1] |> ===> NOTE: CMS_KERNEL is listed as root only. Can non-root |> use it for query? Yes - non-root queries are supported, as of the November 14, 2001 Revision. |> ===> Revert the process back to using the system-wide CpuMemSet A couple of times you refer to a system-wide default CpuMemSet. There is no such entity. The kernel has its own CpuMemSet, which is inherited by init, and subject to change, by all init creates. But any given process can know only: the kernel's CpuMemSet the CpuMemSet of any given process the CpuMemSet of any given vm area There is no "system-wide" default. I couldn't quite tell if this will be a problem for supporting the DYNIX/ptx API or not. Hopefully not. === Once again, my thanks for a most useful analysis. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@en...> - 2001-11-18 03:24:06
|
pj wrote: |> #define CMS_FIRST_ROBIN 0x03 /* First touch, then round-robin */ |> ... |> (Surely someone has a better name than CMS_FIRST_ROBIN ;). hmmm ... how about: CMS_EARLY_BIRD I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |