From: Paul J. <pj...@sg...> - 2004-08-05 19:57:42
|
A bitmap print and parse format that provides lists of ranges of numbers, to be first used for by cpusets (next patch). Cpusets provide a way to manage subsets of CPUs and Memory Nodes for scheduling and memory placement, via a new virtual file system, usually mounted at /dev/cpuset. Manipulation of cpusets can be done directly via this file system, from the shell. However, manipulating 512 bit cpumasks or 256 bit nodemasks (which will get bigger) via hex mask strings is painful for humans. The intention is to provide a format for the cpu and memory mask files in /dev/cpusets that will stand the test of time. This format is supported by a couple of new lib/bitmap.c routines, for printing and parsing these strings. Wrappers for cpumask and nodemask are provided. See the embedded comments, below in the patch, for more details of the format. The input format supports adding or removing specified cpus or nodes, as well as entirely rewriting the mask. include/linux/bitmap.h | 8 ++ include/linux/cpumask.h | 22 ++++++- include/linux/nodemask.h | 22 ++++++- lib/bitmap.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 189 insertions(+), 5 deletions(-) Signed-off-by: Paul Jackson <pj...@sg...> Index: 2.6.8-rc2-mm2/include/linux/bitmap.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/bitmap.h 2004-08-04 19:29:15.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/bitmap.h 2004-08-04 19:41:10.000000000 -0700 @@ -41,7 +41,9 @@ * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n * bitmap_scnprintf(buf, len, src, nbits) Print bitmap src to buf - * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from buf + * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from user buf + * bitmap_scnlistprintf(buf, len, src, nbits) Print bitmap src as list to buf + * bitmap_parselist(buf, dst, nbits) Parse bitmap dst from list */ /* @@ -98,6 +100,10 @@ extern int bitmap_scnprintf(char *buf, u const unsigned long *src, int nbits); extern int bitmap_parse(const char __user *ubuf, unsigned int ulen, unsigned long *dst, int nbits); +extern int bitmap_scnlistprintf(char *buf, unsigned int len, + const unsigned long *src, int nbits); +extern int bitmap_parselist(const char *buf, unsigned long *maskp, + int nmaskbits); extern int bitmap_find_free_region(unsigned long *bitmap, int bits, int order); extern void bitmap_release_region(unsigned long *bitmap, int pos, int order); extern int bitmap_allocate_region(unsigned long *bitmap, int pos, int order); Index: 2.6.8-rc2-mm2/include/linux/cpumask.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/cpumask.h 2004-08-04 19:29:34.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/cpumask.h 2004-08-04 20:35:10.000000000 -0700 @@ -10,6 +10,8 @@ * * For details of cpumask_scnprintf() and cpumask_parse(), * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of cpulist_scnprintf() and cpulist_parse(), see + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. * * The available cpumask operations are: * @@ -46,6 +48,8 @@ * * int cpumask_scnprintf(buf, len, mask) Format cpumask for printing * int cpumask_parse(ubuf, ulen, mask) Parse ascii string as cpumask + * int cpulist_scnprintf(buf, len, mask) Format cpumask as list for printing + * int cpulist_parse(buf, map) Parse ascii string as cpulist * * for_each_cpu_mask(cpu, mask) for-loop cpu over mask * @@ -268,14 +272,28 @@ static inline int __cpumask_scnprintf(ch return bitmap_scnprintf(buf, len, srcp->bits, nbits); } -#define cpumask_parse(ubuf, ulen, src) \ - __cpumask_parse((ubuf), (ulen), &(src), NR_CPUS) +#define cpumask_parse(ubuf, ulen, dst) \ + __cpumask_parse((ubuf), (ulen), &(dst), NR_CPUS) static inline int __cpumask_parse(const char __user *buf, int len, cpumask_t *dstp, int nbits) { return bitmap_parse(buf, len, dstp->bits, nbits); } +#define cpulist_scnprintf(buf, len, src) \ + __cpulist_scnprintf((buf), (len), &(src), NR_CPUS) +static inline int __cpulist_scnprintf(char *buf, int len, + const cpumask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + +#define cpulist_parse(buf, dst) __cpulist_parse((buf), &(dst), NR_CPUS) +static inline int __cpulist_parse(const char *buf, cpumask_t *dstp, int nbits) +{ + return bitmap_parselist(buf, dstp->bits, nbits); +} + #if NR_CPUS > 1 #define for_each_cpu_mask(cpu, mask) \ for ((cpu) = first_cpu(mask); \ Index: 2.6.8-rc2-mm2/include/linux/nodemask.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/nodemask.h 2004-08-04 19:29:29.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/nodemask.h 2004-08-04 20:28:50.000000000 -0700 @@ -10,6 +10,8 @@ * * For details of nodemask_scnprintf() and nodemask_parse(), * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of nodelist_scnprintf() and nodelist_parse(), see + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. * * The available nodemask operations are: * @@ -46,6 +48,8 @@ * * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask + * int nodelist_scnprintf(buf, len, mask) Format nodemask as list for printing + * int nodelist_parse(buf, map) Parse ascii string as nodelist * * for_each_node_mask(node, mask) for-loop node over mask * @@ -271,14 +275,28 @@ static inline int __nodemask_scnprintf(c return bitmap_scnprintf(buf, len, srcp->bits, nbits); } -#define nodemask_parse(ubuf, ulen, src) \ - __nodemask_parse((ubuf), (ulen), &(src), MAX_NUMNODES) +#define nodemask_parse(ubuf, ulen, dst) \ + __nodemask_parse((ubuf), (ulen), &(dst), MAX_NUMNODES) static inline int __nodemask_parse(const char __user *buf, int len, nodemask_t *dstp, int nbits) { return bitmap_parse(buf, len, dstp->bits, nbits); } +#define nodelist_scnprintf(buf, len, src) \ + __nodelist_scnprintf((buf), (len), &(src), MAX_NUMNODES) +static inline int __nodelist_scnprintf(char *buf, int len, + const nodemask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + +#define nodelist_parse(buf, dst) __nodelist_parse((buf), &(dst), MAX_NUMNODES) +static inline int __nodelist_parse(const char *buf, nodemask_t *dstp, int nbits) +{ + return bitmap_parselist(buf, dstp->bits, nbits); +} + #if MAX_NUMNODES > 1 #define for_each_node_mask(node, mask) \ for ((node) = first_node(mask); \ Index: 2.6.8-rc2-mm2/lib/bitmap.c =================================================================== --- 2.6.8-rc2-mm2.orig/lib/bitmap.c 2004-08-04 19:29:15.000000000 -0700 +++ 2.6.8-rc2-mm2/lib/bitmap.c 2004-08-04 21:44:41.000000000 -0700 @@ -291,6 +291,7 @@ EXPORT_SYMBOL(__bitmap_weight); #define nbits_to_hold_value(val) fls(val) #define roundup_power2(val,modulus) (((val) + (modulus) - 1) & ~((modulus) - 1)) #define unhex(c) (isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10)) +#define BASEDEC 10 /* fancier cpuset lists input in decimal */ /** * bitmap_scnprintf - convert bitmap to an ASCII hex string. @@ -409,6 +410,147 @@ int bitmap_parse(const char __user *ubuf } EXPORT_SYMBOL(bitmap_parse); +/* + * bscnl_emit(buf, buflen, rbot, rtop, bp) + * + * Helper routine for bitmap_scnlistprintf(). Write decimal number + * or range to buf, suppressing output past buf+buflen, with optional + * comma-prefix. Return len of what would be written to buf, if it + * all fit. + */ + +int bscnl_emit(char *buf, int buflen, int rbot, int rtop, int len) +{ + if (len) + len += scnprintf(buf + len, buflen - len, ","); + if (rbot == rtop) + len += scnprintf(buf + len, buflen - len, "%d", rbot); + else + len += scnprintf(buf + len, buflen - len, "%d-%d", rbot, rtop); + return len; +} + +/** + * bitmap_scnlistprintf - convert bitmap to an ASCII hex string, list format + * @buf: byte buffer into which string is placed + * @buflen: reserved size of @buf, in bytes + * @maskp: pointer to bitmap to convert + * @nmaskbits: size of bitmap, in bits + * + * Output format is a comma-separated list of decimal numbers and + * ranges. Consecutively set bits are shown as two hyphen-separated + * decimal numbers, the smallest and largest bit numbers set in + * the range. Output format is a compatible subset of the format + * accepted as input by bitmap_parselist(). + * + * The return value is the number of characters which would be + * generated for the given input, excluding the trailing '\0', as + * per ISO C99. + */ + +int bitmap_scnlistprintf(char *buf, unsigned int buflen, + const unsigned long *maskp, int nmaskbits) +{ + int len = 0; + /* current bit is 'cur', most recently seen range is [rbot, rtop] */ + int cur, rbot, rtop; + + rbot = cur = find_first_bit(maskp, nmaskbits); + while (cur < nmaskbits) { + rtop = cur; + cur = find_next_bit(maskp, nmaskbits, cur+1); + if (cur >= nmaskbits || cur > rtop + 1) { + len = bscnl_emit(buf, buflen, rbot, rtop, len); + rbot = cur; + } + } + return len; +} +EXPORT_SYMBOL(bitmap_scnlistprintf); + +/** + * bitmap_parselist - parses a more flexible format for inputting bit masks + * @buf: read nul-terminated user string from this buffer + * @mask: write resulting mask here + * @nmaskbits: number of bits in mask to be written + * + * The input format supports a space separated list of one or more comma + * separated sequences of ascii decimal bit numbers and ranges. Each + * sequence may be preceded by one of the prefix characters '=', + * '-', '+', or '!', which have the following meanings: + * '=': rewrite the mask to have only the bits specified in this sequence + * '-': turn off the bits specified in this sequence + * '+': turn on the bits specified in this sequence + * '!': same as '-'. + * + * If no such initial character is specified, then the default prefix '=' + * is presumed. The list is evaluated and applied in left to right order. + * + * Eamples of input format: + * 0-4,9 # rewrites to 0,1,2,3,4,9 + * -9 # removes 9 + * +6-8 # adds 6,7,8 + * 1-6 -0,2-4 +11-14,16-19 -14-16 # same as 1,5,6,11-13,17-19 + * 1-6 -0,2-4 +11-14,16-19 =14-16 # same as just 14,15,16 + * + * Possible errno's returned for invalid input strings are: + * -EINVAL: second number in range smaller than first + * -ERANGE: bit number specified too large for mask + * -EINVAL: invalid prefix char (not '=', '-', '+', or '!') + */ + +int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits) +{ + char *p, *q; + int masklen = BITS_TO_LONGS(nmaskbits); + + while ((p = strsep((char **)(&buf), " ")) != NULL) { /* blows const XXX */ + char op = isdigit(*p) ? '=' : *p++; + unsigned long m[masklen]; + int maskbytes = sizeof(m); + int i; + + if (op == ' ') + continue; + memset(m, 0, maskbytes); + + while ((q = strsep(&p, ",")) != NULL) { + unsigned a = simple_strtoul(q, 0, BASEDEC); + unsigned b = a; + char *cp = strchr(q, '-'); + if (cp) + b = simple_strtoul(cp + 1, 0, BASEDEC); + if (!(a <= b)) + return -EINVAL; + if (b >= nmaskbits) + return -ERANGE; + while (a <= b) { + set_bit(a, m); + a++; + } + } + + switch (op) { + case '=': + memcpy(maskp, m, maskbytes); + break; + case '!': + case '-': + for (i = 0; i < masklen; i++) + maskp[i] &= ~m[i]; + break; + case '+': + for (i = 0; i < masklen; i++) + maskp[i] |= m[i]; + break; + default: + return -EINVAL; + } + } + return 0; +} +EXPORT_SYMBOL(bitmap_parselist); + /** * bitmap_find_free_region - find a contiguous aligned mem region * @bitmap: an array of unsigned longs corresponding to the bitmap -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-08-05 19:57:49
|
Andrew, I would like to propose the following patch for inclusion in your 2.6.9-*mm series, when that opens. It provides an important facility for high performance computing on large systems. Simon Derr of Bull (France) and myself are the primary authors. I offer it to lkml now, in order to invite continued feedback. Thank-you to several who have provided valuable feedback so far, including Christoph and Andi (I make no claim that they endorse this patch). This is the third time I have posted cpusets on lkml. The first two times, a month or two ago, were more preliminary. I believe that the code is now in good enough shape to be considered for inclusion in your kernels. The one prerequiste patch for this cpuset patch was just posted before this one. That was a patch to provide a new bitmap list format, of which cpusets is the first user. Changes since July 2 (previous lkml posting): - The bitmap, cpumask and nodemask work on which the earlier patches depended are now included in your patches. - Locking around the cpuset struct simplified and rewritten. - Just one cpuset patch now (plus bitmap list format), not 8 of them. - Memory restriction in page_alloc and vmscan added (thanks, Andi). - Term 'strict' for cpusets that others can't use changed to the term 'exclusive' (to avoid collision with the use of the same word in Andi's numa work for the reverse meaning). - Superfluous 'top_cpuset' layer removed from visible mounted cpuset file system. - The /proc/<pid>/cpuset hook for displaying a tasks current cpuset path uses seq_file now. - Notify_on_release calls /sbin/cpuset_release_agent, not /sbin/hotplug. [Hence no CONFIG_HOTPLUG dependency.] - kernel/cpuset.c cpuset_sprintf_list() code moved to lib/bitmap.c, and rewritten to be simpler. - kernel/cpuset.c cpuset_path() code simplified. This patch has been built on top of 2.6.8-rc2-mm2, for several arch's, with and without CONFIG_CPUSET. No doubt you will be glad to know that it has much fewer arch dependencies (none, that I know of) than the dreaded cpumask patch. It has been built, booted and tested in various forms over the last several months by a few developers at SGI and Bull. === Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the processor and memory resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Cpusets require small kernel hooks in init, exit, fork, mempolicy, sched_setaffinity, page_alloc and vmscan. And they require a "struct cpuset" pointer and a "mems_allowed" nodemask_t (to go along with the "cpus_allowed" cpumask_t that's already there) in each task struct. These hooks: 1) establish and propagate cpusets, 2) enforce CPU placement in sched_setaffinity, 3) enforce Memory placement in mbind and sys_set_mempolicy, 4) restrict page allocation and scanning to mems_allowed, and 5) restrict migration and set_cpus_allowed to cpus_allowed. The other required hook, restricting task scheduling to CPUs in a tasks cpus_allowed mask, is already present. Cpusets extend the usefulness of, the existing placement support that was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, and mbind and set_mempolicy for memory placement. On smaller or dedicated use systems, the existing calls are often sufficient. On larger NUMA systems, running more than one, performance critical, job, it is necessary to be able to manage jobs in their entirety. This includes providing a job with exclusive CPU and memory that no other job can use, and being able to list all tasks currently in a cpuset. A given job running within a cpuset, would likely use the existing placement calls to manage its CPU and memory placement in more detail. Cpusets are named, nested sets of CPUs and Memory Nodes. Each cpuset is represented by a directory in the cpuset virtual file system, normally mounted at /dev/cpuset. Each cpuset directory provides the following files, which can be read and written: cpus: List of CPUs allowed to tasks in that cpuset. mems: List of Memory Nodes allowed to tasks in that cpuset. tasks: List of pid's of tasks in that cpuset. cpu_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its CPUs (no sibling or cousin cpuset may overlap CPUs). mem_exclusive: Flag (0 or 1) - if set, cpuset has exclusive use of its Memory Nodes (no sibling or cousin may overlap). notify_on_release: Flag (0 or 1) - if set, then /sbin/cpuset_release_agent will be invoked, with the name (/dev/cpuset relative path) of that cpuset in argv[1], when the last user of it (task or child cpuset) goes away. This supports automatic cleanup of abandoned cpusets. In addition one new filetype is added to the /proc file system: /proc/<pid>/cpuset: For each task (pid), list its cpuset path, relative to the root of the cpuset file system. This file is read-only. New cpusets are created using 'mkdir' (at the shell or in C). Old ones are removed using 'rmdir'. The above files are accessed using read(2) and write(2) system calls, or shell commands such as 'cat' and 'echo'. The CPUs and Memory Nodes in a given cpuset are always a subset of its parent. The root cpuset has all possible CPUs and Memory Nodes in the system. A cpuset may be exclusive (cpu or memory) only if its parent is similarly exclusive. See further Documentation/cpusets.txt, at the top of the following patch. Documentation/cpusets.txt | 381 +++++++++++ fs/proc/base.c | 19 include/linux/cpuset.h | 61 + include/linux/sched.h | 6 init/Kconfig | 10 init/main.c | 3 kernel/Makefile | 1 kernel/cpuset.c | 1477 ++++++++++++++++++++++++++++++++++++++++++++++ kernel/exit.c | 2 kernel/fork.c | 3 kernel/sched.c | 9 mm/mempolicy.c | 9 mm/page_alloc.c | 14 mm/vmscan.c | 19 14 files changed, 2009 insertions(+), 5 deletions(-) Signed-off-by: Paul Jackson <pj...@sg...> Index: 2.6.8-rc2-mm2/Documentation/cpusets.txt =================================================================== --- 2.6.8-rc2-mm2.orig/Documentation/cpusets.txt 2003-03-14 05:07:09.000000000 -0800 +++ 2.6.8-rc2-mm2/Documentation/cpusets.txt 2004-08-05 01:44:59.000000000 -0700 @@ -0,0 +1,387 @@ + CPUSETS + ------- + +Copyright (C) 2004 BULL SA. +Written by Sim...@bu... + +Portions Copyright (c) 2004 Silicon Graphics, Inc. +Modified by Paul Jackson <pj...@sg...> + +CONTENTS: +========= + +1. Cpusets + 1.1 What are cpusets ? + 1.2 Why are cpusets needed ? + 1.3 How are cpusets implemented ? + 1.4 How do I use cpusets ? +2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Adding/removing cpus + 2.3 Setting flags + 2.4 Attaching processes +3. Questions +4. Contact + +1. Cpusets +========== + +1.1 What are cpusets ? +---------------------- + +Cpusets provide a mechanism for assigning a set of CPUs and Memory +Nodes to a set of tasks. + +Cpusets constrain the CPU and Memory placement of tasks to only +the resources within a tasks current cpuset. They form a nested +hierarchy visible in a virtual file system. These are the essential +hooks, beyond what is already present, required to manage dynamic +job placement on large systems. + +Each task has a pointer to a cpuset. Multiple tasks may reference +the same cpuset. Requests by a task, using the sched_setaffinity(2) +system call to include CPUs in its CPU affinity mask, and using the +mbind(2) and set_mempolicy(2) system calls to include Memory Nodes +in its memory policy, are both filtered through that tasks cpuset, +filtering out any CPUs or Memory Nodes not in that cpuset. The +scheduler will not schedule a task on a CPU that is not allowed in +its cpus_allowed vector, and the kernel page allocator will not +allocate a page on a node that is not allowed in the requesting tasks +mems_allowed vector. + +If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct +ancestor or descendent, may share any of the same CPUs or Memory Nodes. + +User level code may create and destroy cpusets by name in the cpuset +virtual file system, manage the attributes and permissions of these +cpusets and which CPUs and Memory Nodes are assigned to each cpuset, +specify and query to which cpuset a task is assigned, and list the +task pids assigned to a cpuset. + + +1.2 Why are cpusets needed ? +---------------------------- + +The management of large computer systems, with many processors (CPUs), +complex memory cache hierarchies and multiple Memory Nodes having +non-uniform access times (NUMA) presents additional challenges for +the efficient scheduling and memory placement of processes. + +Frequently more modest sized systems can be operated with adequate +efficiency just by letting the operating system automatically share +the available CPU and Memory resources amongst the requesting tasks. + +But larger systems, which benefit more from careful processor and +memory placement to reduce memory access times and contention, +and which typically represent a larger investment for the customer, +can benefit from explictly placing jobs on properly sized subsets of +the system. + +This can be especially valuable on: + + * Web Servers running multiple instances of the same web application, + * Servers running different applications (for instance, a web server + and a database), or + * NUMA systems running large HPC applications with demanding + performance characteristics. + +These subsets, or "soft partitions" must be able to be dynamically +adjusted, as the job mix changes, without impacting other concurrently +executing jobs. + +The kernel cpuset patch provides the minimum essential kernel +mechanisms required to efficiently implement such subsets. It +leverages existing CPU and Memory Placement facilities in the Linux +kernel to avoid any additional impact on the critical scheduler or +memory allocator code. + + +1.3 How are cpusets implemented ? +--------------------------------- + +Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain +which CPUs and Memory Nodes are used by a process or set of processes. + +The Linux kernel already has a pair of mechanisms to specify on which +CPUs a task may be scheduled (sched_setaffinity) and on which Memory +Nodes it may obtain memory (mbind, set_mempolicy). + +Cpusets extends these two mechanisms as follows: + + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the + kernel. + - Each task in the system is attached to a cpuset, via a pointer + in the task structure to a reference counted cpuset structure. + - Calls to sched_setaffinity are filtered to just those CPUs + allowed in that tasks cpuset. + - Calls to mbind and set_mempolicy are filtered to just + those Memory Nodes allowed in that tasks cpuset. + - The "top_cpuset" contains all the systems CPUs and Memory + Nodes. + - For any cpuset, one can define child cpusets containing a subset + of the parents CPU and Memory Node resources. + - The hierarchy of cpusets can be mounted at /dev/cpuset, for + browsing and manipulation from user space. + - A cpuset may be marked exclusive, which ensures that no other + cpuset (except direct ancestors and descendents) may contain + any overlapping CPUs or Memory Nodes. + - You can list all the tasks (by pid) attached to any cpuset. + +The implementation of cpusets requires a few, simple hooks +into the rest of the kernel, none in performance critical paths: + + - in main/init.c, to initialize the top_cpuset at system boot. + - in fork and exit, to attach and detach a task from its cpuset. + - in sched_setaffinity, to mask the requested CPUs by what's + allowed in that tasks cpuset. + - in sched.c migrate_all_tasks(), to keep migrating tasks within + the CPUs allowed by their cpuset, if possible. + - in the mbind and set_mempolicy system calls, to mask the requested + Memory Nodes by what's allowed in that tasks cpuset. + - in page_alloc, to restrict memory to allowed nodes. + - in vmscan.c, to restrict page recovery to the current cpuset. + +In addition a new file system, of type "cpuset" may be mounted, +typically at /dev/cpuset, to enable browsing and modifying the cpusets +presently known to the kernel. No new system calls are added for +cpusets - all support for querying and modifying cpusets is via +this cpuset file system. + +Each task under /proc has an added file named 'cpuset', displaying +the cpuset name, as the path relative to the root of the cpuset file +system. + +Each cpuset is represented by a directory in the cpuset file system +containing the following files describing that cpuset: + + - cpus: list of CPUs in that cpuset + - mems: list of Memory Nodes in that cpuset + - cpu_exclusive flag: is cpu placement exclusive? + - mem_exclusive flag: is memory placement exclusive? + - tasks: list of tasks (by pid) attached to that cpuset + +New cpusets are created using the mkdir system call or shell +command. The properties of a cpuset, such as its flags, allowed +CPUs and Memory Nodes, and attached tasks, are modified by writing +to the appropriate file in that cpusets directory, as listed above. + +The named hierarchical structure of nested cpusets allows partitioning +a large system into nested, dynamically changeable, "soft-partitions". + +The attachment of each task, automatically inherited at fork by any +children of that task, to a cpuset allows organizing the work load +on a system into related sets of tasks such that each set is constrained +to using the CPUs and Memory Nodes of a particular cpuset. A task +may be re-attached to any other cpuset, if allowed by the permissions +on the necessary cpuset file system directories. + +Such management of a system "in the large" integrates smoothly with +the detailed placement done on individual tasks and memory regions +using the sched_setaffinity, mbind and set_mempolicy system calls. + +The following rules apply to each cpuset: + + - Its CPUs and Memory Nodes must be a subset of its parents. + - It can only be marked exclusive if its parent is. + - If its cpu or memory is exclusive, they may not overlap any sibling. + +These rules, and the natural hierarchy of cpusets, enable efficient +enforcement of the exclusive guarantee, without having to scan all +cpusets every time any of them change to ensure nothing overlaps a +exclusive cpuset. Also, the use of a Linux virtual file system (vfs) +to represent the cpuset hierarchy provides for a familiar permission +and name space for cpusets, with a minimum of additional kernel code. + + +1.4 How do I use cpusets ? +-------------------------- + +Be warned that cpusets work differently than you might expect. + +In order to avoid _any_ impact on existing critical scheduler and +memory allocator code in the kernel, and to leverage the existing +CPU and Memory placement facilities, putting a task in a particular +cpuset does _not_ immediately affect its placement. + +It would have been possible (and initially cpusets were coded this +way) to immediately change a tasks cpus_allowed affinity mask based +on what cpuset it was placed in. The sched_setaffinity call can be +applied to any requested task. + +But the way numa placement support (added to 2.6 kernels in April +2004 by Andi Kleen) works, it is not possible for one task to change +another tasks Memory placement. The mbind and set_mempolicy system +calls only affect the current task. There really wasn't a choice +in this matter -- the mm's, vma's and zonelists that encode a tasks +Memory placement are complicated, and cannot be safely changed from +outside the current tasks context. + +So, cpuset placement only affects the future sched_setaffinity, +mbind, and set_mempolicy system calls, by filtering out any CPUs +and Memory Nodes that are not allowed in the specified tasks cpuset. +Well, almost all. See also the migrate_all_tasks() hook, listed above. + +To start a new job that is to be contained within a cpuset, this means +the steps are: + + 1) mkdir /dev/cpuset + 2) mount -t cpuset none /dev/cpuset + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in + the /dev/cpuset virtual file system. + 4) Start a task that will be the "founding father" of the new job. + 5) Attach that task to the new cpuset by writing its pid to the + /dev/cpuset tasks file for that cpuset. + 6) Have that task issue sched_setaffinity, mbind and set_mempolicy + system calls, specifying CPUs and Memory Nodes within its cpuset. + Anything it specifies outside will be ignored without complaint, + so if you request all CPUs and Memory Nodes in the system, you will + successfully get all that are available in your current cpuset. + 7) fork, exec or clone the job tasks from this founding father task. + +For example, the following sequence of commands will setup a cpuset +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, +and then start a subshell 'sh' in that cpuset: + + mount -t cpuset none /dev/cpuset + cd /dev/cpuset/top_cpuset + mkdir Charlie + cd Charlie + /bin/echo 2-3 > cpus + /bin/echo 1 > mems + /bin/echo $$ > tasks + # 0xC is bitmask for CPUs 2-3 + taskset 0xC numactl -m 1 sh + # The subshell 'sh' is now running in cpuset Charlie + # The next line should display 'top_cpuset/Charlie' + cat /proc/self/cpuset + +In the case that we want to force an existing job into a particular +cpuset, or that we want to move the cpuset that a job is using, +we will need some additional library code, not yet available as of +this writing (July 2004), that will receive a particular signal, +and reissue the necessary sched_setaffinity, mbind and set_mempolicy +system calls from with the tasks current context. + +In the case that a change of cpuset includes wanting to move already +allocated memory pages, consider further the work of IWAMOTO +Toshihiro <iw...@va...> for page remapping and memory +hotremoval, which can be found at: + + http://people.valinux.co.jp/~iwamoto/mh.html + +The integration of cpusets with such memory migration is not yet +available. + +In the future, a C library interface to cpusets will likely be +available. For now, the only way to query or modify cpusets is +via the cpuset file system, using the various cd, mkdir, echo, cat, +rmdir commands from the shell, or their equivalent from C. + +The sched_setaffinity calls can also be done at the shell prompt using +SGI's runon or Robert Love's taskset. The mbind and set_mempolicy +calls can be done at the shell prompt using the numactl command +(part of Andi's numa package). + +2. Usage Examples and Syntax +============================ + +2.1 Basic Usage +--------------- + +Creating, modifying, using the cpusets can be done through the cpuset +virtual filesystem. + +To mount it, type: +# mount -t cpuset none /dev/cpuset + +Then under /dev/cpuset you can find a tree that corresponds to the +tree of the cpusets in the system. For instance, /dev/cpuset/top_cpuset +is the cpuset that holds the whole system. + +If you want to create a new cpuset under top_cpuset: +# cd /dev/cpuset/top_cpuset +# mkdir my_cpuset + +Now you want to do something with this cpuset. +# cd my_cpuset + +In this directory you can find several files: +# ls +cpus cpu_exclusive mems mem_exclusive tasks + +Reading them will give you information about the state of this cpuset: +the CPUs and Memory Nodes it can use, the processes that are using +it, its properties. By writing to these files you can manipulate +the cpuset. + +Set some flags: +# /bin/echo 1 > cpu_exclusive + +Add some cpus: +# /bin/echo 0-7 > cpus + +Now attach your shell to this cpuset: +# /bin/echo $$ > tasks + +You can also create cpusets inside your cpuset by using mkdir in this +directory. +# mkdir my_sub_cs + +To remove a cpuset, just use rmdir: +# rmdir my_sub_cs +This will fail if the cpuset is in use (has cpusets inside, or has +processes attached). + +2.2 Adding/removing cpus +------------------------ + +This is the syntax to use when writing in the cpus or mems files +in cpuset directories: + +# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 +# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 +# /bin/echo +1 > cpus -> add cpu 1 to the cpus list +# /bin/echo -1-4 > cpus -> remove cpus 1,2,3,4 from the cpus list +# /bin/echo -1,2,3,4 > cpus -> remove cpus 1,2,3,4 from the cpus list + +All these can be mixed together: +# /bin/echo 1-7 -6 +9,10 -> set cpus list to 1,2,3,4,5,7,9,10 + +2.3 Setting flags +----------------- + +The syntax is very simple: + +# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' +# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' + +2.4 Attaching processes +----------------------- + +# /bin/echo PID > tasks + +Note that it is PID, not PIDs. You can only attach ONE task at a time. +If you have several tasks to attach, you have to do it one after another: + +# /bin/echo PID1 > tasks +# /bin/echo PID2 > tasks + ... +# /bin/echo PIDn > tasks + + +3. Questions +============ + +Q: what's up with this '/bin/echo' ? +A: bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cpuset file system, you won't be + able to tell whether a command succeeded or failed. + +Q: When I attach processes, only the first of the line gets really attached ! +A: We can only return one error code per call to write(). So you should also + put only ONE pid. + +4. Contact +========== + +Web: http://www.bullopensource.org/cpuset Index: 2.6.8-rc2-mm2/fs/proc/base.c =================================================================== --- 2.6.8-rc2-mm2.orig/fs/proc/base.c 2004-08-04 21:44:05.000000000 -0700 +++ 2.6.8-rc2-mm2/fs/proc/base.c 2004-08-04 21:44:49.000000000 -0700 @@ -32,6 +32,7 @@ #include <linux/mount.h> #include <linux/security.h> #include <linux/ptrace.h> +#include <linux/cpuset.h> /* * For hysterical raisins we keep the same inumbers as in the old procfs. @@ -60,6 +61,9 @@ enum pid_directory_inos { PROC_TGID_MAPS, PROC_TGID_MOUNTS, PROC_TGID_WCHAN, +#ifdef CONFIG_CPUSETS + PROC_TGID_CPUSET, +#endif #ifdef CONFIG_SECURITY PROC_TGID_ATTR, PROC_TGID_ATTR_CURRENT, @@ -83,6 +87,9 @@ enum pid_directory_inos { PROC_TID_MAPS, PROC_TID_MOUNTS, PROC_TID_WCHAN, +#ifdef CONFIG_CPUSETS + PROC_TID_CPUSET, +#endif #ifdef CONFIG_SECURITY PROC_TID_ATTR, PROC_TID_ATTR_CURRENT, @@ -123,6 +130,9 @@ static struct pid_entry tgid_base_stuff[ #ifdef CONFIG_KALLSYMS E(PROC_TGID_WCHAN, "wchan", S_IFREG|S_IRUGO), #endif +#ifdef CONFIG_CPUSETS + E(PROC_TGID_CPUSET, "cpuset", S_IFREG|S_IRUGO), +#endif {0,0,NULL,0} }; static struct pid_entry tid_base_stuff[] = { @@ -145,6 +155,9 @@ static struct pid_entry tid_base_stuff[] #ifdef CONFIG_KALLSYMS E(PROC_TID_WCHAN, "wchan", S_IFREG|S_IRUGO), #endif +#ifdef CONFIG_CPUSETS + E(PROC_TID_CPUSET, "cpuset", S_IFREG|S_IRUGO), +#endif {0,0,NULL,0} }; @@ -1376,6 +1389,12 @@ static struct dentry *proc_pident_lookup ei->op.proc_read = proc_pid_wchan; break; #endif +#ifdef CONFIG_CPUSETS + case PROC_TID_CPUSET: + case PROC_TGID_CPUSET: + inode->i_fop = &proc_cpuset_operations; + break; +#endif default: printk("procfs: impossible type (%d)",p->type); iput(inode); Index: 2.6.8-rc2-mm2/include/linux/cpuset.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/cpuset.h 2003-03-14 05:07:09.000000000 -0800 +++ 2.6.8-rc2-mm2/include/linux/cpuset.h 2004-08-04 21:44:49.000000000 -0700 @@ -0,0 +1,61 @@ +#ifndef _LINUX_CPUSET_H +#define _LINUX_CPUSET_H +/* + * cpuset interface + * + * Copyright (C) 2003 BULL SA + * Copyright (C) 2004 Silicon Graphics, Inc. + * + */ + +#include <linux/sched.h> +#include <linux/cpumask.h> +#include <linux/nodemask.h> + +#ifdef CONFIG_CPUSETS + +extern int cpuset_init(void); +extern void cpuset_fork(struct task_struct *p); +extern void cpuset_exit(struct task_struct *p); +extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p); +extern const nodemask_t cpuset_mems_allowed(const struct task_struct *p); +void cpuset_init_current_mems_allowed(void); +void cpuset_update_current_mems_allowed(void); +void cpuset_restrict_to_mems_allowed(unsigned long *nodes); +int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl); +int cpuset_zone_allowed(struct zone *z); +extern struct file_operations proc_cpuset_operations; + +#else /* !CONFIG_CPUSETS */ + +static inline int cpuset_init(void) { return 0; } +static inline void cpuset_fork(struct task_struct *p) {} +static inline void cpuset_exit(struct task_struct *p) {} + +static inline const cpumask_t cpuset_cpus_allowed(struct task_struct *p) +{ + return cpu_possible_map; +} + +static inline const nodemask_t cpuset_mems_allowed(struct task_struct *p) +{ + return node_possible_map; +} + +static inline void cpuset_init_current_mems_allowed(void) {} +static inline void cpuset_update_current_mems_allowed(void) {} +static inline void cpuset_restrict_to_mems_allowed(unsigned long *nodes) {} + +static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl) +{ + return 1; +} + +static inline int cpuset_zone_allowed(struct zone *z) +{ + return 1; +} + +#endif /* !CONFIG_CPUSETS */ + +#endif /* _LINUX_CPUSET_H */ Index: 2.6.8-rc2-mm2/include/linux/sched.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/sched.h 2004-08-04 21:44:05.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/sched.h 2004-08-04 21:44:49.000000000 -0700 @@ -13,6 +13,7 @@ #include <linux/rbtree.h> #include <linux/thread_info.h> #include <linux/cpumask.h> +#include <linux/nodemask.h> #include <asm/system.h> #include <asm/semaphore.h> @@ -370,6 +371,7 @@ struct k_itimer { struct io_context; /* See blkdev.h */ void exit_io_context(void); +struct cpuset; #define NGROUPS_SMALL 32 #define NGROUPS_PER_BLOCK ((int)(PAGE_SIZE / sizeof(gid_t))) @@ -551,6 +553,10 @@ struct task_struct { struct rw_semaphore pagg_sem; #endif +#ifdef CONFIG_CPUSETS + struct cpuset *cpuset; + nodemask_t mems_allowed; +#endif }; static inline pid_t process_group(struct task_struct *tsk) Index: 2.6.8-rc2-mm2/init/Kconfig =================================================================== --- 2.6.8-rc2-mm2.orig/init/Kconfig 2004-08-04 21:44:05.000000000 -0700 +++ 2.6.8-rc2-mm2/init/Kconfig 2004-08-04 21:44:49.000000000 -0700 @@ -278,6 +278,16 @@ config EPOLL Disabling this option will cause the kernel to be built without support for epoll family of system calls. +config CPUSETS + bool "Cpuset support" + help + This options will let you create and manage CPUSET's which + allow dynamically partitioning a system into sets of CPUs and + Memory Nodes and assigning tasks to run only within those sets. + This is primarily useful on large SMP or NUMA systems. + + Say N if unsure. + source "drivers/block/Kconfig.iosched" config CC_OPTIMIZE_FOR_SIZE Index: 2.6.8-rc2-mm2/init/main.c =================================================================== --- 2.6.8-rc2-mm2.orig/init/main.c 2004-08-04 21:44:05.000000000 -0700 +++ 2.6.8-rc2-mm2/init/main.c 2004-08-04 21:44:49.000000000 -0700 @@ -41,6 +41,7 @@ #include <linux/writeback.h> #include <linux/cpu.h> #include <linux/efi.h> +#include <linux/cpuset.h> #include <linux/unistd.h> #include <linux/rmap.h> #include <linux/mempolicy.h> @@ -568,6 +569,8 @@ asmlinkage void __init start_kernel(void #ifdef CONFIG_PROC_FS proc_root_init(); #endif + cpuset_init(); + check_bugs(); /* Do the rest non-__init'ed, we're now alive */ Index: 2.6.8-rc2-mm2/kernel/Makefile =================================================================== --- 2.6.8-rc2-mm2.orig/kernel/Makefile 2004-08-04 21:44:05.000000000 -0700 +++ 2.6.8-rc2-mm2/kernel/Makefile 2004-08-04 21:44:49.000000000 -0700 @@ -25,6 +25,7 @@ obj-$(CONFIG_IKCONFIG_PROC) += configs.o obj-$(CONFIG_STOP_MACHINE) += stop_machine.o obj-$(CONFIG_AUDIT) += audit.o obj-$(CONFIG_AUDITSYSCALL) += auditsc.o +obj-$(CONFIG_CPUSETS) += cpuset.o ifneq ($(CONFIG_IA64),y) # According to Alan Modra <al...@li...>, the -fno-omit-frame-pointer is Index: 2.6.8-rc2-mm2/kernel/cpuset.c =================================================================== --- 2.6.8-rc2-mm2.orig/kernel/cpuset.c 2003-03-14 05:07:09.000000000 -0800 +++ 2.6.8-rc2-mm2/kernel/cpuset.c 2004-08-04 21:44:49.000000000 -0700 @@ -0,0 +1,1477 @@ +/* + * kernel/cpuset.c + * + * Processor and Memory placement constraints for sets of tasks. + * + * Copyright (C) 2003 BULL SA. + * Copyright (C) 2004 Silicon Graphics, Inc. + * + * Portions derived from Patrick Mochel's sysfs code. + * sysfs is Copyright (c) 2001-3 Patrick Mochel + * Portions Copyright (c) 2004 Silicon Graphics, Inc. + * + * 2003-10-10 Written by Simon Derr <sim...@bu...> + * 2003-10-22 Updates by Stephen Hemminger. + * 2004 May-July Rework by Paul Jackson <pj...@sg...> + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/config.h> +#include <linux/cpu.h> +#include <linux/cpumask.h> +#include <linux/cpuset.h> +#include <linux/err.h> +#include <linux/errno.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/init.h> +#include <linux/interrupt.h> +#include <linux/kernel.h> +#include <linux/kmod.h> +#include <linux/list.h> +#include <linux/mm.h> +#include <linux/module.h> +#include <linux/mount.h> +#include <linux/namei.h> +#include <linux/pagemap.h> +#include <linux/proc_fs.h> +#include <linux/sched.h> +#include <linux/seq_file.h> +#include <linux/slab.h> +#include <linux/smp_lock.h> +#include <linux/spinlock.h> +#include <linux/stat.h> +#include <linux/string.h> +#include <linux/time.h> +#include <linux/backing-dev.h> + +#include <asm/uaccess.h> +#include <asm/atomic.h> +#include <asm/semaphore.h> + +#define CPUSET_SUPER_MAGIC 0x27e0eb + +struct cpuset { + unsigned long flags; /* "unsigned long" so bitops work */ + cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */ + nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */ + + atomic_t count; /* count tasks using this cpuset */ + + /* + * We link our 'sibling' struct into our parents 'children'. + * Our children link their 'sibling' into our 'children'. + */ + struct list_head sibling; /* my parents children */ + struct list_head children; /* my children */ + + struct cpuset *parent; /* my parent */ + struct dentry *dentry; /* cpuset fs entry */ +}; + +/* bits in struct cpuset flags field */ +typedef enum { + CS_CPU_EXCLUSIVE, + CS_MEM_EXCLUSIVE, + CS_REMOVED, + CS_NOTIFY_ON_RELEASE +} cpuset_flagbits_t; + +/* convenient tests for these bits */ +static inline int is_cpu_exclusive(const struct cpuset *cs) +{ + return !!test_bit(CS_CPU_EXCLUSIVE, &cs->flags); +} + +static inline int is_mem_exclusive(const struct cpuset *cs) +{ + return !!test_bit(CS_MEM_EXCLUSIVE, &cs->flags); +} + +static inline int is_removed(const struct cpuset *cs) +{ + return !!test_bit(CS_REMOVED, &cs->flags); +} + +static inline int notify_on_release(const struct cpuset *cs) +{ + return !!test_bit(CS_NOTIFY_ON_RELEASE, &cs->flags); +} + +static struct cpuset top_cpuset = { + .flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)), + .cpus_allowed = CPU_MASK_ALL, + .mems_allowed = NODE_MASK_ALL, + .count = ATOMIC_INIT(0), + .sibling = LIST_HEAD_INIT(top_cpuset.sibling), + .children = LIST_HEAD_INIT(top_cpuset.children), + .parent = NULL, + .dentry = NULL, +}; + +static struct vfsmount *cpuset_mount; +static struct super_block *cpuset_sb = NULL; + +/* + * cpuset_sem should be held by anyone who is depending on the children + * or sibling lists of any cpuset, or performing non-atomic operations + * on the flags or *_allowed values of a cpuset, such as raising the + * CS_REMOVED flag bit iff it is not already raised, or reading and + * conditionally modifying the *_allowed values. One kernel global + * cpuset semaphore should be sufficient - these things don't change + * that much. + * + * The code that modifies cpusets holds cpuset_sem across the entire + * operation, from cpuset_common_file_write() down, single threading + * all cpuset modifications (except for counter manipulations from + * fork and exit) across the system. This presumes that cpuset + * modifications are rare - better kept simple and safe, even if slow. + * + * The code that reads cpusets, such as in cpuset_common_file_read() + * and below, only holds cpuset_sem across small pieces of code, such + * as when reading out possibly multi-word cpumasks and nodemasks, as + * the risks are less, and the desire for performance a little greater. + * The proc_cpuset_show() routine needs to hold cpuset_sem to insure + * that no cs->dentry is NULL, as it walks up the cpuset tree to root. + * + * The hooks from fork and exit, cpuset_fork() and cpuset_exit(), don't + * (usually) grab cpuset_sem. These are the two most performance + * critical pieces of code here. The exception occurs on exit(), + * if the last task using a cpuset exits, and the cpuset was marked + * notify_on_release. In that case, the cpuset_sem is taken, the + * path to the released cpuset calculated, and a usermode call made + * to /sbin/cpuset_release_agent with the name of the cpuset (path + * relative to the root of cpuset file system) as the argument. + * + * A cpuset can only be deleted if both its 'count' of using tasks is + * zero, and its list of 'children' cpusets is empty. Since all tasks + * in the system use _some_ cpuset, and since there is always at least + * one task in the system (init, pid == 1), therefore, top_cpuset + * always has either children cpusets and/or using tasks. So no need + * for any special hack to ensure that top_cpuset cannot be deleted. + */ + +static DECLARE_MUTEX(cpuset_sem); + +/* + * A couple of forward declarations required, due to cyclic reference loop: + * cpuset_mkdir -> cpuset_create -> cpuset_populate_dir -> cpuset_add_file + * -> cpuset_create_file -> cpuset_dir_inode_operations -> cpuset_mkdir. + */ + +static int cpuset_mkdir(struct inode *dir, struct dentry *dentry, int mode); +static int cpuset_rmdir(struct inode *unused_dir, struct dentry *dentry); + +static struct backing_dev_info cpuset_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .memory_backed = 1, /* Does not contribute to dirty memory */ +}; + +static struct inode *cpuset_new_inode(mode_t mode) +{ + struct inode *inode = new_inode(cpuset_sb); + + if (inode) { + inode->i_mode = mode; + inode->i_uid = current->fsuid; + inode->i_gid = current->fsgid; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_mapping->backing_dev_info = &cpuset_backing_dev_info; + } + return inode; +} + +static void cpuset_diput(struct dentry *dentry, struct inode *inode) +{ + /* is dentry a directory ? if so, kfree() associated cpuset */ + if (S_ISDIR(inode->i_mode)) { + struct cpuset *cs = (struct cpuset *)dentry->d_fsdata; + BUG_ON(!(is_removed(cs))); + kfree(cs); + } + iput(inode); +} + +static struct dentry_operations cpuset_dops = { + .d_iput = cpuset_diput, +}; + +static struct dentry *cpuset_get_dentry(struct dentry *parent, const char *name) +{ + struct qstr qstr; + struct dentry *d; + + qstr.name = name; + qstr.len = strlen(name); + qstr.hash = full_name_hash(name, qstr.len); + d = lookup_hash(&qstr, parent); + if (d) + d->d_op = &cpuset_dops; + return d; +} + +static void remove_dir(struct dentry *d) +{ + struct dentry *parent = dget(d->d_parent); + + d_delete(d); + simple_rmdir(parent->d_inode, d); + dput(parent); +} + +/* + * NOTE : the dentry must have been dget()'ed + */ +static void cpuset_d_remove_dir(struct dentry *dentry) +{ + struct list_head *node; + + spin_lock(&dcache_lock); + node = dentry->d_subdirs.next; + while (node != &dentry->d_subdirs) { + struct dentry *d = list_entry(node, struct dentry, d_child); + list_del_init(node); + if (d->d_inode) { + d = dget_locked(d); + spin_unlock(&dcache_lock); + d_delete(d); + simple_unlink(dentry->d_inode, d); + dput(d); + spin_lock(&dcache_lock); + } + node = dentry->d_subdirs.next; + } + list_del_init(&dentry->d_child); + spin_unlock(&dcache_lock); + remove_dir(dentry); +} + +static struct super_operations cpuset_ops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, +}; + +static int cpuset_fill_super(struct super_block *sb, void *unused_data, + int unused_silent) +{ + struct inode *inode; + struct dentry *root; + + sb->s_blocksize = PAGE_CACHE_SIZE; + sb->s_blocksize_bits = PAGE_CACHE_SHIFT; + sb->s_magic = CPUSET_SUPER_MAGIC; + sb->s_op = &cpuset_ops; + cpuset_sb = sb; + + inode = cpuset_new_inode(S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR); + if (inode) { + inode->i_op = &simple_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + /* directories start off with i_nlink == 2 (for "." entry) */ + inode->i_nlink++; + } else { + return -ENOMEM; + } + + root = d_alloc_root(inode); + if (!root) { + iput(inode); + return -ENOMEM; + } + sb->s_root = root; + return 0; +} + +static struct super_block *cpuset_get_sb(struct file_system_type *fs_type, + int flags, const char *unused_dev_name, + void *data) +{ + return get_sb_single(fs_type, flags, data, cpuset_fill_super); +} + +static struct file_system_type cpuset_fs_type = { + .name = "cpuset", + .get_sb = cpuset_get_sb, + .kill_sb = kill_litter_super, +}; + +/* struct cftype: + * + * The files in the cpuset filesystem mostly have a very simple read/write + * handling, some common function will take care of it. Nevertheless some cases + * (read tasks) are special and therefore I define this structure for every + * kind of file. + * + * + * When reading/writing to a file: + * - the cpuset to use in file->f_dentry->d_parent->d_fsdata + * - the 'cftype' of the file is file->f_dentry->d_fsdata + */ + +struct cftype { + char *name; + int private; + int (*open) (struct inode *inode, struct file *file); + ssize_t (*read) (struct file *file, char __user *buf, size_t nbytes, + loff_t *ppos); + int (*write) (struct file *file, const char *buf, size_t nbytes, + loff_t *ppos); + int (*release) (struct inode *inode, struct file *file); +}; + +static inline struct cpuset *__d_cs(struct dentry *dentry) +{ + return (struct cpuset *)dentry->d_fsdata; +} + +static inline struct cftype *__d_cft(struct dentry *dentry) +{ + return (struct cftype *)dentry->d_fsdata; +} + +/* + * Call with cpuset_sem held. Writes path of cpuset into buf. + * Returns 0 on success, -errno on error. + */ + +static int cpuset_path(const struct cpuset *cs, char *buf, int buflen) +{ + char *start; + + start = buf + buflen; + + *--start = '\0'; + for (;;) { + int len = cs->dentry->d_name.len; + if ((start -= len) < buf) + return -ENAMETOOLONG; + memcpy(start, cs->dentry->d_name.name, len); + cs = cs->parent; + if (!cs) + break; + if (!cs->parent) + continue; + if (--start < buf) + return -ENAMETOOLONG; + *start = '/'; + } + memmove(buf, start, buf + buflen - start); + return 0; +} + +/* + * Notify userspace when a cpuset is released, by running + * /sbin/cpuset_release_agent with the name of the cpuset (path + * relative to the root of cpuset file system) as the argument. + * + * Most likely, this user command will try to rmdir this cpuset. + * + * This races with the possibility that some other task will be + * attached to this cpuset before it is removed, or that some other + * user task will 'mkdir' a child cpuset of this cpuset. That's ok. + * The presumed 'rmdir' will fail quietly if this cpuset is no longer + * unused, and this cpuset will be reprieved from its death sentence, + * to continue to serve a useful existence. Next time it's released, + * we will get notified again, if it still has 'notify_on_release' set. + * + * Note final arg to call_usermodehelper() is 0 - that means + * don't wait. Since we are holding the global cpuset_sem here, + * and we are asking another thread (started from keventd) to rmdir a + * cpuset, we can't wait - or we'd deadlock with the removing thread + * on cpuset_sem. + */ + +static int cpuset_release_agent(char *cpuset_str) +{ + char *argv[3], *envp[3]; + int i; + + i = 0; + argv[i++] = "/sbin/cpuset_release_agent"; + argv[i++] = cpuset_str; + argv[i] = NULL; + + i = 0; + /* minimal command environment */ + envp[i++] = "HOME=/"; + envp[i++] = "PATH=/sbin:/bin:/usr/sbin:/usr/bin"; + envp[i] = NULL; + + return call_usermodehelper(argv[0], argv, envp, 0); +} + +/* + * Either cs->count of using tasks transitioned to zero, or the + * cs->children list of child cpusets just became empty. If this + * cs is notify_on_release() and now both the user count is zero and + * the list of children is empty, send notice to user land. + */ + +static void check_for_release(struct cpuset *cs) +{ + if (notify_on_release(cs) && atomic_read(&cs->count) == 0 && + list_empty(&cs->children)) { + char *buf; + + buf = kmalloc(PAGE_SIZE, GFP_KERNEL); + if (!buf) + return; + if (cpuset_path(cs, buf, PAGE_SIZE) < 0) + goto out; + cpuset_release_agent(buf); + out: + kfree(buf); + } +} + +/* + * is_cpuset_subset(p, q) - Is cpuset p a subset of cpuset q? + * + * One cpuset is a subset of another if all its allowed CPUs and + * Memory Nodes are a subset of the other, and its exclusive flags + * are only set if the other's are set. + */ + +static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q) +{ + return cpus_subset(p->cpus_allowed, q->cpus_allowed) && + nodes_subset(p->mems_allowed, q->mems_allowed) && + is_cpu_exclusive(p) <= is_cpu_exclusive(q) && + is_mem_exclusive(p) <= is_mem_exclusive(q); +} + +/* + * validate_change() - Used to validate that any proposed cpuset change + * follows the structural rules for cpusets. + * + * If we replaced the flag and mask values of the current cpuset + * (cur) with those values in the trial cpuset (trial), would + * our various subset and exclusive rules still be valid? Presumes + * cpuset_sem held. + * + * 'cur' is the address of an actual, in-use cpuset. Operations + * such as list traversal that depend on the actual address of the + * cpuset in the list must use cur below, not trial. + * + * 'trial' is the address of bulk structure copy of cur, with + * perhaps one or more of the fields cpus_allowed, mems_allowed, + * or flags changed to new, trial values. + * + * Return 0 if valid, -errno if not. + */ + +static int validate_change(const struct cpuset *cur, const struct cpuset *trial) +{ + struct cpuset *c, *par = cur->parent; + + /* + * Don't mess with Big Daddy - top_cpuset must remain maximal. + * And besides, the rest of this routine blows chunks if par == 0. + */ + if (cur == &top_cpuset) + return -EPERM; + + /* Any in-use cpuset must have at least ONE cpu and mem */ + if (atomic_read(&trial->count) > 1) { + if (cpus_empty(trial->cpus_allowed)) + return -ENOSPC; + if (nodes_empty(trial->mems_allowed)) + return -ENOSPC; + } + + /* We must be a subset of our parent cpuset */ + if (!is_cpuset_subset(trial, par)) + return -EACCES; + + /* Each of our child cpusets must be a subset of us */ + list_for_each_entry(c, &cur->children, sibling) { + if (!is_cpuset_subset(c, trial)) + return -EBUSY; + } + + /* If either I or some sibling (!= me) is exclusive, we can't overlap */ + list_for_each_entry(c, &par->children, sibling) { + if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) && + c != cur && + cpus_intersects(trial->cpus_allowed, c->cpus_allowed) + ) { + return -EINVAL; + } + if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) && + c != cur && + nodes_intersects(trial->mems_allowed, c->mems_allowed) + ) { + return -EINVAL; + } + } + + return 0; +} + +static int update_cpumask(struct cpuset *cs, char *buf) +{ + struct cpuset trialcs; + int retval; + + trialcs = *cs; + retval = cpulist_parse(buf, trialcs.cpus_allowed); + if (retval < 0) + return retval; + retval = validate_change(cs, &trialcs); + if (retval == 0) + cs->cpus_allowed = trialcs.cpus_allowed; + return retval; +} + +static int update_nodemask(struct cpuset *cs, char *buf) +{ + struct cpuset trialcs; + int retval; + + trialcs = *cs; + retval = nodelist_parse(buf, trialcs.mems_allowed); + if (retval < 0) + return retval; + retval = validate_change(cs, &trialcs); + if (retval == 0) + cs->mems_allowed = trialcs.mems_allowed; + return retval; +} + +/* + * update_flag - read a 0 or a 1 in a file and update associated flag + * bit: the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE, + * CS_NOTIFY_ON_RELEASE) + * cs: the cpuset to update + * buf: the buffer where we read the 0 or 1 + */ + +static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, char *buf) +{ + int turning_on; + struct cpuset trialcs; + int err; + + turning_on = (simple_strtoul(buf, NULL, 10) != 0); + + trialcs = *cs; + if (turning_on) + set_bit(bit, &trialcs.flags); + else + clear_bit(bit, &trialcs.flags); + + err = validate_change(cs, &trialcs); + if (err == 0) { + if (turning_on) + set_bit(bit, &cs->flags); + else + clear_bit(bit, &cs->flags); + } + return err; +} + +static int attach_task(struct cpuset *cs, char *buf) +{ + pid_t pid; + struct task_struct *tsk; + struct cpuset *oldcs; + + if (sscanf(buf, "%d", &pid) != 1) + return -EIO; + if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)) + return -ENOSPC; + + if (pid) { + read_lock(&tasklist_lock); + + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + + get_task_struct(tsk); + read_unlock(&tasklist_lock); + + if ((current->euid) && (current->euid != tsk->uid) + && (current->euid != tsk->suid)) { + put_task_struct(tsk); + return -EACCES; + } + } else { + tsk = current; + get_task_struct(tsk); + } + + task_lock(tsk); + oldcs = tsk->cpuset; + if (!oldcs) { + task_unlock(tsk); + put_task_struct(tsk); + return -ESRCH; + } + atomic_inc(&cs->count); + tsk->cpuset = cs; + task_unlock(tsk); + + put_task_struct(tsk); + if (atomic_dec_and_test(&oldcs->count)) + check_for_release(oldcs); + return 0; +} + +/* The various types of files and directories in a cpuset file system */ + +typedef enum { + FILE_ROOT, + FILE_DIR, + FILE_CPULIST, + FILE_MEMLIST, + FILE_CPU_EXCLUSIVE, + FILE_MEM_EXCLUSIVE, + FILE_NOTIFY_ON_RELEASE, + FILE_TASKLIST, +} cpuset_filetype_t; + +static ssize_t cpuset_common_file_write(struct file *file, const char *userbuf, + size_t nbytes, loff_t *unused_ppos) +{ + struct cpuset *cs = __d_cs(file->f_dentry->d_parent); + struct cftype *cft = __d_cft(file->f_dentry); + cpuset_filetype_t type = cft->private; + char *buffer; + int retval = 0; + + /* Crude upper limit on largest legitimate cpulist user might write. */ + if (nbytes > 100 + 6 * NR_CPUS) + return -E2BIG; + + /* +1 for nul-terminator */ + if ((buffer = kmalloc(nbytes + 1, GFP_KERNEL)) == 0) + return -ENOMEM; + + if (copy_from_user(buffer, userbuf, nbytes)) { + retval = -EFAULT; + goto out1; + } + buffer[nbytes] = 0; /* nul-terminate */ + + down(&cpuset_sem); + + if (is_removed(cs)) { + retval = -ENODEV; + goto out2; + } + + switch (type) { + case FILE_CPULIST: + retval = update_cpumask(cs, buffer); + break; + case FILE_MEMLIST: + retval = update_nodemask(cs, buffer); + break; + case FILE_CPU_EXCLUSIVE: + retval = update_flag(CS_CPU_EXCLUSIVE, cs, buffer); + break; + case FILE_MEM_EXCLUSIVE: + retval = update_flag(CS_MEM_EXCLUSIVE, cs, buffer); + break; + case FILE_NOTIFY_ON_RELEASE: + retval = update_flag(CS_NOTIFY_ON_RELEASE, cs, buffer); + break; + case FILE_TASKLIST: + retval = attach_task(cs, buffer); + break; + default: + retval = -EINVAL; + goto out2; + } + + if (retval == 0) + retval = nbytes; +out2: + up(&cpuset_sem); +out1: + kfree(buffer); + return retval; +} + +static ssize_t cpuset_file_write(struct file *file, const char *buf, + size_t nbytes, loff_t *ppos) +{ + ssize_t retval = 0; + struct cftype *cft = __d_cft(file->f_dentry); + if (!cft) + return -ENODEV; + + /* special function ? */ + if (cft->write) + retval = cft->write(file, buf, nbytes, ppos); + else + retval = cpuset_common_file_write(file, buf, nbytes, ppos); + + return retval; +} + +/* + * These ascii lists should be read in a single call, by using a user + * buffer large enough to hold the entire map. If read in smaller + * chunks, there is no guarantee of atomicity. Since the display format + * used, list of ranges of sequential numbers, is variable length, + * and since these maps can change value dynamically, one could read + * gibberish by doing partial reads while a list was changing. + * A single large read to a buffer that crosses a page boundary is + * ok, because the result being copied to user land is not recomputed + * across a page fault. + */ + +static int cpuset_sprintf_cpulist(char *page, struct cpuset *cs) +{ + cpumask_t mask; + + down(&cpuset_sem); + mask = cs->cpus_allowed; + up(&cpuset_sem); + + return cpulist_scnprintf(page, PAGE_SIZE, mask); +} + +static int cpuset_sprintf_memlist(char *page, struct cpuset *cs) +{ + nodemask_t mask; + + down(&cpuset_sem); + mask = cs->mems_allowed; + up(&cpuset_sem); + + return nodelist_scnprintf(page, PAGE_SIZE, mask); +} + +static ssize_t cpuset_common_file_read(struct file *file, char __user *buf, + size_t nbytes, loff_t *ppos) +{ + struct cftype *cft = __d_cft(file->f_dentry); + struct cpuset *cs = __d_cs(file->f_dentry->d_parent); + cpuset_filetype_t type = cft->private; + char *page; + ssize_t retval = 0; + char *s; + char *start; + size_t n; + + if (!(page = (char *)__get_free_page(GFP_KERNEL))) + return -ENOMEM; + + s = page; + + switch (type) { + case FILE_CPULIST: + s += cpuset_sprintf_cpulist(s, cs); + break; + case FILE_MEMLIST: + s += cpuset_sprintf_memlist(s, cs); + break; + case FILE_CPU_EXCLUSIVE: + *s++ = is_cpu_exclusive(cs) ? '1' : '0'; + break; + case FILE_MEM_EXCLUSIVE: + *s++ = is_mem_exclusive(cs) ? '1' : '0'; + break; + case FILE_NOTIFY_ON_RELEASE: + *s++ = notify_on_release(cs) ? '1' : '0'; + break; + default: + retval = -EINVAL; + goto out; + } + *s++ = '\n'; + *s = '\0'; + + start = page + *ppos; + n = s - start; + retval = n - copy_to_user(buf, start, min(n, nbytes)); + *ppos += retval; +out: + free_page((unsigned long)page); + return retval; +} + +static ssize_t cpuset_file_read(struct file *file, char *buf, size_t nbytes, + loff_t *ppos) +{ + ssize_t retval = 0; + struct cftype *cft = __d_cft(file->f_dentry); + if (!cft) + return -ENODEV; + + /* special function ? */ + if (cft->read) + retval = cft->read(file, buf, nbytes, ppos); + else + retval = cpuset_common_file_read(file, buf, nbytes, ppos); + + return retval; +} + +static int cpuset_file_open(struct inode *inode, struct file *file) +{ + int err; + struct cftype *cft; + + err = generic_file_open(inode, file); + if (err) + return err; + + cft = __d_cft(file->f_dentry); + if (!cft) + return -ENODEV; + if (cft->open) + err = cft->open(inode, file); + else + err = 0; + + return err; +} + +static int cpuset_file_release(struct inode *inode, struct file *file) +{ + struct cftype *cft = __d_cft(file->f_dentry); + if (cft->release) + return cft->release(inode, file); + return 0; +} + +static struct file_operations cpuset_file_operations = { + .read = cpuset_file_read, + .write = cpuset_file_write, + .llseek = generic_file_llseek, + .open = cpuset_file_open, + .release = cpuset_file_release, +}; + +static struct inode_operations cpuset_dir_inode_operations = { + .lookup = simple_lookup, + .mkdir = cpuset_mkdir, + .rmdir = cpuset_rmdir, +}; + +static int cpuset_create_file(struct dentry *dentry, int mode) +{ + struct inode *inode; + + if (!dentry) + return -ENOENT; + if (dentry->d_inode) + return -EEXIST; + + inode = cpuset_new_inode(mode); + if (!inode) + return -ENOMEM; + + if (S_ISDIR(mode)) { + inode->i_op = &cpuset_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + + /* start off with i_nlink == 2 (for "." entry) */ + inode->i_nlink++; + } else if (S_ISREG(mode)) { + inode->i_size = 0; + inode->i_fop = &cpuset_file_operations; + } + + d_instantiate(dentry, inode); + dget(dentry); /* Extra count - pin the dentry in core */ + return 0; +} + +/* + * cpuset_create_dir - create a directory for an object. + * cs: the cpuset we create the directory for. + * It must have a valid ->parent field + * And we are going to fill its ->dentry field. + * name: The name to give to the cpuset directory. Will be copied. + * mode: mode to set on new directory. + */ + +static int cpuset_create_dir(struct cpuset *cs, const char *name, int mode) +{ + struct dentry *dentry = NULL; + struct dentry *parent; + int error = 0; + + parent = cs->parent->dentry; + dentry = cpuset_get_dentry(parent, name); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + error = cpuset_create_file(dentry, S_IFDIR | mode); + if (!error) { + dentry->d_fsdata = cs; + parent->d_inode->i_nlink++; + cs->dentry = dentry; + } + dput(dentry); + + return error; +} + +/* MUST be called with dir->d_inode->i_sem held */ + +static int cpuset_add_file(struct dentry *dir, const struct cftype *cft) +{ + struct dentry *dentry; + int error; + + dentry = cpuset_get_dentry(dir, cft->name); + if (!IS_ERR(dentry)) { + error = cpuset_create_file(dentry, 0644 | S_IFREG); + if (!error) + dentry->d_fsdata = (void *)cft; + dput(dentry); + } else + error = PTR_ERR(dentry); + return error; +} + +/* + * Stuff for reading the 'tasks' file. + * + * Reading this file can return large amounts of data if a cpuset has + * *lots* of attached tasks. So it may need several calls to read(), + * but we cannot guarantee that the information we produce is correct + * unless we produce it entirely atomically. + * + * Upon first file read(), a struct ctr_struct is allocated, that + * will have a pointer to an array (also allocated here). The struct + * ctr_struct * is stored in file->private_data. Its resources will + * be freed by release() when the file is closed. The array is used + * to sprintf the PIDs and then used by read(). + */ + +/* cpusets_tasks_read array */ + +struct ctr_struct { + int *array; + int count; +}; + +static struct ctr_struct *cpuset_tasks_mkctr(struct file *file) +{ + struct cpuset *cs = __d_cs(file->f_dentry->d_parent); + struct ctr_struct *ctr; + pid_t *array; + int n, max; + pid_t i, j, last; + struct task_struct *g, *p; + + ctr = kmalloc(sizeof(*ctr), GFP_KERNEL); + if (!ctr) + return NULL; + + /* + * If cpuset gets more users after we read count, we won't have + * enough space - tough. This race is indistinguishable to the + * caller from the case that the additional cpuset users didn't + * show up until sometime later on. Grabbing cpuset_sem would + * not help, because cpuset_fork() doesn't grab cpuset_sem. + */ + + max = atomic_read(&cs->count); + array = kmalloc(max * sizeof(pid_t), GFP_KERNEL); + if (!array) { + kfree(ctr); + return NULL; + } + + n = 0; + read_lock(&tasklist_lock); + do_each_thread(g, p) { + if (p->cpuset == cs) { + array[n++] = p->pid; + if (unlikely(n == max)) + goto array_full; + } + } + while_each_thread(g, p); +array_full: + read_unlock(&tasklist_lock); + + /* stupid bubble sort */ + for (i = 0; i < n - 1; i++) { + for (j = 0; j < n - 1 - i; j++) + if (array[j + 1] < array[j]) { + pid_t tmp = array[j]; + array[j] = array[j + 1]; + array[j + 1] = tmp; + } + } + + /* + * Collapse sorted array by grouping consecutive pids. + * Code range of pids with a negative pid on the second. + * Read from array[i]; write to array]j]; j <= i always. + */ + last = array[0]; /* any value != array[0] - 1 */ + j = -1; + for (i = 0; i < n; i++) { + pid_t curr = array[i]; + /* consecutive pids ? */ + if (curr - last == 1) { + /* move destination index if it has not been done */ + if (array[j] > 0) + j++; + array[j] = -curr; + } else + array[++j] = curr; + last = curr; + } + + ctr->array = array; + ctr->count = j + 1; + file->private_data = (void *)ctr; + return ctr; +} + +/* printf one pid from an array + * different formatting depending on whether it is positive or negative, + * or whether it is or not the first pid or the last + */ +static int array_pid_sprintf(char *buf, pid_t *array, int idx, int last) +{ + pid_t v = array[idx]; + int l = 0; + + if (v < 0) { /* second pid of a range of pids */ + v = -v; + buf[l++] = '-'; + } else { /* first pid of a range, or not a range */ + if (idx) /* comma only if it's not the first */ + buf[l++] = ','; + } + l += sprintf(buf + l, "%d", v); + /* newline after last record */ + if (idx == last) + l += sprintf(buf + l, "\n"); + return l; +} + +static ssize_t cpuset_tasks_read(struct file *file, char __user *buf, + size_t nbytes, loff_t *ppos) +{ + struct ctr_struct *ctr = (struct ctr_struct *)file->private_data; + int *array, nr_pids, i; + size_t len, lastlen = 0; + char *page; + + /* allocate buffer and fill it on first call to read() */ + if (!ctr) { + ctr = cpuset_tasks_mkctr(file); + if (!ctr) + return -ENOMEM; + } + + array = ctr->array; + nr_pids = ctr->count; + + if (!(page = (char *)__get_free_page(GFP_KERNEL))) + return -ENOMEM; + + i = *ppos; /* index of pid being printed */ + len = 0; /* length of data sprintf'ed in the page */ + + while ((len < PAGE_SIZE - 10) && (i < nr_pids) && (len < nbytes)) { + lastlen = array_pid_sprintf(page + len, array, i++, nr_pids - 1); + len += lastlen; + } + + /* if we wrote too much, remove last record */ + if (len > nbytes) { + len -= lastlen; + i--; + } + + *ppos = i; + + if (copy_to_user(buf, page, len)) + len = -EFAULT; + free_page((unsigned long)page); + return len; +} + +static int cpuset_tasks_release(struct inode *unused_inode, struct file *file) +{ + struct ctr_struct *ctr; + + /* we have nothing to do if no read-access is needed */ + if (!(file->f_mode & FMODE_READ)) + return 0; + + ctr = (struct ctr_struct *)file->private_data; + kfree(ctr->array); + kfree(ctr); + return 0; +} + +/* + * for the common functions, 'private' gives the type of file + */ + +static struct cftype cft_tasks = { + .name = "tasks", + .read = cpuset_tasks_read, + .release = cpuset_tasks_release, + .private = FILE_TASKLIST, +}; + +static struct cftype cft_cpus = { + .name = "cpus", + .private = FILE_CPULIST, +}; + +static struct cftype cft_mems = { + .name = "mems", + .private = FILE_MEMLIST, +}; + +static struct cftype cft_cpu_exclusive = { + .name = "cpu_exclusive", + .private = FILE_CPU_EXCLUSIVE, +}; + +static struct cftype cft_mem_exclusive = { + .name = "mem_exclusive", + .private = FILE_MEM_EXCLUSIVE, +}; + +static struct cftype cft_notify_on_release = { + .name = "notify_on_release", + .private = FILE_NOTIFY_ON_RELEASE, +}; + +/* MUST be called with ->d_inode->i_sem held */ +static int cpuset_populate_dir(struct dentry *cs_dentry) +{ + int err; + + if ((err = cpuset_add_file(cs_dentry, &cft_cpus)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_mems)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_cpu_exclusive)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_mem_exclusive)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_notify_on_release)) < 0) + return err; + if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0) + return err; + return 0; +} + +/* + * cpuset_create - create a cpuset + * parent: cpuset that will be parent of the new cpuset. + * name: name of the new cpuset. Will be strcpy'ed. + * mode: mode to set on new inode + * + * Must be called with the semaphore on the parent inode held + */ + +static long cpuset_create(struct cpuset *parent, const char *name, int mode) +{ + struct cpuset *cs; + int err; + + cs = kmalloc(sizeof(*cs), GFP_KERNEL); + if (!cs) + return -ENOMEM; + + down(&cpuset_sem); + cs->flags = 0; + if (notify_on_release(parent)) + set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags); + cs->... [truncated message content] |
From: Martin J. B. <mb...@ar...> - 2004-08-05 20:56:21
|
> Cpusets extend the usefulness of, the existing placement support that > was added to Linux 2.6 kernels: sched_setaffinity() for CPU placement, > and mbind and set_mempolicy for memory placement. On smaller or > dedicated use systems, the existing calls are often sufficient. > > On larger NUMA systems, running more than one, performance critical, > job, it is necessary to be able to manage jobs in their entirety. > This includes providing a job with exclusive CPU and memory that no > other job can use, and being able to list all tasks currently in a > cpuset. I'm not sure I understand the rationale behind this ... perhaps you could explain it further. We already have mechanisms to bind a process to particular CPUs or node's memory. To provide exclusivity seems valuable ... ie to stop the default allocations using node X's memory, or CPU Y, and potentially even to migrate existing users off that. But that'd seem to be a whole lot simpler than this patch ... what else are we gaining from CPU sets? The patch is massive, so hard to see exactly what you're doing ... is the point to add back virtualized memory and CPU numbering sets specific to each process or group of them, a la cpumemsets thing you were posting a year or two ago? M. |
From: Martin J. B. <mb...@ar...> - 2004-08-05 20:57:12
|
Can't we just do this up in userspace, with some manipulation tools if you really have that many CPUs? I'm not convinced it makes sense to make the kernel interface that complicated ... m. --On Thursday, August 05, 2004 03:08:47 -0700 Paul Jackson <pj...@sg...> wrote: > A bitmap print and parse format that provides lists of ranges of > numbers, to be first used for by cpusets (next patch). > > Cpusets provide a way to manage subsets of CPUs and Memory Nodes > for scheduling and memory placement, via a new virtual file system, > usually mounted at /dev/cpuset. Manipulation of cpusets can be done > directly via this file system, from the shell. > > However, manipulating 512 bit cpumasks or 256 bit nodemasks (which > will get bigger) via hex mask strings is painful for humans. > > The intention is to provide a format for the cpu and memory mask files > in /dev/cpusets that will stand the test of time. This format is > supported by a couple of new lib/bitmap.c routines, for printing and > parsing these strings. Wrappers for cpumask and nodemask are provided. > > See the embedded comments, below in the patch, for more details of > the format. The input format supports adding or removing specified > cpus or nodes, as well as entirely rewriting the mask. > > include/linux/bitmap.h | 8 ++ > include/linux/cpumask.h | 22 ++++++- > include/linux/nodemask.h | 22 ++++++- > lib/bitmap.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++ > 4 files changed, 189 insertions(+), 5 deletions(-) > > Signed-off-by: Paul Jackson <pj...@sg...> > > Index: 2.6.8-rc2-mm2/include/linux/bitmap.h > =================================================================== > --- 2.6.8-rc2-mm2.orig/include/linux/bitmap.h 2004-08-04 19:29:15.000000000 -0700 > +++ 2.6.8-rc2-mm2/include/linux/bitmap.h 2004-08-04 19:41:10.000000000 -0700 > @@ -41,7 +41,9 @@ > * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n > * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n > * bitmap_scnprintf(buf, len, src, nbits) Print bitmap src to buf > - * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from buf > + * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from user buf > + * bitmap_scnlistprintf(buf, len, src, nbits) Print bitmap src as list to buf > + * bitmap_parselist(buf, dst, nbits) Parse bitmap dst from list > */ > > /* > @@ -98,6 +100,10 @@ extern int bitmap_scnprintf(char *buf, u > const unsigned long *src, int nbits); > extern int bitmap_parse(const char __user *ubuf, unsigned int ulen, > unsigned long *dst, int nbits); > +extern int bitmap_scnlistprintf(char *buf, unsigned int len, > + const unsigned long *src, int nbits); > +extern int bitmap_parselist(const char *buf, unsigned long *maskp, > + int nmaskbits); > extern int bitmap_find_free_region(unsigned long *bitmap, int bits, int order); > extern void bitmap_release_region(unsigned long *bitmap, int pos, int order); > extern int bitmap_allocate_region(unsigned long *bitmap, int pos, int order); > Index: 2.6.8-rc2-mm2/include/linux/cpumask.h > =================================================================== > --- 2.6.8-rc2-mm2.orig/include/linux/cpumask.h 2004-08-04 19:29:34.000000000 -0700 > +++ 2.6.8-rc2-mm2/include/linux/cpumask.h 2004-08-04 20:35:10.000000000 -0700 > @@ -10,6 +10,8 @@ > * > * For details of cpumask_scnprintf() and cpumask_parse(), > * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. > + * For details of cpulist_scnprintf() and cpulist_parse(), see > + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. > * > * The available cpumask operations are: > * > @@ -46,6 +48,8 @@ > * > * int cpumask_scnprintf(buf, len, mask) Format cpumask for printing > * int cpumask_parse(ubuf, ulen, mask) Parse ascii string as cpumask > + * int cpulist_scnprintf(buf, len, mask) Format cpumask as list for printing > + * int cpulist_parse(buf, map) Parse ascii string as cpulist > * > * for_each_cpu_mask(cpu, mask) for-loop cpu over mask > * > @@ -268,14 +272,28 @@ static inline int __cpumask_scnprintf(ch > return bitmap_scnprintf(buf, len, srcp->bits, nbits); > } > > -#define cpumask_parse(ubuf, ulen, src) \ > - __cpumask_parse((ubuf), (ulen), &(src), NR_CPUS) > +#define cpumask_parse(ubuf, ulen, dst) \ > + __cpumask_parse((ubuf), (ulen), &(dst), NR_CPUS) > static inline int __cpumask_parse(const char __user *buf, int len, > cpumask_t *dstp, int nbits) > { > return bitmap_parse(buf, len, dstp->bits, nbits); > } > > +#define cpulist_scnprintf(buf, len, src) \ > + __cpulist_scnprintf((buf), (len), &(src), NR_CPUS) > +static inline int __cpulist_scnprintf(char *buf, int len, > + const cpumask_t *srcp, int nbits) > +{ > + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); > +} > + > +#define cpulist_parse(buf, dst) __cpulist_parse((buf), &(dst), NR_CPUS) > +static inline int __cpulist_parse(const char *buf, cpumask_t *dstp, int nbits) > +{ > + return bitmap_parselist(buf, dstp->bits, nbits); > +} > + > #if NR_CPUS > 1 > #define for_each_cpu_mask(cpu, mask) \ > for ((cpu) = first_cpu(mask); \ > Index: 2.6.8-rc2-mm2/include/linux/nodemask.h > =================================================================== > --- 2.6.8-rc2-mm2.orig/include/linux/nodemask.h 2004-08-04 19:29:29.000000000 -0700 > +++ 2.6.8-rc2-mm2/include/linux/nodemask.h 2004-08-04 20:28:50.000000000 -0700 > @@ -10,6 +10,8 @@ > * > * For details of nodemask_scnprintf() and nodemask_parse(), > * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. > + * For details of nodelist_scnprintf() and nodelist_parse(), see > + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. > * > * The available nodemask operations are: > * > @@ -46,6 +48,8 @@ > * > * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing > * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask > + * int nodelist_scnprintf(buf, len, mask) Format nodemask as list for printing > + * int nodelist_parse(buf, map) Parse ascii string as nodelist > * > * for_each_node_mask(node, mask) for-loop node over mask > * > @@ -271,14 +275,28 @@ static inline int __nodemask_scnprintf(c > return bitmap_scnprintf(buf, len, srcp->bits, nbits); > } > > -#define nodemask_parse(ubuf, ulen, src) \ > - __nodemask_parse((ubuf), (ulen), &(src), MAX_NUMNODES) > +#define nodemask_parse(ubuf, ulen, dst) \ > + __nodemask_parse((ubuf), (ulen), &(dst), MAX_NUMNODES) > static inline int __nodemask_parse(const char __user *buf, int len, > nodemask_t *dstp, int nbits) > { > return bitmap_parse(buf, len, dstp->bits, nbits); > } > > +#define nodelist_scnprintf(buf, len, src) \ > + __nodelist_scnprintf((buf), (len), &(src), MAX_NUMNODES) > +static inline int __nodelist_scnprintf(char *buf, int len, > + const nodemask_t *srcp, int nbits) > +{ > + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); > +} > + > +#define nodelist_parse(buf, dst) __nodelist_parse((buf), &(dst), MAX_NUMNODES) > +static inline int __nodelist_parse(const char *buf, nodemask_t *dstp, int nbits) > +{ > + return bitmap_parselist(buf, dstp->bits, nbits); > +} > + > #if MAX_NUMNODES > 1 > #define for_each_node_mask(node, mask) \ > for ((node) = first_node(mask); \ > Index: 2.6.8-rc2-mm2/lib/bitmap.c > =================================================================== > --- 2.6.8-rc2-mm2.orig/lib/bitmap.c 2004-08-04 19:29:15.000000000 -0700 > +++ 2.6.8-rc2-mm2/lib/bitmap.c 2004-08-04 21:44:41.000000000 -0700 > @@ -291,6 +291,7 @@ EXPORT_SYMBOL(__bitmap_weight); > #define nbits_to_hold_value(val) fls(val) > #define roundup_power2(val,modulus) (((val) + (modulus) - 1) & ~((modulus) - 1)) > #define unhex(c) (isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10)) > +#define BASEDEC 10 /* fancier cpuset lists input in decimal */ > > /** > * bitmap_scnprintf - convert bitmap to an ASCII hex string. > @@ -409,6 +410,147 @@ int bitmap_parse(const char __user *ubuf > } > EXPORT_SYMBOL(bitmap_parse); > > +/* > + * bscnl_emit(buf, buflen, rbot, rtop, bp) > + * > + * Helper routine for bitmap_scnlistprintf(). Write decimal number > + * or range to buf, suppressing output past buf+buflen, with optional > + * comma-prefix. Return len of what would be written to buf, if it > + * all fit. > + */ > + > +int bscnl_emit(char *buf, int buflen, int rbot, int rtop, int len) > +{ > + if (len) > + len += scnprintf(buf + len, buflen - len, ","); > + if (rbot == rtop) > + len += scnprintf(buf + len, buflen - len, "%d", rbot); > + else > + len += scnprintf(buf + len, buflen - len, "%d-%d", rbot, rtop); > + return len; > +} > + > +/** > + * bitmap_scnlistprintf - convert bitmap to an ASCII hex string, list format > + * @buf: byte buffer into which string is placed > + * @buflen: reserved size of @buf, in bytes > + * @maskp: pointer to bitmap to convert > + * @nmaskbits: size of bitmap, in bits > + * > + * Output format is a comma-separated list of decimal numbers and > + * ranges. Consecutively set bits are shown as two hyphen-separated > + * decimal numbers, the smallest and largest bit numbers set in > + * the range. Output format is a compatible subset of the format > + * accepted as input by bitmap_parselist(). > + * > + * The return value is the number of characters which would be > + * generated for the given input, excluding the trailing '\0', as > + * per ISO C99. > + */ > + > +int bitmap_scnlistprintf(char *buf, unsigned int buflen, > + const unsigned long *maskp, int nmaskbits) > +{ > + int len = 0; > + /* current bit is 'cur', most recently seen range is [rbot, rtop] */ > + int cur, rbot, rtop; > + > + rbot = cur = find_first_bit(maskp, nmaskbits); > + while (cur < nmaskbits) { > + rtop = cur; > + cur = find_next_bit(maskp, nmaskbits, cur+1); > + if (cur >= nmaskbits || cur > rtop + 1) { > + len = bscnl_emit(buf, buflen, rbot, rtop, len); > + rbot = cur; > + } > + } > + return len; > +} > +EXPORT_SYMBOL(bitmap_scnlistprintf); > + > +/** > + * bitmap_parselist - parses a more flexible format for inputting bit masks > + * @buf: read nul-terminated user string from this buffer > + * @mask: write resulting mask here > + * @nmaskbits: number of bits in mask to be written > + * > + * The input format supports a space separated list of one or more comma > + * separated sequences of ascii decimal bit numbers and ranges. Each > + * sequence may be preceded by one of the prefix characters '=', > + * '-', '+', or '!', which have the following meanings: > + * '=': rewrite the mask to have only the bits specified in this sequence > + * '-': turn off the bits specified in this sequence > + * '+': turn on the bits specified in this sequence > + * '!': same as '-'. > + * > + * If no such initial character is specified, then the default prefix '=' > + * is presumed. The list is evaluated and applied in left to right order. > + * > + * Eamples of input format: > + * 0-4,9 # rewrites to 0,1,2,3,4,9 > + * -9 # removes 9 > + * +6-8 # adds 6,7,8 > + * 1-6 -0,2-4 +11-14,16-19 -14-16 # same as 1,5,6,11-13,17-19 > + * 1-6 -0,2-4 +11-14,16-19 =14-16 # same as just 14,15,16 > + * > + * Possible errno's returned for invalid input strings are: > + * -EINVAL: second number in range smaller than first > + * -ERANGE: bit number specified too large for mask > + * -EINVAL: invalid prefix char (not '=', '-', '+', or '!') > + */ > + > +int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits) > +{ > + char *p, *q; > + int masklen = BITS_TO_LONGS(nmaskbits); > + > + while ((p = strsep((char **)(&buf), " ")) != NULL) { /* blows const XXX */ > + char op = isdigit(*p) ? '=' : *p++; > + unsigned long m[masklen]; > + int maskbytes = sizeof(m); > + int i; > + > + if (op == ' ') > + continue; > + memset(m, 0, maskbytes); > + > + while ((q = strsep(&p, ",")) != NULL) { > + unsigned a = simple_strtoul(q, 0, BASEDEC); > + unsigned b = a; > + char *cp = strchr(q, '-'); > + if (cp) > + b = simple_strtoul(cp + 1, 0, BASEDEC); > + if (!(a <= b)) > + return -EINVAL; > + if (b >= nmaskbits) > + return -ERANGE; > + while (a <= b) { > + set_bit(a, m); > + a++; > + } > + } > + > + switch (op) { > + case '=': > + memcpy(maskp, m, maskbytes); > + break; > + case '!': > + case '-': > + for (i = 0; i < masklen; i++) > + maskp[i] &= ~m[i]; > + break; > + case '+': > + for (i = 0; i < masklen; i++) > + maskp[i] |= m[i]; > + break; > + default: > + return -EINVAL; > + } > + } > + return 0; > +} > +EXPORT_SYMBOL(bitmap_parselist); > + > /** > * bitmap_find_free_region - find a contiguous aligned mem region > * @bitmap: an array of unsigned longs corresponding to the bitmap > > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 > > > ------------------------------------------------------- > This SF.Net email is sponsored by OSTG. Have you noticed the changes on > Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, > one more big change to announce. We are now OSTG- Open Source Technology > Group. Come see the changes on the new OSTG site. www.ostg.com > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech > > |
From: Paul J. <pj...@sg...> - 2004-08-10 01:36:48
|
I was looking at this bitmap list format patch over the weekend, and came to the conclusion that the basic list format, as in the example: 0,3,5,8-15 was a valuable improvement over a fixed length hex mask, but that on the other hand the support for: the prefix characters '=', '-', '+', or '!' was fluff, that few would learn to use, and fewer find essential. So I redid the bitmap list format patch, removing the prefix character support, and making another pass at compacting the input 'write' side code in bitmap_parselist(). The kernel text costs in bytes for these two patches, on an i386 build, are now: bitmap lists: 592 cpusets: 7718 ---- total: 8310 Here's the new bitmap list patch. It applies to 2.6.8-rc2-mm2. It replaces the earlier bitmap list patch, that began this thread on August 5, 2004. ======== A bitmap print and parse format that provides lists of ranges of numbers, to be first used for by cpusets (next patch). Cpusets provide a way to manage subsets of CPUs and Memory Nodes for scheduling and memory placement, via a new virtual file system, usually mounted at /dev/cpuset. Manipulation of cpusets can be done directly via this file system, from the shell. However, manipulating 512 bit cpumasks or 256 bit nodemasks (which will get bigger) via hex mask strings is painful for humans. The intention is to provide a format for the cpu and memory mask files in /dev/cpusets that will stand the test of time. This format is supported by a couple of new lib/bitmap.c routines, for printing and parsing these strings. Wrappers for cpumask and nodemask are provided. include/linux/bitmap.h | 8 +++ include/linux/cpumask.h | 22 +++++++++- include/linux/nodemask.h | 22 +++++++++- lib/bitmap.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 150 insertions(+), 5 deletions(-) Signed-off-by: Paul Jackson <pj...@sg...> Index: 2.6.8-rc2-mm2/include/linux/bitmap.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/bitmap.h 2004-08-08 23:17:35.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/bitmap.h 2004-08-08 23:24:57.000000000 -0700 @@ -41,7 +41,9 @@ * bitmap_shift_right(dst, src, n, nbits) *dst = *src >> n * bitmap_shift_left(dst, src, n, nbits) *dst = *src << n * bitmap_scnprintf(buf, len, src, nbits) Print bitmap src to buf - * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from buf + * bitmap_parse(ubuf, ulen, dst, nbits) Parse bitmap dst from user buf + * bitmap_scnlistprintf(buf, len, src, nbits) Print bitmap src as list to buf + * bitmap_parselist(buf, dst, nbits) Parse bitmap dst from list */ /* @@ -98,6 +100,10 @@ extern int bitmap_scnprintf(char *buf, u const unsigned long *src, int nbits); extern int bitmap_parse(const char __user *ubuf, unsigned int ulen, unsigned long *dst, int nbits); +extern int bitmap_scnlistprintf(char *buf, unsigned int len, + const unsigned long *src, int nbits); +extern int bitmap_parselist(const char *buf, unsigned long *maskp, + int nmaskbits); extern int bitmap_find_free_region(unsigned long *bitmap, int bits, int order); extern void bitmap_release_region(unsigned long *bitmap, int pos, int order); extern int bitmap_allocate_region(unsigned long *bitmap, int pos, int order); Index: 2.6.8-rc2-mm2/include/linux/cpumask.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/cpumask.h 2004-08-08 23:17:35.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/cpumask.h 2004-08-08 23:24:57.000000000 -0700 @@ -10,6 +10,8 @@ * * For details of cpumask_scnprintf() and cpumask_parse(), * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of cpulist_scnprintf() and cpulist_parse(), see + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. * * The available cpumask operations are: * @@ -46,6 +48,8 @@ * * int cpumask_scnprintf(buf, len, mask) Format cpumask for printing * int cpumask_parse(ubuf, ulen, mask) Parse ascii string as cpumask + * int cpulist_scnprintf(buf, len, mask) Format cpumask as list for printing + * int cpulist_parse(buf, map) Parse ascii string as cpulist * * for_each_cpu_mask(cpu, mask) for-loop cpu over mask * @@ -268,14 +272,28 @@ static inline int __cpumask_scnprintf(ch return bitmap_scnprintf(buf, len, srcp->bits, nbits); } -#define cpumask_parse(ubuf, ulen, src) \ - __cpumask_parse((ubuf), (ulen), &(src), NR_CPUS) +#define cpumask_parse(ubuf, ulen, dst) \ + __cpumask_parse((ubuf), (ulen), &(dst), NR_CPUS) static inline int __cpumask_parse(const char __user *buf, int len, cpumask_t *dstp, int nbits) { return bitmap_parse(buf, len, dstp->bits, nbits); } +#define cpulist_scnprintf(buf, len, src) \ + __cpulist_scnprintf((buf), (len), &(src), NR_CPUS) +static inline int __cpulist_scnprintf(char *buf, int len, + const cpumask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + +#define cpulist_parse(buf, dst) __cpulist_parse((buf), &(dst), NR_CPUS) +static inline int __cpulist_parse(const char *buf, cpumask_t *dstp, int nbits) +{ + return bitmap_parselist(buf, dstp->bits, nbits); +} + #if NR_CPUS > 1 #define for_each_cpu_mask(cpu, mask) \ for ((cpu) = first_cpu(mask); \ Index: 2.6.8-rc2-mm2/include/linux/nodemask.h =================================================================== --- 2.6.8-rc2-mm2.orig/include/linux/nodemask.h 2004-08-08 23:17:35.000000000 -0700 +++ 2.6.8-rc2-mm2/include/linux/nodemask.h 2004-08-08 23:24:57.000000000 -0700 @@ -10,6 +10,8 @@ * * For details of nodemask_scnprintf() and nodemask_parse(), * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of nodelist_scnprintf() and nodelist_parse(), see + * bitmap_scnlistprintf() and bitmap_parselist(), also in bitmap.c. * * The available nodemask operations are: * @@ -46,6 +48,8 @@ * * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask + * int nodelist_scnprintf(buf, len, mask) Format nodemask as list for printing + * int nodelist_parse(buf, map) Parse ascii string as nodelist * * for_each_node_mask(node, mask) for-loop node over mask * @@ -271,14 +275,28 @@ static inline int __nodemask_scnprintf(c return bitmap_scnprintf(buf, len, srcp->bits, nbits); } -#define nodemask_parse(ubuf, ulen, src) \ - __nodemask_parse((ubuf), (ulen), &(src), MAX_NUMNODES) +#define nodemask_parse(ubuf, ulen, dst) \ + __nodemask_parse((ubuf), (ulen), &(dst), MAX_NUMNODES) static inline int __nodemask_parse(const char __user *buf, int len, nodemask_t *dstp, int nbits) { return bitmap_parse(buf, len, dstp->bits, nbits); } +#define nodelist_scnprintf(buf, len, src) \ + __nodelist_scnprintf((buf), (len), &(src), MAX_NUMNODES) +static inline int __nodelist_scnprintf(char *buf, int len, + const nodemask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + +#define nodelist_parse(buf, dst) __nodelist_parse((buf), &(dst), MAX_NUMNODES) +static inline int __nodelist_parse(const char *buf, nodemask_t *dstp, int nbits) +{ + return bitmap_parselist(buf, dstp->bits, nbits); +} + #if MAX_NUMNODES > 1 #define for_each_node_mask(node, mask) \ for ((node) = first_node(mask); \ Index: 2.6.8-rc2-mm2/lib/bitmap.c =================================================================== --- 2.6.8-rc2-mm2.orig/lib/bitmap.c 2004-08-08 23:17:35.000000000 -0700 +++ 2.6.8-rc2-mm2/lib/bitmap.c 2004-08-09 00:11:57.000000000 -0700 @@ -291,6 +291,7 @@ EXPORT_SYMBOL(__bitmap_weight); #define nbits_to_hold_value(val) fls(val) #define roundup_power2(val,modulus) (((val) + (modulus) - 1) & ~((modulus) - 1)) #define unhex(c) (isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10)) +#define BASEDEC 10 /* fancier cpuset lists input in decimal */ /** * bitmap_scnprintf - convert bitmap to an ASCII hex string. @@ -409,6 +410,108 @@ int bitmap_parse(const char __user *ubuf } EXPORT_SYMBOL(bitmap_parse); +/* + * bscnl_emit(buf, buflen, rbot, rtop, bp) + * + * Helper routine for bitmap_scnlistprintf(). Write decimal number + * or range to buf, suppressing output past buf+buflen, with optional + * comma-prefix. Return len of what would be written to buf, if it + * all fit. + */ +static inline int bscnl_emit(char *buf, int buflen, int rbot, int rtop, int len) +{ + if (len > 0) + len += scnprintf(buf + len, buflen - len, ","); + if (rbot == rtop) + len += scnprintf(buf + len, buflen - len, "%d", rbot); + else + len += scnprintf(buf + len, buflen - len, "%d-%d", rbot, rtop); + return len; +} + +/** + * bitmap_scnlistprintf - convert bitmap to list format ASCII string + * @buf: byte buffer into which string is placed + * @buflen: reserved size of @buf, in bytes + * @maskp: pointer to bitmap to convert + * @nmaskbits: size of bitmap, in bits + * + * Output format is a comma-separated list of decimal numbers and + * ranges. Consecutively set bits are shown as two hyphen-separated + * decimal numbers, the smallest and largest bit numbers set in + * the range. Output format is compatible with the format + * accepted as input by bitmap_parselist(). + * + * The return value is the number of characters which would be + * generated for the given input, excluding the trailing '\0', as + * per ISO C99. + */ +int bitmap_scnlistprintf(char *buf, unsigned int buflen, + const unsigned long *maskp, int nmaskbits) +{ + int len = 0; + /* current bit is 'cur', most recently seen range is [rbot, rtop] */ + int cur, rbot, rtop; + + rbot = cur = find_first_bit(maskp, nmaskbits); + while (cur < nmaskbits) { + rtop = cur; + cur = find_next_bit(maskp, nmaskbits, cur+1); + if (cur >= nmaskbits || cur > rtop + 1) { + len = bscnl_emit(buf, buflen, rbot, rtop, len); + rbot = cur; + } + } + return len; +} +EXPORT_SYMBOL(bitmap_scnlistprintf); + +/** + * bitmap_parselist - convert list format ASCII string to bitmap + * @buf: read nul-terminated user string from this buffer + * @mask: write resulting mask here + * @nmaskbits: number of bits in mask to be written + * + * Input format is a comma-separated list of decimal numbers and + * ranges. Consecutively set bits are shown as two hyphen-separated + * decimal numbers, the smallest and largest bit numbers set in + * the range. + * + * Returns 0 on success, -errno on invalid input strings: + * -EINVAL: second number in range smaller than first + * -EINVAL: invalid character in string + * -ERANGE: bit number specified too large for mask + */ +int bitmap_parselist(const char *bp, unsigned long *maskp, int nmaskbits) +{ + unsigned a, b; + + bitmap_zero(maskp, nmaskbits); + do { + if (!isdigit(*bp)) + return -EINVAL; + b = a = simple_strtoul(bp, (char **)&bp, BASEDEC); + if (*bp == '-') { + bp++; + if (!isdigit(*bp)) + return -EINVAL; + b = simple_strtoul(bp, (char **)&bp, BASEDEC); + } + if (!(a <= b)) + return -EINVAL; + if (b >= nmaskbits) + return -ERANGE; + while (a <= b) { + set_bit(a, maskp); + a++; + } + if (*bp == ',') + bp++; + } while (*bp != '\0' && *bp != '\n'); + return 0; +} +EXPORT_SYMBOL(bitmap_parselist); + /** * bitmap_find_free_region - find a contiguous aligned mem region * @bitmap: an array of unsigned longs corresponding to the bitmap -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-08-05 21:49:02
|
Martin asks: > Can't we just do this up in userspace, ... Aha - I was expecting this question. Howdy. We could, I suppose (do this fancy bitmap formatting in userland). It's certainly been a pleasure to Simon Derr, Sylvain Jeaugey and myself, over the last six months, to be able to easily manipulate these big masks using classic Unix commands like cat and echo. The ability to atomically update a mask is unique to this interface. The existing bitmap_parse/bitmap_scnprintf interface only allows for a complete rewrite, not atomically adding or removing a node. However, I am not aware of a reason why we need the atomic update. Simon ... could you comment on this, and perhaps better motivate this new bitmap list format? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-08-06 02:07:21
|
Martin wrote: > I'm not sure I understand the rationale behind this ... Thank-you for your question, Martin. Unlike the first patch in this set (the fancier bitmap format), this cpuset patch is important to us, as you likely suspected. I hope I can do it justice. > is the point to add back virtualized memory and > CPU numbering sets specific to each process or group of them, > a la cpumemsets thing you were posting a year or two ago? To answer the easy question first, no. No virtual numbering anymore. We might do some virtualizing in user library code, but so far as this patch and the kernel are aware, cpu number 17 is cpu number 17, all the time, using the same cpu and node numberings as used in the other kernel API's (setaffinity, mbind and set_mempolicy) and in the kernel cpumasks and nodemasks. The bulk of this patch comes from providing named, nested placement (cpu and memory) regions -- cpusets, with a file system style namespace and permission model. The need for supporting such a model comes in managing large systems, when they are divided and subdivided into subsets of cpu and memory resources, dedicated to departments, groups, jobs, threads. Especially on NUMA systems, maintaining processor-memory locality is important. This locality must be maintained in a hierarchical fashion. The VP of Information Systems is not going to be personally placing the 8 parallel threads of the weather simulator run by someone in one of his departments. He can agree that that department gets exclusive use of half of the machine over the weekends, because that's what they budgeted for. Then it gets pushed down. Imagine, say, a large system that is shared by several departments, with shifting resources between them. At any point in time, each department has certain dedicated resources (cpu and memory) allocated to them. Within a department, they may be runing multiple large applications - a database server, a web server, a big simulation, other large HPC apps. Some of these may be very performance critical, and require their own dedicated resources. In many cases, the customer will be running some form of batch manager software to help administer the work load. The result is a hierarchy of these regions, which require, I claim, a kernel supported hierarchical name space, with permissions, to which tasks are attached, and which own subsets of the systems cpu and memory. On most _any_ numa systems running a mixed and shifting load, this ability to manage the systems use, to control placement and minimize interaction, is essential to stable, repeatable performance. On smaller or dedicated use systems, the existing calls are entirely sufficient. On larger, nested use systems, the critical numa resources of processor and memory need to be managed in a nested fashion. The existing cpu and memory placement facilities, added in 2.6, set_schedaffinity (for cpus) and mbind/set_mempolicy (for memory) are just the right thing for an individual task to manage in detail its placement across the resources available to it (the online cpus and nodes if CONFIG_CPUSET is disabled, or within the cpuset if cpusets are enabled). But they do not provide the named hierarchy with kernel enforced permissions and resource support required to sanely manage the largest multi-use systems. Three additional ways to approach this patch: 1) The first file in the patch, Documentation/cpusets.txt, describes this facility, and its purpose. 2) Look at the hooks in the rest of the kernel. I have spent much time minimizing these hooks, so that they are few in number, placed as best I could in low maintenace code in the kernel, and vanish if CONFIG_CPUSETS is disabled. But in addition to evaluating the risk and impact of the hooks, you can get a further sense of how cpusets works from these hooks. These hooks are listed in Documentation/cpusets.txt. 3) Look at the include/linux/cpusets.h header file. It shows the tiny interface with the rest of the kernel, which pretty much evaporates if CONFIG_CPUSET is disabled. By way of analogy, when I had an 8 inch floppy disk drive, I didn't need much of a file system. Initially, didn't even need subdirectories, just a list of files. But as storage grew, and became a shared resource on corporate systems, a hierarchical file system, with names, permissions and sufficient hooks for managing the storage resource, became essential. Now, as big iron is growing from tens, to hundreds, soon to thousands, of processors and memory nodes, their users require a manageable resource hierarchy of the essential compute resources. I understand that it is the proper job of a kernel to present the essential resources of a system to user code, in a sensibly named and controlled fashion, without imposing policy. For nested resources, a file system is a good fit, both for the names, and the associated permission model. It took more code (ifdef'd CONFIG_CPUSET, almost entirely in the kernel/cpuset.c file), doing it this way. But it is the natural model to use, when it fits, as in this case. Certainly, for the class of customer that SGI has on its big Irix systems, we have already seen that this sort of facility is essential for certain customer sites. I hesitate to say "Irix" here, because the Irix kernel code is another world, not directly useful in Linux. Fortunately, Simon and Sylvain of Bull (France) detemined, sometime last year, that they had the same large system cpu/memory management needs, and Simon wrote this initial kernel code, entirely untainted with Irix experience so far as I know. Their focus is apparently more commercial systems, whereas SGI's focus is more HPC apps. But the facilities needed are the same. My primary contribution has been in removing code, and doing what I could to learn how to best adapt it to Linux, in a way that meets our needs, with the most minimal of impact on others (~zero runtime if not configured, very low maintenance load on the kernel source). As Simon and Sylvain can atest, I have thrown away alot of the code and features they wanted, in order to reduce the kernel footprint. The cpu and node renumbering you remembered was one of the things I threw out. And I have rewritten much more code, as I have learned the coding style that is most comfortable within the Linux kernel. The long term health and maintainability of the Linux kernel is important to myself and my employer. If there is further explanation I can provide, or if there is design or code change you see that is important to including cpusets in the kernel, I welcome your input. Or nits and details, whatever. For me, SGI, and Bull, this one is a biggie. I anticipate for others as well, as more companies venture into big iron Linux. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Martin J. B. <mb...@ar...> - 2004-08-06 03:24:52
|
>> is the point to add back virtualized memory and >> CPU numbering sets specific to each process or group of them, >> a la cpumemsets thing you were posting a year or two ago? > > To answer the easy question first, no. No virtual numbering anymore. We > might do some virtualizing in user library code, but so far as this > patch and the kernel are aware, cpu number 17 is cpu number 17, all the > time, using the same cpu and node numberings as used in the other kernel > API's (setaffinity, mbind and set_mempolicy) and in the kernel cpumasks > and nodemasks. OK, good ;-) I don't think the kernel should have to deal with that stuff. Sorry, it's just a little difficult to dive into a large patch without a higher level idea what it's trying to do (which after your last email, I think I have a much better grasp on). ... > The existing cpu and memory placement facilities, added in 2.6, > set_schedaffinity (for cpus) and mbind/set_mempolicy (for memory) are > just the right thing for an individual task to manage in detail its > placement across the resources available to it (the online cpus and > nodes if CONFIG_CPUSET is disabled, or within the cpuset if cpusets > are enabled). I agree that the current mechanisms are not wholly sufficient - the most obvious failing being that whilst you can bind a process to a resource, there's very little support for making a resource exclusively available to a process or set thereof. > But they do not provide the named hierarchy with kernel enforced > permissions and resource support required to sanely manage the > largest multi-use systems. Right ... but I'm kind of shocked by the size of the patch to fix what seems like a fairly simple problem. The other thing that seems to glare at me is the overlap between what you have here and PAGG/CKRM. Does either cpusets depend on PAGG/CKRM or vice versa? They seem to have similar goals, and it'd be strange to have two independant mechanisms. > Three additional ways to approach this patch: > > 1) The first file in the patch, Documentation/cpusets.txt, > describes this facility, and its purpose. > > 2) Look at the hooks in the rest of the kernel. I have spent much > time minimizing these hooks, so that they are few in number, > placed as best I could in low maintenace code in the kernel, > and vanish if CONFIG_CPUSETS is disabled. But in addition > to evaluating the risk and impact of the hooks, you can get > a further sense of how cpusets works from these hooks. > These hooks are listed in Documentation/cpusets.txt. > > 3) Look at the include/linux/cpusets.h header file. It shows > the tiny interface with the rest of the kernel, which > pretty much evaporates if CONFIG_CPUSET is disabled. Thanks ... that'll help me. I'll try to look through it in some more detail. M. |
From: Paul J. <pj...@sg...> - 2004-08-06 12:24:09
|
Martin wrote: > Sorry, it's just a little difficult to dive into a large > patch without a higher level idea ... No need to apologize. I welcome your review. I was well aware that the next step for this patch was the "why would I want this ..." explanation. Let me know if there is more I can explain. > I don't think the kernel should have to deal with that > [cpu and node number virtualization] stuff. I agree, now. > The other thing that seems to glare at me is the overlap > between what you have here and PAGG/CKRM. Does > either cpusets depend on PAGG/CKRM or vice versa? None of these three, cpusets, PAGG or CKRM depend on the other, with the possible exception that perhaps CKRM could make use of PAGG (whether it does or not, or whether it should, I don't know - ask them). Cpusets control _where_ a process can run and allocate. The central construct of cpusets is essentially a "soft partition" -- a set of CPUs and Memory Nodes. These can be arranged in a hierarchy, with names, permissions, a couple of control bits. Tasks can be moved between cpusets, as allowed by the permission model. You can attach a task to a different cpuset if (1) you can access that cpuset (search permission to some directory beneath /dev/cpuset), (2) write that cpusets "tasks" file, and (3) have essentially kill rights on the task being placed. Just as your basic file system provides a hierarchical model for organizing your data files (places to put data), similarly cpusets provides a hierarchical model for organizing the nodes on your big numa box (places to run tasks). CKRM tracks _how_ much of various interesting resouces tasks are using, both measuring and limiting such usage. It provides a way to manage some of the shared system resources, such as CPU time, memory pages, I/O and incoming network bandwith based on user defined groups of tasks called classes (quoting from http://ckrm.sourceforge.net/ ;). Unlike cpusets and most of the rest of the kernel, CKRM doesn't just manage individual tasks, one task at a time, but manages based on a dynamically determined resource class it assigns to various kernel objects in addition to tasks. Cpusets provides a rich model of just the CPU and Memory resources, but only manages tasks, using the traditional simple task pointer to a shared reference counted object. CKRM provides a rich structure for classifying a variety of kernel objects, not just tasks, and managing their use, but it doesn't have a particularly fancy model of any one of these resources (so far as I know anyway ...). PAGG is a just a mechanism that is useful for job accouting and resource management. It's just the hooks - for an inescapable job container and hooks for loadable modules to be called on key events in the life of tasks in that container, such as fork and exit. PAGG provides a useful mechanism for certain kinds of resource management and system accounting modules, but is itself not a resource manager. Hopefully, I haven't misrepresented CKRM and PAGG too much. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Erich F. <ef...@hp...> - 2004-08-06 15:32:21
|
> > The existing cpu and memory placement facilities, added in 2.6, > > set_schedaffinity (for cpus) and mbind/set_mempolicy (for memory) are > > just the right thing for an individual task to manage in detail its > > placement across the resources available to it (the online cpus and > > nodes if CONFIG_CPUSET is disabled, or within the cpuset if cpusets > > are enabled). > > I agree that the current mechanisms are not wholly sufficient - the > most obvious failing being that whilst you can bind a process to a > resource, there's very little support for making a resource exclusively > available to a process or set thereof. For the record, we (NEC) are also a potential user of this patch on the TX-7 NUMA machines. For our 2.4 kernels we are currently using something with similar functionality but only two hierarchy levels. I would very much welcome the inclusion of cpusets. The patch got much leaner compared to the early days, big part of it consists of documentation (good!) and the user interface (also very nice, although it duplicates some code). The rest is just needed. Besides: it's encapsulated enough and doesn't hurt others. (BTW: I could imagine using this on quad-opterons, too...) > Right ... but I'm kind of shocked by the size of the patch to fix what > seems like a fairly simple problem. The other thing that seems to glare > at me is the overlap between what you have here and PAGG/CKRM. Does > either cpusets depend on PAGG/CKRM or vice versa? They seem to have > similar goals, and it'd be strange to have two independant mechanisms. There's no relation to PAGG but I think cpusets and CKRM should be made to come together. One of CKRM's user interfaces is a filesystem with the file-tree representing the class hierarchy. It's the same for cpusets. I'd vote for cpusets going in soon. CKRM could be extended by a cpusets controller which should be pretty trivial when using the infrastructure of this patch. It simply needs to create classes (cpusets) and attach processes to them. The enforcement of resources happens automatically. When CKRM is mature to enter the kernel, one could drop /dev/cpusets in favor of the CKRM way of doing it. Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2004-08-06 15:35:54
|
> There's no relation to PAGG but I think cpusets and CKRM should be > made to come together. One of CKRM's user interfaces is a filesystem > with the file-tree representing the class hierarchy. It's the same for > cpusets. OK, that makes sense ... > I'd vote for cpusets going in soon. CKRM could be extended by > a cpusets controller which should be pretty trivial when using the > infrastructure of this patch. It simply needs to create classes > (cpusets) and attach processes to them. The enforcement of resources > happens automatically. When CKRM is mature to enter the kernel, one > could drop /dev/cpusets in favor of the CKRM way of doing it. But I think that's dangerous. It's very hard to get rid of existing user interfaces ... I'd much rather we sorted out what we're doing BEFORE putting either in the kernel. M. |
From: Hubertus F. <fr...@wa...> - 2004-08-06 15:50:25
|
Martin J. Bligh wrote: >>There's no relation to PAGG but I think cpusets and CKRM should be >>made to come together. One of CKRM's user interfaces is a filesystem >>with the file-tree representing the class hierarchy. It's the same for >>cpusets. > > > OK, that makes sense ... > > >>I'd vote for cpusets going in soon. CKRM could be extended by >>a cpusets controller which should be pretty trivial when using the >>infrastructure of this patch. It simply needs to create classes >>(cpusets) and attach processes to them. The enforcement of resources >>happens automatically. When CKRM is mature to enter the kernel, one >>could drop /dev/cpusets in favor of the CKRM way of doing it. > > > But I think that's dangerous. It's very hard to get rid of existing user > interfaces ... I'd much rather we sorted out what we're doing BEFORE > putting either in the kernel. > > M. > We, CKRM, can put this on our stack, once we have settled how we are going to address the structural requirements that came out of the kernel summit. As indicated above, this would mean to create a resource controller and assign mask to them, which is not what we have done so far, as our current controllers are more share focused. This should be a good excercise. While we are on the topic, do you envision these sets to be somewhat hierarchical or simply a flat hierarchy ? -- Hubertus Franke |
From: Hubertus F. <fr...@wa...> - 2004-08-06 15:53:44
|
Martin J. Bligh wrote: >>There's no relation to PAGG but I think cpusets and CKRM should be >>made to come together. One of CKRM's user interfaces is a filesystem >>with the file-tree representing the class hierarchy. It's the same for >>cpusets. > > > OK, that makes sense ... > > >>I'd vote for cpusets going in soon. CKRM could be extended by >>a cpusets controller which should be pretty trivial when using the >>infrastructure of this patch. It simply needs to create classes >>(cpusets) and attach processes to them. The enforcement of resources >>happens automatically. When CKRM is mature to enter the kernel, one >>could drop /dev/cpusets in favor of the CKRM way of doing it. > > > But I think that's dangerous. It's very hard to get rid of existing user > interfaces ... I'd much rather we sorted out what we're doing BEFORE > putting either in the kernel. > > M. > We, CKRM, can put this on our stack, once we have settled how we are going to address the structural requirements that came out of the kernel summit. As indicated above, this would mean to create a resource controller and assign mask to them, which is not what we have done so far, as our current controllers are more share focused. This should be a good excercise. While we are on the topic, do you envision these sets to be somewhat hierarchical or simply a flat hierarchy ? -- Hubertus Franke |
From: Erich F. <ef...@hp...> - 2004-08-06 15:57:54
|
On Friday 06 August 2004 17:35, Martin J. Bligh wrote: > > I'd vote for cpusets going in soon. CKRM could be extended by > > a cpusets controller which should be pretty trivial when using the > > infrastructure of this patch. It simply needs to create classes > > (cpusets) and attach processes to them. The enforcement of resources > > happens automatically. When CKRM is mature to enter the kernel, one > > could drop /dev/cpusets in favor of the CKRM way of doing it. > > But I think that's dangerous. It's very hard to get rid of existing user > interfaces ... I'd much rather we sorted out what we're doing BEFORE > putting either in the kernel. So the user interfaces should be adapted before? I think this is simple and then the elimination of /dev/cpusets in favor of /rcfs is just deletion of code plus a simbolic link. The classes and cpusets are both directories. The files in cpusets are: - cpus: list of CPUs in that cpuset - mems: list of Memory Nodes in that cpuset - cpu_exclusive flag: is cpu placement exclusive? - mem_exclusive flag: is memory placement exclusive? - tasks: list of tasks (by pid) attached to that cpuset The files in a CKRM class directory: - stats : statistics (not needed for cpusets) - shares : could contain cpus, mems, cpu_exclusive, mem_exclusive - members : same as reading /dev/cpusets/.../tasks - target : same as writing /dev/cpusets/.../tasks Changing the "shares" would mean something like echo "cpus +6-10" > .../shares Just an idea... Regards, Erich |
From: Paul J. <pj...@sg...> - 2004-08-07 06:12:52
|
Erich Focht wrote: > we (NEC) are also a potential user of this patch Good - welcome. > I think cpusets and CKRM should be > made to come together. One of CKRM's user interfaces is a filesystem > with the file-tree representing the class hierarchy. It's the same for > cpusets. Hmmm ... this suggestion worries me, for a couple of reasons. Just because cpusets and CKRM both have a hierarchy represented in a file system doesn't mean it is, or can be, the same file system. Not all trees are the same. Perhaps someone more expert in CKRM can help here. The cpuset hierarchy has some strict semantics: 1) Any cpusets CPUs and Memory must be a subset of its parents. 2) A cpuset may be exclusive for CPU or Memory only if its parent is. 3) A CPU or Memory exclusive cpuset may not overlap its siblings. See the routine kernel/cpuset.c:validate_change() for the exact coding of these rules. If we followed your suggestion, Erich, would these rules still hold? I can't imagine that the CKRM folks have any existing hierarchies with these particular rules. They would need to if we went this way. On the flip side, what additional rules, if any, would CKRM impose on this hierarchy? The other reason that this suggestion worries me is a bit more philosophical. I'm sure that for all the other, well known, resources that CKRM manages, no one is proposing replacing whatever existing names and mechanisms exist for those resources, such as bandwidth, compute cycles, memory, ... Rather I presume that CKRM provides an additional resource management layer on top of the existing resources, which retain their classic names and apparatus. What you seem to be suggesting here, especially with this nice picture from your next post: The files in cpusets are: - cpus: list of CPUs in that cpuset - mems: list of Memory Nodes in that cpuset - cpu_exclusive flag: is cpu placement exclusive? - mem_exclusive flag: is memory placement exclusive? - tasks: list of tasks (by pid) attached to that cpuset The files in a CKRM class directory: - stats : statistics (not needed for cpusets) - shares : could contain cpus, mems, cpu_exclusive, mem_exclusive - members : same as reading /dev/cpusets/.../tasks - target : same as writing /dev/cpusets/.../tasks Changing the "shares" would mean something like echo "cpus +6-10" > .../shares would remove the cpuset specific interface forever, leaving it only visible via a more generic "shares, members, target" interface suitable for abstract resource management. I am afraid that this would make it harder for new users of cpusets to figure them out. Just cpusets by themselves add a new and strange layer of abstraction, that will require a little bit of head scratching (as Martin Bligh can testify to, from recent experience ;) for those administering and managing the big iron where cpusets will be useful. To add yet another layer of abstractions on top of that, from the CKRM world, might send quite a few users into mental overload, doing the usual stupid things we all do when we have given up on understanding and are just thrashing about, trying to get something to work. I think we are onto something useful here, the hierarchical organizing of compute resources of CPU and Memory, which will become increasingly relevant in the coming years, with bigger machines and more complex compute and memory architectures. I'd hate to see cpusets hidden behind resource management terms from day one. And, looking at it from the CKRM side (not sure I can, I'll try ...) would it not seem a bit odd to a CKRM user that just one of the resource types managed, these cpusets, had no apparent existence outside of the CKRM hierarchy, unlike all the other resources, which existed a priori, and, I presume, continue their independent existance? Obviously, I could use a little CKRM expertise here. But my inclination is to continue to view these two projects as separate, with the potential that CKRM will someday add cpusets to the resource types that it can manage. Thank-you. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Shailabh N. <na...@wa...> - 2004-08-08 20:02:22
|
Paul Jackson wrote: > Erich Focht wrote: > >>we (NEC) are also a potential user of this patch > > > Good - welcome. > > > >>I think cpusets and CKRM should be >>made to come together. One of CKRM's user interfaces is a filesystem >>with the file-tree representing the class hierarchy. It's the same for >>cpusets. > > > Hmmm ... this suggestion worries me, for a couple of reasons. > > Just because cpusets and CKRM both have a hierarchy represented in a > file system doesn't mean it is, or can be, the same file system. Not > all trees are the same. > > Perhaps someone more expert in CKRM can help here. The cpuset hierarchy > has some strict semantics: > 1) Any cpusets CPUs and Memory must be a subset of its parents. > 2) A cpuset may be exclusive for CPU or Memory only if its parent is. > 3) A CPU or Memory exclusive cpuset may not overlap its siblings. > > See the routine kernel/cpuset.c:validate_change() for the exact > coding of these rules. > > If we followed your suggestion, Erich, would these rules still hold? > I can't imagine that the CKRM folks have any existing hierarchies with > these particular rules. They would need to if we went this way. As CKRM stands today, we wouldn't be able to impose these constraints for exactly the reasons you point out. The other controllers would not forbid the move of a task violating the above rules to a CKRM class but this controller (CKRM's version of cpusets) would. Currently, on a task move, CKRM's core calls per-controller callbacks so the controller can make modifications to the controller-specific per-class objects. But controllers can't prevent such a move. However, one of the CKRM changes suggested in the Kernel Summit was to split up the controllers and not have them bundled within a "core" class as we call it. In this model, each task would directly belong to some controller-specific class. If CKRM were to adopt this change, one *potential* (but not necessary) consequence, is to have multiple hierarchies, one per-controller, exposed to the user e.g. instead of /rcfs/taskclass/<sameclasstree>, we would have /rcfs/cpu/<oneclasstree> and /rcfs/mem/<anotherclasstree> etc. In such a scenario, it would be more logical for the controller to constrain memberships (i.e. task moves, class share setting while it is part of a hierarchy etc.) and it would be easy for cpusets to get its semantics. > > On the flip side, what additional rules, if any, would CKRM impose > on this hierarchy? Currently, we impose rules on the shares that one can set (child cannot have more than its parent, sibling shares should add up etc.) and we'd discussed, but not implemented yet, some limit on how deep the common hierarchy would go. > > The other reason that this suggestion worries me is a bit more > philosophical. I'm sure that for all the other, well known, > resources that CKRM manages, no one is proposing replacing whatever > existing names and mechanisms exist for those resources, such as > bandwidth, compute cycles, memory, ... Rather I presume that CKRM > provides an additional resource management layer on top of the > existing resources, which retain their classic names and apparatus. > > What you seem to be suggesting here, especially with this nice > picture from your next post: > > The files in cpusets are: > - cpus: list of CPUs in that cpuset > - mems: list of Memory Nodes in that cpuset > - cpu_exclusive flag: is cpu placement exclusive? > - mem_exclusive flag: is memory placement exclusive? > - tasks: list of tasks (by pid) attached to that cpuset > The files in a CKRM class directory: > - stats : statistics (not needed for cpusets) > - shares : could contain cpus, mems, cpu_exclusive, mem_exclusive > - members : same as reading /dev/cpusets/.../tasks > - target : same as writing /dev/cpusets/.../tasks > > Changing the "shares" would mean something like > echo "cpus +6-10" > .../shares > > would remove the cpuset specific interface forever, leaving it only > visible via a more generic "shares, members, target" interface suitable > for abstract resource management. > > I am afraid that this would make it harder for new users of cpusets to > figure them out. Just cpusets by themselves add a new and strange layer > of abstraction, that will require a little bit of head scratching (as > Martin Bligh can testify to, from recent experience ;) for those > administering and managing the big iron where cpusets will be useful. > > To add yet another layer of abstractions on top of that, from the CKRM > world, might send quite a few users into mental overload, doing the > usual stupid things we all do when we have given up on understanding and > are just thrashing about, trying to get something to work. > > I think we are onto something useful here, the hierarchical organizing > of compute resources of CPU and Memory, which will become increasingly > relevant in the coming years, with bigger machines and more complex > compute and memory architectures. > > I'd hate to see cpusets hidden behind resource management terms from day > one. Yup, thats a valid concern. In this current round of CKRM redesign, we're considering whether controllers should be allowed to export their own interface (in a sense) by accepting different kinds of share settings. That is already true today in case of the "stats" and "config" virtual files which don't have any CKRM-imposed semantics. Only "shares" has a CKRM-defined set of values defined, not all of which are useful or will be implemented by a controller. We're debating whether to make that one controller-dependent too. If that happens, it'll make it somewhat better for cpusets. But I'm not sure if we'd want to go so far as to allow controllers to define what virtual files they export......we do that today for the classification engine because it is an entirely different beast but the controllers are similar..... > And, looking at it from the CKRM side (not sure I can, I'll try ...) > would it not seem a bit odd to a CKRM user that just one of the resource > types managed, these cpusets, had no apparent existence outside of the > CKRM hierarchy, unlike all the other resources, which existed a priori, > and, I presume, continue their independent existance? From just the viewpoint of cpusets (not adding mem), it seems to be quite similar to what CKRM's other controllers are doing - grouping a per-task control (in your case, sched_setaffinity) using hierarchical sets. > > Obviously, I could use a little CKRM expertise here. > > But my inclination is to continue to view these two projects as separate, > with the potential that CKRM will someday add cpusets to the resource types > that it can manage. Umm... I'm quite sure you mean , you'll contribute code to do that, right ? :-) It looks like the interface issue is the main one from both projects' pov. Hopefully things will become clearer in the next week or so when ckrm-tech thrashes out the Kernel Summit suggestion (it has other ramifications besides interface). -- Shailabh > > Thank-you. > |
From: Andrew M. <ak...@os...> - 2004-10-01 23:38:03
|
Paul, I'm having second thoughts regarding a cpusets merge. Having gone back and re-read the cpusets-vs-CKRM thread from mid-August, I am quite unconvinced that we should proceed with two orthogonal resource management/partitioning schemes. And CKRM is much more general than the cpu/memsets code, and hence it should be possible to realize your end-users requirements using an appropriately modified CKRM, and a suitable controller. I'd view the difficulty of implementing this as a test of the wisdom of CKRM's design, actually. The clearest statement of the end-user cpu and memory partitioning requirement is this, from Paul: > Cpusets - Static Isolation: > > The essential purpose of cpusets is to support isolating large, > long-running, multinode compute bound HPC (high performance > computing) applications or relatively independent service jobs, > on dedicated sets of processor and memory nodes. > > The (unobtainable) ideal of cpusets is to provide perfect > isolation, for such jobs as: > > 1) Massive compute jobs that might run hours or days, on dozens > or hundreds of processors, consuming gigabytes or terabytes > of main memory. These jobs are often highly parallel, and > carefully sized and placed to obtain maximum performance > on NUMA hardware, where memory placement and bandwidth is > critical. > > 2) Independent services for which dedicated compute resources > have been purchased or allocated, in units of one or more > CPUs and Memory Nodes, such as a web server and a DBMS > sharing a large system, but staying out of each others way. > > The essential new construct of cpusets is the set of dedicated > compute resources - some processors and memory. These sets have > names, permissions, an exclusion property, and can be subdivided > into subsets. > > The cpuset file system models a hierarchy of 'virtual computers', > which hierarchy will be deeper on larger systems. > > The average lifespan of a cpuset used for (1) above is probably > between hours and days, based on the job lifespan, though a couple > of system cpusets will remain in place as long as the system is > running. The cpusets in (2) above might have a longer lifespan; > you'd have to ask Simon Derr of Bull about that. > Now, even that is not a very good end-user requirement because it does prejudge the way in which the requirement's solution should be implemented. Users don't require that their NUMA machines "model a hierarchy of 'virtual computers'". Users require that their NUMA machines implement some particular behaviour for their work mix. What is that behaviour? For example, I am unable to determine from the above whether the users would be 90% satisfied with some close-enough ruleset which was implemented with even the existing CKRM cpu and memory governors. So anyway, I want to reopen this discussion, and throw a huge spanner in your works, sorry. I would ask the CKRM team to tell us whether there has been any progress in this area, whether they feel that they have a good understanding of the end user requirement, and to sketch out a design with which CKRM could satisfy that requirement. Thanks. |
From: Paul J. <pj...@sg...> - 2004-10-02 06:09:50
|
[Adding Erich Focht <ef...@hp...>] Are cpusets a special case of CKRM? Andrew raises (again) the question - can CKRM meet the needs which cpusets is trying to meet, enabling CKRM to subsume cpusets? Step 1 - Why cpusets? Step 2 - Can CKRM do that? Basically - cpusets implements dynamic soft partitioning to provide jobs with sets of isolated CPUs and Memory Nodes. The following begins Step 1, describing who has or is expected to use cpusets, and what I understand of their requirements for cpusets. Cpuset Users ============ The users of cpusets want to run jobs in relative isolation, by dividing the system into dynamically adjustable (w/o rebooting) subsets of compute resources (dedicated CPUs and Memory Nodes), and run one or sometimes several jobs within a given subset. Many such users, if they push this model far enough, tend toward using a batch manager, aka workload manager, such as OpenPBS or LSF. So the actual people who scream (gently) at me the most if I miss something in cpusets for SGI are (or have been, on 2.4 kernels and/or Irix): 1) The PBS and LSF folks porting their workload managers on top of cpusets, and 2) the SGI support engineers supporting customers of our biggest configurations running high value HPC applications. 3) Cpusets are also used by various graphics, storage and soft-realtime projects to obtain dedicated or precisely placed compute resources. The other declared potential users of cpusets, Bull and NEC at least, seem from what I can tell to have a somewhat different focus, toward providing a mix of compute services with minimum interference, from what I'd guess are more departmental size systems. Bull (Simon) and NEC (Erich) should also look closely at CKRM, and then try to describe their requirements, so we can understand whether CKRM, cpusets or both or neither can meet their needs. If I've forgotten any other likely users of cpusets who are lurking out there, I hope they will speak up and describe how they expect to use cpusets, what they require, and whether they find that CKRM would also meet their needs, or why not. I will try to work with the folks in PBS and LSF a bit, to see if I can get a simple statement of their essential needs that would be useful to the CKRM folks. I'll begin taking a stab at it, below. CKRM folks - what would be the best presentation of CKRM that I could point the PBS/LSF folks at? It's usually easier for users to determine if something will meet their needs if they can see and understand it. Trying to do requirements analysis to drive design choices with no feedback loop is crazy. They'll know it when they see it, not a day sooner ;) If some essential capability is missing, they might not articulate that capability at all, until someone tries to push a "solution" on them that is missing that capability. Cpuset Requirements =================== The three primary requirements that the SGI support engineers on our biggest configurations keep telling me are most important are: 1) isolation, 2) isolation, and 3) isolation. A big HPC job running on a dedicated set of CPUs and Memory Nodes should not lose any CPU cycles or Memory pages to outsiders. Both the batch managers and the HPC shops need to be able to guarantee exclusive use of some set of CPUS and Memory to a job. The batch managers need to be able to efficiently list the process id's of all tasks currently attached to a set. By default, set membership should be inherited across fork and exec, but batch managers need to be able to move tasks between sets without regard to the process creation hierarchy. A job running in a cpuset should be able to use various configuration, resource management (CKRM for example), cpu and memory (numa) affinity tools, performance analysis and thread management facilities within a set, including pthreads and MPI, independently from what is happening on the rest of the system. One should be able to run a stock 3rd party app (Oracle is the canonical example) on a system side-by-side with a special customer app, each in their own set, neither interfering with the other, and the Oracle folks happy that their app is running in a supported environment. And of course, a cpuset needs to be able to be setup and torn down without impacting the rest of the system, and then its CPU and Memory resources put back in the free pool, to be reallocated in different configurations for other cpusets. The batch or workload manager folks want to be hibernate and migrate jobs, so that they can move long running jobs around to get higher priority jobs through, and so that they can sensibly over commit without thrashing. And they want to be able to add and remove CPU and Memory resources to an existing cpuset, which might appear to jobs currently executing within that cpuset as resources going on and offline. The HPC apps folks need to control some kernel memory allocations, swapping, classic Unix daemons and kernel threads along cpuset lines as well. When the kernel page cache is many times larger than the memory on a single node, leaving placement up to willy-nilly kernel decisions can totally blow out a nodes memory, which is deadly to the performance of the job using that node. Similarly, one job can interfere with another if it abuses the swapper. Kernel threads that don't require specific placement, as well as the classic Unix daemons both need to be kept off the CPUs and Memory Nodes used for the main applications, typically by confining them to their own small cpuset. The graphics, realtime and storage folks in particular need to place their cpusets on very specific CPUs and Memory Nodes near some piece of hardware of interest to them. The pool of CPUs and Memory Nodes is not homogeneous to these folks. If not all CPUs are the same speed, or not all Memory Nodes the same size, then CPUs and Memory Nodes are not homogeneous to the HPC folks either. And in any case, big numa machines have complex bus topologies, which the system admins or batch managers have to take into account when deciding which CPUs and Memory Nodes to put together into a cpuset. There must not be any presumption that composition of cpusets is done on a per-node basis, with all the CPUs and Memory on a node the unit of allocation. While this is often the case, sometimes other combinations of CPUs and Memory Nodes are needed, not along node boundaries. For the larger configurations, I am beginning to see requests for hierarchical "soft partitions" reflecting typically the complex coorperate or government organization that purchased the big system, and needs to share it amongst different, semi-uncooperative groups and subgroups. I anticipate that SGI will see more of this over the next few years, but I will (reluctantly) admit that a hierarchy of some fixed depth of two or three could meet the current needs as I hear them. Even the flat model (no hierarchy) uses require some way to name and control access to cpusets, with distinct permissions for examining, attaching to, and changing them, that can be used and managed on a system wide basis. At least Bull has a requirement to automatically remove a cpuset when the last user of it exits - which the current implementation in Andrew's tree provides by calling out to a user level program on the last release. User level code can handle the actual removal. Bull also has a requirement for the kernel to provide cpuset-relative numbering of CPUs and Memory Nodes to some applications, so that they can be run oblivious to the fact that they don't own the entire machine. This requirement is not satisfied by the current implementation in Andrew's tree - Simon has a separate patch for that. Cpusets needs to be able to interoperate with hotplug, which can be a bit of challenge, given the tendency of cpuset code to stash its own view of the current system CPU/Memory configuration. The essential implementation hooks required by cpusets follow from their essential purpose. Cpusets control on which CPUs a task may be scheduled, and on which Memory Nodes it may allocate memory. Therefore hooks are required in the scheduler and allocator, which constrain scheduling and allocation to only use the allowed CPUs and Memory Nodes. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Dipankar S. <dip...@in...> - 2004-10-02 14:53:28
|
On Fri, Oct 01, 2004 at 11:06:44PM -0700, Paul Jackson wrote: > Cpuset Requirements > =================== > > The three primary requirements that the SGI support engineers > on our biggest configurations keep telling me are most important > are: > 1) isolation, > 2) isolation, and > 3) isolation. > A big HPC job running on a dedicated set of CPUs and Memory Nodes > should not lose any CPU cycles or Memory pages to outsiders. > .... > > A job running in a cpuset should be able to use various configuration, > resource management (CKRM for example), cpu and memory (numa) affinity > tools, performance analysis and thread management facilities within a > set, including pthreads and MPI, independently from what is happening > on the rest of the system. > > One should be able to run a stock 3rd party app (Oracle is > the canonical example) on a system side-by-side with a special > customer app, each in their own set, neither interfering with > the other, and the Oracle folks happy that their app is running > in a supported environment. One of the things we are working on is to provide exactly something like this. Not just that, within the isolated partitions, we want to be able to provide completely different environment. For example, we need to be able to run or more realtime processes of an application in one partition while the other partition runs the database portion of the application. For this to succeed, they need to be completely isolated. It would be nice if someone explains a potential CKRM implementation for this kind of complete isolation. Thanks Dipankar |
From: Hubertus F. <fr...@wa...> - 2004-10-02 16:18:25
|
OK, let me respond to this (again...) from the perspective of cpus. This should to some extend also cover Andrew's request as well as Paul's earlier message. I see cpumem sets to be orthogonal to CKRM cpu share allocations. AGAIN. I see cpumem sets to be orthogonal to CKRM cpu share allocations. In its essense, "cpumem sets" is a hierarchical mechanism of sucessively tighter constraints on the affinity mask of tasks. The O(1) scheduler today does not know about cpumem sets. It operates on the level of affinity masks to adhere to the constraints specified based on cpu masks. The CKRM cpu scheduler also adheres to affinity mask constraints and frankly does not care how they are set. So I do not see what at the scheduler level the problem will be. If you want system isolation you deploy cpumem sets. If you want overall share enforcement you choose ckrm classes. In addition you can use both with the understanding that cpumem sets can and will not be violated even if that means that shares are not maintained. Since you want orthogonality, cpumem sets could be implemented as a different "classtype". They would not belong to the taskclass and thus are independent from what we consider the task class. The tricky stuff comes in from the fact that CKRM assumes a system wide definition of a class and a system wide "calculation" of shares. Dipankar Sarma wrote: > On Fri, Oct 01, 2004 at 11:06:44PM -0700, Paul Jackson wrote: > >>Cpuset Requirements >>=================== >> >>The three primary requirements that the SGI support engineers >>on our biggest configurations keep telling me are most important >>are: >> 1) isolation, >> 2) isolation, and >> 3) isolation. >>A big HPC job running on a dedicated set of CPUs and Memory Nodes >>should not lose any CPU cycles or Memory pages to outsiders. >> > > .... > > >>A job running in a cpuset should be able to use various configuration, >>resource management (CKRM for example), cpu and memory (numa) affinity >>tools, performance analysis and thread management facilities within a >>set, including pthreads and MPI, independently from what is happening >>on the rest of the system. >> >>One should be able to run a stock 3rd party app (Oracle is >>the canonical example) on a system side-by-side with a special >>customer app, each in their own set, neither interfering with >>the other, and the Oracle folks happy that their app is running >>in a supported environment. > > > One of the things we are working on is to provide exactly something > like this. Not just that, within the isolated partitions, we want > to be able to provide completely different environment. For example, > we need to be able to run or more realtime processes of an application > in one partition while the other partition runs the database portion > of the application. For this to succeed, they need to be completely > isolated. > > It would be nice if someone explains a potential CKRM implementation for > this kind of complete isolation. Alternatively to what is described above, if you want to do cpumemsets purely through the current implementation, I'd approach it as follows: - Start with the current cpumemset implementation. - Write the CKRM controller that simply replaces the API of the cpumemset. - Now you have the object hierarchy through /rcfs/taskclass - Change the memsets through the generic attributes (discussed in earlier emails to extend the static fixed shares notation) - DO NOT USE CPU shares (always specify DONTCARE). I am not saying that this is the most elegant solution, but neither is trying to achieve proportional shares through cpumemsets. > > Thanks > Dipankar > Hope this helps. > > ------------------------------------------------------- > This SF.net email is sponsored by: IT Product Guide on ITManagersJournal > Use IT products in your business? Tell us what you think of them. Give us > Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more > http://productguide.itmanagersjournal.com/guidepromo.tmpl > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech > |
From: Paul J. <pj...@sg...> - 2004-10-02 18:06:48
|
> I see cpumem sets to be orthogonal to CKRM cpu share allocations. I agree. Thank-you for stating that, Hubertus. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Peter W. <pwi...@bi...> - 2004-10-02 23:22:56
|
Hubertus Franke wrote: > > OK, let me respond to this (again...) from the perspective of cpus. > This should to some extend also cover Andrew's request as well as > Paul's earlier message. > > I see cpumem sets to be orthogonal to CKRM cpu share allocations. > AGAIN. > I see cpumem sets to be orthogonal to CKRM cpu share allocations. > > In its essense, "cpumem sets" is a hierarchical mechanism of sucessively > tighter constraints on the affinity mask of tasks. > > The O(1) scheduler today does not know about cpumem sets. It operates > on the level of affinity masks to adhere to the constraints specified > based on cpu masks. This is where I see the need for "CPU sets". I.e. as a replacement/modification to the CPU affinity mechanism basically adding an extra level of abstraction to make it easier to use for implementing the type of isolation that people seem to want. I say this because, strictly speaking and as you imply, the current affinity mechanism is sufficient to provide that isolation BUT it would be a huge pain to implement. The way I see it you just replace the task's affinity mask with a pointer to its "CPU set" which contains the affinity mask shared by tasks belonging to that set (and this is used by try_to_wake_up() and the load balancing mechanism to do their stuff instead of the per task affinity mask). Then when you want to do something like take a CPU away from one group of tasks and give it to another group of tasks it's just a matter of changing the affinity masks in the sets instead of visiting every one of the tasks individually and changing their masks. There should be no need to explicitly move tasks off the "lost" CPU after such a change as it should/could be done next time that they go through try_to_wake_up() and/or finish a time slice. Moving a task from one CPU set to another would be a similar process to the current change of affinity mask. There would, of course, need to be some restriction on the movement of CPUs from one set to another so that you don't end up with an empty set with live tasks, etc. A possible problem is that there may be users whose use of the current affinity mechanism would be broken by such a change. A compile time choice between the current mechanism and a set based mechanism would be a possible solution. Of course, this proposed modification wouldn't make any sense with less than 3 CPUs. PS Once CPU sets were implemented like this, configurable CPU schedulers (such as (blatant plug :-)) ZAPHOD) could have "per CPU set" configurations, CKRM could do its (CPU management stuff) stuff within a CPU set, etc. > > The CKRM cpu scheduler also adheres to affinity mask constraints and > frankly does not care how they are set. > > So I do not see what at the scheduler level the problem will be. > If you want system isolation you deploy cpumem sets. If you want overall > share enforcement you choose ckrm classes. > In addition you can use both with the understanding that cpumem sets can > and will not be violated even if that means that shares are not maintained. > > Since you want orthogonality, cpumem sets could be implemented as a > different "classtype". They would not belong to the taskclass and thus > are independent from what we consider the task class. > > > > The tricky stuff comes in from the fact that CKRM assumes a system wide > definition of a class and a system wide "calculation" of shares. Doesn't sound insurmountable or particularly tricky :-). Peter -- Peter Williams pwi...@bi... "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce |
From: Hubertus F. <fr...@wa...> - 2004-10-02 23:47:01
|
We are in sync on this... Hopefully, everybody else as well. > > This is where I see the need for "CPU sets". I.e. as a > replacement/modification to the CPU affinity mechanism basically adding > an extra level of abstraction to make it easier to use for implementing > the type of isolation that people seem to want. I say this because, > strictly speaking and as you imply, the current affinity mechanism is > sufficient to provide that isolation BUT it would be a huge pain to > implement. Exactly, you do the movement from cpuset through higher level operations replacing the per task cpu-affinity with a shared object. This is what CKRM does at the core level through its class objects. RCFS provides the high level operations. The controller implements them wrt to the constraints and the details. > > The way I see it you just replace the task's affinity mask with a > pointer to its "CPU set" which contains the affinity mask shared by > tasks belonging to that set (and this is used by try_to_wake_up() and > the load balancing mechanism to do their stuff instead of the per task > affinity mask). Then when you want to do something like take a CPU away > from one group of tasks and give it to another group of tasks it's just > a matter of changing the affinity masks in the sets instead of visiting > every one of the tasks individually and changing their masks. Exactly .. > There > should be no need to explicitly move tasks off the "lost" CPU after such > a change as it should/could be done next time that they go through > try_to_wake_up() and/or finish a time slice. Moving a task from one CPU > set to another would be a similar process to the current change of > affinity mask. > > There would, of course, need to be some restriction on the movement of > CPUs from one set to another so that you don't end up with an empty set > with live tasks, etc. > > A possible problem is that there may be users whose use of the current > affinity mechanism would be broken by such a change. A compile time > choice between the current mechanism and a set based mechanism would be > a possible solution. Of course, this proposed modification wouldn't > make any sense with less than 3 CPUs. Why ? It is even useful for 2 cpus. Currently cpumem sets do not enforce that there is not intersections between siblings of a hierarchy. > > PS Once CPU sets were implemented like this, configurable CPU schedulers > (such as (blatant plug :-)) ZAPHOD) could have "per CPU set" > configurations, CKRM could do its (CPU management stuff) stuff within a > CPU set, etc. That's one of the sticking points. That would require that TASKCLASSES and cpumemsets must go along the same hierarchy. With CPUmemsets being the top part of the hierarchy. In other words the task classes can not span different cpusets. There are other posibilities that would restrict the load balancing along cpuset boundaries. If taskclasses are allowed to span disjoint cpumemsets, what is then the definition of setting shares ? Today we simply do the system wide share proportionment adhering to the affinity constraints, which is still valid in this discussion. > >> >> The tricky stuff comes in from the fact that CKRM assumes a system >> wide definition of a class and a system wide "calculation" of shares. > Tricky in that it needs to be decided what the class hierarchy definitions and whether to CKRM cpu scheduling within each cpuset and what the exact definition of share then is ? > > Doesn't sound insurmountable or particularly tricky :-). I agree its not insurmountable but a matter of deciding what the desired behavior is ... Regards. > > Peter |
From: Peter W. <pwi...@bi...> - 2004-10-03 00:01:42
|
Hubertus Franke wrote: >> be a possible solution. Of course, this proposed modification >> wouldn't make any sense with less than 3 CPUs. > > > Why ? It is even useful for 2 cpus. > Currently cpumem sets do not enforce that there is not intersections > between siblings of a hierarchy. There's only 3 non empty sets and only one of them can have a CPU removed from the set without becoming empty. So the pain wouldn't be worth the gain. Peter -- Peter Williams pwi...@bi... "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce |
From: Paul J. <pj...@sg...> - 2004-10-03 03:46:41
|
Hubertus writes: > > That's one of the sticking points. > That would require that TASKCLASSES and cpumemsets must go along the > same hierarchy. With CPUmemsets being the top part of the hierarchy. > In other words the task classes can not span different cpusets. Can task classes span an entire cpuset subtree? I can well imagine that an entire subtree of the cpuset tree should be managed by the same CKRM policies and shares. In particular, if we emulate the setaffinity/mbind/mempolicy calls by forking a child cpuset to represent the new restrictions on the task affected by those calls, then we'd for sure want to leave that task in the same CKRM policy realm as it was before. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |