From: Simon D. <Sim...@bu...> - 2003-09-24 15:59:00
Attachments:
cpuset-2.6.0-test4-100903-0.cleaned.patch
|
Hi, We have developped a new feature in the Linux kernel, controlling CPU placements, which are useful on large SMP machines, especially NUMA ones. We call it CPUSETS, and we would highly appreciate to know about anyone who would be interested in such a feature. This has been somewhat inspired by the pset or cpumemset patches existing for Linux 2.4. CPUSETs are lightweight objects in the linux kernel that enable users to partition their multiprocessor machine by creating execution areas. A virtualization layer has been added so it becomes possible to split a machine in terms of CPUs. Furthermore, HPC applications often need to bind their processes to a specific CPU, and can achieve this by calling sched_setaffinity() in the recent Linux kernels. But running several HPC applications on a large system will result in several processes running on the same processor. This problem is addressed by the CPUSET mechanism. CPUSETS allow to: ---------------- 1/ create sets of CPUs on the system, and bind applications to them 2/ translate the masks of CPUs given to sched_setaffinity() so they stay inside the set of CPUs. With this mechanism, processors are virtualized, for the use of sched_setaffinity() and /proc information. Thus, any former application using this syscall to bind processes to processors will work with virtual CPUs without any change. 3/ provide a way to create sets of cpus *inside* a set of cpus : hence a system administrator can partition a system among users, and users can partition their partition among their applications. 4/ Change on the fly the execution area of a whole set of processes (to give more resources to a critical application, for example). ... 5/ In the future, probably associate a memory allocation policy (such as local node, or round robin) to a set of cpus. These features have been implemented as a kernel patch for Linux 2.6 and a suite of userland tools. You can find the associated manpages and a slightly more detailed explanation here: http://www.bullopensource.org/cpuset/ Any feedback, comment or opinion is welcome: Simon.Derr@Bull.net, Syl...@bu... Thanks, Simon and Sylvain. |
From: Stephen H. <she...@os...> - 2003-09-24 16:32:01
|
Looks good, but you aren't likely to get much acceptance or testing if it only works on ia64. You need to make a version for i386 as well. Also, don't send your patch as base64 encode attachment, it makes working with text tools harder. |
From: David M. <da...@na...> - 2003-09-24 17:05:40
|
>>>>> On Wed, 24 Sep 2003 09:30:44 -0700, Stephen Hemminger <she...@os...> said: Stephen> Looks good, but you aren't likely to get much acceptance or Stephen> testing if it only works on ia64. You need to make a Stephen> version for i386 as well. Is this true for >8-way machines? --david |
From: Paul J. <pj...@sg...> - 2003-09-24 21:43:35
|
Interesting ... I'm still digesting it. However, one of the documents, at: http://www.bullopensource.org/cpuset/cpuset.html was painful to read in a web browser, because it was just one big <pre>...</pre> block of text, with rather long lines (over 400 characters in one line) requiring much horizontal scrolling. So I have reformatted it, using more common html markup. You are welcome to steal my reformatting - it's visible at: http://www.speakeasy.org/~pj99/cpuset_formatted.html -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: David M. <da...@na...> - 2003-09-24 22:06:12
|
Simon, Could you please make it VERY clear that the system call numbers the patch is using are NOT OFFICIAL numbers and are therefore likely to change and/or collide unless and until they get accepted into the official tree? BTW: What do cpusets provide that couldn't be done with user-level tools on top of the existing sched_setaffinity() system call? Thanks, --david |
From: Hanna L. <ha...@us...> - 2003-09-24 22:27:08
|
--On Wednesday, September 24, 2003 03:06:08 PM -0700 David Mosberger <da...@na...> wrote: > Simon, > > Could you please make it VERY clear that the system call numbers the > patch is using are NOT OFFICIAL numbers and are therefore likely to > change and/or collide unless and until they get accepted into the > official tree? > > BTW: What do cpusets provide that couldn't be done with user-level > tools on top of the existing sched_setaffinity() system call? > > Thanks, > > --david Blatent plug for the group call: We have time at the next conference call (Oct 1, 2pm PDT, GMT -0700) to discuss this if you all can make it and are interested. The only other topic on the agenda is "real time application latency when the system is stressed" from Mark Gross at Intel. Simon/Sylvain, let me know if one or both of you would be interested in calling in to give a brief overview and answer questions of this proposal. No need for slides or anything, it is a very informal group. If you are outside the US I can arrange a toll-free number. Hanna |
From: Erich F. <ef...@hp...> - 2003-09-24 22:29:16
|
On Thursday 25 September 2003 00:06, David Mosberger wrote: > BTW: What do cpusets provide that couldn't be done with user-level > tools on top of the existing sched_setaffinity() system call? You've got one point here. As the cpumasks are inherited, you'd just have to make sure that a user cannot escape his cpuset by e.g. not allowing the mask to get bigger than the mask of the parent process (which can be owned by root, thus unchangeable). The management of the cpusets can then be taken over by a user space daemon. Erich |
From: Paul J. <pj...@sg...> - 2003-09-25 05:40:50
|
> BTW: What do cpusets provide that couldn't be done with user-level > tools on top of the existing sched_setaffinity() system call? I don't see how you can do the migrate_cpuset_processes() from a user level daemon. Just because two tasks happen to be allowed on the same CPUs doesn't mean they are in the same cpuset. The kernel must track, across forks, which tasks share a given cpuset. There are also some resource management capabilities, such as tracking and controlling how much memory a cpuset takes, and swapping (with possible oom kill) against a cpuset that one can consider extending this to, but only if it's in the kernel. But I'm not ready to push this point ... yet. And the permission model has to remain a rather primitive "root can do anything, anyone else can just subset their parent" if it lacks kernel hooks to track uid/suid ownership of each cpuset. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2003-09-25 06:11:35
|
> This sounds like it has progressively more commonality with CKRM; the > notion is of a workclass, not of a purely cpu-oriented notion. I _knew_ I shouldn't have thrown in that paragraph that began "There are also some resource management capabilities, ...". There are two aspects to CKRM - a common classification of service levels, and hooks in each scheduler of resources to respect those levels. These cpusets, either as proposed, or possible fancier forms that also manage memory, do not replace, cannot be replaced by, and do not compete with CKRM. Rather they cooperate with CKRM, and represent one more place, along side network drivers, schedulers and memory allocators, that eventually will want to respect CKRM service levels. The point of _this_ subthread was to consider whether this could more or less entirely be done in user space. The two aspects even of Simon's current proposal that I don't see can be done in user space are the migration, and the permission model. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2003-09-25 06:39:57
|
> Well, the thing is, CKRM essentially has the cross-resource bits and > makes up some group that can be joined and departed from and inherited > and so on with all the right knobs ... The hierarchies don't correspond, or do so only accidentally. That is, cpusets, as proposed, have a hierarchy such that one cpuset is the child of another if one cpuset describes a subset of another's CPUs. At first blush, I don't see a hierarchy of CKRM Classes, rather just a flat space, say Gold, Silver and Bronze. "When all you have is a hammer, the whole world looks like a nail" -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2003-09-25 06:53:02
|
> It's meant to flatten the hierarchy ... > ... > The hierarchy is meant to be there, just ... _What_ is meant to flatten the hierarchy ?? To what does "It" refer ?? So is the hierarchy of CKRM there or not -- you've confused me. And in any case, are we in agreement that any such CKRM hierarchy is not isomorphic to the cpuset hierarchy? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Hubertus F. <fr...@wa...> - 2003-09-25 13:25:14
|
Paul Jackson wrote: >>It's meant to flatten the hierarchy ... >> ... >>The hierarchy is meant to be there, just ... >> >> > > >_What_ is meant to flatten the hierarchy ?? To what does "It" refer ?? > >So is the hierarchy of CKRM there or not -- you've confused me. > >And in any case, are we in agreement that any such CKRM hierarchy >is not isomorphic to the cpuset hierarchy? > > > No hierarchy in CKRM (as of now). IHMO even if we introduce a hierarchies in CKRM, I don't see that they are isomorphic to cpuset hiearchies. -- Hubertus Franke ( CKRM team ) c |
From: Hubertus F. <fr...@wa...> - 2003-09-25 13:23:03
|
Paul Jackson wrote: >>Well, the thing is, CKRM essentially has the cross-resource bits and >>makes up some group that can be joined and departed from and inherited >>and so on with all the right knobs ... >> >> > >The hierarchies don't correspond, or do so only accidentally. > >That is, cpusets, as proposed, have a hierarchy such that one >cpuset is the child of another if one cpuset describes a subset >of another's CPUs. > >At first blush, I don't see a hierarchy of CKRM Classes, rather >just a flat space, say Gold, Silver and Bronze. > > Paul, yes CKRM classes at this point are flat, we looked initially at hierarchies and determined that for the first release might add a lot of complexity with questionable benefits for the community at large. So we left hierarchies out. Based on the general community feedback we might have to revisit this issue. Again, I see cpusets and CKRM as addressing two orthogonal issues wrt to cpu's cpusets (partitioning in space) with hierarchies CKRM (time partitioning) how much of time does a class get... -- Hubertus Franke ( CKRM team ) |
From: Rik v. R. <ri...@re...> - 2003-09-25 16:08:09
|
On Thu, 25 Sep 2003, Hubertus Franke wrote: > Paul, yes CKRM classes at this point are flat, we looked initially at > hierarchies and determined that for the first release might add a lot of > complexity with questionable benefits for the community at large. So we > left hierarchies out. Based on the general community feedback we might > have to revisit this issue. I'm not sure we will ever need hierarchies in the kernel. For all intents and purposes, we might be able to emulate them in userspace by having a daemon interpret the class resource usage statistics and adjusting priorities dynamically ;) -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan |
From: Sylvain J. <syl...@bu...> - 2003-09-26 07:19:02
|
On Thu, 25 Sep 2003, Hubertus Franke wrote: > Again, I see cpusets and CKRM as addressing two orthogonal issues wrt to > cpu's > > cpusets (partitioning in space) with hierarchies > CKRM (time partitioning) how much of time does a class get... We do agree. We took a look at CKRM and that is the conclusion we achieved. At first sight, it could look like the goal are the same -and in some points it is- but the two approaches are different. It looks like it would be better to combine them rather to try to merge them. Sylvain |
From: Hubertus F. <fr...@wa...> - 2003-09-26 12:59:06
|
Sylvain Jeaugey wrote: >On Thu, 25 Sep 2003, Hubertus Franke wrote: > > > >>Again, I see cpusets and CKRM as addressing two orthogonal issues wrt to >>cpu's >> >>cpusets (partitioning in space) with hierarchies >>CKRM (time partitioning) how much of time does a class get... >> >> >We do agree. We took a look at CKRM and that is the conclusion we >achieved. At first sight, it could look like the goal are the same -and in >some points it is- but the two approaches are different. It looks >like it would be better to combine them rather to try to merge them. > >Sylvain > > > > Correct. These are both worthwhile efforts and they do different things. A combination of both at some point (not now) should be investigated. On the CPU front, which cpusets at this point provide its simply orthogonal. cpusets provide you means to lock process down to cpus through some abstraction/virtual layer that does not determine the exact cpu but guarantees that some cpu will be choosen to represent that number. This is analogous to MPI applications which provide communicators which effectively are "cpusets" in the broader sense. On top of that they provide topology information as such.. Actually I don't see why CKRM can't enforce class shares on top of cpu sets. They simply don't need to know about each others presense. CKRM through its loadbalancing algorithm enforces shares for SMPs while at the same time observes cpu_affinity constraints, which effectively cpusets boil down to .... So "combine" is the correct wording here.... -- Hubertus Franke (CKRM team) |
From: Hubertus F. <fr...@wa...> - 2003-09-25 13:14:49
|
Paul Jackson wrote: >>This sounds like it has progressively more commonality with CKRM; the >>notion is of a workclass, not of a purely cpu-oriented notion. >> >> > >I _knew_ I shouldn't have thrown in that paragraph that began "There are >also some resource management capabilities, ...". > >There are two aspects to CKRM - a common classification of service levels, >and hooks in each scheduler of resources to respect those levels. > > > That is correct (assuming slight modification of the schedulers qualifies as a hook). >These cpusets, either as proposed, or possible fancier forms that also >manage memory, do not replace, cannot be replaced by, and do not compete >with CKRM. Rather they cooperate with CKRM, and represent one more >place, along side network drivers, schedulers and memory allocators, >that eventually will want to respect CKRM service levels. > > > Yes, to my understanding of cpusets (and I haven't looked into it with great detail) its a virtualization layer above physical binding. One really doesn't care to which CPU a process is bound as long as it is bound to one. One might care that tasks are constraint to a particular number of tasks and not beyond, thus leading to the partitioning capabilities. So I agree here with Paul that it addresses more a physical separation of processes, or say partitioning of machine which CKRM is targeted towards resource utilization within a class. Just like cpu_affinity, CKRM could tolerate cpusets. >The point of _this_ subthread was to consider whether this could more or >less entirely be done in user space. The two aspects even of Simon's >current proposal that I don't see can be done in user space are the >migration, and the permission model. > > > -- Hubertus Franke ( CKRM team ) |
From: Simon D. <Sim...@bu...> - 2003-09-25 14:35:47
|
On Wed, 24 Sep 2003, David Mosberger wrote: > Simon, > > Could you please make it VERY clear that the system call numbers the > patch is using are NOT OFFICIAL numbers and are therefore likely to > change and/or collide unless and until they get accepted into the > official tree? OK, I've updated http://www.bullopensource.org/cpuset/ to highlight this, and also added the missing userland cpuset.[ch] files, plus some more documentation. As suggested by Paul, maybe we'll eventually make the syscalls disappear and replace them by some /proc interaction. Simon. |
From: Paul J. <pj...@sg...> - 2003-09-25 05:41:59
|
Where's the user level cpuset.h file? I wasn't able to find it so far -- just a few links in the man pages to it, but only on some local file system to which I lack access. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2003-09-25 05:45:57
|
On the pchange man page, an example starts with: # pcreate -np 4 --strict new area created with id 2 How does the 'pcreate' invoker know that the new area had an id of "2" ? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: David M. <da...@na...> - 2003-09-25 06:57:12
|
>>>>> On Wed, 24 Sep 2003 23:02:34 -0700, William Lee Irwin III <wl...@ho...> said: Bill> On Wed, 24 Sep 2003 09:30:44 -0700, Stephen Hemminger Bill> <she...@os...> said: Stephen> Looks good, but you aren't likely to get much acceptance or Stephen> testing if it only works on ia64. You need to make a Stephen> version for i386 as well. Bill> On Wed, Sep 24, 2003 at 10:02:35AM -0700, David Mosberger wrote: >> Is this true for >8-way machines? Bill> x86's architectural limitations are 64x for serial APIC -based machines Bill> (e.g. NUMA-Q) and 255x for xAPIC -based machines (no known extant > 32x Bill> machines, apparently some kind of non-architectural regression), where Bill> the non-power-of-two number of cpus is due to the broadcast ID reserved Bill> from an 8-bit interrupt controller ID space. A likely explanation for Bill> the current xAPIC limitations is the recommended (publicly documented) Bill> physical APIC ID enumeration scheme breaking down for > 32x. Bill> Custom interrupt controllers may exceed these limits, but I don't know Bill> of any that have actually been made use of to do so. Though it sucks Bill> and very, very badly, x86 is not limited to anything like 8x. I wasn't suggesting that x86 is limited to 8-way, I was wondering how many > 8-way x86 Linux machines are actually out there. I wasn't even being facetious---just curious. --david |
From: David M. <da...@na...> - 2003-09-25 07:07:06
|
>>>>> On Wed, 24 Sep 2003 23:57:10 -0700, David Mosberger <da...@li...> said: Bill> x86's architectural limitations are 64x for serial APIC -based Bill> machines (e.g. NUMA-Q) and 255x for xAPIC -based machines (no Bill> known extant > 32x machines, apparently some kind of Bill> non-architectural regression), where the non-power-of-two Bill> number of cpus is due to the broadcast ID reserved from an Bill> 8-bit interrupt controller ID space. A likely explanation for Bill> the current xAPIC limitations is the recommended (publicly Bill> documented) physical APIC ID enumeration scheme breaking down Bill> for > 32x. Bill> Custom interrupt controllers may exceed these limits, but I Bill> don't know of any that have actually been made use of to do Bill> so. Though it sucks and very, very badly, x86 is not limited Bill> to anything like 8x. David> I wasn't suggesting that x86 is limited to 8-way, I was David> wondering how many > 8-way x86 Linux machines are actually David> out there. I wasn't even being facetious---just curious. Incidentally, the first "big" SMP machine I had access to was some sort of Sequent (S81/10?), with ~12 80386 CPUs (yes, that was a long time ago... ;-). --david |
From: Dave H. <hav...@us...> - 2003-09-25 09:07:29
|
On Wed, 2003-09-24 at 23:57, David Mosberger wrote: > I wasn't suggesting that x86 is limited to 8-way, I was wondering how > many > 8-way x86 Linux machines are actually out there. I wasn't even > being facetious---just curious. Well, besides the NUMA-Q, which went up to 60x and is dead now, there are at least the IBM Summit chipset machines. They're sold as 32-ways today on the x445 (that's physical, without hyperthreading). I've personally booted Linux on a 16-way, but I'm know others have booted on the 32-way configuration. Patches for this were posted in the last week by James Cleverdon. There's also the bigsmp code in the kernel for other P4-based systems that are >8x. I haven't seen any of them yet, but I wouldn't imagine that people would put support in the kernel for hardware that wasn't at least *close* to production. -- Dave Hansen hav...@us... |
From: Paul J. <pj...@sg...> - 2003-09-25 07:13:18
|
> The hierarchy used in CKRM is there. ... > It's a subgraph of the process inheritance hierarchy. Any recommendations on how I might learn more of such? > my little request to have similar mechanisms consolidated ... Such requests are valuable ... especially if there is more similarity than just "hierarchical resource grouping something or other ...". > bow out at this point. Ok - take care. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Simon D. <Sim...@bu...> - 2003-09-25 13:27:27
|
On Wed, 24 Sep 2003, David Mosberger wrote: > BTW: What do cpusets provide that couldn't be done with user-level > tools on top of the existing sched_setaffinity() system call? This is a question we had a long in-house debate about. The main reason of the inclusion of cpusets *inside* the kernel, is that we have to deal with applications that may call sched_setaffinity() to bind their processes to CPUs. Therefore we have to intercept these calls. We could try to do some LD_PRELOAD userland trick or modify the libc, but that would not work for statically linked programs. As pointed by Paul, another reason is the possibility to change on the fly the size/location of cpusets, and the need to apply the change on the attached processes. Thanks a lot for your comments, Simon. |