From: Peter W. <pwi...@bi...> - 2004-10-06 23:23:31
|
Matthew Dobson wrote: > On Tue, 2004-10-05 at 19:08, Paul Jackson wrote: > > I don't know that these partitions would necessarily need their own > scheduler, allocator and resource manager, or if we would just make the > current scheduler, allocator and resource manager aware of these > boundaries. In either case, that is an implementation detail not to be > agonized over now. It's not so much whether they NEED their own scheduler, etc. as whether it should be possible for them to have their own scheduler, etc. With a configurable scheduler (such as ZAPHOD) this could just be a matter of having separate configuration variables for each cpuset (e.g. if a cpuset has been created to contain as bunch of servers there's no need to try and provide good interactive response for its tasks (as none of them will be interactive) so the interactive response mechanism can be turned off in that cpuset leading to better server response and throughput). Peter -- Peter Williams pwi...@bi... "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce |
From: Rick L. <ric...@us...> - 2004-10-07 00:18:16
|
It's not so much whether they NEED their own scheduler, etc. as whether it should be possible for them to have their own scheduler, etc. With a configurable scheduler (such as ZAPHOD) this could just be a matter of having separate configuration variables for each cpuset (e.g. if a cpuset has been created to contain as bunch of servers there's no need to try and provide good interactive response for its tasks (as none of them will be interactive) so the interactive response mechanism can be turned off in that cpuset leading to better server response and throughput). Providing configurable schedulers is a feature/bug/argument completely separate from cpusets. Let's stay focused on that for now. Two concrete examples for cpusets stick in my mind: * the department that has been given 16 cpus of a 128 cpu machine, is free to do what they want with them, and doesn't much care specifically how they're laid out. Think general timeshare. * the department that has been given 16 cpus of a 128 cpu machine to run a finely tuned application which expects and needs everybody to stay off those cpus. Think compute-intensive. Correct me if I'm wrong, but CKRM can handle the first, but cannot currently handle the second. And the mechanism(s) for creating either situation are suboptimal at best and non-existent at worst. Rick |
From: Paul J. <pj...@sg...> - 2004-10-07 18:31:01
|
Rick wrote: > > Two concrete examples for cpusets stick in my mind: > > * the department that has been given 16 cpus of a 128 cpu machine, > is free to do what they want with them, and doesn't much care > specifically how they're laid out. Think general timeshare. > > * the department that has been given 16 cpus of a 128 cpu machine > to run a finely tuned application which expects and needs everybody > to stay off those cpus. Think compute-intensive. > > Correct me if I'm wrong, but CKRM can handle the first, but cannot > currently handle the second. Even the first scenario is not well handled by CKRM, in my view, for most workloads. On a 128 cpu, if you want 16 cpus of compute power, you are much better off having that power on 16 specific cpus, rather than getting 12.5% of each of the 128 cpus, unless your workload has very low cache footprint. I think of it like this. Long ago, I learned to consider performance for many of the applications I wrote in terms of how many disk accesses I needed, for the disk was a thousand times slower than the processor and dominated performance across a broad scale. The gap between the speed of interior cpu cycles and external ram access across a bus or three is approaching the processor to disk gap of old. A complex hierarchy of caches has grown up, within and surrounding each processor, in an effort to ameliorate this gap. The dreaded disk seek of old is now the cache line miss of today. Look at the advertisements for compute power for hire in the magazines. I can rent a decent small computer, with web access and offsite backup, in an air conditioned room with UPS and 24/7 administration for under $100/month. These advertisements never sell me 12.5% of the cycles on each of the 128 cpus in a large server. They show pictures of some nice little rack machine -- that can be all mine, for just $79/month. Sign up now with our online web server and be using your system in minutes. [ hmmm ... wonder how many spam filters I hit on that last paragraph ... ] -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-10-07 08:55:46
|
> I don't see what non-exclusive cpusets buys us. One can nest them, overlap them, and duplicate them ;) For example, we could do the following: * Carve off CPUs 128-255 of a 256 CPU system in which to run various HPC jobs, requiring numbers of CPUs. This is named /dev/cpuset/hpcarena, and it is the really really exclusive and isolated sort of cpuset which can and does have its own scheduler domain, for a scheduler configuration that is tuned for running a mix of HPC jobs. In this hpcarena also runs the per-cpu kernel threads that are pinned on CPUs 128-255 (for _all_ tasks running on an exclusive cpuset must be in that cpuset or below). * The testing group gets half of this cpuset each weekend, in order to run a battery of tests: /dev/cpuset/hpcarena/testing. In this testing cpuset runs the following batch manager. * They run a home brew batch manager, which takes an input stream of test cases, carves off a small cpuset of the requested size, and runs that test case in that cpuset. This results in cpusets with names like: /dev/cpuset/hpcarena/testing/test123. Our test123 is running in this cpuset. * Test123 here happens to be a test of the integrity of cpusets, so sets up a couple of cpusets to run two independent jobs, each a 2 CPU MPI job. This results in the cpusets: /dev/cpuset/hpcarena/testing/test123/a and /dev/cpuset/hpcarena/testing/test123/b. Our little MPI jobs 'a' and 'b' are running in these two cpusets. We now have several nested cpusets, each overlapping its ancestors, with tasks in each cpuset. But only the top hpcarena cpuset has the exclusive ownership with no form of overlap of everything in its subtree that something like a distinct scheduler domain wants. Hopefully the above is not what you meant by "little more than a convenient way to group tasks." > 2) rewrite the scheduler/allocator to deal with these bindings up front, > and take them into consideration early in the scheduling/allocating > process. The allocator is less stressed here by varied mems_allowed settings than is the scheduler. For in 99+% of the cases, the allocator is dealing with a zonelist that has the local (currently executing) first on the zonelist, and is dealing with a mems_allowed that allows allocation on the local node. So the allocator almost always succeeds the first time it goes to see if the candidate page it has in hand comes from a node allowed in current->mems_allowed. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Rick L. <ric...@us...> - 2004-10-07 10:55:47
|
> I don't see what non-exclusive cpusets buys us. One can nest them, overlap them, and duplicate them ;) For example, we could do the following: Once you have the exclusive set in your example, wouldn't the existing functionality of CKRM provide you all the functionality the other non-exclusive sets require? Seems to me, we need a way to *restrict use* of certain resources (exclusive) and a way to *share use* of certain resources (non-exclusive.) CKRM does the latter right now, I believe, but not the former. (Does CKRM support sharing hierarchies as in the dept/group/individual example you used?) What about this model: * All exclusive sets exist at the "top level" (non-overlapping, non-hierarchical) and each is represented by a separate sched_domain hierarchy suitable for the hardware used to create the cpuset. I can't imagine anything more than an academic use for nested exclusive sets. * All non-exclusive sets are rooted at the "top level" but may subdivide their range as needed in a tree fashion (multiple levels if desired). Right now I believe this functionality could be provided by CKRM. Observations: * There is no current mechanism to create exclusive sets; cpus_allowed alone won't cut it. A combination of Matt's patch plus Paul's code could probably resolve this. * There is no clear policy on how to amiably create an exclusive set. The main problem is what to do with the tasks already there. I'd suggest they get forcibly moved. If their current cpus_allowed mask does not allow them to move, then if they are a user process they are killed. If they are a system process and cannot be moved, they stay and gain squatter's rights in the newly created exclusive set. * Interrupts are not under consideration right now. They land where they land, and this may affect exclusive sets. If this is a problem, for now, you simply lay out your hardware and exclusive sets more intelligently. * Memory allocation has a tendency and preference, but no hard policy with regards to where it comes from. A task which starts on one part of the system but moves to another may have all its memory allocated relatively far away. In unusual cases, it may acquire remote memory because that's all that's left. A memory allocation policy similar to cpus_allowed might be needed. (Martin?) * If we provide a means for creating exclusive sets, I haven't heard a good reason why CKRM can't manage this. Rick |
From: Paul J. <pj...@sg...> - 2004-10-07 14:33:15
|
Rick wrote: > > Once you have the exclusive set in your example, wouldn't the existing > functionality of CKRM provide you all the functionality the other > non-exclusive sets require? > > Seems to me, we need a way to *restrict use* of certain resources > (exclusive) and a way to *share use* of certain resources (non-exclusive.) > CKRM does the latter right now, I believe, but not the former. I'm losing you right at the top here, Rick. Sorry. I'm no CKRM wizard, so tell me if I'm wrong. But doesn't CKRM provide a way to control what percentage of the compute cycles are available from a pool of cycles? And don't cpusets provide a way to control which physical CPUs a task can or cannot use? For workloads of relative independent tasks, it might not matter (though even that is dubious - see my cache comments, below). For parallel threaded apps with rapid synchronization between the threads, as one gets with say OpenMP or MPI, there's a world of difference. Giving both threads in a 2-way application of this kind 50% of the cycles on each of 2 processors can be an order of magnitude slower than giving each thread 100% of one processor. Similarly, the variability of runtimes for such threads pinned on distinct processors can be an order of magnitude less than for floating threads. For shared resource environments where one is purchasing time on your own computer, there's also world of difference. In many cases one has paid (whether in real money to another company, or in inter-departmental funny money - doesn't matter a whole lot here) money for certain processor power, and darn well expects those processors to sit idle if you don't use them. And the vendor (whether your ISP or your MIS department) of these resources can't hide the difference. Your work runs faster and with dramatically more consistent runtimes if the entire processor/memory units are yours, all yours, whether you use them or not. The cache affects matter here as well. Unlike the 6800 I first learned to program, not all cycles are created equally on the fancy processors of today. There is layer upon layer of caching, trying to compensate for the enormous speed difference between the internal cpu clock and the speed of external ram, and again between the speed of the ram and that of the disk. A useful compute cycle on hot cpu can be hundreds or thousands of times faster than one on a cold cpu (hot - you've been running there; cold - you haven't been). There is a fundamental difference between controlling which physical processors on an SMP or NUMA system one may use, and adding delays to the tasks of select users to ensure they don't use too much. In the experience of SGI, and I hear tell of other companies, workload management by fair share techniques (add delays to tasks exceeding their allotment) has been found to be dramatically less useful to customers, year after year, in comparison to having a means to control on which CPUs tasks may be scheduled, and on which Memory Nodes they may obtain pages of ram. > * There is no clear policy on how to amiably create an exclusive set. > The main problem is what to do with the tasks already there. There is a policy, that works well, and those of us in this business have been using for years. When the system boots, you put everything that doesn't need to be pinned elsewhere in a bootcpuset, and leave the rest of the system dark. You then, whether by manual administrative techniques or a batch scheduler, hand out dedicated sets of CPU and Memory to jobs, which get exclusive use of those compute resources (or controlled sharing with only what you intentionally let share). > * If we provide a means for creating exclusive sets, I haven't heard > a good reason why CKRM can't manage this. Unless someone has rewritten CKRM behind my back to do the pinning of cpusets, it doesn't do that. That's why CKRM can't do this. Consider the following analogy. Many of us reading this have two cars in the driveway - our car and our wife's car (fewer will have a husband's car, granted). When we go out to get in our car, we don't interchangebly take whichever one is closest to the street, we take our car. This is an example of dedicated use. If we don't drive someplace that day, then our car just sits there, unused. This car use pattern is like cpusets. CKRM is more like the taxi service in New York. All those yellow cars are interchangeable in my mind. I take the first one I can get. I make sure to leave no personal possession behind when I leave it, because I have no prospect of ever seeing that yellow car again. Somewhere there has been a serious disconnect here. Either I seriously failed to understand what you wrote, or one of us is seriously confused as to the differences between cpusets and CKRM. Where do you think that the disconnect lies? The difference between cpusets and CKRM is not about restricting versus sharing. Rather cpusets is about controlled allocation of big, named chunks of a computer - certain numbered CPUs and Memory Nodes allocated by number. CKRM is about enforcing the rate of usage of anonymous, fungible resources such as cpu cycles and memory pages. Unfortunately for CKRM, on modern system architectures of two or more CPUs, cycles are not interchangeable and fungible, due to the caching. On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs (due to our inability to make a backplane fast enough to handle more) memory pages are not interchangeable and fungible either. If you made other good points, you will have to repeat them. I got lost in the disconnect. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Rick L. <ric...@us...> - 2004-10-07 19:08:51
|
> Once you have the exclusive set in your example, wouldn't the existing > functionality of CKRM provide you all the functionality the other > non-exclusive sets require? > > Seems to me, we need a way to *restrict use* of certain resources > (exclusive) and a way to *share use* of certain resources (non-exclusive.) > CKRM does the latter right now, I believe, but not the former. I'm losing you right at the top here, Rick. Sorry. I'm no CKRM wizard, so tell me if I'm wrong. But doesn't CKRM provide a way to control what percentage of the compute cycles are available from a pool of cycles? And don't cpusets provide a way to control which physical CPUs a task can or cannot use? Right. And what I'm hearing is that if you're a job running in a set of shared resources (i.e., non-exclusive) then by definition you are *not* a job who cares about which processor you run on. I can't think of a situation where I'd care about the physical locality, and the proximity of memory and other nodes, but NOT care that other tasks might steal my cycles. For parallel threaded apps with rapid synchronization between the threads, as one gets with say OpenMP or MPI, there's a world of difference. Giving both threads in a 2-way application of this kind 50% of the cycles on each of 2 processors can be an order of magnitude slower than giving each thread 100% of one processor. Similarly, the variability of runtimes for such threads pinned on distinct processors can be an order of magnitude less than for floating threads. Ah, so you want processor affinity for the tasks, then, not cpusets. For shared resource environments where one is purchasing time on your own computer, there's also world of difference. In many cases one has paid (whether in real money to another company, or in inter-departmental funny money - doesn't matter a whole lot here) money for certain processor power, and darn well expects those processors to sit idle if you don't use them. One does? No, in my world, there's constant auditing going on and if you can get away with having a machine idle, power to ya, but chances are somebody's going to come and take away at least the cycles and maybe the whole machine for somebody yammering louder than you about their budget cuts. You get first cut, but if you're not using it, you don't get to sit fat and happy. And the vendor (whether your ISP or your MIS department) of these resources can't hide the difference. Your work runs faster and with dramatically more consistent runtimes if the entire processor/memory units are yours, all yours, whether you use them or not. When I'm not using them, my work doesn't run faster. It just doesn't run. There is a fundamental difference between controlling which physical processors on an SMP or NUMA system one may use, and adding delays to the tasks of select users to ensure they don't use too much. In the experience of SGI, and I hear tell of other companies, workload management by fair share techniques (add delays to tasks exceeding their allotment) has been found to be dramatically less useful to customers, Less useful than ... what? As a substitute for exclusive access to one or more cpus, which currently is not possible? I can believe that. But you're saying these companies didn't size their tasks properly to the cpus they had allocated and yet didn't require exclusivity? How would non-exclusive sets address this human failing? You have 30 cpus' worth of tasks to run on 24 cpus. Somebody will take a hit, right, whether CKRM or cpusets are managing those 24 cpus? > * There is no clear policy on how to amiably create an exclusive set. > The main problem is what to do with the tasks already there. There is a policy, that works well, and those of us in this business have been using for years. When the system boots, you put everything that doesn't need to be pinned elsewhere in a bootcpuset, and leave the rest of the system dark. You then, whether by manual administrative techniques or a batch scheduler, hand out dedicated sets of CPU and Memory to jobs, which get exclusive use of those compute resources (or controlled sharing with only what you intentionally let share). This presumes you know, at boot time, how you want things divided. All of your examples so far have seemed to indicate that policy changes may well be made *after* boot time. So I'll rephrase: any time you create an exclusive set after boot time, you may find tasks already running there. I suggested one policy for dealing with them. The difference between cpusets and CKRM is not about restricting versus sharing. Rather cpusets is about controlled allocation of big, named chunks of a computer - certain numbered CPUs and Memory Nodes allocated by number. CKRM is about enforcing the rate of usage of anonymous, fungible resources such as cpu cycles and memory pages. Unfortunately for CKRM, on modern system architectures of two or more CPUs, cycles are not interchangeable and fungible, due to the caching. On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs (due to our inability to make a backplane fast enough to handle more) memory pages are not interchangeable and fungible either. CKRM is not going to merrily move tasks around just because it can, either, and it will still adhere to common scheduling principles regarding cache warmth and processor affinity. You use the example of a two car family, and preferring one over the other. I'd turn that around and say it's really two exclusive sets of one car each, rather than a shared set of two cars. In that example, do you ask your wife before you take "her" car, or do just take it because it's a shared resource? I know how it works in *my* family :) You've given a convincing argument for the exclusive side of things. But my point is that on the non-exclusive side the features you claim to need seem in confict: if the cpu/memory linkage is important to job predictability, how can you then claim it's ok to share it with anybody, even a "friendly" task? If it's ok to share, then you've just thrown predictability out the window. The cpu/memory linkage is interesting, but it won't drive the job performance anymore. I'm trying to nail down requirements. I think we've nailed down the exclusive one. It's real, and it's currently unmet. The code you've written looks to provide a good base upon which to meet that requirement. On the non-exclusive side, I keep hearing conflicting information about how layout is important for performance but it's ok to share with arbitrary jobs -- like sharing won't affect performance? Rick |
From: Paul J. <pj...@sg...> - 2004-10-10 02:30:44
|
Rick wrote: > One does? No, in my world, there's constant auditing going on and if > you can get away with having a machine idle, power to ya, but chances > are somebody's going to come and take away at least the cycles and maybe I don't doubt that such worlds as yours exist, nor that you live in one. In some of the worlds my customers live in, they have been hit so many times with the pains of performance degradation and variation due to unwanted interaction between applications that they get nervous if a supposedly unused CPU or Memory looks to be in use. Just the common use by Linux of unused memory to keep old pages in cache upsets them. And, perhaps more to the point, while indeed some other department may soon show up to make use of those lost cycles, the computer had jolly well better leave those cycles lost _until_ the customer decides to use them. Unlike the computer in my dentists office, which should "just do it", maximizing throughput as best it can, not asking any questions, the computers in some of my customers high end shops are managed more tightly (sometimes very tightly) and they expect to control load placement. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Hubertus F. <fr...@wa...> - 2004-10-07 20:18:38
|
Just catching up again on the thread (sorry for long absense due to ....). This msg seems as good as any other to respond. Paul Jackson wrote: > Rick wrote: > >>Once you have the exclusive set in your example, wouldn't the existing >>functionality of CKRM provide you all the functionality the other >>non-exclusive sets require? >> >>Seems to me, we need a way to *restrict use* of certain resources >>(exclusive) and a way to *share use* of certain resources (non-exclusive.) >>CKRM does the latter right now, I believe, but not the former. > The way this is heading is quite promising. - sched_domains seems the right answer wrt to partitioning the machine. Given some boot option or dynamic means one can shift cpus from on domain to the next domain. - If I understood correctly, there would be only one level of such hard partitioning, speak exclusive cpu-set or sched_domain. - In each such domain/set we allow shared *use*. If that is the model then CKRM cpu scheduling can certainly be deployed into and be the agent of that model. First, one needs to understand that sched_domains are nothing else but a set of cpus that are considered during load balancing times. By constricting the top_domain to the respective exclusive set one essentially has accomplished the desired feature of partitioning the machines into "isolated" sections (here from cpu perspective). So it is quite possible that an entire domain is empty based, while another exclusive domain would be totally overloaded. That said, if we associate a domain with a toplevel taskclass one would end up with potentially the following hierarchy for a system. /rcfs/taskclass domain-1 cls-1 cls-2 domain-2 cls-x cls-y domain-3 : we an then associate a "config" with these domains setting the appropriate cpu_set/group they are comprised off. # echo "res=cpu,cpuset=4..7" > /rcfs/taskclass/domain-1 # echo "res=mem,memset=node1" > /rcfs/taskclass/domain-2 Doing so would accomplish two things (a) the cpu-controller would create/modify the domain, while observing the exclusiveness constraints among its children (b) recognizes that its share specification are relative to the size of the domain (c) ensures that load-balancing (we do that different right now) is accomplished among the cpu's of class. (d) accounting is relative to the share. NOTE (as I tried to specify earlier): No class can span multiple domains. > There is a fundamental difference between controlling which physical > processors on an SMP or NUMA system one may use, and adding delays > to the tasks of select users to ensure they don't use too much. > > In the experience of SGI, and I hear tell of other companies, workload > management by fair share techniques (add delays to tasks exceeding their > allotment) has been found to be dramatically less useful to customers, > year after year, in comparison to having a means to control on which > CPUs tasks may be scheduled, and on which Memory Nodes they may obtain > pages of ram. Paul, you simply assume that achieving both are exclusive from each other. That is simply just not true. As in the hierarchy above, which does exactly represent what you describe below, shares will relative to the domain level if the cpuset config is set. So specifying 100% for domain-1 means 100% of its CPUs. > > >> * There is no clear policy on how to amiably create an exclusive set. >> The main problem is what to do with the tasks already there. > > > There is a policy, that works well, and those of us in this business > have been using for years. When the system boots, you put everything > that doesn't need to be pinned elsewhere in a bootcpuset, and leave > the rest of the system dark. You then, whether by manual administrative > techniques or a batch scheduler, hand out dedicated sets of CPU and > Memory to jobs, which get exclusive use of those compute resources > (or controlled sharing with only what you intentionally let share). That approach above will work. > > >> * If we provide a means for creating exclusive sets, I haven't heard >> a good reason why CKRM can't manage this. Exactly .. and on top of it CKRM /rcfs can be used to provide the API to do so (see above). > > > Unless someone has rewritten CKRM behind my back to do the pinning > of cpusets, it doesn't do that. That's why CKRM can't do this. > > Consider the following analogy. Many of us reading this have two > cars in the driveway - our car and our wife's car (fewer will have > a husband's car, granted). When we go out to get in our car, we > don't interchangebly take whichever one is closest to the street, > we take our car. This is an example of dedicated use. If we don't > drive someplace that day, then our car just sits there, unused. > This car use pattern is like cpusets. OK .. since you like analogy. When I picked up my puppy from the kennel all it did was eat and shit and not much more. Well I didn't sit there thinking "Ohh my god, all my puppy is going to do is go eat and shit all its life", I started teaching it new tricks. That "trick" is layed out above, binding domains with CKRM cpu scheduling to give exactly what one wants and using the CKRM interface to create the domains. And yes if within a domain you pin your cpu to a particular cpu, the CKRM cpu scheduler will adher to that. That is no different then in today's scheduler. If you have one large domain of 128 cpus the CPU scheduler today will try to load balance these as well only to fail to see that certain tasks can't be moved... As Matt or Rick pointed out some messages back using cpu_mask as the basic mechanism to control affinity system wide (that's what cpusets does) is the wrong approach. So going with the dynamic domains as the basic mechanism to cut up your global system into smaller parts and then use pinning within is certainly more scalable (particularly at load balancing time). > > CKRM is more like the taxi service in New York. All those yellow cars > are interchangeable in my mind. I take the first one I can get. I make > sure to leave no personal possession behind when I leave it, because I > have no prospect of ever seeing that yellow car again. I live in New York, you forgot: first order of business before entering the taxi is to make sure the seats are clean :-) > > Somewhere there has been a serious disconnect here. Either I seriously > failed to understand what you wrote, or one of us is seriously confused > as to the differences between cpusets and CKRM. I think Rick and Matt are right on the money. > > Where do you think that the disconnect lies? > > The difference between cpusets and CKRM is not about restricting versus > sharing. Rather cpusets is about controlled allocation of big, named > chunks of a computer - certain numbered CPUs and Memory Nodes allocated > by number. CKRM is about enforcing the rate of usage of anonymous, > fungible resources such as cpu cycles and memory pages. The disconnect is that you do not want to recognize that CKRM does NOT have to be systemwide. Once you open your mind to the fact that CKRM can be deployed with in a subset of disconnected resources (cpu domains) and manages shares independently within that domain, I truely don't see what the problem is. You seem to insist that because CKRM is currently system wide it stays that way. This is a prototype implementation. As lined out above, this is quite feasible to do. > > Unfortunately for CKRM, on modern system architectures of two or more > CPUs, cycles are not interchangeable and fungible, due to the caching. > On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs > (due to our inability to make a backplane fast enough to handle more) > memory pages are not interchangeable and fungible either. Just to let you know, I was involved in many NUMA projects, I wrote IBM SP1's MPI implementation and Gang Scheduler some time back, so I am intimately familiar with all these issues. Please don't assume ignorance about these things at our end. > > If you made other good points, you will have to repeat them. I got > lost in the disconnect. > The only thing that was unclear to me so far was whether one needs true hierarchies in cpusets, based on the discussion threats and your own example, that does not seem to be necessary. -- Hubertus |
From: Matthew D. <col...@us...> - 2004-10-09 00:23:42
|
On Thu, 2004-10-07 at 13:12, Hubertus Franke wrote: > The way this is heading is quite promising. > - sched_domains seems the right answer wrt to partitioning the machine. > Given some boot option or dynamic means one can shift cpus from > on domain to the next domain. > - If I understood correctly, there would be only one level of such > hard partitioning, speak exclusive cpu-set or sched_domain. > - In each such domain/set we allow shared *use*. I don't think that there needs to be a requirement that we have only one level of hard partitioning. The rest of your points are valid though, Hubertus. It'd be really nice if we could all get together with a wall of whiteboards, some markers, and a few pots of coffee. I think we'd all get this pretty much hashed out in an hour or two. This isn't directed at you, Hubertus, but at the many communication breakdowns in this thread. > First, one needs to understand that sched_domains are nothing else > but a set of cpus that are considered during load balancing times. > By constricting the top_domain to the respective exclusive set one > essentially has accomplished the desired feature of partitioning > the machines into "isolated" sections (here from cpu perspective). > So it is quite possible that an entire domain is empty based, while > another exclusive domain would be totally overloaded. I think that is very well stated, Hubertus. By having a more or less passive data structure that is only checked at load balance time, we can ensure (in a very light-weight way) that no task ever moves *out* of it's area nor moves *into* someone else's area. -Matt |
From: Matthew D. <col...@us...> - 2004-10-09 00:07:46
|
On Thu, 2004-10-07 at 07:28, Paul Jackson wrote: > Rick wrote: > > * There is no clear policy on how to amiably create an exclusive set. > > The main problem is what to do with the tasks already there. > > There is a policy, that works well, and those of us in this business > have been using for years. When the system boots, you put everything > that doesn't need to be pinned elsewhere in a bootcpuset, and leave > the rest of the system dark. You then, whether by manual administrative > techniques or a batch scheduler, hand out dedicated sets of CPU and > Memory to jobs, which get exclusive use of those compute resources > (or controlled sharing with only what you intentionally let share). No one is trying to take that away. There is nothing that says you can't boot with a small, 1-2 CPU 'boot' domain where you stick all those tasks you typically put in a 'boot' cpuset. <offtopic> In fact, people have talked before about reducing boot times by booting only a single CPU, then bringing the rest online later. This work could potentially facilitate that. Boot up just a single 'boot' CPU. All 'boot' tasks would necessarily be stuck there. Create a new 'work' domain and add (hotplug on) CPUs into that domain to your heart's content. </offtopic> -Matt |
From: Paul J. <pj...@sg...> - 2004-10-10 02:18:19
|
Rick replying to Paul: > > But doesn't CKRM provide a way to control what percentage of the > > compute cycles are available from a pool of cycles? > > > > And don't cpusets provide a way to control which physical CPUs a > > task can or cannot use? > > Right. I am learning (see other messages of the last couple days on this thread) that CKRM is supposed to be a general purpose workload manager framework, and that fair share scheduling (managing percentage of compute cycles) just happens to be the first instance of such a manager. > And what I'm hearing is that if you're a job running in a set of shared > resources (i.e., non-exclusive) then by definition you are *not* a job > who cares about which processor you run on. I can't think of a situation > where I'd care about the physical locality, and the proximity of memory > and other nodes, but NOT care that other tasks might steal my cycles. There are at least these situations: 1) proximity to special hardware (graphics, networking, storage, ...) 2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI) 3) batch managers switching resources between jobs On (2), if say you want to run eight copies of an application, on a system that only has eight CPUs, where each copy of the app is an eight-way tightly coupled app, they will go much faster if each app is placed across all 8 CPUs, one thread per CPU, than if they are placed willy-nilly. Or a bit more realistically, if you have a random input queue of such tightly coupled apps, each with a predetermined number of threads between one and eight, you will get more work done by pinning the threads of any given app on different CPUs. The users submitting the jobs may well not care which CPUs are used for their job, but an intermediate batch manager probably will care, as it may be solving the knapsack problem of how to fit a stream of varying sized jobs onto a given size of hardware. On (3), a batch manager might say have two small cpusets, and also one larger cpuset that is the two small ones combined. It might run one job in each of the two small cpusets for a while, then suspend these two jobs, in order to run a third job in the larger cpuset. The two small cpusets don't go away while the third job runs -- you don't want to lose or have to tear down and rebuild the detailed inter-cpuset placement of the two small jobs while they are suspended. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Matthew D. <col...@us...> - 2004-10-11 22:11:30
|
On Sat, 2004-10-09 at 19:15, Paul Jackson wrote: > Rick replying to Paul: > > And what I'm hearing is that if you're a job running in a set of shared > > resources (i.e., non-exclusive) then by definition you are *not* a job > > who cares about which processor you run on. I can't think of a situation > > where I'd care about the physical locality, and the proximity of memory > > and other nodes, but NOT care that other tasks might steal my cycles. > > There are at least these situations: > 1) proximity to special hardware (graphics, networking, storage, ...) > 2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI) > 3) batch managers switching resources between jobs > > On (2), if say you want to run eight copies of an application, on a > system that only has eight CPUs, where each copy of the app is an > eight-way tightly coupled app, they will go much faster if each app is > placed across all 8 CPUs, one thread per CPU, than if they are placed > willy-nilly. Or a bit more realistically, if you have a random input > queue of such tightly coupled apps, each with a predetermined number of > threads between one and eight, you will get more work done by pinning > the threads of any given app on different CPUs. The users submitting > the jobs may well not care which CPUs are used for their job, but an > intermediate batch manager probably will care, as it may be solving the > knapsack problem of how to fit a stream of varying sized jobs onto a > given size of hardware. > > On (3), a batch manager might say have two small cpusets, and also one > larger cpuset that is the two small ones combined. It might run one job > in each of the two small cpusets for a while, then suspend these two > jobs, in order to run a third job in the larger cpuset. The two small > cpusets don't go away while the third job runs -- you don't want to lose > or have to tear down and rebuild the detailed inter-cpuset placement of > the two small jobs while they are suspended. I think these situations, particularly the first two, are the times you *want* to use the cpus_allowed mechanism. Pinning a specific thread to a specific processor (cases (1) & (2)) is *exactly* why the cpus_allowed mechanism was put into the kernel. And (3) can pretty easily be achieved by using a combination of sched_domains and cpus_allowed. In your example of one 4 CPU cpuset and two 2 CPU sub cpusets (cpu-subsets? :), one could easily create a 4 CPU domain for the larger job and two 2 CPU domains for the smaller jobs. Those 2 2 CPU subdomains can be created & destroyed at will, or they could be simply tagged as "exclusive" when you don't want tasks moving back and forth between them, and tagged as "non-exclusive" when you want tasks to be freely balanced across all 4 CPUs in the larger parent domain. One of the cool thing about using sched_domains as your partitioning element is that in reality, tasks run on *CPUs*, not *domains*. So if you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can suspend threads a1, a2, b1 & b2 and remove the domains they were running in to allow job A (big job with threads A1, A2, A3, & A4) to run on the larger 4 CPU domain. When you then suspend A1-A4 again to allow the smaller jobs to proceed, you can pretty trivially create the 2 CPU domains underneath the 4 CPU domain and resume the jobs. Those jobs (a & b) have been suspended on the CPUs they were originally running on, and thus will resume on the same CPUs without any extra effort. They will simply run on those CPUs, and at load balance time, the domains attached to those CPUs will be consulted to determine where the tasks can be relocated to if there is a heavy load. The domains will tell the scheduler that the tasks cannot be relocated outside the 2 CPUs in each respective domain. Viola! (sorta ;) -Matt |
From: Paul J. <pj...@sg...> - 2004-10-11 23:02:24
|
Matthew wrote: > One of the cool thing about using sched_domains as your partitioning > element is that in reality, tasks run on *CPUs*, not *domains*. Unfortunately, my manager has reminded me of an essential deliverable that I have for another project, due in two weeks. I'm going to need every one of those days. So I will have to take a two week sabbatical from this design work. It might make sense to reconvene this work on a new thread, with a last message on this monster thread inviting all interested parties to come on over. I suspect a few folks will be happy to see this thread wind down. I'd guess lse-tech (my preference) or ckrm-tech would be a suitable forum for this new thread. Carry on. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Matthew D. <col...@us...> - 2004-10-12 21:23:37
|
On Mon, 2004-10-11 at 15:58, Paul Jackson wrote: > Matthew wrote: > > One of the cool thing about using sched_domains as your partitioning > > element is that in reality, tasks run on *CPUs*, not *domains*. > > Unfortunately, my manager has reminded me of an essential deliverable > that I have for another project, due in two weeks. I'm going to need > every one of those days. So I will have to take a two week sabbatical > from this design work. > > It might make sense to reconvene this work on a new thread, with a last > message on this monster thread inviting all interested parties to come > on over. I suspect a few folks will be happy to see this thread wind > down. > > I'd guess lse-tech (my preference) or ckrm-tech would be a suitable > forum for this new thread. > > Carry on. Sounds good, Paul. I think the discussion on this thread was kind of winding down anyway. In two weeks I'll have some more work done on my code, particularly trying to get the cpusets/CKRM filesystem interface to play with my sched_domains code. We'll be able to digest all the the information, requirements, requests, etc. on this thread and start a fresh discussion on (or at least closer to) the same page. -Matt |
From: Simon D. <Sim...@bu...> - 2004-10-12 08:53:39
|
> One of the cool thing about using sched_domains as your partitioning > element is that in reality, tasks run on *CPUs*, not *domains*. So if > you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and > threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can > suspend threads a1, a2, b1 & b2 and remove the domains they were runnin= g > in to allow job A (big job with threads A1, A2, A3, & A4) to run on the > larger 4 CPU domain. When you then suspend A1-A4 again to allow the > smaller jobs to proceed, you can pretty trivially create the 2 CPU > domains underneath the 4 CPU domain and resume the jobs. Those jobs (a > & b) have been suspended on the CPUs they were originally running on, > and thus will resume on the same CPUs without any extra effort. They > will simply run on those CPUs, and at load balance time, the domains > attached to those CPUs will be consulted to determine where the tasks > can be relocated to if there is a heavy load. The domains will tell th= e > scheduler that the tasks cannot be relocated outside the 2 CPUs in each > respective domain. Viola! (sorta ;) Voil=E0 ;-) I agree that this looks really smooth from a scheduler point of view. From a user point of view, remains the issue of suspending the tasks: -find which tasks to suspend : how do you know that job 'a' consists=20 exactly of 'a1' and 'a2' -suspend them (btw, how do you achieve this ? kill -STOP ?) I've been away from my mail and still trying to catch up, nevermind if th= e=20 above does not makes sense to you. Simon. |
From: Matthew D. <col...@us...> - 2004-10-12 21:26:52
|
On Tue, 2004-10-12 at 01:50, Simon Derr wrote: > > One of the cool thing about using sched_domains as your partitioning > > element is that in reality, tasks run on *CPUs*, not *domains*. So i= f > > you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') an= d > > threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can > > suspend threads a1, a2, b1 & b2 and remove the domains they were runn= ing > > in to allow job A (big job with threads A1, A2, A3, & A4) to run on t= he > > larger 4 CPU domain. When you then suspend A1-A4 again to allow the > > smaller jobs to proceed, you can pretty trivially create the 2 CPU > > domains underneath the 4 CPU domain and resume the jobs. Those jobs = (a > > & b) have been suspended on the CPUs they were originally running on, > > and thus will resume on the same CPUs without any extra effort. They > > will simply run on those CPUs, and at load balance time, the domains > > attached to those CPUs will be consulted to determine where the tasks > > can be relocated to if there is a heavy load. The domains will tell = the > > scheduler that the tasks cannot be relocated outside the 2 CPUs in ea= ch > > respective domain. Viola! (sorta ;) > Voil=C3=A0 ;-) hehe... My French spelling obviously isn't quite up to par. ;) > I agree that this looks really smooth from a scheduler point of view. >=20 > From a user point of view, remains the issue of suspending the tasks: > -find which tasks to suspend : how do you know that job 'a' consists=20 > exactly of 'a1' and 'a2' > -suspend them (btw, how do you achieve this ? kill -STOP ?) >=20 >=20 > I've been away from my mail and still trying to catch up, nevermind if = the=20 > above does not makes sense to you. >=20 > Simon. Paul didn't go into specifics about how to suspend the job, so neither did I. Sending SIGSTOP & SIGCONT should work, as you mention... Those are implementation details which really aren't *that* important to the discussion. We're still trying to figure out the overall framework and API to work with, so which method of suspending a thread we'll eventually use can be tackled down the road. :) -Matt |
From: Martin J. B. <mb...@ar...> - 2004-10-07 14:43:22
|
> * Interrupts are not under consideration right now. They land where > they land, and this may affect exclusive sets. If this is a > problem, for now, you simply lay out your hardware and exclusive > sets more intelligently. They're easy to fix, just poke the values in /proc appropriately (same as cpus_allowed, exactly). > * Memory allocation has a tendency and preference, but no hard policy > with regards to where it comes from. A task which starts on one > part of the system but moves to another may have all its memory > allocated relatively far away. In unusual cases, it may acquire > remote memory because that's all that's left. A memory allocation > policy similar to cpus_allowed might be needed. (Martin?) The membind API already does this. M. |
From: Simon D. <Sim...@bu...> - 2004-10-07 12:53:18
|
On Thu, 7 Oct 2004, Paul Jackson wrote: > > I don't see what non-exclusive cpusets buys us. > > One can nest them, overlap them, and duplicate them ;) I would also add, if the decision comes to make 'real exclusive' cpusets, my previous example, as a use for non-exclusive cpusets: we are running jobs that need to be 'mostly' isolated on some part of the system, and run in a specific location. We use cpusets for that. But we can't afford to dedicate a part of the system for administrative tasks (daemons, init..). These tasks should not be put inside one of the 'exclusive' cpusets, even temporary : they do not belong there. They should just be allowed to steal a few cpu cycles from time to time : non exclusive cpusets are the way to go. |
From: Martin J. B. <mb...@ar...> - 2004-10-07 14:52:02
|
> On Thu, 7 Oct 2004, Paul Jackson wrote: > >> > I don't see what non-exclusive cpusets buys us. >> >> One can nest them, overlap them, and duplicate them ;) > > I would also add, if the decision comes to make 'real exclusive' cpusets, > my previous example, as a use for non-exclusive cpusets: > > we are running jobs that need to be 'mostly' isolated on some part of the > system, and run in a specific location. We use cpusets for that. But we > can't afford to dedicate a part of the system for administrative tasks > (daemons, init..). These tasks should not be put inside one of the > 'exclusive' cpusets, even temporary : they do not belong there. They > should just be allowed to steal a few cpu cycles from time to time : non > exclusive cpusets are the way to go. That makes no sense to me whatsoever, I'm afraid. Why if they were allowed "to steal a few cycles" are they so fervently banned from being in there? You can keep them out of your userspace management part if you want. So we have the purely exclusive stuff, which needs kernel support in the form of sched_domains alterations. The rest of cpusets is just poking and prodding at cpus_allowed, the membind API, and the irq binding stuff. All of which you could do from userspace, without any further kernel support, right? Or am I missing something? M. |
From: Paul J. <pj...@sg...> - 2004-10-07 17:56:42
|
Martin wrote: > > So we have the purely exclusive stuff, which needs kernel support in the form > of sched_domains alterations. The rest of cpusets is just poking and prodding > at cpus_allowed, the membind API, and the irq binding stuff. All of which > you could do from userspace, without any further kernel support, right? > Or am I missing something? Well ... we're gaining. A couple of days ago you were suggesting that cpusets could be replaced with some exclusive domains plus CKRM. Now it's some exclusive domains plus poking the affinity masks. Yes - you're still missing something. But I must keep in mind that I had concluded, perhaps three years ago, just what you conclude now: that cpusets is just poking some affinity masks, and that I could do most of it from user land. The result ended up missing some important capabilities. User level code could not manage collections of hardware nodes (sets of CPUs and Memory Nodes) in a co-ordinated and controlled manner. The users of cpusets need to have system wide names for them, with permissions for viewing, modifying and attaching to them, and with the ability to list both what hardware (CPUs and Memory) in a cpuset, and what tasks are attached to a cpuset. As is usual in such operating systems, the kernel manages such system wide synchronized controlled access views. As I quote below, I've been saying this repeatedly. Could you tell me, Martin, whether the disconnect is: 1) that you didn't yet realize that cpusets provided this model (names, permissions, ...) or 2) you don't think such a model is useful, or 3) you think that such a model can be provided sensibly from user space? If I knew this, I could focus my response better. The rest of this message is just quotes from this last week - many can stop reading here. === Date: Fri, 1 Oct 2004 23:06:44 -0700 From: Paul Jackson <pj...@sg...> Even the flat model (no hierarchy) uses require some way to name and control access to cpusets, with distinct permissions for examining, attaching to, and changing them, that can be used and managed on a system wide basis. === Date: Sat, 2 Oct 2004 12:14:30 -0700 From: Paul Jackson <pj...@sg...> And our customers _do_ want to manage these logically isolated chunks as named "virtual computers" with system managed permissions and integrity (such as the system-wide attribute of "Exclusive" ownership of a CPU or Memory by one cpuset, and a robust ability to list all tasks currently in a cpuset). === Date: Sat, 2 Oct 2004 19:26:03 -0700 From: Paul Jackson <pj...@sg...> Consider the following use case scenario, which emphasizes this isolation aspect (and ignores other requirements, such as the need for system admins to manage cpusets by name [some handle valid across process contexts], with a system wide imposed permission model and exclusive use guarantees, and with a well defined system supported notion of which tasks are "in" which cpuset at any point in time). === Date: Sun, 3 Oct 2004 18:41:24 -0700 From: Paul Jackson <pj...@sg...> SGI makes heavy and critical use of the cpuset facilities on both Irix and Linux that have been developed since pset. These facilities handle both cpu and memory placment, and provide the essential kernel support (names and permissions and operations to query, modify and attach) for a system wide administrative interface for managing the resulting sets of CPUs and Memory Nodes. === Date: Tue, 5 Oct 2004 02:17:36 -0700 From: Paul Jackson <pj...@sg...> To: "Martin J. Bligh" <mb...@ar...> The /dev/cpuset pseudo file system api was chosen because it was convenient for small scale work, learning and experimentation, because it was a natural for the hierarchical name space with permissions that I required, and because it was convenient to leverage existing vfs structure in the kernel. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Martin J. B. <mb...@ar...> - 2004-10-07 18:14:35
|
>> So we have the purely exclusive stuff, which needs kernel support in the form >> of sched_domains alterations. The rest of cpusets is just poking and prodding >> at cpus_allowed, the membind API, and the irq binding stuff. All of which >> you could do from userspace, without any further kernel support, right? >> Or am I missing something? > > Well ... we're gaining. A couple of days ago you were suggesting > that cpusets could be replaced with some exclusive domains plus > CKRM. > > Now it's some exclusive domains plus poking the affinity masks. > > Yes - you're still missing something. > > But I must keep in mind that I had concluded, perhaps three years ago, > just what you conclude now: that cpusets is just poking some affinity > masks, and that I could do most of it from user land. The result ended > up missing some important capabilities. User level code could not > manage collections of hardware nodes (sets of CPUs and Memory Nodes) in > a co-ordinated and controlled manner. > > The users of cpusets need to have system wide names for them, with > permissions for viewing, modifying and attaching to them, and with the > ability to list both what hardware (CPUs and Memory) in a cpuset, and > what tasks are attached to a cpuset. As is usual in such operating > systems, the kernel manages such system wide synchronized controlled > access views. > > As I quote below, I've been saying this repeatedly. Could you > tell me, Martin, whether the disconnect is: > 1) that you didn't yet realize that cpusets provided this model (names, > permissions, ...) or > 2) you don't think such a model is useful, or > 3) you think that such a model can be provided sensibly from user space? > > If I knew this, I could focus my response better. > > The rest of this message is just quotes from this last week - many > can stop reading here. My main problem is that I don't think we want lots of overlapping complex interfaces in the kernel. Plus I think some of the stuff proposed is fairly klunky as an interface (physical binding where it's mostly not needed, and yes I sort of see your point about keeping jobs on separate CPUs, though I still think it's tenuous), and makes heavy use of stuff that doesn't work well (e.g. cpus_allowed). So I'm searching for various ways to address that. The purely exclusive parts of cpusets can be implemented in a much nicer manner inside the kernel, by messing with sched_domains, instead of just using cpus_allowed as a mechanism ... so that seems like much less of a problem. The non-exclusive bits seem to overlap heavily with both CKRM and what could be done in userspace. I still think the physical stuff is rather obscure, and binding stuff to specific CPUs is an ugly way to say "I want these two threads to not run on the same CPU". But if we can find some other way (eg userspace) to allow you to do that should you utterly insist on doing so, that'd be a convenient way out. As for the names and permissions issue, both would be *doable* from userspace, though maybe not as easily as in-kernel. Names would probably be less hassle than permissions, but neither would be impossible, it seems. It all just seems like a lot of complexity for a fairly obscure set of requirements for a very limited group of users, to be honest. Some bits (eg partitioning system resources hard in exclusive sets) would seem likely to be used by a much broader audience, and thus are rather more attractive. But they could probably be done with a much simpler interface than the whole cpusets (BTW, did that still sit on top of PAGG as well, or is that long gone?) M. |
From: Rick L. <ric...@us...> - 2004-10-07 20:41:21
|
The users of cpusets need to have system wide names for them, with permissions for viewing, modifying and attaching to them, and with the ability to list both what hardware (CPUs and Memory) in a cpuset, and what tasks are attached to a cpuset. As is usual in such operating systems, the kernel manages such system wide synchronized controlled access views. Well, you are *asserting* the kernel will manage this. But doesn't CKRM offer this capability? The only thing it *can't* do is assure exclusivity, today .. correct? Rick |
From: Paul J. <pj...@sg...> - 2004-10-10 02:37:12
|
> The only thing it *can't* do is assure > exclusivity, today .. correct? No. Could you look back to my other posts of this last week and let us know if I've answered your query in more detail already? Thanks. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-10-10 05:14:32
|
> That makes no sense to me whatsoever, I'm afraid. Why if they were allowed > "to steal a few cycles" are they so fervently banned from being in there? One substantial advantage of cpusets (as in the kernel patch in *-mm's tree), over variations that "just poke the affinity masks from user space" is the task->cpuset pointer. This tracks to what cpuset a task is attached. The fork and exit code duplicates and nukes this pointer, managing the cpuset reference counter. It matters to batch schedulers and the like which cpuset a task is in, and which tasks are in a cpuset, when it comes time to do things like suspend or migrate the tasks currently in a cpuset. Just because it's ok to share a little compute time in a cpuset doesn't mean you don't care to know who is in it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |