From: Erich F. <ef...@hp...> - 2004-10-08 22:39:44
|
On Friday 08 October 2004 16:24, Martin J. Bligh wrote: > > On Thursday 07 October 2004 20:13, Martin J. Bligh wrote: > >> It all just seems like a lot of complexity for a fairly obscure set of > >> requirements for a very limited group of users, to be honest. Some bits > >> (eg partitioning system resources hard in exclusive sets) would seem likely > >> to be used by a much broader audience, and thus are rather more attractive. > > > > May I translate the first sentence to: the requirements and usage > > models described by Paul (SGI), Simon (Bull) and myself (NEC) are > > "fairly obscure" and the group of users addressed (those mainly > > running high performance computing (AKA HPC) applications) is "very > > limited"? If this is what you want to say then it's you whose view is > > very limited. Maybe I'm wrong with what you really wanted to say but I > > remember similar arguing from your side when discussing benchmark > > results in the context of the node affine scheduler. > > No, I was talking about the non-exclusive part of cpusets that wouldn't > fit inside another mechanism. The basic partitioning I have no problem > with, and that seemed to cover most of the requirements, AFAICS. I was hoping that I did misunderstand you ;-) > As I've said before, the exclusive stuff makes sense, and is useful to > a wider audience, I think. Having non-exclusive stuff whilst still > requiring physical partioning is what I think is obscure, won't work > well (cpus_allowed is problematic) and could be done in userspace anyway. Do you mean non-exclusive or simply overlapping? If you think at the implementation through sched_domains you really don't need a 1 to 1 mapping between them and cpusets. IMO one could map sched domains structure from the toplevel cpuset down only as far as the non-overlapping sets go. Below you just don't use sched domains any more and leave it to the affinity masks. The logical setup would anyhow have a first (uppermost) level soft-partitioning the machine, overlaps don't make sense to me here. Then sched domains already buy you something. If soft partition 1 allows overlap in the lower levels (because we want to overcommit the machine here and fear the OpenMP jobs which pin themselves blindly in their cpuset), just don't continue mapping sched domains deeper. In soft-partition 2 you may not allow overlapping subpartitions, so go ahead and map them to sched domains. It doesn't really add complexity this way, just some IF statement. Regards, Erich |
From: <ebi...@xm...> - 2004-10-14 10:40:07
|
"Martin J. Bligh" <mb...@ar...> writes: > My main problem is that I don't think we want lots of overlapping complex > interfaces in the kernel. Plus I think some of the stuff proposed is fairly > klunky as an interface (physical binding where it's mostly not needed, and > yes I sort of see your point about keeping jobs on separate CPUs, though I > still think it's tenuous), and makes heavy use of stuff that doesn't work > well (e.g. cpus_allowed). So I'm searching for various ways to address that. Sorry I spotted this thread late. People seem to be looking at how things are done on clusters and then apply them to numa machines. Which I agree looks totally backwards. The actual application requirement (ignoring the sucky batch schedulers) is for a group of processes (a magic process group?) to all be simultaneously runnable. On a cluster that is accomplished by having an extremely stupid scheduler place one process per machine. On a NUMA machine you can do better because you can suspend and migrate processes. The other difference on these large machines is these compute jobs that are cpu hogs will often have priority over all of the other processes in the system. A batch scheduler should be able to prevent a machine from being overloaded by simply not putting too many processes on the machine at a time. Or if a higher priority job comes in suspending all of the processes that of some lower priority job to make run for the new job. Being able to swap page tables is likely a desirable feature in that scenario so all of the swapped out jobs resources can be removed from memory. > It all just seems like a lot of complexity for a fairly obscure set of > requirements for a very limited group of users, to be honest. I think that is correct to some extent. I think the requirements are much more reasonable when people stop hanging on to the cludges they have been using because they cannot migrate jobs, or suspend sufficiently jobs to get out of the way of other jobs. Martin does enhancing the scheduler to deal with a group of processes that all run in lock-step, usually simultaneously computing or communicating sound sane? Where preempting one is effectively preempting all of them. I have been quite confused by this thread in that I have not seen any mechanism that looks beyond an individual processes at a time, which seems so completely wrong. Eric |
From: Erich F. <ef...@hp...> - 2004-10-14 11:25:15
|
On Thursday 14 October 2004 12:35, Eric W. Biederman wrote: > Sorry I spotted this thread late. The thread was actually d(r)ying out... > People seem to be looking at how things > are done on clusters and then apply them to numa machines. Which I agree > looks totally backwards. > > The actual application requirement (ignoring the sucky batch schedulers) > is for a group of processes (a magic process group?) to all be > simultaneously runnable. On a cluster that is accomplished by having > an extremely stupid scheduler place one process per machine. On a > NUMA machine you can do better because you can suspend and migrate > processes. Eric, beyond wanting all processes scheduled at the same time we also want separation and real isolation (CPU and memory-wise) of processes belonging to different users. The first emails in the thread describe the requirements well. They are too complex to be simply handled by cpus_allowed and mems_allowed masks, basically a hierarchy is needed in the cpusets allocation. > > It all just seems like a lot of complexity for a fairly obscure set of > > requirements for a very limited group of users, to be honest. > > I think that is correct to some extent. I think the requirements are > much more reasonable when people stop hanging on to the cludges they > have been using because they cannot migrate jobs, or suspend > sufficiently jobs to get out of the way of other jobs. Cpusets and alike have a long history originating from ccNUMA machines. It is not simply simulating replicating cluster behavior. Batch schedulers may be an unelegant solution but they are reality and used since computers were invented (more or less). > Martin does enhancing the scheduler to deal with a group of processes > that all run in lock-step, usually simultaneously computing or > communicating sound sane? Where preempting one is effectively preempting > all of them. > > I have been quite confused by this thread in that I have not seen > any mechanism that looks beyond an individual processes at a time, > which seems so completely wrong. You seem to be suggesting a gang scheduler!!! YES!!! I would love that! But I remember that 2 years ago there were some emails from major kernel maintainers (I don't exactly remember whom) saying that a gang scheduler will never go into Linux. So ... here's something which somewhat simulates that behavior. Anyhow, cpusets makes sense (for isolation of resources) anyway, no matter whether we have gang scheduling or not. > Eric Regards, Erich |
From: Paul J. <pj...@sg...> - 2004-10-14 11:25:48
|
Eric wrote: > I have been quite confused by this thread in that I have not seen > any mechanism that looks beyond an individual processes at a time, > which seems so completely wrong. In the simplest form, we obtain the equivalent of gang scheduling for the several threads of a tightly coupled job by arranging to have only one runnable thread per cpu, each such thread pinned on one cpu, and all threads in a given job simultaneously runnable. For compute bound jobs, this is often sufficient. Time share (to a coarse granularity of minutes or hours) and overlap of various sized jobs is handled using suspension and migration in order to obtain the above invariants of one runnable thread per cpu at any given time, and of having all threads in a tightly coupled job pinned to distinct cpus and runnable simultaneously. For jobs that are not compute bound, where other delays such as i/o would allow for running more than one such job at a time (both intermittendly runnable on a finer scale of seconds), then one needs something like gang scheduling in order to keep all the threads in a tightly coupled job running together, while still obtaining maximum utilization of cpu/memory hardware from jobs with cpu duty cycles of less than 50%. The essential purpose of cpusets is to take the placement of individual threads by the sched_setaffinity and mbind/set_mempolicy calls, and extend that to manage placing groups of tasks on administratively designated and controlled groups of cpus/nodes. If you see nothing beyond individual processes, then I think you are missing that. However, it is correct that we haven't (so far as I recall) considered the gang scheduling that you describe. My crystal ball says we might get to that next year. Gang scheduling isn't needed for the compute bound jobs, because just running a single job at a time on a given subset of a systems cpus and memory obtains the same result. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul J. <pj...@sg...> - 2004-10-16 01:45:01
|
Kevin McMahon <n6...@sg...> pointed out to me a link to an interesting article on gang scheduling: http://www.linuxjournal.com/article.php?sid=7690 Issue 127: Improving Application Performance on HPC Systems with Process Synchronization Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen It's amazingly current - won't even be posted for another couple of weeks ;). -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Hubertus F. <fr...@wa...> - 2004-10-14 22:40:47
|
Paul, there are also other means for gang scheduling then having to architect a tightly synchronized global clock into the communication device. Particularly, in a batch oriented environment of compute intensive applications, one does not really need/want to switch frequently. Often, the communication devices are memory mapped straight into the application OS involvement with limited available channels. However, as shown in previous work, gang scheduling and other forms of scheduling tricks (e.g. backfilling) can provide for significant higher utilization. So, if a high context switching rate (read interactivity) is not required, then a user space daemon scheduling network can be used. We have a slew of pubs on this. An example readup can be obtained here: Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pages 113-142 May 2000. http://www.cse.psu.edu/~anand/csl/papers/ipdps00.pdf Or for a final sum up of that research as a journal. Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and Migration. IEEE Transactions on Parallel and Distributed Systems, 14(3):236-247, March 2003. This was implemented for the IBM SP2 cluster and ASCI machine at Livermore National Lab in the late 90's. If you are interested in short scheduling cycles we also discovered that dependent on the synchronity of the applications gang scheduling is not necessarily the best. Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. A Simulation-based Study of Scheduling Mechanisms for a Dynamic Cluster Environment. In Proceedings of the ACM International Conference on Supercomputing (ICS), pages 100-109, May 2000. http://www.cse.psu.edu/~anand/csl/papers/ics00a.pdf If I remember correctly this tight gang scheduling based on slots was already implemented on IRIX in 95/96 ( read a paper on that ). Moral of the story here is that its unlikely that Linux will support gang scheduling in its core anytime soon or will allow network adapters to drive scheduling strategies. So likely these are out. An less frequent gang scheduling can be implemented with user level daemons, so an adequate solution is available for most instances. -- Hubertus Paul Jackson wrote: > Kevin McMahon <n6...@sg...> pointed out to me a link to an interesting > article on gang scheduling: > > http://www.linuxjournal.com/article.php?sid=7690 > Issue 127: Improving Application Performance on HPC Systems with Process Synchronization > Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen > > It's amazingly current - won't even be posted for another couple of weeks ;). > |
From: Paul J. <pj...@sg...> - 2004-10-15 01:29:27
|
Huertus wrote: > Paul, there are also other means for gang scheduling then having > to architect a tightly synchronized global clock into the communication > device. We agree. My reply to the post of Eric W. Biederman at the start of this sub-thread began: > In the simplest form, we obtain the equivalent of gang scheduling for > the several threads of a tightly coupled job by arranging to have only > one runnable thread per cpu, each such thread pinned on one cpu, and all > threads in a given job simultaneously runnable. > > For compute bound jobs, this is often sufficient. You reply adds substantial detail and excellent references. Thank-you. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Andrew M. <ak...@os...> - 2005-02-08 00:15:36
|
Matthew Dobson <col...@us...> wrote: > > Sorry to reply a long quiet thread, Is appreciated, thanks. > but I've been trading emails with Paul > Jackson on this subject recently, and I've been unable to convince either him > or myself that merging CPUSETs and CKRM is as easy as I once believed. I'm > still convinced the CPU side is doable, but I haven't managed as much success > with the memory binding side of CPUSETs. In light of this, I'd like to remove > my previous objections to CPUSETs moving forward. If others still have things > they want discussed before CPUSETs moves into mainline, that's fine, but it > seems to me that CPUSETs offer legitimate functionality and that the code has > certainly "done its time" in -mm to convince me it's stable and usable. OK, I'll add cpusets to the 2.6.12 queue. going once, going twice... |
From: Paul J. <pj...@sg...> - 2005-02-08 00:34:43
|
Andrew wrote: > OK, I'll add cpusets to the 2.6.12 queue. I'd like that ;). Thank-you, Matthew, for the work you put into making sense of this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373, 1.925.600.0401 |