Thread: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement (Page 10)

lse-tech

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Erich F. <ef...@hp...> - 2004-10-08 22:39:44

On Friday 08 October 2004 16:24, Martin J. Bligh wrote:
> > On Thursday 07 October 2004 20:13, Martin J. Bligh wrote:
> >> It all just seems like a lot of complexity for a fairly obscure set of
> >> requirements for a very limited group of users, to be honest. Some bits
> >> (eg partitioning system resources hard in exclusive sets) would seem likely
> >> to be used by a much broader audience, and thus are rather more attractive.
> > 
> > May I translate the first sentence to: the requirements and usage
> > models described by Paul (SGI), Simon (Bull) and myself (NEC) are
> > "fairly obscure" and the group of users addressed (those mainly
> > running high performance computing (AKA HPC) applications) is "very
> > limited"? If this is what you want to say then it's you whose view is
> > very limited. Maybe I'm wrong with what you really wanted to say but I
> > remember similar arguing from your side when discussing benchmark
> > results in the context of the node affine scheduler.
> 
> No, I was talking about the non-exclusive part of cpusets that wouldn't
> fit inside another mechanism. The basic partitioning I have no problem
> with, and that seemed to cover most of the requirements, AFAICS.

I was hoping that I did misunderstand you ;-)

> As I've said before, the exclusive stuff makes sense, and is useful to
> a wider audience, I think. Having non-exclusive stuff whilst still 
> requiring physical partioning is what I think is obscure, won't work
> well (cpus_allowed is problematic) and could be done in userspace anyway.

Do you mean non-exclusive or simply overlapping? If you think at the
implementation through sched_domains you really don't need a 1 to 1
mapping between them and cpusets. IMO one could map sched domains
structure from the toplevel cpuset down only as far as the
non-overlapping sets go. Below you just don't use sched domains any
more and leave it to the affinity masks. The logical setup would
anyhow have a first (uppermost) level soft-partitioning the machine,
overlaps don't make sense to me here. Then sched domains already buy
you something. If soft partition 1 allows overlap in the lower levels
(because we want to overcommit the machine here and fear the OpenMP
jobs which pin themselves blindly in their cpuset), just don't
continue mapping sched domains deeper. In soft-partition 2 you may not
allow overlapping subpartitions, so go ahead and map them to sched
domains. It doesn't really add complexity this way, just some IF
statement.

Regards,
Erich

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: <ebi...@xm...> - 2004-10-14 10:40:07

"Martin J. Bligh" <mb...@ar...> writes:

> My main problem is that I don't think we want lots of overlapping complex 
> interfaces in the kernel. Plus I think some of the stuff proposed is fairly 
> klunky as an interface (physical binding where it's mostly not needed, and
> yes I sort of see your point about keeping jobs on separate CPUs, though I
> still think it's tenuous), and makes heavy use of stuff that doesn't work 
> well (e.g. cpus_allowed). So I'm searching for various ways to address that.

Sorry I spotted this thread late.  People seem to be looking at how things
are done on clusters and then apply them to numa machines.  Which I agree
looks totally backwards.  

The actual application requirement (ignoring the sucky batch schedulers)
is for a group of processes (a magic process group?) to all be
simultaneously runnable.  On a cluster that is accomplished by having
an extremely stupid scheduler place one process per machine.   On a
NUMA machine you can do better because you can suspend and migrate
processes.  

The other difference on these large machines is these compute jobs
that are cpu hogs will often have priority over all of the other
processes in the system.  

A batch scheduler should be able to prevent a machine from being
overloaded by simply not putting too many processes on the machine at
a time.  Or if a higher priority job comes in suspending all of
the processes that of some lower priority job to make run for the
new job.  Being able to swap page tables is likely a desirable feature
in that scenario so all of the swapped out jobs resources can be
removed from memory.

> It all just seems like a lot of complexity for a fairly obscure set of
> requirements for a very limited group of users, to be honest. 

I think that is correct to some extent.  I think the requirements are
much more reasonable when people stop hanging on to the cludges they
have been using because they cannot migrate jobs, or suspend
sufficiently jobs to get out of the way of other jobs. 

Martin does enhancing the scheduler to deal with a group of processes 
that all run in lock-step, usually simultaneously computing or
communicating sound sane?  Where preempting one is effectively preempting
all of them.

I have been quite confused by this thread in that I have not seen
any mechanism that looks beyond an individual processes at a time,
which seems so completely wrong.

Eric

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Erich F. <ef...@hp...> - 2004-10-14 11:25:15

On Thursday 14 October 2004 12:35, Eric W. Biederman wrote:
> Sorry I spotted this thread late. 

The thread was actually d(r)ying out...

> People seem to be looking at how things
> are done on clusters and then apply them to numa machines.  Which I agree
> looks totally backwards.  
> 
> The actual application requirement (ignoring the sucky batch schedulers)
> is for a group of processes (a magic process group?) to all be
> simultaneously runnable.  On a cluster that is accomplished by having
> an extremely stupid scheduler place one process per machine.   On a
> NUMA machine you can do better because you can suspend and migrate
> processes.  

Eric, beyond wanting all processes scheduled at the same time we also
want separation and real isolation (CPU and memory-wise) of processes
belonging to different users. The first emails in the thread describe
the requirements well. They are too complex to be simply handled by
cpus_allowed and mems_allowed masks, basically a hierarchy is needed
in the cpusets allocation.

> > It all just seems like a lot of complexity for a fairly obscure set of
> > requirements for a very limited group of users, to be honest. 
> 
> I think that is correct to some extent.  I think the requirements are
> much more reasonable when people stop hanging on to the cludges they
> have been using because they cannot migrate jobs, or suspend
> sufficiently jobs to get out of the way of other jobs. 

Cpusets and alike have a long history originating from ccNUMA
machines. It is not simply simulating replicating cluster
behavior. Batch schedulers may be an unelegant solution but they are
reality and used since computers were invented (more or less).

> Martin does enhancing the scheduler to deal with a group of processes 
> that all run in lock-step, usually simultaneously computing or
> communicating sound sane?  Where preempting one is effectively preempting
> all of them.
> 
> I have been quite confused by this thread in that I have not seen
> any mechanism that looks beyond an individual processes at a time,
> which seems so completely wrong.

You seem to be suggesting a gang scheduler!!! YES!!! I would love
that! But I remember that 2 years ago there were some emails from
major kernel maintainers (I don't exactly remember whom) saying that a
gang scheduler will never go into Linux. So ... here's something which
somewhat simulates that behavior. Anyhow, cpusets makes sense (for
isolation of resources) anyway, no matter whether we have gang
scheduling or not.

> Eric

Regards,
Erich

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-14 11:25:48

Eric wrote:
> I have been quite confused by this thread in that I have not seen
> any mechanism that looks beyond an individual processes at a time,
> which seems so completely wrong.

In the simplest form, we obtain the equivalent of gang scheduling for
the several threads of a tightly coupled job by arranging to have only
one runnable thread per cpu, each such thread pinned on one cpu, and all
threads in a given job simultaneously runnable.

For compute bound jobs, this is often sufficient.  Time share (to a
coarse granularity of minutes or hours) and overlap of various sized
jobs is handled using suspension and migration in order to obtain the
above invariants of one runnable thread per cpu at any given time, and
of having all threads in a tightly coupled job pinned to distinct cpus
and runnable simultaneously.

For jobs that are not compute bound, where other delays such as i/o
would allow for running more than one such job at a time (both
intermittendly runnable on a finer scale of seconds), then one needs
something like gang scheduling in order to keep all the threads in a
tightly coupled job running together, while still obtaining maximum
utilization of cpu/memory hardware from jobs with cpu duty cycles of
less than 50%.

The essential purpose of cpusets is to take the placement of individual
threads by the sched_setaffinity and mbind/set_mempolicy calls, and
extend that to manage placing groups of tasks on administratively
designated and controlled groups of cpus/nodes.

If you see nothing beyond individual processes, then I think you are
missing that.

However, it is correct that we haven't (so far as I recall) considered
the gang scheduling that you describe.  My crystal ball says we might
get to that next year.

Gang scheduling isn't needed for the compute bound jobs, because just
running a single job at a time on a given subset of a systems cpus and
memory obtains the same result.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-16 01:45:01

Kevin McMahon <n6...@sg...> pointed out to me a link to an interesting
article on gang scheduling:

  http://www.linuxjournal.com/article.php?sid=7690
  Issue 127: Improving Application Performance on HPC Systems with Process Synchronization
  Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen

It's amazingly current - won't even be posted for another couple of weeks ;).

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Hubertus F. <fr...@wa...> - 2004-10-14 22:40:47

Paul, there are also other means for gang scheduling then having
to architect a tightly synchronized global clock into the communication 
device.

Particularly, in a batch oriented environment of compute intensive 
applications, one does not really need/want to switch frequently.
Often, the communication devices are memory mapped straight into the
application OS involvement with limited available channels.

However, as shown in previous work, gang scheduling and other forms of 
scheduling tricks (e.g. backfilling) can provide for significant higher 
utilization. So, if a high context switching rate (read interactivity) 
is not required, then a user space daemon scheduling network can be used.

We have a slew of pubs on this. An example readup can be obtained here:

Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. Improving Parallel 
Job Scheduling by Combining Gang Scheduling and Backfilling Techniques. 
In Proceedings of the International Parallel and Distributed Processing 
Symposium (IPDPS), pages 113-142 May 2000.
http://www.cse.psu.edu/~anand/csl/papers/ipdps00.pdf

Or for a final sum up of that research as a journal.

Y. Zhang, H. Franke, J. Moreira, A. Sivasubramaniam. An Integrated 
Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling and 
Migration. IEEE Transactions on Parallel and Distributed Systems, 
14(3):236-247, March 2003.

This was implemented for the IBM SP2 cluster and ASCI machine at 
Livermore National Lab in the late 90's.

If you are interested in short scheduling cycles we also discovered that
dependent on the synchronity of the applications gang scheduling is not 
necessarily the best.

Y. Zhang, A. Sivasubramaniam, J. Moreira, H. Franke. A Simulation-based 
Study of Scheduling Mechanisms for a Dynamic Cluster Environment. In 
Proceedings of the ACM International Conference on Supercomputing (ICS), 
pages 100-109, May 2000. http://www.cse.psu.edu/~anand/csl/papers/ics00a.pdf

If I remember correctly this tight gang scheduling based on slots was 
already implemented on IRIX in 95/96 ( read a paper on that ).

Moral of the story here is that its unlikely that Linux will support 
gang scheduling in its core anytime soon or will allow network adapters 
to drive scheduling strategies. So likely these are out.
An less frequent gang scheduling can be implemented with user level 
daemons, so an adequate solution is available for most instances.

-- Hubertus

Paul Jackson wrote:

> Kevin McMahon <n6...@sg...> pointed out to me a link to an interesting
> article on gang scheduling:
> 
>   http://www.linuxjournal.com/article.php?sid=7690
>   Issue 127: Improving Application Performance on HPC Systems with Process Synchronization
>   Posted on Monday, November 01, 2004 by Paul Terry Amar Shan Pentti Huttunen
> 
> It's amazingly current - won't even be posted for another couple of weeks ;).
>

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-15 01:29:27

Huertus wrote:
> Paul, there are also other means for gang scheduling then having
> to architect a tightly synchronized global clock into the communication 
> device.

We agree.  

My reply to the post of Eric W. Biederman at the start of this
sub-thread began:

> In the simplest form, we obtain the equivalent of gang scheduling for
> the several threads of a tightly coupled job by arranging to have only
> one runnable thread per cpu, each such thread pinned on one cpu, and all
> threads in a given job simultaneously runnable.
> 
> For compute bound jobs, this is often sufficient. 

You reply adds substantial detail and excellent references.

Thank-you.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Andrew M. <ak...@os...> - 2005-02-08 00:15:36

Matthew Dobson <col...@us...> wrote:
>
> Sorry to reply a long quiet thread,

Is appreciated, thanks.

> but I've been trading emails with Paul 
> Jackson on this subject recently, and I've been unable to convince either him 
> or myself that merging CPUSETs and CKRM is as easy as I once believed.  I'm 
> still convinced the CPU side is doable, but I haven't managed as much success 
> with the memory binding side of CPUSETs.  In light of this, I'd like to remove 
> my previous objections to CPUSETs moving forward.  If others still have things 
> they want discussed before CPUSETs moves into mainline, that's fine, but it 
> seems to me that CPUSETs offer legitimate functionality and that the code has 
> certainly "done its time" in -mm to convince me it's stable and usable.

OK, I'll add cpusets to the 2.6.12 queue.

going once, going twice...

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2005-02-08 00:34:43

Andrew wrote:
> OK, I'll add cpusets to the 2.6.12 queue.

I'd like that ;).

Thank-you, Matthew, for the work you put into making sense of this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj...@sg...> 1.650.933.1373, 1.925.600.0401

<< < 1 .. 8 9 10 (Page 10 of 10)