Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Matthew  wrote:
> 
> I feel that the actual implementation, however, is taking
> a wrong approach, because it attempts to use the cpus_allowed mask to
> override the scheduler in the general case.  cpus_allowed, in my
> estimation, is meant to be used as the exception, not the rule.

I agree that big chunks of a large system that are marching to the beat
of two distinctly different drummers would better have their schedulers
organized along the domains that you describe, than by brute force abuse
of the cpus_allowed mask.

I look forward to your RFC, Matthew.  Though not being a scheduler guru,
I will mostly have to rely on the textual commentary in order to
understand what it means.

Existing finer grain placement of CPUs (sched_setaffinity) and Memory
(mbind, set_mempolicy) already exists, and is required by parallel
threaded applications such as OpenMP and MPI are commonly used to
develop.

The finer grain use of non-exclusive cpusets, in order to support
such workload managers as PBS and LSF in managing this finer grained
placement on a system (domain) wide basis should not be placing any
significantly further load on the schedulers or resource managers.

The top level cpusets must provide additional isolation properties so
that separate scheduler and resource manager domains can work in
relative isolation.  I've tried hard to speculate what these additional
isolation properties might be.  I look forward to hearing from the CKRM
and scheduler folks on this.  I agree that simple unconstrained (ab)use
of the cpus_allowed and mems_allowed masks, at that scale, places an
undo burden on the schedulers, allocators and resource managers.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373