Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

--Erich Focht <ef...@hp...> wrote (on Tuesday, March 30, 2004 00:30:25 +0200):

> On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>> Can we hold off on changing the fork/exec time balancing until we've
>> come to a plan as to what should actually be done with it? Unless we're
>> giving it some hint from userspace, it's frigging hard to be sure if
>> it's going to exec or not - and the vast majority of things do.
> 
> After more than a year (or two?) of discussions there's no better idea
> yet than giving a userspace hint. Default should be to balance at
> exec(), and maybe use a syscall for saying: balance all children a
> particular process is going to fork/clone at creation time. Everybody
> reached the insight that we can't foresee what's optimal, so there is
> only one solution: control the behavior. Give the user a tool to
> improve the performance. Just a small inheritable variable in the task
> structure is enough. Whether you give the hint at or before run-time
> or even at compile-time is not really the point...

Agreed ... absolutely.

> I don't think it's worth to wait and hope that somebody shows up with
> a magic algorithm which balances every kind of job optimally.

Especially as I don't believe that exists ;-) It's not deterministic.

>> Clone is a much more interesting case, though at the time, I consciously
>> decided NOT to do that, as we really mostly want threads on the same
>> node.
> 
> That is not true in the case of HPC applications. And if someone uses
> OpenMP he is just doing that kind of stuff. I consider STREAM a good
> benchmark because it shows exactly the problem of HPC applications:
> they need a lot of memory bandwidth, they don't run in cache and the
> tasks live really long. Spreading those tasks across the nodes gives
> me more bandwidth per task and I accumulate the positive effect
> because the tasks run for hours or days. It's a simple and clear case
> where the scheduler should be improved.
>
> Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> are not relevant for HPC. In a compute center it actually doesn't
> matter much whether some shell command returns 10% faster, it just
> shouldn't disturb my super simulation code for which I bought an
> expensive NUMA box.

OK, but the scheduler can't know the difference automatically, I don't
think ... and whether we should tune the scheduler for "user work" or
HPC is going to be a hotly contested point ;-) We need to try to find
something that works for both. And suppose you have a 4 node system,
with 4 HPC apps running? Surely you want each app to have one node to
itself? That's more the case I'm worried about than "user work" vs HPC,
to be honest.

M.