From: Martin J. B. <mb...@ar...> - 2004-03-30 15:01:25
|
--Erich Focht <ef...@hp...> wrote (on Tuesday, March 30, 2004 00:30:25 +0200): > On Thursday 25 March 2004 23:28, Martin J. Bligh wrote: >> Can we hold off on changing the fork/exec time balancing until we've >> come to a plan as to what should actually be done with it? Unless we're >> giving it some hint from userspace, it's frigging hard to be sure if >> it's going to exec or not - and the vast majority of things do. > > After more than a year (or two?) of discussions there's no better idea > yet than giving a userspace hint. Default should be to balance at > exec(), and maybe use a syscall for saying: balance all children a > particular process is going to fork/clone at creation time. Everybody > reached the insight that we can't foresee what's optimal, so there is > only one solution: control the behavior. Give the user a tool to > improve the performance. Just a small inheritable variable in the task > structure is enough. Whether you give the hint at or before run-time > or even at compile-time is not really the point... Agreed ... absolutely. > I don't think it's worth to wait and hope that somebody shows up with > a magic algorithm which balances every kind of job optimally. Especially as I don't believe that exists ;-) It's not deterministic. >> Clone is a much more interesting case, though at the time, I consciously >> decided NOT to do that, as we really mostly want threads on the same >> node. > > That is not true in the case of HPC applications. And if someone uses > OpenMP he is just doing that kind of stuff. I consider STREAM a good > benchmark because it shows exactly the problem of HPC applications: > they need a lot of memory bandwidth, they don't run in cache and the > tasks live really long. Spreading those tasks across the nodes gives > me more bandwidth per task and I accumulate the positive effect > because the tasks run for hours or days. It's a simple and clear case > where the scheduler should be improved. > > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 > are not relevant for HPC. In a compute center it actually doesn't > matter much whether some shell command returns 10% faster, it just > shouldn't disturb my super simulation code for which I bought an > expensive NUMA box. OK, but the scheduler can't know the difference automatically, I don't think ... and whether we should tune the scheduler for "user work" or HPC is going to be a hotly contested point ;-) We need to try to find something that works for both. And suppose you have a 4 node system, with 4 HPC apps running? Surely you want each app to have one node to itself? That's more the case I'm worried about than "user work" vs HPC, to be honest. M. |