From: Ingo M. <mi...@el...> - 2004-03-30 07:42:18
|
* Andi Kleen <ak...@su...> wrote: > This works much better, but wildly varying (my tests go from 2.8xCPU > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. i'm resurrecting the balance-on-clone patch i sent a couple of days ago. I found at least one bug in it that might explain why it didnt work back then. (also, the scheduler back then was also too agressive at migrating tasks back.) Stay tuned. Ingo |
From: Nick P. <nic...@ya...> - 2004-03-30 07:03:55
|
Andi Kleen wrote: > On Tue, 30 Mar 2004 09:51:46 +1000 > Nick Piggin <nic...@ya...> wrote: > > > >>So both -mm5 and Ingo's sched.patch are much worse than >>what 2.4 and 2.6 get? > > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still > much worse than the max of 3.7x-4x CPU bandwidth. > So it is very likely to be a case of the threads running too long on one CPU before being balanced off, and faulting in most of their working memory from one node, right? I think it is impossible for the scheduler to correctly identify this and implement the behaviour that OpenMP wants without causing regressions on more general workloads (Assuming this is the problem). We are not going to go back to the wild balancing that numasched does (I have some benchmarks where sched-domains reduces cross node task movement by several orders of magnitude). So the other option is to do balance on clone across NUMA nodes, and make it very sensitive to imbalance. Or probably better to make it easy to balance off to an idle CPU, but much more difficult to balance off to a busy CPU. I suspect this would still be a regression for other tests though where thread creation is more frequent, threads share working set more often, or the number of threads > the number of CPUs. |
From: Martin J. B. <mb...@ar...> - 2004-03-30 07:14:15
|
> We are not going to go back to the wild balancing that > numasched does (I have some benchmarks where sched-domains > reduces cross node task movement by several orders of > magnitude). Agreed, I think that'd be a fatal mistake ... > So the other option is to do balance on clone > across NUMA nodes, and make it very sensitive to imbalance. > Or probably better to make it easy to balance off to an idle > CPU, but much more difficult to balance off to a busy CPU. I think that's correct, but we need to be careful. We really, really do want to try to keep threads on the same node *if* we have enough processes around to keep the machine busy. Because we don't balance on fork, we make a reasonable job of that today, but we should probably be more reluctant on rebalance than we are. It's when we have less processes than nodes that we want to spread things around. That's a difficult balance to strike (and exactly why I wimped out on it originally ;-)). M. |
From: Nick P. <nic...@ya...> - 2004-03-30 07:31:31
|
Martin J. Bligh wrote: >>We are not going to go back to the wild balancing that >>numasched does (I have some benchmarks where sched-domains >>reduces cross node task movement by several orders of >>magnitude). > > > Agreed, I think that'd be a fatal mistake ... > > >>So the other option is to do balance on clone >>across NUMA nodes, and make it very sensitive to imbalance. >>Or probably better to make it easy to balance off to an idle >>CPU, but much more difficult to balance off to a busy CPU. > > > I think that's correct, but we need to be careful. We really, really > do want to try to keep threads on the same node *if* we have enough > processes around to keep the machine busy. Because we don't balance > on fork, we make a reasonable job of that today, but we should probably > be more reluctant on rebalance than we are. > > It's when we have less processes than nodes that we want to spread things > around. That's a difficult balance to strike (and exactly why I wimped > out on it originally ;-)). > Well NUMA balance on exec is obviously the right thing to do. Maybe balance on clone would be beneficial if we only balance onto CPUs which are idle or very very imbalanced. Basically, if you are very sure that it is going to be balanced off anyway, it is probably better to do it at clone. |
From: Martin J. B. <mb...@ar...> - 2004-03-30 07:38:13
|
> Well NUMA balance on exec is obviously the right thing to do. > > Maybe balance on clone would be beneficial if we only balance onto > CPUs which are idle or very very imbalanced. Basically, if you are > very sure that it is going to be balanced off anyway, it is probably > better to do it at clone. Yup ... sounds utterly sensible. But I think we need to make the current balance favour grouping threads together on the same CPU/node more first if possible ;-) M. |
From: Ingo M. <mi...@el...> - 2004-03-30 08:05:08
|
* Nick Piggin <nic...@ya...> wrote: > Maybe balance on clone would be beneficial if we only balance onto > CPUs which are idle or very very imbalanced. Basically, if you are > very sure that it is going to be balanced off anyway, it is probably > better to do it at clone. balancing threads/processes is not a problem, as long as it happens within the rules of normal balancing. ie. 'new context created' (on exec, fork or clone) is just an event that impacts the load scenario, and which might trigger rebalancing. _if_ the sharing between various contexts is very high and it's actually faster to run them all single-threaded, then the application writer can bind them to one CPU, via the affinity syscalls. But the scheduler cannot know this advance. so the cleanest assumption, from the POV of the scheduler, is that there's no sharing between contexts. Things become really simple once this assumption is made. and frankly, it's much easier to argue with application developers whose application scales badly and thus the scheduler over-distributes it, than with application developers who's application scales badly due to the scheduler. Ingo |
From: Nick P. <nic...@ya...> - 2004-03-30 08:19:47
|
Ingo Molnar wrote: > * Nick Piggin <nic...@ya...> wrote: > > >>Maybe balance on clone would be beneficial if we only balance onto >>CPUs which are idle or very very imbalanced. Basically, if you are >>very sure that it is going to be balanced off anyway, it is probably >>better to do it at clone. > > > balancing threads/processes is not a problem, as long as it happens > within the rules of normal balancing. > > ie. 'new context created' (on exec, fork or clone) is just an event that > impacts the load scenario, and which might trigger rebalancing. > > _if_ the sharing between various contexts is very high and it's actually > faster to run them all single-threaded, then the application writer can > bind them to one CPU, via the affinity syscalls. But the scheduler > cannot know this advance. > > so the cleanest assumption, from the POV of the scheduler, is that > there's no sharing between contexts. Things become really simple once > this assumption is made. > > and frankly, it's much easier to argue with application developers whose > application scales badly and thus the scheduler over-distributes it, > than with application developers who's application scales badly due to > the scheduler. > You're probably mostly right, but I really don't know if I'd start with the assumption that threads don't share anything. I think they're very likely to share memory and cache. Also, these additional system wide balance points don't come for free if you attach them to common operations (as opposed to the slow periodic balancing). find_best_cpu needs to pull down NR_CPUs remote (and probably hot&dirty) cachelines, which can get expensive, for an operation that you are very likely to be better off *without* if your threads do share any memory. |
From: Ingo M. <mi...@el...> - 2004-03-30 08:44:38
|
* Nick Piggin <nic...@ya...> wrote: > You're probably mostly right, but I really don't know if I'd start > with the assumption that threads don't share anything. I think they're > very likely to share memory and cache. it all depends on the workload i guess, but generally if the application scales well then the threads only share data in a read-mostly manner - hence we can balance at creation time. if the application does not scale well then balancing too early cannot make the app perform much worse. things like JVMs tend to want good balancing - they really are userspace simulations of separate contexts with little sharing and good overall scalability of the architecture. > Also, these additional system wide balance points don't come for free > if you attach them to common operations (as opposed to the slow > periodic balancing). yes, definitely. the implementation in sched2.patch does not take this into account yet. There are a number of things we can do about the 500 CPUs case. Eg. only do the balance search towards the next N nodes/cpus (tunable via a domain parameter). Ingo |
From: Nick P. <nic...@ya...> - 2004-03-30 08:54:16
|
Ingo Molnar wrote: > * Nick Piggin <nic...@ya...> wrote: > > >>You're probably mostly right, but I really don't know if I'd start >>with the assumption that threads don't share anything. I think they're >>very likely to share memory and cache. > > > it all depends on the workload i guess, but generally if the application > scales well then the threads only share data in a read-mostly manner - > hence we can balance at creation time. > > if the application does not scale well then balancing too early cannot > make the app perform much worse. > > things like JVMs tend to want good balancing - they really are userspace > simulations of separate contexts with little sharing and good overall > scalability of the architecture. > Well, it will be interesting to see how it goes. Unfortunately I don't have a single realistic benchmark. In fact the only threaded one I have is volanomark. > >>Also, these additional system wide balance points don't come for free >>if you attach them to common operations (as opposed to the slow >>periodic balancing). > > > yes, definitely. > > the implementation in sched2.patch does not take this into account yet. > There are a number of things we can do about the 500 CPUs case. Eg. only > do the balance search towards the next N nodes/cpus (tunable via a > domain parameter). Yeah I think we shouldn't worry too much about the 500 CPUs case, because they will obviously end up using their own domains. But it is possible this would hurt smaller CPU counts too. Again, it means testing. I think we should probably aim to have a usable and decent default domain for 32, maybe 64 CPUs, and not worry about larger numbers too much if it would hurt lower end performance. |
From: Martin J. B. <mb...@ar...> - 2004-03-30 15:27:14
|
> Well, it will be interesting to see how it goes. Unfortunately > I don't have a single realistic benchmark. That's OK, neither does anyone else ;-) OK, for HPC workloads they do, but not for other stuff. The closest I can come conceptually is to run multiple instances of a Java benchmark in parallel. The existing ones all tend to be either 1 process with many threads, or many processes each with one thread. There's no m x n benchamrks around I've found, and that seems to be a lot more like what the customers I've seen are interested in (throwing a DB, webserver, java, etc all on one machine). Making balance_on_fork a userspace hintable thing wouldn't hurt us at all though, and would provide a great escape route for the HPC people. Some simple pokeable in /proc would probably be sufficient. balance_on_clone is harder, as whether you want to do it or not depends more on the state of the rest of the system, which is very hard for userspace to know ... M. |
From: Ingo M. <mi...@el...> - 2004-03-30 08:18:16
|
* Andi Kleen <ak...@su...> wrote: > > ok, could you try min_interval,max_interval and busy_factor all with a > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > > purposes.) > > I kept the old patch and made these changes. The results are much more > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had > this with older kernels too. great. now, could you try the following patch, against vanilla -mm5: redhat.com/~mingo/scheduler-patches/sched2.patch this includes 'context balancing' and doesnt touch the NUMA async balancing tunables. Do you get better performance than with stock -mm5? Ingo |
From: Andi K. <ak...@su...> - 2004-03-30 09:36:28
|
On Tue, 30 Mar 2004 10:18:40 +0200 Ingo Molnar <mi...@el...> wrote: > > * Andi Kleen <ak...@su...> wrote: > > > > ok, could you try min_interval,max_interval and busy_factor all with a > > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > > > purposes.) > > > > I kept the old patch and made these changes. The results are much more > > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had > > this with older kernels too. > > great. > > now, could you try the following patch, against vanilla -mm5: > > redhat.com/~mingo/scheduler-patches/sched2.patch > > this includes 'context balancing' and doesnt touch the NUMA async > balancing tunables. Do you get better performance than with stock -mm5? I get better performance (roughly 2.1x CPU), but only about half the optimum. -Andi |
From: Martin J. B. <mb...@ar...> - 2004-03-25 19:25:23
|
>> It doesn't do load balance in wake_up_forked_process() and is >> relatively non aggressive in balancing later. This leads to the >> multithreaded OpenMP STREAM running its childs first on the same node >> as the original process and allocating memory there. [...] > > i believe the fix we want is to pre-balance the context at fork() time. > I've implemented this (which is basically just a reuse of > sched_balance_exec() in fork.c, and the related namespace cleanups), > could you give it a go: > > http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5 > > another solution would be to add SD_BALANCE_FORK. > > also, the best place to do fork() blancing is not at > wake_up_forked_process() time, but prior doing the MM copy. This patch > does it there. At wakeup time we've already copied all the pagetables > and created tons of dirty cachelines. How are you going to decide whether to rebalance at fork time or exec time? Exec time balancing is a *lot* more efficient, it just doesn't work for things that don't exec ... cloned threads would certainly be one case. M. |
From: Ingo M. <mi...@el...> - 2004-03-25 21:59:09
|
* Andi Kleen <ak...@su...> wrote: > It doesn't do load balance in wake_up_forked_process() and is > relatively non aggressive in balancing later. This leads to the > multithreaded OpenMP STREAM running its childs first on the same node > as the original process and allocating memory there. Then later they > run on a different node when the balancing finally happens, but > generate cross traffic to the old node, instead of using the memory > bandwidth of their local nodes. > > The difference is very visible, even the 4 thread STREAM only sees the > bandwidth of a single node. With a more aggressive scheduler you get 4 > times as much. > > Admittedly it's a bit of a stupid benchmark, but seems to > representative for a lot of HPC codes. There's no way the scheduler can figure out the scheduling and memory use patterns of the new tasks in advance. but userspace could give hints - e.g. a syscall that triggers a rebalancing: sys_sched_load_balance(). This way userspace notifies the scheduler that it is on 'zero ground' and that the scheduler can move it to the least loaded cpu/node. a variant of this is already possible, userspace can use setaffinity to load-balance manually - but sched_load_balance() would be automatic. Ingo |
From: Nick P. <nic...@ya...> - 2004-03-26 03:56:12
|
Andi Kleen wrote: > On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote: > >>Andi, >> >>Can you be more specific with "it doesn't load balance threads >>aggressively enough"? Or what behavior of the base NUMA scheduler is >>missing in the sched-domain scheduler especially for NUMA? > > > It doesn't do load balance in wake_up_forked_process() and is relatively > non aggressive in balancing later. This leads to the multithreaded OpenMP > STREAM running its childs first on the same node as the original process > and allocating memory there. Then later they run on a different node when > the balancing finally happens, but generate cross traffic to the old node, > instead of using the memory bandwidth of their local nodes. > > The difference is very visible, even the 4 thread STREAM only sees the > bandwidth of a single node. With a more aggressive scheduler you get > 4 times as much. > > Admittedly it's a bit of a stupid benchmark, but seems to representative > for a lot of HPC codes. Hi Andi, Sorry I keep telling you I'll work on this, but I never get around to it. Mostly lack of hardware makes it difficult. I've fixed a few bugs and some other workloads, so I keep hoping that they will fix your problem :P Your STREAM performance is really bad and I hope you don't think I'm going to ignore it even if it is a bit stupid. Give me a bit more time. Of course, there is nothing fundamentally wrong with sched-domains that is causing your problem. It can easily do anything the old numa scheduler can do. It must be a bug or some bad tuning somewhere. Nick |
From: Ingo M. <mi...@el...> - 2004-03-25 21:47:44
|
* Martin J. Bligh <mb...@ar...> wrote: > Exec time balancing is a *lot* more efficient, it just doesn't work > for things that don't exec ... cloned threads would certainly be one > case. yeah - exec-balancing is a clear thing. fork/clone time balancing is alot less clear. Ingo |
From: Martin J. B. <mb...@ar...> - 2004-03-25 22:29:13
|
>> Exec time balancing is a *lot* more efficient, it just doesn't work >> for things that don't exec ... cloned threads would certainly be one >> case. > > yeah - exec-balancing is a clear thing. fork/clone time balancing is > alot less clear. OK, well it *looks* to me from a quick look at your patch like sched_balance_context will rebalance at both fork *and* exec time. That seems like a bad plan, but maybe I'm misreading it. Can we hold off on changing the fork/exec time balancing until we've come to a plan as to what should actually be done with it? Unless we're giving it some hint from userspace, it's frigging hard to be sure if it's going to exec or not - and the vast majority of things do. There was a really good reason why the code is currently set up that way, it's not some random accident ;-) Clone is a much more interesting case, though at the time, I consciously decided NOT to do that, as we really mostly want threads on the same node. The exception is the case where we have one app with lots of threads, and nothing much else running on the system ... I tend to think of that as an artificial benchmark situation, but maybe that's not fair. We probably need to just do a more conservative version of the cross-node rebalance at fork time. M. |
From: Andrew T. <hab...@us...> - 2004-03-25 22:30:57
|
On Thursday 25 March 2004 15:59, Ingo Molnar wrote: > * Andi Kleen <ak...@su...> wrote: > > It doesn't do load balance in wake_up_forked_process() and is > > relatively non aggressive in balancing later. This leads to the > > multithreaded OpenMP STREAM running its childs first on the same node > > as the original process and allocating memory there. Then later they > > run on a different node when the balancing finally happens, but > > generate cross traffic to the old node, instead of using the memory > > bandwidth of their local nodes. > > > > The difference is very visible, even the 4 thread STREAM only sees the > > bandwidth of a single node. With a more aggressive scheduler you get 4 > > times as much. > > > > Admittedly it's a bit of a stupid benchmark, but seems to > > representative for a lot of HPC codes. > > There's no way the scheduler can figure out the scheduling and memory > use patterns of the new tasks in advance. > > but userspace could give hints - e.g. a syscall that triggers a > rebalancing: sys_sched_load_balance(). This way userspace notifies the > scheduler that it is on 'zero ground' and that the scheduler can move it > to the least loaded cpu/node. > > a variant of this is already possible, userspace can use setaffinity to > load-balance manually - but sched_load_balance() would be automatic. For Opteron simply placing all cpus in the same sched domain may solve all of this, since we will have balancing frequency of the default scheduler. Is there any reason this cannot be done for Opteron? Also, I think Erich Focht had another patch which would allow much more frequent node balancing is the nr_cpus_node was 1. |
From: Martin J. B. <mb...@ar...> - 2004-03-25 22:38:21
|
> For Opteron simply placing all cpus in the same sched domain may solve all of > this, since we will have balancing frequency of the default scheduler. Is > there any reason this cannot be done for Opteron? That seems like a good plan to me - they really don't want that cross-node balancing. It might be cleaner to implement it by just tweaking the cross-balance paramters for that system to have the same effect, but it probably doesn't matter much (I'm thinking of some future case when they decide to do multi-chip on die or SMT, so just keying off 1 cpu per node doesn't really fix it). M. |
From: Andi K. <ak...@su...> - 2004-03-26 05:29:08
|
On Thu, 25 Mar 2004 16:30:16 -0600 Andrew Theurer <hab...@us...> wrote: > For Opteron simply placing all cpus in the same sched domain may solve all of > this, since we will have balancing frequency of the default scheduler. Is > there any reason this cannot be done for Opteron? Yes, that makes sense. I will try that -Andi |
From: Andi K. <ak...@su...> - 2004-03-29 12:25:47
|
On Mon, 29 Mar 2004 13:46:35 +0200 Ingo Molnar <mi...@el...> wrote: > > * Andi Kleen <ak...@su...> wrote: > > > Sorry ignore this report - I just found out I booted the wrong kernel > > by mistake. Currently retesting, also with the proposed change to only > > use a single scheduling domain. > > here are the items that are in the works: > > redhat.com/~mingo/scheduler-patches/sched.patch I'm trying to, but -mm5 doesn't work at all on the 4 way machine. It goes through the full boot up sequence, but then never opens a login on the console and sshd also doesn't work. Andrew, maybe that's related to your tty fixes? -Andi |
From: Andi K. <ak...@su...> - 2004-03-29 12:33:10
|
On Mon, 29 Mar 2004 09:03:01 +0200 Andi Kleen <ak...@su...> wrote: > > I'm trying to, but -mm5 doesn't work at all on the 4 way machine. > It goes through the full boot up sequence, but then never opens a login > on the console and sshd also doesn't work. > > Andrew, maybe that's related to your tty fixes? Reverting the two makes login work again -Andi |
From: Andi K. <ak...@su...> - 2004-03-30 07:15:56
|
On Tue, 30 Mar 2004 17:03:42 +1000 Nick Piggin <nic...@ya...> wrote: > > So it is very likely to be a case of the threads running too > long on one CPU before being balanced off, and faulting in > most of their working memory from one node, right? Yes. > I think it is impossible for the scheduler to correctly > identify this and implement the behaviour that OpenMP wants > without causing regressions on more general workloads > (Assuming this is the problem). Regression on what workload? The 2.4 kernel who did the early balancing didn't seem to have problems. I have NUMA API for an application to select memory placement manually, but it's unrealistic to expect all applications to use it, so the scheduler has to do at least an reasonable default. In general on Opteron you want to go as quickly as possible to your target node. Keeping things on the local node and hoping that threads won't need to be balanced off is probably a loss. It is quite possible that other systems have different requirements, but I doubt there is a "one size fits all" requirement and doing a custom domain setup or similar would be fine for me. (or at least if sched domain cannot be tuned for Opteron then it would have failed its promise of being a configurable scheduler) > I suspect this would still be a regression for other tests > though where thread creation is more frequent, threads share > working set more often, or the number of threads > the number > of CPUs. I can try such tests if they're not too time consuming to set up. What did you have in mind? -Andi |
From: Nick P. <nic...@ya...> - 2004-03-30 07:24:24
|
Andi Kleen wrote: > On Tue, 30 Mar 2004 17:03:42 +1000 > Nick Piggin <nic...@ya...> wrote: > > >>So it is very likely to be a case of the threads running too >>long on one CPU before being balanced off, and faulting in >>most of their working memory from one node, right? > > > Yes. > > >>I think it is impossible for the scheduler to correctly >>identify this and implement the behaviour that OpenMP wants >>without causing regressions on more general workloads >>(Assuming this is the problem). > > > Regression on what workload? The 2.4 kernel who did the > early balancing didn't seem to have problems. > No, but hopefully sched domains balancing will do better than the old numasched. > I have NUMA API for an application to select memory placement > manually, but it's unrealistic to expect all applications to use it, > so the scheduler has to do at least an reasonable default. > > In general on Opteron you want to go as quickly as possible > to your target node. Keeping things on the local node and hoping > that threads won't need to be balanced off is probably a loss. > It is quite possible that other systems have different requirements, > but I doubt there is a "one size fits all" requirement and > doing a custom domain setup or similar would be fine for me. It is the same situation with all NUMA, obviously Opteron's 1 CPU per node means it is sensitive to node imbalances. > (or at least if sched domain cannot be tuned for Opteron then > it would have failed its promise of being a configurable scheduler) > Well it seems like Ingo is on to something. Phew! :) > >>I suspect this would still be a regression for other tests >>though where thread creation is more frequent, threads share >>working set more often, or the number of threads > the number >>of CPUs. > > > I can try such tests if they're not too time consuming to set up. > What did you have in mind? > Not really sure. I guess probably most things that use a lot of threads, maybe java, a web server using per connection threads (if there is such a thing). On the other hand though, maybe it will be a good idea if it is done carefully... |
From: Arjan v. de V. <ar...@re...> - 2004-03-30 07:39:23
|
> Regression on what workload? The 2.4 kernel who did the > early balancing didn't seem to have problems. well the hard balance is between a program that just splits of one thread and has those 2 threads working closely together (in which case you want the 2 threads to be together on the same quad in a quad-like setup) and a program that splits of a thread and has the 2 threads working basically entirely independent. Benchmarks are typically of the later kind... but real world applications ???? The ones I can think of using threads are of the former kind. |