From: Nakajima, J. <jun...@in...> - 2004-03-25 15:16:00
|
We have found some performance regressions (e.g. SPECjbb) with the scheduler on a large IA-64 NUMA machine, and we are debugging it. On SMP machines, we haven't seen performance regressions. Jun >-----Original Message----- >From: Andi Kleen [mailto:ak...@su...] >Sent: Wednesday, March 24, 2004 8:56 PM >To: Ingo Molnar >Cc: pi...@cy...; lin...@vg...; ak...@os...; >ke...@ko...; ru...@ru...; Nakajima, Jun; >ric...@us...; an...@sa...; lse...@li...; >mb...@ar... >Subject: Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2- >A3 > >On Thu, 25 Mar 2004 09:28:09 +0100 >Ingo Molnar <mi...@el...> wrote: > >> i've reviewed the sched-domains balancing patches for upstream inclusion >> and they look mostly fine. > >The main problem it has is that it performs quite badly on Opteron NUMA >e.g. in the OpenMP STREAM test (much worse than the normal scheduler) > >-Andi |
From: Nakajima, J. <jun...@in...> - 2004-03-25 15:32:32
|
Andi, Can you be more specific with "it doesn't load balance threads aggressively enough"? Or what behavior of the base NUMA scheduler is missing in the sched-domain scheduler especially for NUMA? Jun >-----Original Message----- >From: Andi Kleen [mailto:ak...@su...] >Sent: Thursday, March 25, 2004 3:47 AM >To: Rick Lindsley >Cc: Andi Kleen; Ingo Molnar; pi...@cy...; linux- >ke...@vg...; ak...@os...; ke...@ko...; >ru...@ru...; Nakajima, Jun; an...@sa...; lse- >te...@li...; mb...@ar... >Subject: Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2- >A3 > >On Thu, Mar 25, 2004 at 03:40:22AM -0800, Rick Lindsley wrote: >> The main problem it has is that it performs quite badly on Opteron >NUMA >> e.g. in the OpenMP STREAM test (much worse than the normal scheduler) >> >> Andi, I've got some schedstat code which may help us to understand why. >> I'll need to port it to Ingo's changes, but if I drop you a patch in a >> day or two can you try your test on sched-domain/non-sched-domain, >> collecting the stats? > >The openmp failure is already pretty well understood - it doesn't load >balance >threads aggressively enough over CPUs after startup. > >-Andi |
From: Andi K. <ak...@su...> - 2004-03-25 15:40:24
|
On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote: > Andi, > > Can you be more specific with "it doesn't load balance threads > aggressively enough"? Or what behavior of the base NUMA scheduler is > missing in the sched-domain scheduler especially for NUMA? It doesn't do load balance in wake_up_forked_process() and is relatively non aggressive in balancing later. This leads to the multithreaded OpenMP STREAM running its childs first on the same node as the original process and allocating memory there. Then later they run on a different node when the balancing finally happens, but generate cross traffic to the old node, instead of using the memory bandwidth of their local nodes. The difference is very visible, even the 4 thread STREAM only sees the bandwidth of a single node. With a more aggressive scheduler you get 4 times as much. Admittedly it's a bit of a stupid benchmark, but seems to representative for a lot of HPC codes. -Andi |
From: Ingo M. <mi...@el...> - 2004-03-25 19:09:15
|
* Andi Kleen <ak...@su...> wrote: > It doesn't do load balance in wake_up_forked_process() and is > relatively non aggressive in balancing later. This leads to the > multithreaded OpenMP STREAM running its childs first on the same node > as the original process and allocating memory there. [...] i believe the fix we want is to pre-balance the context at fork() time. I've implemented this (which is basically just a reuse of sched_balance_exec() in fork.c, and the related namespace cleanups), could you give it a go: http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5 another solution would be to add SD_BALANCE_FORK. also, the best place to do fork() blancing is not at wake_up_forked_process() time, but prior doing the MM copy. This patch does it there. At wakeup time we've already copied all the pagetables and created tons of dirty cachelines. Ingo |
From: Andi K. <ak...@su...> - 2004-03-25 19:20:56
|
On Thu, 25 Mar 2004 20:09:45 +0100 Ingo Molnar <mi...@el...> wrote: > also, the best place to do fork() blancing is not at > wake_up_forked_process() time, but prior doing the MM copy. This patch > does it there. At wakeup time we've already copied all the pagetables > and created tons of dirty cachelines. That won't help for threaded programs that use clone(). OpenMP is such a case. -Andi |
From: Ingo M. <mi...@el...> - 2004-03-25 19:38:47
|
* Andi Kleen <ak...@su...> wrote: > That won't help for threaded programs that use clone(). OpenMP is such > a case. yeah, agreed. Also, exec-balance, if applied to fork(), would migrate the parent which is not what we want. We could perhaps migrate the parent to the target CPU, copy the context, then migrate the parent back to the original CPU ... but this sounds too complex. Ingo |
From: Ingo M. <mi...@el...> - 2004-03-25 20:46:34
|
* Andi Kleen <ak...@su...> wrote: > That won't help for threaded programs that use clone(). OpenMP is such > a case. this patch: redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 does balancing at wake_up_forked_process()-time. but it's a hard issue. Especially after fork() we do have a fair amount of cache context, and migrating at this point can be bad for performance. Ingo |
From: Andi K. <ak...@su...> - 2004-03-29 08:46:04
|
On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote: > > * Andi Kleen <ak...@su...> wrote: > > > That won't help for threaded programs that use clone(). OpenMP is such > > a case. > > this patch: > > redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 > > does balancing at wake_up_forked_process()-time. > > but it's a hard issue. Especially after fork() we do have a fair amount > of cache context, and migrating at this point can be bad for > performance. I ported it by hand to the -mm4 scheduler now and tested it. While it works marginally better than the standard -mm scheduler (you get 1 1/2 the bandwidth of one CPU instead of one) it's still still much worse than the optimum of nearly 4 CPUs archived by 2.4 or the standard scheduler. -Andi |
From: Rick L. <ric...@us...> - 2004-03-29 10:22:02
|
I've got a web page up now on my home machine which shows data from schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under load from kernbench, SPECjbb, and SPECdet. http://eaglet.rain.com/rick/linux/sched-domain/index.html Two things that stand out are that sched-domains tends to call load_balance() less frequently when it is idle and more frequently when it is busy (as compared to the "standard" scheduler.) Another is that even though it moves fewer tasks on average, the sched-domains code shows about half of pull_task()'s work is coming from active_load_balance() ... and that seems wrong. Could these be contributing to what you're seeing? Rick |
From: Andi K. <ak...@su...> - 2004-03-29 10:30:36
|
On Mon, 29 Mar 2004 02:20:58 -0800 Rick Lindsley <ric...@us...> wrote: > I've got a web page up now on my home machine which shows data from > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > load from kernbench, SPECjbb, and SPECdet. > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > > Two things that stand out are that sched-domains tends to call > load_balance() less frequently when it is idle and more frequently when > it is busy (as compared to the "standard" scheduler.) Another is that > even though it moves fewer tasks on average, the sched-domains code shows > about half of pull_task()'s work is coming from active_load_balance() ... > and that seems wrong. Could these be contributing to what you're seeing? Sounds quite possible yes. -Andi |
From: Nick P. <nic...@ya...> - 2004-03-29 11:28:48
|
Rick Lindsley wrote: > I've got a web page up now on my home machine which shows data from > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > load from kernbench, SPECjbb, and SPECdet. > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > I can't see it > Two things that stand out are that sched-domains tends to call > load_balance() less frequently when it is idle and more frequently when > it is busy (as compared to the "standard" scheduler.) Another is that John Hawkes noticed problems here too. mm5 has a patch to improve this for NUMA node balancing. No change on non-NUMA though if that is what you were testing - we might need to tune this a bit if it is hurting. > even though it moves fewer tasks on average, the sched-domains code shows > about half of pull_task()'s work is coming from active_load_balance() ... Yeah this is wrong and shouldn't be happening. It would have been due to a bug in the imbalance calculation which is now fixed. |
From: Nick P. <nic...@ya...> - 2004-03-29 11:20:23
|
Andi Kleen wrote: > On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote: > >>* Andi Kleen <ak...@su...> wrote: >> >> >>>That won't help for threaded programs that use clone(). OpenMP is such >>>a case. >> >>this patch: >> >> redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 >> >>does balancing at wake_up_forked_process()-time. >> >>but it's a hard issue. Especially after fork() we do have a fair amount >>of cache context, and migrating at this point can be bad for >>performance. > > > I ported it by hand to the -mm4 scheduler now and tested it. While > it works marginally better than the standard -mm scheduler > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still > still much worse than the optimum of nearly 4 CPUs archived by > 2.4 or the standard scheduler. > OK there must be some pretty simple reason why this is happening. I guess being OpenMP it is probably a bit complicated for you to try your own scheduling in userspace using CPU affinities? Otherwise could you trace what gets scheduled where for both good and bad kernels? It should help us work out what is going on. I wonder if using one CPU from each quad of the NUMAQ would be give at all comparable behaviour... If it isn't a big problem, could you test with -mm5 with the generic sched domain? STREAM doesn't take long, does it? I don't expect much difference, but the code is in flux while Ingo and I try to sort things out. |
From: Andi K. <ak...@su...> - 2004-03-29 11:24:32
|
On Mon, 29 Mar 2004 21:20:12 +1000 Nick Piggin <nic...@ya...> wrote: > > > > I ported it by hand to the -mm4 scheduler now and tested it. While > > it works marginally better than the standard -mm scheduler > > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still > > still much worse than the optimum of nearly 4 CPUs archived by > > 2.4 or the standard scheduler. > > > Sorry ignore this report - I just found out I booted the wrong kernel by mistake. Currently retesting, also with the proposed change to only use a single scheduling domain. -Andi |
From: Ingo M. <mi...@el...> - 2004-03-29 12:00:00
|
* Andi Kleen <ak...@su...> wrote: > Sorry ignore this report - I just found out I booted the wrong kernel > by mistake. Currently retesting, also with the proposed change to only > use a single scheduling domain. here are the items that are in the works: redhat.com/~mingo/scheduler-patches/sched.patch it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active balancing a bit. Ingo |
From: Andi K. <ak...@su...> - 2004-03-29 20:31:02
|
On Mon, 29 Mar 2004 13:46:35 +0200 Ingo Molnar <mi...@el...> wrote: > > * Andi Kleen <ak...@su...> wrote: > > > Sorry ignore this report - I just found out I booted the wrong kernel > > by mistake. Currently retesting, also with the proposed change to only > > use a single scheduling domain. > > here are the items that are in the works: > > redhat.com/~mingo/scheduler-patches/sched.patch > > it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active > balancing a bit. I applied only this patch and it did slightly better than the normal -mm* 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline and 2.4 reach. -Andi |
From: Nick P. <nic...@ya...> - 2004-03-29 23:52:19
|
Andi Kleen wrote: > On Mon, 29 Mar 2004 13:46:35 +0200 > Ingo Molnar <mi...@el...> wrote: > > >>* Andi Kleen <ak...@su...> wrote: >> >> >>>Sorry ignore this report - I just found out I booted the wrong kernel >>>by mistake. Currently retesting, also with the proposed change to only >>>use a single scheduling domain. >> >>here are the items that are in the works: >> >> redhat.com/~mingo/scheduler-patches/sched.patch >> >>it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active >>balancing a bit. > > > I applied only this patch and it did slightly better than the normal -mm* > 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline > and 2.4 reach. So both -mm5 and Ingo's sched.patch are much worse than what 2.4 and 2.6 get? |
From: Andi K. <ak...@su...> - 2004-03-30 06:34:58
|
On Tue, 30 Mar 2004 09:51:46 +1000 Nick Piggin <nic...@ya...> wrote: > So both -mm5 and Ingo's sched.patch are much worse than > what 2.4 and 2.6 get? Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still much worse than the max of 3.7x-4x CPU bandwidth. -Andi |
From: Ingo M. <mi...@el...> - 2004-03-30 06:39:59
|
* Andi Kleen <ak...@su...> wrote: > > So both -mm5 and Ingo's sched.patch are much worse than > > what 2.4 and 2.6 get? > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), > but still much worse than the max of 3.7x-4x CPU bandwidth. Andi, could you please try the patch below - this will test whether this has to do with the rate of balancing between NUMA nodes. The patch itself is not correct (it way overbalances on NUMA), but it tests the theory. Ingo --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -627,7 +627,7 @@ struct sched_domain { .parent = NULL, \ .groups = NULL, \ .min_interval = 8, \ - .max_interval = 256*fls(num_online_cpus()),\ + .max_interval = 8, \ .busy_factor = 8, \ .imbalance_pct = 125, \ .cache_hot_time = (10*1000000), \ |
From: Andi K. <ak...@su...> - 2004-03-30 07:07:22
|
On Tue, 30 Mar 2004 08:40:15 +0200 Ingo Molnar <mi...@el...> wrote: > > * Andi Kleen <ak...@su...> wrote: > > > > So both -mm5 and Ingo's sched.patch are much worse than > > > what 2.4 and 2.6 get? > > > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), > > but still much worse than the max of 3.7x-4x CPU bandwidth. > > Andi, could you please try the patch below - this will test whether this > has to do with the rate of balancing between NUMA nodes. The patch > itself is not correct (it way overbalances on NUMA), but it tests the > theory. This works much better, but wildly varying (my tests go from 2.8xCPU to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent results would be better though. -Andi |
From: Nick P. <nic...@ya...> - 2004-03-30 07:14:45
|
Andi Kleen wrote: > On Tue, 30 Mar 2004 08:40:15 +0200 > Ingo Molnar <mi...@el...> wrote: > > >>* Andi Kleen <ak...@su...> wrote: >> >> >>>>So both -mm5 and Ingo's sched.patch are much worse than >>>>what 2.4 and 2.6 get? >>> >>>Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) >>> >>>Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), >>>but still much worse than the max of 3.7x-4x CPU bandwidth. >> >>Andi, could you please try the patch below - this will test whether this >>has to do with the rate of balancing between NUMA nodes. The patch >>itself is not correct (it way overbalances on NUMA), but it tests the >>theory. > > > This works much better, but wildly varying (my tests go from 2.8xCPU to > ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. > Oh good, thanks Ingo. Andi you probably want to lower your minimum balance time too then, and maybe try with an even lower maximum. Maybe reduce cache_hot_time a bit too. |
From: Ingo M. <mi...@el...> - 2004-03-30 07:45:04
|
* Nick Piggin <nic...@ya...> wrote: > >This works much better, but wildly varying (my tests go from 2.8xCPU to > >~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > >results would be better though. > > Oh good, thanks Ingo. Andi you probably want to lower your minimum > balance time too then, and maybe try with an even lower maximum. Maybe > reduce cache_hot_time a bit too. i dont think we want to balance with that high of a frequency on NUMA Opteron. These tunes were for testing only. i'm dusting off the balance-on-clone patch right now, that should be the correct solution. It is based on a find_idlest_cpu() function which searches for the least loaded CPU and checks whether we can do passive load-balancing to it. Ie. it's yet another balancing point in the scheduler, _not_ some balancing logic change. Ingo |
From: Nick P. <nic...@ya...> - 2004-03-30 07:58:16
|
Ingo Molnar wrote: > * Nick Piggin <nic...@ya...> wrote: > > >>>This works much better, but wildly varying (my tests go from 2.8xCPU to >>>~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent >>>results would be better though. >> >>Oh good, thanks Ingo. Andi you probably want to lower your minimum >>balance time too then, and maybe try with an even lower maximum. Maybe >>reduce cache_hot_time a bit too. > > > i dont think we want to balance with that high of a frequency on NUMA > Opteron. These tunes were for testing only. > I guess not. Andi says he wants it more like UMA balancing though... > i'm dusting off the balance-on-clone patch right now, that should be the > correct solution. It is based on a find_idlest_cpu() function which > searches for the least loaded CPU and checks whether we can do passive > load-balancing to it. Ie. it's yet another balancing point in the > scheduler, _not_ some balancing logic change. > Yep, as I said to Martin, I also agree this is probably good if it is done carefully. I think we'll need to get a horde of thread benchmarking people together before turning it on by default, of course. It seems Andi can now get equivalent results without it now, so it isn't a pressing issue. |
From: Ingo M. <mi...@el...> - 2004-03-30 07:14:57
|
* Andi Kleen <ak...@su...> wrote: > > Andi, could you please try the patch below - this will test whether this > > has to do with the rate of balancing between NUMA nodes. The patch > > itself is not correct (it way overbalances on NUMA), but it tests the > > theory. > > This works much better, but wildly varying (my tests go from 2.8xCPU > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. ok, could you try min_interval,max_interval and busy_factor all with a value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing purposes.) Ingo |
From: Nick P. <nic...@ya...> - 2004-03-30 07:19:03
|
Ingo Molnar wrote: > * Andi Kleen <ak...@su...> wrote: > > >>>Andi, could you please try the patch below - this will test whether this >>>has to do with the rate of balancing between NUMA nodes. The patch >>>itself is not correct (it way overbalances on NUMA), but it tests the >>>theory. >> >>This works much better, but wildly varying (my tests go from 2.8xCPU >>to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent >>results would be better though. > > > ok, could you try min_interval,max_interval and busy_factor all with a > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > purposes.) > (sorry, forget what I said then, I'll leave it to Ingo) |
From: Andi K. <ak...@su...> - 2004-03-30 07:50:04
|
On Tue, 30 Mar 2004 09:15:19 +0200 Ingo Molnar <mi...@el...> wrote: > > * Andi Kleen <ak...@su...> wrote: > > > > Andi, could you please try the patch below - this will test whether this > > > has to do with the rate of balancing between NUMA nodes. The patch > > > itself is not correct (it way overbalances on NUMA), but it tests the > > > theory. > > > > This works much better, but wildly varying (my tests go from 2.8xCPU > > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > > results would be better though. > > ok, could you try min_interval,max_interval and busy_factor all with a > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > purposes.) I kept the old patch and made these changes. The results are much more consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had this with older kernels too. -Andi |