From: Erich F. <ef...@hp...> - 2003-07-28 19:14:13
Attachments:
1cpufix-lb-2.6.0t1.patch
|
Hi, after talking to several people at OLS about the current NUMA scheduler the conclusion was: (1) it sucks (for particular workloads), (2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1). Fact is that the current separation of local and global balancing, where global balancing is done only in the timer interrupt at a fixed rate is way too unflexible. A CPU going idle inside a well balanced node will stay idle for a while even if there's a lot of work to do. Especially in the corner case of one CPU per node this is condemning that CPU to idleness for at least 5 ms. So x86_64 platforms (but not only those!) suffer and whish to switch off the NUMA scheduler while keeping NUMA memory management on. The attached patch is a simple solution which - solves the 1 CPU / node problem, - lets other systems behave (almost) as before, - opens the way to other optimisations like multi-level node hierarchies (by tuning the retry rate) - simpifies the NUMA scheduler and deletes more lines of code than it adds. The timer interrupt based global rebalancing might appear to be a simple and good idea but it takes the scheduler a lot of flexibility. In the patch the global rebalancing is done after a certain number of failed attempts to locally balance. The number of attempts is proportional to the number of CPUs in the current node. For only 1 CPU in the current node the scheduler doesn't even try to balance locally, it wouldn't make sense anyway. Of course one could instead set IDLE_NODE_REBALANCE_TICK = IDLE_REBALANCE_TICK, but this is more ugly (IMHO) and only helps when all nodes have 1 CPU / node. Please consider this for inclusion. Thanks, Erich |
From: Martin J. B. <mb...@ar...> - 2003-07-28 19:56:27
|
> after talking to several people at OLS about the current NUMA > scheduler the conclusion was: > (1) it sucks (for particular workloads), > (2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1). I really feel there's no point in a NUMA scheduler for the Hammer style architectures. A config option to turn it off would seem like a simpler way to go, unless people can see some advantage of the full NUMA code? The interesting thing is probably whether we want balance on exec or not ... but that probably applies to UMA SMP as well ... > Fact is that the current separation of local and global balancing, > where global balancing is done only in the timer interrupt at a fixed > rate is way too unflexible. A CPU going idle inside a well balanced > node will stay idle for a while even if there's a lot of work to > do. Especially in the corner case of one CPU per node this is > condemning that CPU to idleness for at least 5 ms. Surely it'll hit the idle local balancer and rebalance within the node? Or are you thinking of a case with 3 tasks on a 4 cpu/node system? > So x86_64 platforms > (but not only those!) suffer and whish to switch off the NUMA > scheduler while keeping NUMA memory management on. Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED) ... I'll feed that upstream this week. > The attached patch is a simple solution which > - solves the 1 CPU / node problem, > - lets other systems behave (almost) as before, > - opens the way to other optimisations like multi-level node > hierarchies (by tuning the retry rate) > - simpifies the NUMA scheduler and deletes more lines of code than it > adds. Looks simple, I'll test it out. > The timer interrupt based global rebalancing might appear to be a > simple and good idea but it takes the scheduler a lot of > flexibility. In the patch the global rebalancing is done after a > certain number of failed attempts to locally balance. The number of > attempts is proportional to the number of CPUs in the current > node. Seems like a good plan. M. |
From: Erich F. <ef...@hp...> - 2003-07-28 20:14:53
|
On Monday 28 July 2003 21:55, Martin J. Bligh wrote: > > after talking to several people at OLS about the current NUMA > > scheduler the conclusion was: > > (1) it sucks (for particular workloads), > > (2) on x86_64 (embarassingly simple NUMA) it's useless, goto (1). > > I really feel there's no point in a NUMA scheduler for the Hammer > style architectures. A config option to turn it off would seem like > a simpler way to go, unless people can see some advantage of the > full NUMA code? But the Hammer is a NUMA architecture and a working NUMA scheduler should be flexible enough to deal with it. And: the corner case of 1 CPU per node is possible also on any other NUMA platform, when in some of the nodes (or even just one) only one CPU is configured in. Solving that problem automatically gives the Hammer what it needs. > > Fact is that the current separation of local and global balancing, > > where global balancing is done only in the timer interrupt at a fixed > > rate is way too unflexible. A CPU going idle inside a well balanced > > node will stay idle for a while even if there's a lot of work to > > do. Especially in the corner case of one CPU per node this is > > condemning that CPU to idleness for at least 5 ms. > > Surely it'll hit the idle local balancer and rebalance within the node? > Or are you thinking of a case with 3 tasks on a 4 cpu/node system? No, no, the local load balancer just doesn't make sense now if you have one cpu per node. It is called although it will never find another cpu in the own cell to steal from. There just is nothing else... (or did I misunderstand your comment?) > > So x86_64 platforms > > (but not only those!) suffer and whish to switch off the NUMA > > scheduler while keeping NUMA memory management on. > > Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED) > ... I'll feed that upstream this week. That's one way, but the proposed patch just solves the problem (in a more general way, also for other NUMA cases). If you deconfigure NUMA for a NUMA platform, we'll have problems switching it back on when adding smarter things like node affine or homenode extensions. > > The attached patch is a simple solution which > > - solves the 1 CPU / node problem, > > - lets other systems behave (almost) as before, > > - opens the way to other optimisations like multi-level node > > hierarchies (by tuning the retry rate) > > - simpifies the NUMA scheduler and deletes more lines of code than it > > adds. > > Looks simple, I'll test it out. Great! Thanks! Erich |
From: Andi K. <ak...@mu...> - 2003-07-29 14:42:33
|
On Mon, Jul 28, 2003 at 10:18:19PM +0200, Erich Focht wrote: > > > So x86_64 platforms > > > (but not only those!) suffer and whish to switch off the NUMA > > > scheduler while keeping NUMA memory management on. > > > > Right - I have a patch to make it a config option (CONFIG_NUMA_SCHED) > > ... I'll feed that upstream this week. > > That's one way, but the proposed patch just solves the problem (in a > more general way, also for other NUMA cases). If you deconfigure NUMA > for a NUMA platform, we'll have problems switching it back on when > adding smarter things like node affine or homenode extensions. That's one important point IMHO. Currently Opteron does not really need the NUMA scheduler, but it will be in future with such extensions. This means it would be better if the current scheduler supports it already so that it can be easily extended. Thanks for the patch. -Andi |
From: Martin J. B. <mb...@ar...> - 2003-07-28 20:38:24
|
> But the Hammer is a NUMA architecture and a working NUMA scheduler > should be flexible enough to deal with it. And: the corner case of 1 > CPU per node is possible also on any other NUMA platform, when in some > of the nodes (or even just one) only one CPU is configured in. Solving > that problem automatically gives the Hammer what it needs. OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if you really want to say all that ;-) The question is "does Hammer benefit from the additional complexity"? I'm guessing not ... if so, then yes, it's worth fixing. If not, it would seem better to just leave it at SMP for the scheduler stuff. Simpler, more shared code with common systems. >> > Fact is that the current separation of local and global balancing, >> > where global balancing is done only in the timer interrupt at a fixed >> > rate is way too unflexible. A CPU going idle inside a well balanced >> > node will stay idle for a while even if there's a lot of work to >> > do. Especially in the corner case of one CPU per node this is >> > condemning that CPU to idleness for at least 5 ms. >> >> Surely it'll hit the idle local balancer and rebalance within the node? >> Or are you thinking of a case with 3 tasks on a 4 cpu/node system? > > No, no, the local load balancer just doesn't make sense now if you > have one cpu per node. It is called although it will never find > another cpu in the own cell to steal from. There just is nothing > else... (or did I misunderstand your comment?) Right, I realise that the 1 cpu per node case is broken. However, doesn't this also affect the multi-cpu per node case in the manner I suggested above? So even if we turn off NUMA scheduler for Hammer, this still needs fixing, right? > That's one way, but the proposed patch just solves the problem (in a > more general way, also for other NUMA cases). If you deconfigure NUMA > for a NUMA platform, we'll have problems switching it back on when > adding smarter things like node affine or homenode extensions. Yeah, maybe. I'd rather have the code in hand that needs it, but still ... if Andi's happy that fixes it, then we're OK. M. |
From: Andrew T. <hab...@us...> - 2003-07-29 02:31:56
|
On Monday 28 July 2003 15:37, Martin J. Bligh wrote: > > But the Hammer is a NUMA architecture and a working NUMA scheduler > > should be flexible enough to deal with it. And: the corner case of 1 > > CPU per node is possible also on any other NUMA platform, when in som= e > > of the nodes (or even just one) only one CPU is configured in. Solvin= g > > that problem automatically gives the Hammer what it needs. I am going to ask a silly question, do we have any data showing this real= ly is=20 a problem on AMD? I would think, even if we have an idle cpu, sometimes = a=20 little delay on task migration (on NUMA) may not be a bad thing. If it = is=20 too long, can we just make the rebalance ticks arch specific? =20 I'd much rather have info related to the properties of the NUMA arch than= =20 something that makes decisions based on nr_cpus_node(). For example, we = may=20 want to inter-node balance as much or more often on ppc64 than even AMD, = but=20 it has 8 cpus per node. On this patch it would has the lowest inter-node= =20 balance frequency, even though it probably has one of the lowest latencie= s=20 between nodes and highest throughput interconnects. =20 > OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if > you really want to say all that ;-) > > The question is "does Hammer benefit from the additional complexity"? > I'm guessing not ... if so, then yes, it's worth fixing. If not, it > would seem better to just leave it at SMP for the scheduler stuff. > Simpler, more shared code with common systems. > > >> > Fact is that the current separation of local and global balancing, > >> > where global balancing is done only in the timer interrupt at a fi= xed > >> > rate is way too unflexible. A CPU going idle inside a well balance= d > >> > node will stay idle for a while even if there's a lot of work to > >> > do. Especially in the corner case of one CPU per node this is > >> > condemning that CPU to idleness for at least 5 ms. > >> > >> Surely it'll hit the idle local balancer and rebalance within the no= de? > >> Or are you thinking of a case with 3 tasks on a 4 cpu/node system? > > > > No, no, the local load balancer just doesn't make sense now if you > > have one cpu per node. It is called although it will never find > > another cpu in the own cell to steal from. There just is nothing > > else... (or did I misunderstand your comment?) > > Right, I realise that the 1 cpu per node case is broken. However, doesn= 't > this also affect the multi-cpu per node case in the manner I suggested > above? So even if we turn off NUMA scheduler for Hammer, this still > needs fixing, right? Maybe so, but if we start making idle rebalance more aggressive, I think = we=20 would need to make CAN_MIGRATE more restrictive, taking memory placement = of=20 the tasks in to account. On AMD with interleaved memory allocation, task= s=20 would move very easily, since their memory is spread out anyway. On "hom= e=20 node" or node-local policy, we may not move a task (or maybe not on the f= irst=20 attempt), even if there is an idle cpu in another node. =20 >> That's one way, but the proposed patch just solves the problem (in a >> more general way, also for other NUMA cases). If you deconfigure NUMA >> for a NUMA platform, we'll have problems switching it back on when >> adding smarter things like node affine or homenode extensions. > >Yeah, maybe. I'd rather have the code in hand that needs it, but still .= =2E. >if Andi's happy that fixes it, then we're OK. Personally, I'd like to see all systems use NUMA sched, non NUMA systems = being=20 a single node (no policy difference from non-numa sched), allowing us to=20 remove all NUMA ifdefs. I think the code would be much more readable. -Andrew Theurer |
From: Bill D. <dav...@tm...> - 2003-08-13 20:58:38
|
On Mon, 28 Jul 2003, Andrew Theurer wrote: > Personally, I'd like to see all systems use NUMA sched, non NUMA systems being > a single node (no policy difference from non-numa sched), allowing us to > remove all NUMA ifdefs. I think the code would be much more readable. That sounds like a great idea, but I'm not sure it could be realized short of a major rewrite. Look how hard Ingo and Con are working just to get a single node doing a good job with interactive and throughput tradeoffs. Once they get a good handle on identifying process behaviour, and I believe they will, that information could be used in improving NUMA performance, by sending not just 'a job" but "the right job" if it exists. I'm sure there are still a few graduate theses possible on the topic! -- bill davidsen <dav...@tm...> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. |
From: Andrew T. <hab...@us...> - 2003-08-22 15:47:59
|
On Wednesday 13 August 2003 15:49, Bill Davidsen wrote: > On Mon, 28 Jul 2003, Andrew Theurer wrote: > > Personally, I'd like to see all systems use NUMA sched, non NUMA systems > > being a single node (no policy difference from non-numa sched), allowing > > us to remove all NUMA ifdefs. I think the code would be much more > > readable. > > That sounds like a great idea, but I'm not sure it could be realized short > of a major rewrite. Look how hard Ingo and Con are working just to get a > single node doing a good job with interactive and throughput tradeoffs. Actually it's not too bad. Attached is a patch to do it. It also does multi-level node support and makes all the load balance routines runqueue-centric instead of cpu-centric, so adding something like shared runqueues (for HT) should be really easy. Hmm, other things: inter-node balance intervals are now arch specific (AMD is "1"). The default busy/idle balance timers of 200/1 are not arch specific, but I'm thinking they should be. And for non-numa, the scheduling policy is the same as it was with vanilla O(1). > Once they get a good handle on identifying process behaviour, and I > believe they will, that information could be used in improving NUMA > performance, by sending not just 'a job" but "the right job" if it exists. > I'm sure there are still a few graduate theses possible on the topic! I do agree, but I think _what_ we pick is definitely 2.7 work. All I'd like to see for 2.6 is a very nice, unified framework to build on. As for 1cpu/node case I think this second patch (on top of first) will take care of it; however I have not had a chance to test on AMD yet. -Andrew Theurer |
From: Nick P. <pi...@cy...> - 2003-08-22 22:58:33
|
Andrew Theurer wrote: >On Wednesday 13 August 2003 15:49, Bill Davidsen wrote: > >>On Mon, 28 Jul 2003, Andrew Theurer wrote: >> >>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems >>>being a single node (no policy difference from non-numa sched), allowing >>>us to remove all NUMA ifdefs. I think the code would be much more >>>readable. >>> >>That sounds like a great idea, but I'm not sure it could be realized short >>of a major rewrite. Look how hard Ingo and Con are working just to get a >>single node doing a good job with interactive and throughput tradeoffs. >> > >Actually it's not too bad. Attached is a patch to do it. It also does >multi-level node support and makes all the load balance routines >runqueue-centric instead of cpu-centric, so adding something like shared >runqueues (for HT) should be really easy. Hmm, other things: inter-node >balance intervals are now arch specific (AMD is "1"). The default busy/idle >balance timers of 200/1 are not arch specific, but I'm thinking they should >be. And for non-numa, the scheduling policy is the same as it was with >vanilla O(1). > I'm not saying you're wrong, but do you have some numbers where this helps? ie. two architectures that need very different balance numbers. And what is the reason for making AMD's balance interval 1? Also, things like nr_running_inc are supposed to be very fast. I am a bit worried to see a loop and CPU shared atomics in there. node_2_node is an odd sounding conversion too ;) BTW. you should be CC'ing Ingo if you have any intention of scheduler stuff getting into 2.6. |
From: Andrew T. <hab...@us...> - 2003-08-23 00:12:52
|
On Friday 22 August 2003 17:56, Nick Piggin wrote: > Andrew Theurer wrote: > >On Wednesday 13 August 2003 15:49, Bill Davidsen wrote: > >>On Mon, 28 Jul 2003, Andrew Theurer wrote: > >>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems > >>>being a single node (no policy difference from non-numa sched), allowing > >>>us to remove all NUMA ifdefs. I think the code would be much more > >>>readable. > >> > >>That sounds like a great idea, but I'm not sure it could be realized > >> short of a major rewrite. Look how hard Ingo and Con are working just to > >> get a single node doing a good job with interactive and throughput > >> tradeoffs. > > > >Actually it's not too bad. Attached is a patch to do it. It also does > >multi-level node support and makes all the load balance routines > >runqueue-centric instead of cpu-centric, so adding something like shared > >runqueues (for HT) should be really easy. Hmm, other things: inter-node > >balance intervals are now arch specific (AMD is "1"). The default > > busy/idle balance timers of 200/1 are not arch specific, but I'm thinking > > they should be. And for non-numa, the scheduling policy is the same as > > it was with vanilla O(1). > > I'm not saying you're wrong, but do you have some numbers where this > helps? ie. two architectures that need very different balance numbers. > And what is the reason for making AMD's balance interval 1? AMD is 1 because there's no need to balance within a node, so I want the inter-node balance frequency to be as often as it was with just O(1). This interval would not work well with other NUMA boxes, so that's the main reason to have arch specific intervals. And, as a general guideline, boxes with different local-remote latency ratios will probably benefit from different inter-node balance intervals. I don't know what these ratios are, but I'd like the kernel to have the ability to change for one arch and not affect another. > Also, things like nr_running_inc are supposed to be very fast. I am > a bit worried to see a loop and CPU shared atomics in there. That has concerned me, too. So far I haven't been able to see a measurable difference either way (within noise level), but it's possible. The other alternative is to sum up node load at sched_best_cpu and find_busiest_node. > node_2_node is an odd sounding conversion too ;) I just went off the toplogy already there, so I left it. > BTW. you should be CC'ing Ingo if you have any intention of scheduler > stuff getting into 2.6. OK, thanks! |
From: Nick P. <pi...@cy...> - 2003-08-23 00:30:32
|
Andrew Theurer wrote: >On Friday 22 August 2003 17:56, Nick Piggin wrote: > >>Andrew Theurer wrote: >> >>>On Wednesday 13 August 2003 15:49, Bill Davidsen wrote: >>> >>>>On Mon, 28 Jul 2003, Andrew Theurer wrote: >>>> >>>>>Personally, I'd like to see all systems use NUMA sched, non NUMA systems >>>>>being a single node (no policy difference from non-numa sched), allowing >>>>>us to remove all NUMA ifdefs. I think the code would be much more >>>>>readable. >>>>> >>>>That sounds like a great idea, but I'm not sure it could be realized >>>>short of a major rewrite. Look how hard Ingo and Con are working just to >>>>get a single node doing a good job with interactive and throughput >>>>tradeoffs. >>>> >>>Actually it's not too bad. Attached is a patch to do it. It also does >>>multi-level node support and makes all the load balance routines >>>runqueue-centric instead of cpu-centric, so adding something like shared >>>runqueues (for HT) should be really easy. Hmm, other things: inter-node >>>balance intervals are now arch specific (AMD is "1"). The default >>>busy/idle balance timers of 200/1 are not arch specific, but I'm thinking >>>they should be. And for non-numa, the scheduling policy is the same as >>>it was with vanilla O(1). >>> >>I'm not saying you're wrong, but do you have some numbers where this >>helps? ie. two architectures that need very different balance numbers. >>And what is the reason for making AMD's balance interval 1? >> > >AMD is 1 because there's no need to balance within a node, so I want the >inter-node balance frequency to be as often as it was with just O(1). This >interval would not work well with other NUMA boxes, so that's the main reason >to have arch specific intervals. > OK, I misread the patch. IIRC AMD has 1 CPU per node? If so, why doesn't this simply prevent balancing within a node? > And, as a general guideline, boxes with >different local-remote latency ratios will probably benefit from different >inter-node balance intervals. I don't know what these ratios are, but I'd >like the kernel to have the ability to change for one arch and not affect >another. > I fully appreciate there are huge differences... I am curious if you can see much improvements in practice. > >>Also, things like nr_running_inc are supposed to be very fast. I am >>a bit worried to see a loop and CPU shared atomics in there. >> > >That has concerned me, too. So far I haven't been able to see a measurable >difference either way (within noise level), but it's possible. The other >alternative is to sum up node load at sched_best_cpu and find_busiest_node. > Hmm... get someone to try the scheduler benchmarks on a 32 way box ;) > >>node_2_node is an odd sounding conversion too ;) >> > >I just went off the toplogy already there, so I left it. > > >>BTW. you should be CC'ing Ingo if you have any intention of scheduler >>stuff getting into 2.6. >> > >OK, thanks! > > Good luck with it. Definitely some good ideas. |
From: William L. I. I. <wl...@ho...> - 2003-08-23 00:48:08
|
On Sat, Aug 23, 2003 at 10:29:21AM +1000, Nick Piggin wrote: > Hmm... get someone to try the scheduler benchmarks on a 32 way box ;) What scheduler benchmark? -- wli |
From: Nick P. <pi...@cy...> - 2003-08-23 08:49:44
|
William Lee Irwin III wrote: >On Sat, Aug 23, 2003 at 10:29:21AM +1000, Nick Piggin wrote: > >>Hmm... get someone to try the scheduler benchmarks on a 32 way box ;) >> > >What scheduler benchmark? > > I don't know! I don't care about high end scheduler performance. Volanomark? Kernbench? SDET? Whatever you care about. |
From: Andrew T. <hab...@us...> - 2003-08-23 15:46:46
|
> >AMD is 1 because there's no need to balance within a node, so I want the > >inter-node balance frequency to be as often as it was with just O(1). > > This interval would not work well with other NUMA boxes, so that's the > > main reason to have arch specific intervals. > > OK, I misread the patch. IIRC AMD has 1 CPU per node? If so, why doesn't > this simply prevent balancing within a node? Yes, one cpu/node. Oh, it does prevent it, but with the current intervals, we end up not really balancing as often (since we need a inter-node balance), and when we call load_balance in schedule when idle, we don't balance at all since it's only a node local balance. > > And, as a general guideline, boxes with > >different local-remote latency ratios will probably benefit from different > >inter-node balance intervals. I don't know what these ratios are, but I'd > >like the kernel to have the ability to change for one arch and not affect > >another. > > I fully appreciate there are huge differences... I am curious if > you can see much improvements in practice. I think AMD would be the first good test. Maybe Andi has some results on numasched vs O(1), which would be a good indication. |
From: Martin J. B. <mb...@ar...> - 2003-08-23 01:32:13
|
>> node_2_node is an odd sounding conversion too ;) > > I just went off the toplogy already there, so I left it. I thought we changed that to parent_node or something a while back? Yes, node_2_node is rather nondescriptive ... Matt? M. |
From: Erich F. <ef...@hp...> - 2003-07-29 10:05:03
|
On Monday 28 July 2003 22:37, Martin J. Bligh wrote: > > But the Hammer is a NUMA architecture and a working NUMA scheduler > > should be flexible enough to deal with it. And: the corner case of 1 > > CPU per node is possible also on any other NUMA platform, when in some > > of the nodes (or even just one) only one CPU is configured in. Solving > > that problem automatically gives the Hammer what it needs. > > OK, well what we have now is a "multi-cpu-per-node-numa-scheduler" if > you really want to say all that ;-) And with the fix posted it will just be called "numa-scheduler". It's a simplification, as you noticed. > The question is "does Hammer benefit from the additional complexity"? > I'm guessing not ... if so, then yes, it's worth fixing. If not, it > would seem better to just leave it at SMP for the scheduler stuff. > Simpler, more shared code with common systems. Hammer will just go for the other nodes, as it should, when the own node is free. Then you can benefit of NUMA improvements in the load calculation and later of smarter distribution across the nodes. Reverting to SMP is a quick fix but for improvements we'll have to get back to NUMA. So why not fix it now and stay with NUMA. > > No, no, the local load balancer just doesn't make sense now if you > > have one cpu per node. It is called although it will never find > > another cpu in the own cell to steal from. There just is nothing > > else... (or did I misunderstand your comment?) > > Right, I realise that the 1 cpu per node case is broken. However, doesn't > this also affect the multi-cpu per node case in the manner I suggested > above? So even if we turn off NUMA scheduler for Hammer, this still > needs fixing, right? Yes. > > That's one way, but the proposed patch just solves the problem (in a > > more general way, also for other NUMA cases). If you deconfigure NUMA > > for a NUMA platform, we'll have problems switching it back on when > > adding smarter things like node affine or homenode extensions. > > Yeah, maybe. I'd rather have the code in hand that needs it, but still ... > if Andi's happy that fixes it, then we're OK. http://marc.theaimsgroup.com/?l=linux-kernel&m=105854610325226&w=2 Andi mentioned in his talk to follow that path, too. We'll have to experiment a while with the ideas we currently have to get the initial load balancing/ dynamic homenode issue solved, there's no "golden way" right now. I'll post the stuff I mentioned in my talk soon, but want to gain some experience with it, first. Regards, Erich |
From: Erich F. <ef...@hp...> - 2003-07-29 10:05:04
|
On Tuesday 29 July 2003 04:24, Andrew Theurer wrote: > On Monday 28 July 2003 15:37, Martin J. Bligh wrote: > > > But the Hammer is a NUMA architecture and a working NUMA scheduler > > > should be flexible enough to deal with it. And: the corner case of 1 > > > CPU per node is possible also on any other NUMA platform, when in some > > > of the nodes (or even just one) only one CPU is configured in. Solving > > > that problem automatically gives the Hammer what it needs. > > I am going to ask a silly question, do we have any data showing this really > is a problem on AMD? I would think, even if we have an idle cpu, sometimes > a little delay on task migration (on NUMA) may not be a bad thing. If it > is too long, can we just make the rebalance ticks arch specific? The fact that global rebalances are done only in the timer interrupt is simply bad! It complicates rebalance_tick() and wastes the opportunity to get feedback from the failed local balance attempts. If you want data supporting my assumptions: Ted Ts'o's talk at OLS shows the necessity to rebalance ASAP (even in try_to_wake_up). There are plenty of arguments towards this, starting with the steal delay parameter scans from the early days of multi-queue schedulers (Davide Libenzi), over my experiments with NUMA schedulers and the observation of Andi Kleen that on Opteron you better run from the wrong CPU than wait (if the tasks returns to the right cpu, all's fine anyway). > I'd much rather have info related to the properties of the NUMA arch than > something that makes decisions based on nr_cpus_node(). For example, we > may want to inter-node balance as much or more often on ppc64 than even > AMD, but it has 8 cpus per node. On this patch it would has the lowest > inter-node balance frequency, even though it probably has one of the lowest > latencies between nodes and highest throughput interconnects. We can still discuss on the formula. Currently there's a bug in the scheduler and the corner case of 1 cpu/node is just broken. The function local_balance_retries(attempts, cpus_in_this_node) must return 0 for cpus_in_this_node=1 and should return larger values for larger cpus_in_this_node. To have an upper limit is fine, but maybe not necessary. Regarding the ppc64 interconnect, I'm glad that you said "probably" and "one of the ...". No need to point you to better ones ;-) > > Right, I realise that the 1 cpu per node case is broken. However, doesn't > > this also affect the multi-cpu per node case in the manner I suggested > > above? So even if we turn off NUMA scheduler for Hammer, this still > > needs fixing, right? > > Maybe so, but if we start making idle rebalance more aggressive, I think we > would need to make CAN_MIGRATE more restrictive, taking memory placement of > the tasks in to account. On AMD with interleaved memory allocation, tasks > would move very easily, since their memory is spread out anyway. On "home > node" or node-local policy, we may not move a task (or maybe not on the > first attempt), even if there is an idle cpu in another node. Aehm, that's another story and I'm sure we will fix that too. There are a few ideas around. But you shouldn't expect to solve all problems at once, after all optimal NUMA scheduling can still be considered a research area. > Personally, I'd like to see all systems use NUMA sched, non NUMA systems > being a single node (no policy difference from non-numa sched), allowing us > to remove all NUMA ifdefs. I think the code would be much more readable. :-) Then you expect that everybody who reads the scheduler code knows what NUMA stands for and what it means? Pretty optimistic, but yes, I'd like that, too. Erich |
From: Martin J. B. <mb...@ar...> - 2003-07-29 14:28:02
|
>> I am going to ask a silly question, do we have any data showing this really >> is a problem on AMD? I would think, even if we have an idle cpu, sometimes >> a little delay on task migration (on NUMA) may not be a bad thing. If it >> is too long, can we just make the rebalance ticks arch specific? > > The fact that global rebalances are done only in the timer interrupt > is simply bad! It complicates rebalance_tick() and wastes the > opportunity to get feedback from the failed local balance attempts. The whole point of the NUMA scheduler is to rebalance globally less often. Now I'd agree that in the idle case we want to be more agressive, as your patch does, but we need to be careful not to end up in a cacheline thrashing war. Aiming for perfect balance is not always a good idea - the expense of both the calculation and the migrations has to be taken into account. For some arches, it's less important than others ... one thing we're missing is a per arch "node-balance-agressiveness" factor. Having said that, the NUMA scheduler is clearly broken on Hammer at the moment, from Andi's observations. > If you want data supporting my assumptions: Ted Ts'o's talk at OLS > shows the necessity to rebalance ASAP (even in try_to_wake_up). There > are plenty of arguments towards this, starting with the steal delay > parameter scans from the early days of multi-queue schedulers (Davide > Libenzi), over my experiments with NUMA schedulers and the observation > of Andi Kleen that on Opteron you better run from the wrong CPU than > wait (if the tasks returns to the right cpu, all's fine anyway). That's a drastic oversimplification. It may be better in some circumstances, on some benchmarks. For now, let's just get your patch tested on Hammer, and see if it works better to have the NUMA scheduler on than off after your patch ... M. |
From: Andrew T. <hab...@us...> - 2003-07-29 13:34:14
Attachments:
patch-wakey2-2567
|
On Tuesday 29 July 2003 05:08, Erich Focht wrote: > On Tuesday 29 July 2003 04:24, Andrew Theurer wrote: > > On Monday 28 July 2003 15:37, Martin J. Bligh wrote: > > > > But the Hammer is a NUMA architecture and a working NUMA schedule= r > > > > should be flexible enough to deal with it. And: the corner case o= f 1 > > > > CPU per node is possible also on any other NUMA platform, when in > > > > some of the nodes (or even just one) only one CPU is configured i= n. > > > > Solving that problem automatically gives the Hammer what it needs= =2E > > > > I am going to ask a silly question, do we have any data showing this > > really is a problem on AMD? I would think, even if we have an idle c= pu, > > sometimes a little delay on task migration (on NUMA) may not be a bad > > thing. If it is too long, can we just make the rebalance ticks arch > > specific? > > The fact that global rebalances are done only in the timer interrupt > is simply bad!=20 Even with this patch it still seems that most balances are still timer ba= sed,=20 because we still call load_balance in rebalance_tick. Granted, we may=20 inter-node balance more often, well, maybe less often since=20 node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the advanta= ge=20 of doing this at idle, but idle only, that's why I'd would be more inclin= ed a=20 only a much more aggressive idle rebalance. > It complicates rebalance_tick() and wastes the > opportunity to get feedback from the failed local balance attempts. What does "failed" really mean? To me, when *busiest=3Dnull, that means = we=20 passed, the node itself is probably balanced, and there's nothing to do. = It=20 gives no indication at all of the global load [im]balance. Shouldn't the= =20 thing we are looking for is the imbalance among node_nr_running[]? Would= it=20 make sense to go forward with a global balance based on that only? > If you want data supporting my assumptions: Ted Ts'o's talk at OLS > shows the necessity to rebalance ASAP (even in try_to_wake_up).=20 If this is the patch I am thinking of, it was the (attached) one I sent t= hem,=20 which did a light "push" rebalance at try_to_wake_up. Calling load_balan= ce=20 at try_to_wake_up seems very heavy-weight. This patch only looks for an = idle=20 cpu (within the same node) to wake up on before task activation, only if = the=20 task_rq(p)->nr_running is too long. So, yes, I do believe this can be=20 important, but I think it's only called for when we have an idle cpu. =20 > There > are plenty of arguments towards this, starting with the steal delay > parameter scans from the early days of multi-queue schedulers (Davide > Libenzi), over my experiments with NUMA schedulers and the observation > of Andi Kleen that on Opteron you better run from the wrong CPU than > wait (if the tasks returns to the right cpu, all's fine anyway). > > > I'd much rather have info related to the properties of the NUMA arch = than > > something that makes decisions based on nr_cpus_node(). For example,= we > > may want to inter-node balance as much or more often on ppc64 than ev= en > > AMD, but it has 8 cpus per node. On this patch it would has the lowe= st > > inter-node balance frequency, even though it probably has one of the > > lowest latencies between nodes and highest throughput interconnects. > > We can still discuss on the formula. Currently there's a bug in the > scheduler and the corner case of 1 cpu/node is just broken. The > function local_balance_retries(attempts, cpus_in_this_node) must > return 0 for cpus_in_this_node=3D1 and should return larger values for > larger cpus_in_this_node. To have an upper limit is fine, but maybe > not necessary. > > Regarding the ppc64 interconnect, I'm glad that you said "probably" > and "one of the ...". No need to point you to better ones ;-) OK, we wont get into a pissing match :) I just wanted to base the schedu= ler=20 decisions on data related to the hardware NUMA properties, not the cpu co= unt. =20 > > > Right, I realise that the 1 cpu per node case is broken. However, > > > doesn't this also affect the multi-cpu per node case in the manner = I > > > suggested above? So even if we turn off NUMA scheduler for Hammer, = this > > > still needs fixing, right? > > > > Maybe so, but if we start making idle rebalance more aggressive, I th= ink > > we would need to make CAN_MIGRATE more restrictive, taking memory > > placement of the tasks in to account. On AMD with interleaved memory > > allocation, tasks would move very easily, since their memory is sprea= d > > out anyway. On "home node" or node-local policy, we may not move a t= ask > > (or maybe not on the first attempt), even if there is an idle cpu in > > another node. > > Aehm, that's another story and I'm sure we will fix that too. There > are a few ideas around. But you shouldn't expect to solve all problems > at once, after all optimal NUMA scheduling can still be considered a > research area. > > > Personally, I'd like to see all systems use NUMA sched, non NUMA syst= ems > > being a single node (no policy difference from non-numa sched), allow= ing > > us to remove all NUMA ifdefs. I think the code would be much more > > readable. > > > :-) Then you expect that everybody who reads the scheduler code knows > > what NUMA stands for and what it means? Pretty optimistic, but yes, > I'd like that, too. Yes, at some point we have to. We cannot have two different schedulers. = Non=20 numa should have the exact same scheduling policy as a numa system with o= ne=20 node. I don't know if that's acceptable for 2.6, but I really want to go= for=20 that in 2.7. =20 -Andrew Theurer |
From: Erich F. <ef...@hp...> - 2003-07-30 15:39:05
|
On Tuesday 29 July 2003 15:33, Andrew Theurer wrote: > > The fact that global rebalances are done only in the timer interrupt > > is simply bad! > > Even with this patch it still seems that most balances are still timer > based, because we still call load_balance in rebalance_tick. Granted, we > may inter-node balance more often, well, maybe less often since > node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the advantage > of doing this at idle, but idle only, that's why I'd would be more inclined > a only a much more aggressive idle rebalance. Without this patch the probability to globally balance when going idle is 0. Now it is 1/global_balance_rate(cpus_in_this_node) . This is tunable and we can make this probability depending on the node load imbalance. I'll try that in the next version, it sounds like a good idea. Also changing this probability by some factor could give us a way to handle the differences between the platforms. Balancing globally only when idle isn't a good idea as long as we don't have multiple steals per balance attempt. Even then, tasks don't live all the same time, so you easilly end up with one node overloaded and others just busy and not able to steal from the busiest node. > > It complicates rebalance_tick() and wastes the > > opportunity to get feedback from the failed local balance attempts. > > What does "failed" really mean? To me, when *busiest=null, that means we > passed, the node itself is probably balanced, and there's nothing to do. > It gives no indication at all of the global load [im]balance. Shouldn't > the thing we are looking for is the imbalance among node_nr_running[]? > Would it make sense to go forward with a global balance based on that only? I'll try to include the node imbalance in the global balance rate calculation, let's see how it works. Just wanted to fix one issue with my patch, now it looks like it provides some simple ways to solve other issues as well... Thanks for the idea! Regards, Erich |
From: Andrew T. <hab...@us...> - 2003-07-30 16:15:46
|
On Wednesday 30 July 2003 10:23, Erich Focht wrote: > On Tuesday 29 July 2003 15:33, Andrew Theurer wrote: > > > The fact that global rebalances are done only in the timer interrupt > > > is simply bad! > > > > Even with this patch it still seems that most balances are still timer > > based, because we still call load_balance in rebalance_tick. Granted, we > > may inter-node balance more often, well, maybe less often since > > node_busy_rebalance_tick was busy_rebalance_tick*2. I do see the > > advantage of doing this at idle, but idle only, that's why I'd would be > > more inclined a only a much more aggressive idle rebalance. > > Without this patch the probability to globally balance when going idle > is 0. Now it is 1/global_balance_rate(cpus_in_this_node) . This is > tunable and we can make this probability depending on the node load > imbalance. I'll try that in the next version, it sounds like a good > idea. Also changing this probability by some factor could give us a > way to handle the differences between the platforms. > > Balancing globally only when idle isn't a good idea as long as we > don't have multiple steals per balance attempt. Even then, tasks don't > live all the same time, so you easilly end up with one node overloaded > and others just busy and not able to steal from the busiest node. I think I would agree with this statements. Yes, your patch does fix the "going idle" AMD issue, and yes, we don't want to balance globally on idle by default. However I do still think there may be a better way to solve this. I see this as "range" problem. When balancing, the NUMA scheduler's tactic is to limit the range of cpus to find a busiest queue. First we start with within a node, then once in a while we go to the next "level", and look at cpus in all nodes in that "level". Someday we may even go up another "level" and include even more cpus in our range. We limit how often we go up a level, as to help memory locality of the running tasks. This is obviously well known to you, since you wrote most of this :) What I would like to propose is the following: For load_balance, each architecture tells the scheduler what the range is, where the range starts, and where it ends. For ex: Summit starts at level1, the cpus in a node, and ends at level2, cpus in all level1 nodes. AMD starts at level2, all the cpus in all level1 nodes and also ends at the level2. 8-way AMD would go to level3 :) NEC (I assume) would start at the level1, and end at level3, the "super" nodes. This should solve the AMD problem very easily. As for idle balances, we may be able to go a step further: follow the range rules, but do a more aggressive/frequent search. On busy rebalance, I'd prefer to still uphold the node_rebalance_ticks, but on idle, I would think, since we are idle, some more time spent on a thorough search would be OK: We first start at the lower range and try to find something to steal (always favor stealing a smaller range first), then as needed increase the range, looking for a cpu to steal from. Do not use a node_idle_rebalance interval, just keep increasing the range until we are successful. I'll see if I can get something coded up here. I don't think these concepts would be too difficult to add. -Andrew Theurer |
From: Martin J. B. <mb...@ar...> - 2003-07-31 15:51:03
|
you're using node_to_cpu_mask for ia64 ... others were using node_to_cpumask (1 less "_"), so this doesn't build ... M. |
From: Erich F. <ef...@hp...> - 2003-07-31 21:42:21
Attachments:
1cpufix-lb2-2.6.0t1.patch
|
On Thursday 31 July 2003 17:05, Martin J. Bligh wrote: > you're using node_to_cpu_mask for ia64 ... others were using > node_to_cpumask (1 less "_"), so this doesn't build ... Ooops, you're right, of course. Sorry about this mistake :-( Erich |
From: Martin J. B. <mb...@ar...> - 2003-08-01 00:23:18
|
> On Thursday 31 July 2003 17:05, Martin J. Bligh wrote: >> you're using node_to_cpu_mask for ia64 ... others were using >> node_to_cpumask (1 less "_"), so this doesn't build ... > > Ooops, you're right, of course. Sorry about this mistake :-( np - was easy to fix up ;-) I did run some benchmarks on it ... low end SDET seemed highly variable, but otherwise looked OK. If I have only 4 tasks running on a 16x (4x4), what's the rate limit on the idle cpus trying to steal now? M. |
From: Erich F. <ef...@hp...> - 2003-08-01 16:26:42
|
On Friday 01 August 2003 02:26, Martin J. Bligh wrote: > np - was easy to fix up ;-) I did run some benchmarks on it ... > low end SDET seemed highly variable, but otherwise looked OK. > If I have only 4 tasks running on a 16x (4x4), what's the rate > limit on the idle cpus trying to steal now? the rate is 3ms, that's a bit too fast, maybe. And the busy rate is 200ms (instead of 400). I made some experiments with 2 CPUs per node and 1 ms seems to be definitely too fast. I'll tune the formula a bit... Regards, Erich |