From: Dave H. <hav...@us...> - 2003-04-17 03:41:14
Attachments:
config-numasched-2.5.67-0.patch
|
It might be interesting to see how the NUMA scheduler affects performance, separate from all of the memory allocation. We have benchmarks which are almost all in userspace, and are degraded by turning CONFIG_NUMA on. This should make narrowing down the problem easier. Tested on 16-way NUMA-Q. -- Dave Hansen hav...@us... |
From: Dave H. <hav...@us...> - 2003-04-17 03:51:52
Attachments:
config-numasched-2.5.67-1.patch
|
Dave Hansen wrote: > It might be interesting to see how the NUMA scheduler affects > performance, separately from all of the memory allocation. We have > benchmarks which are almost all in userspace, and are degraded by > turning CONFIG_NUMA on. This should make narrowing down the problem easier. > > Tested on 16-way NUMA-Q. Here's a version that defaults the new option to yes. -- Dave Hansen hav...@us... |
From: Erich F. <ef...@hp...> - 2003-04-17 07:34:39
|
Hi Dave, if you have benchmarks affected by the NUMA scheduler you should also try to tune the parameters. The current settings are: #define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) #define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your (busy) machine unbalanced for 20s! Additionally: are your benchmarks multithreaded (processes created by fork() or clone()) or are the children created by exec()? In the later case you should be ok, in the former case you get an unbalanced machine because of the lack of initial load balancing in that case. (Just in case you play with AIM7, that one just forks :-( ) Regards, Erich On Thursday 17 April 2003 05:51, Dave Hansen wrote: > Dave Hansen wrote: > > It might be interesting to see how the NUMA scheduler affects > > performance, separately from all of the memory allocation. We have > > benchmarks which are almost all in userspace, and are degraded by > > turning CONFIG_NUMA on. This should make narrowing down the problem > > easier. > > > > Tested on 16-way NUMA-Q. > > Here's a version that defaults the new option to yes. |
From: Martin J. B. <mb...@ar...> - 2003-04-17 15:12:20
|
> if you have benchmarks affected by the NUMA scheduler you should > also try to tune the parameters. The current settings are: > ># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) ># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) > > Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your > (busy) machine unbalanced for 20s! It's basically disabled because it's too agressive. We need to make it less agressive first, then we'll be able to up the busy rebalance rate. It also needs to be able to migrate multiple processes at once. M. |
From: Dave H. <hav...@us...> - 2003-04-17 16:35:57
|
Martin J. Bligh wrote: >>if you have benchmarks affected by the NUMA scheduler you should >>also try to tune the parameters. The current settings are: >> >># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) >># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) >> >>Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your >>(busy) machine unbalanced for 20s! > > It's basically disabled because it's too agressive. We need to make it less > agressive first, then we'll be able to up the busy rebalance rate. It also > needs to be able to migrate multiple processes at once. First of all, the workload in question here is SPECjbb200(tm)*. It is all most java and spends 99% of its time in userspace. The kernel isn't really stressed directly? What about putting in some rebalance ticks in more situations than just exec? We can keep track of the number of clone()s or fork()s since the last rebalance on the current node, then force a rebalance if that gets too big. Rick, have you been able to examine the SPECjbb200(tm)* problem and figure out whether it is really a balancing problem? * SPECjbb is a trademark of the Standard Performance Evaluation Corp. (SPEC). -- Dave Hansen hav...@us... |
From: Martin J. B. <mb...@ar...> - 2003-04-17 17:26:21
|
>>> if you have benchmarks affected by the NUMA scheduler you should >>> also try to tune the parameters. The current settings are: >>> >>># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) >>># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) >>> >>> Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your >>> (busy) machine unbalanced for 20s! >> >> It's basically disabled because it's too agressive. We need to make it >> less agressive first, then we'll be able to up the busy rebalance rate. >> It also needs to be able to migrate multiple processes at once. > > First of all, the workload in question here is SPECjbb200(tm)*. It is > all most java and spends 99% of its time in userspace. The kernel isn't > really stressed directly? > > What about putting in some rebalance ticks in more situations than just > exec? We can keep track of the number of clone()s or fork()s since the > last rebalance on the current node, then force a rebalance if that gets > too big. Exec is not the same as a rebalance tick ... it just targets the exec'ed process, not a general rebalance. Keeping track of the exec'ed stuff sounds sane, but by the time you've account for fork/clone, exec, whether that rebalance or not, and then exit ... plus accounting for whether your tasks are cpu intensive or not, I think we're better off just doing the background rebalance tick on a more frequent basis, from the per-node load average. We just need to set up some more weighting to stop tasks being shuttled about so heavily. M. |
From: Andrew T. <hab...@us...> - 2003-04-18 15:51:34
|
On Thursday 17 April 2003 11:35, Dave Hansen wrote: > Martin J. Bligh wrote: > >>if you have benchmarks affected by the NUMA scheduler you should > >>also try to tune the parameters. The current settings are: > >> > >># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) > >># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) > >> > >>Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your > >>(busy) machine unbalanced for 20s! > > > > It's basically disabled because it's too agressive. We need to make i= t > > less agressive first, then we'll be able to up the busy rebalance rat= e. > > It also needs to be able to migrate multiple processes at once. > > First of all, the workload in question here is SPECjbb200(tm)*. It is > all most java and spends 99% of its time in userspace. The kernel isn'= t > really stressed directly? > > What about putting in some rebalance ticks in more situations than just > exec? We can keep track of the number of clone()s or fork()s since the > last rebalance on the current node, then force a rebalance if that gets > too big. > > Rick, have you been able to examine the SPECjbb200(tm)* problem and > figure out whether it is really a balancing problem? It has to do with the rebalance interval. Actually it has to do with the= fact=20 that a timer based load_balance scheduler can't keep up with a workload t= hat=20 has an incredibly high rate task activation/deactivation (which throws of= f=20 load balance, having some cpus with >2 tasks, while some cpus have 0 task= s),=20 which JBB currently has (for other reasons altogether). JBB really=20 shouldn't behave that way, but IMO, the scheduler should do its best shou= ld=20 that situation arise. A change to wake_up helps this, since it's tied mu= ch=20 more closely to the rate of task activation. That's why the wake_up=20 modification works, and makes the NUMA config'd kernel run just as fast a= s=20 the non-NUMA config'd kernel =46rom an earlier mail (not on LSE): <<Here are the results comparing Mark's runs (32 warehouse result), with = and=20 without numa, to my runs with my wakeup patch: with numa off (result: 1.00x) <- relative throughput with numa off and wakeup patch (result: 1.23x) with numa on (results: 0.86x) with numa on and wakeup patch (result: 1.23x) Results pretty much speak for themselves! There is now no degrade at all= on=20 numa vs non-numa. Here is the rundown of the patch: in try_to_wake_up: if task_cpu(p) is not idle and sync is not set if there is an idle cpu, set_task_cpu to it activate the task >> Notice how the non-NUMA and NUMA results with the wake_up patch had the e= xact=20 same throughput.... On higher end systems, this has improved a 32 warehouse run over 100%... -Andrew Theurer * SPECjbb is a trademark of the Standard Performance Evaluation Corp. (SPEC). |
From: Martin J. B. <mb...@ar...> - 2003-04-18 16:06:52
|
>> >># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) >> >># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) >> >> >> >> Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your >> >> (busy) machine unbalanced for 20s! >> > >> > It's basically disabled because it's too agressive. We need to make it >> > less agressive first, then we'll be able to up the busy rebalance rate. >> > It also needs to be able to migrate multiple processes at once. >> >> First of all, the workload in question here is SPECjbb200(tm)*. It is >> all most java and spends 99% of its time in userspace. The kernel isn't >> really stressed directly? >> >> What about putting in some rebalance ticks in more situations than just >> exec? We can keep track of the number of clone()s or fork()s since the >> last rebalance on the current node, then force a rebalance if that gets >> too big. >> >> Rick, have you been able to examine the SPECjbb200(tm)* problem and >> figure out whether it is really a balancing problem? > > It has to do with the rebalance interval. Actually it has to do with the > fact that a timer based load_balance scheduler can't keep up with a > workload that has an incredibly high rate task activation/deactivation > (which throws off load balance, having some cpus with >2 tasks, while > some cpus have 0 tasks), which JBB currently has (for other reasons > altogether). JBB really shouldn't behave that way, but IMO, the > scheduler should do its best should that situation arise. A change to > wake_up helps this, since it's tied much more closely to the rate of > task activation. That's why the wake_up modification works, and makes > the NUMA config'd kernel run just as fast as the non-NUMA config'd kernel Can you try setting: # define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 5) and see if that helps? I still think the wakeup patch is a good idea, just want to see if the problem is caused by disabling the busy rebalancer ... The ia32 setting of 100 was done for NUMA-Q, since I don't have a x440 box. Thanks, M. |
From: John H. <ha...@sg...> - 2003-04-18 17:42:26
|
> in try_to_wake_up: > > if task_cpu(p) is not idle and sync is not set > if there is an idle cpu, set_task_cpu to it > activate the task > On higher end systems, this has improved a 32 warehouse run over 100%... I've had less successful results with such a change, when I tried it a couple of months ago, but I'll try it again with my almost-2.5.6x+numa scheduler and a wider variety of workloads. The tradeoff is being able to get that thread executing ASAP, vs. perhaps migrating it away from its low-latency local memory. Some threads benefit from being migrated thusly, and others don't. I suppose that threads that do a lot of sleeping and waking tend to be less notably sensitive to memory access latencies. John Hawkes |
From: Andrew T. <hab...@us...> - 2003-04-18 17:58:56
|
On Friday 18 April 2003 11:06, John Hawkes wrote: > > in try_to_wake_up: > > > > if task_cpu(p) is not idle and sync is not set > > if there is an idle cpu, set_task_cpu to it > > activate the task > > > > On higher end systems, this has improved a 32 warehouse run over 100%= =2E.. > > I've had less successful results with such a change, when I tried it a > couple of months ago, but I'll try it again with my almost-2.5.6x+numa > scheduler and a wider variety of workloads. > > The tradeoff is being able to get that thread executing ASAP, vs. perha= ps > migrating it away from its low-latency local memory. Some threads bene= fit > from being migrated thusly, and others don't. I suppose that threads t= hat > do a lot of sleeping and waking tend to be less notably sensitive to me= mory > access latencies. I probably should have mentioned, on a NUMA system, it looks for an idle = cpu=20 only within the node of the current task_cpu. Also, if the nr_running is= =20 fairly high anyway, we don't even bother looking for a idle cpu. At some= =20 point the probability of finding one is very low. BTW, this is very simi= lar=20 to Andrea's tree and the SLES kernel. This patch also backed out a recen= t=20 addition Ingo made to this function, which for some reason seems to confl= ict=20 with my changes -Andrew diff -Naur 2.5.67-BK-9-4-2003/Makefile 2.5.67-BK-9-4-2003-wakey2/Makefile --- 2.5.67-BK-9-4-2003/Makefile=092003-04-15 13:42:53.000000000 -0700 +++ 2.5.67-BK-9-4-2003-wakey2/Makefile=092003-04-15 13:57:14.000000000 -0= 700 @@ -1,7 +1,7 @@ VERSION =3D 2 PATCHLEVEL =3D 5 SUBLEVEL =3D 67 -EXTRAVERSION =3D -BK-9-4-2003 +EXTRAVERSION =3D -BK-9-4-2003-wakey2 =20 # *DOCUMENTATION* # To see a list of typical targets execute "make help" diff -Naur 2.5.67-BK-9-4-2003/kernel/sched.c=20 2.5.67-BK-9-4-2003-wakey2/kernel/sched.c --- 2.5.67-BK-9-4-2003/kernel/sched.c=092003-04-13 15:04:34.000000000 -07= 00 +++ 2.5.67-BK-9-4-2003-wakey2/kernel/sched.c=092003-04-15 14:00:01.000000= 000=20 -0700 @@ -486,11 +486,32 @@ */ static int try_to_wake_up(task_t * p, unsigned int state, int sync) { -=09int success =3D 0, requeue_waker =3D 0; -=09unsigned long flags; +=09int success =3D 0, target_cpu, i; +=09unsigned long flags, cpumask; =09long old_state; =09runqueue_t *rq; =20 +=09target_cpu =3D smp_processor_id(); + +=09/* change task_cpu to an idle cpu if its +=09 * default rq is really busy and sync +=09 * wakeup is not requested */ +=09if (!sync && ((nr_running()*4) <=3D (num_online_cpus()*5)) && +=09=09(task_rq(p)->nr_running > 0)) { +=09=09cpumask =3D node_to_cpumask(cpu_to_node(task_cpu(p))); +=09=09for (i=3D0; i<NR_CPUS; i++) { +=09=09=09if (!(cpumask & (1UL << i))) +=09=09=09=09continue; +=09=09=09if (!(cpu_online(i))) +=09=09=09=09continue; +=09=09=09if (idle_cpu(i)) { +=09=09=09=09sync =3D 1; +=09=09=09=09target_cpu =3D i; +=09=09=09=09break; +=09=09=09} +=09=09} +=09} + repeat_lock_task: =09rq =3D task_rq_lock(p, &flags); =09old_state =3D p->state; @@ -501,43 +522,25 @@ =09=09=09 * currently. Do not violate hard affinity. =09=09=09 */ =09=09=09if (unlikely(sync && !task_running(rq, p) && -=09=09=09=09(task_cpu(p) !=3D smp_processor_id()) && -=09=09=09=09(p->cpus_allowed & (1UL << smp_processor_id())))) { +=09=09=09=09(task_cpu(p) !=3D target_cpu) && +=09=09=09=09(p->cpus_allowed & (1UL << target_cpu)))) { =20 -=09=09=09=09set_task_cpu(p, smp_processor_id()); +=09=09=09=09set_task_cpu(p, target_cpu); =09=09=09=09task_rq_unlock(rq, &flags); =09=09=09=09goto repeat_lock_task; =09=09=09} =09=09=09if (old_state =3D=3D TASK_UNINTERRUPTIBLE) =09=09=09=09rq->nr_uninterruptible--; -=09=09=09if (sync) -=09=09=09=09__activate_task(p, rq); -=09=09=09else { -=09=09=09=09requeue_waker =3D activate_task(p, rq); -=09=09=09=09if (p->prio < rq->curr->prio) +=09=09=09activate_task(p, rq); + +=09=09=09if (p->prio < rq->curr->prio) =09=09=09=09=09resched_task(rq->curr); -=09=09=09} =09=09=09success =3D 1; =09=09} =09=09p->state =3D TASK_RUNNING; =09} =09task_rq_unlock(rq, &flags); =20 -=09/* -=09 * We have to do this outside the other spinlock, the two -=09 * runqueues might be different: -=09 */ -=09if (requeue_waker) { -=09=09prio_array_t *array; - -=09=09rq =3D task_rq_lock(current, &flags); -=09=09array =3D current->array; -=09=09dequeue_task(current, array); -=09=09current->prio =3D effective_prio(current); -=09=09enqueue_task(current, array); -=09=09task_rq_unlock(rq, &flags); -=09} - =09return success; } =20 |
From: John H. <ha...@sg...> - 2003-04-18 20:56:49
|
> + if (!sync && ((nr_running()*4) <= (num_online_cpus()*5)) && nr_running() is an O(N) algorithm, and try_to_wake_up() is almost as sensitive a code path as schedule() is. John Hawkes |
From: Erich F. <ef...@hp...> - 2003-04-18 23:00:17
|
On Thursday 17 April 2003 18:35, Dave Hansen wrote: > Martin J. Bligh wrote: > >>if you have benchmarks affected by the NUMA scheduler you should > >>also try to tune the parameters. The current settings are: > >> > >># define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5) > >># define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 100) > >> > >>Especially the BUSY_NODE_REBALANCE_TICK is insane, it leaves your > >>(busy) machine unbalanced for 20s! > > > > It's basically disabled because it's too agressive. We need to make i= t > > less agressive first, then we'll be able to up the busy rebalance rat= e. > > It also needs to be able to migrate multiple processes at once. > > First of all, the workload in question here is SPECjbb200(tm)*. It is > all most java and spends 99% of its time in userspace. The kernel isn'= t > really stressed directly? Well, if it's Java then it's exactly the expected worst case scenario for multithreading: threads are forked and the new threads aren't spread around the CPUs because no initial load balancing is happening at fork(). So the current CPU and node is overloaded. Cross-node balancing is happening too seldomly (20s for stealing one task!!!) so the system runs crappy. A BUSY_NODE_REBALANCE_TICK of 2 would help. The alternative which I proposed long time ago and implemented in the node affine scheduler was to allow certain processes to initially rebalance already at fork() instead of exec(). That way we have much more balanced nodes. And we can handle hackbench with arbitrary number of threads in optimal time. This is somehow equivalent with the approach of Andrew because it implies a much higher wake up rate for such processes. After all they go to less loaded CPUs. Maybe it's bad that a lot of development is happening on NUMAQ as that machine is not representative any more because of it's bad latency ratio. A lot of NUMA optimization is done for these "worst case" machines. Modern machines (like e.g. x440) have good latency ratios and multithreaded tasks do benefit of spreading them around (with having balanced nodes in mind). The results we see with the node affine scheduler and this approach are superior to the non-NUMA kernel because of the advantage of automatically getting balanced nodes, which the simple SMP scheduler can't get. Regards, Erich |
From: Gerrit H. <gh...@us...> - 2003-04-18 23:28:42
|
On Sat, 19 Apr 2003 01:02:20 +0200, Erich Focht wrote: > > Maybe it's bad that a lot of development is happening on NUMAQ as that > machine is not representative any more because of it's bad latency > ratio. A lot of NUMA optimization is done for these "worst case" > machines. Modern machines (like e.g. x440) have good latency ratios > and multithreaded tasks do benefit of spreading them around (with > having balanced nodes in mind). The results we see with the node > affine scheduler and this approach are superior to the non-NUMA kernel > because of the advantage of automatically getting balanced nodes, > which the simple SMP scheduler can't get. > > Regards, > Erich NUMAQ may be less representative today of the latencies, but it the current NUMAQ hardware reflects the end a set of CPU clock rate increases (which roughly double every 6-18 months) and an interconnect that was designed for processors 3-4 years earlier. Most NUMA and SMP architectures deploy a bus speed increase on a much longer cycle than they deploy CPU bus cycle increases. x440 and many of the current IA64 NUMA boxes are at the beginning of a new bus/new CPU cycle. Most will, over the next 2-3 years, begin showing higher remote latencies as compared to the local latencies. I wouldn't discount NUMAQ in a realistic since, although we may be 2-3 years early with the effects that it is noticing. Also, NUMA-Q provides a lot of other benefits, such as the magnifying glass capability for exagerating the effects of today's hardware. But, all things in context; the scheduler decisions for a high remote/local latency are not the same ones you want to make for a machine with a significantly lower remote/local latency. It seems like some decisions should be able to take that factor into account and just "do the right thing" based on the levels of hierarchy and their relative latencies. Or we could punt and include an appropriate factor in the arch specific code. gerrit |
From: William L. I. I. <wl...@ho...> - 2003-04-18 23:59:08
|
On Fri, Apr 18, 2003 at 04:28:25PM -0700, Gerrit Huizenga wrote: > NUMAQ may be less representative today of the latencies, but it > the current NUMAQ hardware reflects the end a set of CPU clock > rate increases (which roughly double every 6-18 months) and an > interconnect that was designed for processors 3-4 years earlier. IMHO remote access latency or a related statistic (e.g. ratios) should be an operational parameter and somehow automatically inferred. -- wli |
From: Martin J. B. <mb...@ar...> - 2003-04-19 02:26:37
|
>> NUMAQ may be less representative today of the latencies, but it >> the current NUMAQ hardware reflects the end a set of CPU clock >> rate increases (which roughly double every 6-18 months) and an >> interconnect that was designed for processors 3-4 years earlier. > > IMHO remote access latency or a related statistic (e.g. ratios) should > be an operational parameter and somehow automatically inferred. Right - the ratios were always meant to be a subarch tunable. That's not done yet, but could (and should) be. M. |
From: Andrew T. <hab...@us...> - 2003-04-19 01:31:40
|
On Friday 18 April 2003 18:57, William Lee Irwin III wrote: > On Fri, Apr 18, 2003 at 04:28:25PM -0700, Gerrit Huizenga wrote: > > NUMAQ may be less representative today of the latencies, but it > > the current NUMAQ hardware reflects the end a set of CPU clock > > rate increases (which roughly double every 6-18 months) and an > > interconnect that was designed for processors 3-4 years earlier. > > IMHO remote access latency or a related statistic (e.g. ratios) should > be an operational parameter and somehow automatically inferred. Yes, I agree, and eventually I'd like to see the NUMA related decisions m= ade=20 by the kernel have varying degrees of complexity/codepath length, based o= n=20 this ratio. Let me back up: the problem I see with using just NUMAQ (not= =20 really NUMAQ, but using just one platform) is that we add enough complexi= ty=20 and code path length in the kernel to benefit it, but that may not apply = well=20 to a low latency system like x440 or even lower like Hammer.=20 What we end up doing is increasing time spent in kernel on these complex=20 decisions designed for high latency systems, because we know on a given=20 platform, that time spent has a good return, the time we "wasted" in kern= el=20 is made up by the time we save in the app. However on a system with a lo= wer=20 latency, the decision does not need to be as precise, and therefore a sim= pler=20 logic/shorter codepath/lower system time would be better, because there i= sn't=20 as much application time to make up for our extra system time. =20 Let me give an example. sched_best_cpu(). At first we did not have the=20 shortcut in it, if nr_running is <2, pick this cpu. Originally we cycle= d=20 through all nodes then cpus to find the best. Adding the check really=20 helped, but it could hurt a system with even worse latency. Who knows,=20 having higher nr_running threshold could be better for x440. The nr_runn= ing=20 < 2 check really should vary based on the latency ratio. This lets the l= ower=20 latency systems fall out of the algorithm more often, giving a shorter=20 codepath, lower kernel time. =20 So, what I am trying to show is, maybe some of these decisions need to ha= ve=20 several stages, and based on the latency ratio, some systems run through = x=20 stages, while a higher latency system goes through x+y stages, and so on: ratio=09=09how much we do to make a decision 1=09=09w 2=09=09w+x 3=09=09w+x+y 4=09=09w+x+y+z The same concept could be used for "aggressiveness" in some algorithms, l= ike=20 the node balance ratio. For example, I didn't get good performance on=20 NUMA-HT until I set this to 1. The non NUMA case is essentially 1, too. Anyway, something to think about. This is obviously 2.7 stuff :) -Andrew |
From: Martin J. B. <mb...@ar...> - 2003-04-19 02:38:49
|
> Yes, I agree, and eventually I'd like to see the NUMA related decisions > made by the kernel have varying degrees of complexity/codepath length, > based on this ratio. Let me back up: the problem I see with using just > NUMAQ (not really NUMAQ, but using just one platform) is that we add > enough complexity and code path length in the kernel to benefit it, but > that may not apply well to a low latency system like x440 or even lower > like Hammer. I don't think that's the right way around at all - the problem is not one of complexity in terms of code path length. It's one of how many remote cachelines we touch - high ratio architectures actually favour simpler decision making, based on rougher (smaller) data. The problem isn't one of cycles burnt in the scheduler functions, it's one of the wrong decisions being made. That's probably not that hard to fix, it just needs some tuning. We need better data. per-cpu and per-node load averages, not crude averages of just a couple of points of runqueue length. We should take more parameters into account - eg task RSS, and how much of that is on which node. Try to keep threaded groups together on the same node. The node balancer should be able to move multiple tasks if it really needs to. > The same concept could be used for "aggressiveness" in some algorithms, > like the node balance ratio. For example, I didn't get good performance > on NUMA-HT until I set this to 1. The non NUMA case is essentially 1, > too. That make work well for some benchmarks, but not for others. And I have a strong feeling that this isn't the real underlying problem, we're just tinkering with it. This depends on all sorts of things ... like how heavily the cpus are loaded, and the RSS of the tasks involved. Hmmm. That gives me an idea ... I'll send out a patch later ;-) M. |
From: Martin J. B. <mb...@ar...> - 2003-04-19 02:25:06
|
> Well, if it's Java then it's exactly the expected worst case scenario > for multithreading: threads are forked and the new threads aren't > spread around the CPUs because no initial load balancing is happening > at fork(). So the current CPU and node is overloaded. Cross-node > balancing is happening too seldomly (20s for stealing one task!!!) so > the system runs crappy. > > A BUSY_NODE_REBALANCE_TICK of 2 would help. The alternative which I > proposed long time ago and implemented in the node affine scheduler > was to allow certain processes to initially rebalance already at > fork() instead of exec(). That way we have much more balanced > nodes. And we can handle hackbench with arbitrary number of threads in > optimal time. This is somehow equivalent with the approach of Andrew > because it implies a much higher wake up rate for such > processes. After all they go to less loaded CPUs. I'm really, really not keen on the "magic tagging" of certain processes that's been discussed before - that just seems too complicated and too messy. We could do the same pretty cleanly for clone(), but I'm not at all sure we really *want* to be spreading multithreaded things across nodes. That only applies if we only have < 1 process per node, with many threads from those few processes. Maybe we could do it conditionally or something, but it's not simple. > Maybe it's bad that a lot of development is happening on NUMAQ as that > machine is not representative any more because of it's bad latency > ratio. A lot of NUMA optimization is done for these "worst case" > machines. Modern machines (like e.g. x440) have good latency ratios > and multithreaded tasks do benefit of spreading them around (with > having balanced nodes in mind). The results we see with the node > affine scheduler and this approach are superior to the non-NUMA kernel > because of the advantage of automatically getting balanced nodes, > which the simple SMP scheduler can't get. x440 is still about 5:1. On an unloaded interconnect. And that's just latency - when we start to max out bandwidth, another whole category of problems kicks in. Yes, NUMA-Q is higher ratio that most other things, but mainly what it does is exaggerate the problems - making them obvious. As I've said before, I agree that the 20s thing is utterly ridiculous - basically it's disabled. But turning it up just seemed to cause cross-node task bounce ... I think we need to fix that first, then we can easily turn that balancer back "on". Unfortunately, I don't think it's simple to fix. M. |
From: Martin J. B. <mb...@ar...> - 2003-04-19 15:20:24
|
BTW, thinking more about the CPU load stuff, I'm confused as to why the node_balance stuff helps JBB so much ... the load average is less than 1 per cpu IIRC. Can you try doing the wakeup patch, but only throwing tasks around *within* a node instead of across nodes, and see if that makes a difference? If not, I guess we need to stare at the schedstat data again. I think the case when the busy node rebalance is broken (as far as I'm concerned) is when the load is low. If I have 4 tasks on one node (1 per cpu) and none on the other, that's just fine by me. That's what I said I was going to fix last night. However, I realised that sucks for other people ... we need a better metric here. Erich, I presume you want more perfect balancing across nodes for mem bandwidth concerns? Ie on a 4 node, 2 cpu per node system (which is what I thought yours was), you don't want 2/2/0/0 tasks for each node, you really, really want 1/1/1/1? Or is 2/2/0/0 just as good? M. |
From: Erich F. <ef...@hp...> - 2003-04-21 22:26:11
|
On Saturday 19 April 2003 17:20, Martin J. Bligh wrote: > I think the case when the busy node rebalance is broken (as far as I'm > concerned) is when the load is low. If I have 4 tasks on one node > (1 per cpu) and none on the other, that's just fine by me. That's what > I said I was going to fix last night. However, I realised that sucks > for other people ... we need a better metric here. Indeed... CPUs in one node are typically on a common FSB. This means that the bandwidth to memory inside a node is limited and the CPUs share it. It makes a lot of difference to have one bandwidth eater one one node which gets the full stream performance or have two of them, each getting only half of the performance. > Erich, I presume you want more perfect balancing across nodes for mem > bandwidth concerns? Ie on a 4 node, 2 cpu per node system (which is wha= t > I thought yours was), you don't want 2/2/0/0 tasks for each node, you > really, really want 1/1/1/1? Or is 2/2/0/0 just as good? The TX7 has normally 4 CPUs per node. My particular configuration has 2. I absolutely prefer 1/1/1/1 :-) Regards, Erich |
From: Andrew T. <hab...@us...> - 2003-04-21 22:38:38
|
On Monday 21 April 2003 17:01, Erich Focht wrote: > On Saturday 19 April 2003 17:20, Martin J. Bligh wrote: > > I think the case when the busy node rebalance is broken (as far as I'= m > > concerned) is when the load is low. If I have 4 tasks on one node > > (1 per cpu) and none on the other, that's just fine by me. That's wha= t > > I said I was going to fix last night. However, I realised that sucks > > for other people ... we need a better metric here. > > Indeed... CPUs in one node are typically on a common FSB. This means > that the bandwidth to memory inside a node is limited and the CPUs > share it. It makes a lot of difference to have one bandwidth eater one > one node which gets the full stream performance or have two of them, > each getting only half of the performance. > > > Erich, I presume you want more perfect balancing across nodes for mem > > bandwidth concerns? Ie on a 4 node, 2 cpu per node system (which is w= hat > > I thought yours was), you don't want 2/2/0/0 tasks for each node, you > > really, really want 1/1/1/1? Or is 2/2/0/0 just as good? > > The TX7 has normally 4 CPUs per node. My particular configuration has > 2. I absolutely prefer 1/1/1/1 :-) So when balancing from 4/0/0/0 to 1/1/1/1, are we going to pull tasks in = a run=20 state? =20 IMO, how we figure out which tasks benefit from 4/0/0/0 vs 1/1/1/1 is wha= t=20 2.7-NUMA is about :) I can't possibly see something working and fully te= sted=20 over many workloads within a couple of months. Right now on exec, we ge= t=20 1/1/1/1, and on fork we get 4/0/0/0, and that's about all we can do at th= is=20 point. However I would like to start discussing how we figure this out -shared m= em,=20 pipes, sockets, pg faults <- a whole lot of things to look at to help the= =20 kernel figure out what's best. -Andrew |
From: Erich F. <ef...@hp...> - 2003-04-19 11:04:32
|
On Saturday 19 April 2003 04:24, Martin J. Bligh wrote: > I'm really, really not keen on the "magic tagging" of certain processes > that's been discussed before - that just seems too complicated and too > messy. We could do the same pretty cleanly for clone(), but I'm not at = all > sure we really *want* to be spreading multithreaded things across nodes= =2E > That only applies if we only have < 1 process per node, with many threa= ds > from those few processes. Maybe we could do it conditionally or somethi= ng, > but it's not simple. Why not just use the prctl() mechanism? That was implemented for just this kind of change to process behavior. The default would be to just use the exec() initial balancing. For special applications (which one knows or has to find out by experimenting) one could add a shell command before starting the process or so. It is simple and offers a solution until we find the nice automagic way of doing it. There are many such approaches which are well accepted in Linux (e.g. hugetlb). My experience is with a machine with latency ratio of 1.6:1 (NEC TX7). It is quite simple: if your threads live long enough the impact of having them spread across the nodes is that the CPU bus bandwidth is more balanced across the nodes and the threads can be scheduled more effectively because they don't queue up on a crowded node. The "first touch" policy when page faults occur arranges the distribution of the (common) memory across the nodes in a quite optimal way. So when we design some approach to load balancing of multithreaded processes we should not exclude the case that the tasks should be spread arround just because this is bad on e.g. NUMAQ. > x440 is still about 5:1. On an unloaded interconnect. And that's just > latency - when we start to max out bandwidth, another whole category of > problems kicks in. Yes, NUMA-Q is higher ratio that most other things, = but > mainly what it does is exaggerate the problems - making them obvious. Yes, but the exagerated results draw conclusions which might be just bad for better machines. > As I've said before, I agree that the 20s thing is utterly ridiculous - > basically it's disabled. But turning it up just seemed to cause cross-n= ode > task bounce ... I think we need to fix that first, then we can easily t= urn > that balancer back "on". Unfortunately, I don't think it's simple to fi= x. This motivates me to advertise once again the ideas of the node affine NUMA scheduler: each task has a homenode assigned to it (which can be recalculated depending on its RSS) and the scheduler prefers stealing tasks with homenode=3D=3Down_node. That will not generate bouncing, it just attracts tasks back to their homenode. In such a case you just don't have to be worried of a task going away from its homenode, it will eventually get back there. And if it doesn't get back and allocates a lot of memory on the new node, it will change its homenode to the new node and stay there. I know this is too much for 2.6 ... :-( Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2003-04-19 14:47:40
|
>> I'm really, really not keen on the "magic tagging" of certain processes >> that's been discussed before - that just seems too complicated and too >> messy. We could do the same pretty cleanly for clone(), but I'm not at >> all sure we really *want* to be spreading multithreaded things across >> nodes. That only applies if we only have < 1 process per node, with many >> threads from those few processes. Maybe we could do it conditionally or >> something, but it's not simple. > > Why not just use the prctl() mechanism? That was implemented for just > this kind of change to process behavior. The default would be to just > use the exec() initial balancing. For special applications (which one > knows or has to find out by experimenting) one could add a shell > command before starting the process or so. It is simple and offers a > solution until we find the nice automagic way of doing it. There are > many such approaches which are well accepted in Linux (e.g. hugetlb). That sounds much more reasonable - leaving it up to userspace until we work out a nice way to do it. > My experience is with a machine with latency ratio of 1.6:1 (NEC > TX7). It is quite simple: if your threads live long enough the impact > of having them spread across the nodes is that the CPU bus bandwidth > is more balanced across the nodes and the threads can be scheduled > more effectively because they don't queue up on a crowded node. The > "first touch" policy when page faults occur arranges the distribution > of the (common) memory across the nodes in a quite optimal way. So > when we design some approach to load balancing of multithreaded > processes we should not exclude the case that the tasks should be > spread arround just because this is bad on e.g. NUMAQ. I'm realise that some systems are different, and optimising for NUMA-Q really isn't that important ... I just don't belive that it's that unrepresentative - just exaggerated. If there are cases when things really are different, we should make that a per-arch thing, and make both types of machines run well. I'm certainly against making NUMA-Q run well at the expense of other machines - it's just what I have to hand, and is *usually* a good magnifying glass to look at problems. Making optimisations for your platform is certainly a good thing, as long as they're reasonably simple - we just need to make them conditional if they slow down other machines. > Yes, but the exagerated results draw conclusions which might be just > bad for better machines. Might be in a couple of cases - feel free to point them out (as in the node rebalance rate). However, as a case in point, I fiddled with the node rebalance ratio last night ... since the other changes we've been making in the scheduler, I can turn this right down to 2 now with no performance degredation ;-) I'll send Linus the patch later, unless anyone objects. But there were other (broken) things going on, that are now seemingly fixed (in retrospect, it probably would have been better to turn it down anyway, and focus on the "too much bounce" problem, but still ...) >> As I've said before, I agree that the 20s thing is utterly ridiculous - >> basically it's disabled. But turning it up just seemed to cause >> cross-node task bounce ... I think we need to fix that first, then we >> can easily turn that balancer back "on". Unfortunately, I don't think >> it's simple to fix. > > This motivates me to advertise once again the ideas of the node affine > NUMA scheduler: each task has a homenode assigned to it (which can be > recalculated depending on its RSS) and the scheduler prefers stealing > tasks with homenode==own_node. That will not generate bouncing, it > just attracts tasks back to their homenode. In such a case you just > don't have to be worried of a task going away from its homenode, it > will eventually get back there. And if it doesn't get back and > allocates a lot of memory on the new node, it will change its homenode > to the new node and stay there. Andi (cc'ed) is interested in that too, and has some code against current, I think ;-) I don't think it'll work well for all boxes, but I understand it might well for some - probably needs to be a config option or soemthing. I still think that doing this dynamically by per-node RSS is a better idea - though now we have the per-cpu / per-zone hot & cold pages stuff in, it might work much better than last time we tried (less cross-node lock bounce for the allocations, etc). > I know this is too much for 2.6 ... :-( Depends ... if we can keep it simple, and do it in small stages, most of it only touches NUMA code anyway, so we should be OK ;-) M. |
From: Erich F. <ef...@hp...> - 2003-04-21 22:35:34
|
On Saturday 19 April 2003 16:47, Martin J. Bligh wrote: > > Why not just use the prctl() mechanism? That was implemented for just > > this kind of change to process behavior. The default would be to just > > use the exec() initial balancing. For special applications (which one > > knows or has to find out by experimenting) one could add a shell > > command before starting the process or so. It is simple and offers a > > solution until we find the nice automagic way of doing it. There are > > many such approaches which are well accepted in Linux (e.g. hugetlb). > > That sounds much more reasonable - leaving it up to userspace until we = work > out a nice way to do it. OK, I'll try to dig out some code, let's see if the multithreaded BMs improve with such an approach. > Might be in a couple of cases - feel free to point them out (as in the = node > rebalance rate). However, as a case in point, I fiddled with the node > rebalance ratio last night ... since the other changes we've been makin= g in > the scheduler, I can turn this right down to 2 now with no performance > degredation ;-) I'll send Linus the patch later, unless anyone objects.= But > there were other (broken) things going on, that are now seemingly fixed= (in > retrospect, it probably would have been better to turn it down anyway, = and > focus on the "too much bounce" problem, but still ...) This sounds good! You use the same machine as before? ;-) > > This motivates me to advertise once again the ideas of the node affin= e > > NUMA scheduler: each task has a homenode assigned to it (which can be > > recalculated depending on its RSS) and the scheduler prefers stealing > > tasks with homenode=3D=3Down_node. That will not generate bouncing, i= t > > just attracts tasks back to their homenode. In such a case you just > > don't have to be worried of a task going away from its homenode, it > > will eventually get back there. And if it doesn't get back and > > allocates a lot of memory on the new node, it will change its homenod= e > > to the new node and stay there. > > Andi (cc'ed) is interested in that too, and has some code against curre= nt, > I think ;-) I don't think it'll work well for all boxes, but I understa= nd > it might well for some - probably needs to be a config option or soemth= ing. I'm interested to see/read anything about that. Is there a pointer to some LKML post or patch somewhere? Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2003-04-22 00:25:42
|
>> Might be in a couple of cases - feel free to point them out (as in the node >> rebalance rate). However, as a case in point, I fiddled with the node >> rebalance ratio last night ... since the other changes we've been making in >> the scheduler, I can turn this right down to 2 now with no performance >> degredation ;-) I'll send Linus the patch later, unless anyone objects. But >> there were other (broken) things going on, that are now seemingly fixed (in >> retrospect, it probably would have been better to turn it down anyway, and >> focus on the "too much bounce" problem, but still ...) > > This sounds good! You use the same machine as before? ;-) Yup, same machine. There were a couple of bits broken in the idle calculations, and the idle rate vs busy rates were de-intertwined. Not sure if one of those fixed it. M. |