From: Xavier B. <xav...@bu...> - 2004-03-23 18:26:36
Attachments:
Bull-sched-040329
|
Hello Erich. We are hitting a problem with the NUMA scheduler: Running a single process on a 16*w NUMA machine (4 nodes of 4 cpus), we=20 sometimes see the process migrating across nodes when the machine is not=20 loaded. It seems that on a machine where most CPUs are idle, the=20 balance_node() routine finds a load unbalance (for example: 2 active=20 processes on a node and 0 on another one), and a process is migrated=20 across nodes with the major problem of the broken memory affinity. Running the numatest with only one process, in the (good) case we have: initial CPU =3D 7 cpu 18493 12 cpu0 0 0 cpu1 0 0 cpu2 0 0 cpu3 0 0 cpu4 0 0 cpu5 0 0 cpu6 0 0 cpu7 18493 12 cpu8 0 0 cpu9 0 0 cpu10 0 0 cpu11 0 0 cpu12 0 0 cpu13 0 0 cpu14 0 0 cpu15 0 0 current_cpu 7 real 0m18.073s user 0m18.060s sys 0m0.013s , but in the bad one (cross node migration): initial CPU =3D 8 cpu 30271 13 cpu0 0 0 cpu1 26902 1 cpu2 0 0 cpu3 0 0 cpu4 0 0 cpu5 0 0 cpu6 0 0 cpu7 0 0 cpu8 4 1 cpu9 191 10 cpu10 3174 1 cpu11 0 0 cpu12 0 0 cpu13 0 0 cpu14 0 0 cpu15 0 0 current_cpu 1 real 0m29.577s user 0m29.562s sys 0m0.015s The following patch to not care about nodes where the number of active=20 processes is <=3D number of cpus in the node in find_busiest_node() seem= s=20 to fix the problem. Note that in this case processes still migrate=20 beetwen cpus of the same node. Should that be ameliorated ? Thanks in advance. Xavier --=20 Sinc=E8res salutations. _____________________________________________________________________ =20 Xavier BRU BULL ISD/R&D/INTEL office: FREC B1-422 tel : +33 (0)4 76 29 77 45 http://www-frec.bull.fr fax : +33 (0)4 76 29 77 70 mailto:Xav...@bu... addr: BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE _____________________________________________________________________ |
From: Martin J. B. <mb...@ar...> - 2004-03-23 22:10:44
|
> Running a single process on a 16*w NUMA machine (4 nodes of 4 cpus), we sometimes see the process migrating across nodes when the machine is not loaded. It seems that on a machine where most CPUs are idle, the balance_node() routine finds a load unbalance (for example: 2 active processes on a node and 0 on another one), and a process is migrated across nodes with the major problem of the broken memory affinity. ... > The following patch to not care about nodes where the number of active processes is <= number of cpus in the node in find_busiest_node() seems to fix the problem. Note that in this case processes still migrate beetwen cpus of the same node. That sounds like the right thing to do, though I haven't checked the specificics of the patch. The confusing thing is that I thought we fixed that exact bug ages ago ... maybe I lost the patch. Can you recheck with Nick's sched_domains code on either -mjb or -mm tree? That's the way we're intending to go anyway, so if it's fixed there, I'm not too bothered. M. |
From: Erich F. <ef...@hp...> - 2004-03-24 12:04:50
|
Hello Xavier, On Tuesday 23 March 2004 19:24, Xavier Bru wrote: > Running a single process on a 16*w NUMA machine (4 nodes of 4 cpus), we > sometimes see the process migrating across nodes when the machine is not > loaded. It seems that on a machine where most CPUs are idle, the > balance_node() routine finds a load unbalance (for example: 2 active > processes on a node and 0 on another one), and a process is migrated > across nodes with the major problem of the broken memory affinity. As Rick mentioned in his reply, in some cases you might actually want an idle node to steal single tasks. The time averaged loads should actually help to avoid the situation you mention. But the loads are computed by taking into acount all runnable processes. It might be better to just consider processes which ran longer or have more memory, as Rick suggested. OTOH this gives a minimum unbalanced run-time which you again might want to avoid. My main idea behind a NUMA scheduler was to provide a mechanism for the process to return to its node. This isn't there in 2.6 and the current NUMA scheduler is pretty "crippled", IMHO. As the sched_domains provide more flexibility regarding HT, NUMA, SMT and what not, and Andrew Morton seems to be willing to accept it, I'm not too motivated to improve the current NUMA scheduler. Looking at your patch: by just skipping all nodes with nr_running<=nr_cpus you don't update this_rq()->prev_node_load[i] so some decisions might be wrong, later. So maybe you want to move the "continue" lower in the loop. Otherwise it's fine IF you want this behaviour... > to fix the problem. Note that in this case processes still migrate > beetwen cpus of the same node. ??? That shouldn't happen, either. Is it due to short running processes on the same CPU? The time-averaging should catch that, too... Andrew suggests in a reply to change try_to_wakeup()... Might work but you still have kernel threads pinned to each CPU which you cannot move, they just have to wake up there. Would be nice to find out what exactly happens and which task pushes the numatest away. It would again be better to not count particular tasks in the load. Regards, Erich |
From: Xavier B. <xav...@bu...> - 2004-03-24 16:37:42
|
[2]kdb> bt Stack traceback for pid 0 0xe0000001fff10000 0 1 0 1 2 R 0xe0000001fff106b0 *swapper 0xa000000100086b20 pull_task args (0xe000002000044620, 0xe000002000044678, 0xe0000020fe440000, 0xe000000004d24620, 0x2) kernel 0xa000000100086b20 0xa000000100086e80 0xa000000100087540 load_balance+0x680 args (0xe000000004d24620, 0xe0000020fe440000, 0xe000002000044e68, 0xe0000020fe440682, 0xe000002000044e68) kernel 0xa000000100086ec0 0xa000000100087980 0xa000000100087ab0 balance_node+0x130 args (0xe000000004d24620, 0x1, 0xe0000001fff17c60, 0xa000000100087b90, 0x38a) kernel 0xa000000100087980 0xa000000100087ae0 0xa000000100087b90 rebalance_tick+0xb0 args (0xe000000004d24620, 0x1, 0x1003d2687, 0x2, 0xa0000001000aa480) kernel 0xa000000100087ae0 0xa000000100087da0 0xa0000001000aa480 update_process_times+0x60 args (0x0, 0x1, 0xa000000100057dd0, 0x205, 0x8000) kernel 0xa0000001000aa420 0xa0000001000aa4a0 0xa000000100057dd0 smp_do_timer+0x90 args (0xe0000001fff17c70, 0xa00000010003c9e0, 0x48c, 0x48c) kernel 0xa000000100057d40 0xa000000100057e00 0xa00000010003c9e0 timer_interrupt+0x280 args (0x4b81b4d1944, 0xa0000001008e4580, 0xe0000001fff17c70, 0xffffffffffff0038, 0xa0000001008e4584) [2]more> kernel 0xa00000010003c760 0xa00000010003cc20 0xa0000001000151c0 handle_IRQ_event+0xa0 args (0xef, 0xe0000001fff17c70, 0xa00000010090f8f8, 0x20000001, 0x0) kernel 0xa000000100015120 0xa000000100015240 0xa000000100015aa0 do_IRQ+0x120 args (0xef, 0xe0000001fff17c70, 0xa0000001008e3790, 0xa0000001008e3780, 0xa0000001008e3788) kernel 0xa000000100015980 0xa000000100015d00 0xa0000001000178a0 ia64_handle_irq+0xa0 args (0x0, 0xe0000001fff17c70, 0x0, 0xfd, 0xa0000001000119a0) kernel 0xa000000100017800 0xa0000001000179a0 0xa0000001000119a0 ia64_leave_kernel args (0x0, 0xe0000001fff17c70) kernel 0xa0000001000119a0 0xa000000100011c00 0xa000000100019360 cpu_idle+0x100 args (0xa00000010087ac28, 0x0, 0xa000000100adddd0, 0xa000000100addc70, 0xa000000100adc098) kernel 0xa000000100019260 0xa000000100019440 0xa00000010089d830 start_secondary+0x50 args (0xa000000100008590, 0x60, 0x400000) kernel 0xa00000010089d7e0 0xa00000010089d860 0xa000000100008590 _start+0x270 args (0xa000000100008590, 0x60, 0x400000, 0xa00000010087ac28, 0x0) kernel 0xa000000100008320 0xa000000100008320 [2]kdb> ps R Task Addr Pid Parent Tgid [*] cpu State Thread Command 0xe0000000048c8000 0 0 0 1 0 R 0xe0000000048c86b0 swapper Error: no saved data for this cpu 0xe0000001fff38000 0 1 0 1 1 R 0xe0000001fff386b0 swapper Error: no saved data for this cpu 0xe0000001fff10000 0 1 0 1 2 R 0xe0000001fff106b0 *swapper 0xe0000001814d8000 0 1 0 1 3 R 0xe0000001814d86b0 swapper Error: no saved data for this cpu 0xe0000001814b0000 0 1 0 1 4 R 0xe0000001814b06b0 swapper Error: no saved data for this cpu 0xe000000181498000 0 1 0 1 5 R 0xe0000001814986b0 swapper Error: no saved data for this cpu 0xe0000001ffef0000 0 1 0 1 6 R 0xe0000001ffef06b0 swapper Error: no saved data for this cpu 0xe0000001ffeb0000 0 1 0 1 8 R 0xe0000001ffeb06b0 swapper Error: no saved data for this cpu 0xe0000001ffe98000 0 1 0 1 9 R 0xe0000001ffe986b0 swapper Error: no saved data for this cpu 0xe0000001ffe70000 0 1 0 1 10 R 0xe0000001ffe706b0 swapper Error: no saved data for this cpu 0xe0000001ffe58000 0 1 0 1 11 R 0xe0000001ffe586b0 swapper Error: no saved data for this cpu [2]more> 0xe0000001ffe30000 0 1 0 1 12 R 0xe0000001ffe306b0 swapper Error: no saved data for this cpu 0xe0000001ffe18000 0 1 0 1 13 R 0xe0000001ffe186b0 swapper Error: no saved data for this cpu 0xe0000001ffe00000 0 1 0 1 14 R 0xe0000001ffe006b0 swapper Error: no saved data for this cpu 0xe0000001815d8000 0 1 0 1 15 R 0xe0000001815d86b0 swapper Error: no saved data for this cpu 0xe0000020fe440000 4233 4114 4233 0 7 R 0xe0000020fe4406b0 mytest [2]kdb> md node_nr_running 0xa000000100b05a80 00000000 00000002 00000000 00000000 ................ 0xa000000100b05a90 00000000 00000000 00000000 00000000 ................ 0xa000000100b05aa0 00000000 00000000 00000000 00000000 ................ 0xa000000100b05ab0 00000000 00000000 00000000 00000000 ................ 0xa000000100b05ac0 00000000 00000000 00000000 00000000 ................ 0xa000000100b05ad0 00000000 00000000 00000000 00000000 ................ 0xa000000100b05ae0 00000000 00000000 00000000 00000000 ................ 0xa000000100b05af0 00000000 00000000 00000000 00000000 ................ [2]kdb> |
From: Paul J. <pj...@sg...> - 2004-03-24 19:04:44
|
Xavier, Where are you getting the printouts that look like: initial CPU = 2 cpu 18491 16 cpu0 17125 2 cpu1 441 0 cpu2 700 14 cpu3 225 0 ... current_cpu 0 We have something in our SGI 2.4 kernels (/proc/<pid>/cpu) that displays this sort of per-cpu usage, but I don't see anything in the 2.6 kernels that seems to do this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Xavier B. <xav...@bu...> - 2004-03-25 08:19:45
|
Paul Jackson a =E9crit : >Xavier, > >Where are you getting the printouts that look like: > > initial CPU =3D 2 > cpu 18491 16 > cpu0 17125 2 > cpu1 441 0 > cpu2 700 14 > cpu3 225 0 > ... > current_cpu 0 > >We have something in our SGI 2.4 kernels (/proc/<pid>/cpu) that >displays this sort of per-cpu usage, but I don't see anything >in the 2.6 kernels that seems to do this. > > =20 > Hello Paul, This is just the per cpu times patch that Eric Focht provided some (long=20 :-) ) time ago with his numa affinity benchmark. Xavier --=20 Sinc=E8res salutations. _____________________________________________________________________ =20 Xavier BRU BULL ISD/R&D/INTEL office: FREC B1-422 tel : +33 (0)4 76 29 77 45 http://www-frec.bull.fr fax : +33 (0)4 76 29 77 70 mailto:Xav...@bu... addr: BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE _____________________________________________________________________ |
From: Erich F. <ef...@hp...> - 2004-03-26 08:31:32
Attachments:
cputimes_stat-2.6.0t1.patch
|
Hi Paul, On Wednesday 24 March 2004 20:03, Paul Jackson wrote: > Where are you getting the printouts that look like: > > initial CPU = 2 > cpu 18491 16 > cpu0 17125 2 > cpu1 441 0 > cpu2 700 14 > cpu3 225 0 > ... > current_cpu 0 > > We have something in our SGI 2.4 kernels (/proc/<pid>/cpu) that > displays this sort of per-cpu usage, but I don't see anything > in the 2.6 kernels that seems to do this. its probably the attached patch. Sorry, I'm travelling and couldn't rediff against a current version... |
From: Paul J. <pj...@sg...> - 2004-03-26 08:37:57
|
> its probably the attached patch Excellent - thank-you. Do you expect this patch to end up in the mainstream kernel sometime? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: <sh...@ft...> - 2004-03-28 10:06:58
|
Hi, Very nice patch. Andrew, would you consider adding this one? --Shai -----Original Message----- From: lse...@li... [mailto:lse...@li...] On Behalf Of Erich Focht Sent: Friday, March 26, 2004 00:31 To: Paul Jackson; Xavier Bru Cc: ric...@us...; mb...@ar...; lse...@li...; Erik Jacobson Subject: Re: [Lse-tech] Re: NUMA scheduler issue Hi Paul, On Wednesday 24 March 2004 20:03, Paul Jackson wrote: > Where are you getting the printouts that look like: > > initial CPU = 2 > cpu 18491 16 > cpu0 17125 2 > cpu1 441 0 > cpu2 700 14 > cpu3 225 0 > ... > current_cpu 0 > > We have something in our SGI 2.4 kernels (/proc/<pid>/cpu) that > displays this sort of per-cpu usage, but I don't see anything > in the 2.6 kernels that seems to do this. its probably the attached patch. Sorry, I'm travelling and couldn't rediff against a current version... |