From: Martin J. B. <mb...@ar...> - 2002-10-26 19:17:28
|
>> From my point of view, the reason for focussing on this was that >> your scheduler degraded the performance on my machine, rather than >> boosting it. Half of that was the more complex stuff you added on >> top ... it's a lot easier to start with something simple that works >> and build on it, than fix something that's complex and doesn't work >> well. > > You're talking about one of the first 2.5 versions of the patch. It > changed a lot since then, thanks to your feedback, too. OK, I went to your latest patches (just 1 and 2). And they worked! You've fixed the performance degradation problems for kernel compile (now a 14% improvement in systime), that core set works without further futzing about or crashing, with or without TSC, on either version of gcc ... congrats! It also produces the fastest system time for kernel compile I've ever seen ... this core set seems to be good (I'm still less than convinced about the further patches, but we can work on those one at a time now you've got it all broken out and modular). Michael posted slightly different looking results for virgin 44 yesterday - the main difference between virgin 44 and 44-mm4 for this stuff is probably the per-cpu hot & cold pages (Ingo, this is like your original per-cpu pages). All results are for a 16-way NUMA-Q (P3 700MHz 2Mb cache) 16Gb RAM. Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2% 2.5.44-mm4-focht12 19.316s 189.514s 36.704s 1146.8% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84 2.5.44-mm4-focht12 38.50 45.34 154.05 1.07 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99 2.5.44-mm4-focht12 35.56 46.57 284.53 1.97 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43 2.5.44-mm4-focht12 51.94 61.43 831.26 4.68 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91 2.5.44-mm4-focht12 55.43 119.49 1773.97 8.41 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82 2.5.44-mm4-focht12 56.49 235.78 3615.71 18.05 There's a small degredation at the low end of schedbench (Erich's numa_test) in there ... would be nice to fix, but I'm less worried about that (where the machine is lightly loaded) than the other numbers. Kernbench is just gcc-2.95-4 compiling the 2.4.17 kernel doing a "make -j24 bzImage". diffprofile 2.5.44-mm4 2.5.44-mm4-hbaum (for kernbench, + got worse by adding the patch, - got better) 184 vm_enough_memory 154 d_lookup 83 do_schedule 75 page_add_rmap 73 strnlen_user 58 find_get_page 52 flush_signal_handlers ... -61 pte_alloc_one -63 do_wp_page -85 .text.lock.file_table -96 __set_page_dirty_buffers -112 clear_page_tables -118 get_empty_filp -134 free_hot_cold_page -144 page_remove_rmap -150 __copy_to_user -213 zap_pte_range -217 buffered_rmqueue -875 __copy_from_user -1015 do_anonymous_page diffprofile 2.5.44-mm4 2.5.44-mm4-focht12 (for kernbench, + got worse by adding the patch, - got better) <nothing significantly degraded> .... -57 path_lookup -69 do_page_fault -73 vm_enough_memory -77 filemap_nopage -78 do_no_page -83 __set_page_dirty_buffers -83 __fput -84 do_schedule -97 find_get_page -106 file_move -115 free_hot_cold_page -115 clear_page_tables -130 d_lookup -147 atomic_dec_and_lock -157 page_add_rmap -197 buffered_rmqueue -236 zap_pte_range -264 get_empty_filp -271 __copy_to_user -464 page_remove_rmap -573 .text.lock.file_table -618 __copy_from_user -823 do_anonymous_page |
From: Martin J. B. <mb...@ar...> - 2002-10-27 18:19:38
|
> OK, I went to your latest patches (just 1 and 2). And they worked! > You've fixed the performance degradation problems for kernel compile > (now a 14% improvement in systime), that core set works without > further futzing about or crashing, with or without TSC, on either > version of gcc ... congrats! So I have a slight correction to make to the above ;-) Your patches do work just fine, no crashes any more. HOWEVER ... turns out I only had the first patch installed, not both. Silly mistake, but turns out to be very interesting. So your second patch is the balance on exec stuff ... I've looked at it, and think it's going to be very expensive to do in practice, at least the simplistic "recalc everything on every exec" approach. It does benefit the low end schedbench results, but not the high end ones, and you can see the cost of your second patch in the system times of the kernbench. In summary, I think I like the first patch alone better than the combination, but will have a play at making a cross between the two. As I have very little context about the scheduler, would appreciate any help anyone would like to volunteer ;-) Corrected results are: Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2% 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.84 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.99 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.43 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.91 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.82 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 |
From: Erich F. <ef...@es...> - 2002-10-27 23:33:06
|
On Sunday 27 October 2002 19:16, Martin J. Bligh wrote: > > OK, I went to your latest patches (just 1 and 2). And they worked! > > You've fixed the performance degradation problems for kernel compile > > (now a 14% improvement in systime), that core set works without > > further futzing about or crashing, with or without TSC, on either > > version of gcc ... congrats! > > So I have a slight correction to make to the above ;-) Your patches > do work just fine, no crashes any more. HOWEVER ... turns out I only > had the first patch installed, not both. Silly mistake, but turns out > to be very interesting. > > So your second patch is the balance on exec stuff ... I've looked at > it, and think it's going to be very expensive to do in practice, at > least the simplistic "recalc everything on every exec" approach. It > does benefit the low end schedbench results, but not the high end ones, > and you can see the cost of your second patch in the system times of > the kernbench. This is interesting, indeed. As you might have seen from the tests I posted on LKML I could not see that effect on our IA64 NUMA machine. Which arises the question: is it expensive to recalculate the load when doing an exec (which I should also see) or is the strategy of equally distributing the jobs across the nodes bad for certain load+architecture combinations? As I'm not seeing the effect, maybe you could do the following experiment: In sched_best_node() keep only the "while" loop at the beginning. This leads to a cheap selection of the next node, just a simple round robin.=20 Regarding the schedbench results: are they averages over multiple runs? The numa_test needs to be repeated a few times to get statistically meaningful results. Thanks, Erich > In summary, I think I like the first patch alone better than the > combination, but will have a play at making a cross between the two. > As I have very little context about the scheduler, would appreciate > any help anyone would like to volunteer ;-) > > Corrected results are: > > Kernbench: > Elapsed User System CP= U > 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4= % > 2.5.44-mm4-hbaum 19.422s 189.828s 40.204s 1196.2= % > 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171= % > 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6= % > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 32.45 49.47 129.86 0.8= 2 > 2.5.44-mm4-hbaum 31.31 43.85 125.29 0.8= 4 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.0= 6 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.8= 5 > > Schedbench 8: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 39.90 61.48 319.26 2.7= 9 > 2.5.44-mm4-hbaum 32.63 46.56 261.10 1.9= 9 > 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.5= 5 > 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.0= 9 > > Schedbench 16: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 62.99 93.59 1008.01 5.1= 1 > 2.5.44-mm4-hbaum 49.78 76.71 796.68 4.4= 3 > 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.9= 5 > 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.2= 3 > > Schedbench 32: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 88.13 194.53 2820.54 11.5= 2 > 2.5.44-mm4-hbaum 54.67 147.30 1749.77 7.9= 1 > 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.9= 2 > 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.2= 8 > > Schedbench 64: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 159.92 653.79 10235.93 25.1= 6 > 2.5.44-mm4-hbaum 65.20 300.58 4173.26 16.8= 2 > 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.6= 1 > 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.7= 6 |
From: Martin J. B. <mb...@ar...> - 2002-10-27 23:55:43
|
> This is interesting, indeed. As you might have seen from the tests I > posted on LKML I could not see that effect on our IA64 NUMA machine. > Which arises the question: is it expensive to recalculate the load > when doing an exec (which I should also see) or is the strategy of > equally distributing the jobs across the nodes bad for certain > load+architecture combinations? I suspect the former. Bouncing a whole pile of cachelines every time would be much more expensive for me than it would for you, and kernbench will be heavy on exec. > As I'm not seeing the effect, maybe > you could do the following experiment: > In sched_best_node() keep only the "while" loop at the beginning. This > leads to a cheap selection of the next node, just a simple round robin. Maybe I could just send you the profiles instead ;-) If I have more time, I'll try your suggestion. I'm trying Michael's balance_exec on top of your patch 1 at the moment, but I'm somewhat confused by his code for sched_best_cpu. +static int sched_best_cpu(struct task_struct *p) +{ + int i, minload, best_cpu, cur_cpu, node; + best_cpu = task_cpu(p); + if (cpu_rq(best_cpu)->nr_running <= 2) + return best_cpu; + + node = __cpu_to_node(__get_cpu_var(last_exec_cpu)); + if (++node >= numnodes) + node = 0; + + cur_cpu = __node_to_first_cpu(node); + minload = cpu_rq(best_cpu)->nr_running; + + for (i = 0; i < NR_CPUS; i++) { + if (!cpu_online(cur_cpu)) + continue; + + if (minload > cpu_rq(cur_cpu)->nr_running) { + minload = cpu_rq(cur_cpu)->nr_running; + best_cpu = cur_cpu; + } + if (++cur_cpu >= NR_CPUS) + cur_cpu = 0; + } + __get_cpu_var(last_exec_cpu) = best_cpu; + return best_cpu; +} Michael, the way I read the NR_CPUS loop, you walk every cpu in the system, and take the best from all of them. In which case what's the point of the last_exec_cpu stuff? On the other hand, I changed your NR_CPUS to 4 (ie just walk the cpus in that node), and it got worse. So perhaps I'm just misreading your code ... and it does seem significantly cheaper to execute than Erich's. Erich, on the other hand, your code does this: +void sched_balance_exec(void) +{ + int new_cpu, new_node=0; + + while (pooldata_is_locked()) + cpu_relax(); + if (numpools > 1) { + new_node = sched_best_node(current); + } + new_cpu = sched_best_cpu(current, new_node); + if (new_cpu != smp_processor_id()) + sched_migrate_task(current, new_cpu); +} which seems to me to walk every runqueue in the system (in sched_best_node), then walk one node's worth all over again in sched_best_cpu .... doesn't it? Again, I may be misreading this ... haven't looked at the scheduler much. But I can't help feeling some sort of lazy evaluation is in order .... And what's this doing? + do { + /* atomic_inc_return is not implemented on all archs [EF] */ + atomic_inc(&sched_node); + best_node = atomic_read(&sched_node) % numpools; + } while (!(pool_mask[best_node] & mask)); I really don't think putting a global atomic in there is going to be cheap .... > Regarding the schedbench results: are they averages over multiple runs? > The numa_test needs to be repeated a few times to get statistically > meaningful results. No. But I don't have 2 hours to run each set of tests either. I did a couple of runs, and didn't see huge variances. Seems stable enough. M. |
From: Michael H. <hoh...@us...> - 2002-10-28 00:57:08
|
> I'm trying Michael's balance_exec on top of your patch 1 at the > moment, but I'm somewhat confused by his code for sched_best_cpu. > > +static int sched_best_cpu(struct task_struct *p) > +{ > + int i, minload, best_cpu, cur_cpu, node; > + best_cpu = task_cpu(p); > + if (cpu_rq(best_cpu)->nr_running <= 2) > + return best_cpu; > + > + node = __cpu_to_node(__get_cpu_var(last_exec_cpu)); > + if (++node >= numnodes) > + node = 0; > + > + cur_cpu = __node_to_first_cpu(node); > + minload = cpu_rq(best_cpu)->nr_running; > + > + for (i = 0; i < NR_CPUS; i++) { > + if (!cpu_online(cur_cpu)) > + continue; > + > + if (minload > cpu_rq(cur_cpu)->nr_running) { > + minload = cpu_rq(cur_cpu)->nr_running; > + best_cpu = cur_cpu; > + } > + if (++cur_cpu >= NR_CPUS) > + cur_cpu = 0; > + } > + __get_cpu_var(last_exec_cpu) = best_cpu; > + return best_cpu; > +} > > Michael, the way I read the NR_CPUS loop, you walk every cpu > in the system, and take the best from all of them. In which case > what's the point of the last_exec_cpu stuff? On the other hand, > I changed your NR_CPUS to 4 (ie just walk the cpus in that node), > and it got worse. So perhaps I'm just misreading your code ... > and it does seem significantly cheaper to execute than Erich's. > You are reading it correct. The only thing that the last_exec_cpu does is to help spread the load across nodes. Without that what was happening is that node 0 would get completely loaded, then node 1, etc. With it, in cases where one or more runqueues have the same length, the one chosen tends to get spread out a bit. Not the greatest solution, but it helps. > -- Michael Hohnbaum 503-578-5486 hoh...@us... T/L 775-5486 |
From: Martin J. B. <mb...@ar...> - 2002-10-28 04:25:52
|
>> Michael, the way I read the NR_CPUS loop, you walk every cpu >> in the system, and take the best from all of them. In which case >> what's the point of the last_exec_cpu stuff? On the other hand, >> I changed your NR_CPUS to 4 (ie just walk the cpus in that node), >> and it got worse. So perhaps I'm just misreading your code ... >> and it does seem significantly cheaper to execute than Erich's. >> > You are reading it correct. The only thing that the last_exec_cpu > does is to help spread the load across nodes. Without that what was > happening is that node 0 would get completely loaded, then node 1, > etc. With it, in cases where one or more runqueues have the same > length, the one chosen tends to get spread out a bit. Not the > greatest solution, but it helps. OK. I made a simple boring optimisation to your patch. Shaved almost a second off system time for kernbench, and seems idiotproof to me, shouldn't change anything apart from touching fewer runqueues: if we find a runqueue with nr_running == 0, stop searching ... we ain't going to find anything better ;-) Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2% 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6% 2.5.44-mm4-hbaum-12-firstzero 19.292s 189.66s 39.428s 1187.4% Patch is probably space-eaten, so just whack it in by hand. --- 2.5.44-mm4-hbaum-12/kernel/sched.c 2002-10-27 19:54:25.000000000 -0800 +++ 2.5.44-mm4-hbaum-12-first_low/kernel/sched.c 2002-10-27 16:42:10.000000000 -0800 @@ -2206,6 +2206,8 @@ if (minload > cpu_rq(cur_cpu)->nr_running) { minload = cpu_rq(cur_cpu)->nr_running; best_cpu = cur_cpu; + if (minload == 0) + break; } if (++cur_cpu >= NR_CPUS) cur_cpu = 0; |
From: Martin J. B. <mb...@ar...> - 2002-10-28 00:34:34
|
OK, so I'm trying to read your patch 1, fairly unsucessfully (seems to be a lot more complex that Michael's). Can you explain pool_lock? It does actually seem to work, but it's rather confusing .... build_pools() has a comment above it saying: +/* + * Call pooldata_lock() before calling this function and + * pooldata_unlock() after! + */ But then you promptly call pooldata_lock inside build_pools anyway ... looks like it's just a naff comment, but doesn't help much. Leaving aside the acknowledged mind-boggling ugliness of pooldata_lock(), what exactly is this lock protecting, and when? The only thing that actually calls pooldata_lock is build_pools, right? And the only other thing that looks at it is sched_balance_exec via pooldata_is_locked ... can that happen before build_pools (seems like you're in deep trouble if it does anyway, as it'll just block). If you really still need to do this, RCU is now in the kernel ;-) If not, can we just chuck all that stuff? M. |
From: Martin J. B. <mb...@ar...> - 2002-10-28 00:49:06
|
OK, so I tried Michael's without the balance_exec code as well, then Erich's main patch with Michael's balance_exec (which seems to be cheaper to calculate). Turns out I was actually running an older version of Michael's patch .... with his latest stuff it actually seems to perform better pretty much across the board (comaring 2.5.44-mm4-focht-12 and 2.5.44-mm4-hbaum-12). And it's also a lot simpler. Erich, what does all the pool stuff actually buy us over what Michael is doing? Seems to be rather more complex, but maybe it's useful for something we're just not measuring here? 2.5.44-mm4 Virgin 2.5.44-mm4-focht-1 Focht main 2.5.44-mm4-hbaum-1 Hbaum main 2.5.44-mm4-focht-12 Focht main + Focht balance_exec 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec Kernbench: Elapsed User System CPU 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4% 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171% 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2% 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6% 2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186% Schedbench 4: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 32.45 49.47 129.86 0.82 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 Schedbench 8: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 39.90 61.48 319.26 2.79 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.55 2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.71 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.09 2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.43 2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.10 Schedbench 16: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 62.99 93.59 1008.01 5.11 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.95 2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.93 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.23 2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.84 2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.67 Schedbench 32: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 88.13 194.53 2820.54 11.52 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.92 2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.20 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.28 2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.09 2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.20 Schedbench 64: Elapsed TotalUser TotalSys AvgUser 2.5.44-mm4 159.92 653.79 10235.93 25.16 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.61 2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.53 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.76 2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.67 2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.97 |
From: Erich F. <ef...@es...> - 2002-10-28 17:11:59
Attachments:
numabench
|
On Monday 28 October 2002 01:46, Martin J. Bligh wrote: > Erich, what does all the pool stuff actually buy us over what > Michael is doing? Seems to be rather more complex, but maybe > it's useful for something we're just not measuring here? The more complicated stuff is for achieving equal load between the nodes. It delays steals more when the stealing node is averagely loaded, less when it is unloaded. This is the place where we can make it cope with more complex machines with multiple levels of memory hierarchy (like our 32 CPU TX7). Equal load among the nodes is important if you have memory bandwidth eaters, as the bandwidth in a node is limited. When introducing node affinity (which shows good results for me!) you also need a more careful ranking of the tasks which are candidates to be stolen. The routine task_to_steal does this and is another source of complexity. It is another point where the multilevel stuff comes in. In the core part of the patch the rank of the steal candidates is compute= d by only taking into account the time which a task has slept. I attach the script for getting some statistics on the numa_test. I=20 consider this test more sensitive to NUMA effects, as it is a bandwidth eater also needing good latency. (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must be set to 1000000!). Regards, Erich > > 2.5.44-mm4 Virgin > 2.5.44-mm4-focht-1 Focht main > 2.5.44-mm4-hbaum-1 Hbaum main > 2.5.44-mm4-focht-12 Focht main + Focht balance_exec > 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec > 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec > > Kernbench: > Elapsed User System CP= U > 2.5.44-mm4 19.676s 192.794s 42.678s 1197.4= % > 2.5.44-mm4-focht-1 19.46s 189.838s 37.938s 1171= % > 2.5.44-mm4-hbaum-1 19.746s 189.232s 38.354s 1152.2= % > 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6= % > 2.5.44-mm4-hbaum-12 19.322s 190.176s 40.354s 1192.6= % > 2.5.44-mm4-f1-h2 19.398s 190.118s 40.06s 1186= % > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 32.45 49.47 129.86 0.8= 2 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.0= 6 > 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.7= 8 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.8= 5 > 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.7= 0 > 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.8= 1 > > Schedbench 8: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 39.90 61.48 319.26 2.7= 9 > 2.5.44-mm4-focht-1 37.76 61.09 302.17 2.5= 5 > 2.5.44-mm4-hbaum-1 43.18 56.74 345.54 1.7= 1 > 2.5.44-mm4-focht-12 28.40 34.43 227.25 2.0= 9 > 2.5.44-mm4-hbaum-12 30.71 45.87 245.75 1.4= 3 > 2.5.44-mm4-f1-h2 36.11 45.18 288.98 2.1= 0 > > Schedbench 16: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 62.99 93.59 1008.01 5.1= 1 > 2.5.44-mm4-focht-1 51.69 60.23 827.20 4.9= 5 > 2.5.44-mm4-hbaum-1 52.57 61.54 841.38 3.9= 3 > 2.5.44-mm4-focht-12 51.24 60.86 820.08 4.2= 3 > 2.5.44-mm4-hbaum-12 52.33 62.23 837.46 3.8= 4 > 2.5.44-mm4-f1-h2 51.76 60.15 828.33 5.6= 7 > > Schedbench 32: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 88.13 194.53 2820.54 11.5= 2 > 2.5.44-mm4-focht-1 56.71 123.62 1815.12 7.9= 2 > 2.5.44-mm4-hbaum-1 54.57 153.56 1746.45 9.2= 0 > 2.5.44-mm4-focht-12 55.69 118.85 1782.25 7.2= 8 > 2.5.44-mm4-hbaum-12 54.36 135.30 1739.95 8.0= 9 > 2.5.44-mm4-f1-h2 55.97 119.28 1791.39 7.2= 0 > > Schedbench 64: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 159.92 653.79 10235.93 25.1= 6 > 2.5.44-mm4-focht-1 55.60 232.36 3558.98 17.6= 1 > 2.5.44-mm4-hbaum-1 71.48 361.77 4575.45 18.5= 3 > 2.5.44-mm4-focht-12 56.03 234.45 3586.46 15.7= 6 > 2.5.44-mm4-hbaum-12 56.91 240.89 3642.99 15.6= 7 > 2.5.44-mm4-f1-h2 56.48 246.93 3615.32 16.9= 7 |
From: Martin J. B. <mb...@ar...> - 2002-10-28 18:38:55
|
>> Erich, what does all the pool stuff actually buy us over what >> Michael is doing? Seems to be rather more complex, but maybe >> it's useful for something we're just not measuring here? > > The more complicated stuff is for achieving equal load between the > nodes. It delays steals more when the stealing node is averagely loaded, > less when it is unloaded. This is the place where we can make it cope > with more complex machines with multiple levels of memory hierarchy > (like our 32 CPU TX7). Equal load among the nodes is important if you > have memory bandwidth eaters, as the bandwidth in a node is limited. > > When introducing node affinity (which shows good results for me!) you > also need a more careful ranking of the tasks which are candidates to > be stolen. The routine task_to_steal does this and is another source > of complexity. It is another point where the multilevel stuff comes in. > In the core part of the patch the rank of the steal candidates is computed > by only taking into account the time which a task has slept. OK, it all sounds sane, just rather complicated ;-) I'm going to trawl through your stuff with Michael, and see if we can simplify it a bit somehow whilst not changing the functionality. Your first patch seems to work just fine, it's just the complexity that bugs me a bit. The combination of your first patch with Michael's balance_exec stuff actually seems to work pretty well ... I'll poke at the new patch you sent me + Michael's exec balance + the little perf tweak I made to it, and see what happens ;-) > I attach the script for getting some statistics on the numa_test. I > consider this test more sensitive to NUMA effects, as it is a bandwidth > eater also needing good latency. > (BTW, Martin: in the numa_test script I've sent you the PROBLEMSIZE must > be set to 1000000!). It is ;-) I'm running 44-mm4, not virgin remember, so things like hot&cold page lists may make it faster? M. |
From: Martin J. B. <mb...@ar...> - 2002-10-28 07:19:35
|
> This is interesting, indeed. As you might have seen from the tests I > posted on LKML I could not see that effect on our IA64 NUMA machine. > Which arises the question: is it expensive to recalculate the load > when doing an exec (which I should also see) or is the strategy of > equally distributing the jobs across the nodes bad for certain > load+architecture combinations? As I'm not seeing the effect, maybe > you could do the following experiment: > In sched_best_node() keep only the "while" loop at the beginning. This > leads to a cheap selection of the next node, just a simple round robin. I did this ... presume that's what you meant: static int sched_best_node(struct task_struct *p) { int i, n, best_node=0, min_load, pool_load, min_pool=numa_node_id(); int cpu, pool, load; unsigned long mask = p->cpus_allowed & cpu_online_map; do { /* atomic_inc_return is not implemented on all archs [EF] */ atomic_inc(&sched_node); best_node = atomic_read(&sched_node) % numpools; } while (!(pool_mask[best_node] & mask)); return best_node; } Odd. seems to make it even worse. Kernbench: Elapsed User System CPU 2.5.44-mm4-focht-12 20.32s 190s 44.4s 1153.6% 2.5.44-mm4-focht-12-lobo 21.362s 193.71s 48.672s 1134% The diffprofiles below look like this just makes it make bad decisions. Very odd ... compare with what hapenned when I put Michael's balance_exec on instead. I'm tired, maybe I did something silly. diffprofile 2.5.44-mm4-focht-1 2.5.44-mm4-focht-12 606 page_remove_rmap 566 do_schedule 488 page_add_rmap 475 .text.lock.file_table 370 __copy_to_user 306 strnlen_user 272 d_lookup 235 find_get_page 233 get_empty_filp 193 atomic_dec_and_lock 161 copy_process 159 sched_best_node 135 flush_signal_handlers 131 complete 116 filemap_nopage 109 __fput 105 path_lookup 103 follow_mount 95 zap_pte_range 92 file_move 91 do_no_page 87 release_task 80 do_page_fault 62 lru_cache_add 62 link_path_walk 62 do_generic_mapping_read 57 find_trylock_page 55 release_pages 50 dup_task_struct ... -73 do_anonymous_page -478 __copy_from_user diffprofile 2.5.44-mm4-focht-12 2.5.44-mm4-focht-12-lobo 567 do_schedule 482 do_anonymous_page 383 page_remove_rmap 336 __copy_from_user 333 page_add_rmap 241 zap_pte_range 213 init_private_file 189 strnlen_user 186 buffered_rmqueue 172 find_get_page 124 complete 111 filemap_nopage 97 free_hot_cold_page 89 flush_signal_handlers 86 clear_page_tables 79 do_page_fault 79 copy_process 75 d_lookup 74 path_lookup 71 sched_best_cpu 68 do_no_page 58 release_pages 58 __set_page_dirty_buffers 52 wait_for_completion 51 release_task 51 handle_mm_fault ... -53 lru_cache_add -73 dentry_open -100 sched_best_node -108 file_ra_state_init -402 .text.lock.file_table |
From: Erich F. <ef...@es...> - 2002-10-28 16:34:52
|
On Monday 28 October 2002 01:31, Martin J. Bligh wrote: > OK, so I'm trying to read your patch 1, fairly unsucessfully > (seems to be a lot more complex that Michael's). > > Can you explain pool_lock? It does actually seem to work, but > it's rather confusing .... The pool data is needed to be able to loop over the CPUs of one node, only. I'm convinced we'll need to do that sometime, no matter how simple the core of the NUMA scheduler is. The pool_lock is protecting that data while it is built. This can happen in future more often, if somebody starts hotplugging CPUs. > build_pools() has a comment above it saying: > > +/* > + * Call pooldata_lock() before calling this function and > + * pooldata_unlock() after! > + */ > > But then you promptly call pooldata_lock inside build_pools > anyway ... looks like it's just a naff comment, but doesn't > help much. Sorry, the comment came from a former version... > just block). If you really still need to do this, RCU is now > in the kernel ;-) If not, can we just chuck all that stuff? I'm preparing a core patch which doesn't need the pool_lock. I'll send it out today. Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-10-28 17:02:23
|
> The pool data is needed to be able to loop over the CPUs of one node, > only. I'm convinced we'll need to do that sometime, no matter how simple > the core of the NUMA scheduler is. Hmmm ... is using node_to_cpumask from the topology stuff, then looping over that bitmask insufficient? > The pool_lock is protecting that data while it is built. This can happen > in future more often, if somebody starts hotplugging CPUs. Heh .... when someone actually does that, we'll have a lot more problems than just this to solve. Would be nice to keep this stuff simple for now, if possible. > Sorry, the comment came from a former version... No problem, I suspected that was all it was. >> just block). If you really still need to do this, RCU is now >> in the kernel ;-) If not, can we just chuck all that stuff? > > I'm preparing a core patch which doesn't need the pool_lock. I'll send it > out today. Cool! Thanks, M. |
From: Erich F. <ef...@es...> - 2002-10-28 17:26:48
Attachments:
01-numa_sched_core-2.5.39-12b.patch
|
On Monday 28 October 2002 17:57, Martin J. Bligh wrote: > > I'm preparing a core patch which doesn't need the pool_lock. I'll sen= d it > > out today. > > Cool! Thanks, OK, here it comes. The core doesn't use the loop_over_nodes() macro any more. There's one big loop over the CPUs for computing node loads and the most loaded CPUs in find_busiest_queue. The call to build_cpus() isn't critical any more. Functionality is the same as in the previous patch (i.e. steal delays, ranking of task_to_steal, etc...). I kept the loop_over_node() macro for compatibility reasons with the additional patches. You might need to replace in the additional patches: numpools -> numpools() pool_nr_cpus[] -> pool_ncpus() I'm puzzled about the initial load balancing impact and have to think about the results I've seen from you so far... In the environments I am used to, the frequency of exec syscalls is rather low, therefore I didn't care too much about the sched_balance_exec performance and prefered to try harder to achieve good distribution across the nodes. Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-10-28 17:40:44
|
> I'm puzzled about the initial load balancing impact and have to think > about the results I've seen from you so far... In the environments I am > used to, the frequency of exec syscalls is rather low, therefore I didn't > care too much about the sched_balance_exec performance and prefered to > try harder to achieve good distribution across the nodes. OK, but take a look at Michael's second patch. It still looks at nr_running on every queue in the system (with some slightly strange code to make a rotating choice on nodes on the case of equality), so should still be able to make the best decision .... *but* it seems to be much cheaper to execute. Not sure why at this point, given the last results I sent you last night ;-) M. |
From: Erich F. <ef...@es...> - 2002-10-29 00:07:25
|
On Monday 28 October 2002 18:35, Martin J. Bligh wrote: > > I'm puzzled about the initial load balancing impact and have to think > > about the results I've seen from you so far... In the environments I = am > > used to, the frequency of exec syscalls is rather low, therefore I di= dn't > > care too much about the sched_balance_exec performance and prefered t= o > > try harder to achieve good distribution across the nodes. > > OK, but take a look at Michael's second patch. It still looks at > nr_running on every queue in the system (with some slightly strange > code to make a rotating choice on nodes on the case of equality), > so should still be able to make the best decision .... *but* it > seems to be much cheaper to execute. Not sure why at this point, > given the last results I sent you last night ;-) Yes, I like it! I needed some time to understand that the per_cpu variables can spread the execed tasks acros the nodes as well as the atomic sched_node. Sure, I'd like to select the least loaded node instead of the least loaded CPU. It can well be that you just have created on a node 10 threads (by fork, therefore still on their original CPU), and hav= e an idle CPU in the same node (which didn't steal yet the newly created tasks). Suppose your instant load looks like this: node 0: cpu0: 1 , cpu1: 1, cpu2: 1, cpu3: 1 node 1: cpu4:10 , cpu5: 0, cpu6: 1, cpu7: 1 If you exec on cpu0 before cpu5 managed to steal something from cpu4, you'll aim for cpu5. This would just increase the node-imbalance and force more of the threads on cpu4 to move to node0, which is maybe bad for them. Just an example... If you start considering non-trivial cpus_allowed masks, you might get more of these cases. We could take this as a design target for the initial load balancer and keep the fastest version we currently have for the benchmarks we currently use (Michael's). Regards, Erich |
From: Erich F. <ef...@es...> - 2002-10-28 17:38:56
|
On Monday 28 October 2002 01:46, Martin J. Bligh wrote: > 2.5.44-mm4 Virgin > 2.5.44-mm4-focht-1 Focht main > 2.5.44-mm4-hbaum-1 Hbaum main > 2.5.44-mm4-focht-12 Focht main + Focht balance_exec > 2.5.44-mm4-hbaum-1 Hbaum main + Hbaum balance_exec > 2.5.44-mm4-f1-h2 Focht main + Hbaum balance_exec > > Schedbench 4: > Elapsed TotalUser TotalSys AvgUse= r > 2.5.44-mm4 32.45 49.47 129.86 0.8= 2 > 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.0= 6 > 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.7= 8 > 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.8= 5 > 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.7= 0 > 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.8= 1 One more remarks: You seem to have made the numa_test shorter. That reduces it to beeing simply a check for the initial load balancing as the hackbench running in the background (and aimed to disturb the initial load balancing) might start too late. You will most probably not see the impact of node affinit= y with such short running tests. But we weren't talking about node affinity= , yet... Erich |
From: Martin J. B. <mb...@ar...> - 2002-10-28 17:41:50
|
>> Schedbench 4: >> Elapsed TotalUser TotalSys AvgUser >> 2.5.44-mm4 32.45 49.47 129.86 0.82 >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 1.06 >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 0.78 >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 0.85 >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 0.70 >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 0.81 > > One more remarks: > You seem to have made the numa_test shorter. That reduces it to beeing > simply a check for the initial load balancing as the hackbench running in > the background (and aimed to disturb the initial load balancing) might > start too late. You will most probably not see the impact of node affinity > with such short running tests. But we weren't talking about node affinity, > yet... I didn't modify what you sent me at all ... perhaps my machine is just faster than yours? /me ducks & runs ;-) M. |
From: Erich F. <ef...@es...> - 2002-10-28 23:49:21
|
On Monday 28 October 2002 18:36, Martin J. Bligh wrote: > >> Schedbench 4: > >> Elapsed TotalUser TotalSys Avg= User > >> 2.5.44-mm4 32.45 49.47 129.86 = 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 = 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 = 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 = 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 = 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 = 0.81 > > > > One more remarks: > > You seem to have made the numa_test shorter. That reduces it to beein= g > > simply a check for the initial load balancing as the hackbench runnin= g in > > the background (and aimed to disturb the initial load balancing) migh= t > > start too late. You will most probably not see the impact of node > > affinity with such short running tests. But we weren't talking about = node > > affinity, yet... > > I didn't modify what you sent me at all ... perhaps my machine is > just faster than yours? > > /me ducks & runs ;-) :-))) I tried with IA32, too ;-) With PROBLEMSIZE=3D1000000 I get on a 2.8GHz XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times runnin= g =2E/numa_test 2 on a dual CPU box. The usertime is pretty independent of = the OS, (but the scheduling influences it a lot). But: you have a node level cache! Maybe the whole memory is inside that one and then things can go really fast. Hmmm, I guess I'll need some cache detection in the future to enforce that the BM really runs in memory... Increasing PROBLEMSIZE might help, but we can do that later, when testing affinity (I'm not giving up on this idea... ;-) Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-10-29 00:06:35
|
>> I didn't modify what you sent me at all ... perhaps my machine is >> just faster than yours? >> >> /me ducks & runs ;-) > > :-))) > > I tried with IA32, too ;-) With PROBLEMSIZE=1000000 I get on a 2.8GHz > XEON something around 16s. On a 1.6GHz Athlon it's 22s. Both times running > ./numa_test 2 on a dual CPU box. The usertime is pretty independent of the > OS, (but the scheduling influences it a lot). I have 700MHz P3 Xeons, but I have 2Mb L2 cache on them which is much better than the newer chips. That might make a big differernce. > But: you have a node level cache! Maybe the whole memory is inside that > one and then things can go really fast. Hmmm, I guess I'll need some > cache detection in the future to enforce that the BM really runs in > memory... Increasing PROBLEMSIZE might help, but we can do that later, > when testing affinity (I'm not giving up on this idea... ;-) Yup, 32Mb cache. Not sure if it's faster than local memory or not. M. |
From: Erich F. <ef...@es...> - 2002-10-29 22:40:09
|
On Monday 28 October 2002 18:36, Martin J. Bligh wrote: > >> Schedbench 4: > >> Elapsed TotalUser TotalSys Avg= User > >> 2.5.44-mm4 32.45 49.47 129.86 = 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 = 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 = 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 = 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 = 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 = 0.81 > > > > One more remarks: > > You seem to have made the numa_test shorter. That reduces it to beein= g > > simply a check for the initial load balancing as the hackbench runnin= g in > > the background (and aimed to disturb the initial load balancing) migh= t > > start too late. You will most probably not see the impact of node > > affinity with such short running tests. But we weren't talking about = node > > affinity, yet... > > I didn't modify what you sent me at all ... perhaps my machine is > just faster than yours? > > /me ducks & runs ;-) Aaargh, now I understand!!! You just have wrong labels in your table, they are permuted! More sense makes: > >> AvgUser Elapsed TotalUser To= talSys > >> 2.5.44-mm4 32.45 49.47 129.86 = 0.82 > >> 2.5.44-mm4-focht-1 38.61 45.15 154.48 = 1.06 > >> 2.5.44-mm4-hbaum-1 37.81 46.44 151.26 = 0.78 > >> 2.5.44-mm4-focht-12 23.23 38.87 92.99 = 0.85 > >> 2.5.44-mm4-hbaum-12 22.26 34.70 89.09 = 0.70 > >> 2.5.44-mm4-f1-h2 21.39 35.97 85.57 = 0.81 Regards, Erich |