From: Erich F. <ef...@es...> - 2002-07-22 14:00:26
|
There's a new version of the node affine NUMA scheduler extension based on the O(1) scheduler at http://home.arcor.de/efocht/sched/Nod18_2.4.18-ia64-O1ef7.patch The patch is for 2.4.18 kernels and it has been tested on IA64 systems. It requires the O(1) scheduler patch with the corrected complex macros which I posted to the LSE and linux-ia64 mailing lists last week. For IA64 you should use: http://home.arcor.de/efocht/sched/O1_ia64-ef7-2.4.18.patch.bz2 which should be applied to 2.4.18 + ia64-020622 patch. For IA32 (NUMA-Q) try instead: http://home.arcor.de/efocht/sched/O1_i386-ef7-2.4.18.patch.bz2 What is it good for? - Extends the scheduler to NUMA. - Each task gets a homenode assigned at start (initial load balancing). - A memory affinity patch (like discontigmem, or similar) should take care that the memory of the task is allocated mainly from its homenode. - The scheduler attracts the tasks to their homenodes while trying to keep the nodes equally balanced. - Target: keep processes and their memory on the same node to reduce memory access latencies without having to fiddle with the cpus_allowed masks (hard affinities). - Within one node, behaves like the normal O(1) scheduler. For an overview over the features have a look at: http://home.arcor.de/efocht/sched There are several changes compared to the previous version, the most important ones are: - Extension to multilevel NUMA hierarchy by implementing delays when stealing tasks from remote nodes. - Better selection of task to be stolen from busiest runqueue. Take into account cache coolness, node and supernode of task and runqueue. Comments and feedback are very wellcome. Regards, Erich |
From: Erich F. <ef...@es...> - 2002-09-21 10:00:51
Attachments:
01-numa_sched_core-2.5.patch
|
Hi, here is an update of the NUMA scheduler for the 2.5.37 kernel. It contains some bugfixes and the coupling to discontigmem memory allocation (memory is allocated from the processes' homenode). The node affine NUMA scheduler is targeted for multi-node platforms and built on top of the O(1) scheduler. Its main objective is to keep the memory access latency for each task as low as possible by scheduling it on or near the node on which its memory is allocated. This should achieve the hard-affinity benefits automatically. The patch comes in two parts. The first part is the core NUMA scheduler, it is functional without the second part and provides following features: - Node aware scheduler (implemented CPU pools). - Scheduler behaves like O(1) scheduler within a node. - Equal load among nodes is targeted, stealing tasks from remote nodes is delayed more if the current node is averagely loaded, less if it's unloaded. - Multi-level node hierarchies are supported by stealing delays adjusted by the relative "node-distance" (memory access latency ratio). =20 The second part of the patch extends the pooling NUMA scheduler to have node affine tasks: - Each process has a homenode assigned to it at creation time (initial load balancing). Memory will be allocated from this node. - Each process is preferentially scheduled on its homenode and attracted back to it if scheduled away for some reason. For multi-level node hierarchies the task is attracted to its supernode, too. The patch was tested on IA64 platforms but should work on NUMAQ i386, too. Similar code for 2.4.18 (cf. http://home.arcor.de/efocht/sched)=20 runs in production environments since months. Comments, tests, ports to other platforms/architectures are very welcome! Regards, Erich |
From: Erich F. <ef...@es...> - 2002-09-21 10:03:10
Attachments:
02-numa_sched_affine-2.5.patch
|
Here comes the second part of the node affine NUMA scheduler. > The patch comes in two parts. The first part is the core NUMA scheduler= , > it is functional without the second part and provides following feature= s: > - Node aware scheduler (implemented CPU pools). > - Scheduler behaves like O(1) scheduler within a node. > - Equal load among nodes is targeted, stealing tasks from remote nodes > is delayed more if the current node is averagely loaded, less if it'= s > unloaded. > - Multi-level node hierarchies are supported by stealing delays adjust= ed > by the relative "node-distance" (memory access latency ratio). > > The second part of the patch extends the pooling NUMA scheduler to > have node affine tasks: > - Each process has a homenode assigned to it at creation time > (initial load balancing). Memory will be allocated from this node. > - Each process is preferentially scheduled on its homenode and > attracted back to it if scheduled away for some reason. For > multi-level node hierarchies the task is attracted to its > supernode, too. Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-09-21 15:57:41
|
> The second part of the patch extends the pooling NUMA scheduler to > have node affine tasks: > - Each process has a homenode assigned to it at creation time > (initial load balancing). Memory will be allocated from this node. Hmmm ... I was wondering how you achieved that without modifying alloc_pages ... until I saw this bit. #ifdef CONFIG_NUMA +#ifdef CONFIG_NUMA_SCHED +#define numa_node_id() (current->node) +#else #define numa_node_id() _cpu_to_node(smp_processor_id()) +#endif #endif /* CONFIG_NUMA */ I'm not convinced it's a good idea to modify this generic function, which was meant to tell you what node you're running on. I can't see it being used anywhere else right now, but wouldn't it be better to just modify alloc_pages instead to use current->node, and leave this macro as intended? Or make a process_node_id or something? Anyway, I'm giving your code a quick spin ... will give you some results later ;-) M. |
From: Martin J. B. <mb...@ar...> - 2002-09-21 16:34:37
|
> Anyway, I'm giving your code a quick spin ... will give you some > results later ;-) Hmmm .... well I ran the One True Benchmark (tm). The patch *increased* my kernel compile time from about 20s to about 28s. Not sure I like that idea ;-) Anything you'd like tweaked, or more info? Both user and system time were up ... I'll grab a profile of kernel stuff. M. |
From: Martin J. B. <mb...@ar...> - 2002-09-21 16:48:22
|
>> Anyway, I'm giving your code a quick spin ... will give you some >> results later ;-) > > Hmmm .... well I ran the One True Benchmark (tm). The patch > *increased* my kernel compile time from about 20s to about 28s. > Not sure I like that idea ;-) Anything you'd like tweaked, or > more info? Both user and system time were up ... I'll grab a > profile of kernel stuff. From the below, I'd suggest you're getting pages off the wrong nodes: do_anonymous_page is page zeroing, and rmqueue the buddy allocator. Are you sure the current->node thing is getting set correctly? I'll try backing out your alloc_pages tweaking, and see what happens. An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar) 87elapse133639 total 0.1390 74447 default_idle 6887 do_anonymous_page 4456 page_remove_rmap 4082 handle_mm_fault 3733 .text.lock.namei 2512 page_add_rmap 2187 __generic_copy_from_user 1989 rmqueue 1964 .text.lock.dec_and_lock 1761 vm_enough_memory 1631 file_read_actor 1599 zap_pte_range 1507 __free_pages_ok 1323 find_get_page 1117 do_no_page 1097 get_empty_filp 1023 link_path_walk 2.5.37-mm1 256745 total 0.2584 82934 default_idle 38978 do_anonymous_page 36533 rmqueue 35099 __free_pages_ok 5551 page_remove_rmap 4694 handle_mm_fault 3166 page_add_rmap 2904 do_no_page 2674 .text.lock.namei 2566 __alloc_pages 2526 zap_pte_range 2306 __generic_copy_from_user 2218 file_read_actor 1803 vm_enough_memory 1789 do_wp_page 1557 .text.lock.dec_and_lock 1414 find_get_page 1251 do_softirq 1123 release_pages 1086 link_path_walk 1072 get_empty_filp 1038 schedule |
From: Martin J. B. <mb...@ar...> - 2002-09-21 17:13:48
|
> From the below, I'd suggest you're getting pages off the wrong > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy > allocator. Are you sure the current->node thing is getting set > correctly? I'll try backing out your alloc_pages tweaking, and > see what happens. OK, well removing that part of the patch gets us back from 28s to about 21s (compared to 20s virgin), total user time compared to virgin is up from 59s to 62s, user from 191 to 195. So it's still a net loss, but not by nearly as much. Are you determining target node on fork or exec ? I forget ... Profile is more comparible. Nothing sticks out any more, but maybe it just needs some tuning for balance intervals or something. 153385 total 0.1544 91219 default_idle 7475 do_anonymous_page 4564 page_remove_rmap 4167 handle_mm_fault 3467 .text.lock.namei 2520 page_add_rmap 2112 rmqueue 1905 .text.lock.dec_and_lock 1849 zap_pte_range 1668 vm_enough_memory 1612 __free_pages_ok 1504 file_read_actor 1484 find_get_page 1381 __generic_copy_from_user 1207 do_no_page 1066 schedule 1050 get_empty_filp 1034 link_path_walk |
From: Erich F. <ef...@es...> - 2002-09-21 17:34:02
|
Hi Martin, thanks for the comments and the testing! On Saturday 21 September 2002 19:11, Martin J. Bligh wrote: > > From the below, I'd suggest you're getting pages off the wrong > > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy > > allocator. Are you sure the current->node thing is getting set > > correctly? I'll try backing out your alloc_pages tweaking, and > > see what happens. The current->node is most probably wrong for most of the kernel threads, except for migration_thread and ksoftirqd. But it should be fine for user processes. Might also be that the __node_distance matrix which you might use by default is not optimal for NUMAQ. It is fine for our remote/local latency ratio of 1.6. Yours is maybe an order of magnitude larger? Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now... > OK, well removing that part of the patch gets us back from 28s to > about 21s (compared to 20s virgin), total user time compared to > virgin is up from 59s to 62s, user from 191 to 195. So it's still > a net loss, but not by nearly as much. Are you determining target > node on fork or exec ? I forget ... The default is exec(). You can use http://home.arcor.de/efocht/sched/nodpol.c to set the node_policy to do initial load_balancing in fork(). Just do "nodpol -P 2" in the shell before starting the job/task. This is VERY reccomended if you are creating many tasks/threads. The default behavior is fine for MPI jobs or users starting serial processes. > Profile is more comparible. Nothing sticks out any more, but maybe > it just needs some tuning for balance intervals or something. Hmmm... There are two changes which might lead to lower performance: 1. load_balance() is not inlined any more. 2. pull_task steals only one task at a load_balance() call. It was maximally imbalance/2 (if I remember correctly). And of course, there is some real additional overhead due to the initial load balancing which one feels for short living tasks... So please try "nodpol -P 2" (and reset to default by "nodpol -P 0". Did you try the first patch alone? I mean the pooling-only scheduler? Thanks, Erich > 153385 total 0.1544 > 91219 default_idle > 7475 do_anonymous_page > 4564 page_remove_rmap > 4167 handle_mm_fault > 3467 .text.lock.namei > 2520 page_add_rmap > 2112 rmqueue > 1905 .text.lock.dec_and_lock > 1849 zap_pte_range > 1668 vm_enough_memory > 1612 __free_pages_ok > 1504 file_read_actor > 1484 find_get_page > 1381 __generic_copy_from_user > 1207 do_no_page > 1066 schedule > 1050 get_empty_filp > 1034 link_path_walk --=20 Dr. Erich Focht <ef...@es...> NEC European Supercomputer Systems, European HPC Technology Center Hessbruehlstr. 21B, 70565 Stuttgart, Germany phone: +49-711-78055-15 fax : +49-711-78055-25 |
From: William L. I. I. <wl...@ho...> - 2002-09-21 17:44:56
|
On Sat, Sep 21, 2002 at 07:32:59PM +0200, Erich Focht wrote: > Might also be that the __node_distance matrix which you might use > by default is not optimal for NUMAQ. It is fine for our remote/local > latency ratio of 1.6. Yours is maybe an order of magnitude larger? > Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now... I'm running with 8 over the weekend, and by and large we go to 16, though we rarely put all our eggs in one basket. I'll take it for a spin. Cheers, Bill |
From: Erich F. <ef...@es...> - 2002-09-22 10:36:18
Attachments:
proc_sched_hist_2.5.37.patch
|
On Saturday 21 September 2002 18:46, Martin J. Bligh wrote: > > Hmmm .... well I ran the One True Benchmark (tm). The patch > > *increased* my kernel compile time from about 20s to about 28s. > > Not sure I like that idea ;-) Anything you'd like tweaked, or > > more info? Both user and system time were up ... I'll grab a > > profile of kernel stuff. > > From the below, I'd suggest you're getting pages off the wrong > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy > allocator. Are you sure the current->node thing is getting set > correctly? I'll try backing out your alloc_pages tweaking, and > see what happens. Could you please check in dmesg whether the CPU pools are initialised correctly? Maybe something goes wrong for your platform. The node_distance is most probably non-optimal for NUMAQ, that might need some tuning. The default is set for maximum 8 nodes, nodes 1-4 and 5-8 being in separate supernodes, with the latency ratios 1:1.5:2. You could use the attached patch for getting an idea about the load distribution. It's a quick&dirty hack which creates files called /proc/sched/load/rqNN :load of RQs, including info on tasks not running on their homenode /proc/sched/history/ilbNN : history of last 25 initial load balancing decisions for runqueue NN /proc/sched/history/lbNN : last 25 load balancing decisions on rq NN. It should be possible to find the reason for the poor performance by looking at the nr_homenode entries in /proc/sched/load/rqNN. Thanks, best regards, Erich |
From: William L. I. I. <wl...@ho...> - 2002-09-21 23:25:07
|
On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote: > An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar) Some 8-quad numbers for 2.5.37 (virgin) follow. I'll get dcache_rcu and NUMA sched stuff in on the act for round 2. This is more fs transaction-based (and VM process-spawning overhead) and not pure I/O throughput so there won't be things like "but everybody's running at peak I/O bandwidth" obscuring the issues. real 0m30.854s c01053ec 16963617 89.009 poll_idle c0114a48 452526 2.37443 load_balance c013962c 253666 1.331 get_page_state c01466de 177354 0.930586 .text.lock.file_table c0114ec0 150583 0.790117 scheduler_tick c01547b3 116144 0.609414 .text.lock.namei c01422ac 94042 0.493444 page_remove_rmap c0138c24 83293 0.437043 rmqueue c0141e48 51963 0.272653 page_add_rmap c012d5ec 37086 0.194592 do_anonymous_page c01391d0 34454 0.180782 __alloc_pages c0130e08 32111 0.168488 find_get_page c0139534 29222 0.153329 nr_free_pages c012df4c 26734 0.140275 handle_mm_fault c0146070 25881 0.135799 get_empty_filp c01a0d2c 25250 0.132488 __generic_copy_from_user c0111728 22018 0.11553 smp_apic_timer_interrupt c0112b90 20777 0.109018 pfn_to_nid c0136238 18410 0.0965983 kmem_cache_free c0113270 17494 0.091792 do_page_fault c0138900 17282 0.0906796 __free_pages_ok c01a0f90 17160 0.0900395 atomic_dec_and_lock c01a100b 16937 0.0888694 .text.lock.dec_and_lock |
From: William L. I. I. <wl...@ho...> - 2002-09-22 08:16:54
|
On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote: >> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar) On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote: > Some 8-quad numbers for 2.5.37 (virgin) follow. Okay, 2.5.37 virgin with overcommit_memory set to 1 this time. (compiles with -j256 seem to do better than -j32 or -j48, here is -j256): ... will follow up with 2.5.38-mm1 with and without NUMA sched, at least if the arrival rate of releases doesn't exceed the benchtime. c01053ec 1605553 95.6785 poll_idle c0114a48 7611 0.453556 load_balance c0114ec0 5303 0.316018 scheduler_tick c01422ac 5017 0.298974 page_remove_rmap c01466de 4026 0.239918 .text.lock.file_table c012d5ec 3290 0.196058 do_anonymous_page c0141e48 3211 0.191351 page_add_rmap c012df4c 2920 0.174009 handle_mm_fault c010d6e8 2844 0.16948 timer_interrupt c01547b3 2080 0.123952 .text.lock.namei c0146070 1934 0.115251 get_empty_filp c01a0d2c 1591 0.0948112 __generic_copy_from_user c0111728 1477 0.0880177 smp_apic_timer_interrupt c014633c 1437 0.085634 __fput c01a0ce0 1346 0.0802111 __generic_copy_to_user c0138c24 1249 0.0744307 rmqueue c012e450 1169 0.0696633 vm_enough_memory c012b694 1056 0.0629294 zap_pte_range c0130e08 997 0.0594135 find_get_page c014427c 950 0.0566126 dentry_open c01391d0 949 0.056553 __alloc_pages c0151424 892 0.0531563 link_path_walk c01a0f90 834 0.0496999 atomic_dec_and_lock |
From: Erich F. <ef...@es...> - 2002-09-22 08:31:44
|
Bill, would you please check the boot messages for the NUMA scheduler before doing the run. Martin sent me an example where he has: CPU pools : 1 pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15=20 node level 0 : 10 pool_delay matrix: 129=20 which is clearly wrong. In that case we need to fix the cpu-pools setup first. Regards, Erich On Sunday 22 September 2002 10:09, William Lee Irwin III wrote: > On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote: > >> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar= ) > > On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote: > > Some 8-quad numbers for 2.5.37 (virgin) follow. > > Okay, 2.5.37 virgin with overcommit_memory set to 1 this time. > (compiles with -j256 seem to do better than -j32 or -j48, here is -j256= ): > > ... will follow up with 2.5.38-mm1 with and without NUMA sched, at > least if the arrival rate of releases doesn't exceed the benchtime. |
From: Martin J. B. <mb...@ar...> - 2002-09-22 17:13:50
|
> would you please check the boot messages for the NUMA scheduler before > doing the run. Martin sent me an example where he has: > > CPU pools : 1 > pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node level 0 : 10 > pool_delay matrix: > 129 > > which is clearly wrong. In that case we need to fix the cpu-pools setup > first. OK, well I hacked this for now: sched.c somewhere: - lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))]; + lnode_number[i] = i/4; Which makes the pools work properly. I think you should just use the cpu_to_node macro here, but the hack will allow us to do some testing. Results, averaged over 5 kernel compiles: Before: Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4% After: Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4% So you actually take a little less horsepower to do the work, but don't utilize the CPUs quite as well, so elapsed time is higher. I seem to recall getting better results from Mike's quick hack though ... that was a long time back. What were the balancing issues you mentioned? M. |
From: Niels C. <nc...@ej...> - 2002-09-22 18:24:48
|
Actually, the horsepower used is virtually the same: 20.82 x 1206.4 = 25,117 21.918 x 1137.4 = 25.929 -nc- > Results, averaged over 5 kernel compiles: > > Before: > Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4% > After: > Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4% > > So you actually take a little less horsepower to do the work, but > don't utilize the CPUs quite as well, so elapsed time is higher. > I seem to recall getting better results from Mike's quick hack > though ... that was a long time back. What were the balancing > issues you mentioned? > > M. |
From: Niels C. <nc...@ej...> - 2002-09-22 18:43:10
|
Oops, typo. Should have been: Actually, the horsepower used is virtually the same: 20.82 x 1206.4 = 25,117 21.918 x 1137.4 = 24.929 -nc- > Results, averaged over 5 kernel compiles: > > Before: > Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4% > After: > Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4% > > So you actually take a little less horsepower to do the work, but > don't utilize the CPUs quite as well, so elapsed time is higher. > I seem to recall getting better results from Mike's quick hack > though ... that was a long time back. What were the balancing > issues you mentioned? > > M. |
From: Martin J. B. <mb...@ar...> - 2002-09-22 19:22:00
|
I tried putting back the current->node logic now that we have the correct node IDs, but it made things worse (not as bad as before, but ... looks like we're still allocing off the wrong node. This run is the last one in the list. Virgin: Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4% 7059 do_anonymous_page 4459 page_remove_rmap 3863 handle_mm_fault 3695 .text.lock.namei 2912 page_add_rmap 2458 rmqueue 2119 vm_enough_memory Both numasched patches, just compile fixes: Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8% 38978 do_anonymous_page 36533 rmqueue 35099 __free_pages_ok 5551 page_remove_rmap 4694 handle_mm_fault 3166 page_add_rmap Both numasched patches, alloc from local node Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4% 7475 do_anonymous_page 4564 page_remove_rmap 4167 handle_mm_fault 3467 .text.lock.namei 2520 page_add_rmap 2112 rmqueue 1905 .text.lock.dec_and_lock 1849 zap_pte_range 1668 vm_enough_memory Both numasched patches, hack node IDs, alloc from local node Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4% 5793 do_anonymous_page 4475 page_remove_rmap 4281 handle_mm_fault 3820 .text.lock.namei 2625 page_add_rmap 2028 .text.lock.dec_and_lock 1748 vm_enough_memory 1713 file_read_actor 1672 rmqueue Both numasched patches, hack node IDs, alloc from current->node Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6% 30317 do_anonymous_page 6962 rmqueue 5190 page_remove_rmap 4773 handle_mm_fault 3522 .text.lock.namei 3161 page_add_rmap |
From: Erich F. <ef...@es...> - 2002-09-22 22:00:23
Attachments:
nodpol.c
|
On Sunday 22 September 2002 21:20, Martin J. Bligh wrote: > I tried putting back the current->node logic now that we have > the correct node IDs, but it made things worse (not as bad as > before, but ... looks like we're still allocing off the wrong > node. Thanks a lot for the testing! It looks like something is still wrong. NUMAQ suffers a lot more from hops across the nodes than the Azusa therefore I expect it is more sensitive to initial load balancing errors. The easiest thing to try is "nodpol -P 2" in the shell before running the test. This changes the initial load balancing policy from exec() to fork() ("nodpol -P 0" gets you back to the default). A bit more difficult is tuning the scheduler parameters which can be done pretty simply by changing the __node_distance matrix. A first attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to larger delays when stealing from a remote node. Anyhow, it would be good to understand what is going on and maybe simpler tests than a kernel compile can reveal something. Or looking into at the /proc/sched/load/rqNN files (you need the patch I posted a few mails ago). I'll modify alloc_pages not to take into acount the kernel threads in the mean time. Regards, Erich > This run is the last one in the list. > > Virgin: > Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4% > 7059 do_anonymous_page > 4459 page_remove_rmap > 3863 handle_mm_fault > 3695 .text.lock.namei > 2912 page_add_rmap > 2458 rmqueue > 2119 vm_enough_memory > > Both numasched patches, just compile fixes: > Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8% > 38978 do_anonymous_page > 36533 rmqueue > 35099 __free_pages_ok > 5551 page_remove_rmap > 4694 handle_mm_fault > 3166 page_add_rmap > > Both numasched patches, alloc from local node > Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4% > 7475 do_anonymous_page > 4564 page_remove_rmap > 4167 handle_mm_fault > 3467 .text.lock.namei > 2520 page_add_rmap > 2112 rmqueue > 1905 .text.lock.dec_and_lock > 1849 zap_pte_range > 1668 vm_enough_memory > > Both numasched patches, hack node IDs, alloc from local node > Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4% > 5793 do_anonymous_page > 4475 page_remove_rmap > 4281 handle_mm_fault > 3820 .text.lock.namei > 2625 page_add_rmap > 2028 .text.lock.dec_and_lock > 1748 vm_enough_memory > 1713 file_read_actor > 1672 rmqueue > > Both numasched patches, hack node IDs, alloc from current->node > Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6% > 30317 do_anonymous_page > 6962 rmqueue > 5190 page_remove_rmap > 4773 handle_mm_fault > 3522 .text.lock.namei > 3161 page_add_rmap |
From: William L. I. I. <wl...@ho...> - 2002-09-22 22:44:25
|
On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote: > A bit more difficult is tuning the scheduler parameters which can > be done pretty simply by changing the __node_distance matrix. A first > attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to > larger delays when stealing from a remote node. This is not entirely reflective of our architecture. Node-to-node latencies vary as well. Some notion of whether communication must cross a lash at the very least should be present. Cheers, Bill |
From: Martin J. B. <mb...@ar...> - 2002-09-22 22:52:45
|
> On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote: >> A bit more difficult is tuning the scheduler parameters which can >> be done pretty simply by changing the __node_distance matrix. A first >> attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to >> larger delays when stealing from a remote node. > > This is not entirely reflective of our architecture. Node-to-node > latencies vary as well. Some notion of whether communication must cross > a lash at the very least should be present. Ummm ... I think it's just flat on or off node, presumably Erich has "on the diagonal" meaning they're on the same node, and "off-diagonal" meaning they're not. In which case, what he suggested seems fine ... it's really about 20:1 ratio so I might use 10 and 200, but other than that, it seems correct to me. M. |
From: Erich F. <ef...@es...> - 2002-09-23 18:21:07
Attachments:
rand_updt
|
Here is a simple benchmark which is NUMA sensitive and simulates a simple but normal situation in an environment running number crunching jobs. It starts N independent tasks which access a large array in a random manner. This is both bandwidth and latency sensitive. The output shows on which node(s) the tasks have spent their lives. Additionally it shows (on a NUMA scheduler kernel) the homenode (iSched). Could you please run it on the virgin kernel and on the "Both numasched patches, hack node IDs, alloc from current->node" one? Maybe we see what's wrong... Thanks, Erich |
From: Erich F. <ef...@es...> - 2002-09-22 10:46:30
|
On Saturday 21 September 2002 17:55, Martin J. Bligh wrote: > > - Each process has a homenode assigned to it at creation time > > (initial load balancing). Memory will be allocated from this node. =2E.. > #ifdef CONFIG_NUMA > +#ifdef CONFIG_NUMA_SCHED > +#define numa_node_id() (current->node) > +#else > #define numa_node_id() _cpu_to_node(smp_processor_id()) > +#endif > #endif /* CONFIG_NUMA */ > > I'm not convinced it's a good idea to modify this generic function, > which was meant to tell you what node you're running on. I can't > see it being used anywhere else right now, but wouldn't it be better > to just modify alloc_pages instead to use current->node, and leave > this macro as intended? Or make a process_node_id or something? OK, I see your point and I agree that numa_node_is() should be similar to smp_processor_id(). I'll change the alloc_pages instead. Do you think it makes sense to get memory from the homenode only for user processes? Many kernel threads have currently the wrong homenode, for some of them it's unclear which homenode they should have... There is an alternative idea (we discussed this at OLS with Andrea, maybe you remember): allocate memory from the current node and keep statistics on where it is allocated. Determine the homenode from this (from time to time) and schedule accordingly. This eliminates the initial load balancin= g and leaves it all to the scheduler, but has the drawback that memory can be somewhat scattered across the nodes. Any comments? Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-09-22 14:58:55
|
> OK, I see your point and I agree that numa_node_is() should be similar to > smp_processor_id(). I'll change the alloc_pages instead. > > Do you think it makes sense to get memory from the homenode only for > user processes? Many kernel threads have currently the wrong homenode, > for some of them it's unclear which homenode they should have... Well yes ... if you can keep things pretty much on their home nodes. That means some sort of algorithm for updating it, which may be fairly complex (and doesn't currently seem to work, but maybe that's just because I only have 1 pool) > There is an alternative idea (we discussed this at OLS with Andrea, maybe > you remember): allocate memory from the current node and keep statistics > on where it is allocated. Determine the homenode from this (from time to > time) and schedule accordingly. This eliminates the initial load balancing > and leaves it all to the scheduler, but has the drawback that memory can > be somewhat scattered across the nodes. Any comments? Well, that's a lot simpler. Things should end up running on their home node, and thus will allocate pages from their home node, so it should be self-re-enforcing. The algorithm for the home node is then implicitly worked out from the scheduler itself, and its actions, so it's one less set of stuff to write. Would suggest we do this at first, to keep things as simple as possible so you have something mergeable. M. |
From: Erich F. <ef...@es...> - 2002-09-23 18:39:12
Attachments:
node_affine_dyn-2.5.37.patch
|
On Sunday 22 September 2002 16:57, Martin J. Bligh wrote: > > There is an alternative idea (we discussed this at OLS with Andrea, m= aybe > > you remember): allocate memory from the current node and keep statist= ics > > on where it is allocated. Determine the homenode from this (from time= to > > time) and schedule accordingly. This eliminates the initial load > > balancing and leaves it all to the scheduler, but has the drawback th= at > > memory can be somewhat scattered across the nodes. Any comments? > > Well, that's a lot simpler. Things should end up running on their home > node, and thus will allocate pages from their home node, so it should > be self-re-enforcing. The algorithm for the home node is then implicitl= y > worked out from the scheduler itself, and its actions, so it's one less > set of stuff to write. Would suggest we do this at first, to keep thing= s > as simple as possible so you have something mergeable. OK, sounds encouraging. So here is my first attempt (attached). You'll have to apply it on top of the two NUMA scheduler patches and hack SAPICID_TO_PNODE again. The patch adds a node_mem[NR_NODES] array to each task. When allocating memory (in rmqueue) and freeing it (in free_pages_ok) the number of pages is added/subtracted from that array and the homenode is set to the node having the largest entry. Is there a better place where to put this in (other than rmqueue/free_pages_ok)? I have two problems with this approach: 1: Freeing memory is quite expensive, as it currently involves finding th= e maximum of the array node_mem[]. 2: I have no idea how tasks sharing the mm structure will behave. I'd like them to run on different nodes (that's why node_mem is not in mm), but they could (legally) free pages which they did not allocate and have wrong values in node_mem[]. Comments? Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-09-23 18:49:13
|
> OK, sounds encouraging. So here is my first attempt (attached). You'll > have to apply it on top of the two NUMA scheduler patches and hack > SAPICID_TO_PNODE again. > > The patch adds a node_mem[NR_NODES] array to each task. When allocating > memory (in rmqueue) and freeing it (in free_pages_ok) the number of > pages is added/subtracted from that array and the homenode is set to > the node having the largest entry. Is there a better place where to put > this in (other than rmqueue/free_pages_ok)? > > I have two problems with this approach: > 1: Freeing memory is quite expensive, as it currently involves finding the > maximum of the array node_mem[]. Bleh ... why? This needs to be calculated much more lazily than this, or you're going to kick the hell out of any cache affinity. Can you recalc this in the rebalance code or something instead? > 2: I have no idea how tasks sharing the mm structure will behave. I'd > like them to run on different nodes (that's why node_mem is not in mm), > but they could (legally) free pages which they did not allocate and > have wrong values in node_mem[]. Yes, that really ought to be per-process, not per task. Which means locking or atomics ... and overhead. Ick. For the first cut of the NUMA sched, maybe you could just leave page allocation alone, and do that seperately? or is that what the second patch was meant to be? M. |