Thread: [Lse-tech] node affine NUMA scheduler

lse-tech

[Lse-tech] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-07-22 14:00:26

There's a new version of the node affine NUMA scheduler extension based
on the O(1) scheduler at
    http://home.arcor.de/efocht/sched/Nod18_2.4.18-ia64-O1ef7.patch

The patch is for 2.4.18 kernels and it has been tested on IA64 systems.
It requires the O(1) scheduler patch with the corrected complex macros
which I posted to the LSE and linux-ia64 mailing lists last week. For
IA64 you should use:
    http://home.arcor.de/efocht/sched/O1_ia64-ef7-2.4.18.patch.bz2
which should be applied to   2.4.18  +  ia64-020622 patch.
For IA32 (NUMA-Q) try instead:
    http://home.arcor.de/efocht/sched/O1_i386-ef7-2.4.18.patch.bz2

What is it good for?

 - Extends the scheduler to NUMA.
 - Each task gets a homenode assigned at start (initial load balancing).
 - A memory affinity patch (like discontigmem, or similar) should take
   care that the memory of the task is allocated mainly from its
   homenode.
 - The scheduler attracts the tasks to their homenodes while trying to
   keep the nodes equally balanced.
 - Target: keep processes and their memory on the same node to reduce
   memory access latencies without having to fiddle with the
   cpus_allowed masks (hard affinities).
 - Within one node, behaves like the normal O(1) scheduler.

For an overview over the features have a look at:
    http://home.arcor.de/efocht/sched

There are several changes compared to the previous version, the most
important ones are:
 - Extension to multilevel NUMA hierarchy by implementing delays when
   stealing tasks from remote nodes.
 - Better selection of task to be stolen from busiest runqueue. Take
   into account cache coolness, node and supernode of task and runqueue.

Comments and feedback are very wellcome.

Regards,
Erich

[Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-21 10:00:51

Attachments: 01-numa_sched_core-2.5.patch

Hi,

here is an update of the NUMA scheduler for the 2.5.37 kernel. It
contains some bugfixes and the coupling to discontigmem memory
allocation (memory is allocated from the processes' homenode).

The node affine NUMA scheduler is targeted for multi-node platforms
and built on top of the O(1) scheduler. Its main objective is to
keep the memory access latency for each task as low as possible by
scheduling it on or near the node on which its memory is allocated.
This should achieve the hard-affinity benefits automatically.

The patch comes in two parts. The first part is the core NUMA scheduler,
it is functional without the second part and provides following features:
 - Node aware scheduler (implemented CPU pools).
 - Scheduler behaves like O(1) scheduler within a node.
 - Equal load among nodes is targeted, stealing tasks from remote nodes
   is delayed more if the current node is averagely loaded, less if it's
   unloaded.
 - Multi-level node hierarchies are supported by stealing delays adjusted
   by the relative "node-distance" (memory access latency ratio).
=20
The second part of the patch extends the pooling NUMA scheduler to
have node affine tasks:
 - Each process has a homenode assigned to it at creation time
   (initial load balancing). Memory will be allocated from this node.
 - Each process is preferentially scheduled on its homenode and
   attracted back to it if scheduled away for some reason. For
   multi-level node hierarchies the task is attracted to its
   supernode, too.

The patch was tested on IA64 platforms but should work on NUMAQ i386,
too. Similar code for 2.4.18 (cf. http://home.arcor.de/efocht/sched)=20
runs in production environments since months.

Comments, tests, ports to other platforms/architectures are very welcome!

Regards,
Erich

[Lse-tech] Re: [PATCH 2/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-21 10:03:10

Attachments: 02-numa_sched_affine-2.5.patch

Here comes the second part of the node affine NUMA scheduler.

> The patch comes in two parts. The first part is the core NUMA scheduler=
,
> it is functional without the second part and provides following feature=
s:
>  - Node aware scheduler (implemented CPU pools).
>  - Scheduler behaves like O(1) scheduler within a node.
>  - Equal load among nodes is targeted, stealing tasks from remote nodes
>    is delayed more if the current node is averagely loaded, less if it'=
s
>    unloaded.
>  - Multi-level node hierarchies are supported by stealing delays adjust=
ed
>    by the relative "node-distance" (memory access latency ratio).
>
> The second part of the patch extends the pooling NUMA scheduler to
> have node affine tasks:
>  - Each process has a homenode assigned to it at creation time
>    (initial load balancing). Memory will be allocated from this node.
>  - Each process is preferentially scheduled on its homenode and
>    attracted back to it if scheduled away for some reason. For
>    multi-level node hierarchies the task is attracted to its
>    supernode, too.

Regards,
Erich

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-21 15:57:41

 
> The second part of the patch extends the pooling NUMA scheduler to
> have node affine tasks:
>  - Each process has a homenode assigned to it at creation time
>    (initial load balancing). Memory will be allocated from this node.

Hmmm ... I was wondering how you achieved that without modifying
alloc_pages ... until I saw this bit.

 #ifdef CONFIG_NUMA
+#ifdef CONFIG_NUMA_SCHED
+#define numa_node_id() (current->node)
+#else
 #define numa_node_id() _cpu_to_node(smp_processor_id())
+#endif
 #endif /* CONFIG_NUMA */

I'm not convinced it's a good idea to modify this generic function,
which was meant to tell you what node you're running on. I can't
see it being used anywhere else right now, but wouldn't it be better
to just modify alloc_pages instead to use current->node, and leave
this macro as intended? Or make a process_node_id or something?

Anyway, I'm giving your code a quick spin ... will give you some
results later ;-)

M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-21 16:34:37

> Anyway, I'm giving your code a quick spin ... will give you some
> results later ;-)

Hmmm .... well I ran the One True Benchmark (tm). The patch 
*increased* my kernel compile time from about 20s to about 28s. 
Not sure I like that idea ;-) Anything you'd like tweaked, or 
more info? Both user and system time were up ... I'll grab a 
profile of kernel stuff.

M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-21 16:48:22

>> Anyway, I'm giving your code a quick spin ... will give you some
>> results later ;-)
> 
> Hmmm .... well I ran the One True Benchmark (tm). The patch 
> *increased* my kernel compile time from about 20s to about 28s. 
> Not sure I like that idea ;-) Anything you'd like tweaked, or 
> more info? Both user and system time were up ... I'll grab a 
> profile of kernel stuff.

From the below, I'd suggest you're getting pages off the wrong 
nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
allocator. Are you sure the current->node thing is getting set
correctly? I'll try backing out your alloc_pages tweaking, and
see what happens.

An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

87elapse133639 total                                      0.1390
 74447 default_idle                            
  6887 do_anonymous_page                       
  4456 page_remove_rmap                        
  4082 handle_mm_fault                         
  3733 .text.lock.namei                        
  2512 page_add_rmap                           
  2187 __generic_copy_from_user                
  1989 rmqueue                                 
  1964 .text.lock.dec_and_lock                 
  1761 vm_enough_memory                        
  1631 file_read_actor                         
  1599 zap_pte_range                           
  1507 __free_pages_ok                         
  1323 find_get_page                           
  1117 do_no_page                              
  1097 get_empty_filp                          
  1023 link_path_walk                          

2.5.37-mm1

256745 total                                      0.2584
 82934 default_idle                            
 38978 do_anonymous_page                       
 36533 rmqueue                                 
 35099 __free_pages_ok                         
  5551 page_remove_rmap                        
  4694 handle_mm_fault                         
  3166 page_add_rmap                           
  2904 do_no_page                              
  2674 .text.lock.namei                        
  2566 __alloc_pages                           
  2526 zap_pte_range                           
  2306 __generic_copy_from_user                
  2218 file_read_actor                         
  1803 vm_enough_memory                        
  1789 do_wp_page                              
  1557 .text.lock.dec_and_lock                 
  1414 find_get_page                           
  1251 do_softirq                              
  1123 release_pages                           
  1086 link_path_walk                          
  1072 get_empty_filp                          
  1038 schedule

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-21 17:13:48

> From the below, I'd suggest you're getting pages off the wrong 
> nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> allocator. Are you sure the current->node thing is getting set
> correctly? I'll try backing out your alloc_pages tweaking, and
> see what happens.

OK, well removing that part of the patch gets us back from 28s to 
about 21s (compared to 20s virgin), total user time compared to 
virgin is up from 59s to 62s, user from 191 to 195. So it's still 
a net loss, but not by nearly as much. Are you determining target 
node on fork or exec ? I forget ...

Profile is more comparible. Nothing sticks out any more, but maybe
it just needs some tuning for balance intervals or something. 

153385 total                                      0.1544
 91219 default_idle                            
  7475 do_anonymous_page                       
  4564 page_remove_rmap                        
  4167 handle_mm_fault                         
  3467 .text.lock.namei                        
  2520 page_add_rmap                           
  2112 rmqueue                                 
  1905 .text.lock.dec_and_lock                 
  1849 zap_pte_range                           
  1668 vm_enough_memory                        
  1612 __free_pages_ok                         
  1504 file_read_actor                         
  1484 find_get_page                           
  1381 __generic_copy_from_user                
  1207 do_no_page                              
  1066 schedule                                
  1050 get_empty_filp                          
  1034 link_path_walk

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-21 17:34:02

Hi Martin,

thanks for the comments and the testing!

On Saturday 21 September 2002 19:11, Martin J. Bligh wrote:
> > From the below, I'd suggest you're getting pages off the wrong
> > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> > allocator. Are you sure the current->node thing is getting set
> > correctly? I'll try backing out your alloc_pages tweaking, and
> > see what happens.

The current->node is most probably wrong for most of the kernel threads,
except for migration_thread and ksoftirqd. But it should be fine for
user processes.

Might also be that the __node_distance matrix which you might use
by default is not optimal for NUMAQ. It is fine for our remote/local
latency ratio of 1.6. Yours is maybe an order of magnitude larger?
Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now...

> OK, well removing that part of the patch gets us back from 28s to
> about 21s (compared to 20s virgin), total user time compared to
> virgin is up from 59s to 62s, user from 191 to 195. So it's still
> a net loss, but not by nearly as much. Are you determining target
> node on fork or exec ? I forget ...

The default is exec(). You can use
http://home.arcor.de/efocht/sched/nodpol.c
to set the node_policy to do initial load_balancing in fork().
Just do "nodpol -P 2" in the shell before starting the job/task.
This is VERY reccomended if you are creating many tasks/threads.
The default behavior is fine for MPI jobs or users starting serial
processes.

> Profile is more comparible. Nothing sticks out any more, but maybe
> it just needs some tuning for balance intervals or something.

Hmmm... There are two changes which might lead to lower performance:
1. load_balance() is not inlined any more.
2. pull_task steals only one task at a load_balance() call. It was
maximally imbalance/2 (if I remember correctly).

And of course, there is some real additional overhead due to the
initial load balancing which one feels for short living tasks... So
please try "nodpol -P 2" (and reset to default by "nodpol -P 0".

Did you try the first patch alone? I mean the pooling-only scheduler?

Thanks,
Erich

> 153385 total                                      0.1544
>  91219 default_idle
>   7475 do_anonymous_page
>   4564 page_remove_rmap
>   4167 handle_mm_fault
>   3467 .text.lock.namei
>   2520 page_add_rmap
>   2112 rmqueue
>   1905 .text.lock.dec_and_lock
>   1849 zap_pte_range
>   1668 vm_enough_memory
>   1612 __free_pages_ok
>   1504 file_read_actor
>   1484 find_get_page
>   1381 __generic_copy_from_user
>   1207 do_no_page
>   1066 schedule
>   1050 get_empty_filp
>   1034 link_path_walk

--=20
Dr. Erich Focht                                <ef...@es...>
NEC European Supercomputer Systems, European HPC Technology Center
Hessbruehlstr. 21B, 70565 Stuttgart, Germany
phone: +49-711-78055-15                    fax  : +49-711-78055-25

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: William L. I. I. <wl...@ho...> - 2002-09-21 17:44:56

On Sat, Sep 21, 2002 at 07:32:59PM +0200, Erich Focht wrote:
> Might also be that the __node_distance matrix which you might use
> by default is not optimal for NUMAQ. It is fine for our remote/local
> latency ratio of 1.6. Yours is maybe an order of magnitude larger?
> Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now...

I'm running with 8 over the weekend, and by and large we go to 16,
though we rarely put all our eggs in one basket.

I'll take it for a spin.


Cheers,
Bill

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-22 10:36:18

Attachments: proc_sched_hist_2.5.37.patch

On Saturday 21 September 2002 18:46, Martin J. Bligh wrote:
> > Hmmm .... well I ran the One True Benchmark (tm). The patch
> > *increased* my kernel compile time from about 20s to about 28s.
> > Not sure I like that idea ;-) Anything you'd like tweaked, or
> > more info? Both user and system time were up ... I'll grab a
> > profile of kernel stuff.
>
> From the below, I'd suggest you're getting pages off the wrong
> nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> allocator. Are you sure the current->node thing is getting set
> correctly? I'll try backing out your alloc_pages tweaking, and
> see what happens.

Could you please check in dmesg whether the CPU pools are initialised
correctly? Maybe something goes wrong for your platform.

The node_distance is most probably non-optimal for NUMAQ, that might
need some tuning. The default is set for maximum 8 nodes, nodes 1-4
and 5-8 being in separate supernodes, with the latency ratios 1:1.5:2.

You could use the attached patch for getting an idea about the load
distribution. It's a quick&dirty hack which creates files called
/proc/sched/load/rqNN  :load of RQs, including info on tasks not running
                        on their homenode
/proc/sched/history/ilbNN : history of last 25 initial load balancing
                            decisions for runqueue NN
/proc/sched/history/lbNN  : last 25 load balancing decisions on rq NN.

It should be possible to find the reason for the poor performance by
looking at the nr_homenode entries in /proc/sched/load/rqNN.

Thanks,
best regards,
Erich

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: William L. I. I. <wl...@ho...> - 2002-09-21 23:25:07

On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

Some 8-quad numbers for 2.5.37 (virgin) follow.

I'll get dcache_rcu and NUMA sched stuff in on the act for round 2.
This is more fs transaction-based (and VM process-spawning overhead)
and not pure I/O throughput so there won't be things like "but
everybody's running at peak I/O bandwidth" obscuring the issues.

real    0m30.854s

c01053ec 16963617 89.009      poll_idle
c0114a48 452526   2.37443     load_balance
c013962c 253666   1.331       get_page_state
c01466de 177354   0.930586    .text.lock.file_table
c0114ec0 150583   0.790117    scheduler_tick
c01547b3 116144   0.609414    .text.lock.namei
c01422ac 94042    0.493444    page_remove_rmap
c0138c24 83293    0.437043    rmqueue
c0141e48 51963    0.272653    page_add_rmap
c012d5ec 37086    0.194592    do_anonymous_page
c01391d0 34454    0.180782    __alloc_pages
c0130e08 32111    0.168488    find_get_page
c0139534 29222    0.153329    nr_free_pages
c012df4c 26734    0.140275    handle_mm_fault
c0146070 25881    0.135799    get_empty_filp
c01a0d2c 25250    0.132488    __generic_copy_from_user
c0111728 22018    0.11553     smp_apic_timer_interrupt
c0112b90 20777    0.109018    pfn_to_nid
c0136238 18410    0.0965983   kmem_cache_free
c0113270 17494    0.091792    do_page_fault
c0138900 17282    0.0906796   __free_pages_ok
c01a0f90 17160    0.0900395   atomic_dec_and_lock
c01a100b 16937    0.0888694   .text.lock.dec_and_lock

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: William L. I. I. <wl...@ho...> - 2002-09-22 08:16:54

On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
>> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote:
> Some 8-quad numbers for 2.5.37 (virgin) follow.

Okay, 2.5.37 virgin with overcommit_memory set to 1 this time.
(compiles with -j256 seem to do better than -j32 or -j48, here is -j256):

... will follow up with 2.5.38-mm1 with and without NUMA sched, at
least if the arrival rate of releases doesn't exceed the benchtime.

c01053ec 1605553  95.6785     poll_idle
c0114a48 7611     0.453556    load_balance
c0114ec0 5303     0.316018    scheduler_tick
c01422ac 5017     0.298974    page_remove_rmap
c01466de 4026     0.239918    .text.lock.file_table
c012d5ec 3290     0.196058    do_anonymous_page
c0141e48 3211     0.191351    page_add_rmap
c012df4c 2920     0.174009    handle_mm_fault
c010d6e8 2844     0.16948     timer_interrupt
c01547b3 2080     0.123952    .text.lock.namei
c0146070 1934     0.115251    get_empty_filp
c01a0d2c 1591     0.0948112   __generic_copy_from_user
c0111728 1477     0.0880177   smp_apic_timer_interrupt
c014633c 1437     0.085634    __fput
c01a0ce0 1346     0.0802111   __generic_copy_to_user
c0138c24 1249     0.0744307   rmqueue
c012e450 1169     0.0696633   vm_enough_memory
c012b694 1056     0.0629294   zap_pte_range
c0130e08 997      0.0594135   find_get_page
c014427c 950      0.0566126   dentry_open
c01391d0 949      0.056553    __alloc_pages
c0151424 892      0.0531563   link_path_walk
c01a0f90 834      0.0496999   atomic_dec_and_lock

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-22 08:31:44

Bill,

would you please check the boot messages for the NUMA scheduler before
doing the run. Martin sent me an example where he has:

CPU pools : 1
pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15=20
node level 0 : 10
pool_delay matrix:
 129=20

which is clearly wrong. In that case we need to fix the cpu-pools setup
first.

Regards,
Erich

On Sunday 22 September 2002 10:09, William Lee Irwin III wrote:
> On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
> >> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar=
)
>
> On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote:
> > Some 8-quad numbers for 2.5.37 (virgin) follow.
>
> Okay, 2.5.37 virgin with overcommit_memory set to 1 this time.
> (compiles with -j256 seem to do better than -j32 or -j48, here is -j256=
):
>
> ... will follow up with 2.5.38-mm1 with and without NUMA sched, at
> least if the arrival rate of releases doesn't exceed the benchtime.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-22 17:13:50

> would you please check the boot messages for the NUMA scheduler before
> doing the run. Martin sent me an example where he has:
> 
> CPU pools : 1
> pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
> node level 0 : 10
> pool_delay matrix:
>  129 
> 
> which is clearly wrong. In that case we need to fix the cpu-pools setup
> first.

OK, well I hacked this for now:

sched.c somewhere:
- lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))];
+ lnode_number[i] = i/4;

Which makes the pools work properly. I think you should just use
the cpu_to_node macro here, but the hack will allow us to do some
testing.

Results, averaged over 5 kernel compiles:

Before:
Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
After:
Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%

So you actually take a little less horsepower to do the work, but 
don't utilize the CPUs quite as well, so elapsed time is higher.
I seem to recall getting better results from Mike's quick hack
though ... that was a long time back. What were the balancing 
issues you mentioned?

M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Niels C. <nc...@ej...> - 2002-09-22 18:24:48

Actually, the horsepower used is virtually the same:

20.82 x 1206.4     = 25,117
21.918 x 1137.4    = 25.929

-nc-

> Results, averaged over 5 kernel compiles:
> 
> Before:
> Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
> After:
> Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
> 
> So you actually take a little less horsepower to do the work, but 
> don't utilize the CPUs quite as well, so elapsed time is higher.
> I seem to recall getting better results from Mike's quick hack
> though ... that was a long time back. What were the balancing 
> issues you mentioned?
> 
> M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Niels C. <nc...@ej...> - 2002-09-22 18:43:10

Oops, typo.  Should have been:

Actually, the horsepower used is virtually the same:

20.82 x 1206.4     = 25,117
21.918 x 1137.4    = 24.929

-nc-

> Results, averaged over 5 kernel compiles:
> 
> Before:
> Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
> After:
> Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
> 
> So you actually take a little less horsepower to do the work, but 
> don't utilize the CPUs quite as well, so elapsed time is higher.
> I seem to recall getting better results from Mike's quick hack
> though ... that was a long time back. What were the balancing 
> issues you mentioned?
> 
> M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-22 19:22:00

I tried putting back the current->node logic now that we have
the correct node IDs, but it made things worse (not as bad as
before, but ... looks like we're still allocing off the wrong
node.

This run is the last one in the list.

Virgin:
Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
  7059 do_anonymous_page                       
  4459 page_remove_rmap                        
  3863 handle_mm_fault                         
  3695 .text.lock.namei                        
  2912 page_add_rmap                           
  2458 rmqueue                                 
  2119 vm_enough_memory                        

Both numasched patches, just compile fixes:
Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8%
 38978 do_anonymous_page                       
 36533 rmqueue                                 
 35099 __free_pages_ok                         
  5551 page_remove_rmap                        
  4694 handle_mm_fault                         
  3166 page_add_rmap                           

Both numasched patches, alloc from local node
Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4%
  7475 do_anonymous_page                       
  4564 page_remove_rmap                        
  4167 handle_mm_fault                         
  3467 .text.lock.namei                        
  2520 page_add_rmap                           
  2112 rmqueue                                 
  1905 .text.lock.dec_and_lock                 
  1849 zap_pte_range                           
  1668 vm_enough_memory                        

Both numasched patches, hack node IDs, alloc from local node
Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
  5793 do_anonymous_page                       
  4475 page_remove_rmap                        
  4281 handle_mm_fault                         
  3820 .text.lock.namei                        
  2625 page_add_rmap                           
  2028 .text.lock.dec_and_lock                 
  1748 vm_enough_memory                        
  1713 file_read_actor                         
  1672 rmqueue                                 

Both numasched patches, hack node IDs, alloc from current->node
Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6%
 30317 do_anonymous_page                       
  6962 rmqueue                                 
  5190 page_remove_rmap                        
  4773 handle_mm_fault                         
  3522 .text.lock.namei                        
  3161 page_add_rmap

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-22 22:00:23

Attachments: nodpol.c

On Sunday 22 September 2002 21:20, Martin J. Bligh wrote:
> I tried putting back the current->node logic now that we have
> the correct node IDs, but it made things worse (not as bad as
> before, but ... looks like we're still allocing off the wrong
> node.

Thanks a lot for the testing! It looks like something is still
wrong. NUMAQ suffers a lot more from hops across the nodes than
the Azusa therefore I expect it is more sensitive to initial
load balancing errors.

The easiest thing to try is "nodpol -P 2" in the shell before
running the test. This changes the initial load balancing policy
from exec() to fork() ("nodpol -P 0" gets you back to the default).

A bit more difficult is tuning the scheduler parameters which can
be done pretty simply by changing the __node_distance matrix. A first
attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
larger delays when stealing from a remote node.

Anyhow, it would be good to understand what is going on and maybe
simpler tests than a kernel compile can reveal something. Or looking
into at the /proc/sched/load/rqNN files (you need the patch I posted
a few mails ago).

I'll modify alloc_pages not to take into acount the kernel threads
in the mean time.

Regards,
Erich

> This run is the last one in the list.
>
> Virgin:
> Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
>   7059 do_anonymous_page
>   4459 page_remove_rmap
>   3863 handle_mm_fault
>   3695 .text.lock.namei
>   2912 page_add_rmap
>   2458 rmqueue
>   2119 vm_enough_memory
>
> Both numasched patches, just compile fixes:
> Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8%
>  38978 do_anonymous_page
>  36533 rmqueue
>  35099 __free_pages_ok
>   5551 page_remove_rmap
>   4694 handle_mm_fault
>   3166 page_add_rmap
>
> Both numasched patches, alloc from local node
> Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4%
>   7475 do_anonymous_page
>   4564 page_remove_rmap
>   4167 handle_mm_fault
>   3467 .text.lock.namei
>   2520 page_add_rmap
>   2112 rmqueue
>   1905 .text.lock.dec_and_lock
>   1849 zap_pte_range
>   1668 vm_enough_memory
>
> Both numasched patches, hack node IDs, alloc from local node
> Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
>   5793 do_anonymous_page
>   4475 page_remove_rmap
>   4281 handle_mm_fault
>   3820 .text.lock.namei
>   2625 page_add_rmap
>   2028 .text.lock.dec_and_lock
>   1748 vm_enough_memory
>   1713 file_read_actor
>   1672 rmqueue
>
> Both numasched patches, hack node IDs, alloc from current->node
> Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6%
>  30317 do_anonymous_page
>   6962 rmqueue
>   5190 page_remove_rmap
>   4773 handle_mm_fault
>   3522 .text.lock.namei
>   3161 page_add_rmap

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: William L. I. I. <wl...@ho...> - 2002-09-22 22:44:25

On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote:
> A bit more difficult is tuning the scheduler parameters which can
> be done pretty simply by changing the __node_distance matrix. A first
> attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
> larger delays when stealing from a remote node.

This is not entirely reflective of our architecture. Node-to-node
latencies vary as well. Some notion of whether communication must cross
a lash at the very least should be present.

Cheers,
Bill

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-22 22:52:45

> On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote:
>> A bit more difficult is tuning the scheduler parameters which can
>> be done pretty simply by changing the __node_distance matrix. A first
>> attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
>> larger delays when stealing from a remote node.
> 
> This is not entirely reflective of our architecture. Node-to-node
> latencies vary as well. Some notion of whether communication must cross
> a lash at the very least should be present.

Ummm ... I think it's just flat on or off node, presumably Erich
has "on the diagonal" meaning they're on the same node, and 
"off-diagonal" meaning they're not. In which case, what he suggested
seems fine ... it's really about 20:1 ratio so I might use 10
and 200, but other than that, it seems correct to me.

M.

[Lse-tech] node affine NUMA scheduler: simple benchmark

From: Erich F. <ef...@es...> - 2002-09-23 18:21:07

Attachments: rand_updt

Here is a simple benchmark which is NUMA sensitive and simulates a simple
but normal situation in an environment running number crunching jobs. It
starts N independent tasks which access a large array in a random manner.
This is both bandwidth and latency sensitive. The output shows on which
node(s) the tasks have spent their lives. Additionally it shows (on a
NUMA scheduler kernel) the homenode (iSched).

Could you please run it on the virgin kernel and on the
"Both numasched patches, hack node IDs, alloc from current->node" one?

Maybe we see what's wrong...

Thanks,
Erich

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-22 10:46:30

On Saturday 21 September 2002 17:55, Martin J. Bligh wrote:
> >  - Each process has a homenode assigned to it at creation time
> >    (initial load balancing). Memory will be allocated from this node.
=2E..
>  #ifdef CONFIG_NUMA
> +#ifdef CONFIG_NUMA_SCHED
> +#define numa_node_id() (current->node)
> +#else
>  #define numa_node_id() _cpu_to_node(smp_processor_id())
> +#endif
>  #endif /* CONFIG_NUMA */
>
> I'm not convinced it's a good idea to modify this generic function,
> which was meant to tell you what node you're running on. I can't
> see it being used anywhere else right now, but wouldn't it be better
> to just modify alloc_pages instead to use current->node, and leave
> this macro as intended? Or make a process_node_id or something?

OK, I see your point and I agree that numa_node_is() should be similar to
smp_processor_id(). I'll change the alloc_pages instead.

Do you think it makes sense to get memory from the homenode only for
user processes? Many kernel threads have currently the wrong homenode,
for some of them it's unclear which homenode they should have...

There is an alternative idea (we discussed this at OLS with Andrea, maybe
you remember): allocate memory from the current node and keep statistics
on where it is allocated. Determine the homenode from this (from time to
time) and schedule accordingly. This eliminates the initial load balancin=
g
and leaves it all to the scheduler, but has the drawback that memory can
be somewhat scattered across the nodes. Any comments?

Regards,
Erich

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-22 14:58:55

> OK, I see your point and I agree that numa_node_is() should be similar to
> smp_processor_id(). I'll change the alloc_pages instead.
> 
> Do you think it makes sense to get memory from the homenode only for
> user processes? Many kernel threads have currently the wrong homenode,
> for some of them it's unclear which homenode they should have...

Well yes ... if you can keep things pretty much on their home nodes.
That means some sort of algorithm for updating it, which may be fairly
complex (and doesn't currently seem to work, but maybe that's just 
because I only have 1 pool)
 
> There is an alternative idea (we discussed this at OLS with Andrea, maybe
> you remember): allocate memory from the current node and keep statistics
> on where it is allocated. Determine the homenode from this (from time to
> time) and schedule accordingly. This eliminates the initial load balancing
> and leaves it all to the scheduler, but has the drawback that memory can
> be somewhat scattered across the nodes. Any comments?

Well, that's a lot simpler. Things should end up running on their home
node, and thus will allocate pages from their home node, so it should
be self-re-enforcing. The algorithm for the home node is then implicitly
worked out from the scheduler itself, and its actions, so it's one less
set of stuff to write. Would suggest we do this at first, to keep things
as simple as possible so you have something mergeable.

M.

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich F. <ef...@es...> - 2002-09-23 18:39:12

Attachments: node_affine_dyn-2.5.37.patch

On Sunday 22 September 2002 16:57, Martin J. Bligh wrote:
> > There is an alternative idea (we discussed this at OLS with Andrea, m=
aybe
> > you remember): allocate memory from the current node and keep statist=
ics
> > on where it is allocated. Determine the homenode from this (from time=
 to
> > time) and schedule accordingly. This eliminates the initial load
> > balancing and leaves it all to the scheduler, but has the drawback th=
at
> > memory can be somewhat scattered across the nodes. Any comments?
>
> Well, that's a lot simpler. Things should end up running on their home
> node, and thus will allocate pages from their home node, so it should
> be self-re-enforcing. The algorithm for the home node is then implicitl=
y
> worked out from the scheduler itself, and its actions, so it's one less
> set of stuff to write. Would suggest we do this at first, to keep thing=
s
> as simple as possible so you have something mergeable.

OK, sounds encouraging. So here is my first attempt (attached). You'll
have to apply it on top of the two NUMA scheduler patches and hack
SAPICID_TO_PNODE again.

The patch adds a node_mem[NR_NODES] array to each task. When allocating
memory (in rmqueue) and freeing it (in free_pages_ok) the number of
pages is added/subtracted from that array and the homenode is set to
the node having the largest entry. Is there a better place where to put
this in (other than rmqueue/free_pages_ok)?

I have two problems with this approach:
1: Freeing memory is quite expensive, as it currently involves finding th=
e
maximum of the array node_mem[].
2: I have no idea how tasks sharing the mm structure will behave. I'd
like them to run on different nodes (that's why node_mem is not in mm),
but they could (legally) free pages which they did not allocate and
have wrong values in node_mem[].

Comments?

Regards,
Erich

Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Martin J. B. <mb...@ar...> - 2002-09-23 18:49:13

> OK, sounds encouraging. So here is my first attempt (attached). You'll
> have to apply it on top of the two NUMA scheduler patches and hack
> SAPICID_TO_PNODE again.
> 
> The patch adds a node_mem[NR_NODES] array to each task. When allocating
> memory (in rmqueue) and freeing it (in free_pages_ok) the number of
> pages is added/subtracted from that array and the homenode is set to
> the node having the largest entry. Is there a better place where to put
> this in (other than rmqueue/free_pages_ok)?
> 
> I have two problems with this approach:
> 1: Freeing memory is quite expensive, as it currently involves finding the
> maximum of the array node_mem[].

Bleh ... why? This needs to be calculated much more lazily than this,
or you're going to kick the hell out of any cache affinity. Can you 
recalc this in the rebalance code or something instead?

> 2: I have no idea how tasks sharing the mm structure will behave. I'd
> like them to run on different nodes (that's why node_mem is not in mm),
> but they could (legally) free pages which they did not allocate and
> have wrong values in node_mem[].

Yes, that really ought to be per-process, not per task. Which means
locking or atomics ... and overhead. Ick.

For the first cut of the NUMA sched, maybe you could just leave page
allocation alone, and do that seperately? or is that what the second 
patch was meant to be?

M.

1 2 > >> (Page 1 of 2)