Hello all
Thanks for your answers.

Erich Focht a écrit :
Hello Xavier,

On Tuesday 23 March 2004 19:24, Xavier Bru wrote:
  
Running a single process on a 16*w NUMA machine (4 nodes of 4 cpus), we
sometimes see the process migrating across nodes when the machine is not
loaded. It seems that on a machine where most CPUs are idle, the
balance_node() routine finds a load unbalance (for example: 2 active
processes on a node and 0 on another one), and a process is migrated
across nodes with the major problem of the broken memory affinity.
    

As Rick mentioned in his reply, in some cases you might actually want
an idle node to steal single tasks. The time averaged loads should
actually help to avoid the situation you mention. But the loads are
computed by taking into acount all runnable processes. It might be
better to just consider processes which ran longer or have more
memory, as Rick suggested. OTOH this gives a minimum unbalanced
run-time which you again might want to avoid.

  
The problem is that the 25% unbalance is less significative when nodes are idle. As far as there are less active tasks than processors on the nodes, it should be better to have memory localized than balanced nodes.
 I catched the case with kdb: we are effectively in the case where node_nr_running=2 (the "mytest" numa test, and probably some daemon or the top command that wakes up) on some node and 0 on the others. (see attached traces).

My main idea behind a NUMA scheduler was to provide a mechanism for
the process to return to its node. This isn't there in 2.6 and the
current NUMA scheduler is pretty "crippled", IMHO. As the
sched_domains provide more flexibility regarding HT, NUMA, SMT and
what not, and Andrew Morton seems to be willing to accept it, I'm not
too motivated to improve the current NUMA scheduler.

  
I remember you provided the home node support that should help when there is a temporary load unbalance. Thanks for the pointer on sched_domains.
Looking at your patch: by just skipping all nodes with
nr_running<=nr_cpus you don't update this_rq()->prev_node_load[i] so
some decisions might be wrong, later. So maybe you want to move the
"continue" lower in the loop. Otherwise it's fine IF you want this
behaviour...
  
Thanks ! I  will move it.
  
to fix the problem. Note that in this case processes still migrate
beetwen cpus of the same node.
    

??? That shouldn't happen, either. Is it due to short running
processes on the same CPU? The time-averaging should catch that,
too... Andrew suggests in a reply to change try_to_wakeup()... Might
work but you still have kernel threads pinned to each CPU which you
cannot move, they just have to wake up there. Would be nice to find
out what exactly happens and which task pushes the numatest
away. It would again be better to not count particular tasks in the
load.
  
hereafter a trace that shows that the numa test is scheduled on others cpus of same node:

initial CPU = 2
cpu  18491 16
cpu0 17125 2
cpu1 441 0
cpu2 700 14
cpu3 225 0
cpu4 0 0
cpu5 0 0
cpu6 0 0
cpu7 0 0
cpu8 0 0
cpu9 0 0
cpu10 0 0
cpu11 0 0
cpu12 0 0
cpu13 0 0
cpu14 0 0
cpu15 0 0
current_cpu 0

real    0m18.074s
user    0m18.058s
sys     0m0.016s

Regards,
Erich




  
Thanks again for your answers.
Xavier

-- 

 Sincères salutations.
_____________________________________________________________________
 
Xavier BRU                 BULL ISD/R&D/INTEL office:     FREC B1-422
tel : +33 (0)4 76 29 77 45                    http://www-frec.bull.fr
fax : +33 (0)4 76 29 77 70                 mailto:Xavier.Bru@bull.net
addr: BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE
_____________________________________________________________________