Re: [Lse-tech] [PATCH] break out numa scheduling config

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Friday 18 April 2003 18:57, William Lee Irwin III wrote:
> On Fri, Apr 18, 2003 at 04:28:25PM -0700, Gerrit Huizenga wrote:
> > NUMAQ may be less representative today of the latencies, but it
> > the current NUMAQ hardware reflects the end a set of CPU clock
> > rate increases (which roughly double every 6-18 months) and an
> > interconnect that was designed for processors 3-4 years earlier.
>
> IMHO remote access latency or a related statistic (e.g. ratios) should
> be an operational parameter and somehow automatically inferred.

Yes, I agree, and eventually I'd like to see the NUMA related decisions m=
ade=20
by the kernel have varying degrees of complexity/codepath length, based o=
n=20
this ratio.  Let me back up: the problem I see with using just NUMAQ (not=
=20
really NUMAQ, but using just one platform) is that we add enough complexi=
ty=20
and code path length in the kernel to benefit it, but that may not apply =
well=20
to a low latency system like x440 or even lower like Hammer.=20

What we end up doing is increasing time spent in kernel on these complex=20
decisions designed for high latency systems, because we know on a given=20
platform, that time spent has a good return, the time we "wasted" in kern=
el=20
is made up by the time we save in the app.  However on a system with a lo=
wer=20
latency, the decision does not need to be as precise, and therefore a sim=
pler=20
logic/shorter codepath/lower system time would be better, because there i=
sn't=20
as much application time to make up for our extra system time. =20

Let me give an example. sched_best_cpu().  At first we did not have the=20
shortcut in it, if nr_running is <2, pick this cpu.   Originally we cycle=
d=20
through all nodes then cpus to find the best.  Adding the check really=20
helped, but it could hurt a system with even worse latency.   Who knows,=20
having higher nr_running threshold could be better for x440.  The nr_runn=
ing=20
< 2 check really should vary based on the latency ratio.  This lets the l=
ower=20
latency systems fall out of the algorithm more often, giving a shorter=20
codepath, lower kernel time. =20

So, what I am trying to show is, maybe some of these decisions need to ha=
ve=20
several stages, and based on the latency ratio, some systems run through =
x=20
stages, while a higher latency system goes through x+y stages, and so on:

ratio=09=09how much we do to make a decision
1=09=09w
2=09=09w+x
3=09=09w+x+y
4=09=09w+x+y+z

The same concept could be used for "aggressiveness" in some algorithms, l=
ike=20
the node balance ratio.  For example, I didn't get good performance on=20
NUMA-HT until I set this to 1.  The non NUMA case is essentially 1, too.

Anyway, something to think about.  This is obviously 2.7 stuff :)

-Andrew