From: Andrew T. <hab...@us...> - 2003-04-19 01:31:40
|
On Friday 18 April 2003 18:57, William Lee Irwin III wrote: > On Fri, Apr 18, 2003 at 04:28:25PM -0700, Gerrit Huizenga wrote: > > NUMAQ may be less representative today of the latencies, but it > > the current NUMAQ hardware reflects the end a set of CPU clock > > rate increases (which roughly double every 6-18 months) and an > > interconnect that was designed for processors 3-4 years earlier. > > IMHO remote access latency or a related statistic (e.g. ratios) should > be an operational parameter and somehow automatically inferred. Yes, I agree, and eventually I'd like to see the NUMA related decisions m= ade=20 by the kernel have varying degrees of complexity/codepath length, based o= n=20 this ratio. Let me back up: the problem I see with using just NUMAQ (not= =20 really NUMAQ, but using just one platform) is that we add enough complexi= ty=20 and code path length in the kernel to benefit it, but that may not apply = well=20 to a low latency system like x440 or even lower like Hammer.=20 What we end up doing is increasing time spent in kernel on these complex=20 decisions designed for high latency systems, because we know on a given=20 platform, that time spent has a good return, the time we "wasted" in kern= el=20 is made up by the time we save in the app. However on a system with a lo= wer=20 latency, the decision does not need to be as precise, and therefore a sim= pler=20 logic/shorter codepath/lower system time would be better, because there i= sn't=20 as much application time to make up for our extra system time. =20 Let me give an example. sched_best_cpu(). At first we did not have the=20 shortcut in it, if nr_running is <2, pick this cpu. Originally we cycle= d=20 through all nodes then cpus to find the best. Adding the check really=20 helped, but it could hurt a system with even worse latency. Who knows,=20 having higher nr_running threshold could be better for x440. The nr_runn= ing=20 < 2 check really should vary based on the latency ratio. This lets the l= ower=20 latency systems fall out of the algorithm more often, giving a shorter=20 codepath, lower kernel time. =20 So, what I am trying to show is, maybe some of these decisions need to ha= ve=20 several stages, and based on the latency ratio, some systems run through = x=20 stages, while a higher latency system goes through x+y stages, and so on: ratio=09=09how much we do to make a decision 1=09=09w 2=09=09w+x 3=09=09w+x+y 4=09=09w+x+y+z The same concept could be used for "aggressiveness" in some algorithms, l= ike=20 the node balance ratio. For example, I didn't get good performance on=20 NUMA-HT until I set this to 1. The non NUMA case is essentially 1, too. Anyway, something to think about. This is obviously 2.7 stuff :) -Andrew |