From: Hubertus F. <fr...@us...> - 2001-02-09 22:03:55
|
Yikes... I am surprised you don't see better improvements. Mike updated the MQ patch sometimes 2-3 weeks ago due to the problem that Jun found. Might want to check whether your patch and the latest one are the same. As of the pools, I tell you what: Since in the call-for-participation calls we call for experimentation, I shall do follow this call. Let me go ahead and propose the following scheduler semantics and make the few modifications to the MQ+pool+LB code that I already have. I can code this probably this sunday and we can have some scalability numbers+curves on Tuesday or so for an 1-8-way system over some reasona Currently there are two path in the system to consider with the following very rough algorithm. reschedule_idle() check_all_remote_cpus for preemptability (affinity impaired) (build stacklist) schedule() check for RT find best in local Q find better in remote Q (affinity impaired) Let's replace/integrate that with the concept of the pool: reschedule_idle() check_all_remote_cpus P such that if P in same pool check for idle and for preemptablily (local-pool-affinity-boost) else check only for idle (with preference to local idles) schedule() check for RT find better in local Q (affinity impaired) if (next-would-be-idle) find better in pool remote Qs This makes sure that we there are no IDLE threads around, but we only preempt in our local pool. We let load balancing take care of the inequalities between the queues. One problem I see right now is that in such an algorithm, the "recalculate mechanism" might be broken. Everytime a local pool drops to (c==0) condition, it would initiate a global recalculate, thus effecting other pools and queues as well. Possibility here is to filter the recalculate loop based on the pool. Loadbalancing right now is trivial, namely try to average all runqueue length. Better algorithms could be deployed or make this beast a loadable module. Andrea Arcancelli had a slightly different design for the above, where he put a seperate scheduler with separate data into every NODE_DATA for NUMA purposes. The difference of the above to his approach is that in reschedule_idle() he probably doesn't check for cpus on remote nodes being idle. On a different note: I integrated a priority scheduler (multiple list based on na_goodness) with the MQ to see whether moving at very high thread count to a different queue organization would make a difference. The resulting scheduler switches dynamically between priority and single-list per runqueue in the MQ. I instrumented the scheduler to count the various occurences. When measuring for a kernel build, the results are not vary encouraging, i.e. the savings I get from not having to traverse the entire runqueue but only a few priority levels seems to be offset by skipping through the various priority levels. Have a good weekend everybody. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/09/2001 03:45:04 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler From: "Hubertus Franke" <fr...@us...> > Have you managed to get it running on a 32-way SGI machine. Yes, and I also made some mips64 kernel fixes to get the IBM "chat" benchmark to execute. I'm not seeing very repeatable results, however, for either the regular scheduler or the MQ scheduler (using the MQ patch from a couple of weeks ago). The MQ scheduler definitely eliminates the contention on the runqueue_lock, but I'm not seeing dramatic improvments on "chat" benchmark results. I may be encountering other lock contention issues, especially with this kind of tcp-intensive workload. I'm planning to run some other benchmarks on this system, but the trick is to choose a workload that exposes the improvements to the scheduler (i.e., flooding the system with long runqueues and lots of context switches) without having that workload be so singleminded that it stumbles into yet another single-threaded algorithm. Getting the global runqueue_lock out of the way (using the MQ scheduler) exposed another hot global lock: xtime_lock. We do a write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and these interrupts tend to be concurrent (or at least they try to be concurrent). For a leisurely HZ==100 this isn't spectacularly awful, but a higher HZ would get progressively worse. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |