Re: [Lse-tech] cpus_allowed in multi-queue scheduler

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Yikes...

I am surprised you don't see better improvements. Mike updated the MQ patch
sometimes 2-3 weeks ago due
to the problem that Jun found. Might want to check whether your patch and
the latest one are the same.

As of the pools, I tell you what: Since in the call-for-participation calls
we call for experimentation, I shall do
follow this call.

Let me go ahead and propose the following scheduler semantics and make the
few modifications to the
MQ+pool+LB code that I already have. I can code this probably this sunday
and we can have some
scalability numbers+curves on Tuesday or so for an 1-8-way system over some
reasona

Currently there are two path in the system to consider with the following
very rough algorithm.

reschedule_idle()
     check_all_remote_cpus for preemptability (affinity impaired) (build
stacklist)
schedule()
     check for RT
     find best in local Q
     find better in remote Q (affinity impaired)

Let's replace/integrate that with the concept of the pool:
reschedule_idle()
      check_all_remote_cpus P such that
          if P in same pool check for idle and for preemptablily
(local-pool-affinity-boost)
          else  check only for idle (with preference to local idles)

schedule()
     check for RT
     find better in local Q (affinity impaired)
     if (next-would-be-idle) find better in pool remote Qs

This makes sure that we there are no IDLE threads around, but we only
preempt in our local pool. We let load balancing take care
of the inequalities between the queues.
One problem I see right now is that in such an algorithm, the "recalculate
mechanism" might be broken. Everytime a local pool drops to (c==0)
condition, it would initiate a global recalculate, thus effecting other
pools and queues as well. Possibility here is to filter the
recalculate loop based on the pool.
Loadbalancing right now is trivial, namely try to average all runqueue
length. Better algorithms could be deployed or make this
beast a loadable module.

Andrea Arcancelli had a slightly different design for the above, where he
put a seperate scheduler with separate data into every NODE_DATA
for NUMA purposes. The difference of the above to his approach is that in
reschedule_idle() he probably doesn't check for cpus on remote
nodes being idle.

On a different note: I integrated a priority scheduler (multiple list based
on na_goodness) with the MQ to see whether moving at very high thread count
to a different queue organization would make a difference. The resulting
scheduler switches dynamically between priority and single-list per
runqueue in the MQ.
I instrumented the scheduler to count the various occurences. When
measuring for a kernel build, the results are not vary encouraging, i.e.
the savings I get from not having to traverse the entire runqueue but only
a few priority levels seems to be offset by skipping through the various
priority levels.

Have a good weekend everybody.

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: fr...@us...
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

"John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/09/2001
03:45:04 PM

Sent by:  lse...@li...

To:   <lse...@li...>
cc:
Subject:  Re: [Lse-tech] cpus_allowed in multi-queue scheduler

From: "Hubertus Franke" <fr...@us...>
> Have you managed to get it running on a 32-way SGI machine.

Yes, and I also made some mips64 kernel fixes to get the IBM "chat"
benchmark to execute.  I'm not seeing very repeatable results, however,
for either the regular scheduler or the MQ scheduler (using the MQ patch
from a couple of weeks ago).  The MQ scheduler definitely eliminates the
contention on the runqueue_lock, but I'm not seeing dramatic improvments
on "chat" benchmark results.  I may be encountering other lock
contention issues, especially with this kind of tcp-intensive workload.

I'm planning to run some other benchmarks on this system, but the trick
is to choose a workload that exposes the improvements to the scheduler
(i.e., flooding the system with long runqueues and lots of context
switches) without having that workload be so singleminded that it
stumbles into yet another single-threaded algorithm.

Getting the global runqueue_lock out of the way (using the MQ scheduler)
exposed another hot global lock: xtime_lock.  We do a
write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and
these interrupts tend to be concurrent (or at least they try to be
concurrent).  For a leisurely HZ==100 this isn't spectacularly awful,
but a higher HZ would get progressively worse.

John Hawkes
ha...@en...

_______________________________________________
Lse-tech mailing list
Lse...@li...
http://lists.sourceforge.net/lists/listinfo/lse-tech