Thread: Re: [Lse-tech] Re: more on scheduler benchmarks

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Well said, John. That is actually exactly our position:

We have MQ in mind as the "scalable solution" having multiple runqueues
each protected by individual locks.
     This should bring down lock contention.
Priority scheduler is orthogonal to this: it is a queue organization
feature that might reduce lock hold time.

Ideally we want to combine the two.

What is constantly pointed out on these lists is that the run-queue is not
heavily populated hence the lock hold time
is rather small. However, when the number of cpus is going up, one can
savely assume that the overall load on the system
will scale accordingly. If the lockhold time scales with the number of
processes, then we have more or less a squared problem.

Now we need to establish, at what point the overhead we need to establish
for the priority or multiqueue solution where the actual
break-even point is as compared to vanilla kernel. We are almost done with
the Chatroom data measurements.
We already posted some numbers for the sched_yield_test, which basically
provides you the worst case scenario in terms
of lock contention. The entire data set should be ready by next week's LWE.

What would be interesting is a patch that registers the length of the
run-queue every time we run through it and keeps track of it.
Is trivial to write and would allow to establish a histogram in various
area. I suspect "0" to get the biggest hit.
Would people be willing to run this for a while and I would collect the
data ? Maybe it would shed some light into this.

Hubertus Franke
email: fr...@us...
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003

"John Hawkes" <ha...@en...>@lists.sourceforge.net on 01/23/2001
02:12:25 PM

Sent by:  lse...@li...

To:   <lse...@li...>
cc:
Subject:  Re: [Lse-tech] Re: more on scheduler benchmarks

From: "Andrew Morton" <an...@uo...>
...[snip]...
> Applying timepegs, plus schedule-timer.patch (attached) reveals that
> vanilla schedule() takes 32 microseconds with 100 tasks on the
> runqueue, and 4 usecs with an empty runqueue.
...[snip]...
> runqueue length microseconds (500MHz PII)
...
> 64 25
> 128 44
>
> Seems surprisingly slow?

What greatly exacerbates the problem is if the global runqueue_lock is
held during this span of schedule() time and if the desired
context-switch rate is high.  On a two-cpu [and not very fast] i386 I've
seen context-switch rates of 10-20,000/second.  This is obviously going
to waste lots of cpu cycles in the spinlock waits.  An 8p SMP is going
to scale even worse.  And the same or larger NUMA machine with a global
runqueue_lock exhibits distinctly unfair spinlock contention (i.e.,
starvation of the cpus "farthest" from the runqueue_lock physical
memory).

I believe the only effective solution for [large] NUMA systems is to
reduce the contention on the global runqueue_lock by using multiple
queues with individual spinlocks.  Using prioritized runqueues (and the
same global runqueue_lock) helps because it reduces the "hold" times,
but if you add enough cpus, you'll eventually be back into the same
high-contention high-waste situation.

John Hawkes
ha...@en...

_______________________________________________
Lse-tech mailing list
Lse...@li...
http://lists.sourceforge.net/lists/listinfo/lse-tech