[Lbs-tech] reflex benchmark for scheduler evaluation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

As a step towards getting more fine-tuned benchmarks to compare schedulers,
here's a new benchmark that I wrote.
Briefly, it attempts to address some of the problems seen with the
chat_room benchmark and gives some more parameters to play
with.

I'll post some numbers comparing 2.4.1-pre8 and the current MQ-scheduler
using an earlier version of this benchmark soon.

Comments please....

Shailabh Nagar
Enterprise Linux Group, IBM T.J.Watson Research Center,
914-945-2851

reflex 1.0.0 : benchmark for evaluating the Linux scheduler
===========================================================

by Shailabh Nagar (na...@us...)
Based on sched_test_yield by Bill Hartner, bha...@us...

Algorithm :
-----------

The benchmark's goal is to provide a sufficient number of tunable
knobs by which to test kernel scheduler performance and introduce minimal
overheads and dependencies.

The program begins with the parent cloning several threads. The threads group
themselves into "active sets". Only one thread from each active set is on the
runqueue at any given time (except when a handoff of the token is performed -
see below). By choosing the number of threads spawned and number of active
sets, the average number of runnable tasks can be controlled.

Each thread executes a "round" consisting of

     Either Path I :
          1. compute
          2. yield

     Or Path II :
          1. compute
          2. send token to successor
          3. block (by waiting for token from predecessor)

The probability of following Path I or II
is decided by the input parameter w, which is the weight to be assigned to
reschedule_idle(). Path II causes more calls to reschedule_idle() as it
causes more sleeps/wakeups (on messages in a pipe), while Path I exercises the
schedule() part of the scheduler.

Tokens are one byte messages and are handed off to other members of the active set through pipes. Pipes are used instead of
semaphores/sockets etc. because the lock contention on these is lower
(its restricted to contention amongst the readers/writers of the pipe alone -
not a system-wide IPC lock as is the case with semaphores etc.)

The compute phase of a round consists of a small calibrated function being
executed repeatedly for a user-specified amount of time (in the microsecond
range).
The compute part is present to model realistic workloads and to control the
number of scheduler invocations (caused by the yield/block parts of the round)
Slight randomness is introduced in the compute time for each thread
to avoid scheduling patterns being formed.

The overall performance metric is the total number of rounds done by all
children.  This is expressed as microseconds/round : higher the number,
worse is the performance of the scheduler. Naturally, this metric should be
used carefully if its the  sole arbiter of scheduler performance as it may not
adequately reflect "desired" scheduler behaviour.

Runtime parameters :
--------------------

1) w : legal values 0-100

Runtime paramter w determines the weight of "reschedule_idle" as follows :

     probability of sched_yield = (100-w)/100

The more the w, the lesser number of times sched_yield is called and hence
lower is the number of invocations to schedule(). The relative number of calls
to reschedule_idle and schedule has to be determined experimentally for a given
number of threads by tweaking the w parameter.

2) r : legal values 0..infinity

Runtime parameter r determines the number of microseconds to be spent in the
compute phase.

3) c : number of children/threads to be launched

Currently this has to be even (see below)

4) a : number of active children == number of active sets

The subset of c which are going to be on the runqueue. If an active set has
only one member in it, that thread will never block. For a very brief interval,
there could be two threads on the runqueue from the same active set (since
handoff of a token isn't atomic) but the path length from token handoff to
blocking is short enough that this shouldn't be a problem.

5) t : Time (in seconds) for which each run of the test should be performed

Several runs are done till the primary metric, microseconds/round, reaches a 95% confidence level.

6) o : output format

2 is most suitable for processing output using scripts.

7) q : quiet mode (no extraneous printfs)

Improvements/todo's :
---------------------

The effect of parameters w,r,c,a on the metric us/round, has been
tested on a 4-way Pentium-II SMP using simple profiling of the kernel
scheduler. Lower values of w cause schedule() calls to increase and
values around 75-90 cause the number of schedule() and reschedule()
calls to be roughly the same. Higher r values increase us/round.
Values of t in the range 20-50 seconds is adequate to reach 95%
confidence in reasonably short number of runs - this might need to increase
on 8-ways.

Todos :

- verifying that the dependencies of token passing in a set do not affect the
scheduler (other than the desired ones of causing wakeups, regulating #threads
on runqueue etc.)

===============================================================================

Sample run :

     reflex -c 80 -a 40 -t 40 -w 90 -r 1 -o 2 -q

results in output that looks like

     80,   40,   50.65

where the first two numbers identify the parameter set and the last one is
the metric, us/round.

Automated runs (over several c values) can be done by :

     runreflex <directory_for_results>

and modifying the "for tc in ......." line appropriately (to run over the
set of thread counts that the test should be run). The runreflex script needs
some work.

===============================================================================

Comments on assumptions/performance welcomed.So is data from running this
benchmark on different SMP systems.

- Shailabh Nagar
na...@us...
(914) 945-2851

(See attached file: reflex.tar)