Thread: Re: [Lse-tech] multi-queue performance degradation analysis

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Mike, one possibility would be to run the newest incarnation
of the priority-list based scheduler. I am currently testing
the latest version, actually verifying that the same decisions
are reached as with the vanilla scheduler.

Maybe you can take Ol'Henry for a spindrive given that
St.Patricks day is tomorrow.
See whether your assumptions are correct.
I'll send a preliminary patch to you on Monday.
Which kernel are you testing with right now?
2.4.1 or 2.4.2 ?

Hubertus Franke

Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 03/16/2001
03:21:35 PM

Sent by:  lse...@li...

To:   Lse-tech <lse...@li...>
cc:
Subject:  [Lse-tech] multi-queue performance degradation analysis

When I first started running the 'Henry' workload, I was concerned
because overall performance dropped after applying the multi-queue
scheduler patch.  I was willing to overlook this drop in performance
because of DB2's extensive use of sched_yield(), and the fact that
sched_yield would perform differently with multiple run queues.  Then,
John Hawkes discovered that applying the multi-queue scheduler patch
reduced AIM7 throughput by about 5%.  Just recently, I received an
advance copy of DB2 that eliminates the extensive use of sched_yield().
Running Henry on this version of DB2 also shows about a 5% drop in
performance after applying the multi-queue patch.

What's Up?

I then started using the kernel profiler to compare runs before and
after applying the multi-queue patch.  As expected, the multi-queue
patch reduced the time spent in the scheduler.  Key results listed:

Vanilla 2.4.1
-------------
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
 88.53    477.50   477.50   405051  1178.86  1178.86  default_idle
  3.31    495.33    17.83                             USER
  1.20    515.79     6.49   714154     9.09    12.54  schedule
  0.53    525.79     2.85   335679     8.49     8.49  process_timeout
  0.09    530.92     0.48   618287     0.78     0.78  _wake_up

2.4.1 Multi-queue
-----------------
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
 80.75    472.59   472.59   215426  2193.74  2193.74  default_idle
  6.06    508.05    35.46                             USER
  0.22    563.29     1.26   235348     5.35     6.30  schedule
  0.08    570.83     0.47  1227170     0.38     0.38  _wake_up
  0.03    577.40     0.19    38876     4.89     4.89  process_timeout

Notice the big difference in the amount of time spent in user mode.
My only thought at this point is that one of two things could be
causing the performance degradation.
1) A bug in the multi-queue implementation.
2) Behavior differences between the multi-queue scheduler and the
   current scheduler.
Being the typical programmer, I discounted the possibility of bugs in
my code and began exploring the differences in scheduling behavior.

As you may recall, the multi-quue scheduler is designed to emulate
the behavior of the existing scheduler as much as possible.  It starts
to deviate from the behavior of the current scheduler when lock
contention is experienced.  Instead of acquiring a global lock and
making decisions with global knowledge, it acquires local locks,
makes a local decision and then trys to acquire additional locks to
make a more global decision.  When experiencing contention on these
additional locks, it simply gives up and reverts back to the best
local decision it can make.

To explore this theory further, I ran selected portions of Henry with
lock metering enabled.  Here are the results for the runqueue lock(s).
Sorry for the 'wide' text.

Vanilla 2.4.1
-------------
SPINLOCKS             HOLD              WAIT
   UTIL     CON    MEAN (   MAX  )   MEAN (   MAX  )      TOTAL     NOWAIT
SPIN REJECT  NAME

   4.66%  68.30%   2.5us(    45us)   7.7us(   203us)    1399137     443539
955598      0  runqueue_lock
   0.29%  16.26%   2.6us(    45us)   4.4us(   156us)      86979      72835
14144      0    __wake_up+0x5c
   0.00%   0.90%   0.4us(   3.4us)   7.3us(    27us)       3571       3539
32      0    __wake_up_sync+0x5c
   0.00%   4.95%   0.2us(   1.4us)   2.9us(    31us)        667        634
33      0    deliver_signal+0x58
   1.59%  80.99%   3.0us(    27us)   5.8us(    54us)     395522      75186
320336      0    process_timeout+0x14
   0.00%  26.53%   3.2us(   8.7us)   6.7us(    26us)         49         36
13      0    schedule_tail+0x6c
   2.71%  68.96%   2.3us(    29us)   8.8us(   203us)     891527     276692
614835      0    schedule+0xd0
   0.03%  48.89%   2.0us(    31us)   5.4us(    99us)      11601       5929
5672      0    schedule+0x49c
   0.00%  11.55%   0.8us(   9.9us)    15us(    61us)        658        582
76      0    schedule+0x55c
   0.04%   5.34%   3.5us(    29us)   7.6us(    83us)       8563       8106
457      0    wake_up_process+0x14

2.4.1 Multi-queue
-----------------
SPINLOCKS             HOLD              WAIT
   UTIL     CON    MEAN (   MAX  )   MEAN (   MAX  )      TOTAL     NOWAIT
SPIN REJECT  NAME

   0.26%   7.70%   2.1us(    41us)   2.8us(    46us)      86657      79982
6675      0  __wake_up+0x84
   0.00%   0.03%   0.6us(   4.4us)   1.1us(   1.1us)       3065       3064
1      0  __wake_up_sync+0xa8
   0.00%   0.00%   0.3us(   1.2us)     0us                  480        480
0      0  deliver_signal+0xa8
   0.13%  13.35%   2.8us(    46us)   3.3us(    18us)      34798      30151
4647      0  process_timeout+0x48
   0.00%   6.67%   6.2us(    24us)   5.6us(   5.8us)         30         28
2      0  schedule_tail+0xa4
   0.79%   4.32%   2.5us(    46us)   3.1us(    20us)     234875     224737
10138      0  schedule+0x130 (routine entry)
   0.05%  52.72%   1.1us(    15us)     0us                29053      13736
0  15317  schedule+0x530 (try_lock)
   0.04%  43.10%   2.0us(    34us)   4.3us(    38us)      12605       7172
5433      0  schedule+0xc9c (__sched_tail)
   0.00%  43.51%   1.3us(   6.1us)   4.6us(    16us)       2066       1167
899      0  schedule+0xcf0 (__sched_tail)
   0.00%   0.00%   1.7us(   6.8us)     0us                  564        564
0      0  schedule+0xe3c (recalculate)
   0.04%   0.41%   3.9us(    42us)   3.8us(    18us)       8129       8096
33      0  wake_up_process+0x48
   0.06%  11.23%   1.5us(    20us)     0us                28591      25380
0   3211  reschedule_idle+0x5b8

Note that the runqueue lock is highly contended in this run (68.30% for
the unmodified kernel).  It is also interesting to note the rate of failure
of spin_trylock() from within schedule.  When schedule calls spin_trylock,
it believes there is a task on another CPU specific runqueue which has
a (sufficiently) higher goodness value than any task on the local
runqueue.  We fail to acquire this remote runqueue lock 52% of the time.
Overall, this suggests that aprox 6% of the time schedule will not run
the task with the highest goodness value (system wide).  In addition
note that reschedule_idle fails to preempt what it thinks is a low priority
task running on a remote CPU 11% of the time.

I'm not sure if this deviation in behavior is what is causing the
performance degradation or not.  However if this is the case, it might
suggest that for these workloads making the best 'global' scheduling
decision even at the expense of more spinning on locks gives you the
highest level of performance.

I suspect few people have taken the time to read this far.  However,
if you have any comments or suggestions I would be happy to hear them.

--
Mike Kravetz                                 mkr...@se...
IBM Linux Technology Center

_______________________________________________
Lse-tech mailing list
Lse...@li...
http://lists.sourceforge.net/lists/listinfo/lse-tech