From: Hubertus F. <fr...@us...> - 2001-03-16 20:54:51
|
Mike, one possibility would be to run the newest incarnation of the priority-list based scheduler. I am currently testing the latest version, actually verifying that the same decisions are reached as with the vanilla scheduler. Maybe you can take Ol'Henry for a spindrive given that St.Patricks day is tomorrow. See whether your assumptions are correct. I'll send a preliminary patch to you on Monday. Which kernel are you testing with right now? 2.4.1 or 2.4.2 ? Hubertus Franke Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 03/16/2001 03:21:35 PM Sent by: lse...@li... To: Lse-tech <lse...@li...> cc: Subject: [Lse-tech] multi-queue performance degradation analysis When I first started running the 'Henry' workload, I was concerned because overall performance dropped after applying the multi-queue scheduler patch. I was willing to overlook this drop in performance because of DB2's extensive use of sched_yield(), and the fact that sched_yield would perform differently with multiple run queues. Then, John Hawkes discovered that applying the multi-queue scheduler patch reduced AIM7 throughput by about 5%. Just recently, I received an advance copy of DB2 that eliminates the extensive use of sched_yield(). Running Henry on this version of DB2 also shows about a 5% drop in performance after applying the multi-queue patch. What's Up? I then started using the kernel profiler to compare runs before and after applying the multi-queue patch. As expected, the multi-queue patch reduced the time spent in the scheduler. Key results listed: Vanilla 2.4.1 ------------- % cumulative self self total time seconds seconds calls us/call us/call name 88.53 477.50 477.50 405051 1178.86 1178.86 default_idle 3.31 495.33 17.83 USER 1.20 515.79 6.49 714154 9.09 12.54 schedule 0.53 525.79 2.85 335679 8.49 8.49 process_timeout 0.09 530.92 0.48 618287 0.78 0.78 _wake_up 2.4.1 Multi-queue ----------------- % cumulative self self total time seconds seconds calls us/call us/call name 80.75 472.59 472.59 215426 2193.74 2193.74 default_idle 6.06 508.05 35.46 USER 0.22 563.29 1.26 235348 5.35 6.30 schedule 0.08 570.83 0.47 1227170 0.38 0.38 _wake_up 0.03 577.40 0.19 38876 4.89 4.89 process_timeout Notice the big difference in the amount of time spent in user mode. My only thought at this point is that one of two things could be causing the performance degradation. 1) A bug in the multi-queue implementation. 2) Behavior differences between the multi-queue scheduler and the current scheduler. Being the typical programmer, I discounted the possibility of bugs in my code and began exploring the differences in scheduling behavior. As you may recall, the multi-quue scheduler is designed to emulate the behavior of the existing scheduler as much as possible. It starts to deviate from the behavior of the current scheduler when lock contention is experienced. Instead of acquiring a global lock and making decisions with global knowledge, it acquires local locks, makes a local decision and then trys to acquire additional locks to make a more global decision. When experiencing contention on these additional locks, it simply gives up and reverts back to the best local decision it can make. To explore this theory further, I ran selected portions of Henry with lock metering enabled. Here are the results for the runqueue lock(s). Sorry for the 'wide' text. Vanilla 2.4.1 ------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT NAME 4.66% 68.30% 2.5us( 45us) 7.7us( 203us) 1399137 443539 955598 0 runqueue_lock 0.29% 16.26% 2.6us( 45us) 4.4us( 156us) 86979 72835 14144 0 __wake_up+0x5c 0.00% 0.90% 0.4us( 3.4us) 7.3us( 27us) 3571 3539 32 0 __wake_up_sync+0x5c 0.00% 4.95% 0.2us( 1.4us) 2.9us( 31us) 667 634 33 0 deliver_signal+0x58 1.59% 80.99% 3.0us( 27us) 5.8us( 54us) 395522 75186 320336 0 process_timeout+0x14 0.00% 26.53% 3.2us( 8.7us) 6.7us( 26us) 49 36 13 0 schedule_tail+0x6c 2.71% 68.96% 2.3us( 29us) 8.8us( 203us) 891527 276692 614835 0 schedule+0xd0 0.03% 48.89% 2.0us( 31us) 5.4us( 99us) 11601 5929 5672 0 schedule+0x49c 0.00% 11.55% 0.8us( 9.9us) 15us( 61us) 658 582 76 0 schedule+0x55c 0.04% 5.34% 3.5us( 29us) 7.6us( 83us) 8563 8106 457 0 wake_up_process+0x14 2.4.1 Multi-queue ----------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT NAME 0.26% 7.70% 2.1us( 41us) 2.8us( 46us) 86657 79982 6675 0 __wake_up+0x84 0.00% 0.03% 0.6us( 4.4us) 1.1us( 1.1us) 3065 3064 1 0 __wake_up_sync+0xa8 0.00% 0.00% 0.3us( 1.2us) 0us 480 480 0 0 deliver_signal+0xa8 0.13% 13.35% 2.8us( 46us) 3.3us( 18us) 34798 30151 4647 0 process_timeout+0x48 0.00% 6.67% 6.2us( 24us) 5.6us( 5.8us) 30 28 2 0 schedule_tail+0xa4 0.79% 4.32% 2.5us( 46us) 3.1us( 20us) 234875 224737 10138 0 schedule+0x130 (routine entry) 0.05% 52.72% 1.1us( 15us) 0us 29053 13736 0 15317 schedule+0x530 (try_lock) 0.04% 43.10% 2.0us( 34us) 4.3us( 38us) 12605 7172 5433 0 schedule+0xc9c (__sched_tail) 0.00% 43.51% 1.3us( 6.1us) 4.6us( 16us) 2066 1167 899 0 schedule+0xcf0 (__sched_tail) 0.00% 0.00% 1.7us( 6.8us) 0us 564 564 0 0 schedule+0xe3c (recalculate) 0.04% 0.41% 3.9us( 42us) 3.8us( 18us) 8129 8096 33 0 wake_up_process+0x48 0.06% 11.23% 1.5us( 20us) 0us 28591 25380 0 3211 reschedule_idle+0x5b8 Note that the runqueue lock is highly contended in this run (68.30% for the unmodified kernel). It is also interesting to note the rate of failure of spin_trylock() from within schedule. When schedule calls spin_trylock, it believes there is a task on another CPU specific runqueue which has a (sufficiently) higher goodness value than any task on the local runqueue. We fail to acquire this remote runqueue lock 52% of the time. Overall, this suggests that aprox 6% of the time schedule will not run the task with the highest goodness value (system wide). In addition note that reschedule_idle fails to preempt what it thinks is a low priority task running on a remote CPU 11% of the time. I'm not sure if this deviation in behavior is what is causing the performance degradation or not. However if this is the case, it might suggest that for these workloads making the best 'global' scheduling decision even at the expense of more spinning on locks gives you the highest level of performance. I suspect few people have taken the time to read this far. However, if you have any comments or suggestions I would be happy to hear them. -- Mike Kravetz mkr...@se... IBM Linux Technology Center _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mike K. <mkr...@se...> - 2001-03-16 22:06:16
|
On Fri, Mar 16, 2001 at 03:58:47PM -0500, Hubertus Franke wrote: > > Maybe you can take Ol'Henry for a spindrive given that > St.Patricks day is tomorrow. > See whether your assumptions are correct. > I'll send a preliminary patch to you on Monday. > Which kernel are you testing with right now? > 2.4.1 or 2.4.2 ? Version does not mater. I applied the current priority-list patch from the web page to 2.4.1 and got a system deadlock when running Ol'Henry. I got the same type of deadlock on both 2.4.1 and 2.4.2 when running on standard kernels with profiling patches applied. Did not see this deadlock when running with previous version of DB2 (that did sched_yields). Looks like this new version of DB2 is stressing the system in new and interesting ways. I'll look into the deadlocks. - Mike Kravetz mkr...@se... IBM Linux Technology Center |