From: "Mala Anand" <manand@...>
> I am looking at a scheduler performance degradation problem. I
> 20% performance
> degradation using O(1) scheduler (RH AS) v/s the old scheduler (Red
What's your workload?
I'm not using RH AS. Rather, I'm using a variation of Erich Focht's
port of 2.5 O(1) to 2.4.19. It's not a recent 2.5 scheduler, which
would appear to have more tuning knobs, though my 2.4.19-based O(1) does
have some 2.5.49 (more or less) fixes. And I'm using the same
load_balance() algorithm that's in 2.5 (although 2.5 has it restructured
in smaller pieces than mine).
> I see that the number of processes in the running state goes down
> scheduler (RH AS) from 15 to 8. I have not pinned down the processes
I'm not sure why that would be. Perhaps you could explain your workload
in more detail?
> Are you planning on fixing the load_balance problem? Please let me
There are various load_balance() inefficiencies, as I see it. One that
is relatively straightforward to fix is the one where the "busiest" CPU
is busy with pinned unable-to-migrate processes. I think the solution
to that is to have the scan of the runqueues build a list of "busy"
CPUs, rather than simply identify the singlemost "busiest" CPU. Then
the search for a migratable process just continues down that
busiest-to-least-busiest-CPUs list until it finds a suitable process to
snatch (quiting the search when it hits a CPU that doesn't meet the
threshold of being significantly busier than the CPU doing the search).
A more difficult problem to solve requires fiddling with a heuristic
algorithm. Suppose you have 9 compute-bound processes running on an
8-CPU system. With the official 2.4 scheduler, all 9 processes move
around among the 8 available CPUs and get relatively equal access to CPU
cycles. Of course, all this migration on a NUMA system leads to
inefficiencies, since the processes are rarely "close" to their local
memory and thereby suffer longer memory latencies. With the O(1)
scheduler and the current load_balance() algorithm, 7 of the processes
remain rather stable on 7 CPUs and get almost 100% of their CPU cycles,
and 2 processes remain rather stable on the 8th CPU and each shares
about 50% of the CPU cycles. This is a basic "unfairness" in terms of
getting access to CPU cycles, even though each of the processes will
potentially execute with efficient access to local memory (assuming they
started running with local memory).
So how do we make this more "fair"? I believe the wrong thing to do is
to relax the migration threshold and allow more migrations. All that
does is equalize the apparent CPU cycles of the processes. Instead, I
believe a better solution is to employ something like Erich Focht's
"NUMA-aware" scheduler changes and to relax the migration threshold
within a common-phys-memory node. So, using the above example, suppose
that 8p system consists of two 4p nodes (where a "node" is defined as
having all 4 CPUs see one minimum-latency memory), then we'd see 4 of
those compute-bound processes sitting on 4 CPUs in one node, each
getting about 100% of their CPU cycles, and the remaining 5 processes
would migrate around on the other 4p node, each getting about 80% of a
CPU's cycles. This still isn't completely "fair" in that we don't have
all 9 processes getting near-identical CPU cycles, but the unfairness in
the current O(1) load_balance() is 2x (some processes getting 100% of a
CPU, 2 processes each getting 50%), and the unfairness in this
NUMA-aware load_balance() would be only 1.25x.