Thread: [Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

lse-tech

[Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

From: Hubertus F. <fr...@wa...> - 2001-10-30 16:29:25

Davide, nice analysis.
I want to point out that some (not all) of the stuff is already done
in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling).

What we have:
-------------
multiple queues, each protected by their own lock to avoid 
the contention.
Automatic Loadbalancing across all queues (yes, that creates overhead)
CPU pooling as configurable mean to get from isolated queues to a fully 
balanced (global scheduling decision) scheduler.
Also have some initial placement to the least loaded runqueue in the least
loaded pool

We look at this as a configurable infrastructure....

What we don't have:
-------------------

The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition.


At ALS: I will be reporting on our experience with what we have
for a 8-way system and a 4x4-way NUMA system (OSDL)
wrt early placement, choice of best pool size ?

Are you can get an early start at:
	http://lse.sourceforge.net/scheduling/als2001/pmqs.ps

Are you going to be a ALS ? Maybe we can chat about what the pros and cons
of each approach are and whether we could/should merge things together.
I am very intriged by the "CPU History Weight" that I see as a major
add-on to our stuff. What I am not so keen about is the fact
you seem to only do load-balancing at fork and idle time.
In a loaded system that can lead to load inbalances

We do a periodic (configurable) call, which has also some drawbacks.
Another thing that needs to be thought about is the metric used
to determine <load> on a queue. For simplicity, runqueue length is
one indication, for fairness, maybe the sum of nice-value would be ok.
We experimented with both and didn't see to much of a difference, however
measuring fairness is difficult to do.


* Davide Libenzi <da...@xm...> [20011030 00;38]:"
> 
>         Proposal For A More Scalable Linux Scheduler
>                            by
>           Davide Libenzi <da...@xm...>
>                       Sat 10/27/2001
> 
>                        Episode [1]
> 
>           Captain's diary, tentative 2, day 1 ...
> 
> 
> 
> The current Linux scheduler has been designed and optimized
> to be very fast and have a low I/D cache footprint.
> Inside the schedule() function the fast path is kept very short
> by moving less probable code out :
> 
>     if (prev->state == TASK_RUNNING)
>         goto still_running;
> still_running_back:
> 	(fast path follow here)
> 	return;
> 
> still_running:
> 	(slow path lies here)
> 	goto still_running_back;
> 
>  <============== Rest deleted  ==============>

[Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

From: Davide L. <da...@xm...> - 2001-10-30 17:11:56

On Tue, 30 Oct 2001, Hubertus Franke wrote:

> Davide, nice analysis.
> I want to point out that some (not all) of the stuff is already done
> in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling).
>
> What we have:
> -------------
> multiple queues, each protected by their own lock to avoid
> the contention.
> Automatic Loadbalancing across all queues (yes, that creates overhead)
> CPU pooling as configurable mean to get from isolated queues to a fully
> balanced (global scheduling decision) scheduler.
> Also have some initial placement to the least loaded runqueue in the least
> loaded pool
>
> We look at this as a configurable infrastructure....
>
> What we don't have:
> -------------------
>
> The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition.
>
>
> At ALS: I will be reporting on our experience with what we have
> for a 8-way system and a 4x4-way NUMA system (OSDL)
> wrt early placement, choice of best pool size ?
>
> Are you can get an early start at:
> 	http://lse.sourceforge.net/scheduling/als2001/pmqs.ps

I see the proposed implementation as a decisive cut with the try to have
processes instantly moved across CPUs and stuff like na_goodness, etc..
Inside each CPU the scheduler is _exactly_ the same as the UP one.




> Are you going to be a ALS ? Maybe we can chat about what the pros and cons
> of each approach are and whether we could/should merge things together.
> I am very intriged by the "CPU History Weight" that I see as a major
> add-on to our stuff. What I am not so keen about is the fact
> you seem to only do load-balancing at fork and idle time.
> In a loaded system that can lead to load inbalances
>
> We do a periodic (configurable) call, which has also some drawbacks.
> Another thing that needs to be thought about is the metric used
> to determine <load> on a queue. For simplicity, runqueue length is
> one indication, for fairness, maybe the sum of nice-value would be ok.
> We experimented with both and didn't see to much of a difference, however
> measuring fairness is difficult to do.

Hey, ... that's part of Episode 2 " Balancing the world", where the evil
Mr. MoveSoon fight with Hysteresis for the universe domination :)




- Davide

[Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

From: Hubertus F. <fr...@wa...> - 2001-10-30 18:30:26

* Davide Libenzi <da...@xm...> [20011030 12;19]:"
> On Tue, 30 Oct 2001, Hubertus Franke wrote:
> 
> > Davide, nice analysis.
> > I want to point out that some (not all) of the stuff is already done
> > in our scalable MQ scheduler (http://lse.sourceforge.net/scheduling).
> >
> > What we have:
> > -------------
> > multiple queues, each protected by their own lock to avoid
> > the contention.
> > Automatic Loadbalancing across all queues (yes, that creates overhead)
> > CPU pooling as configurable mean to get from isolated queues to a fully
> > balanced (global scheduling decision) scheduler.
> > Also have some initial placement to the least loaded runqueue in the least
> > loaded pool
> >
> > We look at this as a configurable infrastructure....
> >
> > What we don't have:
> > -------------------
> >
> > The removal of PROC_CHANGE_PENALTY with a time decay cache affinity definition.
> >
> >
> > At ALS: I will be reporting on our experience with what we have
> > for a 8-way system and a 4x4-way NUMA system (OSDL)
> > wrt early placement, choice of best pool size ?
> >
> > Are you can get an early start at:
> > 	http://lse.sourceforge.net/scheduling/als2001/pmqs.ps
> 
> I see the proposed implementation as a decisive cut with the try to have
> processes instantly moved across CPUs and stuff like na_goodness, etc..
> Inside each CPU the scheduler is _exactly_ the same as the UP one.
> 

Well, to that extent that what MQ does as too. We do a local decision 
first and then compare across multiple queues. In the pooling approach
we limit that global check to some cpus within the proximity.
I think your CPU Weight history could fit into this model as well.
We don't care how the local decision was reached.

There is however another problem that you haven't addressed yet, which
is realtime. As far as I can tell, the realtime semantics require a 
strict ordering with respect to each other and their priorities.
General approach can be either to limit all RT processes to a single CPU
or, as we have done, declare a global RT runqueue.

> 
> 
> > Are you going to be a ALS ? Maybe we can chat about what the pros and cons
> > of each approach are and whether we could/should merge things together.
> > I am very intriged by the "CPU History Weight" that I see as a major
> > add-on to our stuff. What I am not so keen about is the fact
> > you seem to only do load-balancing at fork and idle time.
> > In a loaded system that can lead to load inbalances
> >
> > We do a periodic (configurable) call, which has also some drawbacks.
> > Another thing that needs to be thought about is the metric used
> > to determine <load> on a queue. For simplicity, runqueue length is
> > one indication, for fairness, maybe the sum of nice-value would be ok.
> > We experimented with both and didn't see to much of a difference, however
> > measuring fairness is difficult to do.
> 
> Hey, ... that's part of Episode 2 " Balancing the world", where the evil
> Mr. MoveSoon fight with Hysteresis for the universe domination :)
> 
> 

Well, one has to be careful, if the system is loaded and processes are
more long lived rather then come and go, Initial Placement and Idle-Loop 
Load balancing doesn't get you very far with respect to decent load balancing.
In these kind of scenarios, one needs a feedback system. Trick is to come
up with an algorithm that is not too intrusive and that is not overcorrecting.
Take a look at the paper link, where we experimented with some of these
issues. We tolerated a difference tolerance around the runqueue length.  

> 
> 
> - Davide
> 

:-#  Hubertus

[Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

From: Davide L. <da...@xm...> - 2001-10-30 18:42:57

On Tue, 30 Oct 2001, Hubertus Franke wrote:

> * Davide Libenzi <da...@xm...> [20011030 12;19]:"
> >
> > I see the proposed implementation as a decisive cut with the try to have
> > processes instantly moved across CPUs and stuff like na_goodness, etc..
> > Inside each CPU the scheduler is _exactly_ the same as the UP one.
> >
>
> Well, to that extent that what MQ does as too. We do a local decision
> first and then compare across multiple queues. In the pooling approach
> we limit that global check to some cpus within the proximity.
> I think your CPU Weight history could fit into this model as well.
> We don't care how the local decision was reached.

That's what I don't want to do, at least at every schedule().
The main purpose of the proposed scheduler is to relax process movement
policies.


> There is however another problem that you haven't addressed yet, which
> is realtime. As far as I can tell, the realtime semantics require a
> strict ordering with respect to each other and their priorities.
> General approach can be either to limit all RT processes to a single CPU
> or, as we have done, declare a global RT runqueue.

Real time processes, when wakeup up fall calling reschedule_idle() that
will either find the CPU idle or will be reschedule due a favorable
preemption_goodness().
One of balancing scheme I'm using tries to distribute RT tasks evenly on
CPUs.



On Tue, 30 Oct 2001, Hubertus Franke wrote:

> > > We do a periodic (configurable) call, which has also some drawbacks.
> > > Another thing that needs to be thought about is the metric used
> > > to determine <load> on a queue. For simplicity, runqueue length is
> > > one indication, for fairness, maybe the sum of nice-value would be ok.
> > > We experimented with both and didn't see to much of a difference, however
> > > measuring fairness is difficult to do.
> >
> > Hey, ... that's part of Episode 2 " Balancing the world", where the evil
> > Mr. MoveSoon fight with Hysteresis for the universe domination :)
> >
> >
>
> Well, one has to be careful, if the system is loaded and processes are
> more long lived rather then come and go, Initial Placement and Idle-Loop
> Load balancing doesn't get you very far with respect to decent load balancing.
> In these kind of scenarios, one needs a feedback system. Trick is to come
> up with an algorithm that is not too intrusive and that is not overcorrecting.
> Take a look at the paper link, where we experimented with some of these
> issues. We tolerated a difference tolerance around the runqueue length.

I'm currently trying an hysteresis approach with a tunable value of
hysteresis to watch at the different performance/behavior.



- Davide

[Lse-tech] Re: [PATCH][RFC] Proposal For A More Scalable Scheduler ...

From: Hubertus F. <fr...@wa...> - 2001-10-30 18:53:46

* Davide Libenzi <da...@xm...> [20011030 13;50]:"
> On Tue, 30 Oct 2001, Hubertus Franke wrote:
> 
> > * Davide Libenzi <da...@xm...> [20011030 12;19]:"
> > >
> > > I see the proposed implementation as a decisive cut with the try to have
> > > processes instantly moved across CPUs and stuff like na_goodness, etc..
> > > Inside each CPU the scheduler is _exactly_ the same as the UP one.
> > >
> >
> > Well, to that extent that what MQ does as too. We do a local decision
> > first and then compare across multiple queues. In the pooling approach
> > we limit that global check to some cpus within the proximity.
> > I think your CPU Weight history could fit into this model as well.
> > We don't care how the local decision was reached.
> 
> That's what I don't want to do, at least at every schedule().
> The main purpose of the proposed scheduler is to relax process movement
> policies.
> 

And I think that is a GOOD design point, no question. One of our next
steps is/was to relax the global decision making process that we pursued
to be able to compare A's with A's. Occasionally doing this every Nth time
might be something useful.

> 
> > There is however another problem that you haven't addressed yet, which
> > is realtime. As far as I can tell, the realtime semantics require a
> > strict ordering with respect to each other and their priorities.
> > General approach can be either to limit all RT processes to a single CPU
> > or, as we have done, declare a global RT runqueue.
> 
> Real time processes, when wakeup up fall calling reschedule_idle() that
> will either find the CPU idle or will be reschedule due a favorable
> preemption_goodness().
> One of balancing scheme I'm using tries to distribute RT tasks evenly on
> CPUs.
> 

I think that would be a problem. My understanding is that if two RT process
are globally runnable, then one must run the one with higher priority.
Am I missing something here ?
 
> 
> 
> On Tue, 30 Oct 2001, Hubertus Franke wrote:
> 
> > > > We do a periodic (configurable) call, which has also some drawbacks.
> > > > Another thing that needs to be thought about is the metric used
> > > > to determine <load> on a queue. For simplicity, runqueue length is
> > > > one indication, for fairness, maybe the sum of nice-value would be ok.
> > > > We experimented with both and didn't see to much of a difference, however
> > > > measuring fairness is difficult to do.
> > >
> > > Hey, ... that's part of Episode 2 " Balancing the world", where the evil
> > > Mr. MoveSoon fight with Hysteresis for the universe domination :)
> > >
> > >
> >
> > Well, one has to be careful, if the system is loaded and processes are
> > more long lived rather then come and go, Initial Placement and Idle-Loop
> > Load balancing doesn't get you very far with respect to decent load balancing.
> > In these kind of scenarios, one needs a feedback system. Trick is to come
> > up with an algorithm that is not too intrusive and that is not overcorrecting.
> > Take a look at the paper link, where we experimented with some of these
> > issues. We tolerated a difference tolerance around the runqueue length.
> 
> I'm currently trying an hysteresis approach with a tunable value of
> hysteresis to watch at the different performance/behavior.
> 
> 

That seems appropriate ... looking forward to see that.

> 
> - Davide
> 

-- Hubertus