This is the old problem of deciding affinity based on where the damned
thing ran last.
I read once a scheduler paper from SGI where they actually implemented the
based on some time decay. Another Linux based system build somewhere in
Germany actually counted
the cache misses since the last context switch and used a simple marcov
determine the cache foot print. This is not very difficult to implement
when approximated, but
I don't know whether such an approach would be portable across all
I like your idea of simply taking the jiffies, but then
might get screwed up because now we need to compute the second best with
affinity and time decay in mind. Effectively local goodness value is not
sufficient for computing this
Let me think about it over the weekend for a decent solution and whether I
can integrate such
a mechanism with our current approach. Mike you might want to scratch your
<head> as well.
Next, with respect to scheduler semantics. Mike and I have talked about it
right at the beginning
and we came to the conclusion that we need to proof improvements with equal
allow oranges to be compared to oranges.
Is it time to move beyond that point?
Have we proven that we can do better across most applications ?
I don't know who is going to make the judgement call that we can/should
these strict functional equivalence requirements. I am all in favor. I hope
this is being raised in
the upcoming 2.5 brainstorming session.
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003
Jun Nakajima <jun@... on 02/09/2001 04:43:44 PM
Sent by: nakajima@...
To: Hubertus Franke/Watson/IBM@...
cc: Mike Kravetz <mkravetz@...>, lse-tech@...
Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler
I agree. I think the scope of the values returned by goodness() is not
clear even with the original scheduler - it's not local or global
either, because of PROC_CHANGE_PENALTY, which should be
So I don't think we need to stick to the current scheduler semantics for
non-realtime tasks, especially when supporting larger SMP machines,
Besides the cpu_allowed mask, one thing I would like to add is to check
if its cache is still warm or not there, when stealing a task from a
remote CPU. We don't want to migrate such a task with warm cache even if
it has a high na_goodness value. One of easy and effective ways is to
maintain the last time (jiffies) when the task was scheduled. If the
time is larger than some tunable/platform-specific value, then we might
want to migrate the task, otherwise find another one. I bet this would
cause significant scheduling behavior changes.
Hubertus Franke wrote:
> Mike, it goes to the heart of the question whether we MUST or NEED to
> to the current scheduler semantics.
> For the low-end SMPs (4ways or so) it seems ok to me to simply check
> trying to steal the thread whether it
> can be scheduled or not. If not you move onto the next queue. You are
> afraid that the cpu_allowed mask will
> be used frequently. Waiting for the lock doesn't happen in the current
> code, you use try_spinlock.
> The problem is as you point out that there is a task with sufficiently
> goodness to warrant a preemption, stuck behind
> this "cpu_allowed" limited task and therefore is never considered.
> Again, at some point, one has to do some hand-waiving and state "tough
> As far as I am concerned, the task of the scheduler is to create high
> throughput through the system and provide some
> degree of fairness. By defining an affinity boost, we are already
> sacrifying fairness to some degree.
> I have taken a slightly different angle. I am willing to relax the
> scheduling assumption in order to increase
> scalability. Particularly I am willing to run tasks of lower goodness
> as long as I adhere to the RT semantics.
> I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic
> that might not necessarily be right.
> I think when you are approaching larger number of CPUs (8+) you need to
> look at partitioning your system from a
> scheduling point. One should pretty much deal with the system as if it is
> NUMA system.
> I have taken the MQ scheduler and sub-divide into the cpu pools. I have
> posted regarding this already under
> our latest status report for the scheduling:
> Running the chatroom with 30/300 gives the following results. (I will
> these on Monday on our lse.sourceforge.net/scheduling) site
> for general consumption.
> (See attached file: pre8-chat.pdf)
> In this case, splitting up the scheduler into multiple pools with
> occasional trivial load-balacing shows at the very highend for the
> chat-room a 10% improvement over our current MQ. In this case I am
> in reschedule_idle whether to preempt or not, but in schedule I only look
> within my pool.
> There are some potential improvements possible, for instance only
> for remote idle cpus on all cpus, but no preemption outside the current
> I think these mechanism are worthwhile to look into it.
> I think that the usage of cpu_allowed can be tight into this. cpu_allowed
> is only a simple mechanism and not a policy. I think we need
> to look at the policies that will be built for cpu_affinities. I think
> of them will originate in the NUMA area.
> And doing a pool based approach seems to be ok.
> Any comments ?
> Hubertus Franke
> email: frankeh@...
> Mike Kravetz <mkravetz@... on 02/09/2001
> 02:26:08 PM
> Sent by: lse-tech-admin@...
> To: lse-tech@...
> Subject: [Lse-tech] cpus_allowed in multi-queue scheduler
> I'm looking at the cpus_allowed field of the task structure
> and trying to determine what is the best way to handle this
> in the multi-queue scheduler.
> As a reminder, the multi-queue scheduler I am working on
> has one runqueue per CPU. In addition, there is a separate
> runqueue lock per CPU which synchronizes access to the the
> CPU specific runqueue. In schedule(), the runqueue lock
> associated with the CPU we are currently running on is
> obtained. Then the CPU specific runqueue is scanned looking
> for the task with the highest 'goodness' value. After
> scanning the CPU specific runqueue, we 'take a quick look'
> at other the other runqueues to determine if there is a task
> with higher goodness that should be scheduled. Of course,
> this takes CPU affinity into account just like the current
> scheduler. Each CPU specific runqueue data structure has a
> field which contains the maximum 'non-affinity goodness'
> value of all schedulable tasks on that runqueue. Therefore,
> when we 'take a quick look' we are really only looking at
> the task with the maximum 'non-affinity goodness' on a remote
> CPU's runqueue. See the description of the multi-queue
> scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html'
> if you want more details. Now, if we find a task with
> sufficiently high goodness on another CPU specific runqueue,
> we attempt to lock (via spin_trylock) the other runqueue and
> move the task to our runqueue.
> It is during the process of 'stealing' tasks from other
> runqueues where we must be concerned with the cpus_allowed
> field of the task we are stealing. We don't want to steal
> a task if it is not allowed to run on our CPU. However, we
> don't want to wait until we have obtained a lock on a remote
> CPU's runqueue to check the cpus_allowed field and determine
> if we should steal the task.
> My thought is that the runqueue data structure could also
> contain the cpus_allowed field of the task with the maximum
> 'non-affinity goodness' value. Hence, we could 'quickly
> check' this value without obtaining the remote CPU's runqueue
> However, I have some concerns that this design could cause
> some significant scheduling behavior changes if the cpus_allowed
> field/feature is used extensively. Right now, I don't see
> much/any use of this field but I expect that may change in
> the future. Consider the case where the task with the maximum
> 'non-affinity goodness' value has cpus_allowed set such that
> it is limited to a small subset of the systems CPUs. Now,
> other CPUs outside this subset will not be able to steal this
> task. In addition, they will not be able to steal any other
> tasks on this runqueue which may have a sufficiently high
> goodness value. This is because we only keep track of the
> task with the highest goodness value, and in this case it can
> only run on a subset of CPUs. It is obvious that a task which
> is limited to a single CPU should never be identified as a
> candidate to be stolen by another CPU, and this is easy to code.
> However, what about a task limited to 2 CPUs on a 8 CPU system,
> OR 2 CPUs on a 16 or 32 CPU system?
> Any comments?
> Mike Kravetz mkravetz@...
> IBM Linux Technology Center
> Lse-tech mailing list
> Name: pre8-chat.pdf
> pre8-chat.pdf Type: Portable Document Format (application/pdf)
> Encoding: base64
Jun U Nakajima
Core OS Development
SCO/Murray Hill, NJ
Email: jun@..., Phone: 908-790-2352 Fax: 908-790-2426