Re: [Lse-tech] cpus_allowed in multi-queue scheduler

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I agree. I think the scope of the values returned by goodness() is not
clear even with the original scheduler - it's not local or global
either, because of PROC_CHANGE_PENALTY, which should be
platform-dependent.

So I don't think we need to stick to the current scheduler semantics for
non-realtime tasks, especially when supporting larger SMP machines,
including NUMA.

Besides the cpu_allowed mask, one thing I would like to add is to check
if its cache is still warm or not there, when stealing a task from a
remote CPU. We don't want to migrate such a task with warm cache even if
it has a high na_goodness value. One of easy and effective ways is to
maintain the last time (jiffies) when the task was scheduled. If the
time is larger than some tunable/platform-specific value, then we might
want to migrate the task, otherwise find another one. I bet this would
cause significant scheduling behavior changes.

Hubertus Franke wrote:
> 
> Mike, it goes to the heart of the question whether we MUST or NEED to stick
> to the current scheduler semantics.
> 
> For the low-end SMPs (4ways or so) it seems ok to me to simply check before
> trying to steal the thread whether it
> can be scheduled or not. If not you move onto the next queue. You are
> afraid that the cpu_allowed mask will
> be used frequently. Waiting for the lock doesn't happen in the current
> code, you use try_spinlock.
> The problem is as you point out that there is a task with sufficiently high
> goodness to warrant a preemption, stuck behind
> this "cpu_allowed" limited task and therefore is never considered.
> 
> Again, at some point, one has to do some hand-waiving and state "tough
> luck".
> As far as I am concerned, the task of the scheduler is to create high
> throughput through the system and provide some
> degree of fairness. By defining an affinity boost, we are already
> sacrifying fairness to some degree.
> 
> I have taken a slightly different angle. I am willing to relax the current
> scheduling assumption in order to increase
> scalability. Particularly I am willing to run tasks of lower goodness value
> as long as I adhere to the RT semantics.
> I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic
> that might not necessarily be right.
> 
> I think when you are approaching larger number of CPUs (8+) you need to
> look at partitioning your system from a
> scheduling point. One should pretty much deal with the system as if it is a
> NUMA system.
> 
> I have taken the MQ scheduler and sub-divide into the cpu pools. I have
> posted regarding this already under
> our latest status report for the scheduling:
> http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing
> 
> Running the chatroom with 30/300  gives the following results. (I will post
> these on Monday on our lse.sourceforge.net/scheduling) site
> for general consumption.
> (See attached file: pre8-chat.pdf)
> 
> In this case, splitting up the scheduler into multiple pools with
> occasional trivial load-balacing shows at the very highend for the
> chat-room a 10% improvement over our current MQ. In this case I am checking
> in reschedule_idle whether to preempt or not, but in schedule I only look
> within my pool.
> There are some potential improvements possible, for instance only checking
> for remote idle cpus on all cpus, but no preemption outside the current
> pool
> I think these mechanism are worthwhile to look into it.
> 
> I think that the usage of cpu_allowed can be tight into this. cpu_allowed
> is only a simple mechanism and not a policy. I think we need
> to look at the policies that will be built for cpu_affinities. I think most
> of them will originate in the NUMA area.
> And doing a pool based approach seems to be ok.
> 
> Any comments ?
> 
> Hubertus Franke
> email: fr...@us...
> 
> Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/09/2001
> 02:26:08 PM
> 
> Sent by:  lse...@li...
> 
> To:   lse...@li...
> cc:
> Subject:  [Lse-tech] cpus_allowed in multi-queue scheduler
> 
> I'm looking at the cpus_allowed field of the task structure
> and trying to determine what is the best way to handle this
> in the multi-queue scheduler.
> 
> As a reminder, the multi-queue scheduler I am working on
> has one runqueue per CPU.  In addition, there is a separate
> runqueue lock per CPU which synchronizes access to the the
> CPU specific runqueue.  In schedule(), the runqueue lock
> associated with the CPU we are currently running on is
> obtained.  Then the CPU specific runqueue is scanned looking
> for the task with the highest 'goodness' value.  After
> scanning the CPU specific runqueue, we 'take a quick look'
> at other the other runqueues to determine if there is a task
> with  higher goodness that should be scheduled.  Of course,
> this takes CPU affinity into account just like the current
> scheduler.  Each CPU specific runqueue data structure has a
> field which contains the maximum 'non-affinity goodness'
> value of all schedulable tasks on that runqueue.  Therefore,
> when we 'take a quick look' we are really only looking at
> the task with the maximum 'non-affinity goodness' on a remote
> CPU's runqueue.  See the description of the multi-queue
> scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html'
> if you want more details.  Now, if we find a task with
> sufficiently high goodness on another CPU specific runqueue,
> we attempt to lock (via spin_trylock) the other runqueue and
> move the task to our runqueue.
> 
> It is during the process of 'stealing' tasks from other
> runqueues where we must be concerned with the cpus_allowed
> field of the task we are stealing.  We don't want to steal
> a task if it is not allowed to run on our CPU.  However, we
> don't want to wait until we have obtained a lock on a remote
> CPU's runqueue to check the cpus_allowed field and determine
> if we should steal the task.
> 
> My thought is that the runqueue data structure could also
> contain the cpus_allowed field of the task with the maximum
> 'non-affinity goodness' value.  Hence, we could 'quickly
> check' this value without obtaining the remote CPU's runqueue
> lock.
> 
> However, I have some concerns that this design could cause
> some significant scheduling behavior changes if the cpus_allowed
> field/feature is used extensively.  Right now, I don't see
> much/any use of this field but I expect that may change in
> the future.  Consider the case where the task with the maximum
> 'non-affinity goodness' value has cpus_allowed set such that
> it is limited to a small subset of the systems CPUs.  Now,
> other CPUs outside this subset will not be able to steal this
> task.  In addition, they will not be able to steal any other
> tasks on this runqueue which may have a sufficiently high
> goodness value.  This is because we only keep track of the
> task with the highest goodness value, and in this case it can
> only run on a subset of CPUs.  It is obvious that a task which
> is limited to a single CPU should never be identified as a
> candidate to be stolen by another CPU, and this is easy to code.
> However, what about a task limited to 2 CPUs on a 8 CPU system,
> OR 2 CPUs on a 16 or 32 CPU system?
> 
> Any comments?
> --
> Mike Kravetz                                 mkr...@se...
> IBM Linux Technology Center
> 
> _______________________________________________
> Lse-tech mailing list
> Lse...@li...
> http://lists.sourceforge.net/lists/listinfo/lse-tech
> 
>   ------------------------------------------------------------------------
>                     Name: pre8-chat.pdf
>    pre8-chat.pdf    Type: Portable Document Format (application/pdf)
>                 Encoding: base64

-- 
Jun U Nakajima
Core OS Development
SCO/Murray Hill, NJ
Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426