From: Hubertus F. <fr...@us...> - 2001-02-09 22:29:20
|
This is the old problem of deciding affinity based on where the damned thing ran last. I read once a scheduler paper from SGI where they actually implemented the affinity based on some time decay. Another Linux based system build somewhere in Germany actually counted the cache misses since the last context switch and used a simple marcov chain to determine the cache foot print. This is not very difficult to implement when approximated, but I don't know whether such an approach would be portable across all platforms. I like your idea of simply taking the jiffies, but then max_na_goodness(cpu) definition might get screwed up because now we need to compute the second best with the affinity and time decay in mind. Effectively local goodness value is not sufficient for computing this variable. Let me think about it over the weekend for a decent solution and whether I can integrate such a mechanism with our current approach. Mike you might want to scratch your <head> as well. Next, with respect to scheduler semantics. Mike and I have talked about it right at the beginning and we came to the conclusion that we need to proof improvements with equal semantics to allow oranges to be compared to oranges. Is it time to move beyond that point? Have we proven that we can do better across most applications ? I don't know who is going to make the judgement call that we can/should move beyond these strict functional equivalence requirements. I am all in favor. I hope this is being raised in the upcoming 2.5 brainstorming session. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@sco.com on 02/09/2001 04:43:44 PM Sent by: nak...@sc... To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkr...@se...>, lse...@li... Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler I agree. I think the scope of the values returned by goodness() is not clear even with the original scheduler - it's not local or global either, because of PROC_CHANGE_PENALTY, which should be platform-dependent. So I don't think we need to stick to the current scheduler semantics for non-realtime tasks, especially when supporting larger SMP machines, including NUMA. Besides the cpu_allowed mask, one thing I would like to add is to check if its cache is still warm or not there, when stealing a task from a remote CPU. We don't want to migrate such a task with warm cache even if it has a high na_goodness value. One of easy and effective ways is to maintain the last time (jiffies) when the task was scheduled. If the time is larger than some tunable/platform-specific value, then we might want to migrate the task, otherwise find another one. I bet this would cause significant scheduling behavior changes. Hubertus Franke wrote: > > Mike, it goes to the heart of the question whether we MUST or NEED to stick > to the current scheduler semantics. > > For the low-end SMPs (4ways or so) it seems ok to me to simply check before > trying to steal the thread whether it > can be scheduled or not. If not you move onto the next queue. You are > afraid that the cpu_allowed mask will > be used frequently. Waiting for the lock doesn't happen in the current > code, you use try_spinlock. > The problem is as you point out that there is a task with sufficiently high > goodness to warrant a preemption, stuck behind > this "cpu_allowed" limited task and therefore is never considered. > > Again, at some point, one has to do some hand-waiving and state "tough > luck". > As far as I am concerned, the task of the scheduler is to create high > throughput through the system and provide some > degree of fairness. By defining an affinity boost, we are already > sacrifying fairness to some degree. > > I have taken a slightly different angle. I am willing to relax the current > scheduling assumption in order to increase > scalability. Particularly I am willing to run tasks of lower goodness value > as long as I adhere to the RT semantics. > I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic > that might not necessarily be right. > > I think when you are approaching larger number of CPUs (8+) you need to > look at partitioning your system from a > scheduling point. One should pretty much deal with the system as if it is a > NUMA system. > > I have taken the MQ scheduler and sub-divide into the cpu pools. I have > posted regarding this already under > our latest status report for the scheduling: > http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing > > Running the chatroom with 30/300 gives the following results. (I will post > these on Monday on our lse.sourceforge.net/scheduling) site > for general consumption. > (See attached file: pre8-chat.pdf) > > In this case, splitting up the scheduler into multiple pools with > occasional trivial load-balacing shows at the very highend for the > chat-room a 10% improvement over our current MQ. In this case I am checking > in reschedule_idle whether to preempt or not, but in schedule I only look > within my pool. > There are some potential improvements possible, for instance only checking > for remote idle cpus on all cpus, but no preemption outside the current > pool > I think these mechanism are worthwhile to look into it. > > I think that the usage of cpu_allowed can be tight into this. cpu_allowed > is only a simple mechanism and not a policy. I think we need > to look at the policies that will be built for cpu_affinities. I think most > of them will originate in the NUMA area. > And doing a pool based approach seems to be ok. > > Any comments ? > > Hubertus Franke > email: fr...@us... > > Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/09/2001 > 02:26:08 PM > > Sent by: lse...@li... > > To: lse...@li... > cc: > Subject: [Lse-tech] cpus_allowed in multi-queue scheduler > > I'm looking at the cpus_allowed field of the task structure > and trying to determine what is the best way to handle this > in the multi-queue scheduler. > > As a reminder, the multi-queue scheduler I am working on > has one runqueue per CPU. In addition, there is a separate > runqueue lock per CPU which synchronizes access to the the > CPU specific runqueue. In schedule(), the runqueue lock > associated with the CPU we are currently running on is > obtained. Then the CPU specific runqueue is scanned looking > for the task with the highest 'goodness' value. After > scanning the CPU specific runqueue, we 'take a quick look' > at other the other runqueues to determine if there is a task > with higher goodness that should be scheduled. Of course, > this takes CPU affinity into account just like the current > scheduler. Each CPU specific runqueue data structure has a > field which contains the maximum 'non-affinity goodness' > value of all schedulable tasks on that runqueue. Therefore, > when we 'take a quick look' we are really only looking at > the task with the maximum 'non-affinity goodness' on a remote > CPU's runqueue. See the description of the multi-queue > scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html' > if you want more details. Now, if we find a task with > sufficiently high goodness on another CPU specific runqueue, > we attempt to lock (via spin_trylock) the other runqueue and > move the task to our runqueue. > > It is during the process of 'stealing' tasks from other > runqueues where we must be concerned with the cpus_allowed > field of the task we are stealing. We don't want to steal > a task if it is not allowed to run on our CPU. However, we > don't want to wait until we have obtained a lock on a remote > CPU's runqueue to check the cpus_allowed field and determine > if we should steal the task. > > My thought is that the runqueue data structure could also > contain the cpus_allowed field of the task with the maximum > 'non-affinity goodness' value. Hence, we could 'quickly > check' this value without obtaining the remote CPU's runqueue > lock. > > However, I have some concerns that this design could cause > some significant scheduling behavior changes if the cpus_allowed > field/feature is used extensively. Right now, I don't see > much/any use of this field but I expect that may change in > the future. Consider the case where the task with the maximum > 'non-affinity goodness' value has cpus_allowed set such that > it is limited to a small subset of the systems CPUs. Now, > other CPUs outside this subset will not be able to steal this > task. In addition, they will not be able to steal any other > tasks on this runqueue which may have a sufficiently high > goodness value. This is because we only keep track of the > task with the highest goodness value, and in this case it can > only run on a subset of CPUs. It is obvious that a task which > is limited to a single CPU should never be identified as a > candidate to be stolen by another CPU, and this is easy to code. > However, what about a task limited to 2 CPUs on a 8 CPU system, > OR 2 CPUs on a 16 or 32 CPU system? > > Any comments? > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > ------------------------------------------------------------------------ > Name: pre8-chat.pdf > pre8-chat.pdf Type: Portable Document Format (application/pdf) > Encoding: base64 -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |