From: Hubertus F. <fr...@us...> - 2001-07-14 03:24:11
|
Mike, could we utilize the existing mechanism such as has_cpu. Here is my approach/suggestion. We introduce in the sched_data[cpu] a resched_task slot; struct task *resched_task; When in reschedule_idle() a target cpu is to be decided for task <p> we check the resched_task slot. If the slot is pointing to some task, then this task should be considered running and we should not consider the preemption goodness to the currently running task as we know it already got IPI'ed. See also the schedule() function describe later on. reschedule_idle(struct task *p) { : : struct task *rst = sched_data[target_cpu].resched_task; struct task *rmt = sched_data[target_cpu].current; if (rst != NULL) { if (preemption_goodness(p,rst, ...) > 1) p->processor = target_cpu; p->has_cpu = 1; rst->has_cpu = 0; /* reset as we overwrite */ sched_data[target_cpu].resched_task = p; /* so we make old <rst> available for scheduling * and temp-bind <p> to target_cpu */ * don't have to send IPI as this to handles race * condition and we are holding scheduling lock */ } else { continue; /* we know that the current priority won't be * larger than the one of <rst> otherwise this * would have been picked up in the schedule */ } } else { /* standard stuff that we always do */ } } In schedule() we need to check whether a reschedule reservation is held. First we go through the standard <stillrunning> check to compute the initial <c> goodness value. Then under still_running_back: <loop through the runqueue> /* note that <rst> will be ignored due to <has_cpu=1> flag */ /* check wether reservation <rst> exists */ rst = sched_data[this_cpu].resched_task; if (rst != NULL) { c = goodness(rst,..); if (c > best_prio) { best_prio = goodness(rst,..); next = rst; sched_data[this_cpu].resched_task = NULL; } else { /* need to return rst back to scheduable state */ rst->has_cpu = 0; } } This approach would eliminate the need to check during runqueue scan to check for each task's saved_cpus_allowed and would also make sure that only one task reserves running on a particular cpu. Reservations are preempted through the existing mechanism, namely goodness comparision, and such "preempted" tasks are returned to general scheduability. are put back into the Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkr...@se...>@vger.kernel.org on 07/13/2001 06:43:05 PM Sent by: lin...@vg... To: David Lang <dav...@di...> cc: Larry McVoy <lm...@bi...>, Davide Libenzi <da...@xm...>, lse...@li..., Andi Kleen <ak...@su...>, lin...@vg... Subject: Re: CPU affinity & IPI latency On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote: > A real-world example of this issue. > > I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess > was bouncing back and forth between the two CPUs. I actually was able to > gzip faster by starting up setiathome to keep one CPU busy and force the > scheduler to keep the gzip on a single CPU (I ran things several times to > verify it was actually faster) > > David Lang That does sound like the same behavior I was seeing with lat_ctx. Like I said in my previous note, the scheduler does try to take CPU affinity into account. reschedule_idle() does a pretty good job of determining what CPU a task should run on. In the case of lat_ctx (and I believe your use of gzip), the 'fast path' in reschedule_idle() is taken because the CPU associated with the awakened task is idle. However, before schedule() is run on the 'target' CPU, schedule() is run on another CPU and the task is scheduled there. The root cause of this situation is the delay between the time reschedule_idle() determines what is the best CPU a task should run on, and the time schedule() is actually ran on that CPU. I have toyed with the idea of 'temporarily binding' a task to a CPU for the duration of the delay between reschedule_idle() and schedule(). It would go something like this, - Define a new field in the task structure 'saved_cpus_allowed'. With a little collapsing of existing fields, there is room to put this on the same cache line as 'cpus_allowed'. - In reschedule_idle() if we determine that the best CPU for a task is the CPU it is associated with (p->processor), then temporarily bind the task to that CPU. The task is temporarily bound to the CPU by overwriting the 'cpus_allowed' field such that the task can only be scheduled on the target CPU. Of course, the original value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. - In schedule(), the loop which examines all tasks on the runqueue will restore the value of 'cpus_allowed'. This would preserve the 'best CPU' decision made by reschedule_idle(). On the down side, 'temporarily bound' tasks could not be scheduled until schedule() is run on their associated CPUs. This could potentially waste CPU cycles, and delay context switches. In addition, it is quite possible that while a task is 'temporarily bound' the state of the system could change in such a way that the best CPU is no longer best. There appears to be a classic tradeoff between CPU affinity and context switch time. Comments? -- Mike Kravetz mkr...@se... IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to maj...@vg... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
From: Hubertus F. <fr...@us...> - 2001-07-16 10:12:53
|
David, you are preaching to choir. One can not have it both ways, at least without "#ifdef"s. As Mike stated, we made the decision to adhere to current scheduling semantics purely for the purspose of comparision and increased adaptation chances. As shown with the LoadBalancing addition to MQ, there are simple ways to relax and completely eliminate the feedback between the queues, if one so desires. As for the solutions you proposed for the "switching problem", namely the wakeup list. I don't think you want a list here. A list would basically mean that you would need to walk it and come up with a single decision again. I think what I proposed, namely a per-CPU reschedule reservation that can be overwritten taking PROC_CHANGE_PENALTY or some form of it into account, seems a better solution. Open to discussions... Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fr...@us... Davide Libenzi <da...@xm...>@vger.kernel.org on 07/15/2001 04:02:21 PM Sent by: lin...@vg... To: Mike Kravetz <mkr...@se...> cc: lin...@vg..., Andi Kleen <ak...@su...>, lse...@li..., Larry McVoy <lm...@bi...>, David Lang <dav...@di...> Subject: Re: CPU affinity & IPI latency On 13-Jul-2001 Mike Kravetz wrote: > On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote: >> A real-world example of this issue. >> >> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess >> was bouncing back and forth between the two CPUs. I actually was able to >> gzip faster by starting up setiathome to keep one CPU busy and force the >> scheduler to keep the gzip on a single CPU (I ran things several times to >> verify it was actually faster) >> >> David Lang > > That does sound like the same behavior I was seeing with lat_ctx. Like > I said in my previous note, the scheduler does try to take CPU affinity > into account. reschedule_idle() does a pretty good job of determining > what CPU a task should run on. In the case of lat_ctx (and I believe > your use of gzip), the 'fast path' in reschedule_idle() is taken because > the CPU associated with the awakened task is idle. However, before > schedule() is run on the 'target' CPU, schedule() is run on another > CPU and the task is scheduled there. > > The root cause of this situation is the delay between the time > reschedule_idle() determines what is the best CPU a task should run > on, and the time schedule() is actually ran on that CPU. > > I have toyed with the idea of 'temporarily binding' a task to a CPU > for the duration of the delay between reschedule_idle() and schedule(). > It would go something like this, > > - Define a new field in the task structure 'saved_cpus_allowed'. > With a little collapsing of existing fields, there is room to put > this on the same cache line as 'cpus_allowed'. > - In reschedule_idle() if we determine that the best CPU for a task > is the CPU it is associated with (p->processor), then temporarily > bind the task to that CPU. The task is temporarily bound to the > CPU by overwriting the 'cpus_allowed' field such that the task can > only be scheduled on the target CPU. Of course, the original > value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. > - In schedule(), the loop which examines all tasks on the runqueue > will restore the value of 'cpus_allowed'. > > This would preserve the 'best CPU' decision made by reschedule_idle(). > On the down side, 'temporarily bound' tasks could not be scheduled > until schedule() is run on their associated CPUs. This could potentially > waste CPU cycles, and delay context switches. In addition, it is > quite possible that while a task is 'temporarily bound' the state of > the system could change in such a way that the best CPU is no longer > best. > > There appears to be a classic tradeoff between CPU affinity and > context switch time. The problem of the current scheduler is that it acts like an infinite feedback system. When we're going to decide if we've to move a task we analyze the status at the current time without taking in account the system status at previous time values. But the feedback we send ( IPI to move the task ) take a finite time to hit the target CPU and, meanwhile, the system status changes. So we're going to apply a feedback calculated in time T0 to a time Tn, and this will result in system auto-oscillation that we perceive as tasks bouncing between CPUs. This is kind of electronic example but it applies to all feedback systems. The solution to this problem, given that we can't have a zero feedback delivery time, is : 1) lower the feedback amount, that means, try to minimize task movements 2) a low pass filter, that means, when we're going to decide the sort ( move ) of a task, we've to weight the system status with the one that it had at previous time values - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to maj...@vg... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
From: Davide L. <da...@xm...> - 2001-07-16 16:13:29
|
On 16-Jul-2001 Hubertus Franke wrote: > > David, you are preaching to choir. > > One can not have it both ways, at least without "#ifdef"s. > As Mike stated, we made the decision to adhere to current scheduling > semantics > purely for the purspose of comparision and increased adaptation chances. > As shown with the LoadBalancing addition to MQ, there are simple ways to > relax and completely eliminate the feedback between the queues, if one so > desires. > > As for the solutions you proposed for the "switching problem", namely the > wakeup > list. I don't think you want a list here. A list would basically mean that > you > would need to walk it and come up with a single decision again. I think > what > I proposed, namely a per-CPU reschedule reservation that can be overwritten > taking > PROC_CHANGE_PENALTY or some form of it into account, seems a better > solution. > Open to discussions... No, when you're going to decide ( reschedule_idle ) which idle to spin out, you can inspect the wake list and, based on the content of the list, one can take a better decision about which idle to wake. I think that a list, instead of a single task pointer, is a more open solution that could drive to a more sophisticated choice of the CPU to stock the task to. - Davide |
From: Hubertus F. <fr...@us...> - 2001-07-16 18:27:17
|
Well, actually the sole purpose of <has_cpu> is to lock a task out of a scheduling decision. Remember during transient states, there are two tasks that have the <has_cpu> flag set for a particular cpu. So I think using <has_cpu> is kosher and preferred in this situation, IMHO. As you saw from David Levines reply, he thinks that a list is more appropriate for more generic decision support. But I don't like the fact that you lock down several tasks at that point. In my solution, you return the task back to general schedulability. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fr...@us... Mike Kravetz <mkr...@se...> on 07/16/2001 12:14:46 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkr...@se...>, lse...@li..., lin...@vg... Subject: Re: CPU affinity & IPI latency On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote: > > Mike, could we utilize the existing mechanism such as has_cpu. > I like it. Especially the way you eliminated the situation where we would have multiple tasks waiting for schedule. Hope this is not a frequent situation!!! The only thing I don't like is the use of has_cpu to prevent the task from being scheduled. Right now, I can't think of any problems with it. However, in the past I have been bit by using fields for purposes other than what they were designed for. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Hubertus F. <fr...@us...> - 2001-07-16 21:45:19
|
Clean, but in this solutions, you can lock out tasks for a several cycles, if indeed several of them where added in the time before we can service the IPI on the target cpu. You can end up with strange priority inversion problems, because these tasks aren't seen by the general scheduler or let alone are on the runqueue. That's why I opted in my design for a single slot. One could opt for the following implementation to ensure only a single waiting task, namely put the already enqueued one back if the goodness is worse. static inline void wpush(aligned_data * ad, struct task_struct * tsk) { if (ad->wlist && (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0)) { add_to_runqueue(ad->wlist,tsk); if (task_on_runqueue(tsk)) list_del(&tsk->run_list); /* obsolete now tsk->wlist_next = ad->wlist; */ ad->wlist = tsk; } } I also don't like the fact that with reschedule_idle() you just put the task into the runqueue, then you take it out again, just to put it back into it after the IPI and that as it seems on every reschedule_idle() path. In my design one simply wouldn't send the IPI to a target CPU that has a pending IPI waiting and preemption wouldn't happen based on the goodness values. I grant, that my design seems a bit more intrusive on the code, but you were argueing just yesterday not to get stuck up with closeness to the current code and semantics. Comments ? Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) email: fr...@us... Davide Libenzi <da...@xm...>@ewok.dev.mcafeelabs.com on 07/16/2001 05:25:57 PM Sent by: da...@ew... To: Mike Kravetz <mkr...@se...> cc: lin...@vg..., lse...@li..., Hubertus Franke/Watson/IBM@IBMUS Subject: Re: CPU affinity & IPI latency On 16-Jul-2001 Mike Kravetz wrote: > On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote: >> >> Mike, could we utilize the existing mechanism such as has_cpu. >> > > I like it. Especially the way you eliminated the situation where > we would have multiple tasks waiting for schedule. Hope this is > not a frequent situation!!! The only thing I don't like is the > use of has_cpu to prevent the task from being scheduled. Right > now, I can't think of any problems with it. However, in the past > I have been bit by using fields for purposes other than what they > were designed for. How about this ( draft ) : struct task_struct { ... struct task_struct * wlist_next; ... }; static union { struct schedule_data { struct task_struct * curr; struct task_struct * wlist; cycles_t last_schedule; } schedule_data; char __pad [SMP_CACHE_BYTES]; } aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}}; static inline struct task_struct * wpick(aligned_data * ad) { struct task_struct * tsk = ad->wlist; if (tsk) { ad->wlist = tsk->wlist_next; add_to_runqueue(tsk); } return tsk; } static inline void wpush(aligned_data * ad, struct task_struct * tsk) { if (task_on_runqueue(tsk)) list_del(&tsk->run_list); tsk->wlist_next = ad->wlist; ad->wlist = tsk; } asmlinkage void schedule(void) { ... if ((next = wpick(sched_data))) goto ...; ... } In reschedule_idle() when before sending the IPI we do a wpush(). We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we don't need another one. A slight change is needed to reschedule_idle() to handle the new field. Pros to this solution are 1) we are not going to give other fields a different meaning 2) when the idle will call schedule it'll pick the task w/o rescan. - Davide |
From: Davide L. <da...@xm...> - 2001-07-16 22:53:14
|
On 16-Jul-2001 Hubertus Franke wrote: > > Clean, but in this solutions, you can lock out tasks for a > several cycles, if indeed several of them where added > in the time before we can service the IPI on the target cpu. > You can end up with strange priority inversion problems, > because these tasks aren't seen by the general scheduler > or let alone are on the runqueue. > That's why I opted in my design for a single slot. The idea of list was kind of extension, but to solve this particular problem we just need a single task ( with no extra member in task_struct ) pointer. > One could opt for the following implementation to ensure only > a single waiting task, namely put the already enqueued one back > if the goodness is worse. > > > static inline void wpush(aligned_data * ad, struct task_struct * tsk) { > if (ad->wlist && > (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0)) > { > add_to_runqueue(ad->wlist,tsk); > if (task_on_runqueue(tsk)) > list_del(&tsk->run_list); > /* obsolete now tsk->wlist_next = ad->wlist; */ > ad->wlist = tsk; > } > } wpush() must be called only if *) the target is idle <and> *) the list ( slot ) is empty. I won't add the goodness checking in there. Yes, it could happen that another task with greater goodness needs a CPU to run and, w/o goodness checking this could not be rescheduled on the idle. But think about at what would happen if the IPI had a zero time response. The idle would be running the old ( listed or slotted ) task and the CPU would not be idle anymore. > > I also don't like the fact that with reschedule_idle() you > just put the task into the runqueue, then you take it out again, > just to put it back into it after the IPI and that as it seems > on every reschedule_idle() path. You can "suspend" the task even by adding an extra field inside the task struct and add checking of this field around inside the code. The list removal does not need you to spread extra checking into the code. It's a simple method to hide a task without modifying a bunch of code. > In my design one simply wouldn't send the IPI to a target CPU that > has a pending IPI waiting and preemption wouldn't happen based on > the goodness values. You do exactly that with this solution, you reserve the task for the target idle. What do You mean with "preemption wouldn't happen based on the goodness values"? > I grant, that my design seems a bit more intrusive on the code, > but you were argueing just yesterday not to get stuck up with > closeness to the current code and semantics. Pls, don't make me say this stuff. My was not to make the code more intrusive but, on the contrary, to make it light by relaxing, in some aspect, the SMP scheduler behaviour compatibility. I already expressed my opinion about that in previous emails. - Davide |
From: Mike K. <mkr...@se...> - 2001-07-16 16:17:34
|
On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote: > > Mike, could we utilize the existing mechanism such as has_cpu. > I like it. Especially the way you eliminated the situation where we would have multiple tasks waiting for schedule. Hope this is not a frequent situation!!! The only thing I don't like is the use of has_cpu to prevent the task from being scheduled. Right now, I can't think of any problems with it. However, in the past I have been bit by using fields for purposes other than what they were designed for. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-07-16 21:22:43
|
On 16-Jul-2001 Mike Kravetz wrote: > On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote: >> >> Mike, could we utilize the existing mechanism such as has_cpu. >> > > I like it. Especially the way you eliminated the situation where > we would have multiple tasks waiting for schedule. Hope this is > not a frequent situation!!! The only thing I don't like is the > use of has_cpu to prevent the task from being scheduled. Right > now, I can't think of any problems with it. However, in the past > I have been bit by using fields for purposes other than what they > were designed for. How about this ( draft ) : struct task_struct { ... struct task_struct * wlist_next; ... }; static union { struct schedule_data { struct task_struct * curr; struct task_struct * wlist; cycles_t last_schedule; } schedule_data; char __pad [SMP_CACHE_BYTES]; } aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}}; static inline struct task_struct * wpick(aligned_data * ad) { struct task_struct * tsk = ad->wlist; if (tsk) { ad->wlist = tsk->wlist_next; add_to_runqueue(tsk); } return tsk; } static inline void wpush(aligned_data * ad, struct task_struct * tsk) { if (task_on_runqueue(tsk)) list_del(&tsk->run_list); tsk->wlist_next = ad->wlist; ad->wlist = tsk; } asmlinkage void schedule(void) { ... if ((next = wpick(sched_data))) goto ...; ... } In reschedule_idle() when before sending the IPI we do a wpush(). We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we don't need another one. A slight change is needed to reschedule_idle() to handle the new field. Pros to this solution are 1) we are not going to give other fields a different meaning 2) when the idle will call schedule it'll pick the task w/o rescan. - Davide |