From: Hubertus F. <fr...@us...> - 2001-01-25 16:00:38
|
I think Mike traced it already back to the fact that the tpc_recvmsg's block, i.e. there is not data there. Hence the process get's back into a wait queue and wake's up with reschedule_idle being called. That's the source of the problem. We are doing TOO GOOD ..... I will try to slow down the Chatroom benchmark a bit today. Rather then trying to receive in a tight loop it would make sense to do something on the message. "Grep,hash, ..." something . And then let's see whether that's solves the problem a bit. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Bill Hartner/Austin/IBM@IB...@li... on 01/25/2001 08:48:29 AM Sent by: lse...@li... To: <lse...@li...> cc: Subject: [Lse-tech] reschedule: - MQ scheduler Looking at the (2) profiles that were sent for 2.4.0 vs. 2.4.0+MQ using the chatroom benchmark : There are more reschedule: in the MQ profile. ** 28,000 vs. 239,000 See below for possible cause. ----------------------------------------------- <spontaneous> [41] 1.8 0.02 11.41 reschedule [41] 9.80 1.61 239329/939687 schedule [16] ----------------------------------------------- ... 4.63 0.48 28058/412448 reschedule [52] 5.22 0.54 31657/412448 cpu_idle [27] 48.06 4.96 291491/412448 schedule_timeout [15] [11] 12.5 68.00 7.02 412448 schedule [11] 5.65 0.00 382554/398735 _schedule_tail [49] 0.03 0.96 26663/868157 do_softirq [17] ... vs. ----------------------------------------------- <spontaneous> [52] 0.9 0.00 5.10 reschedule [52] 4.63 0.48 28058/412448 schedule [11] ----------------------------------------------- ... 1.23 0.20 29953/939687 cpu_idle [27] 9.80 1.61 239329/939687 reschedule [41] 24.78 4.07 605300/939687 schedule_timeout [24] [16] 7.0 38.47 6.32 939687 schedule [16] 3.48 0.02 901925/918106 _schedule_tail [67] 0.06 2.12 69717/1561636 do_softirq [14] 0.60 0.00 901925/918106 _switch_to [103] ... It appears as though the following could be happening : (1) Increased number of priority preemptions. This will decrease performance and if occurring on the send side, then increased tcp_data_wait on the receive side. Look for running processes that have counter=0 or miscalculations of goodness when being called by reschedule_idle causing increased priority preemptions. Also, make sure you are not leaving need_resched on when leaving schedule. (2) I don't think it is time slice preemptions. If you look at the following : ----------------------------------------------- 0.00 0.00 61851/61851 UNKNOWN_KERNEL [1255] [515] 0.0 0.00 0.00 61851 smp_apic_timer_interrupt [515] 0.00 0.00 61851/61851 update_process_times [517] ----------------------------------------------- vs. ----------------------------------------------- 0.00 0.00 66452/66452 UNKNOWN_KERNEL [1269] [526] 0.0 0.00 0.00 66452 smp_apic_timer_interrupt [526] 0.00 0.00 66452/66452 update_process_times [528] ----------------------------------------------- It looks as though 2.4.0 has 61851 * 10 ms. / 8 cpus = 77 seconds. 2.4.0+MQ has 66452 * 10 ms. / 8 cpus = 83 seconds. What were the results ? Was MQ about 8 % slower ? The sstat patch (scheduler statistics) would help here. The profile provides some of the same info. I have used it to look at workloads from the scheduler point of view. It has been very helpful in the past when looking at these type of problems. I plan to move it up to 2.4.0 soon. Later, Bill Hartner IBM Linux Technology Center - kernel performance bha...@us... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-01-25 16:09:49
|
Yes, This is correct. Since there is a runqueue for each CPU and the decision on which runqueue to place is made in the MQ case in reschedule_idle, we don't do it here. If this is a UP then its the same code as the current scheduler.. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/25/2001 10:15:27 AM Sent by: lse...@li... To: Bill Hartner/Austin/IBM@IBMUS cc: lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler This may not be related directly to the problem below, but I don't understand the code: inline void wake_up_process(struct task_struct * p) { unsigned long flags; #ifdef CONFIG_SMP int rq; rq = lock_task_cpu_rq_irqsave(p, &flags); if (task_to_runqueue(p) == REALTIME_RQ) { /* * Indicates prev is a realtime task. We must * also lock the realtime runqueue. */ spin_lock(&runqueue_lock(REALTIME_RQ)); } #else spin_lock_irqsave(&runqueue_lock, flags); #endif /* * We want the common case fall through straight, thus the goto. */ p->state = TASK_RUNNING; if (task_on_runqueue(p)) goto out; #ifndef CONFIG_SMP ^^^^^^^ add_to_runqueue(p); #endif reschedule_idle(p); out: ... Am I missing something? Thanks, Bill Hartner wrote: > > Looking at the (2) profiles that were sent for > 2.4.0 vs. 2.4.0+MQ using the chatroom benchmark : > > There are more reschedule: in the MQ profile. > ** > > 28,000 vs. 239,000 > > See below for possible cause. > > ----------------------------------------------- > <spontaneous> > [41] 1.8 0.02 11.41 reschedule [41] > 9.80 1.61 239329/939687 schedule [16] > ----------------------------------------------- > ... > 4.63 0.48 28058/412448 reschedule [52] > 5.22 0.54 31657/412448 cpu_idle [27] > 48.06 4.96 291491/412448 schedule_timeout [15] > [11] 12.5 68.00 7.02 412448 schedule [11] > 5.65 0.00 382554/398735 _schedule_tail [49] > 0.03 0.96 26663/868157 do_softirq [17] > ... > > vs. > > ----------------------------------------------- > <spontaneous> > [52] 0.9 0.00 5.10 reschedule [52] > 4.63 0.48 28058/412448 schedule [11] > ----------------------------------------------- > ... > 1.23 0.20 29953/939687 cpu_idle [27] > 9.80 1.61 239329/939687 reschedule [41] > 24.78 4.07 605300/939687 schedule_timeout [24] > [16] 7.0 38.47 6.32 939687 schedule [16] > 3.48 0.02 901925/918106 _schedule_tail [67] > 0.06 2.12 69717/1561636 do_softirq [14] > 0.60 0.00 901925/918106 _switch_to [103] > ... > > It appears as though the following could be happening : > > (1) Increased number of priority preemptions. This will decrease > performance and if occurring on the send side, then increased > tcp_data_wait on the receive side. Look for running processes > that have counter=0 or miscalculations of goodness when being > called by reschedule_idle causing increased priority preemptions. > Also, make sure you are not leaving need_resched on when leaving > schedule. > > (2) I don't think it is time slice preemptions. > > If you look at the following : > > ----------------------------------------------- > 0.00 0.00 61851/61851 UNKNOWN_KERNEL [1255] > [515] 0.0 0.00 0.00 61851 smp_apic_timer_interrupt [515] > 0.00 0.00 61851/61851 update_process_times [517] > ----------------------------------------------- > vs. > > ----------------------------------------------- > 0.00 0.00 66452/66452 UNKNOWN_KERNEL [1269] > [526] 0.0 0.00 0.00 66452 smp_apic_timer_interrupt [526] > 0.00 0.00 66452/66452 update_process_times [528] > ----------------------------------------------- > > It looks as though 2.4.0 has 61851 * 10 ms. / 8 cpus = 77 seconds. > 2.4.0+MQ has 66452 * 10 ms. / 8 cpus = 83 seconds. > > What were the results ? > Was MQ about 8 % slower ? > > The sstat patch (scheduler statistics) would help here. > The profile provides some of the same info. > I have used it to look at workloads from the scheduler point of view. > It has been very helpful in the past when looking at these type of problems. > I plan to move it up to 2.4.0 soon. > > Later, > > Bill Hartner > IBM Linux Technology Center - kernel performance > bha...@us... > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Jun N. <ju...@sc...> - 2001-01-25 19:05:46
|
I understand the dilemma, i.e. wake_up_process() is not sure on which CPU run queue it should place the process, until reschedule_idle() finds an idle CPU or one with the lowest na_goodness. One thing I'm concerned is that __wake_up_common() (which may call wake_up_process()) can pick up a process affine to the current CPU in the interrupt context and we don't want to migrate that process to another CPU, even if that CPU's current na_goodness happens to be the lowest. Hubertus Franke wrote: > > Yes, > This is correct. > > Since there is a runqueue for each CPU and the decision on which runqueue > to place is made in the MQ > case in reschedule_idle, we don't do it here. > If this is a UP then its the same code as the current scheduler.. > > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: fr...@us... > (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 > > Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/25/2001 10:15:27 AM > > Sent by: lse...@li... > > To: Bill Hartner/Austin/IBM@IBMUS > cc: lse...@li... > Subject: Re: [Lse-tech] reschedule: - MQ scheduler > > This may not be related directly to the problem below, but I don't > understand the code: > > inline void wake_up_process(struct task_struct * p) > { > unsigned long flags; > #ifdef CONFIG_SMP > int rq; > > rq = lock_task_cpu_rq_irqsave(p, &flags); > if (task_to_runqueue(p) == REALTIME_RQ) { > /* > * Indicates prev is a realtime task. We must > * also lock the realtime runqueue. > */ > spin_lock(&runqueue_lock(REALTIME_RQ)); > } > #else > spin_lock_irqsave(&runqueue_lock, flags); > #endif > /* > * We want the common case fall through straight, thus the goto. > */ > p->state = TASK_RUNNING; > if (task_on_runqueue(p)) > goto out; > #ifndef CONFIG_SMP > ^^^^^^^ > add_to_runqueue(p); > #endif > reschedule_idle(p); > out: > > ... > > Am I missing something? > > Thanks, > > Bill Hartner wrote: > > > > Looking at the (2) profiles that were sent for > > 2.4.0 vs. 2.4.0+MQ using the chatroom benchmark : > > > > There are more reschedule: in the MQ profile. > > ** > > > > 28,000 vs. 239,000 > > > > See below for possible cause. > > > > ----------------------------------------------- > > <spontaneous> > > [41] 1.8 0.02 11.41 reschedule [41] > > 9.80 1.61 239329/939687 schedule [16] > > ----------------------------------------------- > > ... > > 4.63 0.48 28058/412448 reschedule [52] > > 5.22 0.54 31657/412448 cpu_idle [27] > > 48.06 4.96 291491/412448 schedule_timeout [15] > > [11] 12.5 68.00 7.02 412448 schedule [11] > > 5.65 0.00 382554/398735 _schedule_tail [49] > > 0.03 0.96 26663/868157 do_softirq [17] > > ... > > > > vs. > > > > ----------------------------------------------- > > <spontaneous> > > [52] 0.9 0.00 5.10 reschedule [52] > > 4.63 0.48 28058/412448 schedule [11] > > ----------------------------------------------- > > ... > > 1.23 0.20 29953/939687 cpu_idle [27] > > 9.80 1.61 239329/939687 reschedule [41] > > 24.78 4.07 605300/939687 schedule_timeout [24] > > [16] 7.0 38.47 6.32 939687 schedule [16] > > 3.48 0.02 901925/918106 _schedule_tail [67] > > 0.06 2.12 69717/1561636 do_softirq [14] > > 0.60 0.00 901925/918106 _switch_to [103] > > ... > > > > It appears as though the following could be happening : > > > > (1) Increased number of priority preemptions. This will decrease > > performance and if occurring on the send side, then increased > > tcp_data_wait on the receive side. Look for running processes > > that have counter=0 or miscalculations of goodness when being > > called by reschedule_idle causing increased priority preemptions. > > Also, make sure you are not leaving need_resched on when leaving > > schedule. > > > > (2) I don't think it is time slice preemptions. > > > > If you look at the following : > > > > ----------------------------------------------- > > 0.00 0.00 61851/61851 UNKNOWN_KERNEL [1255] > > [515] 0.0 0.00 0.00 61851 smp_apic_timer_interrupt > [515] > > 0.00 0.00 61851/61851 update_process_times > [517] > > ----------------------------------------------- > > vs. > > > > ----------------------------------------------- > > 0.00 0.00 66452/66452 UNKNOWN_KERNEL [1269] > > [526] 0.0 0.00 0.00 66452 smp_apic_timer_interrupt > [526] > > 0.00 0.00 66452/66452 update_process_times > [528] > > ----------------------------------------------- > > > > It looks as though 2.4.0 has 61851 * 10 ms. / 8 cpus = 77 seconds. > > 2.4.0+MQ has 66452 * 10 ms. / 8 cpus = 83 seconds. > > > > What were the results ? > > Was MQ about 8 % slower ? > > > > The sstat patch (scheduler statistics) would help here. > > The profile provides some of the same info. > > I have used it to look at workloads from the scheduler point of view. > > It has been very helpful in the past when looking at these type of > problems. > > I plan to move it up to 2.4.0 soon. > > > > Later, > > > > Bill Hartner > > IBM Linux Technology Center - kernel performance > > bha...@us... > > > > _______________________________________________ > > Lse-tech mailing list > > Lse...@li... > > http://lists.sourceforge.net/lists/listinfo/lse-tech > > -- > Jun U Nakajima > Core OS Development > SCO/Murray Hill, NJ > Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-01-25 19:19:43
|
On Thu, Jan 25, 2001 at 02:00:37PM -0500, Jun Nakajima wrote: > I understand the dilemma, i.e. wake_up_process() is not sure on which > CPU run queue it should place the process, until reschedule_idle() finds > an idle CPU or one with the lowest na_goodness. > > One thing I'm concerned is that __wake_up_common() (which may call > wake_up_process()) can pick up a process affine to the current CPU in > the interrupt context and we don't want to migrate that process to > another CPU, even if that CPU's current na_goodness happens to be the > lowest. The multiqueue scheduler does not differ from the original schheduler in this case. I'm not exactly sure what you mean by 'process affine' in this context. However, the current scheduler also picks the CPU running the lowest priority task (lowest na_goodness). Waiting until reschedule_idle time to add a task to the appropriate CPU specific runqueue is only an optimization. Any CPU can choose tasks from any of the runquueueus. It just makes sense to add it to the queue it will most likely be run from. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Jun N. <ju...@sc...> - 2001-01-25 22:49:37
|
Mike Kravetz wrote: > > On Thu, Jan 25, 2001 at 02:00:37PM -0500, Jun Nakajima wrote: > > I understand the dilemma, i.e. wake_up_process() is not sure on which > > CPU run queue it should place the process, until reschedule_idle() finds > > an idle CPU or one with the lowest na_goodness. > > > > One thing I'm concerned is that __wake_up_common() (which may call > > wake_up_process()) can pick up a process affine to the current CPU in > > the interrupt context and we don't want to migrate that process to > > another CPU, even if that CPU's current na_goodness happens to be the > > lowest. > > The multiqueue scheduler does not differ from the original schheduler > in this case. I'm not exactly sure what you mean by 'process affine' > in this context. However, the current scheduler also picks the CPU > running the lowest priority task (lowest na_goodness). I think your code slightly differs from the original (I could be wrong). static void reschedule_idle(struct task_struct * p) { ... for (i = 0; i < smp_num_cpus; i++) { cpu = cpu_logical_map(i); ... stack_list[cpu] = curr_na_goodness(cpu); /* * Add in PROC_CHANGE_PENALTY for remote CPUs */ if (cpu != tsk_cpu) { stack_list[cpu] += PROC_CHANGE_PENALTY; } } /* * Look for the lowest value */ if (stack_list[cpu] < tmp_min_na_goodness) { target_cpu = cpu; tmp_min_na_goodness = stack_list[cpu]; } ... if cpu becomes (cpu == tsk_cpu), stack_list[cpu] does not get PROC_CHANGE_PENALTY, and if (stack_list[cpu] < tmp_min_na_goodness), then tmp_min_na_goodness is changed to the smaller one. And in the following iterations, the test "if (stack_list[cpu] < tmp_min_na_goodness)" becomes harder to be true. Whereas the orignal code checks the max value returned by preemption_goodness(tsk, p, cpu) (tsktsk = cpu_curr(cpu)). The code I'm talking about is the one below. __wake_up_common() is the common body of the wake_up family. Basically it prefers a process on the current CPU to ones on the other CPUs, depending on the mode/flag. This is reasonable because the initiator CPU of the interrupt potentialy has warm cache for handling it. Sompe platform delivers interrupts to the initiator CPU. static inline void __wake_up_common (wait_queue_head_t *q, unsigned int mode, unsigned int wq_mode, const int sync) { struct list_head *tmp, *head; struct task_struct *p, *best_exclusive; unsigned long flags; int best_cpu, irq; ... while (tmp != head) { unsigned int state; wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list); tmp = tmp->next; #if WAITQUEUE_DEBUG CHECK_MAGIC(curr->__magic); #endif p = curr->task; state = p->state; if (state & mode) { #if WAITQUEUE_DEBUG curr->__waker = (long)__builtin_return_address(0); #endif /* * If waking up from an interrupt context then * prefer processes which are affine to this * CPU. */ if (irq && (curr->flags & wq_mode & WQ_FLAG_EXCLUSIVE)) { if (!best_exclusive) best_exclusive = p; if (p->processor == best_cpu) { best_exclusive = p; break; } } else { if (sync) wake_up_process_synchronous(p); else wake_up_process(p); if (curr->flags & wq_mode & WQ_FLAG_EXCLUSIVE) break; } } } if (best_exclusive) { if (sync) wake_up_process_synchronous(best_exclusive); else wake_up_process(best_exclusive); } ... > > Waiting until reschedule_idle time to add a task to the appropriate > CPU specific runqueue is only an optimization. Any CPU can choose > tasks from any of the runquueueus. It just makes sense to add it to > the queue it will most likely be run from. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-01-26 01:15:14
|
On Thu, Jan 25, 2001 at 05:43:39PM -0500, Jun Nakajima wrote: > > I think your code slightly differs from the original (I could be wrong). > <code deleted> > > if cpu becomes (cpu == tsk_cpu), stack_list[cpu] does not get > PROC_CHANGE_PENALTY, and if (stack_list[cpu] < tmp_min_na_goodness), > then tmp_min_na_goodness is changed to the smaller one. And in the > following iterations, the test "if (stack_list[cpu] < > tmp_min_na_goodness)" becomes harder to be true. > > Whereas the orignal code checks the max value returned by > preemption_goodness(tsk, p, cpu) (tsktsk = cpu_curr(cpu)). preemption_goodness is pretty simple, so I'll include it here. static inline int preemption_goodness(struct task_struct * prev, struct task_struct * p, int cpu) { return goodness(p, cpu, prev->active_mm) - goodness(prev, cpu, prev->active_mm); } As you can see, the lower the goodness value of prev (in this case tsk) the higher the value returned by preemption_goodness(). So in essence the original code was looking for the CPU which is executing the task with the lowest goodness value. Also note that in the original code cpu is tsk->processor. Therefore, PROC_CHANGE_PENALTY is always added to the goodness value for tsk. Our code does pretty much the same thing (I believe). We are looking for the task with the lowest goodness value relative to the cpu 'p' previously ran on. Therefore, for all remote CPUs we add PROC_CHANGE_PENALTY to account for the loss of cache affinity. I believe this matches what preemption_goodness does. When cpu == tsk_cpu we don't add PROC_CHANGE_PENALTY because we want to do a direct comparison of na_goodness values. This would be similar to calling preemption_goodness when both tasks have the same 'processor' value. In this case PROC_CHANGE_PENALTY is added to both. Hope this makes sense (and I hope the code works as described/expected). > > The code I'm talking about is the one below. __wake_up_common() is the > common body of the wake_up family. Basically it prefers a process on the > current CPU to ones on the other CPUs, depending on the mode/flag. This > is reasonable because the initiator CPU of the interrupt potentialy has > warm cache for handling it. Sompe platform delivers interrupts to the > initiator CPU. That may be the case, but I believe reschedule_idle is the code that actually determines what CPU the task should run on. This behavior is not changed in the multiqueue scheduler. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Jun N. <ju...@sc...> - 2001-01-26 15:15:25
|
I completely agree on your explanation below. And I think I figured out the difference. The original code triggers preemption when preemption_goodness(tsk, p, cpu) returns more than *1*. oldest_idle = (cycles_t) -1; target_tsk = NULL; max_prio = 1; ... if (oldest_idle == -1ULL) { int prio = preemption_goodness(tsk, p, cpu); if (prio > max_prio) { max_prio = prio; target_tsk = tsk; } ... But your code does that when it finds a task with a lower na_goodness value, compared to na_goodness(p). Because of this, it tends to cause reschedule more. So a quick fix to this would be like: saved_na_goodness = na_goodness(p); - tmp_min_na_goodness = saved_na_goodness; + tmp_min_na_goodness = saved_na_goodness - 1; Mike Kravetz wrote: > > On Thu, Jan 25, 2001 at 05:43:39PM -0500, Jun Nakajima wrote: > > > > I think your code slightly differs from the original (I could be wrong). > > > <code deleted> > > > > if cpu becomes (cpu == tsk_cpu), stack_list[cpu] does not get > > PROC_CHANGE_PENALTY, and if (stack_list[cpu] < tmp_min_na_goodness), > > then tmp_min_na_goodness is changed to the smaller one. And in the > > following iterations, the test "if (stack_list[cpu] < > > tmp_min_na_goodness)" becomes harder to be true. > > > > Whereas the orignal code checks the max value returned by > > preemption_goodness(tsk, p, cpu) (tsktsk = cpu_curr(cpu)). > > preemption_goodness is pretty simple, so I'll include it here. > > static inline int preemption_goodness(struct task_struct * prev, > struct task_struct * p, int cpu) > { > return goodness(p, cpu, prev->active_mm) - > goodness(prev, cpu, prev->active_mm); > } > > As you can see, the lower the goodness value of prev (in this case tsk) > the higher the value returned by preemption_goodness(). So in essence > the original code was looking for the CPU which is executing the task > with the lowest goodness value. Also note that in the original code > cpu is tsk->processor. Therefore, PROC_CHANGE_PENALTY is always added > to the goodness value for tsk. Our code does pretty much the same thing > (I believe). We are looking for the task with the lowest goodness value > relative to the cpu 'p' previously ran on. Therefore, for all remote > CPUs we add PROC_CHANGE_PENALTY to account for the loss of cache affinity. > I believe this matches what preemption_goodness does. When cpu == tsk_cpu > we don't add PROC_CHANGE_PENALTY because we want to do a direct comparison > of na_goodness values. This would be similar to calling preemption_goodness > when both tasks have the same 'processor' value. In this case > PROC_CHANGE_PENALTY is added to both. > > Hope this makes sense (and I hope the code works as described/expected). > > > > > The code I'm talking about is the one below. __wake_up_common() is the > > common body of the wake_up family. Basically it prefers a process on the > > current CPU to ones on the other CPUs, depending on the mode/flag. This > > is reasonable because the initiator CPU of the interrupt potentialy has > > warm cache for handling it. Sompe platform delivers interrupts to the > > initiator CPU. > > That may be the case, but I believe reschedule_idle is the code that > actually determines what CPU the task should run on. This behavior is > not changed in the multiqueue scheduler. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-01-26 19:43:59
|
On Fri, Jan 26, 2001 at 10:09:38AM -0500, Jun Nakajima wrote: > I completely agree on your explanation below. And I think I figured out > the difference. > > The original code triggers preemption when preemption_goodness(tsk, p, > cpu) returns more than *1*. > > oldest_idle = (cycles_t) -1; > target_tsk = NULL; > max_prio = 1; > ... > > if (oldest_idle == -1ULL) { > int prio = preemption_goodness(tsk, p, cpu); > > if (prio > max_prio) { > max_prio = prio; > target_tsk = tsk; > } > ... > > But your code does that when it finds a task with a lower na_goodness > value, compared to na_goodness(p). Because of this, it tends to cause > reschedule more. So a quick fix to this would be like: > > saved_na_goodness = na_goodness(p); > - tmp_min_na_goodness = saved_na_goodness; > + tmp_min_na_goodness = saved_na_goodness - 1; Good find Jun! Note that in the code we ensure (preemption_goodness(tsk, p, target_cpu) > 1) if the 'target cpu' is not the same as the CPU that task p last ran on. However, there is a 'special case' in the code if target cpu is the same as the CPU that p last ran on. In this special case we do not make the preemption_goodness check. In this special case the behavior is different from the original scheduler as you point out above. I'll make this change and see what happens. Thanks, -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Mike K. <mkr...@se...> - 2001-01-27 00:43:51
|
I applied the fix for the bug Jun found and ran the chat benchmark with a profiled kernel. The number of reschedules dropped, but only by a very small amount. There was no significant change. I then tried an experiment. I modified the multi-queue scheduler code such that only a single lock is used, as opposed to the per- runqueue locks. All, algorithms remained unchanged. I only changed the locking model. It was my intention to increase lock contention and see how this impacted the benchmark. I ran the benchmark with a profiled version of this kernel. Here are the results when compared to multi-queue and 2.4.0 untouched. Calls to: schedule reschedule_idle tcp_data_wait ---------------------------------------------------------------------- untouched 412448 438062 281893 multiq 939687 920098 595161 multiq (gl) 578351 559267 407864 Even when the multiqueue scheduler used a global lock, the time spent in the scheduling code was much shorter. This is because there is no need to scan the entire list of runnable tasks (non- local runqueues have pointers to the best task on their queues). In the untouched 2.4.0 kernel, during this benchmark we average 181.89 milliseconds per call. In the multiqueue scheduler even with a global lock we are only averaging 46.69 milliseconds per call. For a more accurate comparison, I added the extra runqueue scans to the multiqueue scheduler with the global locking model. These scans were not used for anything (multiqueue algorithms did not change), I only wanted to add extra overhead to the schedule routine. After a run with this kernel I measured an average of 185.17 milliseconds per call. This was in the ballpark of the untouched kernel. Profile results for this kernel are included below. Calls to: schedule reschedule_idle tcp_data_wait ---------------------------------------------------------------------- untouched 412448 438062 281893 multiq 939687 920098 595161 multiq (gl) 578351 559267 407864 multiq (gl, s) 393834 416385 265007 The interesting result of these experiments is that the number of reschedules decreases as runqueue lock contention and hold time increases. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Jun N. <ju...@sc...> - 2001-01-29 20:13:15
|
Good experiments, Mike. I think the original scheduler would exhibit such scheduling characteristics. Since reschedule_idle() is called with the runqueue_lock held, the number of reschedules decreases as runqueue lock contention and hold time increases. In addition, smp_send_reschedule() tends to be performed less, as runqueue lock contention and hold time increases. If the best_cpu below, for example, is already spin-waiting at the runqueue lock with need_resched = 1, smp_send_reschedule() is not performed against the cpu. static void reschedule_idle(struct task_struct * p) { #ifdef CONFIG_SMP send_now_idle: /* * If need_resched == -1 then we can skip sending * the IPI altogether, tsk->need_resched is * actively watched by the idle thread. */ need_resched = tsk->need_resched; tsk->need_resched = 1; if ((best_cpu != this_cpu) && !need_resched) smp_send_reschedule(best_cpu); return; } } ... tsk = target_tsk; if (tsk) { if (oldest_idle != -1ULL) { best_cpu = tsk->processor; goto send_now_idle; } tsk->need_resched = 1; if (tsk->processor != this_cpu) smp_send_reschedule(tsk->processor); } return; ... The current implementation of the MQ-scheduler inherently causes more reschedule because more than one CPU can perform reschedule_idle() at the same time, reading curr_na_goodness(cpu). One thing I noticed with the current code is: => If spin_trylock(&runqueue_lock(target_cpu) fails, then it tries to find the 'next lowest' cur_na_goodness value, to cause preepmption. I think this code (see below) is a bit tricky. The array stack_list[cpu] in the stack may not be valid any more because schedule() may have happened on the other CPUs, and curr_na_goodness(cpu) may have been updated by this time. /* * Update value so we don't check this CPU again. */ stack_list[target_cpu] = saved_na_goodness; /* * Find the 'next lowest' cur_na_goodness value. */ target_cpu = -1; tmp_min_na_goodness = saved_na_goodness; for (i = 0; i < smp_num_cpus; i++) { cpu = cpu_logical_map(i); if (stack_list[cpu] < tmp_min_na_goodness) { target_cpu = cpu; tmp_min_na_goodness = stack_list[cpu]; Although it checks if (preemption_goodness(tsk, p, target_cpu) > 1) after spin_trylock(&runqueue_lock(target_cpu)), we don't know if the target CPU is running a task with the mostly lowest na_goodness at that time. However, I don't think we want to know or use the very accurate curr_na_goodness(cpu) because they change very frequently and thus the scheduling decision can be misled. It's might be a time for us to explore a more sutitable scheduling algorithm for the MQ scheduling... Mike Kravetz wrote: > > I applied the fix for the bug Jun found and ran the chat benchmark > with a profiled kernel. The number of reschedules dropped, but only > by a very small amount. There was no significant change. > > I then tried an experiment. I modified the multi-queue scheduler > code such that only a single lock is used, as opposed to the per- > runqueue locks. All, algorithms remained unchanged. I only changed > the locking model. It was my intention to increase lock contention > and see how this impacted the benchmark. I ran the benchmark with > a profiled version of this kernel. Here are the results when > compared to multi-queue and 2.4.0 untouched. > > Calls to: > schedule reschedule_idle tcp_data_wait > ---------------------------------------------------------------------- > untouched 412448 438062 281893 > multiq 939687 920098 595161 > multiq (gl) 578351 559267 407864 > > Even when the multiqueue scheduler used a global lock, the time > spent in the scheduling code was much shorter. This is because > there is no need to scan the entire list of runnable tasks (non- > local runqueues have pointers to the best task on their queues). > In the untouched 2.4.0 kernel, during this benchmark we average > 181.89 milliseconds per call. In the multiqueue scheduler even > with a global lock we are only averaging 46.69 milliseconds per > call. For a more accurate comparison, I added the extra runqueue > scans to the multiqueue scheduler with the global locking model. > These scans were not used for anything (multiqueue algorithms did > not change), I only wanted to add extra overhead to the schedule > routine. After a run with this kernel I measured an average of > 185.17 milliseconds per call. This was in the ballpark of the > untouched kernel. Profile results for this kernel are included > below. > > Calls to: > schedule reschedule_idle tcp_data_wait > ---------------------------------------------------------------------- > untouched 412448 438062 281893 > multiq 939687 920098 595161 > multiq (gl) 578351 559267 407864 > multiq (gl, s) 393834 416385 265007 > > The interesting result of these experiments is that the number of > reschedules decreases as runqueue lock contention and hold time > increases. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-01-29 22:26:54
|
In a previous note, I discounted the improvements made by the bug fix to reschedule_idle. This was mainly a result of studying differences in kernel profile output. While it is true that this fix did not significantly reduce the number of reschedules, it does appear that this fix significantly improves chat benchmark numbers. At least this version of the multi-queue scheduler seems to be performing like previous versions. rooms/messages 2.4.0 2.4.0-mq ---------------------------------------------- 10/100 86103 94073 10/200 101182 112933 10/300 107558 113011 10/400 110380 123717 20/100 72928 97917 20/200 84545 111967 20/300 90030 118210 20/400 98081 123935 30/100 60870 98184 30/200 76411 113034 30/300 70368 122253 30/400 8909 13140 I have updated the patch on the web site. Note that the change Jun suggested in an earlier mail was not complete, we still need to check preemption_gooodness() before sending the IPI. Thanks again to Jun for finding this. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Hubertus F. <fr...@us...> - 2001-01-25 19:41:02
|
We are currently trying to boost the priority of the server threads and see whether that makes any difference. We could also try another combination namely raise priority of all sender threads in the system. I think this makes sense to figure out where the problem lies. Shailabh is working on that as we speak. Mike as discussed yesterday, we will post all the relevant numbers that we have right now. We collected for 1,2,3,4,5,6,7,8 way information on 2.4.1-pre8 (vanilla, MQ, prio) sched_yield_test (1--16,32,256,..,2048) threads Chatroom (10-30)Rooms x (100-300) messages Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkr...@se...> on 01/25/2001 02:10:35 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Bill Hartner/Austin/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler On Thu, Jan 25, 2001 at 11:00:15AM -0500, Hubertus Franke wrote: > > I think Mike traced it already back to the fact that the tpc_recvmsg's > block, i.e. there is not data there. > Hence the process get's back into a wait queue and wake's up with > reschedule_idle being called. > That's the source of the problem. We are doing TOO GOOD ..... Well, that is a theory I am working on. The reason for the slowdown is 'too many process preemptions' as stated by Bill. Now the difficult part is to determine why. Here is another interesting piece of the profile data that has not made out to this list. The following is from the untouched 2.4.0 kernel: 12.04 94.37 8000060/8000060 inet_recvmsg [7] [8] 17.8 12.04 94.37 8000060 tcp_recvmsg [8] 0.96 51.84 281893/281893 tcp_data_wait [16] 1.67 17.29 7774955/8018782 memcpy_toiovec [30] 2.46 7.52 8281953/8281953 cleanup_rbuf [38] 0.23 6.51 277660/277660 tcp_prequeue_process [46] 0.03 3.98 330314/2453362 _wake_up [19] 0.31 0.81 257191/1585985 _kfree_skb [45] 0.01 0.71 3687/20785 _lock_sock [62] 0.00 0.03 1490/8500 _release_sock [123] and here it is for the kernel with multiqueue: 12.91 83.95 8000060/8000060 inet_recvmsg [7] [8] 15.1 12.91 83.95 8000060 tcp_recvmsg [8] 2.02 29.45 595161/595161 tcp_data_wait [23] 1.62 17.49 7470947/8022664 memcpy_toiovec [29] 2.79 14.28 8595221/8595221 cleanup_rbuf [35] 0.69 12.67 590860/590860 tcp_prequeue_process [40] 0.02 1.42 221714/2308800 _wake_up [37] 0.38 0.85 292349/2660942 _kfree_skb [42] 0.01 0.24 4678/24471 _lock_sock [87] 0.00 0.03 1548/9337 _release_sock [129] Note that tcp_recvmsg gets called the same number of times in both cases. However, note the big differences in the number of times tcp_recvmsg() calls the routines tcp_data_wait() and tcp_prequeue_process(). The tcp_data_wait() routine results in a sleep/wakeup cycle via a call to schedule_timeout(). A result of these extra sleep/wakeup calls is the increased number of priority preemptions via reschedule_idle(). Now, tcp_data_wait() is called to wait for data when a recv request can not be immediately satisfied. So with the multi-queue scheduler recv'ing data is blocking (sleep/wakeup) some 300,000 + more times than with the current scheduler. I believe this accounts for the additional reschedules we see in the multi-queue kernel. Now here is some data to show runqueue lock contention for both kernels during the runs. Sorry for the wide lines produced by the tool. 2.4.0 vanilla ------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT Acquired at -------------------------------------------------------------------------------------------------------------------- 0.17% 32.35% 5.5us( 20us) 10us( 380us) 50766 34344 16422 0 __schedule_tail+0x68 0.00% 0.00% 1.3us( 2.2us) 0us 56 56 0 0 deliver_signal+0x58 9.10% 16.07% 57us( 337us) 6.0us( 558us) 263570 221213 42357 0 schedule+0xc8 beginning 0.01% 2.56% 3.4us( 101us) 0.1us( 22us) 4602 4484 118 0 schedule+0x44c after recalc 0.73% 32.36% 5.8us( 26us) 19us( 570us) 206799 139869 66930 0 wake_up_process+0x14 2.4.0 multi-queue ----------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT Acquired at -------------------------------------------------------------------------------------------------------------------- 0.29% 2.85% 3.6us( 33us) 0.1us( 22us) 135574 131714 3860 0 __schedule_tail+0xa4 0.00% 0.00% 1.5us( 2.5us) 0us 48 48 0 0 deliver_signal+0xa8 0.02% 2.73% 2.4us( 16us) 0us 15869 15436 0 433 reschedule_idle+0x2e0 try 3.61% 1.77% 11us( 126us) 0.1us( 19us) 519630 510447 9183 0 schedule+0x130 beginning 0.02% 28.79% 2.5us( 12us) 0us 10382 7393 0 2989 schedule+0x648 try 0.01% 0.03% 3.1us( 29us) 0.0us( 5.6us) 3599 3598 1 0 schedule+0xbdc after recalc 0.94% 8.47% 4.3us( 37us) 1.0us( 79us) 356086 325925 30161 0 wake_up_process+0x48 Note the increased lock wait times with untouched 2.4.0 kernel. Here is where I started to form my ?wacky? theory. When a task recv'ing data notices no data is present on the socket, it goes into a sleep/wakeup cycle. This sleep/wakeup cycle requires the runqueue lock be acquired a few times. Now, the longer one waits to get through this sleep/wakeup cycle, the greater the chance that additional data will be queue'd up up on the socket. Hence, if you are delayed you may be able to process multiple recv's before blocking again. Thus, the increased lock contention in the untouched kernel results in slight delays which result in a 'buffering' of data on the socket. Contrast this to the muiltiqueue scheduler which can get through the sleep/wakeup cycle faster. In this case it is less likely that additional data will be queue'ed to the socket and less likely that you will be able to process multiple recv's before blocking again. Like I said, this sounds wacky. Increased lock contention causing an increase in performance/throughput. I'm looking into this and other possible reasons for the slowdown. -- Mike Kravetz mkr...@se... IBM Linux Technology Center 15450 SW Koll Parkway Beaverton, OR 97006-6063 (503)578-3494 |
From: Bill H. <bha...@us...> - 2001-01-25 19:43:55
|
> I think Mike traced it already back to the fact that the > tpc_recvmsg's block, i.e. there is not data there. > Hence the process get's back into a wait queue and > wake's up with reschedule_idle being called. > That's the source of the problem. We are doing > TOO GOOD ..... Why so many reschedule: ? I think something might be broke. Consider the following scenario (remember that this is loopback). Process A sends a 100 byte message, calls wakeup for Process B (the receiver) and Process A gets preempted because "something is broke" in reschedule_idle or the data struct(s) it uses. As Process A exits the kernel it gets caught by need_resched and calls schedule(). In the normal case (2.4.0) the sender does not get caught by need_resched because "something is not broke". Therefore, many sends take place by Process A and keeps the recv buffer full and you do not get much blocking in tcp_recvmsg. Look for miscalculation of goodness for the process waking up and the current running processe(s) when in reschedule_idle. Check preemption_goodness(). You should have relatively small amount of priority preemption like the 2.4.0 case. Bill |
From: Hubertus F. <fr...@us...> - 2001-01-25 20:01:04
|
I don't know about that, effectively there is no process memory affinity in the kernel, everything maps to the same address range. Ofcourse some affinity will be established if objects are touched that only belong to a single process. Also, if you touch so many cache lines in the interrupt context that it truely matters or that lots of false sharing takes place, then we have another serious problem at our hands. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/25/2001 02:00:37 PM Sent by: lse...@li... To: Hubertus Franke/Watson/IBM@IBMUS cc: lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler I understand the dilemma, i.e. wake_up_process() is not sure on which CPU run queue it should place the process, until reschedule_idle() finds an idle CPU or one with the lowest na_goodness. One thing I'm concerned is that __wake_up_common() (which may call wake_up_process()) can pick up a process affine to the current CPU in the interrupt context and we don't want to migrate that process to another CPU, even if that CPU's current na_goodness happens to be the lowest. Hubertus Franke wrote: > > Yes, > This is correct. > > Since there is a runqueue for each CPU and the decision on which runqueue > to place is made in the MQ > case in reschedule_idle, we don't do it here. > If this is a UP then its the same code as the current scheduler.. > > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: fr...@us... > (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 > > Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/25/2001 10:15:27 AM > > Sent by: lse...@li... > > To: Bill Hartner/Austin/IBM@IBMUS > cc: lse...@li... > Subject: Re: [Lse-tech] reschedule: - MQ scheduler > > This may not be related directly to the problem below, but I don't > understand the code: > > inline void wake_up_process(struct task_struct * p) > { > unsigned long flags; > #ifdef CONFIG_SMP > int rq; > > rq = lock_task_cpu_rq_irqsave(p, &flags); > if (task_to_runqueue(p) == REALTIME_RQ) { > /* > * Indicates prev is a realtime task. We must > * also lock the realtime runqueue. > */ > spin_lock(&runqueue_lock(REALTIME_RQ)); > } > #else > spin_lock_irqsave(&runqueue_lock, flags); > #endif > /* > * We want the common case fall through straight, thus the goto. > */ > p->state = TASK_RUNNING; > if (task_on_runqueue(p)) > goto out; > #ifndef CONFIG_SMP > ^^^^^^^ > add_to_runqueue(p); > #endif > reschedule_idle(p); > out: > > ... > > Am I missing something? > > Thanks, > > Bill Hartner wrote: > > > > Looking at the (2) profiles that were sent for > > 2.4.0 vs. 2.4.0+MQ using the chatroom benchmark : > > > > There are more reschedule: in the MQ profile. > > ** > > > > 28,000 vs. 239,000 > > > > See below for possible cause. > > > > ----------------------------------------------- > > <spontaneous> > > [41] 1.8 0.02 11.41 reschedule [41] > > 9.80 1.61 239329/939687 schedule [16] > > ----------------------------------------------- > > ... > > 4.63 0.48 28058/412448 reschedule [52] > > 5.22 0.54 31657/412448 cpu_idle [27] > > 48.06 4.96 291491/412448 schedule_timeout [15] > > [11] 12.5 68.00 7.02 412448 schedule [11] > > 5.65 0.00 382554/398735 _schedule_tail [49] > > 0.03 0.96 26663/868157 do_softirq [17] > > ... > > > > vs. > > > > ----------------------------------------------- > > <spontaneous> > > [52] 0.9 0.00 5.10 reschedule [52] > > 4.63 0.48 28058/412448 schedule [11] > > ----------------------------------------------- > > ... > > 1.23 0.20 29953/939687 cpu_idle [27] > > 9.80 1.61 239329/939687 reschedule [41] > > 24.78 4.07 605300/939687 schedule_timeout [24] > > [16] 7.0 38.47 6.32 939687 schedule [16] > > 3.48 0.02 901925/918106 _schedule_tail [67] > > 0.06 2.12 69717/1561636 do_softirq [14] > > 0.60 0.00 901925/918106 _switch_to [103] > > ... > > > > It appears as though the following could be happening : > > > > (1) Increased number of priority preemptions. This will decrease > > performance and if occurring on the send side, then increased > > tcp_data_wait on the receive side. Look for running processes > > that have counter=0 or miscalculations of goodness when being > > called by reschedule_idle causing increased priority preemptions. > > Also, make sure you are not leaving need_resched on when leaving > > schedule. > > > > (2) I don't think it is time slice preemptions. > > > > If you look at the following : > > > > ----------------------------------------------- > > 0.00 0.00 61851/61851 UNKNOWN_KERNEL [1255] > > [515] 0.0 0.00 0.00 61851 smp_apic_timer_interrupt > [515] > > 0.00 0.00 61851/61851 update_process_times > [517] > > ----------------------------------------------- > > vs. > > > > ----------------------------------------------- > > 0.00 0.00 66452/66452 UNKNOWN_KERNEL [1269] > > [526] 0.0 0.00 0.00 66452 smp_apic_timer_interrupt > [526] > > 0.00 0.00 66452/66452 update_process_times > [528] > > ----------------------------------------------- > > > > It looks as though 2.4.0 has 61851 * 10 ms. / 8 cpus = 77 seconds. > > 2.4.0+MQ has 66452 * 10 ms. / 8 cpus = 83 seconds. > > > > What were the results ? > > Was MQ about 8 % slower ? > > > > The sstat patch (scheduler statistics) would help here. > > The profile provides some of the same info. > > I have used it to look at workloads from the scheduler point of view. > > It has been very helpful in the past when looking at these type of > problems. > > I plan to move it up to 2.4.0 soon. > > > > Later, > > > > Bill Hartner > > IBM Linux Technology Center - kernel performance > > bha...@us... > > > > _______________________________________________ > > Lse-tech mailing list > > Lse...@li... > > http://lists.sourceforge.net/lists/listinfo/lse-tech > > -- > Jun U Nakajima > Core OS Development > SCO/Murray Hill, NJ > Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Shailabh N. <na...@us...> - 2001-01-26 15:06:17
|
Some numbers for the theory that increased performance of the MQ scheduler is causing more receiver threads to sleep waiting for data (not yet posted by sender threads). We ran a modified chat room , in which every sender thread (on server and client side) is given a priority boost of 2 before it starts. That increases the likelihood of messages being sent. The preliminary results from a 4way system are as follows (comments below). All numbers are the througputs reported by the chat room. Vanilla refers to 2.4.1-pre8, MQ to the 2.4.0-mq1rt patch applied to 2.4.1-pre8. Each value is averaged over 5 runs (statistically better runs are being conducted now). 10 rooms/100 msgs MQ Vanilla Diff No boost 133167 152765 19598 Boost=2 173311 179843 6532 10 rooms/200 msgs MQ Vanilla Diff No boost 142533 154945 12412 Boost=2 177076 176494 -582 20 rooms/100 msgs MQ Vanilla Diff No boost 130950 139145 8195 Boost=2 160392 163974 3582 20 rooms/200 msgs MQ Vanilla Diff No boost 150514 138975 -11539 Boost=2 175223 166575 -8648 Applying the boost to sender threads clearly reduces the difference between MQ and vanilla. in all cases. So doing a simple priority boost to both (for fairness of comparison) is one possible way of eliminating the tying of senders/receivers that we were seeing. Along the same lines, we could explore : - giving a boost only to server send threads - increasing the priority boost values besides running the same tests as above with more runs for statistical reasons. Shailabh Nagar (914) 945 2851 na...@us... Hubertus Franke/Watson/IBM@IB...@li... on 01/25/2001 02:36:51 PM Sent by: lse...@li... To: Mike Kravetz <mkr...@se...> cc: Bill Hartner/Austin/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler We are currently trying to boost the priority of the server threads and see whether that makes any difference. We could also try another combination namely raise priority of all sender threads in the system. I think this makes sense to figure out where the problem lies. Shailabh is working on that as we speak. Mike as discussed yesterday, we will post all the relevant numbers that we have right now. We collected for 1,2,3,4,5,6,7,8 way information on 2.4.1-pre8 (vanilla, MQ, prio) sched_yield_test (1--16,32,256,..,2048) threads Chatroom (10-30)Rooms x (100-300) messages Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 |
From: cardente, j. <car...@em...> - 2001-01-26 21:43:55
|
Somewhat related question.. has anyone ever tested a goodness threshold higher than >1 when determing if a preemption should occur, paritcularly in an SMP system. What makes this more interesting is that since the schedule_idle() routine invokes reschedule_idle() for a preempted process there's a chance that a single preemption could lead to a daisy chain of preemptions in an SMP. An improperly low preemption goodness threshold could lead to worse system performance. The MQ patch with processor sets might be able to address this by restricting the length of such a daisy chain of events to the size of a single CPU set. any thoughts.... john -----Original Message----- From: Jun Nakajima [mailto:ju...@sc...] Sent: Friday, January 26, 2001 10:10 AM To: Mike Kravetz Cc: lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler I completely agree on your explanation below. And I think I figured out the difference. The original code triggers preemption when preemption_goodness(tsk, p, cpu) returns more than *1*. oldest_idle = (cycles_t) -1; target_tsk = NULL; max_prio = 1; ... if (oldest_idle == -1ULL) { int prio = preemption_goodness(tsk, p, cpu); if (prio > max_prio) { max_prio = prio; target_tsk = tsk; } ... But your code does that when it finds a task with a lower na_goodness value, compared to na_goodness(p). Because of this, it tends to cause reschedule more. So a quick fix to this would be like: saved_na_goodness = na_goodness(p); - tmp_min_na_goodness = saved_na_goodness; + tmp_min_na_goodness = saved_na_goodness - 1; Mike Kravetz wrote: > > On Thu, Jan 25, 2001 at 05:43:39PM -0500, Jun Nakajima wrote: > > > > I think your code slightly differs from the original (I could be wrong). > > > <code deleted> > > > > if cpu becomes (cpu == tsk_cpu), stack_list[cpu] does not get > > PROC_CHANGE_PENALTY, and if (stack_list[cpu] < tmp_min_na_goodness), > > then tmp_min_na_goodness is changed to the smaller one. And in the > > following iterations, the test "if (stack_list[cpu] < > > tmp_min_na_goodness)" becomes harder to be true. > > > > Whereas the orignal code checks the max value returned by > > preemption_goodness(tsk, p, cpu) (tsktsk = cpu_curr(cpu)). > > preemption_goodness is pretty simple, so I'll include it here. > > static inline int preemption_goodness(struct task_struct * prev, > struct task_struct * p, int cpu) > { > return goodness(p, cpu, prev->active_mm) - > goodness(prev, cpu, prev->active_mm); > } > > As you can see, the lower the goodness value of prev (in this case tsk) > the higher the value returned by preemption_goodness(). So in essence > the original code was looking for the CPU which is executing the task > with the lowest goodness value. Also note that in the original code > cpu is tsk->processor. Therefore, PROC_CHANGE_PENALTY is always added > to the goodness value for tsk. Our code does pretty much the same thing > (I believe). We are looking for the task with the lowest goodness value > relative to the cpu 'p' previously ran on. Therefore, for all remote > CPUs we add PROC_CHANGE_PENALTY to account for the loss of cache affinity. > I believe this matches what preemption_goodness does. When cpu == tsk_cpu > we don't add PROC_CHANGE_PENALTY because we want to do a direct comparison > of na_goodness values. This would be similar to calling preemption_goodness > when both tasks have the same 'processor' value. In this case > PROC_CHANGE_PENALTY is added to both. > > Hope this makes sense (and I hope the code works as described/expected). > > > > > The code I'm talking about is the one below. __wake_up_common() is the > > common body of the wake_up family. Basically it prefers a process on the > > current CPU to ones on the other CPUs, depending on the mode/flag. This > > is reasonable because the initiator CPU of the interrupt potentialy has > > warm cache for handling it. Sompe platform delivers interrupts to the > > initiator CPU. > > That may be the case, but I believe reschedule_idle is the code that > actually determines what CPU the task should run on. This behavior is > not changed in the multiqueue scheduler. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-01-29 21:55:12
|
Jun, you always have this problem. Information is stale by the time you looked and digested it. For instance the counter value is updated without holding the scheduling lock. The fork might just have sliced the counter value in half .... I think in all these approaches one has to assume a certain stability of the counter value around a certain time interval and deal with the occasional misbehavior. Do you have a suggestion, how we could do it differently then in the current MQ scheduler. One thought we had is really disassociate the queues alltogether and do everything with load balancing at a different level. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/29/2001 03:07:25 PM Sent by: lse...@li... To: Mike Kravetz <mkr...@se...> cc: lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler Good experiments, Mike. I think the original scheduler would exhibit such scheduling characteristics. Since reschedule_idle() is called with the runqueue_lock held, the number of reschedules decreases as runqueue lock contention and hold time increases. In addition, smp_send_reschedule() tends to be performed less, as runqueue lock contention and hold time increases. If the best_cpu below, for example, is already spin-waiting at the runqueue lock with need_resched = 1, smp_send_reschedule() is not performed against the cpu. static void reschedule_idle(struct task_struct * p) { #ifdef CONFIG_SMP send_now_idle: /* * If need_resched == -1 then we can skip sending * the IPI altogether, tsk->need_resched is * actively watched by the idle thread. */ need_resched = tsk->need_resched; tsk->need_resched = 1; if ((best_cpu != this_cpu) && !need_resched) smp_send_reschedule(best_cpu); return; } } ... tsk = target_tsk; if (tsk) { if (oldest_idle != -1ULL) { best_cpu = tsk->processor; goto send_now_idle; } tsk->need_resched = 1; if (tsk->processor != this_cpu) smp_send_reschedule(tsk->processor); } return; ... The current implementation of the MQ-scheduler inherently causes more reschedule because more than one CPU can perform reschedule_idle() at the same time, reading curr_na_goodness(cpu). One thing I noticed with the current code is: => If spin_trylock(&runqueue_lock(target_cpu) fails, then it tries to find the 'next lowest' cur_na_goodness value, to cause preepmption. I think this code (see below) is a bit tricky. The array stack_list[cpu] in the stack may not be valid any more because schedule() may have happened on the other CPUs, and curr_na_goodness(cpu) may have been updated by this time. /* * Update value so we don't check this CPU again. */ stack_list[target_cpu] = saved_na_goodness; /* * Find the 'next lowest' cur_na_goodness value. */ target_cpu = -1; tmp_min_na_goodness = saved_na_goodness; for (i = 0; i < smp_num_cpus; i++) { cpu = cpu_logical_map(i); if (stack_list[cpu] < tmp_min_na_goodness) { target_cpu = cpu; tmp_min_na_goodness = stack_list[cpu]; Although it checks if (preemption_goodness(tsk, p, target_cpu) > 1) after spin_trylock(&runqueue_lock(target_cpu)), we don't know if the target CPU is running a task with the mostly lowest na_goodness at that time. However, I don't think we want to know or use the very accurate curr_na_goodness(cpu) because they change very frequently and thus the scheduling decision can be misled. It's might be a time for us to explore a more sutitable scheduling algorithm for the MQ scheduling... Mike Kravetz wrote: > > I applied the fix for the bug Jun found and ran the chat benchmark > with a profiled kernel. The number of reschedules dropped, but only > by a very small amount. There was no significant change. > > I then tried an experiment. I modified the multi-queue scheduler > code such that only a single lock is used, as opposed to the per- > runqueue locks. All, algorithms remained unchanged. I only changed > the locking model. It was my intention to increase lock contention > and see how this impacted the benchmark. I ran the benchmark with > a profiled version of this kernel. Here are the results when > compared to multi-queue and 2.4.0 untouched. > > Calls to: > schedule reschedule_idle tcp_data_wait > ---------------------------------------------------------------------- > untouched 412448 438062 281893 > multiq 939687 920098 595161 > multiq (gl) 578351 559267 407864 > > Even when the multiqueue scheduler used a global lock, the time > spent in the scheduling code was much shorter. This is because > there is no need to scan the entire list of runnable tasks (non- > local runqueues have pointers to the best task on their queues). > In the untouched 2.4.0 kernel, during this benchmark we average > 181.89 milliseconds per call. In the multiqueue scheduler even > with a global lock we are only averaging 46.69 milliseconds per > call. For a more accurate comparison, I added the extra runqueue > scans to the multiqueue scheduler with the global locking model. > These scans were not used for anything (multiqueue algorithms did > not change), I only wanted to add extra overhead to the schedule > routine. After a run with this kernel I measured an average of > 185.17 milliseconds per call. This was in the ballpark of the > untouched kernel. Profile results for this kernel are included > below. > > Calls to: > schedule reschedule_idle tcp_data_wait > ---------------------------------------------------------------------- > untouched 412448 438062 281893 > multiq 939687 920098 595161 > multiq (gl) 578351 559267 407864 > multiq (gl, s) 393834 416385 265007 > > The interesting result of these experiments is that the number of > reschedules decreases as runqueue lock contention and hold time > increases. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Jun N. <ju...@sc...> - 2001-01-30 15:23:59
|
Hubertus, At this point, I think what Mike's doing is right. We need to clearly understand the impacts/implications of the MQ-scheduler, simulating the orignal scheduling as much as possible. One of my suggestions is maintain the set of idle CPUs more accurately, i.e. use a global lock (per-node for NUMA configurations) for them, if necessary. Making a wrong decision on based on state counters, for example, is benign. If the set of idle CPUs at reschedule_idle() time, however, is not maintained accurately, it could have some performance or unpredictable impacts, as the number of reschedule increases. > > Jun, you always have this problem. Information is stale by the time you > looked and digested it. For instance > the counter value is updated without holding the scheduling lock. The fork > might just have sliced the counter value in half .... > I think in all these approaches one has to assume a certain stability of > the counter value around a certain time interval and deal with the > occasional misbehavior. Do you have a suggestion, how we could do it > differently then in the current MQ scheduler. > One thought we had is really disassociate the queues alltogether and do > everything with load balancing at > a different level. > > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: fr...@us... > (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 > > Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/29/2001 03:07:25 PM > > Sent by: lse...@li... > > To: Mike Kravetz <mkr...@se...> > cc: lse...@li... > Subject: Re: [Lse-tech] reschedule: - MQ scheduler > > Good experiments, Mike. > > I think the original scheduler would exhibit such scheduling > characteristics. Since reschedule_idle() is called with the > runqueue_lock held, the number of reschedules decreases as runqueue lock > contention and hold time increases. In addition, smp_send_reschedule() > tends to be performed less, as runqueue lock contention and hold time > increases. If the best_cpu below, for example, is already spin-waiting > at the runqueue lock with need_resched = 1, smp_send_reschedule() is not > performed against the cpu. > > static void reschedule_idle(struct task_struct * p) > { > #ifdef CONFIG_SMP > > send_now_idle: > /* > * If need_resched == -1 then we can skip sending > * the IPI altogether, tsk->need_resched is > * actively watched by the idle thread. > */ > need_resched = tsk->need_resched; > tsk->need_resched = 1; > if ((best_cpu != this_cpu) && !need_resched) > smp_send_reschedule(best_cpu); > return; > } > } > ... > > tsk = target_tsk; > if (tsk) { > if (oldest_idle != -1ULL) { > best_cpu = tsk->processor; > goto send_now_idle; > } > tsk->need_resched = 1; > if (tsk->processor != this_cpu) > smp_send_reschedule(tsk->processor); > } > return; > > ... > > The current implementation of the MQ-scheduler inherently causes more > reschedule because more than one CPU can perform reschedule_idle() at > the same time, reading curr_na_goodness(cpu). One thing I noticed with > the current code is: > => If spin_trylock(&runqueue_lock(target_cpu) fails, then it tries to > find > the 'next lowest' cur_na_goodness value, to cause preepmption. > > I think this code (see below) is a bit tricky. The array stack_list[cpu] > in the stack may not be valid any more because schedule() may have > happened on the other CPUs, and curr_na_goodness(cpu) may have been > updated by this time. > > /* > * Update value so we don't check this CPU again. > */ > stack_list[target_cpu] = saved_na_goodness; > > /* > * Find the 'next lowest' cur_na_goodness value. > */ > target_cpu = -1; > tmp_min_na_goodness = saved_na_goodness; > for (i = 0; i < smp_num_cpus; i++) { > cpu = cpu_logical_map(i); > if (stack_list[cpu] < tmp_min_na_goodness) { > target_cpu = cpu; > tmp_min_na_goodness = stack_list[cpu]; > > Although it checks if (preemption_goodness(tsk, p, target_cpu) > 1) > after spin_trylock(&runqueue_lock(target_cpu)), we don't know if the > target CPU is running a task with the mostly lowest na_goodness at that > time. > > However, I don't think we want to know or use the very accurate > curr_na_goodness(cpu) because they change very frequently and thus the > scheduling decision can be misled. It's might be a time for us to > explore a more sutitable scheduling algorithm for the MQ scheduling... > > Mike Kravetz wrote: > > > > I applied the fix for the bug Jun found and ran the chat benchmark > > with a profiled kernel. The number of reschedules dropped, but only > > by a very small amount. There was no significant change. > > > > I then tried an experiment. I modified the multi-queue scheduler > > code such that only a single lock is used, as opposed to the per- > > runqueue locks. All, algorithms remained unchanged. I only changed > > the locking model. It was my intention to increase lock contention > > and see how this impacted the benchmark. I ran the benchmark with > > a profiled version of this kernel. Here are the results when > > compared to multi-queue and 2.4.0 untouched. > > > > Calls to: > > schedule reschedule_idle tcp_data_wait > > ---------------------------------------------------------------------- > > untouched 412448 438062 281893 > > multiq 939687 920098 595161 > > multiq (gl) 578351 559267 407864 > > > > Even when the multiqueue scheduler used a global lock, the time > > spent in the scheduling code was much shorter. This is because > > there is no need to scan the entire list of runnable tasks (non- > > local runqueues have pointers to the best task on their queues). > > In the untouched 2.4.0 kernel, during this benchmark we average > > 181.89 milliseconds per call. In the multiqueue scheduler even > > with a global lock we are only averaging 46.69 milliseconds per > > call. For a more accurate comparison, I added the extra runqueue > > scans to the multiqueue scheduler with the global locking model. > > These scans were not used for anything (multiqueue algorithms did > > not change), I only wanted to add extra overhead to the schedule > > routine. After a run with this kernel I measured an average of > > 185.17 milliseconds per call. This was in the ballpark of the > > untouched kernel. Profile results for this kernel are included > > below. > > > > Calls to: > > schedule reschedule_idle tcp_data_wait > > ---------------------------------------------------------------------- > > untouched 412448 438062 281893 > > multiq 939687 920098 595161 > > multiq (gl) 578351 559267 407864 > > multiq (gl, s) 393834 416385 265007 > > > > The interesting result of these experiments is that the number of > > reschedules decreases as runqueue lock contention and hold time > > increases. > > > > -- > > Mike Kravetz mkr...@se... > > IBM Linux Technology Center > > > > _______________________________________________ > > Lse-tech mailing list > > Lse...@li... > > http://lists.sourceforge.net/lists/listinfo/lse-tech > > -- > Jun U Nakajima > Core OS Development > SCO/Murray Hill, NJ > Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Hubertus F. <fr...@us...> - 2001-01-30 15:52:50
|
You are absolutely right. If we keep NUMA out for the moment, anytime you leave a task idle for some time, while there are other tasks waiting to execute, one will have a hell of a time to make up for such misjudgements. Mike and I talked about keeping track of idle tasks through a global mask, or one per node. We set the bit when we have decided to go into the idle task on our cpu and clear it when we come out of it. One could then very easily check whether there is an idle task in the system by checking the mask. I don't know whether using a global lock is a good idea here. We don't want to replace one global lock by another one, that's the whole point of the MQ. What could be done, is to do some form of verification once we enter the idle loop. The first thing the idle task checks, is whether there are other tasks still waiting to execute, e.g. nr_running() vs. number of idle tasks. I could see that for that path one might want to use a global lock. The good news with the "1" correction you suggested to Mike, coupled with another check that Mike put in to verify the preemption_goodness in reschedule_idle() the results look "phenomal:-)" again. For now check the results without these improvements under http://lse.sourceforge.net/scheduling/results012501/status.html I don't have write access to the website yet due to non-inclusion in the group, so I can't put the latest numbers out yet. I will send a note once I can do it. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@sco.com on 01/30/2001 10:18:07 AM Sent by: nak...@sc... To: Hubertus Franke/Watson/IBM@IBMUS cc: lse...@li... Subject: Re: [Lse-tech] reschedule: - MQ scheduler Hubertus, At this point, I think what Mike's doing is right. We need to clearly understand the impacts/implications of the MQ-scheduler, simulating the orignal scheduling as much as possible. One of my suggestions is maintain the set of idle CPUs more accurately, i.e. use a global lock (per-node for NUMA configurations) for them, if necessary. Making a wrong decision on based on state counters, for example, is benign. If the set of idle CPUs at reschedule_idle() time, however, is not maintained accurately, it could have some performance or unpredictable impacts, as the number of reschedule increases. > > Jun, you always have this problem. Information is stale by the time you > looked and digested it. For instance > the counter value is updated without holding the scheduling lock. The fork > might just have sliced the counter value in half .... > I think in all these approaches one has to assume a certain stability of > the counter value around a certain time interval and deal with the > occasional misbehavior. Do you have a suggestion, how we could do it > differently then in the current MQ scheduler. > One thought we had is really disassociate the queues alltogether and do > everything with load balancing at > a different level. > > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: fr...@us... > (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 > > Jun Nakajima <ju...@sc...>@lists.sourceforge.net on 01/29/2001 03:07:25 PM > > Sent by: lse...@li... > > To: Mike Kravetz <mkr...@se...> > cc: lse...@li... > Subject: Re: [Lse-tech] reschedule: - MQ scheduler > > Good experiments, Mike. > > I think the original scheduler would exhibit such scheduling > characteristics. Since reschedule_idle() is called with the > runqueue_lock held, the number of reschedules decreases as runqueue lock > contention and hold time increases. In addition, smp_send_reschedule() > tends to be performed less, as runqueue lock contention and hold time > increases. If the best_cpu below, for example, is already spin-waiting > at the runqueue lock with need_resched = 1, smp_send_reschedule() is not > performed against the cpu. > > static void reschedule_idle(struct task_struct * p) > { > #ifdef CONFIG_SMP > > send_now_idle: > /* > * If need_resched == -1 then we can skip sending > * the IPI altogether, tsk->need_resched is > * actively watched by the idle thread. > */ > need_resched = tsk->need_resched; > tsk->need_resched = 1; > if ((best_cpu != this_cpu) && !need_resched) > smp_send_reschedule(best_cpu); > return; > } > } > ... > > tsk = target_tsk; > if (tsk) { > if (oldest_idle != -1ULL) { > best_cpu = tsk->processor; > goto send_now_idle; > } > tsk->need_resched = 1; > if (tsk->processor != this_cpu) > smp_send_reschedule(tsk->processor); > } > return; > > ... > > The current implementation of the MQ-scheduler inherently causes more > reschedule because more than one CPU can perform reschedule_idle() at > the same time, reading curr_na_goodness(cpu). One thing I noticed with > the current code is: > => If spin_trylock(&runqueue_lock(target_cpu) fails, then it tries to > find > the 'next lowest' cur_na_goodness value, to cause preepmption. > > I think this code (see below) is a bit tricky. The array stack_list[cpu] > in the stack may not be valid any more because schedule() may have > happened on the other CPUs, and curr_na_goodness(cpu) may have been > updated by this time. > > /* > * Update value so we don't check this CPU again. > */ > stack_list[target_cpu] = saved_na_goodness; > > /* > * Find the 'next lowest' cur_na_goodness value. > */ > target_cpu = -1; > tmp_min_na_goodness = saved_na_goodness; > for (i = 0; i < smp_num_cpus; i++) { > cpu = cpu_logical_map(i); > if (stack_list[cpu] < tmp_min_na_goodness) { > target_cpu = cpu; > tmp_min_na_goodness = stack_list[cpu]; > > Although it checks if (preemption_goodness(tsk, p, target_cpu) > 1) > after spin_trylock(&runqueue_lock(target_cpu)), we don't know if the > target CPU is running a task with the mostly lowest na_goodness at that > time. > > However, I don't think we want to know or use the very accurate > curr_na_goodness(cpu) because they change very frequently and thus the > scheduling decision can be misled. It's might be a time for us to > explore a more sutitable scheduling algorithm for the MQ scheduling... > > Mike Kravetz wrote: > > > > I applied the fix for the bug Jun found and ran the chat benchmark > > with a profiled kernel. The number of reschedules dropped, but only > > by a very small amount. There was no significant change. > > > > I then tried an experiment. I modified the multi-queue scheduler > > code such that only a single lock is used, as opposed to the per- > > runqueue locks. All, algorithms remained unchanged. I only changed > > the locking model. It was my intention to increase lock contention > > and see how this impacted the benchmark. I ran the benchmark with > > a profiled version of this kernel. Here are the results when > > compared to multi-queue and 2.4.0 untouched. > > > > Calls to: > > schedule reschedule_idle tcp_data_wait > > ---------------------------------------------------------------------- > > untouched 412448 438062 281893 > > multiq 939687 920098 595161 > > multiq (gl) 578351 559267 407864 > > > > Even when the multiqueue scheduler used a global lock, the time > > spent in the scheduling code was much shorter. This is because > > there is no need to scan the entire list of runnable tasks (non- > > local runqueues have pointers to the best task on their queues). > > In the untouched 2.4.0 kernel, during this benchmark we average > > 181.89 milliseconds per call. In the multiqueue scheduler even > > with a global lock we are only averaging 46.69 milliseconds per > > call. For a more accurate comparison, I added the extra runqueue > > scans to the multiqueue scheduler with the global locking model. > > These scans were not used for anything (multiqueue algorithms did > > not change), I only wanted to add extra overhead to the schedule > > routine. After a run with this kernel I measured an average of > > 185.17 milliseconds per call. This was in the ballpark of the > > untouched kernel. Profile results for this kernel are included > > below. > > > > Calls to: > > schedule reschedule_idle tcp_data_wait > > ---------------------------------------------------------------------- > > untouched 412448 438062 281893 > > multiq 939687 920098 595161 > > multiq (gl) 578351 559267 407864 > > multiq (gl, s) 393834 416385 265007 > > > > The interesting result of these experiments is that the number of > > reschedules decreases as runqueue lock contention and hold time > > increases. > > > > -- > > Mike Kravetz mkr...@se... > > IBM Linux Technology Center > > > > _______________________________________________ > > Lse-tech mailing list > > Lse...@li... > > http://lists.sourceforge.net/lists/listinfo/lse-tech > > -- > Jun U Nakajima > Core OS Development > SCO/Murray Hill, NJ > Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-01-25 19:12:01
|
On Thu, Jan 25, 2001 at 11:00:15AM -0500, Hubertus Franke wrote: > > I think Mike traced it already back to the fact that the tpc_recvmsg's > block, i.e. there is not data there. > Hence the process get's back into a wait queue and wake's up with > reschedule_idle being called. > That's the source of the problem. We are doing TOO GOOD ..... Well, that is a theory I am working on. The reason for the slowdown is 'too many process preemptions' as stated by Bill. Now the difficult part is to determine why. Here is another interesting piece of the profile data that has not made out to this list. The following is from the untouched 2.4.0 kernel: 12.04 94.37 8000060/8000060 inet_recvmsg [7] [8] 17.8 12.04 94.37 8000060 tcp_recvmsg [8] 0.96 51.84 281893/281893 tcp_data_wait [16] 1.67 17.29 7774955/8018782 memcpy_toiovec [30] 2.46 7.52 8281953/8281953 cleanup_rbuf [38] 0.23 6.51 277660/277660 tcp_prequeue_process [46] 0.03 3.98 330314/2453362 _wake_up [19] 0.31 0.81 257191/1585985 _kfree_skb [45] 0.01 0.71 3687/20785 _lock_sock [62] 0.00 0.03 1490/8500 _release_sock [123] and here it is for the kernel with multiqueue: 12.91 83.95 8000060/8000060 inet_recvmsg [7] [8] 15.1 12.91 83.95 8000060 tcp_recvmsg [8] 2.02 29.45 595161/595161 tcp_data_wait [23] 1.62 17.49 7470947/8022664 memcpy_toiovec [29] 2.79 14.28 8595221/8595221 cleanup_rbuf [35] 0.69 12.67 590860/590860 tcp_prequeue_process [40] 0.02 1.42 221714/2308800 _wake_up [37] 0.38 0.85 292349/2660942 _kfree_skb [42] 0.01 0.24 4678/24471 _lock_sock [87] 0.00 0.03 1548/9337 _release_sock [129] Note that tcp_recvmsg gets called the same number of times in both cases. However, note the big differences in the number of times tcp_recvmsg() calls the routines tcp_data_wait() and tcp_prequeue_process(). The tcp_data_wait() routine results in a sleep/wakeup cycle via a call to schedule_timeout(). A result of these extra sleep/wakeup calls is the increased number of priority preemptions via reschedule_idle(). Now, tcp_data_wait() is called to wait for data when a recv request can not be immediately satisfied. So with the multi-queue scheduler recv'ing data is blocking (sleep/wakeup) some 300,000 + more times than with the current scheduler. I believe this accounts for the additional reschedules we see in the multi-queue kernel. Now here is some data to show runqueue lock contention for both kernels during the runs. Sorry for the wide lines produced by the tool. 2.4.0 vanilla ------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT Acquired at -------------------------------------------------------------------------------------------------------------------- 0.17% 32.35% 5.5us( 20us) 10us( 380us) 50766 34344 16422 0 __schedule_tail+0x68 0.00% 0.00% 1.3us( 2.2us) 0us 56 56 0 0 deliver_signal+0x58 9.10% 16.07% 57us( 337us) 6.0us( 558us) 263570 221213 42357 0 schedule+0xc8 beginning 0.01% 2.56% 3.4us( 101us) 0.1us( 22us) 4602 4484 118 0 schedule+0x44c after recalc 0.73% 32.36% 5.8us( 26us) 19us( 570us) 206799 139869 66930 0 wake_up_process+0x14 2.4.0 multi-queue ----------------- SPINLOCKS HOLD WAIT UTIL CON MEAN ( MAX ) MEAN ( MAX ) TOTAL NOWAIT SPIN REJECT Acquired at -------------------------------------------------------------------------------------------------------------------- 0.29% 2.85% 3.6us( 33us) 0.1us( 22us) 135574 131714 3860 0 __schedule_tail+0xa4 0.00% 0.00% 1.5us( 2.5us) 0us 48 48 0 0 deliver_signal+0xa8 0.02% 2.73% 2.4us( 16us) 0us 15869 15436 0 433 reschedule_idle+0x2e0 try 3.61% 1.77% 11us( 126us) 0.1us( 19us) 519630 510447 9183 0 schedule+0x130 beginning 0.02% 28.79% 2.5us( 12us) 0us 10382 7393 0 2989 schedule+0x648 try 0.01% 0.03% 3.1us( 29us) 0.0us( 5.6us) 3599 3598 1 0 schedule+0xbdc after recalc 0.94% 8.47% 4.3us( 37us) 1.0us( 79us) 356086 325925 30161 0 wake_up_process+0x48 Note the increased lock wait times with untouched 2.4.0 kernel. Here is where I started to form my ?wacky? theory. When a task recv'ing data notices no data is present on the socket, it goes into a sleep/wakeup cycle. This sleep/wakeup cycle requires the runqueue lock be acquired a few times. Now, the longer one waits to get through this sleep/wakeup cycle, the greater the chance that additional data will be queue'd up up on the socket. Hence, if you are delayed you may be able to process multiple recv's before blocking again. Thus, the increased lock contention in the untouched kernel results in slight delays which result in a 'buffering' of data on the socket. Contrast this to the muiltiqueue scheduler which can get through the sleep/wakeup cycle faster. In this case it is less likely that additional data will be queue'ed to the socket and less likely that you will be able to process multiple recv's before blocking again. Like I said, this sounds wacky. Increased lock contention causing an increase in performance/throughput. I'm looking into this and other possible reasons for the slowdown. -- Mike Kravetz mkr...@se... IBM Linux Technology Center 15450 SW Koll Parkway Beaverton, OR 97006-6063 (503)578-3494 |