From: Mike K. <mkr...@se...> - 2001-07-12 23:40:56
|
This discussion was started on 'lse...@li...'. I'm widening the distribution in the hope of getting more input. It started when Andi Kleen noticed that a single 'CPU Hog' task was being bounced back and forth between the 2 CPUs on his 2-way system. I had seen similar behavior when running the context switching test of LMbench. When running lat_ctx with only two threads on an 8 CPU system, one would ?expect? the two threads to be confined to two of the 8 CPUs in the system. However, what I have observed is that the threads are effectively 'round robined' among all the CPUs and they all end up bearing an equivalent amount of the CPU load. To more easily observe this, increase the number of 'TRIPS' in the benchmark to a really large number. After a little investigation, I believe this 'situation' is caused by the latency of the reschedule IPI used by the scheduler. Recall that in lat_ctx all threads are in a tight loop consisting of: pipe_read() pipe_write() Both threads 'start' on the same CPU and are sitting in pipe_read waiting for data. A token is written to the pipe and one thread is awakened. The awakened thread, then immediately writes the token back to the pipe which ultimately results in a call to reschedule_idle() that will 'initiate' the scheduling of the other thread. In reschedule_idle() we can not take the 'fast path' because WE are currently executing on the other thread's preferred CPU. Therefore, reschedule_idle() chooses the oldest idle CPU and sends the IPI. However, before the IPI is received (and schedule() run) on the remote CPU, the currently running thread calls pipe_read which blocks and calls schedule(). Since the other task has yet to be scheduled on the other CPU, it is scheduled to run on the current CPU. Both tasks continue to execute on the one CPU until such time that an IPI induced schedule() on the other CPU hits a window where it finds one of the tasks to schedule. We continue in this way, migrating the tasks to the oldest idle CPU and eventually cycling our way through all the CPUs. Does this explanation sound reasonable? If so, it would then follow that booting with 'idle=poll' would help alleviate this situation. However, that is not the case. With idle=poll the CPU load is not as evenly distributed among the CPUs, but is still distributed among all of them. Does the behavior of the 'benchmark' mean anything? Should one expect tasks to stay their preferred CPUs if possible? Thoughts/comments -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-07-13 00:19:47
|
On 12-Jul-2001 Mike Kravetz wrote: > This discussion was started on 'lse...@li...'. > I'm widening the distribution in the hope of getting more input. > > It started when Andi Kleen noticed that a single 'CPU Hog' task > was being bounced back and forth between the 2 CPUs on his 2-way > system. I had seen similar behavior when running the context > switching test of LMbench. When running lat_ctx with only two > threads on an 8 CPU system, one would ?expect? the two threads > to be confined to two of the 8 CPUs in the system. However, what > I have observed is that the threads are effectively 'round > robined' among all the CPUs and they all end up bearing > an equivalent amount of the CPU load. To more easily observe > this, increase the number of 'TRIPS' in the benchmark to a really > large number. > > After a little investigation, I believe this 'situation' is caused > by the latency of the reschedule IPI used by the scheduler. Recall > that in lat_ctx all threads are in a tight loop consisting of: > > pipe_read() > pipe_write() > > Both threads 'start' on the same CPU and are sitting in pipe_read > waiting for data. A token is written to the pipe and one thread > is awakened. The awakened thread, then immediately writes the token > back to the pipe which ultimately results in a call to reschedule_idle() > that will 'initiate' the scheduling of the other thread. In > reschedule_idle() we can not take the 'fast path' because WE are > currently executing on the other thread's preferred CPU. Therefore, > reschedule_idle() chooses the oldest idle CPU and sends the IPI. > However, before the IPI is received (and schedule() run) on the > remote CPU, the currently running thread calls pipe_read which > blocks and calls schedule(). Since the other task has yet to be > scheduled on the other CPU, it is scheduled to run on the current > CPU. Both tasks continue to execute on the one CPU until such time > that an IPI induced schedule() on the other CPU hits a window where > it finds one of the tasks to schedule. We continue in this way, > migrating the tasks to the oldest idle CPU and eventually cycling our > way through all the CPUs. > > Does this explanation sound reasonable? I would say yes. > > If so, it would then follow that booting with 'idle=poll' would > help alleviate this situation. However, that is not the case. With > idle=poll the CPU load is not as evenly distributed among the CPUs, > but is still distributed among all of them. > > Does the behavior of the 'benchmark' mean anything? Should one > expect tasks to stay their preferred CPUs if possible? Maybe having a per-cpu wake list where the rescheduled task is moved to be woken up by IPI target. - Davide |
From: Larry M. <lm...@bi...> - 2001-07-13 00:36:51
|
Be careful tuning for LMbench (says the author :-) Especially this benchmark. It's certainly possible to get dramatically better SMP numbers by pinning all the lat_ctx processes to a single CPU, because the benchmark is single threaded. In other words, if we have 5 processes, call them A, B, C, D, and E, then the benchmark is passing a token from A to B to C to D to E and around again. If the amount of data/instructions needed by all 5 processes fits in the cache and you pin all the processes to the same CPU you'll get much better performance than simply letting them float. But making the system do that naively is a bad idea. This is a really hard area to get right but you can take a page from all the failed process migration efforts. In general, moving stuff is a bad idea, it's much better to leave it where it is. Everything scales better if there is a process queue per CPU and the default is that you leave the processes on the queue on which they last run. However, if the load average for a queue starts going up and there is another queue with a substantially lower load average, then and ONLY then, should you move the process. I think if you experiment with that you'll see that lat_ctx does well and so do a lot of other things. An optimization on that requires hardware support. If you knew the number of cache misses associated with each time slice, you could factor that in and start moving processes that have a "too high" cache miss rate, with the idea being that we want to keep all processes on the same CPU if we can but if that is causing an excessive cache miss rate, it's time to move. Another optimization is to always schedule an exec-ed process (as opposed to a forked process) on a different CPU than its parent. In general, when you exec you have a clear boundary and it's good to spread those out. All of this is based on my somewhat dated performance efforts that lead to LMbench. I don't know of any fundamental changed that invalidate these opinions but I could be wrong. This is an area in which I've done a pile of work and I'd be interested in keeping a finger in any efforts to fix up the scheduler. -- --- Larry McVoy lm at bitmover.com http://www.bitmover.com/lm |
From: Davide L. <da...@xm...> - 2001-07-13 16:38:20
|
On 13-Jul-2001 Larry McVoy wrote: > Be careful tuning for LMbench (says the author :-) > > Especially this benchmark. It's certainly possible to get dramatically > better > SMP numbers by pinning all the lat_ctx processes to a single CPU, because > the benchmark is single threaded. In other words, if we have 5 processes, > call them A, B, C, D, and E, then the benchmark is passing a token from > A to B to C to D to E and around again. > > If the amount of data/instructions needed by all 5 processes fits in the > cache and you pin all the processes to the same CPU you'll get much > better performance than simply letting them float. > > But making the system do that naively is a bad idea. Agree. > > This is a really hard area to get right but you can take a page from all > the failed process migration efforts. In general, moving stuff is a bad > idea, it's much better to leave it where it is. Everything scales better > if there is a process queue per CPU and the default is that you leave the > processes on the queue on which they last run. However, if the load average > for a queue starts going up and there is another queue with a substantially > lower load average, then and ONLY then, should you move the process. I personally think that a standard scheduler/cpu is the way to go for SMP. I saw the one IBM guys did and I think that the wrong catch there is trying always to grab the best task to run over all CPUs. I think that this concept could be relaxed introducing less chains between each CPU scheduler. A cheap load balancer should run, "time to time"(tm), to move tasks when a certain level of unbalancing has been reached. This will give each scheduler more independence and will make it more scalable, IMVHO. > This is an area in which I've done a pile of work and I'd be interested > in keeping a finger in any efforts to fix up the scheduler. We've, somehow, understood it :) - Davide |
From: Mike K. <mkr...@se...> - 2001-07-13 17:32:09
|
On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote: > > I personally think that a standard scheduler/cpu is the way to go for SMP. > I saw the one IBM guys did and I think that the wrong catch there is trying > always to grab the best task to run over all CPUs. That was me/us. Most of the reason for making 'global scheduling' decisions was an attempt to maintain the same behavior as the existing scheduler. We are trying to see how well we can make this scheduler scale, while maintaining this global behavior. Thought is that if there was ever any hope of someone adopting this scheduler, they would be more likely to do so if it attempted to maintain existing behavior. > I think that this concept could be relaxed introducing less chains between each > CPU scheduler. > A cheap load balancer should run, "time to time"(tm), to move tasks when a > certain level of unbalancing has been reached. > This will give each scheduler more independence and will make it more scalable, > IMVHO. Take a look at the 'CPU pooling & Load balancing' extensions to our scheduler(lse.sourceforge.net/scheduling). It allows you to divide the system into CPU pools keep scheduling decisions limited to these pools. Periodicly, load balancing will be performed among the pools. This has shown some promise on NUMA architectures (as one would expect). You could define pool sizes to be 1 CPU and get the scheduling behavior you desire on SMP. But, none of this has to do with CPU affinity issues with the current scheduler. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-07-13 19:14:23
|
On 13-Jul-2001 Mike Kravetz wrote: > On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote: >> >> I personally think that a standard scheduler/cpu is the way to go for SMP. >> I saw the one IBM guys did and I think that the wrong catch there is trying >> always to grab the best task to run over all CPUs. > > That was me/us. Most of the reason for making 'global scheduling' > decisions was an attempt to maintain the same behavior as the existing > scheduler. We are trying to see how well we can make this scheduler > scale, while maintaining this global behavior. Thought is that if > there was ever any hope of someone adopting this scheduler, they would > be more likely to do so if it attempted to maintain existing behavior. The problem, IMHO, is that we're trying to extend what is a correct behaviour on the UP scheduler ( pickup the best task to run ) to SMP machines. Global scheduling decisions should be triggered in response of load unbalancing and not at each schedule() run otherwise we're going to introduce a common lock that will limit the overall scalability. My idea about the future of the scheduler is to have a config options users can chose depending upon the machine use. By trying to keep a unique scheduler for both UP and MP is like going to give the same answer to different problems and the scaling factor ( of the scheduler itself ) on SMP will never improve. The code inside kernel/sched.c should be reorganized ( it contains even not scheduler code ) so that the various CONFIG_SCHED* will not introduce any messy inside the code ( possibly by having the code in separate files kernel/sched*.c ). - Davide |
From: Mike K. <mkr...@se...> - 2001-07-13 17:06:49
|
On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote: > Be careful tuning for LMbench (says the author :-) > > Especially this benchmark. It's certainly possible to get dramatically better > SMP numbers by pinning all the lat_ctx processes to a single CPU, because > the benchmark is single threaded. In other words, if we have 5 processes, > call them A, B, C, D, and E, then the benchmark is passing a token from > A to B to C to D to E and around again. > > If the amount of data/instructions needed by all 5 processes fits in the > cache and you pin all the processes to the same CPU you'll get much > better performance than simply letting them float. > > But making the system do that naively is a bad idea. I agree, and can't imagine the system ever attempting to take this into account and leave these 5 tasks on the same CPU. At the other extreme is my observation that 2 tasks on an 8 CPU system are 'round robined' among all 8 CPUs. I think having the 2 tasks stay on 2 of the 8 CPUs would be an improvement with respect to CPU affinity. Actually, the scheduler does 'try' to do this. It is clear that the behavior of lat_ctx bypasses almost all of the scheduler's attempts at CPU affinity. The real question is, "How often in running 'real workloads' are the schduler's attempts at CPU affinity bypassed?". -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: David L. <dav...@di...> - 2001-07-13 21:06:57
|
A real-world example of this issue. I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess was bouncing back and forth between the two CPUs. I actually was able to gzip faster by starting up setiathome to keep one CPU busy and force the scheduler to keep the gzip on a single CPU (I ran things several times to verify it was actually faster) David Lang On Fri, 13 Jul 2001, Mike Kravetz wrote: > Date: Fri, 13 Jul 2001 10:05:21 -0700 > From: Mike Kravetz <mkr...@se...> > To: Larry McVoy <lm...@bi...> > Cc: Davide Libenzi <da...@xm...>, lse...@li..., > Andi Kleen <ak...@su...>, lin...@vg... > Subject: Re: CPU affinity & IPI latency > > On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote: > > Be careful tuning for LMbench (says the author :-) > > > > Especially this benchmark. It's certainly possible to get dramatically better > > SMP numbers by pinning all the lat_ctx processes to a single CPU, because > > the benchmark is single threaded. In other words, if we have 5 processes, > > call them A, B, C, D, and E, then the benchmark is passing a token from > > A to B to C to D to E and around again. > > > > If the amount of data/instructions needed by all 5 processes fits in the > > cache and you pin all the processes to the same CPU you'll get much > > better performance than simply letting them float. > > > > But making the system do that naively is a bad idea. > > I agree, and can't imagine the system ever attempting to take this > into account and leave these 5 tasks on the same CPU. > > At the other extreme is my observation that 2 tasks on an 8 CPU > system are 'round robined' among all 8 CPUs. I think having the > 2 tasks stay on 2 of the 8 CPUs would be an improvement with respect > to CPU affinity. Actually, the scheduler does 'try' to do this. > > It is clear that the behavior of lat_ctx bypasses almost all of > the scheduler's attempts at CPU affinity. The real question is, > "How often in running 'real workloads' are the schduler's attempts > at CPU affinity bypassed?". > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > |
From: Mike K. <mkr...@se...> - 2001-07-13 22:43:41
|
On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote: > A real-world example of this issue. > > I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess > was bouncing back and forth between the two CPUs. I actually was able to > gzip faster by starting up setiathome to keep one CPU busy and force the > scheduler to keep the gzip on a single CPU (I ran things several times to > verify it was actually faster) > > David Lang That does sound like the same behavior I was seeing with lat_ctx. Like I said in my previous note, the scheduler does try to take CPU affinity into account. reschedule_idle() does a pretty good job of determining what CPU a task should run on. In the case of lat_ctx (and I believe your use of gzip), the 'fast path' in reschedule_idle() is taken because the CPU associated with the awakened task is idle. However, before schedule() is run on the 'target' CPU, schedule() is run on another CPU and the task is scheduled there. The root cause of this situation is the delay between the time reschedule_idle() determines what is the best CPU a task should run on, and the time schedule() is actually ran on that CPU. I have toyed with the idea of 'temporarily binding' a task to a CPU for the duration of the delay between reschedule_idle() and schedule(). It would go something like this, - Define a new field in the task structure 'saved_cpus_allowed'. With a little collapsing of existing fields, there is room to put this on the same cache line as 'cpus_allowed'. - In reschedule_idle() if we determine that the best CPU for a task is the CPU it is associated with (p->processor), then temporarily bind the task to that CPU. The task is temporarily bound to the CPU by overwriting the 'cpus_allowed' field such that the task can only be scheduled on the target CPU. Of course, the original value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. - In schedule(), the loop which examines all tasks on the runqueue will restore the value of 'cpus_allowed'. This would preserve the 'best CPU' decision made by reschedule_idle(). On the down side, 'temporarily bound' tasks could not be scheduled until schedule() is run on their associated CPUs. This could potentially waste CPU cycles, and delay context switches. In addition, it is quite possible that while a task is 'temporarily bound' the state of the system could change in such a way that the best CPU is no longer best. There appears to be a classic tradeoff between CPU affinity and context switch time. Comments? -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Davide L. <da...@xm...> - 2001-07-15 19:59:02
|
On 13-Jul-2001 Mike Kravetz wrote: > On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote: >> A real-world example of this issue. >> >> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess >> was bouncing back and forth between the two CPUs. I actually was able to >> gzip faster by starting up setiathome to keep one CPU busy and force the >> scheduler to keep the gzip on a single CPU (I ran things several times to >> verify it was actually faster) >> >> David Lang > > That does sound like the same behavior I was seeing with lat_ctx. Like > I said in my previous note, the scheduler does try to take CPU affinity > into account. reschedule_idle() does a pretty good job of determining > what CPU a task should run on. In the case of lat_ctx (and I believe > your use of gzip), the 'fast path' in reschedule_idle() is taken because > the CPU associated with the awakened task is idle. However, before > schedule() is run on the 'target' CPU, schedule() is run on another > CPU and the task is scheduled there. > > The root cause of this situation is the delay between the time > reschedule_idle() determines what is the best CPU a task should run > on, and the time schedule() is actually ran on that CPU. > > I have toyed with the idea of 'temporarily binding' a task to a CPU > for the duration of the delay between reschedule_idle() and schedule(). > It would go something like this, > > - Define a new field in the task structure 'saved_cpus_allowed'. > With a little collapsing of existing fields, there is room to put > this on the same cache line as 'cpus_allowed'. > - In reschedule_idle() if we determine that the best CPU for a task > is the CPU it is associated with (p->processor), then temporarily > bind the task to that CPU. The task is temporarily bound to the > CPU by overwriting the 'cpus_allowed' field such that the task can > only be scheduled on the target CPU. Of course, the original > value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. > - In schedule(), the loop which examines all tasks on the runqueue > will restore the value of 'cpus_allowed'. > > This would preserve the 'best CPU' decision made by reschedule_idle(). > On the down side, 'temporarily bound' tasks could not be scheduled > until schedule() is run on their associated CPUs. This could potentially > waste CPU cycles, and delay context switches. In addition, it is > quite possible that while a task is 'temporarily bound' the state of > the system could change in such a way that the best CPU is no longer > best. > > There appears to be a classic tradeoff between CPU affinity and > context switch time. The problem of the current scheduler is that it acts like an infinite feedback system. When we're going to decide if we've to move a task we analyze the status at the current time without taking in account the system status at previous time values. But the feedback we send ( IPI to move the task ) take a finite time to hit the target CPU and, meanwhile, the system status changes. So we're going to apply a feedback calculated in time T0 to a time Tn, and this will result in system auto-oscillation that we perceive as tasks bouncing between CPUs. This is kind of electronic example but it applies to all feedback systems. The solution to this problem, given that we can't have a zero feedback delivery time, is : 1) lower the feedback amount, that means, try to minimize task movements 2) a low pass filter, that means, when we're going to decide the sort ( move ) of a task, we've to weight the system status with the one that it had at previous time values - Davide |
From: Andi K. <ak...@su...> - 2001-07-15 20:10:43
|
On Sun, Jul 15, 2001 at 01:02:21PM -0700, Davide Libenzi wrote: > The problem of the current scheduler is that it acts like an infinite feedback > system. > When we're going to decide if we've to move a task we analyze the status at the > current time without taking in account the system status at previous time > values. > But the feedback we send ( IPI to move the task ) take a finite time to hit the > target CPU and, meanwhile, the system status changes. > So we're going to apply a feedback calculated in time T0 to a time Tn, and this > will result in system auto-oscillation that we perceive as tasks bouncing > between CPUs. > This is kind of electronic example but it applies to all feedback systems. > The solution to this problem, given that we can't have a zero feedback delivery > time, is : > > 1) lower the feedback amount, that means, try to minimize task movements > > 2) a low pass filter, that means, when we're going to decide the sort ( move ) > of a task, we've to weight the system status with the one that it had > at previous time values Nice analysis. I think Mike's proposal of the 'saved_cpus_allowed' field to temporarily bind the task to the target would just act as such an low pass filter. -Andi |
From: Andi K. <ak...@su...> - 2001-07-15 20:15:45
|
On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote: > - Define a new field in the task structure 'saved_cpus_allowed'. > With a little collapsing of existing fields, there is room to put > this on the same cache line as 'cpus_allowed'. > - In reschedule_idle() if we determine that the best CPU for a task > is the CPU it is associated with (p->processor), then temporarily > bind the task to that CPU. The task is temporarily bound to the > CPU by overwriting the 'cpus_allowed' field such that the task can > only be scheduled on the target CPU. Of course, the original > value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. > - In schedule(), the loop which examines all tasks on the runqueue > will restore the value of 'cpus_allowed'. This sounds racy, at least with your simple description. Even other CPUs could walk their run queue while the IPI is being processed and reset the cpus_allowed flag too early. I guess it would need a check in the schedule loop that restores cpus_allowed that the current CPU is the only on set in that task's cpus_allowed. This unfortunately would hang the process if the CPU for some reason cannot process schedules timely, so it may be needed to add a timestamp also to the task_struct that allows the restoration of cpus_allowed even from non target CPUs after some time. -Andi |
From: Davide L. <da...@xm...> - 2001-07-15 20:27:47
|
On 15-Jul-2001 Andi Kleen wrote: > On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote: >> - Define a new field in the task structure 'saved_cpus_allowed'. >> With a little collapsing of existing fields, there is room to put >> this on the same cache line as 'cpus_allowed'. >> - In reschedule_idle() if we determine that the best CPU for a task >> is the CPU it is associated with (p->processor), then temporarily >> bind the task to that CPU. The task is temporarily bound to the >> CPU by overwriting the 'cpus_allowed' field such that the task can >> only be scheduled on the target CPU. Of course, the original >> value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. >> - In schedule(), the loop which examines all tasks on the runqueue >> will restore the value of 'cpus_allowed'. > > This sounds racy, at least with your simple description. Even other CPUs > could walk their run queue while the IPI is being processed and reset the > cpus_allowed flag too early. I guess it would need a check in the > schedule loop that restores cpus_allowed that the current CPU is the only > on set in that task's cpus_allowed. This unfortunately would hang the > process if the CPU for some reason cannot process schedules timely, so > it may be needed to add a timestamp also to the task_struct that allows > the restoration of cpus_allowed even from non target CPUs after some > time. I previously expressed ( maybe in this thread, I don't remember ) my idea of not touching ( hacking ) the current scheduler to let it fit to SMP, that IMHO, has different problems and needs different answers. Anyway, as I said a couple of messages ago, the solution of this problem could be to move the moved task in a private ( per CPU ) wakeup list that, when the idle comes up, it'll be ran. - Davide |
From: Mike K. <mkr...@se...> - 2001-07-16 15:48:57
|
On Sun, Jul 15, 2001 at 10:15:42PM +0200, Andi Kleen wrote: > On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote: > > - Define a new field in the task structure 'saved_cpus_allowed'. > > With a little collapsing of existing fields, there is room to put > > this on the same cache line as 'cpus_allowed'. > > - In reschedule_idle() if we determine that the best CPU for a task > > is the CPU it is associated with (p->processor), then temporarily > > bind the task to that CPU. The task is temporarily bound to the > > CPU by overwriting the 'cpus_allowed' field such that the task can > > only be scheduled on the target CPU. Of course, the original > > value of 'cpus_allowed' is saved in 'saved_cpus_allowed'. > > - In schedule(), the loop which examines all tasks on the runqueue > > will restore the value of 'cpus_allowed'. > > This sounds racy, at least with your simple description. Even other CPUs > could walk their run queue while the IPI is being processed and reset the > cpus_allowed flag too early. I guess it would need a check in the > schedule loop that restores cpus_allowed that the current CPU is the only > on set in that task's cpus_allowed. You are correct. It is trivial to allow only the 'target' CPU to reset the cpus_allowed field. This was my first thought. However, as you state below we would get into trouble if there was an extremely long delay in that CPU running schedule. The timestamp sounds reasonable, but I was trying to keep it as simple as possible. > This unfortunately would hang the > process if the CPU for some reason cannot process schedules timely, so > it may be needed to add a timestamp also to the task_struct that allows > the restoration of cpus_allowed even from non target CPUs after some > time. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Troy B. <ho...@dr...> - 2001-07-15 07:44:22
|
On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote: > This discussion was started on 'lse...@li...'. > I'm widening the distribution in the hope of getting more input. > > It started when Andi Kleen noticed that a single 'CPU Hog' task > was being bounced back and forth between the 2 CPUs on his 2-way > system. I had seen similar behavior when running the context > switching test of LMbench. When running lat_ctx with only two > threads on an 8 CPU system, one would ?expect? the two threads > to be confined to two of the 8 CPUs in the system. However, what > I have observed is that the threads are effectively 'round > robined' among all the CPUs and they all end up bearing > an equivalent amount of the CPU load. To more easily observe > this, increase the number of 'TRIPS' in the benchmark to a really > large number. Did this 'CPU Hog' task do any I/O that might have caused an interrupt? I'm wondering if the interrupt distribution has anything to do with this.. are you using any CPU affinity for interrupts? If not, this might explain why the processes wind up doing a 'round robin'. I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering if this will be scewed any because all interrupts are directed at cpu0, so anything that generates input or output on the network or serial is going to tend to wind up on cpu0. I'd also like to figure out what the IPI latency actually is.. Does anyone have any suggestions how to measure this? I would hope it's lower on the dual 7410 machine I have since I have 4-stage pipelines as opposed to 20 odd stages that Pentium III class machines do.. > After a little investigation, I believe this 'situation' is caused > by the latency of the reschedule IPI used by the scheduler. Recall > that in lat_ctx all threads are in a tight loop consisting of: > > pipe_read() > pipe_write() > > Both threads 'start' on the same CPU and are sitting in pipe_read > waiting for data. A token is written to the pipe and one thread > is awakened. The awakened thread, then immediately writes the token > back to the pipe which ultimately results in a call to reschedule_idle() > that will 'initiate' the scheduling of the other thread. In > reschedule_idle() we can not take the 'fast path' because WE are > currently executing on the other thread's preferred CPU. Therefore, > reschedule_idle() chooses the oldest idle CPU and sends the IPI. > However, before the IPI is received (and schedule() run) on the > remote CPU, the currently running thread calls pipe_read which > blocks and calls schedule(). Since the other task has yet to be > scheduled on the other CPU, it is scheduled to run on the current > CPU. Both tasks continue to execute on the one CPU until such time > that an IPI induced schedule() on the other CPU hits a window where > it finds one of the tasks to schedule. We continue in this way, > migrating the tasks to the oldest idle CPU and eventually cycling our > way through all the CPUs. > > Does this explanation sound reasonable? > > If so, it would then follow that booting with 'idle=poll' would > help alleviate this situation. However, that is not the case. With > idle=poll the CPU load is not as evenly distributed among the CPUs, > but is still distributed among all of them. > > Does the behavior of the 'benchmark' mean anything? Should one > expect tasks to stay their preferred CPUs if possible? > > Thoughts/comments > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to maj...@vg... > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Troy Benjegerdes | master of mispeeling | 'da hozer' | ho...@dr... -----"If this message isn't misspelled, I didn't write it" -- Me ----- "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shulz |
From: Andi K. <ak...@su...> - 2001-07-15 09:05:47
|
On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote: > On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote: > > This discussion was started on 'lse...@li...'. > > I'm widening the distribution in the hope of getting more input. > > > > It started when Andi Kleen noticed that a single 'CPU Hog' task > > was being bounced back and forth between the 2 CPUs on his 2-way > > system. I had seen similar behavior when running the context > > switching test of LMbench. When running lat_ctx with only two > > threads on an 8 CPU system, one would ?expect? the two threads > > to be confined to two of the 8 CPUs in the system. However, what > > I have observed is that the threads are effectively 'round > > robined' among all the CPUs and they all end up bearing > > an equivalent amount of the CPU load. To more easily observe > > this, increase the number of 'TRIPS' in the benchmark to a really > > large number. > > Did this 'CPU Hog' task do any I/O that might have caused an interrupt? Yes; it did some IO, but most of its time was doing CPU work (it was CPU bound) > > I'm wondering if the interrupt distribution has anything to do with this.. > are you using any CPU affinity for interrupts? If not, this might explain > why the processes wind up doing a 'round robin'. It was a normal Intel SMP box with no special settings and the interrupts should have been evenly distributed (I did not check it at this time, but normally they are about evenly distributed over the two cpus) > > I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering > if this will be scewed any because all interrupts are directed at cpu0, so > anything that generates input or output on the network or serial is going > to tend to wind up on cpu0. > > I'd also like to figure out what th e IPI latency actually is.. Does anyone > have any suggestions how to measure this? I would hope it's lower on the > dual 7410 machine I have since I have 4-stage pipelines as opposed to 20 > odd stages that Pentium III class machines do.. I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes the time stamp counters of the CPUs. If they are synchronized on your box also you could store TSC value in the IPI sender and compute the average latency in the receiver. -Andi |
From: Troy B. <ho...@dr...> - 2001-07-15 17:02:01
|
On Sun, Jul 15, 2001 at 11:05:43AM +0200, Andi Kleen wrote: > On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote: > > On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote: > > > This discussion was started on 'lse...@li...'. > > > I'm widening the distribution in the hope of getting more input. > > > > > > It started when Andi Kleen noticed that a single 'CPU Hog' task > > > was being bounced back and forth between the 2 CPUs on his 2-way > > > system. I had seen similar behavior when running the context > > > switching test of LMbench. When running lat_ctx with only two > > > threads on an 8 CPU system, one would ?expect? the two threads > > > to be confined to two of the 8 CPUs in the system. However, what > > > I have observed is that the threads are effectively 'round > > > robined' among all the CPUs and they all end up bearing > > > an equivalent amount of the CPU load. To more easily observe > > > this, increase the number of 'TRIPS' in the benchmark to a really > > > large number. > > > > Did this 'CPU Hog' task do any I/O that might have caused an interrupt? > > Yes; it did some IO, but most of its time was doing CPU work > (it was CPU bound) I shouldn't have sent that first email out so quickly. About 5 minutes after I sent it, I saw gzip switching CPU's with top a lot. The next very interesting thing I found was that if I run something like 'while true; do true; done' to load the other CPU while gzip is running, gzip runs faster. The time change isn't very large, and appears to be just barely detectable above the noise, but it definetly appears that gzip is bouncing back and forth between cpus if both are idle. I'm tempted to try the somewhat brute-force approach of increasing PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something like 100. Would this be an adequate 'short-term' solution util there is some sort of multi-queue scheduler that everyone Linus likes? What are the drawbacks of increasing PROC_CHANGE_PENALTY? > I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes > the time stamp counters of the CPUs. If they are synchronized on your box > also you could store TSC value in the IPI sender and compute the average > latency in the receiver. The PPC has a 'timebase register' which is effectively the same thing as the TSC, except it counts processor bus ticks/4, not cpu cycles. -- Troy Benjegerdes | master of mispeeling | 'da hozer' | ho...@dr... -----"If this message isn't misspelled, I didn't write it" -- Me ----- "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shulz |
From: Mike K. <mkr...@se...> - 2001-07-16 01:00:22
|
On Sun, Jul 15, 2001 at 12:00:38PM -0500, Troy Benjegerdes wrote: > > The next very interesting thing I found was that if I run something like > 'while true; do true; done' to load the other CPU while gzip is running, > gzip runs faster. The time change isn't very large, and appears to be just > barely detectable above the noise, but it definetly appears that gzip is > bouncing back and forth between cpus if both are idle. > > I'm tempted to try the somewhat brute-force approach of increasing > PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something > like 100. Would this be an adequate 'short-term' solution util there is > some sort of multi-queue scheduler that everyone Linus likes? What are the > drawbacks of increasing PROC_CHANGE_PENALTY? I'm pretty sure changing the value of PROC_CHANGE_PENALTY will not help in this case. PROC_CHANGE_PENALTY becomes increasingly important as the number of running tasks is increased. In the case of simply running one task (like gzip) on your 2 CPU system, I don't think it will make any difference. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |