From: Hubertus F. <fr...@us...> - 2001-02-09 20:28:18
|
John, regarding your message and my previous message. I am also looking into make the PROC_CHANGE_PENALTY a function of the pool. So within a cpu-pool, the PROC_CHANGE_PENALTY is set to "A" and to cpu's outside the pool the PROC_CHANGE_PENALTY will be set to "Y*A + B". CPU sets and different PROC_CHANGE_PENALTIES are extremely easy to code in our current MQ scheduler. Have you managed to get it running on a 32-way SGI machine. If so you might want to try the prototype of the cpu-pools we posted on the lse site. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/09/2001 03:10:49 PM Sent by: lse...@li... To: "Mike Kravetz" <mkr...@se...>, <lse...@li...> cc: Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler > Each CPU specific runqueue data structure has a > field which contains the maximum 'non-affinity goodness' > value of all schedulable tasks on that runqueue. Therefore, > when we 'take a quick look' we are really only looking at > the task with the maximum 'non-affinity goodness' on a remote > CPU's runqueue. Another wrinkle: specific hardware implementations may have gradations of "non-affinity goodness", rather than a binary presumption that a process that previously executed on cpuA will prefer to execute again on cpuA, but if not cpuA then any other cpu is equally less-good. Suppose we have a NUMA machine consisting of two nodes, and each node contains main memory, *two* CPUs (that we'll name cpuA and cpuB for node1, and cpuC and cpuD for node2), and perhaps even a shared L3 cache for that node's main memory. Suppose processX last executed on cpuA. If it re-executes on cpuA, then we have some potential for having L1 and L2 cacheblocks waiting for it and we expect optimum performance. And if it re-executes on cpuB, then we have some potential for having L3 cacheblocks waiting, and our performance is almost as good as cpuA. Re-executing on another node -- on cpuC or cpuD -- would be definitely inferior to cpuB. Moreover, a NUMA machine that has a hierarchy of memory access penalties, depending upon how "far away" you are from the previous-execution node's memory, will have an even more complex "goodness" calculation. Thus, what we need is to abstract the "goodness" calculation to allow for architecture-specific differences. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-02-09 20:28:26
Attachments:
pre8-chat.pdf
|
Mike, it goes to the heart of the question whether we MUST or NEED to stick to the current scheduler semantics. For the low-end SMPs (4ways or so) it seems ok to me to simply check before trying to steal the thread whether it can be scheduled or not. If not you move onto the next queue. You are afraid that the cpu_allowed mask will be used frequently. Waiting for the lock doesn't happen in the current code, you use try_spinlock. The problem is as you point out that there is a task with sufficiently high goodness to warrant a preemption, stuck behind this "cpu_allowed" limited task and therefore is never considered. Again, at some point, one has to do some hand-waiving and state "tough luck". As far as I am concerned, the task of the scheduler is to create high throughput through the system and provide some degree of fairness. By defining an affinity boost, we are already sacrifying fairness to some degree. I have taken a slightly different angle. I am willing to relax the current scheduling assumption in order to increase scalability. Particularly I am willing to run tasks of lower goodness value as long as I adhere to the RT semantics. I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic that might not necessarily be right. I think when you are approaching larger number of CPUs (8+) you need to look at partitioning your system from a scheduling point. One should pretty much deal with the system as if it is a NUMA system. I have taken the MQ scheduler and sub-divide into the cpu pools. I have posted regarding this already under our latest status report for the scheduling: http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing Running the chatroom with 30/300 gives the following results. (I will post these on Monday on our lse.sourceforge.net/scheduling) site for general consumption. (See attached file: pre8-chat.pdf) In this case, splitting up the scheduler into multiple pools with occasional trivial load-balacing shows at the very highend for the chat-room a 10% improvement over our current MQ. In this case I am checking in reschedule_idle whether to preempt or not, but in schedule I only look within my pool. There are some potential improvements possible, for instance only checking for remote idle cpus on all cpus, but no preemption outside the current pool I think these mechanism are worthwhile to look into it. I think that the usage of cpu_allowed can be tight into this. cpu_allowed is only a simple mechanism and not a policy. I think we need to look at the policies that will be built for cpu_affinities. I think most of them will originate in the NUMA area. And doing a pool based approach seems to be ok. Any comments ? Hubertus Franke email: fr...@us... Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/09/2001 02:26:08 PM Sent by: lse...@li... To: lse...@li... cc: Subject: [Lse-tech] cpus_allowed in multi-queue scheduler I'm looking at the cpus_allowed field of the task structure and trying to determine what is the best way to handle this in the multi-queue scheduler. As a reminder, the multi-queue scheduler I am working on has one runqueue per CPU. In addition, there is a separate runqueue lock per CPU which synchronizes access to the the CPU specific runqueue. In schedule(), the runqueue lock associated with the CPU we are currently running on is obtained. Then the CPU specific runqueue is scanned looking for the task with the highest 'goodness' value. After scanning the CPU specific runqueue, we 'take a quick look' at other the other runqueues to determine if there is a task with higher goodness that should be scheduled. Of course, this takes CPU affinity into account just like the current scheduler. Each CPU specific runqueue data structure has a field which contains the maximum 'non-affinity goodness' value of all schedulable tasks on that runqueue. Therefore, when we 'take a quick look' we are really only looking at the task with the maximum 'non-affinity goodness' on a remote CPU's runqueue. See the description of the multi-queue scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html' if you want more details. Now, if we find a task with sufficiently high goodness on another CPU specific runqueue, we attempt to lock (via spin_trylock) the other runqueue and move the task to our runqueue. It is during the process of 'stealing' tasks from other runqueues where we must be concerned with the cpus_allowed field of the task we are stealing. We don't want to steal a task if it is not allowed to run on our CPU. However, we don't want to wait until we have obtained a lock on a remote CPU's runqueue to check the cpus_allowed field and determine if we should steal the task. My thought is that the runqueue data structure could also contain the cpus_allowed field of the task with the maximum 'non-affinity goodness' value. Hence, we could 'quickly check' this value without obtaining the remote CPU's runqueue lock. However, I have some concerns that this design could cause some significant scheduling behavior changes if the cpus_allowed field/feature is used extensively. Right now, I don't see much/any use of this field but I expect that may change in the future. Consider the case where the task with the maximum 'non-affinity goodness' value has cpus_allowed set such that it is limited to a small subset of the systems CPUs. Now, other CPUs outside this subset will not be able to steal this task. In addition, they will not be able to steal any other tasks on this runqueue which may have a sufficiently high goodness value. This is because we only keep track of the task with the highest goodness value, and in this case it can only run on a subset of CPUs. It is obvious that a task which is limited to a single CPU should never be identified as a candidate to be stolen by another CPU, and this is easy to code. However, what about a task limited to 2 CPUs on a 8 CPU system, OR 2 CPUs on a 16 or 32 CPU system? Any comments? -- Mike Kravetz mkr...@se... IBM Linux Technology Center _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Jun N. <ju...@sc...> - 2001-02-09 21:48:23
|
I agree. I think the scope of the values returned by goodness() is not clear even with the original scheduler - it's not local or global either, because of PROC_CHANGE_PENALTY, which should be platform-dependent. So I don't think we need to stick to the current scheduler semantics for non-realtime tasks, especially when supporting larger SMP machines, including NUMA. Besides the cpu_allowed mask, one thing I would like to add is to check if its cache is still warm or not there, when stealing a task from a remote CPU. We don't want to migrate such a task with warm cache even if it has a high na_goodness value. One of easy and effective ways is to maintain the last time (jiffies) when the task was scheduled. If the time is larger than some tunable/platform-specific value, then we might want to migrate the task, otherwise find another one. I bet this would cause significant scheduling behavior changes. Hubertus Franke wrote: > > Mike, it goes to the heart of the question whether we MUST or NEED to stick > to the current scheduler semantics. > > For the low-end SMPs (4ways or so) it seems ok to me to simply check before > trying to steal the thread whether it > can be scheduled or not. If not you move onto the next queue. You are > afraid that the cpu_allowed mask will > be used frequently. Waiting for the lock doesn't happen in the current > code, you use try_spinlock. > The problem is as you point out that there is a task with sufficiently high > goodness to warrant a preemption, stuck behind > this "cpu_allowed" limited task and therefore is never considered. > > Again, at some point, one has to do some hand-waiving and state "tough > luck". > As far as I am concerned, the task of the scheduler is to create high > throughput through the system and provide some > degree of fairness. By defining an affinity boost, we are already > sacrifying fairness to some degree. > > I have taken a slightly different angle. I am willing to relax the current > scheduling assumption in order to increase > scalability. Particularly I am willing to run tasks of lower goodness value > as long as I adhere to the RT semantics. > I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic > that might not necessarily be right. > > I think when you are approaching larger number of CPUs (8+) you need to > look at partitioning your system from a > scheduling point. One should pretty much deal with the system as if it is a > NUMA system. > > I have taken the MQ scheduler and sub-divide into the cpu pools. I have > posted regarding this already under > our latest status report for the scheduling: > http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing > > Running the chatroom with 30/300 gives the following results. (I will post > these on Monday on our lse.sourceforge.net/scheduling) site > for general consumption. > (See attached file: pre8-chat.pdf) > > In this case, splitting up the scheduler into multiple pools with > occasional trivial load-balacing shows at the very highend for the > chat-room a 10% improvement over our current MQ. In this case I am checking > in reschedule_idle whether to preempt or not, but in schedule I only look > within my pool. > There are some potential improvements possible, for instance only checking > for remote idle cpus on all cpus, but no preemption outside the current > pool > I think these mechanism are worthwhile to look into it. > > I think that the usage of cpu_allowed can be tight into this. cpu_allowed > is only a simple mechanism and not a policy. I think we need > to look at the policies that will be built for cpu_affinities. I think most > of them will originate in the NUMA area. > And doing a pool based approach seems to be ok. > > Any comments ? > > Hubertus Franke > email: fr...@us... > > Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/09/2001 > 02:26:08 PM > > Sent by: lse...@li... > > To: lse...@li... > cc: > Subject: [Lse-tech] cpus_allowed in multi-queue scheduler > > I'm looking at the cpus_allowed field of the task structure > and trying to determine what is the best way to handle this > in the multi-queue scheduler. > > As a reminder, the multi-queue scheduler I am working on > has one runqueue per CPU. In addition, there is a separate > runqueue lock per CPU which synchronizes access to the the > CPU specific runqueue. In schedule(), the runqueue lock > associated with the CPU we are currently running on is > obtained. Then the CPU specific runqueue is scanned looking > for the task with the highest 'goodness' value. After > scanning the CPU specific runqueue, we 'take a quick look' > at other the other runqueues to determine if there is a task > with higher goodness that should be scheduled. Of course, > this takes CPU affinity into account just like the current > scheduler. Each CPU specific runqueue data structure has a > field which contains the maximum 'non-affinity goodness' > value of all schedulable tasks on that runqueue. Therefore, > when we 'take a quick look' we are really only looking at > the task with the maximum 'non-affinity goodness' on a remote > CPU's runqueue. See the description of the multi-queue > scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html' > if you want more details. Now, if we find a task with > sufficiently high goodness on another CPU specific runqueue, > we attempt to lock (via spin_trylock) the other runqueue and > move the task to our runqueue. > > It is during the process of 'stealing' tasks from other > runqueues where we must be concerned with the cpus_allowed > field of the task we are stealing. We don't want to steal > a task if it is not allowed to run on our CPU. However, we > don't want to wait until we have obtained a lock on a remote > CPU's runqueue to check the cpus_allowed field and determine > if we should steal the task. > > My thought is that the runqueue data structure could also > contain the cpus_allowed field of the task with the maximum > 'non-affinity goodness' value. Hence, we could 'quickly > check' this value without obtaining the remote CPU's runqueue > lock. > > However, I have some concerns that this design could cause > some significant scheduling behavior changes if the cpus_allowed > field/feature is used extensively. Right now, I don't see > much/any use of this field but I expect that may change in > the future. Consider the case where the task with the maximum > 'non-affinity goodness' value has cpus_allowed set such that > it is limited to a small subset of the systems CPUs. Now, > other CPUs outside this subset will not be able to steal this > task. In addition, they will not be able to steal any other > tasks on this runqueue which may have a sufficiently high > goodness value. This is because we only keep track of the > task with the highest goodness value, and in this case it can > only run on a subset of CPUs. It is obvious that a task which > is limited to a single CPU should never be identified as a > candidate to be stolen by another CPU, and this is easy to code. > However, what about a task limited to 2 CPUs on a 8 CPU system, > OR 2 CPUs on a 16 or 32 CPU system? > > Any comments? > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > ------------------------------------------------------------------------ > Name: pre8-chat.pdf > pre8-chat.pdf Type: Portable Document Format (application/pdf) > Encoding: base64 -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: John H. <ha...@en...> - 2001-02-09 22:12:28
|
It was my impression that the rationale for having the MQ scheduler maintain existing behavior was primarily to allow for a closer apples-to-apples comparison with the existing scheduler, and secondarily because of a concern that some people would equate *different* behavior with *inferior* behavior. Is there anyone in the Linux Community who believes that the current scheduler is perfect? John Hawkes ha...@en... |
From: Jun N. <ju...@sc...> - 2001-02-09 23:16:08
|
John Hawkes wrote: > > It was my impression that the rationale for having the MQ scheduler > maintain existing behavior was primarily to allow for a closer > apples-to-apples comparison with the existing scheduler, and secondarily > because of a concern that some people would equate *different* behavior > with *inferior* behavior. Is there anyone in the Linux Community who > believes that the current scheduler is perfect? > I believe that the current scheduler is simple and very good, especially for UP or smaller SMP machines. I like it because it chooses tasks using the values (counter) that change verly frequently (compared to some Unixes), providing better response. I think what people were/are doing with the MQ scheduler is right, from the software engineering point of view. We need to understand it clearly, before exploring further. As the first step or verification, we needed to simulate the existing behavior as much as possible. Such "some people" might be correct. A truck cannot run like a car. At this point I'm not sure if we should replace the original scheduler code (#ifdef CONFIG_SMP), with the MQ scheduler. We might need to use something like #ifdef CONFIG_BIG_SMP. Of course, this would have maintenace problems... But I think we should think about various ideas on the MQ scheduler before we worry about such maintenace issues or how to persuade the Linux Community. -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-02-10 00:12:21
|
I agree with most everything people have written so far. As a relative newcomer to Linux kernel development, I was under the impression that there was some reluctance to change the existing scheduler implementation. Because of this, I thought it important to see how far we could take scheduler implementations which try to maintain existing scheduler behavior. This was my primary motivation for the multi-queue scheduler implementation. As already pointed out, this allows for an apples-to-apples comparison. In addition, if an implementation maintains existing behavior I suspect it is more likely to be accepted into the main code base. If we are able to get such a change accepted, this may make it easier to get changes which depart from existing behavior accepted in the future. Of course, this is all my opinion as a relative newcomer. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Rik v. R. <ri...@co...> - 2001-02-11 19:52:28
|
On Fri, 9 Feb 2001, John Hawkes wrote: > Is there anyone in the Linux Community who believes that > the current scheduler is perfect? Unlikely, even on dual-CPU machines with large caches (ie. Xeon systems) it does something very bad... Suppose you have 2 CPU's and 3 CPU-eating tasks (eg. 2 threads of a long-running job and one compiler .. fairly common stuff). Now one of the CPU-bound tasks is halfway its timeslice and gets interrupted by eg. /usr/bin/vi. After vi gives up the CPU, the scheduler will select *another* task to run on the CPU, instead of switching back to the previous CPU-intensive task... Also, recalculating the priority of *every* task in the system at recalculation time has the possibility of blowing away a too large portion of the cache if you're running with huge amounts of processes (and some nice ping-pong effects if you have multiple CPUs recalculating at once, as can happen when you have a bunch of nice +19 processes). regards, Rik -- Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ |
From: Jun N. <ju...@sc...> - 2001-02-12 16:02:22
|
Rik van Riel wrote: > > > Also, recalculating the priority of *every* task in the system > at recalculation time has the possibility of blowing away a too > large portion of the cache if you're running with huge amounts > of processes (and some nice ping-pong effects if you have multiple > CPUs recalculating at once, as can happen when you have a bunch of > nice +19 processes). In addition, the non-realtime tasks with the time quantum expired are still on the run queue, and goodness() against them always returns 0. The scheduler keeps on checking such expired tasks as well, wasting CPU cycles and cache lines, until the next recalculation time. If the number of the tasks on the run queue is N and no rescheuduling is requested except when the time quantum expires, the scheduler basically checks such tasks N times per task. Those extra N-1 checks can be avoided. One solution to this is to move such expired tasks to another (run) queue when their time quantum expires, and concatenate that queue and the run queue (that may have realtime tasks) at recalculation time. The same thing can be done with the MQ scheduler, although the per-CPU run queue does not have realtime tasks. > > regards, > > Rik -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Mike K. <mkr...@se...> - 2001-02-12 20:31:01
|
On Sun, Feb 11, 2001 at 04:52:16PM -0300, Rik van Riel wrote: > Also, recalculating the priority of *every* task in the system > at recalculation time has the possibility of blowing away a too > large portion of the cache if you're running with huge amounts > of processes (and some nice ping-pong effects if you have multiple > CPUs recalculating at once, as can happen when you have a bunch of > nice +19 processes). I seem to remember that someone from SGI had patch which partially addressed this situation. Instead of running the entire task list at recalculate time, it only 'recalculated' counter values for tasks in the runqueue. It would keep track of what adjustments should have been made to non-runnable tasks, and only make these adjustments when the tasks were added to the runqueue. Seemed like a good idea to me. Anyone know why this did not make its way into the scheduler? As far as multiple CPUs doing recalculates at the same time, I also had big concerns about this. As a result, I added some code to track the number of times this situation occurred. My results showed that this did not happen frequently. However, this was only one specific workload and I don't really remember what it was. It should be fairly easy to address this with another lock/locking level. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: george a. <ge...@mv...> - 2001-02-13 02:39:41
|
Jun Nakajima wrote: > > I agree. I think the scope of the values returned by goodness() is not > clear even with the original scheduler - it's not local or global > either, because of PROC_CHANGE_PENALTY, which should be > platform-dependent. > > So I don't think we need to stick to the current scheduler semantics for > non-realtime tasks, especially when supporting larger SMP machines, > including NUMA. > > Besides the cpu_allowed mask, one thing I would like to add is to check > if its cache is still warm or not there, when stealing a task from a > remote CPU. We don't want to migrate such a task with warm cache even if > it has a high na_goodness value. One of easy and effective ways is to > maintain the last time (jiffies) when the task was scheduled. If the > time is larger than some tunable/platform-specific value, then we might > want to migrate the task, otherwise find another one. I bet this would > cause significant scheduling behavior changes. Such a test might be done on run list insertion also. I.e. why favor the last cpu if all the tasks data has long since been flushed from that cpu's cache? Then you might keep a count of the aggregate number of "ticks" in each cpu's run list and add a task with stale cache history to the list with the least aggregate count. George > > Hubertus Franke wrote: ~snip |
From: John H. <ha...@en...> - 2001-02-13 02:48:52
|
From: "george anzinger" <ge...@mv...> ... > Such a test might be done on run list insertion also. I.e. why favor > the last cpu if all the tasks data has long since been flushed from that > cpu's cache? Because if it's a NUMA system, the process has a vested interest in re-executing on the previous cpu if that cpu/node holds the hot portions of the process' main memory usage. John Hawkes ha...@en... |
From: Andi K. <ak...@su...> - 2001-02-10 11:14:45
|
On Fri, Feb 09, 2001 at 03:22:50PM -0500, Hubertus Franke wrote: > > > Mike, it goes to the heart of the question whether we MUST or NEED to stick > to the current scheduler semantics. When it complicates the code for no good reason it's probably better not to stick slavishly. Linux scheduling behaviour has changed in the past too especially on SMP, e.g. from 2.2 to 2.4, just UP has been relatively stable, so it's a moving target. > I have taken the MQ scheduler and sub-divide into the cpu pools. I have > posted regarding this already under > our latest status report for the scheduling: > http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing > > Running the chatroom with 30/300 gives the following results. (I will post > these on Monday on our lse.sourceforge.net/scheduling) site > for general consumption. I think one problem is that it has not been generally accepted that chatroom (so many threads on the runqueue) is a good benchmark. I think e.g. wakeup optimizations (where Linux is relatively poor ATM, and latency is important) + running a reasonable number of running threads on each CPU that get woken up by someone else would be better. Unfortunately I cannot offer a good benchmark for this. > I think that the usage of cpu_allowed can be tight into this. cpu_allowed > is only a simple mechanism and not a policy. I think we need Just call it "hack for Tux"; to work around the reordering problem in the network stack. I don't think anybody else is using it yet. Real dynamic irq affinity redirection would be much better anyways. -Andi |
From: Hubertus F. <fr...@us...> - 2001-02-09 22:03:55
|
Yikes... I am surprised you don't see better improvements. Mike updated the MQ patch sometimes 2-3 weeks ago due to the problem that Jun found. Might want to check whether your patch and the latest one are the same. As of the pools, I tell you what: Since in the call-for-participation calls we call for experimentation, I shall do follow this call. Let me go ahead and propose the following scheduler semantics and make the few modifications to the MQ+pool+LB code that I already have. I can code this probably this sunday and we can have some scalability numbers+curves on Tuesday or so for an 1-8-way system over some reasona Currently there are two path in the system to consider with the following very rough algorithm. reschedule_idle() check_all_remote_cpus for preemptability (affinity impaired) (build stacklist) schedule() check for RT find best in local Q find better in remote Q (affinity impaired) Let's replace/integrate that with the concept of the pool: reschedule_idle() check_all_remote_cpus P such that if P in same pool check for idle and for preemptablily (local-pool-affinity-boost) else check only for idle (with preference to local idles) schedule() check for RT find better in local Q (affinity impaired) if (next-would-be-idle) find better in pool remote Qs This makes sure that we there are no IDLE threads around, but we only preempt in our local pool. We let load balancing take care of the inequalities between the queues. One problem I see right now is that in such an algorithm, the "recalculate mechanism" might be broken. Everytime a local pool drops to (c==0) condition, it would initiate a global recalculate, thus effecting other pools and queues as well. Possibility here is to filter the recalculate loop based on the pool. Loadbalancing right now is trivial, namely try to average all runqueue length. Better algorithms could be deployed or make this beast a loadable module. Andrea Arcancelli had a slightly different design for the above, where he put a seperate scheduler with separate data into every NODE_DATA for NUMA purposes. The difference of the above to his approach is that in reschedule_idle() he probably doesn't check for cpus on remote nodes being idle. On a different note: I integrated a priority scheduler (multiple list based on na_goodness) with the MQ to see whether moving at very high thread count to a different queue organization would make a difference. The resulting scheduler switches dynamically between priority and single-list per runqueue in the MQ. I instrumented the scheduler to count the various occurences. When measuring for a kernel build, the results are not vary encouraging, i.e. the savings I get from not having to traverse the entire runqueue but only a few priority levels seems to be offset by skipping through the various priority levels. Have a good weekend everybody. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...>@lists.sourceforge.net on 02/09/2001 03:45:04 PM Sent by: lse...@li... To: <lse...@li...> cc: Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler From: "Hubertus Franke" <fr...@us...> > Have you managed to get it running on a 32-way SGI machine. Yes, and I also made some mips64 kernel fixes to get the IBM "chat" benchmark to execute. I'm not seeing very repeatable results, however, for either the regular scheduler or the MQ scheduler (using the MQ patch from a couple of weeks ago). The MQ scheduler definitely eliminates the contention on the runqueue_lock, but I'm not seeing dramatic improvments on "chat" benchmark results. I may be encountering other lock contention issues, especially with this kind of tcp-intensive workload. I'm planning to run some other benchmarks on this system, but the trick is to choose a workload that exposes the improvements to the scheduler (i.e., flooding the system with long runqueues and lots of context switches) without having that workload be so singleminded that it stumbles into yet another single-threaded algorithm. Getting the global runqueue_lock out of the way (using the MQ scheduler) exposed another hot global lock: xtime_lock. We do a write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and these interrupts tend to be concurrent (or at least they try to be concurrent). For a leisurely HZ==100 this isn't spectacularly awful, but a higher HZ would get progressively worse. John Hawkes ha...@en... _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: John H. <ha...@en...> - 2001-02-10 02:54:26
|
I've rerun "chat" on my 32-cpu mips64 system, using a kernel based upon 2.4.1, and using the multiqueue patch dated January 29. I'm still seeing lots of variability, and I'm not seeing a great difference between the stock 2.4.1 and the 2.4.1+MQ -- about 8% better, using a mean of five tests. I haven't looked closely to understand what's going on. However, I have a question: the results shown on the SourceForge site show throughput values (on the Y-axis) on the order of 200,000 to 350,000 (messages per second?), e.g., for the default 10 rooms, 100 messages. What is this hardware that you're using? My 4-CPU i386 system gets a tenth of that throughput rate, and my 32-CPU mips64 system is only about double my i386. Our mips64 system cannot be described as being tuned in any way, shape, or form, so I'm not particular alarmed at its numbers, but my 4-CPU i386 system ought to be in the ballpark of what you guys are reporting, and it's an order of magnitude less. John Hawkes ha...@en... |
From: Hubertus F. <fr...@us...> - 2001-02-09 22:29:20
|
Wrong, it was comparing oranges with oranges :-) Why don't we start hashing out a design. I guess we all believe that the MQ is an excellent start. I believe our pooling makes sense but we need to think this through. So let's start out with the design that I through out that takes MQ slightly away from the strict functional equivalence. Let's poo-poo it and hopefully we can come up with something better. For instance how about the idea of Jun to look at last_run as a additional base to decide affinity ? Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...> on 02/09/2001 05:12:20 PM To: "Jun Nakajima" <ju...@sc...>, Hubertus Franke/Watson/IBM@IBMUS cc: "Mike Kravetz" <mkr...@se...>, <lse...@li...> Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler It was my impression that the rationale for having the MQ scheduler maintain existing behavior was primarily to allow for a closer apples-to-apples comparison with the existing scheduler, and secondarily because of a concern that some people would equate *different* behavior with *inferior* behavior. Is there anyone in the Linux Community who believes that the current scheduler is perfect? John Hawkes ha...@en... |
From: Hubertus F. <fr...@us...> - 2001-02-09 22:29:20
|
This is the old problem of deciding affinity based on where the damned thing ran last. I read once a scheduler paper from SGI where they actually implemented the affinity based on some time decay. Another Linux based system build somewhere in Germany actually counted the cache misses since the last context switch and used a simple marcov chain to determine the cache foot print. This is not very difficult to implement when approximated, but I don't know whether such an approach would be portable across all platforms. I like your idea of simply taking the jiffies, but then max_na_goodness(cpu) definition might get screwed up because now we need to compute the second best with the affinity and time decay in mind. Effectively local goodness value is not sufficient for computing this variable. Let me think about it over the weekend for a decent solution and whether I can integrate such a mechanism with our current approach. Mike you might want to scratch your <head> as well. Next, with respect to scheduler semantics. Mike and I have talked about it right at the beginning and we came to the conclusion that we need to proof improvements with equal semantics to allow oranges to be compared to oranges. Is it time to move beyond that point? Have we proven that we can do better across most applications ? I don't know who is going to make the judgement call that we can/should move beyond these strict functional equivalence requirements. I am all in favor. I hope this is being raised in the upcoming 2.5 brainstorming session. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Jun Nakajima <ju...@sc...>@sco.com on 02/09/2001 04:43:44 PM Sent by: nak...@sc... To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkr...@se...>, lse...@li... Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler I agree. I think the scope of the values returned by goodness() is not clear even with the original scheduler - it's not local or global either, because of PROC_CHANGE_PENALTY, which should be platform-dependent. So I don't think we need to stick to the current scheduler semantics for non-realtime tasks, especially when supporting larger SMP machines, including NUMA. Besides the cpu_allowed mask, one thing I would like to add is to check if its cache is still warm or not there, when stealing a task from a remote CPU. We don't want to migrate such a task with warm cache even if it has a high na_goodness value. One of easy and effective ways is to maintain the last time (jiffies) when the task was scheduled. If the time is larger than some tunable/platform-specific value, then we might want to migrate the task, otherwise find another one. I bet this would cause significant scheduling behavior changes. Hubertus Franke wrote: > > Mike, it goes to the heart of the question whether we MUST or NEED to stick > to the current scheduler semantics. > > For the low-end SMPs (4ways or so) it seems ok to me to simply check before > trying to steal the thread whether it > can be scheduled or not. If not you move onto the next queue. You are > afraid that the cpu_allowed mask will > be used frequently. Waiting for the lock doesn't happen in the current > code, you use try_spinlock. > The problem is as you point out that there is a task with sufficiently high > goodness to warrant a preemption, stuck behind > this "cpu_allowed" limited task and therefore is never considered. > > Again, at some point, one has to do some hand-waiving and state "tough > luck". > As far as I am concerned, the task of the scheduler is to create high > throughput through the system and provide some > degree of fairness. By defining an affinity boost, we are already > sacrifying fairness to some degree. > > I have taken a slightly different angle. I am willing to relax the current > scheduling assumption in order to increase > scalability. Particularly I am willing to run tasks of lower goodness value > as long as I adhere to the RT semantics. > I think that is OK, because the PROC_CHANGE_PENALTY is a first heuristic > that might not necessarily be right. > > I think when you are approaching larger number of CPUs (8+) you need to > look at partitioning your system from a > scheduling point. One should pretty much deal with the system as if it is a > NUMA system. > > I have taken the MQ scheduler and sub-divide into the cpu pools. I have > posted regarding this already under > our latest status report for the scheduling: > http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing > > Running the chatroom with 30/300 gives the following results. (I will post > these on Monday on our lse.sourceforge.net/scheduling) site > for general consumption. > (See attached file: pre8-chat.pdf) > > In this case, splitting up the scheduler into multiple pools with > occasional trivial load-balacing shows at the very highend for the > chat-room a 10% improvement over our current MQ. In this case I am checking > in reschedule_idle whether to preempt or not, but in schedule I only look > within my pool. > There are some potential improvements possible, for instance only checking > for remote idle cpus on all cpus, but no preemption outside the current > pool > I think these mechanism are worthwhile to look into it. > > I think that the usage of cpu_allowed can be tight into this. cpu_allowed > is only a simple mechanism and not a policy. I think we need > to look at the policies that will be built for cpu_affinities. I think most > of them will originate in the NUMA area. > And doing a pool based approach seems to be ok. > > Any comments ? > > Hubertus Franke > email: fr...@us... > > Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/09/2001 > 02:26:08 PM > > Sent by: lse...@li... > > To: lse...@li... > cc: > Subject: [Lse-tech] cpus_allowed in multi-queue scheduler > > I'm looking at the cpus_allowed field of the task structure > and trying to determine what is the best way to handle this > in the multi-queue scheduler. > > As a reminder, the multi-queue scheduler I am working on > has one runqueue per CPU. In addition, there is a separate > runqueue lock per CPU which synchronizes access to the the > CPU specific runqueue. In schedule(), the runqueue lock > associated with the CPU we are currently running on is > obtained. Then the CPU specific runqueue is scanned looking > for the task with the highest 'goodness' value. After > scanning the CPU specific runqueue, we 'take a quick look' > at other the other runqueues to determine if there is a task > with higher goodness that should be scheduled. Of course, > this takes CPU affinity into account just like the current > scheduler. Each CPU specific runqueue data structure has a > field which contains the maximum 'non-affinity goodness' > value of all schedulable tasks on that runqueue. Therefore, > when we 'take a quick look' we are really only looking at > the task with the maximum 'non-affinity goodness' on a remote > CPU's runqueue. See the description of the multi-queue > scheduler at 'http://lse.sourceforge.net/scheduling/mq1.html' > if you want more details. Now, if we find a task with > sufficiently high goodness on another CPU specific runqueue, > we attempt to lock (via spin_trylock) the other runqueue and > move the task to our runqueue. > > It is during the process of 'stealing' tasks from other > runqueues where we must be concerned with the cpus_allowed > field of the task we are stealing. We don't want to steal > a task if it is not allowed to run on our CPU. However, we > don't want to wait until we have obtained a lock on a remote > CPU's runqueue to check the cpus_allowed field and determine > if we should steal the task. > > My thought is that the runqueue data structure could also > contain the cpus_allowed field of the task with the maximum > 'non-affinity goodness' value. Hence, we could 'quickly > check' this value without obtaining the remote CPU's runqueue > lock. > > However, I have some concerns that this design could cause > some significant scheduling behavior changes if the cpus_allowed > field/feature is used extensively. Right now, I don't see > much/any use of this field but I expect that may change in > the future. Consider the case where the task with the maximum > 'non-affinity goodness' value has cpus_allowed set such that > it is limited to a small subset of the systems CPUs. Now, > other CPUs outside this subset will not be able to steal this > task. In addition, they will not be able to steal any other > tasks on this runqueue which may have a sufficiently high > goodness value. This is because we only keep track of the > task with the highest goodness value, and in this case it can > only run on a subset of CPUs. It is obvious that a task which > is limited to a single CPU should never be identified as a > candidate to be stolen by another CPU, and this is easy to code. > However, what about a task limited to 2 CPUs on a 8 CPU system, > OR 2 CPUs on a 16 or 32 CPU system? > > Any comments? > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > ------------------------------------------------------------------------ > Name: pre8-chat.pdf > pre8-chat.pdf Type: Portable Document Format (application/pdf) > Encoding: base64 -- Jun U Nakajima Core OS Development SCO/Murray Hill, NJ Email: ju...@sc..., Phone: 908-790-2352 Fax: 908-790-2426 |
From: Gerrit H. <ge...@us...> - 2001-02-10 02:37:32
|
Yes, there has been a lot of written resistence and verbal resistence. I've seen comments from Linux (probably more than 6 months ago) saying that the current scheduler was sufficiently simple/complex and additional complexity wasn't very popular with him. Also, I talked with Stephen Tweedie about the scheduler somewhere in California about a year ago (San Diego Usenix, perhaps?) and he also wasn't convinced that it could be any better for single or dual proc machines. Of course, the tendency seems to be "show me" with new code - "show me the code" and "show me that it is better" where better means SIMPLE and FAST. And Linus has many times expressed a strong preference for SIMPLE over FAST. But simpler and faster would do well, or changes which are minimal with respect to additional complexity could, I believe, be sold. And, apples to apples is always easier. Once you switch apples to pear apples, you can work towards comparing pears. Or something like that. ;-) gerrit > I agree with most everything people have written so far. As a > relative newcomer to Linux kernel development, I was under the > impression that there was some reluctance to change the existing > scheduler implementation. Because of this, I thought it > important to see how far we could take scheduler implementations > which try to maintain existing scheduler behavior. This was > my primary motivation for the multi-queue scheduler implementation. > As already pointed out, this allows for an apples-to-apples > comparison. In addition, if an implementation maintains existing > behavior I suspect it is more likely to be accepted into the > main code base. If we are able to get such a change accepted, > this may make it easier to get changes which depart from existing > behavior accepted in the future. > > Of course, this is all my opinion as a relative newcomer. > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > |
From: Hubertus F. <fr...@us...> - 2001-02-11 18:29:04
|
Yes, Gerrit was pointing the opportunity regarding intelligent IRQ routing out as well. As with respect to scheduler FAST and SIMPLE. I guess its all relative. As far as we are concerned, the MQ is still pretty simple and well documented. Now one big concern ofcourse is lock contention. Its seems like if my assumptions are that (a) there aren't a lot of threads on the run queue and (b) schedule/reschedule_idle aren't called all that frequently, then the current scheduler is more than sufficient, might be even better. The problem is that a bit more complexity might reflect itself in the overall lock hold time. The sched_yield test is not a good test for this general case, as it doesn't measure average lock hold time, though one might think that. If there is a lack of lock-contention, then I should go for the solution with least lock hold time. We discussed this on this board and the opinion seems to be for scalability "It's the lock-contention, my friend". I talked to some HP folks during LinuxWorld. They are running a webserver and they too identified a large number of running threads to be a problem for their applications. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Andi Kleen <ak...@su...>@lists.sourceforge.net on 02/10/2001 06:14:59 AM Sent by: lse...@li... To: Hubertus Franke/Watson/IBM@IBMUS cc: Mike Kravetz <mkr...@se...>, lse...@li... Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler On Fri, Feb 09, 2001 at 03:22:50PM -0500, Hubertus Franke wrote: > > > Mike, it goes to the heart of the question whether we MUST or NEED to stick > to the current scheduler semantics. When it complicates the code for no good reason it's probably better not to stick slavishly. Linux scheduling behaviour has changed in the past too especially on SMP, e.g. from 2.2 to 2.4, just UP has been relatively stable, so it's a moving target. > I have taken the MQ scheduler and sub-divide into the cpu pools. I have > posted regarding this already under > our latest status report for the scheduling: > http://lse.sourceforge.net/scheduling/results012501/status.html#Load%20Balancing > > Running the chatroom with 30/300 gives the following results. (I will post > these on Monday on our lse.sourceforge.net/scheduling) site > for general consumption. I think one problem is that it has not been generally accepted that chatroom (so many threads on the runqueue) is a good benchmark. I think e.g. wakeup optimizations (where Linux is relatively poor ATM, and latency is important) + running a reasonable number of running threads on each CPU that get woken up by someone else would be better. Unfortunately I cannot offer a good benchmark for this. > I think that the usage of cpu_allowed can be tight into this. cpu_allowed > is only a simple mechanism and not a policy. I think we need Just call it "hack for Tux"; to work around the reordering problem in the network stack. I don't think anybody else is using it yet. Real dynamic irq affinity redirection would be much better anyways. -Andi _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-02-12 21:20:45
|
Mike, doing the runqueue part is ofcourse trivial. Question I have is how did they keep track of adjustments that needed to be made to non-runnable tasks ? (a) did they estimated/tracked the number of recalculates since the jiffies kept at del_from_runqueue time? (b) or they used (jiffies - (p)->last_jiffies) as a point to give some boost. Any recollection. If this patch doesn't show up again, we could implement either version reasonably quickly. If we introduce the "runqueue recalculate only", we might have some permanent priority inversion problems on our hands. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/12/2001 12:58:22 PM Sent by: lse...@li... To: Rik van Riel <ri...@co...> cc: John Hawkes <ha...@en...>, Jun Nakajima <ju...@sc...>, Hubertus Franke/Watson/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler On Sun, Feb 11, 2001 at 04:52:16PM -0300, Rik van Riel wrote: > Also, recalculating the priority of *every* task in the system > at recalculation time has the possibility of blowing away a too > large portion of the cache if you're running with huge amounts > of processes (and some nice ping-pong effects if you have multiple > CPUs recalculating at once, as can happen when you have a bunch of > nice +19 processes). I seem to remember that someone from SGI had patch which partially addressed this situation. Instead of running the entire task list at recalculate time, it only 'recalculated' counter values for tasks in the runqueue. It would keep track of what adjustments should have been made to non-runnable tasks, and only make these adjustments when the tasks were added to the runqueue. Seemed like a good idea to me. Anyone know why this did not make its way into the scheduler? As far as multiple CPUs doing recalculates at the same time, I also had big concerns about this. As a result, I added some code to track the number of times this situation occurred. My results showed that this did not happen frequently. However, this was only one specific workload and I don't really remember what it was. It should be fairly easy to address this with another lock/locking level. -- Mike Kravetz mkr...@se... IBM Linux Technology Center _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: george a. <ge...@mv...> - 2001-02-13 01:51:13
|
Hubertus Franke wrote: > > Mike, doing the runqueue part is ofcourse trivial. > Question I have is how did they keep track of adjustments that needed to be > made to non-runnable tasks ? > > (a) did they estimated/tracked the number of recalculates since the jiffies > kept at del_from_runqueue time? > (b) or they used (jiffies - (p)->last_jiffies) as a point to give some > boost. > > Any recollection. If this patch doesn't show up again, we could implement > either version reasonably quickly. > If we introduce the "runqueue recalculate only", we might have some > permanent priority inversion > problems on our hands. > The scheduler I did (posted at: www.mvista.com): Kept a global count of recalcs and a local count in each task struct. Up dated the task counter when the task was put on the run list using: if ((diff = runq.recalc - p->counter_recalc) != 0) { p->counter_recalc = runq.recalc; c = NICE_TO_TICKS(p->nice) << 1; p->counter = diff > 8 ? c - 1 : /* max priority */ c + ((p->counter - c) >> diff); } runq.ticks += p->counter; Note runq.ticks is a running count of the total of all the tasks in the runque's counters. This allows you to tell if the count is zero without walking the list, then you can do the recalc and the search for the next task in one pass thru the list. > Hubertus Franke > Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) > , OS-PIC (Chair) > email: fr...@us... > (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 > > Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 02/12/2001 > 12:58:22 PM > > Sent by: lse...@li... > > To: Rik van Riel <ri...@co...> > cc: John Hawkes <ha...@en...>, Jun Nakajima <ju...@sc...>, > Hubertus Franke/Watson/IBM@IBMUS, lse...@li... > Subject: Re: [Lse-tech] cpus_allowed in multi-queue scheduler > > On Sun, Feb 11, 2001 at 04:52:16PM -0300, Rik van Riel wrote: > > > Also, recalculating the priority of *every* task in the system > > at recalculation time has the possibility of blowing away a too > > large portion of the cache if you're running with huge amounts > > of processes (and some nice ping-pong effects if you have multiple > > CPUs recalculating at once, as can happen when you have a bunch of > > nice +19 processes). > > I seem to remember that someone from SGI had patch which partially > addressed this situation. Instead of running the entire task list > at recalculate time, it only 'recalculated' counter values for tasks > in the runqueue. It would keep track of what adjustments should > have been made to non-runnable tasks, and only make these adjustments > when the tasks were added to the runqueue. > > Seemed like a good idea to me. Anyone know why this did not make > its way into the scheduler? > > As far as multiple CPUs doing recalculates at the same time, I also > had big concerns about this. As a result, I added some code to track > the number of times this situation occurred. My results showed that > this did not happen frequently. However, this was only one specific > workload and I don't really remember what it was. It should be > fairly easy to address this with another lock/locking level. > > -- > Mike Kravetz mkr...@se... > IBM Linux Technology Center > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: John H. <ha...@en...> - 2001-02-09 20:44:57
|
From: "Hubertus Franke" <fr...@us...> > Have you managed to get it running on a 32-way SGI machine. Yes, and I also made some mips64 kernel fixes to get the IBM "chat" benchmark to execute. I'm not seeing very repeatable results, however, for either the regular scheduler or the MQ scheduler (using the MQ patch from a couple of weeks ago). The MQ scheduler definitely eliminates the contention on the runqueue_lock, but I'm not seeing dramatic improvments on "chat" benchmark results. I may be encountering other lock contention issues, especially with this kind of tcp-intensive workload. I'm planning to run some other benchmarks on this system, but the trick is to choose a workload that exposes the improvements to the scheduler (i.e., flooding the system with long runqueues and lots of context switches) without having that workload be so singleminded that it stumbles into yet another single-threaded algorithm. Getting the global runqueue_lock out of the way (using the MQ scheduler) exposed another hot global lock: xtime_lock. We do a write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and these interrupts tend to be concurrent (or at least they try to be concurrent). For a leisurely HZ==100 this isn't spectacularly awful, but a higher HZ would get progressively worse. John Hawkes ha...@en... |
From: Andrew M. <an...@uo...> - 2001-02-12 04:51:26
|
John Hawkes wrote: > > Getting the global runqueue_lock out of the way (using the MQ scheduler) > exposed another hot global lock: xtime_lock. We do a > write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and > these interrupts tend to be concurrent (or at least they try to be > concurrent). For a leisurely HZ==100 this isn't spectacularly awful, > but a higher HZ would get progressively worse. I'm pretty amazed that xtime_lock showed up at such low frequencies. You could try out Ingo's per-CPU timers patch. In fact, you _should_ include this work as part of any highly-scalable kernel. It's working code. http://people.redhat.com/mingo/scalable-timers/ I'm also surprised that you're observing coincidence of timer interrupts across CPUs. If so, then setup_APIC_timer() isn't doing it's job.... |
From: Tim W. <ti...@sp...> - 2001-02-12 16:43:06
|
On Mon, Feb 12, 2001 at 04:51:39AM +0000, Andrew Morton wrote: > John Hawkes wrote: > > > > Getting the global runqueue_lock out of the way (using the MQ scheduler) > > exposed another hot global lock: xtime_lock. We do a > > write_lock(&xtime_lock) on every CPU at every HZ timer interrupt, and > > these interrupts tend to be concurrent (or at least they try to be > > concurrent). For a leisurely HZ==100 this isn't spectacularly awful, > > but a higher HZ would get progressively worse. > > I'm pretty amazed that xtime_lock showed up at such low > frequencies. > I'm not :-) It was an issue under DYNIX/ptx for the equivalent lock (gate) as well. This is a 32-CPU machine (at least that's what I believe John said). We removed most uses of the lock by a rather cute trick. Here's the code for "getthetime()": getthetime(timeval_t *tv) { register ulong_t sampled_time_edition; do { /* * NOTE: We are using RC_MEMSYNC() to prevent aggressive * out-of-order superscalar prefetching from getting * systimedata.std_edition values inconsistent with the * current time of day stored in: systimedata.std_time. * See section 7.1.2.2 of the PentiumPro Operating System * Writer's Guide for more information. */ while ((sampled_time_edition = systimedata.std_edition) & 0x1) { /* loop while time is being changed */ RC_MEMSYNC(); /* do not speculate mem reads */ } RC_MEMSYNC(); /* prevent mem read speculation */ *tv = systimedata.std_time; RC_MEMSYNC(); /* prevent mem read speculation */ } while (sampled_time_edition != systimedata.std_edition); } Basically, we keep a time data serial value. It's atomically incremented before changing the time, and after. If it's odd, it's unstable. If it's even then it's OK. The inner part is still racy, but that's handle at the end of the while loop where we check to see that having read the time data, that the "edition" is still what it was before we read it. It looks like something like this might work for Linux. > You could try out Ingo's per-CPU timers patch. In fact, you > _should_ include this work as part of any highly-scalable > kernel. It's working code. > > http://people.redhat.com/mingo/scalable-timers/ > Yes, this is rather interesting. I definitely want to play with this. Before looking at the code, is there any kind of write-up as to what Ingo did here ? > I'm also surprised that you're observing coincidence of > timer interrupts across CPUs. If so, then setup_APIC_timer() > isn't doing it's job.... > SGI machine - MIPs chips - no APIC :-) Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon Interested in Linux scalability ? Look at http://lse.sourceforge.net/ "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |
From: Andi K. <ak...@su...> - 2001-02-12 17:34:38
|
On Mon, Feb 12, 2001 at 08:43:30AM -0800, Tim Wright wrote: > I'm not :-) > It was an issue under DYNIX/ptx for the equivalent lock (gate) as well. > This is a 32-CPU machine (at least that's what I believe John said). > We removed most uses of the lock by a rather cute trick. Here's the code > for "getthetime()": [...] I guess getthetime() would be the equivalent to jiffies in Linux -- but jiffies is read in a lot of places, some of them critical. I guess replacing it with a function call would optimize the wrong thing -- penalize the frequent reader for a lockless writer. At least for the global jiffies counter it makes probably more sense to just make sure that the increment thread only runs on one CPU at a time, e.g. by locking it to one. Per CPU timers can be done without global state and locks, but they don't need to write jiffies. Jiffies update can be done atomically. For real time gettimeofday() Linux uses rdtsc anyways, and just needs a single correction variable that can also be updated atomically. -Andi |
From: Tim W. <ti...@sp...> - 2001-02-12 21:53:40
|
On Mon, Feb 12, 2001 at 06:27:46PM +0100, Andi Kleen wrote: > On Mon, Feb 12, 2001 at 08:43:30AM -0800, Tim Wright wrote: > > I'm not :-) > > It was an issue under DYNIX/ptx for the equivalent lock (gate) as well. > > This is a 32-CPU machine (at least that's what I believe John said). > > We removed most uses of the lock by a rather cute trick. Here's the code > > for "getthetime()": > > [...] > > I guess getthetime() would be the equivalent to jiffies in Linux -- but > jiffies is read in a lot of places, some of them critical. I guess replacing > it with a function call would optimize the wrong thing -- penalize the > frequent reader for a lockless writer. > No, it's the equivalent of Linux's do_gettimeofday(), which calls read_lock_irqsave(&xtime_lock, flags); just like ptx used to do. This really hurts on a large CPU-count machine. The ptx equivalent of jiffies would be todtick. Of course there's a relationship :-) > At least for the global jiffies counter it makes probably more sense to just > make sure that the increment thread only runs on one CPU at a time, e.g. > by locking it to one. > It certainly should only happen on one CPU. I'd disagree that it should be locked to one because that CPU may be busy. If we find that using the APIC TPR is worthwhile, it would probably be better to have HZ timer IRQ -> "hard" clock() routine (increments jiffies and happens only on one CPU) and have the "hard" clock routine fire off "soft" clock routines on the other CPUs via IPI if this is necessary. What does this look like at the moment ? > Per CPU timers can be done without global state and locks, but they > don't need to write jiffies. Jiffies update can be done atomically. > Yes. > For real time gettimeofday() Linux uses rdtsc anyways, and just needs > a single correction variable that can also be updated atomically. > But it still grabs the rw-lock. If it gets contended enough, this will hurt. Of course, if you have processes calling gettimeofday() a lot (ISTR Oracle does this), you might be better off providing a fast timeofday implementation by mapping in a "magic" page with the TOD values available (read-only). Tim -- Tim Wright - ti...@sp... or ti...@ar... or tw...@us... IBM Linux Technology Center, Beaverton, Oregon Interested in Linux scalability ? Look at http://lse.sourceforge.net/ "Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI |