From: Hubertus F. <fr...@us...> - 2001-03-06 15:22:53
|
Good catch. Another way of doing it could be (just thinking in my own terms here). (a) the PROC_CHANGE_PENALTY increase is only valid for NON-RT and NON-SCHED_YIELD processes hence the code should be conditionalized if (c > 0) c+= PROC_CHANGE_PENALTY; later you then check for recalculate.... (b) if (!c) goto recalculate; By definition you only are going to have either SCHED_YIELD or NON-RT processes in your local run queue. SCHED_YIELD is going to be -1. So I think your approach saves an extra condition check, so that should be the preferred way of doing it. Also on reflex, the stuff that John Hawkes had posted yesterday. I thought you had already done that stuff last week (other then the 128-bit cache alignment) ? Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Shailabh Nagar/Watson/IBM@IB...@li... on 03/06/2001 10:02:03 AM Sent by: lse...@li... To: lse...@li... cc: Subject: [Lse-tech] Recalculate trigger in MQ1 Mike, The code for triggering a recalculate in schedule() in MQ1 reads : if (!c) goto recalculate; But a few lines earlier, c += PROC_CHANGE_PENALTY is done to account for the use of local_goodness() instead of goodness() while assigning a weight to the variable c. If there are no remote runqueue candidates (all have expired counters or goodness values less than PROC_CHANGE_PENALTY), the c value assigned from the local runqueue will be used to trigger a recalculate. At that time, shouldn't we be compensating for the addition of PROC_CHANGE_PENALTY by doing something like : if (c==PROC_CHANGE_PENALTY) goto recalculate; Shailabh Nagar Enterprise Linux Group, IBM TJ Watson Research Center, 914-945-2851 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-03-06 15:37:43
|
Shailabh, .. actually on 2nd thoughts. Your proposed solution is not correct. You would give a SCHED_YIELD process a (PROC_CHANGE_PENALTY-1) goodness value which could result into using a SCHED_YIELD process before bringing over a process from a remote queue with goodness less then PROC_CHANGE_PENALTY-1. That would be incorrect. So we should use what I have proposed, unless somebody has another idea. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Hubertus Franke/Watson/IBM@IB...@li... on 03/06/2001 10:26:14 AM Sent by: lse...@li... To: Shailabh Nagar/Watson/IBM@IBMUS cc: lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 Good catch. Another way of doing it could be (just thinking in my own terms here). (a) the PROC_CHANGE_PENALTY increase is only valid for NON-RT and NON-SCHED_YIELD processes hence the code should be conditionalized if (c > 0) c+= PROC_CHANGE_PENALTY; later you then check for recalculate.... (b) if (!c) goto recalculate; By definition you only are going to have either SCHED_YIELD or NON-RT processes in your local run queue. SCHED_YIELD is going to be -1. So I think your approach saves an extra condition check, so that should be the preferred way of doing it. Also on reflex, the stuff that John Hawkes had posted yesterday. I thought you had already done that stuff last week (other then the 128-bit cache alignment) ? Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Shailabh Nagar/Watson/IBM@IB...@li... on 03/06/2001 10:02:03 AM Sent by: lse...@li... To: lse...@li... cc: Subject: [Lse-tech] Recalculate trigger in MQ1 Mike, The code for triggering a recalculate in schedule() in MQ1 reads : if (!c) goto recalculate; But a few lines earlier, c += PROC_CHANGE_PENALTY is done to account for the use of local_goodness() instead of goodness() while assigning a weight to the variable c. If there are no remote runqueue candidates (all have expired counters or goodness values less than PROC_CHANGE_PENALTY), the c value assigned from the local runqueue will be used to trigger a recalculate. At that time, shouldn't we be compensating for the addition of PROC_CHANGE_PENALTY by doing something like : if (c==PROC_CHANGE_PENALTY) goto recalculate; Shailabh Nagar Enterprise Linux Group, IBM TJ Watson Research Center, 914-945-2851 _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mike K. <mkr...@se...> - 2001-03-06 16:50:00
|
Shailabh/Hubertus, Thanks for the bug find and proposed solution. I'll update the 2.4.1 MQ patch and put a new version out later today. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Hubertus F. <fr...@us...> - 2001-03-06 17:38:29
|
Mike have a look at the conversations that Shailab and I had. Please confirm which one you are putting in. Just wanna make sure we are all in sync. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Mike Kravetz <mkr...@se...>@lists.sourceforge.net on 03/06/2001 11:48:15 AM Sent by: lse...@li... To: Hubertus Franke/Watson/IBM@IBMUS cc: lse...@li..., Shailabh Nagar/Watson/IBM@IBMUS Subject: Re: [Lse-tech] Recalculate trigger in MQ1 Shailabh/Hubertus, Thanks for the bug find and proposed solution. I'll update the 2.4.1 MQ patch and put a new version out later today. -- Mike Kravetz mkr...@se... IBM Linux Technology Center _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Mike K. <mkr...@se...> - 2001-03-06 18:08:24
|
On Tue, Mar 06, 2001 at 12:40:25PM -0500, Hubertus Franke wrote: > > Mike have a look at the conversations that Shailab and I had. > Please confirm which one you are putting in. Just wanna make sure we > are all in sync. > > Hubertus Franke Hubertus, I'll add your proposed fix. As you have already noted, simply checking for 'c == PROC_CHANGE_PENALTY' does impact the yield'ed tasks. -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Mike K. <mkr...@se...> - 2001-03-06 22:35:00
|
I have updated the multi-queue scheduler patch at, http://lse.sourceforge.net/scheduling/ -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: John H. <ha...@en...> - 2001-03-07 17:14:21
|
Another possible MQ change: struct rq_data isn't optimally packed for a 64-bit kernel where longs and pointers are 8 bytes and ints are 4 bytes. This won't affect L2/L3/L4 accesses because each runqueue_data element gets padded out to SMP_CACHE_BYTES (which for mips64 is now 128 bytes), but a loose packing for rq_data might be suboptimal for L1 accesses. For example, in an i386 kernel the struct is a packed 20 bytes. In a mips64 kernel the struct is 48 bytes. One way to reorganize this struct for a tighter packing: --- linux0223-mq/include/linux/sched.h Wed Mar 7 09:02:23 2001 +++ linux0223-mq-prof/include/linux/sched.h Wed Mar 7 08:51:47 2001 @@ -169,14 +169,14 @@ typedef union runqueue_data { struct rq_data { spinlock_t runqueue_lock; /* lock for this runqueue */ - int nt_running; /* # of tasks on runqueue */ struct list_head runqueue; /* list of tasks on runqueue */ - struct task_struct * max_na_ptr; /* pointer to task which */ - /* has max_na_goodness */ int max_na_goodness; /* maximum non-affinity */ /* goodness value of */ /* 'schedulable' task */ /* on this runqueue */ + struct task_struct * max_na_ptr; /* pointer to task which */ + /* has max_na_goodness */ + int nt_running; /* # of tasks on runqueue */ } rq_data; char __pad [SMP_CACHE_BYTES]; } runqueue_data_t; John Hawkes ha...@en... |
From: Christoph H. <hc...@ns...> - 2001-03-07 17:29:45
|
On Wed, Mar 07, 2001 at 09:15:22AM -0800, John Hawkes wrote: > Another possible MQ change: struct rq_data isn't optimally packed for a > 64-bit kernel where longs and pointers are 8 bytes and ints are 4 bytes. > This won't affect L2/L3/L4 accesses because each runqueue_data element > gets padded out to SMP_CACHE_BYTES (which for mips64 is now 128 bytes), > but a loose packing for rq_data might be suboptimal for L1 accesses. > > For example, in an i386 kernel the struct is a packed 20 bytes. In a > mips64 kernel the struct is 48 bytes. One way to reorganize this struct > for a tighter packing: In the long term we should try to get large cpu-private arreas for all this per-cpu stuff instead of adding insane amounts of padding... Christoph -- Of course it doesn't work. We've performed a software upgrade. |
From: Hubertus F. <fr...@us...> - 2001-03-07 13:27:52
|
Rik, this is still on the agenda. We will sync at the end of the week and come up with a "game plan" on what's next. Using the MQ as a base there are so many path we can wonder down. Some we have in place (e.g. load balancing). The montavista patch is "trivial" to do, so it might be a good next step to try. We first wanted to make sure that the MQ is optimized otherwise, all these spin-offs are becoming meaningless and have to be redone again and again. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Rik van Riel <ri...@co...>@lists.sourceforge.net on 03/06/2001 09:16:54 PM Sent by: lse...@li... To: Hubertus Franke/Watson/IBM@IBMUS cc: Shailabh Nagar/Watson/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 On Tue, 6 Mar 2001, Hubertus Franke wrote: > Good catch. > Another way of doing it could be (just thinking in my own terms here). I wonder how much this would affect the recalculation measurements you mentioned when I was visiting you. Maybe the montavista patch may be worth it now after all ? regards, Rik -- Virtual memory is like a game you can't win; However, without VM there's truly nothing to lose... http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com.br/ _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Hubertus F. <fr...@us...> - 2001-03-07 17:30:45
|
Thanks John. Remember some of the data I posted about 3 weeks ago on splitting the RQ data up into two different cachelines (lock+queue+nt_running) into one and the rest into the remainder. For the chatroom that dropped the performance, albeit insignificantly. This would go hand in hand with your suggestions to pack tightly. Mike, since you mentioned that you did a similar test regarding splitting cachelines, let's put John's suggestion in, unless somebody out there has some objections. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 "John Hawkes" <ha...@en...> on 03/07/2001 12:15:22 PM To: "Mike Kravetz" <mkr...@se...>, Hubertus Franke/Watson/IBM@IBMUS cc: <lse...@li...> Subject: Re: [Lse-tech] Recalculate trigger in MQ1 Another possible MQ change: struct rq_data isn't optimally packed for a 64-bit kernel where longs and pointers are 8 bytes and ints are 4 bytes. This won't affect L2/L3/L4 accesses because each runqueue_data element gets padded out to SMP_CACHE_BYTES (which for mips64 is now 128 bytes), but a loose packing for rq_data might be suboptimal for L1 accesses. For example, in an i386 kernel the struct is a packed 20 bytes. In a mips64 kernel the struct is 48 bytes. One way to reorganize this struct for a tighter packing: --- linux0223-mq/include/linux/sched.h Wed Mar 7 09:02:23 2001 +++ linux0223-mq-prof/include/linux/sched.h Wed Mar 7 08:51:47 2001 @@ -169,14 +169,14 @@ typedef union runqueue_data { struct rq_data { spinlock_t runqueue_lock; /* lock for this runqueue */ - int nt_running; /* # of tasks on runqueue */ struct list_head runqueue; /* list of tasks on runqueue */ - struct task_struct * max_na_ptr; /* pointer to task which */ - /* has max_na_goodness */ int max_na_goodness; /* maximum non-affinity */ /* goodness value of */ /* 'schedulable' task */ /* on this runqueue */ + struct task_struct * max_na_ptr; /* pointer to task which */ + /* has max_na_goodness */ + int nt_running; /* # of tasks on runqueue */ } rq_data; char __pad [SMP_CACHE_BYTES]; } runqueue_data_t; John Hawkes ha...@en... |
From: John H. <ha...@en...> - 2001-03-07 17:32:42
|
My earlier patch was reversed. Here is the correct one: diff --exclude-from=/build4/hawkes/Build/ignore.dirs -Naur linux0223-mq-prof/include/ linux/sched.h linux0223-mq/include/linux/sched.h --- linux0223-mq-prof/include/linux/sched.h Wed Mar 7 08:51:47 2001 +++ linux0223-mq/include/linux/sched.h Wed Mar 7 09:02:23 2001 @@ -169,14 +169,14 @@ typedef union runqueue_data { struct rq_data { spinlock_t runqueue_lock; /* lock for this runqueue */ + int nt_running; /* # of tasks on runqueue */ struct list_head runqueue; /* list of tasks on runqueue */ + struct task_struct * max_na_ptr; /* pointer to task which */ + /* has max_na_goodness */ int max_na_goodness; /* maximum non-affinity */ /* goodness value of */ /* 'schedulable' task */ /* on this runqueue */ - struct task_struct * max_na_ptr; /* pointer to task which */ - /* has max_na_goodness */ - int nt_running; /* # of tasks on runqueue */ } rq_data; char __pad [SMP_CACHE_BYTES]; } runqueue_data_t; |
From: Hubertus F. <fr...@us...> - 2001-03-07 17:44:29
|
Not necessarily. This is a question of false sharing. We need to put data together that is rarely shared with other CPUs and don't intermangle it with data that is shared. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Christoph Hellwig <hc...@ns...> on 03/07/2001 12:30:57 PM To: John Hawkes <ha...@en...> cc: Mike Kravetz <mkr...@se...>, Hubertus Franke/Watson/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 On Wed, Mar 07, 2001 at 09:15:22AM -0800, John Hawkes wrote: > Another possible MQ change: struct rq_data isn't optimally packed for a > 64-bit kernel where longs and pointers are 8 bytes and ints are 4 bytes. > This won't affect L2/L3/L4 accesses because each runqueue_data element > gets padded out to SMP_CACHE_BYTES (which for mips64 is now 128 bytes), > but a loose packing for rq_data might be suboptimal for L1 accesses. > > For example, in an i386 kernel the struct is a packed 20 bytes. In a > mips64 kernel the struct is 48 bytes. One way to reorganize this struct > for a tighter packing: In the long term we should try to get large cpu-private arreas for all this per-cpu stuff instead of adding insane amounts of padding... Christoph -- Of course it doesn't work. We've performed a software upgrade. |
From: Christoph H. <hc...@ns...> - 2001-03-07 17:48:50
|
On Wed, Mar 07, 2001 at 12:46:51PM -0500, Hubertus Franke wrote: > > Not necessarily. > > This is a question of false sharing. > We need to put data together that is rarely shared with other CPUs and > don't intermangle it with > data that is shared. Correct. But IMHO it's better to archive this with per-cpu arrays that contain all kinds of this per-cpu stuff instead of having small arrays with excessiv padding. Christoph -- Of course it doesn't work. We've performed a software upgrade. |
From: John H. <ha...@en...> - 2001-03-07 18:03:13
|
From: "Christoph Hellwig" <hc...@ns...> > On Wed, Mar 07, 2001 at 12:46:51PM -0500, Hubertus Franke wrote: > > > > Not necessarily. > > > > This is a question of false sharing. > > We need to put data together that is rarely shared with other CPUs and > > don't intermangle it with > > data that is shared. > > Correct. But IMHO it's better to archive this with per-cpu arrays that > contain all kinds of this per-cpu stuff instead of having small arrays > with excessiv padding. Yes. And no. Yes, we need to avoid excessive padding, especially when padding is done using longer L2, L3, or L4 cacheline lengths. And we need to be careful about what gets packed with what. Sometimes the tradeoffs aren't obvious. For example, "struct rq_data" is a per-cpu structure, but it isn't a *private* per-cpu structure. Each cpu will, on occasion, look at every other cpu's rq_data (e.g., nt_running). When I analyzed the mips64 32p NUMA behavior, I discovered that it didn't make much difference if rq_data was padded or not. In fact, with some workloads I saw a performance *improvement* when rq_data was *not* padded, especially when it was not padded to 128 L2 cacheline bytes. Why? I concluded that when a cpu wanted to look at every other cpu's rq_data.nt_running, it suffered fewer L2 cache misses when it pulled in several rq_data elements per miss. That seemed to outweight the downside of increased cacheblock ping-pongs when the array elements were sharing an L2 cacheblock. (I still think the right way to go is to pad rq_data, just out of general principles.) John Hawkes ha...@en... |
From: Mike K. <mkr...@se...> - 2001-03-07 18:40:42
|
On Wed, Mar 07, 2001 at 10:04:16AM -0800, John Hawkes wrote: > > Sometimes the tradeoffs aren't obvious. For example, "struct rq_data" > is a per-cpu structure, but it isn't a *private* per-cpu structure. > Each cpu will, on occasion, look at every other cpu's rq_data (e.g., > nt_running). When I analyzed the mips64 32p NUMA behavior, I discovered > that it didn't make much difference if rq_data was padded or not. In > fact, with some workloads I saw a performance *improvement* when rq_data > was *not* padded, especially when it was not padded to 128 L2 cacheline > bytes. Why? I concluded that when a cpu wanted to look at every other > cpu's rq_data.nt_running, it suffered fewer L2 cache misses when it > pulled in several rq_data elements per miss. That seemed to outweight > the downside of increased cacheblock ping-pongs when the array elements > were sharing an L2 cacheblock. (I still think the right way to go is to > pad rq_data, just out of general principles.) > > John Hawkes > ha...@en... Interesting observation. I put each struct rq_data on a separate cache line because it was initially believed this data would be cpu specific and rarely accessed by other cpus. However, one of the key design points of this scheduler implementation is that it must try to maintain the semantics of the existing scheduler (for now). In order for the multi-queue scheduler to make (what I call) global scheduling decisions, you will find that schedule() will (almost) always access every cpu specific structure. Likewise, reschedule_idle() will (almost always) access each cpu specific aligned_data structure. It therefore may be possible to get better performance (out of this scheduler implementation) by not putting commonly accessed data on separate cache lines. I'll take at look at data access patterns and experiment with data placement. Of course, if you ever move away from the semantics of the current scheduler, one would want to 'partition' data according to the level where local scheduling decisions were made. This could be at the CPU level or node level (for NUMA). -- Mike Kravetz mkr...@se... IBM Linux Technology Center |
From: Hubertus F. <fr...@us...> - 2001-03-07 17:52:39
|
Oh, absolutely. We need to identify the classes of sharing. (e.g. never, sometimes, often) and make arrays out of those indexed by CPU id, rather than making every little variable or structure available through their own array. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Christoph Hellwig <hc...@ns...> on 03/07/2001 12:50:23 PM To: Hubertus Franke/Watson/IBM@IBMUS cc: Christoph Hellwig <hc...@ns...>, lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 On Wed, Mar 07, 2001 at 12:46:51PM -0500, Hubertus Franke wrote: > > Not necessarily. > > This is a question of false sharing. > We need to put data together that is rarely shared with other CPUs and > don't intermangle it with > data that is shared. Correct. But IMHO it's better to archive this with per-cpu arrays that contain all kinds of this per-cpu stuff instead of having small arrays with excessiv padding. Christoph -- Of course it doesn't work. We've performed a software upgrade. |
From: Andi K. <ak...@su...> - 2001-03-08 11:43:46
|
On Wed, Mar 07, 2001 at 12:56:06PM -0500, Hubertus Franke wrote: > > Oh, absolutely. > > We need to identify the classes of sharing. (e.g. never, sometimes, often) > and > make arrays out of those indexed by CPU id, rather than making every little > variable or > structure available through their own array. There is an old patch from SGI floating around to add a per processor data area to Linux. It puts all the per CPU data into a big structure and stores a pointer to it into current. Among other things it saved several KB of text space (because finding it this way is shorter than the indexing), not to speak of the saved cachelines because of dropped padding. My hope is that the PDA patch will be merged in 2.5. The current x86-64 port already supports a PDA natively. -Andi |
From: Andrew M. <an...@uo...> - 2001-03-08 12:06:14
|
Andi Kleen wrote: > > On Wed, Mar 07, 2001 at 12:56:06PM -0500, Hubertus Franke wrote: > > > > Oh, absolutely. > > > > We need to identify the classes of sharing. (e.g. never, sometimes, often) > > and > > make arrays out of those indexed by CPU id, rather than making every little > > variable or > > structure available through their own array. > > There is an old patch from SGI floating around to add a per processor data > area to Linux. It puts all the per CPU data into a big structure and stores > a pointer to it into current. Among other things it saved several KB of > text space (because finding it this way is shorter than the indexing), not > to speak of the saved cachelines because of dropped padding. > My hope is that the PDA patch will be merged in 2.5. The current x86-64 > port already supports a PDA natively. > That patch was a bit of an ifdef eyesore. Would it be sane to implement PDA via a new linker section and per-CPU pages, all mapped at the same virtual address? |
From: Andi K. <ak...@su...> - 2001-03-08 12:12:00
|
On Thu, Mar 08, 2001 at 11:08:04PM +1100, Andrew Morton wrote: > > There is an old patch from SGI floating around to add a per processor data > > area to Linux. It puts all the per CPU data into a big structure and stores > > a pointer to it into current. Among other things it saved several KB of > > text space (because finding it this way is shorter than the indexing), not > > to speak of the saved cachelines because of dropped padding. > > My hope is that the PDA patch will be merged in 2.5. The current x86-64 > > port already supports a PDA natively. > > > > That patch was a bit of an ifdef eyesore. Would it be sane > to implement PDA via a new linker section and per-CPU > pages, all mapped at the same virtual address? On NUMA it makes sense because text replication needs CPU local kernel page tables anyways, so you can easily do it there. On SMP I'm not sure if the complexity of having per CPU kernel page tables is worth it just to remove a few #ifdefs (there are other places with such global structures, like struct sock or struct inode, and it isn't that big a problem) On x86-64 I'm using a segment register. -Andi |
From: Andrew M. <an...@uo...> - 2001-03-08 12:19:40
|
Andi Kleen wrote: > > On Thu, Mar 08, 2001 at 11:08:04PM +1100, Andrew Morton wrote: > > > There is an old patch from SGI floating around to add a per processor data > > > area to Linux. It puts all the per CPU data into a big structure and stores > > > a pointer to it into current. Among other things it saved several KB of > > > text space (because finding it this way is shorter than the indexing), not > > > to speak of the saved cachelines because of dropped padding. > > > My hope is that the PDA patch will be merged in 2.5. The current x86-64 > > > port already supports a PDA natively. > > > > > > > That patch was a bit of an ifdef eyesore. Would it be sane > > to implement PDA via a new linker section and per-CPU > > pages, all mapped at the same virtual address? > > On NUMA it makes sense because text replication needs CPU local kernel page > tables anyways, so you can easily do it there. On SMP I'm not sure if the > complexity of having per CPU kernel page tables is worth it just to remove > a few #ifdefs (there are other places with such global structures, like > struct sock or struct inode, and it isn't that big a problem) I suppose so. But remember that there's a speed and footprint advantage to the per-CPU pages - the data can be referred to by absolute address, rather than via a pointer indirection, or the smp_processor_id plus array indexing thingy. > On x86-64 I'm using a segment register. What are you using it for? Per-CPU data structures, or something else? |
From: Andi K. <ak...@su...> - 2001-03-08 13:11:17
|
On Thu, Mar 08, 2001 at 11:21:31PM +1100, Andrew Morton wrote: > > On x86-64 I'm using a segment register. > > What are you using it for? Per-CPU data structures, or something else? Per CPU data structures. It is required for the SYSCALL entry point, but could be extended to more. Of course it is a global structure currently, so it would be more #ifdef country. -Andi |
From: Hubertus F. <fr...@us...> - 2001-03-08 14:30:22
|
I don't think that is that simple either. Everytime we move a process from one cpu to another, we would have to update the pagetable as well in order to make this happen. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Andrew Morton <an...@uo...>@uow.edu.au on 03/08/2001 07:08:04 AM Sent by: mo...@uo... To: Andi Kleen <ak...@su...> cc: Hubertus Franke/Watson/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 Andi Kleen wrote: > > On Wed, Mar 07, 2001 at 12:56:06PM -0500, Hubertus Franke wrote: > > > > Oh, absolutely. > > > > We need to identify the classes of sharing. (e.g. never, sometimes, often) > > and > > make arrays out of those indexed by CPU id, rather than making every little > > variable or > > structure available through their own array. > > There is an old patch from SGI floating around to add a per processor data > area to Linux. It puts all the per CPU data into a big structure and stores > a pointer to it into current. Among other things it saved several KB of > text space (because finding it this way is shorter than the indexing), not > to speak of the saved cachelines because of dropped padding. > My hope is that the PDA patch will be merged in 2.5. The current x86-64 > port already supports a PDA natively. > That patch was a bit of an ifdef eyesore. Would it be sane to implement PDA via a new linker section and per-CPU pages, all mapped at the same virtual address? |
From: Andi K. <ak...@su...> - 2001-03-08 14:46:17
|
On Thu, Mar 08, 2001 at 09:31:49AM -0500, Hubertus Franke wrote: > > I don't think that is that simple either. > > Everytime we move a process from one cpu to another, > we would have to update the pagetable as well in order > to make this happen. Yes, on x86 it can get rather ugly, because the first level would become CPU dependent and needs to be split in case of threads. On architectures with software TLB fault handler it is possible though to do it easier. Just it looks like it is hard to get good NUMA performance without text duplication, and I don't see a way to get text duplication without per CPU mappings. -Andi |
From: Hubertus F. <fr...@us...> - 2001-03-08 14:43:46
|
Ahhh, the segment register trick. We did that in the K42-OS as well. That ofcourse would be a much cheaper way of doing it. Are these kind of special purpose registers available across all platforms ? Hubertus Franke Hubertus Franke/Watson/IBM@IB...@li... on 03/08/2001 09:31:49 AM Sent by: lse...@li... To: Andrew Morton <an...@uo...> cc: lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 I don't think that is that simple either. Everytime we move a process from one cpu to another, we would have to update the pagetable as well in order to make this happen. Hubertus Franke Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability) , OS-PIC (Chair) email: fr...@us... (w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003 Andrew Morton <an...@uo...>@uow.edu.au on 03/08/2001 07:08:04 AM Sent by: mo...@uo... To: Andi Kleen <ak...@su...> cc: Hubertus Franke/Watson/IBM@IBMUS, lse...@li... Subject: Re: [Lse-tech] Recalculate trigger in MQ1 Andi Kleen wrote: > > On Wed, Mar 07, 2001 at 12:56:06PM -0500, Hubertus Franke wrote: > > > > Oh, absolutely. > > > > We need to identify the classes of sharing. (e.g. never, sometimes, often) > > and > > make arrays out of those indexed by CPU id, rather than making every little > > variable or > > structure available through their own array. > > There is an old patch from SGI floating around to add a per processor data > area to Linux. It puts all the per CPU data into a big structure and stores > a pointer to it into current. Among other things it saved several KB of > text space (because finding it this way is shorter than the indexing), not > to speak of the saved cachelines because of dropped padding. > My hope is that the PDA patch will be merged in 2.5. The current x86-64 > port already supports a PDA natively. > That patch was a bit of an ifdef eyesore. Would it be sane to implement PDA via a new linker section and per-CPU pages, all mapped at the same virtual address? _______________________________________________ Lse-tech mailing list Lse...@li... http://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Andi K. <ak...@su...> - 2001-03-08 14:53:18
|
On Thu, Mar 08, 2001 at 09:47:11AM -0500, Hubertus Franke wrote: > > > Ahhh, the segment register trick. We did that in the > K42-OS as well. That ofcourse would be a much cheaper > way of doing it. Are these kind of special purpose > registers available across all platforms ? The most generic way to access a PDA is current->pdapointer that is updated in schedule(). This can be hidden in a macro and improved by architecture. Accessing it this way isn't that costly actually, especially because it can be done in pure C and therefore CSEd by the compiler. If you mean using segment registers (different CS) for doing text replication, then I'm not sure if it is such a good idea. The CPUs tend to drop into all kinds of slow paths when the segment base is not zero. [on x86-64 it wouldn't work btw, because segment bases are ignored there] Or did you have something else in mind? -Andi |