From: Manfred S. <ma...@co...> - 2001-12-10 19:12:54
|
Luck, Tony wrote: >I'd better chime in on this thread to let you know where my >efforts have led so far. I did post a note here a couple of >weeks ago (Titled: "Thinking about a node aware kmalloc") >which you might want to dig out of the archives. > >I think that I have most of the issues understood, and I've >started tinkering with the code (where I expect I'll run into >a few areas where it will becomre clear that I don't have all >the bases covered :-) > >My plan is to make kmem_cache_t look like: > > struct kmem_cache_s { > char name[CACHE_NAMELEN]; > struct list_head next; > struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES]; > } > >and to move all the other fields into the new "kmem_node" >structure that will be allocated on node-local memory. > I agree that everything should be pointers and node-local memory. Except with a single node, then the structure should be directly integrated. But I think pointer to the per-cpu structures should be part of kmem_cache_s: Every kmalloc and kfree must look up the per-cpu structure, and embedding it directly would avoid one level of indirection. I though about: <<< struct kmem_cache_s { char name[CACHE_NAMELEN]; struct list_head next; #if PLAT_MAX_COMPACT_NODES > 1 struct kmem_nodedata *nodes[PLAT_MAX_COMPACT_NODES]; #else struct kmem_nodedata nd; #endif #ifdef CONFIG_SMP struct kmem_cpudata *cpus[NR_CPUS]; #endif }; <<< The array for foreign free calls is just an idea: If there are lots of foreign free calls, then a per-cpu array and using only one global spinlock for all nodes instead of a per-node spinlock might be faster. But we need statistics to decide that. -- Manfred |
From: Luck, T. <ton...@in...> - 2001-12-10 18:57:20
|
Oh, one other gaping hole in my current plans for this code is the issue of what to do when the desired node is out of memory. For some objects that are going to exist for a very long time, we might want to wait until some node-local memory becomes available (this presupposes that we have a mechanism to encourage that to occur, rather than just hoping that it will :-) On the other hand, most objects seem to be quite short lived, so it would seem sensible to follow some alternate node search strategy (e.g. using the cpumemset lists to define which alternate node to use). However, this would provide a horrific knee in the performance curve when a node runs low on memory (moving from a (mostly) lockless allocation referencing only local node memory to one that requires acquisition of remote lock(s)) [Plus would probably increase the number of foreign returns]. -Tony Luck -----Original Message----- From: Luck, Tony [mailto:ton...@in...] Sent: Monday, December 10, 2001 10:00 AM To: 'Manfred Spraul'; Jack Steiner Cc: lse...@li... Subject: RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters I'd better chime in on this thread to let you know where my efforts have led so far. I did post a note here a couple of weeks ago (Titled: "Thinking about a node aware kmalloc") which you might want to dig out of the archives. I think that I have most of the issues understood, and I've started tinkering with the code (where I expect I'll run into a few areas where it will becomre clear that I don't have all the bases covered :-) My plan is to make kmem_cache_t look like: struct kmem_cache_s { char name[CACHE_NAMELEN]; struct list_head next; struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES]; } and to move all the other fields into the new "kmem_node" structure that will be allocated on node-local memory. The kmem_node structure will also have a pointer back to the kmem_cache_t, and has a reduced number of cpucache_t pointers (as it only has to cover the cpus on a node). I'm also going to have a "foreign returns" list for each node (see below). My major concern with this is the frequency with which memory blocks are freed by a cpu that does not belong to the same node as the memory being freed (a "foreign return"). My implementation will require kmfree() to grab a lock for the remote node to drop the item onto a list for that node to be able to reclaim it. I read Manfred's note posted yesterday where he suggests having a per-cpu list for foreign returns, which would avoid a need for a lock when the item is freed ... but it seems to me that this only defers the problem. When the per-cpu foreign list is filled, that cpu will have to scan the whole list of foreign objects and give each of them back to the correct node. If there are a large number of nodes, this process may not benefit from much batching (e.g with 32 nodes and a 120 element list we only have an average of 4 elements to be given to each node). In my previous e-mail I described a crude benchmark that I had performed to try to get some data. When we have a prototype implementation we will need to gather more data on larger systems to figure out the best approach. -Tony Luck -----Original Message----- From: Manfred Spraul [mailto:ma...@co...] Sent: Monday, December 10, 2001 9:00 AM To: Jack Steiner Cc: lin...@vg...; lse...@li... Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Jack Steiner wrote: > >BTW, I think Tony Luck (at Intel) is currently changing the slab allocator >to be numa-aware. Are coordinating your work with his??? > Thanks, I wasn't aware that he's working on it. I haven't started coding, I'm still collecting what's needed. * force certain alignments. e.g. ARM needs 1024 byte aligned objects for the page tables. * NUMA support. * Add a "priority" to kmem_cache_shrink, to avoid that every dcache/icache shrink causes an IPI to all cpus. * If possible: replace the division in kmem_cache_free_one with the multiplication by the reciprocal. (I have a patch, but it's too ugly for inclusion). Important for uniprocessor versions. * add reservation support - e.g. there must be a minimum amount of bio structures available, otherwise the kernel could oom-deadlock. They must be available, not hidden in the per-cpu caches of the other cpus. -- Manfred _______________________________________________ Lse-tech mailing list Lse...@li... https://lists.sourceforge.net/lists/listinfo/lse-tech _______________________________________________ Lse-tech mailing list Lse...@li... https://lists.sourceforge.net/lists/listinfo/lse-tech |
From: Luck, T. <ton...@in...> - 2001-12-10 19:46:10
|
Manfred Spraul wrote: > But I think pointer to the per-cpu structures should be part of > kmem_cache_s: Every kmalloc and kfree must look up the per-cpu > structure, and embedding it directly would avoid one level of indirection. Yes, you are right (and that removes my need to invent a "logical cpu number on node" that I was going to need ... thanks!) > The array for foreign free calls is just an idea: If there are lots of > foreign free calls, then a per-cpu array and using only one global > spinlock for all nodes instead of a per-node spinlock might be faster. > But we need statistics to decide that. It should be relaively easy to switch algorithms once we have the basic infrastructure up and running ... then we can get some real data and see which way to go. -Tony Luck |
From: Paul M. <Pau...@us...> - 2001-12-11 20:06:15
|
One way to keep the "horrific knee" down to a dull roar is to make sure that when you do go get memory from a remote node, that you get a bunch of it. The extra memory may be cached on the local CPU (or node), greatly reducing the overhead of subsequent requests. This same caching can help reduce the overhead of freeing remote memory (e.g., for memory allocated by a CPU on one node and freed by a CPU on some other node). All of this must be carefully balanced, as there is a tradeoff between efficient handling of out-of-memory situations and the cost/complexity of the normal case. Thanx, Paul > Oh, one other gaping hole in my current plans for this > code is the issue of what to do when the desired node is > out of memory. For some objects that are going to exist > for a very long time, we might want to wait until some > node-local memory becomes available (this presupposes that > we have a mechanism to encourage that to occur, rather than > just hoping that it will :-) On the other hand, most objects > seem to be quite short lived, so it would seem sensible to > follow some alternate node search strategy (e.g. using the > cpumemset lists to define which alternate node to use). > However, this would provide a horrific knee in the performance > curve when a node runs low on memory (moving from a (mostly) > lockless allocation referencing only local node memory to one > that requires acquisition of remote lock(s)) [Plus would > probably increase the number of foreign returns]. > > -Tony Luck > > -----Original Message----- > From: Luck, Tony [mailto:ton...@in...] > Sent: Monday, December 10, 2001 10:00 AM > To: 'Manfred Spraul'; Jack Steiner > Cc: lse...@li... > Subject: RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters > > > I'd better chime in on this thread to let you know where my > efforts have led so far. I did post a note here a couple of > weeks ago (Titled: "Thinking about a node aware kmalloc") > which you might want to dig out of the archives. > > I think that I have most of the issues understood, and I've > started tinkering with the code (where I expect I'll run into > a few areas where it will becomre clear that I don't have all > the bases covered :-) > > My plan is to make kmem_cache_t look like: > > struct kmem_cache_s { > char name[CACHE_NAMELEN]; > struct list_head next; > struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES]; > } > > and to move all the other fields into the new "kmem_node" > structure that will be allocated on node-local memory. > > The kmem_node structure will also have a pointer back to the > kmem_cache_t, and has a reduced number of cpucache_t pointers > (as it only has to cover the cpus on a node). I'm also going > to have a "foreign returns" list for each node (see below). > > My major concern with this is the frequency with which memory > blocks are freed by a cpu that does not belong to the same node > as the memory being freed (a "foreign return"). My implementation > will require kmfree() to grab a lock for the remote node to drop > the item onto a list for that node to be able to reclaim it. > > I read Manfred's note posted yesterday where he suggests having > a per-cpu list for foreign returns, which would avoid a need for > a lock when the item is freed ... but it seems to me that this > only defers the problem. When the per-cpu foreign list is filled, > that cpu will have to scan the whole list of foreign objects and > give each of them back to the correct node. If there are a large > number of nodes, this process may not benefit from much batching > (e.g with 32 nodes and a 120 element list we only have an average > of 4 elements to be given to each node). > > In my previous e-mail I described a crude benchmark that I had > performed to try to get some data. When we have a prototype > implementation we will need to gather more data on larger systems > to figure out the best approach. > > -Tony Luck > > > -----Original Message----- > From: Manfred Spraul [mailto:ma...@co...] > Sent: Monday, December 10, 2001 9:00 AM > To: Jack Steiner > Cc: lin...@vg...; lse...@li... > Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters > > > Jack Steiner wrote: > > > > >BTW, I think Tony Luck (at Intel) is currently changing the slab allocator > >to be numa-aware. Are coordinating your work with his??? > > > > Thanks, I wasn't aware that he's working on it. > I haven't started coding, I'm still collecting what's needed. > > * force certain alignments. e.g. ARM needs 1024 byte aligned objects for > the page tables. > * NUMA support. > * Add a "priority" to kmem_cache_shrink, to avoid that every > dcache/icache shrink causes an IPI to all cpus. > * If possible: replace the division in kmem_cache_free_one with the > multiplication by the reciprocal. (I have a patch, but it's too ugly for > inclusion). Important for uniprocessor versions. > * add reservation support - e.g. there must be a minimum amount of bio > structures available, otherwise the kernel could oom-deadlock. They must > be available, not hidden in the per-cpu caches of the other cpus. > > -- > Manfred |
From: Paul M. <Pau...@us...> - 2001-12-11 20:07:19
|
> I read Manfred's note posted yesterday where he suggests having > a per-cpu list for foreign returns, which would avoid a need for > a lock when the item is freed ... but it seems to me that this > only defers the problem. When the per-cpu foreign list is filled, > that cpu will have to scan the whole list of foreign objects and > give each of them back to the correct node. If there are a large > number of nodes, this process may not benefit from much batching > (e.g with 32 nodes and a 120 element list we only have an average > of 4 elements to be given to each node). One way to fix this is to have a separate foreign list for each node on each CPU. This means n*(m-1) lists on systems with n CPUs and m memory nodes. Each list must be allowed to grow to a point where the overhead of grabbing the foreign lock is balanced with the time required to do the allocations/frees locally. In past efforts with other OSes, each individual list needed to have on the order of a few tens of foreign elements. This can be quite a bit of memory on large systems, but then again, large systems tend to have quite a bit of memory. For whatever it is worth, one's mileage can vary in different OSes on different hardware architectures... Thanx, Paul > In my previous e-mail I described a crude benchmark that I had > performed to try to get some data. When we have a prototype > implementation we will need to gather more data on larger systems > to figure out the best approach. > > -Tony Luck > > > -----Original Message----- > From: Manfred Spraul [mailto:ma...@co...] > Sent: Monday, December 10, 2001 9:00 AM > To: Jack Steiner > Cc: lin...@vg...; lse...@li... > Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters > > > Jack Steiner wrote: > > > > >BTW, I think Tony Luck (at Intel) is currently changing the slab allocator > >to be numa-aware. Are coordinating your work with his??? > > > > Thanks, I wasn't aware that he's working on it. > I haven't started coding, I'm still collecting what's needed. > > * force certain alignments. e.g. ARM needs 1024 byte aligned objects for > the page tables. > * NUMA support. > * Add a "priority" to kmem_cache_shrink, to avoid that every > dcache/icache shrink causes an IPI to all cpus. > * If possible: replace the division in kmem_cache_free_one with the > multiplication by the reciprocal. (I have a patch, but it's too ugly for > inclusion). Important for uniprocessor versions. > * add reservation support - e.g. there must be a minimum amount of bio > structures available, otherwise the kernel could oom-deadlock. They must > be available, not hidden in the per-cpu caches of the other cpus. > > -- > Manfred |