From: Dipankar S. <dip...@in...> - 2002-05-23 13:05:47
Attachments:
percpu_data.txt
|
If static percpu area is around, can dynamic percpu data allocator be far behind ;-) As a part of scalable kernel primitives work for higher-end SMP and NUMA architectures, we have been seeing increasing need for per-cpu data in various key areas. Rusty's percpu area work has added a way in 2.5 kernels to maintain static per-cpu data. Inspired by that work, I have implemented a dynamic per-cpu data allocator. Currently it is useful to us for - 1. Per-cpu data in dynamically allocated structures. 2. per-cpu statistics and reference counters 3. Per-cpu data in drivers/modules. 4. Scalable locking primitives like local spin only locks (or even big reader locks). Included in this mail is a document that describes the allocator. I would really appreciate if people comment on it. I am particularly interested in eek-value of the interfaces, specially the bit about keeping the type information in a dummy variable in a union. The actual patch will follow soon, unless someone convince me quickly that there is an saner way to do this. Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: BALBIR S. <bal...@wi...> - 2002-05-24 04:35:22
|
Hello, Dipankar, I would prefer to use the existing slab allocator for this. I am not sure if I understand your requirements for the per-cpu allocator correctly, please correct me if I do not. What I would like to see 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should be able to tell for which caches we want per-cpu slabs. This way we can make even kmalloc per-cpu. Since most of the kernel would use and dispose memory before they migrate across cpus. I think this would be useful, but again no data to back it up. 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two CPU machine, I still end up with support for 32 CPUs. What I would like to see is that in new kernel code, we should use treat equivalent classes of CPUs as belonging to the same CPU. For example void *blkaddrs[NR_CPUS]; while searching, instead of doing blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full, we should look through for (i = 0; i < NR_CPUS; i++) { look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or whatever)] if successful break } On a two CPU system 1,3,5 ... belong to the same equivalent class. So we might as well utilize them. Even with a per-cpu pool, threads could use the slots in the per-cpu equivalent classes in parallel (I have a very rough idea about this). Does any of this make sense, Balbir |-----Original Message----- |From: lin...@vg... |[mailto:lin...@vg...]On Behalf Of Dipankar Sarma |Sent: Thursday, May 23, 2002 6:39 PM |To: lin...@vg... |Cc: Rusty Russell; Paul McKenney; lse...@li... |Subject: [RFC] Dynamic percpu data allocator | | |If static percpu area is around, can dynamic percpu data allocator |be far behind ;-) | |As a part of scalable kernel primitives work for higher-end SMP |and NUMA architectures, we have been seeing increasing need |for per-cpu data in various key areas. Rusty's percpu area |work has added a way in 2.5 kernels to maintain static per-cpu |data. Inspired by that work, I have implemented a dynamic per-cpu |data allocator. Currently it is useful to us for - | |1. Per-cpu data in dynamically allocated structures. |2. per-cpu statistics and reference counters |3. Per-cpu data in drivers/modules. |4. Scalable locking primitives like local spin only locks | (or even big reader locks). | |Included in this mail is a document that describes the allocator. |I would really appreciate if people comment on it. I am |particularly interested in eek-value of the interfaces, |specially the bit about keeping the type information in |a dummy variable in a union. | |The actual patch will follow soon, unless someone convince |me quickly that there is an saner way to do this. | |Thanks |-- |Dipankar Sarma <dip...@in...> http://lse.sourceforge.net |Linux Technology Center, IBM Software Lab, Bangalore, India. | |
From: Dipankar S. <dip...@in...> - 2002-05-24 06:10:08
|
On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote: > Hello, Dipankar, > > I would prefer to use the existing slab allocator for this. > I am not sure if I understand your requirements for the per-cpu > allocator correctly, please correct me if I do not. > > What I would like to see > > 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should > be able to tell for which caches we want per-cpu slabs. This > way we can make even kmalloc per-cpu. Since most of the kernel > would use and dispose memory before they migrate across cpus. > I think this would be useful, but again no data to back it up. Allocating cpu-local memory is a different issue altogether. Eventually for NUMA support, we will have to do such allocations that supports choosing memory closest to a group of CPUs. The per-cpu data allocator allocates one copy for *each* CPU. It uses the slab allocator underneath. Eventually, when/if we have per-cpu/numa-node slab allocation, the per-cpu data allocator can allocate every CPU's copy from memory closest to it. I suppose you worked on DYNIX/ptx ? Think of this as dynamic plocal. > > 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two > CPU machine, I still end up with support for 32 CPUs. What I would If you don't like it, just define NR_CPUS to 2 and recompile. > like to see is that in new kernel code, we should use treat equivalent > classes of CPUs as belonging to the same CPU. For example > > void *blkaddrs[NR_CPUS]; > > while searching, instead of doing > > blkaddrs[smp_processor_id()], if the slot for smp_processor_id() is full, > we should look through > > for (i = 0; i < NR_CPUS; i++) { > look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or > whatever)] > if successful break > } How will it work ? You could be accessing memory beyond blkaddrs[]. I use NR_CPUS for allocations because if I don't, supporting CPU hotplug will be a nightmare. Resizing so many data structures is not an option. I believe Rusty and/or cpu hotplug work is adding a for_each_cpu() macro to walk the CPUs that take care of everything including sparse CPU numbers. Until then, I would use for (i = 0; i < smp_num_cpus; i++). Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: BALBIR S. <bal...@wi...> - 2002-05-24 08:37:38
|
|> Hello, Dipankar, |> |> I would prefer to use the existing slab allocator for this. |> I am not sure if I understand your requirements for the per-cpu |> allocator correctly, please correct me if I do not. |> |> What I would like to see |> |> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should |> be able to tell for which caches we want per-cpu slabs. This |> way we can make even kmalloc per-cpu. Since most of the kernel |> would use and dispose memory before they migrate across cpus. |> I think this would be useful, but again no data to back it up. | |Allocating cpu-local memory is a different issue altogether. |Eventually for NUMA support, we will have to do such allocations |that supports choosing memory closest to a group of CPUs. | |The per-cpu data allocator allocates one copy for *each* CPU. |It uses the slab allocator underneath. Eventually, when/if we have |per-cpu/numa-node slab allocation, the per-cpu data allocator |can allocate every CPU's copy from memory closest to it. | |I suppose you worked on DYNIX/ptx ? Think of this as dynamic |plocal. Sure, I understand what you are talking about now. That makes a lot of sense, I will go through your document once more and read it. I was thinking of the two combined (allocating CPU local memory for certain data structs also includes allocating one copy per CPU). Is there a reason to delay the implementation of CPU local memory, or are we waiting for NUMA guys to do it? Is it not useful in an SMP system to allocate CPU local memory? | |> |> 2. I hate the use of NR_CPUS. If I compiled an SMP kernel on a two |> CPU machine, I still end up with support for 32 CPUs. What I would | |If you don't like it, just define NR_CPUS to 2 and recompile. | That does make sense, but I would like to keep my headers in sync, so that all patches apply cleanly. | |> like to see is that in new kernel code, we should use treat equivalent |> classes of CPUs as belonging to the same CPU. For example |> |> void *blkaddrs[NR_CPUS]; |> |> while searching, instead of doing |> |> blkaddrs[smp_processor_id()], if the slot for |smp_processor_id() is full, |> we should look through |> |> for (i = 0; i < NR_CPUS; i++) { |> look into blkaddrs[smp_processor_id() + i % smp_number_of_cpus()(or |> whatever)] |> if successful break |> } | |How will it work ? You could be accessing memory beyond blkaddrs[]. | Sorry that could happen, it should be (smp_processor_id + i) % smp_number_of_cpus(). |I use NR_CPUS for allocations because if I don't, supporting CPU |hotplug will be a nightmare. Resizing so many data structures is |not an option. I believe Rusty and/or cpu hotplug work is |adding a for_each_cpu() macro to walk the CPUs that take |care of everything including sparse CPU numbers. Until then, |I would use for (i = 0; i < smp_num_cpus; i++). I think that does make a whole lot of sense, I was not thinking about HotPluging CPUs (dumb me). | |Thanks |-- |Dipankar Sarma <dip...@in...> http://lse.sourceforge.net |Linux Technology Center, IBM Software Lab, Bangalore, India. | |_______________________________________________________________ |
From: Dipankar S. <dip...@in...> - 2002-05-24 09:10:11
|
On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote: > > Sure, I understand what you are talking about now. That makes a lot > of sense, I will go through your document once more and read it. > I was thinking of the two combined (allocating CPU local memory > for certain data structs also includes allocating one copy per CPU). > Is there a reason to delay the implementation of CPU local memory, > or are we waiting for NUMA guys to do it? Is it not useful in an > SMP system to allocate CPU local memory? In an SMP system, the entire memory is equidistant from the CPUs. So, any memory that is exclusively accessed by once cpu only is CPU-local. On a NUMA machine however that isn't true, so you need special schemes. The thing about one-copy-per-cpu allocator that I describe is that it interleaves per-cpu data to save on space. That is if you allocate per-cpu ints i1, i2, it will be laid out in memory like this - CPU #0 CPU#1 --------- --------- Start of cache line i1 i1 i2 i2 . . . . . . . . . . --------- ---------- End of cache line The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from different cache lines of memory, but copy of i1 and i2 for CPU #0 are in the same cache line. This interleaving saves space by avoiding the need to pad small data structures to cache line sizes. This essentially how the static per-cpu data area in 2.5 kernel is laid out in memory. Since copies for CPU #0 and CPU #1 for the same variable are on different cache lines, assuming that code that accesses "this" CPU's copy will not result in cache line bouncing. On an SMP machine, I can allocate the cache lines for different CPUs, where the interleaved data structures are laid out, using the slab allocator. On a NUMA machine however, I would want to make sure that cache line allocated for this purpose for CPU #N is closest possible to CPU #N. Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: BALBIR S. <bal...@wi...> - 2002-05-24 11:56:45
|
Thanks, when I ment CPU local memory on SMP, I ment CPU cache. Sorry for the confusion. I think your dynamic allocator makes sense. Balbir | |On Fri, May 24, 2002 at 02:08:50PM +0530, BALBIR SINGH wrote: |> |> Sure, I understand what you are talking about now. That makes a lot |> of sense, I will go through your document once more and read it. |> I was thinking of the two combined (allocating CPU local memory |> for certain data structs also includes allocating one copy per CPU). |> Is there a reason to delay the implementation of CPU local memory, |> or are we waiting for NUMA guys to do it? Is it not useful in an |> SMP system to allocate CPU local memory? | |In an SMP system, the entire memory is equidistant from the CPUs. |So, any memory that is exclusively accessed by once cpu only |is CPU-local. On a NUMA machine however that isn't true, so |you need special schemes. | |The thing about one-copy-per-cpu allocator that I describe is that |it interleaves per-cpu data to save on space. That is if you |allocate per-cpu ints i1, i2, it will be laid out in memory like this - | | CPU #0 CPU#1 | | --------- --------- Start of cache line | i1 i1 | i2 i2 | | . . | . . | . . | . . | . . | | --------- ---------- End of cache line | |The per-cpu copies of i1 and i2 for CPU #0 and CPU #1 are allocated from |different cache lines of memory, but copy of i1 and i2 for CPU #0 are |in the same cache line. This interleaving saves space by avoiding |the need to pad small data structures to cache line sizes. |This essentially how the static per-cpu data area in 2.5 kernel |is laid out in memory. Since copies for CPU #0 and CPU #1 for |the same variable are on different cache lines, assuming that |code that accesses "this" CPU's copy will not result in cache line |bouncing. On an SMP machine, I can allocate the cache lines |for different CPUs, where the interleaved data structures are |laid out, using the slab allocator. On a NUMA machine however, |I would want to make sure that cache line allocated for this |purpose for CPU #N is closest possible to CPU #N. | | |Thanks |-- |Dipankar Sarma <dip...@in...> http://lse.sourceforge.net |Linux Technology Center, IBM Software Lab, Bangalore, India. |- |To unsubscribe from this list: send the line "unsubscribe linux-kernel" in |the body of a message to maj...@vg... |More majordomo info at http://vger.kernel.org/majordomo-info.html |Please read the FAQ at http://www.tux.org/lkml/ |
From: Martin J. B. <Mar...@us...> - 2002-05-24 14:39:06
|
> Is there a reason to delay the implementation of CPU local memory, > or are we waiting for NUMA guys to do it? Is it not useful in an > SMP system to allocate CPU local memory? That should be pretty easy to do now, if I understand what you're asking for ... when you allocate the area for cpu N, just do alloc_pages_node (cpu_to_nid(smp_processor_id())) or something similar. M. |
From: Rusty R. <ru...@ru...> - 2002-05-28 04:54:15
|
In message <200...@in...> you write: > If static percpu area is around, can dynamic percpu data allocator > be far behind ;-) > The interfaces for per-cpu data allocator are similar to Rusty's static > per-CPU data interfaces. One clear goal was to make sure that they > leave no overhead in the UP kernels, so for UP kernels they > reduce to ordinary variables. The basic interfaces are these - > > 1. percpu_data_declare(type,var) > 2. percpu_data_alloc(var) > 3. percpu_data(var,cpu) > 4. this_percpu_data(var) > 5. percpu_data_free(var) I think you might be better off following the existing allocation primitives: void *kmalloc_percpu(size_t size, int flags); void kfree_percpu(void *); per_cpu_ptr(ptr, cpu); this_cpu_ptr(ptr); /* Simple implementation: you'll want something better */ void *kmalloc_percpu(size_t size, int flags) { int i; void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags); if (!ptrs) return NULL; for (i = 0; i < NR_CPUS; i++) { ptrs[i] = kmalloc(size, flags); if (!ptrs[i]) goto unwind_oom; } /* Catch derefs w/o wrappers */ return (void *)(~(unsigned long)ptrs); unwind_oom: while (--i >= 0) kfree(ptrs[i]); kfree(ptrs); } #define per_cpu_ptr(ptr, cpu) \ ({ \ void **__p = ~(unsigned long)ptr; \ ((__typeof__(ptr))__p[(cpu)]; \ }) > 3. Per-CPU data in modules - the static per-cpu scheme by Rusty's > doesn't work in modules, atleast I haven't seen a way to do this. Good point. This will be fixed later in 2.5 with the module loader. BTW, do the NUMA guys have a "kmalloc_nearby()" interface yet? Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. |
From: Dipankar S. <dip...@in...> - 2002-05-28 09:09:16
|
On Tue, May 28, 2002 at 02:57:11PM +1000, Rusty Russell wrote: > In message <200...@in...> you write: > > 1. percpu_data_declare(type,var) > > 2. percpu_data_alloc(var) > > 3. percpu_data(var,cpu) > > 4. this_percpu_data(var) > > 5. percpu_data_free(var) > > I think you might be better off following the existing allocation > primitives: > > void *kmalloc_percpu(size_t size, int flags); > void kfree_percpu(void *); > > per_cpu_ptr(ptr, cpu); > this_cpu_ptr(ptr); This definitely more natural than my concoction and I like it. > > /* Simple implementation: you'll want something better */ > void *kmalloc_percpu(size_t size, int flags) > { > int i; > void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags); > > if (!ptrs) return NULL; > for (i = 0; i < NR_CPUS; i++) { > ptrs[i] = kmalloc(size, flags); > if (!ptrs[i]) > goto unwind_oom; > } > > /* Catch derefs w/o wrappers */ > return (void *)(~(unsigned long)ptrs); > > unwind_oom: > while (--i >= 0) > kfree(ptrs[i]); > kfree(ptrs); > } > > #define per_cpu_ptr(ptr, cpu) \ > ({ \ > void **__p = ~(unsigned long)ptr; \ > ((__typeof__(ptr))__p[(cpu)]; \ > }) Is there a way to do aligned allocation ? The only possibility that I see is to use the slab allocator and use SLAB_HWCACHE_ALIGN. Currently, I use the slab cache for allocating blocks of size multiple of SMP_CACHE_BYTES with SLAB_HWCACHE_ALIGN to make sure that the allocated block exactly fits into one or more cache lines. The rest of the logic takes care of allocating interleaved percpu objects from these blocks. Is there a simpler way to take care of the block cache line alignment ? Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Dipankar S. <dip...@in...> - 2002-05-28 09:43:59
|
On Tue, May 28, 2002 at 02:42:38PM +0530, Dipankar Sarma wrote: > On Tue, May 28, 2002 at 02:57:11PM +1000, Rusty Russell wrote: > > > > /* Simple implementation: you'll want something better */ > > void *kmalloc_percpu(size_t size, int flags) > > { > > int i; > > void **ptrs = kmalloc(sizeof(*ptrs) * NR_CPUS, flags); > > > > if (!ptrs) return NULL; > > for (i = 0; i < NR_CPUS; i++) { > > ptrs[i] = kmalloc(size, flags); > > if (!ptrs[i]) > > goto unwind_oom; > > } > > > > /* Catch derefs w/o wrappers */ > > return (void *)(~(unsigned long)ptrs); > > > > unwind_oom: > > while (--i >= 0) > > kfree(ptrs[i]); > > kfree(ptrs); > > } > > > > #define per_cpu_ptr(ptr, cpu) \ > > ({ \ > > void **__p = ~(unsigned long)ptr; \ > > ((__typeof__(ptr))__p[(cpu)]; \ > > }) > > Is there a way to do aligned allocation ? The only possibility > that I see is to use the slab allocator and use SLAB_HWCACHE_ALIGN. > Currently, I use the slab cache for allocating blocks of size multiple of > SMP_CACHE_BYTES with SLAB_HWCACHE_ALIGN to make sure that > the allocated block exactly fits into one or more cache lines. > The rest of the logic takes care of allocating interleaved > percpu objects from these blocks. Is there a simpler way to > take care of the block cache line alignment ? Oh, well, the kmalloc caches are always cache line aligned, so that answers my earlier question. So this version of kmalloc_percpu() would work just fine with no cache line sharing between cpus. Only issue is that typical use of such allocator is likely to be for counters and unless we use interleaving, we will be wasting lots of space. Also, putting multiple counters allocated together from a single driver on the same cache line would help performance. Another question I have is why slab allocator aligns to L1 cache line, not L2 cache line. Any ideas ? Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Rusty R. <ru...@ru...> - 2002-05-29 03:45:47
|
In message <200...@in...> you write: > Another question I have is why slab allocator aligns to L1 cache line, not > L2 cache line. Any ideas ? Good question. It should be SMP_CACHE_ALIGN, which is usually equivalent. Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. |
From: Dipankar S. <dip...@in...> - 2002-05-29 05:52:36
|
On Wed, May 29, 2002 at 01:45:24PM +1000, Rusty Russell wrote: > In message <200...@in...> you write: > > Another question I have is why slab allocator aligns to L1 cache line, not > > L2 cache line. Any ideas ? > > Good question. It should be SMP_CACHE_ALIGN, which is usually > equivalent. Yes, I thought so. And SMP_CACHE_ALIGN is normally defined as the alignment for the cache that has higher miss penalty for that arch, right ? I asked this question in lkml a long time ago, but no one replied. I guess I will just submit a patch to fix this. Thanks -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Martin J. B. <Mar...@us...> - 2002-05-28 13:53:26
|
> BTW, do the NUMA guys have a "kmalloc_nearby()" interface yet? Tony Luck (Intel IIRC) was working on a general fix for the slab allocator as a whole, but I don't *think* he's finished. This is second on my list of things to fix at the moment, so if Tony hasn't done it already, I hope to have something in the next month. M. |