From: Andi K. <ak...@su...> - 2001-12-05 14:03:16
|
Rik van Riel <ri...@co...> writes: > > (it'd be so cool if we could just start using a statistic variable > through some macro and it'd be automatically declared and visible > in /proc ;)) It is rather easy using ELF segments. You just do macros that put the names and a pointer to the variable and the type into an ELF segment starting with a known symbol. Then at kernel/module init you walk this segment and do the necessary registration. [has also been done for sysctls] Disadvantage is that one needs to change modutils for it to make it also Also it needs new binutils; old binutils tend to mislink code with the many inter section relocations generated by this technique. -Andi (who would also like such an easy statistics interface) |
From: Niels C. <nc...@us...> - 2001-12-05 15:07:49
|
Hello, Kiran, > Statistics counters are used in many places in the Linux kernel, including > storage, network I/O subsystems etc. These counters are not atomic since > accuracy is not so important. Nevertheless, frequent updation of these > counters result in cacheline bouncing among various cpus in a multi processor > environment. This patch introduces a new set of interfaces, which should > improve performance of such counters in MP environment. This implementation > switches to code that is devoid of overheads for SMP if these interfaces > are used with a UP kernel. > > Comments are welcome :) > >Regards, >Kiran I'm wondering about the scope of this. My Ethernet adapter with, maybe, 20 counter fields would have 20 counters allocated for each of my 16 processors. The only way to get the total would be to use statctr_read() to merge them. Same for the who knows how many IP counters etc., etc. How many and which counters were converted for the test you refer to? I do like the idea of a uniform access mechanism, though. It is well in line with my thoughts about an architected interface for topology and instrumentation so I'll definitely get back to you as I try to collect requirements. Niels |
From: Ravikiran G T. <ki...@in...> - 2001-12-06 12:30:03
|
Hi Niels, On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote: > > I'm wondering about the scope of this. My Ethernet adapter with, maybe, 20 > counter fields would have 20 counters allocated for each of my 16 > processors. > The only way to get the total would be to use statctr_read() to merge them. > Same for the who knows how many IP counters etc., etc. Are you concerned with increase in memory used per counter Here? I suppose that must not be that much of an issue for a 16 processor box.... > > How many and which counters were converted for the test you refer to? > Well, I wrote a simple kernel module which just increments a shared global counter a million times per processor in parallel, and compared it with the statctr which would be incremented a million times per processor in parallel.. > I do like the idea of a uniform access mechanism, though. It is well in > line > with my thoughts about an architected interface for topology and > instrumentation > so I'll definitely get back to you as I try to collect requirements. > > Niels Hope we can come out with a really cool and acceptable interface.. Kiran -- Ravikiran G Thirumalai <ki...@in...> Linux Technology Center, IBM Software Labs, Bangalore. |
From: Ravikiran G T. <ki...@in...> - 2001-12-06 14:05:51
|
On Thu, Dec 06, 2001 at 01:07:37PM +0000, Arjan van de Ven wrote: > > > > > > > > How many and which counters were converted for the test you refer to? > > > > > > > Well, I wrote a simple kernel module which just increments a shared global > > counter a million times per processor in parallel, and compared it with > > the statctr which would be incremented a million times per processor in > > parallel.. > > Would you care to point out a statistic in the kernel that is > incremented > more than 10.000 times/second ? (I'm giving you a a factor of 100 of > playroom > here) [One that isn't per-cpu yet of course] Well, as I mentioned in my earlier post, we have performed "micro benchmarking", which does not reflect the actual run time kernel conditions. I guess u gotta take these results with a pinch of salt. But, you cannot deny that there r gonna be a lot of cacheline invalidations, if you use a global counter. Using per-cpu versions is definitely going to improve kernel performance. Kiran -- Ravikiran G Thirumalai <ki...@in...> Linux Technology Center, IBM Software Labs, Bangalore. |
From: Arjan v. de V. <ar...@re...> - 2001-12-06 14:10:21
|
On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote: > Well, as I mentioned in my earlier post, we have performed > "micro benchmarking", which does not reflect the actual run time > kernel conditions. I guess u gotta take these results with a > pinch of salt. > > But, you cannot deny that there r gonna be a lot of cacheline > invalidations, if you use a global counter. Using per-cpu versions is > definitely going to improve kernel performance. there's not that many counters in fact. And if you care about a gige counter, just bind the card to a specific CPU and you have ad-hoc per-cpu counters... The extra cost of getting to them (extra indirection) makes each access more expensive..... in the end it might be a loss. There's several things where per cpu data is useful; low frequency statistics is not one of them in my opinion. Greetings, Arjan van de Ven |
From: Dipankar S. <dip...@in...> - 2001-12-06 19:30:52
|
On Thu, Dec 06, 2001 at 09:10:15AM -0500, Arjan van de Ven wrote: > On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote: > > But, you cannot deny that there r gonna be a lot of cacheline > > invalidations, if you use a global counter. Using per-cpu versions is > > definitely going to improve kernel performance. > > there's not that many counters in fact. And if you care about a gige > counter, just bind the card to a specific CPU and you have ad-hoc per-cpu > counters... > > The extra cost of getting to them (extra indirection) makes each access > more expensive..... in the end it might be a loss. If it is a low frequency statistics then the expensive access wouldn't really matter much, right ? On the other hand, this will likely help specially with larger number of CPUs. > > There's several things where per cpu data is useful; low frequency > statistics is not one of them in my opinion. It is quite possible that you are right. What we need to do is a measurement effort to understand the impact. Thanks Dipankar -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Keith O. <ka...@sg...> - 2001-12-06 13:00:16
|
On Thu, 6 Dec 2001 18:03:53 +0530, Ravikiran G Thirumalai <ki...@in...> wrote: >Hope we can come out with a really cool and acceptable interface.. How about a user space interface that runs at machine speed and extracts counters without any syscall overhead? This proposal got very little attention at the time so we put it aside until more people were interested. http://marc.theaimsgroup.com/?l=linux-kernel&m=98578952028153&w=2 Ralf Baechle has pointed out one problem, virtually indexed caches :(. That prevents a single user space mmap over the scattered kernel pages, kernel and user space addresses have to be in sync in the cache. So user space sees the scattered pages and has to run the structure itself. No big deal, just a library function that converts an instance name and cpu number into an address. |
From: Niels C. <nc...@us...> - 2001-12-06 16:10:56
|
Hi Kiran, > Are you concerned with increase in memory used per counter Here? I suppose > that must not be that much of an issue for a 16 processor box.... Nope, I'm concerned that if this mechanism is to be used for all counters, the improvement in cache coherence might be less significant to the point where the additional overhead isn't worth it. Arjab van de Ven voiced similar concerns but he also said: > There's several things where per cpu data is useful; low frequency > statistics is not one of them in my opinion. ...which may be true for 4-ways and even 8-ways but when you get to 32-ways and greater, you start seeing cache problems. That was the case on AIX and per-cpu counters was one of the changes that helped get the spectacular scalability on Regatta. Anyway, since we just had a long thread going on NUMA topology, maybe it would be proper to investigate if there is a better way, such as using the topology to decide where to put counters? I think so, seeing as it is that most Intel based 8-ways and above will have at least some NUMA in them. > Well, I wrote a simple kernel module which just increments a shared global > counter a million times per processor in parallel, and compared it with > the statctr which would be incremented a million times per processor in > parallel.. I suspected that. Would it be possible to do the test on the real counters? Niels Christiansen IBM LTC, Kernel Performance |
From: Dipankar S. <dip...@in...> - 2001-12-07 08:50:15
|
Hi Niels, On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote: > > Hi Kiran, > > > Are you concerned with increase in memory used per counter Here? I > suppose > > that must not be that much of an issue for a 16 processor box.... > > Nope, I'm concerned that if this mechanism is to be used for all counters, > the improvement in cache coherence might be less significant to the point > where the additional overhead isn't worth it. In a low-cpu-count SMP box, yes, this will be a concern. Kiran and I do plan to study this and understand the impact. > > Arjab van de Ven voiced similar concerns but he also said: > > > There's several things where per cpu data is useful; low frequency > > statistics is not one of them in my opinion. > > ...which may be true for 4-ways and even 8-ways but when you get to > 32-ways and greater, you start seeing cache problems. That was the > case on AIX and per-cpu counters was one of the changes that helped > get the spectacular scalability on Regatta. Yes. It also helped us in DYNIX/ptx on Sequent boxes. What we need to do is to verify if theory based on prior experience is also applicable to linux. > > Anyway, since we just had a long thread going on NUMA topology, maybe > it would be proper to investigate if there is a better way, such as > using the topology to decide where to put counters? I think so, seeing > as it is that most Intel based 8-ways and above will have at least some > NUMA in them. It should be easy to place the counters in appropriately close memory if linux gets good NUMA APIs built on top of the topology services. If we extend kmem_cache_alloc() to allocate memory in a particular NUMA node, we could simply do this for placing the counters - static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags) { void *addr; struct pcpu_ctr_blk *blkp; unsigned int save_flags; int i; if (!(blkp = pcpu_ctr_blkctl_alloc(ctl, flags))) return 0; /* Get per cpu cache lines for the block */ for_each_cpu(cpu) { blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, flags, CPU_TO_NODE(cpu)); if(!(blkp->lineaddr[cpu])) goto exit1; memset(blkp->lineaddr[cpu], 0, PCPU_CTR_LINE_SIZE); } This would put the block of counters corresponding to a CPU in memory local to the NUMA node. If there are more sophisticated APIs available for suitable memory selection, those too can be made use of here. Is this the kind of thing you are looking at ? Thanks Dipankar -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Paul J. <pj...@en...> - 2001-12-08 22:23:43
|
On Fri, 7 Dec 2001, Dipankar Sarma wrote: > .... If we extend kmem_cache_alloc() to allocate memory > in a particular NUMA node, we could simply do this for placing the > counters - > > static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags) > { > ... > > /* Get per cpu cache lines for the block */ > for_each_cpu(cpu) { > blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, > flags, CPU_TO_NODE(cpu)); > ... > } > > This would put the block of counters corresponding to a CPU in > memory local to the NUMA node. Rather than baking into each call of kmem_cache_alloc_node() the CPU_TO_NODE() transformation, how about having a kmem_cache_alloc_cpu() call that allocates closest to a specified cpu. I would prefer to avoid spreading the assumption that for each cpu there is an identifiable node that has a single memory that is best for all cpus on that node. Instead, assume that for each cpu, there is an identifiable best memory ... and let the internal implementation of kmem_cache_alloc_cpu() find that best memory for the specified cpu. Given this change, the kmem_cache_alloc_cpu() implementation could use the CpuMemSets NUMA infrastructure that my group is working on to find the best memory. With CpuMemSets, the kernel will have, for each cpu, a list of memories, sorted by distance from that cpu. Just take the first memory block off the selected cpus memory list for the above purpose. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Jack S. <st...@sg...> - 2001-12-09 03:46:25
|
> > On Fri, 7 Dec 2001, Dipankar Sarma wrote: > > .... If we extend kmem_cache_alloc() to allocate memory > > in a particular NUMA node, we could simply do this for placing the > > counters - > > > > static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags) > > { > > ... > > > > /* Get per cpu cache lines for the block */ > > for_each_cpu(cpu) { > > blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, > > flags, CPU_TO_NODE(cpu)); > > ... > > } > > > > This would put the block of counters corresponding to a CPU in > > memory local to the NUMA node. > > Rather than baking into each call of kmem_cache_alloc_node() > the CPU_TO_NODE() transformation, how about having a > kmem_cache_alloc_cpu() call that allocates closest to a > specified cpu. I think it depends on whether the slab allocator manages memory by cpu or node. Since the number of cpus per node is rather small (<=8) for most NUMA systems, I would expect the slab allocator to manage by node. Managing by cpu would likely add extra fragmentation and no real performance benefit. > > I would prefer to avoid spreading the assumption that for each > cpu there is an identifiable node that has a single memory > that is best for all cpus on that node. But this is already true for the page allocator (alloc_pages_node()). The page pools are managed by node, not cpu. All memory on a node is managed by a single pg_data_t structure. This structure contains/points-to the tables for the memory on the node (page structs, free lists, etc). If a cpu needs to allocate local memory, it determines it's node_id. This node_id is in the cpu_data structure for the cpu so this is an easy calculation (one memory reference). The nodeid is used find the pgdata_t struct for the node (index into an array of node-local pointers, so again, no offnode references). Assuming the slab allocator manages by node, kmem_cache_alloc_node() & kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). Each would pick up the nodeid from the cpu_data struct, then allocate from the slab cache for that node. > Instead, assume that > for each cpu, there is an identifiable best memory ... and let > the internal implementation of kmem_cache_alloc_cpu() find that > best memory for the specified cpu. > > Given this change, the kmem_cache_alloc_cpu() implementation > could use the CpuMemSets NUMA infrastructure that my group is > working on to find the best memory. With CpuMemSets, the > kernel will have, for each cpu, a list of memories, sorted > by distance from that cpu. Just take the first memory block > off the selected cpus memory list for the above purpose. > > > I won't rest till it's the best ... > Manager, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 > > > _______________________________________________ > Lse-tech mailing list > Lse...@li... > https://lists.sourceforge.net/lists/listinfo/lse-tech > -- Thanks Jack Steiner (651-683-5302) (vnet 233-5302) st...@sg... |
From: Paul J. <pj...@en...> - 2001-12-09 04:43:05
|
I think Jack got his attribution wrong. Which is good for me, since I wrote what Jack just gently demolished <grin>. On Sat, 8 Dec 2001, Jack Steiner wrote: > I think it depends on whether the slab allocator manages > memory by cpu or node. Since the number of cpus per node > is rather small (<=8) for most NUMA systems, I would expect > the slab allocator to manage by node. Managing by cpu would > likely add extra fragmentation and no real performance benefit. I wasn't intending to suggest that the slab allocator manage by cpu, rather than by node. Pretty clearly, that would be silly. Rather I was doing two things: 1) Suggesting that if some code asking for memory wanted it on a node near to some cpu, that it not convert that cpu to a node _before_ the call, but rather, pass in the cpu, and let the called routine, kmem_cache_alloc_node() or renamed to kmem_cache_alloc_cpu() map the cpu to the node, inside the call. 2) Suggesting (against common usage in the kernel, as Jack describes, so probably I'm tilting at wind mills) that we distinguish between nodes and and chunks of memory that I've started calling memory blocks. I think (1) is sensible enough - the entire discussion leading up to the code example involved getting memory near to some cpu or other - so let the call state just that, and let the details of translating to whatever units the slab allocator works with be handled inside the call. Don't make each caller remember to perform a CPU_TO_NODE conversion - it's just a little silly duplication of code (a kernel a few bytes larger), a slightly less direct interface, and one more detail to impose on each person coding such a call. As to (2) I'd have to get Jack in a room with a white board, and at this point, I'm placing no bets on what we would conclude (well, actually I'd bet on Jack if forced ... his batting average is pretty good ;). I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Jack S. <st...@sg...> - 2001-12-09 17:34:51
|
> > I think Jack got his attribution wrong. Which is good for me, > since I wrote what Jack just gently demolished <grin>. And I probably should not have been reading mail while I debugged a weird system hang. :-) I missed the earlier part of the thread - I though you were refering to local allocation. I dont think I have a strong opinion yet about kmem_cache_alloc_node() vs kmem_cache_alloc_cpu(). I would not be surprised to find that both interfaces make sense. If code want to allocate close to a cpu, then kmem_cache_alloc_cpu() is the best choice. However, I would also expect that some code already knows the node. Then kmem_cache_alloc_node() is best. Conversion of cpu->node is easy. Conversion of node->cpu is slightly more difficult (currently) and has the ambiguity that there may be multiple cpus on the node - which one should you select? And does it matter? As precident, the page allocation routines are all node-based. (ie., alloc_pages_node(), etc...) -- Thanks Jack Steiner (651-683-5302) (vnet 233-5302) st...@sg... |
From: Paul J. <pj...@en...> - 2001-12-11 23:27:17
|
On Sun, 9 Dec 2001, Jack Steiner wrote: > If code want to allocate close to a cpu, then kmem_cache_alloc_cpu() > is the best choice. However, I would also expect that some code > already knows the node. Then kmem_cache_alloc_node() is best. yup. > As precident, the page allocation routines are all node-based. > (ie., alloc_pages_node(), etc...) My inclinations would be to prefer more cpu-based allocators. But until I happen to catch you in a room with a white board, my inclinations are unlikely to go anywhere ... perhaps someday. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Ravikiran G T. <ki...@in...> - 2001-12-07 11:37:09
|
Hi Niels, Arjan On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote: > > > Well, I wrote a simple kernel module which just increments a shared > global > > counter a million times per processor in parallel, and compared it with > > the statctr which would be incremented a million times per processor in > > parallel.. > > I suspected that. Would it be possible to do the test on the real > counters? Yep, I am gonna run a benchmark after changing some stat counters in the Kernel. That should let us know if there are performance gains or otherwise.. Kiran -- Ravikiran G Thirumalai <ki...@in...> Linux Technology Center, IBM Software Labs, Bangalore. |
From: Anton B. <an...@sa...> - 2001-12-08 14:51:29
|
Hi, > > There's several things where per cpu data is useful; low frequency > > statistics is not one of them in my opinion. > > ...which may be true for 4-ways and even 8-ways but when you get to > 32-ways and greater, you start seeing cache problems. That was the > case on AIX and per-cpu counters was one of the changes that helped > get the spectacular scalability on Regatta. I agree there are large areas of improvement to be done wrt cacheline ping ponging (see my patch in 2.4.17-pre6 for one example), but we should do our own benchmarking and not look at what AIX has been doing. Anton (ppc64 Linux Hacker) |
From: Niels C. <nc...@us...> - 2001-12-07 09:52:49
|
Hello Dikanpar, | > Anyway, since we just had a long thread going on NUMA topology, maybe | > it would be proper to investigate if there is a better way, such as | > using the topology to decide where to put counters? I think so, seeing | > as it is that most Intel based 8-ways and above will have at least some | > NUMA in them. | | It should be easy to place the counters in appropriately close | memory if linux gets good NUMA APIs built on top of the topology | services. If we extend kmem_cache_alloc() to allocate memory | in a particular NUMA node, we could simply do this for placing the | counters - | ... | This would put the block of counters corresponding to a CPU in | memory local to the NUMA node. If there are more sophisticated | APIs available for suitable memory selection, those too can be made | use of here. | | Is this the kind of thing you are looking at ? I'm no NUMA person so I can't verify your code snippet but if it does what you say, yes, that is exactly what I meant: We may have to deal with both cache coherence and placement of counters in local memory. Niels |
From: Dipankar S. <dip...@in...> - 2001-12-07 10:05:37
|
On Fri, Dec 07, 2001 at 04:52:40AM -0500, Niels Christiansen wrote: > > Hello Dikanpar, > | It should be easy to place the counters in appropriately close > | memory if linux gets good NUMA APIs built on top of the topology > | services. If we extend kmem_cache_alloc() to allocate memory > | in a particular NUMA node, we could simply do this for placing the > | counters - > | ... > | This would put the block of counters corresponding to a CPU in > | memory local to the NUMA node. If there are more sophisticated > | APIs available for suitable memory selection, those too can be made > | use of here. > | > | Is this the kind of thing you are looking at ? > > I'm no NUMA person so I can't verify your code snippet but if it does > what you say, yes, that is exactly what I meant: We may have to deal > with both cache coherence and placement of counters in local memory. Yes, we will likely need to place the conters in memory closest to the corresponding CPUs. I haven't yet started looking at the current NUMA proposals, but I hope that there will be support for NUMA-aware allocations. The flexible allocator scheme in our statctr implementation allows each counter block corresponding to a CPU to be allocated separately and we can make the locational judgement at that time as indicated in my hypothetical changes to the statctr code snippet. Thanks Dipankar -- Dipankar Sarma <dip...@in...> http://lse.sourceforge.net Linux Technology Center, IBM Software Lab, Bangalore, India. |
From: Rusty R. <ru...@ru...> - 2001-12-08 07:37:21
|
In message <200...@in...> you write: > On Thu, Dec 06, 2001 at 02:18:26PM +1100, Rusty Russell wrote: > > On Wed, 05 Dec 2001 12:08:57 -0800 > > Andrew Morton <ak...@zi...> wrote: > > > > > http://www.zipworld.com.au/~akpm/linux/2.4/2.4.7/ > > > > Oops, guess I should have read this thread first (still catching up on mail ). > > > > Please see my per-cpu patch (just posted under [PATCH] 2.5.1-pre5: per-cpu > > areas), and my previous /proc patch. Combining the two into convenient for m > > is left as an exercise for the reader... > > Hi Rusty, > > Your per-cpu area patch looks like a good solution with a very simple > implementation. BTW, some OSes map the per-cpu data areas > to the same virtual address for each CPU avoiding the per-cpu data > array lookup. I am not sure if this really saves much, we are ourselves > trying to understand the overhead of such array lookup with > statctrs. I'd be interested in the results: it'd certainly be neater. Another option would be to use the per-cpu region pointer where architectures currently hold smp_processor_id(), and derive the current CPU from the per-CPU area instead of vice versa. > IIUC, we can declare statically allocated per-cpu data using > this allocator (kstat, apic_timer_irqs etc.). For things that > are a part of dynamically allocated structure, we would still > need to use a dynamic per-cpu allocator, right ? Yep... Someone Else's Problem 8) > Another interesting question is how we can load different > per-cpu sections to different areas in memory. I would suspect > that for NUMA, we would want to locate the per-cpu sections closest > to the corresponding CPUs. It could possibly be done with linker tricks in vmlinux.lds, and yes, definitely worth doing. > I couldn't find the /proc patch. Any pointers ? Hmm... I'm working on a rewrite, but the interface should stay the same: http://lists.insecure.org/linux-kernel/2001/Nov/0087.html Cheers! Rusty. -- Anyone who quotes me is an idiot. -- Rusty Russell. |
From: Niels C. <nc...@us...> - 2001-12-08 17:43:32
|
Anton Blanchard wrote: | > > There's several things where per cpu data is useful; low frequency | > > statistics is not one of them in my opinion. | > | > ...which may be true for 4-ways and even 8-ways but when you get to | > 32-ways and greater, you start seeing cache problems. That was the | > case on AIX and per-cpu counters was one of the changes that helped | > get the spectacular scalability on Regatta. | | I agree there are large areas of improvement to be done wrt cacheline | ping ponging (see my patch in 2.4.17-pre6 for one example), but we | should do our own benchmarking and not look at what AIX has been doing. Oh, please! You voiced an opinion. I presented facts. Nobody suggested we should not measure on Linux. As a matter of fact, I suggested that Kiran does tests on the real counters and he said he would. Niels |
From: Anton B. <an...@sa...> - 2001-12-09 12:46:56
|
> | > ...which may be true for 4-ways and even 8-ways but when you get to > | > 32-ways and greater, you start seeing cache problems. That was the > | > case on AIX and per-cpu counters was one of the changes that helped > | > get the spectacular scalability on Regatta. > | > | I agree there are large areas of improvement to be done wrt cacheline > | ping ponging (see my patch in 2.4.17-pre6 for one example), but we > | should do our own benchmarking and not look at what AIX has been doing. > > Oh, please! You voiced an opinion. I presented facts. Nobody suggested > we should not measure on Linux. As a matter of fact, I suggested that > Kiran does tests on the real counters and he said he would. Exactly, show me where the current problem is and I will benchmark it on a 16 way linux/ppc64 machine. Your comments are opinions too unless you have some figures to back them up :) Anton |
From: Manfred S. <ma...@co...> - 2001-12-09 10:58:03
|
> > >Assuming the slab allocator manages by node, kmem_cache_alloc_node() & >kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). >Each would pick up the nodeid from the cpu_data struct, then allocate >from the slab cache for that node. > kmem_cache_alloc is simple - the complex operation is kmem_cache_free. The current implementation - assumes that virt_to_page() and reading one cacheline from the page structure is fast. Is that true for your setups? - uses an array to batch several free calls together: If the array overflows, then up to 120 objects are freed in one call, to reduce cacheline trashing. If virt_to_page is fast, then a NUMA allocator would be a simple extention of the current implementation: * one slab chain for each node, one spinlock for each node. * 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free calls , one for "foreign node" kmem_cache_free calls. * kmem_cache_alloc allocates from the "correct node" per-cpu array, fallback to the per-node slab chain, then fallback to __get_free_pages. * kmem_cache_free checks to which node the freed object belongs and adds it to the appropriate per-cpu array. The array overflow function then sorts the objects into the correct slab chains. If virt_to_page is slow we need a different design. Currently it's called in every kmem_cache_free/kfree call. -- Manfred |
From: Jack S. <st...@sg...> - 2001-12-10 16:33:05
|
> > > > > > >Assuming the slab allocator manages by node, kmem_cache_alloc_node() & > >kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). > >Each would pick up the nodeid from the cpu_data struct, then allocate > >from the slab cache for that node. > > > > kmem_cache_alloc is simple - the complex operation is kmem_cache_free. > > The current implementation > - assumes that virt_to_page() and reading one cacheline from the page > structure is fast. Is that true for your setups? > - uses an array to batch several free calls together: If the array > overflows, then up to 120 objects are freed in one call, to reduce > cacheline trashing. > > If virt_to_page is fast, then a NUMA allocator would be a simple > extention of the current implementation: I can give you 1 data point. This is for the SGI SN1 platform. This is a NUMA platform & is running with the DISCONTIGMEM patch that is on sourceforge. The virt_to_page() function currently generates the following code: 23 instructions 18 "real" instructions 5 noop (I would like to believe the compiler can eventually use these instructions slots for something else) The code has 2 load instructions that are always reference node-local memory & have a high probability of hitting in the caches 1 load to the node that contains the target page I think I see a couple opportunities for reducing the amount of code. However, I consider the code to be "fast enough" for most purposes. > > * one slab chain for each node, one spinlock for each node. > * 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free > calls , one for "foreign node" kmem_cache_free calls. > * kmem_cache_alloc allocates from the "correct node" per-cpu array, > fallback to the per-node slab chain, then fallback to __get_free_pages. > * kmem_cache_free checks to which node the freed object belongs and adds > it to the appropriate per-cpu array. The array overflow function then > sorts the objects into the correct slab chains. > > If virt_to_page is slow we need a different design. Currently it's > called in every kmem_cache_free/kfree call. BTW, I think Tony Luck (at Intel) is currently changing the slab allocator to be numa-aware. Are coordinating your work with his??? -- Thanks Jack Steiner (651-683-5302) (vnet 233-5302) st...@sg... |
From: Manfred S. <ma...@co...> - 2001-12-10 17:00:27
|
Jack Steiner wrote: > >BTW, I think Tony Luck (at Intel) is currently changing the slab allocator >to be numa-aware. Are coordinating your work with his??? > Thanks, I wasn't aware that he's working on it. I haven't started coding, I'm still collecting what's needed. * force certain alignments. e.g. ARM needs 1024 byte aligned objects for the page tables. * NUMA support. * Add a "priority" to kmem_cache_shrink, to avoid that every dcache/icache shrink causes an IPI to all cpus. * If possible: replace the division in kmem_cache_free_one with the multiplication by the reciprocal. (I have a patch, but it's too ugly for inclusion). Important for uniprocessor versions. * add reservation support - e.g. there must be a minimum amount of bio structures available, otherwise the kernel could oom-deadlock. They must be available, not hidden in the per-cpu caches of the other cpus. -- Manfred |
From: Luck, T. <ton...@in...> - 2001-12-10 18:00:24
|
I'd better chime in on this thread to let you know where my efforts have led so far. I did post a note here a couple of weeks ago (Titled: "Thinking about a node aware kmalloc") which you might want to dig out of the archives. I think that I have most of the issues understood, and I've started tinkering with the code (where I expect I'll run into a few areas where it will becomre clear that I don't have all the bases covered :-) My plan is to make kmem_cache_t look like: struct kmem_cache_s { char name[CACHE_NAMELEN]; struct list_head next; struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES]; } and to move all the other fields into the new "kmem_node" structure that will be allocated on node-local memory. The kmem_node structure will also have a pointer back to the kmem_cache_t, and has a reduced number of cpucache_t pointers (as it only has to cover the cpus on a node). I'm also going to have a "foreign returns" list for each node (see below). My major concern with this is the frequency with which memory blocks are freed by a cpu that does not belong to the same node as the memory being freed (a "foreign return"). My implementation will require kmfree() to grab a lock for the remote node to drop the item onto a list for that node to be able to reclaim it. I read Manfred's note posted yesterday where he suggests having a per-cpu list for foreign returns, which would avoid a need for a lock when the item is freed ... but it seems to me that this only defers the problem. When the per-cpu foreign list is filled, that cpu will have to scan the whole list of foreign objects and give each of them back to the correct node. If there are a large number of nodes, this process may not benefit from much batching (e.g with 32 nodes and a 120 element list we only have an average of 4 elements to be given to each node). In my previous e-mail I described a crude benchmark that I had performed to try to get some data. When we have a prototype implementation we will need to gather more data on larger systems to figure out the best approach. -Tony Luck -----Original Message----- From: Manfred Spraul [mailto:ma...@co...] Sent: Monday, December 10, 2001 9:00 AM To: Jack Steiner Cc: lin...@vg...; lse...@li... Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Jack Steiner wrote: > >BTW, I think Tony Luck (at Intel) is currently changing the slab allocator >to be numa-aware. Are coordinating your work with his??? > Thanks, I wasn't aware that he's working on it. I haven't started coding, I'm still collecting what's needed. * force certain alignments. e.g. ARM needs 1024 byte aligned objects for the page tables. * NUMA support. * Add a "priority" to kmem_cache_shrink, to avoid that every dcache/icache shrink causes an IPI to all cpus. * If possible: replace the division in kmem_cache_free_one with the multiplication by the reciprocal. (I have a patch, but it's too ugly for inclusion). Important for uniprocessor versions. * add reservation support - e.g. there must be a minimum amount of bio structures available, otherwise the kernel could oom-deadlock. They must be available, not hidden in the per-cpu caches of the other cpus. -- Manfred _______________________________________________ Lse-tech mailing list Lse...@li... https://lists.sourceforge.net/lists/listinfo/lse-tech |