Thread: [Lse-tech] Re: [RFC] [PATCH] Scalable Statistics Counters

lse-tech

[Lse-tech] Re: [RFC] [PATCH] Scalable Statistics Counters

From: Andi K. <ak...@su...> - 2001-12-05 14:03:16

Rik van Riel <ri...@co...> writes:
> 
> (it'd be so cool if we could just start using a statistic variable
> through some macro and it'd be automatically declared and visible
> in /proc ;))

It is rather easy using ELF segments. You just do macros that put the
names and a pointer to the variable and the type into an ELF segment
starting with a known symbol. Then at kernel/module init you walk this
segment and do the necessary registration. 

[has also been done for sysctls]

Disadvantage is that one needs to change modutils for it to make it also

Also it needs new binutils; old binutils tend to mislink code with
the many inter section relocations generated by this technique.

-Andi (who would also like such an easy statistics interface)

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Niels C. <nc...@us...> - 2001-12-05 15:07:49

Hello, Kiran,

> Statistics counters are used in many places in the Linux kernel,
including
> storage, network I/O subsystems etc.  These counters are not atomic since

> accuracy is not so important. Nevertheless, frequent updation of these
> counters result in cacheline bouncing among various cpus in a multi
processor
> environment. This patch introduces a new set of interfaces, which should
> improve performance of such counters in MP environment.  This
implementation
> switches to code that is devoid of overheads for SMP if these interfaces
> are used with a UP kernel.
>
> Comments are welcome :)
>
>Regards,
>Kiran

I'm wondering about the scope of this.  My Ethernet adapter with, maybe, 20
counter fields would have 20 counters allocated for each of my 16
processors.
The only way to get the total would be to use statctr_read() to merge them.
Same for the who knows how many IP counters etc., etc.

How many and which counters were converted for the test you refer to?

I do like the idea of a uniform access mechanism, though.  It is well in
line
with my thoughts about an architected interface for topology and
instrumentation
so I'll definitely get back to you as I try to collect requirements.

Niels

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Ravikiran G T. <ki...@in...> - 2001-12-06 12:30:03

Hi Niels,

On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote:
>
> I'm wondering about the scope of this.  My Ethernet adapter with, maybe, 20
> counter fields would have 20 counters allocated for each of my 16
> processors.
> The only way to get the total would be to use statctr_read() to merge them.
> Same for the who knows how many IP counters etc., etc.

Are you concerned with increase in memory used per counter Here? I suppose
that must not be that much of an issue for a 16 processor box....

>
> How many and which counters were converted for the test you refer to?
>

Well, I wrote a simple kernel module which just increments a shared global
counter a million times per processor in parallel, and compared it with
the statctr which would be incremented a million times per processor in
parallel..

> I do like the idea of a uniform access mechanism, though.  It is well in
> line
> with my thoughts about an architected interface for topology and
> instrumentation
> so I'll definitely get back to you as I try to collect requirements.
>
> Niels

Hope we can come out with a really cool and acceptable interface..

Kiran
-- 
Ravikiran G Thirumalai <ki...@in...>
Linux Technology Center, IBM Software Labs,
Bangalore.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Ravikiran G T. <ki...@in...> - 2001-12-06 14:05:51

On Thu, Dec 06, 2001 at 01:07:37PM +0000, Arjan van de Ven wrote:
> > 
> > >
> > > How many and which counters were converted for the test you refer to?
> > >
> > 
> > Well, I wrote a simple kernel module which just increments a shared global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
> 
> Would you care to point out a statistic in the kernel that is
> incremented 
> more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> playroom 
> here) [One that isn't per-cpu yet of course]

Well, as I mentioned in my earlier post, we have performed 
"micro benchmarking", which does not reflect the actual run time
kernel conditions. I guess u gotta take these results with a 
pinch of salt.  

But, you cannot deny that there r gonna be a lot of cacheline 
invalidations, if you use a global counter.  Using per-cpu versions is
definitely going to improve kernel performance.

Kiran

-- 
Ravikiran G Thirumalai <ki...@in...>
Linux Technology Center, IBM Software Labs,
Bangalore.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Arjan v. de V. <ar...@re...> - 2001-12-06 14:10:21

On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> Well, as I mentioned in my earlier post, we have performed 
> "micro benchmarking", which does not reflect the actual run time
> kernel conditions. I guess u gotta take these results with a 
> pinch of salt.  
> 
> But, you cannot deny that there r gonna be a lot of cacheline 
> invalidations, if you use a global counter.  Using per-cpu versions is
> definitely going to improve kernel performance.

there's not that many counters in fact. And if you care about a gige
counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
counters...

The extra cost of getting to them (extra indirection) makes each access 
more expensive..... in the end it might be a loss.

There's several things where per cpu data is useful; low frequency
statistics is not one of them in my opinion. 

Greetings,
   Arjan van de Ven

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Dipankar S. <dip...@in...> - 2001-12-06 19:30:52

On Thu, Dec 06, 2001 at 09:10:15AM -0500, Arjan van de Ven wrote:
> On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> > But, you cannot deny that there r gonna be a lot of cacheline 
> > invalidations, if you use a global counter.  Using per-cpu versions is
> > definitely going to improve kernel performance.
> 
> there's not that many counters in fact. And if you care about a gige
> counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
> counters...
> 
> The extra cost of getting to them (extra indirection) makes each access 
> more expensive..... in the end it might be a loss.

If it is a low frequency statistics then the expensive access wouldn't 
really matter much, right ? On the other hand, this will likely help 
specially with larger number of CPUs.

> 
> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion. 

It is quite possible that you are right. What we need to do is
a measurement effort to understand the impact.

Thanks
Dipankar
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[Lse-tech] Re: [RFC] [PATCH] Scalable Statistics Counters

From: Keith O. <ka...@sg...> - 2001-12-06 13:00:16

On Thu, 6 Dec 2001 18:03:53 +0530, 
Ravikiran G Thirumalai <ki...@in...> wrote:
>Hope we can come out with a really cool and acceptable interface..

How about a user space interface that runs at machine speed and
extracts counters without any syscall overhead?  This proposal got very
little attention at the time so we put it aside until more people
were interested.

http://marc.theaimsgroup.com/?l=linux-kernel&m=98578952028153&w=2

Ralf Baechle has pointed out one problem, virtually indexed caches :(.
That prevents a single user space mmap over the scattered kernel pages,
kernel and user space addresses have to be in sync in the cache.  So
user space sees the scattered pages and has to run the structure
itself.  No big deal, just a library function that converts an instance
name and cpu number into an address.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Niels C. <nc...@us...> - 2001-12-06 16:10:56

Hi Kiran,

> Are you concerned with increase in memory used per counter Here? I
suppose
> that must not be that much of an issue for a 16 processor box....

Nope, I'm concerned that if this mechanism is to be used for all counters,
the improvement in cache coherence might be less significant to the point
where the additional overhead isn't worth it.

Arjab van de Ven voiced similar concerns but he also said:

> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion.

...which may be true for 4-ways and even 8-ways but when you get to
32-ways and greater, you start seeing cache problems.  That was the
case on AIX and per-cpu counters was one of the changes that helped
get the spectacular scalability on Regatta.

Anyway, since we just had a long thread going on NUMA topology, maybe
it would be proper to investigate if there is a better way, such as
using the topology to decide where to put counters?  I think so, seeing
as it is that most Intel based 8-ways and above will have at least some
NUMA in them.

> Well, I wrote a simple kernel module which just increments a shared
global
> counter a million times per processor in parallel, and compared it with
> the statctr which would be incremented a million times per processor in
> parallel..

I suspected that.  Would it be possible to do the test on the real
counters?

Niels Christiansen
IBM LTC, Kernel Performance

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Dipankar S. <dip...@in...> - 2001-12-07 08:50:15

Hi Niels,

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
> 
> Hi Kiran,
> 
> > Are you concerned with increase in memory used per counter Here? I
> suppose
> > that must not be that much of an issue for a 16 processor box....
> 
> Nope, I'm concerned that if this mechanism is to be used for all counters,
> the improvement in cache coherence might be less significant to the point
> where the additional overhead isn't worth it.

In a low-cpu-count SMP box, yes, this will be a concern. Kiran and I
do plan to study this and understand the impact.

> 
> Arjab van de Ven voiced similar concerns but he also said:
> 
> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
> 
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems.  That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

Yes. It also helped us in DYNIX/ptx on Sequent boxes. What we
need to do is to verify if theory based on prior experience is
also applicable to linux.

> 
> Anyway, since we just had a long thread going on NUMA topology, maybe
> it would be proper to investigate if there is a better way, such as
> using the topology to decide where to put counters?  I think so, seeing
> as it is that most Intel based 8-ways and above will have at least some
> NUMA in them.

It should be easy to place the counters in appropriately close
memory if linux gets good NUMA APIs built on top of the topology
services. If we extend kmem_cache_alloc() to allocate memory
in a particular NUMA node, we could simply do this for placing the
counters -

static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
{
        void *addr;
        struct pcpu_ctr_blk *blkp;
        unsigned int save_flags;
        int i;

        if (!(blkp = pcpu_ctr_blkctl_alloc(ctl, flags)))
                return 0;

        /* Get per cpu cache lines for the block */
        for_each_cpu(cpu) {
               blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
						flags, CPU_TO_NODE(cpu));
               if(!(blkp->lineaddr[cpu]))
                       goto exit1;
               memset(blkp->lineaddr[cpu], 0, PCPU_CTR_LINE_SIZE);
        }

This would put the block of counters corresponding to a CPU in
memory local to the NUMA node. If there are more sophisticated
APIs available for suitable memory selection, those too can be made
use of here.

Is this the kind of thing you are looking at ?

Thanks
Dipankar
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Paul J. <pj...@en...> - 2001-12-08 22:23:43

On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> .... If we extend kmem_cache_alloc() to allocate memory
> in a particular NUMA node, we could simply do this for placing the
> counters -
> 
> static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> {
>         ...
> 
>         /* Get per cpu cache lines for the block */
>         for_each_cpu(cpu) {
>                blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
> 						flags, CPU_TO_NODE(cpu));
>                ...
>         }
> 
> This would put the block of counters corresponding to a CPU in
> memory local to the NUMA node.

Rather than baking into each call of kmem_cache_alloc_node()
the CPU_TO_NODE() transformation, how about having a
kmem_cache_alloc_cpu() call that allocates closest to a
specified cpu.

I would prefer to avoid spreading the assumption that for each
cpu there is an identifiable node that has a single memory
that is best for all cpus on that node.  Instead, assume that
for each cpu, there is an identifiable best memory ... and let
the internal implementation of kmem_cache_alloc_cpu() find that
best memory for the specified cpu.

Given this change, the kmem_cache_alloc_cpu() implementation
could use the CpuMemSets NUMA infrastructure that my group is
working on to find the best memory.  With CpuMemSets, the
kernel will have, for each cpu, a list of memories, sorted
by distance from that cpu.  Just take the first memory block
off the selected cpus memory list for the above purpose.

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Jack S. <st...@sg...> - 2001-12-09 03:46:25

> 
> On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> > .... If we extend kmem_cache_alloc() to allocate memory
> > in a particular NUMA node, we could simply do this for placing the
> > counters -
> > 
> > static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> > {
> >         ...
> > 
> >         /* Get per cpu cache lines for the block */
> >         for_each_cpu(cpu) {
> >                blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
> > 						flags, CPU_TO_NODE(cpu));
> >                ...
> >         }
> > 
> > This would put the block of counters corresponding to a CPU in
> > memory local to the NUMA node.
> 
> Rather than baking into each call of kmem_cache_alloc_node()
> the CPU_TO_NODE() transformation, how about having a
> kmem_cache_alloc_cpu() call that allocates closest to a
> specified cpu.

I think it depends on whether the slab allocator manages memory by cpu or
node. Since the number of cpus per node is rather small (<=8) for
most NUMA systems, I would expect the slab allocator to manage by node.
Managing by cpu would likely add extra fragmentation and no real performance
benefit.

> 
> I would prefer to avoid spreading the assumption that for each
> cpu there is an identifiable node that has a single memory
> that is best for all cpus on that node.  

But this is already true for the page allocator (alloc_pages_node()).

The page pools are managed by node, not cpu. All memory on a node is
managed by a single pg_data_t structure. This structure contains/points-to
the tables for the memory on the node (page structs, free lists, etc). 

If a cpu needs to allocate local memory, it determines it's node_id.
This node_id is in the cpu_data structure for the cpu so this is an easy
calculation (one memory reference). The nodeid is used find the pgdata_t struct
for the node (index into an array of node-local pointers, so again, no offnode
references).

Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
Each would pick up the nodeid from the cpu_data struct, then allocate 
from the slab cache for that node.

> Instead, assume that
> for each cpu, there is an identifiable best memory ... and let
> the internal implementation of kmem_cache_alloc_cpu() find that
> best memory for the specified cpu.
> 
> Given this change, the kmem_cache_alloc_cpu() implementation
> could use the CpuMemSets NUMA infrastructure that my group is
> working on to find the best memory.  With CpuMemSets, the
> kernel will have, for each cpu, a list of memories, sorted
> by distance from that cpu.  Just take the first memory block
> off the selected cpus memory list for the above purpose.
> 
> 
>                           I won't rest till it's the best ...
> 			  Manager, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373
> 
> 
> _______________________________________________
> Lse-tech mailing list
> Lse...@li...
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      st...@sg...

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Paul J. <pj...@en...> - 2001-12-09 04:43:05

I think Jack got his attribution wrong.  Which is good for me,
since I wrote what Jack just gently demolished <grin>.

On Sat, 8 Dec 2001, Jack Steiner wrote:

> I think it depends on whether the slab allocator manages
> memory by cpu or node. Since the number of cpus per node
> is rather small (<=8) for most NUMA systems, I would expect
> the slab allocator to manage by node.  Managing by cpu would
> likely add extra fragmentation and no real performance benefit.

I wasn't intending to suggest that the slab allocator manage by
cpu, rather than by node.  Pretty clearly, that would be silly.
Rather I was doing two things:

  1) Suggesting that if some code asking for memory wanted
     it on a node near to some cpu, that it not convert that
     cpu to a node _before_ the call, but rather, pass in the
     cpu, and let the called routine, kmem_cache_alloc_node()
     or renamed to kmem_cache_alloc_cpu() map the cpu to the
     node, inside the call.

  2) Suggesting (against common usage in the kernel, as Jack
     describes, so probably I'm tilting at wind mills) that
     we distinguish between nodes and and chunks of memory
     that I've started calling memory blocks.

I think (1) is sensible enough - the entire discussion leading
up to the code example involved getting memory near to some
cpu or other - so let the call state just that, and let the
details of translating to whatever units the slab allocator
works with be handled inside the call.  Don't make each
caller remember to perform a CPU_TO_NODE conversion - it's
just a little silly duplication of code (a kernel a few
bytes larger), a slightly less direct interface, and one
more detail to impose on each person coding such a call.

As to (2) I'd have to get Jack in a room with a white board, and
at this point, I'm placing no bets on what we would conclude
(well, actually I'd bet on Jack if forced ... his batting
average is pretty good ;).

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Jack S. <st...@sg...> - 2001-12-09 17:34:51

> 
> I think Jack got his attribution wrong.  Which is good for me,
> since I wrote what Jack just gently demolished <grin>.
  
And I probably should not have been reading mail while I
debugged a weird system hang. :-)  I missed the earlier
part of the thread - I though you were refering to local
allocation.

I dont think I have a strong opinion yet about kmem_cache_alloc_node()
vs kmem_cache_alloc_cpu(). I would not be surprised to find that 
both interfaces make sense.  

If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
is the best choice. However, I would also expect that some code
already knows the node. Then kmem_cache_alloc_node() is best.

Conversion of cpu->node is easy. Conversion of node->cpu
is slightly more difficult (currently) and has the ambiguity
that there may be multiple cpus on the node - which one should
you select? And does it matter?

As precident, the page allocation routines are all node-based.
(ie., alloc_pages_node(), etc...)


-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      st...@sg...

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Paul J. <pj...@en...> - 2001-12-11 23:27:17

On Sun, 9 Dec 2001, Jack Steiner wrote:

> If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
> is the best choice. However, I would also expect that some code
> already knows the node. Then kmem_cache_alloc_node() is best.

yup.

> As precident, the page allocation routines are all node-based.
> (ie., alloc_pages_node(), etc...)

My inclinations would be to prefer more cpu-based allocators.
But until I happen to catch you in a room with a white board,
my inclinations are unlikely to go anywhere ... perhaps someday.

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Ravikiran G T. <ki...@in...> - 2001-12-07 11:37:09

Hi Niels, Arjan

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
> 
> > Well, I wrote a simple kernel module which just increments a shared
> global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
> 
> I suspected that.  Would it be possible to do the test on the real
> counters?

Yep, I am gonna run a benchmark after changing some stat counters in the
Kernel.  That should let us know if there are performance gains or otherwise..

Kiran
-- 
Ravikiran G Thirumalai <ki...@in...>
Linux Technology Center, IBM Software Labs,
Bangalore.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Anton B. <an...@sa...> - 2001-12-08 14:51:29

 
Hi,

> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
> 
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems.  That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

I agree there are large areas of improvement to be done wrt cacheline
ping ponging (see my patch in 2.4.17-pre6 for one example), but we
should do our own benchmarking and not look at what AIX has been doing.

Anton
(ppc64 Linux Hacker)

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Niels C. <nc...@us...> - 2001-12-07 09:52:49

Hello Dikanpar,

| > Anyway, since we just had a long thread going on NUMA topology, maybe
| > it would be proper to investigate if there is a better way, such as
| > using the topology to decide where to put counters?  I think so, seeing
| > as it is that most Intel based 8-ways and above will have at least some
| > NUMA in them.
|
| It should be easy to place the counters in appropriately close
| memory if linux gets good NUMA APIs built on top of the topology
| services. If we extend kmem_cache_alloc() to allocate memory
| in a particular NUMA node, we could simply do this for placing the
| counters -
| ...
| This would put the block of counters corresponding to a CPU in
| memory local to the NUMA node. If there are more sophisticated
| APIs available for suitable memory selection, those too can be made
| use of here.
|
| Is this the kind of thing you are looking at ?

I'm no NUMA person so I can't verify your code snippet but if it does
what you say, yes, that is exactly what I meant:  We may have to deal
with both cache coherence and placement of counters in local memory.

Niels

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Dipankar S. <dip...@in...> - 2001-12-07 10:05:37

On Fri, Dec 07, 2001 at 04:52:40AM -0500, Niels Christiansen wrote:
> 
> Hello Dikanpar,
> | It should be easy to place the counters in appropriately close
> | memory if linux gets good NUMA APIs built on top of the topology
> | services. If we extend kmem_cache_alloc() to allocate memory
> | in a particular NUMA node, we could simply do this for placing the
> | counters -
> | ...
> | This would put the block of counters corresponding to a CPU in
> | memory local to the NUMA node. If there are more sophisticated
> | APIs available for suitable memory selection, those too can be made
> | use of here.
> |
> | Is this the kind of thing you are looking at ?
> 
> I'm no NUMA person so I can't verify your code snippet but if it does
> what you say, yes, that is exactly what I meant:  We may have to deal
> with both cache coherence and placement of counters in local memory.

Yes, we will likely need to place the conters in memory closest to
the corresponding CPUs.

I haven't yet started looking at the current NUMA proposals, but
I hope that there will be support for NUMA-aware allocations. The
flexible allocator scheme in our statctr implementation allows
each counter block corresponding to a CPU to be allocated separately
and we can make the locational judgement at that time as indicated
in my hypothetical changes to the statctr code snippet.

Thanks
Dipankar
-- 
Dipankar Sarma  <dip...@in...> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[Lse-tech] Re: [RFC] [PATCH] Scalable Statistics Counters

From: Rusty R. <ru...@ru...> - 2001-12-08 07:37:21

In message <200...@in...> you write:
> On Thu, Dec 06, 2001 at 02:18:26PM +1100, Rusty Russell wrote:
> > On Wed, 05 Dec 2001 12:08:57 -0800
> > Andrew Morton <ak...@zi...> wrote:
> > 
> > > http://www.zipworld.com.au/~akpm/linux/2.4/2.4.7/
> > 
> > Oops, guess I should have read this thread first (still catching up on mail
).
> > 
> > Please see my per-cpu patch (just posted under [PATCH] 2.5.1-pre5: per-cpu
> > areas), and my previous /proc patch.  Combining the two into convenient for
m
> > is left as an exercise for the reader...
> 
> Hi Rusty,
> 
> Your per-cpu area patch looks like a good solution with a very simple
> implementation. BTW, some OSes map the per-cpu data areas
> to the same virtual address for each CPU avoiding the per-cpu data
> array lookup. I am not sure if this really saves much, we are ourselves
> trying to understand the overhead of such array lookup with 
> statctrs. 

I'd be interested in the results: it'd certainly be neater.  Another
option would be to use the per-cpu region pointer where architectures
currently hold smp_processor_id(), and derive the current CPU from the
per-CPU area instead of vice versa.

> IIUC, we can declare statically allocated per-cpu data using 
> this allocator (kstat, apic_timer_irqs etc.). For things that
> are a part of dynamically allocated structure, we would still
> need to use a dynamic per-cpu allocator, right ?

Yep... Someone Else's Problem 8)

> Another interesting question is how we can load different
> per-cpu sections to different areas in memory. I would suspect
> that for NUMA, we would want to locate the per-cpu sections closest
> to the corresponding CPUs.

It could possibly be done with linker tricks in vmlinux.lds, and yes,
definitely worth doing.

> I couldn't find the /proc patch. Any pointers ?

Hmm... I'm working on a rewrite, but the interface should stay the
same:
	http://lists.insecure.org/linux-kernel/2001/Nov/0087.html

Cheers!
Rusty.
--
  Anyone who quotes me is an idiot. -- Rusty Russell.

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Niels C. <nc...@us...> - 2001-12-08 17:43:32

Anton Blanchard wrote:

| > > There's several things where per cpu data is useful; low frequency
| > > statistics is not one of them in my opinion.
| >
| > ...which may be true for 4-ways and even 8-ways but when you get to
| > 32-ways and greater, you start seeing cache problems.  That was the
| > case on AIX and per-cpu counters was one of the changes that helped
| > get the spectacular scalability on Regatta.
|
| I agree there are large areas of improvement to be done wrt cacheline
| ping ponging (see my patch in 2.4.17-pre6 for one example), but we
| should do our own benchmarking and not look at what AIX has been doing.

Oh, please!  You voiced an opinion.  I presented facts.  Nobody suggested
we should not measure on Linux.  As a matter of fact, I suggested that
Kiran does tests on the real counters and he said he would.

Niels

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Anton B. <an...@sa...> - 2001-12-09 12:46:56

> | > ...which may be true for 4-ways and even 8-ways but when you get to
> | > 32-ways and greater, you start seeing cache problems.  That was the
> | > case on AIX and per-cpu counters was one of the changes that helped
> | > get the spectacular scalability on Regatta.
> |
> | I agree there are large areas of improvement to be done wrt cacheline
> | ping ponging (see my patch in 2.4.17-pre6 for one example), but we
> | should do our own benchmarking and not look at what AIX has been doing.
> 
> Oh, please!  You voiced an opinion.  I presented facts.  Nobody suggested
> we should not measure on Linux.  As a matter of fact, I suggested that
> Kiran does tests on the real counters and he said he would.

Exactly, show me where the current problem is and I will benchmark it on
a 16 way linux/ppc64 machine. Your comments are opinions too unless
you have some figures to back them up :)

Anton

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Manfred S. <ma...@co...> - 2001-12-09 10:58:03

>
>
>Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
>kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
>Each would pick up the nodeid from the cpu_data struct, then allocate 
>from the slab cache for that node.
>

kmem_cache_alloc is simple - the complex operation is kmem_cache_free.

The current implementation
- assumes that virt_to_page() and reading one cacheline from the page 
structure is fast. Is that true for your setups?
- uses an array to batch several free calls together: If the array 
overflows, then up to 120 objects are freed in one call, to reduce 
cacheline trashing.

If virt_to_page is fast, then a NUMA allocator would be a simple 
extention of the current implementation:

* one slab chain for each node, one spinlock for each node.
* 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free 
calls , one for "foreign node" kmem_cache_free calls.
* kmem_cache_alloc allocates from the "correct node" per-cpu array, 
fallback to the per-node slab chain, then fallback to __get_free_pages.
* kmem_cache_free checks to which node the freed object belongs and adds 
it to the appropriate per-cpu array. The array overflow function then 
sorts the objects into the correct slab chains.

If virt_to_page is slow we need a different design. Currently it's 
called in every kmem_cache_free/kfree call.

--
    Manfred

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Jack S. <st...@sg...> - 2001-12-10 16:33:05

> 
> >
> >
> >Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
> >kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
> >Each would pick up the nodeid from the cpu_data struct, then allocate 
> >from the slab cache for that node.
> >
> 
> kmem_cache_alloc is simple - the complex operation is kmem_cache_free.
> 
> The current implementation
> - assumes that virt_to_page() and reading one cacheline from the page 
> structure is fast. Is that true for your setups?
> - uses an array to batch several free calls together: If the array 
> overflows, then up to 120 objects are freed in one call, to reduce 
> cacheline trashing.
> 
> If virt_to_page is fast, then a NUMA allocator would be a simple 
> extention of the current implementation:

I can give you 1 data point. This is for the SGI SN1 platform. This is a NUMA
platform & is running with the DISCONTIGMEM patch that is on sourceforge.

The virt_to_page() function currently generates the following code:

	23 instructions
		18 "real" instructions
		 5 noop (I would like to believe the compiler can eventually
		   use these instructions slots for something else)

The code has
	2 load instructions that are always reference node-local memory & have 
	  a high probability of hitting in the caches
	1 load to the node that contains the target page


I think I see a couple opportunities for reducing the amount of code. However, 
I consider the code to be "fast enough" for most purposes.

> 
> * one slab chain for each node, one spinlock for each node.
> * 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free 
> calls , one for "foreign node" kmem_cache_free calls.
> * kmem_cache_alloc allocates from the "correct node" per-cpu array, 
> fallback to the per-node slab chain, then fallback to __get_free_pages.
> * kmem_cache_free checks to which node the freed object belongs and adds 
> it to the appropriate per-cpu array. The array overflow function then 
> sorts the objects into the correct slab chains.
> 
> If virt_to_page is slow we need a different design. Currently it's 
> called in every kmem_cache_free/kfree call.

BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
to be numa-aware. Are coordinating your work with his???



-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      st...@sg...

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Manfred S. <ma...@co...> - 2001-12-10 17:00:27

Jack Steiner wrote:

>
>BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
>to be numa-aware. Are coordinating your work with his???
>

Thanks, I wasn't aware that he's working on it.
I haven't started coding, I'm still collecting what's needed.

* force certain alignments. e.g. ARM needs 1024 byte aligned objects for 
the page tables.
* NUMA support.
* Add a "priority" to kmem_cache_shrink, to avoid that every 
dcache/icache shrink causes an IPI to all cpus.
* If possible: replace the division in kmem_cache_free_one with the 
multiplication by the reciprocal. (I have a patch, but it's too ugly for 
inclusion). Important for uniprocessor versions.
* add reservation support - e.g. there must be a minimum amount of bio 
structures available, otherwise the kernel could oom-deadlock. They must 
be available, not hidden in the per-cpu caches of the other cpus.

--
    Manfred

RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Luck, T. <ton...@in...> - 2001-12-10 18:00:24

I'd better chime in on this thread to let you know where my
efforts have led so far.  I did post a note here a couple of
weeks ago (Titled: "Thinking about a node aware kmalloc")
which you might want to dig out of the archives.

I think that I have most of the issues understood, and I've
started tinkering with the code (where I expect I'll run into
a few areas where it will becomre clear that I don't have all
the bases covered :-)

My plan is to make kmem_cache_t look like:

	struct kmem_cache_s {
		char name[CACHE_NAMELEN];
		struct list_head next;
		struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES];
	}

and to move all the other fields into the new "kmem_node"
structure that will be allocated on node-local memory.

The kmem_node structure will also have a pointer back to the
kmem_cache_t, and has a reduced number of cpucache_t pointers
(as it only has to cover the cpus on a node). I'm also going
to have a "foreign returns" list for each node (see below).

My major concern with this is the frequency with which memory
blocks are freed by a cpu that does not belong to the same node
as the memory being freed (a "foreign return").  My implementation
will require kmfree() to grab a lock for the remote node to drop
the item onto a list for that node to be able to reclaim it.

I read Manfred's note posted yesterday where he suggests having
a per-cpu list for foreign returns, which would avoid a need for
a lock when the item is freed ... but it seems to me that this
only defers the problem.  When the per-cpu foreign list is filled,
that cpu will have to scan the whole list of foreign objects and
give each of them back to the correct node.  If there are a large
number of nodes, this process may not benefit from much batching
(e.g with 32 nodes and a 120 element list we only have an average
of 4 elements to be given to each node).

In my previous e-mail I described a crude benchmark that I had
performed to try to get some data.  When we have a prototype
implementation we will need to gather more data on larger systems
to figure out the best approach.

-Tony Luck

-----Original Message-----
From: Manfred Spraul [mailto:ma...@co...]
Sent: Monday, December 10, 2001 9:00 AM
To: Jack Steiner
Cc: lin...@vg...; lse...@li...
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

Jack Steiner wrote:

>
>BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
>to be numa-aware. Are coordinating your work with his???
>

Thanks, I wasn't aware that he's working on it.
I haven't started coding, I'm still collecting what's needed.

* force certain alignments. e.g. ARM needs 1024 byte aligned objects for 
the page tables.
* NUMA support.
* Add a "priority" to kmem_cache_shrink, to avoid that every 
dcache/icache shrink causes an IPI to all cpus.
* If possible: replace the division in kmem_cache_free_one with the 
multiplication by the reciprocal. (I have a patch, but it's too ugly for 
inclusion). Important for uniprocessor versions.
* add reservation support - e.g. there must be a minimum amount of bio 
structures available, otherwise the kernel could oom-deadlock. They must 
be available, not hidden in the per-cpu caches of the other cpus.

--
    Manfred

_______________________________________________
Lse-tech mailing list
Lse...@li...
https://lists.sourceforge.net/lists/listinfo/lse-tech

1 2 > >> (Page 1 of 2)