Thread: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters (Page 2)

lse-tech

Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Manfred S. <ma...@co...> - 2001-12-10 19:12:54

Luck, Tony wrote:

>I'd better chime in on this thread to let you know where my
>efforts have led so far.  I did post a note here a couple of
>weeks ago (Titled: "Thinking about a node aware kmalloc")
>which you might want to dig out of the archives.
>
>I think that I have most of the issues understood, and I've
>started tinkering with the code (where I expect I'll run into
>a few areas where it will becomre clear that I don't have all
>the bases covered :-)
>
>My plan is to make kmem_cache_t look like:
>
>	struct kmem_cache_s {
>		char name[CACHE_NAMELEN];
>		struct list_head next;
>		struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES];
>	}
>
>and to move all the other fields into the new "kmem_node"
>structure that will be allocated on node-local memory.
>
I agree that everything should be pointers and node-local memory. Except 
with a single node, then the structure should be directly integrated.
But I think pointer to the per-cpu structures should be part of 
kmem_cache_s: Every kmalloc and kfree must look up the per-cpu 
structure, and embedding it directly would avoid one level of indirection.

I though about:
<<<
struct kmem_cache_s {
    char name[CACHE_NAMELEN];
    struct list_head next;
#if PLAT_MAX_COMPACT_NODES > 1
    struct kmem_nodedata *nodes[PLAT_MAX_COMPACT_NODES];
#else
    struct kmem_nodedata    nd;
#endif
#ifdef CONFIG_SMP
    struct kmem_cpudata *cpus[NR_CPUS];
#endif
};
<<<

The array for foreign free calls is just an idea: If there are lots of 
foreign free calls, then a per-cpu array and using only one global 
spinlock for all nodes instead of a per-node spinlock might be faster. 
But we need statistics to decide that.

--
    Manfred

RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Luck, T. <ton...@in...> - 2001-12-10 18:57:20

Oh, one other gaping hole in my current plans for this
code is the issue of what to do when the desired node is
out of memory.  For some objects that are going to exist
for a very long time, we might want to wait until some
node-local memory becomes available (this presupposes that
we have a mechanism to encourage that to occur, rather than
just hoping that it will :-)  On the other hand, most objects
seem to be quite short lived, so it would seem sensible to
follow some alternate node search strategy (e.g. using the
cpumemset lists to define which alternate node to use).
However, this would provide a horrific knee in the performance
curve when a node runs low on memory (moving from a (mostly)
lockless allocation referencing only local node memory to one
that requires acquisition of remote lock(s)) [Plus would
probably increase the number of foreign returns].

-Tony Luck

-----Original Message-----
From: Luck, Tony [mailto:ton...@in...]
Sent: Monday, December 10, 2001 10:00 AM
To: 'Manfred Spraul'; Jack Steiner
Cc: lse...@li...
Subject: RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

I'd better chime in on this thread to let you know where my
efforts have led so far.  I did post a note here a couple of
weeks ago (Titled: "Thinking about a node aware kmalloc")
which you might want to dig out of the archives.

I think that I have most of the issues understood, and I've
started tinkering with the code (where I expect I'll run into
a few areas where it will becomre clear that I don't have all
the bases covered :-)

My plan is to make kmem_cache_t look like:

	struct kmem_cache_s {
		char name[CACHE_NAMELEN];
		struct list_head next;
		struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES];
	}

and to move all the other fields into the new "kmem_node"
structure that will be allocated on node-local memory.

The kmem_node structure will also have a pointer back to the
kmem_cache_t, and has a reduced number of cpucache_t pointers
(as it only has to cover the cpus on a node). I'm also going
to have a "foreign returns" list for each node (see below).

My major concern with this is the frequency with which memory
blocks are freed by a cpu that does not belong to the same node
as the memory being freed (a "foreign return").  My implementation
will require kmfree() to grab a lock for the remote node to drop
the item onto a list for that node to be able to reclaim it.

I read Manfred's note posted yesterday where he suggests having
a per-cpu list for foreign returns, which would avoid a need for
a lock when the item is freed ... but it seems to me that this
only defers the problem.  When the per-cpu foreign list is filled,
that cpu will have to scan the whole list of foreign objects and
give each of them back to the correct node.  If there are a large
number of nodes, this process may not benefit from much batching
(e.g with 32 nodes and a 120 element list we only have an average
of 4 elements to be given to each node).

In my previous e-mail I described a crude benchmark that I had
performed to try to get some data.  When we have a prototype
implementation we will need to gather more data on larger systems
to figure out the best approach.

-Tony Luck

-----Original Message-----
From: Manfred Spraul [mailto:ma...@co...]
Sent: Monday, December 10, 2001 9:00 AM
To: Jack Steiner
Cc: lin...@vg...; lse...@li...
Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

Jack Steiner wrote:

>
>BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
>to be numa-aware. Are coordinating your work with his???
>

Thanks, I wasn't aware that he's working on it.
I haven't started coding, I'm still collecting what's needed.

* force certain alignments. e.g. ARM needs 1024 byte aligned objects for 
the page tables.
* NUMA support.
* Add a "priority" to kmem_cache_shrink, to avoid that every 
dcache/icache shrink causes an IPI to all cpus.
* If possible: replace the division in kmem_cache_free_one with the 
multiplication by the reciprocal. (I have a patch, but it's too ugly for 
inclusion). Important for uniprocessor versions.
* add reservation support - e.g. there must be a minimum amount of bio 
structures available, otherwise the kernel could oom-deadlock. They must 
be available, not hidden in the per-cpu caches of the other cpus.

--
    Manfred

_______________________________________________
Lse-tech mailing list
Lse...@li...
https://lists.sourceforge.net/lists/listinfo/lse-tech

_______________________________________________
Lse-tech mailing list
Lse...@li...
https://lists.sourceforge.net/lists/listinfo/lse-tech

RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Luck, T. <ton...@in...> - 2001-12-10 19:46:10

Manfred Spraul wrote:

> But I think pointer to the per-cpu structures should be part of 
> kmem_cache_s: Every kmalloc and kfree must look up the per-cpu 
> structure, and embedding it directly would avoid one level of indirection.

Yes, you are right (and that removes my need to invent a "logical cpu number
on node" that I was going to need ... thanks!)

> The array for foreign free calls is just an idea: If there are lots of 
> foreign free calls, then a per-cpu array and using only one global 
> spinlock for all nodes instead of a per-node spinlock might be faster. 
> But we need statistics to decide that.

It should be relaively easy to switch algorithms once we have the basic
infrastructure up and running ... then we can get some real data and see
which way to go.

-Tony Luck

RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Paul M. <Pau...@us...> - 2001-12-11 20:06:15

One way to keep the "horrific knee" down to a dull roar is
to make sure that when you do go get memory from a remote
node, that you get a bunch of it.  The extra memory may
be cached on the local CPU (or node), greatly reducing the
overhead of subsequent requests.

This same caching can help reduce the overhead of freeing
remote memory (e.g., for memory allocated by a CPU on one
node and freed by a CPU on some other node).

All of this must be carefully balanced, as there is a
tradeoff between efficient handling of out-of-memory
situations and the cost/complexity of the normal case.

                         Thanx, Paul

> Oh, one other gaping hole in my current plans for this
> code is the issue of what to do when the desired node is
> out of memory.  For some objects that are going to exist
> for a very long time, we might want to wait until some
> node-local memory becomes available (this presupposes that
> we have a mechanism to encourage that to occur, rather than
> just hoping that it will :-)  On the other hand, most objects
> seem to be quite short lived, so it would seem sensible to
> follow some alternate node search strategy (e.g. using the
> cpumemset lists to define which alternate node to use).
> However, this would provide a horrific knee in the performance
> curve when a node runs low on memory (moving from a (mostly)
> lockless allocation referencing only local node memory to one
> that requires acquisition of remote lock(s)) [Plus would
> probably increase the number of foreign returns].
>
> -Tony Luck
>
> -----Original Message-----
> From: Luck, Tony [mailto:ton...@in...]
> Sent: Monday, December 10, 2001 10:00 AM
> To: 'Manfred Spraul'; Jack Steiner
> Cc: lse...@li...
> Subject: RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
>
>
> I'd better chime in on this thread to let you know where my
> efforts have led so far.  I did post a note here a couple of
> weeks ago (Titled: "Thinking about a node aware kmalloc")
> which you might want to dig out of the archives.
>
> I think that I have most of the issues understood, and I've
> started tinkering with the code (where I expect I'll run into
> a few areas where it will becomre clear that I don't have all
> the bases covered :-)
>
> My plan is to make kmem_cache_t look like:
>
>          struct kmem_cache_s {
>                    char name[CACHE_NAMELEN];
>                    struct list_head next;
>                    struct kmem_node *nodes[PLAT_MAX_COMPACT_NODES];
>          }
>
> and to move all the other fields into the new "kmem_node"
> structure that will be allocated on node-local memory.
>
> The kmem_node structure will also have a pointer back to the
> kmem_cache_t, and has a reduced number of cpucache_t pointers
> (as it only has to cover the cpus on a node). I'm also going
> to have a "foreign returns" list for each node (see below).
>
> My major concern with this is the frequency with which memory
> blocks are freed by a cpu that does not belong to the same node
> as the memory being freed (a "foreign return").  My implementation
> will require kmfree() to grab a lock for the remote node to drop
> the item onto a list for that node to be able to reclaim it.
>
> I read Manfred's note posted yesterday where he suggests having
> a per-cpu list for foreign returns, which would avoid a need for
> a lock when the item is freed ... but it seems to me that this
> only defers the problem.  When the per-cpu foreign list is filled,
> that cpu will have to scan the whole list of foreign objects and
> give each of them back to the correct node.  If there are a large
> number of nodes, this process may not benefit from much batching
> (e.g with 32 nodes and a 120 element list we only have an average
> of 4 elements to be given to each node).
>
> In my previous e-mail I described a crude benchmark that I had
> performed to try to get some data.  When we have a prototype
> implementation we will need to gather more data on larger systems
> to figure out the best approach.
>
> -Tony Luck
>
>
> -----Original Message-----
> From: Manfred Spraul [mailto:ma...@co...]
> Sent: Monday, December 10, 2001 9:00 AM
> To: Jack Steiner
> Cc: lin...@vg...; lse...@li...
> Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
>
>
> Jack Steiner wrote:
>
> >
> >BTW, I think Tony Luck (at Intel) is currently changing the slab
allocator
> >to be numa-aware. Are coordinating your work with his???
> >
>
> Thanks, I wasn't aware that he's working on it.
> I haven't started coding, I'm still collecting what's needed.
>
> * force certain alignments. e.g. ARM needs 1024 byte aligned objects for
> the page tables.
> * NUMA support.
> * Add a "priority" to kmem_cache_shrink, to avoid that every
> dcache/icache shrink causes an IPI to all cpus.
> * If possible: replace the division in kmem_cache_free_one with the
> multiplication by the reciprocal. (I have a patch, but it's too ugly for
> inclusion). Important for uniprocessor versions.
> * add reservation support - e.g. there must be a minimum amount of bio
> structures available, otherwise the kernel could oom-deadlock. They must
> be available, not hidden in the per-cpu caches of the other cpus.
>
> --
>     Manfred

RE: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

From: Paul M. <Pau...@us...> - 2001-12-11 20:07:19

> I read Manfred's note posted yesterday where he suggests having
> a per-cpu list for foreign returns, which would avoid a need for
> a lock when the item is freed ... but it seems to me that this
> only defers the problem.  When the per-cpu foreign list is filled,
> that cpu will have to scan the whole list of foreign objects and
> give each of them back to the correct node.  If there are a large
> number of nodes, this process may not benefit from much batching
> (e.g with 32 nodes and a 120 element list we only have an average
> of 4 elements to be given to each node).

One way to fix this is to have a separate foreign list for each node
on each CPU.  This means n*(m-1) lists on systems with n CPUs and m memory
nodes.  Each list must be allowed to grow to a point where the overhead
of grabbing the foreign lock is balanced with the time required to do
the allocations/frees locally.  In past efforts with other OSes, each
individual list needed to have on the order of a few tens of foreign
elements.  This can be quite a bit of memory on large systems, but then
again, large systems tend to have quite a bit of memory.

For whatever it is worth, one's mileage can vary in different OSes on
different hardware architectures...

                              Thanx, Paul

> In my previous e-mail I described a crude benchmark that I had
> performed to try to get some data.  When we have a prototype
> implementation we will need to gather more data on larger systems
> to figure out the best approach.
>
> -Tony Luck
>
>
> -----Original Message-----
> From: Manfred Spraul [mailto:ma...@co...]
> Sent: Monday, December 10, 2001 9:00 AM
> To: Jack Steiner
> Cc: lin...@vg...; lse...@li...
> Subject: Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
>
>
> Jack Steiner wrote:
>
> >
> >BTW, I think Tony Luck (at Intel) is currently changing the slab
allocator
> >to be numa-aware. Are coordinating your work with his???
> >
>
> Thanks, I wasn't aware that he's working on it.
> I haven't started coding, I'm still collecting what's needed.
>
> * force certain alignments. e.g. ARM needs 1024 byte aligned objects for
> the page tables.
> * NUMA support.
> * Add a "priority" to kmem_cache_shrink, to avoid that every
> dcache/icache shrink causes an IPI to all cpus.
> * If possible: replace the division in kmem_cache_free_one with the
> multiplication by the reciprocal. (I have a patch, but it's too ugly for
> inclusion). Important for uniprocessor versions.
> * add reservation support - e.g. there must be a minimum amount of bio
> structures available, otherwise the kernel could oom-deadlock. They must
> be available, not hidden in the per-cpu caches of the other cpus.
>
> --
>     Manfred

<< < 1 2 (Page 2 of 2)