Thread: [Lse-tech] Re: idle task's task_t allocation on NUMA machines

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

[Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Christoph L. <cla...@en...> - 2005-08-18 17:24:49

On Thu, 18 Aug 2005, Samuel Thibault wrote:

> Indeed, but I guess there are a lot of such little optimizations here
> and there that could be relatively easily fixed, for a not-so little
> benefit.

Get on it :-) I hope the kmalloc_node stuff etc that was recently added is 
enough for most structur4es. Note that there is a new rev of the slab 
allocator in Andrew's tree that will make kmalloc_node as fast as kmalloc.

[Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Robin H. <ho...@sg...> - 2005-08-18 18:29:01

On Thu, Aug 18, 2005 at 04:08:29PM +0200, Samuel Thibault wrote:
> A solution would be to add to copy_process(), dup_task_struct(),
> alloc_task_struct() and kmem_cache_alloc() the node number on which
> allocation should be performed. This might also be useful if performing
> node load balancing at fork(): one could then allocate task_t directly
> on the new node. It might also be useful when allocating data for
> another node.

Can this be abstracted some?

Let me start with some background.  SGI has a kernel addition we made
on our previous kernel release something we called dplace.  It has a
userland piece and a library which gets configuration information passed
into a kernel driver.

Inside the kernel, we used Process Aggregates (pagg as found on
oss.sgi.com) to track children of a starting process and migrate them
to a desired cpu.

The problem we have with this method is the callout to pagg happens
far too late after the fork to help with some of the more important
user structures like page tables.  We find that most processes have
their pgd and many parts of the pmd allocated remotely.  Although it
is not a significant source of NUMA traffic, it does cause variability
in process run times which becomes exaggerated on larger MPI jobs which
rendezvous at a barrier.

It would be nice to be able to, early in fork, decide on a destination
numa node and cpu list for the task.  If this were done, then changing
allocation of structures like the task_t and page tables could be handled
on a case-by-case basis as we see benefit.  Additionally, it would be
nice if we could make the placement decision logic provide a callback
so we could add tailored placement.

I realize this is a very vague sketch of what I think needs to be done,
but I am sort of in a rush right now and wanted to at least start the
discussion.

Thanks,
Robin

[Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Eric D. <da...@co...> - 2005-08-19 00:31:50

Samuel Thibault a =E9crit :
> Hi,
>=20
> Currently, the task_t structure of the idle task is always allocated
> on CPU0, hence on node 0: while booting, for each CPU, CPU 0 calls
> fork_idle(), hence copy_process(), hence dup_task_struct(), hence
> alloc_task_struct(), hence kmem_cache_alloc(), which picks up memory
> from the allocation cache of the current CPU, i.e. on node 0.
>=20
> This is a bad idea: every write needs be written back to node 0 at some
> time, so that node 0 can get a small bit busy especially when other
> nodes are idle.
>=20
> A solution would be to add to copy_process(), dup_task_struct(),
> alloc_task_struct() and kmem_cache_alloc() the node number on which
> allocation should be performed. This might also be useful if performing
> node load balancing at fork(): one could then allocate task_t directly
> on the new node. It might also be useful when allocating data for
> another node.
>=20
> Regards,
> Samuel

An idle task should block itself, hence not touching its task_t structure=
 very much.

I believe IRQ stacks are also allocated on node 0, that seems more seriou=
s.

Eric

[Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Samuel T. <sam...@en...> - 2005-08-18 19:50:04

Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
> I believe IRQ stacks are also allocated on node 0, that seems more serious.

For the i386 architecture at least, yes: they are statically defined in
arch/i386/kernel/irq.c, while they could be per_cpu.

Regards,
Samuel

[Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Samuel T. <sam...@en...> - 2005-08-18 20:03:34

Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a écrit :
> Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit :
> > I believe IRQ stacks are also allocated on node 0, that seems more serious.
> 
> For the i386 architecture at least, yes: they are statically defined in
> arch/i386/kernel/irq.c, while they could be per_cpu.

Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
isn't the current x86_64 numa-aware implementation of per_cpu generic
enough for any architecture?

Regards,
Samuel

Re: [Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Martin J. B. <mb...@mb...> - 2005-08-18 21:28:48

--On Thursday, August 18, 2005 22:02:55 +0200 Samuel Thibault <sam...@en...> wrote:

> Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a =E9crit :
>> Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a =E9crit :
>> > I believe IRQ stacks are also allocated on node 0, that seems more serious.
>>=20
>> For the i386 architecture at least, yes: they are statically defined in
>> arch/i386/kernel/irq.c, while they could be per_cpu.
>=20
> Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
> isn't the current x86_64 numa-aware implementation of per_cpu generic
> enough for any architecture?

All ZONE_NORMAL on ia32 is on node 0, so I don't think it'll help.

M.

Re: [Lse-tech] Re: idle task's task_t allocation on NUMA machines

From: Andi K. <ak...@su...> - 2005-08-18 21:32:50

On Thu, Aug 18, 2005 at 10:02:55PM +0200, Samuel Thibault wrote:
> Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a ?crit :
> > Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a ?crit :
> > > I believe IRQ stacks are also allocated on node 0, that seems more serious.
> > 
> > For the i386 architecture at least, yes: they are statically defined in
> > arch/i386/kernel/irq.c, while they could be per_cpu.
> 
> Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering:
> isn't the current x86_64 numa-aware implementation of per_cpu generic
> enough for any architecture?

Actually it's broken for many x86-64 configurations now that use SRAT because
we assign the nodes to CPUs only after this code runs. I was considering
to remove it.

-Andi