From: Christoph L. <cla...@en...> - 2005-08-18 17:24:49
|
On Thu, 18 Aug 2005, Samuel Thibault wrote: > Indeed, but I guess there are a lot of such little optimizations here > and there that could be relatively easily fixed, for a not-so little > benefit. Get on it :-) I hope the kmalloc_node stuff etc that was recently added is enough for most structur4es. Note that there is a new rev of the slab allocator in Andrew's tree that will make kmalloc_node as fast as kmalloc. |
From: Robin H. <ho...@sg...> - 2005-08-18 18:29:01
|
On Thu, Aug 18, 2005 at 04:08:29PM +0200, Samuel Thibault wrote: > A solution would be to add to copy_process(), dup_task_struct(), > alloc_task_struct() and kmem_cache_alloc() the node number on which > allocation should be performed. This might also be useful if performing > node load balancing at fork(): one could then allocate task_t directly > on the new node. It might also be useful when allocating data for > another node. Can this be abstracted some? Let me start with some background. SGI has a kernel addition we made on our previous kernel release something we called dplace. It has a userland piece and a library which gets configuration information passed into a kernel driver. Inside the kernel, we used Process Aggregates (pagg as found on oss.sgi.com) to track children of a starting process and migrate them to a desired cpu. The problem we have with this method is the callout to pagg happens far too late after the fork to help with some of the more important user structures like page tables. We find that most processes have their pgd and many parts of the pmd allocated remotely. Although it is not a significant source of NUMA traffic, it does cause variability in process run times which becomes exaggerated on larger MPI jobs which rendezvous at a barrier. It would be nice to be able to, early in fork, decide on a destination numa node and cpu list for the task. If this were done, then changing allocation of structures like the task_t and page tables could be handled on a case-by-case basis as we see benefit. Additionally, it would be nice if we could make the placement decision logic provide a callback so we could add tailored placement. I realize this is a very vague sketch of what I think needs to be done, but I am sort of in a rush right now and wanted to at least start the discussion. Thanks, Robin |
From: Eric D. <da...@co...> - 2005-08-19 00:31:50
|
Samuel Thibault a =E9crit : > Hi, >=20 > Currently, the task_t structure of the idle task is always allocated > on CPU0, hence on node 0: while booting, for each CPU, CPU 0 calls > fork_idle(), hence copy_process(), hence dup_task_struct(), hence > alloc_task_struct(), hence kmem_cache_alloc(), which picks up memory > from the allocation cache of the current CPU, i.e. on node 0. >=20 > This is a bad idea: every write needs be written back to node 0 at some > time, so that node 0 can get a small bit busy especially when other > nodes are idle. >=20 > A solution would be to add to copy_process(), dup_task_struct(), > alloc_task_struct() and kmem_cache_alloc() the node number on which > allocation should be performed. This might also be useful if performing > node load balancing at fork(): one could then allocate task_t directly > on the new node. It might also be useful when allocating data for > another node. >=20 > Regards, > Samuel An idle task should block itself, hence not touching its task_t structure= very much. I believe IRQ stacks are also allocated on node 0, that seems more seriou= s. Eric |
From: Samuel T. <sam...@en...> - 2005-08-18 19:50:04
|
Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit : > I believe IRQ stacks are also allocated on node 0, that seems more serious. For the i386 architecture at least, yes: they are statically defined in arch/i386/kernel/irq.c, while they could be per_cpu. Regards, Samuel |
From: Samuel T. <sam...@en...> - 2005-08-18 20:03:34
|
Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a écrit : > Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a écrit : > > I believe IRQ stacks are also allocated on node 0, that seems more serious. > > For the i386 architecture at least, yes: they are statically defined in > arch/i386/kernel/irq.c, while they could be per_cpu. Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering: isn't the current x86_64 numa-aware implementation of per_cpu generic enough for any architecture? Regards, Samuel |
From: Martin J. B. <mb...@mb...> - 2005-08-18 21:28:48
|
--On Thursday, August 18, 2005 22:02:55 +0200 Samuel Thibault <sam...@en...> wrote: > Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a =E9crit : >> Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a =E9crit : >> > I believe IRQ stacks are also allocated on node 0, that seems more serious. >>=20 >> For the i386 architecture at least, yes: they are statically defined in >> arch/i386/kernel/irq.c, while they could be per_cpu. >=20 > Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering: > isn't the current x86_64 numa-aware implementation of per_cpu generic > enough for any architecture? All ZONE_NORMAL on ia32 is on node 0, so I don't think it'll help. M. |
From: Andi K. <ak...@su...> - 2005-08-18 21:32:50
|
On Thu, Aug 18, 2005 at 10:02:55PM +0200, Samuel Thibault wrote: > Samuel Thibault, le Thu 18 Aug 2005 21:49:41 +0200, a ?crit : > > Eric Dumazet, le Thu 18 Aug 2005 17:18:55 +0200, a ?crit : > > > I believe IRQ stacks are also allocated on node 0, that seems more serious. > > > > For the i386 architecture at least, yes: they are statically defined in > > arch/i386/kernel/irq.c, while they could be per_cpu. > > Hum, but the per_cpu areas for i386 are not numa-aware... I'm wondering: > isn't the current x86_64 numa-aware implementation of per_cpu generic > enough for any architecture? Actually it's broken for many x86-64 configurations now that use SRAT because we assign the nodes to CPUs only after this code runs. I was considering to remove it. -Andi |