From: Erich F. <ef...@es...> - 2002-09-24 21:09:36
|
On Monday 23 September 2002 20:47, Martin J. Bligh wrote: > > I have two problems with this approach: > > 1: Freeing memory is quite expensive, as it currently involves findin= g > > the maximum of the array node_mem[]. > > Bleh ... why? This needs to be calculated much more lazily than this, > or you're going to kick the hell out of any cache affinity. Can you > recalc this in the rebalance code or something instead? You're right, that would be too slow. I think of marking the tasks needing recalculation and update their homenode when their runqueue is scanned for a task to be stolen. > > 2: I have no idea how tasks sharing the mm structure will behave. I'd > > like them to run on different nodes (that's why node_mem is not in mm= ), > > but they could (legally) free pages which they did not allocate and > > have wrong values in node_mem[]. > > Yes, that really ought to be per-process, not per task. Which means > locking or atomics ... and overhead. Ick. Hmm, I think it is sometimes ok to have it per task. For example OpenMP parallel jobs working on huge arrays. The "first-touch" of these arrays leads to pagefaults generated by the different tasks and thus different node_mem[] arrays for each task. As long as they just allocate memory, all is well. If they only release it at the end of the job, too. This probably goes wrong if we have a long running task that spawns short living clones. They inherit the node_mem from the parent but pages added by them to the common mm are not reflected in the parent's node_mem after their death. > For the first cut of the NUMA sched, maybe you could just leave page > allocation alone, and do that seperately? or is that what the second > patch was meant to be? The first patch needs a correction, add in load_balance() =09if (!busiest) goto out; after the call to find_busiest_queue. This works alone. On top of this pooling NUMA scheduler we can put the node affinity approach that fits best. With or without memory allocation. I'll update the patches and their setup code (thanks for the comments!) and resend them. Regards, Erich |
From: Martin J. B. <mb...@ar...> - 2002-09-24 21:19:37
|
>> > 2: I have no idea how tasks sharing the mm structure will behave. I'd >> > like them to run on different nodes (that's why node_mem is not in mm), >> > but they could (legally) free pages which they did not allocate and >> > have wrong values in node_mem[]. >> >> Yes, that really ought to be per-process, not per task. Which means >> locking or atomics ... and overhead. Ick. > > Hmm, I think it is sometimes ok to have it per task. For example OpenMP > parallel jobs working on huge arrays. The "first-touch" of these arrays > leads to pagefaults generated by the different tasks and thus different > node_mem[] arrays for each task. As long as they just allocate memory, > all is well. If they only release it at the end of the job, too. This > probably goes wrong if we have a long running task that spawns short > living clones. They inherit the node_mem from the parent but pages > added by them to the common mm are not reflected in the parent's node_mem > after their death. But you're left with a choice whether to base it on the per-task or per-process information when you make decisions. 1. per-process requires cross-node collation for a data read. Bad. 2. per-task leads to obviously bad decision cases when there's significant amounts of shared data between the tasks of a process (which was often the point of making them threads in the first place). Yes, I can imagine a situation for which it would work, as you describe above ... but that's not the point - this is a general policy, and I don't think it works in general ... as you say yourself "it is *sometimes* ok" ;-) > The first patch needs a correction, add in load_balance() > if (!busiest) goto out; > after the call to find_busiest_queue. This works alone. On top of this > pooling NUMA scheduler we can put the node affinity approach that fits > best. With or without memory allocation. I'll update the patches and > their setup code (thanks for the comments!) and resend them. Excellent! Will try out the correction above and get back to you. Might be a day or so, I'm embroiled in some other code at the moment. M. |
From: Martin J. B. <mb...@ar...> - 2002-09-22 15:54:19
|
A few comments ... mainly just i386 arch things (which aren't really your fault, was just a lack of the mechanisms being in mainline), and a request to push a couple of things down into the arch trees from rearing their ugly head into generic code ;-) M. > +static int __initdata nr_lnodes = 0; > + Use numnodes. > cpu = ++cpucount; > +#ifdef CONFIG_NUMA_SCHED > + cell = SAPICID_TO_PNODE(apicid); > + if (pnode_to_lnode[cell] < 0) { > + pnode_to_lnode[cell] = nr_lnodes++; > + } > +#endif pnodes and lnodes are all 1-1, so they're just called nodes for i386, and there's no such thing as a SAPICID on this platform. > /* > * We can't use kernel_thread since we must avoid to > * reschedule the child. > @@ -996,6 +1004,10 @@ static void __init smp_boot_cpus(unsigne > set_bit(0, &cpu_callout_map); > boot_cpu_logical_apicid = logical_smp_processor_id(); > map_cpu_to_boot_apicid(0, boot_cpu_apicid); > +#ifdef CONFIG_NUMA_SCHED > + pnode_to_lnode[SAPICID_TO_PNODE(boot_cpu_apicid)] = nr_lnodes++; > + printk("boot_cpu_apicid = %d, nr_lnodes = %d, lnode = %d\n", boot_cpu_apicid, nr_lnodes, pnode_to_lnode[0]); > +#endif Ditto. All these mappings exist already. No need to reinvent them. The -mm tree has a generic cpu_to_node macro, which keys off the logical_apic_id. http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.37/2.5.37-mm1/broken-out/topology-api.patch > +#ifdef CONFIG_NUMA_SCHED > +#define NR_NODES 8 MAX_NUMNODES exists for every arch (except maybe ia64 ;-)) > +#define cpu_physical_id(cpuid) (cpu_to_physical_apicid(cpuid)) Not needed, should be buried within a wrapper, not exposed to generic code. > +#define SAPICID_TO_PNODE(hwid) (hwid >> 4) Grrr. No such thing. > diff -urNp a/include/linux/sched.h b/include/linux/sched.h > +extern int pnode_to_lnode[NR_NODES]; > +extern char lnode_number[NR_CPUS] __cacheline_aligned; Can't you push all this down into the arch .... > +#define CPU_TO_NODE(cpu) lnode_number[cpu] ... by letting them define cpu_to_node() themselves? (most people don't have lnodes and pnodes, etc). > +EXPORT_SYMBOL(runqueues_address); > +EXPORT_SYMBOL(numpools); > +EXPORT_SYMBOL(pool_ptr); > +EXPORT_SYMBOL(pool_cpus); > +EXPORT_SYMBOL(pool_nr_cpus); > +EXPORT_SYMBOL(pool_mask); > +EXPORT_SYMBOL(sched_migrate_task); Aren't these internal scheduler things? > diff -urNp a/kernel/sched.c b/kernel/sched.c > +int pnode_to_lnode[NR_NODES] = { [0 ... NR_NODES-1] = -1 }; > +char lnode_number[NR_CPUS] __cacheline_aligned; Ditto. > + int this_pool = CPU_TO_NODE(this_cpu); > + int this_pool=CPU_TO_NODE(this_cpu), weight, maxweight=0; Howcome you can use the CPU_TO_NODE abstraction here ... > + /* build translation table for CPU_TO_NODE macro */ > + for (i = 0; i < NR_CPUS; i++) > + if (cpu_online(i)) > + lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))]; But not here? |
From: Martin J. B. <mb...@ar...> - 2002-09-22 19:26:27
|
>> + int this_pool = CPU_TO_NODE(this_cpu); >> + int this_pool=CPU_TO_NODE(this_cpu), weight, maxweight=0; > > Howcome you can use the CPU_TO_NODE abstraction here ... > >> + /* build translation table for CPU_TO_NODE macro */ >> + for (i = 0; i < NR_CPUS; i++) >> + if (cpu_online(i)) >> + lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))]; > > But not here? Doh! Because you're building the list to use for CPU_TO_NODE, obviously ;-) Sorry. Should still get buried back down into the arch code though. M. |
From: Matthew D. <col...@us...> - 2002-09-25 00:02:43
|
Martin J. Bligh wrote: > A few comments ... mainly just i386 arch things (which aren't > really your fault, was just a lack of the mechanisms being in > mainline), and a request to push a couple of things down into > the arch trees from rearing their ugly head into generic code ;-) <SNIP> >>+extern int pnode_to_lnode[NR_NODES]; >>+extern char lnode_number[NR_CPUS] __cacheline_aligned; > > Can't you push all this down into the arch .... > >>+#define CPU_TO_NODE(cpu) lnode_number[cpu] > > ... by letting them define cpu_to_node() themselves? > (most people don't have lnodes and pnodes, etc). Yep... Sorry to get into the thread late, but Erich, you should really look at the latest in-kernel topology API stuff in the mm tree. I'd be happy to discuss it with you over email... Plus, I'd get another user of the topology (to help push it into mainline) and ia64 support, and you'd get a generic topology mechanism that works (err, will work?) for all architectures. Email me for more info, or just check out the patch that Martin pointed to! :) Cheers! -Matt |