From: Paul J. <pj...@sg...> - 2004-08-03 07:59:53
|
Earlier, I (pj) wrote: > It has poor cache performance on big iron. For a modest job on a big > system, the allocator has to walk down an average of 128 out of 256 zone > pointers in the list, derefencing each one into the zone struct, then > into the struct pglist_data, before it finds one that matches an allowed > node id. That's a nasty memory footprint for a hot code path. This paragraph is B.S. Most tasks are running on CPUs that are on nodes whose memory they are allowed to use. That node is at the front of the local zonelist, and they get their memory on the first node they look. Damn ... hate it when that happens ;). Still, either MPOL_BIND needs a more numa friendly set of zonelists having a differently sorted list for each node in the set, or it's usefulness for binding to more than one or a few very close nodes, if you care about memory performance, falls off quickly as the number of nodes increases. As you well know, any such numa-friendly set of sorted zonelists will require space on the Order of N**2, for N the node count, given the NULL-terminated linear list form in which they must be handed to __alloc_pages. I suspect that the English phrase you are searching for now to tell me is "if it hurts, don't use it ;)." That is, you are clearly advising me not to use MPOL_BIND if I need a fancy zonelist sort. The place I ran into the most complexity doing this in the 2.4 kernel was in the per-memory region binding. You're dealing with this in the 2.6 kernels, and when you get to stuff like shared memory and huge pages, it's not easy. At least the vma splitting code is better in 2.6 than it was in 2.4. Whatever I do for cpusets must _not_ duplicate your virtual address range specific work (mbind). Too much detail to be done twice. Andi wrote: > My first reaction that if you really want to do that, just pass > the policy node bitmap to alloc_pages and try_to_free_pages > and use the normal per node zone list with the bitmap as filter. Pass in, or add to task_struct? I can imagine adding a: nodemask_t mems_allowed; to task_struct, and ending up with a CONFIG_CPUSET enabled macro called in a few places in __alloc_pages() and try_to_free_pages() that amounts to: if (!in_interrupt()) if (!node_isset(z->zone_pgdat->node_id, current->mems_allowed)) continue; In any event, cpusets provides the larger "container" on bigger numa systems, and mbind/mempolicy provides the more detailed, and vma specific, placement within the container (or within the entire system if cpusets not configured). I'll try coding this up and see how it looks. I welcome your further comments. Thank-you. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |