From: <j-n...@ce...> - 2004-02-24 09:29:36
Attachments:
ia64-numa-zoneordering.diff
|
Hello, The attached patch makes use of arch-dependent info for building zonelist. The patch uses ACPI SLIT for ia64. Other arch may have their own method to determine the order. This kind of ordering is very important for the NUMA system in which the distance between nodes is not uniform. The patch doing this was posted by Jesse Barnes in linux-ia64: http://marc.theaimsgroup.com/?t=106383477500001&r=1&w=2 however, I couldn't find it in current tree... The sorting can be extended to, for example, more fine grained round-robin like Erich suggested. But let's start from the simple one. Any comments? Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: <jb...@sg...> - 2004-02-24 17:22:39
|
On Tue, Feb 24, 2004 at 06:20:28PM +0900, j-n...@ce... wrote: > The attached patch makes use of arch-dependent info for building zonelist. > The patch uses ACPI SLIT for ia64. > Other arch may have their own method to determine the order. > > This kind of ordering is very important for the NUMA system in which > the distance between nodes is not uniform. > > The patch doing this was posted by Jesse Barnes in linux-ia64: > http://marc.theaimsgroup.com/?t=106383477500001&r=1&w=2 > however, I couldn't find it in current tree... Yeah, I haven't pushed it yet (I didn't think it was ready yet and I haven't done a good version for 2.6 yet). > The sorting can be extended to, for example, more fine grained round-robin > like Erich suggested. But let's start from the simple one. > > Any comments? Yeah, it looks ok. What I was hoping to do in the patch that ultimately gets in: 1) make it arch independent this means having arch code populate a SLIT-like table for use by the generic zonelist building code 2) handle the cases that Erich talked about a bit better 3) some systems have pgdats w/o any CPUs associated with them, they need to be dealt with differently than regular nodes, maybe as extensions to an existing node The final routine might look something like (many thanks to pj for hitting me with a cluebat about this): /** * find_next_best_node - find the next node that should appear in a given * node's fallback list * @node: node whose fallback list we're appending * * We use a number of factors to determine which is the next node that should * appear on a given node's fallback list. The node should not have appeared * already in @node's fallback list, and it should be the next closest node * according to the distance array (which contains arbitrary distance values * from each node to each node in the system), and should also prefer nodes * with no CPUs, since presumably they'll have very little allocation pressure * on them otherwise. */ int find_next_best_node(int node) { int i, val, min_val, best_node; for (i = 0; i < numnodes; i++) { /* Don't want a node to appear more than once */ if (node_present(node, i)) continue; /* Use the distance array to find the distance */ val = node_distance(node, i); /* Give preference to headless and unused nodes */ val += nid_enabled_cpu_count[i] * 255; val += node_load[i]; if (val < min_val) { min_val = val; best_node = i; } } return best_node; } Jesse |
From: <j-n...@ce...> - 2004-02-25 05:10:34
|
Hi, > > The sorting can be extended to, for example, more fine grained round-robin > > like Erich suggested. But let's start from the simple one. > > > > Any comments? > > Yeah, it looks ok. Thank you. > What I was hoping to do in the patch that ultimately > gets in: > > 1) make it arch independent > this means having arch code populate a SLIT-like table for use by > the generic zonelist building code I would like to hear the comments from people on other arch. If the same ordering rule can be applicable for others, it's nice. > 2) handle the cases that Erich talked about a bit better > 3) some systems have pgdats w/o any CPUs associated with them, they > need to be dealt with differently than regular nodes, maybe as > extensions to an existing node > > The final routine might look something like (many thanks to pj for > hitting me with a cluebat about this): It looks cleaner than mine. I'll try it. Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: <jb...@sg...> - 2004-02-25 17:03:33
|
On Wed, Feb 25, 2004 at 02:01:16PM +0900, j-n...@ce... wrote: > > 1) make it arch independent > > this means having arch code populate a SLIT-like table for use by > > the generic zonelist building code > > I would like to hear the comments from people on other arch. > If the same ordering rule can be applicable for others, it's nice. Martin, does a scheme like this sound ok with you? Arch specific code would populate a node distance table, which would be used to build each pgdat->zonelist in a smarter way than we do currently. > > 2) handle the cases that Erich talked about a bit better > > 3) some systems have pgdats w/o any CPUs associated with them, they > > need to be dealt with differently than regular nodes, maybe as > > extensions to an existing node > > > > The final routine might look something like (many thanks to pj for > > hitting me with a cluebat about this): > > It looks cleaner than mine. I'll try it. Cool, thanks. Jesse |
From: Martin J. B. <mb...@ar...> - 2004-02-26 22:31:17
|
--On Wednesday, February 25, 2004 08:54:09 -0800 Jesse Barnes <jb...@sg...> wrote: > On Wed, Feb 25, 2004 at 02:01:16PM +0900, j-n...@ce... wrote: >> > 1) make it arch independent >> > this means having arch code populate a SLIT-like table for use by >> > the generic zonelist building code >> >> I would like to hear the comments from people on other arch. >> If the same ordering rule can be applicable for others, it's nice. > > Martin, does a scheme like this sound ok with you? Arch specific code > would populate a node distance table, which would be used to build each > pgdat->zonelist in a smarter way than we do currently. Yeah, looks sensible to me. We probably ought to do this: +#ifndef node_distance +#define node_distance(from,to) (1) +#endif in the generic fallback topology headers, not in the mm/ .c files. Matt? Also, I seem to recall those build_zonelists functions are used for both NUMA and UMA ... now they're getting complex enough that it's probably worth making a specific non-NUMA version, if only for the sanity of 99% of the poor souls trying to work out how a UMA machine lays it out ;-) It looks like it won't change ordering for existing boxes with single layer flat NUMA topologies (round-robin), but we probably ought to check that carefully ;-) M. |
From: Matthew D. <col...@us...> - 2004-02-26 23:20:01
Attachments:
node_distance.patch
|
On Thu, 2004-02-26 at 14:21, Martin J. Bligh wrote: > --On Wednesday, February 25, 2004 08:54:09 -0800 Jesse Barnes <jb...@sg...> wrote: > > > On Wed, Feb 25, 2004 at 02:01:16PM +0900, j-n...@ce... wrote: > >> > 1) make it arch independent > >> > this means having arch code populate a SLIT-like table for use by > >> > the generic zonelist building code > >> > >> I would like to hear the comments from people on other arch. > >> If the same ordering rule can be applicable for others, it's nice. > > > > Martin, does a scheme like this sound ok with you? Arch specific code > > would populate a node distance table, which would be used to build each > > pgdat->zonelist in a smarter way than we do currently. > > Yeah, looks sensible to me. We probably ought to do this: > > +#ifndef node_distance > +#define node_distance(from,to) (1) > +#endif > > in the generic fallback topology headers, not in the mm/ .c files. Matt? > > Also, I seem to recall those build_zonelists functions are used for both > NUMA and UMA ... now they're getting complex enough that it's probably > worth making a specific non-NUMA version, if only for the sanity of > 99% of the poor souls trying to work out how a UMA machine lays it out ;-) > > It looks like it won't change ordering for existing boxes with single > layer flat NUMA topologies (round-robin), but we probably ought to check > that carefully ;-) > > M. Yep... Here's a quickie for i386 and the generic header. All other arches would look pretty similar to the asm/i386/topology.h change. -Matt |
From: Dave H. <hav...@us...> - 2004-02-26 23:50:52
|
On Thu, 2004-02-26 at 14:21, Martin J. Bligh wrote: > Also, I seem to recall those build_zonelists functions are used for both > NUMA and UMA ... now they're getting complex enough that it's probably > worth making a specific non-NUMA version, if only for the sanity of > 99% of the poor souls trying to work out how a UMA machine lays it out ;-) > > It looks like it won't change ordering for existing boxes with single > layer flat NUMA topologies (round-robin), but we probably ought to check > that carefully ;-) More ifdefs just make the code messier. How about adding a nice comment explaining that UMA will fall right through the loop, and giving the poor souls a temporary reprieve? -- dave |
From: Martin J. B. <mb...@ar...> - 2004-02-27 00:04:54
|
> On Thu, 2004-02-26 at 14:21, Martin J. Bligh wrote: >> Also, I seem to recall those build_zonelists functions are used for both >> NUMA and UMA ... now they're getting complex enough that it's probably >> worth making a specific non-NUMA version, if only for the sanity of >> 99% of the poor souls trying to work out how a UMA machine lays it out ;-) >> >> It looks like it won't change ordering for existing boxes with single >> layer flat NUMA topologies (round-robin), but we probably ought to check >> that carefully ;-) > > More ifdefs just make the code messier. How about adding a nice comment > explaining that UMA will fall right through the loop, and giving the > poor souls a temporary reprieve? In this case, I disagree - it's better to just have the ifdef. M. |
From: Chris W. <cw...@f0...> - 2004-02-27 00:57:14
|
On Thu, Feb 26, 2004 at 02:21:02PM -0800, Martin J. Bligh wrote: > +#ifndef node_distance > +#define node_distance(from,to) (1) > +#endif 1 or 0 here? --cw |
From: Martin J. B. <mb...@ar...> - 2004-02-27 01:07:02
|
--On Thursday, February 26, 2004 16:47:10 -0800 Chris Wedgwood <cw...@f0...> wrote: > On Thu, Feb 26, 2004 at 02:21:02PM -0800, Martin J. Bligh wrote: > >> +#ifndef node_distance >> +#define node_distance(from,to) (1) >> +#endif > > 1 or 0 here? Dunno ... was from j-n...@ce..., but I'd guess 1 makes more sense semantically - we're saying the distance between nodes on a flat NUMA architecture is equal (I think ;-)) M. |
From: <j-n...@ce...> - 2004-02-27 05:48:39
|
Hi, > >> +#ifndef node_distance > >> +#define node_distance(from,to) (1) > >> +#endif > > > > 1 or 0 here? > > Dunno ... was from j-n...@ce..., but I'd guess 1 makes more > sense semantically - we're saying the distance between nodes on a flat > NUMA architecture is equal (I think ;-)) Umm. Even on a flat NUMA architecture, I should have distinguish between local and remote. It's NUMA. #define node_distance(from,to) (from != to) Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: Matthew D. <col...@us...> - 2004-02-24 19:59:09
|
On Tue, 2004-02-24 at 01:20, j-n...@ce... wrote: > - for (node = local_node + 1; node < numnodes; node++) > - j = build_zonelists_node(NODE_DATA(node), zonelist, j, k); > - for (node = 0; node < local_node; node++) > - j = build_zonelists_node(NODE_DATA(node), zonelist, j, k); > + for (node = 1; node < numnodes; node++) > + j = build_zonelists_node(SORTED_NODE_DATA(local_node, node), zonelist, j, k); Shouldn't that for loop be: for (node = 0; node < numnodes; node++) j = .... or we'll be leaving a node out of the zonelists, right? -Matt |
From: <j-n...@ce...> - 2004-02-25 05:12:05
|
Hi, > > + for (node = 1; node < numnodes; node++) > > + j = build_zonelists_node(SORTED_NODE_DATA(local_node, node), zonelist, j, k); > > Shouldn't that for loop be: > for (node = 0; node < numnodes; node++) > j = .... > > or we'll be leaving a node out of the zonelists, right? No. The zonelist for the local_node is built before the code above. The whole code looks like: > local_node = pgdat->node_id; > ... > j = build_zonelists_node(pgdat, zonelist, j, k); > ... > for (node = 1; node < numnodes; node++) > j = build_zonelists_node(SORTED_NODE_DATA(local_node,node), zonelist, j, k); Actually, we might be able to clean it up like: > for (idx = 0; idx < numnodes; idx++) > j = build_zonelists_node(SORTED_NODE_DATA(local_node,idx), zonelist, j, k); Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: <j-n...@ce...> - 2004-02-25 11:09:14
Attachments:
numa-zonelist.diff
|
I cleaned up the patch based on the comments from Jesse and Matthew. > 1) make it arch independent > this means having arch code populate a SLIT-like table for use by > the generic zonelist building code I moved the whole function to mm/page_alloc.c. > 3) some systems have pgdats w/o any CPUs associated with them, they > need to be dealt with differently than regular nodes, maybe as > extensions to an existing node Headless node is prefered over the nodes with same distance. > 2) handle the cases that Erich talked about a bit better Any idea for doing it in generic way? Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: Christoph H. <hc...@in...> - 2004-02-25 11:38:24
|
On Wed, Feb 25, 2004 at 07:59:33PM +0900, j-n...@ce... wrote: > +#ifndef node_distance > +#define node_distance(from,to) (1) > +#endif please move this to toplogy.h - it'll be usefull in a few other places. > +#define PENALTY_FOR_NODE_WITH_CPUS (1) should probably also go to topology.h to allow architectures to change it. > + for(i = 0; i < numnodes; i++) { minor nitpick: please leave a space between the for and the opening brace > + while((node = find_next_best_node(local_node, used_mask)) >= 0) same here. |
From: <jb...@sg...> - 2004-02-25 17:01:55
|
On Wed, Feb 25, 2004 at 07:59:33PM +0900, j-n...@ce... wrote: > I cleaned up the patch based on the comments from Jesse and Matthew. > > > 1) make it arch independent > > this means having arch code populate a SLIT-like table for use by > > the generic zonelist building code > > I moved the whole function to mm/page_alloc.c. Looks even better, that was fast! :) > > 3) some systems have pgdats w/o any CPUs associated with them, they > > need to be dealt with differently than regular nodes, maybe as > > extensions to an existing node > > Headless node is prefered over the nodes with same distance. I'd be curious to hear about others with similar configurations. On sn2, we may have multiple headless nodes for each node with CPUs. In such a configuration, it seems best to have each node with CPUs 'own' a set of headless nodes, and allocate from them even if they're further away than other nodes with CPUs. I don't think we have to worry about that too much now though, since the algorithm below could be tweaked to do just that easier than the simple sort code I did awhile back. > > 2) handle the cases that Erich talked about a bit better > > Any idea for doing it in generic way? We could adjust 'val' below based on an array that weights each node as it's added to a zonelist. I think that would be up to the caller of find_next_best_node() to adjust, but would be used in the routine below. Doing it that way would allow the balancing that Erich was talking about as well as the headless node stuff we want for sn2. Thanks, Jesse |
From: <j-n...@ce...> - 2004-02-27 05:42:47
|
Hi, > > > 2) handle the cases that Erich talked about a bit better > > > > Any idea for doing it in generic way? > > We could adjust 'val' below based on an array that weights each node as > it's added to a zonelist. I think that would be up to the caller of > find_next_best_node() to adjust, but would be used in the routine below. > Doing it that way would allow the balancing that Erich was talking about > as well as the headless node stuff we want for sn2. Needs consideration for this. I think the find_next_best_node-loop must be outer to for-each-zonelist loop to do this kind of weighing. Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: <j-n...@ce...> - 2004-02-27 12:31:13
Attachments:
numa-zonelist2.diff
|
I updated the patch. This version of build_zonelists() is NUMA-specific one. It will handle the situation Erich pointed out better. For example, when the node_distance(i,j) looks like below: 10 15 15 15 20 20 20 20 15 10 15 15 20 20 20 20 15 15 10 15 20 20 20 20 15 15 15 10 20 20 20 20 20 20 20 20 10 15 15 15 20 20 20 20 15 10 15 15 20 20 20 20 15 15 10 15 20 20 20 20 15 15 15 10 The previous patch generates zonelist in the following order: Node#00 0 1 2 3 4 5 6 7 Node#01 1 2 3 0 4 5 6 7 Node#02 2 3 0 1 4 5 6 7 Node#03 3 0 1 2 4 5 6 7 Node#04 4 5 6 7 0 1 2 3 Node#05 5 6 7 4 0 1 2 3 Node#06 6 7 4 5 0 1 2 3 Node#07 7 4 5 6 0 1 2 3 With this patch, the order looks like: Node#00 0 1 2 3 4 5 6 7 Node#01 1 2 3 0 5 6 7 4 Node#02 2 3 0 1 6 7 4 5 Node#03 3 0 1 2 7 4 5 6 Node#04 4 5 6 7 0 1 2 3 Node#05 5 6 7 4 1 2 3 0 Node#06 6 7 4 5 2 3 0 1 Node#07 7 4 5 6 3 0 1 2 Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |
From: <jb...@sg...> - 2004-02-27 17:44:55
|
On Fri, Feb 27, 2004 at 09:19:58PM +0900, j-n...@ce... wrote: > I updated the patch. > This version of build_zonelists() is NUMA-specific one. > It will handle the situation Erich pointed out better. This patch looks great to me. Care to send it on to Andrew for inclusion into his tree? Thanks, Jesse |
From: <j-n...@ce...> - 2004-03-01 14:21:05
|
> > I updated the patch. > > This version of build_zonelists() is NUMA-specific one. > > It will handle the situation Erich pointed out better. > > This patch looks great to me. Care to send it on to Andrew for > inclusion into his tree? Thank you. I'll send it to Andrew. Best regards. -- NOMURA, Jun'ichi <j-n...@ce...> |