I'd like to have commented on this sooner, but I've been busy with some
other work. This looks really interesting, and I'd love to help any way
I can.
Now, on to more specific comments.
Andi Kleen wrote:
> Hallo,
>
<SNIP>
>
> It only deals with nodes, not CPUs. One reason for this is that it is
> AMD64 centric where CPU equals node, but even on other architectures
with
> multiple CPUs per node more finegrained settings than nodes do not
seem to be
> commonly used. Inside a node conventional SMP tunings can be used, no
need
> for an NUMA library.
>
> The only possible exception is the CPU binding (numa_run_on_node*), but
> node granuality seems to be enough for that too. If it should be a
problem
> the application can call sched_setaffinity directly.
I like this. Working with CPUs can be a pain because they tend to be
used in groups anyway, ie: nodes (as you pointed out), they tend not to
have physical memory directly associated with them (make membinding
tricky), they force bitmasks to be much larger (ie: a 32bit bitmask of
nodes covers a larger array of systems than a 32 bit bitmask of CPUs), etc.
> Possible distance between nodes is ignored. On current AMD64 it doesn't
> exist and it seems like a very big complication for little gain even
> on other architectures. If it should be needed it can be read from
> sysfs in 2.5.
Ermm... I have to disagree here. Right now, I don't think leaving out
distance is too a big deal, just assume everything is either local or
remote, but in the future, there *will* be a need for it. As machines
consistently get larger, we're going to have multiple hops to go from
one CPU to another (some machines have this now), and we'll need a
distance metric. I guess only time will tell... ;)
> The set of nodes is defined as unsigned long. I did this because I
don't see
> Linux breaking this limit any time soon (note this is talking about
nodes, not
> CPUs again; e.g. on a 64bit 4cpus per node machine the CPU limit is
256 CPUs).
> On AMD64 it allows upto 64 CPUs. I expect this to be controversal,
but the
> alternatives (defining bitset types etc.) seemed too ugly.
I don't like this either, but I see that several other have already
mentioned the badness of this, so I'll leave it alone... :)
> Homenode is a specific concept from my NUMA scheduler that may not exist
> in others (e.g. it doesn't in 2.5). I decided to show it in the
library for now,
> but it's only a hint and could be ignored. The main reason I did this is
> that the automatic balancing has some nasty corner cases where it may
make
> sense for the application or numactl to overwrite it. The concept of
a homenode is
> different from just memory binding because it implies order.
I like the homenode concept as well. It's good to migrate the processes
back to where most of their memory is, and in the future, we could even
migrate a processes pages back to its homenode.
> Ignoring the homenode hint scheduler changes the changes in the
kernel to implement
> this are all rather simple. Basically it only consists of a couple
> or prctls and some simple changes to the page allocation function.
>
> The patch for the homenode NUMA scheduler is still in development and
not
> released yet.
I'm interested in assisting if possible. If you'd like to send me some
code when you're ready, I'd be interested in working on the i386 side...
> NAME
> numa - NUMA policy library
>
> SYNOPSIS
> #include <numa.h>
>
> cc ... -lnuma
>
> int numa_available(void)
>
> int numa_max_node(void)
> int numa_homenode(void)
>
> void numa_set_interleave_mask(unsigned long mask)
> unsigned long numa_get_interleave_mask(void)
> void numa_set_homenode(int node)
> void numa_set_localalloc(int flag)
> void numa_set_membind(unsigned long mask)
> void numa_get_membind(unsigned long mask)
The above should probably be:
unsigned long numa_get_membind(void)
right?
>
> void *numa_alloc_stripped_subset(size_t size, unsigned
> long mask)
> void *numa_alloc_stripped(size_t size)
> void *numa_alloc_onnode(size_t size, int node)
> void *numa_alloc_local(size_t size)
> void *numa_alloc(size_t size)
> void numa_free(void *mem, size_t size)
>
> int numa_run_on_node_mask(unsigned long mask)
> int numa_run_on_node(int node)
>
> void numa_interleave_memory(void *mem, size_t size,
> unsigned long mask)
> void numa_tonode_memory(void *mem, size_t size, int node)
> void numa_setlocal_memory(void *mem, size_t size)
> void numa_police_memory(void *mem, size_t size)
>
> DESCRIPTION
>
<SNIP>
> numa_homenode returns the homenode of the current thread.
> It is the node the kernel preferably allocates memory on,
> unless some other policy overwrites this.
>
> numa_set_interleave_mask Set an memory interleave mask for
> the current thread. All new memory allocations are page
> interleaved over all nodes in the interleave mask. The
> page interleaving only occurs on the actual page fault
> that puts a new page into the current address space, not
> during mmap. This is a low level function, it may be more
> convenient to use the higher level functions like
> numa_alloc_stripped or numa_alloc_stripped_subset.
You mean that mmap'd pages aren't interleaved until they are actually
faulted in, not that mmap'd regions aren't affected by the
interleave_mask, correct?
> numa_get_interleave_mask returns the current interleave
> mask.
>
> numa_set_homenode sets the homenode for the current thread
> to node. Homenode is the node memory is preferably allo-
> cated from.
>
> numa_set_localalloc sets a local memory allocation policy
> for the current thread. When flag is not null memory is
> preferably allocated from the current node. Otherwise it
> is allocated from the homenode. These are normally identi-
> cal, but can differ in some special situations.
This basically says homenode == current_node, if flag is not NULL? Just
making sure I'm reading it right. It might be easier (cleaner?) to just
have a special flag for numa_set_homenode that does this. Ie:
numa_set_homenode(-1) ensures that you always do local allocation.
> numa_set_membind sets a memory allocation mask. Only allo-
> cate memory from the nodes set in mask. A mask of 0 or
> -1UL turns membinding off.
>
> numa_get_membind returns the current node mask from which
> memory can be allocated. 0 or -1UL means all nodes.
I'm a bit unclear on how these are different than the similar get/set
interleave_mask calls? Do these calls set a policy that just allows the
page to be faulted in from any node in the mask, whereas the
interleave_mask calls force the faults to follow a round-robin type policy?
> numa_alloc_stripped allocates size bytes of memory page
> stripped on all nodes. This function is relatively slow
> and should only be used for large areas consisting of mul-
> tiple pages. The interleaving works on page level and will
> only show an effect when the area is large. It must be
> freed with numa_free
>
> numa_alloc_stripped_subset is like numa_alloc_stripped
> except that it also accepts a mask of the nodes to inter-
> leave on.
I saw that you renamed these to match the interleaved calls. I think
these could be condensed into a single call
numa_alloc_interleaved(mask_t mask), and if mask is -1 (all bits set) or
some such, it uses all nodes?
> numa_alloc_onnode allocates memory on a specific node.
> This function is relatively slow and allocations are
> rounded to pagesize. The memory must be freed with
> numa_free
>
> numa_alloc_local allocates memory on the local node. This
> function is relatively slow and allocations are rounded to
> pagesize. The memory must be freed with numa_free.
These also could be condensed... numa_alloc_onnode(-1) == numa_alloc_local?
> numa_alloc allocates memory with the current NUMA policy.
> This function is relatively slow and allocations are
> rounded to pagesize. The memory must be freed with
> numa_free.
>
> numa_free frees memory allocates by the numa_alloc_* func-
> tions above.
>
> numa_run_on_node runs the current thread on a specific
> node. The thread will not migrate to other nodes until
> this is reset with numa_run_on_node_mask with an -1UL
> argument.
Shouldn't this policy be reset by *any* future calls to
numa_run_on_node/numa_run_on_node_mask? Why force a binding to all
nodes only to immediately rebind to a different node/node set?
> numa_run_on_node_mask runs the current thread only on a
> specific node mask.
Are the restrictions the same on resetting the binding for this call as
for numa_run_on_node?
Also, you may want to add a single function call that binds the process
and its memory to a set of nodes? Ie: numa_bind(mask), which is like
numa_run_on_node_mask(mask) + numa_set_interleave_mask(mask)?
> numa_interleave_memory is a lower level function to inter-
> leave not yet faulted in, but allocated memory. Not yet
> faulted in means the memory is allocated using mmap(2) or
> shmat(2), but has not been accessed by the current process
> yet. The memory is page interleaved to all nodes specified
> in mask. Normally numa_alloc_stripped should be used for
> private memory instead, but this function is useful to
> handle shared memory areas. To be useful the memory area
> should be significantly larger than a page.
Maybe rename this call to something like numa_interleave_range() or
numa_interleave_mem_range(), to help distinguish between actual
allocation calls and this one?
> numa_tonode_memory locates memory on a specific node. The
> constraints described for numa_interleave_memory apply
> here too.
Again, I don't think there's a need for two calls. Simply calling
numa_interleave_memory() with mask == 1 << node should be equivalent, no?
> numa_setlocal_memory locates memory on the current node.
> The constraints described for numa_interleave_memory apply
> here too.
Ditto.
> numa_police_memory locates memory with the current NUMA
> policy. The constraints described for numa_interleave_mem-
> ory apply here too.
<SNIP>
> NAME
> numactl - Control NUMA policy for processes
>
> SYNOPSIS
> numactl [ --interleave=nodes ] [ --homenode=homenode ] [
> --cpubind=cpu ] [ --membind=nodes ] [ --localalloc ] com-
> mand {arguments ...}
> numactl [ -i nodes ] [ -h homenode ] [ -m nodes ] [ -b
> cpus ] command {arguments ...}
> numactl --show
>
> DESCRIPTION
> numactl runs processes with a specific NUMA scheduling or
> memory placement policy. The policy is set for command
> and inherited by all of its children.
>
> Policy settings are:
>
> --interleave=nodes, -i nodes
> Set an memory interleave policy. Memory will be
> allocated using round robin on nodes.
>
> --homenode=node, -h node
> Set the homenode to node. homenode is the node the
> process first tries to allocate memory from. Nor-
> mally it is assigned dynamically at exec(2) or
> fork(2) / clone(2)
> time (the later only when the kernel.homenode_bal-
> ance_threads is set). In addition the scheduler
> gives strong preference to the homenode to schedule
> the process near its memory. If the memory alloca-
> tion does not succeed the allocation is tried on
> other nodes.
>
> --membind=nodes, -m nodes
> Only allocate memory from nodes.
I'm not 100% clear on the difference between this and --interleave?
Same question as above, in the libnuma section.
> --cpubind=cpus, -b cpus
> Only execute process on cpus. The syntax for cpus
> is the same as for node specifiers.
Seems a bit strange to allow cpubinding here, especially since
everywhere else in the API is strictly node-level...
> --localalloc, -l
> Do always local allocation on the current node.
> This overwrites the homenode and interleave set-
> tings. It is also default when the
> vm.node_local_alloc sysctl is set.
>
> --show, -s
> Show current NUMA policy settings.
Show current settings for what? The shell that you are executing the
command in? That doesn't seem particularly useful, since you can't
change them... Maybe a PID argument for --show? Or are there
global/system-wide settings as well?
> Valid node specifiers
> all All nodes
> number Node containing CPU number.
> number1{,number2} Set of nodes containing the CPUs
number1 and number2
> number1-number2 Nodes containing CPUs from number1
to number2
> ! nodes Invert selection of the following
specification.
>
> EXAMPLES
> numactl --interleave=all bigdatabase arguments Run big
> database with its memory interleaved on all CPUs.
Should be 'interleaved on all Nodes'? The interleave specifically says
it takes node arguments. I'm guessing some (all?) of this CPU/Node
confusion is due to x86_64's CPU == Node. We should be careful about
this, because for just about every other arch out there, CPU == node
doesn't hold.
> numactl --homenode=0 --membind=0,1 process Run process
> preferably on node 0 with memory allocated on node 0 and
> 1.
>
>
> SYSCTLS
<SNIP>
Overall I really like a lot of your ideas, and I'm looking forward to
seeing where this goes!
Cheers!
-Matt
|