An implementation of a NUMA policy API for Linux 2.6 has been released. It consists
of an implementation of the Linux kernel NUMA policy API discussed at last kernel summit,
an higher level library named libnuma for applications, an user space policy tool
numactl and some test programs. The libnuma interface is still very similar to
the older specification I posted some time ago (there were only a few minor
changes in it). numactl is also largely unchanged.
This version has been tested on x86-64. It should be portable to other architectures,
although you may need to get an system call allocation for them first and add them
to the user library and the kernel code.
This is still a quite rough release, but I think it's good enough now for some
wider testing and review.
It can be downloaded from:
User space tools and libraries and manpages
Kernel patch for 2.6.1, with support for x86-64
The new kernel API supports several memory policies for NUMA system:
MPOL_BIND only allocate on a specific set of nodes)
MPOL_PREFERED allocate preferable on a specific node, but fall back to others if it fails
MPOL_DEFAULT (standard policy) allocate preferable on the current node and fall back to others.
MPOL_INTERLEAVE interleave allocation to a specific set of nodes.
It allows to set policies for a process or for a memory area.
It adds three new system calls:
mbind to set a policy for a specific memory area
set_mempolicy to set the process policy for the current process
get_mempolicy to get the memory policy for an memory area or process.
This kernel API should be normally not used directly by programs, instead they
should use the higher level libnuma. libnuma has a lot of functions to allocate
memory with various policies, discover the NUMA topology and also some wrapper functions
for other system calls (e.g. for controlling scheduler affinity). You
See http://www.firstfloor.org/~andi/numa.html for details
numactl is a command line utility that allows to run programs and their
children with a specific policy. You can use it like
numactl --interleave=0-2 memhog 100m
to set an interleaving policy for nodes 0 to 2 for memhog. All memory
allocated in there will be interleaved to these nodes.
There is also an program numastat to print the new numa statistics from sysfs.
There are some test programs, especially a program called numademo that attempts
to benchmark most possible policy combinations on your machine.
Any feedback welcome, especially from bigger machines.
Some design issues in the kernel implementation:
All policy is always applied at fault time. This means when you set a process
policy you have to fault pages to let it take any effect. The higher level API
takes care of that.
Process policy is not persistent over swapping. This is not easily fixable. If you
need that persistency use mbind()
Currently the interleaving state is per VMA. This implies that e.g. when you
set an interleave state for a shared memory VMA each process accessing
it does its own interleaving, which may end with the object not being
very evenly interleaved. Better would be to share the interleaving
state for VMAs pointing to the same object between processes.
Should have a way to set global policy for a file (especially in
hugetlbfs) or a shared memory object (related to the previous item).
It would be useful for all files too to control the page cache.
Only the highest zone in the zone hierarchy of each node is policied. This
implies that on 32bit systems there is no policy for the lowmem zone if
there is highmem, only for highmem. If the system doesn't have highmem
the lowmem zone will be policied. The dma zone cannot be policied.
On 64bit systems it doesn't make any difference (except for DMA)
Needs more testing (especially all the corner cases in mbind and large
The sysfs cpu parser may not be completely uptodate with the ever changing
cpumap format. It works on an 4 node Opteron, but that is easy because the cpu
mask there fits into a single word.
The user space tools and libraries still have quite some rough edges and need
The man pages need proofreading and cleaning up, especially get_mempolicy.2
which is quite bad currently.