From: Andi K. <ak...@su...> - 2004-01-12 16:09:20
|
An implementation of a NUMA policy API for Linux 2.6 has been released. It consists of an implementation of the Linux kernel NUMA policy API discussed at last kernel summit, an higher level library named libnuma for applications, an user space policy tool numactl and some test programs. The libnuma interface is still very similar to the older specification I posted some time ago (there were only a few minor changes in it). numactl is also largely unchanged. This version has been tested on x86-64. It should be portable to other architectures, although you may need to get an system call allocation for them first and add them to the user library and the kernel code. This is still a quite rough release, but I think it's good enough now for some wider testing and review. It can be downloaded from: ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.5.tar.gz User space tools and libraries and manpages ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz Kernel patch for 2.6.1, with support for x86-64 The new kernel API supports several memory policies for NUMA system: MPOL_BIND only allocate on a specific set of nodes) MPOL_PREFERED allocate preferable on a specific node, but fall back to others if it fails MPOL_DEFAULT (standard policy) allocate preferable on the current node and fall back to others. MPOL_INTERLEAVE interleave allocation to a specific set of nodes. It allows to set policies for a process or for a memory area. It adds three new system calls: mbind to set a policy for a specific memory area See http://www.firstfloor.org/~andi/mbind.html set_mempolicy to set the process policy for the current process See http://www.firstfloor.org/~andi/set_mempolicy.html get_mempolicy to get the memory policy for an memory area or process. This kernel API should be normally not used directly by programs, instead they should use the higher level libnuma. libnuma has a lot of functions to allocate memory with various policies, discover the NUMA topology and also some wrapper functions for other system calls (e.g. for controlling scheduler affinity). You See http://www.firstfloor.org/~andi/numa.html for details numactl is a command line utility that allows to run programs and their children with a specific policy. You can use it like numactl --interleave=0-2 memhog 100m to set an interleaving policy for nodes 0 to 2 for memhog. All memory allocated in there will be interleaved to these nodes. There is also an program numastat to print the new numa statistics from sysfs. There are some test programs, especially a program called numademo that attempts to benchmark most possible policy combinations on your machine. Any feedback welcome, especially from bigger machines. Some design issues in the kernel implementation: All policy is always applied at fault time. This means when you set a process policy you have to fault pages to let it take any effect. The higher level API takes care of that. Process policy is not persistent over swapping. This is not easily fixable. If you need that persistency use mbind() Currently the interleaving state is per VMA. This implies that e.g. when you set an interleave state for a shared memory VMA each process accessing it does its own interleaving, which may end with the object not being very evenly interleaved. Better would be to share the interleaving state for VMAs pointing to the same object between processes. Should have a way to set global policy for a file (especially in hugetlbfs) or a shared memory object (related to the previous item). It would be useful for all files too to control the page cache. Only the highest zone in the zone hierarchy of each node is policied. This implies that on 32bit systems there is no policy for the lowmem zone if there is highmem, only for highmem. If the system doesn't have highmem the lowmem zone will be policied. The dma zone cannot be policied. On 64bit systems it doesn't make any difference (except for DMA) Known problems: Needs more testing (especially all the corner cases in mbind and large pages support) The sysfs cpu parser may not be completely uptodate with the ever changing cpumap format. It works on an 4 node Opteron, but that is easy because the cpu mask there fits into a single word. The user space tools and libraries still have quite some rough edges and need more polishing. The man pages need proofreading and cleaning up, especially get_mempolicy.2 which is quite bad currently. -Andi |
From: <jb...@sg...> - 2004-01-14 04:21:39
|
On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > There is also an program numastat to print the new numa statistics from sysfs. > > There are some test programs, especially a program called numademo that attempts > to benchmark most possible policy combinations on your machine. Those bits sound interesting... > Any feedback welcome, especially from bigger machines. I'll give this a try on an Altix tomorrow. > Should have a way to set global policy for a file (especially in > hugetlbfs) or a shared memory object (related to the previous item). > It would be useful for all files too to control the page cache. You mean some sort of hint embedded in the ELF image itself? If we need more than one, we should probably try to make it somewhat extensible. I remember mbligh talking about adding similar hints for scheduler node balancing on exec vs. fork, among other things. > The sysfs cpu parser may not be completely uptodate with the ever changing > cpumap format. It works on an 4 node Opteron, but that is easy because the cpu > mask there fits into a single word. We'll see that right away. I can try to take a look if I get time. > The user space tools and libraries still have quite some rough edges and need > more polishing. > > The man pages need proofreading and cleaning up, especially get_mempolicy.2 > which is quite bad currently. Ok. Thanks, Jesse |
From: Andi K. <ak...@su...> - 2004-01-14 10:18:13
|
On Tue, 13 Jan 2004 20:20:46 -0800 jb...@sg... (Jesse Barnes) wrote: > On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > > There is also an program numastat to print the new numa statistics from sysfs. > > > > There are some test programs, especially a program called numademo that attempts > > to benchmark most possible policy combinations on your machine. > > Those bits sound interesting... You'll need some minor changes for the timing functions on IA64 - should be obvious. Or maybe use gettimeofday if your TSC is drifting too badly. Also I must warn - numademo numbers don't seem to be very stable and fluctuate for unexplained reasons. I experienced with merging STREAM and the numademo benchmarks because STREAM seems to give more consistent numbers. although that work is not completely finished. > > Should have a way to set global policy for a file (especially in > > hugetlbfs) or a shared memory object (related to the previous item). > > It would be useful for all files too to control the page cache. > > You mean some sort of hint embedded in the ELF image itself? If we need > more than one, we should probably try to make it somewhat extensible. I > remember mbligh talking about adding similar hints for scheduler node > balancing on exec vs. fork, among other things. I was thinking more for all files, e.g. using an new EA. Or a tool for hugetlbfs. Doing it for executables alone doesn't make that much sense. -Andi |
From: <jb...@sg...> - 2004-01-14 16:15:58
|
On Wed, Jan 14, 2004 at 11:18:05AM +0100, Andi Kleen wrote: > > > Should have a way to set global policy for a file (especially in > > > hugetlbfs) or a shared memory object (related to the previous item). > > > It would be useful for all files too to control the page cache. > > > > You mean some sort of hint embedded in the ELF image itself? If we need > > more than one, we should probably try to make it somewhat extensible. I > > remember mbligh talking about adding similar hints for scheduler node > > balancing on exec vs. fork, among other things. > > I was thinking more for all files, e.g. using an new EA. Or a tool > for hugetlbfs. Doing it for executables alone doesn't make that much sense. An EA (or several) sounds better, I agree. Jesse |
From: <jb...@sg...> - 2004-01-15 00:17:56
|
On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz > Kernel patch for 2.6.1, with support for x86-64 I had to patch arch/ia64/Kconfig (add CONFIG_NUMA_POLICY) and arch/ia64/ia32/binfmt_elf32.c (I think line 201 should refer to mpnt instead of vma). Other than that, it compiles fine. I'm debugging another problem right now, so I havent' tested the interface yet... Jesse |
From: <jb...@sg...> - 2004-01-15 00:45:16
|
On Wed, Jan 14, 2004 at 04:17:02PM -0800, Jesse Barnes wrote: > On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > > ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz > > Kernel patch for 2.6.1, with support for x86-64 > > I had to patch arch/ia64/Kconfig (add CONFIG_NUMA_POLICY) and > arch/ia64/ia32/binfmt_elf32.c (I think line 201 should refer to mpnt > instead of vma). Other than that, it compiles fine. I'm debugging > another problem right now, so I havent' tested the interface yet... Oh, and of course I had to add ia64 syscall numbers for the new functions... Jesse |
From: <jb...@sg...> - 2004-01-15 01:25:40
Attachments:
numactl-ia64.patch
numa-api-2.6.1-mm3-ia64.patch
|
On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.5.tar.gz > User space tools and libraries and manpages > > ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz > Kernel patch for 2.6.1, with support for x86-64 Here are a couple of patches I needed for ia64. We don't set node_online_map, so I saw the BUG at policy.c:378, but other than that, the tests *seemed* to behave ok. Still looking. Jesse |
From: Andi K. <ak...@su...> - 2004-01-16 08:21:16
|
On Wed, 14 Jan 2004 17:25:34 -0800 jb...@sg... (Jesse Barnes) wrote: > On Mon, Jan 12, 2004 at 05:09:16PM +0100, Andi Kleen wrote: > > ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.5.tar.gz > > User space tools and libraries and manpages > > > > ftp://ftp.suse.com/pub/people/ak/numa/numa-2.6.1-4.gz > > Kernel patch for 2.6.1, with support for x86-64 > > Here are a couple of patches I needed for ia64. We don't set > node_online_map, so I saw the BUG at policy.c:378, but other than that, > the tests *seemed* to behave ok. Still looking. > > Jesse > I cannot merge that one because it'll break x86-64. Needs some uname check or somesuch. -libdir := ${prefix}/lib64 +libdir := ${prefix}/lib I think I will just drop the NUMA_POLICY symbol completely again. I just added it for testing at one point. I don't intend to merge the early printk changes. The problem with the system calls is that it needs an official allocation from the architecture maintainer first. I was able to do that on my own for x86-64, but not for IA64. Thanks for the other fixes. How does testing look like? -andi |
From: <jb...@sg...> - 2004-01-16 16:52:42
|
On Fri, Jan 16, 2004 at 09:21:06AM +0100, Andi Kleen wrote: > I cannot merge that one because it'll break x86-64. Needs some uname check or somesuch. > > -libdir := ${prefix}/lib64 > +libdir := ${prefix}/lib How about: -libdir := ${prefix}/lib64 +libdir := ${prefix}/lib +[ `uname -m` = "x86-64" ] && libdir := ${prefix}/lib since x86-64 is probably the only dual-ABI platform that we have to worry about (or are there sparc64 NUMA machines too?). > I think I will just drop the NUMA_POLICY symbol completely again. I just added it > for testing at one point. > > I don't intend to merge the early printk changes. Yep, sorry. I saw that those had snuck in there after I sent out the patch. You can ignore them. > The problem with the system calls is that it needs an official allocation from > the architecture maintainer first. I was able to do that on my own for x86-64, but not for > IA64. Of course. > Thanks for the other fixes. Sure. Thanks alot for putting everything together. > How does testing look like? I haven't had time to do much more yet, and I also have to implement another rdtsc() function for our platform clock source (which *is* syncrhonized across nodes), otherwise the results will be funky. Jesse |
From: Andi K. <ak...@su...> - 2004-01-16 17:20:56
|
On Fri, 16 Jan 2004 08:52:04 -0800 jb...@sg... (Jesse Barnes) wrote: > On Fri, Jan 16, 2004 at 09:21:06AM +0100, Andi Kleen wrote: > > I cannot merge that one because it'll break x86-64. Needs some uname check or somesuch. > > > > -libdir := ${prefix}/lib64 > > +libdir := ${prefix}/lib > > How about: > > -libdir := ${prefix}/lib64 > +libdir := ${prefix}/lib > +[ `uname -m` = "x86-64" ] && libdir := ${prefix}/lib > > since x86-64 is probably the only dual-ABI platform that we have to > worry about (or are there sparc64 NUMA machines too?). Yes, the biggest Sun Enterprise box is NUMA apparently. I don't know if Linux runs on it though. Also don't ppc64 and s390x use /lib64 too? But I guess it's reasonable to leave worrying about that to future porters and your change is fine. -Andi |
From: <jb...@sg...> - 2004-01-16 20:41:29
|
On Fri, Jan 16, 2004 at 06:20:51PM +0100, Andi Kleen wrote: > > How about: > > > > -libdir := ${prefix}/lib64 > > +libdir := ${prefix}/lib > > +[ `uname -m` = "x86-64" ] && libdir := ${prefix}/lib > > > > since x86-64 is probably the only dual-ABI platform that we have to > > worry about (or are there sparc64 NUMA machines too?). > > Yes, the biggest Sun Enterprise box is NUMA apparently. I don't know > if Linux runs on it though. Also don't ppc64 and s390x use /lib64 too? True, forgot about ppc and s390. Ports with only a 64 bit ABI may be the exception then, in which case the test should be reversed. > But I guess it's reasonable to leave worrying about that to future porters and your > change is fine. Ok, fine by me! Thanks. Jesse |