From: Paul M. <Pau...@us...> - 2001-10-05 23:11:36
|
Hello, Kimio! > I have a question about your document. > # Sorry, I missed the today's conference. :( Sorry for the inconvenient time in your timezone! 3AM is a bit much to expect... > In your document, there is a description; > > ------------- > Systems that can dynamically remove CPUs or nodes from > a running system may have "holes" in the numbering scheme. > However, if new CPUs are introduced, they will appear > in the same range as other CPUs on the same node, > and if new nodes are introduced, their CPUs will be consecutively > numbered. CPUs from different nodes are never interleaved. > > This means that if a node has the capacity to have additional CPUs > added to it, space must be left in the numbering scheme to > accommodate those additional CPUs. > ------------- > > Is this restriction necessary? > I feel there is no API relying on this assumption. This restriction allows some NUMA-aware algorithms to rely on ordering rather than having to know anything about the NUMA structure. This in turn allows some of these algorithms to be much simpler. One example of this is the NUMA-aware locking algorithm that Jack Vogel, Swaminathan Sivasubramanian, and John Stultz have been working on. I presented an early version of their work during the NUMA session at OLS, and an early patch may be found at sf.net/projects/lse (Numalock Patch 1). This algorithm becomes considerably more complex if CPUs from different nodes may be interleaved. So it would be really, really good if this restriction is in place. > If so, I have to look for a way to get max. number of CPUs per node. :( Can this be an architecture-specific parameter? Perhaps an API? The architecture will have hard limits defined by the numbers of physical connectors, slots, or socket available, right? Thanx, Paul > Regards, > Kimi > > > On Fri, 5 Oct 2001 10:29:26 -0700 (PDT) > "Paul E. McKenney" <pmc...@us...> wrote: > > > Hello! > > > > There is a new version of the NUMA description document at: > > > > http://lse.sourceforge.net/numa/numastatusdesc.html > > > > There is a new document describing simple-binding APIs at: > > > > http://lse.sourceforge.net/numa/numa_api.html > > > > And a rationale document at: > > > > http://lse.sourceforge.net/numa/numa_api_rationale.html > > > > Please send questions and comments! There will be conference call > > discussing this document in about 30 minutes (Friday Oct 5 > > 1-888-790-7156 Passcode 85875 at 11:00AM Pacific Daylight Time), as > > advertised in Pat Gaughen's earlier email. > > > > Thanx, Paul > > > > _______________________________________________ > > Lse-tech mailing list > > Lse...@li... > > https://lists.sourceforge.net/lists/listinfo/lse-tech > > -- > Kimio Suganuma <k-s...@mv...> |
From: Paul M. <Pau...@us...> - 2001-10-05 23:54:44
|
> > Systems that can dynamically remove CPUs or nodes from > > a running system may have "holes" in the numbering scheme. > > However, if new CPUs are introduced, they will appear > > in the same range as other CPUs on the same node, > > and if new nodes are introduced, their CPUs will be consecutively > > numbered. CPUs from different nodes are never interleaved. > > > > This means that if a node has the capacity to have additional CPUs > > added to it, space must be left in the numbering scheme to > > accommodate those additional CPUs. > > The hotplug cpu patches remove the idea of logical cpu numbers > completely, replacing it with a for_each_cpu macro. I assume the > hardware groups cpu ids with NUMA domains? Interesting! Usually, the hardware numbers CPUs within a NUMA domain/node, and often has large non-integer IDs for the nodes themselves. So the OS has considerable freedom on assigning numbers. The current numalock algorithm relies on the ability to quickly map CPUs into bits in a long. It also relies on a given node's CPU numbers not being interleaved with numbers from some other CPU. There are NUMA-aware locking algorithms that do not require this sort of mapping, but these algorithms are -much- more complex (more than 1,000 lines of code), and slower, too. The NUMA-aware algorithm -can- handle "gaps" in the CPU numbering, so if the CPUs can still be mapped to small integers, and if "gaps" are left to allow for any additional CPUs that could be plugged into a node, we should be OK, right? I may be missing something, but doesn't the hotplug patch work with the existing code in the kernel that maintains arrays with per-CPU elements and that maintains bitmasks with per-CPU bits? If so, the same sort of thing should work here, right? Thanx, Paul |
From: Anton B. <an...@sa...> - 2001-10-07 10:37:08
|
> Interesting! Usually, the hardware numbers CPUs within a NUMA domain/node, > and often has large non-integer IDs for the nodes themselves. So the > OS has considerable freedom on assigning numbers. > > The current numalock algorithm relies on the ability to quickly map > CPUs into bits in a long. It also relies on a given node's CPU numbers > not being interleaved with numbers from some other CPU. There are > NUMA-aware locking algorithms that do not require this sort of mapping, > but these algorithms are -much- more complex (more than 1,000 lines of > code), and slower, too. So long as we know the maximum number of cpus in a NUMA domain it should be OK. The architecture code could map NUMA domain identifiers into some sort of logical order and we just allocate the maximum number of cpus per domain. That means we never get interleaving of cpus and it makes our life easier trying to map linux cpus back to the hardware when the cpu plays up :) > The NUMA-aware algorithm -can- handle "gaps" in the CPU numbering, > so if the CPUs can still be mapped to small integers, and if "gaps" > are left to allow for any additional CPUs that could be plugged into > a node, we should be OK, right? Yep sounds good. > I may be missing something, but doesn't > the hotplug patch work with the existing code in the kernel that > maintains arrays with per-CPU elements and that maintains bitmasks > with per-CPU bits? If so, the same sort of thing should work here, > right? Well the old for(i = 0; i < smp_num_cpus; i++) has been replaced with a for_each_cpu() macro. I dont think Rusty removed the reliance on unsigned long cpu bitmasks, but it wouldnt be much more work on top of the hotplug cpu patch, I had planned to test it a while ago. Anton |
From: Paul J. <pj...@en...> - 2001-10-09 23:26:56
|
On Fri, 5 Oct 2001, Paul McKenney wrote: > The current numalock algorithm relies on the ability to quickly map > CPUs into bits in a long. It also relies on a given node's CPU numbers > not being interleaved with numbers from some other CPU. There are > NUMA-aware locking algorithms that do not require this sort of mapping, > but these algorithms are -much- more complex (more than 1,000 lines of > code), and slower, too. "It also relies on a given node's CPU numbers not being interleaved with numbers from some other CPU." Well, I looked at the NUMA lock patch (thanks to Pat for pointing me at it), and it doesn't even mention the word "node". So I sit here speculating how it is that this patch depends on having the cpu numbering scheme reflect node structure. My current working hypothesis is that this code wants to search in some semblance of "node order" for the next cpu to wakeup, and depends on the bit positions in the lock word to be both (1) numbered by cpu number, and (2) be in the desired "node order" (non-overlapping nodes and such). Perhaps it wants "node order" for performance, to keep the lock word within a node-local cache as long as it can. Is my hypothesis (a) anywhere close to accurate, and (b) written down anywhere accessible to the public? If I'm still somewhat in touch with reality here, then I can imagine isolating the cpu numbering used by this locking code from the cpu numbering scheme used by its callers, say by having a compact, carefully ordered way of numbering cpus known only to this locking code, and a map between the loose external numbering scheme and the NUMA lockings tight internal scheme, stored in per-cpu data for fast usually cached access. Such a map would only have to be updated during system boot and when some cpu was hot swapped in or out. The vagaries of node numbering in complex tiered systems are something to be isolated, not integrated, if it is practical to do so, in my view. Such isolation would also serve those who need more than 64 cpus. Isolation in principle is near and dear to my heart. Isolation in support of > 64 cpus is near and dear to my wallet <grin>. But I'd better quit climbing out on this limb, and see if it is going to fall out from under me ... I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Paul M. <Pau...@us...> - 2001-10-06 00:08:01
|
> > The hotplug cpu patches remove the idea of logical cpu numbers > > completely, replacing it with a for_each_cpu macro. I assume the > > hardware groups cpu ids with NUMA domains? > > The problem is how to determine CPU number at booting. > Currently, there is no way to get the max. number of CPUs per node > at least on our system. So, we have to determine the number by > config or someting. (It's no big deal, though.) > > In addition, I'd like to make sure the meaning of "consecutively." > When I want to remove node 1 which consists of cpu 4,5,6,7, > do I have to offline cpus in number order like 7-6-5-4 to avoid > to make hole in the cpu numbers? You could offline them in any order. Holes in the CPU numbering are OK. My concern is rather with interleaving numbers of CPUs from different nodes. So if CPU 4 is from node 1 and CPU 7 is from node 1, then CPUs 5 and 6 must be either from node 1 or nonexistent. They should -not- be from (say) node 3. > And, when cpu 5 is reports some HW error, is it possible to > offline only cpu 5? This should be OK, as it would just leave a hole in the sequence of CPUs. Does this work well with hotplug and dynamic insertion? Thoughts? Thanx, Paul |
From: Paul M. <Pau...@us...> - 2001-10-06 00:14:12
|
Hello, Kimi, > > > I have a question about your document. > > > # Sorry, I missed the today's conference. :( > > > > Sorry for the inconvenient time in your timezone! 3AM is a bit much > > to expect... > > I'm living in CA currentry, so I just missed it by my fault. :P I have missed it more than once myself for similar reasons, so I cannot give you too much trouble about it. ;-) > > > Is this restriction necessary? > > > I feel there is no API relying on this assumption. > > > > This restriction allows some NUMA-aware algorithms to rely on ordering > > rather than having to know anything about the NUMA structure. This > > in turn allows some of these algorithms to be much simpler. One example > > of this is the NUMA-aware locking algorithm that Jack Vogel, Swaminathan > > Sivasubramanian, and John Stultz have been working on. I presented an > > early version of their work during the NUMA session at OLS, and an early > > patch may be found at sf.net/projects/lse (Numalock Patch 1). > > > > This algorithm becomes considerably more complex if CPUs from different > > nodes may be interleaved. So it would be really, really good if this > > restriction is in place. > > Does the nume lock allow different CPU numbers/node ? > For example, when node 0 has 3 cpus and node 1 has 4 cpus, > CPU number will be like this in current implementation; > node0: cpu0,1,2 > node1: cpu3,4,5,6 > Will it be OK? This would work fine as long as a fourth CPU cannot be inserted into node 0. The only restriction is that CPUs from different nodes not be interleaved. > > > If so, I have to look for a way to get max. number of CPUs per node. :( > > > > Can this be an architecture-specific parameter? Perhaps an API? The > > architecture will have hard limits defined by the numbers of physical > > connectors, slots, or socket available, right? > > I hope the parameter can be get by ACPI or something.. > I don't want to see CONFIG_CPU_NUM_PER_NODE. :P Agreed! ;-) > Anyway, I'll think about it. Can't ask for more than that! Thanx, Paul |
From: Paul M. <Pau...@us...> - 2001-10-08 00:40:34
|
> > Interesting! Usually, the hardware numbers CPUs within a NUMA domain/node, > > and often has large non-integer IDs for the nodes themselves. So the > > OS has considerable freedom on assigning numbers. > > > > The current numalock algorithm relies on the ability to quickly map > > CPUs into bits in a long. It also relies on a given node's CPU numbers > > not being interleaved with numbers from some other CPU. There are > > NUMA-aware locking algorithms that do not require this sort of mapping, > > but these algorithms are -much- more complex (more than 1,000 lines of > > code), and slower, too. > > So long as we know the maximum number of cpus in a NUMA domain it should > be OK. The architecture code could map NUMA domain identifiers into some > sort of logical order and we just allocate the maximum number of cpus > per domain. That means we never get interleaving of cpus and it makes > our life easier trying to map linux cpus back to the hardware when the > cpu plays up :) Sounds great! Sure would be nice if the hardware/firmware helps with the maximum number of CPUs per domain. ;-) > > The NUMA-aware algorithm -can- handle "gaps" in the CPU numbering, > > so if the CPUs can still be mapped to small integers, and if "gaps" > > are left to allow for any additional CPUs that could be plugged into > > a node, we should be OK, right? > > Yep sounds good. Great! > > I may be missing something, but doesn't > > the hotplug patch work with the existing code in the kernel that > > maintains arrays with per-CPU elements and that maintains bitmasks > > with per-CPU bits? If so, the same sort of thing should work here, > > right? > > Well the old for(i = 0; i < smp_num_cpus; i++) has been replaced with > a for_each_cpu() macro. I dont think Rusty removed the reliance on > unsigned long cpu bitmasks, but it wouldnt be much more work on top > of the hotplug cpu patch, I had planned to test it a while ago. Interesting! What sorts of things were the unsigned long CPU bitmasks to be replaced with? Thanx, Paul |
From: Paul M. <Pau...@us...> - 2001-10-10 01:14:58
|
> On Fri, 5 Oct 2001, Paul McKenney wrote: > > The current numalock algorithm relies on the ability to quickly map > > CPUs into bits in a long. It also relies on a given node's CPU numbers > > not being interleaved with numbers from some other CPU. There are > > NUMA-aware locking algorithms that do not require this sort of mapping, > > but these algorithms are -much- more complex (more than 1,000 lines of > > code), and slower, too. > > "It also relies on a given node's CPU numbers not > being interleaved with numbers from some other CPU." Good catch! This should instead be: "It also relies on a given node's CPU numbers not being interleaved with numbers from some other node." > Well, I looked at the NUMA lock patch (thanks to Pat for pointing > me at it), and it doesn't even mention the word "node". So I > sit here speculating how it is that this patch depends on having > the cpu numbering scheme reflect node structure. > > My current working hypothesis is that this code wants to search > in some semblance of "node order" for the next cpu to wakeup, > and depends on the bit positions in the lock word to be both > (1) numbered by cpu number, and (2) be in the desired "node > order" (non-overlapping nodes and such). Perhaps it wants > "node order" for performance, to keep the lock word within a > node-local cache as long as it can. > > Is my hypothesis (a) anywhere close to accurate, and (b) written > down anywhere accessible to the public? (a) Yep! It uses ffs() to find the next CPU spinning on the lock. (b) Other than in the source code of the patch, no. We are working on more accessible documentation, but at the current time, the patch is the best we have. > If I'm still somewhat in touch with reality here, then I can > imagine isolating the cpu numbering used by this locking code > from the cpu numbering scheme used by its callers, say by having > a compact, carefully ordered way of numbering cpus known only > to this locking code, and a map between the loose external > numbering scheme and the NUMA lockings tight internal scheme, > stored in per-cpu data for fast usually cached access. > > Such a map would only have to be updated during system boot and > when some cpu was hot swapped in or out. The vagaries of node > numbering in complex tiered systems are something to be isolated, > not integrated, if it is practical to do so, in my view. Hmmm... A nice flattening of the NUMA topology onto a linear space is possible in the architectures I have come across. Can you tell me more about NUMA topologies where this would not be convenient? > Such isolation would also serve those who need more than 64 cpus. > Isolation in principle is near and dear to my heart. Isolation > in support of > 64 cpus is near and dear to my wallet <grin>. ;-) However, >64 CPUs requires a different algorithm (which is OK, since locking is architecture specific, though we hope to be able to share much of the code with the proposed algorithm). This algorithm can only handle 63 CPUs (31 CPUs on systems that have only 32-bit atomic operations). For more CPUs, you need a different algorithm. There are no shortage of candidate algorithms, BTW. There are something like eight different NUMA-aware locking algorithms that I know of. Probably the most appropriate is one that uses a hierarchical bitmask, with the top level having one bit per node and the lowest level having one bit per CPU on the corresponding node. One can have levels in between, though it might not be worth it. This is limited to a few thousand CPUs, depending on the number of CPUs permitted in a node. There is one that has extremely high limits (50 or so CPUs per node, and as many nodes as you have memory for) based on a queued lock where CPUs "take cuts" behind other CPUs on their node, but limited to prevent starvation. This algorithm is quite complex (even by proprietary-Unix standards), and is significantly slower than the others. Either of these last two algorithms could do CPU/node mapping without much penalty, since they would be quite architecture dependent anyway. Thoughts? > But I'd better quit climbing out on this limb, and see if it > is going to fall out from under me ... I know the feeling! ;-) Thanx, Paul |
From: Paul J. <pj...@en...> - 2001-10-10 02:41:54
|
On Tue, 9 Oct 2001, Paul McKenney wrote: > Good catch! This should instead be: > > "It also relies on a given node's CPU numbers not > being interleaved with numbers from some other node." The good catch goes to you - I hadn't noticed the s/CPU/node/ typo. > > If I'm still somewhat in touch with reality here, then I can > > imagine isolating the cpu numbering used by this locking code > > from the cpu numbering scheme used by its callers, say by having > > a compact, carefully ordered way of numbering cpus known only > > to this locking code, and a map between the loose external > > numbering scheme and the NUMA lockings tight internal scheme, > > stored in per-cpu data for fast usually cached access. > > > > Such a map would only have to be updated during system boot and > > when some cpu was hot swapped in or out. The vagaries of node > > numbering in complex tiered systems are something to be isolated, > > not integrated, if it is practical to do so, in my view. > > Hmmm... A nice flattening of the NUMA topology onto a linear > space is possible in the architectures I have come across. > Can you tell me more about NUMA topologies where this would not > be convenient? > > > Such isolation would also serve those who need more than 64 cpus. > > Isolation in principle is near and dear to my heart. Isolation > > in support of > 64 cpus is near and dear to my wallet <grin>. > > ;-) hmmm ... I'm confused ... I thought I was just going from one linear space to another less sparse linear space. Scrap the above brainstorming ... let me come at this from another direction. I was concerned that the "node order" constraints being placed on cpu numbering, complete with holes for missing cpus, wasn't generally needed. I read the NUMA locking patch because that was the one example I saw quoted where such numbering was claimed to be important. And I came away from that reading with the impression that "fine if that code needs such careful numbering (to avoid ping-ponging the cache line holding a lock word, apparently). But there might be a way to keep from polluting the rest of the kernel, outside this lock code, with that constraint on cpu numbering. Then I started babbling and thinking this would also avoid polluting the kernel with the 63 or 64 cpu limit as well. Though its really the API of the locking routines, not their internals, that determine whether the 64 bit assumption leaks. So nevermind that last >64 cpu riff. My original concern came upon reading the "Rationale for Proposed NUMA API", which spent 1/2 the paper, and many fine pictures, explaining how cpus had to be numbered in node order. I saw little rationale for that (play on words from the papers title intended ;). And it felt to me like an instance of what should be a localized constraint leaking into the rest of the known universe, for no good reason. Though, granted, as a matter of practicality, we number our cpus one node at a time, just like most folks. So I guess it doesn't really matter ... except to us purists who delight in finding the minimum preconditions essential to a piece of code. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: <k-s...@mv...> - 2001-10-05 23:56:40
|
Hi, Paul > > I have a question about your document. > > # Sorry, I missed the today's conference. :( > > Sorry for the inconvenient time in your timezone! 3AM is a bit much > to expect... I'm living in CA currentry, so I just missed it by my fault. :P > > Is this restriction necessary? > > I feel there is no API relying on this assumption. > > This restriction allows some NUMA-aware algorithms to rely on ordering > rather than having to know anything about the NUMA structure. This > in turn allows some of these algorithms to be much simpler. One example > of this is the NUMA-aware locking algorithm that Jack Vogel, Swaminathan > Sivasubramanian, and John Stultz have been working on. I presented an > early version of their work during the NUMA session at OLS, and an early > patch may be found at sf.net/projects/lse (Numalock Patch 1). > > This algorithm becomes considerably more complex if CPUs from different > nodes may be interleaved. So it would be really, really good if this > restriction is in place. Does the nume lock allow different CPU numbers/node ? For example, when node 0 has 3 cpus and node 1 has 4 cpus, CPU number will be like this in current implementation; node0: cpu0,1,2 node1: cpu3,4,5,6 Will it be OK? > > If so, I have to look for a way to get max. number of CPUs per node. :( > > Can this be an architecture-specific parameter? Perhaps an API? The > architecture will have hard limits defined by the numbers of physical > connectors, slots, or socket available, right? I hope the parameter can be get by ACPI or something.. I don't want to see CONFIG_CPU_NUM_PER_NODE. :P Anyway, I'll think about it. Regards, Kimi |
From: Paul J. <pj...@en...> - 2001-10-09 20:50:11
|
On Fri, 5 Oct 2001, Paul McKenney wrote: > This restriction allows some NUMA-aware algorithms to rely on ordering > rather than having to know anything about the NUMA structure. This > in turn allows some of these algorithms to be much simpler. One example > of this is the NUMA-aware locking algorithm that Jack Vogel, Swaminathan > Sivasubramanian, and John Stultz have been working on. I presented an > early version of their work during the NUMA session at OLS, and an early > patch may be found at sf.net/projects/lse (Numalock Patch 1). My apologies for responding so late to this thread. I was hoping to understand more of what is needed to handle nodes, in particular to determine how much of this could be resolved outside the kernel. My intuition so far is that for most purposes, our NUMA systems are simply a collection of processors and memory nodes (chunks of memory with differing proximity to the processors). For low level software that must navigate the actual buses, backplanes and routers of our systems, certainly nodes matter. And certainly the abstract topology connecting the processors and memory banks matter, as well as distance metrics such as <processor, memory> distances, for memory latency and bandwidth, and <processor, processor> distances, for cache affinity. I am looking for situations in which the guts of the kernel (such as scheduling and allocation) need to know more than can be derived from this abstract topology and distance metrics. I went looking for more information about this NUMA-aware locking, but at least on the SF page: http://sourceforge.net/project/showfiles.php?group_id=8875 it shows "No Files" under Numalock Patch 1. Can someone point me to something more on this? I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |