From: Paul J. <pj...@en...> - 2001-07-12 04:29:12
|
On Wed, 11 Jul 2001, Paul McKenney wrote: > One question on this approach -- If you have a system based on > Intel quads, you have 4 CPUs and about 6 PCI slots per quad. If > you put both I/O devices and CPUs into the same bit vector, a 64-bit > vector limits you to 6 quads, or 24 CPUs. A 32-bit vector limits > you to 3 quads, or 12 CPUs. No - special purpose processors, such as might be found on a PCI board, would almost never be listed in the 32 or 64 CPUs selected by a cpumemset. They would only appear in the distance vectors, and be allocated physical numbers from the same space as the main CPUs. Lets take an example. Lets say we have 3 quads, each with 4 CPUs, 6 i/o processors (the PCI slots are filled up with cards supporting DMA) and 1 bank of memory. Lets say the system assigns physical processor numbers thusly: 0..3 Quad 0 cpus 4..9 Quad 0 dma engines 10..13 Quad 1 cpus 14..19 Quad 1 dma engines 20..23 Quad 2 cpus 24..29 Quad 2 dma engines Then the initial default cpumemset, containing all 12 CPUs, would be initialized: cpumemsets[0].physcpu = { 0, 1, 2, 3, 10, 11, 12, 13, 20, 21, 22, 23 }; which can be covered with just 12 bits of cur and max cpus_allowed. In short, just because an i/o or special purpose processor has a physical number in the same space as CPUs, doesn't mean that they will consume bits in any given cpumemset. This is what I took to be the essense of the strong need to "virtualize" the physical numbering of CPUs or mems that others had, I've been told, observed before I joined this effort. Aha - I see a typo in my html doc of last week vaguely related to this. In the section "Selected Details of Implementation", item (1), the comments in the pseudo-C code: struct cpumemset { int physcpu[WORDSIZE]; # map phys cpu number to virtual int physmem[WORDSIZE]; # map phys mem number to virtual }; are backwards. These comments should be as in: struct cpumemset { int physcpu[WORDSIZE]; # map virtual cpu number to physical int physmem[WORDSIZE]; # map virtual mem number to physical }; > I believe that the scalability limit implied > by #2 (12 CPUs on a 32-bit system) is not sufficient. You're quite right that that would be insufficient. Due to the above virtualization, there is no practical limit on the number of CPUs on a system. You could literally have a million CPUs. It's just that any one non-priviledged process cannot schedule itself on more than the 32 or 64 CPUs assigned it, and cannot change the 32 or 64 CPUs assigned it without priviledged assistance. In particular, note that SGI ships 512 CPU systems, and has installed experimental 1024 CPU systems. I trust that this CPU-count will continue to grow. Granted, these are MIPS- Irix systems. But our larger Intel-Linux systems are built from the same technology, except for the CPU brick and the software. So I'm certainly not going to knowingly constrain the work we do to something like just 32 or 64 CPUs or some corresponding number of quads. > > Note that the limit is only on the set of cpus and mems that a single, > > non-priviledged process can directly manipulate. A priviledged (root) > > process can create, destroy and attach to its choice of cpumemset, > > enabling it to have the full run of the system. > > > > The justifications (excuses) for this odd limit are: > > > > 1) The largest size set that can be passed via a classically clean > > system call has size WORDSIZE (32 or 64), such as happens > > when using the unsigned long ulimit parameters as a > > bit mask. ... > > > > But I will admit to some uncertainty about this restriction. > > > > Reason (1) above is the key one for me -- if someone found a > > suitable syscall amongst the 222 or so now assigned in Linux, > > ... > > /proc? This has been suggested by several people, both on lse-tech > and privately. I find /proc suitable for the kernel to display status and configuration information. I realize that it is occassionally used by user code to set kernel configuration parameters ... a hack in my view (which isn't to say it's beneath my dignity to code ;). But it seems wildly inappropriate to have non-priviledged apps writing long commands in ascii to /proc in order to control details of their execution. The ultimate Swiss Army knife. Heck - we could turn the entire kernel syscall API into a single sequence of writes, resembling the output of strace say, to a /proc location ;). > > The <cpu, cpu> distance is a measure of how costly it would > > be due to caching affects to change the current cpu on which > > a task is executing (and has considerable cache presence) > > to the other cpu. > > Yes!!! Also the cost of placing two tasks that tightly share data > on widely separated CPUs. Excellent observation - thanks. > > Now then - what did you have in mind above? > > I believe you pretty much covered it. CPU-CPU for cache-affinity > considerations, CPU-memory for placement decisions, I/O-memory for > multipath I/O decisions, etc. Excellent. > > > > o "Software Components to be Developed", item #1: There was a patch > > > from Kanoj on this. Paul Dorwin has started porting this to > > > NUMA-Q. ... > > > > Yes - Sam Ortiz, who works with me, is looking at this patch now, > > as is likely evident to you from another thread on this lse list. > > > > ... > > I agree that it would be good to have a common hardware representation, > at least for the usual components such as memories, CPUs, caches, and so > on. Seems like it would also be reasonable to have a naming scheme for > vendor-specific components, so that future expansions to the common > set of hardware would be guaranteed not to collide. > > Thoughts? When discussing this with Sam Ortiz today, I learned that he has taken Kanoj's patch, and is modifying it to produce <cpu, cpu> and <cpu, mem> distances, rather than <node, node>. I think he is onto something good. I will let him speak further to that, when he is ready. As for a common vendor expansion naming scheme - sounds good. I don't have any immediate inspiration here. > ... the ability to keep applications in separate partitions from interfering > with each other. For example if an application in one partition is a > memory hog, you don't want to penalize a well-behaved application in > another partition that happens to draw from a common pool of memory. > Specifying a per-pool memory limit as part of the cpumemset seems like one > way to achieve this. Sorta sounds to me like get/setrlimit RLIMIT_RSS. To which you might respond: but the administrator can't impose such limits on processes "from the outside". To which I might respond: seems to me that's a problem with several of these get/setrlimit options -- they can only be imposed on one self (and future children). > Thank -you- for the interesting and useful proposal! You're welcome. Hopefully we (the engineers working for me) can back it up with code, in a suitable patch, in due time, and then we (on this LSE project) can encourage more of the community to adapt this work, and include it in mainstream Linux. I will be on a six week SGI sabbatical, starting in one week, from July 21 through Sept 3. I expect that some combination of Nick Pollitt and John Hawkes will be working with LSE during that time. They will probably both be phoning into this Fridays LSE call. I won't rest till it's the best ... Manager, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |