Re: [Lse-tech] Process Scheduling and Memory Placement - Design Notes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Wed, 11 Jul 2001, Paul McKenney wrote:

> One question on this approach -- If you have a system based on
> Intel quads, you have 4 CPUs and about 6 PCI slots per quad.  If
> you put both I/O devices and CPUs into the same bit vector, a 64-bit
> vector limits you to 6 quads, or 24 CPUs.  A 32-bit vector limits
> you to 3 quads, or 12 CPUs.

No - special purpose processors, such as might be found on
a PCI board, would almost never be listed in the 32 or 64
CPUs selected by a cpumemset.  They would only appear in the
distance vectors, and be allocated physical numbers from the
same space as the main CPUs.

Lets take an example.

    Lets say we have 3 quads, each with 4 CPUs, 6 i/o processors
    (the PCI slots are filled up with cards supporting DMA)
    and 1 bank of memory.  Lets say the system assigns physical
    processor numbers thusly:

	0..3	Quad 0 cpus
	4..9	Quad 0 dma engines
	10..13	Quad 1 cpus
	14..19	Quad 1 dma engines
	20..23	Quad 2 cpus
	24..29	Quad 2 dma engines

    Then the initial default cpumemset, containing all 12 CPUs,
    would be initialized:

	cpumemsets[0].physcpu = {
	    0, 1, 2, 3, 10, 11, 12, 13, 20, 21, 22, 23
	};

    which can be covered with just 12 bits of cur and max
    cpus_allowed.

In short, just because an i/o or special purpose processor has
a physical number in the same space as CPUs, doesn't mean that
they will consume bits in any given cpumemset.

This is what I took to be the essense of the strong need to
"virtualize" the physical numbering of CPUs or mems that others
had, I've been told, observed before I joined this effort.

Aha - I see a typo in my html doc of last week vaguely related
to this.  In the section "Selected Details of Implementation",
item (1), the comments in the pseudo-C code:

    struct cpumemset { 
	int physcpu[WORDSIZE];      # map phys cpu number to virtual 
	int physmem[WORDSIZE];      # map phys mem number to virtual 
    }; 

are backwards.  These comments should be as in:

    struct cpumemset { 
	int physcpu[WORDSIZE];      # map virtual cpu number to physical 
	int physmem[WORDSIZE];      # map virtual mem number to physical
    };

> I believe that the scalability limit implied
> by #2 (12 CPUs on a 32-bit system) is not sufficient.

You're quite right that that would be insufficient.  Due to the
above virtualization, there is no practical limit on the number
of CPUs on a system.  You could literally have a million CPUs.
It's just that any one non-priviledged process cannot schedule
itself on more than the 32 or 64 CPUs assigned it, and cannot
change the 32 or 64 CPUs assigned it without priviledged
assistance.

In particular, note that SGI ships 512 CPU systems, and has
installed experimental 1024 CPU systems.  I trust that this
CPU-count will continue to grow.  Granted, these are MIPS-
Irix systems.  But our larger Intel-Linux systems are built from
the same technology, except for the CPU brick and the software.
So I'm certainly not going to knowingly constrain the work we
do to something like just 32 or 64 CPUs or some corresponding
number of quads.

> > Note that the limit is only on the set of cpus and mems that a single,
> > non-priviledged process can directly manipulate.  A priviledged (root)
> > process can create, destroy and attach to its choice of cpumemset,
> > enabling it to have the full run of the system.
> >
> > The justifications (excuses) for this odd limit are:
> >
> >   1) The largest size set that can be passed via a classically clean
> >      system call has size WORDSIZE (32 or 64), such as happens
> >      when using the unsigned long ulimit parameters as a
> >      bit mask. ...
> >
> > But I will admit to some uncertainty about this restriction.
> >
> > Reason (1) above is the key one for me -- if someone found a
> > suitable syscall amongst the 222 or so now assigned in Linux,
> > ...
> 
> /proc?  This has been suggested by several people, both on lse-tech
> and privately.

I find /proc suitable for the kernel to display status and
configuration information.  I realize that it is occassionally
used by user code to set kernel configuration parameters ...
a hack in my view (which isn't to say it's beneath my dignity
to code ;).

But it seems wildly inappropriate to have non-priviledged apps
writing long commands in ascii to /proc in order to control
details of their execution.  The ultimate Swiss Army knife.
Heck - we could turn the entire kernel syscall API into a single
sequence of writes, resembling the output of strace say, to a
/proc location ;).

> >     The <cpu, cpu> distance is a measure of how costly it would
> >     be due to caching affects to change the current cpu on which
> >     a task is executing (and has considerable cache presence)
> >     to the other cpu.
> 
> Yes!!!  Also the cost of placing two tasks that tightly share data
> on widely separated CPUs.

Excellent observation - thanks.

> > Now then - what did you have in mind above?
> 
> I believe you pretty much covered it.  CPU-CPU for cache-affinity
> considerations, CPU-memory for placement decisions, I/O-memory for
> multipath I/O decisions, etc.

Excellent.

> 
> > > o    "Software Components to be Developed", item #1: There was a patch
> > >      from Kanoj on this.  Paul Dorwin has started porting this to
> > >      NUMA-Q. ...
> >
> > Yes - Sam Ortiz, who works with me, is looking at this patch now,
> > as is likely evident to you from another thread on this lse list.
> >
> > ...
> 
> I agree that it would be good to have a common hardware representation,
> at least for the usual components such as memories, CPUs, caches, and so
> on.  Seems like it would also be reasonable to have a naming scheme for
> vendor-specific components, so that future expansions to the common
> set of hardware would be guaranteed not to collide.
> 
> Thoughts?

When discussing this with Sam Ortiz today, I learned that he
has taken Kanoj's patch, and is modifying it to produce <cpu,
cpu> and <cpu, mem> distances, rather than <node, node>.

I think he is onto something good.  I will let him speak further
to that, when he is ready.

As for a common vendor expansion naming scheme - sounds good.
I don't have any immediate inspiration here.

> ... the ability to keep applications in separate partitions from interfering
> with each other.  For example if an application in one partition is a
> memory hog, you don't want to penalize a well-behaved application in
> another partition that happens to draw from a common pool of memory.
> Specifying a per-pool memory limit as part of the cpumemset seems like one
> way to achieve this.

Sorta sounds to me like get/setrlimit RLIMIT_RSS.

To which you might respond: but the administrator can't
impose such limits on processes "from the outside".

To which I might respond: seems to me that's a problem
with several of these get/setrlimit options -- they can
only be imposed on one self (and future children).

> Thank -you- for the interesting and useful proposal!

You're welcome.  Hopefully we (the engineers working for me)
can back it up with code, in a suitable patch, in due time,
and then we (on this LSE project) can encourage more of the
community to adapt this work, and include it in mainstream Linux.

I will be on a six week SGI sabbatical, starting in one week,
from July 21 through Sept 3.  I expect that some combination
of Nick Pollitt and John Hawkes will be working with LSE
during that time.  They will probably both be phoning into
this Fridays LSE call.

                          I won't rest till it's the best ...
                          Manager, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373