[Lse-tech] Re: [RFC PATCH] scheduler: Dynamic sched_domains

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Matthew Dobson wrote:
> This code is in no way complete.  But since I brought it up in the
> "cpusets - big numa cpu and memory placement" thread, I figure the code
> needs to be posted.
> 
> The basic idea is as follows:
> 
> 1) Rip out sched_groups and move them into the sched_domains.
> 2) Add some reference counting, and eventually locking, to
> sched_domains.
> 3) Rewrite & simplify the way sched_domains are built and linked into a
> cohesive tree.
> 

OK. I'm not sure that I like the direction, but... (I haven't looked
too closely at it).

> This should allow us to support hotplug more easily, simply removing the
> domain belonging to the going-away CPU, rather than throwing away the
> whole domain tree and rebuilding from scratch.

Although what we have in -mm now should support CPU hotplug just fine.
The hotplug guys really seem not to care how disruptive a hotplug
operation is.

>  This should also allow
> us to support multiple, independent (ie: no shared root) domain trees
> which will facilitate isolated CPU groups and exclusive domains.  I also

Hmm, what was my word for them... yeah, disjoint. We can do that now,
see isolcpus= for a subset of the functionality you want (doing larger
exclusive sets would probably just require we run the setup code once
for each exclusive set we want to build).

> hope this will allow us to leverage the existing topology infrastructure
> to build domains that closely resemble the physical structure of the
> machine automagically, thus making supporting interesting NUMA machines
> and SMT machines easier.
> 
> This patch is just a snapshot in the middle of development, so there are
> certainly some uglies & bugs that will get fixed.  That said, any
> comments about the general design are strongly encouraged.  Heck, any
> feedback at all is welcome! :) 
> 
> Patch against 2.6.9-rc3-mm2.

This is what I did in my first (that nobody ever saw) implementation of
sched domains. Ie. no sched_groups, just use sched_domains as the balancing
object... I'm not sure this works too well.

For example, your bottom level domain is going to basically be a redundant,
single CPU on most topologies, isn't it?

Also, how will you do overlapping domains that SGI want to do (see
arch/ia64/kernel/domain.c in -mm kernels)?

node2 wants to balance between node0, node1, itself, node3, node4.
node4 wants to balance between node2, node3, itself, node5, node6.
etc.

I think your lists will get tangled, no?