Thread: Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement (Page 3)

lse-tech

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Peter W. <pwi...@bi...> - 2004-10-06 23:23:31

Matthew Dobson wrote:
> On Tue, 2004-10-05 at 19:08, Paul Jackson wrote:
> 
> I don't know that these partitions would necessarily need their own
> scheduler, allocator and resource manager, or if we would just make the
> current scheduler, allocator and resource manager aware of these
> boundaries.  In either case, that is an implementation detail not to be
> agonized over now.

It's not so much whether they NEED their own scheduler, etc. as whether 
it should be possible for them to have their own scheduler, etc.  With a 
configurable scheduler (such as ZAPHOD) this could just be a matter of 
having separate configuration variables for each cpuset (e.g. if a 
cpuset has been created to contain as bunch of servers there's no need 
to try and provide good interactive response for its tasks (as none of 
them will be interactive) so the interactive response mechanism can be 
turned off in that cpuset leading to better server response and throughput).

Peter
-- 
Peter Williams                                   pwi...@bi...

"Learning, n. The kind of ignorance distinguishing the studious."
  -- Ambrose Bierce

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Rick L. <ric...@us...> - 2004-10-07 00:18:16

    It's not so much whether they NEED their own scheduler, etc. as whether 
    it should be possible for them to have their own scheduler, etc.  With a 
    configurable scheduler (such as ZAPHOD) this could just be a matter of 
    having separate configuration variables for each cpuset (e.g. if a 
    cpuset has been created to contain as bunch of servers there's no need 
    to try and provide good interactive response for its tasks (as none of 
    them will be interactive) so the interactive response mechanism can be 
    turned off in that cpuset leading to better server response and throughput).

Providing configurable schedulers is a feature/bug/argument completely
separate from cpusets.  Let's stay focused on that for now.

Two concrete examples for cpusets stick in my mind:

    * the department that has been given 16 cpus of a 128 cpu machine,
      is free to do what they want with them, and doesn't much care
      specifically how they're laid out. Think general timeshare.

    * the department that has been given 16 cpus of a 128 cpu machine
      to run a finely tuned application which expects and needs everybody
      to stay off those cpus. Think compute-intensive.

Correct me if I'm wrong, but CKRM can handle the first, but cannot
currently handle the second.  And the mechanism(s) for creating either
situation are suboptimal at best and non-existent at worst.

Rick

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-07 18:31:01

Rick wrote:
> 
> Two concrete examples for cpusets stick in my mind:
> 
>     * the department that has been given 16 cpus of a 128 cpu machine,
>       is free to do what they want with them, and doesn't much care
>       specifically how they're laid out. Think general timeshare.
> 
>     * the department that has been given 16 cpus of a 128 cpu machine
>       to run a finely tuned application which expects and needs everybody
>       to stay off those cpus. Think compute-intensive.
> 
> Correct me if I'm wrong, but CKRM can handle the first, but cannot
> currently handle the second.

Even the first scenario is not well handled by CKRM, in my view, for
most workloads.  On a 128 cpu, if you want 16 cpus of compute power, you
are much better off having that power on 16 specific cpus, rather than
getting 12.5% of each of the 128 cpus, unless your workload has very low
cache footprint.

I think of it like this.  Long ago, I learned to consider performance
for many of the applications I wrote in terms of how many disk accesses
I needed, for the disk was a thousand times slower than the processor
and dominated performance across a broad scale.

The gap between the speed of interior cpu cycles and external ram
access across a bus or three is approaching the processor to disk
gap of old.  A complex hierarchy of caches has grown up, within and
surrounding each processor, in an effort to ameliorate this gap.

The dreaded disk seek of old is now the cache line miss of today.

Look at the advertisements for compute power for hire in the magazines.
I can rent a decent small computer, with web access and offsite backup,
in an air conditioned room with UPS and 24/7 administration for under
$100/month. These advertisements never sell me 12.5% of the cycles on
each of the 128 cpus in a large server.  They show pictures of some nice
little rack machine -- that can be all mine, for just $79/month.  Sign
up now with our online web server and be using your system in minutes.

[ hmmm ... wonder how many spam filters I hit on that last paragraph ... ]

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-07 08:55:46

> I don't see what non-exclusive cpusets buys us.

One can nest them, overlap them, and duplicate them ;)

For example, we could do the following:

 * Carve off CPUs 128-255 of a 256 CPU system in which
   to run various HPC jobs, requiring numbers of CPUs.
   This is named /dev/cpuset/hpcarena, and it is the really
   really exclusive and isolated sort of cpuset which can and
   does have its own scheduler domain, for a scheduler configuration
   that is tuned for running a mix of HPC jobs.  In this hpcarena
   also runs the per-cpu kernel threads that are pinned on CPUs
   128-255 (for _all_ tasks running on an exclusive cpuset
   must be in that cpuset or below).

 * The testing group gets half of this cpuset each weekend, in
   order to run a battery of tests: /dev/cpuset/hpcarena/testing.
   In this testing cpuset runs the following batch manager.

 * They run a home brew batch manager, which takes an input
   stream of test cases, carves off a small cpuset of the
   requested size, and runs that test case in that cpuset.
   This results in cpusets with names like:
   /dev/cpuset/hpcarena/testing/test123.  Our test123 is
   running in this cpuset.

 * Test123 here happens to be a test of the integrity of cpusets,
   so sets up a couple of cpusets to run two independent jobs,
   each a 2 CPU MPI job.  This results in the cpusets:
   /dev/cpuset/hpcarena/testing/test123/a and
   /dev/cpuset/hpcarena/testing/test123/b.  Our little
   MPI jobs 'a' and 'b' are running in these two cpusets.

We now have several nested cpusets, each overlapping its ancestors,
with tasks in each cpuset.

But only the top hpcarena cpuset has the exclusive ownership
with no form of overlap of everything in its subtree that
something like a distinct scheduler domain wants.

Hopefully the above is not what you meant by "little more than a
convenient way to group tasks."


> 2) rewrite the scheduler/allocator to deal with these bindings up front,
> and take them into consideration early in the scheduling/allocating
> process.

The allocator is less stressed here by varied mems_allowed settings
than is the scheduler.  For in 99+% of the cases, the allocator is
dealing with a zonelist that has the local (currently executing)
first on the zonelist, and is dealing with a mems_allowed that allows
allocation on the local node.  So the allocator almost always succeeds
the first time it goes to see if the candidate page it has in hand
comes from a node allowed in current->mems_allowed.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Rick L. <ric...@us...> - 2004-10-07 10:55:47

    > I don't see what non-exclusive cpusets buys us.
    
    One can nest them, overlap them, and duplicate them ;)
    
    For example, we could do the following:

Once you have the exclusive set in your example, wouldn't the existing
functionality of CKRM provide you all the functionality the other
non-exclusive sets require?

Seems to me, we need a way to *restrict use* of certain resources
(exclusive) and a way to *share use* of certain resources (non-exclusive.)
CKRM does the latter right now, I believe, but not the former. (Does
CKRM support sharing hierarchies as in the dept/group/individual example
you used?)

What about this model:

    * All exclusive sets exist at the "top level" (non-overlapping,
      non-hierarchical) and each is represented by a separate sched_domain
      hierarchy suitable for the hardware used to create the cpuset.
      I can't imagine anything more than an academic use for nested
      exclusive sets.

    * All non-exclusive sets are rooted at the "top level" but may
      subdivide their range as needed in a tree fashion (multiple levels
      if desired).  Right now I believe this functionality could be
      provided by CKRM.

Observations:

    * There is no current mechanism to create exclusive sets; cpus_allowed
      alone won't cut it.  A combination of Matt's patch plus Paul's
      code could probably resolve this.

    * There is no clear policy on how to amiably create an exclusive set.
      The main problem is what to do with the tasks already there.
      I'd suggest they get forcibly moved.  If their current cpus_allowed
      mask does not allow them to move, then if they are a user process
      they are killed.  If they are a system process and cannot be
      moved, they stay and gain squatter's rights in the newly created
      exclusive set.

    * Interrupts are not under consideration right now. They land where
      they land, and this may affect exclusive sets.  If this is a
      problem, for now, you simply lay out your hardware and exclusive
      sets more intelligently.

    * Memory allocation has a tendency and preference, but no hard policy
      with regards to where it comes from.  A task which starts on one
      part of the system but moves to another may have all its memory
      allocated relatively far away.  In unusual cases, it may acquire
      remote memory because that's all that's left.  A memory allocation
      policy similar to cpus_allowed might be needed. (Martin?)

    * If we provide a means for creating exclusive sets, I haven't heard
      a good reason why CKRM can't manage this.

Rick

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-07 14:33:15

Rick wrote:
>
> Once you have the exclusive set in your example, wouldn't the existing
> functionality of CKRM provide you all the functionality the other
> non-exclusive sets require?
> 
> Seems to me, we need a way to *restrict use* of certain resources
> (exclusive) and a way to *share use* of certain resources (non-exclusive.)
> CKRM does the latter right now, I believe, but not the former.

I'm losing you right at the top here, Rick.  Sorry.

I'm no CKRM wizard, so tell me if I'm wrong.

But doesn't CKRM provide a way to control what percentage of the compute
cycles are available from a pool of cycles?

And don't cpusets provide a way to control which physical CPUs a task
can or cannot use?

For workloads of relative independent tasks, it might not matter (though
even that is dubious - see my cache comments, below).  For parallel
threaded apps with rapid synchronization between the threads, as one
gets with say OpenMP or MPI, there's a world of difference. Giving both
threads in a 2-way application of this kind 50% of the cycles on each of
2 processors can be an order of magnitude slower than giving each thread
100% of one processor.  Similarly, the variability of runtimes for such
threads pinned on distinct processors can be an order of magnitude less
than for floating threads.

For shared resource environments where one is purchasing time on your
own computer, there's also world of difference. In many cases one has
paid (whether in real money to another company, or in inter-departmental
funny money - doesn't matter a whole lot here) money for certain
processor power, and darn well expects those processors to sit idle if
you don't use them.  And the vendor (whether your ISP or your MIS
department) of these resources can't hide the difference. Your work runs
faster and with dramatically more consistent runtimes if the entire
processor/memory units are yours, all yours, whether you use them or
not.

The cache affects matter here as well.  Unlike the 6800 I first learned
to program, not all cycles are created equally on the fancy processors
of today.  There is layer upon layer of caching, trying to compensate
for the enormous speed difference between the internal cpu clock and the
speed of external ram, and again between the speed of the ram and that
of the disk.  A useful compute cycle on hot cpu can be hundreds or
thousands of times faster than one on a cold cpu (hot - you've been
running there; cold - you haven't been).

There is a fundamental difference between controlling which physical
processors on an SMP or NUMA system one may use, and adding delays
to the tasks of select users to ensure they don't use too much.

In the experience of SGI, and I hear tell of other companies, workload
management by fair share techniques (add delays to tasks exceeding their
allotment) has been found to be dramatically less useful to customers,
year after year, in comparison to having a means to control on which
CPUs tasks may be scheduled, and on which Memory Nodes they may obtain
pages of ram.

>     * There is no clear policy on how to amiably create an exclusive set.
>       The main problem is what to do with the tasks already there.

There is a policy, that works well, and those of us in this business
have been using for years.  When the system boots, you put everything
that doesn't need to be pinned elsewhere in a bootcpuset, and leave
the rest of the system dark.  You then, whether by manual administrative
techniques or a batch scheduler, hand out dedicated sets of CPU and
Memory to jobs, which get exclusive use of those compute resources
(or controlled sharing with only what you intentionally let share).

>     * If we provide a means for creating exclusive sets, I haven't heard
>       a good reason why CKRM can't manage this.

Unless someone has rewritten CKRM behind my back to do the pinning
of cpusets, it doesn't do that.  That's why CKRM can't do this.

Consider the following analogy.  Many of us reading this have two
cars in the driveway - our car and our wife's car (fewer will have
a husband's car, granted).  When we go out to get in our car, we
don't interchangebly take whichever one is closest to the street,
we take our car.  This is an example of dedicated use.  If we don't
drive someplace that day, then our car just sits there, unused.
This car use pattern is like cpusets.

CKRM is more like the taxi service in New York.  All those yellow cars
are interchangeable in my mind.  I take the first one I can get.  I make
sure to leave no personal possession behind when I leave it, because I
have no prospect of ever seeing that yellow car again.

Somewhere there has been a serious disconnect here.  Either I seriously
failed to understand what you wrote, or one of us is seriously confused
as to the differences between cpusets and CKRM.

Where do you think that the disconnect lies?

The difference between cpusets and CKRM is not about restricting versus
sharing.  Rather cpusets is about controlled allocation of big, named
chunks of a computer - certain numbered CPUs and Memory Nodes allocated
by number.  CKRM is about enforcing the rate of usage of anonymous,
fungible resources such as cpu cycles and memory pages.

Unfortunately for CKRM, on modern system architectures of two or more
CPUs, cycles are not interchangeable and fungible, due to the caching.
On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs
(due to our inability to make a backplane fast enough to handle more)
memory pages are not interchangeable and fungible either.

If you made other good points, you will have to repeat them.  I got
lost in the disconnect.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Rick L. <ric...@us...> - 2004-10-07 19:08:51

    > Once you have the exclusive set in your example, wouldn't the existing
    > functionality of CKRM provide you all the functionality the other
    > non-exclusive sets require?
    > 
    > Seems to me, we need a way to *restrict use* of certain resources
    > (exclusive) and a way to *share use* of certain resources (non-exclusive.)
    > CKRM does the latter right now, I believe, but not the former.
    
    
    I'm losing you right at the top here, Rick.  Sorry.

    I'm no CKRM wizard, so tell me if I'm wrong.

    But doesn't CKRM provide a way to control what percentage of the
    compute cycles are available from a pool of cycles?

    And don't cpusets provide a way to control which physical CPUs a
    task can or cannot use?

Right.

And what I'm hearing is that if you're a job running in a set of shared
resources (i.e., non-exclusive) then by definition you are *not* a job
who cares about which processor you run on.  I can't think of a situation
where I'd care about the physical locality, and the proximity of memory
and other nodes, but NOT care that other tasks might steal my cycles.

    For parallel threaded apps with rapid synchronization between the
    threads, as one gets with say OpenMP or MPI, there's a world of
    difference. Giving both threads in a 2-way application of this kind
    50% of the cycles on each of 2 processors can be an order of magnitude
    slower than giving each thread 100% of one processor.  Similarly, the
    variability of runtimes for such threads pinned on distinct processors
    can be an order of magnitude less than for floating threads.

Ah, so you want processor affinity for the tasks, then, not cpusets.

    For shared resource environments where one is purchasing time
    on your own computer, there's also world of difference. In many
    cases one has paid (whether in real money to another company, or in
    inter-departmental funny money - doesn't matter a whole lot here)
    money for certain processor power, and darn well expects those
    processors to sit idle if you don't use them.

One does?  No, in my world, there's constant auditing going on and if
you can get away with having a machine idle, power to ya, but chances
are somebody's going to come and take away at least the cycles and maybe
the whole machine for somebody yammering louder than you about their
budget cuts.  You get first cut, but if you're not using it, you don't
get to sit fat and happy.

    And the vendor (whether your ISP or your MIS department) of these
    resources can't hide the difference. Your work runs faster and with
    dramatically more consistent runtimes if the entire processor/memory
    units are yours, all yours, whether you use them or not.

When I'm not using them, my work doesn't run faster.  It just doesn't run.

    There is a fundamental difference between controlling which physical
    processors on an SMP or NUMA system one may use, and adding delays
    to the tasks of select users to ensure they don't use too much.

    In the experience of SGI, and I hear tell of other companies,
    workload management by fair share techniques (add delays to tasks
    exceeding their allotment) has been found to be dramatically less
    useful to customers,

Less useful than ... what?  As a substitute for exclusive access to
one or more cpus, which currently is not possible?  I can believe that.
But you're saying these companies didn't size their tasks properly to
the cpus they had allocated and yet didn't require exclusivity? How
would non-exclusive sets address this human failing?  You have 30 cpus'
worth of tasks to run on 24 cpus.  Somebody will take a hit, right,
whether CKRM or cpusets are managing those 24 cpus?

    >     * There is no clear policy on how to amiably create an exclusive set.
    >       The main problem is what to do with the tasks already there.

    There is a policy, that works well, and those of us in this
    business have been using for years.  When the system boots,
    you put everything that doesn't need to be pinned elsewhere in
    a bootcpuset, and leave the rest of the system dark.  You then,
    whether by manual administrative techniques or a batch scheduler,
    hand out dedicated sets of CPU and Memory to jobs, which get exclusive
    use of those compute resources (or controlled sharing with only what
    you intentionally let share).

This presumes you know, at boot time, how you want things divided.
All of your examples so far have seemed to indicate that policy changes
may well be made *after* boot time.  So I'll rephrase: any time you
create an exclusive set after boot time, you may find tasks already
running there.  I suggested one policy for dealing with them.

    The difference between cpusets and CKRM is not about restricting
    versus sharing.  Rather cpusets is about controlled allocation of big,
    named chunks of a computer - certain numbered CPUs and Memory Nodes
    allocated by number.  CKRM is about enforcing the rate of usage of
    anonymous, fungible resources such as cpu cycles and memory pages.

    Unfortunately for CKRM, on modern system architectures of two or more
    CPUs, cycles are not interchangeable and fungible, due to the caching.
    On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs
    (due to our inability to make a backplane fast enough to handle more)
    memory pages are not interchangeable and fungible either.

CKRM is not going to merrily move tasks around just because it can,
either, and it will still adhere to common scheduling principles regarding
cache warmth and processor affinity.

You use the example of a two car family, and preferring one over the other.
I'd turn that around and say it's really two exclusive sets of one
car each, rather than a shared set of two cars.  In that example, do you
ask your wife before you take "her" car, or do just take it because it's
a shared resource?  I know how it works in *my* family :)

You've given a convincing argument for the exclusive side of things.
But my point is that on the non-exclusive side the features you claim
to need seem in confict: if the cpu/memory linkage is important to job
predictability, how can you then claim it's ok to share it with anybody,
even a "friendly" task?  If it's ok to share, then you've just thrown
predictability out the window.  The cpu/memory linkage is interesting,
but it won't drive the job performance anymore.

I'm trying to nail down requirements.  I think we've nailed down the
exclusive one.  It's real, and it's currently unmet.  The code you've
written looks to provide a good base upon which to meet that requirement.
On the non-exclusive side, I keep hearing conflicting information
about how layout is important for performance but it's ok to share with
arbitrary jobs -- like sharing won't affect performance?

Rick

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-10 02:30:44

Rick wrote:
> One does?  No, in my world, there's constant auditing going on and if
> you can get away with having a machine idle, power to ya, but chances
> are somebody's going to come and take away at least the cycles and maybe

I don't doubt that such worlds as yours exist, nor that you live in one.

In some of the worlds my customers live in, they have been hit so many
times with the pains of performance degradation and variation due to
unwanted interaction between applications that they get nervous if a
supposedly unused CPU or Memory looks to be in use.  Just the common use
by Linux of unused memory to keep old pages in cache upsets them.

And, perhaps more to the point, while indeed some other department may
soon show up to make use of those lost cycles, the computer had jolly
well better leave those cycles lost _until_ the customer decides to use
them.

Unlike the computer in my dentists office, which should "just do it",
maximizing throughput as best it can, not asking any questions, the
computers in some of my customers high end shops are managed more tightly
(sometimes very tightly) and they expect to control load placement.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

[Lse-tech] Re: [PATCH] cpusets - big numa cpu and memory placement

From: Hubertus F. <fr...@wa...> - 2004-10-07 20:18:38

Just catching up again on the thread
(sorry for long absense due to ....).
This msg seems as good as any other to respond.

Paul Jackson wrote:
> Rick wrote:
> 
>>Once you have the exclusive set in your example, wouldn't the existing
>>functionality of CKRM provide you all the functionality the other
>>non-exclusive sets require?
>>
>>Seems to me, we need a way to *restrict use* of certain resources
>>(exclusive) and a way to *share use* of certain resources (non-exclusive.)
>>CKRM does the latter right now, I believe, but not the former.
> 

The way this is heading is quite promising.
- sched_domains seems the right answer wrt to partitioning the machine.
   Given some boot option or dynamic means one can shift cpus from
   on domain to the next domain.
- If I understood correctly, there would be only one level of such
   hard partitioning, speak exclusive cpu-set or sched_domain.
- In each such domain/set we allow shared *use*.

If that is the model then CKRM cpu scheduling can certainly be deployed
into and be the agent of that model.

First, one needs to understand that sched_domains are nothing else
but a set of cpus that are considered during load balancing times.
By constricting the top_domain to the respective exclusive set one
essentially has accomplished the desired feature of partitioning
the machines into "isolated" sections (here from cpu perspective).
So it is quite possible that an entire domain is empty based, while
another exclusive domain would be totally overloaded.

That said, if we associate a domain with a toplevel taskclass
one would end up with potentially the following hierarchy
for a system.

/rcfs/taskclass
	domain-1
		cls-1
		cls-2
	domain-2
		cls-x
		cls-y
	domain-3
		:

we an then associate a "config" with these domains setting the
appropriate cpu_set/group they are comprised off.

# echo "res=cpu,cpuset=4..7" > /rcfs/taskclass/domain-1
# echo "res=mem,memset=node1" > /rcfs/taskclass/domain-2

Doing so would accomplish two things
(a) the cpu-controller would create/modify the domain,
	while observing the exclusiveness constraints among its
	children
(b) recognizes that its share specification are relative to
	the size of the domain
(c) ensures that load-balancing (we do that different right now)
	is accomplished among the cpu's of class.
(d) accounting is relative to the share.

NOTE (as I tried to specify earlier):  No class can span multiple
domains.

> There is a fundamental difference between controlling which physical
> processors on an SMP or NUMA system one may use, and adding delays
> to the tasks of select users to ensure they don't use too much.
> 
> In the experience of SGI, and I hear tell of other companies, workload
> management by fair share techniques (add delays to tasks exceeding their
> allotment) has been found to be dramatically less useful to customers,
> year after year, in comparison to having a means to control on which
> CPUs tasks may be scheduled, and on which Memory Nodes they may obtain
> pages of ram.

Paul, you simply assume that achieving both are exclusive from each 
other. That is simply just not true.
As in the hierarchy above, which does exactly represent what you 
describe below, shares will relative to the domain level if the
cpuset config is set. So specifying 100% for domain-1 means 100% of its 
CPUs.

> 
> 
>>    * There is no clear policy on how to amiably create an exclusive set.
>>      The main problem is what to do with the tasks already there.
> 
> 
> There is a policy, that works well, and those of us in this business
> have been using for years.  When the system boots, you put everything
> that doesn't need to be pinned elsewhere in a bootcpuset, and leave
> the rest of the system dark.  You then, whether by manual administrative
> techniques or a batch scheduler, hand out dedicated sets of CPU and
> Memory to jobs, which get exclusive use of those compute resources
> (or controlled sharing with only what you intentionally let share).

That approach above will work.

> 
> 
>>    * If we provide a means for creating exclusive sets, I haven't heard
>>      a good reason why CKRM can't manage this.

Exactly .. and on top of it CKRM /rcfs can be used to provide the API
to do so (see above).

> 
> 
> Unless someone has rewritten CKRM behind my back to do the pinning
> of cpusets, it doesn't do that.  That's why CKRM can't do this.
> 
> Consider the following analogy.  Many of us reading this have two
> cars in the driveway - our car and our wife's car (fewer will have
> a husband's car, granted).  When we go out to get in our car, we
> don't interchangebly take whichever one is closest to the street,
> we take our car.  This is an example of dedicated use.  If we don't
> drive someplace that day, then our car just sits there, unused.
> This car use pattern is like cpusets.

OK .. since you like analogy. When I picked up my puppy from the kennel 
all it did was eat and shit and not much more. Well I didn't sit there 
thinking "Ohh my god, all my puppy is going to do is go eat and shit all 
its life", I started teaching it new tricks.

That "trick" is layed out above, binding domains with CKRM cpu 
scheduling to give exactly what one wants and using the CKRM interface 
to create the domains.
And yes if within a domain you pin your cpu to a particular cpu, the 
CKRM cpu scheduler will adher to that. That is no different then in 
today's scheduler. If you have one large domain of 128 cpus the CPU 
scheduler today will try to load balance these as well only to fail to 
see that certain tasks can't be moved...

As Matt or Rick pointed out some messages back using
cpu_mask as the basic mechanism to control affinity system wide (that's
what cpusets does) is the wrong approach.

So going with the dynamic domains as the basic mechanism to cut up your
global system into smaller parts and then use pinning within is 
certainly more scalable (particularly at load balancing time).

> 
> CKRM is more like the taxi service in New York.  All those yellow cars
> are interchangeable in my mind.  I take the first one I can get.  I make
> sure to leave no personal possession behind when I leave it, because I
> have no prospect of ever seeing that yellow car again.

I live in New York, you forgot: first order of business before entering 
the taxi is to make sure the seats are clean :-)

> 
> Somewhere there has been a serious disconnect here.  Either I seriously
> failed to understand what you wrote, or one of us is seriously confused
> as to the differences between cpusets and CKRM.

I think Rick and Matt are right on the money.

> 
> Where do you think that the disconnect lies?
> 
> The difference between cpusets and CKRM is not about restricting versus
> sharing.  Rather cpusets is about controlled allocation of big, named
> chunks of a computer - certain numbered CPUs and Memory Nodes allocated
> by number.  CKRM is about enforcing the rate of usage of anonymous,
> fungible resources such as cpu cycles and memory pages.

The disconnect is that you do not want to recognize that CKRM does NOT 
have to be systemwide. Once you open your mind to the fact that CKRM can 
be deployed with in a subset of disconnected resources (cpu domains)
and manages shares independently within that domain, I truely don't see
what the problem is.
You seem to insist that because CKRM is currently system wide it stays 
that way. This is a prototype implementation. As lined out above, this 
is quite feasible to do.

> 
> Unfortunately for CKRM, on modern system architectures of two or more
> CPUs, cycles are not interchangeable and fungible, due to the caching.
> On NUMA systems, which is the norm for all vendors above 10 or 20 CPUs
> (due to our inability to make a backplane fast enough to handle more)
> memory pages are not interchangeable and fungible either.

Just to let you know, I was involved in many NUMA projects, I wrote IBM 
SP1's MPI implementation and Gang Scheduler some time back, so I am 
intimately familiar with all these issues. Please don't assume ignorance 
about these things at our end.

> 
> If you made other good points, you will have to repeat them.  I got
> lost in the disconnect.
> 

The only thing that was unclear to me so far was whether one needs true 
hierarchies in cpusets, based on the discussion threats and your own 
example, that does not seem to be necessary.

-- Hubertus

[Lse-tech] Re: [PATCH] cpusets - big numa cpu and memory placement

From: Matthew D. <col...@us...> - 2004-10-09 00:23:42

On Thu, 2004-10-07 at 13:12, Hubertus Franke wrote:
> The way this is heading is quite promising.
> - sched_domains seems the right answer wrt to partitioning the machine.
>    Given some boot option or dynamic means one can shift cpus from
>    on domain to the next domain.
> - If I understood correctly, there would be only one level of such
>    hard partitioning, speak exclusive cpu-set or sched_domain.
> - In each such domain/set we allow shared *use*.

I don't think that there needs to be a requirement that we have only one
level of hard partitioning.  The rest of your points are valid though,
Hubertus.

It'd be really nice if we could all get together with a wall of
whiteboards, some markers, and a few pots of coffee.  I think we'd all
get this pretty much hashed out in an hour or two.  This isn't directed
at you, Hubertus, but at the many communication breakdowns in this
thread.

> First, one needs to understand that sched_domains are nothing else
> but a set of cpus that are considered during load balancing times.
> By constricting the top_domain to the respective exclusive set one
> essentially has accomplished the desired feature of partitioning
> the machines into "isolated" sections (here from cpu perspective).
> So it is quite possible that an entire domain is empty based, while
> another exclusive domain would be totally overloaded.

I think that is very well stated, Hubertus.  By having a more or less
passive data structure that is only checked at load balance time, we can
ensure (in a very light-weight way) that no task ever moves *out* of
it's area nor moves *into* someone else's area.

-Matt

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Matthew D. <col...@us...> - 2004-10-09 00:07:46

On Thu, 2004-10-07 at 07:28, Paul Jackson wrote:
> Rick wrote:
> >     * There is no clear policy on how to amiably create an exclusive set.
> >       The main problem is what to do with the tasks already there.
> 
> There is a policy, that works well, and those of us in this business
> have been using for years.  When the system boots, you put everything
> that doesn't need to be pinned elsewhere in a bootcpuset, and leave
> the rest of the system dark.  You then, whether by manual administrative
> techniques or a batch scheduler, hand out dedicated sets of CPU and
> Memory to jobs, which get exclusive use of those compute resources
> (or controlled sharing with only what you intentionally let share).

No one is trying to take that away.  There is nothing that says you
can't boot with a small, 1-2 CPU 'boot' domain where you stick all those
tasks you typically put in a 'boot' cpuset.

<offtopic> In fact, people have talked before about reducing boot times
by booting only a single CPU, then bringing the rest online later.  This
work could potentially facilitate that.  Boot up just a single 'boot'
CPU.  All 'boot' tasks would necessarily be stuck there.  Create a new
'work' domain and add (hotplug on) CPUs into that domain to your heart's
content. </offtopic>

-Matt

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-10 02:18:19

Rick replying to Paul:
> > But doesn't CKRM provide a way to control what percentage of the
> > compute cycles are available from a pool of cycles?
> > 
> > And don't cpusets provide a way to control which physical CPUs a
> > task can or cannot use?
> 
> Right.

I am learning (see other messages of the last couple days on this
thread) that CKRM is supposed to be a general purpose workload manager
framework, and that fair share scheduling (managing percentage of
compute cycles) just happens to be the first instance of such a manager.

> And what I'm hearing is that if you're a job running in a set of shared
> resources (i.e., non-exclusive) then by definition you are *not* a job
> who cares about which processor you run on.  I can't think of a situation
> where I'd care about the physical locality, and the proximity of memory
> and other nodes, but NOT care that other tasks might steal my cycles.

There are at least these situations:
 1) proximity to special hardware (graphics, networking, storage, ...)
 2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
 3) batch managers switching resources between jobs

On (2), if say you want to run eight copies of an application, on a
system that only has eight CPUs, where each copy of the app is an
eight-way tightly coupled app, they will go much faster if each app is
placed across all 8 CPUs, one thread per CPU, than if they are placed
willy-nilly.  Or a bit more realistically, if you have a random input
queue of such tightly coupled apps, each with a predetermined number of
threads between one and eight, you will get more work done by pinning
the threads of any given app on different CPUs.  The users submitting
the jobs may well not care which CPUs are used for their job, but an
intermediate batch manager probably will care, as it may be solving the
knapsack problem of how to fit a stream of varying sized jobs onto a
given size of hardware.

On (3), a batch manager might say have two small cpusets, and also one
larger cpuset that is the two small ones combined.  It might run one job
in each of the two small cpusets for a while, then suspend these two
jobs, in order to run a third job in the larger cpuset.  The two small
cpusets don't go away while the third job runs -- you don't want to lose
or have to tear down and rebuild the detailed inter-cpuset placement of
the two small jobs while they are suspended.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Matthew D. <col...@us...> - 2004-10-11 22:11:30

On Sat, 2004-10-09 at 19:15, Paul Jackson wrote:
> Rick replying to Paul:
> > And what I'm hearing is that if you're a job running in a set of shared
> > resources (i.e., non-exclusive) then by definition you are *not* a job
> > who cares about which processor you run on.  I can't think of a situation
> > where I'd care about the physical locality, and the proximity of memory
> > and other nodes, but NOT care that other tasks might steal my cycles.
> 
> There are at least these situations:
>  1) proximity to special hardware (graphics, networking, storage, ...)
>  2) non-dedicated tightly coupled multi-threaded apps (OpenMP, MPI)
>  3) batch managers switching resources between jobs
> 
> On (2), if say you want to run eight copies of an application, on a
> system that only has eight CPUs, where each copy of the app is an
> eight-way tightly coupled app, they will go much faster if each app is
> placed across all 8 CPUs, one thread per CPU, than if they are placed
> willy-nilly.  Or a bit more realistically, if you have a random input
> queue of such tightly coupled apps, each with a predetermined number of
> threads between one and eight, you will get more work done by pinning
> the threads of any given app on different CPUs.  The users submitting
> the jobs may well not care which CPUs are used for their job, but an
> intermediate batch manager probably will care, as it may be solving the
> knapsack problem of how to fit a stream of varying sized jobs onto a
> given size of hardware.
> 
> On (3), a batch manager might say have two small cpusets, and also one
> larger cpuset that is the two small ones combined.  It might run one job
> in each of the two small cpusets for a while, then suspend these two
> jobs, in order to run a third job in the larger cpuset.  The two small
> cpusets don't go away while the third job runs -- you don't want to lose
> or have to tear down and rebuild the detailed inter-cpuset placement of
> the two small jobs while they are suspended.

I think these situations, particularly the first two, are the times you
*want* to use the cpus_allowed mechanism.  Pinning a specific thread to
a specific processor (cases (1) & (2)) is *exactly* why the cpus_allowed
mechanism was put into the kernel.

And (3) can pretty easily be achieved by using a combination of
sched_domains and cpus_allowed.  In your example of one 4 CPU cpuset and
two 2 CPU sub cpusets (cpu-subsets? :), one could easily create a 4 CPU
domain for the larger job and two 2 CPU domains for the smaller jobs. 
Those 2 2 CPU subdomains can be created & destroyed at will, or they
could be simply tagged as "exclusive" when you don't want tasks moving
back and forth between them, and tagged as "non-exclusive" when you want
tasks to be freely balanced across all 4 CPUs in the larger parent
domain.

One of the cool thing about using sched_domains as your partitioning
element is that in reality, tasks run on *CPUs*, not *domains*.  So if
you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
suspend threads a1, a2, b1 & b2 and remove the domains they were running
in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
smaller jobs to proceed, you can pretty trivially create the 2 CPU
domains underneath the 4 CPU domain and resume the jobs.  Those jobs (a
& b) have been suspended on the CPUs they were originally running on,
and thus will resume on the same CPUs without any extra effort.  They
will simply run on those CPUs, and at load balance time, the domains
attached to those CPUs will be consulted to determine where the tasks
can be relocated to if there is a heavy load.  The domains will tell the
scheduler that the tasks cannot be relocated outside the 2 CPUs in each
respective domain.  Viola!  (sorta ;)

-Matt

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-11 23:02:24

Matthew wrote:
> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*. 

Unfortunately, my manager has reminded me of an essential deliverable
that I have for another project, due in two weeks.  I'm going to need
every one of those days.  So I will have to take a two week sabbatical
from this design work.

It might make sense to reconvene this work on a new thread, with a last
message on this monster thread inviting all interested parties to come
on over.  I suspect a few folks will be happy to see this thread wind
down.

I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
forum for this new thread.

Carry on.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Matthew D. <col...@us...> - 2004-10-12 21:23:37

On Mon, 2004-10-11 at 15:58, Paul Jackson wrote:
> Matthew wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*. 
> 
> Unfortunately, my manager has reminded me of an essential deliverable
> that I have for another project, due in two weeks.  I'm going to need
> every one of those days.  So I will have to take a two week sabbatical
> from this design work.
> 
> It might make sense to reconvene this work on a new thread, with a last
> message on this monster thread inviting all interested parties to come
> on over.  I suspect a few folks will be happy to see this thread wind
> down.
> 
> I'd guess lse-tech (my preference) or ckrm-tech would be a suitable
> forum for this new thread.
> 
> Carry on.

Sounds good, Paul.  I think the discussion on this thread was kind of
winding down anyway.  In two weeks I'll have some more work done on my
code, particularly trying to get the cpusets/CKRM filesystem interface
to play with my sched_domains code.  We'll be able to digest all the the
information, requirements, requests, etc. on this thread and start a
fresh discussion on (or at least closer to) the same page.

-Matt

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Simon D. <Sim...@bu...> - 2004-10-12 08:53:39

> One of the cool thing about using sched_domains as your partitioning
> element is that in reality, tasks run on *CPUs*, not *domains*.  So if
> you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') and
> threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> suspend threads a1, a2, b1 & b2 and remove the domains they were runnin=
g
> in to allow job A (big job with threads A1, A2, A3, & A4) to run on the
> larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
> smaller jobs to proceed, you can pretty trivially create the 2 CPU
> domains underneath the 4 CPU domain and resume the jobs.  Those jobs (a
> & b) have been suspended on the CPUs they were originally running on,
> and thus will resume on the same CPUs without any extra effort.  They
> will simply run on those CPUs, and at load balance time, the domains
> attached to those CPUs will be consulted to determine where the tasks
> can be relocated to if there is a heavy load.  The domains will tell th=
e
> scheduler that the tasks cannot be relocated outside the 2 CPUs in each
> respective domain.  Viola!  (sorta ;)
Voil=E0 ;-)

I agree that this looks really smooth from a scheduler point of view.

From a user point of view, remains the issue of suspending the tasks:
-find which tasks to suspend : how do you know that job 'a' consists=20
exactly of 'a1' and 'a2'
-suspend them (btw, how do you achieve this ? kill -STOP ?)


I've been away from my mail and still trying to catch up, nevermind if th=
e=20
above does not makes sense to you.

	Simon.

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Matthew D. <col...@us...> - 2004-10-12 21:26:52

On Tue, 2004-10-12 at 01:50, Simon Derr wrote:
> > One of the cool thing about using sched_domains as your partitioning
> > element is that in reality, tasks run on *CPUs*, not *domains*.  So i=
f
> > you have threads 'a1' & 'a2' running on CPUs 0 & 1 (small job 'a') an=
d
> > threads 'b1' & 'b2' running on CPUs 2 & 3 (small job 'b'), you can
> > suspend threads a1, a2, b1 & b2 and remove the domains they were runn=
ing
> > in to allow job A (big job with threads A1, A2, A3, & A4) to run on t=
he
> > larger 4 CPU domain.  When you then suspend A1-A4 again to allow the
> > smaller jobs to proceed, you can pretty trivially create the 2 CPU
> > domains underneath the 4 CPU domain and resume the jobs.  Those jobs =
(a
> > & b) have been suspended on the CPUs they were originally running on,
> > and thus will resume on the same CPUs without any extra effort.  They
> > will simply run on those CPUs, and at load balance time, the domains
> > attached to those CPUs will be consulted to determine where the tasks
> > can be relocated to if there is a heavy load.  The domains will tell =
the
> > scheduler that the tasks cannot be relocated outside the 2 CPUs in ea=
ch
> > respective domain.  Viola!  (sorta ;)
> Voil=C3=A0 ;-)

hehe...  My French spelling obviously isn't quite up to par. ;)


> I agree that this looks really smooth from a scheduler point of view.
>=20
> From a user point of view, remains the issue of suspending the tasks:
> -find which tasks to suspend : how do you know that job 'a' consists=20
> exactly of 'a1' and 'a2'
> -suspend them (btw, how do you achieve this ? kill -STOP ?)
>=20
>=20
> I've been away from my mail and still trying to catch up, nevermind if =
the=20
> above does not makes sense to you.
>=20
> 	Simon.

Paul didn't go into specifics about how to suspend the job, so neither
did I.  Sending SIGSTOP & SIGCONT should work, as you mention...  Those
are implementation details which really aren't *that* important to the
discussion.  We're still trying to figure out the overall framework and
API to work with, so which method of suspending a thread we'll
eventually use can be tackled down the road.  :)

-Matt

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Martin J. B. <mb...@ar...> - 2004-10-07 14:43:22

>     * Interrupts are not under consideration right now. They land where
>       they land, and this may affect exclusive sets.  If this is a
>       problem, for now, you simply lay out your hardware and exclusive
>       sets more intelligently.

They're easy to fix, just poke the values in /proc appropriately (same
as cpus_allowed, exactly).
 
>     * Memory allocation has a tendency and preference, but no hard policy
>       with regards to where it comes from.  A task which starts on one
>       part of the system but moves to another may have all its memory
>       allocated relatively far away.  In unusual cases, it may acquire
>       remote memory because that's all that's left.  A memory allocation
>       policy similar to cpus_allowed might be needed. (Martin?)

The membind API already does this.

M.

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Simon D. <Sim...@bu...> - 2004-10-07 12:53:18

On Thu, 7 Oct 2004, Paul Jackson wrote:

> > I don't see what non-exclusive cpusets buys us.
> 
> One can nest them, overlap them, and duplicate them ;)

I would also add, if the decision comes to make 'real exclusive' cpusets, 
my previous example, as a use for non-exclusive cpusets: 

we are running jobs that need to be 'mostly' isolated on some part of the 
system, and run in a specific location. We use cpusets for that. But we 
can't afford to dedicate a part of the system for administrative tasks 
(daemons, init..). These tasks should not be put inside one of the 
'exclusive' cpusets, even temporary : they do not belong there. They 
should just be allowed to steal a few cpu cycles from time to time : non 
exclusive cpusets are the way to go.

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Martin J. B. <mb...@ar...> - 2004-10-07 14:52:02

> On Thu, 7 Oct 2004, Paul Jackson wrote:
> 
>> > I don't see what non-exclusive cpusets buys us.
>> 
>> One can nest them, overlap them, and duplicate them ;)
> 
> I would also add, if the decision comes to make 'real exclusive' cpusets, 
> my previous example, as a use for non-exclusive cpusets: 
> 
> we are running jobs that need to be 'mostly' isolated on some part of the 
> system, and run in a specific location. We use cpusets for that. But we 
> can't afford to dedicate a part of the system for administrative tasks 
> (daemons, init..). These tasks should not be put inside one of the 
> 'exclusive' cpusets, even temporary : they do not belong there. They 
> should just be allowed to steal a few cpu cycles from time to time : non 
> exclusive cpusets are the way to go.

That makes no sense to me whatsoever, I'm afraid. Why if they were allowed
"to steal a few cycles" are they so fervently banned from being in there?
You can keep them out of your userspace management part if you want.

So we have the purely exclusive stuff, which needs kernel support in the form
of sched_domains alterations. The rest of cpusets is just poking and prodding
at cpus_allowed, the membind API, and the irq binding stuff. All of which
you could do from userspace, without any further kernel support, right?
Or am I missing something?

M.

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-07 17:56:42

Martin wrote:
> 
> So we have the purely exclusive stuff, which needs kernel support in the form
> of sched_domains alterations. The rest of cpusets is just poking and prodding
> at cpus_allowed, the membind API, and the irq binding stuff. All of which
> you could do from userspace, without any further kernel support, right?
> Or am I missing something?

Well ... we're gaining.  A couple of days ago you were suggesting
that cpusets could be replaced with some exclusive domains plus
CKRM.

Now it's some exclusive domains plus poking the affinity masks.

Yes - you're still missing something.

But I must keep in mind that I had concluded, perhaps three years ago,
just what you conclude now: that cpusets is just poking some affinity
masks, and that I could do most of it from user land.  The result ended
up missing some important capabilities.  User level code could not
manage collections of hardware nodes (sets of CPUs and Memory Nodes) in
a co-ordinated and controlled manner.

The users of cpusets need to have system wide names for them, with
permissions for viewing, modifying and attaching to them, and with the
ability to list both what hardware (CPUs and Memory) in a cpuset, and
what tasks are attached to a cpuset.  As is usual in such operating
systems, the kernel manages such system wide synchronized controlled
access views.

As I quote below, I've been saying this repeatedly.  Could you
tell me, Martin, whether the disconnect is:
 1) that you didn't yet realize that cpusets provided this model (names,
    permissions, ...) or
 2) you don't think such a model is useful, or
 3) you think that such a model can be provided sensibly from user space?

If I knew this, I could focus my response better.

The rest of this message is just quotes from this last week - many
can stop reading here.

===

Date: Fri, 1 Oct 2004 23:06:44 -0700
From: Paul Jackson <pj...@sg...>

Even the flat model (no hierarchy) uses require some way to
name and control access to cpusets, with distinct permissions
for examining, attaching to, and changing them, that can be
used and managed on a system wide basis.

===

Date: Sat, 2 Oct 2004 12:14:30 -0700
From: Paul Jackson <pj...@sg...>

And our customers _do_ want to manage these logically isolated
chunks as named "virtual computers" with system managed permissions
and integrity (such as the system-wide attribute of "Exclusive"
ownership of a CPU or Memory by one cpuset, and a robust ability
to list all tasks currently in a cpuset).

===

Date: Sat, 2 Oct 2004 19:26:03 -0700
From: Paul Jackson <pj...@sg...>

Consider the following use case scenario, which emphasizes this
isolation aspect (and ignores other requirements, such as the need for
system admins to manage cpusets by name [some handle valid across
process contexts], with a system wide imposed permission model and
exclusive use guarantees, and with a well defined system supported
notion of which tasks are "in" which cpuset at any point in time).

===

Date: Sun, 3 Oct 2004 18:41:24 -0700
From: Paul Jackson <pj...@sg...>

SGI makes heavy and critical use of the cpuset facilities on both Irix
and Linux that have been developed since pset.  These facilities handle
both cpu and memory placment, and provide the essential kernel support
(names and permissions and operations to query, modify and attach) for a
system wide administrative interface for managing the resulting sets of
CPUs and Memory Nodes.

===

Date: Tue, 5 Oct 2004 02:17:36 -0700
From: Paul Jackson <pj...@sg...>
To: "Martin J. Bligh" <mb...@ar...>

The /dev/cpuset pseudo file system api was chosen because it was
convenient for small scale work, learning and experimentation, because
it was a natural for the hierarchical name space with permissions that I
required, and because it was convenient to leverage existing vfs
structure in the kernel.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Martin J. B. <mb...@ar...> - 2004-10-07 18:14:35

>> So we have the purely exclusive stuff, which needs kernel support in the form
>> of sched_domains alterations. The rest of cpusets is just poking and prodding
>> at cpus_allowed, the membind API, and the irq binding stuff. All of which
>> you could do from userspace, without any further kernel support, right?
>> Or am I missing something?
> 
> Well ... we're gaining.  A couple of days ago you were suggesting
> that cpusets could be replaced with some exclusive domains plus
> CKRM.
> 
> Now it's some exclusive domains plus poking the affinity masks.
> 
> Yes - you're still missing something.
> 
> But I must keep in mind that I had concluded, perhaps three years ago,
> just what you conclude now: that cpusets is just poking some affinity
> masks, and that I could do most of it from user land.  The result ended
> up missing some important capabilities.  User level code could not
> manage collections of hardware nodes (sets of CPUs and Memory Nodes) in
> a co-ordinated and controlled manner.
> 
> The users of cpusets need to have system wide names for them, with
> permissions for viewing, modifying and attaching to them, and with the
> ability to list both what hardware (CPUs and Memory) in a cpuset, and
> what tasks are attached to a cpuset.  As is usual in such operating
> systems, the kernel manages such system wide synchronized controlled
> access views.
> 
> As I quote below, I've been saying this repeatedly.  Could you
> tell me, Martin, whether the disconnect is:
>  1) that you didn't yet realize that cpusets provided this model (names,
>     permissions, ...) or
>  2) you don't think such a model is useful, or
>  3) you think that such a model can be provided sensibly from user space?
> 
> If I knew this, I could focus my response better.
> 
> The rest of this message is just quotes from this last week - many
> can stop reading here.

My main problem is that I don't think we want lots of overlapping complex 
interfaces in the kernel. Plus I think some of the stuff proposed is fairly 
klunky as an interface (physical binding where it's mostly not needed, and
yes I sort of see your point about keeping jobs on separate CPUs, though I
still think it's tenuous), and makes heavy use of stuff that doesn't work 
well (e.g. cpus_allowed). So I'm searching for various ways to address that.

The purely exclusive parts of cpusets can be implemented in a much nicer
manner inside the kernel, by messing with sched_domains, instead of just
using cpus_allowed as a mechanism ... so that seems like much less of a
problem.

The non-exclusive bits seem to overlap heavily with both CKRM and what
could be done in userspace. I still think the physical stuff is rather
obscure, and binding stuff to specific CPUs is an ugly way to say "I want
these two threads to not run on the same CPU". But if we can find some
other way (eg userspace) to allow you to do that should you utterly insist
on doing so, that'd be a convenient way out.

As for the names and permissions issue, both would be *doable* from 
userspace, though maybe not as easily as in-kernel. Names would probably 
be less hassle than permissions, but neither would be impossible, it seems.

It all just seems like a lot of complexity for a fairly obscure set of
requirements for a very limited group of users, to be honest. Some bits
(eg partitioning system resources hard in exclusive sets) would seem likely
to be used by a much broader audience, and thus are rather more attractive.
But they could probably be done with a much simpler interface than the whole
cpusets (BTW, did that still sit on top of PAGG as well, or is that long
gone?)

M.

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Rick L. <ric...@us...> - 2004-10-07 20:41:21

    The users of cpusets need to have system wide names for them, with
    permissions for viewing, modifying and attaching to them, and with the
    ability to list both what hardware (CPUs and Memory) in a cpuset, and
    what tasks are attached to a cpuset.  As is usual in such operating
    systems, the kernel manages such system wide synchronized controlled
    access views.

Well, you are *asserting* the kernel will manage this.  But doesn't
CKRM offer this capability?  The only thing it *can't* do is assure
exclusivity, today .. correct?

Rick

Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-10 02:37:12

> The only thing it *can't* do is assure
> exclusivity, today .. correct?

No.  Could you look back to my other posts of this
last week and let us know if I've answered your query
in more detail already?  Thanks.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [ckrm-tech] Re: [Lse-tech] [PATCH] cpusets - big numa cpu and memory placement

From: Paul J. <pj...@sg...> - 2004-10-10 05:14:32

> That makes no sense to me whatsoever, I'm afraid. Why if they were allowed
> "to steal a few cycles" are they so fervently banned from being in there?

One substantial advantage of cpusets (as in the kernel patch in *-mm's
tree), over variations that "just poke the affinity masks from user
space" is the task->cpuset pointer.  This tracks to what cpuset a task
is attached.  The fork and exit code duplicates and nukes this pointer,
managing the cpuset reference counter.

It matters to batch schedulers and the like which cpuset a task is in,
and which tasks are in a cpuset, when it comes time to do things like
suspend or migrate the tasks currently in a cpuset.

Just because it's ok to share a little compute time in a cpuset doesn't
mean you don't care to know who is in it.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

<< < 1 2 3 4 5 .. 10 > >> (Page 3 of 10)