Thread: Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 (Page 2)

lse-tech

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-30 07:42:18

* Andi Kleen <ak...@su...> wrote:

> This works much better, but wildly varying (my tests go from 2.8xCPU
> to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.

i'm resurrecting the balance-on-clone patch i sent a couple of days ago. 
I found at least one bug in it that might explain why it didnt work back
then. (also, the scheduler back then was also too agressive at migrating
tasks back.) Stay tuned.

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-30 07:03:55

Andi Kleen wrote:
> On Tue, 30 Mar 2004 09:51:46 +1000
> Nick Piggin <nic...@ya...> wrote:
> 
> 
> 
>>So both -mm5 and Ingo's sched.patch are much worse than
>>what 2.4 and 2.6 get?
> 
> 
> Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) 
> 
> Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still
> much worse than the max of 3.7x-4x CPU bandwidth.
> 

So it is very likely to be a case of the threads running too
long on one CPU before being balanced off, and faulting in
most of their working memory from one node, right?

I think it is impossible for the scheduler to correctly
identify this and implement the behaviour that OpenMP wants
without causing regressions on more general workloads
(Assuming this is the problem).

We are not going to go back to the wild balancing that
numasched does (I have some benchmarks where sched-domains
reduces cross node task movement by several orders of
magnitude). So the other option is to do balance on clone
across NUMA nodes, and make it very sensitive to imbalance.
Or probably better to make it easy to balance off to an idle
CPU, but much more difficult to balance off to a busy CPU.

I suspect this would still be a regression for other tests
though where thread creation is more frequent, threads share
working set more often, or the number of threads > the number
of CPUs.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-30 07:14:15

> We are not going to go back to the wild balancing that
> numasched does (I have some benchmarks where sched-domains
> reduces cross node task movement by several orders of
> magnitude). 

Agreed, I think that'd be a fatal mistake ...

> So the other option is to do balance on clone
> across NUMA nodes, and make it very sensitive to imbalance.
> Or probably better to make it easy to balance off to an idle
> CPU, but much more difficult to balance off to a busy CPU.

I think that's correct, but we need to be careful. We really, really 
do want to try to keep threads on the same node *if* we have enough 
processes around to keep the machine busy. Because we don't balance
on fork, we make a reasonable job of that today, but we should probably
be more reluctant on rebalance than we are.

It's when we have less processes than nodes that we want to spread things 
around. That's a difficult balance to strike (and exactly why I wimped 
out on it originally ;-)).

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-30 07:31:31

Martin J. Bligh wrote:
>>We are not going to go back to the wild balancing that
>>numasched does (I have some benchmarks where sched-domains
>>reduces cross node task movement by several orders of
>>magnitude). 
> 
> 
> Agreed, I think that'd be a fatal mistake ...
> 
> 
>>So the other option is to do balance on clone
>>across NUMA nodes, and make it very sensitive to imbalance.
>>Or probably better to make it easy to balance off to an idle
>>CPU, but much more difficult to balance off to a busy CPU.
> 
> 
> I think that's correct, but we need to be careful. We really, really 
> do want to try to keep threads on the same node *if* we have enough 
> processes around to keep the machine busy. Because we don't balance
> on fork, we make a reasonable job of that today, but we should probably
> be more reluctant on rebalance than we are.
> 
> It's when we have less processes than nodes that we want to spread things 
> around. That's a difficult balance to strike (and exactly why I wimped 
> out on it originally ;-)).
> 

Well NUMA balance on exec is obviously the right thing to do.

Maybe balance on clone would be beneficial if we only balance onto
CPUs which are idle or very very imbalanced. Basically, if you are
very sure that it is going to be balanced off anyway, it is probably
better to do it at clone.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-30 07:38:13

> Well NUMA balance on exec is obviously the right thing to do.
> 
> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.

Yup ... sounds utterly sensible. But I think we need to make the current
balance favour grouping threads together on the same CPU/node more first
if possible ;-)

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-30 08:05:08

* Nick Piggin <nic...@ya...> wrote:

> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.

balancing threads/processes is not a problem, as long as it happens
within the rules of normal balancing.

ie. 'new context created' (on exec, fork or clone) is just an event that
impacts the load scenario, and which might trigger rebalancing.

_if_ the sharing between various contexts is very high and it's actually
faster to run them all single-threaded, then the application writer can
bind them to one CPU, via the affinity syscalls. But the scheduler
cannot know this advance.

so the cleanest assumption, from the POV of the scheduler, is that
there's no sharing between contexts. Things become really simple once
this assumption is made.

and frankly, it's much easier to argue with application developers whose
application scales badly and thus the scheduler over-distributes it,
than with application developers who's application scales badly due to
the scheduler.

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-30 08:19:47

Ingo Molnar wrote:
> * Nick Piggin <nic...@ya...> wrote:
> 
> 
>>Maybe balance on clone would be beneficial if we only balance onto
>>CPUs which are idle or very very imbalanced. Basically, if you are
>>very sure that it is going to be balanced off anyway, it is probably
>>better to do it at clone.
> 
> 
> balancing threads/processes is not a problem, as long as it happens
> within the rules of normal balancing.
> 
> ie. 'new context created' (on exec, fork or clone) is just an event that
> impacts the load scenario, and which might trigger rebalancing.
> 
> _if_ the sharing between various contexts is very high and it's actually
> faster to run them all single-threaded, then the application writer can
> bind them to one CPU, via the affinity syscalls. But the scheduler
> cannot know this advance.
> 
> so the cleanest assumption, from the POV of the scheduler, is that
> there's no sharing between contexts. Things become really simple once
> this assumption is made.
> 
> and frankly, it's much easier to argue with application developers whose
> application scales badly and thus the scheduler over-distributes it,
> than with application developers who's application scales badly due to
> the scheduler.
> 

You're probably mostly right, but I really don't know if I'd
start with the assumption that threads don't share anything.
I think they're very likely to share memory and cache.

Also, these additional system wide balance points don't come
for free if you attach them to common operations (as opposed
to the slow periodic balancing).

find_best_cpu needs to pull down NR_CPUs remote (and probably
hot&dirty) cachelines, which can get expensive, for an
operation that you are very likely to be better off *without*
if your threads do share any memory.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-30 08:44:38

* Nick Piggin <nic...@ya...> wrote:

> You're probably mostly right, but I really don't know if I'd start
> with the assumption that threads don't share anything. I think they're
> very likely to share memory and cache.

it all depends on the workload i guess, but generally if the application
scales well then the threads only share data in a read-mostly manner -
hence we can balance at creation time.

if the application does not scale well then balancing too early cannot
make the app perform much worse.

things like JVMs tend to want good balancing - they really are userspace
simulations of separate contexts with little sharing and good overall
scalability of the architecture.

> Also, these additional system wide balance points don't come for free
> if you attach them to common operations (as opposed to the slow
> periodic balancing).

yes, definitely.

the implementation in sched2.patch does not take this into account yet. 
There are a number of things we can do about the 500 CPUs case. Eg. only
do the balance search towards the next N nodes/cpus (tunable via a
domain parameter).

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-30 08:54:16

Ingo Molnar wrote:
> * Nick Piggin <nic...@ya...> wrote:
> 
> 
>>You're probably mostly right, but I really don't know if I'd start
>>with the assumption that threads don't share anything. I think they're
>>very likely to share memory and cache.
> 
> 
> it all depends on the workload i guess, but generally if the application
> scales well then the threads only share data in a read-mostly manner -
> hence we can balance at creation time.
> 
> if the application does not scale well then balancing too early cannot
> make the app perform much worse.
> 
> things like JVMs tend to want good balancing - they really are userspace
> simulations of separate contexts with little sharing and good overall
> scalability of the architecture.
> 

Well, it will be interesting to see how it goes. Unfortunately
I don't have a single realistic benchmark. In fact the only
threaded one I have is volanomark.

> 
>>Also, these additional system wide balance points don't come for free
>>if you attach them to common operations (as opposed to the slow
>>periodic balancing).
> 
> 
> yes, definitely.
> 
> the implementation in sched2.patch does not take this into account yet. 
> There are a number of things we can do about the 500 CPUs case. Eg. only
> do the balance search towards the next N nodes/cpus (tunable via a
> domain parameter).

Yeah I think we shouldn't worry too much about the 500 CPUs
case, because they will obviously end up using their own
domains. But it is possible this would hurt smaller CPU
counts too. Again, it means testing.

I think we should probably aim to have a usable and decent
default domain for 32, maybe 64 CPUs, and not worry about
larger numbers too much if it would hurt lower end performance.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-30 15:27:14

> Well, it will be interesting to see how it goes. Unfortunately
> I don't have a single realistic benchmark. 

That's OK, neither does anyone else ;-) OK, for HPC workloads they do,
but not for other stuff.

The closest I can come conceptually is to run multiple instances of a 
Java benchmark in parallel. The existing ones all tend to be either 1 
process with many threads, or many processes each with one thread. There's 
no m x n benchamrks around I've found, and that seems to be a lot more 
like what the customers I've seen are interested in (throwing a DB, 
webserver, java, etc all on one machine).

Making balance_on_fork a userspace hintable thing wouldn't hurt us at all
though, and would provide a great escape route for the HPC people. 
Some simple pokeable in /proc would probably be sufficient. balance_on_clone
is harder, as whether you want to do it or not depends more on the state
of the rest of the system, which is very hard for userspace to know ...

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-30 08:18:16

* Andi Kleen <ak...@su...> wrote:

> > ok, could you try min_interval,max_interval and busy_factor all with a
> > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > purposes.)
> 
> I kept the old patch and made these changes. The results are much more
> consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> this with older kernels too.

great.

now, could you try the following patch, against vanilla -mm5:

	redhat.com/~mingo/scheduler-patches/sched2.patch

this includes 'context balancing' and doesnt touch the NUMA async
balancing tunables. Do you get better performance than with stock -mm5?

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andi K. <ak...@su...> - 2004-03-30 09:36:28

On Tue, 30 Mar 2004 10:18:40 +0200
Ingo Molnar <mi...@el...> wrote:

> 
> * Andi Kleen <ak...@su...> wrote:
> 
> > > ok, could you try min_interval,max_interval and busy_factor all with a
> > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > > purposes.)
> > 
> > I kept the old patch and made these changes. The results are much more
> > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> > this with older kernels too.
> 
> great.
> 
> now, could you try the following patch, against vanilla -mm5:
> 
> 	redhat.com/~mingo/scheduler-patches/sched2.patch
> 
> this includes 'context balancing' and doesnt touch the NUMA async
> balancing tunables. Do you get better performance than with stock -mm5?

I get better performance (roughly 2.1x CPU), but only about half the optimum.

-Andi

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-25 19:25:23

>> It doesn't do load balance in wake_up_forked_process() and is
>> relatively non aggressive in balancing later. This leads to the
>> multithreaded OpenMP STREAM running its childs first on the same node
>> as the original process and allocating memory there. [...]
> 
> i believe the fix we want is to pre-balance the context at fork() time. 
> I've implemented this (which is basically just a reuse of
> sched_balance_exec() in fork.c, and the related namespace cleanups), 
> could you give it a go:
> 
>   http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5
> 
> another solution would be to add SD_BALANCE_FORK.
> 
> also, the best place to do fork() blancing is not at
> wake_up_forked_process() time, but prior doing the MM copy. This patch
> does it there. At wakeup time we've already copied all the pagetables
> and created tons of dirty cachelines.

How are you going to decide whether to rebalance at fork time or exec time?
Exec time balancing is a *lot* more efficient, it just doesn't work for
things that don't exec ... cloned threads would certainly be one case.

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-25 21:59:09

* Andi Kleen <ak...@su...> wrote:

> It doesn't do load balance in wake_up_forked_process() and is
> relatively non aggressive in balancing later. This leads to the
> multithreaded OpenMP STREAM running its childs first on the same node
> as the original process and allocating memory there. Then later they
> run on a different node when the balancing finally happens, but
> generate cross traffic to the old node, instead of using the memory
> bandwidth of their local nodes.
> 
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get 4
> times as much.
> 
> Admittedly it's a bit of a stupid benchmark, but seems to
> representative for a lot of HPC codes.

There's no way the scheduler can figure out the scheduling and memory
use patterns of the new tasks in advance.

but userspace could give hints - e.g. a syscall that triggers a
rebalancing: sys_sched_load_balance(). This way userspace notifies the
scheduler that it is on 'zero ground' and that the scheduler can move it
to the least loaded cpu/node.

a variant of this is already possible, userspace can use setaffinity to
load-balance manually - but sched_load_balance() would be automatic.

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-26 03:56:12

Andi Kleen wrote:
> On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote:
> 
>>Andi,
>>
>>Can you be more specific with "it doesn't load balance threads
>>aggressively enough"? Or what behavior of the base NUMA scheduler is
>>missing in the sched-domain scheduler especially for NUMA?
> 
> 
> It doesn't do load balance in wake_up_forked_process()  and is relatively
> non aggressive in balancing later. This leads to the multithreaded OpenMP
> STREAM running its childs first on the same node as the original process
> and allocating memory there. Then later they run on a different node when
> the balancing finally happens, but generate  cross traffic to the old node, 
> instead of using the memory bandwidth of their local nodes.
> 
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get
> 4 times as much.
> 
> Admittedly it's a bit of a stupid benchmark, but seems to representative
> for a lot of HPC codes.

Hi Andi,
Sorry I keep telling you I'll work on this, but I never get
around to it. Mostly lack of hardware makes it difficult. I've
fixed a few bugs and some other workloads, so I keep hoping
that they will fix your problem :P

Your STREAM performance is really bad and I hope you don't
think I'm going to ignore it even if it is a bit stupid. Give
me a bit more time.

Of course, there is nothing fundamentally wrong with
sched-domains that is causing your problem. It can easily do
anything the old numa scheduler can do. It must be a bug or
some bad tuning somewhere.

Nick

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Ingo M. <mi...@el...> - 2004-03-25 21:47:44

* Martin J. Bligh <mb...@ar...> wrote:

> Exec time balancing is a *lot* more efficient, it just doesn't work
> for things that don't exec ... cloned threads would certainly be one
> case.

yeah - exec-balancing is a clear thing. fork/clone time balancing is
alot less clear.

	Ingo

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-25 22:29:13

>> Exec time balancing is a *lot* more efficient, it just doesn't work
>> for things that don't exec ... cloned threads would certainly be one
>> case.
> 
> yeah - exec-balancing is a clear thing. fork/clone time balancing is
> alot less clear.

OK, well it *looks* to me from a quick look at your patch like
sched_balance_context will rebalance at both fork *and* exec time.
That seems like a bad plan, but maybe I'm misreading it.

Can we hold off on changing the fork/exec time balancing until we've
come to a plan as to what should actually be done with it? Unless we're
giving it some hint from userspace, it's frigging hard to be sure if
it's going to exec or not - and the vast majority of things do. 

There was a really good reason why the code is currently set up that
way, it's not some random accident ;-)

Clone is a much more interesting case, though at the time, I consciously
decided NOT to do that, as we really mostly want threads on the same
node. The exception is the case where we have one app with lots of threads,
and nothing much else running on the system ... I tend to think of that
as an artificial benchmark situation, but maybe that's not fair. We 
probably need to just do a more conservative version of the cross-node
rebalance at fork time.

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andrew T. <hab...@us...> - 2004-03-25 22:30:57

On Thursday 25 March 2004 15:59, Ingo Molnar wrote:
> * Andi Kleen <ak...@su...> wrote:
> > It doesn't do load balance in wake_up_forked_process() and is
> > relatively non aggressive in balancing later. This leads to the
> > multithreaded OpenMP STREAM running its childs first on the same node
> > as the original process and allocating memory there. Then later they
> > run on a different node when the balancing finally happens, but
> > generate cross traffic to the old node, instead of using the memory
> > bandwidth of their local nodes.
> >
> > The difference is very visible, even the 4 thread STREAM only sees the
> > bandwidth of a single node. With a more aggressive scheduler you get 4
> > times as much.
> >
> > Admittedly it's a bit of a stupid benchmark, but seems to
> > representative for a lot of HPC codes.
>
> There's no way the scheduler can figure out the scheduling and memory
> use patterns of the new tasks in advance.
>
> but userspace could give hints - e.g. a syscall that triggers a
> rebalancing: sys_sched_load_balance(). This way userspace notifies the
> scheduler that it is on 'zero ground' and that the scheduler can move it
> to the least loaded cpu/node.
>
> a variant of this is already possible, userspace can use setaffinity to
> load-balance manually - but sched_load_balance() would be automatic.

For Opteron simply placing all cpus in the same sched domain may solve all of 
this, since we will have balancing frequency of the default scheduler.  Is 
there any reason this cannot be done for Opteron?

Also, I think Erich Focht had another patch which would allow much more 
frequent node balancing is the nr_cpus_node was 1.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Martin J. B. <mb...@ar...> - 2004-03-25 22:38:21

> For Opteron simply placing all cpus in the same sched domain may solve all of 
> this, since we will have balancing frequency of the default scheduler.  Is 
> there any reason this cannot be done for Opteron?

That seems like a good plan to me - they really don't want that cross-node
balancing. It might be cleaner to implement it by just tweaking the 
cross-balance paramters for that system to have the same effect, but it
probably doesn't matter much (I'm thinking of some future case when they
decide to do multi-chip on die or SMT, so just keying off 1 cpu per node
doesn't really fix it).

M.

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andi K. <ak...@su...> - 2004-03-26 05:29:08

On Thu, 25 Mar 2004 16:30:16 -0600
Andrew Theurer <hab...@us...> wrote:


> For Opteron simply placing all cpus in the same sched domain may solve all of 
> this, since we will have balancing frequency of the default scheduler.  Is 
> there any reason this cannot be done for Opteron?

Yes, that makes sense. I will try that

-Andi

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andi K. <ak...@su...> - 2004-03-29 12:25:47

On Mon, 29 Mar 2004 13:46:35 +0200
Ingo Molnar <mi...@el...> wrote:

> 
> * Andi Kleen <ak...@su...> wrote:
> 
> > Sorry ignore this report - I just found out I booted the wrong kernel
> > by mistake. Currently retesting, also with the proposed change to only
> > use a single scheduling domain.
> 
> here are the items that are in the works:
> 
>   redhat.com/~mingo/scheduler-patches/sched.patch

I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
It goes through the full boot up sequence, but then never opens a login
on the console and sshd also doesn't work.

Andrew, maybe that's related to your tty fixes?

-Andi

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andi K. <ak...@su...> - 2004-03-29 12:33:10

On Mon, 29 Mar 2004 09:03:01 +0200
Andi Kleen <ak...@su...> wrote:
> 
> I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
> It goes through the full boot up sequence, but then never opens a login
> on the console and sshd also doesn't work.
> 
> Andrew, maybe that's related to your tty fixes?

Reverting the two makes login work again

-Andi

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Andi K. <ak...@su...> - 2004-03-30 07:15:56

On Tue, 30 Mar 2004 17:03:42 +1000
Nick Piggin <nic...@ya...> wrote:

>
> So it is very likely to be a case of the threads running too
> long on one CPU before being balanced off, and faulting in
> most of their working memory from one node, right?

Yes.

> I think it is impossible for the scheduler to correctly
> identify this and implement the behaviour that OpenMP wants
> without causing regressions on more general workloads
> (Assuming this is the problem).

Regression on what workload? The 2.4 kernel who did the
early balancing didn't seem to have problems.

I have NUMA API for an application to select memory placement
manually, but it's unrealistic to expect all applications to use it,
so the scheduler has to do at least an reasonable default.

In general on Opteron you want to go as quickly as possible
to your target node. Keeping things on the local node and hoping
that threads won't need to be balanced off is probably a loss.
It is quite possible that other systems have different requirements,
but I doubt there is a "one size fits all" requirement and 
doing a custom domain setup or similar would be fine for me.
(or at least if sched domain cannot be tuned for Opteron then
it would have failed its promise of being a configurable scheduler)

> I suspect this would still be a regression for other tests
> though where thread creation is more frequent, threads share
> working set more often, or the number of threads > the number
> of CPUs.

I can try such tests if they're not too time consuming to set up.
What did you have in mind?

-Andi

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Nick P. <nic...@ya...> - 2004-03-30 07:24:24

Andi Kleen wrote:
> On Tue, 30 Mar 2004 17:03:42 +1000
> Nick Piggin <nic...@ya...> wrote:
> 
> 
>>So it is very likely to be a case of the threads running too
>>long on one CPU before being balanced off, and faulting in
>>most of their working memory from one node, right?
> 
> 
> Yes.
>  
> 
>>I think it is impossible for the scheduler to correctly
>>identify this and implement the behaviour that OpenMP wants
>>without causing regressions on more general workloads
>>(Assuming this is the problem).
> 
> 
> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.
> 

No, but hopefully sched domains balancing will do
better than the old numasched.


> I have NUMA API for an application to select memory placement
> manually, but it's unrealistic to expect all applications to use it,
> so the scheduler has to do at least an reasonable default.
> 
> In general on Opteron you want to go as quickly as possible
> to your target node. Keeping things on the local node and hoping
> that threads won't need to be balanced off is probably a loss.
> It is quite possible that other systems have different requirements,
> but I doubt there is a "one size fits all" requirement and 
> doing a custom domain setup or similar would be fine for me.

It is the same situation with all NUMA, obviously Opteron's
1 CPU per node means it is sensitive to node imbalances.

> (or at least if sched domain cannot be tuned for Opteron then
> it would have failed its promise of being a configurable scheduler)
>  

Well it seems like Ingo is on to something. Phew! :)

> 
>>I suspect this would still be a regression for other tests
>>though where thread creation is more frequent, threads share
>>working set more often, or the number of threads > the number
>>of CPUs.
> 
> 
> I can try such tests if they're not too time consuming to set up.
> What did you have in mind?
> 

Not really sure. I guess probably most things that use a
lot of threads, maybe java, a web server using per connection
threads (if there is such a thing).

On the other hand though, maybe it will be a good idea if it
is done carefully...

Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Arjan v. de V. <ar...@re...> - 2004-03-30 07:39:23

> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.

well the hard balance is between a program that just splits of one
thread and has those 2 threads working closely together (in which case
you want the 2 threads to be together on the same quad in a quad-like
setup) and a program that splits of a thread and has the 2 threads
working basically entirely independent.

Benchmarks are typically of the later kind... but real world
applications ???? The ones I can think of using threads are of the
former kind.

<< < 1 2 3 > >> (Page 2 of 3)