Thread: Re: [Algorithms] General purpose task parallel threading approach (Page 4)

Brought to you by: vexxed72

gdalgorithms-list

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-09 22:51:51

> It's actually not vital for throughput, and I couldn't see anything in the
presentation you linked saying it is.

 

That's what "Minimizing the Critical Path" is about. Ignoring the issue
doesn't mean it doesn't exist.

 

 

Cheers, Jarkko

 

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Friday, April 10, 2009 1:13 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Thu, Apr 9, 2009 at 10:27 PM, Jarkko Lempiainen <al...@gm...>
wrote:

Yes, appealing to the "big boys" is pretty weak argument since they have
provided sub-optimal solutions before as well. 

In this case almost all of them have, at one point or another, tried the
alternative and decided it was worth the effort to implement the much
trickier solution. In fact the the presentation you linked was given by the
very same organization (xna) who now recommend work stealing - so clearly
they've given the global work queue an honest shot don't you think?

 

Anyway, I was hoping to get some strong arguments either way, but I guess
it's not very clear yet what solution is the best for multi-threading your
engine. I read through the Gamefest slides and it was a bit surprising they
didn't deal with prioritization at all, which is vital for throughput.

It's actually not vital for throughput, and I couldn't see anything in the
presentation you linked saying it is.

Tasks are naturally ordered according to their dependencies due to the way
they are spawned (you spawn tasks, then you sync - only then do you spawn
any dependent tasks). With the cilk based model any task that is enqueued is
by definition ready to run. There's in no need for priorities to optimize
throughput. 

 

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 23:04:47

On Thu, Apr 9, 2009 at 11:52 PM, Jon Watte <jw...@gm...> wrote:

> Sebastian Sylvan wrote:
> >
> >
> >
> >
> >     Anyway, I was hoping to get some strong arguments either way, but
> >     I guess it’s not very clear yet what solution is the best for
> >     multi-threading your engine. I read through the Gamefest slides
> >     and it was a bit surprising they didn’t deal with prioritization
> >     at all, which is vital for throughput.
> >
> > It's actually not vital for throughput, and I couldn't see anything in
> > the presentation you linked saying it is.
> > Tasks are naturally ordered according to their dependencies due to the
> > way they are spawned (you spawn tasks, then you sync - only then do
> > you spawn any dependent tasks). With the cilk based model any task
> > that is enqueued is by definition ready to run. There's in no need for
> > priorities to optimize throughput.
> >
>
> Actually, it is. Consider a case where you have some independent tasks,
> and some serially dependent tasks. Suppose you have 4 independent tasks
> xyzw, 4 linearly dependent tasks abcd, and two cores to schedule across.
> Here are two possible work orderings:
>
> xy
> zw
> a
> b
> c
> d
>
> ax
> by
> cz
> dw
>
> Clearly, the second case is much preferred. However, without
> prioritization, you won't get that outcome.


Actually you could, since you can just spawn the tasks in the order you want
to run them. Like I believe I said very early on, the only reason you would
want to reorder tasks is because you may want to start a task that spawns
many other threads early - but you can still do that without having a
priority system since the order in which tasks are spawned is deterministic.



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 23:38:56

On Fri, Apr 10, 2009 at 12:04 AM, Sebastian Sylvan <
seb...@gm...> wrote:

>
>
> On Thu, Apr 9, 2009 at 11:52 PM, Jon Watte <jw...@gm...> wrote:
>
>> Sebastian Sylvan wrote:
>> >
>> >
>> >
>> >
>> >     Anyway, I was hoping to get some strong arguments either way, but
>> >     I guess it’s not very clear yet what solution is the best for
>> >     multi-threading your engine. I read through the Gamefest slides
>> >     and it was a bit surprising they didn’t deal with prioritization
>> >     at all, which is vital for throughput.
>> >
>> > It's actually not vital for throughput, and I couldn't see anything in
>> > the presentation you linked saying it is.
>> > Tasks are naturally ordered according to their dependencies due to the
>> > way they are spawned (you spawn tasks, then you sync - only then do
>> > you spawn any dependent tasks). With the cilk based model any task
>> > that is enqueued is by definition ready to run. There's in no need for
>> > priorities to optimize throughput.
>> >
>>
>> Actually, it is. Consider a case where you have some independent tasks,
>> and some serially dependent tasks. Suppose you have 4 independent tasks
>> xyzw, 4 linearly dependent tasks abcd, and two cores to schedule across.
>> Here are two possible work orderings:
>>
>> xy
>> zw
>> a
>> b
>> c
>> d
>>
>> ax
>> by
>> cz
>> dw
>>
>> Clearly, the second case is much preferred. However, without
>> prioritization, you won't get that outcome.
>
>
> Actually you could, since you can just spawn the tasks in the order you
> want to run them. Like I believe I said very early on, the only reason you
> would want to reorder tasks is because you may want to start a task that
> spawns many other threads early - but you can still do that without having a
> priority system since the order in which tasks are spawned is deterministic.
>
>
In fact, in your particular case this isn't only possible without
prioritization, but it's what happens if you pay no attention what so ever
and just write things naively, expressing parallelism where it is.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 00:46:41

Sebastian Sylvan wrote:
> Actually you could, since you can just spawn the tasks in the order 
> you want to run them.

So the totally independent tasks xyzw, and the totally dependent tasks 
abcd need to be spawned in interspersed order? Across all executing 
threads? I shudder to think of the horribly convoluted code that would 
require -- it's pretty much impossible. Note that I view tasks as work 
items -- not as something that gets a regular tick each frame, but 
actually something which has an allocation, lifetime, and death 
(although various underlying data is likely pooled).

Tasks come from things like detecting collision, or exploding grenades, 
or passing triggers, or what have you -- they are not static.

> you can still do that without having a priority system since the order 
> in which tasks are spawned is deterministic.

Only in a world where the user is deterministic.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-10 01:03:43

On Thu, Apr 9, 2009 at 5:46 PM, Jon Watte <jw...@gm...> wrote:
> Only in a world where the user is deterministic.

In my experience the order of events within a single thread is set and
doesn't change, even though depending on the user certain tasks won't
need to be spawned, It's trivial to setup the order of tasks *within*
a single frame to have a static order, whether that is optimal or not
is another matter altogether.

Nicholas "Indy" Ray

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 23:25:31

On Thu, Apr 9, 2009 at 11:51 PM, Jarkko Lempiainen <al...@gm...>wrote:

>  > It's actually not vital for throughput, and I couldn't see anything in
> the presentation you linked saying it is.
>
>
>
> That’s what “Minimizing the Critical Path” is about. Ignoring the issue
> doesn’t mean it doesn’t exist.
>
>
That section only seems to talk about optimizing scheduling based on static
(in the sense that they're known ahead of time - not necessarily at compile
time) dependencies, not about the necessity to support priorities.

And of course, general code doesn't have a static call graph, so it's even
impossible to do this analysis unless your code happens to be simple enough
(see: the halting problem).



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-10 00:04:33

On Fri, Apr 10, 2009 at 12:25 AM, Sebastian Sylvan <
seb...@gm...> wrote:

>
>
> On Thu, Apr 9, 2009 at 11:51 PM, Jarkko Lempiainen <al...@gm...>wrote:
>
>>  > It's actually not vital for throughput, and I couldn't see anything in
>> the presentation you linked saying it is.
>>
>>
>>
>> That’s what “Minimizing the Critical Path” is about. Ignoring the issue
>> doesn’t mean it doesn’t exist.
>>
>>
> That section only seems to talk about optimizing scheduling based on static
> (in the sense that they're known ahead of time - not necessarily at compile
> time) dependencies, not about the necessity to support priorities.
>
> And of course, general code doesn't have a static call graph, so it's even
> impossible to do this analysis unless your code happens to be simple enough
> (see: the halting problem).
>

Also, it's worth noting that work stealing is *provably* efficient. In other
words the way the scheduling works gives asymptotically optimal throughput.
So claiming that either ahead-of-time prioritization, or a runtime priority
system is required for efficient throughput is mathematically incorrect. And
given that the absence of them give a far more flexible programming model,
and real-world speedups in practice, I'd say the jury's in on this one...

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-09 23:49:05

On Thu, Apr 9, 2009 at 4:25 PM, Sebastian Sylvan
<seb...@gm...> wrote:
> And of course, general code doesn't have a static call graph, so it's even
> impossible to do this analysis unless your code happens to be simple enough
> (see: the halting problem).

I've found the barrier for "simple enough" to actually be rather high,
especially when writing special case analysis (dehydra has been used
in mozilla to build simple call graphs
https://developer.mozilla.org/en/Dehydra). But often it turns out that
building a call graph for optimization during a profile like stage is
very effective, It's easy to get real call data, and you don't have to
worry about the halting problem.

Also, a description of the halting problem with some humor value:
http://www.lel.ed.ac.uk/~gpullum/loopsnoop.html

Nicholas "Indy" Ray

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 23:59:49

On Fri, Apr 10, 2009 at 12:49 AM, Nicholas "Indy" Ray <ar...@gm...>wrote:

> On Thu, Apr 9, 2009 at 4:25 PM, Sebastian Sylvan
> <seb...@gm...> wrote:
> > And of course, general code doesn't have a static call graph, so it's
> even
> > impossible to do this analysis unless your code happens to be simple
> enough
> > (see: the halting problem).
>
> I've found the barrier for "simple enough" to actually be rather high,
> especially when writing special case analysis (dehydra has been used
> in mozilla to build simple call graphs
> https://developer.mozilla.org/en/Dehydra). But often it turns out that
> building a call graph for optimization during a profile like stage is
> very effective, It's easy to get real call data, and you don't have to
> worry about the halting problem.
>

Yes, profile guided optimization is definitely promising.It would seem to me
that the most gainful use would be not to reorder task execution, but to
determine which tasks would be more efficient to execute inline. In other
words, the programmer would call "spawn" *everywhere* with no regard to the
overhead, and the profiler would figure out which one of those to ignore.
I seem to remember some research out of MSR doing this to Haskell programs
and getting good speedups (but of course Haskell programs don't need manual
markup of potential parallelism since they're purely functional - the
profiler would insert the parallism "sparks" for you in the correct places).

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 00:48:33

Sebastian Sylvan wrote:
>
>         Here are two possible work orderings:
>
>         xy
>         zw
>         a
>         b
>         c
>         d
>
>         ax
>         by
>         cz
>         dw
>

>
> In fact, in your particular case this isn't only possible without 
> prioritization, but it's what happens if you pay no attention what so 
> ever and just write things naively, expressing parallelism where it is.
>

There is no guarantee of that. The scheduler, expressing "parallelism 
where it is" could easily choose the xyzw first order, rather than 
trying to run a in parallel with x. There is nothing in the knowledge of 
the scheduler you propose to guarantee the second (desirable) outcome, 
except perhaps wishful thinking on part of the implementer.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-10 01:00:15

On Fri, Apr 10, 2009 at 1:48 AM, Jon Watte <jw...@gm...> wrote:

> Sebastian Sylvan wrote:
> >
> >         Here are two possible work orderings:
> >
> >         xy
> >         zw
> >         a
> >         b
> >         c
> >         d
> >
> >         ax
> >         by
> >         cz
> >         dw
> >
>
> >
> > In fact, in your particular case this isn't only possible without
> > prioritization, but it's what happens if you pay no attention what so
> > ever and just write things naively, expressing parallelism where it is.
> >
>
>
> There is no guarantee of that. The scheduler, expressing "parallelism
> where it is" could easily choose the xyzw first order, rather than
> trying to run a in parallel with x. There is nothing in the knowledge of
> the scheduler you propose to guarantee the second (desirable) outcome,
> except perhaps wishful thinking on part of the implementer.
>

In this case, with the scheduling policy I propose (standard work-stealing)
I don't see how it could possibly do anything different?
You put two tasks (the xyzw one and abcd one) on whichever queue you start
on, the other queue will sit idle just waiting for something to pop up
somewhere and grab the first one to go on the queue. Then the second one
goes on the queue and then there's the final sync which causes the
originating thread to start executing. Regardless of the order you spawn the
task the other thread will take the other one. There is no opportunity for
the xyzw tasks to "skip" past the abcd ones since you only steal when you've
run out (so even if the other thread takes the xyzw ones it will put the
child-tasks on *its* queue, and the original thread will go ahead and do
abcd before trying to steal).

So yeah, looks pretty guaranteed.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 02:30:21

Sebastian Sylvan wrote:
> You put two tasks (the xyzw one and abcd one) on whichever queue you 
> start on, the other queue will sit idle just waiting for something to 
> pop up somewhere and grab the first one to go on the queue. Then the 
> second one goes on the queue and then there's the final sync which 
> causes the originating thread to start executing. Regardless of the 
> order you spawn the task the other thread will take the other one. 
> There is no opportunity for the xyzw tasks to "skip" past the abcd 
> ones since you only steal when you've run out (so even if the other 
> thread takes the xyzw ones it will put the child-tasks on *its* queue, 
> and the original thread will go ahead and do abcd before trying to steal).

There are eight tasks. x, y, z and w are part of the "independent" work 
group. a, b, c and d are part of the "sequentially dependent" group. 
That was part of the original problem statement. As you should clearly 
see in the problem statement, if the tasks xyzw are enqueued before abcd 
(all on the same thread, then), you will not get optimal performance, 
because the stealer will steal y after the original thread starts 
working on x.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-10 01:26:37

On Fri, Apr 10, 2009 at 1:48 AM, Jon Watte <jw...@gm...> wrote:

> Sebastian Sylvan wrote:
> >
> >         Here are two possible work orderings:
> >
> >         xy
> >         zw
> >         a
> >         b
> >         c
> >         d
> >
> >         ax
> >         by
> >         cz
> >         dw
> >
>
> >
> > In fact, in your particular case this isn't only possible without
> > prioritization, but it's what happens if you pay no attention what so
> > ever and just write things naively, expressing parallelism where it is.
> >
>
>
> There is no guarantee of that. The scheduler, expressing "parallelism
> where it is" could easily choose the xyzw first order, rather than
> trying to run a in parallel with x. There is nothing in the knowledge of
> the scheduler you propose to guarantee the second (desirable) outcome,
> except perhaps wishful thinking on part of the implementer.
>

I think they key notion you may be misunderstanding about work stealing is
that the stealing happens from the opposite end of the queue versus where
the owning thread executes from. If stealing happened from the "execution
end" then yes, you could get this weird interleaving, but that's not the
case.



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 02:32:38

Nicholas "Indy" Ray wrote:
> In my experience the order of events within a single thread is set and
> doesn't change, even though depending on the user certain tasks won't
> need to be spawned, It's trivial to setup the order of tasks *within*
> a single frame to have a static order, whether that is optimal or not
> is another matter altogether.
>   

Only if you legislate that reactions to events can only happen on the 
next frame. Else, you may have an event of type A that triggers an event 
of type B, and another event of type B that triggers an event of type A, 
and you can't get a static order that solves both.
For example: grenades exploding may put things on fire, but things 
burning may also explode grenades. Only in a system where you tie-break 
by introducing latency would you be able to use a static ordering. 
Meanwhile, in a system where work items are actual individual items 
(rather than static "pulsed" state machines or whatever), either case 
will be resolved within the same frame.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-10 02:49:55

On Thu, Apr 9, 2009 at 7:32 PM, Jon Watte <jw...@gm...> wrote:
> Only if you legislate that reactions to events can only happen on the
> next frame. Else, you may have an event of type A that triggers an event
> of type B, and another event of type B that triggers an event of type A,
> and you can't get a static order that solves both.
> For example: grenades exploding may put things on fire, but things
> burning may also explode grenades. Only in a system where you tie-break
> by introducing latency would you be able to use a static ordering.
> Meanwhile, in a system where work items are actual individual items
> (rather than static "pulsed" state machines or whatever), either case
> will be resolved within the same frame.

I understand it's true that situations arise where it is easy to loose
said order of events. But in my experience most of such events are put
off until the next frame anyways, even in single threaded engines.

Indy

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 02:35:42

Sebastian Sylvan wrote:
> I think they key notion you may be misunderstanding about work 
> stealing is that the stealing happens from the opposite end of the 
> queue versus where the owning thread executes from. If stealing 
> happened from the "execution end" then yes, you could get this weird 
> interleaving, but that's not the case.

Steal order doesn't matter, because in that case you may easily have two 
threads adding tasks:

Adding order:

1   2
x   y
z   w
a  
b  
c 
d

Now, stealing, no matter from which order you steal, will not lead to 
optimal scheduling. You do need to sort by dependencies, and to avoid 
having to do optimization on the full spanning graph, simply choosing 
the task with the deepest*widest dependency chain first will usually 
work out well enough.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-10 23:01:51

Sebastian Sylvan wrote:
>
>
> On Fri, Apr 10, 2009 at 7:11 PM, Jon Watte <jw...@gm... 
> <mailto:jw...@gm...>> wrote:
>
>
>     Of course not. A task doesn't get run until its dependencies are
>     resolved.
>
>
> Which is the whole problem, since it may need to run to even know what 
> its dependencies are.

It's clear that we have totally different ideas about data and software 
architecture.
Clearly you can know what the dependencies are, or you wouldn't be able 
to have those dependencies in the first place.

>
>  Same thing if you call a virtual function - who knows what code will 
> run as a result of that, and what data dependencies that will cause?
>

To answer your rethorical question: Our data path system knows.
If you have wildly different dependencies based on a switch(), then that 
is better expressed as a behavior component (that you can swap out) 
rather than a code switch(). (The "pure OO" movement claims that every 
switch can be turned into a virtual dispatch, but I wouldn't go that far)

> Yours doesn't seem to be. It may work great for your particular 
> specialized conditions (heck, we use something similar for our 
> rendering tasks!)

Our data dependency system is very similar to that of dependency 
properties in WPF, or properties in the Spring web framework. It's very 
general, and used pervasively through the entire system.

> IMO we really will need a general system in the future to be able to 
> kick off enough tasks to keep 64 threads busy for an entire frame.
>

I believe you need to know, and be able to express, what your data is, 
where it is coming from, and how it is processed, to be able to reason 
about it properly. Accidental dependencies that aren't injected 
explicitly is pretty much where all non-trivial coupling and performance 
bugs come from, so engineering those out is a big win, even before you 
start looking at threading a component system.

> You specify dependencies, yes, but they may be dynamic

Which is why the scheduler must be dynamic!

> For example, in some AI code you may want to do a parallel loop 
> through all enemies within a certain radius to score them based on the 
> danger they pose to you - you can't really know before you start 
> executing what enemies you're going to get back from that range query, 
> and therefore which objects you're going to have a dependency on - so 
> you need to start executing the task before you know which 
> dependencies it has.

Subsystems can have dependencies, too. It's perfectly fine to specify 
that AI reasoning depends on collision detection, and collision 
detection depends on integration output (position) from the last frame. 
That way, you can re-use the collision output data for AI, rather than 
to have to re-run queries, too.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-10 23:35:34

On Sat, Apr 11, 2009 at 12:01 AM, Jon Watte <jw...@gm...> wrote:

> Sebastian Sylvan wrote:
> >
> >
> > On Fri, Apr 10, 2009 at 7:11 PM, Jon Watte <jw...@gm...
> > <mailto:jw...@gm...>> wrote:
> >
> >
> >     Of course not. A task doesn't get run until its dependencies are
> >     resolved.
> >
> >
> > Which is the whole problem, since it may need to run to even know what
> > its dependencies are.
>
> It's clear that we have totally different ideas about data and software
> architecture.
> Clearly you can know what the dependencies are, or you wouldn't be able
> to have those dependencies in the first place.

The issue is about knowing *ahead of time*.
Again, this goes back to the halting problem. It is mathematically
impossible to know the flow of a general function ahead of time - you have
to execute it to find out. If you think your system can do this then you've
either broken the universe, or you're wrong :-)

> If you have wildly different dependencies based on a switch(), then that
> is better expressed as a behavior component (that you can swap out)
> rather than a code switch(). (The "pure OO" movement claims that every
> switch can be turned into a virtual dispatch, but I wouldn't go that far)

But it's still impossible to know which behaviour to swap in until you've
started executing the tasks!
Look, all I'm saying is that your system isn't general - it only works for a
restricted form of computation where all dependencies can be known before
starting to execute the tasks. Is this incorrect?

I'm sure you understand that code in general doesn't work like that, even
this wouldn't be possible:

x = spawn(foo);
y = spawn(bar);
sync;
if ( x && y )
{
   z = spawn(baz);
}

You have to start executing tasks for the result of foo and bar to be known,
which means you can't know that baz is a dependency until after you've
already started executing the tasks, which means that the task running the
above code would need to block until all three sub-tasks have finished,
before it can run, even though one of them may only rarely be needed in
reality... But it gets even worse:

x = spawn(foo);
y = spawn(baz); // returns a task!
sync;
if ( x )
{
   z = spawn(y);
}

Now you can't even pessimistically schedule it, since you have no idea what
y might be until after you've started executing!

All I'm saying is: real code has dynamic control flow that's impossible to
predict ahead of time. Using your system you have to run that kind of stuff
serially, and I personally think that to scale to the kinds of thread counts
we will need to in the future, we can't afford to say "nah, we'll never need
to parallelize that" about anything.

Yeah, we use a very restricted task system right now for rendering, and in
this case the data flow is pretty simple, and the number of threads we use
for rendering is low, so it works out well. But once those big juicy low
hanging fruits are parallelized we'll still not be done, and then we would
be in trouble if the task system we have isn't general enough to handle the
remaining "nasty" code.

>
> > You specify dependencies, yes, but they may be dynamic
>
> Which is why the scheduler must be dynamic!
>

Precisely!

Subsystems can have dependencies, too. It's perfectly fine to specify
> that AI reasoning depends on collision detection, and collision
> detection depends on integration output (position) from the last frame.
> That way, you can re-use the collision output data for AI, rather than
> to have to re-run queries, too.

Precisely. As you say the AI reasoning must depend on collision detection,
even though it doesn't *actually* depend on it for most frames (and you
can't know in advance if it will). Meaning that it can't run even though
it's ready, it must wait for the worst-case set of dependencies to finish
first.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-11 01:50:53

Sebastian Sylvan wrote:
> Again, this goes back to the halting problem. It is mathematically 
> impossible to know the flow of a general function ahead of time - you 
> have to execute it to find out. If you think your system can do this 
> then you've either broken the universe, or you're wrong :-)

Luckily, I don't ship general functions; I ship specific functions.

>  
>
>     If you have wildly different dependencies based on a switch(),
>     then that
>     is better expressed as a behavior component (that you can swap out)
>     rather than a code switch(). (The "pure OO" movement claims that every
>     switch can be turned into a virtual dispatch, but I wouldn't go
>     that far)
>
>
> But it's still impossible to know which behaviour to swap in until 
> you've started executing the tasks!

If you swap in a new behavior, that will take effect the next frame.

> Look, all I'm saying is that your system isn't general - it only works 
> for a restricted form of computation where all dependencies can be 
> known before starting to execute the tasks. Is this incorrect?
>

Luckily, all games, as well as all business software, fall into that 
"restricted" form.

> I'm sure you understand that code in general doesn't work like that, 
> even this wouldn't be possible:
>
> x = spawn(foo);
> y = spawn(bar);
> sync;
> if ( x && y )
> {
>    z = spawn(baz);
> }
>

So I'm saying you, as the component writer, don't need to worry about 
spawning at all. Tasks are created by the system, to resolve the 
semantics as described by the system. Having the programmer create 
deadlocks and races and global dependencies willy-nilly sounds to me a 
lot like writing your code in assembly. Very last-century, and not where 
the industry is going. I believe the computer can actually make more 
sense, and schedule your dependencies better, than you can.

>     > You specify dependencies, yes, but they may be dynamic
>
>     Which is why the scheduler must be dynamic!
>
>
> Precisely!
>

At least we agree on something :-)

>     Subsystems can have dependencies, too. It's perfectly fine to specify
>     that AI reasoning depends on collision detection, and collision
>     detection depends on integration output (position) from the last
>     frame.
>     That way, you can re-use the collision output data for AI, rather than
>     to have to re-run queries, too.
>
>
> Precisely. As you say the AI reasoning must depend on collision 
> detection, even though it doesn't *actually* depend on it for most 
> frames (and you can't know in advance if it will). Meaning that it 
> can't run even though it's ready, it must wait for the worst-case set 
> of dependencies to finish first.

If you're not doing at least some AI reasoning each frame, then you're 
not using your cores right :-)
Seriously, though, I'm OK with saying that a task that depends on the 
output of collision testing is actually forced to run after collision 
testing. Tasks that depend only on, say, the game being in "paused" 
mode, or depend on the player input controller being pushed forward, or 
whatever, can run independent of that.

I think the discussion has come to an end, though. You want to enable 
old-school low-level code to run as well as it can under the explicit 
control of the programmer, and you're willing to pay the cost of 
allocating and switching stacks at runtime to do it. I want to move 
threading decisions away from the programmer, and let the software do 
that, based on intentions and dependencies expressed by the programmer. 
Two different problems lead to two different solutions. I don't think 
I'll convince you that your problem is "too old-school to solve," and I 
don't think you'll convince me that my problem is "too high-level to be 
useful."

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-11 02:39:42

On Sat, Apr 11, 2009 at 2:44 AM, Jon Watte <jw...@gm...> wrote:

> So I'm saying you, as the component writer, don't need to worry about
> spawning at all. Tasks are created by the system, to resolve the
> semantics as described by the system.

I do need to worry about specifying what's parallel and what isn't (unless
I'm in a language where this can be automatically deduced - which C/C++
isn't), and if since I can't do that in a dynamic way your system isn't
flexible enough to run generic code.

> Having the programmer create
> deadlocks and races and global dependencies willy-nilly sounds to me a
> lot like writing your code in assembly. Very last-century, and not where
> the industry is going. I believe the computer can actually make more
> sense, and schedule your dependencies better, than you can.

I don't believe the computer can violate the laws of mathematics to do it,
which means that in order for your ahead-of-time scheduler to work it *must*
severely restrict the kinds of control flow I can have.

> >     Which is why the scheduler must be dynamic!
> >
> >
> > Precisely!
> >
>
> At least we agree on something :-)

Yes, unfortunately I can't seem to explain adequately to you that a system
that requires all dependencies to be known up front is not actually dynamic
in any meaningful sense of the word.

> If you're not doing at least some AI reasoning each frame, then you're
> not using your cores right :-)

I'm doing AI reasoning every frame, but every frame each task will only go
through exactly one trace through my AI processing routine (not both
branches of an if-else for example), which means that there might be many
*potential* dependencies, that are not *actual* dependencies. I may cast a
ray to an off-screen object to see if my laser hits, but if I miss the
bounding sphere I don't actually need to spawn a task for computing the
animation of the object - and even if that animation is needed for something
else and will be computed anyway, at least I won't be blocked for it to
finish, improving throughput.

> . I don't think
> I'll convince you that your problem is "too old-school to solve," and I
> don't think you'll convince me that my problem is "too high-level to be
> useful."

My criticism of your approach is hardly that it's too high level. Quite the
opposite! I'm saying that it is, as a matter of mathematical fact, not
flexible enough to run code with dynamic control flow, because by definition
you can not know dependencies ahead of time for such code. Dynamic control
flow *requires* an on-line scheduler - yours isn't.
Restricting the use of dynamic control flow is the opposite of "high level"
as it restricts the use of fairly ubiquitous things like virtual functions
(when you call a virtual function you can't know what dependencies that call
will have ahead of time as virtual functions are dynamically dispatched -
you just have to make the call and see where you end up).

That might be fine for some parts of your program, however, saying "I'm fine
with restricting expressiveness/abstraction for code that needs to run in
parallel" might be okay today, but in the future that will be synonymous
with "I'm fine with restricting expressiveness/abstraction everywhere",
which IMO would be a big mistake.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-12 02:15:38

Sebastian Sylvan wrote:
>
>     So I'm saying you, as the component writer, don't need to worry about
>     spawning at all. Tasks are created by the system, to resolve the
>     semantics as described by the system.
>
>
> I do need to worry about specifying what's parallel and what isn't 
> (unless I'm in a language where this can be automatically deduced - 
> which C/C++ isn't), and if since I can't do that in a dynamic way your 
> system isn't flexible enough to run generic code. 

Luckily, I'm not running generic code; I'm running code that interfaces 
with a specific operating system, a specific compiler, and a specific 
runtime framework.

>
> Yes, unfortunately I can't seem to explain adequately to you that a 
> system that requires all dependencies to be known up front is not 
> actually dynamic in any meaningful sense of the word.

And I can't explain to you how that's not actually true, and how the 
limitations that do exist actually are much less evil than the problems 
of trying to solve the "generic" problem. Although I gave it a pretty 
good shot, I think.

> My criticism of your approach is hardly that it's too high level. 
> Quite the opposite! I'm saying that it is, as a matter of mathematical 
> fact, not flexible enough to run code with dynamic control flow, 
> because by definition you can not know

Yet it is doing it, daily. Note that most flow control comes from 
expression evaluation and polymorphism, rather than old-school 
if-then-else and switch statements which you seem to be worried about. 
It's simply a different style of building software, using a different 
kind of paradigm than you are using. Personally, I think the old-school 
paradigm is unfit for heavy multi-core execution, and using explicit, 
data-centric components are the way of the future. You're of course free 
to disagree, because the future isn't actually here yet.

> dependencies ahead of time for such code. Dynamic control flow 
> *requires* an on-line scheduler - yours isn't.

My specific proposal was that each time a core is free, it picks the 
highest priority task available in the queue. Priorities are ideally 
updated each time a new task is added to the queue (but that can be 
loosened up for efficiency if it turns out to be a bottleneck). I think 
that's an on-line scheduler. If you recall, my argument was simply that 
a naive work-stealing scheduler is not capable of making decisions made 
on global knowledge (such as dependency depth), and thus will often 
(statistically) make poor choices.

> Restricting the use of dynamic control flow is the opposite of "high 
> level" as it restricts the use of fairly ubiquitous things like 
> virtual functions (when you call a virtual function you can't know 
> what dependencies that call will have ahead of time as virtual 
> functions are dynamically dispatched - you just have to make the call 
> and see where you end up).
>

Go write a real application using WPF dependency properties, and then 
tell me that that's a "low level" "restricted" programming model. (I 
pick WPF because that's the closest available, well-documented API that 
has many of the same affordances as our framework) I think you simply 
are mired in the old, low-level ways, and don't understand the system 
I'm using. (The arguments you make are very similar to the arguments 
that assembly programmers made when FORTRAN and C showed up, IMO).

Heavy multi-core threading is a new kind of execution environment, and 
it's unlikely that the old style of doing things will be the best match 
for that environment. Just like the introduction of deep caches made 
instruction counting a thing of the past.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-12 15:01:04

On Sun, Apr 12, 2009 at 3:15 AM, Jon Watte <jw...@gm...> wrote:

> Luckily, I'm not running generic code; I'm running code that interfaces
> with a specific operating system, a specific compiler, and a specific
> runtime framework.

Exactly! And if that specific runtime framework enforces specific
restrictions, then that means your specific code has to be written under
those specific restrictions, which means your runtime framework is not
general.

> > Yes, unfortunately I can't seem to explain adequately to you that a
> > system that requires all dependencies to be known up front is not
> > actually dynamic in any meaningful sense of the word.
>
> And I can't explain to you how that's not actually true

It's a mathetmatical truth that any system that requires dependencies to be
known up front can not also be dynamic.
You have said that your system requires this, I quote:
"A task doesn't get run until its dependencies are resolved.
[...]
In our system, we have explicit data dependency metadata
[...], so this knowledge is something we
have up front."
and therefore your system can not be dynamic.

> > My criticism of your approach is hardly that it's too high level.
> > Quite the opposite! I'm saying that it is, as a matter of mathematical
> > fact, not flexible enough to run code with dynamic control flow,
> > because by definition you can not know
>
> Yet it is doing it, daily. Note that most flow control comes from
> expression evaluation and polymorphism, rather than old-school
> if-then-else and switch statements which you seem to be worried about.

If you read more carefully I've given you two examples of how tasks using
polymorphism can not be scheduled ahead of time either. Any kind of dynamic
control flow needs an on-line scheduler.

> > dependencies ahead of time for such code. Dynamic control flow
> > *requires* an on-line scheduler - yours isn't.
>
> My specific proposal was that each time a core is free, it picks the
> highest priority task available in the queue. Priorities are ideally
> updated each time a new task is added to the queue (but that can be
> loosened up for efficiency if it turns out to be a bottleneck). I think
> that's an on-line scheduler.

Well depending on how you propose to know the priorities, this may not be
what you said before.
My whole argument has been (and I've stated this very clearly in every
messag) that an ahead-of-time scheduler can not schedule dynamic code. If
you weren't talking about an ahead of time scheduler you could've said this
on any number of occasions. You said (see your quote above) that you *were*
doing ahead-of-time scheduling before, if you've now changed your mind then
we clearly aren't in disagreement any more (on this particular issue
anyway).

Go write a real application using WPF dependency properties, and then

tell me that that's a "low level" "restricted" programming model.

If WPF requires all dependencies to be known up front then it is restricted.
That's probably fine because it has a very specific usage area. Like I've
said before, restricted doesn't mean bad, it just means restricted. I
re-wrote our task scheduler for our rendering system not four months ago,
and guess what? It's an ahead-of-time scheduler.
Saying that ahead-of-time scheduling can only handle restricted scenarios is
not the same as calling anyone who uses it stupid (because then I'd be
calling myself stupid!), it's just an honest discussion about pros and cons
without getting defensive just because I happen to use this very system
myself.

All I'm doing is pointing out the mathematical impossibility of both knowing
all dependencies up front, *and* allowing dynamic control flow (including
polymorphism).

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-13 03:17:48

Sebastian Sylvan wrote:
> Exactly! And if that specific runtime framework enforces specific 
> restrictions, then that means your specific code has to be written 
> under those specific restrictions, which means your runtime framework 
> is not general.
>  

I used to be in the OS business for many years. However, Microsoft ate 
us on the high end, and Linux ate us on the low end -- there's not room 
for much else these days, IMO.

>
> It's a mathetmatical truth that any system that requires dependencies 
> to be known up front can not also be dynamic.
> You have said that your system requires this, I quote:
> "A task doesn't get run until its dependencies are resolved.
> [...]
> In our system, we have explicit data dependency metadata
> [...], so this knowledge is something we
> have up front."
>

Up front at the start of a frame, not up front at the start of the program.

A dependency is simply expressed as a DOM-style path, much like XPath, 
JavaScript and WPF. All components and functions of the path are "live," 
and any change is distributed to dependents. You can re-assign these 
paths if you want; such a re-assignment will take effect the next frame.

>
> If you read more carefully I've given you two examples of how tasks 
> using polymorphism can not be scheduled ahead of time either. Any kind 
> of dynamic control flow needs an on-line scheduler.

Sure it can! You just have to say that polymorphism isn't allowed to 
cause new tasks to get spawned synchronously. I think that's a very 
simple rule to live by, and it makes smearing the workload across cores 
much easier. Not to mention it causes fewer bugs.

>
> Well depending on how you propose to know the priorities, this may not 
> be what you said before. 

Well, it seems that you're talking to several people at once, so the 
"you" you've been using has started to look a little diffuse to me.

> My whole argument has been (and I've stated this very clearly in every 
> messag) that an ahead-of-time scheduler can not schedule dynamic code. 
> If you weren't talking about an ahead of time scheduler you could've 
> said this on any number of

The scheduler I propose is not ahead of time, but it does know (and 
update) the dependencies of tasks, including tasks that get added to the 
queue during a frame.
If you want to buckle down to ahead-of-time, there's actually two kinds 
of ahead-of-time: one that is truly static (such as those proposed by 
people who set up "tasks" at program start, and then run them 
statically), and one that is static per frame, but is dynamic across 
frames. I think the latter may also be useful for games, but that's not 
what I'm proposing.

>
> If WPF requires all dependencies to be known up front then it is 
> restricted.

You know what, my car is restricted. It can't go faster than 85 miles 
per hour. It can only run on paved roads. Heck, I can't even drive to 
another state from where I live without the infrastructure of a gas 
station. Still, it turns out that those restrictions allow me to buy a 
car that is quite affordable, comfortable, and safe, and actually solves 
the problem I need it to solve (get from home to work and back, every 
day, without breaking).
I'll let you figure out how to map this analogy to writing schedulers 
for games :-)

> I re-wrote our task scheduler for our rendering system not four months 
> ago, and guess what? It's an ahead-of-time scheduler.

> All I'm doing is pointing out the mathematical impossibility of both 
> knowing all dependencies up front, *and* allowing dynamic control flow 
> (including polymorphism).

And I'm saying that the system I'm describing knows all the dependencies 
when a task is inserted, but because tasks can be inserted during 
processing, that means it's not "ahead of time" in the sense of 
start-of-frame or start-or-program.
That being said, almost all new tasks are created because new objects 
are created in the system, and object creation (or, rather, injection) 
generally happens at a defined time. Seems dynamic enough for anything 
we do (and we do a whole-Earth virtual world with user generated 
content, which is pretty much the most dynamic case you can think of 
while still calling it a game).

Again: Spawning threads/tasks willy-nilly, especially based on hidden 
switch/if statements in code, and then (this is the worst part) 
synchronously waiting for the result, is a Bad Idea if you want to 
spread across many, many cores. If you consider caches local to cores, 
then cores are actually similar to NUMA machine architectures, with all 
that implies. If you want to be totally general (in the sense of a 
current general-purpose UNIX-like operating system, say), then you will 
not scale well across large numbers of cores. A different programming 
paradigm is needed. What I'm describing is one that I think will work 
out great.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-13 10:16:32

On Mon, Apr 13, 2009 at 4:17 AM, Jon Watte <jw...@gm...> wrote:

>
>
> Up front at the start of a frame, not up front at the start of the program.

Of course. Never thought you meant anything else.

> >
> > If you read more carefully I've given you two examples of how tasks
> > using polymorphism can not be scheduled ahead of time either. Any kind
> > of dynamic control flow needs an on-line scheduler.
>
> Sure it can! You just have to say that polymorphism isn't allowed to
> cause new tasks to get spawned synchronously.

Which is what I've been trying to say all this time. If the scheduler can't
handle dynamic code, then it's not really a general system.

I've given you several examples of very mundane and common uses of dynamic
control flow that wouldn't be possible for this reason, so I believe I've
made my point pretty clear on why this is unlikely to be acceptable for me
at least (especially since the benefits are requiring these restrictions are
highly dubious, IMO).

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-11 09:20:25

I was talking about generic task system before. Dependencies impact to the
task priorities, because you want the longest chain of work (critical path)
to start as early as possible in the frame. Another thing which impacts to
the chain of the work is the length of each task in the chain, which is why
I profile each task in the task manager to know which chain is the longest.
Now, you are telling that you don't know about dependencies, etc.
beforehand, but in games we are in fortunate position that pretty much the
same call graph is ran from one frame to the next, so it's possible to
exploit frame-to-frame coherency to prioritize tasks and have pretty good
idea what's going to happen ahead of time. It's not perfect of course (i.e.
when a task is run for the first time there is no priority for it), but it's
close and definitely better than not prioritizing at all. So the task
prioritization is very dynamic and can adapt to the situation where
different tasks are pushed to the queue and where their lengths change from
frame to frame.

 

In my implementation I don't define dependencies beforehand like you seem to
assume you would have to do, but I define them at the call site, so there
isn't external maintenance to do but the usage is pretty much like calling a
function (1-2 lines of code), so it's quite convenient to use. One
motivation for the job queue I implemented was that it should be simple to
use so that you could easier take advantage of opportunities for
parallelism.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Friday, April 10, 2009 7:54 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Fri, Apr 10, 2009 at 5:08 PM, Jarkko Lempiainen <al...@gm...>
wrote:

You can't statically schedule the work because the work changes dynamically
from on frame to the next. In one frame it may be best to start running
a->b->c->d as soon as possible, while sometimes you should start running z
early because it takes 10 * y, etc. It's the scheduler's job to figure out
the best order for optimal throughput and it would be silly to try to craft
the order directly to the code.

Not really, we do things like that all the time today - e.g. adding a
speculative prefetch because the profiler says that most of the time the
code will go down a certain branch and read a certain bit of data.

It may not be strictly possible to predict which order will be better in
every single circumstance, but you can look a the profiler output and
optimize for the most common case.

 

I see what you're saying now, though. You weren't actually talking about a
general task system before? So when you said "priorities" you really meant
ordering of a dependency graph that's known ahead of time (i.e. before you
start executing the first task), not a dynamic priority system for a general
task system. The use of the word "priority" threw me (as it's generally
referred to in the context of a user giving a certain task a priority and
having it execute first).

 

However, using a static dependency graph (not static as in "compile-time",
static as in "known before execution starts") is a bit like solving a
problem with a slow heap allocator by saying that you're not allowed to do
dynamic allocations. It doesn't actually solve the problem. I think we
really do want a task system that would "work like normal" except allow you
to run certain things in parallel, not something which has all sorts of
restrictions on the kind of code you're allowed to write.

 

Also, which scenario do you think is more likely:

A) The intricate one given before (which needs extremely specific
circumstances to be a problem).

B) Merely a couple of if statements or virtual function calls, giving you
dynamic dependencies. 

 

As an example of why B is bad, consider that e.g. a task may have to be
scheduled after some model has finished animating, even though it only needs
to look at the animation data in very rare circumstances (e.g. only a
specific subclass looks at it, or maybe it does hit detection and only needs
animation data if the conservative bounding sphere test succeeds, etc.). In
fact, this not only impacts throughput (a task that is actually ready to run
is blocked, while threads may be idle), but it may cause unnecessary
computatations (what if nothing ever needs that animation data - with a
static system you still need to compute it up front in case something does).

There are all sorts of cases where dependencies can only be known after
looking at dynamic conditions, meaning you have to start executing the task
before you know what other tasks you have dependencies on. Not supporting
that doesn't seem like it will scale in general, IMO.

 

Honestly dynamic call graphs are pretty common (if statements, virtual
function calls, loops, hell scripting etc. too) so I'm probably going to go
with saying that B is probably more common.

 

So given that static task systems have these issues, and that they are
highly restrictive in the kinds of computations you can parallelize, and
that they don't seem to actually perform better (the paper I linked earlier
showed work stealing matching hand-scheduled tasks), I think I know what
I'll go for.

 

 

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

<< < 1 2 3 4 5 6 .. 12 > >> (Page 4 of 12)