Thread: Re: [Algorithms] General purpose task parallel threading approach (Page 3)

Brought to you by: vexxed72

gdalgorithms-list

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-05 01:47:22

On Sun, Apr 5, 2009 at 1:52 AM, Jarkko Lempiainen <al...@gm...> wrote:

>  You could also not have a wait for dependent tasks within a task, but
> rather have job queue dependency system to take care of it by splitting the
> function in two tasks. Or you could implement the wait by pulling tasks from
> the job queue until the dependent tasks have been completed. There is no
> context switches needed in either case though.
>
>
The first option seems limited to me - you really want to be able to just
stick a parallel for in a task and have it work without having to do a whole
bunch of magic outside the task itself to split it up. The Cilk model is
pretty much what we're going for.
The second is an option I guess, but it could lead to bubbles if you happen
to pick up a long running task.
Imagine two massive parallel for loops, the first one finishes and syncs,
and then using your suggestion goes off and steals work from another thread
which takes a long time to complete, and only after that's finished can it
continue with the next parallel for loop (which spawns enough tasks to keep
the system saturated), but by that time the entire system may have been
starved for the duration of that long running task just because the original
worker handled its blocking by doing something else rather than yielding the
core.
It's probably more attractive to just make sure no workers are blocked
needlessly in case the very first thing they do is to spawn more tasks
(again, I see this as quite common - lots of stuff where you fork tons of
tiny tasks, wait for them to finish, then spawn tons of tiny tasks again for
the next bit of work).




>  ------------------------------
>
> *From:* Sebastian Sylvan [mailto:seb...@gm...]
> *Sent:* Sunday, April 05, 2009 3:31 AM
>
> *To:* Game Development Algorithms
> *Subject:* Re: [Algorithms] General purpose task parallel threading
> approach
>
>
>
>
>
> On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...>
> wrote:
>
> I think you really need to target for both task and data level parallelism
> to make your game engine truly scalable for many-core architectures. I
> don't
> see why you would need to consider the cost of context switching in
> relation
> to const of a task execution though since if you got a job queue, each
> worker thread just pulls tasks from the queue without switching context.
>
>
>
> The problem is when you have a task spawn N other tasks and then wait for
> their results, the core running that worker needs to do something until the
> two sub-tasks have completed. In practice, I do think that this kind of
> synchronisation (specifically the fork-and-join model) will be extremely
> common in the future, which means that context switch needs to be fast
> (Windows 7's user mode scheduling may be useful here - in fact I believe
> that's what Microsoft's native concurrency runtime uses, while the .Net one
> is backed by thread pools, I believe).
>
>
>
>
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-05 20:40:38

There isn't "whole bunch of magic" outside the task you would need to do
with the first option. The task just adds another function to the job queue
which does the work after the wait. For second option it is job queues task
to schedule the tasks for optimal throughput, so for wait it could for
example favor tasks blocking the task from proceeding.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Sunday, April 05, 2009 4:47 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Sun, Apr 5, 2009 at 1:52 AM, Jarkko Lempiainen <al...@gm...> wrote:

You could also not have a wait for dependent tasks within a task, but rather
have job queue dependency system to take care of it by splitting the
function in two tasks. Or you could implement the wait by pulling tasks from
the job queue until the dependent tasks have been completed. There is no
context switches needed in either case though.

 

The first option seems limited to me - you really want to be able to just
stick a parallel for in a task and have it work without having to do a whole
bunch of magic outside the task itself to split it up. The Cilk model is
pretty much what we're going for.

The second is an option I guess, but it could lead to bubbles if you happen
to pick up a long running task. 

Imagine two massive parallel for loops, the first one finishes and syncs,
and then using your suggestion goes off and steals work from another thread
which takes a long time to complete, and only after that's finished can it
continue with the next parallel for loop (which spawns enough tasks to keep
the system saturated), but by that time the entire system may have been
starved for the duration of that long running task just because the original
worker handled its blocking by doing something else rather than yielding the
core. 

It's probably more attractive to just make sure no workers are blocked
needlessly in case the very first thing they do is to spawn more tasks
(again, I see this as quite common - lots of stuff where you fork tons of
tiny tasks, wait for them to finish, then spawn tons of tiny tasks again for
the next bit of work).

 

 

 

 


  _____  


From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Sunday, April 05, 2009 3:31 AM


To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> wrote:

I think you really need to target for both task and data level parallelism
to make your game engine truly scalable for many-core architectures. I don't
see why you would need to consider the cost of context switching in relation
to const of a task execution though since if you got a job queue, each
worker thread just pulls tasks from the queue without switching context.

 

The problem is when you have a task spawn N other tasks and then wait for
their results, the core running that worker needs to do something until the
two sub-tasks have completed. In practice, I do think that this kind of
synchronisation (specifically the fork-and-join model) will be extremely
common in the future, which means that context switch needs to be fast
(Windows 7's user mode scheduling may be useful here - in fact I believe
that's what Microsoft's native concurrency runtime uses, while the .Net one
is backed by thread pools, I believe).

 



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862


----------------------------------------------------------------------------
--

_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list




-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-05 22:43:59

On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote:

>  There isn’t “whole bunch of magic” outside the task you would need to do
> with the first option. The task just adds another function to the job queue
> which does the work after the wait. For second option it is job queues task
> to schedule the tasks for optimal throughput, so for wait it could for
> example favor tasks blocking the task from proceeding
>

That's a whole bunch of magic to me. If you want pervasive parallelism (and
Mr Almdahl says we do), you're going to need sync points all over the place.
Currently we're not doing that because we don't *need* to do it yet, but if
you have 16-32 threads to keep busy, you bet you're going to need to spawn
of tasks all over the place (and usually sync immediately
afterwards). Having to manually split all of these things up into lots of
tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
overhead (the cost of starting a task still isn't free - so doing it, say,
5-10x as often is a big deal).

With job stealing from other threads you don’t get global prioritization, so
> one task may be executing lower priority tasks and not to work towards
> optimal throughput.

Why do you actually need global prioritization? After all you need to finish
everything by the end of the frame anyway, so why would you care if task A
gets executed before task B, so long as you never get any bubbles where a
core is idle?
All you care about is maximal throughput, maximal locality (especially true
for in-order CPUs) and minimal contention/overhead, and for that a giant
shared queue is sub-optimal for two reasons:
* It's inherently non-scalable. If you have lots of threads reading from the
same queue you'll get contention. Even with relatively modest number of
threads this starts becoming an issue. Work stealing
has negligible coherency because you have to be unlucky enough to get
someone trying to steal from the bottom of your stack, at the same time as
you running out of tasks to do (so you're fighting for the very last item in
the stack). That's extremely unlikely to happen, and it's even more unlikely
for two threads to try to steal from the same stack - whereas a work queue
is constantly being hit by multiple threads.
* You loose locality. With a work-stealing approach you get good locality
because you run on the same thread as the task was spawned, and tasks
spawned close together in time get run close together in time, meaning good
cache coherency. If tasks are spread out over multiple cores, and over time,
you'll get less coherency.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-05 20:45:07

Why is it a problem to execute the tasks in recursive way? Note that this
recursive way of executing tasks is used only when blocking within tasks,
not in the job queue dependency system.

 

 

Cheers, Jarkko

 

  _____  

From: asy...@gm... [mailto:asy...@gm...] 
Sent: Sunday, April 05, 2009 8:34 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

Without the ability to save/restore tasks your dependency system will have a
big bunch of limitations, because you can only execute tasks during a wait
in a recursive way ONLY.

2009/4/5 Jarkko Lempiainen <al...@gm...>

You could also not have a wait for dependent tasks within a task, but rather
have job queue dependency system to take care of it by splitting the
function in two tasks. Or you could implement the wait by pulling tasks from
the job queue until the dependent tasks have been completed. There is no
context switches needed in either case though.

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-06 05:58:34

You are probably meaning some specific implementation of task-stealing
approach? Task stealing can be implemented with and without global queue.
Also, increasing number of cores trend results in improved inter-core
communication mechanisms either (since otherwise it becomes a bottleneck),
so I'm not sure that global queue approach (together with stealing or not)
is that bad as you say.

Alexander

2009/4/6 Sebastian Sylvan <seb...@gm...>

>
>
> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote:
>
>>  There isn’t “whole bunch of magic” outside the task you would need to do
>> with the first option. The task just adds another function to the job queue
>> which does the work after the wait. For second option it is job queues task
>> to schedule the tasks for optimal throughput, so for wait it could for
>> example favor tasks blocking the task from proceeding
>>
>
> That's a whole bunch of magic to me. If you want pervasive parallelism (and
> Mr Almdahl says we do), you're going to need sync points all over the place.
> Currently we're not doing that because we don't *need* to do it yet, but if
> you have 16-32 threads to keep busy, you bet you're going to need to spawn
> of tasks all over the place (and usually sync immediately
> afterwards). Having to manually split all of these things up into lots of
> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
> overhead (the cost of starting a task still isn't free - so doing it, say,
> 5-10x as often is a big deal).
>
> With job stealing from other threads you don’t get global prioritization,
>> so one task may be executing lower priority tasks and not to work towards
>> optimal throughput.
>
>
> Why do you actually need global prioritization? After all you need to
> finish everything by the end of the frame anyway, so why would you care if
> task A gets executed before task B, so long as you never get any bubbles
> where a core is idle?
> All you care about is maximal throughput, maximal locality (especially true
> for in-order CPUs) and minimal contention/overhead, and for that a giant
> shared queue is sub-optimal for two reasons:
> * It's inherently non-scalable. If you have lots of threads reading from
> the same queue you'll get contention. Even with relatively modest number of
> threads this starts becoming an issue. Work stealing
> has negligible coherency because you have to be unlucky enough to get
> someone trying to steal from the bottom of your stack, at the same time as
> you running out of tasks to do (so you're fighting for the very last item in
> the stack). That's extremely unlikely to happen, and it's even more unlikely
> for two threads to try to steal from the same stack - whereas a work queue
> is constantly being hit by multiple threads.
> * You loose locality. With a work-stealing approach you get good locality
> because you run on the same thread as the task was spawned, and tasks
> spawned close together in time get run close together in time, meaning good
> cache coherency. If tasks are spread out over multiple cores, and over time,
> you'll get less coherency.
>
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-06 07:27:53

I mean the standard one-queue per worker, with work-stealing to seed a
worker that's run out of stuff to do. I'm not sure what work stealing would
even mean with a global queue? Look up the cilk papers for some more data on
the benefits of this. Improved inter-core communication is all well and
good, but not as good as avoiding inter-core communication. Having even just
six-eight threads all trying to CAS the same pointer *is* going to lead to
some contention overhead if your tasks are fine grained enough (and they
will have to be eventually).

On Mon, Apr 6, 2009 at 6:58 AM, <asy...@gm...> wrote:

> You are probably meaning some specific implementation of task-stealing
> approach? Task stealing can be implemented with and without global queue.
> Also, increasing number of cores trend results in improved inter-core
> communication mechanisms either (since otherwise it becomes a bottleneck),
> so I'm not sure that global queue approach (together with stealing or not)
> is that bad as you say.
>
> Alexander
>
> 2009/4/6 Sebastian Sylvan <seb...@gm...>
>
>>
>>
>> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote:
>>
>>>  There isn’t “whole bunch of magic” outside the task you would need to
>>> do with the first option. The task just adds another function to the job
>>> queue which does the work after the wait. For second option it is job queues
>>> task to schedule the tasks for optimal throughput, so for wait it could for
>>> example favor tasks blocking the task from proceeding
>>>
>>
>> That's a whole bunch of magic to me. If you want pervasive parallelism
>> (and Mr Almdahl says we do), you're going to need sync points all over the
>> place. Currently we're not doing that because we don't *need* to do it yet,
>> but if you have 16-32 threads to keep busy, you bet you're going to need to
>> spawn of tasks all over the place (and usually sync immediately
>> afterwards). Having to manually split all of these things up into lots of
>> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
>> overhead (the cost of starting a task still isn't free - so doing it, say,
>> 5-10x as often is a big deal).
>>
>> With job stealing from other threads you don’t get global prioritization,
>>> so one task may be executing lower priority tasks and not to work towards
>>> optimal throughput.
>>
>>
>> Why do you actually need global prioritization? After all you need to
>> finish everything by the end of the frame anyway, so why would you care if
>> task A gets executed before task B, so long as you never get any bubbles
>> where a core is idle?
>> All you care about is maximal throughput, maximal locality (especially
>> true for in-order CPUs) and minimal contention/overhead, and for that a
>> giant shared queue is sub-optimal for two reasons:
>> * It's inherently non-scalable. If you have lots of threads reading from
>> the same queue you'll get contention. Even with relatively modest number of
>> threads this starts becoming an issue. Work stealing
>> has negligible coherency because you have to be unlucky enough to get
>> someone trying to steal from the bottom of your stack, at the same time as
>> you running out of tasks to do (so you're fighting for the very last item in
>> the stack). That's extremely unlikely to happen, and it's even more unlikely
>> for two threads to try to steal from the same stack - whereas a work queue
>> is constantly being hit by multiple threads.
>> * You loose locality. With a work-stealing approach you get good locality
>> because you run on the same thread as the task was spawned, and tasks
>> spawned close together in time get run close together in time, meaning good
>> cache coherency. If tasks are spread out over multiple cores, and over time,
>> you'll get less coherency.
>>
>>
>> --
>> Sebastian Sylvan
>> +44(0)7857-300802
>> UIN: 44640862
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> --
> Regards,
> Alexander Karnakov
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-06 08:13:55

I didn't get how would you handle priorities in a fair way ? Wouldn't you
need an expensive search in all local processor's queues for that ? Or how
would you even out tasks with high/low priorities among per cpu queues ? I
think that there is no best approach for all cases, final one depends on the
set of features you'd like to have in your scheduler, which should of course
be the best out of all alternatives in regard to lowering contention.

Alexander

2009/4/6 Sebastian Sylvan <seb...@gm...>

> I mean the standard one-queue per worker, with work-stealing to seed a
> worker that's run out of stuff to do. I'm not sure what work stealing would
> even mean with a global queue? Look up the cilk papers for some more data
> on the benefits of this. Improved inter-core communication is all well and
> good, but not as good as avoiding inter-core communication. Having even just
> six-eight threads all trying to CAS the same pointer *is* going to lead to
> some contention overhead if your tasks are fine grained enough (and they
> will have to be eventually).
>
> On Mon, Apr 6, 2009 at 6:58 AM, <asy...@gm...> wrote:
>
>> You are probably meaning some specific implementation of task-stealing
>> approach? Task stealing can be implemented with and without global queue.
>> Also, increasing number of cores trend results in improved inter-core
>> communication mechanisms either (since otherwise it becomes a bottleneck),
>> so I'm not sure that global queue approach (together with stealing or not)
>> is that bad as you say.
>>
>> Alexander
>>
>> 2009/4/6 Sebastian Sylvan <seb...@gm...>
>>
>>>
>>>
>>> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote:
>>>
>>>>  There isn’t “whole bunch of magic” outside the task you would need to
>>>> do with the first option. The task just adds another function to the job
>>>> queue which does the work after the wait. For second option it is job queues
>>>> task to schedule the tasks for optimal throughput, so for wait it could for
>>>> example favor tasks blocking the task from proceeding
>>>>
>>>
>>> That's a whole bunch of magic to me. If you want pervasive parallelism
>>> (and Mr Almdahl says we do), you're going to need sync points all over the
>>> place. Currently we're not doing that because we don't *need* to do it yet,
>>> but if you have 16-32 threads to keep busy, you bet you're going to need to
>>> spawn of tasks all over the place (and usually sync immediately
>>> afterwards). Having to manually split all of these things up into lots of
>>> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
>>> overhead (the cost of starting a task still isn't free - so doing it, say,
>>> 5-10x as often is a big deal).
>>>
>>> With job stealing from other threads you don’t get global prioritization,
>>>> so one task may be executing lower priority tasks and not to work towards
>>>> optimal throughput.
>>>
>>>
>>> Why do you actually need global prioritization? After all you need to
>>> finish everything by the end of the frame anyway, so why would you care if
>>> task A gets executed before task B, so long as you never get any bubbles
>>> where a core is idle?
>>> All you care about is maximal throughput, maximal locality (especially
>>> true for in-order CPUs) and minimal contention/overhead, and for that a
>>> giant shared queue is sub-optimal for two reasons:
>>> * It's inherently non-scalable. If you have lots of threads reading from
>>> the same queue you'll get contention. Even with relatively modest number of
>>> threads this starts becoming an issue. Work stealing
>>> has negligible coherency because you have to be unlucky enough to get
>>> someone trying to steal from the bottom of your stack, at the same time as
>>> you running out of tasks to do (so you're fighting for the very last item in
>>> the stack). That's extremely unlikely to happen, and it's even more unlikely
>>> for two threads to try to steal from the same stack - whereas a work queue
>>> is constantly being hit by multiple threads.
>>> * You loose locality. With a work-stealing approach you get good locality
>>> because you run on the same thread as the task was spawned, and tasks
>>> spawned close together in time get run close together in time, meaning good
>>> cache coherency. If tasks are spread out over multiple cores, and over time,
>>> you'll get less coherency.
>>>
>>>
>>> --
>>> Sebastian Sylvan
>>> +44(0)7857-300802
>>> UIN: 44640862
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> GDAlgorithms-list mailing list
>>> GDA...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>> Archives:
>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>
>>
>>
>>
>> --
>> Regards,
>> Alexander Karnakov
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-06 07:37:55

I was under impression that you should minimize those sync points and
dependencies for optimal throughput. If you have ideal situation where your
entire app is split into small tasks, then I guess you wouldn't need to
prioritize, but I believe the reality is that you got some large tasks whose
execution needs to start as yearly in the frame as possible to avoid
stalling other threads in the end of the frame. So you need to prioritize
those tasks and tasks those large tasks depend on high for execution.

 

I agree that ability to wait within a task is more convenient, particularly
if you have to do it deep inside a nested loop. But I don't see spawning new
tasks where you can as tedious as you describe, and those tasks don't add
that much overhead to the queue as you imply. I.e. when you got a sync
point, you are waiting for on average maybe tens or hundreds of tasks to
complete so overhead of a single task is neglible in proportion. I doubt
that contention will be an issue with shared queue any time soon (definitely
not with 16-32 threads you are talking about), but agreed that locality is a
good point in favor of job stealing.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Monday, April 06, 2009 1:44 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote:

There isn't "whole bunch of magic" outside the task you would need to do
with the first option. The task just adds another function to the job queue
which does the work after the wait. For second option it is job queues task
to schedule the tasks for optimal throughput, so for wait it could for
example favor tasks blocking the task from proceeding

 

That's a whole bunch of magic to me. If you want pervasive parallelism (and
Mr Almdahl says we do), you're going to need sync points all over the place.
Currently we're not doing that because we don't *need* to do it yet, but if
you have 16-32 threads to keep busy, you bet you're going to need to spawn
of tasks all over the place (and usually sync immediately afterwards).
Having to manually split all of these things up into lots of tiny tasks is a
big issue IMO - it's tedious, error prone, and adds more overhead (the cost
of starting a task still isn't free - so doing it, say, 5-10x as often is a
big deal).

 

With job stealing from other threads you don't get global prioritization, so
one task may be executing lower priority tasks and not to work towards
optimal throughput.

 

Why do you actually need global prioritization? After all you need to finish
everything by the end of the frame anyway, so why would you care if task A
gets executed before task B, so long as you never get any bubbles where a
core is idle? 

All you care about is maximal throughput, maximal locality (especially true
for in-order CPUs) and minimal contention/overhead, and for that a giant
shared queue is sub-optimal for two reasons:

* It's inherently non-scalable. If you have lots of threads reading from the
same queue you'll get contention. Even with relatively modest number of
threads this starts becoming an issue. Work stealing has negligible
coherency because you have to be unlucky enough to get someone trying to
steal from the bottom of your stack, at the same time as you running out of
tasks to do (so you're fighting for the very last item in the stack). That's
extremely unlikely to happen, and it's even more unlikely for two threads to
try to steal from the same stack - whereas a work queue is constantly being
hit by multiple threads.

* You loose locality. With a work-stealing approach you get good locality
because you run on the same thread as the task was spawned, and tasks
spawned close together in time get run close together in time, meaning good
cache coherency. If tasks are spread out over multiple cores, and over time,
you'll get less coherency.

 

 

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-06 11:55:01

On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote:

>  I was under impression that you should minimize those sync points and
> dependencies for optimal throughput. If you have ideal situation where your
> entire app is split into small tasks, then I guess you wouldn’t need to
> prioritize, but I believe the reality is that you got some large tasks whose
> execution needs to start as yearly in the frame as possible to avoid
> stalling other threads in the end of the frame. So you need to prioritize
> those tasks and tasks those large tasks depend on high for execution. I
> agree that ability to wait within a task is more convenient, particularly if
> you have to do it deep inside a nested loop. But I don’t see spawning new
> tasks where you can as tedious as you describe, and those tasks don’t add
> that much overhead to the queue as you imply. I.e. when you got a sync
> point, you are waiting for on average maybe tens or hundreds of tasks to
> complete so overhead of a single task is neglible in proportion. I doubt
> that contention will be an issue with shared queue any time soon (definitely
> not with 16-32 threads you are talking about), but agreed that locality is a
> good point in favor of job stealing.
>
 Realistically most code isn't embarassingly parallel and has lots of
complicated, and dynamic, flow, which would be either inconvenient or
impossible to deal with if you have an "up front" dependency graph. Yes, by
all means try to minimize dependencies, but don't count on  succeeding to
any major extent. IMO real code has loads of non-parallel dependencies, with
tiny pockets of parallelism embedded within them. To get pervasive
parallelism I predict you'll have to write lots and lots of code where you
kick off 2-3 parallel tasks and then immediately sync, over and over again.
You'll have a couple of huge parallel for loops too, but most code isn't
like that.
You still have some room for choosing when to start a thread. For example,
you want to put tasks that are likely to spawn lots of other tasks close to
the next sync point so that they get executed first (assuming the worker is
using LIFO ordering). That doesn't need a priority system though.

The way I see it tasks are for "burst parallelism". I.e. you have certain
amount of work that needs to be finished and that's all you care about.
Prioritizing between those tasks isn't really a parallelism issue, it's more
concurrency/coordination and probably belongs on a different level of
abstraction (i.e threads), IMO.
You could, however, easily have one worker thread per priority level, with
work stealing only stealing within its own priority. Meaning it wouldn't be
too hard to prioritze things in the rough (i.e. the very root function that
kicks of tasks could do stick it on a higher priorty worker).

The relative inefficiency of a global queue doesn't just depend on the
number of threads, but also on the granularity of tasks. Even with a few
hundred to a thousand tasks a global queue starts losing significantly on
current hardware (a mere 6 threads) according to the benchmarks from the
concurrency talk at the last gamefest. So it doesn't look too good for 16-32
threads, IMO.

> Actually, after a bit of an extra thought on locality, I think I’ll take
> back that “locality is a good point in favor of job stealing” part. When you
> got shared cache between threads, locality is actually in favor of shared
> job queue because all the worker threads are executing the same type of
> task, thus it doesn’t put as much pressure on L2 cache as it would if each
> thread would execute completely unrelated tasks.

Well I guess that COULD be true in principle, however ordering by priority
is not the same as ordering by locality, so you're really just making a
guess that tasks of similar priorities will tend to access the same memory,
whereas a work-stealing approach makes the guess that tasks scheduled in
close proximity will work on the same data. With hundreds or thousands of
tasks I must say I find the latter to be much more likely (and benchmarks
seem to confirm).
Also, moving tasks around randomly means (with signifcant number of workers)
that you're almost guaranteeing a bunch of L1 cache misses at the start of
every single task. L1 caches may not be a big deal, but with lots of tiny
tasks these things add up...

Also, it seems that just about all the major task libraries coming out have
come to the conclusion that work-stealing is much faster, and I've yet to
see a single benchmark showing that this current recommended practice is
incorrect.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-07 10:02:27

If you got only 2-3 parallel tasks in average, then it doesn't make much
sense to multi-thread the code anyway, particularly if you use work
stealing. Sounds like you assume that you have some work starvation issues
and that there is always a thread idling and ready to steal work, but to me
it doesn't sound like a healthy assumption to build your work scheduling
around, unless if you try to patch an engine not built for multi-threading.
If that's the situation you are facing, to me it sounds you have some more
serious multi-threading & scalability issues in your game anyway.

 

Regarding task prioritization, you could of course have a thread per
priority, but it doesn't really help getting high priority tasks completed
early in the frame, but you would need all the threads to contribute to
running those tasks. I didn't mean either that tasks with same/similar
priority are local in relation to each other per se, but that threads gain
locality by running the same function provided by job queue for different
data sets when utilizing data level parallelism. E.g. in job queue you have
animate_character(character*) function where multiple threads are
simultaneously scheduled to run the function with different character*, so
you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
shared data between characters.

 

You are right that high granularity tasks increase shared job queue
contention, but then again I don't think you should put very fine grained
tasks to job queue anyway but run those tasks sequentially instead, which
has superior performance and simplicity to any task system (if you don't
count the cache effect, that is). I don't really see any benefit of aiming
for very high task granularity unless if you try to fill in those
aforementioned task starvation bubbles. Unfortunately I haven't had chance
to test my shared job queue past 2 cores so I can't really tell how well it
scales in practice, but since it's utilizing basic lock-free algorithms, I
presume it scales reasonably well.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Monday, April 06, 2009 2:55 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote:

I was under impression that you should minimize those sync points and
dependencies for optimal throughput. If you have ideal situation where your
entire app is split into small tasks, then I guess you wouldn't need to
prioritize, but I believe the reality is that you got some large tasks whose
execution needs to start as yearly in the frame as possible to avoid
stalling other threads in the end of the frame. So you need to prioritize
those tasks and tasks those large tasks depend on high for execution. I
agree that ability to wait within a task is more convenient, particularly if
you have to do it deep inside a nested loop. But I don't see spawning new
tasks where you can as tedious as you describe, and those tasks don't add
that much overhead to the queue as you imply. I.e. when you got a sync
point, you are waiting for on average maybe tens or hundreds of tasks to
complete so overhead of a single task is neglible in proportion. I doubt
that contention will be an issue with shared queue any time soon (definitely
not with 16-32 threads you are talking about), but agreed that locality is a
good point in favor of job stealing.

Realistically most code isn't embarassingly parallel and has lots of
complicated, and dynamic, flow, which would be either inconvenient or
impossible to deal with if you have an "up front" dependency graph. Yes, by
all means try to minimize dependencies, but don't count on  succeeding to
any major extent. IMO real code has loads of non-parallel dependencies, with
tiny pockets of parallelism embedded within them. To get pervasive
parallelism I predict you'll have to write lots and lots of code where you
kick off 2-3 parallel tasks and then immediately sync, over and over again.
You'll have a couple of huge parallel for loops too, but most code isn't
like that.

You still have some room for choosing when to start a thread. For example,
you want to put tasks that are likely to spawn lots of other tasks close to
the next sync point so that they get executed first (assuming the worker is
using LIFO ordering). That doesn't need a priority system though.

 

The way I see it tasks are for "burst parallelism". I.e. you have certain
amount of work that needs to be finished and that's all you care about.
Prioritizing between those tasks isn't really a parallelism issue, it's more
concurrency/coordination and probably belongs on a different level of
abstraction (i.e threads), IMO.

You could, however, easily have one worker thread per priority level, with
work stealing only stealing within its own priority. Meaning it wouldn't be
too hard to prioritze things in the rough (i.e. the very root function that
kicks of tasks could do stick it on a higher priorty worker).

 

The relative inefficiency of a global queue doesn't just depend on the
number of threads, but also on the granularity of tasks. Even with a few
hundred to a thousand tasks a global queue starts losing significantly on
current hardware (a mere 6 threads) according to the benchmarks from the
concurrency talk at the last gamefest. So it doesn't look too good for 16-32
threads, IMO.

 

 

Actually, after a bit of an extra thought on locality, I think I'll take
back that "locality is a good point in favor of job stealing" part. When you
got shared cache between threads, locality is actually in favor of shared
job queue because all the worker threads are executing the same type of
task, thus it doesn't put as much pressure on L2 cache as it would if each
thread would execute completely unrelated tasks.

 

Well I guess that COULD be true in principle, however ordering by priority
is not the same as ordering by locality, so you're really just making a
guess that tasks of similar priorities will tend to access the same memory,
whereas a work-stealing approach makes the guess that tasks scheduled in
close proximity will work on the same data. With hundreds or thousands of
tasks I must say I find the latter to be much more likely (and benchmarks
seem to confirm).

Also, moving tasks around randomly means (with signifcant number of workers)
that you're almost guaranteeing a bunch of L1 cache misses at the start of
every single task. L1 caches may not be a big deal, but with lots of tiny
tasks these things add up...

 

Also, it seems that just about all the major task libraries coming out have
come to the conclusion that work-stealing is much faster, and I've yet to
see a single benchmark showing that this current recommended practice is
incorrect.


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-07 17:14:31

Jarkko Lempiainen wrote:
>
> I don’t really see any benefit of aiming for very high task 
> granularity unless if you try to fill in those aforementioned task 
> starvation bubbles. Unfortunately I haven’t had chance to test my 
> shared job queue past 2 cores so I can’t really tell how well it 
> scales in practice, but since it’s utilizing basic lock-free 
> algorithms, I presume it scales reasonably well.
>

If you had 64 hardware threads, would your view change? We'll be getting 
there in just a few years. Code you write today may very well be 
expected to scale to that level, or it will be considered obsolete and 
have to be replaced.

There is the job queue implementation question (which is what we're 
mostly talking about here), that contains all kinds of 
micro-optimization opportunities (as shown by the asynclib approach), 
that may or may not apply in a particular game situation.

Then there's the question of how to granularize your game so that it can 
actually smear the workload across 64 hardware threads (or some other, 
equally "large" number). That's, in the end, a much more interesting 
question.

Whether your job system uses 100, 1000 or 10,000 cycles to switch jobs 
can easily be measured, but doesn't really affect the scalability of 
your system. Whether you can actually break the work of a single frame 
into 500 individual tasks with as few serial dependencies as possible, 
is much more important in the big scheme of things.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-07 18:49:27

No, it wouldn't change my view, because that's what I'm already thinking.
What would change my view though, is that if there were hundreds of threads
which would run in significantly lower frequency (e.g. few hundreds of Mhz
instead of 2-3Ghz), in which case you would need to split your work to
smaller pieces. Like I said to Sebastian, only thing you care is that your
job's execution time is relatively small fraction of the entire frame time,
so in the end it's the frame time and thread frequency which determines your
ideal task size. Ideal task size isn't smallest possible unit of work though
because of the overhead of distributing the work. Well, that's what I think
anyway (:

Cheers, Jarkko

-----Original Message-----
From: Jon Watte [mailto:jw...@gm...] 
Sent: Tuesday, April 07, 2009 8:14 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

Jarkko Lempiainen wrote:
>
> I don't really see any benefit of aiming for very high task 
> granularity unless if you try to fill in those aforementioned task 
> starvation bubbles. Unfortunately I haven't had chance to test my 
> shared job queue past 2 cores so I can't really tell how well it 
> scales in practice, but since it's utilizing basic lock-free 
> algorithms, I presume it scales reasonably well.
>

If you had 64 hardware threads, would your view change? We'll be getting 
there in just a few years. Code you write today may very well be 
expected to scale to that level, or it will be considered obsolete and 
have to be replaced.

There is the job queue implementation question (which is what we're 
mostly talking about here), that contains all kinds of 
micro-optimization opportunities (as shown by the asynclib approach), 
that may or may not apply in a particular game situation.

Then there's the question of how to granularize your game so that it can 
actually smear the workload across 64 hardware threads (or some other, 
equally "large" number). That's, in the end, a much more interesting 
question.

Whether your job system uses 100, 1000 or 10,000 cycles to switch jobs 
can easily be measured, but doesn't really affect the scalability of 
your system. Whether you can actually break the work of a single frame 
into 500 individual tasks with as few serial dependencies as possible, 
is much more important in the big scheme of things.

Sincerely,

jw

----------------------------------------------------------------------------
--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-06 16:47:52

asy...@gm... wrote:
> Where do you get a deadlock from ?

If a specific operation is not cancellable, and waits for completion no 
matter what, then that means that cancelling the fiber will have to wait 
for that operation to complete. If the operation in turn needs another 
task to complete, which may be later in the work queue, or require 
locks/resources currently held by the fiber doing the cancellation, then 
you have a deadlock.

In general, non-cancellable operations are bad (similar to 
non-signal-breakable syscalls in UNIX). The right thing to do is to make 
any operation cancellable, and we've found that the best way to make 
sure nobody does the wrong thing is to simply throw an exception out of 
the yield call when an operation is cancelled. Proper use of smart 
pointers and RAII will then take care of the unwind semantics.

That being said, blocking is still rare in game loops; we use all of 
this only for the transactional application server back-end.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-06 17:21:32

>If the operation in turn needs another
>task to complete, which may be later in the work queue, or require
>locks/resources currently held by the fiber doing the cancellation
That's no problem in my implementation. Both tasks will be suspended until
non-cancellable wait operation finishes, and worker thread will be working
on something else. Also async::Task::Cancel only marks a task for
cancelling, you need to explicitly wait on a task to finish to reach the
point where a cancelled task doesn't exist.

If in turn wait operation never expires, then it is a problem in the task
logic, and canceling is not supposed to fix that.

I already have a dosen of application written with asynclib, and strict user
control for what to cancel and what not  turned out to be the best thing.
You can still apply your own use, and make each operation cancellable and
throw exception - it is up to you.

Alexander.

2009/4/6 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > Where do you get a deadlock from ?
>
> If a specific operation is not cancellable, and waits for completion no
> matter what, then that means that cancelling the fiber will have to wait
> for that operation to complete. If the operation in turn needs another
> task to complete, which may be later in the work queue, or require
> locks/resources currently held by the fiber doing the cancellation, then
> you have a deadlock.
>
> In general, non-cancellable operations are bad (similar to
> non-signal-breakable syscalls in UNIX). The right thing to do is to make
> any operation cancellable, and we've found that the best way to make
> sure nobody does the wrong thing is to simply throw an exception out of
> the yield call when an operation is cancelled. Proper use of smart
> pointers and RAII will then take care of the unwind semantics.
>
> That being said, blocking is still rare in game loops; we use all of
> this only for the transactional application server back-end.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-07 11:40:53

On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...>wrote:

>  If you got only 2-3 parallel tasks in average, then it doesn’t make much
> sense to multi-thread the code anyway, particularly if you use work
> stealing. Sounds like you assume that you have some work starvation issues
> and that there is always a thread idling and ready to steal work, but to me
> it doesn’t sound like a healthy assumption to build your work scheduling
> around, unless if you try to patch an engine not built for multi-threading.
> If that’s the situation you are facing, to me it sounds you have some more
> serious multi-threading & scalability issues in your game anyway.
>
That's not what I meant, the total number of tasks should ideally be in the
hundreds or thousands at a peak, but at least several dozen on average, I'm
saying that the branching factor for a lot of that is going to be around
2-3. But of course each of those tasks may kick of another 2-3 tasks (and
then synchronize before returning its result). So you get a tree with lots
of nodes (synchronisation points), but the number of nodes in each level
(i.e. stuff that can run in parallel) can still be hundreds.

Combine this with the rare (but very welcome!) big giant parallel loops
(i.e. "for each of the thousands of game objects, tick in parallel"), and
there should be plenty of tasks to keep your scheduler sweating.


>
>
> Regarding task prioritization, you could of course have a thread per
> priority, but it doesn’t really help getting high priority tasks completed
> early in the frame, but you would need all the threads to contribute to
> running those tasks.
>
Yes, and that would still work. If you have three priority levels, you'd
have three worker threads per hardware thread. Mid-priority worker threads
would not get to run until the high priority worker thread is blocked, which
will only happen when every single high priority task is completed. And
again, it doesn't seem like you need priorities to get what you want, since
scheduling with work-stealing is reasonably predictable if you really do
need something to run first (because it's likely to produce enough
child-tasks to keep all threads busy) then you just run that closer to the
next sync point.


>  I didn’t mean either that tasks with same/similar priority are local in
> relation to each other per se, but that threads gain locality by running the
> same function provided by job queue for different data sets when utilizing
> data level parallelism. E.g. in job queue you have
> animate_character(character*) function where multiple threads are
> simultaneously scheduled to run the function with different character*, so
> you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
> shared data between characters.
>
But if you *interrupt* (in the sense that you push some of the tasks related
to a particular area to the bottom of the queue) what the threads were
already doing just to get them to run animate_character instead, then they
will have to resume the previous tasks later and get lots of cache misses.
So it's probably better to let them finish the work they're doing (which is
hot in the cache) first, and *then* start spreading out the
animate_character over all cores. Which is precisely what you get with work
stealing.
If running different tasks is hurting your cache, you're of course free to
insert a sync call before starting animation to make sure all pending tasks
are done before you start the animation stuff.
My gut instinct on this says "priorities means shuffling, shuffling means
loss of locality"...

>
>
> You are right that high granularity tasks increase shared job queue
> contention, but then again I don’t think you should put very fine grained
> tasks to job queue anyway but run those tasks sequentially instead, which
> has superior performance and simplicity to any task system (if you don’t
> count the cache effect, that is). I don’t really see any benefit of aiming
> for very high task granularity unless if you try to fill in those
> aforementioned task starvation bubbles. Unfortunately I haven’t had chance
> to test my shared job queue past 2 cores so I can’t really tell how well it
> scales in practice, but since it’s utilizing basic lock-free algorithms, I
> presume it scales reasonably well.
>
>

Honestly I think we won't have a choice but to have very fine grained tasks
if we're going to hope to keep 32+ threads busy... Cilk claims an overhead
of 3x the cost of a function call (IIRC) on PowerPC architectures, so it
seems to me that for pretty much anything that you're not already marking
for inlining you could probably run as a task (especially if you use work
stealing where it will just run immediately on the very same thread, meaning
essentially unchanged performance, in case there aren't any free workers to
parallelize it).

In fact, I don't think tasks are fine grained enough! We're probably going
to need to do data parallelism on the CPU too for even more fine grained
parallelism.



Sebastian

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-07 18:35:26

Yes, your prioritization would still function properly if you would assign a
thread per priority of course, but it wouldn't result in optimal throughput,
what's the point after all. You need all your worker threads to work on the
top priority tasks in order to have optimal throughput, which requires that
all the threads work on the single prioritized queue of jobs. Job stealing
doesn't achieve this because each worker thread has its own queue. I also
never said that tasks should be interrupted prematurely, but you should let
them finish and start the next high priority task after.

 

Having fine granularity tasks doesn't help regardless how many threads you
have. Only thing you really care is that your jobs take let say <~10% of
your frame time (1-3ms), which isn't very high granularity. Some might take
longer, but they are also prioritized higher to avoid stalling other threads
in the end of the frame. So if you can animate all your on-screen characters
in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to
gain shared cache coherence. However, with work stealing you wouldn't even
gain this because each thread works on its own set of jobs.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Tuesday, April 07, 2009 2:41 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...>
wrote:

If you got only 2-3 parallel tasks in average, then it doesn't make much
sense to multi-thread the code anyway, particularly if you use work
stealing. Sounds like you assume that you have some work starvation issues
and that there is always a thread idling and ready to steal work, but to me
it doesn't sound like a healthy assumption to build your work scheduling
around, unless if you try to patch an engine not built for multi-threading.
If that's the situation you are facing, to me it sounds you have some more
serious multi-threading & scalability issues in your game anyway.

That's not what I meant, the total number of tasks should ideally be in the
hundreds or thousands at a peak, but at least several dozen on average, I'm
saying that the branching factor for a lot of that is going to be around
2-3. But of course each of those tasks may kick of another 2-3 tasks (and
then synchronize before returning its result). So you get a tree with lots
of nodes (synchronisation points), but the number of nodes in each level
(i.e. stuff that can run in parallel) can still be hundreds.

 

Combine this with the rare (but very welcome!) big giant parallel loops
(i.e. "for each of the thousands of game objects, tick in parallel"), and
there should be plenty of tasks to keep your scheduler sweating.

  

 

Regarding task prioritization, you could of course have a thread per
priority, but it doesn't really help getting high priority tasks completed
early in the frame, but you would need all the threads to contribute to
running those tasks.

Yes, and that would still work. If you have three priority levels, you'd
have three worker threads per hardware thread. Mid-priority worker threads
would not get to run until the high priority worker thread is blocked, which
will only happen when every single high priority task is completed. And
again, it doesn't seem like you need priorities to get what you want, since
scheduling with work-stealing is reasonably predictable if you really do
need something to run first (because it's likely to produce enough
child-tasks to keep all threads busy) then you just run that closer to the
next sync point.

 

I didn't mean either that tasks with same/similar priority are local in
relation to each other per se, but that threads gain locality by running the
same function provided by job queue for different data sets when utilizing
data level parallelism. E.g. in job queue you have
animate_character(character*) function where multiple threads are
simultaneously scheduled to run the function with different character*, so
you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
shared data between characters.

But if you *interrupt* (in the sense that you push some of the tasks related
to a particular area to the bottom of the queue) what the threads were
already doing just to get them to run animate_character instead, then they
will have to resume the previous tasks later and get lots of cache misses.
So it's probably better to let them finish the work they're doing (which is
hot in the cache) first, and *then* start spreading out the
animate_character over all cores. Which is precisely what you get with work
stealing. 

If running different tasks is hurting your cache, you're of course free to
insert a sync call before starting animation to make sure all pending tasks
are done before you start the animation stuff.

My gut instinct on this says "priorities means shuffling, shuffling means
loss of locality"...

 

You are right that high granularity tasks increase shared job queue
contention, but then again I don't think you should put very fine grained
tasks to job queue anyway but run those tasks sequentially instead, which
has superior performance and simplicity to any task system (if you don't
count the cache effect, that is). I don't really see any benefit of aiming
for very high task granularity unless if you try to fill in those
aforementioned task starvation bubbles. Unfortunately I haven't had chance
to test my shared job queue past 2 cores so I can't really tell how well it
scales in practice, but since it's utilizing basic lock-free algorithms, I
presume it scales reasonably well.

 

 

Honestly I think we won't have a choice but to have very fine grained tasks
if we're going to hope to keep 32+ threads busy... Cilk claims an overhead
of 3x the cost of a function call (IIRC) on PowerPC architectures, so it
seems to me that for pretty much anything that you're not already marking
for inlining you could probably run as a task (especially if you use work
stealing where it will just run immediately on the very same thread, meaning
essentially unchanged performance, in case there aren't any free workers to
parallelize it).

 

In fact, I don't think tasks are fine grained enough! We're probably going
to need to do data parallelism on the CPU too for even more fine grained
parallelism.

 

 

 

Sebastian

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-07 19:55:21

On Tue, Apr 7, 2009 at 7:35 PM, Jarkko Lempiainen <al...@gm...> wrote:

>  Yes, your prioritization would still function properly if you would
> assign a thread per priority of course, but it wouldn’t result in optimal
> throughput, what’s the point after all. You need all your worker threads to
> work on the top priority tasks in order to have optimal throughput, which
> requires that all the threads work on the single prioritized queue of jobs.
> Job stealing doesn’t achieve this because each worker thread has its own
> queue. I also never said that tasks should be interrupted prematurely, but
> you should let them finish and start the next high priority task after.
>

I didn't say an individual task would be interrupted, but if you have 100
tasks working on something and you kick off 20 tasks to do your animation at
a high priority when only 5 of those 100 tasks have completed, you'll evict
a bunch of data from the cache and then you still have 95 tasks to go, then
you do another 5 and something else pops up to the top evicting the data
again, and so on. It's better for the entire batch of 100 tasks to run to
completion. With work stealing large bathches of tasks that would benefit
from sharing data across cores would naturally start to fill up the worker
threads as the previous tasks finish - and if you really want to make sure
that's the case, just call sync() before.

> Having fine granularity tasks doesn’t help regardless how many threads you
> have. Only thing you really care is that your jobs take let say <~10% of
> your frame time (1-3ms), which isn’t very high granularity. Some might take
> longer, but they are also prioritized higher to avoid stalling other threads
> in the end of the frame. So if you can animate all your on-screen characters
> in 1-3ms you shouldn’t bother multi-threading the code EXCEPT if you want to
> gain shared cache coherence. However, with work stealing you wouldn’t even
> gain this because each thread works on its own set of jobs.
>

I'm not sure I follow you? If I can take something that takes 3ms to finish
and get it done in 0.5ms because I can use six threads instead of one then
that's clearly a win. Even if I get an unreasonable amount of overhead and
it takes 1ms (50% overhead!) it's still a 3x boost as far as the overall
frame time.
What you care about is that your set of tasks that you need to complete in a
frame is done as quickly as possible, which means that it should waste as
few cycles as possible, which means you should avoid contention, bubbles due
to coarse granularity, and cache thrashing.
Fine task granularity is clearly extremely important in order to actually be
able to loads of threads. Of course you need to make sure the overhead of
starting the thread isn't higher than the cost of just doing the work
directly, but as the overhead is even slightly lower than the cost of
running the task, it's a win to get some other thread that would otherwise
be idle to do it "for free" while you only pay the fixed overhead cost.

>
>

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-07 21:30:18

Okay, yes you would evict the data from cache if a higher priority task
kicks in, assuming the data doesn't fit into the cache simultaneously before
returning to work on the original task. Now, if you think of a 8 core system
with job stealing, you must fit 8 times the amount of data into the cache
because you have 8 threads working on total of 8 unrelated tasks
simultaneously. This is much worse from data eviction point of view than
switching to a higher priority task and returning to work on the original
task after. Now think how this scales to 16, 32, 64 cores.

 

> If I can take something that takes 3ms to finish and get it done in 0.5ms
because I can use six threads instead of one then that's clearly a win.

 

No, it's not a win. See, all you care is that ALL the work for a frame is
completed as soon as possible, and not minimizing the time to execute a
single task. If you had 50% overhead like you say, then you would have 50%
less time to do any useful work. Now, you say it doesn't matter because you
have threads sitting idle anyway, but don't you think that you have some
serious work starvation issues in your engine instead which you try to patch
with this approach with the conclusion that it's not a lost resource?

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Tuesday, April 07, 2009 10:55 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Tue, Apr 7, 2009 at 7:35 PM, Jarkko Lempiainen <al...@gm...> wrote:

Yes, your prioritization would still function properly if you would assign a
thread per priority of course, but it wouldn't result in optimal throughput,
what's the point after all. You need all your worker threads to work on the
top priority tasks in order to have optimal throughput, which requires that
all the threads work on the single prioritized queue of jobs. Job stealing
doesn't achieve this because each worker thread has its own queue. I also
never said that tasks should be interrupted prematurely, but you should let
them finish and start the next high priority task after.

 

I didn't say an individual task would be interrupted, but if you have 100
tasks working on something and you kick off 20 tasks to do your animation at
a high priority when only 5 of those 100 tasks have completed, you'll evict
a bunch of data from the cache and then you still have 95 tasks to go, then
you do another 5 and something else pops up to the top evicting the data
again, and so on. It's better for the entire batch of 100 tasks to run to
completion. With work stealing large bathches of tasks that would benefit
from sharing data across cores would naturally start to fill up the worker
threads as the previous tasks finish - and if you really want to make sure
that's the case, just call sync() before.

  

Having fine granularity tasks doesn't help regardless how many threads you
have. Only thing you really care is that your jobs take let say <~10% of
your frame time (1-3ms), which isn't very high granularity. Some might take
longer, but they are also prioritized higher to avoid stalling other threads
in the end of the frame. So if you can animate all your on-screen characters
in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to
gain shared cache coherence. However, with work stealing you wouldn't even
gain this because each thread works on its own set of jobs.

 

I'm not sure I follow you? If I can take something that takes 3ms to finish
and get it done in 0.5ms because I can use six threads instead of one then
that's clearly a win. Even if I get an unreasonable amount of overhead and
it takes 1ms (50% overhead!) it's still a 3x boost as far as the overall
frame time. 

What you care about is that your set of tasks that you need to complete in a
frame is done as quickly as possible, which means that it should waste as
few cycles as possible, which means you should avoid contention, bubbles due
to coarse granularity, and cache thrashing.

Fine task granularity is clearly extremely important in order to actually be
able to loads of threads. Of course you need to make sure the overhead of
starting the thread isn't higher than the cost of just doing the work
directly, but as the overhead is even slightly lower than the cost of
running the task, it's a win to get some other thread that would otherwise
be idle to do it "for free" while you only pay the fixed overhead cost.

 


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-07 22:12:56

On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...>wrote:

>  Okay, yes you would evict the data from cache if a higher priority task
> kicks in, assuming the data doesn’t fit into the cache simultaneously before
> returning to work on the original task. Now, if you think of a 8 core system
> with job stealing, you must fit 8 times the amount of data into the cache
> because you have 8 threads working on total of 8 unrelated tasks
> simultaneously. This is much worse from data eviction point of view than
> switching to a higher priority task and returning to work on the original
> task after. Now think how this scales to 16, 32, 64 cores.
>

You're assuming that work stealing scheduling will always work on unrelated
tasks when this isn't the case. If you kick off a large batch of work it
will spread out over all workers, it'll just let whatever's already there
finish first - exploiting locality.

>
>
> > If I can take something that takes 3ms to finish and get it done in
> 0.5ms because I can use six threads instead of one then that's clearly a
> win.
>
>
>
> No, it’s not a win. See, all you care is that ALL the work for a frame is
> completed as soon as possible, and not minimizing the time to execute a
> single task. If you had 50% overhead like you say, then you would have 50%
> less time to do any useful work. Now, you say it doesn’t matter because you
> have threads sitting idle anyway, but don’t you think that you have some
> serious work starvation issues in your engine instead which you try to patch
> with this approach with the conclusion that it’s not a lost resource?
>
If I have 50% overhead, that means that the other 50% is useful work that
would've had to run sequentially otherwise! I'm doing 50% MORE useful work
on my spare threads.

The purpose of splitting stuff up into parallel tasks is to use idle cores
to help speed up work. If you assume that you can keep those cores busy with
useful work using Some Other Technique then you're basically assuming that
there is no problem and that we therefore need no solution. That would be
great if it was the case, but it seems unrealistic.
The whole discussion is about "how do we use N threads to speed up a game
engine", saying that we shouldn't spawn lots of fine grained tasks because
the overhead in doing so would cost cycles that we could use for "useful
work" isn't really helping me, unless you can tell me what other technique
you plan to use to actually do this "useful work" that doesn't have the
overhead of just ramping up task parallelism so it scales to the number of
threads.

Of course you wouldn't trade 100% useful work for 80% useful work and 20%
overhead, that would be stupid. Nobody would write a task system with worker
threads on for a single-threaded machine. What I'm saying is that in the
not-so-far-off future when you start off with, say, 90% idle cycles and only
10% useful work, that 80/20 split is looking a lot more appealing.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-07 22:17:50

Okay so that maths is wrong, the point is that overhead doesn't come alone,
the overhead is there because I'm getting more useful  work done on the
other threads meaning that it's an overall win.

On Tue, Apr 7, 2009 at 11:12 PM, Sebastian Sylvan <
seb...@gm...> wrote:

>
>
> On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...>wrote:
>
>>  Okay, yes you would evict the data from cache if a higher priority task
>> kicks in, assuming the data doesn’t fit into the cache simultaneously before
>> returning to work on the original task. Now, if you think of a 8 core system
>> with job stealing, you must fit 8 times the amount of data into the cache
>> because you have 8 threads working on total of 8 unrelated tasks
>> simultaneously. This is much worse from data eviction point of view than
>> switching to a higher priority task and returning to work on the original
>> task after. Now think how this scales to 16, 32, 64 cores.
>>
>
> You're assuming that work stealing scheduling will always work on unrelated
> tasks when this isn't the case. If you kick off a large batch of work it
> will spread out over all workers, it'll just let whatever's already there
> finish first - exploiting locality.
>
>
>>
>>
>> > If I can take something that takes 3ms to finish and get it done in
>> 0.5ms because I can use six threads instead of one then that's clearly a
>> win.
>>
>>
>>
>> No, it’s not a win. See, all you care is that ALL the work for a frame is
>> completed as soon as possible, and not minimizing the time to execute a
>> single task. If you had 50% overhead like you say, then you would have 50%
>> less time to do any useful work. Now, you say it doesn’t matter because you
>> have threads sitting idle anyway, but don’t you think that you have some
>> serious work starvation issues in your engine instead which you try to patch
>> with this approach with the conclusion that it’s not a lost resource?
>>
> If I have 50% overhead, that means that the other 50% is useful work that
> would've had to run sequentially otherwise! I'm doing 50% MORE useful work
> on my spare threads.
>
> The purpose of splitting stuff up into parallel tasks is to use idle cores
> to help speed up work. If you assume that you can keep those cores busy with
> useful work using Some Other Technique then you're basically assuming that
> there is no problem and that we therefore need no solution. That would be
> great if it was the case, but it seems unrealistic.
> The whole discussion is about "how do we use N threads to speed up a game
> engine", saying that we shouldn't spawn lots of fine grained tasks because
> the overhead in doing so would cost cycles that we could use for "useful
> work" isn't really helping me, unless you can tell me what other technique
> you plan to use to actually do this "useful work" that doesn't have the
> overhead of just ramping up task parallelism so it scales to the number of
> threads.
>
> Of course you wouldn't trade 100% useful work for 80% useful work and 20%
> overhead, that would be stupid. Nobody would write a task system with worker
> threads on for a single-threaded machine. What I'm saying is that in the
> not-so-far-off future when you start off with, say, 90% idle cycles and only
> 10% useful work, that 80/20 split is looking a lot more appealing.
>
>
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>



-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-08 19:51:19

Pushing work to multiple queues doesn't guarantee that they are also
executed simultaneously, unless if all your threads are idling, which isn't
very happy scenario. To me it sounds that you are in the process of
converting a single threaded engine to multi-threaded one where you start
from 100%/number_of_threads efficiency, in which case what you say makes
sense. I.e. you want the multi-threaded bits to be very local, where you
dispatch work to multiple threads and sync right after, and any work you can
parallelize is a win from the old design. However, if you have the advantage
of designing an engine from scratch for multi-threading, I don't see this as
a good approach.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Wednesday, April 08, 2009 1:13 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...>
wrote:

Okay, yes you would evict the data from cache if a higher priority task
kicks in, assuming the data doesn't fit into the cache simultaneously before
returning to work on the original task. Now, if you think of a 8 core system
with job stealing, you must fit 8 times the amount of data into the cache
because you have 8 threads working on total of 8 unrelated tasks
simultaneously. This is much worse from data eviction point of view than
switching to a higher priority task and returning to work on the original
task after. Now think how this scales to 16, 32, 64 cores.

 

You're assuming that work stealing scheduling will always work on unrelated
tasks when this isn't the case. If you kick off a large batch of work it
will spread out over all workers, it'll just let whatever's already there
finish first - exploiting locality.

 

 

> If I can take something that takes 3ms to finish and get it done in 0.5ms
because I can use six threads instead of one then that's clearly a win.

 

No, it's not a win. See, all you care is that ALL the work for a frame is
completed as soon as possible, and not minimizing the time to execute a
single task. If you had 50% overhead like you say, then you would have 50%
less time to do any useful work. Now, you say it doesn't matter because you
have threads sitting idle anyway, but don't you think that you have some
serious work starvation issues in your engine instead which you try to patch
with this approach with the conclusion that it's not a lost resource?

If I have 50% overhead, that means that the other 50% is useful work that
would've had to run sequentially otherwise! I'm doing 50% MORE useful work
on my spare threads.

 

The purpose of splitting stuff up into parallel tasks is to use idle cores
to help speed up work. If you assume that you can keep those cores busy with
useful work using Some Other Technique then you're basically assuming that
there is no problem and that we therefore need no solution. That would be
great if it was the case, but it seems unrealistic.

The whole discussion is about "how do we use N threads to speed up a game
engine", saying that we shouldn't spawn lots of fine grained tasks because
the overhead in doing so would cost cycles that we could use for "useful
work" isn't really helping me, unless you can tell me what other technique
you plan to use to actually do this "useful work" that doesn't have the
overhead of just ramping up task parallelism so it scales to the number of
threads.

 

Of course you wouldn't trade 100% useful work for 80% useful work and 20%
overhead, that would be stupid. Nobody would write a task system with worker
threads on for a single-threaded machine. What I'm saying is that in the
not-so-far-off future when you start off with, say, 90% idle cycles and only
10% useful work, that 80/20 split is looking a lot more appealing.

 

 


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 17:52:03

On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote:

>  Work stealing executes tasks in the order of spawning on a /single/
> thread, but that’s not the point. I hate to repeat myself, but in order to
> exploit shared cache coherency, you have to run the same work on /all/ the
> threads simultaneously which you can’t guarantee with work stealing without
> waste of CPU time. Do you disagree that multiple threads accessing the same
> data do exploit shared cache coherency and need supporting evidence of this?
> Intel can provide you with plenty of documentation of this basic cache
> behavior.
>
I'm well aware of how a cache works, thank you. The point I'm trying to make
is that a global work queue does not solve this issue either, and in fact
appears to make it worse (which is why nobody is using them for task
parallel systems).
Multiple independent threads or running tasks spawning sub-tasks can all be
interleaved on the global work queue, and not only run in completely random
order, but not even stay on the same hardware thread that they were spawned
on. This isn't good for locality. Mixing in priorities into that will only
cause tasks to be shuffled around even more.

With work stealing you at least get good locality within each worker thread
(L1 cache matters too), with the option of good locality across all threads
because the way tasks are scheduled is reasonably deterministic. And you can
augment the algorithm to take caching into account (and get provable bounds
for cache behaviour). Also, think about what the trend for coherency is
going - do you think memory will be more or less coherent in the future? It
seems pretty clear to me that coherency will be reduced, meaning that the
importance of keeping related tasks on the same thread will only increase.

> Could you be a bit more specific which Gamefest talk are you referring to,
> because I would be very interested to check it out. If you got a link to the
> slides, that would be great. I checked some Cilk slides and videos, and it
> seems that their main motivation is indeed to provide framework to improve
> parallelism of legacy code as I suspected. Makes sense to build a business
> model around that because majority of ISV’s want to gain from multi-core
> architectures without the risk & investment of redesigning/refactor their
> entire software for multi-core. There was some obvious marketing BS in the
> slides for example about linear scalability though ;) Anyway, I see Cilk as
> a way to patch your single-threaded app which starts at 100%/num_of_cores
> efficiency level rather than a good approach if you design your game engine
> from scratch for multi-core.
>

This one:
http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38-9BAD-BFF5E15C9629&displaylang=en

You imply that just because something is easier to retrofit in an existing
system it's not useful for something written from scratch, which I don't
think is true. Writing it from scratch may make certain things easier, but
parallelism is still a very hard problem, and I don't think re-writing
things will significantly reduce the issues I've already highlighted - the
problem is still there, and still needs a solution.

>
>
> Regarding task granularity, what you say about 63 threads waiting for 3ms
> for a task to finish is true if you have a horribly sequential way of
> processing data,
>
It was a simplifed example to make a point. As I said before, the
dependencies aren't all at the toplevel, it's more like a tree. So yes
there's a certain amount of overlap where maybe a few threads can keep doing
other tasks that are parallel and independent of animation, but the key
point I was trying to make is still there: *Real* code has tons of data
dependencies all over the place, this means you need to "over subscribe"
your task system for load-balancing purposes.

> e.g. some legacy single threaded engine flow which you are patching for
> multi-core in small bits without major refactoring of its sub-systems. In my
> experience, even in some not-so-well designed multi-threaded engines you got
> multiple sub-systems running simultaneously (rendering, physics, game logic,
> audio, etc.) which could spawn tasks without global stalls of the threads.
>

I don't think it's a matter of refactoring/design, it's a matter of the data
flow. There is no way you can do collision detection against an animated
mesh until you've animated it. That's just a fact of the problem that will
never change.
As I mentioned before you can pipeline some sequential tasks, but then you
add latency so this doesn't really scale beyond a few tasks, because you end
up with hundreds of milliseconds of latency. So as a one-time optimization
you can scale to, say, 3 more threads if you have 3 tasks that are big
enough to basically keep a whole thread occupied for most of the frame, what
are you going to do for the other 61 threads? The data dependencies are
still there, so the problem isn't gone. You've made the problem slightly
smaller, from 64 to 61 threads, but you still have to split the rest of your
engine up to run on these 61 threads, and you're still going to need way
more than 10*61 tasks to do it for reasons I've already explained.

I hate to appeal to authority, but I really don't see how you can just
ignore that all the major companies building task parallel libraries
disagree with you regarding the merits of a shared global queue.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-09 21:27:16

Yes, appealing to the "big boys" is pretty weak argument since they have
provided sub-optimal solutions before as well. Anyway, I was hoping to get
some strong arguments either way, but I guess it's not very clear yet what
solution is the best for multi-threading your engine. I read through the
Gamefest slides and it was a bit surprising they didn't deal with
prioritization at all, which is vital for throughput. If you don't take my
word for it, play around with different size of tasks with various
dependencies and see how prioritization impacts to the throughput, or check
out GDC 2008 presentation "Getting More From Multicore" by Ian Lewis
(http://www.microsoft.com/downloads/details.aspx?FamilyID=A36FE736-5FE7-4E08
-84CF-ACCF801538EB
<http://www.microsoft.com/downloads/details.aspx?FamilyID=A36FE736-5FE7-4E08
-84CF-ACCF801538EB&displaylang=en> &displaylang=en) for some pointers. But I
guess in the end we just have to agree to disagree and see how it pans out.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Thursday, April 09, 2009 8:52 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote:

Work stealing executes tasks in the order of spawning on a /single/ thread,
but that's not the point. I hate to repeat myself, but in order to exploit
shared cache coherency, you have to run the same work on /all/ the threads
simultaneously which you can't guarantee with work stealing without waste of
CPU time. Do you disagree that multiple threads accessing the same data do
exploit shared cache coherency and need supporting evidence of this? Intel
can provide you with plenty of documentation of this basic cache behavior.

I'm well aware of how a cache works, thank you. The point I'm trying to make
is that a global work queue does not solve this issue either, and in fact
appears to make it worse (which is why nobody is using them for task
parallel systems). 

Multiple independent threads or running tasks spawning sub-tasks can all be
interleaved on the global work queue, and not only run in completely random
order, but not even stay on the same hardware thread that they were spawned
on. This isn't good for locality. Mixing in priorities into that will only
cause tasks to be shuffled around even more.

 

With work stealing you at least get good locality within each worker thread
(L1 cache matters too), with the option of good locality across all threads
because the way tasks are scheduled is reasonably deterministic. And you can
augment the algorithm to take caching into account (and get provable bounds
for cache behaviour). Also, think about what the trend for coherency is
going - do you think memory will be more or less coherent in the future? It
seems pretty clear to me that coherency will be reduced, meaning that the
importance of keeping related tasks on the same thread will only increase.

 

Could you be a bit more specific which Gamefest talk are you referring to,
because I would be very interested to check it out. If you got a link to the
slides, that would be great. I checked some Cilk slides and videos, and it
seems that their main motivation is indeed to provide framework to improve
parallelism of legacy code as I suspected. Makes sense to build a business
model around that because majority of ISV's want to gain from multi-core
architectures without the risk & investment of redesigning/refactor their
entire software for multi-core. There was some obvious marketing BS in the
slides for example about linear scalability though ;) Anyway, I see Cilk as
a way to patch your single-threaded app which starts at 100%/num_of_cores
efficiency level rather than a good approach if you design your game engine
from scratch for multi-core.

 

This one:
http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38-
9BAD-BFF5E15C9629
<http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38
-9BAD-BFF5E15C9629&displaylang=en> &displaylang=en

 

You imply that just because something is easier to retrofit in an existing
system it's not useful for something written from scratch, which I don't
think is true. Writing it from scratch may make certain things easier, but
parallelism is still a very hard problem, and I don't think re-writing
things will significantly reduce the issues I've already highlighted - the
problem is still there, and still needs a solution.

 

 

Regarding task granularity, what you say about 63 threads waiting for 3ms
for a task to finish is true if you have a horribly sequential way of
processing data,

It was a simplifed example to make a point. As I said before, the
dependencies aren't all at the toplevel, it's more like a tree. So yes
there's a certain amount of overlap where maybe a few threads can keep doing
other tasks that are parallel and independent of animation, but the key
point I was trying to make is still there: *Real* code has tons of data
dependencies all over the place, this means you need to "over subscribe"
your task system for load-balancing purposes.

 

 

e.g. some legacy single threaded engine flow which you are patching for
multi-core in small bits without major refactoring of its sub-systems. In my
experience, even in some not-so-well designed multi-threaded engines you got
multiple sub-systems running simultaneously (rendering, physics, game logic,
audio, etc.) which could spawn tasks without global stalls of the threads.

 

I don't think it's a matter of refactoring/design, it's a matter of the data
flow. There is no way you can do collision detection against an animated
mesh until you've animated it. That's just a fact of the problem that will
never change.

As I mentioned before you can pipeline some sequential tasks, but then you
add latency so this doesn't really scale beyond a few tasks, because you end
up with hundreds of milliseconds of latency. So as a one-time optimization
you can scale to, say, 3 more threads if you have 3 tasks that are big
enough to basically keep a whole thread occupied for most of the frame, what
are you going to do for the other 61 threads? The data dependencies are
still there, so the problem isn't gone. You've made the problem slightly
smaller, from 64 to 61 threads, but you still have to split the rest of your
engine up to run on these 61 threads, and you're still going to need way
more than 10*61 tasks to do it for reasons I've already explained.

 

 

I hate to appeal to authority, but I really don't see how you can just
ignore that all the major companies building task parallel libraries
disagree with you regarding the merits of a shared global queue.


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-09 22:13:13

On Thu, Apr 9, 2009 at 10:27 PM, Jarkko Lempiainen <al...@gm...>wrote:

>  Yes, appealing to the “big boys” is pretty weak argument since they have
> provided sub-optimal solutions before as well.
>
In this case almost all of them have, at one point or another, tried the
alternative and decided it was worth the effort to implement the much
trickier solution. In fact the the presentation you linked was given by the
very same organization (xna) who now recommend work stealing - so clearly
they've given the global work queue an honest shot don't you think?


> Anyway, I was hoping to get some strong arguments either way, but I guess
> it’s not very clear yet what solution is the best for multi-threading your
> engine. I read through the Gamefest slides and it was a bit surprising they
> didn’t deal with prioritization at all, which is vital for throughput.
>
It's actually not vital for throughput, and I couldn't see anything in the
presentation you linked saying it is.
Tasks are naturally ordered according to their dependencies due to the way
they are spawned (you spawn tasks, then you sync - only then do you spawn
any dependent tasks). With the cilk based model any task that is enqueued is
by definition ready to run. There's in no need for priorities to optimize
throughput.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-09 22:53:00

Sebastian Sylvan wrote:
>
>
>  
>
>     Anyway, I was hoping to get some strong arguments either way, but
>     I guess it’s not very clear yet what solution is the best for
>     multi-threading your engine. I read through the Gamefest slides
>     and it was a bit surprising they didn’t deal with prioritization
>     at all, which is vital for throughput.
>
> It's actually not vital for throughput, and I couldn't see anything in 
> the presentation you linked saying it is.
> Tasks are naturally ordered according to their dependencies due to the 
> way they are spawned (you spawn tasks, then you sync - only then do 
> you spawn any dependent tasks). With the cilk based model any task 
> that is enqueued is by definition ready to run. There's in no need for 
> priorities to optimize throughput. 
>

Actually, it is. Consider a case where you have some independent tasks, 
and some serially dependent tasks. Suppose you have 4 independent tasks 
xyzw, 4 linearly dependent tasks abcd, and two cores to schedule across. 
Here are two possible work orderings:

xy
zw
a
b
c
d

ax
by
cz
dw

Clearly, the second case is much preferred. However, without 
prioritization, you won't get that outcome. While you could sort the 
entire graph, and do some dependency/bin packing to push it as compact 
as possible, usually it suffices with some heuristics, such as always 
prioritizing tasks that have dependents, and the longer the dependency 
chain, the higher the priority.

Sincerely,

jw

<< < 1 2 3 4 5 .. 12 > >> (Page 3 of 12)