Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Actually, after a bit of an extra thought on locality, I think I'll take
back that "locality is a good point in favor of job stealing" part. When you
got shared cache between threads, locality is actually in favor of shared
job queue because all the worker threads are executing the same type of
task, thus it doesn't put as much pressure on L2 cache as it would if each
thread would execute completely unrelated tasks.

Cheers, Jarkko

  _____  

From: Jarkko Lempiainen [mailto:al...@gm...] 
Sent: Monday, April 06, 2009 10:38 AM
To: 'Game Development Algorithms'
Subject: RE: [Algorithms] General purpose task parallel threading approach

I was under impression that you should minimize those sync points and
dependencies for optimal throughput. If you have ideal situation where your
entire app is split into small tasks, then I guess you wouldn't need to
prioritize, but I believe the reality is that you got some large tasks whose
execution needs to start as yearly in the frame as possible to avoid
stalling other threads in the end of the frame. So you need to prioritize
those tasks and tasks those large tasks depend on high for execution.

I agree that ability to wait within a task is more convenient, particularly
if you have to do it deep inside a nested loop. But I don't see spawning new
tasks where you can as tedious as you describe, and those tasks don't add
that much overhead to the queue as you imply. I.e. when you got a sync
point, you are waiting for on average maybe tens or hundreds of tasks to
complete so overhead of a single task is neglible in proportion. I doubt
that contention will be an issue with shared queue any time soon (definitely
not with 16-32 threads you are talking about), but agreed that locality is a
good point in favor of job stealing.

Cheers, Jarkko

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Monday, April 06, 2009 1:44 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote:

There isn't "whole bunch of magic" outside the task you would need to do
with the first option. The task just adds another function to the job queue
which does the work after the wait. For second option it is job queues task
to schedule the tasks for optimal throughput, so for wait it could for
example favor tasks blocking the task from proceeding

That's a whole bunch of magic to me. If you want pervasive parallelism (and
Mr Almdahl says we do), you're going to need sync points all over the place.
Currently we're not doing that because we don't *need* to do it yet, but if
you have 16-32 threads to keep busy, you bet you're going to need to spawn
of tasks all over the place (and usually sync immediately afterwards).
Having to manually split all of these things up into lots of tiny tasks is a
big issue IMO - it's tedious, error prone, and adds more overhead (the cost
of starting a task still isn't free - so doing it, say, 5-10x as often is a
big deal).

With job stealing from other threads you don't get global prioritization, so
one task may be executing lower priority tasks and not to work towards
optimal throughput.

Why do you actually need global prioritization? After all you need to finish
everything by the end of the frame anyway, so why would you care if task A
gets executed before task B, so long as you never get any bubbles where a
core is idle? 

All you care about is maximal throughput, maximal locality (especially true
for in-order CPUs) and minimal contention/overhead, and for that a giant
shared queue is sub-optimal for two reasons:

* It's inherently non-scalable. If you have lots of threads reading from the
same queue you'll get contention. Even with relatively modest number of
threads this starts becoming an issue. Work stealing has negligible
coherency because you have to be unlucky enough to get someone trying to
steal from the bottom of your stack, at the same time as you running out of
tasks to do (so you're fighting for the very last item in the stack). That's
extremely unlikely to happen, and it's even more unlikely for two threads to
try to steal from the same stack - whereas a work queue is constantly being
hit by multiple threads.

* You loose locality. With a work-stealing approach you get good locality
because you run on the same thread as the task was spawned, and tasks
spawned close together in time get run close together in time, meaning good
cache coherency. If tasks are spread out over multiple cores, and over time,
you'll get less coherency.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862