Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote:

>  There isn’t “whole bunch of magic” outside the task you would need to do
> with the first option. The task just adds another function to the job queue
> which does the work after the wait. For second option it is job queues task
> to schedule the tasks for optimal throughput, so for wait it could for
> example favor tasks blocking the task from proceeding
>

That's a whole bunch of magic to me. If you want pervasive parallelism (and
Mr Almdahl says we do), you're going to need sync points all over the place.
Currently we're not doing that because we don't *need* to do it yet, but if
you have 16-32 threads to keep busy, you bet you're going to need to spawn
of tasks all over the place (and usually sync immediately
afterwards). Having to manually split all of these things up into lots of
tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
overhead (the cost of starting a task still isn't free - so doing it, say,
5-10x as often is a big deal).

With job stealing from other threads you don’t get global prioritization, so
> one task may be executing lower priority tasks and not to work towards
> optimal throughput.

Why do you actually need global prioritization? After all you need to finish
everything by the end of the frame anyway, so why would you care if task A
gets executed before task B, so long as you never get any bubbles where a
core is idle?
All you care about is maximal throughput, maximal locality (especially true
for in-order CPUs) and minimal contention/overhead, and for that a giant
shared queue is sub-optimal for two reasons:
* It's inherently non-scalable. If you have lots of threads reading from the
same queue you'll get contention. Even with relatively modest number of
threads this starts becoming an issue. Work stealing
has negligible coherency because you have to be unlucky enough to get
someone trying to steal from the bottom of your stack, at the same time as
you running out of tasks to do (so you're fighting for the very last item in
the stack). That's extremely unlikely to happen, and it's even more unlikely
for two threads to try to steal from the same stack - whereas a work queue
is constantly being hit by multiple threads.
* You loose locality. With a work-stealing approach you get good locality
because you run on the same thread as the task was spawned, and tasks
spawned close together in time get run close together in time, meaning good
cache coherency. If tasks are spread out over multiple cores, and over time,
you'll get less coherency.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862