Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote:

>  I was under impression that you should minimize those sync points and
> dependencies for optimal throughput. If you have ideal situation where your
> entire app is split into small tasks, then I guess you wouldn’t need to
> prioritize, but I believe the reality is that you got some large tasks whose
> execution needs to start as yearly in the frame as possible to avoid
> stalling other threads in the end of the frame. So you need to prioritize
> those tasks and tasks those large tasks depend on high for execution. I
> agree that ability to wait within a task is more convenient, particularly if
> you have to do it deep inside a nested loop. But I don’t see spawning new
> tasks where you can as tedious as you describe, and those tasks don’t add
> that much overhead to the queue as you imply. I.e. when you got a sync
> point, you are waiting for on average maybe tens or hundreds of tasks to
> complete so overhead of a single task is neglible in proportion. I doubt
> that contention will be an issue with shared queue any time soon (definitely
> not with 16-32 threads you are talking about), but agreed that locality is a
> good point in favor of job stealing.
>
 Realistically most code isn't embarassingly parallel and has lots of
complicated, and dynamic, flow, which would be either inconvenient or
impossible to deal with if you have an "up front" dependency graph. Yes, by
all means try to minimize dependencies, but don't count on  succeeding to
any major extent. IMO real code has loads of non-parallel dependencies, with
tiny pockets of parallelism embedded within them. To get pervasive
parallelism I predict you'll have to write lots and lots of code where you
kick off 2-3 parallel tasks and then immediately sync, over and over again.
You'll have a couple of huge parallel for loops too, but most code isn't
like that.
You still have some room for choosing when to start a thread. For example,
you want to put tasks that are likely to spawn lots of other tasks close to
the next sync point so that they get executed first (assuming the worker is
using LIFO ordering). That doesn't need a priority system though.

The way I see it tasks are for "burst parallelism". I.e. you have certain
amount of work that needs to be finished and that's all you care about.
Prioritizing between those tasks isn't really a parallelism issue, it's more
concurrency/coordination and probably belongs on a different level of
abstraction (i.e threads), IMO.
You could, however, easily have one worker thread per priority level, with
work stealing only stealing within its own priority. Meaning it wouldn't be
too hard to prioritze things in the rough (i.e. the very root function that
kicks of tasks could do stick it on a higher priorty worker).

The relative inefficiency of a global queue doesn't just depend on the
number of threads, but also on the granularity of tasks. Even with a few
hundred to a thousand tasks a global queue starts losing significantly on
current hardware (a mere 6 threads) according to the benchmarks from the
concurrency talk at the last gamefest. So it doesn't look too good for 16-32
threads, IMO.

> Actually, after a bit of an extra thought on locality, I think I’ll take
> back that “locality is a good point in favor of job stealing” part. When you
> got shared cache between threads, locality is actually in favor of shared
> job queue because all the worker threads are executing the same type of
> task, thus it doesn’t put as much pressure on L2 cache as it would if each
> thread would execute completely unrelated tasks.

Well I guess that COULD be true in principle, however ordering by priority
is not the same as ordering by locality, so you're really just making a
guess that tasks of similar priorities will tend to access the same memory,
whereas a work-stealing approach makes the guess that tasks scheduled in
close proximity will work on the same data. With hundreds or thousands of
tasks I must say I find the latter to be much more likely (and benchmarks
seem to confirm).
Also, moving tasks around randomly means (with signifcant number of workers)
that you're almost guaranteeing a bunch of L1 cache misses at the start of
every single task. L1 caches may not be a big deal, but with lots of tiny
tasks these things add up...

Also, it seems that just about all the major task libraries coming out have
come to the conclusion that work-stealing is much faster, and I've yet to
see a single benchmark showing that this current recommended practice is
incorrect.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862