Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

If you got only 2-3 parallel tasks in average, then it doesn't make much
sense to multi-thread the code anyway, particularly if you use work
stealing. Sounds like you assume that you have some work starvation issues
and that there is always a thread idling and ready to steal work, but to me
it doesn't sound like a healthy assumption to build your work scheduling
around, unless if you try to patch an engine not built for multi-threading.
If that's the situation you are facing, to me it sounds you have some more
serious multi-threading & scalability issues in your game anyway.

Regarding task prioritization, you could of course have a thread per
priority, but it doesn't really help getting high priority tasks completed
early in the frame, but you would need all the threads to contribute to
running those tasks. I didn't mean either that tasks with same/similar
priority are local in relation to each other per se, but that threads gain
locality by running the same function provided by job queue for different
data sets when utilizing data level parallelism. E.g. in job queue you have
animate_character(character*) function where multiple threads are
simultaneously scheduled to run the function with different character*, so
you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
shared data between characters.

You are right that high granularity tasks increase shared job queue
contention, but then again I don't think you should put very fine grained
tasks to job queue anyway but run those tasks sequentially instead, which
has superior performance and simplicity to any task system (if you don't
count the cache effect, that is). I don't really see any benefit of aiming
for very high task granularity unless if you try to fill in those
aforementioned task starvation bubbles. Unfortunately I haven't had chance
to test my shared job queue past 2 cores so I can't really tell how well it
scales in practice, but since it's utilizing basic lock-free algorithms, I
presume it scales reasonably well.

Cheers, Jarkko

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Monday, April 06, 2009 2:55 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote:

I was under impression that you should minimize those sync points and
dependencies for optimal throughput. If you have ideal situation where your
entire app is split into small tasks, then I guess you wouldn't need to
prioritize, but I believe the reality is that you got some large tasks whose
execution needs to start as yearly in the frame as possible to avoid
stalling other threads in the end of the frame. So you need to prioritize
those tasks and tasks those large tasks depend on high for execution. I
agree that ability to wait within a task is more convenient, particularly if
you have to do it deep inside a nested loop. But I don't see spawning new
tasks where you can as tedious as you describe, and those tasks don't add
that much overhead to the queue as you imply. I.e. when you got a sync
point, you are waiting for on average maybe tens or hundreds of tasks to
complete so overhead of a single task is neglible in proportion. I doubt
that contention will be an issue with shared queue any time soon (definitely
not with 16-32 threads you are talking about), but agreed that locality is a
good point in favor of job stealing.

Realistically most code isn't embarassingly parallel and has lots of
complicated, and dynamic, flow, which would be either inconvenient or
impossible to deal with if you have an "up front" dependency graph. Yes, by
all means try to minimize dependencies, but don't count on  succeeding to
any major extent. IMO real code has loads of non-parallel dependencies, with
tiny pockets of parallelism embedded within them. To get pervasive
parallelism I predict you'll have to write lots and lots of code where you
kick off 2-3 parallel tasks and then immediately sync, over and over again.
You'll have a couple of huge parallel for loops too, but most code isn't
like that.

You still have some room for choosing when to start a thread. For example,
you want to put tasks that are likely to spawn lots of other tasks close to
the next sync point so that they get executed first (assuming the worker is
using LIFO ordering). That doesn't need a priority system though.

The way I see it tasks are for "burst parallelism". I.e. you have certain
amount of work that needs to be finished and that's all you care about.
Prioritizing between those tasks isn't really a parallelism issue, it's more
concurrency/coordination and probably belongs on a different level of
abstraction (i.e threads), IMO.

You could, however, easily have one worker thread per priority level, with
work stealing only stealing within its own priority. Meaning it wouldn't be
too hard to prioritze things in the rough (i.e. the very root function that
kicks of tasks could do stick it on a higher priorty worker).

The relative inefficiency of a global queue doesn't just depend on the
number of threads, but also on the granularity of tasks. Even with a few
hundred to a thousand tasks a global queue starts losing significantly on
current hardware (a mere 6 threads) according to the benchmarks from the
concurrency talk at the last gamefest. So it doesn't look too good for 16-32
threads, IMO.

Actually, after a bit of an extra thought on locality, I think I'll take
back that "locality is a good point in favor of job stealing" part. When you
got shared cache between threads, locality is actually in favor of shared
job queue because all the worker threads are executing the same type of
task, thus it doesn't put as much pressure on L2 cache as it would if each
thread would execute completely unrelated tasks.

Well I guess that COULD be true in principle, however ordering by priority
is not the same as ordering by locality, so you're really just making a
guess that tasks of similar priorities will tend to access the same memory,
whereas a work-stealing approach makes the guess that tasks scheduled in
close proximity will work on the same data. With hundreds or thousands of
tasks I must say I find the latter to be much more likely (and benchmarks
seem to confirm).

Also, moving tasks around randomly means (with signifcant number of workers)
that you're almost guaranteeing a bunch of L1 cache misses at the start of
every single task. L1 caches may not be a big deal, but with lots of tiny
tasks these things add up...

Also, it seems that just about all the major task libraries coming out have
come to the conclusion that work-stealing is much faster, and I've yet to
see a single benchmark showing that this current recommended practice is
incorrect.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862