Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yes, your prioritization would still function properly if you would assign a
thread per priority of course, but it wouldn't result in optimal throughput,
what's the point after all. You need all your worker threads to work on the
top priority tasks in order to have optimal throughput, which requires that
all the threads work on the single prioritized queue of jobs. Job stealing
doesn't achieve this because each worker thread has its own queue. I also
never said that tasks should be interrupted prematurely, but you should let
them finish and start the next high priority task after.

Having fine granularity tasks doesn't help regardless how many threads you
have. Only thing you really care is that your jobs take let say <~10% of
your frame time (1-3ms), which isn't very high granularity. Some might take
longer, but they are also prioritized higher to avoid stalling other threads
in the end of the frame. So if you can animate all your on-screen characters
in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to
gain shared cache coherence. However, with work stealing you wouldn't even
gain this because each thread works on its own set of jobs.

Cheers, Jarkko

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Tuesday, April 07, 2009 2:41 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...>
wrote:

If you got only 2-3 parallel tasks in average, then it doesn't make much
sense to multi-thread the code anyway, particularly if you use work
stealing. Sounds like you assume that you have some work starvation issues
and that there is always a thread idling and ready to steal work, but to me
it doesn't sound like a healthy assumption to build your work scheduling
around, unless if you try to patch an engine not built for multi-threading.
If that's the situation you are facing, to me it sounds you have some more
serious multi-threading & scalability issues in your game anyway.

That's not what I meant, the total number of tasks should ideally be in the
hundreds or thousands at a peak, but at least several dozen on average, I'm
saying that the branching factor for a lot of that is going to be around
2-3. But of course each of those tasks may kick of another 2-3 tasks (and
then synchronize before returning its result). So you get a tree with lots
of nodes (synchronisation points), but the number of nodes in each level
(i.e. stuff that can run in parallel) can still be hundreds.

Combine this with the rare (but very welcome!) big giant parallel loops
(i.e. "for each of the thousands of game objects, tick in parallel"), and
there should be plenty of tasks to keep your scheduler sweating.

Regarding task prioritization, you could of course have a thread per
priority, but it doesn't really help getting high priority tasks completed
early in the frame, but you would need all the threads to contribute to
running those tasks.

Yes, and that would still work. If you have three priority levels, you'd
have three worker threads per hardware thread. Mid-priority worker threads
would not get to run until the high priority worker thread is blocked, which
will only happen when every single high priority task is completed. And
again, it doesn't seem like you need priorities to get what you want, since
scheduling with work-stealing is reasonably predictable if you really do
need something to run first (because it's likely to produce enough
child-tasks to keep all threads busy) then you just run that closer to the
next sync point.

I didn't mean either that tasks with same/similar priority are local in
relation to each other per se, but that threads gain locality by running the
same function provided by job queue for different data sets when utilizing
data level parallelism. E.g. in job queue you have
animate_character(character*) function where multiple threads are
simultaneously scheduled to run the function with different character*, so
you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
shared data between characters.

But if you *interrupt* (in the sense that you push some of the tasks related
to a particular area to the bottom of the queue) what the threads were
already doing just to get them to run animate_character instead, then they
will have to resume the previous tasks later and get lots of cache misses.
So it's probably better to let them finish the work they're doing (which is
hot in the cache) first, and *then* start spreading out the
animate_character over all cores. Which is precisely what you get with work
stealing. 

If running different tasks is hurting your cache, you're of course free to
insert a sync call before starting animation to make sure all pending tasks
are done before you start the animation stuff.

My gut instinct on this says "priorities means shuffling, shuffling means
loss of locality"...

You are right that high granularity tasks increase shared job queue
contention, but then again I don't think you should put very fine grained
tasks to job queue anyway but run those tasks sequentially instead, which
has superior performance and simplicity to any task system (if you don't
count the cache effect, that is). I don't really see any benefit of aiming
for very high task granularity unless if you try to fill in those
aforementioned task starvation bubbles. Unfortunately I haven't had chance
to test my shared job queue past 2 cores so I can't really tell how well it
scales in practice, but since it's utilizing basic lock-free algorithms, I
presume it scales reasonably well.

Honestly I think we won't have a choice but to have very fine grained tasks
if we're going to hope to keep 32+ threads busy... Cilk claims an overhead
of 3x the cost of a function call (IIRC) on PowerPC architectures, so it
seems to me that for pretty much anything that you're not already marking
for inlining you could probably run as a task (especially if you use work
stealing where it will just run immediately on the very same thread, meaning
essentially unchanged performance, in case there aren't any free workers to
parallelize it).

In fact, I don't think tasks are fine grained enough! We're probably going
to need to do data parallelism on the CPU too for even more fine grained
parallelism.

Sebastian