Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...>wrote:

>  If you got only 2-3 parallel tasks in average, then it doesn’t make much
> sense to multi-thread the code anyway, particularly if you use work
> stealing. Sounds like you assume that you have some work starvation issues
> and that there is always a thread idling and ready to steal work, but to me
> it doesn’t sound like a healthy assumption to build your work scheduling
> around, unless if you try to patch an engine not built for multi-threading.
> If that’s the situation you are facing, to me it sounds you have some more
> serious multi-threading & scalability issues in your game anyway.
>
That's not what I meant, the total number of tasks should ideally be in the
hundreds or thousands at a peak, but at least several dozen on average, I'm
saying that the branching factor for a lot of that is going to be around
2-3. But of course each of those tasks may kick of another 2-3 tasks (and
then synchronize before returning its result). So you get a tree with lots
of nodes (synchronisation points), but the number of nodes in each level
(i.e. stuff that can run in parallel) can still be hundreds.

Combine this with the rare (but very welcome!) big giant parallel loops
(i.e. "for each of the thousands of game objects, tick in parallel"), and
there should be plenty of tasks to keep your scheduler sweating.

>
>
> Regarding task prioritization, you could of course have a thread per
> priority, but it doesn’t really help getting high priority tasks completed
> early in the frame, but you would need all the threads to contribute to
> running those tasks.
>
Yes, and that would still work. If you have three priority levels, you'd
have three worker threads per hardware thread. Mid-priority worker threads
would not get to run until the high priority worker thread is blocked, which
will only happen when every single high priority task is completed. And
again, it doesn't seem like you need priorities to get what you want, since
scheduling with work-stealing is reasonably predictable if you really do
need something to run first (because it's likely to produce enough
child-tasks to keep all threads busy) then you just run that closer to the
next sync point.

>  I didn’t mean either that tasks with same/similar priority are local in
> relation to each other per se, but that threads gain locality by running the
> same function provided by job queue for different data sets when utilizing
> data level parallelism. E.g. in job queue you have
> animate_character(character*) function where multiple threads are
> simultaneously scheduled to run the function with different character*, so
> you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the
> shared data between characters.
>
But if you *interrupt* (in the sense that you push some of the tasks related
to a particular area to the bottom of the queue) what the threads were
already doing just to get them to run animate_character instead, then they
will have to resume the previous tasks later and get lots of cache misses.
So it's probably better to let them finish the work they're doing (which is
hot in the cache) first, and *then* start spreading out the
animate_character over all cores. Which is precisely what you get with work
stealing.
If running different tasks is hurting your cache, you're of course free to
insert a sync call before starting animation to make sure all pending tasks
are done before you start the animation stuff.
My gut instinct on this says "priorities means shuffling, shuffling means
loss of locality"...

>
>
> You are right that high granularity tasks increase shared job queue
> contention, but then again I don’t think you should put very fine grained
> tasks to job queue anyway but run those tasks sequentially instead, which
> has superior performance and simplicity to any task system (if you don’t
> count the cache effect, that is). I don’t really see any benefit of aiming
> for very high task granularity unless if you try to fill in those
> aforementioned task starvation bubbles. Unfortunately I haven’t had chance
> to test my shared job queue past 2 cores so I can’t really tell how well it
> scales in practice, but since it’s utilizing basic lock-free algorithms, I
> presume it scales reasonably well.
>
>

Honestly I think we won't have a choice but to have very fine grained tasks
if we're going to hope to keep 32+ threads busy... Cilk claims an overhead
of 3x the cost of a function call (IIRC) on PowerPC architectures, so it
seems to me that for pretty much anything that you're not already marking
for inlining you could probably run as a task (especially if you use work
stealing where it will just run immediately on the very same thread, meaning
essentially unchanged performance, in case there aren't any free workers to
parallelize it).

In fact, I don't think tasks are fine grained enough! We're probably going
to need to do data parallelism on the CPU too for even more fine grained
parallelism.

Sebastian