Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Okay, yes you would evict the data from cache if a higher priority task
kicks in, assuming the data doesn't fit into the cache simultaneously before
returning to work on the original task. Now, if you think of a 8 core system
with job stealing, you must fit 8 times the amount of data into the cache
because you have 8 threads working on total of 8 unrelated tasks
simultaneously. This is much worse from data eviction point of view than
switching to a higher priority task and returning to work on the original
task after. Now think how this scales to 16, 32, 64 cores.

> If I can take something that takes 3ms to finish and get it done in 0.5ms
because I can use six threads instead of one then that's clearly a win.

No, it's not a win. See, all you care is that ALL the work for a frame is
completed as soon as possible, and not minimizing the time to execute a
single task. If you had 50% overhead like you say, then you would have 50%
less time to do any useful work. Now, you say it doesn't matter because you
have threads sitting idle anyway, but don't you think that you have some
serious work starvation issues in your engine instead which you try to patch
with this approach with the conclusion that it's not a lost resource?

Cheers, Jarkko

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Tuesday, April 07, 2009 10:55 PM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

On Tue, Apr 7, 2009 at 7:35 PM, Jarkko Lempiainen <al...@gm...> wrote:

Yes, your prioritization would still function properly if you would assign a
thread per priority of course, but it wouldn't result in optimal throughput,
what's the point after all. You need all your worker threads to work on the
top priority tasks in order to have optimal throughput, which requires that
all the threads work on the single prioritized queue of jobs. Job stealing
doesn't achieve this because each worker thread has its own queue. I also
never said that tasks should be interrupted prematurely, but you should let
them finish and start the next high priority task after.

I didn't say an individual task would be interrupted, but if you have 100
tasks working on something and you kick off 20 tasks to do your animation at
a high priority when only 5 of those 100 tasks have completed, you'll evict
a bunch of data from the cache and then you still have 95 tasks to go, then
you do another 5 and something else pops up to the top evicting the data
again, and so on. It's better for the entire batch of 100 tasks to run to
completion. With work stealing large bathches of tasks that would benefit
from sharing data across cores would naturally start to fill up the worker
threads as the previous tasks finish - and if you really want to make sure
that's the case, just call sync() before.

Having fine granularity tasks doesn't help regardless how many threads you
have. Only thing you really care is that your jobs take let say <~10% of
your frame time (1-3ms), which isn't very high granularity. Some might take
longer, but they are also prioritized higher to avoid stalling other threads
in the end of the frame. So if you can animate all your on-screen characters
in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to
gain shared cache coherence. However, with work stealing you wouldn't even
gain this because each thread works on its own set of jobs.

I'm not sure I follow you? If I can take something that takes 3ms to finish
and get it done in 0.5ms because I can use six threads instead of one then
that's clearly a win. Even if I get an unreasonable amount of overhead and
it takes 1ms (50% overhead!) it's still a 3x boost as far as the overall
frame time. 

What you care about is that your set of tasks that you need to complete in a
frame is done as quickly as possible, which means that it should waste as
few cycles as possible, which means you should avoid contention, bubbles due
to coarse granularity, and cache thrashing.

Fine task granularity is clearly extremely important in order to actually be
able to loads of threads. Of course you need to make sure the overhead of
starting the thread isn't higher than the cost of just doing the work
directly, but as the overhead is even slightly lower than the cost of
running the task, it's a win to get some other thread that would otherwise
be idle to do it "for free" while you only pay the fixed overhead cost.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862