Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote:

>  Work stealing executes tasks in the order of spawning on a /single/
> thread, but that’s not the point. I hate to repeat myself, but in order to
> exploit shared cache coherency, you have to run the same work on /all/ the
> threads simultaneously which you can’t guarantee with work stealing without
> waste of CPU time. Do you disagree that multiple threads accessing the same
> data do exploit shared cache coherency and need supporting evidence of this?
> Intel can provide you with plenty of documentation of this basic cache
> behavior.
>
I'm well aware of how a cache works, thank you. The point I'm trying to make
is that a global work queue does not solve this issue either, and in fact
appears to make it worse (which is why nobody is using them for task
parallel systems).
Multiple independent threads or running tasks spawning sub-tasks can all be
interleaved on the global work queue, and not only run in completely random
order, but not even stay on the same hardware thread that they were spawned
on. This isn't good for locality. Mixing in priorities into that will only
cause tasks to be shuffled around even more.

With work stealing you at least get good locality within each worker thread
(L1 cache matters too), with the option of good locality across all threads
because the way tasks are scheduled is reasonably deterministic. And you can
augment the algorithm to take caching into account (and get provable bounds
for cache behaviour). Also, think about what the trend for coherency is
going - do you think memory will be more or less coherent in the future? It
seems pretty clear to me that coherency will be reduced, meaning that the
importance of keeping related tasks on the same thread will only increase.

> Could you be a bit more specific which Gamefest talk are you referring to,
> because I would be very interested to check it out. If you got a link to the
> slides, that would be great. I checked some Cilk slides and videos, and it
> seems that their main motivation is indeed to provide framework to improve
> parallelism of legacy code as I suspected. Makes sense to build a business
> model around that because majority of ISV’s want to gain from multi-core
> architectures without the risk & investment of redesigning/refactor their
> entire software for multi-core. There was some obvious marketing BS in the
> slides for example about linear scalability though ;) Anyway, I see Cilk as
> a way to patch your single-threaded app which starts at 100%/num_of_cores
> efficiency level rather than a good approach if you design your game engine
> from scratch for multi-core.
>

This one:
http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38-9BAD-BFF5E15C9629&displaylang=en

You imply that just because something is easier to retrofit in an existing
system it's not useful for something written from scratch, which I don't
think is true. Writing it from scratch may make certain things easier, but
parallelism is still a very hard problem, and I don't think re-writing
things will significantly reduce the issues I've already highlighted - the
problem is still there, and still needs a solution.

>
>
> Regarding task granularity, what you say about 63 threads waiting for 3ms
> for a task to finish is true if you have a horribly sequential way of
> processing data,
>
It was a simplifed example to make a point. As I said before, the
dependencies aren't all at the toplevel, it's more like a tree. So yes
there's a certain amount of overlap where maybe a few threads can keep doing
other tasks that are parallel and independent of animation, but the key
point I was trying to make is still there: *Real* code has tons of data
dependencies all over the place, this means you need to "over subscribe"
your task system for load-balancing purposes.

> e.g. some legacy single threaded engine flow which you are patching for
> multi-core in small bits without major refactoring of its sub-systems. In my
> experience, even in some not-so-well designed multi-threaded engines you got
> multiple sub-systems running simultaneously (rendering, physics, game logic,
> audio, etc.) which could spawn tasks without global stalls of the threads.
>

I don't think it's a matter of refactoring/design, it's a matter of the data
flow. There is no way you can do collision detection against an animated
mesh until you've animated it. That's just a fact of the problem that will
never change.
As I mentioned before you can pipeline some sequential tasks, but then you
add latency so this doesn't really scale beyond a few tasks, because you end
up with hundreds of milliseconds of latency. So as a one-time optimization
you can scale to, say, 3 more threads if you have 3 tasks that are big
enough to basically keep a whole thread occupied for most of the frame, what
are you going to do for the other 61 threads? The data dependencies are
still there, so the problem isn't gone. You've made the problem slightly
smaller, from 64 to 61 threads, but you still have to split the rest of your
engine up to run on these 61 threads, and you're still going to need way
more than 10*61 tasks to do it for reasons I've already explained.

I hate to appeal to authority, but I really don't see how you can just
ignore that all the major companies building task parallel libraries
disagree with you regarding the merits of a shared global queue.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862