Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Sebastian S. <seb...@gm...> - 2009-04-09 17:52:03
|
On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote: > Work stealing executes tasks in the order of spawning on a /single/ > thread, but that’s not the point. I hate to repeat myself, but in order to > exploit shared cache coherency, you have to run the same work on /all/ the > threads simultaneously which you can’t guarantee with work stealing without > waste of CPU time. Do you disagree that multiple threads accessing the same > data do exploit shared cache coherency and need supporting evidence of this? > Intel can provide you with plenty of documentation of this basic cache > behavior. > I'm well aware of how a cache works, thank you. The point I'm trying to make is that a global work queue does not solve this issue either, and in fact appears to make it worse (which is why nobody is using them for task parallel systems). Multiple independent threads or running tasks spawning sub-tasks can all be interleaved on the global work queue, and not only run in completely random order, but not even stay on the same hardware thread that they were spawned on. This isn't good for locality. Mixing in priorities into that will only cause tasks to be shuffled around even more. With work stealing you at least get good locality within each worker thread (L1 cache matters too), with the option of good locality across all threads because the way tasks are scheduled is reasonably deterministic. And you can augment the algorithm to take caching into account (and get provable bounds for cache behaviour). Also, think about what the trend for coherency is going - do you think memory will be more or less coherent in the future? It seems pretty clear to me that coherency will be reduced, meaning that the importance of keeping related tasks on the same thread will only increase. > Could you be a bit more specific which Gamefest talk are you referring to, > because I would be very interested to check it out. If you got a link to the > slides, that would be great. I checked some Cilk slides and videos, and it > seems that their main motivation is indeed to provide framework to improve > parallelism of legacy code as I suspected. Makes sense to build a business > model around that because majority of ISV’s want to gain from multi-core > architectures without the risk & investment of redesigning/refactor their > entire software for multi-core. There was some obvious marketing BS in the > slides for example about linear scalability though ;) Anyway, I see Cilk as > a way to patch your single-threaded app which starts at 100%/num_of_cores > efficiency level rather than a good approach if you design your game engine > from scratch for multi-core. > This one: http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38-9BAD-BFF5E15C9629&displaylang=en You imply that just because something is easier to retrofit in an existing system it's not useful for something written from scratch, which I don't think is true. Writing it from scratch may make certain things easier, but parallelism is still a very hard problem, and I don't think re-writing things will significantly reduce the issues I've already highlighted - the problem is still there, and still needs a solution. > > > Regarding task granularity, what you say about 63 threads waiting for 3ms > for a task to finish is true if you have a horribly sequential way of > processing data, > It was a simplifed example to make a point. As I said before, the dependencies aren't all at the toplevel, it's more like a tree. So yes there's a certain amount of overlap where maybe a few threads can keep doing other tasks that are parallel and independent of animation, but the key point I was trying to make is still there: *Real* code has tons of data dependencies all over the place, this means you need to "over subscribe" your task system for load-balancing purposes. > e.g. some legacy single threaded engine flow which you are patching for > multi-core in small bits without major refactoring of its sub-systems. In my > experience, even in some not-so-well designed multi-threaded engines you got > multiple sub-systems running simultaneously (rendering, physics, game logic, > audio, etc.) which could spawn tasks without global stalls of the threads. > I don't think it's a matter of refactoring/design, it's a matter of the data flow. There is no way you can do collision detection against an animated mesh until you've animated it. That's just a fact of the problem that will never change. As I mentioned before you can pipeline some sequential tasks, but then you add latency so this doesn't really scale beyond a few tasks, because you end up with hundreds of milliseconds of latency. So as a one-time optimization you can scale to, say, 3 more threads if you have 3 tasks that are big enough to basically keep a whole thread occupied for most of the frame, what are you going to do for the other 61 threads? The data dependencies are still there, so the problem isn't gone. You've made the problem slightly smaller, from 64 to 61 threads, but you still have to split the rest of your engine up to run on these 61 threads, and you're still going to need way more than 10*61 tasks to do it for reasons I've already explained. I hate to appeal to authority, but I really don't see how you can just ignore that all the major companies building task parallel libraries disagree with you regarding the merits of a shared global queue. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |