Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Jarkko L. <al...@gm...> - 2009-04-06 08:11:17
|
Actually, after a bit of an extra thought on locality, I think I'll take back that "locality is a good point in favor of job stealing" part. When you got shared cache between threads, locality is actually in favor of shared job queue because all the worker threads are executing the same type of task, thus it doesn't put as much pressure on L2 cache as it would if each thread would execute completely unrelated tasks. Cheers, Jarkko _____ From: Jarkko Lempiainen [mailto:al...@gm...] Sent: Monday, April 06, 2009 10:38 AM To: 'Game Development Algorithms' Subject: RE: [Algorithms] General purpose task parallel threading approach I was under impression that you should minimize those sync points and dependencies for optimal throughput. If you have ideal situation where your entire app is split into small tasks, then I guess you wouldn't need to prioritize, but I believe the reality is that you got some large tasks whose execution needs to start as yearly in the frame as possible to avoid stalling other threads in the end of the frame. So you need to prioritize those tasks and tasks those large tasks depend on high for execution. I agree that ability to wait within a task is more convenient, particularly if you have to do it deep inside a nested loop. But I don't see spawning new tasks where you can as tedious as you describe, and those tasks don't add that much overhead to the queue as you imply. I.e. when you got a sync point, you are waiting for on average maybe tens or hundreds of tasks to complete so overhead of a single task is neglible in proportion. I doubt that contention will be an issue with shared queue any time soon (definitely not with 16-32 threads you are talking about), but agreed that locality is a good point in favor of job stealing. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Monday, April 06, 2009 1:44 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote: There isn't "whole bunch of magic" outside the task you would need to do with the first option. The task just adds another function to the job queue which does the work after the wait. For second option it is job queues task to schedule the tasks for optimal throughput, so for wait it could for example favor tasks blocking the task from proceeding That's a whole bunch of magic to me. If you want pervasive parallelism (and Mr Almdahl says we do), you're going to need sync points all over the place. Currently we're not doing that because we don't *need* to do it yet, but if you have 16-32 threads to keep busy, you bet you're going to need to spawn of tasks all over the place (and usually sync immediately afterwards). Having to manually split all of these things up into lots of tiny tasks is a big issue IMO - it's tedious, error prone, and adds more overhead (the cost of starting a task still isn't free - so doing it, say, 5-10x as often is a big deal). With job stealing from other threads you don't get global prioritization, so one task may be executing lower priority tasks and not to work towards optimal throughput. Why do you actually need global prioritization? After all you need to finish everything by the end of the frame anyway, so why would you care if task A gets executed before task B, so long as you never get any bubbles where a core is idle? All you care about is maximal throughput, maximal locality (especially true for in-order CPUs) and minimal contention/overhead, and for that a giant shared queue is sub-optimal for two reasons: * It's inherently non-scalable. If you have lots of threads reading from the same queue you'll get contention. Even with relatively modest number of threads this starts becoming an issue. Work stealing has negligible coherency because you have to be unlucky enough to get someone trying to steal from the bottom of your stack, at the same time as you running out of tasks to do (so you're fighting for the very last item in the stack). That's extremely unlikely to happen, and it's even more unlikely for two threads to try to steal from the same stack - whereas a work queue is constantly being hit by multiple threads. * You loose locality. With a work-stealing approach you get good locality because you run on the same thread as the task was spawned, and tasks spawned close together in time get run close together in time, meaning good cache coherency. If tasks are spread out over multiple cores, and over time, you'll get less coherency. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |