Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Sebastian S. <seb...@gm...> - 2009-04-06 11:55:01
|
On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote: > I was under impression that you should minimize those sync points and > dependencies for optimal throughput. If you have ideal situation where your > entire app is split into small tasks, then I guess you wouldn’t need to > prioritize, but I believe the reality is that you got some large tasks whose > execution needs to start as yearly in the frame as possible to avoid > stalling other threads in the end of the frame. So you need to prioritize > those tasks and tasks those large tasks depend on high for execution. I > agree that ability to wait within a task is more convenient, particularly if > you have to do it deep inside a nested loop. But I don’t see spawning new > tasks where you can as tedious as you describe, and those tasks don’t add > that much overhead to the queue as you imply. I.e. when you got a sync > point, you are waiting for on average maybe tens or hundreds of tasks to > complete so overhead of a single task is neglible in proportion. I doubt > that contention will be an issue with shared queue any time soon (definitely > not with 16-32 threads you are talking about), but agreed that locality is a > good point in favor of job stealing. > Realistically most code isn't embarassingly parallel and has lots of complicated, and dynamic, flow, which would be either inconvenient or impossible to deal with if you have an "up front" dependency graph. Yes, by all means try to minimize dependencies, but don't count on succeeding to any major extent. IMO real code has loads of non-parallel dependencies, with tiny pockets of parallelism embedded within them. To get pervasive parallelism I predict you'll have to write lots and lots of code where you kick off 2-3 parallel tasks and then immediately sync, over and over again. You'll have a couple of huge parallel for loops too, but most code isn't like that. You still have some room for choosing when to start a thread. For example, you want to put tasks that are likely to spawn lots of other tasks close to the next sync point so that they get executed first (assuming the worker is using LIFO ordering). That doesn't need a priority system though. The way I see it tasks are for "burst parallelism". I.e. you have certain amount of work that needs to be finished and that's all you care about. Prioritizing between those tasks isn't really a parallelism issue, it's more concurrency/coordination and probably belongs on a different level of abstraction (i.e threads), IMO. You could, however, easily have one worker thread per priority level, with work stealing only stealing within its own priority. Meaning it wouldn't be too hard to prioritze things in the rough (i.e. the very root function that kicks of tasks could do stick it on a higher priorty worker). The relative inefficiency of a global queue doesn't just depend on the number of threads, but also on the granularity of tasks. Even with a few hundred to a thousand tasks a global queue starts losing significantly on current hardware (a mere 6 threads) according to the benchmarks from the concurrency talk at the last gamefest. So it doesn't look too good for 16-32 threads, IMO. > Actually, after a bit of an extra thought on locality, I think I’ll take > back that “locality is a good point in favor of job stealing” part. When you > got shared cache between threads, locality is actually in favor of shared > job queue because all the worker threads are executing the same type of > task, thus it doesn’t put as much pressure on L2 cache as it would if each > thread would execute completely unrelated tasks. Well I guess that COULD be true in principle, however ordering by priority is not the same as ordering by locality, so you're really just making a guess that tasks of similar priorities will tend to access the same memory, whereas a work-stealing approach makes the guess that tasks scheduled in close proximity will work on the same data. With hundreds or thousands of tasks I must say I find the latter to be much more likely (and benchmarks seem to confirm). Also, moving tasks around randomly means (with signifcant number of workers) that you're almost guaranteeing a bunch of L1 cache misses at the start of every single task. L1 caches may not be a big deal, but with lots of tiny tasks these things add up... Also, it seems that just about all the major task libraries coming out have come to the conclusion that work-stealing is much faster, and I've yet to see a single benchmark showing that this current recommended practice is incorrect. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |