Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Jarkko L. <al...@gm...> - 2009-04-07 10:02:27
|
If you got only 2-3 parallel tasks in average, then it doesn't make much sense to multi-thread the code anyway, particularly if you use work stealing. Sounds like you assume that you have some work starvation issues and that there is always a thread idling and ready to steal work, but to me it doesn't sound like a healthy assumption to build your work scheduling around, unless if you try to patch an engine not built for multi-threading. If that's the situation you are facing, to me it sounds you have some more serious multi-threading & scalability issues in your game anyway. Regarding task prioritization, you could of course have a thread per priority, but it doesn't really help getting high priority tasks completed early in the frame, but you would need all the threads to contribute to running those tasks. I didn't mean either that tasks with same/similar priority are local in relation to each other per se, but that threads gain locality by running the same function provided by job queue for different data sets when utilizing data level parallelism. E.g. in job queue you have animate_character(character*) function where multiple threads are simultaneously scheduled to run the function with different character*, so you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the shared data between characters. You are right that high granularity tasks increase shared job queue contention, but then again I don't think you should put very fine grained tasks to job queue anyway but run those tasks sequentially instead, which has superior performance and simplicity to any task system (if you don't count the cache effect, that is). I don't really see any benefit of aiming for very high task granularity unless if you try to fill in those aforementioned task starvation bubbles. Unfortunately I haven't had chance to test my shared job queue past 2 cores so I can't really tell how well it scales in practice, but since it's utilizing basic lock-free algorithms, I presume it scales reasonably well. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Monday, April 06, 2009 2:55 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote: I was under impression that you should minimize those sync points and dependencies for optimal throughput. If you have ideal situation where your entire app is split into small tasks, then I guess you wouldn't need to prioritize, but I believe the reality is that you got some large tasks whose execution needs to start as yearly in the frame as possible to avoid stalling other threads in the end of the frame. So you need to prioritize those tasks and tasks those large tasks depend on high for execution. I agree that ability to wait within a task is more convenient, particularly if you have to do it deep inside a nested loop. But I don't see spawning new tasks where you can as tedious as you describe, and those tasks don't add that much overhead to the queue as you imply. I.e. when you got a sync point, you are waiting for on average maybe tens or hundreds of tasks to complete so overhead of a single task is neglible in proportion. I doubt that contention will be an issue with shared queue any time soon (definitely not with 16-32 threads you are talking about), but agreed that locality is a good point in favor of job stealing. Realistically most code isn't embarassingly parallel and has lots of complicated, and dynamic, flow, which would be either inconvenient or impossible to deal with if you have an "up front" dependency graph. Yes, by all means try to minimize dependencies, but don't count on succeeding to any major extent. IMO real code has loads of non-parallel dependencies, with tiny pockets of parallelism embedded within them. To get pervasive parallelism I predict you'll have to write lots and lots of code where you kick off 2-3 parallel tasks and then immediately sync, over and over again. You'll have a couple of huge parallel for loops too, but most code isn't like that. You still have some room for choosing when to start a thread. For example, you want to put tasks that are likely to spawn lots of other tasks close to the next sync point so that they get executed first (assuming the worker is using LIFO ordering). That doesn't need a priority system though. The way I see it tasks are for "burst parallelism". I.e. you have certain amount of work that needs to be finished and that's all you care about. Prioritizing between those tasks isn't really a parallelism issue, it's more concurrency/coordination and probably belongs on a different level of abstraction (i.e threads), IMO. You could, however, easily have one worker thread per priority level, with work stealing only stealing within its own priority. Meaning it wouldn't be too hard to prioritze things in the rough (i.e. the very root function that kicks of tasks could do stick it on a higher priorty worker). The relative inefficiency of a global queue doesn't just depend on the number of threads, but also on the granularity of tasks. Even with a few hundred to a thousand tasks a global queue starts losing significantly on current hardware (a mere 6 threads) according to the benchmarks from the concurrency talk at the last gamefest. So it doesn't look too good for 16-32 threads, IMO. Actually, after a bit of an extra thought on locality, I think I'll take back that "locality is a good point in favor of job stealing" part. When you got shared cache between threads, locality is actually in favor of shared job queue because all the worker threads are executing the same type of task, thus it doesn't put as much pressure on L2 cache as it would if each thread would execute completely unrelated tasks. Well I guess that COULD be true in principle, however ordering by priority is not the same as ordering by locality, so you're really just making a guess that tasks of similar priorities will tend to access the same memory, whereas a work-stealing approach makes the guess that tasks scheduled in close proximity will work on the same data. With hundreds or thousands of tasks I must say I find the latter to be much more likely (and benchmarks seem to confirm). Also, moving tasks around randomly means (with signifcant number of workers) that you're almost guaranteeing a bunch of L1 cache misses at the start of every single task. L1 caches may not be a big deal, but with lots of tiny tasks these things add up... Also, it seems that just about all the major task libraries coming out have come to the conclusion that work-stealing is much faster, and I've yet to see a single benchmark showing that this current recommended practice is incorrect. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |