Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Jarkko L. <al...@gm...> - 2009-04-07 18:35:26
|
Yes, your prioritization would still function properly if you would assign a thread per priority of course, but it wouldn't result in optimal throughput, what's the point after all. You need all your worker threads to work on the top priority tasks in order to have optimal throughput, which requires that all the threads work on the single prioritized queue of jobs. Job stealing doesn't achieve this because each worker thread has its own queue. I also never said that tasks should be interrupted prematurely, but you should let them finish and start the next high priority task after. Having fine granularity tasks doesn't help regardless how many threads you have. Only thing you really care is that your jobs take let say <~10% of your frame time (1-3ms), which isn't very high granularity. Some might take longer, but they are also prioritized higher to avoid stalling other threads in the end of the frame. So if you can animate all your on-screen characters in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to gain shared cache coherence. However, with work stealing you wouldn't even gain this because each thread works on its own set of jobs. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Tuesday, April 07, 2009 2:41 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...> wrote: If you got only 2-3 parallel tasks in average, then it doesn't make much sense to multi-thread the code anyway, particularly if you use work stealing. Sounds like you assume that you have some work starvation issues and that there is always a thread idling and ready to steal work, but to me it doesn't sound like a healthy assumption to build your work scheduling around, unless if you try to patch an engine not built for multi-threading. If that's the situation you are facing, to me it sounds you have some more serious multi-threading & scalability issues in your game anyway. That's not what I meant, the total number of tasks should ideally be in the hundreds or thousands at a peak, but at least several dozen on average, I'm saying that the branching factor for a lot of that is going to be around 2-3. But of course each of those tasks may kick of another 2-3 tasks (and then synchronize before returning its result). So you get a tree with lots of nodes (synchronisation points), but the number of nodes in each level (i.e. stuff that can run in parallel) can still be hundreds. Combine this with the rare (but very welcome!) big giant parallel loops (i.e. "for each of the thousands of game objects, tick in parallel"), and there should be plenty of tasks to keep your scheduler sweating. Regarding task prioritization, you could of course have a thread per priority, but it doesn't really help getting high priority tasks completed early in the frame, but you would need all the threads to contribute to running those tasks. Yes, and that would still work. If you have three priority levels, you'd have three worker threads per hardware thread. Mid-priority worker threads would not get to run until the high priority worker thread is blocked, which will only happen when every single high priority task is completed. And again, it doesn't seem like you need priorities to get what you want, since scheduling with work-stealing is reasonably predictable if you really do need something to run first (because it's likely to produce enough child-tasks to keep all threads busy) then you just run that closer to the next sync point. I didn't mean either that tasks with same/similar priority are local in relation to each other per se, but that threads gain locality by running the same function provided by job queue for different data sets when utilizing data level parallelism. E.g. in job queue you have animate_character(character*) function where multiple threads are simultaneously scheduled to run the function with different character*, so you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the shared data between characters. But if you *interrupt* (in the sense that you push some of the tasks related to a particular area to the bottom of the queue) what the threads were already doing just to get them to run animate_character instead, then they will have to resume the previous tasks later and get lots of cache misses. So it's probably better to let them finish the work they're doing (which is hot in the cache) first, and *then* start spreading out the animate_character over all cores. Which is precisely what you get with work stealing. If running different tasks is hurting your cache, you're of course free to insert a sync call before starting animation to make sure all pending tasks are done before you start the animation stuff. My gut instinct on this says "priorities means shuffling, shuffling means loss of locality"... You are right that high granularity tasks increase shared job queue contention, but then again I don't think you should put very fine grained tasks to job queue anyway but run those tasks sequentially instead, which has superior performance and simplicity to any task system (if you don't count the cache effect, that is). I don't really see any benefit of aiming for very high task granularity unless if you try to fill in those aforementioned task starvation bubbles. Unfortunately I haven't had chance to test my shared job queue past 2 cores so I can't really tell how well it scales in practice, but since it's utilizing basic lock-free algorithms, I presume it scales reasonably well. Honestly I think we won't have a choice but to have very fine grained tasks if we're going to hope to keep 32+ threads busy... Cilk claims an overhead of 3x the cost of a function call (IIRC) on PowerPC architectures, so it seems to me that for pretty much anything that you're not already marking for inlining you could probably run as a task (especially if you use work stealing where it will just run immediately on the very same thread, meaning essentially unchanged performance, in case there aren't any free workers to parallelize it). In fact, I don't think tasks are fine grained enough! We're probably going to need to do data parallelism on the CPU too for even more fine grained parallelism. Sebastian |