Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Jarkko L. <al...@gm...> - 2009-04-12 08:39:24
|
Well, I see the good task prioritization as a dominating factor in task scheduling, e.g. if you save even 1ms shared across /all/ the threads due to better scheduling, you have already gained the overhead of executing ~5000 tasks in a frame. But you seem to disagree that task prioritization gains you anything, am I right? And then there are of course issues with the cache coherency and programmer productivity, so individual task overhead seems to pale in comparison when speaking of such a small differences (in cycles) and not something really worth optimizing further. But yes, real-life benchmark would probably shine some light on this, but then again you could always continue arguing that the benchmark favors either approach, so I doubt a single benchmark proves anything either way. I guess in the end it boils down to which ever approach is simplest to use and understand by the majority and yet scales reasonably well thus gets adopted as the best practice, even though it doesn't necessarily result in the absolute best throughput. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 12, 2009 5:07 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 12, 2009 at 12:15 AM, Jarkko Lempiainen <al...@gm...> wrote: Yes, only tasks that have been previously executed are prioritized (if not, the task has some default priority). The job queue also processes tasks in groups (e.g. a group per function), so all tasks for a task group have the same priority and are scheduled sequentially for execution (e.g. multiple animate_character(character*) function tasks with different character* form a task group). I profiled that the overhead of adding and executing a task from the queue is around 500 cycles total on my test PC, so it's not bad, but I haven't used the queue in a real game engine yet so I can't tell how well (or badly) it works in real-life scenarios. Anyway, it should be quite easy to change the scheduling strategy of the queue once it's being integrated into an engine to see how it compares to work stealing for example. It would definitely be interesting to see some benchmark comparisons... The problem I have with it a priori is that just plain old work stealing works so surprisingly well (performance is very close to the theoretical maximum for a given number of processors and a given length of critical path), that there doesn't seem much room for improvement as far as throughput is concerned. You'd need to test both on the same code I guess because the cost of these things vary, but if we guesstimate a bit then depending on the exact architecture and calling convention (and the number of parameters that need to be pushed on to the stack etc.) let's say cilk's "4x the cost of a function call" equates to about 50 cycles giver or take, then you're adding an order of magnitude extra overhead which seems quite steep to me. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |