Thread: Re: [Algorithms] General purpose task parallel threading approach (Page 3)
Brought to you by:
vexxed72
|
From: Sebastian S. <seb...@gm...> - 2009-04-05 01:47:22
|
On Sun, Apr 5, 2009 at 1:52 AM, Jarkko Lempiainen <al...@gm...> wrote: > You could also not have a wait for dependent tasks within a task, but > rather have job queue dependency system to take care of it by splitting the > function in two tasks. Or you could implement the wait by pulling tasks from > the job queue until the dependent tasks have been completed. There is no > context switches needed in either case though. > > The first option seems limited to me - you really want to be able to just stick a parallel for in a task and have it work without having to do a whole bunch of magic outside the task itself to split it up. The Cilk model is pretty much what we're going for. The second is an option I guess, but it could lead to bubbles if you happen to pick up a long running task. Imagine two massive parallel for loops, the first one finishes and syncs, and then using your suggestion goes off and steals work from another thread which takes a long time to complete, and only after that's finished can it continue with the next parallel for loop (which spawns enough tasks to keep the system saturated), but by that time the entire system may have been starved for the duration of that long running task just because the original worker handled its blocking by doing something else rather than yielding the core. It's probably more attractive to just make sure no workers are blocked needlessly in case the very first thing they do is to spawn more tasks (again, I see this as quite common - lots of stuff where you fork tons of tiny tasks, wait for them to finish, then spawn tons of tiny tasks again for the next bit of work). > ------------------------------ > > *From:* Sebastian Sylvan [mailto:seb...@gm...] > *Sent:* Sunday, April 05, 2009 3:31 AM > > *To:* Game Development Algorithms > *Subject:* Re: [Algorithms] General purpose task parallel threading > approach > > > > > > On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> > wrote: > > I think you really need to target for both task and data level parallelism > to make your game engine truly scalable for many-core architectures. I > don't > see why you would need to consider the cost of context switching in > relation > to const of a task execution though since if you got a job queue, each > worker thread just pulls tasks from the queue without switching context. > > > > The problem is when you have a task spawn N other tasks and then wait for > their results, the core running that worker needs to do something until the > two sub-tasks have completed. In practice, I do think that this kind of > synchronisation (specifically the fork-and-join model) will be extremely > common in the future, which means that context switch needs to be fast > (Windows 7's user mode scheduling may be useful here - in fact I believe > that's what Microsoft's native concurrency runtime uses, while the .Net one > is backed by thread pools, I believe). > > > > > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-05 20:40:38
|
There isn't "whole bunch of magic" outside the task you would need to do with the first option. The task just adds another function to the job queue which does the work after the wait. For second option it is job queues task to schedule the tasks for optimal throughput, so for wait it could for example favor tasks blocking the task from proceeding. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 05, 2009 4:47 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 1:52 AM, Jarkko Lempiainen <al...@gm...> wrote: You could also not have a wait for dependent tasks within a task, but rather have job queue dependency system to take care of it by splitting the function in two tasks. Or you could implement the wait by pulling tasks from the job queue until the dependent tasks have been completed. There is no context switches needed in either case though. The first option seems limited to me - you really want to be able to just stick a parallel for in a task and have it work without having to do a whole bunch of magic outside the task itself to split it up. The Cilk model is pretty much what we're going for. The second is an option I guess, but it could lead to bubbles if you happen to pick up a long running task. Imagine two massive parallel for loops, the first one finishes and syncs, and then using your suggestion goes off and steals work from another thread which takes a long time to complete, and only after that's finished can it continue with the next parallel for loop (which spawns enough tasks to keep the system saturated), but by that time the entire system may have been starved for the duration of that long running task just because the original worker handled its blocking by doing something else rather than yielding the core. It's probably more attractive to just make sure no workers are blocked needlessly in case the very first thing they do is to spawn more tasks (again, I see this as quite common - lots of stuff where you fork tons of tiny tasks, wait for them to finish, then spawn tons of tiny tasks again for the next bit of work). _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 05, 2009 3:31 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> wrote: I think you really need to target for both task and data level parallelism to make your game engine truly scalable for many-core architectures. I don't see why you would need to consider the cost of context switching in relation to const of a task execution though since if you got a job queue, each worker thread just pulls tasks from the queue without switching context. The problem is when you have a task spawn N other tasks and then wait for their results, the core running that worker needs to do something until the two sub-tasks have completed. In practice, I do think that this kind of synchronisation (specifically the fork-and-join model) will be extremely common in the future, which means that context switch needs to be fast (Windows 7's user mode scheduling may be useful here - in fact I believe that's what Microsoft's native concurrency runtime uses, while the .Net one is backed by thread pools, I believe). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 ---------------------------------------------------------------------------- -- _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-05 22:43:59
|
On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote: > There isn’t “whole bunch of magic” outside the task you would need to do > with the first option. The task just adds another function to the job queue > which does the work after the wait. For second option it is job queues task > to schedule the tasks for optimal throughput, so for wait it could for > example favor tasks blocking the task from proceeding > That's a whole bunch of magic to me. If you want pervasive parallelism (and Mr Almdahl says we do), you're going to need sync points all over the place. Currently we're not doing that because we don't *need* to do it yet, but if you have 16-32 threads to keep busy, you bet you're going to need to spawn of tasks all over the place (and usually sync immediately afterwards). Having to manually split all of these things up into lots of tiny tasks is a big issue IMO - it's tedious, error prone, and adds more overhead (the cost of starting a task still isn't free - so doing it, say, 5-10x as often is a big deal). With job stealing from other threads you don’t get global prioritization, so > one task may be executing lower priority tasks and not to work towards > optimal throughput. Why do you actually need global prioritization? After all you need to finish everything by the end of the frame anyway, so why would you care if task A gets executed before task B, so long as you never get any bubbles where a core is idle? All you care about is maximal throughput, maximal locality (especially true for in-order CPUs) and minimal contention/overhead, and for that a giant shared queue is sub-optimal for two reasons: * It's inherently non-scalable. If you have lots of threads reading from the same queue you'll get contention. Even with relatively modest number of threads this starts becoming an issue. Work stealing has negligible coherency because you have to be unlucky enough to get someone trying to steal from the bottom of your stack, at the same time as you running out of tasks to do (so you're fighting for the very last item in the stack). That's extremely unlikely to happen, and it's even more unlikely for two threads to try to steal from the same stack - whereas a work queue is constantly being hit by multiple threads. * You loose locality. With a work-stealing approach you get good locality because you run on the same thread as the task was spawned, and tasks spawned close together in time get run close together in time, meaning good cache coherency. If tasks are spread out over multiple cores, and over time, you'll get less coherency. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-05 20:45:07
|
Why is it a problem to execute the tasks in recursive way? Note that this recursive way of executing tasks is used only when blocking within tasks, not in the job queue dependency system. Cheers, Jarkko _____ From: asy...@gm... [mailto:asy...@gm...] Sent: Sunday, April 05, 2009 8:34 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach Without the ability to save/restore tasks your dependency system will have a big bunch of limitations, because you can only execute tasks during a wait in a recursive way ONLY. 2009/4/5 Jarkko Lempiainen <al...@gm...> You could also not have a wait for dependent tasks within a task, but rather have job queue dependency system to take care of it by splitting the function in two tasks. Or you could implement the wait by pulling tasks from the job queue until the dependent tasks have been completed. There is no context switches needed in either case though. |
|
From: <asy...@gm...> - 2009-04-06 05:58:34
|
You are probably meaning some specific implementation of task-stealing approach? Task stealing can be implemented with and without global queue. Also, increasing number of cores trend results in improved inter-core communication mechanisms either (since otherwise it becomes a bottleneck), so I'm not sure that global queue approach (together with stealing or not) is that bad as you say. Alexander 2009/4/6 Sebastian Sylvan <seb...@gm...> > > > On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote: > >> There isn’t “whole bunch of magic” outside the task you would need to do >> with the first option. The task just adds another function to the job queue >> which does the work after the wait. For second option it is job queues task >> to schedule the tasks for optimal throughput, so for wait it could for >> example favor tasks blocking the task from proceeding >> > > That's a whole bunch of magic to me. If you want pervasive parallelism (and > Mr Almdahl says we do), you're going to need sync points all over the place. > Currently we're not doing that because we don't *need* to do it yet, but if > you have 16-32 threads to keep busy, you bet you're going to need to spawn > of tasks all over the place (and usually sync immediately > afterwards). Having to manually split all of these things up into lots of > tiny tasks is a big issue IMO - it's tedious, error prone, and adds more > overhead (the cost of starting a task still isn't free - so doing it, say, > 5-10x as often is a big deal). > > With job stealing from other threads you don’t get global prioritization, >> so one task may be executing lower priority tasks and not to work towards >> optimal throughput. > > > Why do you actually need global prioritization? After all you need to > finish everything by the end of the frame anyway, so why would you care if > task A gets executed before task B, so long as you never get any bubbles > where a core is idle? > All you care about is maximal throughput, maximal locality (especially true > for in-order CPUs) and minimal contention/overhead, and for that a giant > shared queue is sub-optimal for two reasons: > * It's inherently non-scalable. If you have lots of threads reading from > the same queue you'll get contention. Even with relatively modest number of > threads this starts becoming an issue. Work stealing > has negligible coherency because you have to be unlucky enough to get > someone trying to steal from the bottom of your stack, at the same time as > you running out of tasks to do (so you're fighting for the very last item in > the stack). That's extremely unlikely to happen, and it's even more unlikely > for two threads to try to steal from the same stack - whereas a work queue > is constantly being hit by multiple threads. > * You loose locality. With a work-stealing approach you get good locality > because you run on the same thread as the task was spawned, and tasks > spawned close together in time get run close together in time, meaning good > cache coherency. If tasks are spread out over multiple cores, and over time, > you'll get less coherency. > > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Sebastian S. <seb...@gm...> - 2009-04-06 07:27:53
|
I mean the standard one-queue per worker, with work-stealing to seed a worker that's run out of stuff to do. I'm not sure what work stealing would even mean with a global queue? Look up the cilk papers for some more data on the benefits of this. Improved inter-core communication is all well and good, but not as good as avoiding inter-core communication. Having even just six-eight threads all trying to CAS the same pointer *is* going to lead to some contention overhead if your tasks are fine grained enough (and they will have to be eventually). On Mon, Apr 6, 2009 at 6:58 AM, <asy...@gm...> wrote: > You are probably meaning some specific implementation of task-stealing > approach? Task stealing can be implemented with and without global queue. > Also, increasing number of cores trend results in improved inter-core > communication mechanisms either (since otherwise it becomes a bottleneck), > so I'm not sure that global queue approach (together with stealing or not) > is that bad as you say. > > Alexander > > 2009/4/6 Sebastian Sylvan <seb...@gm...> > >> >> >> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote: >> >>> There isn’t “whole bunch of magic” outside the task you would need to >>> do with the first option. The task just adds another function to the job >>> queue which does the work after the wait. For second option it is job queues >>> task to schedule the tasks for optimal throughput, so for wait it could for >>> example favor tasks blocking the task from proceeding >>> >> >> That's a whole bunch of magic to me. If you want pervasive parallelism >> (and Mr Almdahl says we do), you're going to need sync points all over the >> place. Currently we're not doing that because we don't *need* to do it yet, >> but if you have 16-32 threads to keep busy, you bet you're going to need to >> spawn of tasks all over the place (and usually sync immediately >> afterwards). Having to manually split all of these things up into lots of >> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more >> overhead (the cost of starting a task still isn't free - so doing it, say, >> 5-10x as often is a big deal). >> >> With job stealing from other threads you don’t get global prioritization, >>> so one task may be executing lower priority tasks and not to work towards >>> optimal throughput. >> >> >> Why do you actually need global prioritization? After all you need to >> finish everything by the end of the frame anyway, so why would you care if >> task A gets executed before task B, so long as you never get any bubbles >> where a core is idle? >> All you care about is maximal throughput, maximal locality (especially >> true for in-order CPUs) and minimal contention/overhead, and for that a >> giant shared queue is sub-optimal for two reasons: >> * It's inherently non-scalable. If you have lots of threads reading from >> the same queue you'll get contention. Even with relatively modest number of >> threads this starts becoming an issue. Work stealing >> has negligible coherency because you have to be unlucky enough to get >> someone trying to steal from the bottom of your stack, at the same time as >> you running out of tasks to do (so you're fighting for the very last item in >> the stack). That's extremely unlikely to happen, and it's even more unlikely >> for two threads to try to steal from the same stack - whereas a work queue >> is constantly being hit by multiple threads. >> * You loose locality. With a work-stealing approach you get good locality >> because you run on the same thread as the task was spawned, and tasks >> spawned close together in time get run close together in time, meaning good >> cache coherency. If tasks are spread out over multiple cores, and over time, >> you'll get less coherency. >> >> >> -- >> Sebastian Sylvan >> +44(0)7857-300802 >> UIN: 44640862 >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > -- > Regards, > Alexander Karnakov > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: <asy...@gm...> - 2009-04-06 08:13:55
|
I didn't get how would you handle priorities in a fair way ? Wouldn't you need an expensive search in all local processor's queues for that ? Or how would you even out tasks with high/low priorities among per cpu queues ? I think that there is no best approach for all cases, final one depends on the set of features you'd like to have in your scheduler, which should of course be the best out of all alternatives in regard to lowering contention. Alexander 2009/4/6 Sebastian Sylvan <seb...@gm...> > I mean the standard one-queue per worker, with work-stealing to seed a > worker that's run out of stuff to do. I'm not sure what work stealing would > even mean with a global queue? Look up the cilk papers for some more data > on the benefits of this. Improved inter-core communication is all well and > good, but not as good as avoiding inter-core communication. Having even just > six-eight threads all trying to CAS the same pointer *is* going to lead to > some contention overhead if your tasks are fine grained enough (and they > will have to be eventually). > > On Mon, Apr 6, 2009 at 6:58 AM, <asy...@gm...> wrote: > >> You are probably meaning some specific implementation of task-stealing >> approach? Task stealing can be implemented with and without global queue. >> Also, increasing number of cores trend results in improved inter-core >> communication mechanisms either (since otherwise it becomes a bottleneck), >> so I'm not sure that global queue approach (together with stealing or not) >> is that bad as you say. >> >> Alexander >> >> 2009/4/6 Sebastian Sylvan <seb...@gm...> >> >>> >>> >>> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote: >>> >>>> There isn’t “whole bunch of magic” outside the task you would need to >>>> do with the first option. The task just adds another function to the job >>>> queue which does the work after the wait. For second option it is job queues >>>> task to schedule the tasks for optimal throughput, so for wait it could for >>>> example favor tasks blocking the task from proceeding >>>> >>> >>> That's a whole bunch of magic to me. If you want pervasive parallelism >>> (and Mr Almdahl says we do), you're going to need sync points all over the >>> place. Currently we're not doing that because we don't *need* to do it yet, >>> but if you have 16-32 threads to keep busy, you bet you're going to need to >>> spawn of tasks all over the place (and usually sync immediately >>> afterwards). Having to manually split all of these things up into lots of >>> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more >>> overhead (the cost of starting a task still isn't free - so doing it, say, >>> 5-10x as often is a big deal). >>> >>> With job stealing from other threads you don’t get global prioritization, >>>> so one task may be executing lower priority tasks and not to work towards >>>> optimal throughput. >>> >>> >>> Why do you actually need global prioritization? After all you need to >>> finish everything by the end of the frame anyway, so why would you care if >>> task A gets executed before task B, so long as you never get any bubbles >>> where a core is idle? >>> All you care about is maximal throughput, maximal locality (especially >>> true for in-order CPUs) and minimal contention/overhead, and for that a >>> giant shared queue is sub-optimal for two reasons: >>> * It's inherently non-scalable. If you have lots of threads reading from >>> the same queue you'll get contention. Even with relatively modest number of >>> threads this starts becoming an issue. Work stealing >>> has negligible coherency because you have to be unlucky enough to get >>> someone trying to steal from the bottom of your stack, at the same time as >>> you running out of tasks to do (so you're fighting for the very last item in >>> the stack). That's extremely unlikely to happen, and it's even more unlikely >>> for two threads to try to steal from the same stack - whereas a work queue >>> is constantly being hit by multiple threads. >>> * You loose locality. With a work-stealing approach you get good locality >>> because you run on the same thread as the task was spawned, and tasks >>> spawned close together in time get run close together in time, meaning good >>> cache coherency. If tasks are spread out over multiple cores, and over time, >>> you'll get less coherency. >>> >>> >>> -- >>> Sebastian Sylvan >>> +44(0)7857-300802 >>> UIN: 44640862 >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >> >> >> >> -- >> Regards, >> Alexander Karnakov >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Jarkko L. <al...@gm...> - 2009-04-06 07:37:55
|
I was under impression that you should minimize those sync points and dependencies for optimal throughput. If you have ideal situation where your entire app is split into small tasks, then I guess you wouldn't need to prioritize, but I believe the reality is that you got some large tasks whose execution needs to start as yearly in the frame as possible to avoid stalling other threads in the end of the frame. So you need to prioritize those tasks and tasks those large tasks depend on high for execution. I agree that ability to wait within a task is more convenient, particularly if you have to do it deep inside a nested loop. But I don't see spawning new tasks where you can as tedious as you describe, and those tasks don't add that much overhead to the queue as you imply. I.e. when you got a sync point, you are waiting for on average maybe tens or hundreds of tasks to complete so overhead of a single task is neglible in proportion. I doubt that contention will be an issue with shared queue any time soon (definitely not with 16-32 threads you are talking about), but agreed that locality is a good point in favor of job stealing. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Monday, April 06, 2009 1:44 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote: There isn't "whole bunch of magic" outside the task you would need to do with the first option. The task just adds another function to the job queue which does the work after the wait. For second option it is job queues task to schedule the tasks for optimal throughput, so for wait it could for example favor tasks blocking the task from proceeding That's a whole bunch of magic to me. If you want pervasive parallelism (and Mr Almdahl says we do), you're going to need sync points all over the place. Currently we're not doing that because we don't *need* to do it yet, but if you have 16-32 threads to keep busy, you bet you're going to need to spawn of tasks all over the place (and usually sync immediately afterwards). Having to manually split all of these things up into lots of tiny tasks is a big issue IMO - it's tedious, error prone, and adds more overhead (the cost of starting a task still isn't free - so doing it, say, 5-10x as often is a big deal). With job stealing from other threads you don't get global prioritization, so one task may be executing lower priority tasks and not to work towards optimal throughput. Why do you actually need global prioritization? After all you need to finish everything by the end of the frame anyway, so why would you care if task A gets executed before task B, so long as you never get any bubbles where a core is idle? All you care about is maximal throughput, maximal locality (especially true for in-order CPUs) and minimal contention/overhead, and for that a giant shared queue is sub-optimal for two reasons: * It's inherently non-scalable. If you have lots of threads reading from the same queue you'll get contention. Even with relatively modest number of threads this starts becoming an issue. Work stealing has negligible coherency because you have to be unlucky enough to get someone trying to steal from the bottom of your stack, at the same time as you running out of tasks to do (so you're fighting for the very last item in the stack). That's extremely unlikely to happen, and it's even more unlikely for two threads to try to steal from the same stack - whereas a work queue is constantly being hit by multiple threads. * You loose locality. With a work-stealing approach you get good locality because you run on the same thread as the task was spawned, and tasks spawned close together in time get run close together in time, meaning good cache coherency. If tasks are spread out over multiple cores, and over time, you'll get less coherency. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-06 11:55:01
|
On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote: > I was under impression that you should minimize those sync points and > dependencies for optimal throughput. If you have ideal situation where your > entire app is split into small tasks, then I guess you wouldn’t need to > prioritize, but I believe the reality is that you got some large tasks whose > execution needs to start as yearly in the frame as possible to avoid > stalling other threads in the end of the frame. So you need to prioritize > those tasks and tasks those large tasks depend on high for execution. I > agree that ability to wait within a task is more convenient, particularly if > you have to do it deep inside a nested loop. But I don’t see spawning new > tasks where you can as tedious as you describe, and those tasks don’t add > that much overhead to the queue as you imply. I.e. when you got a sync > point, you are waiting for on average maybe tens or hundreds of tasks to > complete so overhead of a single task is neglible in proportion. I doubt > that contention will be an issue with shared queue any time soon (definitely > not with 16-32 threads you are talking about), but agreed that locality is a > good point in favor of job stealing. > Realistically most code isn't embarassingly parallel and has lots of complicated, and dynamic, flow, which would be either inconvenient or impossible to deal with if you have an "up front" dependency graph. Yes, by all means try to minimize dependencies, but don't count on succeeding to any major extent. IMO real code has loads of non-parallel dependencies, with tiny pockets of parallelism embedded within them. To get pervasive parallelism I predict you'll have to write lots and lots of code where you kick off 2-3 parallel tasks and then immediately sync, over and over again. You'll have a couple of huge parallel for loops too, but most code isn't like that. You still have some room for choosing when to start a thread. For example, you want to put tasks that are likely to spawn lots of other tasks close to the next sync point so that they get executed first (assuming the worker is using LIFO ordering). That doesn't need a priority system though. The way I see it tasks are for "burst parallelism". I.e. you have certain amount of work that needs to be finished and that's all you care about. Prioritizing between those tasks isn't really a parallelism issue, it's more concurrency/coordination and probably belongs on a different level of abstraction (i.e threads), IMO. You could, however, easily have one worker thread per priority level, with work stealing only stealing within its own priority. Meaning it wouldn't be too hard to prioritze things in the rough (i.e. the very root function that kicks of tasks could do stick it on a higher priorty worker). The relative inefficiency of a global queue doesn't just depend on the number of threads, but also on the granularity of tasks. Even with a few hundred to a thousand tasks a global queue starts losing significantly on current hardware (a mere 6 threads) according to the benchmarks from the concurrency talk at the last gamefest. So it doesn't look too good for 16-32 threads, IMO. > Actually, after a bit of an extra thought on locality, I think I’ll take > back that “locality is a good point in favor of job stealing” part. When you > got shared cache between threads, locality is actually in favor of shared > job queue because all the worker threads are executing the same type of > task, thus it doesn’t put as much pressure on L2 cache as it would if each > thread would execute completely unrelated tasks. Well I guess that COULD be true in principle, however ordering by priority is not the same as ordering by locality, so you're really just making a guess that tasks of similar priorities will tend to access the same memory, whereas a work-stealing approach makes the guess that tasks scheduled in close proximity will work on the same data. With hundreds or thousands of tasks I must say I find the latter to be much more likely (and benchmarks seem to confirm). Also, moving tasks around randomly means (with signifcant number of workers) that you're almost guaranteeing a bunch of L1 cache misses at the start of every single task. L1 caches may not be a big deal, but with lots of tiny tasks these things add up... Also, it seems that just about all the major task libraries coming out have come to the conclusion that work-stealing is much faster, and I've yet to see a single benchmark showing that this current recommended practice is incorrect. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-07 10:02:27
|
If you got only 2-3 parallel tasks in average, then it doesn't make much sense to multi-thread the code anyway, particularly if you use work stealing. Sounds like you assume that you have some work starvation issues and that there is always a thread idling and ready to steal work, but to me it doesn't sound like a healthy assumption to build your work scheduling around, unless if you try to patch an engine not built for multi-threading. If that's the situation you are facing, to me it sounds you have some more serious multi-threading & scalability issues in your game anyway. Regarding task prioritization, you could of course have a thread per priority, but it doesn't really help getting high priority tasks completed early in the frame, but you would need all the threads to contribute to running those tasks. I didn't mean either that tasks with same/similar priority are local in relation to each other per se, but that threads gain locality by running the same function provided by job queue for different data sets when utilizing data level parallelism. E.g. in job queue you have animate_character(character*) function where multiple threads are simultaneously scheduled to run the function with different character*, so you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the shared data between characters. You are right that high granularity tasks increase shared job queue contention, but then again I don't think you should put very fine grained tasks to job queue anyway but run those tasks sequentially instead, which has superior performance and simplicity to any task system (if you don't count the cache effect, that is). I don't really see any benefit of aiming for very high task granularity unless if you try to fill in those aforementioned task starvation bubbles. Unfortunately I haven't had chance to test my shared job queue past 2 cores so I can't really tell how well it scales in practice, but since it's utilizing basic lock-free algorithms, I presume it scales reasonably well. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Monday, April 06, 2009 2:55 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Mon, Apr 6, 2009 at 8:37 AM, Jarkko Lempiainen <al...@gm...> wrote: I was under impression that you should minimize those sync points and dependencies for optimal throughput. If you have ideal situation where your entire app is split into small tasks, then I guess you wouldn't need to prioritize, but I believe the reality is that you got some large tasks whose execution needs to start as yearly in the frame as possible to avoid stalling other threads in the end of the frame. So you need to prioritize those tasks and tasks those large tasks depend on high for execution. I agree that ability to wait within a task is more convenient, particularly if you have to do it deep inside a nested loop. But I don't see spawning new tasks where you can as tedious as you describe, and those tasks don't add that much overhead to the queue as you imply. I.e. when you got a sync point, you are waiting for on average maybe tens or hundreds of tasks to complete so overhead of a single task is neglible in proportion. I doubt that contention will be an issue with shared queue any time soon (definitely not with 16-32 threads you are talking about), but agreed that locality is a good point in favor of job stealing. Realistically most code isn't embarassingly parallel and has lots of complicated, and dynamic, flow, which would be either inconvenient or impossible to deal with if you have an "up front" dependency graph. Yes, by all means try to minimize dependencies, but don't count on succeeding to any major extent. IMO real code has loads of non-parallel dependencies, with tiny pockets of parallelism embedded within them. To get pervasive parallelism I predict you'll have to write lots and lots of code where you kick off 2-3 parallel tasks and then immediately sync, over and over again. You'll have a couple of huge parallel for loops too, but most code isn't like that. You still have some room for choosing when to start a thread. For example, you want to put tasks that are likely to spawn lots of other tasks close to the next sync point so that they get executed first (assuming the worker is using LIFO ordering). That doesn't need a priority system though. The way I see it tasks are for "burst parallelism". I.e. you have certain amount of work that needs to be finished and that's all you care about. Prioritizing between those tasks isn't really a parallelism issue, it's more concurrency/coordination and probably belongs on a different level of abstraction (i.e threads), IMO. You could, however, easily have one worker thread per priority level, with work stealing only stealing within its own priority. Meaning it wouldn't be too hard to prioritze things in the rough (i.e. the very root function that kicks of tasks could do stick it on a higher priorty worker). The relative inefficiency of a global queue doesn't just depend on the number of threads, but also on the granularity of tasks. Even with a few hundred to a thousand tasks a global queue starts losing significantly on current hardware (a mere 6 threads) according to the benchmarks from the concurrency talk at the last gamefest. So it doesn't look too good for 16-32 threads, IMO. Actually, after a bit of an extra thought on locality, I think I'll take back that "locality is a good point in favor of job stealing" part. When you got shared cache between threads, locality is actually in favor of shared job queue because all the worker threads are executing the same type of task, thus it doesn't put as much pressure on L2 cache as it would if each thread would execute completely unrelated tasks. Well I guess that COULD be true in principle, however ordering by priority is not the same as ordering by locality, so you're really just making a guess that tasks of similar priorities will tend to access the same memory, whereas a work-stealing approach makes the guess that tasks scheduled in close proximity will work on the same data. With hundreds or thousands of tasks I must say I find the latter to be much more likely (and benchmarks seem to confirm). Also, moving tasks around randomly means (with signifcant number of workers) that you're almost guaranteeing a bunch of L1 cache misses at the start of every single task. L1 caches may not be a big deal, but with lots of tiny tasks these things add up... Also, it seems that just about all the major task libraries coming out have come to the conclusion that work-stealing is much faster, and I've yet to see a single benchmark showing that this current recommended practice is incorrect. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jon W. <jw...@gm...> - 2009-04-07 17:14:31
|
Jarkko Lempiainen wrote: > > I don’t really see any benefit of aiming for very high task > granularity unless if you try to fill in those aforementioned task > starvation bubbles. Unfortunately I haven’t had chance to test my > shared job queue past 2 cores so I can’t really tell how well it > scales in practice, but since it’s utilizing basic lock-free > algorithms, I presume it scales reasonably well. > If you had 64 hardware threads, would your view change? We'll be getting there in just a few years. Code you write today may very well be expected to scale to that level, or it will be considered obsolete and have to be replaced. There is the job queue implementation question (which is what we're mostly talking about here), that contains all kinds of micro-optimization opportunities (as shown by the asynclib approach), that may or may not apply in a particular game situation. Then there's the question of how to granularize your game so that it can actually smear the workload across 64 hardware threads (or some other, equally "large" number). That's, in the end, a much more interesting question. Whether your job system uses 100, 1000 or 10,000 cycles to switch jobs can easily be measured, but doesn't really affect the scalability of your system. Whether you can actually break the work of a single frame into 500 individual tasks with as few serial dependencies as possible, is much more important in the big scheme of things. Sincerely, jw |
|
From: Jarkko L. <al...@gm...> - 2009-04-07 18:49:27
|
No, it wouldn't change my view, because that's what I'm already thinking. What would change my view though, is that if there were hundreds of threads which would run in significantly lower frequency (e.g. few hundreds of Mhz instead of 2-3Ghz), in which case you would need to split your work to smaller pieces. Like I said to Sebastian, only thing you care is that your job's execution time is relatively small fraction of the entire frame time, so in the end it's the frame time and thread frequency which determines your ideal task size. Ideal task size isn't smallest possible unit of work though because of the overhead of distributing the work. Well, that's what I think anyway (: Cheers, Jarkko -----Original Message----- From: Jon Watte [mailto:jw...@gm...] Sent: Tuesday, April 07, 2009 8:14 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach Jarkko Lempiainen wrote: > > I don't really see any benefit of aiming for very high task > granularity unless if you try to fill in those aforementioned task > starvation bubbles. Unfortunately I haven't had chance to test my > shared job queue past 2 cores so I can't really tell how well it > scales in practice, but since it's utilizing basic lock-free > algorithms, I presume it scales reasonably well. > If you had 64 hardware threads, would your view change? We'll be getting there in just a few years. Code you write today may very well be expected to scale to that level, or it will be considered obsolete and have to be replaced. There is the job queue implementation question (which is what we're mostly talking about here), that contains all kinds of micro-optimization opportunities (as shown by the asynclib approach), that may or may not apply in a particular game situation. Then there's the question of how to granularize your game so that it can actually smear the workload across 64 hardware threads (or some other, equally "large" number). That's, in the end, a much more interesting question. Whether your job system uses 100, 1000 or 10,000 cycles to switch jobs can easily be measured, but doesn't really affect the scalability of your system. Whether you can actually break the work of a single frame into 500 individual tasks with as few serial dependencies as possible, is much more important in the big scheme of things. Sincerely, jw ---------------------------------------------------------------------------- -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list |
|
From: Jon W. <jw...@gm...> - 2009-04-06 16:47:52
|
asy...@gm... wrote: > Where do you get a deadlock from ? If a specific operation is not cancellable, and waits for completion no matter what, then that means that cancelling the fiber will have to wait for that operation to complete. If the operation in turn needs another task to complete, which may be later in the work queue, or require locks/resources currently held by the fiber doing the cancellation, then you have a deadlock. In general, non-cancellable operations are bad (similar to non-signal-breakable syscalls in UNIX). The right thing to do is to make any operation cancellable, and we've found that the best way to make sure nobody does the wrong thing is to simply throw an exception out of the yield call when an operation is cancelled. Proper use of smart pointers and RAII will then take care of the unwind semantics. That being said, blocking is still rare in game loops; we use all of this only for the transactional application server back-end. Sincerely, jw |
|
From: <asy...@gm...> - 2009-04-06 17:21:32
|
>If the operation in turn needs another >task to complete, which may be later in the work queue, or require >locks/resources currently held by the fiber doing the cancellation That's no problem in my implementation. Both tasks will be suspended until non-cancellable wait operation finishes, and worker thread will be working on something else. Also async::Task::Cancel only marks a task for cancelling, you need to explicitly wait on a task to finish to reach the point where a cancelled task doesn't exist. If in turn wait operation never expires, then it is a problem in the task logic, and canceling is not supposed to fix that. I already have a dosen of application written with asynclib, and strict user control for what to cancel and what not turned out to be the best thing. You can still apply your own use, and make each operation cancellable and throw exception - it is up to you. Alexander. 2009/4/6 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > Where do you get a deadlock from ? > > If a specific operation is not cancellable, and waits for completion no > matter what, then that means that cancelling the fiber will have to wait > for that operation to complete. If the operation in turn needs another > task to complete, which may be later in the work queue, or require > locks/resources currently held by the fiber doing the cancellation, then > you have a deadlock. > > In general, non-cancellable operations are bad (similar to > non-signal-breakable syscalls in UNIX). The right thing to do is to make > any operation cancellable, and we've found that the best way to make > sure nobody does the wrong thing is to simply throw an exception out of > the yield call when an operation is cancelled. Proper use of smart > pointers and RAII will then take care of the unwind semantics. > > That being said, blocking is still rare in game loops; we use all of > this only for the transactional application server back-end. > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Sebastian S. <seb...@gm...> - 2009-04-07 11:40:53
|
On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...>wrote: > If you got only 2-3 parallel tasks in average, then it doesn’t make much > sense to multi-thread the code anyway, particularly if you use work > stealing. Sounds like you assume that you have some work starvation issues > and that there is always a thread idling and ready to steal work, but to me > it doesn’t sound like a healthy assumption to build your work scheduling > around, unless if you try to patch an engine not built for multi-threading. > If that’s the situation you are facing, to me it sounds you have some more > serious multi-threading & scalability issues in your game anyway. > That's not what I meant, the total number of tasks should ideally be in the hundreds or thousands at a peak, but at least several dozen on average, I'm saying that the branching factor for a lot of that is going to be around 2-3. But of course each of those tasks may kick of another 2-3 tasks (and then synchronize before returning its result). So you get a tree with lots of nodes (synchronisation points), but the number of nodes in each level (i.e. stuff that can run in parallel) can still be hundreds. Combine this with the rare (but very welcome!) big giant parallel loops (i.e. "for each of the thousands of game objects, tick in parallel"), and there should be plenty of tasks to keep your scheduler sweating. > > > Regarding task prioritization, you could of course have a thread per > priority, but it doesn’t really help getting high priority tasks completed > early in the frame, but you would need all the threads to contribute to > running those tasks. > Yes, and that would still work. If you have three priority levels, you'd have three worker threads per hardware thread. Mid-priority worker threads would not get to run until the high priority worker thread is blocked, which will only happen when every single high priority task is completed. And again, it doesn't seem like you need priorities to get what you want, since scheduling with work-stealing is reasonably predictable if you really do need something to run first (because it's likely to produce enough child-tasks to keep all threads busy) then you just run that closer to the next sync point. > I didn’t mean either that tasks with same/similar priority are local in > relation to each other per se, but that threads gain locality by running the > same function provided by job queue for different data sets when utilizing > data level parallelism. E.g. in job queue you have > animate_character(character*) function where multiple threads are > simultaneously scheduled to run the function with different character*, so > you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the > shared data between characters. > But if you *interrupt* (in the sense that you push some of the tasks related to a particular area to the bottom of the queue) what the threads were already doing just to get them to run animate_character instead, then they will have to resume the previous tasks later and get lots of cache misses. So it's probably better to let them finish the work they're doing (which is hot in the cache) first, and *then* start spreading out the animate_character over all cores. Which is precisely what you get with work stealing. If running different tasks is hurting your cache, you're of course free to insert a sync call before starting animation to make sure all pending tasks are done before you start the animation stuff. My gut instinct on this says "priorities means shuffling, shuffling means loss of locality"... > > > You are right that high granularity tasks increase shared job queue > contention, but then again I don’t think you should put very fine grained > tasks to job queue anyway but run those tasks sequentially instead, which > has superior performance and simplicity to any task system (if you don’t > count the cache effect, that is). I don’t really see any benefit of aiming > for very high task granularity unless if you try to fill in those > aforementioned task starvation bubbles. Unfortunately I haven’t had chance > to test my shared job queue past 2 cores so I can’t really tell how well it > scales in practice, but since it’s utilizing basic lock-free algorithms, I > presume it scales reasonably well. > > Honestly I think we won't have a choice but to have very fine grained tasks if we're going to hope to keep 32+ threads busy... Cilk claims an overhead of 3x the cost of a function call (IIRC) on PowerPC architectures, so it seems to me that for pretty much anything that you're not already marking for inlining you could probably run as a task (especially if you use work stealing where it will just run immediately on the very same thread, meaning essentially unchanged performance, in case there aren't any free workers to parallelize it). In fact, I don't think tasks are fine grained enough! We're probably going to need to do data parallelism on the CPU too for even more fine grained parallelism. Sebastian |
|
From: Jarkko L. <al...@gm...> - 2009-04-07 18:35:26
|
Yes, your prioritization would still function properly if you would assign a thread per priority of course, but it wouldn't result in optimal throughput, what's the point after all. You need all your worker threads to work on the top priority tasks in order to have optimal throughput, which requires that all the threads work on the single prioritized queue of jobs. Job stealing doesn't achieve this because each worker thread has its own queue. I also never said that tasks should be interrupted prematurely, but you should let them finish and start the next high priority task after. Having fine granularity tasks doesn't help regardless how many threads you have. Only thing you really care is that your jobs take let say <~10% of your frame time (1-3ms), which isn't very high granularity. Some might take longer, but they are also prioritized higher to avoid stalling other threads in the end of the frame. So if you can animate all your on-screen characters in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to gain shared cache coherence. However, with work stealing you wouldn't even gain this because each thread works on its own set of jobs. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Tuesday, April 07, 2009 2:41 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Tue, Apr 7, 2009 at 11:02 AM, Jarkko Lempiainen <al...@gm...> wrote: If you got only 2-3 parallel tasks in average, then it doesn't make much sense to multi-thread the code anyway, particularly if you use work stealing. Sounds like you assume that you have some work starvation issues and that there is always a thread idling and ready to steal work, but to me it doesn't sound like a healthy assumption to build your work scheduling around, unless if you try to patch an engine not built for multi-threading. If that's the situation you are facing, to me it sounds you have some more serious multi-threading & scalability issues in your game anyway. That's not what I meant, the total number of tasks should ideally be in the hundreds or thousands at a peak, but at least several dozen on average, I'm saying that the branching factor for a lot of that is going to be around 2-3. But of course each of those tasks may kick of another 2-3 tasks (and then synchronize before returning its result). So you get a tree with lots of nodes (synchronisation points), but the number of nodes in each level (i.e. stuff that can run in parallel) can still be hundreds. Combine this with the rare (but very welcome!) big giant parallel loops (i.e. "for each of the thousands of game objects, tick in parallel"), and there should be plenty of tasks to keep your scheduler sweating. Regarding task prioritization, you could of course have a thread per priority, but it doesn't really help getting high priority tasks completed early in the frame, but you would need all the threads to contribute to running those tasks. Yes, and that would still work. If you have three priority levels, you'd have three worker threads per hardware thread. Mid-priority worker threads would not get to run until the high priority worker thread is blocked, which will only happen when every single high priority task is completed. And again, it doesn't seem like you need priorities to get what you want, since scheduling with work-stealing is reasonably predictable if you really do need something to run first (because it's likely to produce enough child-tasks to keep all threads busy) then you just run that closer to the next sync point. I didn't mean either that tasks with same/similar priority are local in relation to each other per se, but that threads gain locality by running the same function provided by job queue for different data sets when utilizing data level parallelism. E.g. in job queue you have animate_character(character*) function where multiple threads are simultaneously scheduled to run the function with different character*, so you get less pressure on L2/L3 cache (even L1 cache on Xbox) due to the shared data between characters. But if you *interrupt* (in the sense that you push some of the tasks related to a particular area to the bottom of the queue) what the threads were already doing just to get them to run animate_character instead, then they will have to resume the previous tasks later and get lots of cache misses. So it's probably better to let them finish the work they're doing (which is hot in the cache) first, and *then* start spreading out the animate_character over all cores. Which is precisely what you get with work stealing. If running different tasks is hurting your cache, you're of course free to insert a sync call before starting animation to make sure all pending tasks are done before you start the animation stuff. My gut instinct on this says "priorities means shuffling, shuffling means loss of locality"... You are right that high granularity tasks increase shared job queue contention, but then again I don't think you should put very fine grained tasks to job queue anyway but run those tasks sequentially instead, which has superior performance and simplicity to any task system (if you don't count the cache effect, that is). I don't really see any benefit of aiming for very high task granularity unless if you try to fill in those aforementioned task starvation bubbles. Unfortunately I haven't had chance to test my shared job queue past 2 cores so I can't really tell how well it scales in practice, but since it's utilizing basic lock-free algorithms, I presume it scales reasonably well. Honestly I think we won't have a choice but to have very fine grained tasks if we're going to hope to keep 32+ threads busy... Cilk claims an overhead of 3x the cost of a function call (IIRC) on PowerPC architectures, so it seems to me that for pretty much anything that you're not already marking for inlining you could probably run as a task (especially if you use work stealing where it will just run immediately on the very same thread, meaning essentially unchanged performance, in case there aren't any free workers to parallelize it). In fact, I don't think tasks are fine grained enough! We're probably going to need to do data parallelism on the CPU too for even more fine grained parallelism. Sebastian |
|
From: Sebastian S. <seb...@gm...> - 2009-04-07 19:55:21
|
On Tue, Apr 7, 2009 at 7:35 PM, Jarkko Lempiainen <al...@gm...> wrote: > Yes, your prioritization would still function properly if you would > assign a thread per priority of course, but it wouldn’t result in optimal > throughput, what’s the point after all. You need all your worker threads to > work on the top priority tasks in order to have optimal throughput, which > requires that all the threads work on the single prioritized queue of jobs. > Job stealing doesn’t achieve this because each worker thread has its own > queue. I also never said that tasks should be interrupted prematurely, but > you should let them finish and start the next high priority task after. > I didn't say an individual task would be interrupted, but if you have 100 tasks working on something and you kick off 20 tasks to do your animation at a high priority when only 5 of those 100 tasks have completed, you'll evict a bunch of data from the cache and then you still have 95 tasks to go, then you do another 5 and something else pops up to the top evicting the data again, and so on. It's better for the entire batch of 100 tasks to run to completion. With work stealing large bathches of tasks that would benefit from sharing data across cores would naturally start to fill up the worker threads as the previous tasks finish - and if you really want to make sure that's the case, just call sync() before. > Having fine granularity tasks doesn’t help regardless how many threads you > have. Only thing you really care is that your jobs take let say <~10% of > your frame time (1-3ms), which isn’t very high granularity. Some might take > longer, but they are also prioritized higher to avoid stalling other threads > in the end of the frame. So if you can animate all your on-screen characters > in 1-3ms you shouldn’t bother multi-threading the code EXCEPT if you want to > gain shared cache coherence. However, with work stealing you wouldn’t even > gain this because each thread works on its own set of jobs. > I'm not sure I follow you? If I can take something that takes 3ms to finish and get it done in 0.5ms because I can use six threads instead of one then that's clearly a win. Even if I get an unreasonable amount of overhead and it takes 1ms (50% overhead!) it's still a 3x boost as far as the overall frame time. What you care about is that your set of tasks that you need to complete in a frame is done as quickly as possible, which means that it should waste as few cycles as possible, which means you should avoid contention, bubbles due to coarse granularity, and cache thrashing. Fine task granularity is clearly extremely important in order to actually be able to loads of threads. Of course you need to make sure the overhead of starting the thread isn't higher than the cost of just doing the work directly, but as the overhead is even slightly lower than the cost of running the task, it's a win to get some other thread that would otherwise be idle to do it "for free" while you only pay the fixed overhead cost. > > -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-07 21:30:18
|
Okay, yes you would evict the data from cache if a higher priority task kicks in, assuming the data doesn't fit into the cache simultaneously before returning to work on the original task. Now, if you think of a 8 core system with job stealing, you must fit 8 times the amount of data into the cache because you have 8 threads working on total of 8 unrelated tasks simultaneously. This is much worse from data eviction point of view than switching to a higher priority task and returning to work on the original task after. Now think how this scales to 16, 32, 64 cores. > If I can take something that takes 3ms to finish and get it done in 0.5ms because I can use six threads instead of one then that's clearly a win. No, it's not a win. See, all you care is that ALL the work for a frame is completed as soon as possible, and not minimizing the time to execute a single task. If you had 50% overhead like you say, then you would have 50% less time to do any useful work. Now, you say it doesn't matter because you have threads sitting idle anyway, but don't you think that you have some serious work starvation issues in your engine instead which you try to patch with this approach with the conclusion that it's not a lost resource? Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Tuesday, April 07, 2009 10:55 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Tue, Apr 7, 2009 at 7:35 PM, Jarkko Lempiainen <al...@gm...> wrote: Yes, your prioritization would still function properly if you would assign a thread per priority of course, but it wouldn't result in optimal throughput, what's the point after all. You need all your worker threads to work on the top priority tasks in order to have optimal throughput, which requires that all the threads work on the single prioritized queue of jobs. Job stealing doesn't achieve this because each worker thread has its own queue. I also never said that tasks should be interrupted prematurely, but you should let them finish and start the next high priority task after. I didn't say an individual task would be interrupted, but if you have 100 tasks working on something and you kick off 20 tasks to do your animation at a high priority when only 5 of those 100 tasks have completed, you'll evict a bunch of data from the cache and then you still have 95 tasks to go, then you do another 5 and something else pops up to the top evicting the data again, and so on. It's better for the entire batch of 100 tasks to run to completion. With work stealing large bathches of tasks that would benefit from sharing data across cores would naturally start to fill up the worker threads as the previous tasks finish - and if you really want to make sure that's the case, just call sync() before. Having fine granularity tasks doesn't help regardless how many threads you have. Only thing you really care is that your jobs take let say <~10% of your frame time (1-3ms), which isn't very high granularity. Some might take longer, but they are also prioritized higher to avoid stalling other threads in the end of the frame. So if you can animate all your on-screen characters in 1-3ms you shouldn't bother multi-threading the code EXCEPT if you want to gain shared cache coherence. However, with work stealing you wouldn't even gain this because each thread works on its own set of jobs. I'm not sure I follow you? If I can take something that takes 3ms to finish and get it done in 0.5ms because I can use six threads instead of one then that's clearly a win. Even if I get an unreasonable amount of overhead and it takes 1ms (50% overhead!) it's still a 3x boost as far as the overall frame time. What you care about is that your set of tasks that you need to complete in a frame is done as quickly as possible, which means that it should waste as few cycles as possible, which means you should avoid contention, bubbles due to coarse granularity, and cache thrashing. Fine task granularity is clearly extremely important in order to actually be able to loads of threads. Of course you need to make sure the overhead of starting the thread isn't higher than the cost of just doing the work directly, but as the overhead is even slightly lower than the cost of running the task, it's a win to get some other thread that would otherwise be idle to do it "for free" while you only pay the fixed overhead cost. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-07 22:12:56
|
On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...>wrote: > Okay, yes you would evict the data from cache if a higher priority task > kicks in, assuming the data doesn’t fit into the cache simultaneously before > returning to work on the original task. Now, if you think of a 8 core system > with job stealing, you must fit 8 times the amount of data into the cache > because you have 8 threads working on total of 8 unrelated tasks > simultaneously. This is much worse from data eviction point of view than > switching to a higher priority task and returning to work on the original > task after. Now think how this scales to 16, 32, 64 cores. > You're assuming that work stealing scheduling will always work on unrelated tasks when this isn't the case. If you kick off a large batch of work it will spread out over all workers, it'll just let whatever's already there finish first - exploiting locality. > > > > If I can take something that takes 3ms to finish and get it done in > 0.5ms because I can use six threads instead of one then that's clearly a > win. > > > > No, it’s not a win. See, all you care is that ALL the work for a frame is > completed as soon as possible, and not minimizing the time to execute a > single task. If you had 50% overhead like you say, then you would have 50% > less time to do any useful work. Now, you say it doesn’t matter because you > have threads sitting idle anyway, but don’t you think that you have some > serious work starvation issues in your engine instead which you try to patch > with this approach with the conclusion that it’s not a lost resource? > If I have 50% overhead, that means that the other 50% is useful work that would've had to run sequentially otherwise! I'm doing 50% MORE useful work on my spare threads. The purpose of splitting stuff up into parallel tasks is to use idle cores to help speed up work. If you assume that you can keep those cores busy with useful work using Some Other Technique then you're basically assuming that there is no problem and that we therefore need no solution. That would be great if it was the case, but it seems unrealistic. The whole discussion is about "how do we use N threads to speed up a game engine", saying that we shouldn't spawn lots of fine grained tasks because the overhead in doing so would cost cycles that we could use for "useful work" isn't really helping me, unless you can tell me what other technique you plan to use to actually do this "useful work" that doesn't have the overhead of just ramping up task parallelism so it scales to the number of threads. Of course you wouldn't trade 100% useful work for 80% useful work and 20% overhead, that would be stupid. Nobody would write a task system with worker threads on for a single-threaded machine. What I'm saying is that in the not-so-far-off future when you start off with, say, 90% idle cycles and only 10% useful work, that 80/20 split is looking a lot more appealing. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-07 22:17:50
|
Okay so that maths is wrong, the point is that overhead doesn't come alone, the overhead is there because I'm getting more useful work done on the other threads meaning that it's an overall win. On Tue, Apr 7, 2009 at 11:12 PM, Sebastian Sylvan < seb...@gm...> wrote: > > > On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...>wrote: > >> Okay, yes you would evict the data from cache if a higher priority task >> kicks in, assuming the data doesn’t fit into the cache simultaneously before >> returning to work on the original task. Now, if you think of a 8 core system >> with job stealing, you must fit 8 times the amount of data into the cache >> because you have 8 threads working on total of 8 unrelated tasks >> simultaneously. This is much worse from data eviction point of view than >> switching to a higher priority task and returning to work on the original >> task after. Now think how this scales to 16, 32, 64 cores. >> > > You're assuming that work stealing scheduling will always work on unrelated > tasks when this isn't the case. If you kick off a large batch of work it > will spread out over all workers, it'll just let whatever's already there > finish first - exploiting locality. > > >> >> >> > If I can take something that takes 3ms to finish and get it done in >> 0.5ms because I can use six threads instead of one then that's clearly a >> win. >> >> >> >> No, it’s not a win. See, all you care is that ALL the work for a frame is >> completed as soon as possible, and not minimizing the time to execute a >> single task. If you had 50% overhead like you say, then you would have 50% >> less time to do any useful work. Now, you say it doesn’t matter because you >> have threads sitting idle anyway, but don’t you think that you have some >> serious work starvation issues in your engine instead which you try to patch >> with this approach with the conclusion that it’s not a lost resource? >> > If I have 50% overhead, that means that the other 50% is useful work that > would've had to run sequentially otherwise! I'm doing 50% MORE useful work > on my spare threads. > > The purpose of splitting stuff up into parallel tasks is to use idle cores > to help speed up work. If you assume that you can keep those cores busy with > useful work using Some Other Technique then you're basically assuming that > there is no problem and that we therefore need no solution. That would be > great if it was the case, but it seems unrealistic. > The whole discussion is about "how do we use N threads to speed up a game > engine", saying that we shouldn't spawn lots of fine grained tasks because > the overhead in doing so would cost cycles that we could use for "useful > work" isn't really helping me, unless you can tell me what other technique > you plan to use to actually do this "useful work" that doesn't have the > overhead of just ramping up task parallelism so it scales to the number of > threads. > > Of course you wouldn't trade 100% useful work for 80% useful work and 20% > overhead, that would be stupid. Nobody would write a task system with worker > threads on for a single-threaded machine. What I'm saying is that in the > not-so-far-off future when you start off with, say, 90% idle cycles and only > 10% useful work, that 80/20 split is looking a lot more appealing. > > > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-08 19:51:19
|
Pushing work to multiple queues doesn't guarantee that they are also executed simultaneously, unless if all your threads are idling, which isn't very happy scenario. To me it sounds that you are in the process of converting a single threaded engine to multi-threaded one where you start from 100%/number_of_threads efficiency, in which case what you say makes sense. I.e. you want the multi-threaded bits to be very local, where you dispatch work to multiple threads and sync right after, and any work you can parallelize is a win from the old design. However, if you have the advantage of designing an engine from scratch for multi-threading, I don't see this as a good approach. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Wednesday, April 08, 2009 1:13 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Tue, Apr 7, 2009 at 10:30 PM, Jarkko Lempiainen <al...@gm...> wrote: Okay, yes you would evict the data from cache if a higher priority task kicks in, assuming the data doesn't fit into the cache simultaneously before returning to work on the original task. Now, if you think of a 8 core system with job stealing, you must fit 8 times the amount of data into the cache because you have 8 threads working on total of 8 unrelated tasks simultaneously. This is much worse from data eviction point of view than switching to a higher priority task and returning to work on the original task after. Now think how this scales to 16, 32, 64 cores. You're assuming that work stealing scheduling will always work on unrelated tasks when this isn't the case. If you kick off a large batch of work it will spread out over all workers, it'll just let whatever's already there finish first - exploiting locality. > If I can take something that takes 3ms to finish and get it done in 0.5ms because I can use six threads instead of one then that's clearly a win. No, it's not a win. See, all you care is that ALL the work for a frame is completed as soon as possible, and not minimizing the time to execute a single task. If you had 50% overhead like you say, then you would have 50% less time to do any useful work. Now, you say it doesn't matter because you have threads sitting idle anyway, but don't you think that you have some serious work starvation issues in your engine instead which you try to patch with this approach with the conclusion that it's not a lost resource? If I have 50% overhead, that means that the other 50% is useful work that would've had to run sequentially otherwise! I'm doing 50% MORE useful work on my spare threads. The purpose of splitting stuff up into parallel tasks is to use idle cores to help speed up work. If you assume that you can keep those cores busy with useful work using Some Other Technique then you're basically assuming that there is no problem and that we therefore need no solution. That would be great if it was the case, but it seems unrealistic. The whole discussion is about "how do we use N threads to speed up a game engine", saying that we shouldn't spawn lots of fine grained tasks because the overhead in doing so would cost cycles that we could use for "useful work" isn't really helping me, unless you can tell me what other technique you plan to use to actually do this "useful work" that doesn't have the overhead of just ramping up task parallelism so it scales to the number of threads. Of course you wouldn't trade 100% useful work for 80% useful work and 20% overhead, that would be stupid. Nobody would write a task system with worker threads on for a single-threaded machine. What I'm saying is that in the not-so-far-off future when you start off with, say, 90% idle cycles and only 10% useful work, that 80/20 split is looking a lot more appealing. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-09 17:52:03
|
On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote: > Work stealing executes tasks in the order of spawning on a /single/ > thread, but that’s not the point. I hate to repeat myself, but in order to > exploit shared cache coherency, you have to run the same work on /all/ the > threads simultaneously which you can’t guarantee with work stealing without > waste of CPU time. Do you disagree that multiple threads accessing the same > data do exploit shared cache coherency and need supporting evidence of this? > Intel can provide you with plenty of documentation of this basic cache > behavior. > I'm well aware of how a cache works, thank you. The point I'm trying to make is that a global work queue does not solve this issue either, and in fact appears to make it worse (which is why nobody is using them for task parallel systems). Multiple independent threads or running tasks spawning sub-tasks can all be interleaved on the global work queue, and not only run in completely random order, but not even stay on the same hardware thread that they were spawned on. This isn't good for locality. Mixing in priorities into that will only cause tasks to be shuffled around even more. With work stealing you at least get good locality within each worker thread (L1 cache matters too), with the option of good locality across all threads because the way tasks are scheduled is reasonably deterministic. And you can augment the algorithm to take caching into account (and get provable bounds for cache behaviour). Also, think about what the trend for coherency is going - do you think memory will be more or less coherent in the future? It seems pretty clear to me that coherency will be reduced, meaning that the importance of keeping related tasks on the same thread will only increase. > Could you be a bit more specific which Gamefest talk are you referring to, > because I would be very interested to check it out. If you got a link to the > slides, that would be great. I checked some Cilk slides and videos, and it > seems that their main motivation is indeed to provide framework to improve > parallelism of legacy code as I suspected. Makes sense to build a business > model around that because majority of ISV’s want to gain from multi-core > architectures without the risk & investment of redesigning/refactor their > entire software for multi-core. There was some obvious marketing BS in the > slides for example about linear scalability though ;) Anyway, I see Cilk as > a way to patch your single-threaded app which starts at 100%/num_of_cores > efficiency level rather than a good approach if you design your game engine > from scratch for multi-core. > This one: http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38-9BAD-BFF5E15C9629&displaylang=en You imply that just because something is easier to retrofit in an existing system it's not useful for something written from scratch, which I don't think is true. Writing it from scratch may make certain things easier, but parallelism is still a very hard problem, and I don't think re-writing things will significantly reduce the issues I've already highlighted - the problem is still there, and still needs a solution. > > > Regarding task granularity, what you say about 63 threads waiting for 3ms > for a task to finish is true if you have a horribly sequential way of > processing data, > It was a simplifed example to make a point. As I said before, the dependencies aren't all at the toplevel, it's more like a tree. So yes there's a certain amount of overlap where maybe a few threads can keep doing other tasks that are parallel and independent of animation, but the key point I was trying to make is still there: *Real* code has tons of data dependencies all over the place, this means you need to "over subscribe" your task system for load-balancing purposes. > e.g. some legacy single threaded engine flow which you are patching for > multi-core in small bits without major refactoring of its sub-systems. In my > experience, even in some not-so-well designed multi-threaded engines you got > multiple sub-systems running simultaneously (rendering, physics, game logic, > audio, etc.) which could spawn tasks without global stalls of the threads. > I don't think it's a matter of refactoring/design, it's a matter of the data flow. There is no way you can do collision detection against an animated mesh until you've animated it. That's just a fact of the problem that will never change. As I mentioned before you can pipeline some sequential tasks, but then you add latency so this doesn't really scale beyond a few tasks, because you end up with hundreds of milliseconds of latency. So as a one-time optimization you can scale to, say, 3 more threads if you have 3 tasks that are big enough to basically keep a whole thread occupied for most of the frame, what are you going to do for the other 61 threads? The data dependencies are still there, so the problem isn't gone. You've made the problem slightly smaller, from 64 to 61 threads, but you still have to split the rest of your engine up to run on these 61 threads, and you're still going to need way more than 10*61 tasks to do it for reasons I've already explained. I hate to appeal to authority, but I really don't see how you can just ignore that all the major companies building task parallel libraries disagree with you regarding the merits of a shared global queue. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jarkko L. <al...@gm...> - 2009-04-09 21:27:16
|
Yes, appealing to the "big boys" is pretty weak argument since they have provided sub-optimal solutions before as well. Anyway, I was hoping to get some strong arguments either way, but I guess it's not very clear yet what solution is the best for multi-threading your engine. I read through the Gamefest slides and it was a bit surprising they didn't deal with prioritization at all, which is vital for throughput. If you don't take my word for it, play around with different size of tasks with various dependencies and see how prioritization impacts to the throughput, or check out GDC 2008 presentation "Getting More From Multicore" by Ian Lewis (http://www.microsoft.com/downloads/details.aspx?FamilyID=A36FE736-5FE7-4E08 -84CF-ACCF801538EB <http://www.microsoft.com/downloads/details.aspx?FamilyID=A36FE736-5FE7-4E08 -84CF-ACCF801538EB&displaylang=en> &displaylang=en) for some pointers. But I guess in the end we just have to agree to disagree and see how it pans out. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Thursday, April 09, 2009 8:52 PM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Thu, Apr 9, 2009 at 3:28 PM, Jarkko Lempiainen <al...@gm...> wrote: Work stealing executes tasks in the order of spawning on a /single/ thread, but that's not the point. I hate to repeat myself, but in order to exploit shared cache coherency, you have to run the same work on /all/ the threads simultaneously which you can't guarantee with work stealing without waste of CPU time. Do you disagree that multiple threads accessing the same data do exploit shared cache coherency and need supporting evidence of this? Intel can provide you with plenty of documentation of this basic cache behavior. I'm well aware of how a cache works, thank you. The point I'm trying to make is that a global work queue does not solve this issue either, and in fact appears to make it worse (which is why nobody is using them for task parallel systems). Multiple independent threads or running tasks spawning sub-tasks can all be interleaved on the global work queue, and not only run in completely random order, but not even stay on the same hardware thread that they were spawned on. This isn't good for locality. Mixing in priorities into that will only cause tasks to be shuffled around even more. With work stealing you at least get good locality within each worker thread (L1 cache matters too), with the option of good locality across all threads because the way tasks are scheduled is reasonably deterministic. And you can augment the algorithm to take caching into account (and get provable bounds for cache behaviour). Also, think about what the trend for coherency is going - do you think memory will be more or less coherent in the future? It seems pretty clear to me that coherency will be reduced, meaning that the importance of keeping related tasks on the same thread will only increase. Could you be a bit more specific which Gamefest talk are you referring to, because I would be very interested to check it out. If you got a link to the slides, that would be great. I checked some Cilk slides and videos, and it seems that their main motivation is indeed to provide framework to improve parallelism of legacy code as I suspected. Makes sense to build a business model around that because majority of ISV's want to gain from multi-core architectures without the risk & investment of redesigning/refactor their entire software for multi-core. There was some obvious marketing BS in the slides for example about linear scalability though ;) Anyway, I see Cilk as a way to patch your single-threaded app which starts at 100%/num_of_cores efficiency level rather than a good approach if you design your game engine from scratch for multi-core. This one: http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38- 9BAD-BFF5E15C9629 <http://www.microsoft.com/downloads/details.aspx?FamilyId=F9D74066-0DFD-4A38 -9BAD-BFF5E15C9629&displaylang=en> &displaylang=en You imply that just because something is easier to retrofit in an existing system it's not useful for something written from scratch, which I don't think is true. Writing it from scratch may make certain things easier, but parallelism is still a very hard problem, and I don't think re-writing things will significantly reduce the issues I've already highlighted - the problem is still there, and still needs a solution. Regarding task granularity, what you say about 63 threads waiting for 3ms for a task to finish is true if you have a horribly sequential way of processing data, It was a simplifed example to make a point. As I said before, the dependencies aren't all at the toplevel, it's more like a tree. So yes there's a certain amount of overlap where maybe a few threads can keep doing other tasks that are parallel and independent of animation, but the key point I was trying to make is still there: *Real* code has tons of data dependencies all over the place, this means you need to "over subscribe" your task system for load-balancing purposes. e.g. some legacy single threaded engine flow which you are patching for multi-core in small bits without major refactoring of its sub-systems. In my experience, even in some not-so-well designed multi-threaded engines you got multiple sub-systems running simultaneously (rendering, physics, game logic, audio, etc.) which could spawn tasks without global stalls of the threads. I don't think it's a matter of refactoring/design, it's a matter of the data flow. There is no way you can do collision detection against an animated mesh until you've animated it. That's just a fact of the problem that will never change. As I mentioned before you can pipeline some sequential tasks, but then you add latency so this doesn't really scale beyond a few tasks, because you end up with hundreds of milliseconds of latency. So as a one-time optimization you can scale to, say, 3 more threads if you have 3 tasks that are big enough to basically keep a whole thread occupied for most of the frame, what are you going to do for the other 61 threads? The data dependencies are still there, so the problem isn't gone. You've made the problem slightly smaller, from 64 to 61 threads, but you still have to split the rest of your engine up to run on these 61 threads, and you're still going to need way more than 10*61 tasks to do it for reasons I've already explained. I hate to appeal to authority, but I really don't see how you can just ignore that all the major companies building task parallel libraries disagree with you regarding the merits of a shared global queue. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Sebastian S. <seb...@gm...> - 2009-04-09 22:13:13
|
On Thu, Apr 9, 2009 at 10:27 PM, Jarkko Lempiainen <al...@gm...>wrote: > Yes, appealing to the “big boys” is pretty weak argument since they have > provided sub-optimal solutions before as well. > In this case almost all of them have, at one point or another, tried the alternative and decided it was worth the effort to implement the much trickier solution. In fact the the presentation you linked was given by the very same organization (xna) who now recommend work stealing - so clearly they've given the global work queue an honest shot don't you think? > Anyway, I was hoping to get some strong arguments either way, but I guess > it’s not very clear yet what solution is the best for multi-threading your > engine. I read through the Gamefest slides and it was a bit surprising they > didn’t deal with prioritization at all, which is vital for throughput. > It's actually not vital for throughput, and I couldn't see anything in the presentation you linked saying it is. Tasks are naturally ordered according to their dependencies due to the way they are spawned (you spawn tasks, then you sync - only then do you spawn any dependent tasks). With the cilk based model any task that is enqueued is by definition ready to run. There's in no need for priorities to optimize throughput. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jon W. <jw...@gm...> - 2009-04-09 22:53:00
|
Sebastian Sylvan wrote: > > > > > Anyway, I was hoping to get some strong arguments either way, but > I guess it’s not very clear yet what solution is the best for > multi-threading your engine. I read through the Gamefest slides > and it was a bit surprising they didn’t deal with prioritization > at all, which is vital for throughput. > > It's actually not vital for throughput, and I couldn't see anything in > the presentation you linked saying it is. > Tasks are naturally ordered according to their dependencies due to the > way they are spawned (you spawn tasks, then you sync - only then do > you spawn any dependent tasks). With the cilk based model any task > that is enqueued is by definition ready to run. There's in no need for > priorities to optimize throughput. > Actually, it is. Consider a case where you have some independent tasks, and some serially dependent tasks. Suppose you have 4 independent tasks xyzw, 4 linearly dependent tasks abcd, and two cores to schedule across. Here are two possible work orderings: xy zw a b c d ax by cz dw Clearly, the second case is much preferred. However, without prioritization, you won't get that outcome. While you could sort the entire graph, and do some dependency/bin packing to push it as compact as possible, usually it suffices with some heuristics, such as always prioritizing tasks that have dependents, and the longer the dependency chain, the higher the priority. Sincerely, jw |