Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I didn't get how would you handle priorities in a fair way ? Wouldn't you
need an expensive search in all local processor's queues for that ? Or how
would you even out tasks with high/low priorities among per cpu queues ? I
think that there is no best approach for all cases, final one depends on the
set of features you'd like to have in your scheduler, which should of course
be the best out of all alternatives in regard to lowering contention.

Alexander

2009/4/6 Sebastian Sylvan <seb...@gm...>

> I mean the standard one-queue per worker, with work-stealing to seed a
> worker that's run out of stuff to do. I'm not sure what work stealing would
> even mean with a global queue? Look up the cilk papers for some more data
> on the benefits of this. Improved inter-core communication is all well and
> good, but not as good as avoiding inter-core communication. Having even just
> six-eight threads all trying to CAS the same pointer *is* going to lead to
> some contention overhead if your tasks are fine grained enough (and they
> will have to be eventually).
>
> On Mon, Apr 6, 2009 at 6:58 AM, <asy...@gm...> wrote:
>
>> You are probably meaning some specific implementation of task-stealing
>> approach? Task stealing can be implemented with and without global queue.
>> Also, increasing number of cores trend results in improved inter-core
>> communication mechanisms either (since otherwise it becomes a bottleneck),
>> so I'm not sure that global queue approach (together with stealing or not)
>> is that bad as you say.
>>
>> Alexander
>>
>> 2009/4/6 Sebastian Sylvan <seb...@gm...>
>>
>>>
>>>
>>> On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...>wrote:
>>>
>>>>  There isn’t “whole bunch of magic” outside the task you would need to
>>>> do with the first option. The task just adds another function to the job
>>>> queue which does the work after the wait. For second option it is job queues
>>>> task to schedule the tasks for optimal throughput, so for wait it could for
>>>> example favor tasks blocking the task from proceeding
>>>>
>>>
>>> That's a whole bunch of magic to me. If you want pervasive parallelism
>>> (and Mr Almdahl says we do), you're going to need sync points all over the
>>> place. Currently we're not doing that because we don't *need* to do it yet,
>>> but if you have 16-32 threads to keep busy, you bet you're going to need to
>>> spawn of tasks all over the place (and usually sync immediately
>>> afterwards). Having to manually split all of these things up into lots of
>>> tiny tasks is a big issue IMO - it's tedious, error prone, and adds more
>>> overhead (the cost of starting a task still isn't free - so doing it, say,
>>> 5-10x as often is a big deal).
>>>
>>> With job stealing from other threads you don’t get global prioritization,
>>>> so one task may be executing lower priority tasks and not to work towards
>>>> optimal throughput.
>>>
>>>
>>> Why do you actually need global prioritization? After all you need to
>>> finish everything by the end of the frame anyway, so why would you care if
>>> task A gets executed before task B, so long as you never get any bubbles
>>> where a core is idle?
>>> All you care about is maximal throughput, maximal locality (especially
>>> true for in-order CPUs) and minimal contention/overhead, and for that a
>>> giant shared queue is sub-optimal for two reasons:
>>> * It's inherently non-scalable. If you have lots of threads reading from
>>> the same queue you'll get contention. Even with relatively modest number of
>>> threads this starts becoming an issue. Work stealing
>>> has negligible coherency because you have to be unlucky enough to get
>>> someone trying to steal from the bottom of your stack, at the same time as
>>> you running out of tasks to do (so you're fighting for the very last item in
>>> the stack). That's extremely unlikely to happen, and it's even more unlikely
>>> for two threads to try to steal from the same stack - whereas a work queue
>>> is constantly being hit by multiple threads.
>>> * You loose locality. With a work-stealing approach you get good locality
>>> because you run on the same thread as the task was spawned, and tasks
>>> spawned close together in time get run close together in time, meaning good
>>> cache coherency. If tasks are spread out over multiple cores, and over time,
>>> you'll get less coherency.
>>>
>>>
>>> --
>>> Sebastian Sylvan
>>> +44(0)7857-300802
>>> UIN: 44640862
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> GDAlgorithms-list mailing list
>>> GDA...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>> Archives:
>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>
>>
>>
>>
>> --
>> Regards,
>> Alexander Karnakov
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

-- 
Regards,
Alexander Karnakov