Thread: [Algorithms] General purpose task parallel threading approach

gdalgorithms-list

[Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-03 18:13:15

Hello,

A while ago I started wondering about how to apply task parallel approach
for a complete application, not just a simple "let's parallelize this loop"
approach. What I wanted to achieve is an approach where you can have all
sort of different tasks running in parallel which is what you usually have
in real life applications. I think all games now have some sort of
background processing where they do some cpu calculations, some  wait on
disk io, some wait on other tasks to complete, while there are heavy
computations going on in the game logic for examle, and all require CPU. All
avialable solutions have either too small scale and algorithm centric
approach (like openmp or tbb) while more generic ones suffer from blocking
conditions you have all over (including synchronization primitives, all sort
of io, and ability to wait on other tasks to finish).

I've tried to develop my own alternative and would like to have your
oppinion on my approach.
Main goals I tried to achieve are speed, low memory overhead and ability to
handle blocking conditions somehow, (which is why I started to reinvent the
wheel).
To fight blocking conditions I've made a task suspension mechanics, which
allows a worker thread to suspend a task when it meets something it should
wait for (whether it is a mutex or an on going disk read). After task is
suspended, worker thread picks up another task from the task pool, and keeps
on executing it. When blocking condition ends, task is resumed, and put back
into a task pool.
This approach maximizes task throughput, and worker threads go to sleep only
when there is nothing else to do (or all tasks are suspended).

To achieve speed I avoid using any kernel structures (except one place in
task scheduler which I haven't optimized yet, so expect performance to be
about a million task switches per second or so). That also helped me to
reduce the size of synchronization primitives.
Other useful features include:
 - a unified way to wait on primitives (you can wait on all sort of
primitives including tasks together)
 - task canceling
 - timeout on wait operations

I'd like to hear your feedback on all this stuff you can down load at
http://groups.google.com/group/asynclib. I've also composed few samples
demonstrating some aspects of programming with the library, but in general
it is no more complex than using windows threading api.

I'd also ask you to comment on library specific questions in the group
forum.

Thanks.

Re: [Algorithms] General purpose task parallel threading approach

From: Gregory J. <gj...@da...> - 2009-04-03 19:12:08

Isn't this where coroutines come into play? 


Greg

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-03 19:27:32

What do you mean ?

Alexander.

2009/4/3 Gregory Junker <gj...@da...>

>  Isn’t this where coroutines come into play?
>
>
> Greg
>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: Tim A. <tim...@gm...> - 2009-04-03 21:16:48

Alexander-
Coroutines are an implementation of cooperative multi-tasking.  There are
implementations in various dynamic languages, such as Lua and Stackless
Python.  Also, there are less-portable implementations in C++ that involve
inline assembly, setjmp/longjmp, fibers, etc... depending on the platform.
See http://en.wikipedia.org/wiki/Coroutine for more info.

It sounds, however, like you also want to write lower-level,
performance-critical systems such as rendering, physics, serialization,
etc... across many threads/cores.  Your library is similar in silhouette to
the popular and fairly general job system, where you break down
parallelizable tasks in N jobs that are tasked to idle threads.  These jobs
aim to either minimize contention on shared resources, or operate in a
wait-free manner on said resources.  If wait-free operation cannot be
achieved within the domain, you usually try a lock-free approach that
enables different threads to assist obstructing threads in completing their
tasks while you wait.  As a last resort, you fall back on traditional
synchronization primitives if you can find no more efficient approach.

I'm not quite sure what your library provides to the domain of game
development, but I'm sure it was motivated by a specific problem you
encountered.  Perhaps you should describe your use cases more specifically,
given that the stated domain of your library seems overly broad?

--Tim A.

On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote:

> What do you mean ?
>
> Alexander.
>
> 2009/4/3 Gregory Junker <gj...@da...>
>
>>  Isn’t this where coroutines come into play?
>>
>>
>> Greg
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-03 22:17:34

asy...@gm... wrote:
> To fight blocking conditions I've made a task suspension mechanics, 
> which allows a worker thread to suspend a task when it meets something 
> it should wait for (whether it is a mutex or an on going disk read). 
> After task is suspended, worker thread picks up another task from the 
> task pool, and keeps on executing it. When blocking condition ends, 
> task is resumed, and put back into a task pool.

You are basically combining fibers (co-operative threads) with 
pre-emptive threads. There really isn't that much cost difference 
between a pre-emptive thread and a fiber, so you might as well just 
spawn another pre-emptive thread, and suspend the current thread, rather 
than switching fiber on the current thread. If you're using a 
dynamically growing thread pool, your worst-case thread count (and thus 
memory usage) will be no worse than your worst-case fiber count.

>
>  - a unified way to wait on primitives (you can wait on all sort of 
> primitives including tasks together)
>  - task canceling
>  - timeout on wait operations

We do that in our fibers, too. However, it turns out that on most OS-es, 
you can do that for OS primitives too. On Windows, pretty much anything 
can go into WaitForMultipleObjects() (although there is a 64-object 
limitation). On UNIX, as long as something is a file descriptor, you can 
wait on it using select().

So, if instead of "suspending" the current "task" and picking another 
task within the worker thread -- which I presume to be exactly like 
fiber switching -- you instead just suspend the current thread and 
create a new thread in the thread pool, what would the difference be? 
Have you measured it?

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 06:29:14

Hi Jon,

>you instead just suspend the current thread and
>create a new thread in the thread pool, what would the difference be?
The difference here is speed.

I have an implementation which does the same but with threads (for debugging
purposes - you can find it in the package as well), and it is significantly
slower. While mine implementation still touches a semaphore on each task
switch, it can push about a million task switches per second now. I actually
didn't measure performance with threads, but will try doing that later.
My target goal for a task switch cost is about 1-2 hundred of cycles, which
is about 10-15 millions of task switches per second on a 3ghz cpu.
You should also compare thread spawning cost (unless you preallocate all of
them) and a memory overhead (for windows threads it is minimum 4k for stack
size), but if you preallocate, you'd do it for the maximum stack size.
Creation cost in my system is just a malloc (which you can also override
with a faster allocator), and minimum allowed task is 32 bytes (on x86).

The cost of synchronization primitives on windows is high imho (critical
section is a 48 bytes in size + an attached event in the long run). My
approach allows minimal size for a crit-section like primitive be 2 bits
only (can be injected into pointer), simple Lock (non-recursive mutex) - 4
bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical
section) and no use of windows events or any other objects.

Besides huge optimizations potential in speed and size, there is no
difference with what you do, Jon.

>Have you measured it?
I'll come back with numbers later.

Regards,
Alexander.

2009/4/4 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > To fight blocking conditions I've made a task suspension mechanics,
> > which allows a worker thread to suspend a task when it meets something
> > it should wait for (whether it is a mutex or an on going disk read).
> > After task is suspended, worker thread picks up another task from the
> > task pool, and keeps on executing it. When blocking condition ends,
> > task is resumed, and put back into a task pool.
>
> You are basically combining fibers (co-operative threads) with
> pre-emptive threads. There really isn't that much cost difference
> between a pre-emptive thread and a fiber, so you might as well just
> spawn another pre-emptive thread, and suspend the current thread, rather
> than switching fiber on the current thread. If you're using a
> dynamically growing thread pool, your worst-case thread count (and thus
> memory usage) will be no worse than your worst-case fiber count.
>
> >
> >  - a unified way to wait on primitives (you can wait on all sort of
> > primitives including tasks together)
> >  - task canceling
> >  - timeout on wait operations
>
> We do that in our fibers, too. However, it turns out that on most OS-es,
> you can do that for OS primitives too. On Windows, pretty much anything
> can go into WaitForMultipleObjects() (although there is a 64-object
> limitation). On UNIX, as long as something is a file descriptor, you can
> wait on it using select().
>
> So, if instead of "suspending" the current "task" and picking another
> task within the worker thread -- which I presume to be exactly like
> fiber switching -- you instead just suspend the current thread and
> create a new thread in the thread pool, what would the difference be?
> Have you measured it?
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: Oscar F. <os...@tr...> - 2009-04-04 07:58:16

Am I being thick?  It looks to me like you are trying to write single
threaded code wih multiple threads.  Shouldn't you be redesigning your
system to eliminate this wait for data?  Ie if you need some data it should
be already computed, ie the frame before.

Otherwise you are never going to get the best out of a multicore
architecture ...

Just my, probably wrong, 2p
2009/4/4 <asy...@gm...>

> Hi Jon,
>
> >you instead just suspend the current thread and
> >create a new thread in the thread pool, what would the difference be?
> The difference here is speed.
>
> I have an implementation which does the same but with threads (for
> debugging purposes - you can find it in the package as well), and it is
> significantly slower. While mine implementation still touches a semaphore on
> each task switch, it can push about a million task switches per second now.
> I actually didn't measure performance with threads, but will try doing that
> later.
> My target goal for a task switch cost is about 1-2 hundred of cycles, which
> is about 10-15 millions of task switches per second on a 3ghz cpu.
> You should also compare thread spawning cost (unless you preallocate all of
> them) and a memory overhead (for windows threads it is minimum 4k for stack
> size), but if you preallocate, you'd do it for the maximum stack size.
> Creation cost in my system is just a malloc (which you can also override
> with a faster allocator), and minimum allowed task is 32 bytes (on x86).
>
> The cost of synchronization primitives on windows is high imho (critical
> section is a 48 bytes in size + an attached event in the long run). My
> approach allows minimal size for a crit-section like primitive be 2 bits
> only (can be injected into pointer), simple Lock (non-recursive mutex) - 4
> bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical
> section) and no use of windows events or any other objects.
>
>
> Besides huge optimizations potential in speed and size, there is no
> difference with what you do, Jon.
>
> >Have you measured it?
> I'll come back with numbers later.
>
> Regards,
> Alexander.
>
>
>
> 2009/4/4 Jon Watte <jw...@gm...>
>
>  asy...@gm... wrote:
>> > To fight blocking conditions I've made a task suspension mechanics,
>> > which allows a worker thread to suspend a task when it meets something
>> > it should wait for (whether it is a mutex or an on going disk read).
>> > After task is suspended, worker thread picks up another task from the
>> > task pool, and keeps on executing it. When blocking condition ends,
>> > task is resumed, and put back into a task pool.
>>
>> You are basically combining fibers (co-operative threads) with
>> pre-emptive threads. There really isn't that much cost difference
>> between a pre-emptive thread and a fiber, so you might as well just
>> spawn another pre-emptive thread, and suspend the current thread, rather
>> than switching fiber on the current thread. If you're using a
>> dynamically growing thread pool, your worst-case thread count (and thus
>> memory usage) will be no worse than your worst-case fiber count.
>>
>> >
>> >  - a unified way to wait on primitives (you can wait on all sort of
>> > primitives including tasks together)
>> >  - task canceling
>> >  - timeout on wait operations
>>
>> We do that in our fibers, too. However, it turns out that on most OS-es,
>> you can do that for OS primitives too. On Windows, pretty much anything
>> can go into WaitForMultipleObjects() (although there is a 64-object
>> limitation). On UNIX, as long as something is a file descriptor, you can
>> wait on it using select().
>>
>> So, if instead of "suspending" the current "task" and picking another
>> task within the worker thread -- which I presume to be exactly like
>> fiber switching -- you instead just suspend the current thread and
>> create a new thread in the thread pool, what would the difference be?
>> Have you measured it?
>>
>> Sincerely,
>>
>> jw
>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 08:44:21

>Shouldn't you be redesigning your system to eliminate this wait for data?
Ie if you need some data it should be already >computed, ie the frame
before.
That's the only option you have using standard approach to achieve a benefit
from multicore architecture.

What I am trying to say is that you really don't have to. Writing single
threaded code is a lot easier, takes less time, code is more compact and all
dependencies in my case are handled automatically (where in your case you
have to explicitly code them). To achieve multicore scale all you need to do
is to provide enough work to execute by worker threads while other tasks
wait for the dependencies to complete.

Do you have a concrete example which bothers you (maybe I could provide a
better explanation of how it'll work on a concrete case) ?

Regards,
Alexander.

2009/4/4 Oscar Forth <os...@tr...>

> Am I being thick?  It looks to me like you are trying to write single
> threaded code wih multiple threads.  Shouldn't you be redesigning your
> system to eliminate this wait for data?  Ie if you need some data it should
> be already computed, ie the frame before.
>
> Otherwise you are never going to get the best out of a multicore
> architecture ...
>
> Just my, probably wrong, 2p
> 2009/4/4 <asy...@gm...>
>
> Hi Jon,
>>
>> >you instead just suspend the current thread and
>> >create a new thread in the thread pool, what would the difference be?
>> The difference here is speed.
>>
>> I have an implementation which does the same but with threads (for
>> debugging purposes - you can find it in the package as well), and it is
>> significantly slower. While mine implementation still touches a semaphore on
>> each task switch, it can push about a million task switches per second now.
>> I actually didn't measure performance with threads, but will try doing that
>> later.
>> My target goal for a task switch cost is about 1-2 hundred of cycles,
>> which is about 10-15 millions of task switches per second on a 3ghz cpu.
>> You should also compare thread spawning cost (unless you preallocate all
>> of them) and a memory overhead (for windows threads it is minimum 4k for
>> stack size), but if you preallocate, you'd do it for the maximum stack size.
>> Creation cost in my system is just a malloc (which you can also override
>> with a faster allocator), and minimum allowed task is 32 bytes (on x86).
>>
>> The cost of synchronization primitives on windows is high imho (critical
>> section is a 48 bytes in size + an attached event in the long run). My
>> approach allows minimal size for a crit-section like primitive be 2 bits
>> only (can be injected into pointer), simple Lock (non-recursive mutex) - 4
>> bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical
>> section) and no use of windows events or any other objects.
>>
>>
>> Besides huge optimizations potential in speed and size, there is no
>> difference with what you do, Jon.
>>
>> >Have you measured it?
>> I'll come back with numbers later.
>>
>> Regards,
>> Alexander.
>>
>>
>>
>> 2009/4/4 Jon Watte <jw...@gm...>
>>
>>  asy...@gm... wrote:
>>> > To fight blocking conditions I've made a task suspension mechanics,
>>> > which allows a worker thread to suspend a task when it meets something
>>> > it should wait for (whether it is a mutex or an on going disk read).
>>> > After task is suspended, worker thread picks up another task from the
>>> > task pool, and keeps on executing it. When blocking condition ends,
>>> > task is resumed, and put back into a task pool.
>>>
>>> You are basically combining fibers (co-operative threads) with
>>> pre-emptive threads. There really isn't that much cost difference
>>> between a pre-emptive thread and a fiber, so you might as well just
>>> spawn another pre-emptive thread, and suspend the current thread, rather
>>> than switching fiber on the current thread. If you're using a
>>> dynamically growing thread pool, your worst-case thread count (and thus
>>> memory usage) will be no worse than your worst-case fiber count.
>>>
>>> >
>>> >  - a unified way to wait on primitives (you can wait on all sort of
>>> > primitives including tasks together)
>>> >  - task canceling
>>> >  - timeout on wait operations
>>>
>>> We do that in our fibers, too. However, it turns out that on most OS-es,
>>> you can do that for OS primitives too. On Windows, pretty much anything
>>> can go into WaitForMultipleObjects() (although there is a 64-object
>>> limitation). On UNIX, as long as something is a file descriptor, you can
>>> wait on it using select().
>>>
>>> So, if instead of "suspending" the current "task" and picking another
>>> task within the worker thread -- which I presume to be exactly like
>>> fiber switching -- you instead just suspend the current thread and
>>> create a new thread in the thread pool, what would the difference be?
>>> Have you measured it?
>>>
>>> Sincerely,
>>>
>>> jw
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> GDAlgorithms-list mailing list
>>> GDA...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>> Archives:
>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-05 00:50:44

On Sat, Apr 4, 2009 at 7:29 AM, <asy...@gm...> wrote:

> Hi Jon,
>
> >you instead just suspend the current thread and
> >create a new thread in the thread pool, what would the difference be?
> The difference here is speed.
>
> I have an implementation which does the same but with threads (for
> debugging purposes - you can find it in the package as well), and it is
> significantly slower. While mine implementation still touches a semaphore on
> each task switch, it can push about a million task switches per second now.
> I actually didn't measure performance with threads, but will try doing that
> later.
> My target goal for a task switch cost is about 1-2 hundred of cycles, which
> is about 10-15 millions of task switches per second on a 3ghz cpu.

Hi, I didn't look at the source code, but am I to understand that this
"switch" you're talking about happens every single time you start a new
task? I.e. when a worker runs tasks A, B, and C, neither of which does any
blocking, you will pay that cost 3 times?
If not, then what's the additional cost of starting a task given that each
task is now a fibre?

Assuming I did understand you correctly, I'll just say that I think adding
(even slight) overhead to the "run" method of every single task in order to
make blocking cheaper is a big mistake. The whole motivating idea behind
task parallelism is that you saturate the system with way more runnable
tasks than there are hardware capabilities, so that performance can scale
with hardware, and that no CPU capability is left idle. If the number of
runnable tasks is much larger than the actual number of worker threads, that
means that most tasks will just run serially on the same worker that spawned
them, which means that most synchronisation points won't block, which means
that the number of thread switches will be few compared to the number of
task executions, which means that optimizing for fast blocking is not as
important as making sure that running a task is dirt cheap.

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-05 01:11:59

Sub-tasks spawn by a task are not necessarily run serially by the same
worker thread though. Worker threads should pull tasks from job queue based
on task priority, so those spawned tasks are likely to be executed by many
worker threads instead.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Sunday, April 05, 2009 3:51 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

Assuming I did understand you correctly, I'll just say that I think adding
(even slight) overhead to the "run" method of every single task in order to
make blocking cheaper is a big mistake. The whole motivating idea behind
task parallelism is that you saturate the system with way more runnable
tasks than there are hardware capabilities, so that performance can scale
with hardware, and that no CPU capability is left idle. If the number of
runnable tasks is much larger than the actual number of worker threads, that
means that most tasks will just run serially on the same worker that spawned
them, which means that most synchronisation points won't block, which means
that the number of thread switches will be few compared to the number of
task executions, which means that optimizing for fast blocking is not as
important as making sure that running a task is dirt cheap.


-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Sebastian S. <seb...@gm...> - 2009-04-05 01:46:08

On Sun, Apr 5, 2009 at 2:11 AM, Jarkko Lempiainen <al...@gm...> wrote:

>  Sub-tasks spawn by a task are not necessarily run serially by the same
> worker thread though. Worker threads should pull tasks from job queue based
> on task priority, so those spawned tasks are likely to be executed by many
> worker threads instead.
>
That's probably not a good idea. Work-stealing is a pretty well-established
as the current best solution (with provable performance properties) for
task-based parallelism as it combines locality with low contention and still
gets optimal load-balancing.
Going to shared queue every single task (and probably incurring a cache miss
if the task is unrelated to what you were just doing) is a lot more costly
than just pushing tasks off of your own mostly private stack (and only
interfering with other threads when you and they run out of stuff to do
simultaneously).

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: Jarkko L. <al...@gm...> - 2009-04-05 20:41:43

With job stealing from other threads you don't get global prioritization, so
one task may be executing lower priority tasks and not to work towards
optimal throughput.

 

 

Cheers, Jarkko

 

  _____  

From: Sebastian Sylvan [mailto:seb...@gm...] 
Sent: Sunday, April 05, 2009 4:32 AM
To: Game Development Algorithms
Subject: Re: [Algorithms] General purpose task parallel threading approach

 

 

On Sun, Apr 5, 2009 at 2:11 AM, Jarkko Lempiainen <al...@gm...> wrote:

Sub-tasks spawn by a task are not necessarily run serially by the same
worker thread though. Worker threads should pull tasks from job queue based
on task priority, so those spawned tasks are likely to be executed by many
worker threads instead.

That's probably not a good idea. Work-stealing is a pretty well-established
as the current best solution (with provable performance properties) for
task-based parallelism as it combines locality with low contention and still
gets optimal load-balancing.

Going to shared queue every single task (and probably incurring a cache miss
if the task is unrelated to what you were just doing) is a lot more costly
than just pushing tasks off of your own mostly private stack (and only
interfering with other threads when you and they run out of stuff to do
simultaneously).

 

-- 
Sebastian Sylvan
+44(0)7857-300802
UIN: 44640862

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-05 17:31:52

Cost of starting a task is the cost to allocate memory for it (you can
provide your own allocator by the way). You can also provide already
allocated memory. There is no additional cost.

The cost of a "switch" is not actually the cost of a context switch (which
is just few cycles, really), but the total cost required to setup a wait
operation on synchronization primitive and further resume of a task (in my
benchmark it was blocking on an event).

Alexander.

2009/4/5 Sebastian Sylvan <seb...@gm...>

>
>
> On Sat, Apr 4, 2009 at 7:29 AM, <asy...@gm...> wrote:
>
>> Hi Jon,
>>
>> >you instead just suspend the current thread and
>> >create a new thread in the thread pool, what would the difference be?
>> The difference here is speed.
>>
>> I have an implementation which does the same but with threads (for
>> debugging purposes - you can find it in the package as well), and it is
>> significantly slower. While mine implementation still touches a semaphore on
>> each task switch, it can push about a million task switches per second now.
>> I actually didn't measure performance with threads, but will try doing that
>> later.
>> My target goal for a task switch cost is about 1-2 hundred of cycles,
>> which is about 10-15 millions of task switches per second on a 3ghz cpu.
>
>
> Hi, I didn't look at the source code, but am I to understand that this
> "switch" you're talking about happens every single time you start a new
> task? I.e. when a worker runs tasks A, B, and C, neither of which does any
> blocking, you will pay that cost 3 times?
> If not, then what's the additional cost of starting a task given that each
> task is now a fibre?
>
> Assuming I did understand you correctly, I'll just say that I think adding
> (even slight) overhead to the "run" method of every single task in order to
> make blocking cheaper is a big mistake. The whole motivating idea behind
> task parallelism is that you saturate the system with way more runnable
> tasks than there are hardware capabilities, so that performance can scale
> with hardware, and that no CPU capability is left idle. If the number of
> runnable tasks is much larger than the actual number of worker threads, that
> means that most tasks will just run serially on the same worker that spawned
> them, which means that most synchronisation points won't block, which means
> that the number of thread switches will be few compared to the number of
> task executions, which means that optimizing for fast blocking is not as
> important as making sure that running a task is dirt cheap.
>
> --
> Sebastian Sylvan
> +44(0)7857-300802
> UIN: 44640862
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 09:18:40

>Have you measured it?
I composed a quick sample - 2 threads which ping pong each other via 2
events (1 core utilized in total), each thread did 10 million iterations of
passing execution to another thread (20 million execution passes in total),
and did the same test with tasks.
Win32 threads: 85seconds ~ 236 thousands task switches per second
asynclib: 23seconds ~ 870 million task switches per second

Then I also tried hacking in scheduler optimization I'm about to start
working on (which removes polling of the above mentioned semaphore - the
slowest part of the current scheduler), the result is 4seconds ~ 5million
task switches per second when you avoid using any windows api.

Regards,
Alexander.

Sample I used:


HANDLE hEvent1;
HANDLE hEvent2;
async::Event Event1;
async::Event Event2;

static const int nIterations = 10000000;
DWORD __stdcall Thread1_Proc(void*)
{
  for (int i = 0; i < nIterations; ++i)
  {
    WaitForSingleObject(hEvent1, INFINITE);
    ResetEvent(hEvent1);
    SetEvent(hEvent2);
  }
  return 0;
}
DWORD __stdcall Thread2_Proc(void*)
{
  for (int i = 0; i < nIterations; ++i)
  {
    WaitForSingleObject(hEvent2, INFINITE);
    ResetEvent(hEvent2);
    SetEvent(hEvent1);
  }
  return 0;
}

void ASYNC_API Task1_Proc(void*)
{
  for (int i = 0; i < nIterations; ++i)
  {
    async::Wait(Event1);
    Event1.Reset();
    Event2.Signale();
  }
}
void ASYNC_API Task2_Proc(void*)
{
  for (int i = 0; i < nIterations; ++i)
  {
    async::Wait(Event2);
    Event2.Reset();
    Event1.Signale();
  }
}

void Profile_Threads()
{
  hEvent1 = CreateEvent(NULL, TRUE, FALSE, NULL);
  hEvent2 = CreateEvent(NULL, TRUE, FALSE, NULL);

  HANDLE hThread1 = CreateThread(NULL, 4096, &Thread1_Proc, NULL, 0, NULL);
  HANDLE hThread2 = CreateThread(NULL, 4096, &Thread2_Proc, NULL, 0, NULL);

  int nTicks = GetTickCount();
  SetEvent(hEvent1);
  WaitForSingleObject(hThread1, INFINITE);
  WaitForSingleObject(hThread2, INFINITE);
  nTicks = GetTickCount() - nTicks;

  printf("thread ticks %03f\n", (float) ((double)nTicks / 1000.0));
  CloseHandle(hThread1);
  CloseHandle(hThread2);


  async::Initialize(0, 0);
  async::Task* pTask1; async::Task* pTask2;
  async::Execute(&Task1_Proc, 0, &pTask1, NULL, 0, 512, NULL);
  async::Execute(&Task2_Proc, 0, &pTask2, NULL, 0, 512, NULL);

  nTicks = GetTickCount();
  Event1.Signale();
  async::WaitAll(*pTask1, *pTask2);
  nTicks = GetTickCount() - nTicks;

  printf("task ticks %03f\n", (float) ((double)nTicks / 1000.0));

  pTask1->Release();
  pTask2->Release();
  async::Shutdown();
}


2009/4/4 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > To fight blocking conditions I've made a task suspension mechanics,
> > which allows a worker thread to suspend a task when it meets something
> > it should wait for (whether it is a mutex or an on going disk read).
> > After task is suspended, worker thread picks up another task from the
> > task pool, and keeps on executing it. When blocking condition ends,
> > task is resumed, and put back into a task pool.
>
> You are basically combining fibers (co-operative threads) with
> pre-emptive threads. There really isn't that much cost difference
> between a pre-emptive thread and a fiber, so you might as well just
> spawn another pre-emptive thread, and suspend the current thread, rather
> than switching fiber on the current thread. If you're using a
> dynamically growing thread pool, your worst-case thread count (and thus
> memory usage) will be no worse than your worst-case fiber count.
>
> >
> >  - a unified way to wait on primitives (you can wait on all sort of
> > primitives including tasks together)
> >  - task canceling
> >  - timeout on wait operations
>
> We do that in our fibers, too. However, it turns out that on most OS-es,
> you can do that for OS primitives too. On Windows, pretty much anything
> can go into WaitForMultipleObjects() (although there is a 64-object
> limitation). On UNIX, as long as something is a file descriptor, you can
> wait on it using select().
>
> So, if instead of "suspending" the current "task" and picking another
> task within the worker thread -- which I presume to be exactly like
> fiber switching -- you instead just suspend the current thread and
> create a new thread in the thread pool, what would the difference be?
> Have you measured it?
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 09:22:36

Replace '870 million' with '870 thousands' in my previous post.

Sorry for mistake
Alexander.


2009/4/4 <asy...@gm...>

> >Have you measured it?
> I composed a quick sample - 2 threads which ping pong each other via 2
> events (1 core utilized in total), each thread did 10 million iterations of
> passing execution to another thread (20 million execution passes in total),
> and did the same test with tasks.
> Win32 threads: 85seconds ~ 236 thousands task switches per second
> asynclib: 23seconds ~ 870 million task switches per second
>
> Then I also tried hacking in scheduler optimization I'm about to start
> working on (which removes polling of the above mentioned semaphore - the
> slowest part of the current scheduler), the result is 4seconds ~ 5million
> task switches per second when you avoid using any windows api.
>
> Regards,
> Alexander.
>
> Sample I used:
>
>
> HANDLE hEvent1;
> HANDLE hEvent2;
> async::Event Event1;
> async::Event Event2;
>
> static const int nIterations = 10000000;
> DWORD __stdcall Thread1_Proc(void*)
> {
>   for (int i = 0; i < nIterations; ++i)
>   {
>     WaitForSingleObject(hEvent1, INFINITE);
>     ResetEvent(hEvent1);
>     SetEvent(hEvent2);
>   }
>   return 0;
> }
> DWORD __stdcall Thread2_Proc(void*)
> {
>   for (int i = 0; i < nIterations; ++i)
>   {
>     WaitForSingleObject(hEvent2, INFINITE);
>     ResetEvent(hEvent2);
>     SetEvent(hEvent1);
>   }
>   return 0;
> }
>
> void ASYNC_API Task1_Proc(void*)
> {
>   for (int i = 0; i < nIterations; ++i)
>   {
>     async::Wait(Event1);
>     Event1.Reset();
>     Event2.Signale();
>   }
> }
> void ASYNC_API Task2_Proc(void*)
> {
>   for (int i = 0; i < nIterations; ++i)
>   {
>     async::Wait(Event2);
>     Event2.Reset();
>     Event1.Signale();
>   }
> }
>
> void Profile_Threads()
> {
>   hEvent1 = CreateEvent(NULL, TRUE, FALSE, NULL);
>   hEvent2 = CreateEvent(NULL, TRUE, FALSE, NULL);
>
>   HANDLE hThread1 = CreateThread(NULL, 4096, &Thread1_Proc, NULL, 0, NULL);
>   HANDLE hThread2 = CreateThread(NULL, 4096, &Thread2_Proc, NULL, 0, NULL);
>
>   int nTicks = GetTickCount();
>   SetEvent(hEvent1);
>   WaitForSingleObject(hThread1, INFINITE);
>   WaitForSingleObject(hThread2, INFINITE);
>   nTicks = GetTickCount() - nTicks;
>
>   printf("thread ticks %03f\n", (float) ((double)nTicks / 1000.0));
>   CloseHandle(hThread1);
>   CloseHandle(hThread2);
>
>
>   async::Initialize(0, 0);
>   async::Task* pTask1; async::Task* pTask2;
>   async::Execute(&Task1_Proc, 0, &pTask1, NULL, 0, 512, NULL);
>   async::Execute(&Task2_Proc, 0, &pTask2, NULL, 0, 512, NULL);
>
>   nTicks = GetTickCount();
>   Event1.Signale();
>   async::WaitAll(*pTask1, *pTask2);
>   nTicks = GetTickCount() - nTicks;
>
>   printf("task ticks %03f\n", (float) ((double)nTicks / 1000.0));
>
>   pTask1->Release();
>   pTask2->Release();
>   async::Shutdown();
> }
>
>
> 2009/4/4 Jon Watte <jw...@gm...>
>
>> asy...@gm... wrote:
>>
>> > To fight blocking conditions I've made a task suspension mechanics,
>> > which allows a worker thread to suspend a task when it meets something
>> > it should wait for (whether it is a mutex or an on going disk read).
>> > After task is suspended, worker thread picks up another task from the
>> > task pool, and keeps on executing it. When blocking condition ends,
>> > task is resumed, and put back into a task pool.
>>
>> You are basically combining fibers (co-operative threads) with
>> pre-emptive threads. There really isn't that much cost difference
>> between a pre-emptive thread and a fiber, so you might as well just
>> spawn another pre-emptive thread, and suspend the current thread, rather
>> than switching fiber on the current thread. If you're using a
>> dynamically growing thread pool, your worst-case thread count (and thus
>> memory usage) will be no worse than your worst-case fiber count.
>>
>> >
>> >  - a unified way to wait on primitives (you can wait on all sort of
>> > primitives including tasks together)
>> >  - task canceling
>> >  - timeout on wait operations
>>
>> We do that in our fibers, too. However, it turns out that on most OS-es,
>> you can do that for OS primitives too. On Windows, pretty much anything
>> can go into WaitForMultipleObjects() (although there is a 64-object
>> limitation). On UNIX, as long as something is a file descriptor, you can
>> wait on it using select().
>>
>> So, if instead of "suspending" the current "task" and picking another
>> task within the worker thread -- which I presume to be exactly like
>> fiber switching -- you instead just suspend the current thread and
>> create a new thread in the thread pool, what would the difference be?
>> Have you measured it?
>>
>> Sincerely,
>>
>> jw
>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-04 16:16:17

asy...@gm... wrote:
> >Have you measured it?
> I composed a quick sample - 2 threads which ping pong each other via 2 
> events (1 core utilized in total), each thread did 10 million 
> iterations of passing execution to another thread (20 million 
> execution passes in total), and did the same test with tasks.

I don't think that's a representative measurement, though, because it is 
the absolute best case: both contexts are in L1, on the same core. For 
any kind of real workload with blocking waits, that will not be the 
case. I believe the Windows kernel version will not slow down much in 
that case, but the asynclib version will. Of course, if your workload is 
actually highly cached and interdependent, then this measurement may be 
more relevant.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 16:49:49

I believe that from cpu point of view kernel scheduler and user-space
scheduler are in the same situation here in regard to cpu cache usage. Yes,
you need to keep your data close to cache, avoid task/thread migration and
avoid cache conflicts. I don't see why a user space version of task
scheduler can't be as good as kernel side one, since the kernel and
userspace scheduler have the same amount of information in order to take a
best decision, or am I wrong here ?
And as I mentioned - in user space you need to save/restore muuch smaller
state, so you have initial speed benefit here.

Alexander

2009/4/4 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > >Have you measured it?
> > I composed a quick sample - 2 threads which ping pong each other via 2
> > events (1 core utilized in total), each thread did 10 million
> > iterations of passing execution to another thread (20 million
> > execution passes in total), and did the same test with tasks.
>
> I don't think that's a representative measurement, though, because it is
> the absolute best case: both contexts are in L1, on the same core. For
> any kind of real workload with blocking waits, that will not be the
> case. I believe the Windows kernel version will not slow down much in
> that case, but the asynclib version will. Of course, if your workload is
> actually highly cached and interdependent, then this measurement may be
> more relevant.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-04 18:18:18

asy...@gm... wrote:
> I believe that from cpu point of view kernel scheduler and user-space 
> scheduler are in the same situation here in regard to cpu cache usage. 
> Yes, you need to keep your data close to cache, avoid task/thread 
> migration and avoid cache conflicts. I don't see why a user space 
> version of task scheduler can't be as good as kernel side one, since 
> the kernel and userspace scheduler have the same amount of information 
> in order to take a best decision, or am I wrong here ?
> And as I mentioned - in user space you need to save/restore muuch 
> smaller state, so you have initial speed benefit here.

What I'm saying is that, in the specific benchmark you posted, you 
heavily skewed the measurement in favor of user-space cooperative 
threads (fibers, coroutines, whatever you want to call them). The reason 
is that the cache footprint of that particular, single-threaded, 
small-task, user-space switch is tiny.
When using the kernel to switch threads, more code is executed, and more 
data is touched. For example, threads provide TLS, which fibers do not. 
Threads provide a whole host of other specific state which fibers don't. 
In addition, entering the kernel has some overhead for the system call 
and parameter validation. Clearly, the overhead when switching real 
threads in this case is higher.

However, when you start writing a real application, where you have a 
large amount of different tasks, where the tasks are not as small or as 
tightly coupled, and you start seeing tasks get moved between cores, the 
different between a fiber scheduler and the kernel thread scheduler 
becomes smaller. That's why I said I think that, with higher load, the 
thread scheduler will not degrade at the same rate as the fiber scheduler.

Of course the fiber scheduler (if implemented competently, at least) 
will be faster than the thread scheduler, because it does less (no 
kernel entry, no TLS or other thread context, no pre-emption, etc). The 
question is: once you've added back all the things you end up needing 
for a real application to your fiber system (and TLS context switch may 
be part of that), how much faster is it, and what are you giving up?

For example, we've had a user-space fiber system for our server stuff 
for many years, not unlike what you describe -- we have fiber-specific 
locks, waiting for I/O schedules another fiber, etc. We've found that 
the fiber switching brings with it its own problems, especially in the 
cancellation case. If you want to stop the operation that is going on 
(say, the user canceled, or the connection was broken while you were 
waiting for I/O), the fiber "wait" operation has to return an error, and 
it has to wind its way out to the base of the request while backing out 
state. In later incarnations, we use exception handling for this; the 
"wait" operation for fibers (which can wait on I/O, synchronization, 
etc) will throw an exception when the fiber/operation is canceled.

I guess I'm just a little jaded, having seen a large number of 
user-space thread packages over the last 25 years (including the Mac 
threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp, 
etc), and I think the benefits and cost trade-off is not as clear as you 
seem to want them to be. Yes, there are cases where it is the right 
solution, but there also are cases where it isn't. If you want to use a 
particular measurement (such as context switches per second) as your 
criteria, then I think it's important to measure on actual, real-world 
workloads to make an informed decision.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 20:42:50

Thanks for your input Jon.

The sample I composed is really a cache friendly one. This is the best
situation for cooperative tasks, that's true, but it should also be the best
situation for win threads as well (since the same data is touched all the
time).

You are right that this benchmark is far away from a real life scenario, but
there should be a way to estimate the speed. How about comparing the worst
case scenario for cooperative task approach with the best case one for
winthreads ?
Next benchmark I just tried which should be the worst situation
for cooperative tasks - every task not in the cache. I created 2 million
tasks (each of 512 bytes) and linked them all with an unique event (i.e.
next task waits for it's event which previous tasks signales). Total size of
memory traversed is 1 gigabyte, which obviously doesn't fit into cpu cache.
10 traversals give 20 million task switches in total, and that takes
10seconds to complete, which is still 8 times faster than the best case for
winthreads. If you can think of a way how to slow it down more, let me know.

I also would like to ask you what am I missing currently you would like to
have in a task state ? You've mentioned tls for tasks, that's something I'll
add a bit later, and it is not going to cost any extra cycles. In general I
follow approach of 0 cost for a feature if you don't use it (which is
achieved with assembly). For example - if you don't use exceptions, I can
avoid saving/restoring exception pointer and so on.  I didn't get what
problems you've met with task cancelling, I just return special code from
the blocking function (as you did as well), whether to throw an exception or
not in this case is up to the user.
Could you also point me to any other potential problems you've experienced
with such approaches ?

>I think it's important to measure on actual, real-world
>workloads to make an informed decision.
Well, I can only provide benchmarks from my side, and I try to do them as
fair as I can.

Alexander.

2009/4/4 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > I believe that from cpu point of view kernel scheduler and user-space
> > scheduler are in the same situation here in regard to cpu cache usage.
> > Yes, you need to keep your data close to cache, avoid task/thread
> > migration and avoid cache conflicts. I don't see why a user space
> > version of task scheduler can't be as good as kernel side one, since
> > the kernel and userspace scheduler have the same amount of information
> > in order to take a best decision, or am I wrong here ?
> > And as I mentioned - in user space you need to save/restore muuch
> > smaller state, so you have initial speed benefit here.
>
> What I'm saying is that, in the specific benchmark you posted, you
> heavily skewed the measurement in favor of user-space cooperative
> threads (fibers, coroutines, whatever you want to call them). The reason
> is that the cache footprint of that particular, single-threaded,
> small-task, user-space switch is tiny.
> When using the kernel to switch threads, more code is executed, and more
> data is touched. For example, threads provide TLS, which fibers do not.
> Threads provide a whole host of other specific state which fibers don't.
> In addition, entering the kernel has some overhead for the system call
> and parameter validation. Clearly, the overhead when switching real
> threads in this case is higher.
>
> However, when you start writing a real application, where you have a
> large amount of different tasks, where the tasks are not as small or as
> tightly coupled, and you start seeing tasks get moved between cores, the
> different between a fiber scheduler and the kernel thread scheduler
> becomes smaller. That's why I said I think that, with higher load, the
> thread scheduler will not degrade at the same rate as the fiber scheduler.
>
> Of course the fiber scheduler (if implemented competently, at least)
> will be faster than the thread scheduler, because it does less (no
> kernel entry, no TLS or other thread context, no pre-emption, etc). The
> question is: once you've added back all the things you end up needing
> for a real application to your fiber system (and TLS context switch may
> be part of that), how much faster is it, and what are you giving up?
>
> For example, we've had a user-space fiber system for our server stuff
> for many years, not unlike what you describe -- we have fiber-specific
> locks, waiting for I/O schedules another fiber, etc. We've found that
> the fiber switching brings with it its own problems, especially in the
> cancellation case. If you want to stop the operation that is going on
> (say, the user canceled, or the connection was broken while you were
> waiting for I/O), the fiber "wait" operation has to return an error, and
> it has to wind its way out to the base of the request while backing out
> state. In later incarnations, we use exception handling for this; the
> "wait" operation for fibers (which can wait on I/O, synchronization,
> etc) will throw an exception when the fiber/operation is canceled.
>
> I guess I'm just a little jaded, having seen a large number of
> user-space thread packages over the last 25 years (including the Mac
> threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp,
> etc), and I think the benefits and cost trade-off is not as clear as you
> seem to want them to be. Yes, there are cases where it is the right
> solution, but there also are cases where it isn't. If you want to use a
> particular measurement (such as context switches per second) as your
> criteria, then I think it's important to measure on actual, real-world
> workloads to make an informed decision.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-04 23:14:12

asy...@gm... wrote:
>  I didn't get what problems you've met with task cancelling, I just 
> return special code from the blocking function (as you did as well), 
> whether to throw an exception or not in this case is up to the user.
> Could you also point me to any other potential problems you've 
> experienced with such approaches ?

It mostly has to do with the ease of creating bugs. The two bugs we've 
seen again and again are:
1) Some programmer failing to handle the "cancel" error case correctly, 
and perhaps treating it like a temporary I/O error rather than unwinding 
ASAP.
2) Some programmer deleting the underlying object that uses the fiber, 
while the fiber is still waiting in the scheduling queue. This is extra 
nasty to debug when it happens in an asynchronous manner.

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-05 17:23:41

>1) Some programmer failing to handle the "cancel" error case correctly,
For that purpose I made by default a wait operation not cancellable. You
have to provide a flag in order to be have an ability to cancel current wait
operation (and have a 'cancelled' flag returned)

Alexander.


2009/4/5 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> >  I didn't get what problems you've met with task cancelling, I just
> > return special code from the blocking function (as you did as well),
> > whether to throw an exception or not in this case is up to the user.
> > Could you also point me to any other potential problems you've
> > experienced with such approaches ?
>
> It mostly has to do with the ease of creating bugs. The two bugs we've
> seen again and again are:
> 1) Some programmer failing to handle the "cancel" error case correctly,
> and perhaps treating it like a temporary I/O error rather than unwinding
> ASAP.
> 2) Some programmer deleting the underlying object that uses the fiber,
> while the fiber is still waiting in the scheduling queue. This is extra
> nasty to debug when it happens in an asynchronous manner.
>
> Sincerely,
>
> jw
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: Jon W. <jw...@gm...> - 2009-04-05 20:12:50

asy...@gm... wrote:
> >1) Some programmer failing to handle the "cancel" error case correctly,
> For that purpose I made by default a wait operation not cancellable. 
> You have to provide a flag in order to be have an ability to cancel 
> current wait operation (and have a 'cancelled' flag returned)

So instead of cancellation crash bugs, you get cancellation deadlocks. 
Not a big improvement IMO :-)

Sincerely,

jw

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-05 20:24:19

Where do you get a deadlock from ?

Alexander

2009/4/5 Jon Watte <jw...@gm...>

> asy...@gm... wrote:
> > >1) Some programmer failing to handle the "cancel" error case correctly,
> > For that purpose I made by default a wait operation not cancellable.
> > You have to provide a flag in order to be have an ability to cancel
> > current wait operation (and have a 'cancelled' flag returned)
>
> So instead of cancellation crash bugs, you get cancellation deadlocks.
> Not a big improvement IMO :-)
>
> Sincerely,
>
> jw
>
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Regards,
Alexander Karnakov

Re: [Algorithms] General purpose task parallel threading approach

From: <asy...@gm...> - 2009-04-04 05:55:17

>Coroutines are an implementation of cooperative multi-tasking
Yes, it is basically a cooperative multitasking on top of a fixed number of
worker threads.

>As a last resort, you fall back on traditional synchronization primitives
if you can find no more efficient approach.
As far as I know this is the most common approach as well, because it is
less error prone. On the other side, in standard threads, or any job system
I've heard about, this approach is not efficient because it blocks the
execution thread.
What I propose here is a system where you can use traditional
synchronization primitives (as well as traditional way of io) without
blocking the worker thread. This is the main significant difference from
standard threads or job system because where one of them blocks (and makes
cpu idle) in case 2 of them enter synchronization primitive.

>I'm not quite sure what your library provides to the domain of game
development
That's what I would like your feedback on. I've come up with a very generic
way of utilizing cpu's at max regardless of the existence of synchronization
primitives or synchronous IO request. And I wonder what is missing here for
a game developer.

Regards,
Alexander.


2009/4/3 Tim Ambrogi <tim...@gm...>

> Alexander-
> Coroutines are an implementation of cooperative multi-tasking.  There are
> implementations in various dynamic languages, such as Lua and Stackless
> Python.  Also, there are less-portable implementations in C++ that involve
> inline assembly, setjmp/longjmp, fibers, etc... depending on the platform.
> See http://en.wikipedia.org/wiki/Coroutine for more info.
>
> It sounds, however, like you also want to write lower-level,
> performance-critical systems such as rendering, physics, serialization,
> etc... across many threads/cores.  Your library is similar in silhouette to
> the popular and fairly general job system, where you break down
> parallelizable tasks in N jobs that are tasked to idle threads.  These jobs
> aim to either minimize contention on shared resources, or operate in a
> wait-free manner on said resources.  If wait-free operation cannot be
> achieved within the domain, you usually try a lock-free approach that
> enables different threads to assist obstructing threads in completing their
> tasks while you wait.  As a last resort, you fall back on traditional
> synchronization primitives if you can find no more efficient approach.
>
> I'm not quite sure what your library provides to the domain of game
> development, but I'm sure it was motivated by a specific problem you
> encountered.  Perhaps you should describe your use cases more specifically,
> given that the stated domain of your library seems overly broad?
>
> --Tim A.
>
>
> On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote:
>
>> What do you mean ?
>>
>> Alexander.
>>
>> 2009/4/3 Gregory Junker <gj...@da...>
>>
>>>   Isn’t this where coroutines come into play?
>>>
>>>
>>> Greg
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> GDAlgorithms-list mailing list
>>> GDA...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>> Archives:
>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] General purpose task parallel threading approach

From: Tim A. <tim...@gm...> - 2009-04-04 09:19:56

On Fri, Apr 3, 2009 at 10:55 PM, <asy...@gm...> wrote:
>
> As far as I know this is the most common approach as well, because it is
> less error prone. On the other side, in standard threads, or any job system
> I've heard about, this approach is not efficient because it blocks the
> execution thread.


Lock-free algorithms are, in some ways, less error-prone.  In particular,
correctly-implemented lock-free algorithms cannot deadlock, and prevent
priority inversion.


> >I'm not quite sure what your library provides to the domain of game
> development
> That's what I would like your feedback on.


In that case, I again suggest you provide some specific use cases.  What
problems did you run into that led you to write the library?  How does the
library you wrote solve the specific problems you encountered?

--Tim A.


> 2009/4/3 Tim Ambrogi <tim...@gm...>
>
> Alexander-
>> Coroutines are an implementation of cooperative multi-tasking.  There are
>> implementations in various dynamic languages, such as Lua and Stackless
>> Python.  Also, there are less-portable implementations in C++ that involve
>> inline assembly, setjmp/longjmp, fibers, etc... depending on the platform.
>> See http://en.wikipedia.org/wiki/Coroutine for more info.
>>
>> It sounds, however, like you also want to write lower-level,
>> performance-critical systems such as rendering, physics, serialization,
>> etc... across many threads/cores.  Your library is similar in silhouette to
>> the popular and fairly general job system, where you break down
>> parallelizable tasks in N jobs that are tasked to idle threads.  These jobs
>> aim to either minimize contention on shared resources, or operate in a
>> wait-free manner on said resources.  If wait-free operation cannot be
>> achieved within the domain, you usually try a lock-free approach that
>> enables different threads to assist obstructing threads in completing their
>> tasks while you wait.  As a last resort, you fall back on traditional
>> synchronization primitives if you can find no more efficient approach.
>>
>> I'm not quite sure what your library provides to the domain of game
>> development, but I'm sure it was motivated by a specific problem you
>> encountered.  Perhaps you should describe your use cases more specifically,
>> given that the stated domain of your library seems overly broad?
>>
>> --Tim A.
>>
>>
>> On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote:
>>
>>> What do you mean ?
>>>
>>> Alexander.
>>>
>>> 2009/4/3 Gregory Junker <gj...@da...>
>>>
>>>>   Isn’t this where coroutines come into play?
>>>>
>>>>
>>>> Greg
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> GDAlgorithms-list mailing list
>>>> GDA...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>>> Archives:
>>>>
>>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> GDAlgorithms-list mailing list
>>> GDA...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>> Archives:
>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

1 2 3 .. 12 > >> (Page 1 of 12)