Thread: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
From: <asy...@gm...> - 2009-04-03 18:13:15
|
Hello, A while ago I started wondering about how to apply task parallel approach for a complete application, not just a simple "let's parallelize this loop" approach. What I wanted to achieve is an approach where you can have all sort of different tasks running in parallel which is what you usually have in real life applications. I think all games now have some sort of background processing where they do some cpu calculations, some wait on disk io, some wait on other tasks to complete, while there are heavy computations going on in the game logic for examle, and all require CPU. All avialable solutions have either too small scale and algorithm centric approach (like openmp or tbb) while more generic ones suffer from blocking conditions you have all over (including synchronization primitives, all sort of io, and ability to wait on other tasks to finish). I've tried to develop my own alternative and would like to have your oppinion on my approach. Main goals I tried to achieve are speed, low memory overhead and ability to handle blocking conditions somehow, (which is why I started to reinvent the wheel). To fight blocking conditions I've made a task suspension mechanics, which allows a worker thread to suspend a task when it meets something it should wait for (whether it is a mutex or an on going disk read). After task is suspended, worker thread picks up another task from the task pool, and keeps on executing it. When blocking condition ends, task is resumed, and put back into a task pool. This approach maximizes task throughput, and worker threads go to sleep only when there is nothing else to do (or all tasks are suspended). To achieve speed I avoid using any kernel structures (except one place in task scheduler which I haven't optimized yet, so expect performance to be about a million task switches per second or so). That also helped me to reduce the size of synchronization primitives. Other useful features include: - a unified way to wait on primitives (you can wait on all sort of primitives including tasks together) - task canceling - timeout on wait operations I'd like to hear your feedback on all this stuff you can down load at http://groups.google.com/group/asynclib. I've also composed few samples demonstrating some aspects of programming with the library, but in general it is no more complex than using windows threading api. I'd also ask you to comment on library specific questions in the group forum. Thanks. |
From: Gregory J. <gj...@da...> - 2009-04-03 19:12:08
|
Isn't this where coroutines come into play? Greg |
From: <asy...@gm...> - 2009-04-03 19:27:32
|
What do you mean ? Alexander. 2009/4/3 Gregory Junker <gj...@da...> > Isn’t this where coroutines come into play? > > > Greg > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: Tim A. <tim...@gm...> - 2009-04-03 21:16:48
|
Alexander- Coroutines are an implementation of cooperative multi-tasking. There are implementations in various dynamic languages, such as Lua and Stackless Python. Also, there are less-portable implementations in C++ that involve inline assembly, setjmp/longjmp, fibers, etc... depending on the platform. See http://en.wikipedia.org/wiki/Coroutine for more info. It sounds, however, like you also want to write lower-level, performance-critical systems such as rendering, physics, serialization, etc... across many threads/cores. Your library is similar in silhouette to the popular and fairly general job system, where you break down parallelizable tasks in N jobs that are tasked to idle threads. These jobs aim to either minimize contention on shared resources, or operate in a wait-free manner on said resources. If wait-free operation cannot be achieved within the domain, you usually try a lock-free approach that enables different threads to assist obstructing threads in completing their tasks while you wait. As a last resort, you fall back on traditional synchronization primitives if you can find no more efficient approach. I'm not quite sure what your library provides to the domain of game development, but I'm sure it was motivated by a specific problem you encountered. Perhaps you should describe your use cases more specifically, given that the stated domain of your library seems overly broad? --Tim A. On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote: > What do you mean ? > > Alexander. > > 2009/4/3 Gregory Junker <gj...@da...> > >> Isn’t this where coroutines come into play? >> >> >> Greg >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: Jon W. <jw...@gm...> - 2009-04-03 22:17:34
|
asy...@gm... wrote: > To fight blocking conditions I've made a task suspension mechanics, > which allows a worker thread to suspend a task when it meets something > it should wait for (whether it is a mutex or an on going disk read). > After task is suspended, worker thread picks up another task from the > task pool, and keeps on executing it. When blocking condition ends, > task is resumed, and put back into a task pool. You are basically combining fibers (co-operative threads) with pre-emptive threads. There really isn't that much cost difference between a pre-emptive thread and a fiber, so you might as well just spawn another pre-emptive thread, and suspend the current thread, rather than switching fiber on the current thread. If you're using a dynamically growing thread pool, your worst-case thread count (and thus memory usage) will be no worse than your worst-case fiber count. > > - a unified way to wait on primitives (you can wait on all sort of > primitives including tasks together) > - task canceling > - timeout on wait operations We do that in our fibers, too. However, it turns out that on most OS-es, you can do that for OS primitives too. On Windows, pretty much anything can go into WaitForMultipleObjects() (although there is a 64-object limitation). On UNIX, as long as something is a file descriptor, you can wait on it using select(). So, if instead of "suspending" the current "task" and picking another task within the worker thread -- which I presume to be exactly like fiber switching -- you instead just suspend the current thread and create a new thread in the thread pool, what would the difference be? Have you measured it? Sincerely, jw |
From: <asy...@gm...> - 2009-04-04 06:29:14
|
Hi Jon, >you instead just suspend the current thread and >create a new thread in the thread pool, what would the difference be? The difference here is speed. I have an implementation which does the same but with threads (for debugging purposes - you can find it in the package as well), and it is significantly slower. While mine implementation still touches a semaphore on each task switch, it can push about a million task switches per second now. I actually didn't measure performance with threads, but will try doing that later. My target goal for a task switch cost is about 1-2 hundred of cycles, which is about 10-15 millions of task switches per second on a 3ghz cpu. You should also compare thread spawning cost (unless you preallocate all of them) and a memory overhead (for windows threads it is minimum 4k for stack size), but if you preallocate, you'd do it for the maximum stack size. Creation cost in my system is just a malloc (which you can also override with a faster allocator), and minimum allowed task is 32 bytes (on x86). The cost of synchronization primitives on windows is high imho (critical section is a 48 bytes in size + an attached event in the long run). My approach allows minimal size for a crit-section like primitive be 2 bits only (can be injected into pointer), simple Lock (non-recursive mutex) - 4 bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical section) and no use of windows events or any other objects. Besides huge optimizations potential in speed and size, there is no difference with what you do, Jon. >Have you measured it? I'll come back with numbers later. Regards, Alexander. 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > To fight blocking conditions I've made a task suspension mechanics, > > which allows a worker thread to suspend a task when it meets something > > it should wait for (whether it is a mutex or an on going disk read). > > After task is suspended, worker thread picks up another task from the > > task pool, and keeps on executing it. When blocking condition ends, > > task is resumed, and put back into a task pool. > > You are basically combining fibers (co-operative threads) with > pre-emptive threads. There really isn't that much cost difference > between a pre-emptive thread and a fiber, so you might as well just > spawn another pre-emptive thread, and suspend the current thread, rather > than switching fiber on the current thread. If you're using a > dynamically growing thread pool, your worst-case thread count (and thus > memory usage) will be no worse than your worst-case fiber count. > > > > > - a unified way to wait on primitives (you can wait on all sort of > > primitives including tasks together) > > - task canceling > > - timeout on wait operations > > We do that in our fibers, too. However, it turns out that on most OS-es, > you can do that for OS primitives too. On Windows, pretty much anything > can go into WaitForMultipleObjects() (although there is a 64-object > limitation). On UNIX, as long as something is a file descriptor, you can > wait on it using select(). > > So, if instead of "suspending" the current "task" and picking another > task within the worker thread -- which I presume to be exactly like > fiber switching -- you instead just suspend the current thread and > create a new thread in the thread pool, what would the difference be? > Have you measured it? > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: Oscar F. <os...@tr...> - 2009-04-04 07:58:16
|
Am I being thick? It looks to me like you are trying to write single threaded code wih multiple threads. Shouldn't you be redesigning your system to eliminate this wait for data? Ie if you need some data it should be already computed, ie the frame before. Otherwise you are never going to get the best out of a multicore architecture ... Just my, probably wrong, 2p 2009/4/4 <asy...@gm...> > Hi Jon, > > >you instead just suspend the current thread and > >create a new thread in the thread pool, what would the difference be? > The difference here is speed. > > I have an implementation which does the same but with threads (for > debugging purposes - you can find it in the package as well), and it is > significantly slower. While mine implementation still touches a semaphore on > each task switch, it can push about a million task switches per second now. > I actually didn't measure performance with threads, but will try doing that > later. > My target goal for a task switch cost is about 1-2 hundred of cycles, which > is about 10-15 millions of task switches per second on a 3ghz cpu. > You should also compare thread spawning cost (unless you preallocate all of > them) and a memory overhead (for windows threads it is minimum 4k for stack > size), but if you preallocate, you'd do it for the maximum stack size. > Creation cost in my system is just a malloc (which you can also override > with a faster allocator), and minimum allowed task is 32 bytes (on x86). > > The cost of synchronization primitives on windows is high imho (critical > section is a 48 bytes in size + an attached event in the long run). My > approach allows minimal size for a crit-section like primitive be 2 bits > only (can be injected into pointer), simple Lock (non-recursive mutex) - 4 > bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical > section) and no use of windows events or any other objects. > > > Besides huge optimizations potential in speed and size, there is no > difference with what you do, Jon. > > >Have you measured it? > I'll come back with numbers later. > > Regards, > Alexander. > > > > 2009/4/4 Jon Watte <jw...@gm...> > > asy...@gm... wrote: >> > To fight blocking conditions I've made a task suspension mechanics, >> > which allows a worker thread to suspend a task when it meets something >> > it should wait for (whether it is a mutex or an on going disk read). >> > After task is suspended, worker thread picks up another task from the >> > task pool, and keeps on executing it. When blocking condition ends, >> > task is resumed, and put back into a task pool. >> >> You are basically combining fibers (co-operative threads) with >> pre-emptive threads. There really isn't that much cost difference >> between a pre-emptive thread and a fiber, so you might as well just >> spawn another pre-emptive thread, and suspend the current thread, rather >> than switching fiber on the current thread. If you're using a >> dynamically growing thread pool, your worst-case thread count (and thus >> memory usage) will be no worse than your worst-case fiber count. >> >> > >> > - a unified way to wait on primitives (you can wait on all sort of >> > primitives including tasks together) >> > - task canceling >> > - timeout on wait operations >> >> We do that in our fibers, too. However, it turns out that on most OS-es, >> you can do that for OS primitives too. On Windows, pretty much anything >> can go into WaitForMultipleObjects() (although there is a 64-object >> limitation). On UNIX, as long as something is a file descriptor, you can >> wait on it using select(). >> >> So, if instead of "suspending" the current "task" and picking another >> task within the worker thread -- which I presume to be exactly like >> fiber switching -- you instead just suspend the current thread and >> create a new thread in the thread pool, what would the difference be? >> Have you measured it? >> >> Sincerely, >> >> jw >> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: <asy...@gm...> - 2009-04-04 08:44:21
|
>Shouldn't you be redesigning your system to eliminate this wait for data? Ie if you need some data it should be already >computed, ie the frame before. That's the only option you have using standard approach to achieve a benefit from multicore architecture. What I am trying to say is that you really don't have to. Writing single threaded code is a lot easier, takes less time, code is more compact and all dependencies in my case are handled automatically (where in your case you have to explicitly code them). To achieve multicore scale all you need to do is to provide enough work to execute by worker threads while other tasks wait for the dependencies to complete. Do you have a concrete example which bothers you (maybe I could provide a better explanation of how it'll work on a concrete case) ? Regards, Alexander. 2009/4/4 Oscar Forth <os...@tr...> > Am I being thick? It looks to me like you are trying to write single > threaded code wih multiple threads. Shouldn't you be redesigning your > system to eliminate this wait for data? Ie if you need some data it should > be already computed, ie the frame before. > > Otherwise you are never going to get the best out of a multicore > architecture ... > > Just my, probably wrong, 2p > 2009/4/4 <asy...@gm...> > > Hi Jon, >> >> >you instead just suspend the current thread and >> >create a new thread in the thread pool, what would the difference be? >> The difference here is speed. >> >> I have an implementation which does the same but with threads (for >> debugging purposes - you can find it in the package as well), and it is >> significantly slower. While mine implementation still touches a semaphore on >> each task switch, it can push about a million task switches per second now. >> I actually didn't measure performance with threads, but will try doing that >> later. >> My target goal for a task switch cost is about 1-2 hundred of cycles, >> which is about 10-15 millions of task switches per second on a 3ghz cpu. >> You should also compare thread spawning cost (unless you preallocate all >> of them) and a memory overhead (for windows threads it is minimum 4k for >> stack size), but if you preallocate, you'd do it for the maximum stack size. >> Creation cost in my system is just a malloc (which you can also override >> with a faster allocator), and minimum allowed task is 32 bytes (on x86). >> >> The cost of synchronization primitives on windows is high imho (critical >> section is a 48 bytes in size + an attached event in the long run). My >> approach allows minimal size for a crit-section like primitive be 2 bits >> only (can be injected into pointer), simple Lock (non-recursive mutex) - 4 >> bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical >> section) and no use of windows events or any other objects. >> >> >> Besides huge optimizations potential in speed and size, there is no >> difference with what you do, Jon. >> >> >Have you measured it? >> I'll come back with numbers later. >> >> Regards, >> Alexander. >> >> >> >> 2009/4/4 Jon Watte <jw...@gm...> >> >> asy...@gm... wrote: >>> > To fight blocking conditions I've made a task suspension mechanics, >>> > which allows a worker thread to suspend a task when it meets something >>> > it should wait for (whether it is a mutex or an on going disk read). >>> > After task is suspended, worker thread picks up another task from the >>> > task pool, and keeps on executing it. When blocking condition ends, >>> > task is resumed, and put back into a task pool. >>> >>> You are basically combining fibers (co-operative threads) with >>> pre-emptive threads. There really isn't that much cost difference >>> between a pre-emptive thread and a fiber, so you might as well just >>> spawn another pre-emptive thread, and suspend the current thread, rather >>> than switching fiber on the current thread. If you're using a >>> dynamically growing thread pool, your worst-case thread count (and thus >>> memory usage) will be no worse than your worst-case fiber count. >>> >>> > >>> > - a unified way to wait on primitives (you can wait on all sort of >>> > primitives including tasks together) >>> > - task canceling >>> > - timeout on wait operations >>> >>> We do that in our fibers, too. However, it turns out that on most OS-es, >>> you can do that for OS primitives too. On Windows, pretty much anything >>> can go into WaitForMultipleObjects() (although there is a 64-object >>> limitation). On UNIX, as long as something is a file descriptor, you can >>> wait on it using select(). >>> >>> So, if instead of "suspending" the current "task" and picking another >>> task within the worker thread -- which I presume to be exactly like >>> fiber switching -- you instead just suspend the current thread and >>> create a new thread in the thread pool, what would the difference be? >>> Have you measured it? >>> >>> Sincerely, >>> >>> jw >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: Sebastian S. <seb...@gm...> - 2009-04-05 00:50:44
|
On Sat, Apr 4, 2009 at 7:29 AM, <asy...@gm...> wrote: > Hi Jon, > > >you instead just suspend the current thread and > >create a new thread in the thread pool, what would the difference be? > The difference here is speed. > > I have an implementation which does the same but with threads (for > debugging purposes - you can find it in the package as well), and it is > significantly slower. While mine implementation still touches a semaphore on > each task switch, it can push about a million task switches per second now. > I actually didn't measure performance with threads, but will try doing that > later. > My target goal for a task switch cost is about 1-2 hundred of cycles, which > is about 10-15 millions of task switches per second on a 3ghz cpu. Hi, I didn't look at the source code, but am I to understand that this "switch" you're talking about happens every single time you start a new task? I.e. when a worker runs tasks A, B, and C, neither of which does any blocking, you will pay that cost 3 times? If not, then what's the additional cost of starting a task given that each task is now a fibre? Assuming I did understand you correctly, I'll just say that I think adding (even slight) overhead to the "run" method of every single task in order to make blocking cheaper is a big mistake. The whole motivating idea behind task parallelism is that you saturate the system with way more runnable tasks than there are hardware capabilities, so that performance can scale with hardware, and that no CPU capability is left idle. If the number of runnable tasks is much larger than the actual number of worker threads, that means that most tasks will just run serially on the same worker that spawned them, which means that most synchronisation points won't block, which means that the number of thread switches will be few compared to the number of task executions, which means that optimizing for fast blocking is not as important as making sure that running a task is dirt cheap. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
From: Jarkko L. <al...@gm...> - 2009-04-05 01:11:59
|
Sub-tasks spawn by a task are not necessarily run serially by the same worker thread though. Worker threads should pull tasks from job queue based on task priority, so those spawned tasks are likely to be executed by many worker threads instead. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 05, 2009 3:51 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach Assuming I did understand you correctly, I'll just say that I think adding (even slight) overhead to the "run" method of every single task in order to make blocking cheaper is a big mistake. The whole motivating idea behind task parallelism is that you saturate the system with way more runnable tasks than there are hardware capabilities, so that performance can scale with hardware, and that no CPU capability is left idle. If the number of runnable tasks is much larger than the actual number of worker threads, that means that most tasks will just run serially on the same worker that spawned them, which means that most synchronisation points won't block, which means that the number of thread switches will be few compared to the number of task executions, which means that optimizing for fast blocking is not as important as making sure that running a task is dirt cheap. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
From: Sebastian S. <seb...@gm...> - 2009-04-05 01:46:08
|
On Sun, Apr 5, 2009 at 2:11 AM, Jarkko Lempiainen <al...@gm...> wrote: > Sub-tasks spawn by a task are not necessarily run serially by the same > worker thread though. Worker threads should pull tasks from job queue based > on task priority, so those spawned tasks are likely to be executed by many > worker threads instead. > That's probably not a good idea. Work-stealing is a pretty well-established as the current best solution (with provable performance properties) for task-based parallelism as it combines locality with low contention and still gets optimal load-balancing. Going to shared queue every single task (and probably incurring a cache miss if the task is unrelated to what you were just doing) is a lot more costly than just pushing tasks off of your own mostly private stack (and only interfering with other threads when you and they run out of stuff to do simultaneously). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
From: Jarkko L. <al...@gm...> - 2009-04-05 20:41:43
|
With job stealing from other threads you don't get global prioritization, so one task may be executing lower priority tasks and not to work towards optimal throughput. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 05, 2009 4:32 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 2:11 AM, Jarkko Lempiainen <al...@gm...> wrote: Sub-tasks spawn by a task are not necessarily run serially by the same worker thread though. Worker threads should pull tasks from job queue based on task priority, so those spawned tasks are likely to be executed by many worker threads instead. That's probably not a good idea. Work-stealing is a pretty well-established as the current best solution (with provable performance properties) for task-based parallelism as it combines locality with low contention and still gets optimal load-balancing. Going to shared queue every single task (and probably incurring a cache miss if the task is unrelated to what you were just doing) is a lot more costly than just pushing tasks off of your own mostly private stack (and only interfering with other threads when you and they run out of stuff to do simultaneously). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
From: <asy...@gm...> - 2009-04-05 17:31:52
|
Cost of starting a task is the cost to allocate memory for it (you can provide your own allocator by the way). You can also provide already allocated memory. There is no additional cost. The cost of a "switch" is not actually the cost of a context switch (which is just few cycles, really), but the total cost required to setup a wait operation on synchronization primitive and further resume of a task (in my benchmark it was blocking on an event). Alexander. 2009/4/5 Sebastian Sylvan <seb...@gm...> > > > On Sat, Apr 4, 2009 at 7:29 AM, <asy...@gm...> wrote: > >> Hi Jon, >> >> >you instead just suspend the current thread and >> >create a new thread in the thread pool, what would the difference be? >> The difference here is speed. >> >> I have an implementation which does the same but with threads (for >> debugging purposes - you can find it in the package as well), and it is >> significantly slower. While mine implementation still touches a semaphore on >> each task switch, it can push about a million task switches per second now. >> I actually didn't measure performance with threads, but will try doing that >> later. >> My target goal for a task switch cost is about 1-2 hundred of cycles, >> which is about 10-15 millions of task switches per second on a 3ghz cpu. > > > Hi, I didn't look at the source code, but am I to understand that this > "switch" you're talking about happens every single time you start a new > task? I.e. when a worker runs tasks A, B, and C, neither of which does any > blocking, you will pay that cost 3 times? > If not, then what's the additional cost of starting a task given that each > task is now a fibre? > > Assuming I did understand you correctly, I'll just say that I think adding > (even slight) overhead to the "run" method of every single task in order to > make blocking cheaper is a big mistake. The whole motivating idea behind > task parallelism is that you saturate the system with way more runnable > tasks than there are hardware capabilities, so that performance can scale > with hardware, and that no CPU capability is left idle. If the number of > runnable tasks is much larger than the actual number of worker threads, that > means that most tasks will just run serially on the same worker that spawned > them, which means that most synchronisation points won't block, which means > that the number of thread switches will be few compared to the number of > task executions, which means that optimizing for fast blocking is not as > important as making sure that running a task is dirt cheap. > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
From: <asy...@gm...> - 2009-04-04 09:18:40
|
>Have you measured it? I composed a quick sample - 2 threads which ping pong each other via 2 events (1 core utilized in total), each thread did 10 million iterations of passing execution to another thread (20 million execution passes in total), and did the same test with tasks. Win32 threads: 85seconds ~ 236 thousands task switches per second asynclib: 23seconds ~ 870 million task switches per second Then I also tried hacking in scheduler optimization I'm about to start working on (which removes polling of the above mentioned semaphore - the slowest part of the current scheduler), the result is 4seconds ~ 5million task switches per second when you avoid using any windows api. Regards, Alexander. Sample I used: HANDLE hEvent1; HANDLE hEvent2; async::Event Event1; async::Event Event2; static const int nIterations = 10000000; DWORD __stdcall Thread1_Proc(void*) { for (int i = 0; i < nIterations; ++i) { WaitForSingleObject(hEvent1, INFINITE); ResetEvent(hEvent1); SetEvent(hEvent2); } return 0; } DWORD __stdcall Thread2_Proc(void*) { for (int i = 0; i < nIterations; ++i) { WaitForSingleObject(hEvent2, INFINITE); ResetEvent(hEvent2); SetEvent(hEvent1); } return 0; } void ASYNC_API Task1_Proc(void*) { for (int i = 0; i < nIterations; ++i) { async::Wait(Event1); Event1.Reset(); Event2.Signale(); } } void ASYNC_API Task2_Proc(void*) { for (int i = 0; i < nIterations; ++i) { async::Wait(Event2); Event2.Reset(); Event1.Signale(); } } void Profile_Threads() { hEvent1 = CreateEvent(NULL, TRUE, FALSE, NULL); hEvent2 = CreateEvent(NULL, TRUE, FALSE, NULL); HANDLE hThread1 = CreateThread(NULL, 4096, &Thread1_Proc, NULL, 0, NULL); HANDLE hThread2 = CreateThread(NULL, 4096, &Thread2_Proc, NULL, 0, NULL); int nTicks = GetTickCount(); SetEvent(hEvent1); WaitForSingleObject(hThread1, INFINITE); WaitForSingleObject(hThread2, INFINITE); nTicks = GetTickCount() - nTicks; printf("thread ticks %03f\n", (float) ((double)nTicks / 1000.0)); CloseHandle(hThread1); CloseHandle(hThread2); async::Initialize(0, 0); async::Task* pTask1; async::Task* pTask2; async::Execute(&Task1_Proc, 0, &pTask1, NULL, 0, 512, NULL); async::Execute(&Task2_Proc, 0, &pTask2, NULL, 0, 512, NULL); nTicks = GetTickCount(); Event1.Signale(); async::WaitAll(*pTask1, *pTask2); nTicks = GetTickCount() - nTicks; printf("task ticks %03f\n", (float) ((double)nTicks / 1000.0)); pTask1->Release(); pTask2->Release(); async::Shutdown(); } 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > To fight blocking conditions I've made a task suspension mechanics, > > which allows a worker thread to suspend a task when it meets something > > it should wait for (whether it is a mutex or an on going disk read). > > After task is suspended, worker thread picks up another task from the > > task pool, and keeps on executing it. When blocking condition ends, > > task is resumed, and put back into a task pool. > > You are basically combining fibers (co-operative threads) with > pre-emptive threads. There really isn't that much cost difference > between a pre-emptive thread and a fiber, so you might as well just > spawn another pre-emptive thread, and suspend the current thread, rather > than switching fiber on the current thread. If you're using a > dynamically growing thread pool, your worst-case thread count (and thus > memory usage) will be no worse than your worst-case fiber count. > > > > > - a unified way to wait on primitives (you can wait on all sort of > > primitives including tasks together) > > - task canceling > > - timeout on wait operations > > We do that in our fibers, too. However, it turns out that on most OS-es, > you can do that for OS primitives too. On Windows, pretty much anything > can go into WaitForMultipleObjects() (although there is a 64-object > limitation). On UNIX, as long as something is a file descriptor, you can > wait on it using select(). > > So, if instead of "suspending" the current "task" and picking another > task within the worker thread -- which I presume to be exactly like > fiber switching -- you instead just suspend the current thread and > create a new thread in the thread pool, what would the difference be? > Have you measured it? > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: <asy...@gm...> - 2009-04-04 09:22:36
|
Replace '870 million' with '870 thousands' in my previous post. Sorry for mistake Alexander. 2009/4/4 <asy...@gm...> > >Have you measured it? > I composed a quick sample - 2 threads which ping pong each other via 2 > events (1 core utilized in total), each thread did 10 million iterations of > passing execution to another thread (20 million execution passes in total), > and did the same test with tasks. > Win32 threads: 85seconds ~ 236 thousands task switches per second > asynclib: 23seconds ~ 870 million task switches per second > > Then I also tried hacking in scheduler optimization I'm about to start > working on (which removes polling of the above mentioned semaphore - the > slowest part of the current scheduler), the result is 4seconds ~ 5million > task switches per second when you avoid using any windows api. > > Regards, > Alexander. > > Sample I used: > > > HANDLE hEvent1; > HANDLE hEvent2; > async::Event Event1; > async::Event Event2; > > static const int nIterations = 10000000; > DWORD __stdcall Thread1_Proc(void*) > { > for (int i = 0; i < nIterations; ++i) > { > WaitForSingleObject(hEvent1, INFINITE); > ResetEvent(hEvent1); > SetEvent(hEvent2); > } > return 0; > } > DWORD __stdcall Thread2_Proc(void*) > { > for (int i = 0; i < nIterations; ++i) > { > WaitForSingleObject(hEvent2, INFINITE); > ResetEvent(hEvent2); > SetEvent(hEvent1); > } > return 0; > } > > void ASYNC_API Task1_Proc(void*) > { > for (int i = 0; i < nIterations; ++i) > { > async::Wait(Event1); > Event1.Reset(); > Event2.Signale(); > } > } > void ASYNC_API Task2_Proc(void*) > { > for (int i = 0; i < nIterations; ++i) > { > async::Wait(Event2); > Event2.Reset(); > Event1.Signale(); > } > } > > void Profile_Threads() > { > hEvent1 = CreateEvent(NULL, TRUE, FALSE, NULL); > hEvent2 = CreateEvent(NULL, TRUE, FALSE, NULL); > > HANDLE hThread1 = CreateThread(NULL, 4096, &Thread1_Proc, NULL, 0, NULL); > HANDLE hThread2 = CreateThread(NULL, 4096, &Thread2_Proc, NULL, 0, NULL); > > int nTicks = GetTickCount(); > SetEvent(hEvent1); > WaitForSingleObject(hThread1, INFINITE); > WaitForSingleObject(hThread2, INFINITE); > nTicks = GetTickCount() - nTicks; > > printf("thread ticks %03f\n", (float) ((double)nTicks / 1000.0)); > CloseHandle(hThread1); > CloseHandle(hThread2); > > > async::Initialize(0, 0); > async::Task* pTask1; async::Task* pTask2; > async::Execute(&Task1_Proc, 0, &pTask1, NULL, 0, 512, NULL); > async::Execute(&Task2_Proc, 0, &pTask2, NULL, 0, 512, NULL); > > nTicks = GetTickCount(); > Event1.Signale(); > async::WaitAll(*pTask1, *pTask2); > nTicks = GetTickCount() - nTicks; > > printf("task ticks %03f\n", (float) ((double)nTicks / 1000.0)); > > pTask1->Release(); > pTask2->Release(); > async::Shutdown(); > } > > > 2009/4/4 Jon Watte <jw...@gm...> > >> asy...@gm... wrote: >> >> > To fight blocking conditions I've made a task suspension mechanics, >> > which allows a worker thread to suspend a task when it meets something >> > it should wait for (whether it is a mutex or an on going disk read). >> > After task is suspended, worker thread picks up another task from the >> > task pool, and keeps on executing it. When blocking condition ends, >> > task is resumed, and put back into a task pool. >> >> You are basically combining fibers (co-operative threads) with >> pre-emptive threads. There really isn't that much cost difference >> between a pre-emptive thread and a fiber, so you might as well just >> spawn another pre-emptive thread, and suspend the current thread, rather >> than switching fiber on the current thread. If you're using a >> dynamically growing thread pool, your worst-case thread count (and thus >> memory usage) will be no worse than your worst-case fiber count. >> >> > >> > - a unified way to wait on primitives (you can wait on all sort of >> > primitives including tasks together) >> > - task canceling >> > - timeout on wait operations >> >> We do that in our fibers, too. However, it turns out that on most OS-es, >> you can do that for OS primitives too. On Windows, pretty much anything >> can go into WaitForMultipleObjects() (although there is a 64-object >> limitation). On UNIX, as long as something is a file descriptor, you can >> wait on it using select(). >> >> So, if instead of "suspending" the current "task" and picking another >> task within the worker thread -- which I presume to be exactly like >> fiber switching -- you instead just suspend the current thread and >> create a new thread in the thread pool, what would the difference be? >> Have you measured it? >> >> Sincerely, >> >> jw >> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > |
From: Jon W. <jw...@gm...> - 2009-04-04 16:16:17
|
asy...@gm... wrote: > >Have you measured it? > I composed a quick sample - 2 threads which ping pong each other via 2 > events (1 core utilized in total), each thread did 10 million > iterations of passing execution to another thread (20 million > execution passes in total), and did the same test with tasks. I don't think that's a representative measurement, though, because it is the absolute best case: both contexts are in L1, on the same core. For any kind of real workload with blocking waits, that will not be the case. I believe the Windows kernel version will not slow down much in that case, but the asynclib version will. Of course, if your workload is actually highly cached and interdependent, then this measurement may be more relevant. Sincerely, jw |
From: <asy...@gm...> - 2009-04-04 16:49:49
|
I believe that from cpu point of view kernel scheduler and user-space scheduler are in the same situation here in regard to cpu cache usage. Yes, you need to keep your data close to cache, avoid task/thread migration and avoid cache conflicts. I don't see why a user space version of task scheduler can't be as good as kernel side one, since the kernel and userspace scheduler have the same amount of information in order to take a best decision, or am I wrong here ? And as I mentioned - in user space you need to save/restore muuch smaller state, so you have initial speed benefit here. Alexander 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > >Have you measured it? > > I composed a quick sample - 2 threads which ping pong each other via 2 > > events (1 core utilized in total), each thread did 10 million > > iterations of passing execution to another thread (20 million > > execution passes in total), and did the same test with tasks. > > I don't think that's a representative measurement, though, because it is > the absolute best case: both contexts are in L1, on the same core. For > any kind of real workload with blocking waits, that will not be the > case. I believe the Windows kernel version will not slow down much in > that case, but the asynclib version will. Of course, if your workload is > actually highly cached and interdependent, then this measurement may be > more relevant. > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
From: Jon W. <jw...@gm...> - 2009-04-04 18:18:18
|
asy...@gm... wrote: > I believe that from cpu point of view kernel scheduler and user-space > scheduler are in the same situation here in regard to cpu cache usage. > Yes, you need to keep your data close to cache, avoid task/thread > migration and avoid cache conflicts. I don't see why a user space > version of task scheduler can't be as good as kernel side one, since > the kernel and userspace scheduler have the same amount of information > in order to take a best decision, or am I wrong here ? > And as I mentioned - in user space you need to save/restore muuch > smaller state, so you have initial speed benefit here. What I'm saying is that, in the specific benchmark you posted, you heavily skewed the measurement in favor of user-space cooperative threads (fibers, coroutines, whatever you want to call them). The reason is that the cache footprint of that particular, single-threaded, small-task, user-space switch is tiny. When using the kernel to switch threads, more code is executed, and more data is touched. For example, threads provide TLS, which fibers do not. Threads provide a whole host of other specific state which fibers don't. In addition, entering the kernel has some overhead for the system call and parameter validation. Clearly, the overhead when switching real threads in this case is higher. However, when you start writing a real application, where you have a large amount of different tasks, where the tasks are not as small or as tightly coupled, and you start seeing tasks get moved between cores, the different between a fiber scheduler and the kernel thread scheduler becomes smaller. That's why I said I think that, with higher load, the thread scheduler will not degrade at the same rate as the fiber scheduler. Of course the fiber scheduler (if implemented competently, at least) will be faster than the thread scheduler, because it does less (no kernel entry, no TLS or other thread context, no pre-emption, etc). The question is: once you've added back all the things you end up needing for a real application to your fiber system (and TLS context switch may be part of that), how much faster is it, and what are you giving up? For example, we've had a user-space fiber system for our server stuff for many years, not unlike what you describe -- we have fiber-specific locks, waiting for I/O schedules another fiber, etc. We've found that the fiber switching brings with it its own problems, especially in the cancellation case. If you want to stop the operation that is going on (say, the user canceled, or the connection was broken while you were waiting for I/O), the fiber "wait" operation has to return an error, and it has to wind its way out to the base of the request while backing out state. In later incarnations, we use exception handling for this; the "wait" operation for fibers (which can wait on I/O, synchronization, etc) will throw an exception when the fiber/operation is canceled. I guess I'm just a little jaded, having seen a large number of user-space thread packages over the last 25 years (including the Mac threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp, etc), and I think the benefits and cost trade-off is not as clear as you seem to want them to be. Yes, there are cases where it is the right solution, but there also are cases where it isn't. If you want to use a particular measurement (such as context switches per second) as your criteria, then I think it's important to measure on actual, real-world workloads to make an informed decision. Sincerely, jw |
From: <asy...@gm...> - 2009-04-04 20:42:50
|
Thanks for your input Jon. The sample I composed is really a cache friendly one. This is the best situation for cooperative tasks, that's true, but it should also be the best situation for win threads as well (since the same data is touched all the time). You are right that this benchmark is far away from a real life scenario, but there should be a way to estimate the speed. How about comparing the worst case scenario for cooperative task approach with the best case one for winthreads ? Next benchmark I just tried which should be the worst situation for cooperative tasks - every task not in the cache. I created 2 million tasks (each of 512 bytes) and linked them all with an unique event (i.e. next task waits for it's event which previous tasks signales). Total size of memory traversed is 1 gigabyte, which obviously doesn't fit into cpu cache. 10 traversals give 20 million task switches in total, and that takes 10seconds to complete, which is still 8 times faster than the best case for winthreads. If you can think of a way how to slow it down more, let me know. I also would like to ask you what am I missing currently you would like to have in a task state ? You've mentioned tls for tasks, that's something I'll add a bit later, and it is not going to cost any extra cycles. In general I follow approach of 0 cost for a feature if you don't use it (which is achieved with assembly). For example - if you don't use exceptions, I can avoid saving/restoring exception pointer and so on. I didn't get what problems you've met with task cancelling, I just return special code from the blocking function (as you did as well), whether to throw an exception or not in this case is up to the user. Could you also point me to any other potential problems you've experienced with such approaches ? >I think it's important to measure on actual, real-world >workloads to make an informed decision. Well, I can only provide benchmarks from my side, and I try to do them as fair as I can. Alexander. 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > I believe that from cpu point of view kernel scheduler and user-space > > scheduler are in the same situation here in regard to cpu cache usage. > > Yes, you need to keep your data close to cache, avoid task/thread > > migration and avoid cache conflicts. I don't see why a user space > > version of task scheduler can't be as good as kernel side one, since > > the kernel and userspace scheduler have the same amount of information > > in order to take a best decision, or am I wrong here ? > > And as I mentioned - in user space you need to save/restore muuch > > smaller state, so you have initial speed benefit here. > > What I'm saying is that, in the specific benchmark you posted, you > heavily skewed the measurement in favor of user-space cooperative > threads (fibers, coroutines, whatever you want to call them). The reason > is that the cache footprint of that particular, single-threaded, > small-task, user-space switch is tiny. > When using the kernel to switch threads, more code is executed, and more > data is touched. For example, threads provide TLS, which fibers do not. > Threads provide a whole host of other specific state which fibers don't. > In addition, entering the kernel has some overhead for the system call > and parameter validation. Clearly, the overhead when switching real > threads in this case is higher. > > However, when you start writing a real application, where you have a > large amount of different tasks, where the tasks are not as small or as > tightly coupled, and you start seeing tasks get moved between cores, the > different between a fiber scheduler and the kernel thread scheduler > becomes smaller. That's why I said I think that, with higher load, the > thread scheduler will not degrade at the same rate as the fiber scheduler. > > Of course the fiber scheduler (if implemented competently, at least) > will be faster than the thread scheduler, because it does less (no > kernel entry, no TLS or other thread context, no pre-emption, etc). The > question is: once you've added back all the things you end up needing > for a real application to your fiber system (and TLS context switch may > be part of that), how much faster is it, and what are you giving up? > > For example, we've had a user-space fiber system for our server stuff > for many years, not unlike what you describe -- we have fiber-specific > locks, waiting for I/O schedules another fiber, etc. We've found that > the fiber switching brings with it its own problems, especially in the > cancellation case. If you want to stop the operation that is going on > (say, the user canceled, or the connection was broken while you were > waiting for I/O), the fiber "wait" operation has to return an error, and > it has to wind its way out to the base of the request while backing out > state. In later incarnations, we use exception handling for this; the > "wait" operation for fibers (which can wait on I/O, synchronization, > etc) will throw an exception when the fiber/operation is canceled. > > I guess I'm just a little jaded, having seen a large number of > user-space thread packages over the last 25 years (including the Mac > threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp, > etc), and I think the benefits and cost trade-off is not as clear as you > seem to want them to be. Yes, there are cases where it is the right > solution, but there also are cases where it isn't. If you want to use a > particular measurement (such as context switches per second) as your > criteria, then I think it's important to measure on actual, real-world > workloads to make an informed decision. > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
From: Jon W. <jw...@gm...> - 2009-04-04 23:14:12
|
asy...@gm... wrote: > I didn't get what problems you've met with task cancelling, I just > return special code from the blocking function (as you did as well), > whether to throw an exception or not in this case is up to the user. > Could you also point me to any other potential problems you've > experienced with such approaches ? It mostly has to do with the ease of creating bugs. The two bugs we've seen again and again are: 1) Some programmer failing to handle the "cancel" error case correctly, and perhaps treating it like a temporary I/O error rather than unwinding ASAP. 2) Some programmer deleting the underlying object that uses the fiber, while the fiber is still waiting in the scheduling queue. This is extra nasty to debug when it happens in an asynchronous manner. Sincerely, jw |
From: <asy...@gm...> - 2009-04-05 17:23:41
|
>1) Some programmer failing to handle the "cancel" error case correctly, For that purpose I made by default a wait operation not cancellable. You have to provide a flag in order to be have an ability to cancel current wait operation (and have a 'cancelled' flag returned) Alexander. 2009/4/5 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > I didn't get what problems you've met with task cancelling, I just > > return special code from the blocking function (as you did as well), > > whether to throw an exception or not in this case is up to the user. > > Could you also point me to any other potential problems you've > > experienced with such approaches ? > > It mostly has to do with the ease of creating bugs. The two bugs we've > seen again and again are: > 1) Some programmer failing to handle the "cancel" error case correctly, > and perhaps treating it like a temporary I/O error rather than unwinding > ASAP. > 2) Some programmer deleting the underlying object that uses the fiber, > while the fiber is still waiting in the scheduling queue. This is extra > nasty to debug when it happens in an asynchronous manner. > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
From: Jon W. <jw...@gm...> - 2009-04-05 20:12:50
|
asy...@gm... wrote: > >1) Some programmer failing to handle the "cancel" error case correctly, > For that purpose I made by default a wait operation not cancellable. > You have to provide a flag in order to be have an ability to cancel > current wait operation (and have a 'cancelled' flag returned) So instead of cancellation crash bugs, you get cancellation deadlocks. Not a big improvement IMO :-) Sincerely, jw |
From: <asy...@gm...> - 2009-04-05 20:24:19
|
Where do you get a deadlock from ? Alexander 2009/4/5 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > >1) Some programmer failing to handle the "cancel" error case correctly, > > For that purpose I made by default a wait operation not cancellable. > > You have to provide a flag in order to be have an ability to cancel > > current wait operation (and have a 'cancelled' flag returned) > > So instead of cancellation crash bugs, you get cancellation deadlocks. > Not a big improvement IMO :-) > > Sincerely, > > jw > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
From: <asy...@gm...> - 2009-04-04 05:55:17
|
>Coroutines are an implementation of cooperative multi-tasking Yes, it is basically a cooperative multitasking on top of a fixed number of worker threads. >As a last resort, you fall back on traditional synchronization primitives if you can find no more efficient approach. As far as I know this is the most common approach as well, because it is less error prone. On the other side, in standard threads, or any job system I've heard about, this approach is not efficient because it blocks the execution thread. What I propose here is a system where you can use traditional synchronization primitives (as well as traditional way of io) without blocking the worker thread. This is the main significant difference from standard threads or job system because where one of them blocks (and makes cpu idle) in case 2 of them enter synchronization primitive. >I'm not quite sure what your library provides to the domain of game development That's what I would like your feedback on. I've come up with a very generic way of utilizing cpu's at max regardless of the existence of synchronization primitives or synchronous IO request. And I wonder what is missing here for a game developer. Regards, Alexander. 2009/4/3 Tim Ambrogi <tim...@gm...> > Alexander- > Coroutines are an implementation of cooperative multi-tasking. There are > implementations in various dynamic languages, such as Lua and Stackless > Python. Also, there are less-portable implementations in C++ that involve > inline assembly, setjmp/longjmp, fibers, etc... depending on the platform. > See http://en.wikipedia.org/wiki/Coroutine for more info. > > It sounds, however, like you also want to write lower-level, > performance-critical systems such as rendering, physics, serialization, > etc... across many threads/cores. Your library is similar in silhouette to > the popular and fairly general job system, where you break down > parallelizable tasks in N jobs that are tasked to idle threads. These jobs > aim to either minimize contention on shared resources, or operate in a > wait-free manner on said resources. If wait-free operation cannot be > achieved within the domain, you usually try a lock-free approach that > enables different threads to assist obstructing threads in completing their > tasks while you wait. As a last resort, you fall back on traditional > synchronization primitives if you can find no more efficient approach. > > I'm not quite sure what your library provides to the domain of game > development, but I'm sure it was motivated by a specific problem you > encountered. Perhaps you should describe your use cases more specifically, > given that the stated domain of your library seems overly broad? > > --Tim A. > > > On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote: > >> What do you mean ? >> >> Alexander. >> >> 2009/4/3 Gregory Junker <gj...@da...> >> >>> Isn’t this where coroutines come into play? >>> >>> >>> Greg >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: Tim A. <tim...@gm...> - 2009-04-04 09:19:56
|
On Fri, Apr 3, 2009 at 10:55 PM, <asy...@gm...> wrote: > > As far as I know this is the most common approach as well, because it is > less error prone. On the other side, in standard threads, or any job system > I've heard about, this approach is not efficient because it blocks the > execution thread. Lock-free algorithms are, in some ways, less error-prone. In particular, correctly-implemented lock-free algorithms cannot deadlock, and prevent priority inversion. > >I'm not quite sure what your library provides to the domain of game > development > That's what I would like your feedback on. In that case, I again suggest you provide some specific use cases. What problems did you run into that led you to write the library? How does the library you wrote solve the specific problems you encountered? --Tim A. > 2009/4/3 Tim Ambrogi <tim...@gm...> > > Alexander- >> Coroutines are an implementation of cooperative multi-tasking. There are >> implementations in various dynamic languages, such as Lua and Stackless >> Python. Also, there are less-portable implementations in C++ that involve >> inline assembly, setjmp/longjmp, fibers, etc... depending on the platform. >> See http://en.wikipedia.org/wiki/Coroutine for more info. >> >> It sounds, however, like you also want to write lower-level, >> performance-critical systems such as rendering, physics, serialization, >> etc... across many threads/cores. Your library is similar in silhouette to >> the popular and fairly general job system, where you break down >> parallelizable tasks in N jobs that are tasked to idle threads. These jobs >> aim to either minimize contention on shared resources, or operate in a >> wait-free manner on said resources. If wait-free operation cannot be >> achieved within the domain, you usually try a lock-free approach that >> enables different threads to assist obstructing threads in completing their >> tasks while you wait. As a last resort, you fall back on traditional >> synchronization primitives if you can find no more efficient approach. >> >> I'm not quite sure what your library provides to the domain of game >> development, but I'm sure it was motivated by a specific problem you >> encountered. Perhaps you should describe your use cases more specifically, >> given that the stated domain of your library seems overly broad? >> >> --Tim A. >> >> >> On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote: >> >>> What do you mean ? >>> >>> Alexander. >>> >>> 2009/4/3 Gregory Junker <gj...@da...> >>> >>>> Isn’t this where coroutines come into play? >>>> >>>> >>>> Greg >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> GDAlgorithms-list mailing list >>>> GDA...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>>> Archives: >>>> >>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |