Thread: Re: [Algorithms] General purpose task parallel threading approach (Page 2)
Brought to you by:
vexxed72
|
From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-04 09:32:42
|
I personally just don't see what this offers over standard threads, it seems that your "worker threads" provide no additional functionality to standard OS level threads (with lower priority). And I figure that if you are seeing significant speedups from OS threading at this level of development in your code, It's much more likely that you are simply missing something important and not immediatally important then the OS libs being very poorly written. > I composed a quick sample - 2 threads which ping pong each other via 2 events (1 core utilized in total) You will find that benchmarks done in a single core environment quickly loose any sort of resemblance to results in a mult-core environment. And if Single core is all you are going for, a Comparison to threads is probally inappropiate and cooperative threads is what you are looking for; Win32 already provides an implementation "fibers". Indy On Sat, Apr 4, 2009 at 1:44 AM, <asy...@gm...> wrote: > >>Shouldn't you be redesigning your system to eliminate this wait for data? >> Ie if you need some data it should be already >computed, ie the frame >> before. > That's the only option you have using standard approach to achieve a benefit > from multicore architecture. > > What I am trying to say is that you really don't have to. Writing single > threaded code is a lot easier, takes less time, code is more compact and all > dependencies in my case are handled automatically (where in your case you > have to explicitly code them). To achieve multicore scale all you need to do > is to provide enough work to execute by worker threads while other tasks > wait for the dependencies to complete. > > Do you have a concrete example which bothers you (maybe I could provide a > better explanation of how it'll work on a concrete case) ? > > Regards, > Alexander. > > 2009/4/4 Oscar Forth <os...@tr...> >> >> Am I being thick? It looks to me like you are trying to write single >> threaded code wih multiple threads. Shouldn't you be redesigning your >> system to eliminate this wait for data? Ie if you need some data it should >> be already computed, ie the frame before. >> >> Otherwise you are never going to get the best out of a multicore >> architecture ... >> >> Just my, probably wrong, 2p >> 2009/4/4 <asy...@gm...> >>> >>> Hi Jon, >>> >>> >you instead just suspend the current thread and >>> >create a new thread in the thread pool, what would the difference be? >>> The difference here is speed. >>> >>> I have an implementation which does the same but with threads (for >>> debugging purposes - you can find it in the package as well), and it is >>> significantly slower. While mine implementation still touches a semaphore on >>> each task switch, it can push about a million task switches per second now. >>> I actually didn't measure performance with threads, but will try doing that >>> later. >>> My target goal for a task switch cost is about 1-2 hundred of cycles, >>> which is about 10-15 millions of task switches per second on a 3ghz cpu. >>> You should also compare thread spawning cost (unless you preallocate all >>> of them) and a memory overhead (for windows threads it is minimum 4k for >>> stack size), but if you preallocate, you'd do it for the maximum stack size. >>> Creation cost in my system is just a malloc (which you can also override >>> with a faster allocator), and minimum allowed task is 32 bytes (on x86). >>> >>> The cost of synchronization primitives on windows is high imho (critical >>> section is a 48 bytes in size + an attached event in the long run). My >>> approach allows minimal size for a crit-section like primitive be 2 bits >>> only (can be injected into pointer), simple Lock (non-recursive mutex) - 4 >>> bytes, 12 bytes for a Mutex (recursive mutex - complete analog of critical >>> section) and no use of windows events or any other objects. >>> >>> >>> Besides huge optimizations potential in speed and size, there is no >>> difference with what you do, Jon. >>> >>> >Have you measured it? >>> I'll come back with numbers later. >>> >>> Regards, >>> Alexander. >>> >>> >>> >>> 2009/4/4 Jon Watte <jw...@gm...> >>>> >>>> asy...@gm... wrote: >>>> > To fight blocking conditions I've made a task suspension mechanics, >>>> > which allows a worker thread to suspend a task when it meets something >>>> > it should wait for (whether it is a mutex or an on going disk read). >>>> > After task is suspended, worker thread picks up another task from the >>>> > task pool, and keeps on executing it. When blocking condition ends, >>>> > task is resumed, and put back into a task pool. >>>> >>>> You are basically combining fibers (co-operative threads) with >>>> pre-emptive threads. There really isn't that much cost difference >>>> between a pre-emptive thread and a fiber, so you might as well just >>>> spawn another pre-emptive thread, and suspend the current thread, rather >>>> than switching fiber on the current thread. If you're using a >>>> dynamically growing thread pool, your worst-case thread count (and thus >>>> memory usage) will be no worse than your worst-case fiber count. >>>> >>>> > >>>> > - a unified way to wait on primitives (you can wait on all sort of >>>> > primitives including tasks together) >>>> > - task canceling >>>> > - timeout on wait operations >>>> >>>> We do that in our fibers, too. However, it turns out that on most OS-es, >>>> you can do that for OS primitives too. On Windows, pretty much anything >>>> can go into WaitForMultipleObjects() (although there is a 64-object >>>> limitation). On UNIX, as long as something is a file descriptor, you can >>>> wait on it using select(). >>>> >>>> So, if instead of "suspending" the current "task" and picking another >>>> task within the worker thread -- which I presume to be exactly like >>>> fiber switching -- you instead just suspend the current thread and >>>> create a new thread in the thread pool, what would the difference be? >>>> Have you measured it? >>>> >>>> Sincerely, >>>> >>>> jw >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> GDAlgorithms-list mailing list >>>> GDA...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>>> Archives: >>>> >>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
|
From: <asy...@gm...> - 2009-04-04 10:11:51
|
>You will find that benchmarks done in a single core environment I just wanted to measure raw overhead of thread communication in a system Jon Watte described. >Win32 already provides an implementation "fibers". Fibers is just a small fraction of functionality, i.e. a fiber doesn't automatically switch to another available fiber if it enters a blocking critical section for example, you need a framework built on top of fibers for that to happen. Regards, Alexander. 2009/4/4 Nicholas "Indy" Ray <ar...@gm...> > I personally just don't see what this offers over standard threads, it > seems that your "worker threads" provide no additional functionality > to standard OS level threads (with lower priority). And I figure that > if you are seeing significant speedups from OS threading at this level > of development in your code, It's much more likely that you are simply > missing something important and not immediatally important then the OS > libs being very poorly written. > > > I composed a quick sample - 2 threads which ping pong each other via 2 > events (1 core utilized in total) > > You will find that benchmarks done in a single core environment > quickly loose any sort of resemblance to results in a mult-core > environment. And if Single core is all you are going for, a Comparison > to threads is probally inappropiate and cooperative threads is what > you are looking for; Win32 already provides an implementation > "fibers". > > Indy > > On Sat, Apr 4, 2009 at 1:44 AM, <asy...@gm...> wrote: > > > >>Shouldn't you be redesigning your system to eliminate this wait for data? > >> Ie if you need some data it should be already >computed, ie the frame > >> before. > > That's the only option you have using standard approach to achieve a > benefit > > from multicore architecture. > > > > What I am trying to say is that you really don't have to. Writing single > > threaded code is a lot easier, takes less time, code is more compact and > all > > dependencies in my case are handled automatically (where in your case you > > have to explicitly code them). To achieve multicore scale all you need to > do > > is to provide enough work to execute by worker threads while other tasks > > wait for the dependencies to complete. > > > > Do you have a concrete example which bothers you (maybe I could provide a > > better explanation of how it'll work on a concrete case) ? > > > > Regards, > > Alexander. > > > > 2009/4/4 Oscar Forth <os...@tr...> > >> > >> Am I being thick? It looks to me like you are trying to write single > >> threaded code wih multiple threads. Shouldn't you be redesigning your > >> system to eliminate this wait for data? Ie if you need some data it > should > >> be already computed, ie the frame before. > >> > >> Otherwise you are never going to get the best out of a multicore > >> architecture ... > >> > >> Just my, probably wrong, 2p > >> 2009/4/4 <asy...@gm...> > >>> > >>> Hi Jon, > >>> > >>> >you instead just suspend the current thread and > >>> >create a new thread in the thread pool, what would the difference be? > >>> The difference here is speed. > >>> > >>> I have an implementation which does the same but with threads (for > >>> debugging purposes - you can find it in the package as well), and it is > >>> significantly slower. While mine implementation still touches a > semaphore on > >>> each task switch, it can push about a million task switches per second > now. > >>> I actually didn't measure performance with threads, but will try doing > that > >>> later. > >>> My target goal for a task switch cost is about 1-2 hundred of cycles, > >>> which is about 10-15 millions of task switches per second on a 3ghz > cpu. > >>> You should also compare thread spawning cost (unless you preallocate > all > >>> of them) and a memory overhead (for windows threads it is minimum 4k > for > >>> stack size), but if you preallocate, you'd do it for the maximum stack > size. > >>> Creation cost in my system is just a malloc (which you can also > override > >>> with a faster allocator), and minimum allowed task is 32 bytes (on > x86). > >>> > >>> The cost of synchronization primitives on windows is high imho > (critical > >>> section is a 48 bytes in size + an attached event in the long run). My > >>> approach allows minimal size for a crit-section like primitive be 2 > bits > >>> only (can be injected into pointer), simple Lock (non-recursive mutex) > - 4 > >>> bytes, 12 bytes for a Mutex (recursive mutex - complete analog of > critical > >>> section) and no use of windows events or any other objects. > >>> > >>> > >>> Besides huge optimizations potential in speed and size, there is no > >>> difference with what you do, Jon. > >>> > >>> >Have you measured it? > >>> I'll come back with numbers later. > >>> > >>> Regards, > >>> Alexander. > >>> > >>> > >>> > >>> 2009/4/4 Jon Watte <jw...@gm...> > >>>> > >>>> asy...@gm... wrote: > >>>> > To fight blocking conditions I've made a task suspension mechanics, > >>>> > which allows a worker thread to suspend a task when it meets > something > >>>> > it should wait for (whether it is a mutex or an on going disk read). > >>>> > After task is suspended, worker thread picks up another task from > the > >>>> > task pool, and keeps on executing it. When blocking condition ends, > >>>> > task is resumed, and put back into a task pool. > >>>> > >>>> You are basically combining fibers (co-operative threads) with > >>>> pre-emptive threads. There really isn't that much cost difference > >>>> between a pre-emptive thread and a fiber, so you might as well just > >>>> spawn another pre-emptive thread, and suspend the current thread, > rather > >>>> than switching fiber on the current thread. If you're using a > >>>> dynamically growing thread pool, your worst-case thread count (and > thus > >>>> memory usage) will be no worse than your worst-case fiber count. > >>>> > >>>> > > >>>> > - a unified way to wait on primitives (you can wait on all sort of > >>>> > primitives including tasks together) > >>>> > - task canceling > >>>> > - timeout on wait operations > >>>> > >>>> We do that in our fibers, too. However, it turns out that on most > OS-es, > >>>> you can do that for OS primitives too. On Windows, pretty much > anything > >>>> can go into WaitForMultipleObjects() (although there is a 64-object > >>>> limitation). On UNIX, as long as something is a file descriptor, you > can > >>>> wait on it using select(). > >>>> > >>>> So, if instead of "suspending" the current "task" and picking another > >>>> task within the worker thread -- which I presume to be exactly like > >>>> fiber switching -- you instead just suspend the current thread and > >>>> create a new thread in the thread pool, what would the difference be? > >>>> Have you measured it? > >>>> > >>>> Sincerely, > >>>> > >>>> jw > >>>> > >>>> > >>>> > >>>> > ------------------------------------------------------------------------------ > >>>> _______________________________________________ > >>>> GDAlgorithms-list mailing list > >>>> GDA...@li... > >>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > >>>> Archives: > >>>> > >>>> > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> > >>> _______________________________________________ > >>> GDAlgorithms-list mailing list > >>> GDA...@li... > >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > >>> Archives: > >>> > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> GDAlgorithms-list mailing list > >> GDA...@li... > >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > >> Archives: > >> > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > > > > > > > ------------------------------------------------------------------------------ > > > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: <asy...@gm...> - 2009-04-04 09:55:46
|
>Lock-free algorithms are, in some ways, less error-prone. I'm not an expert in lock-free for sure, but most advanced lock-free structure I've seen was a double linked list implementation. In many real-life scenarios you often need to keep consistent a lot more complex structures (i.e. - a set of some tables, like hashmaps), for which I think it would be close to impossible to write an efficient lock free implementation. I also noticed that most programmers prefer using mutexes anyway, because it is easier to write, clear to other programmers what happens, and guaranteed to work. This made me thinking that using mutexes is fine as far as you provide a cheap way for cpu to execute something else while mutex is being hold. >How does the library you wrote solve the specific problems you encountered? Originally I was trying to solve the problem of how to write a server, serving multiple clients simultaneously utilizing all cores without writing a state machine for each client (I found out that it is damn complex to extend logic when it is expressed via state machine). The solution I have now perfectly fits imho - I only need to write a task per client which does simple send/receive (as if you'd write a single threaded application), and I get ideal multicore scale. Regards, Alexander. 2009/4/4 Tim Ambrogi <tim...@gm...> > On Fri, Apr 3, 2009 at 10:55 PM, <asy...@gm...> wrote: >> >> As far as I know this is the most common approach as well, because it is >> less error prone. On the other side, in standard threads, or any job system >> I've heard about, this approach is not efficient because it blocks the >> execution thread. > > > Lock-free algorithms are, in some ways, less error-prone. In particular, > correctly-implemented lock-free algorithms cannot deadlock, and prevent > priority inversion. > > >> >I'm not quite sure what your library provides to the domain of game >> development >> That's what I would like your feedback on. > > > In that case, I again suggest you provide some specific use cases. What > problems did you run into that led you to write the library? How does the > library you wrote solve the specific problems you encountered? > > --Tim A. > > >> 2009/4/3 Tim Ambrogi <tim...@gm...> >> >> Alexander- >>> Coroutines are an implementation of cooperative multi-tasking. There are >>> implementations in various dynamic languages, such as Lua and Stackless >>> Python. Also, there are less-portable implementations in C++ that involve >>> inline assembly, setjmp/longjmp, fibers, etc... depending on the platform. >>> See http://en.wikipedia.org/wiki/Coroutine for more info. >>> >>> It sounds, however, like you also want to write lower-level, >>> performance-critical systems such as rendering, physics, serialization, >>> etc... across many threads/cores. Your library is similar in silhouette to >>> the popular and fairly general job system, where you break down >>> parallelizable tasks in N jobs that are tasked to idle threads. These jobs >>> aim to either minimize contention on shared resources, or operate in a >>> wait-free manner on said resources. If wait-free operation cannot be >>> achieved within the domain, you usually try a lock-free approach that >>> enables different threads to assist obstructing threads in completing their >>> tasks while you wait. As a last resort, you fall back on traditional >>> synchronization primitives if you can find no more efficient approach. >>> >>> I'm not quite sure what your library provides to the domain of game >>> development, but I'm sure it was motivated by a specific problem you >>> encountered. Perhaps you should describe your use cases more specifically, >>> given that the stated domain of your library seems overly broad? >>> >>> --Tim A. >>> >>> >>> On Fri, Apr 3, 2009 at 12:27 PM, <asy...@gm...> wrote: >>> >>>> What do you mean ? >>>> >>>> Alexander. >>>> >>>> 2009/4/3 Gregory Junker <gj...@da...> >>>> >>>>> Isn’t this where coroutines come into play? >>>>> >>>>> >>>>> Greg >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> GDAlgorithms-list mailing list >>>>> GDA...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>>>> Archives: >>>>> >>>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> GDAlgorithms-list mailing list >>>> GDA...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>>> Archives: >>>> >>>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> GDAlgorithms-list mailing list >>> GDA...@li... >>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >>> Archives: >>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >>> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> GDAlgorithms-list mailing list >> GDA...@li... >> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list >> Archives: >> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list >> > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-04 10:22:15
|
On Sat, Apr 4, 2009 at 2:55 AM, <asy...@gm...> wrote: > Originally I was trying to solve the problem of how to write a server, > serving multiple clients simultaneously utilizing all cores without writing > a state machine for each client (I found out that it is damn complex to > extend logic when it is expressed via state machine). > The solution I have now perfectly fits imho - I only need to write a task > per client which does simple send/receive (as if you'd write a single > threaded application), and I get ideal multicore scale. It's problematic trying to take an approach that works well for servers and extend it for video games. Most servers have very low data contention, don't worry to much about small variances in latency and spend a lot of time waiting around on IO and the like. On the other hand, video games not designed for multi threading generally have very high data contention, which require many locks, and at that point, the cache misses generally provide just as much of a performance hit as the lock itself. Latency requirements are very strict, and additionally, most games spend very little to zero time in IO and when so it is almost always in loading, or designed async to begin with. And given such there is not a lot of time to fit in extra cycles for worker threads on the main CPU. Indy |
|
From: <asy...@gm...> - 2009-04-04 10:55:52
|
How do you utilize multicore in your game then? I'd like to cover as many areas in programming world as I can, so I wonder what kind of functionality am I missing which you would like to have ? That automatic task switching on blocking condition has no cost if you don't use it. In this case you have something close to openmp style multithreading. Alexander. 2009/4/4 Nicholas "Indy" Ray <ar...@gm...> > On Sat, Apr 4, 2009 at 2:55 AM, <asy...@gm...> wrote: > > Originally I was trying to solve the problem of how to write a server, > > serving multiple clients simultaneously utilizing all cores without > writing > > a state machine for each client (I found out that it is damn complex to > > extend logic when it is expressed via state machine). > > The solution I have now perfectly fits imho - I only need to write a task > > per client which does simple send/receive (as if you'd write a single > > threaded application), and I get ideal multicore scale. > > It's problematic trying to take an approach that works well for > servers and extend it for video games. Most servers have very low data > contention, don't worry to much about small variances in latency and > spend a lot of time waiting around on IO and the like. On the other > hand, video games not designed for multi threading generally have very > high data contention, which require many locks, and at that point, the > cache misses generally provide just as much of a performance hit as > the lock itself. Latency requirements are very strict, and > additionally, most games spend very little to zero time in IO and when > so it is almost always in loading, or designed async to begin with. > And given such there is not a lot of time to fit in extra cycles for > worker threads on the main CPU. > > Indy > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Oscar F. <os...@tr...> - 2009-04-04 13:25:58
|
> > Do you have a concrete example which bothers you (maybe I could provide a > better explanation of how it'll work on a concrete case) ? > Well I don't have a concrete example. But the fact you are stalling at all, is the bit that concerns me. I appreciate using a double buffered approach leads to latency but it makes much more sense to me especially as far as scaling to many cores goes. Thanks to using a display list style system for DirectX I'm also doing rendering from multiple threads. Essentially each "task" decides what it wants to do ... sends its forces to the physics engine and then renders the previous frame's physics results. This, of course, allows it to respond to things like walking into a wall before straight away and make alternate "decisions". This way I'm avoiding stalls except at the end of a frame when I synchronise everything. I guess I could probably completely de-couple rendering from logic but I don't see the point of rendering at 300fps when you can only update the logic at 5fps. Anyway ... thats just my approach. I'm not saying you are wrong (My knowledge is limited enough without telling other people they are wrong ;)) but it seems to >me< that you are attacking the problem of multiple cores in totally the wrong way. |
|
From: <asy...@gm...> - 2009-04-04 15:00:34
|
>But the fact you are stalling at all, is the bit that concerns me. What stalls are you talking about ? On the contrary I'm trying to eliminate all the cases where cpu has to wait on something. >but it seems to >me< that you are attacking the problem of multiple cores in totally the wrong way. Explain. If I make things work at the same speed as a standard approach (when there is no or very little contention) or faster (in case of contention or blocking io), what is wrong here ? Could you explain a bit more about your approach ? And what do you do when physics iteration finishes faster than rendering iteration or the other way around ? Alexander. 2009/4/4 Oscar Forth <os...@tr...> > Do you have a concrete example which bothers you (maybe I could provide a >> better explanation of how it'll work on a concrete case) ? >> > > Well I don't have a concrete example. But the fact you are stalling at > all, is the bit that concerns me. I appreciate using a double buffered > approach leads to latency but it makes much more sense to me especially as > far as scaling to many cores goes. > > Thanks to using a display list style system for DirectX I'm also doing > rendering from multiple threads. Essentially each "task" decides what it > wants to do ... sends its forces to the physics engine and then renders the > previous frame's physics results. This, of course, allows it to respond to > things like walking into a wall before straight away and make alternate > "decisions". > > This way I'm avoiding stalls except at the end of a frame when I > synchronise everything. I guess I could probably completely de-couple > rendering from logic but I don't see the point of rendering at 300fps when > you can only update the logic at 5fps. > > Anyway ... thats just my approach. I'm not saying you are wrong (My > knowledge is limited enough without telling other people they are wrong ;)) > but it seems to >me< that you are attacking the problem of multiple cores in > totally the wrong way. > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Jon W. <jw...@gm...> - 2009-04-04 16:12:52
|
asy...@gm... wrote: > My target goal for a task switch cost is about 1-2 hundred of cycles, > which is about 10-15 millions of task switches per second on a 3ghz cpu. > You should also compare thread spawning cost (unless you preallocate > all of them) and a memory overhead (for windows threads it is minimum > 4k for stack size), but if you preallocate, you'd do it for the > maximum stack size. Creation cost in my system is just a malloc (which > you can also override with a faster allocator), and minimum allowed > task is 32 bytes (on x86). > If you're using exception handling, or stack buffer overflow, or even certain standard C library functions, a 32-byte stack will get you into trouble. Also, you want your thread/task stacks to be page aligned, so that a thread overflow will actually be caught by a guard page. And, yes, you'd either pre-allocate your maximum number of blocking threads, or you would allocate more threads on demand when you started blocking and ran out of runnable threads. Note: in a system with guard page security for each thread stack, you need 8 kB/thread minimum, plus the register state (which is about 512 bytes if you care about SSE). This is no matter whether you're using fibers or threads. Trying to find out whether an innocuous library call you're making is going to overwrite your malloc-ed stack or not is too unpredictable to rely on IMO -- I prefer the safety of a guard page. I don't think a full thread context (stack pointer, registers, pc) can be done in a hundred cycles on x86; at least not if they aren't already all in L1 on the local core. The amount of data you need to shuffle and the memory latencies involved seem to contra-indicate such a low switch time. Getting as close as you can is not a bad idea, though :-) Sincerely, jw |
|
From: <asy...@gm...> - 2009-04-04 16:35:39
|
Of course there is a tradeoff if you decide to use tasks as small as 32 bytes (though the sample I provided can do a printf with such small tasks). User code I've worked with required on average 1 - 4 kilobytes. I also provided optional automatic guardpage safety mechanism to catch stack overflows, and a way to handle library calls (that's why printf works from 32byte task). You are right about full thread context state though, but you don't have to capture full state to make cooperation work in user space, but just small fraction of it. Also keep in mind that according to windows x86 abi sse registers (and 75% of other registers) are not callee preserved, that's why a user space task state requires about 20 bytes and sse will still work (as well as mmx and fpu code). Alexander. 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > My target goal for a task switch cost is about 1-2 hundred of cycles, > > which is about 10-15 millions of task switches per second on a 3ghz cpu. > > You should also compare thread spawning cost (unless you preallocate > > all of them) and a memory overhead (for windows threads it is minimum > > 4k for stack size), but if you preallocate, you'd do it for the > > maximum stack size. Creation cost in my system is just a malloc (which > > you can also override with a faster allocator), and minimum allowed > > task is 32 bytes (on x86). > > > > If you're using exception handling, or stack buffer overflow, or even > certain standard C library functions, a 32-byte stack will get you into > trouble. Also, you want your thread/task stacks to be page aligned, so > that a thread overflow will actually be caught by a guard page. And, > yes, you'd either pre-allocate your maximum number of blocking threads, > or you would allocate more threads on demand when you started blocking > and ran out of runnable threads. Note: in a system with guard page > security for each thread stack, you need 8 kB/thread minimum, plus the > register state (which is about 512 bytes if you care about SSE). This is > no matter whether you're using fibers or threads. Trying to find out > whether an innocuous library call you're making is going to overwrite > your malloc-ed stack or not is too unpredictable to rely on IMO -- I > prefer the safety of a guard page. > > I don't think a full thread context (stack pointer, registers, pc) can > be done in a hundred cycles on x86; at least not if they aren't already > all in L1 on the local core. The amount of data you need to shuffle and > the memory latencies involved seem to contra-indicate such a low switch > time. Getting as close as you can is not a bad idea, though :-) > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Oscar F. <os...@tr...> - 2009-04-04 16:58:47
|
> > >but it seems to >me< that you are attacking the problem of multiple cores > in totally the wrong way. > Explain. If I make things work at the same speed as a standard approach > (when there is no or very little contention) or faster (in case of > contention or blocking io), what is wrong here ? Again im not saying im correct ... but you stated that you are trying to handle the "ability to handle blocking conditions somehow". I'm suggesting that you don't need this to be a problem. Maybe we are talking about the same thing? Maybe im just misunderstanding you? BUT, it still seems to me that if blocking is a problem then you are attacking the problem wrongly. > Could you explain a bit more about your approach ? And what do you do when > physics iteration finishes faster than rendering iteration or the other way > around ? If my physics system finishes early i return its thread to the thread pool and give its thread to the object iterations. I only ever have twice the number of threads that there are cores in the machine (Experimentation showed me that this is a good tradeoff between leaving the core stalled on things like memory access and general thread throughput. I would love to experiment on massively multi-core systems and it looks likely i'll get a play with a 1024 core system soon to do some testing for a non games related problem). I dunno what more I can tell you about my approach from my last post. Basically I am aiming to get tasks as parallel to each other as possible. A given task, when it finishes, has a "callback" that tells it when it next has to bother doing any AI decisions. Otherwise it only handles animation and rendering of itself. Basically I'm looking at where a stall can occur and trying to eliminate that stall. Currently my only stalling occurs when I add to global lists (message list and render list). The system seems perfectly compatible with a networking system whereby you don't know exactly when the next update will occur. In fact things like the rendering and physics could be offloaded, easily, to whole other computers if such a need ever came about. I'm pretty crap at explaining things. If you have any specific questions to ask on my approach then I'll happily attempt to answer them. I'm better at that :) That said, I will re-iterate, I'm not saying your solution is wrong. It just goes contrary to my experience and kowledge (which is inevitably flawed). My system will, however, port nicely to >any< pre-emptively multi-tasking operating system which was my aim. |
|
From: <asy...@gm...> - 2009-04-04 18:09:12
|
Blocking is only part of a problem (I used it for demonstration purporses basically), on top of that you have all sort of IO and task dependencies. You can compare my approach to hyperthreading in hardware world which is becoming more and more popular these days (sun's ultrasparc, intel's larabee, and gpus are also using that scheme as far as I'm aware of). Hyperthreading switches virtual threads when currently executed thread needs to wait on something (lime memory access to finish). Is it needed ? Not if you layout your data in a way where you don't have memory access latencies, but you usually do have them, because in many cases to eliminate latencies is very difficult/time consuming/completely not possible to do. I'm not saying that one should not plan his code for asynchronous execution scenario, but the delays because of all sort of blocking conditions can be hidden like it happens in hyperthreading. Alexander. 2009/4/4 Oscar Forth <os...@tr...> > >but it seems to >me< that you are attacking the problem of multiple >> cores in totally the wrong way. >> Explain. If I make things work at the same speed as a standard approach >> (when there is no or very little contention) or faster (in case of >> contention or blocking io), what is wrong here ? > > > Again im not saying im correct ... but you stated that you are trying to > handle the "ability to handle blocking conditions somehow". I'm suggesting > that you don't need this to be a problem. Maybe we are talking about the > same thing? Maybe im just misunderstanding you? BUT, it still seems to me > that if blocking is a problem then you are attacking the problem wrongly. > > >> Could you explain a bit more about your approach ? And what do you do when >> physics iteration finishes faster than rendering iteration or the other way >> around ? > > > If my physics system finishes early i return its thread to the thread pool > and give its thread to the object iterations. I only ever have twice the > number of threads that there are cores in the machine > (Experimentation showed me that this is a good tradeoff between leaving the > core stalled on things like memory access and general thread throughput. I > would love to experiment on massively multi-core systems and it looks likely > i'll get a play with a 1024 core system soon to do some testing for a non > games related problem). > > I dunno what more I can tell you about my approach from my last post. > Basically I am aiming to get tasks as parallel to each other as possible. A > given task, when it finishes, has a "callback" that tells it when it next > has to bother doing any AI decisions. Otherwise it only handles animation > and rendering of itself. Basically I'm looking at where a stall can occur > and trying to eliminate that stall. Currently my only stalling occurs when > I add to global lists (message list and render list). > > The system seems perfectly compatible with a networking system whereby you > don't know exactly when the next update will occur. In fact things like the > rendering and physics could be offloaded, easily, to whole other computers > if such a need ever came about. > > I'm pretty crap at explaining things. If you have any specific questions > to ask on my approach then I'll happily attempt to answer them. I'm better > at that :) That said, I will re-iterate, I'm not saying your solution is > wrong. It just goes contrary to my experience and kowledge (which is > inevitably flawed). My system will, however, port nicely to >any< > pre-emptively multi-tasking operating system which was my aim. > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Ola O. <ola...@gm...> - 2009-04-05 08:14:29
|
> I can't speak on ultraspec, but neither larabee or GPUs use > hyperthreading as the main form of threading (or afaik at all) They > are both data parallel architectures primarily. Additionally, > hyperthreading is build to fill bubbles in the pipeline and do such > rather well because they don't have to worry about any sort of context > switch, as the threading is implemented in hardware and multiple > states can actually be active at the same time. > Well, NVidias current GPUs are fundamentally built on using 'hyperthreading' to hide latency, as opposed to keeping large caches. Each core is capable of keeping 1024 threads going at a time, which are processed SIMD fashoin 32 at a time. So I don't see what you mean by saying they do not use hyperthreading? Larabee is different, and less known, but from their presentations they have 4 'hyperthreads' per core. However from their presentations, they intend to use cooperative threads/fibers to run their shaders, switching when hitting a texture fetch or suchlike to achieve the same latency hiding as other GPUs. cheers .ola |
|
From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-04 19:45:25
|
On Sat, Apr 4, 2009 at 11:09 AM, <asy...@gm...> wrote: > Blocking is only part of a problem (I used it for demonstration purporses > basically), on top of that you have all sort of IO and task dependencies. These are all variants on blocking, it's best to eliminate these dependencies. > You can compare my approach to hyperthreading in hardware world which is > becoming more and more popular these days (sun's ultrasparc, intel's > larabee, and gpus are also using that scheme as far as I'm aware of). I can't speak on ultraspec, but neither larabee or GPUs use hyperthreading as the main form of threading (or afaik at all) They are both data parallel architectures primarily. Additionally, hyperthreading is build to fill bubbles in the pipeline and do such rather well because they don't have to worry about any sort of context switch, as the threading is implemented in hardware and multiple states can actually be active at the same time. > Hyperthreading switches virtual threads when currently executed thread needs > to wait on something (lime memory access to finish). Is it needed ? Not if > you layout your data in a way where you don't have memory access latencies, > but you usually do have them, because in many cases to eliminate latencies > is very difficult/time consuming/completely not possible to do. Keep in mind that there is 0 context switch overhead in hyperthreading. Additionally, eliminating cache stalls is far more difficult then blocking IO and the like. > I'm not saying that one should not plan his code for asynchronous execution > scenario, but the delays because of all sort of blocking conditions can be > hidden like it happens in hyperthreading. Additionally hyperthreading still requires threads to be designed in a way that is low on data contention. And if you already have worker tasks that can do such, It's generally best to create OS threads and just let them run on separate threads or hyperthreaded. But still ensure that the main thread does no waiting. In a well designed system, thread switches are a non-issue as they just shouldn't happen very often (or at all in the case of consoles) So I don't think it's worthwhile worrying too much about the amount of time it takes to do a thread switch. Indy |
|
From: Jon W. <jw...@gm...> - 2009-04-04 23:10:39
|
Nicholas "Indy" Ray wrote: > I can't speak on ultraspec, but neither larabee or GPUs use > hyperthreading as the main form of threading (or afaik at all) They > are both data parallel architectures primarily. Additionally, > The latest UltraSparc has, I think, eight cores, each with eight hyperthreads. When one core blocks on a cache miss, it will automatically schedule to the next hyperthread. Larrabee has almost the exact same set-up; many hyper-threads that it switches between to hide data access latencies. NVIDIA calls the same thing warps, I think (one warp is 8 threads -- get it? :-) However, in software, you don't get the same fine-grained benefit; not by a long shot. When you wait for something, you either wait on a different task using a user-mode sychronization primitive, or you wait on something that has to come from the kernel (I/O, interrupt, etc). Those are the only two options. Neither of them allow you to switch tasks quickly enough or with little enough overhead to be compared to hyper-threading IMO. > Additionally hyperthreading still requires threads to be designed in a > way that is low on data contention. And if you already have worker > tasks that can do such, It's generally best to create OS threads and > just let them run on separate threads or hyperthreaded. But still > ensure that the main thread does no waiting. In a well designed > system, thread switches are a non-issue as they just shouldn't happen > very often (or at all in the case of consoles) So I don't think it's > worthwhile worrying too much about the amount of time it takes to do a > thread switch. > I think the idea of a thread per subsystem (particles, skinning, collision, simulation, audio, scene graph issue, etc) will only scale so far. Once you have 64-way CPUs (like the latest Sparcs) and even more with Larrabee (if you include the hyper-threads), making your game (or any software) go fast will mean a task-oriented workload, rather than a subsystem-oriented workload. As long as your tasks are significantly heavier than context switching, you're doing fine. That's what the fibers-within-threads tries to optimize. There probably exists a real-world workload where the efficiency of fibers leads to measurable throughput, even though the overhead of fibers or threading is not dramatically large compared to the workload, but I think that slice is pretty thin. It all comes down to Amdahl's Law in the end. Sincerely, jw |
|
From: Jarkko L. <al...@gm...> - 2009-04-05 00:10:42
|
I think you really need to target for both task and data level parallelism to make your game engine truly scalable for many-core architectures. I don't see why you would need to consider the cost of context switching in relation to const of a task execution though since if you got a job queue, each worker thread just pulls tasks from the queue without switching context. Cheers, Jarkko -----Original Message----- From: Jon Watte [mailto:jw...@gm...] Sent: Sunday, April 05, 2009 2:11 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach I think the idea of a thread per subsystem (particles, skinning, collision, simulation, audio, scene graph issue, etc) will only scale so far. Once you have 64-way CPUs (like the latest Sparcs) and even more with Larrabee (if you include the hyper-threads), making your game (or any software) go fast will mean a task-oriented workload, rather than a subsystem-oriented workload. As long as your tasks are significantly heavier than context switching, you're doing fine. That's what the fibers-within-threads tries to optimize. There probably exists a real-world workload where the efficiency of fibers leads to measurable throughput, even though the overhead of fibers or threading is not dramatically large compared to the workload, but I think that slice is pretty thin. It all comes down to Amdahl's Law in the end. Sincerely, jw ---------------------------------------------------------------------------- -- _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list |
|
From: Sebastian S. <seb...@gm...> - 2009-04-05 00:30:55
|
On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> wrote: > I think you really need to target for both task and data level parallelism > to make your game engine truly scalable for many-core architectures. I > don't > see why you would need to consider the cost of context switching in > relation > to const of a task execution though since if you got a job queue, each > worker thread just pulls tasks from the queue without switching context. The problem is when you have a task spawn N other tasks and then wait for their results, the core running that worker needs to do something until the two sub-tasks have completed. In practice, I do think that this kind of synchronisation (specifically the fork-and-join model) will be extremely common in the future, which means that context switch needs to be fast (Windows 7's user mode scheduling may be useful here - in fact I believe that's what Microsoft's native concurrency runtime uses, while the .Net one is backed by thread pools, I believe). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: Jon W. <jw...@gm...> - 2009-04-05 05:32:54
|
Jarkko Lempiainen wrote: > I think you really need to target for both task and data level parallelism > to make your game engine truly scalable for many-core architectures. I don't > see why you would need to consider the cost of context switching in relation > to const of a task execution though since if you got a job queue, each > worker thread just pulls tasks from the queue without switching context. > If you are fully FIFO in your data flow, then that is true. However, in real code, you end up with data flow dependencies. For example, you typically want to do simulation something like: extract state from previous iteration of physics frame run collision/intersection tests push state over to renderer read user input extract entity behavior based on collision and input state kick off new physics frame present the frame (vsync) There are, unfortunately, a number of intra-frame dependencies here, such as the entity behavior needing the output of the collision tests. (Can I jump? That depends on whether I have my feet on the ground.) You may even have intra-frame dependencies in the entity behavior job, where one entity could cause more behavior for other entities (explosions, triggers, etc). You can set this up to be totally FIFO, and thus be able to start all the jobs in parallel, but you will be introducing latency; sometimes as much as two additional frames (which for certain game types can be noticed as sluggish controls, etc). Now, most of these dependencies are in the form of "this task can't start until that task has ended" which can be implemented in a multi-threaded pool without blocking -- if a single thread runs low on work, it simply busy-waits until some pre-requisite finishes. Or you have a queue of background work, such as pathfinding queries etc -- but sometimes, that queue will be empty anyway. Or you can use a blocking primitive of some sort (fibers or threads) to deal with the case where the queue is temporarily stalled waiting on the output of something else that's executing out of the queue. Depending on how much of this happens each frame, the overhead may or may not matter. This brings up another interesting point, which you also remarked on: The workload for a game is generally quite different than the workload from a server. Typically, you pre-load everything you want, and just run in-core. The cases where that breaks down is various large-world games (streaming worlds, MMOs, etc) where there really are "unplanned" loads of data, where being able to have a task block until it's done would sometimes be convenient. However, typically this happens seldom enough (compared to the other stuff) that a state machine approach, or a simple threading approach, really isn't that punitive. Sincerely, jw |
|
From: <asy...@gm...> - 2009-04-05 17:18:21
|
That's exactly what I try to do efficiently - give cpu some other work to do while this kind of waits are in progress. I'd also add a 3rd one - waiting on synchronization primitives, it doesn't have to be a mutex, it can be any primitive which controls resource access in regard to multiple tasks. For example, with a semaphore you can limit the number of tasks which can simultaneously try to kick into a protected region by a number of 10, if 11th task kicks in, it'll wait until one of those 10 tasks leaves protected regions. Just keep in mind - synchronization primitive is a way to control task behaviour in regard to other tasks and resources. Task itself is a synchronization primitive, and you can wait from one task on another task to finish. Alexander 2009/4/5 Jon Watte <jw...@gm...> > Nicholas "Indy" Ray wrote: > > I can't speak on ultraspec, but neither larabee or GPUs use > > hyperthreading as the main form of threading (or afaik at all) They > > are both data parallel architectures primarily. Additionally, > > > > The latest UltraSparc has, I think, eight cores, each with eight > hyperthreads. When one core blocks on a cache miss, it will > automatically schedule to the next hyperthread. > Larrabee has almost the exact same set-up; many hyper-threads that it > switches between to hide data access latencies. > NVIDIA calls the same thing warps, I think (one warp is 8 threads -- get > it? :-) > > However, in software, you don't get the same fine-grained benefit; not > by a long shot. When you wait for something, you either wait on a > different task using a user-mode sychronization primitive, or you wait > on something that has to come from the kernel (I/O, interrupt, etc). > Those are the only two options. Neither of them allow you to switch > tasks quickly enough or with little enough overhead to be compared to > hyper-threading IMO. > > > Additionally hyperthreading still requires threads to be designed in a > > way that is low on data contention. And if you already have worker > > tasks that can do such, It's generally best to create OS threads and > > just let them run on separate threads or hyperthreaded. But still > > ensure that the main thread does no waiting. In a well designed > > system, thread switches are a non-issue as they just shouldn't happen > > very often (or at all in the case of consoles) So I don't think it's > > worthwhile worrying too much about the amount of time it takes to do a > > thread switch. > > > > I think the idea of a thread per subsystem (particles, skinning, > collision, simulation, audio, scene graph issue, etc) will only scale so > far. Once you have 64-way CPUs (like the latest Sparcs) and even more > with Larrabee (if you include the hyper-threads), making your game (or > any software) go fast will mean a task-oriented workload, rather than a > subsystem-oriented workload. As long as your tasks are significantly > heavier than context switching, you're doing fine. That's what the > fibers-within-threads tries to optimize. There probably exists a > real-world workload where the efficiency of fibers leads to measurable > throughput, even though the overhead of fibers or threading is not > dramatically large compared to the workload, but I think that slice is > pretty thin. It all comes down to Amdahl's Law in the end. > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: <asy...@gm...> - 2009-04-05 17:33:43
|
Without the ability to save/restore tasks your dependency system will have a big bunch of limitations, because you can only execute tasks during a wait in a recursive way ONLY. 2009/4/5 Jarkko Lempiainen <al...@gm...> > You could also not have a wait for dependent tasks within a task, but > rather have job queue dependency system to take care of it by splitting the > function in two tasks. Or you could implement the wait by pulling tasks from > the job queue until the dependent tasks have been completed. There is no > context switches needed in either case though. > > > ------------------------------ > > *From:* Sebastian Sylvan [mailto:seb...@gm...] > *Sent:* Sunday, April 05, 2009 3:31 AM > *To:* Game Development Algorithms > *Subject:* Re: [Algorithms] General purpose task parallel threading > approach > > > > > > On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> > wrote: > > I think you really need to target for both task and data level parallelism > to make your game engine truly scalable for many-core architectures. I > don't > see why you would need to consider the cost of context switching in > relation > to const of a task execution though since if you got a job queue, each > worker thread just pulls tasks from the queue without switching context. > > > > The problem is when you have a task spawn N other tasks and then wait for > their results, the core running that worker needs to do something until the > two sub-tasks have completed. In practice, I do think that this kind of > synchronisation (specifically the fork-and-join model) will be extremely > common in the future, which means that context switch needs to be fast > (Windows 7's user mode scheduling may be useful here - in fact I believe > that's what Microsoft's native concurrency runtime uses, while the .Net one > is backed by thread pools, I believe). > > > > > > -- > Sebastian Sylvan > +44(0)7857-300802 > UIN: 44640862 > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: <asy...@gm...> - 2009-04-06 05:37:55
|
How is your dependency system is supposed to work then ? You also mean that inside a task you can't wait for other tasks to finish (because it becomes a 'blocking within tasks') ? That's a really huge limitation imho. Alexander. 2009/4/5 Jarkko Lempiainen <al...@gm...> > Why is it a problem to execute the tasks in recursive way? Note that this > recursive way of executing tasks is used only when blocking within tasks, > not in the job queue dependency system. > > > > > > Cheers, Jarkko > > > ------------------------------ > > *From:* asy...@gm... [mailto:asy...@gm...] > *Sent:* Sunday, April 05, 2009 8:34 PM > *To:* Game Development Algorithms > *Subject:* Re: [Algorithms] General purpose task parallel threading > approach > > > > Without the ability to save/restore tasks your dependency system will have > a big bunch of limitations, because you can only execute tasks during a wait > in a recursive way ONLY. > > 2009/4/5 Jarkko Lempiainen <al...@gm...> > > You could also not have a wait for dependent tasks within a task, but > rather have job queue dependency system to take care of it by splitting the > function in two tasks. Or you could implement the wait by pulling tasks from > the job queue until the dependent tasks have been completed. There is no > context switches needed in either case though. > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Jarkko L. <al...@gm...> - 2009-04-06 07:40:10
|
A job queue holds dependencies between tasks and when a task is completed releases the depending task for execution. This is different system to defining dependencies by waiting within a task. Cheers, Jarkko _____ From: asy...@gm... [mailto:asy...@gm...] Sent: Monday, April 06, 2009 8:38 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach How is your dependency system is supposed to work then ? You also mean that inside a task you can't wait for other tasks to finish (because it becomes a 'blocking within tasks') ? That's a really huge limitation imho. Alexander. 2009/4/5 Jarkko Lempiainen <al...@gm...> Why is it a problem to execute the tasks in recursive way? Note that this recursive way of executing tasks is used only when blocking within tasks, not in the job queue dependency system. |
|
From: Jarkko L. <al...@gm...> - 2009-04-06 08:11:17
|
Actually, after a bit of an extra thought on locality, I think I'll take back that "locality is a good point in favor of job stealing" part. When you got shared cache between threads, locality is actually in favor of shared job queue because all the worker threads are executing the same type of task, thus it doesn't put as much pressure on L2 cache as it would if each thread would execute completely unrelated tasks. Cheers, Jarkko _____ From: Jarkko Lempiainen [mailto:al...@gm...] Sent: Monday, April 06, 2009 10:38 AM To: 'Game Development Algorithms' Subject: RE: [Algorithms] General purpose task parallel threading approach I was under impression that you should minimize those sync points and dependencies for optimal throughput. If you have ideal situation where your entire app is split into small tasks, then I guess you wouldn't need to prioritize, but I believe the reality is that you got some large tasks whose execution needs to start as yearly in the frame as possible to avoid stalling other threads in the end of the frame. So you need to prioritize those tasks and tasks those large tasks depend on high for execution. I agree that ability to wait within a task is more convenient, particularly if you have to do it deep inside a nested loop. But I don't see spawning new tasks where you can as tedious as you describe, and those tasks don't add that much overhead to the queue as you imply. I.e. when you got a sync point, you are waiting for on average maybe tens or hundreds of tasks to complete so overhead of a single task is neglible in proportion. I doubt that contention will be an issue with shared queue any time soon (definitely not with 16-32 threads you are talking about), but agreed that locality is a good point in favor of job stealing. Cheers, Jarkko _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Monday, April 06, 2009 1:44 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 9:40 PM, Jarkko Lempiainen <al...@gm...> wrote: There isn't "whole bunch of magic" outside the task you would need to do with the first option. The task just adds another function to the job queue which does the work after the wait. For second option it is job queues task to schedule the tasks for optimal throughput, so for wait it could for example favor tasks blocking the task from proceeding That's a whole bunch of magic to me. If you want pervasive parallelism (and Mr Almdahl says we do), you're going to need sync points all over the place. Currently we're not doing that because we don't *need* to do it yet, but if you have 16-32 threads to keep busy, you bet you're going to need to spawn of tasks all over the place (and usually sync immediately afterwards). Having to manually split all of these things up into lots of tiny tasks is a big issue IMO - it's tedious, error prone, and adds more overhead (the cost of starting a task still isn't free - so doing it, say, 5-10x as often is a big deal). With job stealing from other threads you don't get global prioritization, so one task may be executing lower priority tasks and not to work towards optimal throughput. Why do you actually need global prioritization? After all you need to finish everything by the end of the frame anyway, so why would you care if task A gets executed before task B, so long as you never get any bubbles where a core is idle? All you care about is maximal throughput, maximal locality (especially true for in-order CPUs) and minimal contention/overhead, and for that a giant shared queue is sub-optimal for two reasons: * It's inherently non-scalable. If you have lots of threads reading from the same queue you'll get contention. Even with relatively modest number of threads this starts becoming an issue. Work stealing has negligible coherency because you have to be unlucky enough to get someone trying to steal from the bottom of your stack, at the same time as you running out of tasks to do (so you're fighting for the very last item in the stack). That's extremely unlikely to happen, and it's even more unlikely for two threads to try to steal from the same stack - whereas a work queue is constantly being hit by multiple threads. * You loose locality. With a work-stealing approach you get good locality because you run on the same thread as the task was spawned, and tasks spawned close together in time get run close together in time, meaning good cache coherency. If tasks are spread out over multiple cores, and over time, you'll get less coherency. -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |
|
From: <asy...@gm...> - 2009-04-04 21:07:14
|
Well, maybe hyperthreading was not a good comparison, but I think you got my point on how I'd like to hide the stalls Generally speaking of course all sort of dependencies should be eliminated, the question is whether it is reasonable to do so given a way to hide latencies those dependencies generate. But eliminating them only makes even situation better, that's for sure. The role of efficient job systems currently increases together with the number of cores you have in the system. The problem of speed given that many cores you have available now goes into the direction of "how to deliver the work I need to do to all execution units you have available (so that they are all busy)" instead of how to make it run faster on one execution unit. The only solution to that problem I know of is to fragment your amount of work. And the more cores you have, the more fragmentation you need to achieve to keep all your 1024 cores busy, which in the end makes the efficiency of a job system more important, don't you agree ? Alexander. 2009/4/4 Nicholas "Indy" Ray <ar...@gm...> > On Sat, Apr 4, 2009 at 11:09 AM, <asy...@gm...> wrote: > > Blocking is only part of a problem (I used it for demonstration purporses > > basically), on top of that you have all sort of IO and task dependencies. > > These are all variants on blocking, it's best to eliminate these > dependencies. > > > You can compare my approach to hyperthreading in hardware world which is > > becoming more and more popular these days (sun's ultrasparc, intel's > > larabee, and gpus are also using that scheme as far as I'm aware of). > > I can't speak on ultraspec, but neither larabee or GPUs use > hyperthreading as the main form of threading (or afaik at all) They > are both data parallel architectures primarily. Additionally, > hyperthreading is build to fill bubbles in the pipeline and do such > rather well because they don't have to worry about any sort of context > switch, as the threading is implemented in hardware and multiple > states can actually be active at the same time. > > > Hyperthreading switches virtual threads when currently executed thread > needs > > to wait on something (lime memory access to finish). Is it needed ? Not > if > > you layout your data in a way where you don't have memory access > latencies, > > but you usually do have them, because in many cases to eliminate > latencies > > is very difficult/time consuming/completely not possible to do. > > Keep in mind that there is 0 context switch overhead in > hyperthreading. Additionally, eliminating cache stalls is far more > difficult then blocking IO and the like. > > > I'm not saying that one should not plan his code for asynchronous > execution > > scenario, but the delays because of all sort of blocking conditions can > be > > hidden like it happens in hyperthreading. > > Additionally hyperthreading still requires threads to be designed in a > way that is low on data contention. And if you already have worker > tasks that can do such, It's generally best to create OS threads and > just let them run on separate threads or hyperthreaded. But still > ensure that the main thread does no waiting. In a well designed > system, thread switches are a non-issue as they just shouldn't happen > very often (or at all in the case of consoles) So I don't think it's > worthwhile worrying too much about the amount of time it takes to do a > thread switch. > > Indy > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |
|
From: Nicholas \Indy\ R. <ar...@gm...> - 2009-04-04 22:57:24
|
On Sat, Apr 4, 2009 at 2:07 PM, <asy...@gm...> wrote: > Well, maybe hyperthreading was not a good comparison, but I think you got my > point on how I'd like to hide the stalls My point wasn't to show how the comparison itself was bad, but how the approach is fundamentally flawed in it's own. > Generally speaking of course all sort of dependencies should be eliminated, > the question is whether it is reasonable to do so given a way to hide > latencies those dependencies generate. But eliminating them only makes even > situation better, that's for sure. In game development in particular all of those dependencies should already be eliminated, there should be no waits in existing well coded game loops. > The role of efficient job systems currently increases together with the > number of cores you have in the system. You're system is hardly a proper job system, a better job system doesn't need to store a stack or other thread state, and should never context switch until the work is done, That is the job of a threading system, which should remain separate. > The only solution to that problem I know of is to fragment your amount of > work. And the more cores you have, the more fragmentation you need to > achieve to keep all your 1024 cores busy, which in the end makes the > efficiency of a job system more important, don't you agree ? I agree, but again I think an entire cooperative thread system layered atop the OS thread system is not the proper proper way to distribute jobs. Jobs should be designed in a way that context switching during any particular job would give no benefit, and the existing hyperthreading implementations should be used to fill in the bubbles in the pipeline. > You are right that this benchmark is far away from a real life scenario, but > there should be a way to estimate the speed. How about comparing the worst > case scenario for cooperative task approach with the best case one for > winthreads? The problem here is that benchmarking anything in regards to context switch time between fibers and OS threads will be in favor of fibers because they have less overhead, but they also provide less, such as the ability to run on multiple CPUs, in an ideal case of a game (ignoring the OS and other apps on the computer) there should be zero thread switches. and in that case, it doesn't matter how long a task switch takes, as it should just never happen. That's like spending you're time optimizing the MP3 Decoder when you are only using Ogg anyways. Indy |
|
From: Jarkko L. <al...@gm...> - 2009-04-05 00:52:58
|
You could also not have a wait for dependent tasks within a task, but rather have job queue dependency system to take care of it by splitting the function in two tasks. Or you could implement the wait by pulling tasks from the job queue until the dependent tasks have been completed. There is no context switches needed in either case though. _____ From: Sebastian Sylvan [mailto:seb...@gm...] Sent: Sunday, April 05, 2009 3:31 AM To: Game Development Algorithms Subject: Re: [Algorithms] General purpose task parallel threading approach On Sun, Apr 5, 2009 at 1:10 AM, Jarkko Lempiainen <al...@gm...> wrote: I think you really need to target for both task and data level parallelism to make your game engine truly scalable for many-core architectures. I don't see why you would need to consider the cost of context switching in relation to const of a task execution though since if you got a job queue, each worker thread just pulls tasks from the queue without switching context. The problem is when you have a task spawn N other tasks and then wait for their results, the core running that worker needs to do something until the two sub-tasks have completed. In practice, I do think that this kind of synchronisation (specifically the fork-and-join model) will be extremely common in the future, which means that context switch needs to be fast (Windows 7's user mode scheduling may be useful here - in fact I believe that's what Microsoft's native concurrency runtime uses, while the .Net one is backed by thread pools, I believe). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862 |