Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: <asy...@gm...> - 2009-04-04 16:35:39
|
Of course there is a tradeoff if you decide to use tasks as small as 32 bytes (though the sample I provided can do a printf with such small tasks). User code I've worked with required on average 1 - 4 kilobytes. I also provided optional automatic guardpage safety mechanism to catch stack overflows, and a way to handle library calls (that's why printf works from 32byte task). You are right about full thread context state though, but you don't have to capture full state to make cooperation work in user space, but just small fraction of it. Also keep in mind that according to windows x86 abi sse registers (and 75% of other registers) are not callee preserved, that's why a user space task state requires about 20 bytes and sse will still work (as well as mmx and fpu code). Alexander. 2009/4/4 Jon Watte <jw...@gm...> > asy...@gm... wrote: > > My target goal for a task switch cost is about 1-2 hundred of cycles, > > which is about 10-15 millions of task switches per second on a 3ghz cpu. > > You should also compare thread spawning cost (unless you preallocate > > all of them) and a memory overhead (for windows threads it is minimum > > 4k for stack size), but if you preallocate, you'd do it for the > > maximum stack size. Creation cost in my system is just a malloc (which > > you can also override with a faster allocator), and minimum allowed > > task is 32 bytes (on x86). > > > > If you're using exception handling, or stack buffer overflow, or even > certain standard C library functions, a 32-byte stack will get you into > trouble. Also, you want your thread/task stacks to be page aligned, so > that a thread overflow will actually be caught by a guard page. And, > yes, you'd either pre-allocate your maximum number of blocking threads, > or you would allocate more threads on demand when you started blocking > and ran out of runnable threads. Note: in a system with guard page > security for each thread stack, you need 8 kB/thread minimum, plus the > register state (which is about 512 bytes if you care about SSE). This is > no matter whether you're using fibers or threads. Trying to find out > whether an innocuous library call you're making is going to overwrite > your malloc-ed stack or not is too unpredictable to rely on IMO -- I > prefer the safety of a guard page. > > I don't think a full thread context (stack pointer, registers, pc) can > be done in a hundred cycles on x86; at least not if they aren't already > all in L1 on the local core. The amount of data you need to shuffle and > the memory latencies involved seem to contra-indicate such a low switch > time. Getting as close as you can is not a bad idea, though :-) > > Sincerely, > > jw > > > > ------------------------------------------------------------------------------ > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Regards, Alexander Karnakov |