Re: [Algorithms] General purpose task parallel threading approach
Brought to you by:
vexxed72
|
From: Jon W. <jw...@gm...> - 2009-04-04 18:18:18
|
asy...@gm... wrote: > I believe that from cpu point of view kernel scheduler and user-space > scheduler are in the same situation here in regard to cpu cache usage. > Yes, you need to keep your data close to cache, avoid task/thread > migration and avoid cache conflicts. I don't see why a user space > version of task scheduler can't be as good as kernel side one, since > the kernel and userspace scheduler have the same amount of information > in order to take a best decision, or am I wrong here ? > And as I mentioned - in user space you need to save/restore muuch > smaller state, so you have initial speed benefit here. What I'm saying is that, in the specific benchmark you posted, you heavily skewed the measurement in favor of user-space cooperative threads (fibers, coroutines, whatever you want to call them). The reason is that the cache footprint of that particular, single-threaded, small-task, user-space switch is tiny. When using the kernel to switch threads, more code is executed, and more data is touched. For example, threads provide TLS, which fibers do not. Threads provide a whole host of other specific state which fibers don't. In addition, entering the kernel has some overhead for the system call and parameter validation. Clearly, the overhead when switching real threads in this case is higher. However, when you start writing a real application, where you have a large amount of different tasks, where the tasks are not as small or as tightly coupled, and you start seeing tasks get moved between cores, the different between a fiber scheduler and the kernel thread scheduler becomes smaller. That's why I said I think that, with higher load, the thread scheduler will not degrade at the same rate as the fiber scheduler. Of course the fiber scheduler (if implemented competently, at least) will be faster than the thread scheduler, because it does less (no kernel entry, no TLS or other thread context, no pre-emption, etc). The question is: once you've added back all the things you end up needing for a real application to your fiber system (and TLS context switch may be part of that), how much faster is it, and what are you giving up? For example, we've had a user-space fiber system for our server stuff for many years, not unlike what you describe -- we have fiber-specific locks, waiting for I/O schedules another fiber, etc. We've found that the fiber switching brings with it its own problems, especially in the cancellation case. If you want to stop the operation that is going on (say, the user canceled, or the connection was broken while you were waiting for I/O), the fiber "wait" operation has to return an error, and it has to wind its way out to the base of the request while backing out state. In later incarnations, we use exception handling for this; the "wait" operation for fibers (which can wait on I/O, synchronization, etc) will throw an exception when the fiber/operation is canceled. I guess I'm just a little jaded, having seen a large number of user-space thread packages over the last 25 years (including the Mac threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp, etc), and I think the benefits and cost trade-off is not as clear as you seem to want them to be. Yes, there are cases where it is the right solution, but there also are cases where it isn't. If you want to use a particular measurement (such as context switches per second) as your criteria, then I think it's important to measure on actual, real-world workloads to make an informed decision. Sincerely, jw |