Re: [Algorithms] General purpose task parallel threading approach

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

asy...@gm... wrote:
> I believe that from cpu point of view kernel scheduler and user-space 
> scheduler are in the same situation here in regard to cpu cache usage. 
> Yes, you need to keep your data close to cache, avoid task/thread 
> migration and avoid cache conflicts. I don't see why a user space 
> version of task scheduler can't be as good as kernel side one, since 
> the kernel and userspace scheduler have the same amount of information 
> in order to take a best decision, or am I wrong here ?
> And as I mentioned - in user space you need to save/restore muuch 
> smaller state, so you have initial speed benefit here.

What I'm saying is that, in the specific benchmark you posted, you 
heavily skewed the measurement in favor of user-space cooperative 
threads (fibers, coroutines, whatever you want to call them). The reason 
is that the cache footprint of that particular, single-threaded, 
small-task, user-space switch is tiny.
When using the kernel to switch threads, more code is executed, and more 
data is touched. For example, threads provide TLS, which fibers do not. 
Threads provide a whole host of other specific state which fibers don't. 
In addition, entering the kernel has some overhead for the system call 
and parameter validation. Clearly, the overhead when switching real 
threads in this case is higher.

However, when you start writing a real application, where you have a 
large amount of different tasks, where the tasks are not as small or as 
tightly coupled, and you start seeing tasks get moved between cores, the 
different between a fiber scheduler and the kernel thread scheduler 
becomes smaller. That's why I said I think that, with higher load, the 
thread scheduler will not degrade at the same rate as the fiber scheduler.

Of course the fiber scheduler (if implemented competently, at least) 
will be faster than the thread scheduler, because it does less (no 
kernel entry, no TLS or other thread context, no pre-emption, etc). The 
question is: once you've added back all the things you end up needing 
for a real application to your fiber system (and TLS context switch may 
be part of that), how much faster is it, and what are you giving up?

For example, we've had a user-space fiber system for our server stuff 
for many years, not unlike what you describe -- we have fiber-specific 
locks, waiting for I/O schedules another fiber, etc. We've found that 
the fiber switching brings with it its own problems, especially in the 
cancellation case. If you want to stop the operation that is going on 
(say, the user canceled, or the connection was broken while you were 
waiting for I/O), the fiber "wait" operation has to return an error, and 
it has to wind its way out to the base of the request while backing out 
state. In later incarnations, we use exception handling for this; the 
"wait" operation for fibers (which can wait on I/O, synchronization, 
etc) will throw an exception when the fiber/operation is canceled.

I guess I'm just a little jaded, having seen a large number of 
user-space thread packages over the last 25 years (including the Mac 
threads, Java 1 "green" threads, Win32 fibers, UNIX fibers on longjmp, 
etc), and I think the benefits and cost trade-off is not as clear as you 
seem to want them to be. Yes, there are cases where it is the right 
solution, but there also are cases where it isn't. If you want to use a 
particular measurement (such as context switches per second) as your 
criteria, then I think it's important to measure on actual, real-world 
workloads to make an informed decision.

Sincerely,

jw