From: Dave O. <ol...@us...> - 2002-04-18 18:03:49
|
DEFER_PREEMPION(), ENBLE_PREEMPTION() API overview. I'm lookng at implementing an API that can be used by applications such as DB2 or Oracle, to hold off task preemption while that application is holding a critical shared resource. These resources will usually be in shared memory that is mapped among several tasks. The application wants to acquire the resource, operate on that resource, and release it quickly so that other tasks sharing that resource can acquire it. A task should hold off preemption in this fashion for only short periods of time. It would be bad for a misbehaving applicatoin to defer preemption for long periods of time. This could constitue a denial of service (DOS) attack on the computing system. This may be addressed in the design of the API, either by requiring users of the API to be privileged and hence "trusted" to be "well behaved". A preferred approach would implement policies that limit the extent that the API can be abused, as this would make the API useable by a wider variety of applications. More on this later. The API would defer involuntary preemption, both in user mode and in kernel mode for preemptable kernels. The processor running a a task with deferred preemption could still do "voluntary" preemption, to wait for I/O completion, page faults, etc. That processor can still service interrupts, system calls, exceptions, etc. The request to defer preemption, and reenabling preemption must be fast operations, so they need to be done without system calls, just as user-level locks want to not user system calls. So, some mechanism other than a system call, is needed to communicate these requests to the kernel. More on that later. --------------------------------------------------------------------------- Why is this API a good idea? There are developers who have expressed "extreme distaste" for this concept. The early effort will be to implement a prototype and run benchmarks to demonstrate its benefits, if any. The prototype will be "cleaned up" only after the prototype demonstrates performance benefits. This API will likely benefit primarily really large, "threaded" applications, on really large platforms. "Threaded" here doesn't necessarily mean using Posix threads, or any specific threads package. They may even be implemented as a set of tasks that are sharing an mmaped data region. Oracle, DB2, and possibly Domino make interesting studies. The costs/benefits of this API will be measured on smaller platforms also, including a monoprocessor, to determine if there is any benefit there, or at least ensure the API has negligible cost. Without a DEFER_PREEMPTION() API, one thread in the application could acquire a user-level lock, and then be preempted for an undetermined period while holding that lock. Other threads in the application could then be wait for long periods waiting for that lock. If the application is using simple spin locks, those other threads could be burning CPU time. The new futex locks introduced in 2.5 would not burn cpu time in this way. However, even with futex, other tasks in the application would block in the kernel. A preemption of one thread could have the practical effect of indirectly preempting other threads. The application will not be able to make forward progress, and will pay a higher overhead cost in context switches. --------------------------------------------------------------------------- DOS prevention policies. The most straight forward DOS prevention method would be to require users of this API to be privileged. Such privileged applications would then be "trusted" to be "well behaved". This restricts the set of applications that could use the API. A potential advantage of requiring privilege might be that the DEFER_PREEMPT() could mean "mandatory preemption deferal". The kernel would NEVER involuntarily preempt a task that has requested preemption deferal. The alternative is to implement policies that restrict the extent an application can abuse the API. The DEFER_PREEMPT would be treated only as "advisory preeption deferal", meaning that the task could be prempted anyway. An easy policy to implement would be to never allow a task to defer more than "N" consecutive involuntary preemptions. Or, we could allow the ask to defer only a certain "percentage" of involuntary contex switches within a given time span. These counters could be combined with other penalties applied to the task's scheduling priority. Another approach would be to give the task only a short reprieve of its preemption. When the kernel first tries to preempt the task, and notices that preeption is defered, it would set a timer for some short period of time, and allow the task to continue running. When the timer expires, if the task is still running, it would be preempted regardless of its preemption deferal advisory. --------------------------------------------------------------------------- Communicating between the application and the kernel A task in the application must be able to request a preemption deferal without making a system call. For the prototype implementation, the application will set a bit in the IA32 %gs register. The linux scheduler will test this bit, and defer the task's preemption if it is set. This does break the modify_ldt() system call. But for a prototype, it should work well enough. A better, architecture independent approach would use a cache line in memory page that is shared between the task in user mode, and the kernel. The page would be non-pageable, always present, so the kernel can examine it without faulting. A non-zero cache line would indicate preemption deferal. This would allow the application to have nested preemption deferal requests. Ideally, this shared page would be a user read/writeable per-task page. This might be difficult to implement when several tasks are sharing VM. Another way might be to let tasks that are sharing VM to also share the same page to communicate with the kernel. Each task would have its own cache line withint that page. But, does each task know which cache line it should use? Yet another variation would have the tasks that are sharing VM also use the same cache line in that page. This means a preemption deferal by one task would also defer preemption for all the other tasks that are sharing VM with it. This would probably mean that all the tasks are defering preemption most of the time. This doesn't sound like a good idea. Suggestions on a page sharing scheme would be appreciated. |