[Lse-tech] [RFC] proposed preemption deferal interface

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

DEFER_PREEMPION(), ENBLE_PREEMPTION() API overview.

I'm lookng at implementing an API that can be used by applications
such as DB2 or Oracle, to hold off task preemption while that application is
holding a critical shared resource.  These resources will usually be
in shared memory that is mapped among several tasks.  The application wants
to acquire the resource, operate on that resource, and release it quickly
so that other tasks sharing that resource can acquire it.

A task should hold off preemption in this fashion for only short
periods of time.  It would be bad for a misbehaving applicatoin to defer
preemption for long periods of time.  This could constitue a denial
of service (DOS) attack on the computing system.  This may be
addressed in the design of the API, either by requiring users of the
API to be privileged and hence "trusted" to be "well behaved".
A preferred approach would implement policies that limit the extent that
the API can be abused, as this would make the API useable by a wider variety
of applications.  More on this later.

The API would defer involuntary preemption, both in user mode
and in kernel mode for preemptable kernels. The processor running a
a task with deferred preemption could still do "voluntary" preemption,
to wait for I/O completion, page faults, etc.  That processor
can still service interrupts, system calls, exceptions, etc.

The request to defer preemption, and reenabling preemption must be
fast operations, so they need to be done without system calls, just
as user-level locks want to not user system calls.  So, some
mechanism other than a system call, is needed to communicate these
requests to the kernel.  More on that later.

---------------------------------------------------------------------------

Why is this API a good idea?

There are developers who have expressed "extreme distaste" for 
this concept.  The early effort will be to implement a prototype and run
benchmarks to demonstrate its benefits, if any.  The prototype will
be "cleaned up" only after the prototype demonstrates performance benefits.

This API will likely benefit primarily really large, "threaded" applications,
on really large platforms.  "Threaded" here doesn't necessarily mean
using Posix threads, or any specific threads package.  They may even
be implemented as a set of tasks that are sharing an mmaped data region.
Oracle, DB2, and possibly Domino make interesting studies.  The
costs/benefits of this API will be measured on smaller platforms also,
including a monoprocessor, to determine if there is any benefit there, or
at least ensure the API has negligible cost.

Without a DEFER_PREEMPTION() API, one thread in the application could
acquire a user-level lock, and then be preempted for an undetermined
period while holding that lock.  Other threads in the application 
could then be wait for long periods waiting for that lock.  If
the application is using simple spin locks, those other threads could
be burning CPU time.

The new futex locks introduced in 2.5 would not burn cpu time in this
way.  However, even with futex, other tasks in the application would
block in the kernel.  A preemption of one thread  could have the practical
effect of indirectly preempting other threads.  The application will not
be able to make forward progress, and will pay a higher overhead cost in
context switches.

---------------------------------------------------------------------------

DOS prevention policies.

The most straight forward DOS prevention method would be to require
users of this API to be privileged.  Such privileged applications would
then be "trusted" to be "well behaved".  This restricts the set
of applications that could use the API.  A potential advantage
of requiring privilege might be that the DEFER_PREEMPT() could mean
"mandatory preemption deferal". The kernel would NEVER involuntarily
preempt a task that has requested preemption deferal.

The alternative is to implement policies that restrict the extent an
application can abuse the API. The DEFER_PREEMPT would be treated
only as "advisory preeption deferal", meaning that the task could
be prempted anyway.

An easy policy to implement would be to never allow a task to defer
more than "N" consecutive involuntary preemptions.  Or, we could
allow the ask to defer only a certain "percentage" of involuntary
contex switches within a given time span.  These counters could
be combined with other penalties applied to the task's scheduling
priority.

Another approach would be to give the task only a short reprieve
of its preemption.  When the kernel first tries to preempt the
task, and notices that preeption is defered, it would set a timer
for some short period of time, and allow the task to continue
running.  When the timer expires, if the task is still running,
it would be preempted regardless of its preemption deferal advisory.

---------------------------------------------------------------------------

Communicating between the application and the kernel

A task in the application must be able to request a preemption deferal
without making a system call.  For the prototype implementation,
the application will set a bit in the IA32 %gs register. The linux
scheduler will test this bit, and defer the task's preemption if it is set.
This does break the modify_ldt() system call. But for a prototype,
it should work well enough.

A better, architecture independent approach would use a cache line
in memory page that is shared between the task in user mode, and the
kernel.  The page would be non-pageable, always present, so the kernel
can examine it without faulting.  A non-zero cache line would indicate
preemption deferal.  This would allow the application to have nested
preemption deferal requests.

Ideally, this shared page would be a user read/writeable per-task page.
This might be difficult to implement when several tasks are sharing VM. 
Another way might be to let tasks that are sharing VM to also share
the same page to communicate with the kernel.  Each task
would have its own cache line withint that page.  But, does each
task know which cache line it should use?  Yet another variation
would have the tasks that are sharing VM also use the same cache line
in that page.  This means a preemption deferal by one task would also
defer preemption for all the other tasks that are sharing VM with it.
This would probably mean that all the tasks are defering preemption
most of the time.  This doesn't sound like a good idea.

Suggestions on a page sharing scheme would be appreciated.