I think you are talking about two different
problems: one is to make sure the kernel
scheduler does not preempt a process that
holds some kind of resource in user land;
and the second is how to do that fast.
Some os's have of course built in what
you are suggesting, a write to some location
in user space that is used by the kernel
scheduler to make decisions about preempting.
One thing to note is that there may be groups
of processes, say P1/P2/P3 and Pa/Pb/Pc where
both groups want to tell the kernel scheduler
that they should not be preempted *within
the group*. But, it should be okay to preempt
a process from either group for a member of
the other, or any other process in the system
... and in fact, that decision should be left
to the kernel scheduler based on policies of
the two groups and other processes, their
priorities etc. P1/P2/P3 and Pa/Pb/Pc could
be created thru something like clone(CLONE_SCHED),
so that the kernel scheduler knows they belong
to a "scheduling group"; irrespective of
whether they are also CLONE_VM and/or using
Of course, conceptually this might make a
lot of sense, but implementation wise, this
might be very complex ... and probably that's
why other os's cop out and do the easy thing
of just blocking any scheduling activity on
the cpu. Have you thought about the above
approach and decided it is just too costly?
PS - Btw, I ran into this issue a year back
while trying to scale apache/specweb as an
experiment ... I played with using
sched_setscheduler(SCHED_FIFO) at thread
creation time ...
--- Dave Olien <oliendm@...> wrote:
> DEFER_PREEMPION(), ENBLE_PREEMPTION() API overview.
> I'm lookng at implementing an API that can be used
> by applications
> such as DB2 or Oracle, to hold off task preemption
> while that application is
> holding a critical shared resource. These resources
> will usually be
> in shared memory that is mapped among several tasks.
> The application wants
> to acquire the resource, operate on that resource,
> and release it quickly
> so that other tasks sharing that resource can
> acquire it.
> A task should hold off preemption in this fashion
> for only short
> periods of time. It would be bad for a misbehaving
> applicatoin to defer
> preemption for long periods of time. This could
> constitue a denial
> of service (DOS) attack on the computing system.
> This may be
> addressed in the design of the API, either by
> requiring users of the
> API to be privileged and hence "trusted" to be "well
> A preferred approach would implement policies that
> limit the extent that
> the API can be abused, as this would make the API
> useable by a wider variety
> of applications. More on this later.
> The API would defer involuntary preemption, both in
> user mode
> and in kernel mode for preemptable kernels. The
> processor running a
> a task with deferred preemption could still do
> "voluntary" preemption,
> to wait for I/O completion, page faults, etc. That
> can still service interrupts, system calls,
> exceptions, etc.
> The request to defer preemption, and reenabling
> preemption must be
> fast operations, so they need to be done without
> system calls, just
> as user-level locks want to not user system calls.
> So, some
> mechanism other than a system call, is needed to
> communicate these
> requests to the kernel. More on that later.
> Why is this API a good idea?
> There are developers who have expressed "extreme
> distaste" for
> this concept. The early effort will be to implement
> a prototype and run
> benchmarks to demonstrate its benefits, if any. The
> prototype will
> be "cleaned up" only after the prototype
> demonstrates performance benefits.
> This API will likely benefit primarily really large,
> "threaded" applications,
> on really large platforms. "Threaded" here doesn't
> necessarily mean
> using Posix threads, or any specific threads
> package. They may even
> be implemented as a set of tasks that are sharing an
> mmaped data region.
> Oracle, DB2, and possibly Domino make interesting
> studies. The
> costs/benefits of this API will be measured on
> smaller platforms also,
> including a monoprocessor, to determine if there is
> any benefit there, or
> at least ensure the API has negligible cost.
> Without a DEFER_PREEMPTION() API, one thread in the
> application could
> acquire a user-level lock, and then be preempted for
> an undetermined
> period while holding that lock. Other threads in
> the application
> could then be wait for long periods waiting for that
> lock. If
> the application is using simple spin locks, those
> other threads could
> be burning CPU time.
> The new futex locks introduced in 2.5 would not burn
> cpu time in this
> way. However, even with futex, other tasks in the
> application would
> block in the kernel. A preemption of one thread
> could have the practical
> effect of indirectly preempting other threads. The
> application will not
> be able to make forward progress, and will pay a
> higher overhead cost in
> context switches.
> DOS prevention policies.
> The most straight forward DOS prevention method
> would be to require
> users of this API to be privileged. Such privileged
> applications would
> then be "trusted" to be "well behaved". This
> restricts the set
> of applications that could use the API. A potential
> of requiring privilege might be that the
> DEFER_PREEMPT() could mean
> "mandatory preemption deferal". The kernel would
> NEVER involuntarily
> preempt a task that has requested preemption
> The alternative is to implement policies that
> restrict the extent an
> application can abuse the API. The DEFER_PREEMPT
> would be treated
> only as "advisory preeption deferal", meaning that
> the task could
> be prempted anyway.
> An easy policy to implement would be to never allow
> a task to defer
> more than "N" consecutive involuntary preemptions.
> Or, we could
> allow the ask to defer only a certain "percentage"
> of involuntary
> contex switches within a given time span. These
> counters could
> be combined with other penalties applied to the
> task's scheduling
> Another approach would be to give the task only a
> short reprieve
> of its preemption. When the kernel first tries to
> preempt the
> task, and notices that preeption is defered, it
> would set a timer
> for some short period of time, and allow the task to
> running. When the timer expires, if the task is
> still running,
> it would be preempted regardless of its preemption
> deferal advisory.
> Communicating between the application and the kernel
> A task in the application must be able to request a
> preemption deferal
> without making a system call. For the prototype
> the application will set a bit in the IA32 %gs
> register. The linux
> scheduler will test this bit, and defer the task's
> preemption if it is set.
> This does break the modify_ldt() system call. But
> for a prototype,
> it should work well enough.
> A better, architecture independent approach would
> use a cache line
> in memory page that is shared between the task in
> user mode, and the
> kernel. The page would be non-pageable, always
> present, so the kernel
> can examine it without faulting. A non-zero cache
> line would indicate
> preemption deferal. This would allow the
> application to have nested
> preemption deferal requests.
> Ideally, this shared page would be a user
> read/writeable per-task page.
> This might be difficult to implement when several
> tasks are sharing VM.
> Another way might be to let tasks that are sharing
> VM to also share
> the same page to communicate with the kernel. Each
> would have its own cache line withint that page.
> But, does each
> task know which cache line it should use? Yet
> another variation
=== message truncated ===
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax