Read-Copy Mutual-Exclusion

Table of Contents

  1. OVERVIEW
  2. MOTIVATION
  3. DESCRIPTION
  4. BENEFITS OF READ-COPY MECHANISM
  5. RESTRICTIONS FOR USE OF READ-COPY MECHANISM
  6. APPLICATIONS OF THIS MECHANISM
  7. OVERVIEW OF INTERFACE
  8. DESIGN RULES FOR kmem_deferred_alloc_embedded()
  9. ADDITIONAL DESIGN RULES TO PROTECT LIST OF ELEMENTS
  10. EXAMPLE OF ELIMINATING TABLE-LOOKUP MUTEX
  11. EXAMPLE OF rc_callback_wrapper()
  12. EXAMPLE OF kmem_deferred_free_embedded()
  13. DOCUMENTATION OF INTERFACES
  14. CPUS WITH WEAK MEMORY BARRIERS

AVAILABILITY

These primitives are available in ptx4.x. Additional interfaces will be available in ptx4.3.

OVERVIEW

Straightforward, well-understood mutual-exclusion mechanisms have served quite well for over a decade. However, times are changing, and computer architectures are changing with them. These changes have caused tried-and-true locking primitives to become increasingly expensive when compared to the cost of ordinary instructions, and provide motivation for development and use of alternative locking mechanisms.

As usual, there is no free lunch. Higher-performance locking mechanisms tend to be more specialized than the more familar general-purpose mechanisms. One useful specialization stems from the observation that many data structures are accessed much more frequently than they are modified. These data structures could benefit from locking mechanisms that reduce the overhead for accessing the structure, while possibly increasing the cost of modifying the structure.

One such locking mechanism is the reader-writer spinlock. This mechanism gives readers better cache locality (and writers worse cache locality), thereby speeding up readers (and slowing down writers). This mechanism is described in detail in a separate document.

Another specialized locking mechanism is the read-copy ``lock'' (actually an update discipline). This mechanism reduces the readers' locking overheads to zero. Since the readers are doing absolutely nothing to protect their accesses, the writers must take careful steps to ensure that their updates will allow the readers to safely access the data at any time. The read-copy update discipline mechanism provides primitives that allow writers to easily take such careful steps.

The overhead of these primitives is relatively low, but is considerably greater than than of a simple spinlock. Therefore, like the reader-writer spinlock, this mechanism improves performance only when the protected data structure is read much more frequently than written. In addition, the readers must be able to tolerate using stale data. Although this restriction sounds severe, it is satisfied quite often. One example is a routing table in a network protocol, which is updated at most every minute or so, but may be accessed several thousand times per second, and which can tolerate stale routing information. Even in the unforgiving realm of numerical analysis, there are algorithms that can tolerate stale data -- so-called ``chaotic relaxation'' is one example. Other example applications of this mechanism are listed in the APPLICATIONS OF THIS MECHANISM section.

The remainder of this document describes the motivation for the read-copy update mechanism (e.g., why is it so important to eliminate lock operations), a description of the PTX4.0 implementation, benefits and drawbacks, applications of this mechanism, some example usages of the mechanism, followed by a detailed description of the interface that is present in PTX4.0. The examples and interface are specialized for the PTX kernel, but the mechanism can be adapted to a variety of other environments, including user processes and other kernels.

MOTIVATION

CPU speeds have been increasing by a factor of from 1.5x to 2x per year for more than a decade. However, interconnect speeds (printed circuit boards and busses) have only been increasing by roughly 15 percent per year over the same period.

To illustrate this effect, consider the number of simple instructions that can be executed in the time required to acquire and release a simple in-kernel spinlock for Sequent's Symmetry Model-B and the NUMA-Q "Scorpion" system containing Xeon processors. The Model B shipped in mid-1988, gets about 5 MIPS, and requires about 10 microseconds to acquire and release a spinlock. The Scorpion shipped in late 1988, gets about 540 MIPS, and requires about 4 microseconds to acquire and release a spinlock.

In the 10 years between the Model B and the Scorpion, the CPU speed has increased more than 100-fold (1.6x increase per year), while the lock speed has increased by only 2.5x (1.1x increase per year). The number of instructions that can be executed in the time required to acquire and release a lock has risen from 50 for the model B to over 2,000 for the Scorpion. If a lock is acquired every 200 instructions, then the locking overhead has increased from 20% to over 90% in a 3.5 year period. This vividly illustrates why reducing the number of lock operations can be just as important as reducing lock contention.

This increase in overhead provides the motivation for constructing mutual-exclusion schemes that do not rely so heavily on communication over the relatively slow interconnects between the relatively fast CPUs.

Note that the MIPS ratings given above assume a low miss rate from the on-chip cache. An analysis of the effect of cache misses would uncover a similar increasing penalty for data movement between memory and caches. This is a subject for another document. Note however that low-end machines that trade low latency for limited scaling have lower miss penalties than do higher-end machines such as ours. Larger miss penalties result in locks having higher overhead on high-end machines than on machines with the low latencies made possible by limited scaling.

The read-copy update mechanism allows locking operations to be eliminated on read accesses to the data it protects.

DESCRIPTION

This section describes the read-copy mechanism as implemented on PTX4.0. The principles underlying the mechanism can be applied to other environments (such as other kernels, to real-time systems, and even to application programs). These underlying principles are documented in a PDCS'98 paper. The rest of this section gives an overview of these principles. This description assumes a non-preemptive kernel, however, a preemptive kernel may be handled easily.

The basis of the read-copy mechanism is the use of the execution history of each CPU to demonstrate that a given destructive operation can be safely performed at a particular time. The mechanism is currently used in three general ways: (1) as an update discipline that allows readers to access a changing data structure without performing any synchronization operations, (2) to guarantee existence of data structures referenced by global pointers, thereby avoiding complex race conditions that would otherwise occur when freeing up these data structures, (3) to reduce deadlock-avoidance complexity, and (4) as a barrier-like synchronization mechanism that allows one process to exclude other processes without requiring these other processes to execute any expensive locking primitives or atomic instructions.

The key point is that all kernel data structures are manipulated only by the kernel context of a user process or by an interrupt routine. Neither the idle loop nor a user process running in user mode may manipulate kernel data structures. The idle loop, user-mode execution, and (by special dispensation) a context switch are therefore ``quiescent states'': a CPU in a quiescent state cannot interfere with manipulation of in-kernel data or be interfered with by manipulation of in-kernel data. Of course, a given CPU could switch out of a quiescent state at any time, but such a CPU would only be able to access or modify shared data that was globally accessible at or after that time. Therefore, if a given process removes all globally-accessible references to a particular data structure, then blocks until all CPUs pass through a quiescent state (refering to a summary of CPU execution history to determine when this occurs), it can manipulate that data structure secure in the knowledge that no other CPU can interfere. Likewise, if a given process makes a change to a global variable, then blocks until all CPUs have passed through a quiescent state, it can be sure that all the other CPUs are now aware of the change, even if that variable is not protected by any sort of mutex.

The core of the read-copy mechanism is a primitive named rc_callback() that allows a callback to be scheduled once all CPUs have passed through a quiescent state.

An example use of this mechanism is the deletion of a data structure that is still being used by processes and interrupt routines running on other CPUs. To delete the data structure, simply do the following:

This procedure will allow all kernel threads of execution that might still be using the data structure to terminate before actually deleting the structure. Note that spinlocks or reference counters could be used to get the same effect, but that these techniques would impose overhead on everything accessing the data structure.

The rc_callback() primitive takes a pointer to an rc_callback_t control block, which may be allocated via the rc_alloc_callback() primitive and freed via the rc_free_callback() primitive. The rc_alloc_callback() primitive fills in the rc_callback_t with the address of the function to be called back and the values of two arguments.

When the function is invoked, it will be passed a pointer to the rc_callback_t and the two arguments. The function may reuse the rc_callback_t by rescheduling itself, otherwise it must free it up. The rc_set_func() primitive can be used as a shorthand for freeing up one rc_callback_t and immediately allocating another. The reason that the two arguments are passed as separate parameters even though they are present in the rc_callback_t is to protect layered products from any changes in the rc_callback_t format.

An RC_MEMSYNC() primitive is used to maintain memory consistency on CPUs that can reorder memory reads and writes (note however that the existing spinlock primitives and the rc_callback() primitive act as implicit RC_MEMSYNC()s). This primitive must be used between the act of filling in the fields internal to a given structure and the act of creating a globally-accessible pointer to that structure. Otherwise, an aggressive super-scalar CPU might reorder the memory write to the global pointer before some of the writes that initialized the structure itself. This reordering could allow other CPUs to access the structure before it was fully initialized, which could significantly reduce the longevity of the kernel.

RC_RDPROTECT() and RC_RDUNPROTECT() primitives are used to indicate sections of code that are relying on the read-copy mechanism. These generate no code in .std kernels, but generate debug code in .mfg kernels that checks for proper nesting.

The preceding primitives form the interface to the core read-copy mechanism, and are described in detail later in this document. There are two facilities layered on top of this core mechanism: the deferred-free facility and the callback-wrapper facility.

The deferred-free primitives provide an efficient and easy-to-use interface to the rc_callback core primitive that allows the caller to specify that a given data structure be freed after all CPUs (which might still be referencing the structure) have passed through a quiescent state. The workhorse of this interface is kmem_deferred_free(). The example data structure deletion given above would use kmem_deferred_free() as follows:

The kmem_deferred_free() primitive requires some additional memory to keep track of all data structures waiting for a deferred free. This memory may be unioned with fields in the data structure that are no longer needed once the deferred free is in progress, or may be allocated separately. The kma_alloc_defer_item() and kma_defer_free_item() primitives allow this memory to be allocated and freed independently of the data structure, while the kmem_deferred_alloc_embedded() and kmem_alloc_free_embedded() primitives allow it to be allocated along with the structure itself. If the additional memory is to be unioned with fields in the data structure, kma_alloc_defer_item_size() should be used to verify that the fields are large enough.

The callback-wrapper facility consists of the rc_callback_wrapper() function, which may be specified as a callback to the rc_callback() primitive. The rc_callback_wrapper() function takes another function as its first parameter and that function's single parameter as its second parameter. This allows functions such as v_sema() to be conveniently invoked using the read-copy callback mechanism.

So where did the name ``read-copy'' come from? It turns out that this is how the mechanism is used to modify (rather than simply delete) a data structure while allowing readers full concurrent access to that data structure. The procedure is as follows:

  1. Allocate new memory for the modified data structure.

  2. Copy the contents of the old data structure to the new.

  3. Modify the new data structure as needed.

  4. Update all pointers that currently point to the old copy to instead point to the new copy.

  5. Do a deferred free of the old copy.
In other words, readers can continue reading while updaters make new copies, hence ``read-copy''. Use of the mechanism in this manner is sometimes called a ``read-copy lock''; this nomenclature was stolen from the ``reader-writer lock''. This analogy with spinlock primitives can be useful, but it is better to think of the read-copy mechanism as enabling the read-copy update procedure shown above than to risk confusing the mechanism with explicit locking.

BENEFITS OF READ-COPY MECHANISM

  1. Requires execution of fewer explicit mutex operations are required to protect accesses to shared data structures. In many cases no explicit lock operations are required. This results in less contention and lower CPU overhead. Of course, modification of shared data structures still requires locking operations. However, all the well-known techniques (such as hashing) may be used to reduce the lock contention incurred when modifying shared data.

  2. Requires execution of fewer special instructions that enforce sequential consistency (e.g., instructions taking the ``lock'' prefix on the Intel 80x86 instruction-set architecture). These instructions result in CPU pipeline flushes, resulting in degraded performance, especially on the newer super-scalar microprocessors containing multiple instruction pipelines.

  3. In code not subject to blocking preemption, requires fewer interrupt-level manipulation operations when accessing shared data structures. These operations often require interaction with relatively slow external interrupt-distribution hardware, and are frequently required even on uniprocessors (in which case the read-copy update mechanism can be said to have negative overhead with respect to traditional uniprocessor algorithms). In PTX, interrupt-level manipulation operations are usually combined with the locking operations and are still usually required when modifying shared data structures.

    An example of this occurs when a data structure is modified at interrupt level. If the interrupt code can use the read-copy mechanism, then any process-level code that reads the data structure no longer needs to exclude the update code in the interrupt routine. Therefore, the process-level code no longer needs to block interrupts for the duration of its access.

    In environments allowing preemption, interrupt-level manipulations or operations that disable preemption may still be required.

  4. This mechanism is not as susceptible to deadlock as are explicit lock operations. The reduced susceptibility to deadlock results in reduction in or elimination of code that would otherwise be required to detect and repair or to avoid deadlocks, which in turn reduces software development and maintenance costs.

RESTRICTIONS FOR USE OF READ-COPY MECHANISM

As noted in the overview, high-performance locking mechanism like rclock are specialized, and can therefore be used only in certain situations. This section lists these restrictions along with metrics that can help to determine when it is beneficial to use the rclock mechanism.

  1. Writers cannot exclude readers. This can be an advantage in cases where there may be bursts of very heavy write activity, causing conventional locking primitives to slow down or or entirely lock out read activity. However, readers must be excluded in those cases where writers cannot make their modifications in such a way that each intermediate state appears consistent to a reader. This read-copy mechanism is unable to provide this form of exclusion.

  2. Readers cannot exclude writers. Readers may therefore see stale data. Therefore, algorithms employing the rclock mechanism must tolerate stale data, or must explicitly flag stale data (see Pugh's skip-list papers for ways of accomplishing this).

  3. There is a significant latency imposed by the primitives making up the write side of the read-copy mechanism, even in absence of contending readers. This latency averages about 25 milliseconds on lightly-loaded systems. In other words, there will be about a 25-millisecond delay between the time that rc_callout() is invoked until the time that the rclock subsystem invokes the corresponding callout.

    This figure assumes a 10-millisecond clock interrupt as well as short in-kernel code paths. Varying the clock frequency will change the average delay. Long in-kernel code paths will increase the average delay. Note that speeding up the clock (thus decreasing the average delay) will in turn decrease the number of requests satisfied by a single quiescent-state detection, which in turn can degrade performance.

    Note that the kmem_deferred_free() family of primitives shields you from the rc_callout() latency as long as you have enough memory to handle 25 milliseconds worth of deletions from your data structure.

  4. The write side of the read-copy update mechanism can be stalled by bugs in the kernel that result in long-running loops. For example, if a buggy kernel driver spins for one second, there can be up to a one-second latency imposed on the write-side primitives. Note that this latency does not consume CPU time. PTX 4.0.1 and 4.1 (as well as later OS versions) have a version of the rclock mechanism that issues warnings to the console when such stalls occur. In even later versions of the OS, these warnings include a short-form stack trace of the overly long code segment.

  5. In cases where it is permissible to exclude readers, the read-to-write ratio must be rather high in order for the read-copy mechanism to have a performance benefit (in cases where it is not permissible to exclude readers, read-copy is the only game in town). The exact breakeven ratio will vary depending on the details of the situation, but, as a rule of thumb, the read-copy update mechanism will almost certainly provide a performance benefit when the read-to-write ratio is N-to-1, where N is the number of CPUs. If there are a large number of read-copy updates per unit time, the overhead of each update is amortized, so that read-copy update provide a performance benefit with read-to-write ratios as low as 1.10.

    The reason for this wide range is that the primitives making up the read-copy mechanism make heavy use of batching techniques that allows overhead to be spread over multiple concurrent requests.

  6. The use of the read-copy mechanism for operations that modify existing data structures sometimes requires that the modification process be restructured into read-copy form. This restructuring is necessary in cases where the modification cannot be made without the data structure passing through inconsistent intermediate states.

    The read-copy form requires that a new data structure be allocated and copied from the old one. The new data structure is then modified as required. The old data structure is then replaced with the newly modified one. Finally, the old data structure is deleted when it is safe to do so.

    Such restructuring will likely require the addition of memory allocation, which can fail. These failures must be correctly handled, either by blocking until memory is available, aborting the modification with an error, preallocating the memory, or using some mechanism to retry the modification at a later time.

    In some cases, the complexity added by this restructuring preclude use of the read-copy mechanism.

    Note that modifications that do not leave the data structure in inconsistent intermediate states do not require this sort of restructuring. Examples include "blind" writes to single aligned bytes, words, or longs, or atomic updates of single aligned bytes, words, longs, or (on Pentiums and better) 8-byte quantities.

  7. The read-copy mechanism may not be used to protect access to a data structure across a blocking operation. This same restriction applies also applies to spinlocks.

    Therefore, data structures containing semaphores are not candidates for the read-copy mechanism unless the semaphores can be guaranteed to have no processes sleeping on them at the time of deletion. (It is possible to create a mechanism analogous to the read-copy mechanism where CPUs are replaced by processes and where a different set of quiescent states is selected so that this new mechanism can tolerate blocking. Such a mechanism was created as part of ptx4.5, and is used to protect per-process state such as credentials, current directory, and root directory.)

  8. A read-copy update requires the updater to locate all globally accessible pointers to the structure being updated. Local temporary variables need not be located, as the read-copy callback will delay until all such temporaries are no longer in use.

  9. Readers cannot block while accessing a data structure protected by the rclock mechanism. If a reader blocks, it must restart the access from the beginning, as the data structure may well have undergone arbitrary changes in the meantime. In principle, it is possible to recast the rclock mechanism into a per-process form that would tolerate reader blocking, such as was done to protect per-process state in ptx4.5.

Given all of these restrictions, it might seem that the read-copy mechanism is useful only in a few esoteric situations. In fact, these situations are surprisingly common. The following section lists some areas where the read-copy mechanism is useful and lists some conditions that must be satisfied for it to be used.

APPLICATIONS OF THIS MECHANISM

The following are a few specific examples of applications of this mechanism.

  1. Routing tables in computer communications protocol implementations. Here, the readers look up routes in the table for the packets leaving the system. The writers apply routing updates to the table. With the advent of fibrechannel, disk I/O must often be routed just like traditional comms traffic. Read-copy update may therefore be used to route disk-I/O traffic through fibrechannel networks.

  2. Address-resolution tables in computer communications protocol implementations. Readers look up link-level addresses for the packets leaving the system and writers apply address-resolution updates to the table.

  3. Device state tables. Readers access and possibly modify the individual entries, while writers may add and delete entries. If readers can modify the individual entries, then there will need to be some sort of mutex on the individual entries.

  4. Tables whose entries can be more efficiently deallocated in batches than one at a time (for example, by spreading locking overhead over more than one entry).

  5. Phased state change. A data structure with an intermediate state between each existing state can switch to the intermediate state, then switch to the destination state once there are no more threads that have reason to assume that the data structure was still in the initial state.

    This is used in ptx/CLUSTERS when starting the recovery phase. When a lock-manager instance determines that recovery is necessary, it sets a flag and registers a callback with the read-copy mechanism. Once the callback is invoked, the lock-manager can be sure that all subsequent processing will be aware that a recovery is in progress. The alternative is to protect all code checking or setting the flag with an explicit (and expensive) mutex.

  6. Existence guarantees. This is used to more simply handle races between use of a service and teardown of that service. This approach is used in TCP/IP to handle the case where an IP protocol (such as TCP, UDP, or, of more practical interest, ICMP) is shut down while packets are still being received and processed. Each such protocol has a pointer to its private data structures, and these data structures must be freed up as part of the shutdown process. Clearly, it would be disasterous if a packet, received just before shutdown, was still being processed when the relevant data structures were freed. This disaster could be prevented using global locks or reference counts on each and every packet reception, but this would result in cache thrashing.

    Instead, we first NULL out the pointer to the private data structures so that any subsequent packets will be properly rejected. Then, the read-copy mechanism is used to post a callback when all current kernel threads of execution -- including those currently processing recently-received packets -- have terminated. At this point, it is safe to delete the private data structures.

    Existence guarantees are also used to ensure that all CPUs are finished using a data structure before deleting it. This case can occur during processing of a Unix close() system call. The kernel context of the process doing the close can set a flag indicating that the data structure is no longer available. The mechanism may then be used to wait until all currently-executing kernel threads of control (which may not have seen the new value of the flag) have terminated, at which point it is safe to delete the data structure.

  7. Poor-man's garbage collection. A function that returns a pointer to a dynamically-allocated structure could relieve the caller of the responsibility for freeing the structure by doing a deferred deletion before returning. It is not yet clear whether this is a useful tool or a shameless kludge.

  8. Simplify locking design when freeing a structure. It is sometimes convenient to hold a structure's lock throughout the entire deletion process. However, one problem with this approach is that if some other CPU attempts to acquire the lock just after the deleting CPU acquires it, the second CPU can end up holding the lock just after the element is returned to the freelist, where it might be reused for some unrelated purpose arbitrarily quickly. Doing a deferred rather than an immediate deletion is one way to resolve this problem -- the element could not be returned to the freelist until the second CPU was finished looking at it. This would allow the deleting CPU to set a flag in the structure indicating that it was no longer valid.

    This technique is used in TCP 4.2.0 and later to avoid deadlocks in STREAMS accept/t_accept processing. It is also used in TCP 4.2.0-4.4.x to avoid socket_peer deadlocks in STREAMS open processing.

Cases 1, 2, and 3 will have read-side code with the following structure when coded using Dynix/PTX spinlock primitives:

s = p_lock(&my_lock, MY_SPL); /* look something up. */ v_lock(&my_lock, s); /* use the item looked up. */ This situation offers the most natural opportunity for use of these primitives. The information is used outside of the lock, and therefore may be stale. In contrast, code with the following structure: s = p_lock(&my_lock, MY_SPL); /* look something up. */ /* use the item looked up. */ v_lock(&my_lock, s); is probably not a candidate for use of these primitives, as this code guarantees that the information is up-to-date while it is being used. As noted in the previous section, the locking design would have to be restructured before read-copy update could be used.

OVERVIEW OF INTERFACE

The easiest-to-use members of the read-copy update interface are the kmem_deferred_free_embedded() family of functions. These functions take a pointer to a structure that is to be freed, but which may still be in use by other CPUs. The structure must have been removed from any global lists so that any CPU not currently referencing it will have no way of gaining a fresh reference to it in the future. The kmem_deferred_free_embedded() functions place the structure on a list where it is kept until all CPUs have passed through the idle loop, user code, a context switch, the beginning of a system call, or a trap from user space. At this point, all threads of kernel execution that were in progress at the time of the kmem_deferred_free_embedded() call have terminated. Since any new threads are unable to form a new reference to the structure, it may now be safely deleted.

The other commonly-used member is the rc_callback_wrapper() function that allows a process to wake itself up after all CPUs can be guaranteed to see a global state change.

Example uses of both types of interface appear in the following sections.

DESIGN RULES FOR kmem_deferred_alloc_embedded()

Applying the kmem_deferred_free_embedded() function to a data structure that is linked together with many other data structures can be an exercise in frustration if done in an ad-hoc manner. An effective way to manage the complexity is to restrict the pointer-manipulation to a few functions as described below. This approach can also do much to simplify existing code, independent of the use of the read-copy update mechanism.

The first thing to note is that use of kmem_deferred_free_embedded() assumes that updates are much less frequent than accesses. Therefore, the code that performs the updates usually need not be highly optimized, and that the function-call overhead that will be incurred by grouping the needed pointer manipulations in a few functions is not of concern.

The second thing to note is that a common mistake in the use of kmem_deferred_free_embedded() is to treat a structure slated for deletion as if it were the most up-to-date copy. The best way to guard against this is to maintain a flag in the structure that indicates that it is slated for deletion and to keep a pointer to its replacement, as shown in the following examples.

An ``xxx'' structure that supports the read-copy mechanism might therefore be defined as follows:

typedef struct xxx xxx_t; struct xxx { xxx_t *xxx_next; xxx_t *xxx_prev; long xxx_flags; xxx_t *xxx_replacement; long *xxx_commonctr; /* example field to be */ /* shared among all versions */ /* of an object. */ /* other fields as needed, e.g... */ long xxx_address; /* used to identify xxx. */ lock_t xxx_lock; /* used to lock individual */ /* xxx_t. */ }; #define XXX_DEFDEL 0x4000 /* Deletion in progress. */ #define XXX_DEFDUP 0x8000 /* Deletion of duplicate. */ #define SPLXXX SPLHI /* For example... */ xxx_t xxx_head; /* Global list header. */ lock_t xxx_mutex; /* Global update mutex. */ Implementations in languages such as C++ or SmallTalk that support a more natural expression of object properties should of course use a more object-oriented approach. In either case, some algorithms may be structured so as to allow the xxx_next pointer to be used in place of the xxx_replacement pointer.

Allocation of an ``xxx'' structure should be performed by a function of the form:

xxx_t * xxx_alloc(xxx_t *oldp, int flag) { xxx_t *newp; long *commonctr = NULL; if (oldp == NULL) { /* * There is no old version, so allocate * all fields that are common to all * versions. */ commonctr = kmem_zalloc(sizeof(*commonctr), flag); if (commonctr == NULL) { return (NULL); } } newp = kmem_zalloc_deferred_embedded(sizeof(xxx_t), flag); if (newp == NULL) { return (NULL); } if (oldp != NULL) { INSIST(oldp->xxx_flags & XXX_DEFDEL == 0, "xxx_alloc: update of deleted item!"); /* *+ Internal software error. Caller attempted *+ to modify a structure already slated for *+ deletion. Corrective action: change *+ code to modify the current version. *+ This may involve eliminating a race *+ condition. */ /* Also, caller must hold xxx_mutex. */ *newp = *oldp; oldp->xxx_flags |= XXX_DEFDEL | XXX_DEFDUP; oldp->xxx_replacement = newp; } else { newp->xxx_commonctr = commonctr; /* initialize nonzero fields of newp. */ } return (newp); } Once a structure has been allocated, the calling code will likely modify some of the fields as needed. If the structure is to replace an existing one, the following function would be used: void xxx_replace(xxx_t *newp, xxx_t *oldp) { /* Caller must hold xxx_mutex. */ /* Commit to memory before pointer update. */ RC_MEMSYNC(); /* * Update all pointers to old version to now * point to new version. */ if (newp->xxx_next->xxx_prev == oldp) { newp->xxx_next->xxx_prev = newp; } if (newp->xxx_prev->xxx_next == oldp) { newp->xxx_prev->xxx_next = newp; } kmem_deferred_free_embedded(oldp, sizeof(*oldp), xxx_cleanup); } Another function named xxx_insert() would be needed to insert an entirely new structure into the lists. A fourth function named xxx_delete() would be used to remove a structure without replacing it with a new copy. Its form would be similar to that of xxx_replace(). void xxx_insert(xxx_t *newp) { /* Caller must hold xxx_mutex. */ /* Commit to memory before pointer update. */ RC_MEMSYNC(); /* Link into all relevant lists. */ newp->xxx_prev = &xxx_head; newp->xxx_next = xxx_head->xxx_next; xxx_head.xxx_next->xxx_prev = newp; xxx_head.xxx_next = newp; } void xxx_delete(xxx_t *oldp) { /* Caller must hold xxx_mutex. */ /* Link out of all relevant lists. */ oldp->xxx_prev->xxx_next = oldp->xxx_next; oldp->xxx_next->xxx_prev = oldp->xxx_prev; /* Mark as deleted. */ INSIST(oldp->xxx_flags & XXX_DEFDEL == 0, "xxx_delete: deletion of deleted item!"); /* *+ Internal software error. Caller attempted to *+ delete a structure already slated for deletion. *+ Corrective action: change code to modify the *+ current version. This may involve eliminating *+ a race condition. */ oldp->xxx_flags |= XXX_DEFDEL; /* * Delete the structure when safe to do so. * Also does implicit RC_MEMSYNC(). */ kmem_deferred_free_embedded(oldp, sizeof(*oldp), xxx_cleanup); } If the structure has pointers to other structures which themselves may need to be freed, then an xxx_cleanup() function may be used to handle this. This example has a field xxx_commonctr that is common to all versions of a given structure. Therefore, this field must be freed only when the object is deleted -- not when a version is deleted after being replaced by a newer version. If the structure did not have any such fields, then the xxx_cleanup passed to the call to kmem_deferred_free_embedded() above would be replaced with NULL in order to omit any cleanup processing. void xxx_cleanup(void *ptr, size_t nbytes) { xxx_t *xp = (xxx_t *)ptr; if (!(xp->xxx_flags & XXX_DEFDUP)) { /* * This is not an old duplicate, so we * must free up all fields common to * all versions. */ kmem_free(xp->xxx_commonctr, sizeof(xp->xxx_commonctr)); /* * Insert code here to free up other * fields common to all versions. */ } /* * Insert code here to free up any other structures * pointed to by this one that are not referenced * by other structures. */ /* * Since we did not free up the structure itself, * return 0 to tell the caller to do it. */ return (0); } To illustrate, to update the structure pointed to by xp from within a process context: s = p_lock(&xxx_mutex, SPLXXX); /* * If the initial search for xp was done under the lock, then * this "while" loop would not be necessary, see below. */ while (xp->xxx_replacement != NULL) { xp = xp->xxx_replacement; } newp = xxx_alloc(xp, KM_SLEEP); /* insert code here to modify newp's fields as appropriate. */ xxx_replace(newp, xp); v_lock(&xxx_mutex, s); An entirely new structure would be added as follows: s = p_lock(&xxx_mutex, SPLXXX); newp = xxx_alloc(NULL, KM_SLEEP); /* insert code here to initialize newp's fields. */ xxx_insert(newp); v_lock(&xxx_mutex, s); An existing structure would be deleted (without replacement) as follows: s = p_lock(&xxx_mutex, SPLXXX); /* * If the initial search for xp was done under the lock, then * this "while" loop would not be necessary, see below. */ while (xp->xxx_replacement != NULL) { xp = xp->xxx_replacement; } xxx_delete(xp); v_lock(&xxx_mutex, s); The while loop in both the replacement and deletion cases is not needed if the list traversal that results in the xp pointer is conducted under the lock. For example, if the lookup and deletion code were combined, the following would result: s = p_lock(&xxx_mutex, SPLXXX); xp = xxx_lookup(xxx_address); if (xp == NULL) { v_lock(&xxx_mutex, s); /* * Do whatever is necessary to report failure and * return control to the caller. */ } xxx_delete(xp); v_lock(&xxx_mutex, s); Since the lock is held while looking up the pointer, it is not possible for the pointer to become obsolete.

The code for xxx_lookup would be as follows:

xxx_t * xxx_lookup(long xxx_address) { xxx_t *xp; for (xp = xxx_head; xp != NULL; xp = xp->xxx_next) { if (xp->xxx_address == xxx_address) { return (xp); } } return (NULL); } Note that the caller is responsible for locking. As noted earlier, updaters would use xxx_mutex to lock. Readers would just use RC_RDPROTECT() and RC_RDUNPROTECT() as follows: RC_RDPROTECT(); xp = xxx_lookup(xxx_address); if (xp != NULL) { /* Do something with xp. */ } RC_RDUNPROTECT();

ADDITIONAL DESIGN RULES TO PROTECT LIST OF ELEMENTS

The rclock discipline is often used to protect a list of elements, but each element itself may need to be protected by a normal spinlock. In this case, the following code would be used to find the element and lock it:

RC_RDPROTECT(); xp = xxx_lookup(xxx_address); if (xp != NULL) { s = p_lock(&(xp->xxx_lock, SPLXXX); if (xp->xxx_flags & XXX_DEFDEL) { /* * handle case where item is deleted, * possibly by retrying the search. */ } else { /* Do something with xp. */ } v_lock(&(xp->xxx_lock, s); } RC_RDUNPROTECT(); In particular, the following code would delete an item: s = p_lock(&xxx_mutex, SPLXXX); xp = xxx_lookup(xxx_address); if (xp != NULL) { (void)p_lock(&(xp->xxx_lock, SPLXXX); ASSERT_DEBUG(!(xp->xxx_flags & XXX_DEFDEL), "item deleted without xxx_mutex"); xxx_delete(xp); v_lock(&(xp->xxx_lock, SPLXXX); } v_lock(&xxx_mutex, s); An new structure would be added as follows: s = p_lock(&xxx_mutex, SPLXXX); newp = xxx_alloc(NULL, KM_SLEEP); /* insert code here to modify newp's fields as appropriate. */ (void)p_lock(&(xp->xxx_lock), SPLXXX); xxx_insert(newp); (void)v_lock(&(xp->xxx_lock), SPLXXX); v_lock(&xxx_mutex, s); Note that it is not necessary to use the read-copy primitives in order to update a single element in place, as the xp->xxx_lock may be used to guard such updates.

EXAMPLE OF ELIMINATING TABLE-LOOKUP MUTEX

It is fairly common to have a list structure protected by a global mutex whose individual elements are protected by a per-element mutex. In most cases, searches of the list are much more common than are insertions or deletions.

For example, the following code gives (simplistic) search and delete functions for a circular doubly-linked list of elements:

lock_t hash_lock; struct element { struct element *next; struct element *prev; lock_t element_lock; int address; int data; int deleted; /* used later... */ }; struct element head[NHASH]; /* waste a bit of space... */ #define hashit(x) ((x) % NHASH) /* * Find element with specified address, lock it, and return * a pointer to it. Return NULL if no such address. * Drop the global lock if rlsgbl is set. * If element not found, drop all locks. */ struct element * search(int addr, spl_t *s, int rlsgbl) { struct element *p; struct element *p0; p0 = &head[hashit(addr)]; *s = p_lock(&hash_lock, SPLXXX); p = p0->next; while (p != p0) { if (p->address == addr) { (void)p_lock(&(p->element_lock), SPLXXX); if (rlsgbl) { v_lock(&hash_lock, SPLXXX); } return (p); } p = p->next; } v_lock(&hash_lock, *s); return ((struct element *)NULL); } /* * Delete the specified element from the table. Element * must be locked, global lock must -not- be held. Drops * spl to that specified when releasing per-element lock. */ void delete(struct element *p, spl_t s) { int addr; spl_t junkspl; ASSERT_DEBUG(!p->deleted, "Element already deleted"); /* Try to get global lock. */ if (cp_lock(&hash_lock, SPLXXX) == CPLOCKFAIL) { /* Someone else has the global lock. * To avoid deadlock, we must release the * element lock and acquire the locks in * the proper order. In the meantime, someone * may have deleted the element, so we must * check. We must do an explicit search, as * we cannot trust our pointer to survive * unless we have one of the locks held. * Don't bother coding inline, as this * should be a rather rare case. In fact, * if efficiency is a concern, the body * of this if-statement should be coded * out-of-line. */ addr = p->address; v_lock(&(p->element_lock), s); if ((p = search(addr, &junkspl, 0)) == NULL) { /* Someone else deleted us. */ return; } /* * The element is still there and we have * all the locks we need. */ } /* * We now have all needed locks, so * delete the element from the list, release * locks, free up the element, and return. */ p->next->prev = p->prev; p->prev->next = p->next; v_lock(&(p->element_lock), SPLXXX); v_lock(&hash_lock, s); kmem_free(p, sizeof(*p)); return; } This code has lots of lock round-trips and has some rather ugly deadlock-avoidance code. Compare to the following code which uses kmem_deferred_free_embedded: /* * Find element with specified address, lock it, and return * a pointer to it. Return NULL if no such address. */ struct element * search(int addr, spl_t *s) { struct element *p; struct element *p0; p0 = &head[hashit(addr)]; RC_RDPROTECT(); p = p0->next; while (p != p0) { if ((p->address == addr) && !p->deleted) { *s = p_lock(&(p->element_lock), SPLXXX); RC_RDUNPROTECT(); return (p); } p = p->next; } RC_RDUNPROTECT(); return ((struct element *)NULL); } /* * Delete the specified element from the table. Element * must be locked. Drops spl to that specified when * releasing per-element lock. */ void delete(struct element *p, spl_t s) { int addr; /* * Get global lock. There is no possible deadlock, * since the global lock is now always acquired * after the per-element lock. */ (void)p_lock(&hash_lock, SPLXXX); /* * Delete the element from the list, release * locks, free up the element, and return. */ p->deleted = 1; p->next->prev = p->prev; p->prev->next = p->next; v_lock(&(p->element_lock), SPLXXX); v_lock(&hash_lock, s); kmem_deferred_free_embedded(p, sizeof(*p), NULL); return; } There is one less lock round-trip in search(), and the code for delete() is much simpler because there is now no possibility of deadlock. Of course, in order to use kmem_deferred_free_embedded(), the objects must have been allocated using either kmem_deferred_alloc_embedded() or kmem_deferred_zalloc_embedded().

EXAMPLE OF rc_callback_wrapper()

This function can be used to cause a process to block until all currently active kernel threads have completed (either blocked, returned to user, or returned to the idle loop).

static sema_t my_sema; rc_callback_t *rp; extern bool_t global_flag; init_sema(&my_sema, 0); rp = rc_alloc_callback(rc_callback_wrapper, (void *)v_sema, (void *)&my_sema, KM_SLEEP); global_flag = 1; rc_callback(rp); p_sema(&my_sema, PRSWAIT); /* If not signalable. */ /* * At this point, we know that all processes or interrupt * routines executing in the kernel started after we set * global_flag to one. */ rc_free_callback(rp); Note that an explicit RC_MEMSYNC() is not needed because an implicit one is performed as a result of calling rc_callback(). This will have the effect of committing the assignment to global_flag. Note that there must be some sort of mutex or convention guarding the assignment to global_flag. An example of a convention is that the global_flag can be modified only by this process. Such a convention should of course be well-documented.

EXAMPLE OF kmem_deferred_free_embedded()

Consider a linked list:

struct element { struct element *next; int data; int morestuff[3]; }; struct element *head; Suppose that one process occasionally deletes the first element from the list: struct element *p; p = head; head = p->next; kmem_free(p, sizeof(*p)); /* erases the element. */ Suppose that another process concurrently scans the list: struct element *p; for (p = head; p != NULL; p = p->next) { do_something_with(p); } This code is of course not safe on multiprocessors. One way to make it safe is to use a global lock named, say, list_lock.

First process:

struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); p = head; head = p->next; v_lock(list_lock, s); kmem_free(p, sizeof(*p)); /* erases the element. */ Second process: struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); for (p = head; p != NULL; p = p->next) { do_something_with(p); } v_lock(list_lock, s); This approach adds considerable mutex overhead, particularly in the common case where the list is short (after all, if the list were often long, some other data structure would have been chosen).

The first process's code may be rewritten as follows to avoid this overhead:

struct element *p; p = head; head = p->next; kmem_deferred_free_embedded(p, sizeof(*p), NULL); where kmem_deferred_free_embedded() is a function that uses the read-copy update mechanism to free the element only when it is safe to do so. The memory must have been allocated with either kmem_deferred_alloc_embedded() or kmem_deferred_zalloc_embedded(), both of which allocated the extra space that kmem_deferred_free_embedded() needs to keep track of the structure until it can safely be deleted.

If there can be several processes removing elements from the list concurrently, a lock must still be used:

struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); p = head; head = p->next; v_lock(list_lock, s); kmem_deferred_free_embedded(p, sizeof(*p), NULL); Note that the second process's code need not be modified in any manner, hence, the process scanning the list sees zero overhead from use of the mechanism. Nonetheless, additional statements should be added to ease debugging. In a .std kernel, these statements generate no code. In a .mfg kernel, they add some debugging code: struct element *p; RC_RDPROTECT(); for (p = head; p != NULL; p = p->next) { do_something_with(p); } RC_RDUNPROTECT(); In addition, these statements alert the human reading the code to the fact that a read-copy update is protecting the code. In operating systems that permit preemption of kernel code, these macros would expand into code that suppressed preemption, most likely by disabling interrupts.

The following code can be used to insert an element into the list:

struct element *p; spl_t s; p = (struct element *) kmem_deferred_alloc_embedded(sizeof(*p), KM_SLEEP); p->data = 1; s = p_lock(list_lock, SPLXXX); p->next = head; RC_MEMSYNC(); head = p; v_lock(list_lock, s); The RC_MEMSYNC() is needed to prevent readers from seeing the head pointer being updated in memory before the p->next field is filled in. The reason that we have not needed RC_MEMSYNC() in the past is that v_lock() acts as an implicit RC_MEMSYNC() due to the xchgb instruction that it uses to release the lock. (An older version of v_lock() used a simple store; the change from movb to xchgb is needed for correct operation on 80486 CPUs.) RC_MEMSYNC() should be used after filling in the fields of a structure in preparation for planting the first globally-accessible pointer to that structure. CPUs with weak memory barriers, such as the DEC Alpha, require special handling.

The p_lock/v_lock pair is needed to prevent multiple concurrent insertions and to prevent concurrent insertion and deletion. Again, processes (or interrupt routines) reading the list need not use p_lock/v_lock.

As noted earlier, any memory allocated via kmem_deferred_free_embedded() must have been allocated with either kmem_deferred_alloc_embedded() or kmem_deferred_zalloc_embedded(). When this is inconvenient, kmem_deferred_free may be used instead. For example, the deletion code could be rewritten as follows:

struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); p = head; head = p->next; v_lock(list_lock, s); kmem_deferred_free(p, sizeof(*p), NULL, kma_alloc_defer_item(KM_SLEEP)); Note that the call to kma_alloc_defer_item() allocates memory. If a huge number of processes were doing this sort of thing simultaneously when the system was out of memory, they would deadlock (they would be unable to release the item pointed to by p until kma_alloc_defer_item() returned, which it could not until some memory was freed).

If this sort of deadlock is an issue, the kma_defer_item_t may be statically allocated with in the item pointed to by p. If the array named morestuff was never used while the item was in the process of deletion, the following code could be used to overlay the kma_defer_item_t on top of morestuff:

struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); p = head; head = p->next; v_lock(list_lock, s); ASSERT_DEBUG(sizeof(p->morestuff) >= kma_alloc_defer_item_size(), "p->morestuff not large enough"); kmem_deferred_free(p, sizeof(*p), NULL, (kma_defer_item_t *)p->morestuff); Note that the ASSERT_DEBUG() protects against size changes in a way that may be safely used in layered products.

Nevertheless, this overlaying can result in fastpatches or rerolls of layered products if sizes do change. A way to work around this is to use kmem_deferred_alloc_embedded() instead of kmem_alloc(). The former allocates the requested amount of memory as well as enough space for a kma_defer_item_t. The code that adds items to the list would be as follows:

struct element *p; spl_t s; p = (struct element *) kmem_deferred_alloc_embedded(sizeof(*p), KM_SLEEP); p->data = 1; s = p_lock(list_lock, SPLXXX); p->next = head; RC_MEMSYNC(); head = p; v_lock(list_lock, s); The code that deletes items would be as follows: struct element *p; spl_t s; s = p_lock(list_lock, SPLXXX); p = head; head = p->next; v_lock(list_lock, s); kmem_deferred_free_embedded(p, sizeof(*p), NULL); Using kmem_deferred_free_embedded() on storage allocated via kmem_alloc() is a bad idea, as is using kmem_free() on storage allocated via kmem_deferred_alloc_embedded(). Debugging versions of kmem_alloc(), such as the one in the manufacturing kernel in ptx4.1 and later, may be used to detect such errors.

DOCUMENTATION OF INTERFACES

  1. kma_alloc_defer_item
  2. kma_alloc_defer_item_size
  3. kmem_deferred_alloc_embedded
  4. kmem_deferred_free
  5. kmem_deferred_free_embedded
  6. kmem_immediate_free_embedded (New in ptx4.5)
  7. kmem_deferred_zalloc_embedded
  8. If you need kmem_struct_deferred_free()...
  9. rc_alloc_callback
  10. rc_callback
  11. rc_callback_wrapper
  12. rc_free_callback
  13. rc_set_defer
  14. rc_set_func
  15. RC_MEMSYNC
  16. RC_RDMEMSYNC
  17. RC_RDPROTECT
  18. RC_RDUNPROTECT

kma_alloc_defer_item

kma_defer_item_t * kma_alloc_defer_item(int flag) Allocate a struct used to track a deferred kmem_free. These structs are freed automatically at the time that the deferred kmem_free is done. The flag may be KM_SLEEP to allow blocking or KM_NOSLEEP to allow NULL return if low on memory.

In addition, if KM_SLEEP is specified, no locks or gates may be held and the SPL must be at SPL0.

kma_alloc_defer_item_size

size_t kma_alloc_defer_item_size() Returns the size of a kma_defer_item_t. This can be used in layered products to verify that a given part of a structure may be safely overlaid with a kma_defer_item_t.

kmem_deferred_alloc_embedded

void * kmem_deferred_alloc_embedded(size_t nbytes, int flag) Allocate the specified amount of memory, leaving enough additional room at the end for a kma_defer_item_t. The returned pointed must be freed via either kmem_deferred_free_embedded() or kmem_immediate_free_embedded(). The latter is useful when backing out of a memory-allocation failure during early boot (e.g., early autoconf).

kmem_deferred_free

void kmem_deferred_free(void *ptr, size_t nbytes, int (*func)(void *ptr, size_t nbytes), kma_defer_item_t *kp) Free up the memory pointed to by ptr of size nbytes given an already-allocated kma_defer_item_t. If the caller can block, the following is a convenient way to obtain the memory: kma_deferred_free((void *)p, (size_t)sizeof(*p), NULL, kma_alloc_defer_item(KM_SLEEP)); However, this means that memory must be allocated in order to free it. This can cause deadlocks in low-memory situations. See kmem_deferred_free_embedded() for one way of handling this situation more gracefully. Another way of handling this situation is to have a section of the struct being free that may be overlaid and clobbered by the read-copy update mechanism. The kma_defer_item_t will be freed by the read-copy lock mechanism (but of course not if it is part of the struct being freed).

If the func parameter is NULL, the actual freeing will be via kmem_free(), and will occur once all CPUs have been observed outside of the kernel (or inside the idle loop or in a context switch).

The func parameter should be used when p points to a complex linked structure. The pointer p will be passed to the function, which can then free up any structures hanging off it. This function may also free up the structure pointed to by p, but it must return a nonzero value to indicate that it has done so. Otherwise, the structure pointed to by p will be freed via kmem_free() as noted above.

kmem_deferred_free_embedded

void kmem_deferred_free_embedded(void *ptr, size_t nbytes, int (*func)(void *ptr, size_t nbytes)) Free up the memory pointed to by ptr of size nbytes. The memory must have been allocated via kmem_deferred_alloc_embedded().

If func is NULL, the actual freeing will be via kmem_free(), and will occur once all CPUs have been observed outside of the kernel (or inside the idle loop or in a context switch). The function pointed to by func is allowed to do any needed processing that must happen at the time of the actual freeing, just as kmem_deferred_free().

kmem_immediate_free_embedded

void kmem_immediate_free_embedded(void *ptr, size_t nbytes, int (*func)(void *ptr, size_t nbytes)) Free up the memory pointed to by ptr of size nbytes. The memory must have been allocated via kmem_deferred_alloc_embedded().

The chip interrupts (sti/cli) must be enabled on entry to this function, e.g., no gates may be held.

If func is NULL, the actual freeing will be via kmem_free(). Unlike kmem_deferred_free_embedded(), the free will occur immediately. Therefore, you would normally only use kmem_immediate_free_embedded() for backing out of a multi-part structure allocation due to memory-allocation failure.

The function pointed to by func is allowed to do any needed processing that must happen at the time of the actual freeing, just as kmem_deferred_free(). This is provided for convenience only, as the caller could just as well manually invoke the function before invoking kmem_immediate_free_embedded().

This primitive is new in ptx4.5.

kmem_deferred_zalloc_embedded

void * kmem_deferred_zalloc_embedded(size_t nbytes, int flag) This operates the same as kmem_deferred_alloc_embedded(), except that is also zeroes the newly-allocated memory.

If you need kmem_struct_deferred_free()...

There is no kmem_struct_deferred_free(). This is because you can use kmem_deferred_free() to get the same effect easily. Instead of the (mythical):

kmem_struct_deferred_free((void *)p, my_kmem_struct_pool, NULL, kma_alloc_defer_item(KM_SLEEP)); use the following: kmem_deferred_free((void *p), sizeof(*p), my_kmem_struct_free, kma_alloc_defer_item(KM_SLEEP)); Where my_kmem_struct_free() is defined as follows: /*ARGSUSED*/ int my_kmem_struct_free(void *ptr, size_t nbytes) { /* * Note that kmem_deferred_free() -does- use nbytes * in order to determine whether the kma_defer_item_t * is embedded in your structure. So you need to pass * the correct value in, even though my_kmem_struct_free * does not use it. */ kmem_struct_free(my_kmem_struct_pool, ptr); /* * Indicate that we have freed the structure. * Note that the kma_defer_item_t will be unconditionally * freed by the caller (unless you embedded it as one * of the fields in your structure). */ return (1); } Note that the kmem_deferred_free() call includes a memory allocation. This will not be convenient outside of process context. One way to handle this is to allocate the kma_defer_item_t at the same time that you allocate your structure. Another, and probably better, way to handle this is to increase the size of your structure to allow for the kma_defer_item_t.

Use the following procedure to safely increase your structure's size in a layered product (where sizeof(kma_defer_item_t) will eventually cause your product to do an unscheduled release due to Base-OS changes!):

  1. Instead of passing sizeof(my_struct_t) to kmem_struct_init(), pass in (roundup(sizeof(my_struct_t), sizeof(void *)) + kma_alloc_defer_item_size()). This will safely increase the size of your structure to accomodate a kma_defer_item_t.
  2. Instead of passing sizeof(*p) to kmem_deferred_free(), pass in (roundup(sizeof(my_struct_t), sizeof(void *)) + kma_alloc_defer_item_size()). This is important! If you fail to make this change, you will get a panic in a .mfg kernel and strange memory-corruption panics in .std kernels!
  3. Instead of passing kma_alloc_defer_item(KM_SLEEP) as the last argument to kmem_deferred_free(), pass in: /*CASTOK*/ (kma_defer_item_t *)(((long)p) + roundup(sizeof(my_struct_t), sizeof(void *))) The people following you would probably thank you to wrap this in a macro or function.
This allows you to avoid the extra allocation of the kma_defer_item_t and save a bit of precious KVA.

The kmem_struct_init() code would look as follows:

kmem_struct_init(roundup(sizeof(my_struct_t), sizeof(void *)) + kma_alloc_defer_item_size(), alignment, ... The call to kmem_deferred_free() would appear as follows: kmem_deferred_free((void *)p, (roundup(sizeof(my_struct_t), sizeof(void *)) + kma_alloc_defer_item_size()), my_kmem_struct_free, /*CASTOK*/(kma_defer_item_t *) (((long)p) + roundup(sizeof(my_struct_t), sizeof(void *)))); The my_kmem_struct_free() function would be defined as before.

It is possible to do deferred kmem_struct_free()s by using rc_callback() directly, but more code is required to properly recover or dispose of the rc_callback_t structures.

It is also possible to use this same general technique in many other situations.

rc_alloc_callback

rc_callback_t * rc_alloc_callback(void (*func)(rc_callback_t *rc, void *arg1, void *arg2), void *arg1, void *arg2, int flag) This function allocates a callback structure for later use by rc_callback().

func is the function that is to be called back, arg1 and arg2 are the arguments to be passed to it, and flag is passed to kmem_alloc() to specify blocking (KM_SLEEP) or no blocking (KM_NOSLEEP) if short on memory.

In addition, if KM_SLEEP is specified, no locks or gates may be held and the SPL must be at SPL0.

See the code in the file kma_defer.c in PTX4.0 for an example usage.

rc_callback

void rc_callback(rc_callback_t *rc) This function registers a callback that has been previously allocated by rc_alloc_callback(). The function will be invoked after all currently-active kernel threads (interrupts and kernel execution on behalf of processes) have finished.

The chip interrupts (sti/cli) must be enabled on entry to this function, e.g., no gates may be held.

See the code in the file kma_defer.c in PTX4.0 for an example usage.

rc_callback_wrapper

void rc_callback_wrapper(rc_callback_t *rc, void *arg1, void *arg2) This function allows you to register a callback that does not know about rc_callback_t arguments. See the example for how to use it (rc_callback_wrapper() is normally passed to rc_alloc_callback() as an argument). This can be used to cause a v_sema() to be executed once a change of state has been seen by all CPUs, as shown in the example.

rc_free_callback

void rc_free_callback(rc_callback_t *rp) This function frees up a callback structure that is no longer needed. It is illegal to free up a callback that is currently registered -- you must wait until the callback is processed. It is legal for the callback function to free up the callback structure.

See the code in the file kma_defer.c in PTX4.0 for an example usage.

rc_set_defer

void rc_set_defer(rc_callback_t *rp, bool_t defer) This function allows the rc_callback handler to defer itself if it has consumed too much CPU time. Note that if rc_set_defer() is called from within the handler, it will not take effect until the subsequent invocation. This is because the handler is within its rights to delete the rc_callback_t, which means that the rc_intr() function cannot refer to the rc_callback_t after the handler completes.

Use the same function to clear deferral -- the defer argument is set to one in order to enable deferral and set to zero in order to clear deferral. The handler returns CALLBACK_DONE to signal completion, or CALLBACK_DEFER to be reinvoked at a later time.

rc_set_func

void rc_set_func(rc_callback_t *rp, void (*func)(rc_callback_t *rc, void *arg1, void *arg2), void *arg1, void *arg2) This function allows the function and arguments of an existing callback to be changed. This is used to replace an rc_free_callback()/rc_alloc_callback() pair in order to reuse an existing rc_callback_t. The caller must ensure that two rc_set_func()s don't happen to the same callback at the same time, typically by making sure that no other CPU has a reference to the callback. It is also illegal to invoke rc_set_func() on an rc_callback_t that is currently registered with the read-copy update mechanism.

See the code in the file kma_defer.c in PTX4.0 for an example usage.

RC_MEMSYNC

RC_MEMSYNC() Force a memory-system synchronization point. This is required for systems (such as the 80486) which do not support strict sequential consistency. On such systems, reads and writes may be executed out of order, which could potentially cause data elements to be linked into data structures before they are completely initialized, even though the source code specifies all initialization operations before the link-in operations. Placing an RC_MEMSYNC() invocation between the initialization and link-in operations will ensure that the initialization is completed before the link-in starts.

RC_RDMEMSYNC

RC_RDMEMSYNC() Force a read-side memory-system synchronization point. This is required for certain styles of read-copy-update usage on CPUs such as the DEC Alpha that do not support a strong memory barrier. Although the easiest way to adapt a read-copy update algorithm to such a CPU uses RC_RDMEMSYNC(), there are more efficient methods.

RC_RDPROTECT

RC_RDPROTECT() Begin a section of code that manipulates data structures protected by the read-copy update mechanism. The section is terminated by a RC_RDUNPROTECT() invocation. RC_RDPROTECTs may be nested. It is illegal to block while under the influence of RC_RDPROTECT.

In kernels (such as PTX) that do not support kernel preemption, RC_RDPROTECT() serves only documentation and debug purposes.

RC_RDUNPROTECT

RC_RDUNPROTECT() End a section of code that manipulates data protected by the read-copy mechanism.

CPUS WITH WEAK MEMORY BARRIERS

Although read-copy update has been designed to handle weak memory consistency models, it does assume that a strong memory-barrier operation is available. Such an operation is not available on certain CPUs including DEC's Alpha. Lack of a strong memory-barrier operation causes the following read-copy update code fragments to fail:

/* Insert new after pred. Caller has initialized *new. */ void insert(foo_t *pred, foo_t *new) { P_GATE(mymutex); new->next = pred->next; RC_MEMSYNC(); pred->next = new; V_GATE(mymutex); } /* Search. */ foo_t * search(foo_t *head, int addr) { foo_t *p = head; while (p != NULL) { if (p->addr == addr) { return (p); } p = p->next; } return (NULL); } The problem is that the MB and WMB instructions do not force ordering of invalidation operations. Instead, they simply:
  1. Force all invalidations received by the current CPU from other CPUs and DMA devices to be processed.

  2. Force all current write buffers from the current CPU to be sent out from the current CPU to the "coherency point" before any subsequent writes from the current CPU. MB and WMB do not wait for completion of the invalidations corresponding to current write buffers.
This means that the search code in the previous example does not work. For example, some other CPU might see the changed value of pred->next, but see the old (garbage) value of new->next. Even though the RC_MEMSYNC() forced the new value of new->next out of the CPU before the new value of pred->next, it did not force all invalidations of the old value of new->next to complete before all invalidations of the old value of pred->next. Therefore, if new->next took a long time to invalidate, some other CPU might see the garbage value for it after the new value for pred->next. This could lead to data corruption, a panic, or both.

The following are some ways to fix this code:

  1. Come up with some sort of Alpha PAL procedure that has the desired effect of waiting for all outstanding invalidations to complete. My contacts who know about the Alpha tell me that this is not possible, that the bus protocol does not let the CPU in on this secret.
  2. Insert RC_MEMSYNC() operations into the search() function as follows: /* Search. */ foo_t * slow_search(foo_t *head, int addr) { foo_t *p = head; while (p != NULL) { RC_RDMEMSYNC(); if (p->addr == addr) { return (p); } p = p->next; } return (NULL); } This code is functionally correct, but can perform very poorly on Alpha CPUs, since it forces gratuitous memory-barrier instructions on the searching CPU. If the linked list were very long, it might be cheaper to use explicit locking (although in some existing code, use of explicit locking would cause deadlocks).

    A more desireable approach would place little or no overhead on the search code.

  3. Replace the RC_MEMSYNC() in the insert() function with a primitive that does a "global MB shootdown", that is, sends each CPU a high-priority interrupt, then spins until each CPU has executed at least one MB instruction. The code for the insert() function would then look something like the following: /* Insert new after pred. Caller has initialized *new. */ void insertMB(foo_t *pred, foo_t *new) { long ticket; P_GATE(mymutex); new->next = pred->next; ShootdownMB(); pred->next = new; V_GATE(mymutex); } This approach allows the search() function to be used unchanged, thereby retaining the full speed of the PTX/x86 implementation on the search side.

  4. The need for a special ShootdownMB() primitive can be dispensed with by observing that (given the current PTX implementation) a quiescent period also implies a global MB shootdown. Therefore, the RC_MEMSYNC() in the insert() function can be replaced with a call that blocks for the duration of a quiescent period. Such a call can easily be constructed from the read-copy update primitives: int rc_wait_quiescent_period_wake(rc_callback_t *rc, void *arg1, void *arg2) { v_sema((sema_t *)arg1); } void rc_wait_quiescent_period() { rc_callback_t *rcp; sema_t *sp; sp = (sema_t *)kmem_alloc(sizeof(*sp), KM_SLEEP); init_sema(sp, 0); rcp = rc_alloc_callback(rc_wait_quiescent_period_wake, (void *)sp, (void *)NULL, KM_SLEEP); rc_callback(rcp); (void)p_sema(sp, PRSWAIT); rc_free_callback(rcp); kmem_free(sp, sizeof(*sp)); }

    The insertMB() function would then be as follows:

    /* Insert new after pred. Caller has initialized *new. */ void insertMB(foo_t *pred, foo_t *new) { long ticket; P_GATE(mymutex); new->next = pred->next; rc_wait_quiescent_period(); pred->next = new; V_GATE(mymutex); }

    One problem with this approach from a PTX viewpoint is that insertMB() cannot be called from interrupt context. However, PTX is unlikely to ever run on an Alpha, and Digital UNIX does almost all of its work in process/thread context. However, any OS that expects to run on an Alpha which also does significant work in interrupt context would need to implement ShootdownMB() so that blocking is unnecessary.

    Another problem is that the blocking approach throttles the update rate. Each update is blocked for the duration of a quiescent period. Therefore, OSes such as Digital UNIX that do most of their work in the process/thread context might consider one of the following approaches.

  5. Take the approach outlined above, but use hashing or a similar technique to partition the data. Then separate locks may be used for each part of the data, and multiple updates may proceed in parallel.

    This can greatly speed up updates, but only in cases where multiple threads are updating the data structure simultaneously and where the hash function has been carefully chosen. If both search speed and simplicity are paramount, and the update code always runs in process context, this is most likely the method of choice.

  6. Batch updates. The idea is to use a conservatively locked supplementary data structure to hold items waiting to be inserted into the main data structure. Items must wait in the supplementary data structure for a full quiescent period to pass (or for a global MB shootdown) before they can be linked into the main data structure. The supplementary data structure is invisible to searching processes.

    Updating processes must examine both the main and the supplementary data structures when making modifications. Conflicting updates must be resolved in such a way that in not prone to indefinitely postponing the update. This can be done by maintaining three supplementary data structures (similar to the three lists kept by the read-copy update mechanism itself). Alternatively, a single supplementary list could be maintained with generation numbers used to determine when a given item could move from the supplementary to the main list.

    One downside of this approach is that it becomes much more difficult for a caller of insert() to predict when a change will be visible to callers of search(). In most cases, this will not be important, but some algorithms may require a callback or a wakeup mechanism. In constrast, the PTX implementation is able to guarantee that any search starting after the completion of an update() will see that update.

    Another downside is that some rather complex machinery is required to manipulate the supplementary lists. Although it may be possible to abstract much of this machinery out into common primitives, the great variety of searches performed with read-copy update make it unlikely that all of it could be pulled out.

  7. One final approach (due to Wayne Cardoza) is to impose a small amount of overhead onto the search() function. The idea is to define an invalid value for each field touched by the search() function. The update() function can then maintain a pool of invalidated foo_t objects that can safely be inserted into the list accessed by search().

    The search() function would check the values as it advances to each item. If one or more is invalid, it refetches the pointer and tries again. The search() function is not required to check any value that it is not going to use. New items must wait for a full quiescent period (or, alternatively, for a global MB shootdown) after being initialized before they can be added to the pool. Using this approach, the search() and update() functions would appear as follows:

    /* Insert new after pred. Caller has initialized *new. */ void insert_inv(foo_t *pred, foo_t *new) { foo_t *newinv; newinv = new_foo_inv(KM_SLEEP); P_GATE(mymutex); new->next = pred->next; pred->next = new; V_GATE(mymutex); } /* Search. */ foo_t * search_inv(foo_t *head, int addr) { foo_t *p = head; foo_t *q = NULL; while (p != NULL) { if ((p->addr == INVALID_ADDR) || (p->next == INVALID_POINTER)) { if (q == NULL) { p = head; } else { p = q->next; } /* * Spin waiting for the memory * system to get the new value * out to this CPU. * The semantics of the "mb" * ("memory barrier") instruction * guarantee at most two passes * through this loop. */ asm("mb"); continue; } if (p->addr == addr) { RC_MEMSYNC(); /* if needed. */ return (p); } q = p; p = p->next; } return (NULL); } /* * Get a new, invalidated foo_t. A non-toy implementation * would maintain per-CPU lists of these, most likely * leveraging kmem_struct functionality. */ foo_t * new_foo_inv(flags) { int i; foo_t *p == NULL; spl_t s; for (;;) { s = p_lock(&nfi_mutex, SPLHI); if (nfi_head != NULL) { p = nfi_head; nfi_head = p->nfi_next; /* Can't use next! */ v_lock(&nfi_mutex, s); return (p); } else if (nfiw_head == NULL) { for (i = 0; i < NFI_TARGET; i++) { p = (foo_t *) kmem_alloc(sizeof *p, flags); if (p == NULL) { break; } p->next = INVALID_POINTER; p->addr = INVALID_ADDR; p->nfi_generation = nfi_gen; p->nfi_next = nfiw_head; nfiw_head = p; } if (nfiw_head != NULL) { rc_callback(nfiw_rc); } if (!CANSLEEP(flags)) { v_lock(&nfi_mutex, s); return (NULL); } p_sema_v_lock(&nfi_sema, PZERO + 1, nfi_mutex); } else { p_sema_v_lock(&nfi_sema, PZERO + 1, nfi_mutex); } } /*NOTREACHED*/ } /* * Called at the completion of a quiescent period that * started after the above call to rc_callback(). */ void new_foo_inv_callback() { spl_t s; s = p_lock(&nfi_mutex, SPLHI); INSIST(nfiw_head != NULL, "inconsistent state!"); nfi_head = nfiw_head; nfiw_head = NULL; vall_sema(&nfi_sema); v_lock(&nfi_mutex, s); } The RC_MEMSYNC() in search_inv() is only required if the caller is going to be using fields in the item that are not guarded by search_inv() by the invalid-value trick. (Since the caller does not have access to the predecessor item that search_inv() tracks in the "q" pointer, it cannot make use of the invalid-value trick itself, except by making additional calls to search_inv()).

    This approach allows insertions to proceed at full speed. The speed at which the insertions may be applied may be tuned by adjusting the value of NFI_TARGET. However, it does impose an overhead on the search_inv() function. This overhead will be negligible in the case shown here, since the cost of two compares is way down in the noise. However, searches that examine a large number of fields in each item could be significantly slowed.