Given that I have heard only one request, and have not heard why the
priority locking was needed, I would be inclined to leave the burden on the
user. If there are lots of cases where this is needed, then perhaps a
primitive is needed. However, given that this -is- a passenger-side
airbag, it seems like the real solution would be to redesign to decrease
contention, then stop using numalock_t.
Franke To: Paul McKenney/Beaverton/IBM@...
08/01/2001 From: Hubertus Franke/Watson/IBM@...
08:55 AM Subject: Re: [Lse-tech] Priority access to NUMA-aware locks
(Document link: Paul McKenney)
This seems like a good and clean solution, while the other once are
the issues of side-impact airbags (now standard in many cars) to use your
One remaining question though is, whether it makes sense to package them up
a single object with 2 lock and 2 release functions, or put the extra
burden on the user
so he understands, that such locks are not cheap ?
Enterprise Linux Group (Mgr), Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
(w) 914-945-2003 (fax) 914-945-4425 TL: 862-2003
Paul McKenney/Beaverton/IBM@... on 08/01/2001
Sent by: lse-tech-admin@...
To: npollitt@..., hawkes@...
Subject: [Lse-tech] Priority access to NUMA-aware locks
At OLS, Nick passed on a question from John as to whether the NUMA-aware
locks (ported from ptx by Jack Vogel and Swami Sivasubramanian) could
support high-priority access, so that a high-priority request would get the
lock next, rather than having to wait for its turn.
It turns out that there are a number of ways to tweak the algorithm to
support this, but the ones that I have thought of so far either greatly
increase the size of the lock or result in spinning on the lock word
itself. (If you really want to see a summary of these approaches, skip to
the end of this note.)
A more straightforward approach is to pair a numalock_t with a spinlock_t:
numalock_t my_numalock __cacheline_aligned = NUMA_LOCK_UNLOCKED;
spinlock_t my_spinlock __cacheline_aligned = SPIN_LOCK_UNLOCKED;
/* low-priority lock acquisition and release */
/* critical section */
/* high-priority lock acquisition and release */
/* critical section */
If all acquisitions are low-priority acquisitions, access to the underlying
spinlock_t is ordered by the numa_lock() primitive. There is at most one
CPU spinning on the spinlock_t, so starvation is prevented and performance
is increased (since the numa_lock() holds the lock on a given node for a
bit before allowing it to move to the next node. There is a little extra
overhead due to acquiring the spinlock_t, but this lock's cache line will
normally be in the right place due to the action of the numalock_t.
If a few of the acquistions are high-priority acquisitions, the
high-priority task will spin directly on the spinlock_t without having to
wait for all the tasks queued on the numalock_t. Because the low-priority
code path releases the spinlock_t first, this high-priority task will be
the next to get the lock.
If most of the acquisitions are high-priority acquisitions, you should
work to reduce lock contention instead of using numalock_t. Of course, in
all cases, reducing lock contention is the right thing to do -- numalock_t
should be thought of as a passenger-side airbag that prevents undue injury
in case of a collision. The right thing to do is of course to avoid the
collision in the first place by redesigning to reducing lock contention.
PS. For masochists who want to know the other approaches...
1. reserve some of the numalock_t bits to indicate the presence of
high-priority tasks. You need enough bits to count the number of CPUs, so
that a 32-bit system would need to reserve five bits, limiting the system
to 27 CPUs. But there needs to be some extra state to indicate which of
several high-priority requesters should get the lock next. The simplest
approach is to use another variable, which, after simplifying, degenerates
into the preferred solution given above.
2. As in #1, but reserve half the bits for high-priority accesses, so
that a CPU sets one of two bits to indicate high vs. low priority. But
this limits a 32-bit system to 16 CPUs and a 64-bit system to 32 CPUs, and
also complicates the algorithm. The preferred solution seems, well,
3. Expand the numalock_t to include a variable for high-priority tasks to
spin on in addition to the existing bitmask. But this results in cacheline
bouncing (AKA cache thrashing) if there is more than one high-priority
task. This can be solved by placing the bitmask and the variable in
separate cache lines, which degenerates into the preferred solution given
Lse-tech mailing list