Thread: [Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

Status: Pre-Alpha

Brought to you by: atheurer, hlinder, jwright, mingming, and 3 others

lse-tech

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-02-27 16:58:15

On Wed, Feb 27, 2002 at 07:38:34PM +0300, Joshua MacDonald wrote:
> Hi Hubertus,
> 
> I have a question for you about these semaphores.  I took a glance at
> <linux/ulocks.h> to try and find out how large an object the
> user-space lock is.  I see that you have it written like this:
> 
> typedef struct ulock_t {
> 	unsigned long  status;
> 	unsigned long  type;
> 	unsigned long  counters[ULOCK_STAT_COUNTERS];
> 	char           pad[SMP_CACHE_BYTES - 
> 			   (ULOCK_STAT_COUNTERS+2)*sizeof(unsigned long)];
> } ulock_t ____cacheline_aligned;
> 
> I would like to suggest that you offer a version of the ulock_t that
> is as small as possible so that the user can make use of the entire
> cacheline rather than waste it on padding.
> 
> The reason I'm interested is that I have written a concurrent,
> cache-optimized skip list and I did all of my testing using the
> standard Linux spinlocks.  Those are 4 bytes per lock.  I use one lock
> per cacheline-sized node of the data structure.  If you can get your
> locks down to one or two words that would be really interesting, since
> spinlocks don't work terribly well in user-space.  I would really like
> to be able to use this data structure outside of the kernel, and your
> locks might just solve my problem, but only if they are small enough.
> 
> Do you see my point?  You can find my latest skiplist code at:

I have seen the light. Seriously, I initially had it as you said, but
Linus was strongly recommending cacheline size objects to 
avoid false sharing other than on the lock word
However, there should be no problem whatsoever to bring that back
to 2 words.

> typedef struct ulock_t {
>       unsigned long  status;
>       unsigned long  qcount;   /* this will change instead of type */
> }

Counters are only there for lock statistics.
padding is only there for filling the cacheline (might be obsolute anyway.
What can be done is to make the ____cacheline_aligned a configuration
option. Nothing in the system relies on this. I'll see how I'll
do this.

Also, this can be brought down to one word, if we don't have
demand based kernel object allocation and/or multiple queue requirements
as requested by BenLeHaise.

Rusty, how would your pin based approach be effected by this ?

Comments ?

-- Hubertus

> 
>     http://sourceforge.net/projects/skiplist
> 
> I have put test results up on that site as well, but those tests were
> made using spinlocks at user-level!  In otherwords, I don't really
> believe my results are meaningful.
> 
> (And let me warn you that there's a bug and haven't uploaded the
> latest version yet...)
> 
> -josh

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-03-01 02:01:20

On Thu, Feb 28, 2002 at 04:24:22PM -0800, Richard Henderson wrote:
> On Wed, Feb 27, 2002 at 10:53:11AM -0500, Hubertus Franke wrote:
> > As stated above, I allocate a kernel object <kulock_t> on demand and
> > hash it. This way I don't have to pin any user address. What does everybody
> > think about the merit of this approach versus the pinning approach?
> [...]
> > In your case, can the lock be allocated at different
> > virtual addresses in the various address spaces.
> 
> I think this is a relatively important feature.  It may not be
> possible to use the same virtual address in different processes.
> 
> 
> r~

I think so too. However let me point that Linus's initial recommendation
of a handle, comprised of a kernel pointer and a signature also has
that property.
Just pointing out the merits of the various approaches.

-- Hubertus

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-03-02 14:08:16

On Fri, Mar 01, 2002 at 09:07:18PM +0100, Martin Wirth wrote:
> 
> 
> Hubertus Franke Worte:
> 
> >
> >So, it works and is correct. Hope that clarifies it, if not let me know.
> >Interestingly enough. This scheme doesn't work for spinning locks.
> >Goto lib/ulocks.c[91-133], I have integrated this dead code here
> >to show how not to do it. A similar analysis as above shows
> >that this approach wouldn't work. You need cmpxchg for these scenarios
> >(more or less).
> >
> 
> You are right, I falsely assumed the initial state to be [1,1].
> 
> But as mentioned in your README, your approach is currently is not able 
> to manage signal handling correctly.
> You have to ignore all non-fatal signals by using ESYSRESTART and a 
> SIG_KILL sent to one of the processes
> may corrupt your user-kernel-syncronisation. 
> 
> I don't think a user space semaphore implementation is acceptable until 
> it provides (signal-) interruptability and
> timeouts. But maybe you have some idea how to manage this.
> 
> Martin Wirth
> 
>  

As of the signal handling and ESYSRESTART.
The user code on the slow path can check for return code and
has two choices (a) reenter the kernel and wait or (b) correct the
status word, because its still counted as a waiting process and return
0. I have chosen (a) and I automated it.
I have tried to send signals (e.g. SIGIO) and it 
seems to work fine. The signal is handled in user space and returns back
to the kernel section. (b) is feasible but a requires a bit more exception
work in the routine.
As of the corruption case. There is flatout (almost) nothing you can do.
For instance, if a process graps the locks and exits, then another process
tries to acquire it you got a deadlock. At the least on the mailing list
most people feel that you are dead in the water anyway.

I have been thinking about the problem of corruption.
The owner of the lock could register its pid in the lock aread (2nd word).
That still leaves non-atomic windows open and is a half-ass solution.

But more conceptually, if you process dies while holding a lock that protects
some shared data, there is no guarantee that you can make about the data
itself if the updating process dies. In that kind of scenario, why
trying to continue as if nothing happened. 
It seems the general consent on the list is that your app is toast at that
point.

-- Hubertus

Re: [Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Paul J. <pj...@en...> - 2002-03-03 22:13:51

On Sat, 2 Mar 2002, Hubertus Franke wrote:
> But more conceptually, if you process dies while holding a lock ... 
> your app is toast at that point.

Without application specific knowledge, certainly.

Is there someway one could support a hook, to enable
an application to register a handler that could clean
up, for those apps that found that worthwhile?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-03-04 16:07:16

On Sun, Mar 03, 2002 at 02:13:45PM -0800, Paul Jackson wrote:
> On Sat, 2 Mar 2002, Hubertus Franke wrote:
> > But more conceptually, if you process dies while holding a lock ... 
> > your app is toast at that point.
> 
> Without application specific knowledge, certainly.
> 
> Is there someway one could support a hook, to enable
> an application to register a handler that could clean
> up, for those apps that found that worthwhile?
> 

I think that's a good idea,  I'll think it through and see what Rusty thinks.


> -- 
>                           I won't rest till it's the best ...
>                           Programmer, Linux Scalability
>                           Paul Jackson <pj...@sg...> 1.650.933.1373

Re: [Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Rusty R. <ru...@ru...> - 2002-03-05 07:08:52

On Sun, 3 Mar 2002 14:13:45 -0800
Paul Jackson <pj...@en...> wrote:

> On Sat, 2 Mar 2002, Hubertus Franke wrote:
> > But more conceptually, if you process dies while holding a lock ... 
> > your app is toast at that point.
> 
> Without application specific knowledge, certainly.
> 
> Is there someway one could support a hook, to enable
> an application to register a handler that could clean
> up, for those apps that found that worthwhile?

If you want this, use fcntl locks (see TDB).  If you don't tell the kernel
what you are doing (which is the reason these locks are fast), it cannot
clean up for you.

One could conceive of a solution where a process told the kernel
"here is where I keep my lock states: if I die, clean up".  Of course,
there's a race there too.

IMHO, given that the lock is protecting something which is left in an
unknown state, this is something which would require serious testing
to be proven worthwhile.

Hope that helps,
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-03-02 14:50:11

On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote:
> OK.  New implementation.  Compiles (PPC), but due to personal reasons,
> UNTESTED.  Thanks especially to Hubertus for his prior work and
> feedback.
>
> 1) Turned into syscalls.
> 2) Added arch-specific sys_futex_region() to enable mutexes on region,
>    and give a chance to detect lack of support via -ENOSYS.
> 3) Just a single atomic_t variable for mutex (thanks Paul M).
> - This means -ENOSYS on SMP 386s, as we need cmpxchg.


> - We can no longer do generic semaphores: mutexes only.
> - Requires arch __update_count atomic op.

    I don't see that, you can use atomic_inc on 386s ..

> 4) Current valid args to sys_futex are -1 and +1: we can implement
>    other lock types by using other values later (eg. rw locks).
>

  (seem below) 

> Size of futex.c dropped from 244 to 161 lines, so it's clearly 40%
> better!
>
> Get your complaints in now...

  (plenty below :-)

> Rusty.
> --
>   Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
>


Rusty, as indicated I think there is quite some merit to this solution
and it certainly cuts down on some of the complexity that is inherent in
what I submitted.
First, what I like is the fact that by pinning the page down you avoid the
extra indirection that I went through with the kulock_ref_t objects.

There are some limitations to your approach however.

(a) the hashing is to simple. As a consequence there are two limitations
     -  the total number of locks that can be waited for is 
        (1<< USEM_HASHBITS).
        that is way not enough (as you pointed out).
     - this has to be bumped up to roughly the number of tasks in the
       system.
       since each wait_queue_head_t element has 3 words, MAX_PID = 32K
       this would end up with about 400K of memory for this. This would be
       the worst case scenario, namely that every task in the system is 
       waiting on a different lock (hey shit happens).

more seriously though is the following.
(b) there is no provision to what to do when there is a hash collision?
      this seems too rudimentary that I assume I am missing something, 
      in case not:
      if <page1,offset1> and <page2,offset2> both result in the same
      hash value then they will be both waiting on the same wait queue, 
      which is incorrect.

     A solution to this would be to introduce hash chaining as I have done
     through my access vector cache and allocate the waitqueue objects from 
     the cachep object and link them through it. An alternative solution is
     to provide a sufficiently large queue as described in (a) and on collision
     go for a secondary hash function and if that fails search for an empty 
     bucket. In the current situation, I think the second solution might be
     more appropriate.

(c) your implementation could be easily expanded to include Ben's
    suggestion of multiple queues (which I already implement in my 
    submission), by including the queue index into the hashing mechanism.
    Since you are rewriting the wait routine, the problem can be circumvented
    if we can settle for mutex, semaphores, and rwlocks only. Ben what's your
    take. In rwlocks you signal continuation of the next class of holders.
    Then you wake up one writer or the next set of readers, or based on 
    some other parameter, if the first candidate you find is a reader you
    walk the entire list to wake up all readers. Lot's of stuff that can be
    done here. What you need then is to expand your wait structure in this
    case to indicate why you are sleeping, waiting for read or waiting for
    write.   What do you think.


(d) If (b) is magically solved, then I have no take yet regarding cleanup.
    If not, then given (b), your cleanup will become expensive and you 
    will end up in a similar solution as I have it, namely a callback on 
    vmarea destruction and some ref counting. 

(e) In your solution you use throughout cmpxchg, which limits you from 
    utilizing atomic_inc/dec functions on i386.
    I have based my mutex implementation in user space on atomic_inc
    Spinning versions require cmpxchg though.
    Maybe it might be useful to differentiate these cases and provide at
    least non spinning versions for i386.

(f) Why get away from semaphores and only stick to mutexes. Isn't there
    some value to provide counting semaphores ? Effectively, mutexes are
    a subclass of semaphores with a max-count of 1.

(g) I don't understand the business of rechecking in the kernel. In your
    case it comes cheap as you pinned the page already. In general that's 
    true, since the page was just touched the pinning itself is reasonable
    cheap. Nevertheless, I am concerned that now you need intrinsic knowledge
    how the lock word is used in user space. For instance, I use it in 
    different fashions, dependent whether its a semaphore or rwlock.
    Why can't we keep that separated from the kernel function of waiting
    on an event that is signaled along the lock word, the content of the lock 
    word should not come into play in the kernel. 
    If you want this, you use spinning locks.

(h) how do you deal with signals ? E.g. SIGIO or something like it.

Overall, pinning down the page (assuming that not a lot of tasks are
waiting on a large number of queues at the same time) is acceptable, 
and it cuts out one additional level of indirection that is present 
in my submission.

-- Hubertus Franke (fr...@wa...)

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Rusty R. <ru...@ru...> - 2002-03-04 03:58:52

On Sat, 2 Mar 2002 09:50:31 -0500
Hubertus Franke <fr...@wa...> wrote:

> On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote:
> > - We can no longer do generic semaphores: mutexes only.
> > - Requires arch __update_count atomic op.
> 
>     I don't see that, you can use atomic_inc on 386s ..

__update_count needed to be able to do the decrement from 0 or
the count, whichever was greater.  I have eliminated this in the
new patch (patch III).

> (a) the hashing is to simple. As a consequence there are two limitations
>      -  the total number of locks that can be waited for is 
>         (1<< USEM_HASHBITS).
>         that is way not enough (as you pointed out).
>      - this has to be bumped up to roughly the number of tasks in the
>        system.

OK.  We dealt with hash collisions by sharing waitqueues.  This works, but
means we have to do wake-all.  My new patch uses its own queues: we still
share queues, but we only wake the first one waiting on our counter.

> (c) your implementation could be easily expanded to include Ben's
>     suggestion of multiple queues (which I already implement in my 
>     submission), by including the queue index into the hashing mechanism.

Hmmm.... it should be pretty easy to extend: the queues can be shared.

> (f) Why get away from semaphores and only stick to mutexes. Isn't there
>     some value to provide counting semaphores ? Effectively, mutexes are
>     a subclass of semaphores with a max-count of 1.

Yes, but counting semaphores need two variables, or cmpxchg.  Mutexes do not
need either (proof by implementation 8).  While general counting semaphores
may be useful, the fact that they are not implemented in the kernel suggests
they are not worth paying extra for.

> (g) I don't understand the business of rechecking in the kernel. In your
>     case it comes cheap as you pinned the page already. In general that's 
>     true, since the page was just touched the pinning itself is reasonable
>     cheap. Nevertheless, I am concerned that now you need intrinsic knowledge
>     how the lock word is used in user space. For instance, I use it in 
>     different fashions, dependent whether its a semaphore or rwlock.

This is a definite advantage: my old fd-based code never looked at the
userspace counter: it had a kernel sempahore to sleep on for each
userspace lock. OTOH, this involves kernel allocation, with the complexity
that brings.

> (h) how do you deal with signals ? E.g. SIGIO or something like it.

They're interruptible, so you'll get -ERESTARTSYS.  Should be fine.

> Overall, pinning down the page (assuming that not a lot of tasks are
> waiting on a large number of queues at the same time) is acceptable, 
> and it cuts out one additional level of indirection that is present 
> in my submission.

I agree.  While originally reluctant, when it's one page per sleeping
process it's rather neat.

Cheers!
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-03-04 16:51:49

On Mon, Mar 04, 2002 at 12:30:40AM +1100, Rusty Russell wrote:
> On Sat, 2 Mar 2002 09:50:31 -0500
> Hubertus Franke <fr...@wa...> wrote:
> 

I am pretty much ready to support Rusty's latest patch. I think
much of the stuff I mentioned as been addressed.
The hash collision elimination the biggest one.
The page pinning is a great idea and I see that to be a great 
advantage over what I tried to push.

> > On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote:
> > > - We can no longer do generic semaphores: mutexes only.
> > > - Requires arch __update_count atomic op.
> > 
> >     I don't see that, you can use atomic_inc on 386s ..
> 
> __update_count needed to be able to do the decrement from 0 or
> the count, whichever was greater.  I have eliminated this in the
> new patch (patch III).
> 
> > (a) the hashing is to simple. As a consequence there are two limitations
> >      -  the total number of locks that can be waited for is 
> >         (1<< USEM_HASHBITS).
> >         that is way not enough (as you pointed out).
> >      - this has to be bumped up to roughly the number of tasks in the
> >        system.
> 
> OK.  We dealt with hash collisions by sharing waitqueues.  This works, but
> means we have to do wake-all.  My new patch uses its own queues: we still
> share queues, but we only wake the first one waiting on our counter.
> 
> > (c) your implementation could be easily expanded to include Ben's
> >     suggestion of multiple queues (which I already implement in my 
> >     submission), by including the queue index into the hashing mechanism.
> 
> Hmmm.... it should be pretty easy to extend: the queues can be shared.
> 

Correct, should we bring that into the interface already. 
Expand the syscall with an additional parameter identifying the queue.

> > (f) Why get away from semaphores and only stick to mutexes. Isn't there
> >     some value to provide counting semaphores ? Effectively, mutexes are
> >     a subclass of semaphores with a max-count of 1.
> 
> Yes, but counting semaphores need two variables, or cmpxchg.  Mutexes do not
> need either (proof by implementation 8).  While general counting semaphores
> may be useful, the fact that they are not implemented in the kernel suggests
> they are not worth paying extra for.

Not exactly, they need one atomic in user space and a semaphore in the kernel
which as you stated uses (atomic and sleepers).

> 
> > (g) I don't understand the business of rechecking in the kernel. In your
> >     case it comes cheap as you pinned the page already. In general that's 
> >     true, since the page was just touched the pinning itself is reasonable
> >     cheap. Nevertheless, I am concerned that now you need intrinsic knowledge
> >     how the lock word is used in user space. For instance, I use it in 
> >     different fashions, dependent whether its a semaphore or rwlock.
> 
> This is a definite advantage: my old fd-based code never looked at the
> userspace counter: it had a kernel sempahore to sleep on for each
> userspace lock. OTOH, this involves kernel allocation, with the complexity
> that brings.
> 

???, kernel allocation is only required in under contention. If you take a look 
at how I used the atomic word for semaphores and for rwlock_t, its different. 
If you recheck in the kernel you need to know how this is used.
In my approach I simply allocated two queues on demand.

I am OK with only supplying semaphores and rwlocks, if there is a general consent
about it. Nevertheless, I think the multiqueue approach is somewhat more elegant
and expandable.

> > (h) how do you deal with signals ? E.g. SIGIO or something like it.
> 
> They're interruptible, so you'll get -ERESTARTSYS.  Should be fine.
>  

That's what I did too, but some folks pointed out they might want to 
interrupt a waking task, so that it can get back into user space and
fail with EAGAIN or so and do not automatically restart the syscall.

> > Overall, pinning down the page (assuming that not a lot of tasks are
> > waiting on a large number of queues at the same time) is acceptable, 
> > and it cuts out one additional level of indirection that is present 
> > in my submission.
> 
> I agree.  While originally reluctant, when it's one page per sleeping
> process it's rather neat.
> 

Yipp, most applications that will take advantage of this will allocate
the locks anyway in a shared region, i.e. there will be more than
one lock per page.

> Cheers!
> Rusty.
> -- 

As I said above, I am favor for the general infrastructure that you have in 
place now. Can we hash out the rwlock issues again.
They can still be folded into a single queue (logical) and you wakeup
eiher the next writer or set of readers, or all readers ....
We need a mean to bring that through the API. I think about it.

Great job Rusty, pulling all these issues together with a comprehensive patch.
The few remaining issues are cravy :-)

>   Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

-- Hubertus

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Rusty R. <ru...@ru...> - 2002-03-05 04:38:12

On Mon, 4 Mar 2002 11:51:43 -0500
Hubertus Franke <fr...@wa...> wrote:
> > This is a definite advantage: my old fd-based code never looked at the
> > userspace counter: it had a kernel sempahore to sleep on for each
> > userspace lock. OTOH, this involves kernel allocation, with the complexity
> > that brings.
> > 
> 
> ???, kernel allocation is only required in under contention. If you take a look 
> at how I used the atomic word for semaphores and for rwlock_t, its different. 
> If you recheck in the kernel you need to know how this is used.
> In my approach I simply allocated two queues on demand.

Yes, but no allocation is even better than on-demand allocation, if you can
get away with it.

> I am OK with only supplying semaphores and rwlocks, if there is a general consent
> about it. Nevertheless, I think the multiqueue approach is somewhat more elegant
> and expandable.

Well, it's pretty easy to allow other values than -1/1 later as required.
I don't think the switch() overhead will be measurable for the first dozen
lock types 8).

> > > (h) how do you deal with signals ? E.g. SIGIO or something like it.
> > 
> > They're interruptible, so you'll get -ERESTARTSYS.  Should be fine.
> >  
> 
> That's what I did too, but some folks pointed out they might want to 
> interrupt a waking task, so that it can get back into user space and
> fail with EAGAIN or so and do not automatically restart the syscall.

Sorry, my bad.  I use -EINTR, so they don't get restarted, for exactly that
reason (eg. implementing timeouts on mutexes).

Cheers!
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

[Lse-tech] Re: [PATCH] Lightweight userspace semaphores...

From: Hubertus F. <fr...@wa...> - 2002-02-27 22:08:39

To followup on this. I have posted the latest version of this under
http://prdownloads.sourceforge.net/lse/ulocks-2.4.17.tar.bz2

- I followed the suggestion below to minimize ulock_t datastructure
  to do so go to ./make.setup and comment out the -DLOCK_STATISTICS flag
  this will bring the ulockt_t down to 2 words, otherwise 28 bytes.
- The ulockflex program will irrespectively allocate all locks on cacheline
  boundaries regardless of size.
- I fixed a problem in ulockflex that was introduced in the last version.
- I have now number of queues instead of an explicite lock type. Hence
  the kernel only knows about number of queues to maintain.

-- Hubertus Franke

On Wed, Feb 27, 2002 at 08:33:07PM +0300, Joshua MacDonald wrote:
> On Wed, Feb 27, 2002 at 11:58:42AM -0500, Hubertus Franke wrote:
> > On Wed, Feb 27, 2002 at 07:38:34PM +0300, Joshua MacDonald wrote:
> > > Hi Hubertus,
> > > 
> > > I have a question for you about these semaphores.  I took a glance at
> > > <linux/ulocks.h> to try and find out how large an object the
> > > user-space lock is.  I see that you have it written like this:
> > > 
> > > typedef struct ulock_t {
> > > 	unsigned long  status;
> > > 	unsigned long  type;
> > > 	unsigned long  counters[ULOCK_STAT_COUNTERS];
> > > 	char           pad[SMP_CACHE_BYTES - 
> > > 			   (ULOCK_STAT_COUNTERS+2)*sizeof(unsigned long)];
> > > } ulock_t ____cacheline_aligned;
> > > 
> > > I would like to suggest that you offer a version of the ulock_t that
> > > is as small as possible so that the user can make use of the entire
> > > cacheline rather than waste it on padding.
> > > 
> > > The reason I'm interested is that I have written a concurrent,
> > > cache-optimized skip list and I did all of my testing using the
> > > standard Linux spinlocks.  Those are 4 bytes per lock.  I use one lock
> > > per cacheline-sized node of the data structure.  If you can get your
> > > locks down to one or two words that would be really interesting, since
> > > spinlocks don't work terribly well in user-space.  I would really like
> > > to be able to use this data structure outside of the kernel, and your
> > > locks might just solve my problem, but only if they are small enough.
> > > 
> > > Do you see my point?  You can find my latest skiplist code at:
> > 
> > I have seen the light. Seriously, I initially had it as you said, but
> > Linus was strongly recommending cacheline size objects to 
> > avoid false sharing other than on the lock word
> > However, there should be no problem whatsoever to bring that back
> > to 2 words.
> 
> Sounds good.  There is nothing to prevent the declaration of a ulock_t
> variable from using __cacheline_aligned, so I think Linus is off on
> this one.
> 
> I'd like to test this out sometime.  The SLPC code uses a lots of CPP
> trickery to configure itself with various different read/write or
> exclusive locking packages, so its fairly easy to port to a new
> locking primitive.
> 
> -josh