Re: [Lse-tech] Changing the inode structure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Matthew Wilcox is the new owner of fs/locks.c.  He'll be interested.

About six months back we had a _big_ problem with Apache throughput.
On 8-way x86 Apache throughput almost halved because someone removed
the BKL from a path in the file locking code.  Apache uses flock()-based
synchronisation.  Removing the BKL had turned a short spin into a semaphore
schedule(), which hurt big-time.

I did a bunch of maitenance work against fs/locks.c at the time to
set things back right.  IIRC I moved the BKL to a lower level in
the flock() codepath.

At one point I did have a super-scalable implementation which used a
new per-inode spinlock for the exclusion.  It worked and was just
fine.  But Linus and I agreed that it was a larger-than-necessary
change, that sys_flock() contention was not a likely scenario, and
that sticking with the BKL approach was a safer path.

FWIW, the super-scalable flocking patch against 2.4.0-test10 is
at http://www.zip.com.au/~akpm/threaded-locks-sem.patch

Rick Lindsley wrote:
> 
> Thanks for all of your responses.  Yes, -fsdevel is probably the right
> place to finish this discussion, but I wanted to take start it here in
> lse because it's actually SMP related.
> 
> A file-lock-intensive benchmark brought to my attention that the BKL is
> currently used to guard i_flock.  Without arguing about the merits of
> this particular benchmark, it seems to me that simply from inspection,
> replacing the BKL here would be a good thing. A per-inode spinlock
> would give better granularity than a global one which will cause
> blockage across the system on every lock attempt by any process. I've
> given some thought to how to improve on that, and come up with to
> 
>         a) reducing use of kernel_flag elsewhere
>         b) replacing kernel_flag with another global spinlock
>         c) replacing kernel_flag with a global read/write lock
>         d) replacing kernel_flag with a new lock in struct inode
>         e) revisiting the algorithm, and all locking associated therein
> 
> a) is far more work than necessary to fix this problem. b) through d)
> are all possibilities but since this hasn't shown up before, I'd
> conclude that all the contention this benchmark is seeing really is
> centered right around i_flock.  My hunch is that the best solution is
> d), but it's possible that  c) could actually provide "enough"
> improvements to allow d) to be postponed. Unfortunately, c) may
> introduce more troubles than it's worth, because in this particular
> example, I suspect that i_flock is NOT read mostly, write occasionally.
> Upgrading from a read to a write can't be done atomically so what you
> may gain in performance you may lose in "supportability" as the code
> grows in complexity.
> 
> Both b) and c) cause serialization across every cpu in the system by
> using a global lock, but d) would cause serialization *per inode* and
> thus almost guarantee less contention.  Assuming, of course, mucking
> with the inode structure doesn't cause too many other ripples, which is
> why I asked the question.  Doing e) almost certainly puts it into the
> 2.5 timespace, but not 100% certainly, I suppose. Before I dig too deep
> into some test patches I thought I'd test the waters among the folks
> here in LSE.
> 
> It's good to hear that the inode is being redesigned for 2.5; a
> spinlock (or two) which guards elements of the inode structure would be
> very helpful in the new design.  If there were one to usurp here I'd
> include that in my options, but all we have is semaphores right now.
> 
> Rick
> 
> _______________________________________________
> Lse-tech mailing list
> Lse...@li...
> https://lists.sourceforge.net/lists/listinfo/lse-tech