From: john s. <joh...@us...> - 2002-01-29 02:07:19
|
Here again is my implementation of MCS locks. What is new this time is that I've added a nodeless extension to the lock, which can be used as a drop in replacement for spinlocks. I did this by serializing access to a single bit lock with an MCS lock, keeping node allocation on the stack of nmcs_lock(). This causes some additional overhead, but simplifies the usage. Various performance numbers follow. Zero contention overhead calculated from userspace test (tight lock/unlock loop): 10000000 spin lock/unlocks: 2406865us 10000000 mcs lock/unlocks: 3071824us 10000000 nmcs lock/unlocks: 4244046us mcs performance hit: 127.627599% nmcs performance hit: 176.330870% High contention performance (5sec w/ 8threads): spin lock: 593051 acquisitions mcs lock: 5500393 acquisitions nmcs lock: 3642682 acquisitions Also attached is a patch which globally replaces all spinlocks in the kernel with nodeless mcs locks. Using hackbench as a quick and dirty benchmark, I got the following numbers on an ~8 way (4 cpu + HT) box. 2.4.17 vanilla hackbench 10: Time: 3.693 Time: 3.050 Time: 3.206 hackbench 25: Time: 15.831 Time: 20.074 Time: 22.694 hackbench 50: Time: 96.058 Time: 84.713 Time: 94.403 hackbench 75: Time: 203.603 Time: 184.666 Time: 218.726 2.4.17+nmcs+spinlocks-replaced hackbench 10: Time: 3.300 Time: 4.280 Time: 4.305 hackbench 25: Time: 16.185 Time: 18.584 Time: 15.028 hackbench 50: Time: 72.532 Time: 71.361 Time: 75.763 hackbench 75: Time: 153.873 Time: 165.528 Time: 148.546 The patches have been tested on i386 only. I'll start playing w/ ia64 soon, and a patch for proper write ordering will most likely follow. thanks -john |
From: Andi K. <ak...@su...> - 2002-01-29 07:33:19
|
> > The patches have been tested on i386 only. I'll start playing w/ ia64 > soon, and a patch for proper write ordering will most likely follow. Interesting numbers. Unfortunately they're slower for the uncontended case,so just s/spin_lock/mcs_lock/ is probably not a solution. Did you also check them on 2cpu boxes ? It seems for the runqueue lock there is already a better solution in form of one the multiqueue schedulers, but it may be interesting to use them for one of the not-yet-cracked heavy contended locks like the lru_list_lock, pagecache_lock (when the tux patch is not used) or dcache_lock (when RCU dcache is not used) and perhaps BKL (heavily used by e.g. ext2) -Andi |
From: john s. <joh...@us...> - 2002-01-29 19:00:39
|
> Interesting numbers. Unfortunately they're slower for the uncontended case,so > just s/spin_lock/mcs_lock/ is probably not a solution. Oh, indeed! I just did the global replacement as a test to make sure the lock wasn't broken. As you suggest below, more surgical replacement of locks that are suffering from NUMA induced starvation or very high contention would be a much wiser usage. As for the low contention performance, the lock wizards at IBM are working on it. Hopefully more to follow soon. > Did you also check them on 2cpu boxes ? No, but I can get some numbers for you if you'd like. I'd imagine they won't be to favorable (due to low contention), but thats probably useful none the less. > It seems for the runqueue lock there is already a better solution > in form of one the multiqueue schedulers, but it may be interesting > to use them for one of the not-yet-cracked heavy contended locks > like the lru_list_lock, pagecache_lock (when the tux patch is not used) or > dcache_lock (when RCU dcache is not used) and perhaps BKL (heavily used by > e.g. ext2) Thanks for the suggestions. I'll try replacing a few of those and see how it helps. If anyone has suggestions for benchmarks or workloads they'd like to see, I'd be interested in hearing them. I'll probably give aim a whirl w/ lockmeter and see what comes of it. Thanks again for the feedback! -john |
From: Momchil V. <ve...@fa...> - 2002-01-29 08:13:36
|
>>>>> "john" == john stultz <joh...@us...> writes: john> static inline void mcs_lock(mcslock_t* lock, mcsnode_t* instance) john> { john> mcsnode_t* before; john> instance->next = NULL; john> before = xchg((mcsnode_t**)lock,instance); john> if (before != NULL) { john> instance->flag = 1; john> before->next = instance; john> while(instance->flag){barrier();rep_nop();} Why is the barrier() here needed ? john> } john> } |
From: john s. <joh...@us...> - 2002-01-29 19:21:42
|
> john> while(instance->flag){barrier();rep_nop();} > > Why is the barrier() here needed ? Its not. I had seen it used in another lock I had been playing with and had figured it was necessary for non i386 arch issues. But the fact that flag is volatile should really take care of it. I'll yank it in the next release. Thanks, -john |
From: Ravikiran G T. <ki...@in...> - 2002-01-30 08:50:39
|
On Mon, Jan 28, 2002 at 06:02:04PM -0800, john stultz wrote: > +static inline void nmcs_unlock(nmcslock_t* lock) > +{ > + lock->flag = 0; > +} Wouldn't a barrier() be required after lock->flag = 0 ? or even a mb()??? How about publishing nmcs numbers for difft no of processors? (2, 4, 6, 8 or more). Just want to see if this implementation has a near flat charecteristic similar to that of plain mcs locks :). Comparisons with varying lengths of critical sections would also be nice to look at. While nmcs implementation seems to gaurantee FIFO ordering, spinning on locally accesible memory location is compromised..(although not entirely?) ...might not be good for NUMA (compared to plain mcs locks) ??? Regards, Kiran -- Ravikiran G Thirumalai <ki...@in...> Linux Technology Center, IBM Software Labs, Bangalore. |
From: john s. <joh...@us...> - 2002-01-30 19:26:54
|
On Wed, 2002-01-30 at 00:48, Ravikiran G Thirumalai wrote: > On Mon, Jan 28, 2002 at 06:02:04PM -0800, john stultz wrote: > > +static inline void nmcs_unlock(nmcslock_t* lock) > > +{ > > + lock->flag = 0; > > +} > > Wouldn't a barrier() be required after lock->flag = 0 ? or even a mb()??? On i386, memory writes are ordered, so not yet. However, for other architectures memory ordering will be needed, I'm just not quite there yet. > How about publishing nmcs numbers for difft no of processors? (2, 4, 6, 8 or > more). Just want to see if this implementation has a near flat charecteristic > similar to that of plain mcs locks :). Comparisons with varying lengths of > critical sections would also be nice to look at. I'm working on this. More to come soon. thanks -john |
From: Ravikiran G T. <ki...@in...> - 2002-01-31 06:51:55
|
On Wed, Jan 30, 2002 at 11:21:40AM -0800, john stultz wrote: > On Wed, 2002-01-30 at 00:48, Ravikiran G Thirumalai wrote: > > On Mon, Jan 28, 2002 at 06:02:04PM -0800, john stultz wrote: > > > +static inline void nmcs_unlock(nmcslock_t* lock) > > > +{ > > > + lock->flag = 0; > > > +} > > > > Wouldn't a barrier() be required after lock->flag = 0 ? or even a mb()??? > > On i386, memory writes are ordered, so not yet. However, for other > architectures memory ordering will be needed, I'm just not quite there > yet. > As far as my understanding goes, barrier() is to prevent the compiler from optimising code by reordering instructions. If you notice, barrier() implementation is arch independent and compiler specific. The purpose of locking would be defeated if instructions within the CS are optimised to be executed outside the CS by the compiler (or viceversa). Compile (and write memory) barrier is taken care of during nmcs_lock by the xchg macro in mcs_lock. But, with nmcs_unlock (which is inline), it is safe to have a compile barrier after lock->flag = 0 (IMHO). Yes, I have noticed that flag var is declared volatile, but I am not sure if that would prevent gcc from reordering instructions. Please correct me if I have missed something. Yep, on 386 machines, memory reads and writes are program ordered. But on 486 and above, they are "processor ordered". Writes are in program order, but reads can pass buffered writes (What if a read within CS is reorderd by the processor after nmcs_unlock?). Also, note that cmpxchg is not present on 386. This is all my understanding ... pls correct me if I am wrong. All in all, IMO, nodeless mcs algo is cool. Regards, Kiran -- Ravikiran G Thirumalai <ki...@in...> Linux Technology Center, IBM Software Labs, Bangalore. |
From: john s. <joh...@us...> - 2002-01-31 21:07:37
|
> Yes, I have noticed that flag var is declared volatile, but I am not sure > if that would prevent gcc from reordering instructions. Please correct me > if I have missed something. Sounds like a reasonable enough argument. I don't really know if it being volatile is enough as well, so unless someone wants to pipe up tell it like it is, I'll toss in the barrier just to be sure. > Yep, on 386 machines, memory reads and writes are program ordered. But on > 486 and above, they are "processor ordered". Writes are in program order, > but reads can pass buffered writes (What if a read within CS is reorderd by > the processor after nmcs_unlock?). Also, note that cmpxchg is not present on > 386. > > This is all my understanding ... pls correct me if I am wrong. It seems far better then mine. I'll talk to a few people and see about the read ordering issue you suggest. As for i386 not having cmpxchg, I know there are a lot of sequent folks on this list, but I don't know of anyone running linux on an 386 smp box :) It would be interesting to be corrected on that, though. I'll look forward to your feed back on the next revision, where I'm going to try to not just stick my head in the sand wrt memory ordering. thanks -john |