snorgers locks / Discussion / Open Discussion: Snorgers lock algorithm sugestion

Stefan Israelsson Tampe - 2007-10-13

Step 1 (Partly Finished). Implement and verify a deadlock-free lock algorithm based on a ordering of the processes. This has som limited use for cases where only throughput is important or where lock conflicts can be neglected.

Algorithm.
Consider P Threads th(1),...,th(P).
Assume proc(i) has a unique order or priority o(i).
Consider a LOCK(proc,lock) primitive, a FIRST_LOCK(proc,lock) primitive and an UNLOCK(proc,lock) primitive. The usage in code follows this rule:

retry:
FIRST_LOCK(proc,lock1)
 .... Some operationes .... that doesn't change state e.g. reading global variables
 and writing to local ones

 if (LOCK(proc,lock2) == CLOBERED) goto UNLOCK(proc,lok1); goto retry;
 .... Some operationes .... that doesn't change state
 e.g. reading global variables and writing to local ones

 if (LOCK(proc,lock3) == CLOBERED) UNLOCK(lock2); UNLOCK(lock1); goto retry
 etc...
 when the lock section has finished and all operations can succeed we unlock

 UNLOCK(proc,lock3)
 UNLOCK(proc,lock2)
UNLOCK(proc,lock1)

One can discuss version where you retry from the clobered lock only which can be an improvement. Any thoughts on that is a later issue, but basically a list of gotos has to be stored on the proc structure and then when clobering a lock one has to mark the lock which is clobered. This is a clear improvement.

When a lock is taken.
If the lock is free aquire it
If the lock is taken:
 If the lock holder has lower priority:
 Clobber the lock holder by setting a clob flag on the proc. that has the lock.
 Put it on the waiting list in the following way.
 a) First and Second in priority order
 b) All list in priority order
 Also a wake-up signal is send to the clobered thread to reaply for the lock so
 that it can be clobered correctly.

 If the lock holder has higher priority:
 If the clob flag of the proc is set,
 return with CLOBERED
 else
 as in the previous section but do not clobber.

If the lock is not taken take the lock.

The first lock is clearing the clobber flag on proc. And therefore never returns CLOBERED.

When the lock is taken NO_CLOBERED is of cause returned.

At the relese of the lock in UNLOCK, the waiting list is checked.
a) Wake up all waiters letting the first waiter get the lock and the rest will
 reapply for the lock hence sorting them the same way.
b) Just wake up the first waiter with the lock

Note i):
a) means that if we clobber a lot compared to adding waiter this algorithm or the number of waiters is small this algorithm is cheeper,
we will probably try to implement b) with a careful algorithm combining trees and arrays in the end.

Note ii):
* If thread i waits on thread j, we denote this by i -> j.
* A dead lock means that we have a circular chain i->j->....->i.
* At dead lock this imply that o(i) < o(j) ... < o(i) (Impossable)
 or there can be a situation where o(i) > o(j) but the clobbering mechanism failed.

* Now when i tried to take it's lock, either j had the lock or another higher order
lock k,o(k)>o(i) had the lock.

+ If j had the lock and
 - was alive it would be clobered when it got stucked.
 - If it was sleeping then it would be waken up by the algortithm and then become
 clobered when it got stuck.
+ Finally if another lock k had the lock o(i) < o(k), then in case
 a) all locks are waked up and thus when j got the lock i will reaply for the lock
 and this case is already verifyied.
 b) because the waiters is sorted in priorities and o(i)>o(j) i is befor j and the
 algortithm works for this final case.

Hence the algorithm is dead lock free.
QED

Note iii):
We actuall designed the algorithm not to target threads with real priorities, this means that we will let the tasks work as long as thay can without prematuraly preemt them. Still there are cases where postponing the preemption is a better but we have at least adressed the common case where we mainly have one interaction between two threads which always is dead lock and should not be preempted.

Note iv)
The drawback of this algorithm is that threads with low priority will be punnished if a lock is popular. That can lead to serious problems if threads are relatet to a user, you can simple get sluggish behavior if you got priority 1. Step 2 will try to adress this problem.

As always the dev(e)l is in the details, but the above algorithm have been sucessfuly implemented with (I'm sorry) spin-locks and only at most 250 threads. A more mutexish solution is going to be the next effort.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stefan Israelsson Tampe - 2007-10-20
 
 First of all a version that shows the idea is in the cvs. This version use shedule intensive approach, but a more mutexish
 version is probable possible. Now, trying to implement a sleep/wake up algorithm revealed that you typically would like to
 have a dead-lock free version to start with in order to make this happen without being a computer wizard.
 
 Reason for complexity:
 When we do a dead lock free lock, you will typically take a simpler lock, do something and then release the lock. When working inside this simpler lock you sometimes need to wake up the lock holder of the snorgers lock, which is perhaps sleeping on another
 snorgers lock. The safe approach would be to take a simple lock on this snorgers lock as well and then do the wake up.
 But now you will get problems with a standard lock due to dead-lock. the solution is to use teh cvs version of a
 scheduling lock or a spinning version of it. This will be tried later.
 
 Noting that this adds complexity and taking the time for taking this lock would be 4x a standard mutex. meaning about 400ns as spent on synchronization towards main memory for one lock. Of cause if you seldom take this lock and the work inside the lock is much more than the work done taking and releasing the lock is it might be possible to use this kind of lock. So there might be a use case for this kind of lock.
 
 It is suggested to implement this lock as a last effort and for know direct non user thread applications that want to make use of a dead-lock free locking version to the cvs kind or a spinning version of it.
 
 The real outcome of last weeks work is that combining user threads and kernel threads is a very interesting idea. It really looks like it is posible to have a dead-lock free snorgeres lock that has low overhead and excellent look contention behavior. The downside is that you will need the usual cooperation of the application such as doing schedule operations to not stall other processes. And you will need to have many threads to schedule when a threads start to stall waiting for locks. Hence the logical next effort will be into implementing a combined threading model.
 
 The reason for good look contention behavior is that when you have user threads is that you can start migrating threads to the same
 kernel thread and that the suggested mechanism would performe close to a unithread implementation. So at a non contended situation
 you will take advantage of all cores or processors and in the contended case you will get the same performance as a single threaded
 applications using a state model which actually is fantastic.
 
 Migration can be done in many ways.
 
 It is probably so that as we approach contention at some point the numbers of threads waiting on a lock grows dramatically. So by monitoring the number of waiters of a lock you can start migrating when this number is above a certain threshold. This will be done by scanning the waiters and count how many per process is waiting and then sort the result. all kernel threads that is within a
 certain distance from the maximum thread will be considered for migration and the selection will be the one that has maximal kernel thread priority (a global priority that is attached to a kernel thread). The next waiters arriving to the lock will be migrated to the selected thread. There will be some tunable variables here but in the end we will try to supply sane defaults. When the waiting list is below a certain threshold the migration will stop. Also the migration time will be noted. If there are kernel threads with a low
 amount of user threads they will ask for threads. The threads are picked from a pool of random-migratable threads at a certain rate. A thread is random-migratable if it has gone x ns from it's last migration. It is suspected that with this system monitoring total numbers of waiters you will see a cyclic pattern. On this you can attach a global control algorithm to alow a cyclic pattern but keep it from being a significant overhead. This global control could be a system administrator.
 
 Why is this possible, well you take a global lock. Assume you have many threads, make the time you spend under the global lock as
 short as possible which can be done targeted the use case of getting the lock directly or having a waiting list of a few threads.
 
 This will be the next weeks effort
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stefan Israelsson Tampe - 2007-10-27
 
 p>The code to try out a combined user space and kernel
 space version of a snorgers lock is shaping up. It compiles
 but as beeing a abstract minding person with dyslectic
 tendencies, 50% of the time is spend on bug hunting. Hence
 next week I will be devoted to puting myself in the shoes of
 Sherlock Holmes and hunt down the time thieves. Anyway this
 is actually a very intellectiually stimulating part of
 programming and can preferably be enjoyed with a glass of
 fine whiskey.
 
 One thing that is not going to be explored right now is
 to what degree a migration of a user thread from one
 kernel-thread to another can be beneficial. One obvious
 reson is that if all user threads are on one kernel thread
 there will be no need to lock any objects at all. This case
 should not be neglected because it is not an uncommon case
 that your code cannot be gained by trying to multithread it
 on many cores due to memory bandwidth issues, due to lock
 contention issues, and of cause over head issues.
 
 The other case to notice is that if you code carefully,
 which I really try, for many cases you will never need to
 realese a taken lock and redo from the beginning if the lock
 holder belongs to the same kernel thread.
 
 To make this work assume that a thread takes a sequence
 of locks atomically and the same for releasing the lock, we
 bring these operation into one lock operation process
 denoted lop. Then a thread will issue a locking lop, then
 schedule, then when the thread is run again it will do the
 essenial stuff, issue a unlocking lop and then schedule. 
 
 Behind the scene we must take care for the situation
 when one of the locks fail in a lop, when this happens not
 only the current thread doing the lop goes to sleep but all
 threads that executes a locking lop coming after this lop
 has to go to sleep as well. Doing this will ensure that
 only one lop per kernel thread partially succeeds and any
 deadlock sequence in a sequence of user threads will not be
 possible.
 
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stefan Israelsson Tampe - 2007-11-06
 
 The current week has been hectic. But the mixed threading
 endavour is shaping up and the overhead of doing locking in both user space and kernel space can be read.
 
 Last week was a serious bug week and as a consequence my family and I have to comb our hairs (splitting every hair) the next 14 days, That my friend is a real debugging for men with hair on their breast as we say back home.
 
 Anyway being able to factor out many locking operations into one locking operation holding a global lock have about 100ns of overhead per lock in the non contended case. About 40ns is spend inside the global lock per lock. 
 
 It remains to make a more scalable solution for handling of waiters and introduce kernel threads as well. This is prepared for and should not be a big obstacle. My next effort will in stead be to reduce the overhead.
 
 There is one important optimization that can be done.
 If we check a lock and it is locked with the lock holder beeing another thread on the same kernel thread we do not need a global lock, hence in practice you could reduce the time under the global lock at contention. Also it would be possible to class user threads as contended and not contended and handling the non contended threads as usual and the contended locks could be managed so that we do not
 need to restart the lock operations according to previous discussions. 
 
 The basic strategy to handle the classes would be,
 if a contended thread tries to take a lock that is already hold by
 a non contended thread then you may clobber that thread. If
 a contended thread tries to take a lock that is hold by a
 contended thread then if it is not the first freeze all lock acquire lops for contended threads. 
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Snorgers lock algorithm sugestion

Forums

Help

Snorgers lock algorithm sugestion

Snorgers lock algorithm sugestion

Forums

Help

Snorgers lock algorithm sugestion document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Snorgers lock algorithm sugestion