From: Hubertus F. <fr...@wa...> - 2002-02-27 16:58:15
|
On Wed, Feb 27, 2002 at 07:38:34PM +0300, Joshua MacDonald wrote: > Hi Hubertus, > > I have a question for you about these semaphores. I took a glance at > <linux/ulocks.h> to try and find out how large an object the > user-space lock is. I see that you have it written like this: > > typedef struct ulock_t { > unsigned long status; > unsigned long type; > unsigned long counters[ULOCK_STAT_COUNTERS]; > char pad[SMP_CACHE_BYTES - > (ULOCK_STAT_COUNTERS+2)*sizeof(unsigned long)]; > } ulock_t ____cacheline_aligned; > > I would like to suggest that you offer a version of the ulock_t that > is as small as possible so that the user can make use of the entire > cacheline rather than waste it on padding. > > The reason I'm interested is that I have written a concurrent, > cache-optimized skip list and I did all of my testing using the > standard Linux spinlocks. Those are 4 bytes per lock. I use one lock > per cacheline-sized node of the data structure. If you can get your > locks down to one or two words that would be really interesting, since > spinlocks don't work terribly well in user-space. I would really like > to be able to use this data structure outside of the kernel, and your > locks might just solve my problem, but only if they are small enough. > > Do you see my point? You can find my latest skiplist code at: I have seen the light. Seriously, I initially had it as you said, but Linus was strongly recommending cacheline size objects to avoid false sharing other than on the lock word However, there should be no problem whatsoever to bring that back to 2 words. > typedef struct ulock_t { > unsigned long status; > unsigned long qcount; /* this will change instead of type */ > } Counters are only there for lock statistics. padding is only there for filling the cacheline (might be obsolute anyway. What can be done is to make the ____cacheline_aligned a configuration option. Nothing in the system relies on this. I'll see how I'll do this. Also, this can be brought down to one word, if we don't have demand based kernel object allocation and/or multiple queue requirements as requested by BenLeHaise. Rusty, how would your pin based approach be effected by this ? Comments ? -- Hubertus > > http://sourceforge.net/projects/skiplist > > I have put test results up on that site as well, but those tests were > made using spinlocks at user-level! In otherwords, I don't really > believe my results are meaningful. > > (And let me warn you that there's a bug and haven't uploaded the > latest version yet...) > > -josh |
From: Hubertus F. <fr...@wa...> - 2002-03-01 02:01:20
|
On Thu, Feb 28, 2002 at 04:24:22PM -0800, Richard Henderson wrote: > On Wed, Feb 27, 2002 at 10:53:11AM -0500, Hubertus Franke wrote: > > As stated above, I allocate a kernel object <kulock_t> on demand and > > hash it. This way I don't have to pin any user address. What does everybody > > think about the merit of this approach versus the pinning approach? > [...] > > In your case, can the lock be allocated at different > > virtual addresses in the various address spaces. > > I think this is a relatively important feature. It may not be > possible to use the same virtual address in different processes. > > > r~ I think so too. However let me point that Linus's initial recommendation of a handle, comprised of a kernel pointer and a signature also has that property. Just pointing out the merits of the various approaches. -- Hubertus |
From: Hubertus F. <fr...@wa...> - 2002-03-02 14:08:16
|
On Fri, Mar 01, 2002 at 09:07:18PM +0100, Martin Wirth wrote: > > > Hubertus Franke Worte: > > > > >So, it works and is correct. Hope that clarifies it, if not let me know. > >Interestingly enough. This scheme doesn't work for spinning locks. > >Goto lib/ulocks.c[91-133], I have integrated this dead code here > >to show how not to do it. A similar analysis as above shows > >that this approach wouldn't work. You need cmpxchg for these scenarios > >(more or less). > > > > You are right, I falsely assumed the initial state to be [1,1]. > > But as mentioned in your README, your approach is currently is not able > to manage signal handling correctly. > You have to ignore all non-fatal signals by using ESYSRESTART and a > SIG_KILL sent to one of the processes > may corrupt your user-kernel-syncronisation. > > I don't think a user space semaphore implementation is acceptable until > it provides (signal-) interruptability and > timeouts. But maybe you have some idea how to manage this. > > Martin Wirth > > As of the signal handling and ESYSRESTART. The user code on the slow path can check for return code and has two choices (a) reenter the kernel and wait or (b) correct the status word, because its still counted as a waiting process and return 0. I have chosen (a) and I automated it. I have tried to send signals (e.g. SIGIO) and it seems to work fine. The signal is handled in user space and returns back to the kernel section. (b) is feasible but a requires a bit more exception work in the routine. As of the corruption case. There is flatout (almost) nothing you can do. For instance, if a process graps the locks and exits, then another process tries to acquire it you got a deadlock. At the least on the mailing list most people feel that you are dead in the water anyway. I have been thinking about the problem of corruption. The owner of the lock could register its pid in the lock aread (2nd word). That still leaves non-atomic windows open and is a half-ass solution. But more conceptually, if you process dies while holding a lock that protects some shared data, there is no guarantee that you can make about the data itself if the updating process dies. In that kind of scenario, why trying to continue as if nothing happened. It seems the general consent on the list is that your app is toast at that point. -- Hubertus |
From: Paul J. <pj...@en...> - 2002-03-03 22:13:51
|
On Sat, 2 Mar 2002, Hubertus Franke wrote: > But more conceptually, if you process dies while holding a lock ... > your app is toast at that point. Without application specific knowledge, certainly. Is there someway one could support a hook, to enable an application to register a handler that could clean up, for those apps that found that worthwhile? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Hubertus F. <fr...@wa...> - 2002-03-04 16:07:16
|
On Sun, Mar 03, 2002 at 02:13:45PM -0800, Paul Jackson wrote: > On Sat, 2 Mar 2002, Hubertus Franke wrote: > > But more conceptually, if you process dies while holding a lock ... > > your app is toast at that point. > > Without application specific knowledge, certainly. > > Is there someway one could support a hook, to enable > an application to register a handler that could clean > up, for those apps that found that worthwhile? > I think that's a good idea, I'll think it through and see what Rusty thinks. > -- > I won't rest till it's the best ... > Programmer, Linux Scalability > Paul Jackson <pj...@sg...> 1.650.933.1373 |
From: Rusty R. <ru...@ru...> - 2002-03-05 07:08:52
|
On Sun, 3 Mar 2002 14:13:45 -0800 Paul Jackson <pj...@en...> wrote: > On Sat, 2 Mar 2002, Hubertus Franke wrote: > > But more conceptually, if you process dies while holding a lock ... > > your app is toast at that point. > > Without application specific knowledge, certainly. > > Is there someway one could support a hook, to enable > an application to register a handler that could clean > up, for those apps that found that worthwhile? If you want this, use fcntl locks (see TDB). If you don't tell the kernel what you are doing (which is the reason these locks are fast), it cannot clean up for you. One could conceive of a solution where a process told the kernel "here is where I keep my lock states: if I die, clean up". Of course, there's a race there too. IMHO, given that the lock is protecting something which is left in an unknown state, this is something which would require serious testing to be proven worthwhile. Hope that helps, Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. |
From: Hubertus F. <fr...@wa...> - 2002-03-02 14:50:11
|
On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote: > OK. New implementation. Compiles (PPC), but due to personal reasons, > UNTESTED. Thanks especially to Hubertus for his prior work and > feedback. > > 1) Turned into syscalls. > 2) Added arch-specific sys_futex_region() to enable mutexes on region, > and give a chance to detect lack of support via -ENOSYS. > 3) Just a single atomic_t variable for mutex (thanks Paul M). > - This means -ENOSYS on SMP 386s, as we need cmpxchg. > - We can no longer do generic semaphores: mutexes only. > - Requires arch __update_count atomic op. I don't see that, you can use atomic_inc on 386s .. > 4) Current valid args to sys_futex are -1 and +1: we can implement > other lock types by using other values later (eg. rw locks). > (seem below) > Size of futex.c dropped from 244 to 161 lines, so it's clearly 40% > better! > > Get your complaints in now... (plenty below :-) > Rusty. > -- > Anyone who quotes me in their sig is an idiot. -- Rusty Russell. > Rusty, as indicated I think there is quite some merit to this solution and it certainly cuts down on some of the complexity that is inherent in what I submitted. First, what I like is the fact that by pinning the page down you avoid the extra indirection that I went through with the kulock_ref_t objects. There are some limitations to your approach however. (a) the hashing is to simple. As a consequence there are two limitations - the total number of locks that can be waited for is (1<< USEM_HASHBITS). that is way not enough (as you pointed out). - this has to be bumped up to roughly the number of tasks in the system. since each wait_queue_head_t element has 3 words, MAX_PID = 32K this would end up with about 400K of memory for this. This would be the worst case scenario, namely that every task in the system is waiting on a different lock (hey shit happens). more seriously though is the following. (b) there is no provision to what to do when there is a hash collision? this seems too rudimentary that I assume I am missing something, in case not: if <page1,offset1> and <page2,offset2> both result in the same hash value then they will be both waiting on the same wait queue, which is incorrect. A solution to this would be to introduce hash chaining as I have done through my access vector cache and allocate the waitqueue objects from the cachep object and link them through it. An alternative solution is to provide a sufficiently large queue as described in (a) and on collision go for a secondary hash function and if that fails search for an empty bucket. In the current situation, I think the second solution might be more appropriate. (c) your implementation could be easily expanded to include Ben's suggestion of multiple queues (which I already implement in my submission), by including the queue index into the hashing mechanism. Since you are rewriting the wait routine, the problem can be circumvented if we can settle for mutex, semaphores, and rwlocks only. Ben what's your take. In rwlocks you signal continuation of the next class of holders. Then you wake up one writer or the next set of readers, or based on some other parameter, if the first candidate you find is a reader you walk the entire list to wake up all readers. Lot's of stuff that can be done here. What you need then is to expand your wait structure in this case to indicate why you are sleeping, waiting for read or waiting for write. What do you think. (d) If (b) is magically solved, then I have no take yet regarding cleanup. If not, then given (b), your cleanup will become expensive and you will end up in a similar solution as I have it, namely a callback on vmarea destruction and some ref counting. (e) In your solution you use throughout cmpxchg, which limits you from utilizing atomic_inc/dec functions on i386. I have based my mutex implementation in user space on atomic_inc Spinning versions require cmpxchg though. Maybe it might be useful to differentiate these cases and provide at least non spinning versions for i386. (f) Why get away from semaphores and only stick to mutexes. Isn't there some value to provide counting semaphores ? Effectively, mutexes are a subclass of semaphores with a max-count of 1. (g) I don't understand the business of rechecking in the kernel. In your case it comes cheap as you pinned the page already. In general that's true, since the page was just touched the pinning itself is reasonable cheap. Nevertheless, I am concerned that now you need intrinsic knowledge how the lock word is used in user space. For instance, I use it in different fashions, dependent whether its a semaphore or rwlock. Why can't we keep that separated from the kernel function of waiting on an event that is signaled along the lock word, the content of the lock word should not come into play in the kernel. If you want this, you use spinning locks. (h) how do you deal with signals ? E.g. SIGIO or something like it. Overall, pinning down the page (assuming that not a lot of tasks are waiting on a large number of queues at the same time) is acceptable, and it cuts out one additional level of indirection that is present in my submission. -- Hubertus Franke (fr...@wa...) |
From: Rusty R. <ru...@ru...> - 2002-03-04 03:58:52
|
On Sat, 2 Mar 2002 09:50:31 -0500 Hubertus Franke <fr...@wa...> wrote: > On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote: > > - We can no longer do generic semaphores: mutexes only. > > - Requires arch __update_count atomic op. > > I don't see that, you can use atomic_inc on 386s .. __update_count needed to be able to do the decrement from 0 or the count, whichever was greater. I have eliminated this in the new patch (patch III). > (a) the hashing is to simple. As a consequence there are two limitations > - the total number of locks that can be waited for is > (1<< USEM_HASHBITS). > that is way not enough (as you pointed out). > - this has to be bumped up to roughly the number of tasks in the > system. OK. We dealt with hash collisions by sharing waitqueues. This works, but means we have to do wake-all. My new patch uses its own queues: we still share queues, but we only wake the first one waiting on our counter. > (c) your implementation could be easily expanded to include Ben's > suggestion of multiple queues (which I already implement in my > submission), by including the queue index into the hashing mechanism. Hmmm.... it should be pretty easy to extend: the queues can be shared. > (f) Why get away from semaphores and only stick to mutexes. Isn't there > some value to provide counting semaphores ? Effectively, mutexes are > a subclass of semaphores with a max-count of 1. Yes, but counting semaphores need two variables, or cmpxchg. Mutexes do not need either (proof by implementation 8). While general counting semaphores may be useful, the fact that they are not implemented in the kernel suggests they are not worth paying extra for. > (g) I don't understand the business of rechecking in the kernel. In your > case it comes cheap as you pinned the page already. In general that's > true, since the page was just touched the pinning itself is reasonable > cheap. Nevertheless, I am concerned that now you need intrinsic knowledge > how the lock word is used in user space. For instance, I use it in > different fashions, dependent whether its a semaphore or rwlock. This is a definite advantage: my old fd-based code never looked at the userspace counter: it had a kernel sempahore to sleep on for each userspace lock. OTOH, this involves kernel allocation, with the complexity that brings. > (h) how do you deal with signals ? E.g. SIGIO or something like it. They're interruptible, so you'll get -ERESTARTSYS. Should be fine. > Overall, pinning down the page (assuming that not a lot of tasks are > waiting on a large number of queues at the same time) is acceptable, > and it cuts out one additional level of indirection that is present > in my submission. I agree. While originally reluctant, when it's one page per sleeping process it's rather neat. Cheers! Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. |
From: Hubertus F. <fr...@wa...> - 2002-03-04 16:51:49
|
On Mon, Mar 04, 2002 at 12:30:40AM +1100, Rusty Russell wrote: > On Sat, 2 Mar 2002 09:50:31 -0500 > Hubertus Franke <fr...@wa...> wrote: > I am pretty much ready to support Rusty's latest patch. I think much of the stuff I mentioned as been addressed. The hash collision elimination the biggest one. The page pinning is a great idea and I see that to be a great advantage over what I tried to push. > > On Wed, Feb 27, 2002 at 11:24:17AM +1100, Rusty Russell wrote: > > > - We can no longer do generic semaphores: mutexes only. > > > - Requires arch __update_count atomic op. > > > > I don't see that, you can use atomic_inc on 386s .. > > __update_count needed to be able to do the decrement from 0 or > the count, whichever was greater. I have eliminated this in the > new patch (patch III). > > > (a) the hashing is to simple. As a consequence there are two limitations > > - the total number of locks that can be waited for is > > (1<< USEM_HASHBITS). > > that is way not enough (as you pointed out). > > - this has to be bumped up to roughly the number of tasks in the > > system. > > OK. We dealt with hash collisions by sharing waitqueues. This works, but > means we have to do wake-all. My new patch uses its own queues: we still > share queues, but we only wake the first one waiting on our counter. > > > (c) your implementation could be easily expanded to include Ben's > > suggestion of multiple queues (which I already implement in my > > submission), by including the queue index into the hashing mechanism. > > Hmmm.... it should be pretty easy to extend: the queues can be shared. > Correct, should we bring that into the interface already. Expand the syscall with an additional parameter identifying the queue. > > (f) Why get away from semaphores and only stick to mutexes. Isn't there > > some value to provide counting semaphores ? Effectively, mutexes are > > a subclass of semaphores with a max-count of 1. > > Yes, but counting semaphores need two variables, or cmpxchg. Mutexes do not > need either (proof by implementation 8). While general counting semaphores > may be useful, the fact that they are not implemented in the kernel suggests > they are not worth paying extra for. Not exactly, they need one atomic in user space and a semaphore in the kernel which as you stated uses (atomic and sleepers). > > > (g) I don't understand the business of rechecking in the kernel. In your > > case it comes cheap as you pinned the page already. In general that's > > true, since the page was just touched the pinning itself is reasonable > > cheap. Nevertheless, I am concerned that now you need intrinsic knowledge > > how the lock word is used in user space. For instance, I use it in > > different fashions, dependent whether its a semaphore or rwlock. > > This is a definite advantage: my old fd-based code never looked at the > userspace counter: it had a kernel sempahore to sleep on for each > userspace lock. OTOH, this involves kernel allocation, with the complexity > that brings. > ???, kernel allocation is only required in under contention. If you take a look at how I used the atomic word for semaphores and for rwlock_t, its different. If you recheck in the kernel you need to know how this is used. In my approach I simply allocated two queues on demand. I am OK with only supplying semaphores and rwlocks, if there is a general consent about it. Nevertheless, I think the multiqueue approach is somewhat more elegant and expandable. > > (h) how do you deal with signals ? E.g. SIGIO or something like it. > > They're interruptible, so you'll get -ERESTARTSYS. Should be fine. > That's what I did too, but some folks pointed out they might want to interrupt a waking task, so that it can get back into user space and fail with EAGAIN or so and do not automatically restart the syscall. > > Overall, pinning down the page (assuming that not a lot of tasks are > > waiting on a large number of queues at the same time) is acceptable, > > and it cuts out one additional level of indirection that is present > > in my submission. > > I agree. While originally reluctant, when it's one page per sleeping > process it's rather neat. > Yipp, most applications that will take advantage of this will allocate the locks anyway in a shared region, i.e. there will be more than one lock per page. > Cheers! > Rusty. > -- As I said above, I am favor for the general infrastructure that you have in place now. Can we hash out the rwlock issues again. They can still be folded into a single queue (logical) and you wakeup eiher the next writer or set of readers, or all readers .... We need a mean to bring that through the API. I think about it. Great job Rusty, pulling all these issues together with a comprehensive patch. The few remaining issues are cravy :-) > Anyone who quotes me in their sig is an idiot. -- Rusty Russell. -- Hubertus |
From: Rusty R. <ru...@ru...> - 2002-03-05 04:38:12
|
On Mon, 4 Mar 2002 11:51:43 -0500 Hubertus Franke <fr...@wa...> wrote: > > This is a definite advantage: my old fd-based code never looked at the > > userspace counter: it had a kernel sempahore to sleep on for each > > userspace lock. OTOH, this involves kernel allocation, with the complexity > > that brings. > > > > ???, kernel allocation is only required in under contention. If you take a look > at how I used the atomic word for semaphores and for rwlock_t, its different. > If you recheck in the kernel you need to know how this is used. > In my approach I simply allocated two queues on demand. Yes, but no allocation is even better than on-demand allocation, if you can get away with it. > I am OK with only supplying semaphores and rwlocks, if there is a general consent > about it. Nevertheless, I think the multiqueue approach is somewhat more elegant > and expandable. Well, it's pretty easy to allow other values than -1/1 later as required. I don't think the switch() overhead will be measurable for the first dozen lock types 8). > > > (h) how do you deal with signals ? E.g. SIGIO or something like it. > > > > They're interruptible, so you'll get -ERESTARTSYS. Should be fine. > > > > That's what I did too, but some folks pointed out they might want to > interrupt a waking task, so that it can get back into user space and > fail with EAGAIN or so and do not automatically restart the syscall. Sorry, my bad. I use -EINTR, so they don't get restarted, for exactly that reason (eg. implementing timeouts on mutexes). Cheers! Rusty. -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. |
From: Hubertus F. <fr...@wa...> - 2002-02-27 22:08:39
|
To followup on this. I have posted the latest version of this under http://prdownloads.sourceforge.net/lse/ulocks-2.4.17.tar.bz2 - I followed the suggestion below to minimize ulock_t datastructure to do so go to ./make.setup and comment out the -DLOCK_STATISTICS flag this will bring the ulockt_t down to 2 words, otherwise 28 bytes. - The ulockflex program will irrespectively allocate all locks on cacheline boundaries regardless of size. - I fixed a problem in ulockflex that was introduced in the last version. - I have now number of queues instead of an explicite lock type. Hence the kernel only knows about number of queues to maintain. -- Hubertus Franke On Wed, Feb 27, 2002 at 08:33:07PM +0300, Joshua MacDonald wrote: > On Wed, Feb 27, 2002 at 11:58:42AM -0500, Hubertus Franke wrote: > > On Wed, Feb 27, 2002 at 07:38:34PM +0300, Joshua MacDonald wrote: > > > Hi Hubertus, > > > > > > I have a question for you about these semaphores. I took a glance at > > > <linux/ulocks.h> to try and find out how large an object the > > > user-space lock is. I see that you have it written like this: > > > > > > typedef struct ulock_t { > > > unsigned long status; > > > unsigned long type; > > > unsigned long counters[ULOCK_STAT_COUNTERS]; > > > char pad[SMP_CACHE_BYTES - > > > (ULOCK_STAT_COUNTERS+2)*sizeof(unsigned long)]; > > > } ulock_t ____cacheline_aligned; > > > > > > I would like to suggest that you offer a version of the ulock_t that > > > is as small as possible so that the user can make use of the entire > > > cacheline rather than waste it on padding. > > > > > > The reason I'm interested is that I have written a concurrent, > > > cache-optimized skip list and I did all of my testing using the > > > standard Linux spinlocks. Those are 4 bytes per lock. I use one lock > > > per cacheline-sized node of the data structure. If you can get your > > > locks down to one or two words that would be really interesting, since > > > spinlocks don't work terribly well in user-space. I would really like > > > to be able to use this data structure outside of the kernel, and your > > > locks might just solve my problem, but only if they are small enough. > > > > > > Do you see my point? You can find my latest skiplist code at: > > > > I have seen the light. Seriously, I initially had it as you said, but > > Linus was strongly recommending cacheline size objects to > > avoid false sharing other than on the lock word > > However, there should be no problem whatsoever to bring that back > > to 2 words. > > Sounds good. There is nothing to prevent the declaration of a ulock_t > variable from using __cacheline_aligned, so I think Linus is off on > this one. > > I'd like to test this out sometime. The SLPC code uses a lots of CPP > trickery to configure itself with various different read/write or > exclusive locking packages, so its fairly easy to port to a new > locking primitive. > > -josh |