From: Andrea A. <an...@qu...> - 2008-05-07 23:45:17
|
On Thu, May 08, 2008 at 09:28:38AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2008-05-08 at 00:44 +0200, Andrea Arcangeli wrote: > > > > Please note, we can't allow a thread to be in the middle of > > zap_page_range while mmu_notifier_register runs. > > You said yourself that mmu_notifier_register can be as slow as you > want ... what about you use stop_machine for it ? I'm not even joking > here :-) We can put a cap of time + a cap of vmas. It's not important if it fails, the only useful case we know it, and it won't be slow at all. The failure can happen because the cap of time or the cap of vmas or the cap vmas triggers or there's a vmalloc shortage. We handle the failure in userland of course. There are zillon of allocations needed anyway, any one of them can fail, so this isn't a new fail path, is the same fail path that always existed before mmu_notifiers existed. I can't possibly see how adding a new global wide lock that forces all truncate to be serialized against each other, practically eliminating the need of the i_mmap_lock, could be superior to an approach that doesn't cause the overhead to the VM at all, and only require kvm to pay for an additional cost when it startup. Furthermore the only reason I had to implement mm_lock was to fix the invalidate_range_start/end model, if we go with only invalidate_page and invalidate_pages called inside the PT lock and we use the PT lock to serialize, we don't need a mm_lock anymore and no new lock from the VM either. I tried to push for that, but everyone else wanted invalidate_range_start/end. I only did the only possible thing to do: to make invalidate_range_start safe to make everyone happy without slowing down the VM. |
From: Linus T. <tor...@li...> - 2008-05-08 00:04:40
|
On Wed, 7 May 2008, Christoph Lameter wrote: > > Multiple vmas may share the same mapping or refer to the same anonymous > vma. The above code will deadlock since we may take some locks multiple > times. Ok, so that actually _is_ a problem. It would be easy enough to also add just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's suggestion of just using stop-machine is actually the right one just because it's _so_ simple. (That said, we're not running out of vm flags yet, and if we were, we could just add another word. We're already wasting that space right now on 64-bit by calling it "unsigned long"). Linus |
From: Robin H. <ho...@sg...> - 2008-05-08 00:52:51
|
On Wed, May 07, 2008 at 05:03:30PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Christoph Lameter wrote: > > > > Multiple vmas may share the same mapping or refer to the same anonymous > > vma. The above code will deadlock since we may take some locks multiple > > times. > > Ok, so that actually _is_ a problem. It would be easy enough to also add > just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing > a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's > suggestion of just using stop-machine is actually the right one just > because it's _so_ simple. Also, stop-machine will not work if we come to the conclusion that i_mmap_lock and anon_vma->lock need to be sleepable locks. Thanks, Robin Holt |
From: Robin H. <ho...@sg...> - 2008-05-08 00:38:34
|
On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > I think the spinlock->rwsem conversion is ok under config option, as > > you can see I complained myself to various of those patches and I'll > > take care they're in a mergeable state the moment I submit them. What > > XPMEM requires are different semantics for the methods, and we never > > had to do any blocking I/O during vmtruncate before, now we have to. > > I really suspect we don't really have to, and that it would be better to > just fix the code that does that. That fix is going to be fairly difficult. I will argue impossible. First, a little background. SGI allows one large numa-link connected machine to be broken into seperate single-system images which we call partitions. XPMEM allows, at its most extreme, one process on one partition to grant access to a portion of its virtual address range to processes on another partition. Those processes can then fault pages and directly share the memory. In order to invalidate the remote page table entries, we need to message (uses XPC) to the remote side. The remote side needs to acquire the importing process's mmap_sem and call zap_page_range(). Between the messaging and the acquiring a sleeping lock, I would argue this will require sleeping locks in the path prior to the mmu_notifier invalidate_* callouts(). On a side note, we currently have XPMEM working on x86_64 SSI, and ia64 cross-partition. We are in the process of getting XPMEM working on x86_64 cross-partition in support of UV. Thanks, Robin Holt |
From: Linus T. <tor...@li...> - 2008-05-08 00:56:21
|
On Wed, 7 May 2008, Robin Holt wrote: > > In order to invalidate the remote page table entries, we need to message > (uses XPC) to the remote side. The remote side needs to acquire the > importing process's mmap_sem and call zap_page_range(). Between the > messaging and the acquiring a sleeping lock, I would argue this will > require sleeping locks in the path prior to the mmu_notifier invalidate_* > callouts(). You simply will *have* to do it without locally holding all the MM spinlocks. Because quite frankly, slowing down all the normal VM stuff for some really esoteric hardware simply isn't acceptable. We just don't do it. So what is it that actually triggers one of these events? The most obvious solution is to just queue the affected pages while holding the spinlocks (perhaps locking them locally), and then handling all the stuff that can block after releasing things. That's how we normally do these things, and it works beautifully, without making everything slower. Sometimes we go to extremes, and actually break the locks are restart (ugh), and it gets ugly, but even that tends to be preferable to using the wrong locking. The thing is, spinlocks really kick ass. Yes, they restrict what you can do within them, but if 99.99% of all work is non-blocking, then the really odd rare blocking case is the one that needs to accomodate, not the rest. Linus |
From: Christoph L. <cla...@sg...> - 2008-05-08 00:56:11
|
On Wed, 7 May 2008, Linus Torvalds wrote: > On Wed, 7 May 2008, Christoph Lameter wrote: > > > > Multiple vmas may share the same mapping or refer to the same anonymous > > vma. The above code will deadlock since we may take some locks multiple > > times. > > Ok, so that actually _is_ a problem. It would be easy enough to also add > just a flag to the vma (VM_MULTILOCKED), which is still cleaner than doing > a vmalloc and a whole sort thing, but if this is really rare, maybe Ben's > suggestion of just using stop-machine is actually the right one just > because it's _so_ simple. Set the vma flag when we locked it and then skip when we find it locked right? This would be in addition to the global lock? stop-machine would work for KVM since its a once in a Guest OS time of thing. But GRU, KVM and eventually Infiniband need the ability to attach in a reasonable timeframe without causing major hiccups for other processes. > (That said, we're not running out of vm flags yet, and if we were, we > could just add another word. We're already wasting that space right now on > 64-bit by calling it "unsigned long"). We sure have enough flags. |
From: Linus T. <tor...@li...> - 2008-05-08 01:07:51
|
On Wed, 7 May 2008, Christoph Lameter wrote: > > Set the vma flag when we locked it and then skip when we find it locked > right? This would be in addition to the global lock? Yes. And clear it before unlocking (and again, testing if it's already clear - you mustn't unlock twice, so you must only unlock when the bit was set). You also (obviously) need to have somethign that guarantees that the lists themselves are stable over the whole sequence, but I assume you already have mmap_sem for reading (since you'd need it anyway just to follow the list). And if you have it for writing, it can obviously *act* as the global lock, since it would already guarantee mutual exclusion on that mm->mmap list. Linus |
From: Linus T. <tor...@li...> - 2008-05-08 01:03:33
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > Hi Andrew, > > On Wed, May 07, 2008 at 03:59:14PM -0700, Andrew Morton wrote: > > CPU0: CPU1: > > > > spin_lock(global_lock) > > spin_lock(a->lock); spin_lock(b->lock); > ================== mmu_notifier_register() If mmy_notifier_register() takes the global lock, it cannot happen here. It will be blocked (by CPU0), so there's no way it can then cause an ABBA deadlock. It will be released when CPU0 has taken *all* the locks it needed to take. > What we can do is to replace the mm_lock with a > spin_lock(&global_lock) only if all places that takes i_mmap_lock NO! You replace mm_lock() with the sequence that Andrew gave you (and I described): spin_lock(&global_lock) .. get all locks UNORDERED .. spin_unlock(&global_lock) and you're now done. You have your "mm_lock()" (which still needs to be renamed - it should be a "mmu_notifier_lock()" or something like that), but you don't need the insane sorting. At most you apparently need a way to recognize duplicates (so that you don't deadlock on yourself), which looks like a simple bit-per-vma. The global lock doesn't protect any data structures itself - it just protects two of these mm_lock() functions from ABBA'ing on each other! Linus |
From: Andrea A. <an...@qu...> - 2008-05-08 01:26:52
|
On Wed, May 07, 2008 at 06:02:49PM -0700, Linus Torvalds wrote: > You replace mm_lock() with the sequence that Andrew gave you (and I > described): > > spin_lock(&global_lock) > .. get all locks UNORDERED .. > spin_unlock(&global_lock) > > and you're now done. You have your "mm_lock()" (which still needs to be > renamed - it should be a "mmu_notifier_lock()" or something like that), > but you don't need the insane sorting. At most you apparently need a way > to recognize duplicates (so that you don't deadlock on yourself), which > looks like a simple bit-per-vma. > > The global lock doesn't protect any data structures itself - it just > protects two of these mm_lock() functions from ABBA'ing on each other! I thought the thing to remove was the "get all locks". I didn't realize the major problem was only the sorting of the array. I'll add the global lock, it's worth it as it drops the worst case number of steps by log(65536) times. Furthermore surely two concurrent mm_notifier_lock will run faster as it'll decrease the cacheline collisions. Since you ask to call it mmu_notifier_lock I'll also move it to mmu_notifier.[ch] as consequence. |
From: Peter Z. <a.p...@ch...> - 2008-05-09 18:37:35
|
On Thu, 2008-05-08 at 09:11 -0700, Linus Torvalds wrote: > > On Thu, 8 May 2008, Linus Torvalds wrote: > > > > Also, we'd need to make it > > > > unsigned short flag:1; > > > > _and_ change spinlock_types.h to make the spinlock size actually match the > > required size (right now we make it an "unsigned int slock" even when we > > actually only use 16 bits). > > Btw, this is an issue only on 32-bit x86, because on 64-bit one we already > have the padding due to the alignment of the 64-bit pointers in the > list_head (so there's already empty space there). > > On 32-bit, the alignment of list-head is obviously just 32 bits, so right > now the structure is "perfectly packed" and doesn't have any empty space. > But that's just because the spinlock is unnecessarily big. > > (Of course, if anybody really uses NR_CPUS >= 256 on 32-bit x86, then the > structure really will grow. That's a very odd configuration, though, and > not one I feel we really need to care about). Another possibility, would something like this work? /* * null out the begin function, no new begin calls can be made */ rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); /* * lock/unlock all rmap locks in any order - this ensures that any * pending start() will have its end() function called. */ mm_barrier(mm); /* * now that no new start() call can be made and all start()/end() pairs * are complete we can remove the notifier. */ mmu_notifier_remove(mm, my_notifier); This requires a mmu_notifier instance per attached mm and that __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain the function. But I think its enough to ensure that: for each start an end will be called It can however happen that end is called without start - but we could handle that I think. |
From: Andrea A. <an...@qu...> - 2008-05-09 18:55:49
|
On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote: > Another possibility, would something like this work? > > > /* > * null out the begin function, no new begin calls can be made > */ > rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); > > /* > * lock/unlock all rmap locks in any order - this ensures that any > * pending start() will have its end() function called. > */ > mm_barrier(mm); > > /* > * now that no new start() call can be made and all start()/end() pairs > * are complete we can remove the notifier. > */ > mmu_notifier_remove(mm, my_notifier); > > > This requires a mmu_notifier instance per attached mm and that > __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain > the function. > > But I think its enough to ensure that: > > for each start an end will be called We don't need that, it's perfectly ok if start is called but end is not, it's ok to unregister in the middle as I guarantee ->release is called before mmu_notifier_unregister returns (if ->release is needed at all, not the case for KVM/GRU). Unregister is already solved with srcu/rcu without any additional complication as we don't need the guarantee that for each start an end will be called. > It can however happen that end is called without start - but we could > handle that I think. The only reason mm_lock() was introduced is to solve "register", to guarantee that for each end there was a start. We can't handle end called without start in the driver. The reason the driver must be prevented to register in the middle of start/end, if that if it ever happens the driver has no way to know it must stop the secondary mmu page faults to call get_user_pages and instantiate sptes/secondarytlbs on pages that will be freed as soon as zap_page_range starts. |
From: Peter Z. <a.p...@ch...> - 2008-05-09 19:05:27
|
On Fri, 2008-05-09 at 20:55 +0200, Andrea Arcangeli wrote: > On Fri, May 09, 2008 at 08:37:29PM +0200, Peter Zijlstra wrote: > > Another possibility, would something like this work? > > > > > > /* > > * null out the begin function, no new begin calls can be made > > */ > > rcu_assing_pointer(my_notifier.invalidate_start_begin, NULL); > > > > /* > > * lock/unlock all rmap locks in any order - this ensures that any > > * pending start() will have its end() function called. > > */ > > mm_barrier(mm); > > > > /* > > * now that no new start() call can be made and all start()/end() pairs > > * are complete we can remove the notifier. > > */ > > mmu_notifier_remove(mm, my_notifier); > > > > > > This requires a mmu_notifier instance per attached mm and that > > __mmu_notifier_invalidate_range_start() uses rcu_dereference() to obtain > > the function. > > > > But I think its enough to ensure that: > > > > for each start an end will be called > > We don't need that, it's perfectly ok if start is called but end is > not, it's ok to unregister in the middle as I guarantee ->release is > called before mmu_notifier_unregister returns (if ->release is needed > at all, not the case for KVM/GRU). > > Unregister is already solved with srcu/rcu without any additional > complication as we don't need the guarantee that for each start an end > will be called. > > > It can however happen that end is called without start - but we could > > handle that I think. > > The only reason mm_lock() was introduced is to solve "register", to > guarantee that for each end there was a start. We can't handle end > called without start in the driver. > > The reason the driver must be prevented to register in the middle of > start/end, if that if it ever happens the driver has no way to know it > must stop the secondary mmu page faults to call get_user_pages and > instantiate sptes/secondarytlbs on pages that will be freed as soon as > zap_page_range starts. Right - then I got it backwards. Never mind me then.. |
From: Andrea A. <an...@qu...> - 2008-05-08 01:34:54
|
Sorry for not having completely answered to this. I initially thought stop_machine could work when you mentioned it, but I don't think it can even removing xpmem block-inside-mmu-notifier-method requirements. For stop_machine to solve this (besides being slower and potentially not more safe as running stop_machine in a loop isn't nice), we'd need to prevent preemption in between invalidate_range_start/end. I think there are two ways: 1) add global lock around mm_lock to remove the sorting 2) remove invalidate_range_start/end, nuke mm_lock as consequence of it, and replace all three with invalidate_pages issued inside the PT lock, one invalidation for each 512 pte_t modified, so serialization against get_user_pages becomes trivial but this will be not ok at all for SGI as it increases a lot their invalidation frequency For KVM both ways are almost the same. I'll implement 1 now then we'll see... |
From: Nick P. <nic...@ya...> - 2008-05-13 12:07:00
|
On Thursday 08 May 2008 10:38, Robin Holt wrote: > On Wed, May 07, 2008 at 02:36:57PM -0700, Linus Torvalds wrote: > > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > I think the spinlock->rwsem conversion is ok under config option, as > > > you can see I complained myself to various of those patches and I'll > > > take care they're in a mergeable state the moment I submit them. What > > > XPMEM requires are different semantics for the methods, and we never > > > had to do any blocking I/O during vmtruncate before, now we have to. > > > > I really suspect we don't really have to, and that it would be better to > > just fix the code that does that. > > That fix is going to be fairly difficult. I will argue impossible. > > First, a little background. SGI allows one large numa-link connected > machine to be broken into seperate single-system images which we call > partitions. > > XPMEM allows, at its most extreme, one process on one partition to > grant access to a portion of its virtual address range to processes on > another partition. Those processes can then fault pages and directly > share the memory. > > In order to invalidate the remote page table entries, we need to message > (uses XPC) to the remote side. The remote side needs to acquire the > importing process's mmap_sem and call zap_page_range(). Between the > messaging and the acquiring a sleeping lock, I would argue this will > require sleeping locks in the path prior to the mmu_notifier invalidate_* > callouts(). Why do you need to take mmap_sem in order to shoot down pagetables of the process? It would be nice if this can just be done without sleeping. |
From: Nick P. <nic...@ya...> - 2008-05-13 12:14:33
|
On Thursday 08 May 2008 11:34, Andrea Arcangeli wrote: > Sorry for not having completely answered to this. I initially thought > stop_machine could work when you mentioned it, but I don't think it > can even removing xpmem block-inside-mmu-notifier-method requirements. > > For stop_machine to solve this (besides being slower and potentially > not more safe as running stop_machine in a loop isn't nice), we'd need > to prevent preemption in between invalidate_range_start/end. > > I think there are two ways: > > 1) add global lock around mm_lock to remove the sorting > > 2) remove invalidate_range_start/end, nuke mm_lock as consequence of > it, and replace all three with invalidate_pages issued inside the > PT lock, one invalidation for each 512 pte_t modified, so > serialization against get_user_pages becomes trivial but this will > be not ok at all for SGI as it increases a lot their invalidation > frequency This is what I suggested to begin with before this crazy locking was developed to handle these corner cases... because I wanted the locking to match with the tried and tested Linux core mm/ locking rather than introducing this new idea. I don't see why you're bending over so far backwards to accommodate this GRU thing that we don't even have numbers for and could actually potentially be batched up in other ways (eg. using mmu_gather or mmu_gather-like idea). The bare essential, matches-with-Linux-mm mmu notifiers that I first saw of yours was pretty elegant and nice. The idea that "only one solution must go in and handle everything perfectly" is stupid because it is quite obvious that the sleeping invalidate idea is just an order of magnitude or two more complex than the simple atomic invalidates needed by you. We should and could easily have had that code upstream long ago :( I'm not saying we ignore the sleeping or batching cases, but we should introduce the ideas slowly and carefully and assess the pros and cons of each step along the way. > > For KVM both ways are almost the same. > > I'll implement 1 now then we'll see... > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to maj...@kv.... For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"do...@kv..."> em...@kv... </a> |
From: Robin H. <ho...@sg...> - 2008-05-13 15:32:38
|
On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > In order to invalidate the remote page table entries, we need to message > > (uses XPC) to the remote side. The remote side needs to acquire the > > importing process's mmap_sem and call zap_page_range(). Between the > > messaging and the acquiring a sleeping lock, I would argue this will > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > callouts(). > > Why do you need to take mmap_sem in order to shoot down pagetables of > the process? It would be nice if this can just be done without > sleeping. We are trying to shoot down page tables of a different process running on a different instance of Linux running on Numa-link connected portions of the same machine. The messaging is clearly going to require sleeping. Are you suggesting we need to rework XPC communications to not require sleeping? I think that is going to be impossible since the transfer engine requires a sleeping context. Additionally, the call to zap_page_range expects to have the mmap_sem held. I suppose we could use something other than zap_page_range and atomically clear the process page tables. Doing that will not alleviate the need to sleep for the messaging to the other partitions. Thanks, Robin |
From: Nick P. <np...@su...> - 2008-05-14 04:11:25
|
On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote: > On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > > In order to invalidate the remote page table entries, we need to message > > > (uses XPC) to the remote side. The remote side needs to acquire the > > > importing process's mmap_sem and call zap_page_range(). Between the > > > messaging and the acquiring a sleeping lock, I would argue this will > > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > > callouts(). > > > > Why do you need to take mmap_sem in order to shoot down pagetables of > > the process? It would be nice if this can just be done without > > sleeping. > > We are trying to shoot down page tables of a different process running > on a different instance of Linux running on Numa-link connected portions > of the same machine. Right. You can zap page tables without sleeping, if you're careful. I don't know that we quite do that for anonymous pages at the moment, but it should be possible with a bit of thought, I believe. > The messaging is clearly going to require sleeping. Are you suggesting > we need to rework XPC communications to not require sleeping? I think > that is going to be impossible since the transfer engine requires a > sleeping context. I guess that you have found a way to perform TLB flushing within coherent domains over the numalink interconnect without sleeping. I'm sure it would be possible to send similar messages between non coherent domains. So yes, I'd much rather rework such highly specialized system to fit in closer with Linux than rework Linux to fit with these machines (and apparently slow everyone else down). > Additionally, the call to zap_page_range expects to have the mmap_sem > held. I suppose we could use something other than zap_page_range and > atomically clear the process page tables. zap_page_range does not expect to have mmap_sem held. I think for anon pages it is always called with mmap_sem, however try_to_unmap_anon is not (although it expects page lock to be held, I think we should be able to avoid that). > Doing that will not alleviate > the need to sleep for the messaging to the other partitions. No, but I'd venture to guess that is not impossible to implement even on your current hardware (maybe a firmware update is needed)? |
From: Benjamin H. <be...@ke...> - 2008-05-14 05:44:58
|
On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > ea. > > I don't see why you're bending over so far backwards to accommodate > this GRU thing that we don't even have numbers for and could actually > potentially be batched up in other ways (eg. using mmu_gather or > mmu_gather-like idea). I agree, we're better off generalizing the mmu_gather batching instead... I had some never-finished patches to use the mmu_gather for pretty much everything except single page faults, tho various subtle differences between archs and lack of time caused me to let them take the dust and not finish them... I can try to dig some of that out when I'm back from my current travel, though it's probably worth re-doing from scratch now. Ben. |
From: Nick P. <np...@su...> - 2008-05-14 06:06:11
|
On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote: > > On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > > ea. > > > > I don't see why you're bending over so far backwards to accommodate > > this GRU thing that we don't even have numbers for and could actually > > potentially be batched up in other ways (eg. using mmu_gather or > > mmu_gather-like idea). > > I agree, we're better off generalizing the mmu_gather batching > instead... Well, the first thing would be just to get rid of the whole start/end idea, which completely departs from the standard Linux system of clearing ptes, then flushing TLBs, then freeing memory. The onus would then be on GRU to come up with some numbers to justify batching, and a patch which works nicely with the rest of the Linux mm. And yes, mmu-gather is *the* obvious first choice of places to look if one wanted batching hooks. > I had some never-finished patches to use the mmu_gather for pretty much > everything except single page faults, tho various subtle differences > between archs and lack of time caused me to let them take the dust and > not finish them... > > I can try to dig some of that out when I'm back from my current travel, > though it's probably worth re-doing from scratch now. I always liked the idea as you know. But I don't think that should be mixed in with the first iteration of the mmu notifiers patch anyway. GRU actually can work without batching, but there is simply some (unquantified to me) penalty for not batching it. I think it is far better to first put in a clean and simple and working functionality first. The idea that we have to unload some monster be-all-and-end-all solution onto mainline in a single go seems counter productive to me. |
From: Jack S. <st...@sg...> - 2008-05-14 13:16:08
|
On Tue, May 13, 2008 at 10:43:59PM -0700, Benjamin Herrenschmidt wrote: > On Tue, 2008-05-13 at 22:14 +1000, Nick Piggin wrote: > > ea. > > > > I don't see why you're bending over so far backwards to accommodate > > this GRU thing that we don't even have numbers for and could actually > > potentially be batched up in other ways (eg. using mmu_gather or > > mmu_gather-like idea). > > I agree, we're better off generalizing the mmu_gather batching > instead... Unfortunately, we are at least several months away from being able to provide numbers to justify batching - assuming it is really needed. We need large systems running real user workloads. I wish we had that available right now, but we don't. It also depends on what you mean by "no batching". If you mean that the notifier gets called for each pte that is removed from the page table, then the overhead is clearly very high for some operations. Consider the unmap of a very large object. A TLB flush per page will be too costly. However, something based on the mmu_gather seems like it should provide exactly what is needed to do efficient flushing of the TLB. The GRU does not require that it be called in a sleepable context. As long as the notifier callout provides the mmu_gather and vaddr range being flushed, the GRU can do the efficiently do the rest. > > I had some never-finished patches to use the mmu_gather for pretty much > everything except single page faults, tho various subtle differences > between archs and lack of time caused me to let them take the dust and > not finish them... > > I can try to dig some of that out when I'm back from my current travel, > though it's probably worth re-doing from scratch now. > > Ben. > -- jack |
From: Robin H. <ho...@sg...> - 2008-05-14 11:26:22
|
On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote: > On Tue, May 13, 2008 at 10:32:38AM -0500, Robin Holt wrote: > > On Tue, May 13, 2008 at 10:06:44PM +1000, Nick Piggin wrote: > > > On Thursday 08 May 2008 10:38, Robin Holt wrote: > > > > In order to invalidate the remote page table entries, we need to message > > > > (uses XPC) to the remote side. The remote side needs to acquire the > > > > importing process's mmap_sem and call zap_page_range(). Between the > > > > messaging and the acquiring a sleeping lock, I would argue this will > > > > require sleeping locks in the path prior to the mmu_notifier invalidate_* > > > > callouts(). > > > > > > Why do you need to take mmap_sem in order to shoot down pagetables of > > > the process? It would be nice if this can just be done without > > > sleeping. > > > > We are trying to shoot down page tables of a different process running > > on a different instance of Linux running on Numa-link connected portions > > of the same machine. > > Right. You can zap page tables without sleeping, if you're careful. I > don't know that we quite do that for anonymous pages at the moment, but it > should be possible with a bit of thought, I believe. > > > > The messaging is clearly going to require sleeping. Are you suggesting > > we need to rework XPC communications to not require sleeping? I think > > that is going to be impossible since the transfer engine requires a > > sleeping context. > > I guess that you have found a way to perform TLB flushing within coherent > domains over the numalink interconnect without sleeping. I'm sure it would > be possible to send similar messages between non coherent domains. I assume by coherent domains, your are actually talking about system images. Our memory coherence domain on the 3700 family is 512 processors on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes. We extend a "Read-Exclusive" mode beyond the coherence domain so any processor is able to read any cacheline on the system. We also provide uncached access for certain types of memory beyond the coherence domain. For the other partitions, the exporting partition does not know what virtual address the imported pages are mapped. The pages are frequently mapped in a different order by the MPI library to help with MPI collective operations. For the exporting side to do those TLB flushes, we would need to replicate all that importing information back to the exporting side. Additionally, the hardware that does the TLB flushing is protected by a spinlock on each system image. We would need to change that simple spinlock into a type of hardware lock that would work (on 3700) outside the processors coherence domain. The only way to do that is to use uncached addresses with our Atomic Memory Operations which do the cmpxchg at the memory controller. The uncached accesses are an order of magnitude or more slower. > So yes, I'd much rather rework such highly specialized system to fit in > closer with Linux than rework Linux to fit with these machines (and > apparently slow everyone else down). But it isn't that we are having a problem adapting to just the hardware. One of the limiting factors is Linux on the other partition. > > Additionally, the call to zap_page_range expects to have the mmap_sem > > held. I suppose we could use something other than zap_page_range and > > atomically clear the process page tables. > > zap_page_range does not expect to have mmap_sem held. I think for anon > pages it is always called with mmap_sem, however try_to_unmap_anon is > not (although it expects page lock to be held, I think we should be able > to avoid that). zap_page_range calls unmap_vmas which walks to vma->next. Are you saying that can be walked without grabbing the mmap_sem at least readably? I feel my understanding of list management and locking completely shifting. > > Doing that will not alleviate > > the need to sleep for the messaging to the other partitions. > > No, but I'd venture to guess that is not impossible to implement even > on your current hardware (maybe a firmware update is needed)? Are you suggesting the sending side would not need to sleep or the receiving side? Assuming you meant the sender, it spins waiting for the remote side to acknowledge the invalidate request? We place the data into a previously agreed upon buffer and send an interrupt. At this point, we would need to start spinning and waiting for completion. Let's assume we never run out of buffer space. The receiving side receives an interrupt. The interrupt currently wakes an XPC thread to do the work of transfering and delivering the message to XPMEM. The transfer of the data which XPC does uses the BTE engine which takes up to 28 seconds to timeout (hardware timeout before raising and error) and the BTE code automatically does a retry for certain types of failure. We currently need to grab semaphores which _MAY_ be able to be reworked into other types of locks. Thanks, Robin |
From: Linus T. <tor...@li...> - 2008-05-14 15:18:58
|
On Wed, 14 May 2008, Robin Holt wrote: > > Are you suggesting the sending side would not need to sleep or the > receiving side? One thing to realize is that most of the time (read: pretty much *always*) when we have the problem of wanting to sleep inside a spinlock, the solution is actually to just move the sleeping to outside the lock, and then have something else that serializes things. That way, the core code (protected by the spinlock, and in all the hot paths) doesn't sleep, but the special case code (that wants to sleep) can have some other model of serialization that allows sleeping, and that includes as a small part the spinlocked region. I do not know how XPMEM actually works, or how you use it, but it seriously sounds like that is how things *should* work. And yes, that probably means that the mmu-notifiers as they are now are simply not workable: they'd need to be moved up so that they are inside the mmap semaphore but not the spinlocks. Can it be done? I don't know. But I do know that I'm unlikely to accept a noticeable slowdown in some very core code for a case that affects about 0.00001% of the population. In other words, I think you *have* to do it. Linus |
From: Robin H. <ho...@sg...> - 2008-05-14 16:22:25
|
On Wed, May 14, 2008 at 08:18:21AM -0700, Linus Torvalds wrote: > > > On Wed, 14 May 2008, Robin Holt wrote: > > > > Are you suggesting the sending side would not need to sleep or the > > receiving side? > > One thing to realize is that most of the time (read: pretty much *always*) > when we have the problem of wanting to sleep inside a spinlock, the > solution is actually to just move the sleeping to outside the lock, and > then have something else that serializes things. > > That way, the core code (protected by the spinlock, and in all the hot > paths) doesn't sleep, but the special case code (that wants to sleep) can > have some other model of serialization that allows sleeping, and that > includes as a small part the spinlocked region. > > I do not know how XPMEM actually works, or how you use it, but it > seriously sounds like that is how things *should* work. And yes, that > probably means that the mmu-notifiers as they are now are simply not > workable: they'd need to be moved up so that they are inside the mmap > semaphore but not the spinlocks. We are in the process of attempting this now. Unfortunately for SGI, Christoph is on vacation right now so we have been trying to work it internally. We are looking through two possible methods, one we add a callout to the tlb flush paths for both the mmu_gather and flush_tlb_page locations. The other we place a specific callout seperate from the gather callouts in the paths we are concerned with. We will look at both more carefully before posting. In either implementation, not all call paths would require the stall to ensure data integrity. Would it be acceptable to always put a sleepable stall in even if the code path did not require the pages be unwritable prior to continuing? If we did that, I would be freed from having a pool of invalidate threads ready for XPMEM to use for that work. Maybe there is a better way, but the sleeping requirement we would have on the threads make most options seem unworkable. Thanks, Robin |
From: Linus T. <tor...@li...> - 2008-05-14 16:57:10
|
On Wed, 14 May 2008, Robin Holt wrote: > > Would it be acceptable to always put a sleepable stall in even if the > code path did not require the pages be unwritable prior to continuing? > If we did that, I would be freed from having a pool of invalidate > threads ready for XPMEM to use for that work. Maybe there is a better > way, but the sleeping requirement we would have on the threads make most > options seem unworkable. I'm not understanding the question. If you can do you management outside of the spinlocks, then you can obviously do whatever you want, including sleping. It's changing the existing spinlocks to be sleepable that is not acceptable, because it's such a performance problem. Linus |
From: Christoph L. <cla...@sg...> - 2008-05-14 17:57:17
|
On Wed, 14 May 2008, Linus Torvalds wrote: > One thing to realize is that most of the time (read: pretty much *always*) > when we have the problem of wanting to sleep inside a spinlock, the > solution is actually to just move the sleeping to outside the lock, and > then have something else that serializes things. The problem is that the code in rmap.c try_to_umap() and friends loops over reverse maps after taking a spinlock. The mm_struct is only known after the rmap has been acccessed. This means *inside* the spinlock. That is why I tried to convert the locks to scan the revese maps to semaphores. If that is done then one can indeed do the callouts outside of atomic contexts. > Can it be done? I don't know. But I do know that I'm unlikely to accept a > noticeable slowdown in some very core code for a case that affects about > 0.00001% of the population. In other words, I think you *have* to do it. With larger number of processor semaphores make a lot of sense since the holdoff times on spinlocks will increase. If we go to sleep then the processor can do something useful instead of hogging a cacheline. A rw lock there can also increase concurrency during reclaim espcially if the anon_vma chains and the number of address spaces mapping a page is high. |