From: Andrea A. <an...@qu...> - 2008-04-23 17:24:48
|
On Wed, Apr 23, 2008 at 06:26:29PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: > > I guess I have to prepare another patchset then? Apologies for my previous not too polite comment in answer to the above, but I thought this double patchset was over now that you converged in #v12 and obsoleted EMM and after the last private discussions. There's nothing personal here on my side, just a bit of general frustration on this matter. I appreciate all great contribution from you, as last your idea to use sort(), but I can't really see any possible benefit or justification anymore from keeping two patchsets floating around given we already converged on the mmu-notifier-core, and given it's almost certain mmu-notifier-core will go in -mm in time for 2.6.26. Let's put it this way, if I fail to merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire patchset and leave maintainership to you so you move 1/N to N/N and remove mm_lock-sem patch (everything else can remain the same as it's all orthogonal so changing the order is a matter of minutes). |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:09:35
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > Implement unregister but it's not reliable, only ->release is reliable. Why is there still the hlist stuff being used for the mmu notifier list? And why is this still unsafe? There are cases in which you do not take the reverse map locks or mmap_sem while traversing the notifier list? This hope for inclusion without proper review (first for .25 now for .26) seems to interfere with the patch cleanup work and cause delay after delay for getting the patch ready. On what basis do you think that there is a chance of any of these patches making it into 2.6.26 given that this patchset has never been vetted in Andrew's tree? |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:19:40
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > The only way to avoid failing because of vmalloc space shortage or > oom, would be to provide a O(N*N) fallback. But one that can't be > interrupted by sigkill! sigkill interruption was ok in #v12 because we > didn't rely on mmu_notifier_unregister to succeed. So it avoided any > DoS but it still can't provide any reliable unregister. If unregister fails then the driver should not detach from the address space immediately but wait until -->release is called. That may be a possible solution. It will be rare that the unregister fails. |
From: Andrea A. <an...@qu...> - 2008-04-23 18:25:19
|
On Wed, Apr 23, 2008 at 11:19:26AM -0700, Christoph Lameter wrote: > If unregister fails then the driver should not detach from the address > space immediately but wait until -->release is called. That may be > a possible solution. It will be rare that the unregister fails. This is the current idea, exactly. Unless we find a way to replace mm_lock with something else, I don't see a way to make mmu_notifier_unregister reliable without wasting ram. |
From: Andrea A. <an...@qu...> - 2008-04-23 18:19:43
|
On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote: > Why is there still the hlist stuff being used for the mmu notifier list? > And why is this still unsafe? What's the problem with hlist, it saves 8 bytes for each mm_struct, you should be using it too instead of list. > There are cases in which you do not take the reverse map locks or mmap_sem > while traversing the notifier list? There aren't. > This hope for inclusion without proper review (first for .25 now for .26) > seems to interfere with the patch cleanup work and cause delay after delay > for getting the patch ready. On what basis do you think that there is a > chance of any of these patches making it into 2.6.26 given that this > patchset has never been vetted in Andrew's tree? Let's say I try to be optimistic and hope the right thing will happen given this is like a new driver that can't hurt anybody but KVM and GRU if there's any bug. But in my view what interfere with proper review for .26 are the endless discussions we're doing ;). |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:21:43
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > will go in -mm in time for 2.6.26. Let's put it this way, if I fail to > merge mmu-notifier-core into 2.6.26 I'll voluntarily give up my entire > patchset and leave maintainership to you so you move 1/N to N/N and > remove mm_lock-sem patch (everything else can remain the same as it's > all orthogonal so changing the order is a matter of minutes). No I really want you to do this. I have no interest in a takeover in the future and have done the EMM stuff only because I saw no other way forward. I just want this be done the right way for all parties with patches that are nice and mergeable. |
From: Andrea A. <an...@qu...> - 2008-04-23 18:34:28
|
On Wed, Apr 23, 2008 at 11:21:49AM -0700, Christoph Lameter wrote: > No I really want you to do this. I have no interest in a takeover in the Ok if you want me to do this, I definitely prefer the core to go in now. It's so much easier to concentrate on two problems at different times then to attack both problems at the same time given they're mostly completely orthogonal problems. Given we already solved one problem, I'd like to close it before concentrating on the second problem. I already told you it was my interest to support XPMEM too. For example it was me to notice we couldn't possibly remove can_sleep parameter from invalidate_range without altering the locking as vmas were unstable outside of one of the three core vm locks. That finding resulted in much bigger patches than we hoped (like Andrew previously sort of predicted) and you did all great work to develop those. From my part, once the converged part is in, it'll be a lot easier to fully concentrate on the rest. My main focus right now is to produce a mmu-notifier-core that is entirely bug free for .26. |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:27:23
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > On Wed, Apr 23, 2008 at 11:09:35AM -0700, Christoph Lameter wrote: > > Why is there still the hlist stuff being used for the mmu notifier list? > > And why is this still unsafe? > > What's the problem with hlist, it saves 8 bytes for each mm_struct, > you should be using it too instead of list. list heads in mm_struct and in the mmu_notifier struct seemed to be more consistent. We have no hash list after all. > > > There are cases in which you do not take the reverse map locks or mmap_sem > > while traversing the notifier list? > > There aren't. There is a potential issue in move_ptes where you call invalidate_range_end after dropping i_mmap_sem whereas my patches did the opposite. Mmap_sem saves you there? |
From: Andrea A. <an...@qu...> - 2008-04-23 18:37:14
|
On Wed, Apr 23, 2008 at 11:27:21AM -0700, Christoph Lameter wrote: > There is a potential issue in move_ptes where you call > invalidate_range_end after dropping i_mmap_sem whereas my patches did the > opposite. Mmap_sem saves you there? Yes, there's really no risk of races in this area after introducing mm_lock, any place that mangles over ptes and doesn't hold any of the three locks is buggy anyway. I appreciate the audit work (I also did it and couldn't find bugs but the more eyes the better). |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:46:33
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > Yes, there's really no risk of races in this area after introducing > mm_lock, any place that mangles over ptes and doesn't hold any of the > three locks is buggy anyway. I appreciate the audit work (I also did > it and couldn't find bugs but the more eyes the better). I guess I would need to merge some patches together somehow to be able to review them properly like I did before <sigh>. I have not reviewed the latest code completely. |
From: Andrea A. <an...@qu...> - 2008-04-23 22:19:29
|
On Wed, Apr 23, 2008 at 06:37:13PM +0200, Andrea Arcangeli wrote: > I'm afraid if you don't want to worst-case unregister with ->release > you need to have a better idea than my mm_lock and personally I can't > see any other way than mm_lock to ensure not to miss range_begin... But wait, mmu_notifier_register absolutely requires mm_lock to ensure that when the kvm->arch.mmu_notifier_invalidate_range_count is zero (large variable name, it'll get shorter but this is to explain), really no cpu is in the middle of range_begin/end critical section. That's why we've to take all the mm locks. But we cannot care less if we unregister in the middle, unregister only needs to be sure that no cpu could possibly still using the ram of the notifier allocated by the driver before returning. So I'll implement unregister in O(1) and without ram allocations using srcu and that'll fix all issues with unregister. It'll return "void" to make it crystal clear it can't fail. It turns out unregister will make life easier to kvm as well, mostly to simplify the teardown of the /dev/kvm closure. Given this can be a considered a bugfix to mmu_notifier_unregister I'll apply it to 1/N and I'll release a new mmu-notifier-core patch for you to review before I resend to Andrew before Saturday. Thanks! |
From: Christoph L. <cla...@sg...> - 2008-04-29 01:28:04
|
On Tue, 29 Apr 2008, Andrea Arcangeli wrote: > Frankly I've absolutely no idea why rcu is needed in all rmap code > when walking the page->mapping. Definitely the PG_locked is taken so > there's no way page->mapping could possibly go away under the rmap > code, hence the anon_vma can't go away as it's queued in the vma, and > the vma has to go away before the page is zapped out of the pte. zap_pte_range can race with the rmap code and it does not take the page lock. The page may not go away since a refcount was taken but the mapping can go away. Without RCU you have no guarantee that the anon_vma is existing when you take the lock. How long were you away from VM development? > Now the double atomic op may not be horrible when not contented, as it > works on the same cacheline but with cacheline bouncing with > contention it sounds doubly horrible than a single cacheline bounce > and I don't see the point of it as you can't use rcu anyways, so you > can't possibly take advantage of whatever microoptimization done over > the original locking. Cachelines are acquired for exclusive use for a mininum duration. Multiple atomic operations can be performed after a cacheline becomes exclusive without danger of bouncing. |
From: Hugh D. <hu...@ve...> - 2008-04-29 11:00:14
|
On Tue, 29 Apr 2008, Andrea Arcangeli wrote: > > My point of view is that there was no rcu when I wrote that code, yet > there was no reference count and yet all locking looks still exactly > the same as I wrote it. There's even still the page_table_lock to > serialize threads taking the mmap_sem in read mode against the first > vma->anon_vma = anon_vma during the page fault. > > Frankly I've absolutely no idea why rcu is needed in all rmap code > when walking the page->mapping. Definitely the PG_locked is taken so > there's no way page->mapping could possibly go away under the rmap > code, hence the anon_vma can't go away as it's queued in the vma, and > the vma has to go away before the page is zapped out of the pte. [I'm scarcely following the mmu notifiers to-and-fro, which seems to be in good hands, amongst faster thinkers than me: who actually need and can test this stuff. Don't let me slow you down; but I can quickly clarify on this history.] No, the locking was different as you had it, Andrea: there was an extra bitspin lock, carried over from the pte_chains days (maybe we changed the name, maybe we disagreed over the name, I forget), which mainly guarded the page->mapcount. I thought that was one lock more than we needed, and eliminated it in favour of atomic page->mapcount in 2.6.9. Here's the relevant extracts from ChangeLog-2.6.9: [PATCH] rmaplock: PageAnon in mapping First of a batch of five patches to eliminate rmap's page_map_lock, replace its trylocking by spinlocking, and use anon_vma to speed up swapoff. Patches updated from the originals against 2.6.7-mm7: nothing new so I won't spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting the unuse_process mmap_sem fix already in 2.6.8-rc3. This patch: Replace the PG_anon page->flags bit by setting the lower bit of the pointer in page->mapping when it's anon_vma: PAGE_MAPPING_ANON bit. We're about to eliminate the locking which kept the flags and mapping in synch: it's much easier to work on a local copy of page->mapping, than worry about whether flags and mapping are in synch (though I imagine it could be done, at greater cost, with some barriers). [PATCH] rmaplock: kill page_map_lock The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to lock its pte_chains. We kept this (as page_map_lock: bit_spin_lock on PG_maplock) when we moved to objrmap. But the file objrmap locks its vma tree with mapping->i_mmap_lock, and the anon objrmap locks its vma list with anon_vma->lock: so isn't the page_map_lock superfluous? Pretty much, yes. The mapcount was protected by it, and needs to become an atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely up and down. The last page_remove_rmap can't clear anon page mapping any more, because of races with page_add_rmap; from which some BUG_ONs must go for the same reason, but they've served their purpose. vmscan decisions are naturally racy, little change there beyond removing page_map_lock/unlock. But to stabilize the file-backed page->mapping against truncation while acquiring i_mmap_lock, page_referenced_file now needs page lock to be held even for refill_inactive_zone. There's a similar issue in acquiring anon_vma->lock, where page lock doesn't help: which this patch pretends to handle, but actually it needs the next. Roughly 10% cut off lmbench fork numbers on my 2*HT*P4. Must confess my testing failed to show the races even while they were knowingly exposed: would benefit from testing on racier equipment. [PATCH] rmaplock: SLAB_DESTROY_BY_RCU With page_map_lock gone, how to stabilize page->mapping's anon_vma while acquiring anon_vma->lock in page_referenced_anon and try_to_unmap_anon? The page cannot actually be freed (vmscan holds reference), but however much we check page_mapped (which guarantees that anon_vma is in use - or would guarantee that if we added suitable barriers), there's no locking against page becoming unmapped the instant after, then anon_vma freed. It's okay to take anon_vma->lock after it's freed, so long as it remains a struct anon_vma (its list would become empty, or perhaps reused for an unrelated anon_vma: but no problem since we always check that the page located is the right one); but corruption if that memory gets reused for some other purpose. This is not unique: it's liable to be problem whenever the kernel tries to approach a structure obliquely. It's generally solved with an atomic reference count; but one advantage of anon_vma over anonmm is that it does not have such a count, and it would be a backward step to add one. Therefore... implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a kmem_cache_alloc'ed structure cannot get freed to other use while the rcu_read_lock is held i.e. preempt disabled; and use that for anon_vma. Fix concerns raised by Manfred: this flag is incompatible with poisoning and destructor, and kmem_cache_destroy needs to synchronize_kernel. I hope SLAB_DESTROY_BY_RCU may be useful elsewhere; but though it's safe for little anon_vma, I'd be reluctant to use it on any caches whose immediate shrinkage under pressure is important to the system. [PATCH] rmaplock: mm lock ordering With page_map_lock out of the way, there's no need for page_referenced and try_to_unmap to use trylocks - provided we switch anon_vma->lock and mm->page_table_lock around in anon_vma_prepare. Though I suppose it's possible that we'll find that vmscan makes better progress with trylocks than spinning - we're free to choose trylocks again if so. Try to update the mm lock ordering documentation in filemap.c. But I still find it confusing, and I've no idea of where to stop. So add an mm lock ordering list I can understand to rmap.c. [The fifth patch was about using anon_vma in swapoff, not relevant here.] So, going back to what you wrote: holding the page lock there is not enough to prevent the struct anon_vma going away beneath us. Hugh |
From: Andrea A. <an...@qu...> - 2008-04-29 13:35:47
|
Hi Hugh!! On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote: > [I'm scarcely following the mmu notifiers to-and-fro, which seems > to be in good hands, amongst faster thinkers than me: who actually > need and can test this stuff. Don't let me slow you down; but I > can quickly clarify on this history.] Still I think it'd be great if you could review mmu-notifier-core v14. You and Nick are the core VM maintainers so it'd be great to hear any feedback about it. I think it's fairly easy to classify the patch as obviously safe as long as mmu notifiers are disarmed. Here a link for your convenience. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core > No, the locking was different as you had it, Andrea: there was an extra > bitspin lock, carried over from the pte_chains days (maybe we changed > the name, maybe we disagreed over the name, I forget), which mainly > guarded the page->mapcount. I thought that was one lock more than we > needed, and eliminated it in favour of atomic page->mapcount in 2.6.9. Thanks a lot for the explanation! |
From: Andrea A. <an...@qu...> - 2008-04-24 06:49:49
|
On Thu, Apr 24, 2008 at 12:19:28AM +0200, Andrea Arcangeli wrote: > /dev/kvm closure. Given this can be a considered a bugfix to > mmu_notifier_unregister I'll apply it to 1/N and I'll release a new I'm not sure anymore this can be considered a bugfix given how large change this resulted in the locking and register/unregister/release behavior. Here a full draft patch for review and testing. Works great with KVM so far at least... - mmu_notifier_register has to run on current->mm or on get_task_mm() (in the later case it can mmput after mmu_notifier_register returns) - mmu_notifier_register in turn can't race against mmu_notifier_release as that runs in exit_mmap after the last mmput - mmu_notifier_unregister can run at any time, even after exit_mmap completed. No mm_count pin is required, it's taken automatically by register and released by unregister - mmu_notifier_unregister serializes against all mmu notifiers with srcu, and it serializes especially against a concurrent mmu_notifier_unregister with a mix of a spinlock and SRCU - the spinlock let us keep track who run first between mmu_notifier_unregister and mmu_notifier_release, this makes life much easier for the driver to handle as the driver is then guaranteed that ->release will run. - The first that runs executes ->release method as well after dropping the spinlock but before releasing the srcu lock - it was unsafe to unpin the module count from ->release, as release itself has to run the 'ret' instruction to return back to the mmu notifier code - the ->release method is mandatory as it has to run before the pages are freed to zap all existing sptes - the one that arrives second between mmu_notifier_unregister and mmu_notifier_register waits the first with srcu As said this is a much larger change than I hoped, but as usual it can only affect KVM/GRU/XPMEM if something is wrong with this. I don't exclude we'll have to backoff to the previous mm_users model. The main issue with taking a mm_users pin is that filehandles associated with vmas aren't closed by exit() if the mm_users is pinned (that simply leaks ram with kvm). It looks more correct not to relay on the mm_users being >0 only in mmu_notifier_register. The other big change is that ->release is mandatory and always called by the first between mmu_notifier_unregister or mmu_notifier_release. Both mmu_notifier_unregister and mmu_notifier_release are slow paths so taking a spinlock there is no big deal. Impact when the mmu notifiers are disarmed is unchanged. The interesting part of the kvm patch to test this change is below. After this last bit KVM patch status is almost final if this new mmu notifier update is remotely ok, I've another one that does the locking change to remove the page pin. +static void kvm_free_vcpus(struct kvm *kvm); +/* This must zap all the sptes because all pages will be freed then */ +static void kvm_mmu_notifier_release(struct mmu_notifier *mn, + struct mm_struct *mm) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + kvm_free_pit(kvm); + kfree(kvm->arch.vpic); + kfree(kvm->arch.vioapic); + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + if (kvm->arch.apic_access_page) + put_page(kvm->arch.apic_access_page); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .release = kvm_mmu_notifier_release, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .clear_flush_young = kvm_mmu_notifier_clear_flush_young, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } @@ -3899,13 +3967,12 @@ static void kvm_free_vcpus(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_free_pit(kvm); - kfree(kvm->arch.vpic); - kfree(kvm->arch.vioapic); - kvm_free_vcpus(kvm); - kvm_free_physmem(kvm); - if (kvm->arch.apic_access_page) - put_page(kvm->arch.apic_access_page); + /* + * kvm_mmu_notifier_release() will be called before + * mmu_notifier_unregister returns, if it didn't run + * already. + */ + mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm); kfree(kvm); } Let's call this mmu notifier #v14-test1. Signed-off-by: Andrea Arcangeli <an...@qu...> Signed-off-by: Nick Piggin <np...@su...> Signed-off-by: Christoph Lameter <cla...@sg...> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,6 +1050,27 @@ unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * mm_lock will take mmap_sem writably (to prevent all modifications + * and scanning of vmas) and then also takes the mapping locks for + * each of the vma to lockout any scans of pagetables of this address + * space. This can be used to effectively holding off reclaim from the + * address space. + * + * mm_lock can fail if there is not enough memory to store a pointer + * array to all vmas. + * + * mm_lock and mm_unlock are expensive operations that may take a long time. + */ +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -19,6 +19,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -225,6 +226,9 @@ #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,251 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/mm_types.h> + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER +#include <linux/srcu.h> + +struct mmu_notifier_mm { + struct hlist_head list; + struct srcu_struct srcu; + /* to serialize mmu_notifier_unregister against mmu_notifier_release */ + spinlock_t unregister_lock; +}; + +struct mmu_notifier_ops { + /* + * Called after all other threads have terminated and the executing + * thread is the only remaining execution thread. There are no + * users of the mm_struct remaining. + * + * If the methods are implemented in a module, the module + * can't be unloaded until release() is called. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed by the Linux pte + * because the old page hasn't been freed yet. If required + * set_page_dirty has to be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem is held and/or + * the semaphores protecting the reverse maps. Both functions + * may sleep. The subsystem must guarantee that no additional + * references to the pages in the range established between + * the call to invalidate_range_start() and the matching call + * to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be permitted + * by the driver or the driver may exclude other invalidation + * from proceeding by blocking on new invalidate_range_start() + * callback that overlap invalidates that are already in + * progress. Either way the establishment of sptes to the + * range can only be allowed if all invalidate_range_stop() + * function have been called. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -395,6 +397,7 @@ BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -194,7 +194,7 @@ if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -799,6 +800,7 @@ BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -819,6 +821,7 @@ } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -596,6 +597,7 @@ unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -603,25 +605,39 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ + ret = 0; if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma->anon_vma) - return 0; + goto out; } - if (is_vm_hugetlb_page(vma)) - return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (unlikely(is_vm_hugetlb_page(vma))) { + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); + goto out; + } + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); +out: + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -825,7 +841,9 @@ unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -876,6 +894,7 @@ } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1463,10 +1482,11 @@ { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1474,6 +1494,7 @@ if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1675,7 +1696,7 @@ * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/vmalloc.h> +#include <linux/sort.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2038,6 +2041,7 @@ /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2242,3 +2246,144 @@ return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + unsigned long _a = (unsigned long)*(spinlock_t **)a; + unsigned long _b = (unsigned long)*(spinlock_t **)b; + + cond_resched(); + if (_a < _b) + return -1; + if (_a > _b) + return 1; + return 0; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + down_write(&mm->mmap_sem); + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) { + up_write(&mm->mmap_sem); + return -ENOMEM; + } + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + up_write(&mm->mmap_sem); + vfree(anon_vma_locks); + return -ENOMEM; + } + + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* avoid memory allocations for mm_unlock to prevent deadlock */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } + up_write(&mm->mmap_sem); +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,241 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/err.h> +#include <linux/srcu.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +/* + * This function can't run concurrently against mmu_notifier_register + * or any other mmu notifier method. mmu_notifier_register can only + * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is + * zero). All other tasks of this mm already quit so they can't invoke + * mmu notifiers anymore. This can run concurrently only against + * mmu_notifier_unregister and it serializes against it with the + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go + * away from under us as the exit_mmap holds a mm_count pin itself. + * + * The ->release method can't allow the module to be unloaded, the + * module can only be unloaded after mmu_notifier_unregister run. This + * is because the release method has to run the ret instruction to + * return back here, and so it can't allow the ret instruction to be + * freed. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_lock(&mm->mmu_notifier_mm->unregister_lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing else than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init(&mn->hlist); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes before the pages in the + * mm are freed. + */ + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); + /* SRCU will block mmu_notifier_unregister */ + mn->ops->release(mn, mm); + spin_lock(&mm->mmu_notifier_mm->unregister_lock); + } + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + /* + * Wait ->release if mmu_notifier_unregister run list_del_rcu. + * srcu can't go away from under us because one mm_count is + * hold by exit_mmap. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0, srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely like + * with get_task_mm(). mmput can be called after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int... [truncated message content] |
From: Andrea A. <an...@qu...> - 2008-04-29 15:31:54
|
On Mon, Apr 28, 2008 at 06:28:06PM -0700, Christoph Lameter wrote: > On Tue, 29 Apr 2008, Andrea Arcangeli wrote: > > > Frankly I've absolutely no idea why rcu is needed in all rmap code > > when walking the page->mapping. Definitely the PG_locked is taken so > > there's no way page->mapping could possibly go away under the rmap > > code, hence the anon_vma can't go away as it's queued in the vma, and > > the vma has to go away before the page is zapped out of the pte. > > zap_pte_range can race with the rmap code and it does not take the page > lock. The page may not go away since a refcount was taken but the mapping > can go away. Without RCU you have no guarantee that the anon_vma is > existing when you take the lock. There's some room for improvement, like using down_read_trylock, if that succeeds we don't need to increase the refcount and we can keep the rcu_read_lock held instead. Secondly we don't need to increase the refcount in fork() when we queue the vma-copy in the anon_vma. You should init the refcount to 1 when the anon_vma is allocated, remove the atomic_inc from all code (except when down_read_trylock fails) and then change anon_vma_unlink to: up_write(&anon_vma->sem); if (empty) put_anon_vma(anon_vma); While the down_read_trylock surely won't help in AIM, the second change will reduce a bit the overhead in the VM core fast paths by avoiding all refcounting changes by checking the list_empty the same way the current code does. I really like how I designed the garbage collection through list_empty and that's efficient and I'd like to keep it. I however doubt this will bring us back to the same performance of the current spinlock version, as the real overhead should come out of overscheduling in down_write ai anon_vma_link. Here an initially spinning lock would help but that's gray area, it greatly depends on timings, and on very large systems where a cacheline wait with many cpus forking at the same time takes more than scheduling a semaphore may not slowdown performance that much. So I think the only way is a configuration option to switch the locking at compile time, then XPMEM will depend on that option to be on, I don't see a big deal and this guarantees embedded isn't screwed up by totally unnecessary locks on UP. |
From: Robin H. <ho...@sg...> - 2008-04-24 09:52:42
|
I am not certain of this, but it seems like this patch leaves things in a somewhat asymetric state. At the very least, I think that asymetry should be documented in the comments of either mmu_notifier.h or .c. Before I do the first mmu_notifier_register, all places that test for mm_has_notifiers(mm) will return false and take the fast path. After I do some mmu_notifier_register()s and their corresponding mmu_notifier_unregister()s, The mm_has_notifiers(mm) will return true and the slow path will be taken. This, despite all registered notifiers having unregistered. It seems to me the work done by mmu_notifier_mm_destroy should really be done inside the mm_lock()/mm_unlock area of mmu_unregister and mm_notifier_release when we have removed the last entry. That would give the users job the same performance after they are done using the special device that they had prior to its use. On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: ... > diff --git a/mm/memory.c b/mm/memory.c > --- a/mm/memory.c > +++ b/mm/memory.c ... > @@ -603,25 +605,39 @@ > * readonly mappings. The tradeoff is that copy_page_range is more > * efficient than faulting. > */ > + ret = 0; > if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { > if (!vma->anon_vma) > - return 0; > + goto out; > } > > - if (is_vm_hugetlb_page(vma)) > - return copy_hugetlb_page_range(dst_mm, src_mm, vma); > + if (unlikely(is_vm_hugetlb_page(vma))) { > + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); > + goto out; > + } > > + if (is_cow_mapping(vma->vm_flags)) > + mmu_notifier_invalidate_range_start(src_mm, addr, end); > + > + ret = 0; I don't think this is needed. ... > +/* avoid memory allocations for mm_unlock to prevent deadlock */ > +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) > +{ > + if (mm->map_count) { > + if (data->nr_anon_vma_locks) > + mm_unlock_vfree(data->anon_vma_locks, > + data->nr_anon_vma_locks); > + if (data->i_mmap_locks) I think you really want data->nr_i_mmap_locks. ... > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c > new file mode 100644 > --- /dev/null > +++ b/mm/mmu_notifier.c ... > +/* > + * This function can't run concurrently against mmu_notifier_register > + * or any other mmu notifier method. mmu_notifier_register can only > + * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is > + * zero). All other tasks of this mm already quit so they can't invoke > + * mmu notifiers anymore. This can run concurrently only against > + * mmu_notifier_unregister and it serializes against it with the > + * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go > + * away from under us as the exit_mmap holds a mm_count pin itself. > + * > + * The ->release method can't allow the module to be unloaded, the > + * module can only be unloaded after mmu_notifier_unregister run. This > + * is because the release method has to run the ret instruction to > + * return back here, and so it can't allow the ret instruction to be > + * freed. > + */ The second paragraph of this comment seems extraneous. ... > + /* > + * Wait ->release if mmu_notifier_unregister run list_del_rcu. > + * srcu can't go away from under us because one mm_count is > + * hold by exit_mmap. > + */ These two sentences don't make any sense to me. ... > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) > +{ > + int before_release = 0, srcu; > + > + BUG_ON(atomic_read(&mm->mm_count) <= 0); > + > + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); > + spin_lock(&mm->mmu_notifier_mm->unregister_lock); > + if (!hlist_unhashed(&mn->hlist)) { > + hlist_del_rcu(&mn->hlist); > + before_release = 1; > + } > + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); > + if (before_release) > + /* > + * exit_mmap will block in mmu_notifier_release to > + * guarantee ->release is called before freeing the > + * pages. > + */ > + mn->ops->release(mn, mm); I am not certain about the need to do the release callout when the driver has already told this subsystem it is done. For XPMEM, this callout would immediately return. I would expect it to be the same or GRU. Thanks, Robin |
From: Robin H. <ho...@sg...> - 2008-04-29 15:50:51
|
> I however doubt this will bring us back to the same performance of the > current spinlock version, as the real overhead should come out of > overscheduling in down_write ai anon_vma_link. Here an initially > spinning lock would help but that's gray area, it greatly depends on > timings, and on very large systems where a cacheline wait with many > cpus forking at the same time takes more than scheduling a semaphore > may not slowdown performance that much. So I think the only way is a > configuration option to switch the locking at compile time, then XPMEM > will depend on that option to be on, I don't see a big deal and this > guarantees embedded isn't screwed up by totally unnecessary locks on UP. You have said this continually about a CONFIG option. I am unsure how that could be achieved. Could you provide a patch? Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-24 15:39:47
|
On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote: > It seems to me the work done by mmu_notifier_mm_destroy should really > be done inside the mm_lock()/mm_unlock area of mmu_unregister and There's no mm_lock/unlock for mmu_unregister anymore. That's the whole point of using srcu so it becomes reliable and quick. > mm_notifier_release when we have removed the last entry. That would > give the users job the same performance after they are done using the > special device that they had prior to its use. That's not feasible. Otherwise mmu_notifier_mm will go away at any time under both _release from exit_mmap and under _unregister too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm after the last mmdrop makes it safe. mmu_notifier_unregister also holds the mm_count because mm_count was pinned by mmu_notifier_register. That solves the issue with mmu_notifier_mm going away from under mmu_notifier_unregister and _release and that's why it can only be freed after mm_count == 0. There's at least one small issue I noticed so far, that while _release don't need to care about _register, but _unregister definitely need to care about _register. I've to take the mmap_sem in addition or in replacement of the unregister_lock. The srcu_read_lock can also likely moved just before releasing the unregister_lock but that's just a minor optimization to make the code more strict. > On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote: > ... > > diff --git a/mm/memory.c b/mm/memory.c > > --- a/mm/memory.c > > +++ b/mm/memory.c > ... > > @@ -603,25 +605,39 @@ > > * readonly mappings. The tradeoff is that copy_page_range is more > > * efficient than faulting. > > */ > > + ret = 0; > > if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { > > if (!vma->anon_vma) > > - return 0; > > + goto out; > > } > > > > - if (is_vm_hugetlb_page(vma)) > > - return copy_hugetlb_page_range(dst_mm, src_mm, vma); > > + if (unlikely(is_vm_hugetlb_page(vma))) { > > + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); > > + goto out; > > + } > > > > + if (is_cow_mapping(vma->vm_flags)) > > + mmu_notifier_invalidate_range_start(src_mm, addr, end); > > + > > + ret = 0; > > I don't think this is needed. It's not needed right, but I thought it was cleaner if they all use "ret" after I had to change the code at the end of the function. Anyway I'll delete this to make the patch shorter and only change the minimum, agreed. > ... > > +/* avoid memory allocations for mm_unlock to prevent deadlock */ > > +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) > > +{ > > + if (mm->map_count) { > > + if (data->nr_anon_vma_locks) > > + mm_unlock_vfree(data->anon_vma_locks, > > + data->nr_anon_vma_locks); > > + if (data->i_mmap_locks) > > I think you really want data->nr_i_mmap_locks. Indeed. It never happens that there are zero vmas with filebacked mappings, this is why this couldn't be triggered in practice, thanks! > The second paragraph of this comment seems extraneous. ok removed. > > + /* > > + * Wait ->release if mmu_notifier_unregister run list_del_rcu. > > + * srcu can't go away from under us because one mm_count is > > + * hold by exit_mmap. > > + */ > > These two sentences don't make any sense to me. Well that was a short explanation of why the mmu_notifier_mm structure can only be freed after the last mmdrop, which is what you asked at the top. I'll try to rephrase. > > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) > > +{ > > + int before_release = 0, srcu; > > + > > + BUG_ON(atomic_read(&mm->mm_count) <= 0); > > + > > + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); > > + spin_lock(&mm->mmu_notifier_mm->unregister_lock); > > + if (!hlist_unhashed(&mn->hlist)) { > > + hlist_del_rcu(&mn->hlist); > > + before_release = 1; > > + } > > + spin_unlock(&mm->mmu_notifier_mm->unregister_lock); > > + if (before_release) > > + /* > > + * exit_mmap will block in mmu_notifier_release to > > + * guarantee ->release is called before freeing the > > + * pages. > > + */ > > + mn->ops->release(mn, mm); > > I am not certain about the need to do the release callout when the driver > has already told this subsystem it is done. For XPMEM, this callout > would immediately return. I would expect it to be the same or GRU. The point is that you don't want to run it twice. And without this you will have to serialize against ->release yourself in the driver. It's much more convenient if you know that ->release will be called just once, and before mmu_notifier_unregister returns. It could be called by _release even after you're already inside _unregister, _release may reach the spinlock before _unregister, you won't notice the difference. Solving this race in the driver looked too complex, I rather solve it once inside the mmu notifier code to be sure. Missing a release event is fatal because all sptes have to be dropped before _release returns. The requirement is the same for _unregister, all sptes have to be dropped before it returns. ->release should be able to sleep as long as it wants even with only 1/N applied. exit_mmap can sleep too, no problem. You can't unregister inside ->release first of all because 'ret' instruction must be still allocated to return to mmu notifier code. |
From: Andrea A. <an...@qu...> - 2008-04-29 16:04:00
|
On Tue, Apr 29, 2008 at 10:50:30AM -0500, Robin Holt wrote: > You have said this continually about a CONFIG option. I am unsure how > that could be achieved. Could you provide a patch? I'm busy with the reserved ram patch against 2.6.25 and latest kvm.git that is moving from pages to pfn for pci passthrough (that change will also remove the page pin with mmu notifiers). Unfortunately reserved-ram bugs out again in the blk-settings.c on real hardware. The fix I pushed in .25 for it, works when booting kvm (that's how I tested it) but on real hardware sata b_pfn happens to be 1 page less than the result of the min comparison and I'll have to figure out what happens (only .24 code works on real hardware..., at least my fix is surely better than the previous .25-pre code). I've other people waiting on that reserved-ram to be working, so once I've finished, I'll do the optimization to anon-vma (at least the removal of the unnecessary atomic_inc from fork) and add the config option. Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem yourself in this direction, you're very welcome to go ahead while I finish sorting out reserved-ram. If you do, please let me know so we don't duplicate effort, and it'd be absolutely great if the patches could be incremental with #v14 so I can merge them trivially later and upload a new patchset once you're finished (the only outstanding fix you have to apply on top of #v14 that is already integrated in my patchset, is the i_mmap_sem deadlock fix I posted and that I'm sure you've already applied on top of #v14 before doing any more development on it). Thanks! |
From: Andrea A. <an...@qu...> - 2008-05-07 15:00:28
|
On Tue, Apr 29, 2008 at 06:03:40PM +0200, Andrea Arcangeli wrote: > Christoph if you've interest in evolving anon-vma-sem and i_mmap_sem > yourself in this direction, you're very welcome to go ahead while I In case you didn't notice this already, for a further explanation of why semaphores runs slower for small critical sections and why the conversion from spinlock to rwsem should happen under a config option, see the "AIM7 40% regression with 2.6.26-rc1" thread. |
From: Andrea A. <an...@qu...> - 2008-04-22 14:10:31
|
# HG changeset patch # User Andrea Arcangeli <an...@qu...> # Date 1208872186 -7200 # Node ID 3c804dca25b15017b22008647783d6f5f3801fa9 # Parent ea87c15371b1bd49380c40c3f15f1c7ca4438af5 Fix ia64 compilation failure because of common code include bug. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/cpumask.h> #include <asm/page.h> #include <asm/mmu.h> |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:22:49
|
Looks like this is not complete. There are numerous .h files missing which means that various structs are undefined (fs.h and rmap.h are needed f.e.) which leads to surprises when dereferencing fields of these struct. It seems that mm_types.h is expected to be included only in certain contexts. Could you make sure to include all necessary .h files? Or add some docs to clarify the situation here. |
From: Andrea A. <an...@qu...> - 2008-04-22 22:43:55
|
On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote: > Looks like this is not complete. There are numerous .h files missing which > means that various structs are undefined (fs.h and rmap.h are needed > f.e.) which leads to surprises when dereferencing fields of these struct. > > It seems that mm_types.h is expected to be included only in certain > contexts. Could you make sure to include all necessary .h files? Or add > some docs to clarify the situation here. Robin, what other changes did you need to compile? I only did that one because I didn't hear any more feedback from you after I sent that patch, so I assumed it was enough. |
From: Robin H. <ho...@sg...> - 2008-04-22 23:07:52
|
On Wed, Apr 23, 2008 at 12:43:52AM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:22:55PM -0700, Christoph Lameter wrote: > > Looks like this is not complete. There are numerous .h files missing which > > means that various structs are undefined (fs.h and rmap.h are needed > > f.e.) which leads to surprises when dereferencing fields of these struct. > > > > It seems that mm_types.h is expected to be included only in certain > > contexts. Could you make sure to include all necessary .h files? Or add > > some docs to clarify the situation here. > > Robin, what other changes did you need to compile? I only did that one > because I didn't hear any more feedback from you after I sent that > patch, so I assumed it was enough. It was perfect. Nothing else was needed. Thanks, Robin |