From: Linus T. <tor...@li...> - 2008-05-14 18:27:38
|
On Wed, 14 May 2008, Christoph Lameter wrote: > > The problem is that the code in rmap.c try_to_umap() and friends loops > over reverse maps after taking a spinlock. The mm_struct is only known > after the rmap has been acccessed. This means *inside* the spinlock. So you queue them. That's what we do with things like the dirty bit. We need to hold various spinlocks to look up pages, but then we can't actually call the filesystem with the spinlock held. Converting a spinlock to a waiting lock for things like that is simply not acceptable. You have to work with the system. Yeah, there's only a single bit worth of information on whether a page is dirty or not, so "queueing" that information is trivial (it's just the return value from "page_mkclean_file()". Some things are harder than others, and I suspect you need some kind of "gather" structure to queue up all the vma's that can be affected. But it sounds like for the case of rmap, the approach of: - the page lock is the higher-level "sleeping lock" (which makes sense, since this is very close to an IO event, and that is what the page lock is generally used for) But hey, it could be anything else - maybe you have some other even bigger lock to allow you to handle lots of pages in one go. - with that lock held, you do the whole rmap dance (which requires spinlocks) and gather up the vma's and the struct mm's involved. - outside the spinlocks you then do whatever it is you need to do. This doesn't sound all that different from TLB shoot-down in SMP, and the "mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while holding the spinlocks, but if we couldn't that's how we'd still do it: it would get more involved (because we'd need to guarantee that the gather can hold *all* the pages - right now we can just flush in the middle if we need to), but it wouldn't be all that fundamentally different. And no, I really haven't even wanted to look at what XPMEM really needs to do, so maybe the above thing doesn't work for you, and you have other issues. I'm just pointing you in a general direction, not trying to say "this is exactly how to get there". Linus |
From: Christoph L. <cla...@sg...> - 2008-05-17 01:38:26
|
Implementation of what Linus suggested: Defer the XPMEM processing until after the locks are dropped. Allow immediate action by GRU/KVM. This patch implements a callbacks for device drivers that establish external references to pages aside from the Linux rmaps. Those either: 1. Do not take a refcount on pages that are mapped from devices. They have a TLB cache like handling and must be able to flush external references from atomic contexts. These devices do not need to provide the _sync methods. 2. Do take a refcount on pages mapped externally. These are handling by marking pages as to be invalidated in atomic contexts. Invalidation may be started by the driver. A _sync variant for the individual or range unmap is called when we are back in a nonatomic context. At that point the device must complete the removal of external references and drop its refcount. With the mm notifier it is possible for the device driver to release external references after the page references are removed from a process that made them available. With the notifier it becomes possible to get pages unpinned on request and thus avoid issues that come with having a large amount of pinned pages. A device driver must subscribe to a process using mm_register_notifier(struct mm_struct *, struct mm_notifier *) The VM will then perform callbacks for operations that unmap or change permissions of pages in that address space. When the process terminates then first the ->release method is called to remove all pages still mapped to the proces. Before the mm_struct is freed the ->destroy() method is called which should dispose of the mm_notifier structure. The following callbacks exist: invalidate_range(notifier, mm_struct *, from , to) Invalidate a range of addresses. The invalidation is not required to complete immediately. invalidate_range_sync(notifier, mm_struct *, from, to) This is called after some invalidate_range callouts. The driver may only return when the invalidation of the references is completed. Callback is only called from non atomic contexts. There is no need to provide this callback if the driver can remove references in an atomic context. invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address) Invalidate references to a particular page. The driver may defer the invalidation. invalidate_page_sync(notifier, mm_struct *,struct *) Called after one or more invalidate_page() callbacks. The callback must only return when the external references have been removed. The callback does not need to be provided if the driver can remove references in atomic contexts. [NOTE] The invalidate_page_sync() callback is weird because it is called for every notifier that supports the invalidate_page_sync() callback if a page has PageNotifier() set. The driver must determine in an efficient way that the page is not of interest. This is because we do not have the mm context after we have dropped the rmap list lock. Drivers incrementing the refcount must set and clear PageNotifier appropriately when establishing and/or dropping a refcount! [These conditions are similar to the rmap notifier that was introduced in my V7 of the mmu_notifier]. There is no support for an aging callback. A device driver may simply set the reference bit on the linux pte when the external mapping is referenced if such support is desired. The patch is provisional. All functions are inlined for now. They should be wrapped like in Andrea's series. Its probably good to have Andrea review this if we actually decide to go this route since he is pretty good as detecting issues with complex lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and we are reintroducing that now in a light weight order to be able to defer freeing until after the rmap spinlocks have been dropped. Jack tested this with the GRU. Signed-off-by: Christoph Lameter <cla...@sg...> --- fs/hugetlbfs/inode.c | 2 include/linux/mm_types.h | 3 include/linux/page-flags.h | 3 include/linux/rmap.h | 161 +++++++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 4 + mm/Kconfig | 4 + mm/filemap_xip.c | 2 mm/fremap.c | 2 mm/hugetlb.c | 3 mm/memory.c | 38 ++++++++-- mm/mmap.c | 3 mm/mprotect.c | 3 mm/mremap.c | 5 + mm/rmap.c | 11 ++- 14 files changed, 234 insertions(+), 10 deletions(-) Index: linux-2.6/kernel/fork.c =================================================================== --- linux-2.6.orig/kernel/fork.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/kernel/fork.c 2008-05-16 16:06:26.000000000 -0700 @@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; +#ifdef CONFIG_MM_NOTIFIER + mm->mm_notifier = NULL; +#endif return mm; } @@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mm_notifier_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); Index: linux-2.6/mm/filemap_xip.c =================================================================== --- linux-2.6.orig/mm/filemap_xip.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/filemap_xip.c 2008-05-16 16:06:26.000000000 -0700 @@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); @@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp } } spin_unlock(&mapping->i_mmap_lock); + mm_notifier_invalidate_page_sync(page); } /* Index: linux-2.6/mm/fremap.c =================================================================== --- linux-2.6.orig/mm/fremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/fremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mm_notifier_invalidate_range(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mm_notifier_invalidate_range_sync(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/hugetlb.c 2008-05-16 17:50:31.000000000 -0700 @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/rmap.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); @@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); __unmap_hugepage_range(vma, start, end); spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + mm_notifier_invalidate_range_sync(vma->vm_mm, start, end); } } Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/memory.c 2008-05-16 16:06:26.000000000 -0700 @@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s */ if (is_cow_mapping(vm_flags)) { ptep_set_wrprotect(src_mm, addr, src_pte); + mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE); pte = pte_wrprotect(pte); } @@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end); + + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath } tlb_finish_mmu(*tlbp, tlb_start, start); + mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start); if (need_resched() || (i_mmap_lock && spin_needbreak(i_mmap_lock))) { @@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a tlb = tlb_gather_mmu(mm, 0); update_hiwater_rss(mm); end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); - if (tlb) + if (tlb) { tlb_finish_mmu(tlb, address, end); + mm_notifier_invalidate_range(mm, address, end); + } return end; } @@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct * */ page_table = pte_offset_map_lock(mm, pmd, address, &ptl); - page_cache_release(old_page); if (!pte_same(*page_table, orig_pte)) goto unlock; @@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct * if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = NULL; goto unlock; } @@ -1774,6 +1792,7 @@ gotten: * thread doing COW. */ ptep_clear_flush(vma, address, page_table); + mm_notifier_invalidate_page(mm, old_page, address); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1787,10 +1806,13 @@ gotten: if (new_page) page_cache_release(new_page); - if (old_page) - page_cache_release(old_page); unlock: pte_unmap_unlock(page_table, ptl); + if (old_page) { + mm_notifier_invalidate_page_sync(old_page); + page_cache_release(old_page); + } + if (dirty_page) { if (vma->vm_file) file_update_time(vma->vm_file); Index: linux-2.6/mm/mmap.c =================================================================== --- linux-2.6.orig/mm/mmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mm_notifier_invalidate_range(mm, start, end); + mm_notifier_invalidate_range_sync(mm, start, end); } /* @@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mm_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); Index: linux-2.6/mm/mprotect.c =================================================================== --- linux-2.6.orig/mm/mprotect.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/mprotect.c 2008-05-16 16:06:26.000000000 -0700 @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -132,6 +133,7 @@ static void change_protection(struct vm_ change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable); } while (pgd++, addr = next, addr != end); flush_tlb_range(vma, start, end); + mm_notifier_invalidate_range(vma->vm_mm, start, end); } int @@ -211,6 +213,7 @@ success: hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mm_notifier_invalidate_range_sync(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; Index: linux-2.6/mm/mremap.c =================================================================== --- linux-2.6.orig/mm/mremap.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/mm/mremap.c 2008-05-16 16:06:26.000000000 -0700 @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/rmap.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start = old_addr; if (vma->vm_file) { /* @@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); arch_enter_lazy_mmu_mode(); + mm_notifier_invalidate_range(mm, old_addr, old_end); for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) @@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + + mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/rmap.c 2008-05-16 16:06:26.000000000 -0700 @@ -52,6 +52,9 @@ #include <asm/tlbflush.h> +struct mm_notifier *mm_notifier_page_sync; +DECLARE_RWSEM(mm_notifier_page_sync_sem); + struct kmem_cache *anon_vma_cachep; /* This must be called under the mmap_sem. */ @@ -458,6 +461,7 @@ static int page_mkclean_one(struct page flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -502,8 +506,8 @@ int page_mkclean(struct page *page) ret = 1; } } + mm_notifier_invalidate_page_sync(page); } - return ret; } EXPORT_SYMBOL_GPL(page_mkclean); @@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); + mm_notifier_invalidate_page(mm, page, address); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + mm_notifier_invalidate_page_sync(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; } - Index: linux-2.6/include/linux/rmap.h =================================================================== --- linux-2.6.orig/include/linux/rmap.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/rmap.h 2008-05-16 18:32:52.000000000 -0700 @@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa #define SWAP_AGAIN 1 #define SWAP_FAIL 2 +#ifdef CONFIG_MM_NOTIFIER + +struct mm_notifier_ops { + void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page, unsigned long addr); + void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm, + struct page *page); + void (*release)(struct mm_notifier *mn, struct mm_struct *mm); + void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm); +}; + +struct mm_notifier { + struct mm_notifier_ops *ops; + struct mm_struct *mm; + struct mm_notifier *next; + struct mm_notifier *next_page_sync; +}; + +extern struct mm_notifier *mm_notifier_page_sync; +extern struct rw_semaphore mm_notifier_page_sync_sem; + +/* + * Must hold mmap_sem when calling mm_notifier_register. + */ +static inline void mm_notifier_register(struct mm_notifier *mn, + struct mm_struct *mm) +{ + mn->mm = mm; + mn->next = mm->mm_notifier; + rcu_assign_pointer(mm->mm_notifier, mn); + if (mn->ops->invalidate_page_sync) { + down_write(&mm_notifier_page_sync_sem); + mn->next_page_sync = mm_notifier_page_sync; + mm_notifier_page_sync = mn; + up_write(&mm_notifier_page_sync_sem); + } +} + +/* + * Invalidate remote references in a particular address range + */ +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_range(mn, mm, start, end); +} + +/* + * Invalidate remote references in a particular address range. + * Can sleep. Only return if all remote references have been removed. + */ +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + if (mn->ops->invalidate_range_sync) + mn->ops->invalidate_range_sync(mn, mm, start, end); +} + +/* + * Invalidate remote references to a page + */ +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long addr) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->invalidate_page(mn, mm, page, addr); +} + +/* + * Invalidate remote references to a partioular page. Only return + * if all references have been removed. + * + * Note: This is an expensive function since it is not clear at the time + * of call to which mm_struct() the page belongs.. It walks through the + * mmlist and calls the mmu notifier ops for each address space in the + * system. At some point this needs to be optimized. + */ +static inline void mm_notifier_invalidate_page_sync(struct page *page) +{ + struct mm_notifier *mn; + + if (!PageNotifier(page)) + return; + + down_read(&mm_notifier_page_sync_sem); + + for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync) + if (mn->ops->invalidate_page_sync) + mn->ops->invalidate_page_sync(mn, mn->mm, page); + + up_read(&mm_notifier_page_sync_sem); +} + +/* + * Invalidate all remote references before shutdown + */ +static inline void mm_notifier_release(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + for (mn = rcu_dereference(mm->mm_notifier); mn; + mn = rcu_dereference(mn->next)) + mn->ops->release(mn, mm); +} + +/* + * Release resources before freeing mm_struct. + */ +static inline void mm_notifier_destroy(struct mm_struct *mm) +{ + struct mm_notifier *mn; + + while (mm->mm_notifier) { + mn = mm->mm_notifier; + mm->mm_notifier = mn->next; + if (mn->ops->invalidate_page_sync) { + struct mm_notifier *m; + + down_write(&mm_notifier_page_sync_sem); + + if (mm_notifier_page_sync != mn) { + for (m = mm_notifier_page_sync; m; m = m->next_page_sync) + if (m->next_page_sync == mn) + break; + + m->next_page_sync = mn->next_page_sync; + } else + mm_notifier_page_sync = mn->next_page_sync; + + up_write(&mm_notifier_page_sync_sem); + } + mn->ops->destroy(mn, mm); + } +} +#else +static inline void mm_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm, + unsigned long start, unsigned long end) {} +static inline void mm_notifier_invalidate_page(struct mm_struct *mm, + struct page *page, unsigned long address) {} +static inline void mm_notifier_invalidate_page_sync(struct page *page) {} +static inline void mm_notifier_release(struct mm_struct *mm) {} +static inline void mm_notifier_destroy(struct mm_struct *mm) {} +#endif + #endif /* _LINUX_RMAP_H */ Index: linux-2.6/mm/Kconfig =================================================================== --- linux-2.6.orig/mm/Kconfig 2008-05-16 11:28:50.000000000 -0700 +++ linux-2.6/mm/Kconfig 2008-05-16 16:06:26.000000000 -0700 @@ -205,3 +205,7 @@ config NR_QUICK config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MM_NOTIFIER + def_bool y + Index: linux-2.6/include/linux/mm_types.h =================================================================== --- linux-2.6.orig/include/linux/mm_types.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/mm_types.h 2008-05-16 16:06:26.000000000 -0700 @@ -244,6 +244,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MM_NOTIFIER + struct mm_notifier *mm_notifier; +#endif }; #endif /* _LINUX_MM_TYPES_H */ Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/include/linux/page-flags.h 2008-05-16 16:06:26.000000000 -0700 @@ -93,6 +93,7 @@ enum pageflags { PG_mappedtodisk, /* Has blocks allocated on-disk */ PG_reclaim, /* To be reclaimed asap */ PG_buddy, /* Page is free, on buddy lists */ + PG_notifier, /* Call notifier when page is changed/unmapped */ #ifdef CONFIG_IA64_UNCACHED_ALLOCATOR PG_uncached, /* Page has been mapped as uncached */ #endif @@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk) PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim) PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */ +PAGEFLAG(Notifier, notifier); + #ifdef CONFIG_HIGHMEM /* * Must use a macro here due to header dependency issues. page_zone() is not Index: linux-2.6/fs/hugetlbfs/inode.c =================================================================== --- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-05-16 11:28:49.000000000 -0700 +++ linux-2.6/fs/hugetlbfs/inode.c 2008-05-16 16:06:55.000000000 -0700 @@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree __unmap_hugepage_range(vma, vma->vm_start + v_offset, vma->vm_end); + mm_notifier_invalidate_range_sync(vma->vm_mm, + vma->vm_start + v_offset, vma->vm_end); } } |
From: Nick P. <np...@su...> - 2008-05-15 07:57:46
|
On Wed, May 14, 2008 at 06:26:25AM -0500, Robin Holt wrote: > On Wed, May 14, 2008 at 06:11:22AM +0200, Nick Piggin wrote: > > > > I guess that you have found a way to perform TLB flushing within coherent > > domains over the numalink interconnect without sleeping. I'm sure it would > > be possible to send similar messages between non coherent domains. > > I assume by coherent domains, your are actually talking about system > images. Yes > Our memory coherence domain on the 3700 family is 512 processors > on 128 nodes. On the 4700 family, it is 16,384 processors on 4096 nodes. > We extend a "Read-Exclusive" mode beyond the coherence domain so any > processor is able to read any cacheline on the system. We also provide > uncached access for certain types of memory beyond the coherence domain. Yes, I understand the basics. > For the other partitions, the exporting partition does not know what > virtual address the imported pages are mapped. The pages are frequently > mapped in a different order by the MPI library to help with MPI collective > operations. > > For the exporting side to do those TLB flushes, we would need to replicate > all that importing information back to the exporting side. Right. Or the exporting side could be passed tokens that it tracks itself, rather than virtual addresses. > Additionally, the hardware that does the TLB flushing is protected > by a spinlock on each system image. We would need to change that > simple spinlock into a type of hardware lock that would work (on 3700) > outside the processors coherence domain. The only way to do that is to > use uncached addresses with our Atomic Memory Operations which do the > cmpxchg at the memory controller. The uncached accesses are an order > of magnitude or more slower. I'm not sure if you're thinking about what I'm thinking of. With the scheme I'm imagining, all you will need is some way to raise an IPI-like interrupt on the target domain. The IPI target will have a driver to handle the interrupt, which will determine the mm and virtual addresses which are to be invalidated, and will then tear down those page tables and issue hardware TLB flushes within its domain. On the Linux side, I don't see why this can't be done. > > So yes, I'd much rather rework such highly specialized system to fit in > > closer with Linux than rework Linux to fit with these machines (and > > apparently slow everyone else down). > > But it isn't that we are having a problem adapting to just the hardware. > One of the limiting factors is Linux on the other partition. In what way is the Linux limiting? > > > Additionally, the call to zap_page_range expects to have the mmap_sem > > > held. I suppose we could use something other than zap_page_range and > > > atomically clear the process page tables. > > > > zap_page_range does not expect to have mmap_sem held. I think for anon > > pages it is always called with mmap_sem, however try_to_unmap_anon is > > not (although it expects page lock to be held, I think we should be able > > to avoid that). > > zap_page_range calls unmap_vmas which walks to vma->next. Are you saying > that can be walked without grabbing the mmap_sem at least readably? Oh, I get that confused because of the mixed up naming conventions there: unmap_page_range should actually be called zap_page_range. But at any rate, yes we can easily zap pagetables without holding mmap_sem. > I feel my understanding of list management and locking completely > shifting. FWIW, mmap_sem isn't held to protect vma->next there anyway, because at that point the vmas are detached from the mm's rbtree and linked list. But sure, in that particular path it is held for other reasons. > > > Doing that will not alleviate > > > the need to sleep for the messaging to the other partitions. > > > > No, but I'd venture to guess that is not impossible to implement even > > on your current hardware (maybe a firmware update is needed)? > > Are you suggesting the sending side would not need to sleep or the > receiving side? Assuming you meant the sender, it spins waiting for the > remote side to acknowledge the invalidate request? We place the data > into a previously agreed upon buffer and send an interrupt. At this > point, we would need to start spinning and waiting for completion. > Let's assume we never run out of buffer space. How would you run out of buffer space if it is synchronous? > The receiving side receives an interrupt. The interrupt currently wakes > an XPC thread to do the work of transfering and delivering the message > to XPMEM. The transfer of the data which XPC does uses the BTE engine > which takes up to 28 seconds to timeout (hardware timeout before raising > and error) and the BTE code automatically does a retry for certain > types of failure. We currently need to grab semaphores which _MAY_ > be able to be reworked into other types of locks. Sure, you obviously would need to rework your code because it's been written with the assumption that it can sleep. What is XPMEM exactly anyway? I'd assumed it is a Linux driver. |
From: Robin H. <ho...@sg...> - 2008-05-15 11:01:46
|
We are pursuing Linus' suggestion currently. This discussion is completely unrelated to that work. On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote: > I'm not sure if you're thinking about what I'm thinking of. With the > scheme I'm imagining, all you will need is some way to raise an IPI-like > interrupt on the target domain. The IPI target will have a driver to > handle the interrupt, which will determine the mm and virtual addresses > which are to be invalidated, and will then tear down those page tables > and issue hardware TLB flushes within its domain. On the Linux side, > I don't see why this can't be done. We would need to deposit the payload into a central location to do the invalidate, correct? That central location would either need to be indexed by physical cpuid (65536 possible currently, UV will push that up much higher) or some sort of global id which is difficult because remote partitions can reboot giving you a different view of the machine and running partitions would need to be updated. Alternatively, that central location would need to be protected by a global lock or atomic type operation, but a majority of the machine does not have coherent access to other partitions so they would need to use uncached operations. Essentially, take away from this paragraph that it is going to be really slow or really large. Then we need to deposit the information needed to do the invalidate. Lastly, we would need to interrupt. Unfortunately, here we have a thundering herd. There could be up to 16256 processors interrupting the same processor. That will be a lot of work. It will need to look up the mm (without grabbing any sleeping locks in either xpmem or the kernel) and do the tlb invalidates. Unfortunately, the sending side is not free to continue (in most cases) until it knows that the invalidate is completed. So it will need to spin waiting for a completion signal will could be as simple as an uncached word. But how will it handle the possible failure of the other partition? How will it detect that failure and recover? A timeout value could be difficult to gauge because the other side may be off doing a considerable amount of work and may just be backed up. > Sure, you obviously would need to rework your code because it's been > written with the assumption that it can sleep. It is an assumption based upon some of the kernel functions we call doing things like grabbing mutexes or rw_sems. That pushes back to us. I think the kernel's locking is perfectly reasonable. The problem we run into is we are trying to get from one context in one kernel to a different context in another and the in-between piece needs to be sleepable. > What is XPMEM exactly anyway? I'd assumed it is a Linux driver. XPMEM allows one process to make a portion of its virtual address range directly addressable by another process with the appropriate access. The other process can be on other partitions. As long as Numa-link allows access to the memory, we can make it available. Userland has an advantage in that the kernel entrance/exit code contains memory errors so we can contain hardware failures (in most cases) to only needing to terminate a user program and not lose the partition. The kernel enjoys no such fault containment so it can not safely directly reference memory. Thanks, Robin |
From: Avi K. <av...@qu...> - 2008-05-15 11:12:39
|
Robin Holt wrote: > Then we need to deposit the information needed to do the invalidate. > > Lastly, we would need to interrupt. Unfortunately, here we have a > thundering herd. There could be up to 16256 processors interrupting the > same processor. That will be a lot of work. It will need to look up the > mm (without grabbing any sleeping locks in either xpmem or the kernel) > and do the tlb invalidates. > > You don't need to interrupt every time. Place your data in a queue (you do support rmw operations, right?) and interrupt. Invalidates from other processors will see that the queue hasn't been processed yet and skip the interrupt. -- error compiling committee.c: too many arguments to function |
From: Christoph L. <cla...@sg...> - 2008-05-15 17:34:18
|
On Thu, 15 May 2008, Nick Piggin wrote: > Oh, I get that confused because of the mixed up naming conventions > there: unmap_page_range should actually be called zap_page_range. But > at any rate, yes we can easily zap pagetables without holding mmap_sem. How is that synchronized with code that walks the same pagetable. These walks may not hold mmap_sem either. I would expect that one could only remove a portion of the pagetable where we have some sort of guarantee that no accesses occur. So the removal of the vma prior ensures that? |
From: Nick P. <np...@su...> - 2008-05-15 23:52:26
|
On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > On Thu, 15 May 2008, Nick Piggin wrote: > > > Oh, I get that confused because of the mixed up naming conventions > > there: unmap_page_range should actually be called zap_page_range. But > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > How is that synchronized with code that walks the same pagetable. These > walks may not hold mmap_sem either. I would expect that one could only > remove a portion of the pagetable where we have some sort of guarantee > that no accesses occur. So the removal of the vma prior ensures that? I don't really understand the question. If you remove the pte and invalidate the TLBS on the remote image's process (importing the page), then it can of course try to refault the page in because it's vma is still there. But you catch that refault in your driver , which can prevent the page from being faulted back in. |
From: Andrea A. <an...@qu...> - 2008-05-07 14:38:23
|
# HG changeset patch # User Andrea Arcangeli <an...@qu...> # Date 1210115135 -7200 # Node ID 58f716ad4d067afb6bdd1b5f7042e19d854aae0d # Parent 0621238970155f8ff2d60ca4996dcdd470f9c6ce i_mmap_rwsem The conversion to a rwsem allows notifier callbacks during rmap traversal for files. A rw style lock also allows concurrent walking of the reverse map so that multiple processors can expire pages in the same memory area of the same process. So it increases the potential concurrency. Signed-off-by: Andrea Arcangeli <an...@qu...> Signed-off-by: Christoph Lameter <cla...@sg...> diff --git a/Documentation/vm/locking b/Documentation/vm/locking --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ expand_stack(), it is hard to come up wi expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_sem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ static void huge_pmd_share(struct mm_str if (!vma_shareable(vma, addr)) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(svma, &iter, &mapping->i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ static void huge_pmd_share(struct mm_str put_page(virt_to_page(spte)); spin_unlock(&mm->page_table_lock); out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ static int hugetlb_vmtruncate(struct ino pgoff = offset >> PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); if (!prio_tree_empty(&mapping->i_mmap)) hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); rwlock_init(&inode->i_data.tree_lock); - spin_lock_init(&inode->i_data.i_mmap_lock); + init_rwsem(&inode->i_data.i_mmap_sem); INIT_LIST_HEAD(&inode->i_data.private_list); spin_lock_init(&inode->i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -502,7 +502,7 @@ struct address_space { unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock; /* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned int truncate_count; /* Cover race condition with truncate */ unsigned long nrpages; /* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -735,7 +735,7 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ - spinlock_t *i_mmap_lock; /* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem; /* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -297,12 +297,12 @@ static int dup_mmap(struct mm_struct *mm atomic_dec(&inode->i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(&file->f_mapping->i_mmap_lock); + down_write(&file->f_mapping->i_mmap_sem); tmp->vm_truncate_count = mpnt->vm_truncate_count; flush_dcache_mmap_lock(file->f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file->f_mapping); - spin_unlock(&file->f_mapping->i_mmap_lock); + up_write(&file->f_mapping->i_mmap_sem); } /* diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -61,16 +61,16 @@ generic_file_direct_IO(int rw, struct ki /* * Lock ordering: * - * ->i_mmap_lock (vmtruncate) + * ->i_mmap_sem (vmtruncate) * ->private_lock (__free_pte->__set_page_dirty_buffers) * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * * ->i_mutex - * ->i_mmap_lock (truncate->unmap_mapping_range) + * ->i_mmap_sem (truncate->unmap_mapping_range) * * ->mmap_sem - * ->i_mmap_lock + * ->i_mmap_sem * ->page_table_lock or pte_lock (various, mainly in memory.c) * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock) * @@ -87,7 +87,7 @@ generic_file_direct_IO(int rw, struct ki * ->sb_lock (fs/fs-writeback.c) * ->mapping->tree_lock (__sync_single_inode) * - * ->i_mmap_lock + * ->i_mmap_sem * ->anon_vma.lock (vma_adjust) * * ->anon_vma.lock diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -178,7 +178,7 @@ __xip_unmap (struct address_space * mapp if (!page) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { mm = vma->vm_mm; address = vma->vm_start + @@ -198,7 +198,7 @@ __xip_unmap (struct address_space * mapp page_cache_release(page); } } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -206,13 +206,13 @@ asmlinkage long sys_remap_file_pages(uns } goto out; } - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); flush_dcache_mmap_lock(mapping); vma->vm_flags |= VM_NONLINEAR; vma_prio_tree_remove(vma, &mapping->i_mmap); vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear); flush_dcache_mmap_unlock(mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } mmu_notifier_invalidate_range_start(mm, start, start + size); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -814,7 +814,7 @@ void __unmap_hugepage_range(struct vm_ar struct page *page; struct page *tmp; /* - * A page gathering list, protected by per file i_mmap_lock. The + * A page gathering list, protected by per file i_mmap_sem. The * lock is used to avoid list corruption from multiple unmapping * of the same page since we are using page->lru. */ @@ -864,9 +864,9 @@ void unmap_hugepage_range(struct vm_area * do nothing in this case. */ if (vma->vm_file) { - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); __unmap_hugepage_range(vma, start, end); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); } } @@ -1111,7 +1111,7 @@ void hugetlb_change_protection(struct vm BUG_ON(address >= end); flush_cache_range(vma, address, end); - spin_lock(&vma->vm_file->f_mapping->i_mmap_lock); + down_write(&vma->vm_file->f_mapping->i_mmap_sem); spin_lock(&mm->page_table_lock); for (; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -1126,7 +1126,7 @@ void hugetlb_change_protection(struct vm } } spin_unlock(&mm->page_table_lock); - spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock); + up_write(&vma->vm_file->f_mapping->i_mmap_sem); flush_tlb_range(vma, start, end); } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -874,7 +874,7 @@ unsigned long unmap_vmas(struct vm_area_ unsigned long tlb_start = 0; /* For tlb_finish_mmu */ int tlb_start_valid = 0; unsigned long start = start_addr; - spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; + struct rw_semaphore *i_mmap_sem = details? details->i_mmap_sem: NULL; int fullmm; struct mmu_gather *tlb; struct mm_struct *mm = vma->vm_mm; @@ -920,8 +920,8 @@ unsigned long unmap_vmas(struct vm_area_ tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || - (i_mmap_lock && spin_needbreak(i_mmap_lock))) { - if (i_mmap_lock) { + (i_mmap_sem && rwsem_needbreak(i_mmap_sem))) { + if (i_mmap_sem) { tlb = NULL; goto out; } @@ -1829,7 +1829,7 @@ unwritable_page: /* * Helper functions for unmap_mapping_range(). * - * __ Notes on dropping i_mmap_lock to reduce latency while unmapping __ + * __ Notes on dropping i_mmap_sem to reduce latency while unmapping __ * * We have to restart searching the prio_tree whenever we drop the lock, * since the iterator is only valid while the lock is held, and anyway @@ -1848,7 +1848,7 @@ unwritable_page: * can't efficiently keep all vmas in step with mapping->truncate_count: * so instead reset them all whenever it wraps back to 0 (then go to 1). * mapping->truncate_count and vma->vm_truncate_count are protected by - * i_mmap_lock. + * i_mmap_sem. * * In order to make forward progress despite repeatedly restarting some * large vma, note the restart_addr from unmap_vmas when it breaks out: @@ -1898,7 +1898,7 @@ again: restart_addr = zap_page_range(vma, start_addr, end_addr - start_addr, details); - need_break = need_resched() || spin_needbreak(details->i_mmap_lock); + need_break = need_resched() || rwsem_needbreak(details->i_mmap_sem); if (restart_addr >= end_addr) { /* We have now completed this vma: mark it so */ @@ -1912,9 +1912,9 @@ again: goto again; } - spin_unlock(details->i_mmap_lock); + up_write(details->i_mmap_sem); cond_resched(); - spin_lock(details->i_mmap_lock); + down_write(details->i_mmap_sem); return -EINTR; } @@ -2008,9 +2008,9 @@ void unmap_mapping_range(struct address_ details.last_index = hba + hlen - 1; if (details.last_index < details.first_index) details.last_index = ULONG_MAX; - details.i_mmap_lock = &mapping->i_mmap_lock; + details.i_mmap_sem = &mapping->i_mmap_sem; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); /* Protect against endless unmapping loops */ mapping->truncate_count++; @@ -2025,7 +2025,7 @@ void unmap_mapping_range(struct address_ unmap_mapping_range_tree(&mapping->i_mmap, &details); if (unlikely(!list_empty(&mapping->i_mmap_nonlinear))) unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } EXPORT_SYMBOL(unmap_mapping_range); diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -211,12 +211,12 @@ static void remove_file_migration_ptes(s if (!mapping) return; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) remove_migration_pte(vma, old, new); - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); } /* diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -189,7 +189,7 @@ error: } /* - * Requires inode->i_mapping->i_mmap_lock + * Requires inode->i_mapping->i_mmap_sem */ static void __remove_shared_vm_struct(struct vm_area_struct *vma, struct file *file, struct address_space *mapping) @@ -217,9 +217,9 @@ void unlink_file_vma(struct vm_area_stru if (file) { struct address_space *mapping = file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); __remove_shared_vm_struct(vma, file, mapping); - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); } } @@ -445,7 +445,7 @@ static void vma_link(struct mm_struct *m mapping = vma->vm_file->f_mapping; if (mapping) { - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); vma->vm_truncate_count = mapping->truncate_count; } anon_vma_lock(vma); @@ -455,7 +455,7 @@ static void vma_link(struct mm_struct *m anon_vma_unlock(vma); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mm->map_count++; validate_mm(mm); @@ -542,7 +542,7 @@ again: remove_next = 1 + (end > next-> mapping = file->f_mapping; if (!(vma->vm_flags & VM_NONLINEAR)) root = &mapping->i_mmap; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (importer && vma->vm_truncate_count != next->vm_truncate_count) { /* @@ -626,7 +626,7 @@ again: remove_next = 1 + (end > next-> if (anon_vma) spin_unlock(&anon_vma->lock); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); if (remove_next) { if (file) { @@ -2068,7 +2068,7 @@ void exit_mmap(struct mm_struct *mm) /* Insert vm structure into process list sorted by address * and into the inode's i_mmap tree. If vm_file is non-NULL - * then i_mmap_lock is taken here. + * then i_mmap_sem is taken here. */ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) { diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -88,7 +88,7 @@ static void move_ptes(struct vm_area_str * and we propagate stale pages into the dst afterward. */ mapping = vma->vm_file->f_mapping; - spin_lock(&mapping->i_mmap_lock); + down_write(&mapping->i_mmap_sem); if (new_vma->vm_truncate_count && new_vma->vm_truncate_count != vma->vm_truncate_count) new_vma->vm_truncate_count = 0; @@ -120,7 +120,7 @@ static void move_ptes(struct vm_area_str pte_unmap_nested(new_pte - 1); pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) - spin_unlock(&mapping->i_mmap_lock); + up_write(&mapping->i_mmap_sem); mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -24,7 +24,7 @@ * inode->i_alloc_sem (vmtruncate_range) * mm->mmap_sem * page->flags PG_locked (lock_page) - * mapping->i_mmap_lock + * mapping->i_mmap_sem * anon_vma->lock * mm->page_table_lock or pte_lock * zone->lru_lock (in mark_page_accessed, isolate_lru_page) @@ -373,14 +373,14 @@ static int page_referenced_file(struct p * The page lock not only makes sure that page->mapping cannot * suddenly be NULLified by truncation, it makes sure that the * structure at mapping cannot be freed and reused yet, - * so we can safely take mapping->i_mmap_lock. + * so we can safely take mapping->i_mmap_sem. */ BUG_ON(!PageLocked(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); /* - * i_mmap_lock does not stabilize mapcount at all, but mapcount + * i_mmap_sem does not stabilize mapcount at all, but mapcount * is more likely to be accurate if we note it after spinning. */ mapcount = page_mapcount(page); @@ -403,7 +403,7 @@ static int page_referenced_file(struct p break; } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return referenced; } @@ -490,12 +490,12 @@ static int page_mkclean_file(struct addr BUG_ON(PageAnon(page)); - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { if (vma->vm_flags & VM_SHARED) ret += page_mkclean_one(page, vma); } - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } @@ -930,7 +930,7 @@ static int try_to_unmap_file(struct page unsigned long max_nl_size = 0; unsigned int mapcount; - spin_lock(&mapping->i_mmap_lock); + down_read(&mapping->i_mmap_sem); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) @@ -967,7 +967,6 @@ static int try_to_unmap_file(struct page mapcount = page_mapcount(page); if (!mapcount) goto out; - cond_resched_lock(&mapping->i_mmap_lock); max_nl_size = (max_nl_size + CLUSTER_SIZE - 1) & CLUSTER_MASK; if (max_nl_cursor == 0) @@ -989,7 +988,6 @@ static int try_to_unmap_file(struct page } vma->vm_private_data = (void *) max_nl_cursor; } - cond_resched_lock(&mapping->i_mmap_lock); max_nl_cursor += CLUSTER_SIZE; } while (max_nl_cursor <= max_nl_size); @@ -1001,7 +999,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - spin_unlock(&mapping->i_mmap_lock); + up_read(&mapping->i_mmap_sem); return ret; } |
From: Andrea A. <an...@qu...> - 2008-05-07 14:38:55
|
# HG changeset patch # User Andrea Arcangeli <an...@qu...> # Date 1210096013 -7200 # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_lock() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken in virtual address order. The order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running concurrently to trigger lock inversion deadlocks. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_lock may not allocate the required vmalloc space. See the comment on top of mm_lock() implementation for the worst case memory requirements. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. Signed-off-by: Andrea Arcangeli <an...@qu...> Signed-off-by: Nick Piggin <np...@su...> Signed-off-by: Christoph Lameter <cla...@sg...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct if (!hlist_unhashed(n)) { __hlist_del(n); INIT_HLIST_NODE(n); + } +} + +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; } } diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,6 +1084,15 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/cpumask.h> #include <asm/page.h> #include <asm/mmu.h> @@ -19,6 +20,7 @@ #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1)) struct address_space; +struct mmu_notifier_mm; #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS typedef atomic_long_t mm_counter_t; @@ -235,6 +237,9 @@ struct mm_struct { struct file *exe_file; unsigned long num_exe_file_vmas; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_mm *mmu_notifier_mm; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,282 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/mm_types.h> +#include <linux/srcu.h> + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_lock() protected critical section + * and it's released only when mm_count reaches zero in mmdrop(). + */ +struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ + struct hlist_head list; + /* srcu structure for this mm */ + struct srcu_struct srcu; + /* to serialize the list modifications and hlist_unhashed */ + spinlock_t lock; +}; + +struct mmu_notifier_ops { + /* + * Called either by mmu_notifier_unregister or when the mm is + * being destroyed by exit_mmap, always before all pages are + * freed. This can run concurrently with other mmu notifier + * methods (the ones invoked outside the mm context) and it + * should tear down all secondary mmu mappings and freeze the + * secondary mmu. If this method isn't implemented you've to + * be sure that nothing could possibly write to the pages + * through the secondary mmu by the time the last thread with + * tsk->mm == mm exits. + * + * As side note: the pages freed after ->release returns could + * be immediately reallocated by the gart at an alias physical + * address with a different cache model, so if ->release isn't + * implemented because all _software_ driven memory accesses + * through the secondary mmu are terminated by the time the + * last thread of this mm quits, you've also to be sure that + * speculative _hardware_ operations can't allocate dirty + * cachelines in the cpu that could not be snooped and made + * coherent with the other read and write operations happening + * through the gart alias address, so leading to memory + * corruption. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. Both functions + * may sleep. The subsystem must guarantee that no additional + * references are taken to the pages in the range established + * between the call to invalidate_range_start() and the + * matching call to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(mm->mmu_notifier_mm); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern int __mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + mm->mmu_notifier_mm = NULL; +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_mm_destroy(mm); +} + +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/include/linux/srcu.h b/include/linux/srcu.h --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include <linux/mutex.h> + struct srcu_struct_array { int c[2]; }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -385,6 +386,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } @@ -417,6 +419,7 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mmu_notifier_mm_destroy(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -205,3 +205,6 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + bool diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds unsigned long next; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int ret; /* * Don't copy ptes where a page fault will fill them correctly. @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + + ret = 0; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(src_pgd)) continue; - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, - vma, addr, next)) - return -ENOMEM; + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd, + vma, addr, next))) { + ret = -ENOMEM; + break; + } } while (dst_pgd++, src_pgd++, addr = next, addr != end); - return 0; + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); + return ret; } static unsigned long zap_pte_range(struct mmu_gather *tlb, @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1541,10 +1562,11 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1552,6 +1574,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1753,7 +1776,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/vmalloc.h> +#include <linux/sort.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2048,6 +2051,7 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2255,3 +2259,194 @@ int install_special_mapping(struct mm_st return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + unsigned long _a = (unsigned long)*(spinlock_t **)a; + unsigned long _b = (unsigned long)*(spinlock_t **)b; + + cond_resched(); + if (_a < _b) + return -1; + if (_a > _b) + return 1; + return 0; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. + * + * The caller must take the mmap_sem in read or write mode before + * calling mm_lock(). The caller isn't allowed to release the mmap_sem + * until mm_unlock() returns. + * + * While mm_lock() itself won't strictly require the mmap_sem in write + * mode to be safe, in order to block all operations that could modify + * pagetables and free pages without need of altering the vma layout + * (for example populate_range() with nonlinear vmas) the mmap_sem + * must be taken in write mode by the caller. + * + * A single task can't take more than one mm_lock in a row or it would + * deadlock. + * + * The sorting is needed to avoid lock inversion deadlocks if two + * tasks run mm_lock at the same time on different mm that happen to + * share some anon_vmas/inodes but mapped in different order. + * + * mm_lock and mm_unlock are expensive operations that may have to + * take thousand of locks. Thanks to sort() the complexity is + * O(N*log(N)) where N is the number of VMAs in the mm. The max number + * of vmas is defined in /proc/sys/vm/max_map_count. + * + * mm_lock() can fail if memory allocation fails. The worst case + * vmalloc allocation required is 2*max_map_count*sizeof(spinlock_t *), + * so around 1Mbyte, but in practice it'll be much less because + * normally there won't be max_map_count vmas allocated in the task + * that runs mm_lock(). + * + * The vmalloc memory allocated by mm_lock is stored in the + * mm_lock_data structure that must be allocated by the caller and it + * must be later passed to mm_unlock that will free it after using it. + * Allocating the mm_lock_data structure on the stack is fine because + * it's only a couple of bytes in size. + * + * If mm_lock() returns -ENOMEM no memory has been allocated and the + * mm_lock_data structure can be freed immediately, and mm_unlock must + * not be called. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) + return -ENOMEM; + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + vfree(anon_vma_locks); + return -ENOMEM; + } + + /* + * When mm_lock_sort_anon_vma/i_mmap returns zero it + * means there's no lock to take and so we can free + * the array here without waiting mm_unlock. mm_unlock + * will do nothing if nr_i_mmap/anon_vma_locks is + * zero. + */ + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* + * mm_unlock doesn't require any memory allocation and it won't fail. + * + * The mmap_sem cannot be released until mm_unlock returns. + * + * All memory has been previously allocated by mm_lock and it'll be + * all freed before returning. Only after mm_unlock returns, the + * caller is allowed to free and forget the mm_lock_data structure. + * + * mm_unlock runs in O(N) where N is the max number of VMAs in the + * mm. The max number of vmas is defined in + * /proc/sys/vm/max_map_count. + */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->nr_i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,292 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/err.h> +#include <linux/srcu.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> + +/* + * This function can't run concurrently against mmu_notifier_register + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds an mm_count pin + * itself. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + int srcu; + + spin_lock(&mm->mmu_notifier_mm->lock); + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { + mn = hlist_entry(mm->mmu_notifier_mm->list.first, + struct mmu_notifier, + hlist); + /* + * We arrived before mmu_notifier_unregister so + * mmu_notifier_unregister will do nothing other than + * to wait ->release to finish and + * mmu_notifier_unregister to return. + */ + hlist_del_init_rcu(&mn->hlist); + /* + * SRCU here will block mmu_notifier_unregister until + * ->release returns. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * if ->release runs before mmu_notifier_unregister it + * must be handled as it's the only way for the driver + * to flush all existing sptes and stop the driver + * from establishing any more sptes before all the + * pages in the mm are freed. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + spin_lock(&mm->mmu_notifier_mm->lock); + } + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * synchronize_srcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * + * The mmu_notifier_mm can't go away from under us because one + * mm_count is hold by exit_mmap. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0, srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int srcu; + + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); +} + +static int do_mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm, + int take_mmap_sem) +{ + struct mm_lock_data data; + struct mmu_notifier_mm * mmu_notifier_mm; + int ret; + + BUG_ON(atomic_read(&mm->mm_users) <= 0); + + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + ret = init_srcu_struct(&mmu_notifier_mm->srcu); + if (unlikely(ret)) + goto out_kfree; + + if (take_mmap_sem) + down_write(&mm->mmap_sem); + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out_cleanup; + + if (!mm_has_notifiers(mm)) { + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; + } + atomic_inc(&mm->mm_count); + + /* + * Serialize the update against mmu_notifier_unregister. A + * side note: mmu_notifier_release can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + * We can't race against any other mmu notifiers either thanks + * to mm_lock(). + */ + spin_lock(&mm->mmu_notifier_mm->lock); + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); + spin_unlock(&mm->mmu_notifier_mm->lock); + + mm_unlock(mm, &data); +out_cleanup: + if (take_mmap_sem) + up_write(&mm->mmap_sem); + if (mmu_notifier_mm) + cleanup_srcu_struct(&mmu_notifier_mm->srcu); +out_kfree: + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); +out: + BUG_ON(atomic_read(&mm->mm_users) <= 0); + return ret; +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. Must also ensure mm_users can't go down + * to zero while this runs to avoid races with mmu_notifier_release, + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register + * returns. mmu_notifier_unregister must be always called to + * unregister the notifier. mm_count is automatically pinned to allow + * mmu_notifier_unregister to safely run at any time later, before or + * after exit_mmap. ->release will always be called before exit_mmap + * frees the pages. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 1); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * Same as mmu_notifier_register but here the caller must hold the + * mmap_sem in write mode. + */ +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + return do_mmu_notifier_register(mn, mm, 0); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_register); + +/* this is called after the last mmu_notifier_unregister() returned */ +void __mmu_notifier_mm_destroy(struct mm_struct *mm) +{ + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list)); + cleanup_srcu_struct(&mm->mmu_notifier_mm->srcu); + kfree(mm->mmu_notifier_mm); + mm->mmu_notifier_mm = LIST_POISON1; /* debug */ +} + +/* + * This releases the mm_count pin automatically and frees the mm + * structure if it was the last user of it. It serializes against + * running mmu notifiers with SRCU and against mmu_notifier_unregister + * with the unregister lock + SRCU. All sptes must be dropped before + * calling mmu_notifier_unregister. ->release or any other notifier + * method may be invoked concurrently with mmu_notifier_unregister, + * and only after mmu_notifier_unregister returned we're guaranteed + * that ->release or any other method can't run anymore. + */ +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + spin_lock(&mm->mmu_notifier_mm->lock); + if (!hlist_unhashed(&mn->hlist)) { + int srcu; + + hlist_del_rcu(&mn->hlist); + + /* + * SRCU here will force exit_mmap to wait ->release to finish + * before freeing the pages. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* + * exit_mmap will block in mmu_notifier_release to + * guarantee ->release is called before freeing the + * pages. + */ + if (mn->ops->release) + mn->ops->release(mn, mm); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + } else + spin_unlock(&mm->mmu_notifier_mm->lock); + + /* + * Wait any running method to finish, of course including + * ->release if it was run by mmu_notifier_relase instead of us. + */ + synchronize_srcu(&mm->mmu_notifier_mm->srcu); + + BUG_ON(atomic_read(&mm->mm_count) <= 0); + + mmdrop(mm); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -198,10 +199,12 @@ success: dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include <linux/module.h> #include <linux/kallsyms.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) |
From: Andrew M. <ak...@li...> - 2008-05-07 20:03:02
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core > > ... > > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis > * or hlist_del_rcu(), running on this same list. > * However, it is perfectly legal to run concurrently with > * the _rcu list-traversal primitives, such as > - * hlist_for_each_entry(). > + * hlist_for_each_entry_rcu(). > */ > static inline void hlist_del_rcu(struct hlist_node *n) > { > @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct > if (!hlist_unhashed(n)) { > __hlist_del(n); > INIT_HLIST_NODE(n); > + } > +} > + > +/** > + * hlist_del_init_rcu - deletes entry from hash list with re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on entry does return true after this. It is Should that be "does" or "does not". "does", I suppose. It should refer to hlist_unhashed() The term "on entry" is a bit ambiguous - we normally use that as shorthand to mean "on entry to the function". So I'll change this to > + * Note: hlist_unhashed() on the node returns true after this. It is OK? <oh, that was copied-and-pasted from similarly errant comments in that file> > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -10,6 +10,7 @@ > #include <linux/rbtree.h> > #include <linux/rwsem.h> > #include <linux/completion.h> > +#include <linux/cpumask.h> OK, unrelated bugfix ;) > --- a/include/linux/srcu.h > +++ b/include/linux/srcu.h > @@ -27,6 +27,8 @@ > #ifndef _LINUX_SRCU_H > #define _LINUX_SRCU_H > > +#include <linux/mutex.h> And another. Fair enough. |
From: Rik v. R. <ri...@re...> - 2008-05-07 17:36:50
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > Signed-off-by: Andrea Arcangeli <an...@qu...> > Signed-off-by: Nick Piggin <np...@su...> > Signed-off-by: Christoph Lameter <cla...@sg...> Acked-by: Rik van Riel <ri...@re...> -- All rights reversed. |
From: Robin H. <ho...@sg...> - 2008-05-07 16:01:23
|
You can drop this patch. This turned out to be a race in xpmem. It "appeared" as if it were a race in get_task_mm, but it really is not. The current->mm field is cleared under the task_lock and the task_lock is grabbed by get_task_mm. I have been testing you v15 version without this patch and not encountere the problem again (now that I fixed my xpmem race). Thanks, Robin On Wed, May 07, 2008 at 04:35:52PM +0200, Andrea Arcangeli wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210115127 -7200 > # Node ID c5badbefeee07518d9d1acca13e94c981420317c > # Parent e20917dcc8284b6a07cfcced13dda4cbca850a9c > get_task_mm |
From: Andrea A. <an...@qu...> - 2008-05-07 16:20:06
|
On Wed, May 07, 2008 at 10:59:48AM -0500, Robin Holt wrote: > You can drop this patch. > > This turned out to be a race in xpmem. It "appeared" as if it were a > race in get_task_mm, but it really is not. The current->mm field is > cleared under the task_lock and the task_lock is grabbed by get_task_mm. 100% agreed, I'll nuke it as it seems really a noop. > I have been testing you v15 version without this patch and not > encountere the problem again (now that I fixed my xpmem race). Great. About your other deadlock I'm curious if my deadlock fix for the i_mmap_sem patch helped. That was crashing kvm with a VM 2G in the swap + a swaphog allocating and freeing another 2G of swap in a loop. I couldn't reproduce any other problem with KVM since I fixed that bit regardless if I apply only mmu-notifier-core (2.6.26 version) or the full patchset (post 2.6.26). |
From: Andrew M. <ak...@li...> - 2008-05-07 20:08:08
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core The patch looks OK to me. The proposal is that we sneak this into 2.6.26. Are there any sufficiently-serious objections to this? The patch will be a no-op for 2.6.26. This is all rather unusual. For the record, could we please review the reasons for wanting to do this? Thanks. |
From: Linus T. <tor...@li...> - 2008-05-07 20:31:00
|
On Wed, 7 May 2008, Andrew Morton wrote: > > The patch looks OK to me. As far as I can tell, authorship has been destroyed by at least two of the patches (ie Christoph seems to be the author, but Andrea seems to have dropped that fact). > The proposal is that we sneak this into 2.6.26. Are there any > sufficiently-serious objections to this? Yeah, too late and no upside. That "locking" code is also too ugly to live, at least without some serious arguments for why it has to be done that way. Sorting the locks? In a vmalloc'ed area? And calling this something innocuous like "mm_lock()"? Hell no. That code needs some serious re-thinking. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 21:58:35
|
On Wed, May 07, 2008 at 01:30:39PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrew Morton wrote: > > > > The patch looks OK to me. > > As far as I can tell, authorship has been destroyed by at least two of the > patches (ie Christoph seems to be the author, but Andrea seems to have > dropped that fact). I can't follow this, please be more specific. About the patches I merged from Christoph, I didn't touch them at all (except for fixing a kernel crashing bug in them plus some reject fix). Initially I didn't even add a signed-off-by: andrea, and I only had the signed-off-by: christoph. But then he said I had to add my signed-off-by too, while I thought at most an acked-by was required. So if I got any attribution on Christoph work it's only because he explicitly requested it as it was passing through my maintenance line. In any case, all patches except mmu-notifier-core are irrelevant in this context and I'm entirely fine to give Christoph the whole attribution of the whole patchset including the whole mmu-notifier-core where most of the code is mine. We had many discussions with Christoph, Robin and Jack, but I can assure you nobody had a single problem with regard to attribution. About all patches except mmu-notifier-core: Christoph, Robin and everyone (especially myself) agrees those patches can't yet be merged in 2.6.26. With regard to the post-2.6.26 material, I think adding a config option to make the change at compile time, is ok. And there's no other way to deal with it in a clean way, as vmtrunate has to teardown pagetables, and if the i_mmap_lock is a spinlock there's no way to notify secondary mmus about it, if the ->invalidate_range_start method has to allocate an skb, send it through the network and wait for I/O completion with schedule(). > Yeah, too late and no upside. No upside to all people setting CONFIG_KVM=n true, but no downside for them either, that's the important fact! And for all the people setting CONFIG_KVM!=n, I should provide some background here. KVM MM development is halted without this, that includes: paging, ballooning, tlb flushing at large, pci-passthrough removing page pin as a whole, etc... Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle things without mmu notifiers (mlock whatever). Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has gone as far as writing his own ultrasimple mmu notifier implementation, unfortunately too simple as invalidate_range_start was missing and we can't remove the page pinning and avoid doing spte=invalid;tlbflush;unpin for every group of sptes released without it. And without mm_lock invalidate_range_start can't be implemented in a generic way (to work for GRU/XPMEM too). > That "locking" code is also too ugly to live, at least without some > serious arguments for why it has to be done that way. Sorting the locks? > In a vmalloc'ed area? And calling this something innocuous like > "mm_lock()"? Hell no. That's only invoked in mmu_notifier_register, mm_lock is explicitly documented as heavyweight function. In the KVM case it's only called when a VM is created, that's irrelevant cpu cost compared to the time it takes to the OS to boot in the VM... (especially without real mode emulation with direct NPT-like secondary-mmu paging). mm_lock solved the fundamental race in the range_start/end invalidation model (that will allow GRU to do a single tlb flush for the whole range that is going to be freed by zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his EMM versions of mmu notifiers, moments after I released it, I think he wouldn't have done it if there was a better way. > That code needs some serious re-thinking. Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu notifiers, my original mmu notifiers, Christoph's first version of my mmu notifiers, with my new mmu notifiers, with christoph EMM version of my new mmu notifiers, with my latest mmu notifiers, and all people making suggestions and testing the code and needing the code badly, and further patches waiting inclusion during 2.6.27 in this area, it must be obvious for everyone, that there's zero chance this code won't evolve over time to perfection, but we can't wait it to be perfect before start using it or we're screwed. Even if it's entirely broken this will allow kvm development to continue and then we'll fix it (but don't worry it works great at runtime and there are no race conditions, Jack and Robin are also using it with zero problems with GRU and XPMEM just in case the KVM testing going great isn't enough). Furthermore the API is freezed for almost months, everyone agrees with all fundamental blocks in mmu-notifier-core patch (to be complete Christoph would like to replace invalidate_page with an invalidate_range_start/end but that's a minor detail). And most important we need something in now, regardless of which API. We can handle a change of API totally fine later. mm_lock() is not even part of the mmu notifier API, it's just an internal implementation detail, so whatever problem it has, or whatever better name we can find, isn't an high priority right now. If you suggest a better name now I'll fix it up immediately. I hope the mm_lock name and whatever signed-off-by error in patches after mmu-notifier-core won't be really why this doesn't go in. Thanks a lot for your time to review even if it wasn't as positive as I hoped, Andrea |
From: Linus T. <tor...@li...> - 2008-05-07 22:12:04
|
On Wed, 7 May 2008, Andrea Arcangeli wrote: > > As far as I can tell, authorship has been destroyed by at least two of the > > patches (ie Christoph seems to be the author, but Andrea seems to have > > dropped that fact). > > I can't follow this, please be more specific. The patches were sent to lkml without *any* indication that you weren't actually the author. So if Andrew had merged them, they would have been merged as yours. > > That "locking" code is also too ugly to live, at least without some > > serious arguments for why it has to be done that way. Sorting the locks? > > In a vmalloc'ed area? And calling this something innocuous like > > "mm_lock()"? Hell no. > > That's only invoked in mmu_notifier_register, mm_lock is explicitly > documented as heavyweight function. Is that an excuse for UTTER AND TOTAL CRAP? Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 22:27:58
|
On Wed, May 07, 2008 at 03:11:10PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrea Arcangeli wrote: > > > > As far as I can tell, authorship has been destroyed by at least two of the > > > patches (ie Christoph seems to be the author, but Andrea seems to have > > > dropped that fact). > > > > I can't follow this, please be more specific. > > The patches were sent to lkml without *any* indication that you weren't > actually the author. > > So if Andrew had merged them, they would have been merged as yours. I rechecked and I guarantee that the patches where Christoph isn't listed are developed by myself and he didn't write a single line on them. In any case I expect Christoph to review (he's CCed) and to point me to any attribution error. The only mistake I did once in that area was to give too _few_ attribution to myself and he asked me to add myself in the signed-off so I added myself by Christoph own request, but be sure I didn't remove him! |
From: Linus T. <tor...@li...> - 2008-05-07 23:04:14
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. How long have you been doing kernel development? How about you read SubmittingPatches a few times before you show just how clueless you are? Hint: look for the string that says "From:". Also look at the section that talks about "summary phrase". You got it all wrong, and you don't even seem to realize that you got it wrong, even when I told you. Linus |
From: Roland D. <rd...@ci...> - 2008-05-07 22:31:36
|
> I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! I think the point you're missing is that any patches written by Christoph need a line like From: Christoph Lameter <cla...@sg...> at the top of the body so that Christoph becomes the author when it is committed into git. The Signed-off-by: line needs to be preserved too of course, but it is not sufficient by itself. - R. |
From: Andrea A. <an...@qu...> - 2008-05-07 22:37:35
|
On Thu, May 08, 2008 at 12:27:58AM +0200, Andrea Arcangeli wrote: > I rechecked and I guarantee that the patches where Christoph isn't > listed are developed by myself and he didn't write a single line on > them. In any case I expect Christoph to review (he's CCed) and to > point me to any attribution error. The only mistake I did once in that > area was to give too _few_ attribution to myself and he asked me to > add myself in the signed-off so I added myself by Christoph own > request, but be sure I didn't remove him! By PM (guess he's scared to post to this thread ;) Chris is telling me, what you mean perhaps is I should add a From: Christoph in the body of the email if the first signed-off-by is from Christoph, to indicate the first signoff was by him and the patch in turn was started by him. I thought the order of the signoffs was enough, but if that From was mandatory and missing, if there's any error it obviously wasn't intentional especially given I only left a signed-off-by: christoph on his patches until he asked me to add my signoff too. Correcting it is trivial given I carefully ordered the signoff so that the author is at the top of the signoff list. At least for mmu-notifier-core given I obviously am the original author of that code, I hope the From: of the email was enough even if an additional From: andrea was missing in the body. Also you can be sure that Christoph and especially Robin (XPMEM) will be more than happy if all patches with Christoph at the top of the signed-off-by will be merged in 2.6.26 despite there wasn't From: christoph at the top of the body ;). So I don't see a big deal here... |
From: Linus T. <tor...@li...> - 2008-05-07 23:40:02
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > At least for mmu-notifier-core given I obviously am the original > author of that code, I hope the From: of the email was enough even if > an additional From: andrea was missing in the body. Ok, this whole series of patches have just been such a disaster that I'm (a) disgusted that _anybody_ sent an Acked-by: for any of it, and (b) that I'm still looking at it at all, but I am. And quite frankly, the more I look, and the more answers from you I get, the less I like it. And I didn't like it that much to start with, as you may have noticed. You say that "At least for mmu-notifier-core given I obviously am the original author of that code", but that is not at all obvious either. One of the reasons I stated that authorship seems to have been thrown away is very much exactly in that first mmu-notifier-core patch: + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> so I would very strongly dispute that it's "obvious" that you are the original author of the code there. So there was a reason why I said that I thought authorship had been lost somewhere along the way. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 22:39:11
|
On Wed, May 07, 2008 at 03:31:08PM -0700, Roland Dreier wrote: > I think the point you're missing is that any patches written by > Christoph need a line like > > From: Christoph Lameter <cla...@sg...> > > at the top of the body so that Christoph becomes the author when it is > committed into git. The Signed-off-by: line needs to be preserved too > of course, but it is not sufficient by itself. Ok so I see the problem Linus is referring to now (I received the hint by PM too), I thought the order of the signed-off-by was relevant, it clearly isn't or we're wasting space ;) |
From: Linus T. <tor...@li...> - 2008-05-07 23:04:14
|
On Thu, 8 May 2008, Andrea Arcangeli wrote: > > Ok so I see the problem Linus is referring to now (I received the hint > by PM too), I thought the order of the signed-off-by was relevant, it > clearly isn't or we're wasting space ;) The order of the signed-offs are somewhat relevant, but no, sign-offs don't mean authorship. See the rules for sign-off: you can sign off on another persons patches, even if they didn't sign off on them themselves. That's clause (b) in particular. So yes, quite often you'd _expect_ the first sign-off to match the author, but that's a correlation, not a causal relationship. Linus |