From: Andrea A. <an...@qu...> - 2008-04-22 14:10:41
|
Hello, This is the latest and greatest version of the mmu notifier patch #v13. Changes are mainly in the mm_lock that uses sort() suggested by Christoph. This reduces the complexity from O(N**2) to O(N*log(N)). I folded the mm_lock functionality together with the mmu-notifier-core 1/12 patch to make it self-contained. I recommend merging 1/12 into -mm/mainline ASAP. Lack of mmu notifiers is holding off KVM development. We are going to rework the way the pages are mapped and unmapped to work with pure pfn for pci passthrough without the use of page pinning, and we can't without mmu notifiers. This is not just a performance matter. KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed further. XPMEM support comes with the later patches 2-12, risk for those patches is >0 and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are simple and can go in immediately but not all are so simple. 2-12/12 are posted as usual for review by the VM developers and so Robin can keep testing them on XPMEM and they can be merged later without any downside (they're mostly orthogonal with 1/12). |
From: Andrea A. <an...@qu...> - 2008-04-22 14:10:30
|
# HG changeset patch # User Andrea Arcangeli <an...@qu...> # Date 1208870142 -7200 # Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5 # Parent fb3bc9942fb78629d096bd07564f435d51d86e5f Core of mmu notifiers. Signed-off-by: Andrea Arcangeli <an...@qu...> Signed-off-by: Nick Piggin <np...@su...> Signed-off-by: Christoph Lameter <cla...@sg...> diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,6 +1050,27 @@ unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * mm_lock will take mmap_sem writably (to prevent all modifications + * and scanning of vmas) and then also takes the mapping locks for + * each of the vma to lockout any scans of pagetables of this address + * space. This can be used to effectively holding off reclaim from the + * address space. + * + * mm_lock can fail if there is not enough memory to store a pointer + * array to all vmas. + * + * mm_lock and mm_unlock are expensive operations that may take a long time. + */ +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -225,6 +225,9 @@ #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head mmu_notifier_list; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,229 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> +#include <linux/mm_types.h> + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier_ops { + /* + * Called after all other threads have terminated and the executing + * thread is the only remaining execution thread. There are no + * users of the mm_struct remaining. + */ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* + * clear_flush_young is called after the VM is + * test-and-clearing the young/accessed bitflag in the + * pte. This way the VM will provide proper aging to the + * accesses to the page through the secondary MMUs and not + * only to the ones through the Linux pte. + */ + int (*clear_flush_young)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * Before this is invoked any secondary MMU is still ok to + * read/write to the page previously pointed by the Linux pte + * because the old page hasn't been freed yet. If required + * set_page_dirty has to be called internally to this method. + */ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* + * invalidate_range_start() and invalidate_range_end() must be + * paired and are called only when the mmap_sem is held and/or + * the semaphores protecting the reverse maps. Both functions + * may sleep. The subsystem must guarantee that no additional + * references to the pages in the range established between + * the call to invalidate_range_start() and the matching call + * to invalidate_range_end(). + * + * Invalidation of multiple concurrent ranges may be permitted + * by the driver or the driver may exclude other invalidation + * from proceeding by blocking on new invalidate_range_start() + * callback that overlap invalidates that are already in + * progress. Either way the establishment of sptes to the + * range can only be allowed if all invalidate_range_stop() + * function have been called. + * + * invalidate_range_start() is called when all pages in the + * range are still mapped and have at least a refcount of one. + * + * invalidate_range_end() is called when all pages in the + * range have been unmapped and the pages have been freed by + * the VM. + * + * The VM will remove the page table entries and potentially + * the page between invalidate_range_start() and + * invalidate_range_end(). If the page must not be freed + * because of pending I/O or other circumstances then the + * invalidate_range_start() callback (or the initial mapping + * by the driver) must make sure that the refcount is kept + * elevated. + * + * If the driver increases the refcount when the pages are + * initially mapped into an address space then either + * invalidate_range_start() or invalidate_range_end() may + * decrease the refcount. If the refcount is decreased on + * invalidate_range_start() then the VM can free pages as page + * table entries are removed. If the refcount is only + * droppped on invalidate_range_end() then the driver itself + * will drop the last refcount but it must take care to flush + * any secondary tlb before doing the final free on the + * page. Pages will no longer be referenced by the linux + * address space but may still be referenced by sptes until + * the last refcount is dropped. + */ + void (*invalidate_range_start)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); + void (*invalidate_range_end)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +/* + * The notifier chains are protected by mmap_sem and/or the reverse map + * semaphores. Notifier chains are only changed when all reverse maps and + * the mmap_sem locks are taken. + * + * Therefore notifier chains can only be traversed when either + * + * 1. mmap_sem is held. + * 2. One of the reverse map locks is held (i_mmap_sem or anon_vma->sem). + * 3. No other concurrent thread can access the list (release) + */ +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +static inline int mm_has_notifiers(struct mm_struct *mm) +{ + return unlikely(!hlist_empty(&mm->mmu_notifier_list)); +} + +extern int mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern int mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void __mmu_notifier_release(struct mm_struct *mm); +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address); +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end); + + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_release(mm); +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + return __mmu_notifier_clear_flush_young(mm, address); + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_page(mm, address); +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_start(mm, start, end); +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range_end(mm, start, end); +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ + INIT_HLIST_HEAD(&mm->mmu_notifier_list); +} + +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + pte_t __pte; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __pte = ptep_clear_flush(___vma, ___address, __ptep); \ + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ + __pte; \ +}) + +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ +({ \ + int __young; \ + struct vm_area_struct *___vma = __vma; \ + unsigned long ___address = __address; \ + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \ + ___address); \ + __young; \ +}) + +#else /* CONFIG_MMU_NOTIFIER */ + +static inline void mmu_notifier_release(struct mm_struct *mm) +{ +} + +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + return 0; +} + +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ +} + +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + +static inline void mmu_notifier_mm_init(struct mm_struct *mm) +{ +} + +#define ptep_clear_flush_young_notify ptep_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -53,6 +53,7 @@ #include <linux/tty.h> #include <linux/proc_fs.h> #include <linux/blkdev.h> +#include <linux/mmu_notifier.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -362,6 +363,7 @@ if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_mm_init(mm); return mm; } diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -194,7 +194,7 @@ if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); diff --git a/mm/fremap.c b/mm/fremap.c --- a/mm/fremap.c +++ b/mm/fremap.c @@ -15,6 +15,7 @@ #include <linux/rmap.h> #include <linux/module.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/mmu_context.h> #include <asm/cacheflush.h> @@ -214,7 +215,9 @@ spin_unlock(&mapping->i_mmap_lock); } + mmu_notifier_invalidate_range_start(mm, start, start + size); err = populate_range(mm, vma, start, size, pgoff); + mmu_notifier_invalidate_range_end(mm, start, start + size); if (!err && !(flags & MAP_NONBLOCK)) { if (unlikely(has_write_lock)) { downgrade_write(&mm->mmap_sem); diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -14,6 +14,7 @@ #include <linux/mempolicy.h> #include <linux/cpuset.h> #include <linux/mutex.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -799,6 +800,7 @@ BUG_ON(start & ~HPAGE_MASK); BUG_ON(end & ~HPAGE_MASK); + mmu_notifier_invalidate_range_start(mm, start, end); spin_lock(&mm->page_table_lock); for (address = start; address < end; address += HPAGE_SIZE) { ptep = huge_pte_offset(mm, address); @@ -819,6 +821,7 @@ } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range_end(mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -51,6 +51,7 @@ #include <linux/init.h> #include <linux/writeback.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/uaccess.h> @@ -611,6 +612,9 @@ if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_start(src_mm, addr, end); + dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { @@ -621,6 +625,11 @@ vma, addr, next)) return -ENOMEM; } while (dst_pgd++, src_pgd++, addr = next, addr != end); + + if (is_cow_mapping(vma->vm_flags)) + mmu_notifier_invalidate_range_end(src_mm, + vma->vm_start, end); + return 0; } @@ -825,7 +834,9 @@ unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL; int fullmm = (*tlbp)->fullmm; + struct mm_struct *mm = vma->vm_mm; + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) { unsigned long end; @@ -876,6 +887,7 @@ } } out: + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ } @@ -1463,10 +1475,11 @@ { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); + mmu_notifier_invalidate_range_start(mm, start, end); pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); @@ -1474,6 +1487,7 @@ if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier_invalidate_range_end(mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); @@ -1675,7 +1689,7 @@ * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -26,6 +26,9 @@ #include <linux/mount.h> #include <linux/mempolicy.h> #include <linux/rmap.h> +#include <linux/vmalloc.h> +#include <linux/sort.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -2038,6 +2041,7 @@ /* mm's last user has gone, and its about to be pulled down */ arch_exit_mmap(mm); + mmu_notifier_release(mm); lru_add_drain(); flush_cache_mm(mm); @@ -2242,3 +2246,143 @@ return 0; } + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a < + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + +static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, + int anon) +{ + struct vm_area_struct *vma; + size_t i = 0; + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if (anon) { + if (vma->anon_vma) + locks[i++] = &vma->anon_vma->lock; + } else { + if (vma->vm_file && vma->vm_file->f_mapping) + locks[i++] = &vma->vm_file->f_mapping->i_mmap_lock; + } + } + + if (!i) + goto out; + + sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + +out: + return i; +} + +static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 1); +} + +static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, + spinlock_t **locks) +{ + return mm_lock_sort(mm, locks, 0); +} + +static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +{ + spinlock_t *last = NULL; + size_t i; + + for (i = 0; i < nr; i++) + /* Multiple vmas may use the same lock. */ + if (locks[i] != last) { + BUG_ON((unsigned long) last > (unsigned long) locks[i]); + last = locks[i]; + if (lock) + spin_lock(last); + else + spin_unlock(last); + } +} + +static inline void __mm_lock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 1); +} + +static inline void __mm_unlock(spinlock_t **locks, size_t nr) +{ + mm_lock_unlock(locks, nr, 0); +} + +/* + * This operation locks against the VM for all pte/vma/mm related + * operations that could ever happen on a certain mm. This includes + * vmtruncate, try_to_unmap, and all page faults. The holder + * must not hold any mm related lock. A single task can't take more + * than one mm lock in a row or it would deadlock. + */ +int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) +{ + spinlock_t **anon_vma_locks, **i_mmap_locks; + + down_write(&mm->mmap_sem); + if (mm->map_count) { + anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!anon_vma_locks)) { + up_write(&mm->mmap_sem); + return -ENOMEM; + } + + i_mmap_locks = vmalloc(sizeof(spinlock_t *) * mm->map_count); + if (unlikely(!i_mmap_locks)) { + up_write(&mm->mmap_sem); + vfree(anon_vma_locks); + return -ENOMEM; + } + + data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); + data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); + + if (data->nr_anon_vma_locks) { + __mm_lock(anon_vma_locks, data->nr_anon_vma_locks); + data->anon_vma_locks = anon_vma_locks; + } else + vfree(anon_vma_locks); + + if (data->nr_i_mmap_locks) { + __mm_lock(i_mmap_locks, data->nr_i_mmap_locks); + data->i_mmap_locks = i_mmap_locks; + } else + vfree(i_mmap_locks); + } + return 0; +} + +static void mm_unlock_vfree(spinlock_t **locks, size_t nr) +{ + __mm_unlock(locks, nr); + vfree(locks); +} + +/* avoid memory allocations for mm_unlock to prevent deadlock */ +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) +{ + if (mm->map_count) { + if (data->nr_anon_vma_locks) + mm_unlock_vfree(data->anon_vma_locks, + data->nr_anon_vma_locks); + if (data->i_mmap_locks) + mm_unlock_vfree(data->i_mmap_locks, + data->nr_i_mmap_locks); + } + up_write(&mm->mmap_sem); +} diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,130 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * Copyright (C) 2008 SGI + * Christoph Lameter <cla...@sg...> + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/err.h> + +/* + * No synchronization. This function can only be called when only a single + * process remains that performs teardown. + */ +void __mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + + while (unlikely(!hlist_empty(&mm->mmu_notifier_list))) { + mn = hlist_entry(mm->mmu_notifier_list.first, + struct mmu_notifier, + hlist); + hlist_del(&mn->hlist); + if (mn->ops->release) + mn->ops->release(mn, mm); + } +} + +/* + * If no young bitflag is supported by the hardware, ->clear_flush_young can + * unmap the address and return 1 or 0 depending if the mapping previously + * existed or not. + */ +int __mmu_notifier_clear_flush_young(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + int young = 0; + + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) { + if (mn->ops->clear_flush_young) + young |= mn->ops->clear_flush_young(mn, mm, address); + } + + return young; +} + +void __mmu_notifier_invalidate_page(struct mm_struct *mm, + unsigned long address) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) { + if (mn->ops->invalidate_page) + mn->ops->invalidate_page(mn, mm, address); + } +} + +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) { + if (mn->ops->invalidate_range_start) + mn->ops->invalidate_range_start(mn, mm, start, end); + } +} + +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + struct hlist_node *n; + + hlist_for_each_entry(mn, n, &mm->mmu_notifier_list, hlist) { + if (mn->ops->invalidate_range_end) + mn->ops->invalidate_range_end(mn, mm, start, end); + } +} + +/* + * Must not hold mmap_sem nor any other VM related lock when calling + * this registration function. + */ +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct mm_lock_data data; + int ret; + + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out; + hlist_add_head(&mn->hlist, &mm->mmu_notifier_list); + mm_unlock(mm, &data); +out: + return ret; +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * mm_users can't go down to zero while mmu_notifier_unregister() + * runs or it can race with ->release. So a mm_users pin must + * be taken by the caller (if mm can be different from current->mm). + */ +int mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct mm_lock_data data; + int ret; + + BUG_ON(!atomic_read(&mm->mm_users)); + + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out; + hlist_del(&mn->hlist); + mm_unlock(mm, &data); +out: + return ret; +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); diff --git a/mm/mprotect.c b/mm/mprotect.c --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -21,6 +21,7 @@ #include <linux/syscalls.h> #include <linux/swap.h> #include <linux/swapops.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -198,10 +199,12 @@ dirty_accountable = 1; } + mmu_notifier_invalidate_range_start(mm, start, end); if (is_vm_hugetlb_page(vma)) hugetlb_change_protection(vma, start, end, vma->vm_page_prot); else change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable); + mmu_notifier_invalidate_range_end(mm, start, end); vm_stat_account(mm, oldflags, vma->vm_file, -nrpages); vm_stat_account(mm, newflags, vma->vm_file, nrpages); return 0; diff --git a/mm/mremap.c b/mm/mremap.c --- a/mm/mremap.c +++ b/mm/mremap.c @@ -18,6 +18,7 @@ #include <linux/highmem.h> #include <linux/security.h> #include <linux/syscalls.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -74,7 +75,11 @@ struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; + unsigned long old_start; + old_start = old_addr; + mmu_notifier_invalidate_range_start(vma->vm_mm, + old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -116,6 +121,7 @@ pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -49,6 +49,7 @@ #include <linux/module.h> #include <linux/kallsyms.h> #include <linux/memcontrol.h> +#include <linux/mmu_notifier.h> #include <asm/tlbflush.h> @@ -287,7 +288,7 @@ if (vma->vm_flags & VM_LOCKED) { referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + } else if (ptep_clear_flush_young_notify(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -456,7 +457,7 @@ pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush(vma, address, pte); + entry = ptep_clear_flush_notify(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -717,14 +718,14 @@ * skipped over this mm) then we should reactivate it. */ if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { + (ptep_clear_flush_young_notify(vma, address, pte)))) { ret = SWAP_FAIL; goto out_unmap; } /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -849,12 +850,12 @@ page = vm_normal_page(vma, address, *pte); BUG_ON(!page || PageAnon(page)); - if (ptep_clear_flush_young(vma, address, pte)) + if (ptep_clear_flush_young_notify(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) |
From: Andrea A. <an...@qu...> - 2008-04-22 15:15:52
|
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > Andrea Arcangeli a écrit : >> + >> +static int mm_lock_cmp(const void *a, const void *b) >> +{ >> + cond_resched(); >> + if ((unsigned long)*(spinlock_t **)a < >> + (unsigned long)*(spinlock_t **)b) >> + return -1; >> + else if (a == b) >> + return 0; >> + else >> + return 1; >> +} >> + > This compare function looks unusual... > It should work, but sort() could be faster if the > if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > static int mm_lock_cmp(const void *a, const void *b) > { > unsigned long la = (unsigned long)*(spinlock_t **)a; > unsigned long lb = (unsigned long)*(spinlock_t **)b; > > cond_resched(); > if (la < lb) > return -1; > if (la > lb) > return 1; > return 0; > } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la > lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! |
From: Avi K. <av...@qu...> - 2008-04-22 15:24:31
|
Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > > You need to compare *a to *b (at least, that's what you're doing for the < case). -- error compiling committee.c: too many arguments to function |
From: Eric D. <da...@co...> - 2008-04-22 16:40:34
|
Andrea Arcangeli a écrit : > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need to dereference them. >> static int mm_lock_cmp(const void *a, const void *b) >> { >> unsigned long la = (unsigned long)*(spinlock_t **)a; >> unsigned long lb = (unsigned long)*(spinlock_t **)b; >> >> cond_resched(); >> if (la < lb) >> return -1; >> if (la > lb) >> return 1; >> return 0; >> } >> > > If your intent is to use the assumption that there are going to be few > equal entries, you should have used likely(la > lb) to signal it's > rarely going to return zero or gcc is likely free to do whatever it > wants with the above. Overall that function is such a slow path that > this is going to be lost in the noise. My suggestion would be to defer > microoptimizations like this after 1/12 will be applied to mainline. > > Thanks! > > Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear |
From: Andrea A. <an...@qu...> - 2008-04-22 16:46:15
|
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote: > I am saying your intent was probably to test > > else if ((unsigned long)*(spinlock_t **)a == > (unsigned long)*(spinlock_t **)b) > return 0; Indeed... > Hum, it's not a micro-optimization, but a bug fix. :) The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). |
From: Eric D. <da...@co...> - 2008-04-22 17:22:36
|
Andrea Arcangeli a écrit : > + > +static int mm_lock_cmp(const void *a, const void *b) > +{ > + cond_resched(); > + if ((unsigned long)*(spinlock_t **)a < > + (unsigned long)*(spinlock_t **)b) > + return -1; > + else if (a == b) > + return 0; > + else > + return 1; > +} > + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la < lb) return -1; if (la > lb) return 1; return 0; } |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:19:28
|
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. |
From: Robin H. <ho...@sg...> - 2008-04-22 20:32:26
|
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: > Thanks for adding most of my enhancements. But > > 1. There is no real need for invalidate_page(). Can be done with > invalidate_start/end. Needlessly complicates the API. One > of the objections by Andrew was that there mere multiple > callbacks that perform similar functions. While I agree with that reading of Andrew's email about invalidate_page, I think the GRU hardware makes a strong enough case to justify the two seperate callouts. Due to the GRU hardware, we can assure that invalidate_page terminates all pending GRU faults (that includes faults that are just beginning) and can therefore be completed without needing any locking. The invalidate_page() callout gets turned into a GRU flush instruction and we return. Because the invalidate_range_start() leaves the page table information available, we can not use a single page _start to mimick that functionality. Therefore, there is a documented case justifying the seperate callouts. I agree the case is fairly weak, but it does exist. Given Andrea's unwillingness to move and Jack's documented case, it is my opinion the most likely compromise is to leave in the invalidate_page() callout. Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-24 17:41:52
|
On Thu, Apr 24, 2008 at 05:39:43PM +0200, Andrea Arcangeli wrote: > There's at least one small issue I noticed so far, that while _release > don't need to care about _register, but _unregister definitely need to > care about _register. I've to take the mmap_sem in addition or in In the end the best is to use the spinlock around those list_add/list_del they all run in O(1) with the hlist and they take a few asm insn. This also avoids to take the mmap_sem in exit_mmap, at exit_mmap time nobody should need to use mmap_sem anymore, it might work but this looks cleaner. The lock is dynamically allocated only when the notifiers are registered, so the few bytes taken by it aren't relevant. A full new update will some become visible here: http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/ Please have a close look again. Your help is extremely appreciated and very helpful as usual! Thanks a lot. diff -urN xxx/include/linux/mmu_notifier.h xx/include/linux/mmu_notifier.h --- xxx/include/linux/mmu_notifier.h 2008-04-24 19:41:15.000000000 +0200 +++ xx/include/linux/mmu_notifier.h 2008-04-24 19:38:37.000000000 +0200 @@ -15,7 +15,7 @@ struct hlist_head list; struct srcu_struct srcu; /* to serialize mmu_notifier_unregister against mmu_notifier_release */ - spinlock_t unregister_lock; + spinlock_t lock; }; struct mmu_notifier_ops { diff -urN xxx/mm/memory.c xx/mm/memory.c --- xxx/mm/memory.c 2008-04-24 19:41:15.000000000 +0200 +++ xx/mm/memory.c 2008-04-24 19:38:37.000000000 +0200 @@ -605,16 +605,13 @@ * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ - ret = 0; if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) { if (!vma->anon_vma) - goto out; + return 0; } - if (unlikely(is_vm_hugetlb_page(vma))) { - ret = copy_hugetlb_page_range(dst_mm, src_mm, vma); - goto out; - } + if (is_vm_hugetlb_page(vma)) + return copy_hugetlb_page_range(dst_mm, src_mm, vma); if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); @@ -636,7 +633,6 @@ if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_end(src_mm, vma->vm_start, end); -out: return ret; } diff -urN xxx/mm/mmap.c xx/mm/mmap.c --- xxx/mm/mmap.c 2008-04-24 19:41:15.000000000 +0200 +++ xx/mm/mmap.c 2008-04-24 19:38:37.000000000 +0200 @@ -2381,7 +2381,7 @@ if (data->nr_anon_vma_locks) mm_unlock_vfree(data->anon_vma_locks, data->nr_anon_vma_locks); - if (data->i_mmap_locks) + if (data->nr_i_mmap_locks) mm_unlock_vfree(data->i_mmap_locks, data->nr_i_mmap_locks); } diff -urN xxx/mm/mmu_notifier.c xx/mm/mmu_notifier.c --- xxx/mm/mmu_notifier.c 2008-04-24 19:41:15.000000000 +0200 +++ xx/mm/mmu_notifier.c 2008-04-24 19:31:23.000000000 +0200 @@ -24,22 +24,16 @@ * zero). All other tasks of this mm already quit so they can't invoke * mmu notifiers anymore. This can run concurrently only against * mmu_notifier_unregister and it serializes against it with the - * unregister_lock in addition to RCU. struct mmu_notifier_mm can't go - * away from under us as the exit_mmap holds a mm_count pin itself. - * - * The ->release method can't allow the module to be unloaded, the - * module can only be unloaded after mmu_notifier_unregister run. This - * is because the release method has to run the ret instruction to - * return back here, and so it can't allow the ret instruction to be - * freed. + * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm + * can't go away from under us as exit_mmap holds a mm_count pin + * itself. */ void __mmu_notifier_release(struct mm_struct *mm) { struct mmu_notifier *mn; int srcu; - srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); - spin_lock(&mm->mmu_notifier_mm->unregister_lock); + spin_lock(&mm->mmu_notifier_mm->lock); while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) { mn = hlist_entry(mm->mmu_notifier_mm->list.first, struct mmu_notifier, @@ -52,23 +46,28 @@ */ hlist_del_init(&mn->hlist); /* + * SRCU here will block mmu_notifier_unregister until + * ->release returns. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + /* * if ->release runs before mmu_notifier_unregister it * must be handled as it's the only way for the driver - * to flush all existing sptes before the pages in the - * mm are freed. + * to flush all existing sptes and stop the driver + * from establishing any more sptes before all the + * pages in the mm are freed. */ - spin_unlock(&mm->mmu_notifier_mm->unregister_lock); - /* SRCU will block mmu_notifier_unregister */ mn->ops->release(mn, mm); - spin_lock(&mm->mmu_notifier_mm->unregister_lock); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + spin_lock(&mm->mmu_notifier_mm->lock); } - spin_unlock(&mm->mmu_notifier_mm->unregister_lock); - srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); /* - * Wait ->release if mmu_notifier_unregister run list_del_rcu. - * srcu can't go away from under us because one mm_count is - * hold by exit_mmap. + * Wait ->release if mmu_notifier_unregister is running it. + * The mmu_notifier_mm can't go away from under us because one + * mm_count is hold by exit_mmap. */ synchronize_srcu(&mm->mmu_notifier_mm->srcu); } @@ -177,11 +176,19 @@ goto out_unlock; } INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list); - spin_lock_init(&mm->mmu_notifier_mm->unregister_lock); + spin_lock_init(&mm->mmu_notifier_mm->lock); } atomic_inc(&mm->mm_count); + /* + * Serialize the update against mmu_notifier_unregister. A + * side note: mmu_notifier_release can't run concurrently with + * us because we hold the mm_users pin (either implicitly as + * current->mm or explicitly with get_task_mm() or similar). + */ + spin_lock(&mm->mmu_notifier_mm->lock); hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list); + spin_unlock(&mm->mmu_notifier_mm->lock); out_unlock: mm_unlock(mm, &data); out: @@ -215,23 +222,32 @@ BUG_ON(atomic_read(&mm->mm_count) <= 0); - srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); - spin_lock(&mm->mmu_notifier_mm->unregister_lock); + spin_lock(&mm->mmu_notifier_mm->lock); if (!hlist_unhashed(&mn->hlist)) { hlist_del_rcu(&mn->hlist); before_release = 1; } - spin_unlock(&mm->mmu_notifier_mm->unregister_lock); if (before_release) /* + * SRCU here will force exit_mmap to wait ->release to finish + * before freeing the pages. + */ + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu); + spin_unlock(&mm->mmu_notifier_mm->lock); + if (before_release) { + /* * exit_mmap will block in mmu_notifier_release to * guarantee ->release is called before freeing the * pages. */ mn->ops->release(mn, mm); - srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + srcu_read_unlock(&mm->mmu_notifier_mm->srcu, srcu); + } - /* wait any running method to finish, including ->release */ + /* + * Wait any running method to finish, of course including + * ->release if it was run by mmu_notifier_relase instead of us. + */ synchronize_srcu(&mm->mmu_notifier_mm->srcu); BUG_ON(atomic_read(&mm->mm_count) <= 0); |
From: Andrea A. <an...@qu...> - 2008-04-22 22:35:47
|
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: > 3. As noted by Eric and also contained in private post from yesterday by > me: The cmp function needs to retrieve the value before > doing comparisons which is not done for the == of a and b. I retrieved the value, which is why mm_lock works perfectly on #v13 as well as #v12. It's not mandatory to ever return 0, so it won't produce any runtime error (there is a bugcheck for wrong sort ordering in my patch just in case it would generate any runtime error and it never did, or I would have noticed before submission), which is why I didn't need to release any hotfix yet and I'm waiting more time to get more comments before sending an update to clean up that bit. Mentioning this as the third and last point I guess shows how strong are your arguments against merging my mmu-notifier-core now, so in the end doing that cosmetical error payed off somehow. I'll send an update in any case to Andrew way before Saturday so hopefully we'll finally get mmu-notifiers-core merged before next week. Also I'm not updating my mmu-notifier-core patch anymore except for strict bugfixes so don't worry about any more cosmetical bugs being introduced while optimizing the code like it happened this time. The only other change I did has been to move mmu_notifier_unregister at the end of the patchset after getting more questions about its reliability and I documented a bit the rmmod requirements for ->release. we'll think later if it makes sense to add it, nobody's using it anyway. |
From: Robin H. <ho...@sg...> - 2008-04-26 13:17:33
|
On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote: > A full new update will some become visible here: > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/ I grabbed these and built them. Only change needed was another include. After that, everything built fine and xpmem regression tests ran through the first four sets. The fifth is the oversubscription test which trips my xpmem bug. This is as good as the v12 runs from before. Since this include and the one for mm_types.h both are build breakages for ia64, I think you need to apply your ia64_cpumask and the following (possibly as a single patch) first or in your patch 1. Without that, ia64 doing a git-bisect could hit a build failure. Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h =================================================================== --- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h 2008-04-26 06:41:54.000000000 -0500 +++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 07:01:17.292071827 -0500 @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include <linux/mutex.h> + struct srcu_struct_array { int c[2]; }; |
From: Andrea A. <an...@qu...> - 2008-04-26 14:04:05
|
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: > Since this include and the one for mm_types.h both are build breakages > for ia64, I think you need to apply your ia64_cpumask and the following > (possibly as a single patch) first or in your patch 1. Without that, > ia64 doing a git-bisect could hit a build failure. Agreed, so it doesn't risk to break ia64 compilation, thanks for the great XPMEM feedback! Also note, I figured out that mmu_notifier_release can actually run concurrently against other mmu notifiers in case there's a vmtruncate (->release could already run concurrently if invoked by _unregister, the only guarantee is that ->release will be called one time and only one time and that no mmu notifier will ever run after _unregister returns). In short I can't keep the list_del_init in _release and I need a list_del_init_rcu instead to fix this minor issue. So this won't really make much difference after all. I'll release #v14 with all this after a bit of kvm testing with it... diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -755,6 +755,14 @@ static inline void hlist_del_init(struct } } +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; + } +} + /** * hlist_replace_rcu - replace old entry by new one * @old : the element to be replaced diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -22,7 +22,10 @@ struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are - * freed. It's mandatory to implement this method. + * freed. It's mandatory to implement this method. This can + * run concurrently to other mmu notifier methods and it + * should teardown all secondary mmu mappings and freeze the + * secondary mmu. */ void (*release)(struct mmu_notifier *mn, struct mm_struct *mm); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -19,12 +19,13 @@ /* * This function can't run concurrently against mmu_notifier_register - * or any other mmu notifier method. mmu_notifier_register can only - * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is - * zero). All other tasks of this mm already quit so they can't invoke - * mmu notifiers anymore. This can run concurrently only against - * mmu_notifier_unregister and it serializes against it with the - * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there's no task using this mm anymore, through + * the vmas outside of the exit_mmap context, like with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm * can't go away from under us as exit_mmap holds a mm_count pin * itself. */ @@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st * to wait ->release to finish and * mmu_notifier_unregister to return. */ - hlist_del_init(&mn->hlist); + hlist_del_init_rcu(&mn->hlist); /* * SRCU here will block mmu_notifier_unregister until * ->release returns. @@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not * side note: mmu_notifier_release can't run concurrently with * us because we hold the mm_users pin (either implicitly as * current->mm or explicitly with get_task_mm() or similar). + * We can't race against any other mmu notifiers either thanks + * to mm_lock(). */ spin_lock(&mm->mmu_notifier_mm->lock); hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); |
From: Andrea A. <an...@qu...> - 2008-04-27 12:27:24
|
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: > the first four sets. The fifth is the oversubscription test which trips > my xpmem bug. This is as good as the v12 runs from before. Now that mmu-notifier-core #v14 seems finished and hopefully will appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier code with the full patchset applied and not only with mmu-notifier-core. I soon found the full patchset has a swap deadlock bug. Then I tried without using kvm (so with mmu notifier disarmed) and I could still reproduce the crashes. After grabbing a few stack traces I tracked it down to a bug in the i_mmap_lock->i_mmap_sem conversion. If you oversubscription means swapping, you should retest with this applied on #v14 i_mmap_sem patch as it would eventually deadlock with all tasks allocating memory in D state without this. Now the full patchset is as rock solid as with only mmu-notifier-core applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for the last hours without a problem. Everything is working great with KVM at least. Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - up_write(&mapping->i_mmap_sem); + up_read(&mapping->i_mmap_sem); return ret; } |
From: Robin H. <ho...@sg...> - 2008-04-22 23:07:21
|
> The only other change I did has been to move mmu_notifier_unregister > at the end of the patchset after getting more questions about its > reliability and I documented a bit the rmmod requirements for > ->release. we'll think later if it makes sense to add it, nobody's > using it anyway. XPMEM is using it. GRU will be as well (probably already does). |
From: Christoph L. <cla...@sg...> - 2008-04-22 23:20:29
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > I'll send an update in any case to Andrew way before Saturday so > hopefully we'll finally get mmu-notifiers-core merged before next > week. Also I'm not updating my mmu-notifier-core patch anymore except > for strict bugfixes so don't worry about any more cosmetical bugs > being introduced while optimizing the code like it happened this time. I guess I have to prepare another patchset then? |
From: Jack S. <st...@sg...> - 2008-04-23 00:28:53
|
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > The only other change I did has been to move mmu_notifier_unregister > > at the end of the patchset after getting more questions about its > > reliability and I documented a bit the rmmod requirements for > > ->release. we'll think later if it makes sense to add it, nobody's > > using it anyway. > > XPMEM is using it. GRU will be as well (probably already does). Yeppp. The GRU driver unregisters the notifier when all GRU mappings are unmapped. I could make it work either way - either with or without an unregister function. However, unregister is the most logical action to take when all mappings have been destroyed. --- jack |
From: Christoph L. <cla...@sg...> - 2008-04-28 20:40:46
|
On Sun, 27 Apr 2008, Andrea Arcangeli wrote: > Talking about post 2.6.26: the refcount with rcu in the anon-vma > conversion seems unnecessary and may explain part of the AIM slowdown > too. The rest looks ok and probably we should switch the code to a > compile-time decision between rwlock and rwsem (so obsoleting the > current spinlock). You are going to take a semphore in an rcu section? Guess you did not activate all debugging options while testing? I was not aware that you can take a sleeping lock from a non preemptible context. |
From: Andrea A. <an...@qu...> - 2008-04-29 00:10:50
|
On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote: > On Sun, 27 Apr 2008, Andrea Arcangeli wrote: > > > Talking about post 2.6.26: the refcount with rcu in the anon-vma > > conversion seems unnecessary and may explain part of the AIM slowdown > > too. The rest looks ok and probably we should switch the code to a > > compile-time decision between rwlock and rwsem (so obsoleting the > > current spinlock). > > You are going to take a semphore in an rcu section? Guess you did not > activate all debugging options while testing? I was not aware that you can > take a sleeping lock from a non preemptible context. I'd hoped to discuss this topic after mmu-notifier-core was already merged, but let's do it anyway. My point of view is that there was no rcu when I wrote that code, yet there was no reference count and yet all locking looks still exactly the same as I wrote it. There's even still the page_table_lock to serialize threads taking the mmap_sem in read mode against the first vma->anon_vma = anon_vma during the page fault. Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page->mapping. Definitely the PG_locked is taken so there's no way page->mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. So there are some possible scenarios: 1) my original anon_vma code was buggy not taking the rcu_read_lock() and somebody fixed it (I tend to exclude it) 2) somebody has seen a race that doesn't exist and didn't bother to document it other than with this obscure comment * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. I tend to exclude it too as VM folks are too smart for this to be the case. 3) somebody did some microoptimization using rcu and we surely can undo that microoptimization to get the code back to my original code that didn't need rcu despite it worked exactly the same, and that is going to be cheaper to use with semaphores than doubling the number of locked ops for every lock instruction. Now the double atomic op may not be horrible when not contented, as it works on the same cacheline but with cacheline bouncing with contention it sounds doubly horrible than a single cacheline bounce and I don't see the point of it as you can't use rcu anyways, so you can't possibly take advantage of whatever microoptimization done over the original locking. |
From: Andrea A. <an...@qu...> - 2008-04-23 13:47:06
|
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > The only other change I did has been to move mmu_notifier_unregister > > at the end of the patchset after getting more questions about its > > reliability and I documented a bit the rmmod requirements for > > ->release. we'll think later if it makes sense to add it, nobody's > > using it anyway. > > XPMEM is using it. GRU will be as well (probably already does). XPMEM requires more patches anyway. Note that in previous email you told me you weren't using it. I think GRU can work fine on 2.6.26 without mmu_notifier_unregister, like KVM too. You've simply to unpin the module count in ->release. The most important bit is that you've to do that anyway in case mmu_notifier_unregister fails (and it can fail because of vmalloc space shortage because somebody loaded some framebuffer driver or whatever). |
From: Robin H. <ho...@sg...> - 2008-04-23 14:48:34
|
On Wed, Apr 23, 2008 at 03:36:19PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > > The only other change I did has been to move mmu_notifier_unregister > > > at the end of the patchset after getting more questions about its > > > reliability and I documented a bit the rmmod requirements for > > > ->release. we'll think later if it makes sense to add it, nobody's > > > using it anyway. > > > > XPMEM is using it. GRU will be as well (probably already does). > > XPMEM requires more patches anyway. Note that in previous email you > told me you weren't using it. I think GRU can work fine on 2.6.26 I said I could test without it. It is needed for the final version. It also makes the API consistent. What you are proposing is equivalent to having a file you can open but never close. This whole discussion seems ludicrous. You could refactor the code to get the sorted list of locks, pass that list into mm_lock to do the locking, do the register/unregister, then pass the same list into mm_unlock. If the allocation fails, you could fall back to the older slower method of repeatedly scanning the lists and acquiring locks in ascending order. > without mmu_notifier_unregister, like KVM too. You've simply to unpin > the module count in ->release. The most important bit is that you've > to do that anyway in case mmu_notifier_unregister fails (and it can If you are not going to provide the _unregister callout you need to change the API so I can scan the list of notifiers to see if my structures are already registered. We register our notifier structure at device open time. If we receive a _release callout, we mark our structure as unregistered. At device close time, if we have not been unregistered, we call _unregister. If you take away _unregister, I have an xpmem kernel structure in use _AFTER_ the device is closed with no indication that the process is using it. In that case, I need to get an extra reference to the module in my device open method and hold that reference until the _release callout. Additionally, if the users program reopens the device, I need to scan the mmu_notifiers list to see if this tasks notifier is already registered. I view _unregister as essential. Did I miss something? Thanks, Robin |
From: Andrea A. <an...@qu...> - 2008-04-23 16:00:11
|
On Wed, Apr 23, 2008 at 09:47:47AM -0500, Robin Holt wrote: > It also makes the API consistent. What you are proposing is equivalent > to having a file you can open but never close. That's not entirely true, you can close the file just fine it by killing the tasks leading to an mmput. From an user prospective in KVM terms, it won't make a difference as /dev/kvm will remain open and it'll pin the module count until the kvm task is killed anyway, I assume for GRU it's similar. Until I had the idea of how to implement an mm_lock to ensure the mmu_notifier_register could miss a running invalidate_range_begin, it wasn't even possible to implement a mmu_notifier_unregister (see EMM patches) and it looked like you were ok with that API that missed _unregister... > This whole discussion seems ludicrous. You could refactor the code to get > the sorted list of locks, pass that list into mm_lock to do the locking, > do the register/unregister, then pass the same list into mm_unlock. Correct, but it will keep the vmalloc ram pinned during the runtime. There's no reason to keep that ram allocated per-VM while the VM runs. We only need it during the startup and teardown. > If the allocation fails, you could fall back to the older slower method > of repeatedly scanning the lists and acquiring locks in ascending order. Correct, I already thought about that. This is exactly why I'm deferring this for later! Or those perfectionism not needed for KVM/GRU will keep delaying indefinitely the part that is already converged and that's enough for KVM and GRU (and for this specific bit, actually enough for XPMEM as well). We can make a second version of mm_lock_slow to use if mm_lock fails, in mmu_notifier_unregister, with N^2 complexity later, after the mmu-notifier-core is merged into mainline. > If you are not going to provide the _unregister callout you need to change > the API so I can scan the list of notifiers to see if my structures are > already registered. As said 1/N isn't enough for XPMEM anyway. 1/N has to only include the absolute minimum and zero risk stuff, that is enough for both KVM and GRU. > We register our notifier structure at device open time. If we receive a > _release callout, we mark our structure as unregistered. At device close > time, if we have not been unregistered, we call _unregister. If you > take away _unregister, I have an xpmem kernel structure in use _AFTER_ > the device is closed with no indication that the process is using it. > In that case, I need to get an extra reference to the module in my device > open method and hold that reference until the _release callout. Yes exactly, but you've to do that anyway, if mmu_notifier_unregister fails because some driver already allocated all vmalloc space (even x86-64 hasn't indefinite amount of vmalloc because of the vmalloc being in the end of the address space) unless we've a N^2 fallback, but the N^2 fallback will make the code more easily dosable and unkillable, so if I would be an admin I'd prefer having to quickly kill -9 a task in O(N) than having to wait some syscall that runs in O(N^2) to complete before the task quits. So the fallback to a slower algorithm isn't necessarily what will really happen after 2.6.26 is released, we'll see. Relaying on ->release for the module unpin sounds preferable, and it's certainly the only reliable way to unregister that we'll provide in 2.6.26. > Additionally, if the users program reopens the device, I need to scan the > mmu_notifiers list to see if this tasks notifier is already registered. But you don't need any browse the list for this, keep a flag in your structure after the mmu_notifier struct, set the bitflag after mmu_notifier_register returns, and clear the bitflag after ->release runs or after mmu_notifier_unregister returns success. What's the big deal to track if you've to call mmu_notifier_register a second time or not? Or you can create a new structure every time somebody asks to reattach. > I view _unregister as essential. Did I miss something? We can add it later, and we can keep discussing on what's the best model to implement it as long as you want after 2.6.26 is released with mmu-notifier-core so GRU/KVM are done. It's unlikely KVM will use mmu_notifier_unregister anyway as we need it attached for the whole lifetime of the task, and only for the lifetime of the task. This is the patch to add it, as you can see it's entirely orthogonal, backwards compatible with previous API and it doesn't duplicate or rewrite any code. Don't worry, any kernel after 2.6.26 will have unregister, but we can't focus on this for 2.6.26. We can also consider making mmu_notifier_register safe against double calls on the same structure but again that's not something we should be doing in 1/N and it can be done later in a backwards compatible way (plus we're perfectly fine with the API having not backwards compatible changes as long as 2.6.26 can work for us). --------------------------------- Implement unregister but it's not reliable, only ->release is reliable. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -119,6 +119,8 @@ extern int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm); +extern int mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -106,3 +106,29 @@ return ret; } EXPORT_SYMBOL_GPL(mmu_notifier_register); + +/* + * mm_users can't go down to zero while mmu_notifier_unregister() + * runs or it can race with ->release. So a mm_users pin must + * be taken by the caller (if mm can be different from current->mm). + * + * This function can fail (for example during out of memory conditions + * or after vmalloc virtual range shortage), so the only reliable way + * to unregister is to wait release() to be called. + */ +int mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + struct mm_lock_data data; + int ret; + + BUG_ON(!atomic_read(&mm->mm_users)); + + ret = mm_lock(mm, &data); + if (unlikely(ret)) + goto out; + hlist_del(&mn->hlist); + mm_unlock(mm, &data); +out: + return ret; +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); |
From: Andrea A. <an...@qu...> - 2008-04-23 16:26:28
|
On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: > I guess I have to prepare another patchset then? If you want to embarrass yourself three time in a row go ahead ;). I thought two failed takeovers was enough. |
From: Christoph L. <cla...@sg...> - 2008-04-23 18:15:21
|
On Wed, 23 Apr 2008, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:20:35PM -0700, Christoph Lameter wrote: > > I guess I have to prepare another patchset then? > > If you want to embarrass yourself three time in a row go ahead ;). I > thought two failed takeovers was enough. Takeover? I'd be happy if I would not have to deal with this issue. These patches were necessary because you were not listening to feedback plus there is the issue that your patchsets were not easy to review or diff against. I had to merge several patches to get to a useful patch. You have always picked up lots of stuff from my patchsets. Lots of work that could have been avoided by proper patchsets in the first place. |
From: Andrea A. <an...@qu...> - 2008-04-23 16:37:10
|
On Tue, Apr 22, 2008 at 07:28:49PM -0500, Jack Steiner wrote: > The GRU driver unregisters the notifier when all GRU mappings > are unmapped. I could make it work either way - either with or without > an unregister function. However, unregister is the most logical > action to take when all mappings have been destroyed. This is true for KVM as well, unregister would be the most logical action to take when the kvm device is closed and the vm destroyed. However we can't implement mm_lock in O(N*log(N)) without triggering RAM allocations. And the size of those ram allocations are unknown at the time unregister runs (they also depend on the max_nr_vmas sysctl). So on a second thought not even passing the array from register to unregister would solve it (unless we allocate max_nr_vmas and we block the sysctl to alter max_nr_vmas if not all unregister run yet).That's clearly unacceptable. The only way to avoid failing because of vmalloc space shortage or oom, would be to provide a O(N*N) fallback. But one that can't be interrupted by sigkill! sigkill interruption was ok in #v12 because we didn't rely on mmu_notifier_unregister to succeed. So it avoided any DoS but it still can't provide any reliable unregister. So in the end unregistering with kill -9 leading to ->release in O(1) sounds safer solution for the long term. You can't loop if unregister fails and pretend your module not to have deadlocks. Yes, waiting ->release add up a bit of complexity but I think it worth it, and there weren't genial ideas on how to avoid O(N*N) complexity and allocations too in mmu_notifier_unregister yet. Until that genius idea will materialize we'll stick with ->release in O(1) as the only safe unregister so we guarantee the admin will be in control of his hardware in O(1) with kill -9 no matter if /dev/kvm and /dev/gru are owned by sillyuser. I'm afraid if you don't want to worst-case unregister with ->release you need to have a better idea than my mm_lock and personally I can't see any other way than mm_lock to ensure not to miss range_begin... All the above is in 2.6.27 context (for 2.6.26 ->release is the way, even if the genius idea would materialize). |