From: Andrea A. <an...@qu...> - 2008-01-13 16:24:43
|
Hello, This patch is last version of a basic implementation of the mmu notifiers. In short when the linux VM decides to free a page, it will unmap it from the linux pagetables. However when a page is mapped not just by the regular linux ptes, but also from the shadow pagetables, it's currently unfreeable by the linux VM. This patch allows the shadow pagetables to be dropped and the page to be freed after that, if the linux VM decides to unmap the page from the main ptes because it wants to swap out the page. In my basic initial patch I only track the tlb flushes which should be the minimum required to have a nice linux-VM controlled swapping behavior of the KVM gphysical memory. The shadow-ptes works much like a TLB, so the same way we flush the tlb after clearing the ptes, we should also issue the mmu_notifier invalidate_page/range/release methods. Quadrics needs much more than that to optimize things to preload secondary-ptes after main-pte instantiation (to prevent page faults) and other similar optimizations, but it's easy to add more methods to the below code to fit their needs if the basic infrastructure is ok. Furthermore the best would be to use self-modifying code to implement the notifiers as a nop inst when disarmed but such an optimization is a low priority for now. This follows the model of Avi's original patch, however I guess it would also be possible to track when the VM shrink_cache methods wants to free a certain host-page_t instead of tracking when the tlb is flushed. Not sure if other ways are feasible or better, but the below is the most important bit we need for KVM to swap nicely with minimal overhead to the host kernel even if KVM is unused. About the locking perhaps I'm underestimating it, but by following the TLB flushing analogy, by simply clearing the shadow ptes (with kvm mmu_lock spinlock to avoid racing with other vcpu spte accesses of course) and flushing the shadow-pte after clearing the main linux pte, it should be enough to serialize against shadow-pte page faults that would call into get_user_pages. Flushing the host TLB before or after the shadow-ptes shouldn't matter. Comments welcome... especially from SGI/IBM/Quadrics and all other potential users of this functionality. Patch is running on my desktop and swapping out a 3G non-linux VM, smp guest and smp host as I write this, it seems to work fine (only tested with SVM so far). If you're interested in the KVM usage of this feature, in the below link you can see how it enables full KVM swapping using the core linux kernel virtual memory algorithms using the KVM rmap (like pte-chains) to unmap shadow-ptes in addition to the mm/rmap.c for the regular ptes. http://marc.info/?l=kvm-devel&m=120023119710198&w=2 This is a zero-risk patch when unused. So I hope we can quickly concentrate on finalize the interface and merge it ASAP. It's mostly a matter of deciding if this is the right way of getting notification of VM mapping invalidations, mapping instantiations for secondary TLBs/MMUs/RDMA/Shadow etc... There are also certain details I'm uncertain about, like passing 'mm' to the lowlevel methods, my KVM usage of the invalidate_page() notifier for example only uses 'mm' for a BUG_ON for example: void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm, [..] BUG_ON(mm != kvm->mm); [..] BTW, if anybody needs 'mm' the mm could also be stored in the container of the 'mn', incidentally like in KVM case. containerof is an efficient and standard way to do things. OTOH I kept passing down the 'mm' like below just in case others don't have 'mm' saved in the container for other reasons like we have in struct kvm, and they prefer to get it through registers/stack. In practice it won't make any measurable difference, it's mostly a style issue. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -86,6 +86,7 @@ do { \ pte_t __pte; \ __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ flush_tlb_page(__vma, __address); \ + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ __pte; \ }) #endif diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -13,6 +13,7 @@ #include <linux/debug_locks.h> #include <linux/mm_types.h> #include <linux/security.h> +#include <linux/mmu_notifier.h> struct mempolicy; struct anon_vma; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -219,6 +219,10 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head mmu_notifier; /* MMU notifier list */ +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,53 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/mm_types.h> + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier; + +struct mmu_notifier_ops { + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + void (*invalidate_range)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn); +extern void mmu_notifier_release(struct mm_struct *mm); + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, mm, args); \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +#define mmu_notifier_register(mn, mm) do {} while(0) +#define mmu_notifier_unregister(mn) do {} while (0) +#define mmu_notifier_release(mm) do {} while (0) + +#define mmu_notifier(function, mm, args...) \ + do { } while (0) + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -753,6 +753,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end); return end; } @@ -1358,6 +1359,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, end-PAGE_ALIGN(size), end); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1452,6 +1454,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, end-size, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1747,6 +1747,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end); } /* @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + mmu_notifier_release(mm); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,35 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> + +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *tmp; + + hlist_for_each_entry_safe(mn, n, tmp, &mm->mmu_notifier, hlist) { + if (mn->ops->release) + mn->ops->release(mn, mm); + hlist_del(&mn->hlist); + } +} + +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + hlist_add_head(&mn->hlist, &mm->mmu_notifier); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn) +{ + hlist_del(&mn->hlist); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); |
From: Benjamin H. <be...@ke...> - 2008-01-13 21:12:22
|
On Sun, 2008-01-13 at 17:24 +0100, Andrea Arcangeli wrote: > Hello, > > This patch is last version of a basic implementation of the mmu > notifiers. > > In short when the linux VM decides to free a page, it will unmap it > from the linux pagetables. However when a page is mapped not just by > the regular linux ptes, but also from the shadow pagetables, it's > currently unfreeable by the linux VM. > > This patch allows the shadow pagetables to be dropped and the page to > be freed after that, if the linux VM decides to unmap the page from > the main ptes because it wants to swap out the page. Another potential user of that I can see is the DRM. Nowadays, graphic cards essentially have an MMU on chip, and can do paging. It would be nice to be able to map user objects in them without having to lock them down using your callback to properly mark them cast out on the card. Ben. |
From: Christoph L. <cla...@sg...> - 2008-01-14 20:02:37
|
On Sun, 13 Jan 2008, Andrea Arcangeli wrote: > About the locking perhaps I'm underestimating it, but by following the > TLB flushing analogy, by simply clearing the shadow ptes (with kvm > mmu_lock spinlock to avoid racing with other vcpu spte accesses of > course) and flushing the shadow-pte after clearing the main linux pte, > it should be enough to serialize against shadow-pte page faults that > would call into get_user_pages. Flushing the host TLB before or after > the shadow-ptes shouldn't matter. Hmmm... In most of the callsites we hold a writelock on mmap_sem right? > Comments welcome... especially from SGI/IBM/Quadrics and all other > potential users of this functionality. > There are also certain details I'm uncertain about, like passing 'mm' > to the lowlevel methods, my KVM usage of the invalidate_page() > notifier for example only uses 'mm' for a BUG_ON for example: Passing mm is fine as long as mmap_sem is held. > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -86,6 +86,7 @@ do { \ > pte_t __pte; \ > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ > flush_tlb_page(__vma, __address); \ > + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ > __pte; \ > }) > #endif Hmmm... this is ptep_clear_flush? What about the other uses of flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? (would help if your patches would mention the function name in the diff headers) > +#define mmu_notifier(function, mm, args...) \ > + do { \ > + struct mmu_notifier *__mn; \ > + struct hlist_node *__n; \ > + \ > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > + if (__mn->ops->function) \ > + __mn->ops->function(__mn, mm, args); \ > + } while (0) Does this have to be inline? ptep_clear_flush will become quite big |
From: Benjamin H. <be...@ke...> - 2008-01-15 04:28:49
|
On Mon, 2008-01-14 at 12:02 -0800, Christoph Lameter wrote: > On Sun, 13 Jan 2008, Andrea Arcangeli wrote: > > > About the locking perhaps I'm underestimating it, but by following the > > TLB flushing analogy, by simply clearing the shadow ptes (with kvm > > mmu_lock spinlock to avoid racing with other vcpu spte accesses of > > course) and flushing the shadow-pte after clearing the main linux pte, > > it should be enough to serialize against shadow-pte page faults that > > would call into get_user_pages. Flushing the host TLB before or after > > the shadow-ptes shouldn't matter. > > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in unmap_mapping_range() afaik. > > Comments welcome... especially from SGI/IBM/Quadrics and all other > > potential users of this functionality. > > > There are also certain details I'm uncertain about, like passing 'mm' > > to the lowlevel methods, my KVM usage of the invalidate_page() > > notifier for example only uses 'mm' for a BUG_ON for example: > > Passing mm is fine as long as mmap_sem is held. Passing mm is always a good idea, regardless of the mmap_sem, it can be useful for lots of other things :-) > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > > --- a/include/asm-generic/pgtable.h > > +++ b/include/asm-generic/pgtable.h > > @@ -86,6 +86,7 @@ do { \ > > pte_t __pte; \ > > __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ > > flush_tlb_page(__vma, __address); \ > > + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ > > __pte; \ > > }) > > #endif > > Hmmm... this is ptep_clear_flush? What about the other uses of > flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? > (would help if your patches would mention the function name in the diff > headers) Note that last I looked, a lot of these were stale. Might be time to resume my spring/summer cleaning of page table accessors... > > +#define mmu_notifier(function, mm, args...) \ > > + do { \ > > + struct mmu_notifier *__mn; \ > > + struct hlist_node *__n; \ > > + \ > > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > > + if (__mn->ops->function) \ > > + __mn->ops->function(__mn, mm, args); \ > > + } while (0) > > Does this have to be inline? ptep_clear_flush will become quite big > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to maj...@kv.... For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"do...@kv..."> em...@kv... </a> |
From: Andrea A. <an...@qu...> - 2008-01-15 12:44:51
|
On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel locking can't relay on the VM locks. About your request to schedule in the mmu notifier methods this is not feasible right now, the notifier is often called with the pte spinlocks held. I wonder if you can simply post/queue an event like a softirq/pdflush. > Passing mm is fine as long as mmap_sem is held. mmap_sem is not held, but don't worry "mm" can't go away under the mmu notifier, so it's ok. It's just that the KVM methods never uses "mm" at all (containerof translates the struct mmu_notifier to a struct kvm, and there is the mm in kvm->mm too). Perhaps others don't save the "mm" in their container where the mmu_notifier is embedded into, so I left mm as parameter to the methods. > Hmmm... this is ptep_clear_flush? What about the other uses of > flush_tlb_page in asm-generic/pgtable.h and related uses in arch code? This is not necessarily a 1:1 relationship with the tlb flushes. Otherwise they'd be the tlb-notifiers not the mmu-notifiers. The other methods in the pgtable.h are not dropping an user page from the "mm". That's the invalidate case right now. Other methods will not call into invalidate_page, but you're welcome to add other methods and call them from other ptep_* functions if you're interested about being notified about more than just the invalidates of the "mm". Is invalidate_page/range a clear enough method name to explain when the ptes and tlb entries have been dropped for such page/range mapped in userland in that address/range? > (would help if your patches would mention the function name in the diff > headers) my patches uses git diff defaults I guess, and they mention the function name in all other places, it's just git isn't smart enough there to catch the function name in that single place, it's ok. > > +#define mmu_notifier(function, mm, args...) \ > > + do { \ > > + struct mmu_notifier *__mn; \ > > + struct hlist_node *__n; \ > > + \ > > + hlist_for_each_entry(__mn, __n, &(mm)->mmu_notifier, hlist) \ > > + if (__mn->ops->function) \ > > + __mn->ops->function(__mn, mm, args); \ > > + } while (0) > > Does this have to be inline? ptep_clear_flush will become quite big Inline makes the patch smaller and it avoids a call in the common case that the mmu_notifier list is empty. Perhaps I could add a: if (unlikely(!list_empty(&(mm)->mmu_notifier)) { ... } so gcc could offload the internal block in a cold-icache region of .text. I think at least an unlikely(!list_empty(&(mm)->mmu_notifier)) check has to be inline. Currently there isn't such check because I'm unsure if it really makes sense. The idea is that if you really care to optimize this you'll use self-modifying code to turn a nop into a call when a certain method is armed. That's an extreme optimization though, current code shouldn't be measurable already when disarmed. |
From: Benjamin H. <be...@ke...> - 2008-01-15 20:19:13
|
On Tue, 2008-01-15 at 13:44 +0100, Andrea Arcangeli wrote: > On Mon, Jan 14, 2008 at 12:02:42PM -0800, Christoph Lameter wrote: > > Hmmm... In most of the callsites we hold a writelock on mmap_sem right? > > Not in all, like Marcelo pointed out in kvm-devel, so the lowlevel > locking can't relay on the VM locks. > > About your request to schedule in the mmu notifier methods this is not > feasible right now, the notifier is often called with the pte > spinlocks held. I wonder if you can simply post/queue an event like a > softirq/pdflush. Do you have cases where it's -not- called with the PTE lock held ? Ben. |
From: Andrea A. <an...@qu...> - 2008-01-16 01:06:27
|
On Wed, Jan 16, 2008 at 07:18:53AM +1100, Benjamin Herrenschmidt wrote: > Do you have cases where it's -not- called with the PTE lock held ? For invalidate_page no because currently it's only called next to the ptep_get_and_clear that modifies the pte and requires the pte lock. invalidate_range/release are called w/o pte lock held. |
From: Rik v. R. <ri...@re...> - 2008-01-16 17:43:06
|
On Sun, 13 Jan 2008 17:24:18 +0100 Andrea Arcangeli <an...@qu...> wrote: > In my basic initial patch I only track the tlb flushes which should be > the minimum required to have a nice linux-VM controlled swapping > behavior of the KVM gphysical memory. I have a vaguely related question on KVM swapping. Do page accesses inside KVM guests get propagated to the host OS, so Linux can choose a reasonable page for eviction, or is the pageout of KVM guest pages essentially random? -- All rights reversed. |
From: Izik E. <iz...@qu...> - 2008-01-16 17:57:24
|
Rik van Riel wrote: > On Sun, 13 Jan 2008 17:24:18 +0100 > Andrea Arcangeli <an...@qu...> wrote: > > >> In my basic initial patch I only track the tlb flushes which should be >> the minimum required to have a nice linux-VM controlled swapping >> behavior of the KVM gphysical memory. >> > > I have a vaguely related question on KVM swapping. > > Do page accesses inside KVM guests get propagated to the host > OS, so Linux can choose a reasonable page for eviction, or is > the pageout of KVM guest pages essentially random? > > right now when kvm remove pte from the shadow cache, it mark as access the page that this pte pointed to. it was a good solution untill the mmut notifiers beacuse the pages were pinned and couldnt be swapped to disk so now it will have to do something more sophisticated or at least mark as access every page pointed by pte that get insrted to the shadow cache.... |
From: Andrea A. <an...@cp...> - 2008-01-17 16:23:27
|
On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: > Rik van Riel wrote: >> On Sun, 13 Jan 2008 17:24:18 +0100 >> Andrea Arcangeli <an...@qu...> wrote: >> >> >>> In my basic initial patch I only track the tlb flushes which should be >>> the minimum required to have a nice linux-VM controlled swapping >>> behavior of the KVM gphysical memory. >> >> I have a vaguely related question on KVM swapping. >> >> Do page accesses inside KVM guests get propagated to the host >> OS, so Linux can choose a reasonable page for eviction, or is >> the pageout of KVM guest pages essentially random? Right, selection of the guest OS pages to swap is partly random but wait: _only_ for the long-cached and hot spte entries. It's certainly not entirely random. As the shadow-cache is a bit dynamic, every new instantiated spte will refresh the PG_referenced bit in follow_page already (through minor faults). not-present fault of swapped non-present sptes, can trigger minor faults from swapcache too and they'll refresh young regular ptes. > right now when kvm remove pte from the shadow cache, it mark as access the > page that this pte pointed to. Yes: the referenced bit in the mmu-notifier invalidate case isn't useful because it's set right before freeing the page. > it was a good solution untill the mmut notifiers beacuse the pages were > pinned and couldnt be swapped to disk It probably still makes sense for sptes removed because of other reasons (not mmu notifier invalidates). > so now it will have to do something more sophisticated or at least mark as > access every page pointed by pte > that get insrted to the shadow cache.... I think that should already be the case, see the mark_page_accessed in follow_page, isn't FOLL_TOUCH set, isn't it? The only thing we clearly miss is a logic that refreshes the PG_referenced bitflag for "hot" sptes that remains instantiated and cached for a long time. For regular linux ptes this is done by the cpu through the young bitflag. But note that not all architectures have the young bitflag support in hardware! So I suppose the swapping of the KVM task, is like the swapping any other task but on an alpha CPU. It works good enough in practice even if we clearly have room for further optimizations in this area (like there would be on archs w/o young bit updated in hardware too). To refresh the PG_referenced bit for long lived hot sptes, I think the easiest solution is to chain the sptes in a lru, and to start dropping them when memory pressure start. We could drop one spte every X pages collected by the VM. So the "age" time factor depends on the VM velocity and we totally avoid useless shadow page faults when there's no VM pressure. When VM pressure increases, the kvm non-present fault will then take care to refresh the PG_referenced bit. This should solve the aging-issue for long lived and hot sptes. This should improve the responsiveness of the guest OS during "initial" swap pressure (after the initial swap pressure, the working set finds itself in ram again). So it should avoid some swapout/swapin not required jitter during the initial swap. I see this mostly as a kvm internal optimization, not strictly related to the mmu notifiers though. |
From: Izik E. <iz...@qu...> - 2008-01-17 18:31:19
|
Andrea Arcangeli wrote: > On Wed, Jan 16, 2008 at 07:48:06PM +0200, Izik Eidus wrote: > >> Rik van Riel wrote: >> >>> On Sun, 13 Jan 2008 17:24:18 +0100 >>> Andrea Arcangeli <an...@qu...> wrote: >>> >>> >>> >>>> In my basic initial patch I only track the tlb flushes which should be >>>> the minimum required to have a nice linux-VM controlled swapping >>>> behavior of the KVM gphysical memory. >>>> >>> I have a vaguely related question on KVM swapping. >>> >>> Do page accesses inside KVM guests get propagated to the host >>> OS, so Linux can choose a reasonable page for eviction, or is >>> the pageout of KVM guest pages essentially random? >>> > > Right, selection of the guest OS pages to swap is partly random but > wait: _only_ for the long-cached and hot spte entries. It's certainly > not entirely random. > > As the shadow-cache is a bit dynamic, every new instantiated spte will > refresh the PG_referenced bit in follow_page already (through minor > faults). not-present fault of swapped non-present sptes, can trigger > minor faults from swapcache too and they'll refresh young regular > ptes. > > >> right now when kvm remove pte from the shadow cache, it mark as access the >> page that this pte pointed to. >> > > Yes: the referenced bit in the mmu-notifier invalidate case isn't > useful because it's set right before freeing the page. > > >> it was a good solution untill the mmut notifiers beacuse the pages were >> pinned and couldnt be swapped to disk >> > > It probably still makes sense for sptes removed because of other > reasons (not mmu notifier invalidates). > agree > >> so now it will have to do something more sophisticated or at least mark as >> access every page pointed by pte >> that get insrted to the shadow cache.... >> > > I think that should already be the case, see the mark_page_accessed in > follow_page, isn't FOLL_TOUCH set, isn't it? > yes you are right FOLL_TOUCH is set. > The only thing we clearly miss is a logic that refreshes the > PG_referenced bitflag for "hot" sptes that remains instantiated and > cached for a long time. For regular linux ptes this is done by the cpu > through the young bitflag. But note that not all architectures have > the young bitflag support in hardware! So I suppose the swapping of > the KVM task, is like the swapping any other task but on an alpha > CPU. It works good enough in practice even if we clearly have room for > further optimizations in this area (like there would be on archs w/o > young bit updated in hardware too). > > To refresh the PG_referenced bit for long lived hot sptes, I think the > easiest solution is to chain the sptes in a lru, and to start dropping > them when memory pressure start. We could drop one spte every X pages > collected by the VM. So the "age" time factor depends on the VM > velocity and we totally avoid useless shadow page faults when there's > no VM pressure. When VM pressure increases, the kvm non-present fault > will then take care to refresh the PG_referenced bit. This should > solve the aging-issue for long lived and hot sptes. This should > improve the responsiveness of the guest OS during "initial" swap > pressure (after the initial swap pressure, the working set finds > itself in ram again). So it should avoid some swapout/swapin not > required jitter during the initial swap. I see this mostly as a kvm > internal optimization, not strictly related to the mmu notifiers > though. > ohh i like it, this is cleaver solution, and i guess the cost of the vmexits wont be too high if it will be not too much aggressive.... |
From: Andrea A. <an...@qu...> - 2008-01-17 19:33:20
|
On Thu, Jan 17, 2008 at 08:21:16PM +0200, Izik Eidus wrote: > ohh i like it, this is cleaver solution, and i guess the cost of the > vmexits wont be too high if it will > be not too much aggressive.... Yes, and especially during swapping, the system isn't usually CPU bound. The idea is to pay with some vmexit minor fault when the CPU tends to be idle, to reduce the amount of swapouts. I say swapouts and not swapins because it will mostly help avoiding writing out swapcache to disk for no good reason. Swapins already have a chance not to generate any read-I/O if the removed spte is really hot. To make this work we still need notification from the VM about memory pressure and perhaps the slab shrinker method is enough even if it has a coarse granularity. Freeing sptes during memory pressure converges also with the objective of releasing pinned slab memory so that the spte cache can grow more freely (the 4k PAGE_SIZE for 0-order page defrag philosophy will also appreciate that to work). There are lots of details to figure out in an good implementation though but the basic idea converges on two fairly important fronts. |
From: Andrea A. <an...@qu...> - 2008-01-21 12:52:03
|
On Thu, Jan 17, 2008 at 08:32:52PM +0100, Andrea Arcangeli wrote: > To make this work we still need notification from the VM about memory > pressure [..] Ok I thought some more at the aging issue of the hot kvm pages (to prevent the guest-OS very-hot working set to be swapped out). So I now hooked a age_page mmu notifier in the page_referenced mkold path. This way when the linux pte is marked old, we can also drop the spte. This way we give the guest-OS a whole round scan of the inactive list in order to generate a vmexit minor fault by touching the hot page. The very-lightweight vmexit will call into follow_page again that I accordingly changed to mark the pte young (which is nicer because it truly simulates what a regular access through the virtual address would do). For direct-io it makes no difference and this way the next time page_referenced runs it'll find the pte young again and it'll mark the pte old again and in turn it'll call ->age_page again that will drop the spte again, etc... So the working set will be sticky in ram and it won't generate spurious swapouts (this is the theory at least). It works well in practice so far but I don't have hard numbers myself (I just implemented what I think is a quite effective aging strategy to do a not random page replacement on the very hot guest-OS working set). In absence of memory pressure (or with little pressure) there will be no age_page calls at all and the spte cache can grow freely without any vmexit. This provides peak performance in absence of memory pressure. This keeps the VM aging decision in the VM instead of having an lru of sptes to collect. The lru of sptes to collect would still be interesting for the shrinker method though (similar to dcache/inode lru etc..). This update also adds some locking so multiple subsystems can register/unregister for the notifiers at any time (something that had to be handled by design with external serialization before and effectively it was a bit fragile). BTW, when MMU_NOTIFIER=n the kernel compile spawns a warning in memory.c about two unused variables, not sure if it worth hiding it given I suppose most people will have MMU_NOTIFIER=y. One easy way to avoid the warning is to move the mmu_notifier call out of line and to have one function per notifier (which was suggested by Christoph already as an icache optimization). But this implementation keeps the patch smaller and quicker to improve for now... I'd like to know if this could be possibly merged soon and what I need to change to make this happen. Thanks! The kvm side of this can be found here: http://marc.info/?l=kvm-devel&m=120091930324366&w=2 http://marc.info/?l=kvm-devel&m=120091906724000&w=2 http://marc.info/?l=kvm-devel&m=120091939024572&w=2 Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -44,8 +44,10 @@ ({ \ int __young; \ __young = ptep_test_and_clear_young(__vma, __address, __ptep); \ - if (__young) \ + if (__young) { \ flush_tlb_page(__vma, __address); \ + mmu_notifier(age_page, (__vma)->vm_mm, __address); \ + } \ __young; \ }) #endif @@ -86,6 +88,7 @@ do { \ pte_t __pte; \ __pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep); \ flush_tlb_page(__vma, __address); \ + mmu_notifier(invalidate_page, (__vma)->vm_mm, __address); \ __pte; \ }) #endif diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include <linux/rbtree.h> #include <linux/rwsem.h> #include <linux/completion.h> +#include <linux/mmu_notifier.h> #include <asm/page.h> #include <asm/mmu.h> @@ -219,6 +220,10 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + +#ifdef CONFIG_MMU_NOTIFIER + struct mmu_notifier_head mmu_notifier; /* MMU notifier list */ +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,79 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include <linux/list.h> +#include <linux/spinlock.h> + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier; + +struct mmu_notifier_ops { + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + void (*age_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + void (*invalidate_range)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +struct mmu_notifier_head { + struct hlist_head head; + rwlock_t lock; +}; + +struct mmu_notifier { + struct hlist_node hlist; + const struct mmu_notifier_ops *ops; +}; + +#include <linux/mm_types.h> + +extern void mmu_notifier_register(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_unregister(struct mmu_notifier *mn, + struct mm_struct *mm); +extern void mmu_notifier_release(struct mm_struct *mm); + +static inline void mmu_notifier_head_init(struct mmu_notifier_head *mnh) +{ + INIT_HLIST_HEAD(&mnh->head); + rwlock_init(&mnh->lock); +} + +#define mmu_notifier(function, mm, args...) \ + do { \ + struct mmu_notifier *__mn; \ + struct hlist_node *__n; \ + \ + if (unlikely(!hlist_empty(&(mm)->mmu_notifier.head))) { \ + read_lock(&(mm)->mmu_notifier.lock); \ + hlist_for_each_entry(__mn, __n, \ + &(mm)->mmu_notifier.head, \ + hlist) \ + if (__mn->ops->function) \ + __mn->ops->function(__mn, \ + mm, \ + args); \ + read_unlock(&(mm)->mmu_notifier.lock); \ + } \ + } while (0) + +#else /* CONFIG_MMU_NOTIFIER */ + +#define mmu_notifier_register(mn, mm) do {} while(0) +#define mmu_notifier_unregister(mn, mm) do {} while (0) +#define mmu_notifier_release(mm) do {} while (0) +#define mmu_notifier_head_init(mmh) do {} while (0) + +#define mmu_notifier(function, mm, args...) \ + do { } while (0) + +#endif /* CONFIG_MMU_NOTIFIER */ + +#endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -359,6 +359,7 @@ static struct mm_struct * mm_init(struct if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; + mmu_notifier_head_init(&mm->mmu_notifier); return mm; } free_mm(mm); diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,3 +193,7 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS + +config MMU_NOTIFIER + def_bool y + bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/Makefile b/mm/Makefile --- a/mm/Makefile +++ b/mm/Makefile @@ -30,4 +30,5 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_SMP) += allocpercpu.o obj-$(CONFIG_QUICKLIST) += quicklist.o +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o diff --git a/mm/hugetlb.c b/mm/hugetlb.c --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -753,6 +753,7 @@ void __unmap_hugepage_range(struct vm_ar } spin_unlock(&mm->page_table_lock); flush_tlb_range(vma, start, end); + mmu_notifier(invalidate_range, mm, start, end); list_for_each_entry_safe(page, tmp, &page_list, lru) { list_del(&page->lru); put_page(page); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -889,6 +889,7 @@ unsigned long zap_page_range(struct vm_a end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details); if (tlb) tlb_finish_mmu(tlb, address, end); + mmu_notifier(invalidate_range, mm, address, end); return end; } @@ -950,7 +951,8 @@ struct page *follow_page(struct vm_area_ if ((flags & FOLL_WRITE) && !pte_dirty(pte) && !PageDirty(page)) set_page_dirty(page); - mark_page_accessed(page); + if (!pte_young(pte)) + set_pte_at(mm, address, ptep, pte_mkyoung(pte)); } unlock: pte_unmap_unlock(ptep, ptl); @@ -1317,7 +1319,7 @@ int remap_pfn_range(struct vm_area_struc { pgd_t *pgd; unsigned long next; - unsigned long end = addr + PAGE_ALIGN(size); + unsigned long start = addr, end = addr + PAGE_ALIGN(size); struct mm_struct *mm = vma->vm_mm; int err; @@ -1358,6 +1360,7 @@ int remap_pfn_range(struct vm_area_struc if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end); return err; } EXPORT_SYMBOL(remap_pfn_range); @@ -1441,7 +1444,7 @@ int apply_to_page_range(struct mm_struct { pgd_t *pgd; unsigned long next; - unsigned long end = addr + size; + unsigned long start = addr, end = addr + size; int err; BUG_ON(addr >= end); @@ -1452,6 +1455,7 @@ int apply_to_page_range(struct mm_struct if (err) break; } while (pgd++, addr = next, addr != end); + mmu_notifier(invalidate_range, mm, start, end); return err; } EXPORT_SYMBOL_GPL(apply_to_page_range); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1747,6 +1747,7 @@ static void unmap_region(struct mm_struc free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS, next? next->vm_start: 0); tlb_finish_mmu(tlb, start, end); + mmu_notifier(invalidate_range, mm, start, end); } /* @@ -2043,6 +2044,7 @@ void exit_mmap(struct mm_struct *mm) vm_unacct_memory(nr_accounted); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + mmu_notifier_release(mm); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c new file mode 100644 --- /dev/null +++ b/mm/mmu_notifier.c @@ -0,0 +1,44 @@ +/* + * linux/mm/mmu_notifier.c + * + * Copyright (C) 2008 Qumranet, Inc. + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include <linux/mmu_notifier.h> +#include <linux/module.h> + +void mmu_notifier_release(struct mm_struct *mm) +{ + struct mmu_notifier *mn; + struct hlist_node *n, *tmp; + + if (unlikely(!hlist_empty(&mm->mmu_notifier.head))) { + read_lock(&mm->mmu_notifier.lock); + hlist_for_each_entry_safe(mn, n, tmp, + &mm->mmu_notifier.head, hlist) { + if (mn->ops->release) + mn->ops->release(mn, mm); + hlist_del(&mn->hlist); + } + read_unlock(&mm->mmu_notifier.lock); + } +} + +void mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) +{ + write_lock(&mm->mmu_notifier.lock); + hlist_add_head(&mn->hlist, &mm->mmu_notifier.head); + write_unlock(&mm->mmu_notifier.lock); +} +EXPORT_SYMBOL_GPL(mmu_notifier_register); + +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm) +{ + write_lock(&mm->mmu_notifier.lock); + hlist_del(&mn->hlist); + write_unlock(&mm->mmu_notifier.lock); +} +EXPORT_SYMBOL_GPL(mmu_notifier_unregister); |
From: Rik v. R. <ri...@re...> - 2008-01-22 02:21:35
|
On Mon, 21 Jan 2008 13:52:04 +0100 Andrea Arcangeli <an...@qu...> wrote: > Signed-off-by: Andrea Arcangeli <an...@qu...> Reviewed-by: Rik van Riel <ri...@re...> -- All rights reversed. |
From: Avi K. <av...@qu...> - 2008-01-22 14:12:31
|
Andrea Arcangeli wrote: > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h > --- a/include/asm-generic/pgtable.h > +++ b/include/asm-generic/pgtable.h > @@ -44,8 +44,10 @@ > ({ \ > int __young; \ > __young = ptep_test_and_clear_young(__vma, __address, __ptep); \ > - if (__young) \ > + if (__young) { \ > flush_tlb_page(__vma, __address); \ > + mmu_notifier(age_page, (__vma)->vm_mm, __address); \ > + } \ > __young; \ > }) > I think that unconditionally doing __young |= mmu_notifier(test_and_clear_young, ...); allows hardware with accessed bits more control over what is going on. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-22 14:43:55
|
On Tue, Jan 22, 2008 at 04:12:34PM +0200, Avi Kivity wrote: > Andrea Arcangeli wrote: >> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h >> --- a/include/asm-generic/pgtable.h >> +++ b/include/asm-generic/pgtable.h >> @@ -44,8 +44,10 @@ >> ({ \ >> int __young; \ >> __young = ptep_test_and_clear_young(__vma, __address, __ptep); \ >> - if (__young) \ >> + if (__young) { \ >> flush_tlb_page(__vma, __address); \ >> + mmu_notifier(age_page, (__vma)->vm_mm, __address); \ >> + } \ >> __young; \ >> }) >> > > I think that unconditionally doing > > __young |= mmu_notifier(test_and_clear_young, ...); > > allows hardware with accessed bits more control over what is going on. Agreed, likely it'll have to be mmu_notifier_age_page(). |
From: Christoph L. <cla...@sg...> - 2008-01-23 19:48:44
|
On Wed, 23 Jan 2008, Andrea Arcangeli wrote: > On Wed, Jan 23, 2008 at 04:52:47AM -0600, Robin Holt wrote: > > But 100 callouts holding spinlocks will not work for our implementation > > and even if the callouts are made with spinlocks released, we would very > > strongly prefer a single callout which messages the range to the other > > side. > > But you take the physical address and turn into mm+va with your rmap... The remote mm+va or a local mm+va? |
From: Robin H. <ho...@sg...> - 2008-01-23 19:58:35
|
On Wed, Jan 23, 2008 at 11:48:43AM -0800, Christoph Lameter wrote: > On Wed, 23 Jan 2008, Andrea Arcangeli wrote: > > > On Wed, Jan 23, 2008 at 04:52:47AM -0600, Robin Holt wrote: > > > But 100 callouts holding spinlocks will not work for our implementation > > > and even if the callouts are made with spinlocks released, we would very > > > strongly prefer a single callout which messages the range to the other > > > side. > > > > But you take the physical address and turn into mm+va with your rmap... > > The remote mm+va or a local mm+va? To be more complete, the phys is pointing to a xpmem_segment+va and the xpmem_segment points to the mm. The seg describes a window into the source processes virtual address space. Seems somewhat analogous to the Xen grant, but I do not know. Thanks, Robin |
From: Christoph L. <cla...@sg...> - 2008-01-23 20:27:44
|
On Wed, 23 Jan 2008, Andrea Arcangeli wrote: > You want to be able to tell the mmu_notifier that you want the flush > repeated without locks later? Sorry but then if you're always going to > set the bitflag unconditionally, why don't you simply implement a > second notifier in addition of my current ->invalidate_page (like > ->after_invalidate_page). Because there is no mm_struct available at that point. So we cannot do a callback based on the mmu_ops in that fasion. We would have to build a list of notifiers while scanning the reverse maps. > We can then implement a method in rmap.c for you to call to do the > final freeing of the page (pagecache/swapcache won't be collected > unless it's a truncate, as long as you keep it pinned and you > certainly don't want to wait a second round of lru scan before freeing > the page after you release the external reference, so you may need to > call this method before returning from the The page count is elevated because of the remote pte so the page is effectively pinned. > ->after_invalidate_page). Infact I can call that method for you in the > notifier implementation itself after all ->after_invalidate_pages have > been called. (of course only if at least one of them was implemented > and not-null) Ok. > > As an example of thousands, we currently have one customer job that > > has 16880 processors all with the same physical page faulted into their > > address space. The way XPMEM is currently structured, there is fan-out of > > that PFN information so we would not need to queue up that many messages, > > but it would still be considerable. Our upcoming version of the hardware > > will potentially make this fanout worse because we are going to allow > > even more fine-grained divisions of the machine to help with memory > > error containment. > > Well as long as you send these messages somewhat serially and you > don't pretend to allocate all packets at once it should be ok. Perhaps > you should preallocate all packets statically and serialize the access > to the pool with a lock. > > What I'd like to stress to be sure it's crystal clear, is that in the > mm/rmap.c path GFP_KERNEL==GFP_ATOMIC, infact both are = PF_MEMALLOC = > TIF_MEMDIE = if mempool is empty it will crash. The argument that you > need to sleep to allocate memory with GFP_KERNEL is totally bogus. If > that's the only reason, you don't need to sleep at all. alloc_pages > will not invoke the VM when called inside the VM, it will grab ram > from PF_MEMALLOC instead. At most it will schedule so the only benefit > would be lower -rt latency in the end. If you are holding a lock then you have to use GFP_ATOMIC and the number of GFP_ATOMIC allocs is limited. PF_MEMALLOC does not do reclaim so we are in trouble if too many allocs occur. > > We have a counter associated with a pfn that indicates when the pfn is no > > longer referenced by other partitions. This counter triggers changing of > > memory protections so any subsequent access to this page will result in > > a memory error on the remote partition (this should be an illegal case). > > As long as you keep a reference on the page too, you don't risk > any corruption by flushing after. There are still dirty bit issues. > The window that you must close with that bitflag is the request coming > from the remote node to map the page after the linux pte has been > cleared. If you map the page in a remote node after the linux pte has > been cleared ->invalidate_page won't be called again because the page > will look unmapped in the linux VM. Now invalidate_page will clear the > bitflag, so the map requests will block. But where exactly you know > that the linux pte has been cleared so you can "unblock" the map > requests? If a page is not mapped by some linux pte, mm/rmap.c will > never be called and this is why any notification in mm/rmap.c should > track the "address space" and not the "physical page". The subsystem needs to establish proper locking for that case. > In effect you don't care less about the address space of the task in > the master node, so IMHO you're hooking your ->invalidate_page(page) > (instead of my ->invalidate_page(mm, address)) in the very wrong > place. You should hook it in mm/vmscan.c shrink-list so it will be > invoked regardless if the pte is mapped or not. Then your model that Then page migration and other uses of try_to_unmap wont get there. Also the page lock is an item that helps with serialization of new faults. > If you work the "pages" you should stick to pages and to stay away > from mm/rmap.c and ignore whatever is mapped in the master address > space of the task. mm/rmap.c only deals with ptes/sptes and other > _virtual-tracked_ mappings. It also deals f.e. with page dirty status. |
From: Avi K. <av...@qu...> - 2008-01-24 06:03:29
|
Christoph Lameter wrote: > On Wed, 23 Jan 2008, Robin Holt wrote: > > >>> That won't work for kvm. If we have a hundred virtual machines, that means >>> 99 no-op notifications. >>> >> But 100 callouts holding spinlocks will not work for our implementation >> and even if the callouts are made with spinlocks released, we would very >> strongly prefer a single callout which messages the range to the other >> side. >> > > > Andrea wont have 99 no op notifications. You will have one notification to > the kvm subsystem (since there needs to be only one register operation > for a subsystem that wants to get notifications). What do you do there is > up to kvm. If you want to call some function 99 times then you are free to > do that. > What I need is a list of (mm, va) that map the page. kvm doesn't have access to that, export notifiers do. It seems reasonable that export notifier do that rmap walk since they are part of core mm, not kvm. Alternatively, kvm can change its internal rmap structure to be page based instead of (mm, hva) based. The problem here is to size this thing, as we don't know in advance (when the kvm module is loaded) whether 0% or 100% (or some value in between) of system memory will be used for kvm. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Andrea A. <an...@qu...> - 2008-01-24 12:26:21
|
On Thu, Jan 24, 2008 at 07:56:43AM +0200, Avi Kivity wrote: > What I need is a list of (mm, va) that map the page. kvm doesn't have > access to that, export notifiers do. It seems reasonable that export > notifier do that rmap walk since they are part of core mm, not kvm. Yes. Like said in earlier email we could ignore the slowdown and duplicate the mm/rmap.c code inside kvm, but that looks a bad layering violation and it's unnecessary, dirty and suboptimal IMHO. > Alternatively, kvm can change its internal rmap structure to be page based > instead of (mm, hva) based. The problem here is to size this thing, as we > don't know in advance (when the kvm module is loaded) whether 0% or 100% > (or some value in between) of system memory will be used for kvm. Another issue is that for things like the page sharing driver, it's more convenient to be able to know exactly which "sptes" belongs to a certain userland mapping, and only that userland mapping (not all others mappings of the physical page). So if the rmap becomes page based, it'd be nice to still be able to find the "mm" associated with that certain spte pointer to skip all sptes in the other "mm" during the invalidate. |
From: Avi K. <av...@qu...> - 2008-01-24 12:34:38
|
Andrea Arcangeli wrote: > On Thu, Jan 24, 2008 at 07:56:43AM +0200, Avi Kivity wrote: > >> What I need is a list of (mm, va) that map the page. kvm doesn't have >> access to that, export notifiers do. It seems reasonable that export >> notifier do that rmap walk since they are part of core mm, not kvm. >> > > Yes. Like said in earlier email we could ignore the slowdown and > duplicate the mm/rmap.c code inside kvm, but that looks a bad layering > violation and it's unnecessary, dirty and suboptimal IMHO. > > Historical note: old kvm versions (like the what will eventually ship in 2.6.24) have a page-based rmap (hooking the rmap list off page->private). We changed that to an mm based rmap since page->private is not available when kvm maps general userspace memory. >> Alternatively, kvm can change its internal rmap structure to be page based >> instead of (mm, hva) based. The problem here is to size this thing, as we >> don't know in advance (when the kvm module is loaded) whether 0% or 100% >> (or some value in between) of system memory will be used for kvm. >> > > Another issue is that for things like the page sharing driver, it's > more convenient to be able to know exactly which "sptes" belongs to a > certain userland mapping, and only that userland mapping (not all > others mappings of the physical page). So if the rmap becomes page > based, it'd be nice to still be able to find the "mm" associated with > that certain spte pointer to skip all sptes in the other "mm" during > the invalidate. > You also need the mm (or rather, the kvm structure, but they have a 1:1 relationship) to be able to lock and maintain the shadow structures properly. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Andrea A. <an...@qu...> - 2008-01-24 15:43:01
|
On Wed, Jan 23, 2008 at 12:27:47PM -0800, Christoph Lameter wrote: > There are still dirty bit issues. Yes, but no big issues given ->invalidate_page is fully capable of running set_page_dirty if needed. > > The window that you must close with that bitflag is the request coming > > from the remote node to map the page after the linux pte has been > > cleared. If you map the page in a remote node after the linux pte has > > been cleared ->invalidate_page won't be called again because the page > > will look unmapped in the linux VM. Now invalidate_page will clear the > > bitflag, so the map requests will block. But where exactly you know > > that the linux pte has been cleared so you can "unblock" the map > > requests? If a page is not mapped by some linux pte, mm/rmap.c will > > never be called and this is why any notification in mm/rmap.c should > > track the "address space" and not the "physical page". > > The subsystem needs to establish proper locking for that case. How? I Your answer was to have the subsystem-fault wait PG_exported to return ON... when later you told me the subsystem-fault is the thing supposed to set PG_exported ON again... Perhaps you really could invent a proper locking to make your #v1 workable somehow but I didn't see a sign of it yet. Infact I'm not so sure if all will be race-free with invalidate_page_after (given you pretend to call it outside the PT lock so concurrent linux minor faults can happen in parallel of your invalidate_page_after) but at least it has a better chance to work without having to invent much new complex locking. > It also deals f.e. with page dirty status. I think you should consider if you can also build a rmap per-MM like KVM does and index it by the virtual address like KVM does. |
From: Christoph L. <cla...@sg...> - 2008-01-24 20:07:50
|
On Thu, 24 Jan 2008, Andrea Arcangeli wrote: > I think you should consider if you can also build a rmap per-MM like > KVM does and index it by the virtual address like KVM does. Yes we have that. If we have that then we do not need the mmu_notifier. We could call it with a page parameter and then walk the KVM or XPmem reverse map to directly find all the ptes we need to clear. There is no need then to add a new field to the mm_struct. |
From: Avi K. <av...@qu...> - 2008-01-25 06:35:54
|
Christoph Lameter wrote: > On Thu, 24 Jan 2008, Andrea Arcangeli wrote: > > >> I think you should consider if you can also build a rmap per-MM like >> KVM does and index it by the virtual address like KVM does. >> > > Yes we have that. > > If we have that then we do not need the mmu_notifier. > We could call it with a page parameter and then walk the KVM or XPmem > reverse map to directly find all the ptes we need to clear. There is no > need then to add a new field to the mm_struct. > The reason the new field is needed is because the Linux mm does not understand the secondary pte format and zapping protocol. Locating the secondary ptes is just a part of the problem. -- Any sufficiently difficult bug is indistinguishable from a feature. |