From: Andrea A. <an...@qu...> - 2008-01-13 13:33:07
|
Hi everyone, So far KVM swapping has been a limited feature. Depending on the workloads huge chunks of the anonymous memory simulating the guest physical memory could get pinned and unswappable for extended periods of time. Whenever a spte mapps a host physical page, KVM has to pin the page to prevent it to be swapped out. The page could still be unmapped from the Linux VM ptes, it could go in swapcache, but the boosted reference count (due to the spte pointing to the host physical page) would prevent the page to be freed (and rightfully so). The big difference is that the mmu notifier patch now allows KVM to know when the the main Linux VM wants to unmap a certain host physical page. When that happens KVM now make sure to release all sptes and to drop the reference count of the page, so the page can finally be swapped out for real in any case. This way the KVM task can now be swapped out fully and at any time regardless of the guest OS activity and regardless the size of the readonly shadow-pte cache generated by the guest-OS. Last test I run on this code was to run two VM on dual core SVM host, SMP guest (4 vcpus). The linux vm was ~400M the other VM was 3G. Host system has 2G ram + 4G swap. Starting a heavy VM job both VM are swapped out quite nicely (one VM was running my oom deadlock testcase for the linux-mm oom patches, the other was playing a youtube video): andrea 9742 57.7 2.5 588536 50104 andrea 9809 69.3 9.4 3211172 182448 After sigstopping both and running the same heavy VM job again I get: andrea 9742 42.6 0.0 588536 644 andrea 9809 48.2 0.0 3211172 848 So when sigstopped less than 1M of rss remains allocated in ram. After sigcont and after killing the heavy VM job (that released lots of ram and swap) both VM gracefully restarts with only swapins firing in the host: andrea 9742 57.6 2.0 588536 39308 andrea 9809 61.6 61.4 3211172 1186256 No idea why the non-linux VM after a while grows back to a 1G working set despite a single youtube playback is playing in the guest... ;), the linux vm OTOH has only a 39M working set when idling in the oom loops. Host must be compiled with CONFIG_MMU_NOTIFIERS=y of course, or this won't work. Here the patch to kvm.git (there's some room for optimization in doing a single tlb flush in the unmap_spte for all sptes pointing to the page, or even more aggressively for the whole range in the invalidate_range case, but the invalidate_range isn't an interesting path for kvm so I guess not worth optimizing in the short/mid term, but by optimizing the invalidate_page case we may halve the number of tlb flushes for some common case. I leave it for later, the swapping is heavily I/O bound anyway so a some more ipi in smp host shouldn't be very measurable (on UP host it makes no difference to flush multiple times in practice). Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 4086080..c527d7d 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -18,6 +18,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 324ff9a..103c270 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -532,6 +532,36 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) kvm_flush_remote_tlbs(kvm); } +static void unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + __free_page(page); +} + +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn) +{ + unsigned long *rmapp; + u64 *spte, *curr_spte; + + spin_lock(&kvm->mmu_lock); + gfn = unalias_gfn(kvm, gfn); + rmapp = gfn_to_rmap(kvm, gfn); + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("rmap_swap_page: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + unmap_spte(kvm, curr_spte); + } + spin_unlock(&kvm->mmu_lock); +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8a90403..e9a3f6e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3159,6 +3159,36 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + return container_of(mn, struct kvm, mmu_notifier); +} + +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + gfn_t gfn = hva_to_gfn(kvm, address); + BUG_ON(mm != kvm->mm); + if (gfn == -1UL) + return; + kvm_rmap_unmap_gfn(kvm, gfn); +} + +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + for (; start < end; start += PAGE_SIZE) + kvm_mmu_notifier_invalidate_page(mn, mm, start); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .invalidate_range = kvm_mmu_notifier_invalidate_range, + .invalidate_page = kvm_mmu_notifier_invalidate_page, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); @@ -3167,6 +3197,7 @@ struct kvm *kvm_arch_create_vm(void) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; return kvm; } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index d6db0de..feacd77 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -404,6 +404,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 2714068..85da7fa 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -117,6 +117,7 @@ struct kvm { struct kvm_io_bus pio_bus; struct kvm_vm_stat stat; struct kvm_arch arch; + struct mmu_notifier mmu_notifier; }; /* The guest did something we don't support. */ @@ -163,6 +164,7 @@ int kvm_arch_set_memory_region(struct kvm *kvm, struct kvm_memory_slot old, int user_alloc); gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn); +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4295623..8f1dd86 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) kvm->mm = current->mm; atomic_inc(&kvm->mm->mm_count); + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); spin_lock_init(&kvm->mmu_lock); kvm_io_bus_init(&kvm->pio_bus); mutex_init(&kvm->lock); @@ -454,6 +455,23 @@ static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE); } +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr) +{ + int i; + + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end = start + (memslot->npages << PAGE_SHIFT); + + if (addr >= start && addr < end) { + gfn_t gfn_offset = (addr - start) >> PAGE_SHIFT; + return memslot->base_gfn + gfn_offset; + } + } + return -1UL; +} + /* * Requires current->mm->mmap_sem to be held */ And here a compatibility patch to kvm-userland so the external module still compile and runs with older kernels w/o MMU_NOTIFIER patch applied. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/kernel/external-module-compat.h b/kernel/external-module-compat.h index 67b9cc4..34ef0a5 100644 --- a/kernel/external-module-compat.h +++ b/kernel/external-module-compat.h @@ -17,6 +17,28 @@ #include <linux/hrtimer.h> #include <asm/bitops.h> +#ifndef CONFIG_MMU_NOTIFIER +struct mmu_notifier; + +struct mmu_notifier_ops { + void (*release)(struct mmu_notifier * mn, + struct mm_struct *mm); + void (*invalidate_page)(struct mmu_notifier * mn, + struct mm_struct *mm, + unsigned long address); + void (*invalidate_range)(struct mmu_notifier * mn, + struct mm_struct *mm, + unsigned long start, unsigned long end); +}; + +struct mmu_notifier { + const struct mmu_notifier_ops *ops; +}; +#define mmu_notifier_register(mn, mm) do {} while(0) +#define mmu_notifier_unregister(mn) do {} while (0) +#define mmu_notifier_release(mm) do {} while (0) +#endif + /* * 2.6.16 does not have GFP_NOWAIT */ Here another patch for kvm-userland where I can't see symmetry between the lack of atomic_inc despite mmdrop is still run. I can't possibly see how this is supposedly not required when compiled into the kernel vs external module. Either atomic_inc is needed in both or none. Even if I'm right still this bug wasn't destabilizing because atomic_inc_and_dec only fires once in the overflow check, so it shouldn't matter to run one mmdrop more than needed, but it's good for correctness. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/kernel/hack-module.awk b/kernel/hack-module.awk index 5187c96..884bc50 100644 --- a/kernel/hack-module.awk +++ b/kernel/hack-module.awk @@ -33,8 +33,6 @@ vmx_load_host_state = 0 } -/atomic_inc\(&kvm->mm->mm_count\);/ { $0 = "//" $0 } - /^\t\.fault = / { fcn = gensub(/,/, "", "g", $3) $0 = "\t.VMA_OPS_FAULT(fault) = VMA_OPS_FAULT_FUNC(" fcn ")," I'll post the mmu-notifiers patch (required in the host kernel to run the above) separately in CC with more mailing lists because that's not KVM code at all and we hope to get it merged in the mainline kernel soon after getting feedback on the interface from the other users of the mmu notifiers. Thanks! Andrea |
From: Anthony L. <an...@co...> - 2008-01-13 15:02:35
|
Andrea Arcangeli wrote: > Hi everyone, > > So far KVM swapping has been a limited feature. Depending on the > workloads huge chunks of the anonymous memory simulating the guest > physical memory could get pinned and unswappable for extended periods > of time. Whenever a spte mapps a host physical page, KVM has to pin > the page to prevent it to be swapped out. The page could still be > unmapped from the Linux VM ptes, it could go in swapcache, but the > boosted reference count (due to the spte pointing to the host physical > page) would prevent the page to be freed (and rightfully so). The big > difference is that the mmu notifier patch now allows KVM to know when > the the main Linux VM wants to unmap a certain host physical > page. When that happens KVM now make sure to release all sptes and to > drop the reference count of the page, so the page can finally be > swapped out for real in any case. This way the KVM task can now be > swapped out fully and at any time regardless of the guest OS activity > and regardless the size of the readonly shadow-pte cache generated by > the guest-OS. > > Last test I run on this code was to run two VM on dual core SVM host, > SMP guest (4 vcpus). The linux vm was ~400M the other VM was 3G. Host > system has 2G ram + 4G swap. > > Starting a heavy VM job both VM are swapped out quite nicely (one VM > was running my oom deadlock testcase for the linux-mm oom patches, the > other was playing a youtube video): > > andrea 9742 57.7 2.5 588536 50104 > andrea 9809 69.3 9.4 3211172 182448 > > After sigstopping both and running the same heavy VM job again I get: > > andrea 9742 42.6 0.0 588536 644 > andrea 9809 48.2 0.0 3211172 848 > > So when sigstopped less than 1M of rss remains allocated in ram. > > After sigcont and after killing the heavy VM job (that released lots > of ram and swap) both VM gracefully restarts with only swapins firing > in the host: > > andrea 9742 57.6 2.0 588536 39308 > andrea 9809 61.6 61.4 3211172 1186256 > Very cool! > No idea why the non-linux VM after a while grows back to a 1G working > set despite a single youtube playback is playing in the guest... ;), > the linux vm OTOH has only a 39M working set when idling in the oom > loops. > Perhaps the non-Linux guest has a page scrubber that runs while the system is otherwise idle. Regards, Anthony Liguori > Host must be compiled with CONFIG_MMU_NOTIFIERS=y of course, or this > won't work. > > Here the patch to kvm.git (there's some room for optimization in doing > a single tlb flush in the unmap_spte for all sptes pointing to the > page, or even more aggressively for the whole range in the > invalidate_range case, but the invalidate_range isn't an interesting > path for kvm so I guess not worth optimizing in the short/mid term, > but by optimizing the invalidate_page case we may halve the number of > tlb flushes for some common case. I leave it for later, the swapping > is heavily I/O bound anyway so a some more ipi in smp host shouldn't > be very measurable (on UP host it makes no difference to flush > multiple times in practice). > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 4086080..c527d7d 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -18,6 +18,7 @@ config KVM > tristate "Kernel-based Virtual Machine (KVM) support" > depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL > select PREEMPT_NOTIFIERS > + select MMU_NOTIFIER > select ANON_INODES > ---help--- > Support hosting fully virtualized guest machines using hardware > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 324ff9a..103c270 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -532,6 +532,36 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) > kvm_flush_remote_tlbs(kvm); > } > > +static void unmap_spte(struct kvm *kvm, u64 *spte) > +{ > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); > + get_page(page); > + rmap_remove(kvm, spte); > + set_shadow_pte(spte, shadow_trap_nonpresent_pte); > + kvm_flush_remote_tlbs(kvm); > + __free_page(page); > +} > + > +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn) > +{ > + unsigned long *rmapp; > + u64 *spte, *curr_spte; > + > + spin_lock(&kvm->mmu_lock); > + gfn = unalias_gfn(kvm, gfn); > + rmapp = gfn_to_rmap(kvm, gfn); > + > + spte = rmap_next(kvm, rmapp, NULL); > + while (spte) { > + BUG_ON(!(*spte & PT_PRESENT_MASK)); > + rmap_printk("rmap_swap_page: spte %p %llx\n", spte, *spte); > + curr_spte = spte; > + spte = rmap_next(kvm, rmapp, spte); > + unmap_spte(kvm, curr_spte); > + } > + spin_unlock(&kvm->mmu_lock); > +} > + > #ifdef MMU_DEBUG > static int is_empty_shadow_page(u64 *spt) > { > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 8a90403..e9a3f6e 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3159,6 +3159,36 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) > free_page((unsigned long)vcpu->arch.pio_data); > } > > +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > +{ > + return container_of(mn, struct kvm, mmu_notifier); > +} > + > +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address) > +{ > + struct kvm *kvm = mmu_notifier_to_kvm(mn); > + gfn_t gfn = hva_to_gfn(kvm, address); > + BUG_ON(mm != kvm->mm); > + if (gfn == -1UL) > + return; > + kvm_rmap_unmap_gfn(kvm, gfn); > +} > + > +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + for (; start < end; start += PAGE_SIZE) > + kvm_mmu_notifier_invalidate_page(mn, mm, start); > +} > + > +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { > + .invalidate_range = kvm_mmu_notifier_invalidate_range, > + .invalidate_page = kvm_mmu_notifier_invalidate_page, > +}; > + > struct kvm *kvm_arch_create_vm(void) > { > struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); > @@ -3167,6 +3197,7 @@ struct kvm *kvm_arch_create_vm(void) > return ERR_PTR(-ENOMEM); > > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); > + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; > > return kvm; > } > diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h > index d6db0de..feacd77 100644 > --- a/include/asm-x86/kvm_host.h > +++ b/include/asm-x86/kvm_host.h > @@ -404,6 +404,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); > int kvm_mmu_setup(struct kvm_vcpu *vcpu); > void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); > > +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn); > int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); > void kvm_mmu_zap_all(struct kvm *kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 2714068..85da7fa 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -117,6 +117,7 @@ struct kvm { > struct kvm_io_bus pio_bus; > struct kvm_vm_stat stat; > struct kvm_arch arch; > + struct mmu_notifier mmu_notifier; > }; > > /* The guest did something we don't support. */ > @@ -163,6 +164,7 @@ int kvm_arch_set_memory_region(struct kvm *kvm, > struct kvm_memory_slot old, > int user_alloc); > gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn); > +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr); > struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4295623..8f1dd86 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) > > kvm->mm = current->mm; > atomic_inc(&kvm->mm->mm_count); > + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); > spin_lock_init(&kvm->mmu_lock); > kvm_io_bus_init(&kvm->pio_bus); > mutex_init(&kvm->lock); > @@ -454,6 +455,23 @@ static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) > return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE); > } > > +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr) > +{ > + int i; > + > + for (i = 0; i < kvm->nmemslots; i++) { > + struct kvm_memory_slot *memslot = &kvm->memslots[i]; > + unsigned long start = memslot->userspace_addr; > + unsigned long end = start + (memslot->npages << PAGE_SHIFT); > + > + if (addr >= start && addr < end) { > + gfn_t gfn_offset = (addr - start) >> PAGE_SHIFT; > + return memslot->base_gfn + gfn_offset; > + } > + } > + return -1UL; > +} > + > /* > * Requires current->mm->mmap_sem to be held > */ > > > > And here a compatibility patch to kvm-userland so the external module > still compile and runs with older kernels w/o MMU_NOTIFIER patch applied. > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/kernel/external-module-compat.h b/kernel/external-module-compat.h > index 67b9cc4..34ef0a5 100644 > --- a/kernel/external-module-compat.h > +++ b/kernel/external-module-compat.h > @@ -17,6 +17,28 @@ > #include <linux/hrtimer.h> > #include <asm/bitops.h> > > +#ifndef CONFIG_MMU_NOTIFIER > +struct mmu_notifier; > + > +struct mmu_notifier_ops { > + void (*release)(struct mmu_notifier * mn, > + struct mm_struct *mm); > + void (*invalidate_page)(struct mmu_notifier * mn, > + struct mm_struct *mm, > + unsigned long address); > + void (*invalidate_range)(struct mmu_notifier * mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end); > +}; > + > +struct mmu_notifier { > + const struct mmu_notifier_ops *ops; > +}; > +#define mmu_notifier_register(mn, mm) do {} while(0) > +#define mmu_notifier_unregister(mn) do {} while (0) > +#define mmu_notifier_release(mm) do {} while (0) > +#endif > + > /* > * 2.6.16 does not have GFP_NOWAIT > */ > > > Here another patch for kvm-userland where I can't see symmetry between > the lack of atomic_inc despite mmdrop is still run. I can't possibly > see how this is supposedly not required when compiled into the kernel > vs external module. Either atomic_inc is needed in both or none. Even > if I'm right still this bug wasn't destabilizing because > atomic_inc_and_dec only fires once in the overflow check, so it > shouldn't matter to run one mmdrop more than needed, but it's good for > correctness. > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/kernel/hack-module.awk b/kernel/hack-module.awk > index 5187c96..884bc50 100644 > --- a/kernel/hack-module.awk > +++ b/kernel/hack-module.awk > @@ -33,8 +33,6 @@ > vmx_load_host_state = 0 > } > > -/atomic_inc\(&kvm->mm->mm_count\);/ { $0 = "//" $0 } > - > /^\t\.fault = / { > fcn = gensub(/,/, "", "g", $3) > $0 = "\t.VMA_OPS_FAULT(fault) = VMA_OPS_FAULT_FUNC(" fcn ")," > > > I'll post the mmu-notifiers patch (required in the host kernel to run > the above) separately in CC with more mailing lists because that's not > KVM code at all and we hope to get it merged in the mainline kernel > soon after getting feedback on the interface from the other users of > the mmu notifiers. > > Thanks! > Andrea > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel > |
From: Marcelo T. <ma...@kv...> - 2008-01-14 13:43:07
|
Hi Andrea, > path for kvm so I guess not worth optimizing in the short/mid term, > but by optimizing the invalidate_page case we may halve the number of > tlb flushes for some common case. I leave it for later, the swapping > is heavily I/O bound anyway so a some more ipi in smp host shouldn't > be very measurable (on UP host it makes no difference to flush > multiple times in practice). > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 4086080..c527d7d 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -18,6 +18,7 @@ config KVM > tristate "Kernel-based Virtual Machine (KVM) support" > depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL > select PREEMPT_NOTIFIERS > + select MMU_NOTIFIER > select ANON_INODES > ---help--- > Support hosting fully virtualized guest machines using hardware > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 324ff9a..103c270 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -532,6 +532,36 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) > kvm_flush_remote_tlbs(kvm); > } > > +static void unmap_spte(struct kvm *kvm, u64 *spte) > +{ > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); > + get_page(page); > + rmap_remove(kvm, spte); > + set_shadow_pte(spte, shadow_trap_nonpresent_pte); > + kvm_flush_remote_tlbs(kvm); > + __free_page(page); > +} > + > +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn) > +{ > + unsigned long *rmapp; > + u64 *spte, *curr_spte; > + > + spin_lock(&kvm->mmu_lock); > + gfn = unalias_gfn(kvm, gfn); > + rmapp = gfn_to_rmap(kvm, gfn); The alias and memslot maps are protected only by mmap_sem, so you should make kvm_set_memory_region/set_memory_alias grab the mmu spinlock in addition to mmap_sem in write mode. kvm_mmu_zap_all() grabs the mmu lock.. that should probably move up into the caller. > + > + spte = rmap_next(kvm, rmapp, NULL); > + while (spte) { > + BUG_ON(!(*spte & PT_PRESENT_MASK)); > + rmap_printk("rmap_swap_page: spte %p %llx\n", spte, *spte); > + curr_spte = spte; > + spte = rmap_next(kvm, rmapp, spte); > + unmap_spte(kvm, curr_spte); > + } > + spin_unlock(&kvm->mmu_lock); > +} > + > #ifdef MMU_DEBUG > static int is_empty_shadow_page(u64 *spt) > { > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 8a90403..e9a3f6e 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3159,6 +3159,36 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) > free_page((unsigned long)vcpu->arch.pio_data); > } > > +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) > +{ > + return container_of(mn, struct kvm, mmu_notifier); > +} > + > +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long address) > +{ > + struct kvm *kvm = mmu_notifier_to_kvm(mn); > + gfn_t gfn = hva_to_gfn(kvm, address); > + BUG_ON(mm != kvm->mm); > + if (gfn == -1UL) > + return; > + kvm_rmap_unmap_gfn(kvm, gfn); And then you also need to cover "hva_to_gfn()" to happen under the lock. > +} > + > +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end) > +{ > + for (; start < end; start += PAGE_SIZE) > + kvm_mmu_notifier_invalidate_page(mn, mm, start); > +} > + > +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { > + .invalidate_range = kvm_mmu_notifier_invalidate_range, > + .invalidate_page = kvm_mmu_notifier_invalidate_page, > +}; > + > struct kvm *kvm_arch_create_vm(void) > { > struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); > @@ -3167,6 +3197,7 @@ struct kvm *kvm_arch_create_vm(void) > return ERR_PTR(-ENOMEM); > > INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); > + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; > > return kvm; > } > diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h > index d6db0de..feacd77 100644 > --- a/include/asm-x86/kvm_host.h > +++ b/include/asm-x86/kvm_host.h > @@ -404,6 +404,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); > int kvm_mmu_setup(struct kvm_vcpu *vcpu); > void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); > > +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn); > int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); > void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); > void kvm_mmu_zap_all(struct kvm *kvm); > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 2714068..85da7fa 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -117,6 +117,7 @@ struct kvm { > struct kvm_io_bus pio_bus; > struct kvm_vm_stat stat; > struct kvm_arch arch; > + struct mmu_notifier mmu_notifier; > }; > > /* The guest did something we don't support. */ > @@ -163,6 +164,7 @@ int kvm_arch_set_memory_region(struct kvm *kvm, > struct kvm_memory_slot old, > int user_alloc); > gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn); > +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr); > struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); > void kvm_release_page_clean(struct page *page); > void kvm_release_page_dirty(struct page *page); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 4295623..8f1dd86 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) > > kvm->mm = current->mm; > atomic_inc(&kvm->mm->mm_count); > + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); > spin_lock_init(&kvm->mmu_lock); > kvm_io_bus_init(&kvm->pio_bus); > mutex_init(&kvm->lock); > @@ -454,6 +455,23 @@ static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) > return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE); > } > > +gfn_t hva_to_gfn(struct kvm *kvm, unsigned long addr) > +{ > + int i; > + > + for (i = 0; i < kvm->nmemslots; i++) { > + struct kvm_memory_slot *memslot = &kvm->memslots[i]; > + unsigned long start = memslot->userspace_addr; > + unsigned long end = start + (memslot->npages << PAGE_SHIFT); > + > + if (addr >= start && addr < end) { > + gfn_t gfn_offset = (addr - start) >> PAGE_SHIFT; > + return memslot->base_gfn + gfn_offset; > + } > + } > + return -1UL; > +} > + > /* > * Requires current->mm->mmap_sem to be held > */ > > > > And here a compatibility patch to kvm-userland so the external module > still compile and runs with older kernels w/o MMU_NOTIFIER patch applied. > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/kernel/external-module-compat.h b/kernel/external-module-compat.h > index 67b9cc4..34ef0a5 100644 > --- a/kernel/external-module-compat.h > +++ b/kernel/external-module-compat.h > @@ -17,6 +17,28 @@ > #include <linux/hrtimer.h> > #include <asm/bitops.h> > > +#ifndef CONFIG_MMU_NOTIFIER > +struct mmu_notifier; > + > +struct mmu_notifier_ops { > + void (*release)(struct mmu_notifier * mn, > + struct mm_struct *mm); > + void (*invalidate_page)(struct mmu_notifier * mn, > + struct mm_struct *mm, > + unsigned long address); > + void (*invalidate_range)(struct mmu_notifier * mn, > + struct mm_struct *mm, > + unsigned long start, unsigned long end); > +}; > + > +struct mmu_notifier { > + const struct mmu_notifier_ops *ops; > +}; > +#define mmu_notifier_register(mn, mm) do {} while(0) > +#define mmu_notifier_unregister(mn) do {} while (0) > +#define mmu_notifier_release(mm) do {} while (0) > +#endif > + > /* > * 2.6.16 does not have GFP_NOWAIT > */ > > > Here another patch for kvm-userland where I can't see symmetry between > the lack of atomic_inc despite mmdrop is still run. I can't possibly > see how this is supposedly not required when compiled into the kernel > vs external module. Either atomic_inc is needed in both or none. Even > if I'm right still this bug wasn't destabilizing because > atomic_inc_and_dec only fires once in the overflow check, so it > shouldn't matter to run one mmdrop more than needed, but it's good for > correctness. > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/kernel/hack-module.awk b/kernel/hack-module.awk > index 5187c96..884bc50 100644 > --- a/kernel/hack-module.awk > +++ b/kernel/hack-module.awk > @@ -33,8 +33,6 @@ > vmx_load_host_state = 0 > } > > -/atomic_inc\(&kvm->mm->mm_count\);/ { $0 = "//" $0 } > - > /^\t\.fault = / { > fcn = gensub(/,/, "", "g", $3) > $0 = "\t.VMA_OPS_FAULT(fault) = VMA_OPS_FAULT_FUNC(" fcn ")," > > > I'll post the mmu-notifiers patch (required in the host kernel to run > the above) separately in CC with more mailing lists because that's not > KVM code at all and we hope to get it merged in the mainline kernel > soon after getting feedback on the interface from the other users of > the mmu notifiers. > > Thanks! > Andrea > > ------------------------------------------------------------------------- > Check out the new SourceForge.net Marketplace. > It's the best place to buy or sell services for > just about anything Open Source. > http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel |
From: Andrea A. <an...@qu...> - 2008-01-14 14:06:54
|
On Mon, Jan 14, 2008 at 11:45:39AM -0200, Marcelo Tosatti wrote: > The alias and memslot maps are protected only by mmap_sem, so you yes, they are already protected and furthermore in write mode. > should make kvm_set_memory_region/set_memory_alias grab the mmu spinlock > in addition to mmap_sem in write mode. The mmu notifiers already hold the mmap_sem in read mode so I don't see why I should add the mmu_lock around memslots. The mmu_lock AFAICS is only needed to serialize with other vcpu fautls when updating the sptes and I already take it there. > And then you also need to cover "hva_to_gfn()" to happen under the lock. hva_to_gfn only requires the mmap_sem in read mode and that's already taken implicitly before the mmu notifiers are called. |
From: Avi K. <av...@qu...> - 2008-01-14 14:09:25
|
Marcelo Tosatti wrote: >> >> +static void unmap_spte(struct kvm *kvm, u64 *spte) >> +{ >> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); >> + get_page(page); >> + rmap_remove(kvm, spte); >> + set_shadow_pte(spte, shadow_trap_nonpresent_pte); >> + kvm_flush_remote_tlbs(kvm); >> + __free_page(page); >> +} >> + >> +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn) >> +{ >> + unsigned long *rmapp; >> + u64 *spte, *curr_spte; >> + >> + spin_lock(&kvm->mmu_lock); >> + gfn = unalias_gfn(kvm, gfn); >> + rmapp = gfn_to_rmap(kvm, gfn); >> > > The alias and memslot maps are protected only by mmap_sem, so you > should make kvm_set_memory_region/set_memory_alias grab the mmu spinlock > in addition to mmap_sem in write mode. > > kvm_mmu_zap_all() grabs the mmu lock.. that should probably move up into > the caller. > > Aren't mmu notifiers called with mmap_sem held for read? Maybe not from the swap path? -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-01-20 15:15:59
|
Andrea Arcangeli wrote: > On Tue, Jan 15, 2008 at 05:57:03PM +0200, Avi Kivity wrote: > >> It's the same hva for two different gpas. Same functionality as the alias, >> but with less data structures. >> > > Ok but if this is already supposed to work, then I need to at least > change kvm_hva_to_rmapp not to stop when it find the first match but > to keep going to find other memslots in the same hva range. It's not > just one rmapp for each hva, but multiple rmapp for each hva. > Yes, it's supposed to work (we can't prevent userspace from doing it). -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-14 14:24:58
|
On Mon, Jan 14, 2008 at 04:09:03PM +0200, Avi Kivity wrote: > Marcelo Tosatti wrote: >>> +static void unmap_spte(struct kvm *kvm, u64 *spte) >>> +{ >>> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> >>> PAGE_SHIFT); >>> + get_page(page); >>> + rmap_remove(kvm, spte); >>> + set_shadow_pte(spte, shadow_trap_nonpresent_pte); >>> + kvm_flush_remote_tlbs(kvm); >>> + __free_page(page); >>> +} >>> + >>> +void kvm_rmap_unmap_gfn(struct kvm *kvm, gfn_t gfn) >>> +{ >>> + unsigned long *rmapp; >>> + u64 *spte, *curr_spte; >>> + >>> + spin_lock(&kvm->mmu_lock); >>> + gfn = unalias_gfn(kvm, gfn); >>> + rmapp = gfn_to_rmap(kvm, gfn); >>> >> >> The alias and memslot maps are protected only by mmap_sem, so you >> should make kvm_set_memory_region/set_memory_alias grab the mmu spinlock >> in addition to mmap_sem in write mode. >> >> kvm_mmu_zap_all() grabs the mmu lock.. that should probably move up into >> the caller. >> >> > > Aren't mmu notifiers called with mmap_sem held for read? > > Maybe not from the swap path? Good point, the swap path isn't covered by the mmap_sem, so Marcelo's right I need to fixup the locking a bit. |
From: Avi K. <av...@qu...> - 2008-01-14 15:47:15
|
Andrea Arcangeli wrote: >> >> Aren't mmu notifiers called with mmap_sem held for read? >> >> Maybe not from the swap path? >> > > Good point, the swap path isn't covered by the mmap_sem, so Marcelo's > right I need to fixup the locking a bit. > One idea I had at a time was to add ->lock() and ->unlock() callbacks to mmu notifiers that would be called in a sleepy context. But that seems heavy handed. Maybe it can be fixed in some clever way with rcu or with a rwlock around the memory slot map. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-01-22 13:37:54
|
Andrea Arcangeli wrote: > On Sun, Jan 20, 2008 at 05:16:03PM +0200, Avi Kivity wrote: > >> Yes, it's supposed to work (we can't prevent userspace from doing it). >> > > Hmm, I think we already prevent it, so I don't think I need to update > my swap code until the below is removed. > > /* Check for overlaps */ > r = -EEXIST; > for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { > struct kvm_memory_slot *s = &kvm->memslots[i]; > > if (s == memslot) > continue; > if (!((base_gfn + npages <= s->base_gfn) || > (base_gfn >= s->base_gfn + s->npages))) > goto out_free; > } > > Actually, this checks against slots with the overlapping gfns. Aliases have different gfns but same hvas. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-22 14:56:35
|
On Tue, Jan 22, 2008 at 03:37:59PM +0200, Avi Kivity wrote: > Andrea Arcangeli wrote: >> On Sun, Jan 20, 2008 at 05:16:03PM +0200, Avi Kivity wrote: >> >>> Yes, it's supposed to work (we can't prevent userspace from doing it). >>> >> >> Hmm, I think we already prevent it, so I don't think I need to update >> my swap code until the below is removed. >> >> /* Check for overlaps */ >> r = -EEXIST; >> for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { >> struct kvm_memory_slot *s = &kvm->memslots[i]; >> >> if (s == memslot) >> continue; >> if (!((base_gfn + npages <= s->base_gfn) || >> (base_gfn >= s->base_gfn + s->npages))) >> goto out_free; >> } >> >> > > Actually, this checks against slots with the overlapping gfns. Aliases > have different gfns but same hvas. Hmm not sure to follow, do you mean I need to change something? Aliases live in a different dimension, and I can't see how my code could ever notice or care about aliases (that have to be translated to a real hva-memslot-backed gfn before calling get_user_pages). All I care about are to find the right rmap structures for each hva. The above snippet should be enough to guarantee that there can only be 1 rmap structure for each hva, so when I checked yesterday that looked enough to prove my kvm_hva_to_rmapp implementation already correct. |
From: Avi K. <av...@qu...> - 2008-01-15 15:57:23
|
Andrea Arcangeli wrote: > On Tue, Jan 15, 2008 at 04:40:19PM +0200, Avi Kivity wrote: > >> Note that we can now replace aliases with slots; simply map the same hva >> range to two different gpa ranges (using a kvm-private memslot for backward >> compat). >> > > Currently the host physical page is a 1:1 mapping with the hva. Not > sure how to alias the same host physical page on two different gfn by > using two different hva. I still see an use for the alias. > It's the same hva for two different gpas. Same functionality as the alias, but with less data structures. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-21 11:37:50
|
On Sun, Jan 20, 2008 at 05:16:03PM +0200, Avi Kivity wrote: > Yes, it's supposed to work (we can't prevent userspace from doing it). Hmm, I think we already prevent it, so I don't think I need to update my swap code until the below is removed. /* Check for overlaps */ r = -EEXIST; for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { struct kvm_memory_slot *s = &kvm->memslots[i]; if (s == memslot) continue; if (!((base_gfn + npages <= s->base_gfn) || (base_gfn >= s->base_gfn + s->npages))) goto out_free; } |
From: Avi K. <av...@qu...> - 2008-01-21 17:34:28
|
Andrea Arcangeli wrote: > On Sun, Jan 20, 2008 at 05:16:03PM +0200, Avi Kivity wrote: > >> Yes, it's supposed to work (we can't prevent userspace from doing it). >> > > Hmm, I think we already prevent it, so I don't think I need to update > my swap code until the below is removed. > > /* Check for overlaps */ > r = -EEXIST; > for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { > struct kvm_memory_slot *s = &kvm->memslots[i]; > > if (s == memslot) > continue; > if (!((base_gfn + npages <= s->base_gfn) || > (base_gfn >= s->base_gfn + s->npages))) > goto out_free; > } > Right. We will have to eventually remove it (to support aliases on non-x86), but no hurry now. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-14 17:45:15
|
On Mon, Jan 14, 2008 at 05:43:58PM +0200, Avi Kivity wrote: > heavy handed. Maybe it can be fixed in some clever way with rcu or with a > rwlock around the memory slot map. Ok, first the alias looked superflous so I just dropped it (the whole point of the alias is to never call get_user_pages in the aliased hva, so the notifiers should never trigger in those alias gfn ranges). And the code was doing hva->gfn->rmap, while I'm now doing hva->rmap directly to avoid two loops over the memslot instead of only one. To serialize hva_to_rmapp mmap_sem is taken (in any mode) or the mmu_lock is taken, and it's enough to serialize the insertion with the mmu_lock and setting userspace_addr to 0 before insertion if the buffer isn't allocated by userland. Setting userspace_addr atomically will make the memslot visible. This is a new version of the kvm.git patch. Please Marcelo let me know if you see more problems, thanks a lot for the review! Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 4086080..c527d7d 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -18,6 +18,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 324ff9a..189f3e1 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -532,6 +532,38 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) kvm_flush_remote_tlbs(kvm); } +static void unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + __free_page(page); +} + +void kvm_rmap_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + unsigned long *rmapp; + u64 *spte, *curr_spte; + + spin_lock(&kvm->mmu_lock); + rmapp = kvm_hva_to_rmapp(kvm, hva); + if (!rmapp) + goto out_unlock; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + unmap_spte(kvm, curr_spte); + } +out_unlock: + spin_unlock(&kvm->mmu_lock); +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8a90403..271ce12 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3159,6 +3159,33 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + return container_of(mn, struct kvm, mmu_notifier); +} + +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + kvm_rmap_unmap_hva(kvm, address); +} + +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + for (; start < end; start += PAGE_SIZE) + kvm_mmu_notifier_invalidate_page(mn, mm, start); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .invalidate_range = kvm_mmu_notifier_invalidate_range, + .invalidate_page = kvm_mmu_notifier_invalidate_page, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); @@ -3167,6 +3194,7 @@ struct kvm *kvm_arch_create_vm(void) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; return kvm; } @@ -3219,14 +3247,20 @@ int kvm_arch_set_memory_region(struct kvm *kvm, */ if (!user_alloc) { if (npages && !old.rmap) { - memslot->userspace_addr = do_mmap(NULL, 0, - npages * PAGE_SIZE, - PROT_READ | PROT_WRITE, - MAP_SHARED | MAP_ANONYMOUS, - 0); - - if (IS_ERR((void *)memslot->userspace_addr)) - return PTR_ERR((void *)memslot->userspace_addr); + unsigned long userspace_addr; + + userspace_addr = do_mmap(NULL, 0, + npages * PAGE_SIZE, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + 0); + if (IS_ERR((void *)userspace_addr)) + return PTR_ERR((void *)userspace_addr); + + /* set userspace_addr atomically for kvm_hva_to_rmapp */ + spin_lock(&kvm->mmu_lock); + memslot->userspace_addr = userspace_addr; + spin_unlock(&kvm->mmu_lock); } else { if (!old.user_alloc && old.rmap) { int ret; diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index d6db0de..522028b 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -404,6 +404,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +void kvm_rmap_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 2714068..eae8734 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -117,6 +117,7 @@ struct kvm { struct kvm_io_bus pio_bus; struct kvm_vm_stat stat; struct kvm_arch arch; + struct mmu_notifier mmu_notifier; }; /* The guest did something we don't support. */ @@ -163,6 +164,7 @@ int kvm_arch_set_memory_region(struct kvm *kvm, struct kvm_memory_slot old, int user_alloc); gfn_t unalias_gfn(struct kvm *kvm, gfn_t gfn); +unsigned long *kvm_hva_to_rmapp(struct kvm *kvm, unsigned long addr); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn); void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4295623..a67e38f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) kvm->mm = current->mm; atomic_inc(&kvm->mm->mm_count); + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); spin_lock_init(&kvm->mmu_lock); kvm_io_bus_init(&kvm->pio_bus); mutex_init(&kvm->lock); @@ -298,7 +299,15 @@ int __kvm_set_memory_region(struct kvm *kvm, memset(new.rmap, 0, npages * sizeof(*new.rmap)); new.user_alloc = user_alloc; - new.userspace_addr = mem->userspace_addr; + /* + * hva_to_rmmap() serialzies with the mmu_lock and to be + * safe it has to ignore memslots with !user_alloc && + * !userspace_addr. + */ + if (user_alloc) + new.userspace_addr = mem->userspace_addr; + else + new.userspace_addr = 0; } /* Allocate page dirty bitmap if needed */ @@ -311,14 +320,18 @@ int __kvm_set_memory_region(struct kvm *kvm, memset(new.dirty_bitmap, 0, dirty_bytes); } + spin_lock(&kvm->mmu_lock); if (mem->slot >= kvm->nmemslots) kvm->nmemslots = mem->slot + 1; *memslot = new; + spin_unlock(&kvm->mmu_lock); r = kvm_arch_set_memory_region(kvm, mem, old, user_alloc); if (r) { + spin_lock(&kvm->mmu_lock); *memslot = old; + spin_unlock(&kvm->mmu_lock); goto out_free; } @@ -454,6 +467,28 @@ static unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn) return (slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE); } +/* if mmap_sem isn't taken, it can be safely called with only the mmu_lock */ +unsigned long *kvm_hva_to_rmapp(struct kvm *kvm, unsigned long addr) +{ + int i; + + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end = start + (memslot->npages << PAGE_SHIFT); + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + if (addr >= start && addr < end) { + gfn_t gfn_offset = (addr - start) >> PAGE_SHIFT; + return &memslot->rmap[gfn_offset]; + } + } + return NULL; +} + /* * Requires current->mm->mmap_sem to be held */ |
From: Avi K. <av...@qu...> - 2008-01-15 14:40:16
|
Andrea Arcangeli wrote: > On Mon, Jan 14, 2008 at 05:43:58PM +0200, Avi Kivity wrote: > >> heavy handed. Maybe it can be fixed in some clever way with rcu or with a >> rwlock around the memory slot map. >> > > Ok, first the alias looked superflous so I just dropped it (the whole > point of the alias is to never call get_user_pages in the aliased hva, > so the notifiers should never trigger in those alias gfn ranges). > > Right, good point. Note that we can now replace aliases with slots; simply map the same hva range to two different gpa ranges (using a kvm-private memslot for backward compat). > And the code was doing hva->gfn->rmap, while I'm now doing hva->rmap > directly to avoid two loops over the memslot instead of only one. To > serialize hva_to_rmapp mmap_sem is taken (in any mode) or the mmu_lock > is taken, and it's enough to serialize the insertion with the mmu_lock > and setting userspace_addr to 0 before insertion if the buffer isn't > allocated by userland. Setting userspace_addr atomically will make the > memslot visible. > Can we do the locking change as a separate patch? -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-15 15:52:54
|
On Tue, Jan 15, 2008 at 04:40:19PM +0200, Avi Kivity wrote: > Note that we can now replace aliases with slots; simply map the same hva > range to two different gpa ranges (using a kvm-private memslot for backward > compat). Currently the host physical page is a 1:1 mapping with the hva. Not sure how to alias the same host physical page on two different gfn by using two different hva. I still see an use for the alias. > Can we do the locking change as a separate patch? Sure, I will split it. |
From: Andrea A. <an...@qu...> - 2008-01-22 17:18:07
|
On Tue, Jan 22, 2008 at 06:17:38PM +0200, Avi Kivity wrote: > There can be more than one rmapp per hva. Real world example: > > memslot 1: gfn range 0xe000000 - 0xe0800000 @ hva 0x10000000 (8MB > framebuffer) > memslot 2: gfn range 0xa0000 - 0xa8000 @ hva 0x10000000 (32KB VGA window) > > If the guest accesses gfn 0xa0000 through one gva, and gfn 0xe0000000 > through a second gva, then you will have two rmap chains for hva > 0x10000000. > > This doesn't happen today because we use the alias mechanism in qemu, but > we don't forbid it either. Ok then it's better to be sure there will not be problems if alias are removed. This should work fine with multiple rmap chains per hva too. diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 4086080..c527d7d 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -18,6 +18,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cb62ef6..a025fde 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -532,6 +532,57 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) kvm_flush_remote_tlbs(kvm); } +static void unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + __free_page(page); +} + +static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + unmap_spte(kvm, curr_spte); + } +} + +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + spin_unlock(&kvm->mmu_lock); +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f94a0b..5c445dd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3167,6 +3167,35 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + return container_of(mn, struct kvm, mmu_notifier); +} + +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + kvm_unmap_hva(kvm, address); +} + +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + for (; start < end; start += PAGE_SIZE) + kvm_mmu_notifier_invalidate_page(mn, mm, start); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .invalidate_range = kvm_mmu_notifier_invalidate_range, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + /* age page will drop the spte so follow_page will set the young bit */ + .age_page = kvm_mmu_notifier_invalidate_page, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); @@ -3175,6 +3204,7 @@ struct kvm *kvm_arch_create_vm(void) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; return kvm; } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index d6db0de..f13d1c3 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -404,6 +404,7 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ea4764b..51c9bb8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -118,6 +118,7 @@ struct kvm { struct kvm_io_bus pio_bus; struct kvm_vm_stat stat; struct kvm_arch arch; + struct mmu_notifier mmu_notifier; }; /* The guest did something we don't support. */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4295623..9e229ac 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) kvm->mm = current->mm; atomic_inc(&kvm->mm->mm_count); + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); spin_lock_init(&kvm->mmu_lock); kvm_io_bus_init(&kvm->pio_bus); mutex_init(&kvm->lock); |
From: Andrea A. <an...@qu...> - 2008-01-22 20:04:03
|
This last update will work against mmu-notifiers #v4, this will make the accessed bitflag in the spte visible to the linux VM so it will provide an accurate working set detection w/o requiring vmexits. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 4086080..c527d7d 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -18,6 +18,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on ARCH_SUPPORTS_KVM && EXPERIMENTAL select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c85b904..adb20de 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -532,6 +532,110 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) kvm_flush_remote_tlbs(kvm); } +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + __free_page(page); +} + +static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + kvm_unmap_spte(kvm, curr_spte); + } +} + +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + spin_unlock(&kvm->mmu_lock); +} + +static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte; + int young = 0; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _young; + u64 _spte = *spte; + BUG_ON(!(_spte & PT_PRESENT_MASK)); + _young = _spte & PT_ACCESSED_MASK; + if (_young) { + young = !!_young; + set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK); + } + spte = rmap_next(kvm, rmapp, spte); + } + return young; +} + +int kvm_age_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + int young = 0; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + spin_unlock(&kvm->mmu_lock); + + if (young) + kvm_flush_remote_tlbs(kvm); + + return young; +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f94a0b..35bb114 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3167,6 +3167,43 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + return container_of(mn, struct kvm, mmu_notifier); +} + +void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + kvm_unmap_hva(kvm, address); +} + +void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + for (; start < end; start += PAGE_SIZE) + kvm_mmu_notifier_invalidate_page(mn, mm, start); +} + +int kvm_mmu_notifier_age_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + return kvm_age_hva(kvm, address); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .invalidate_range = kvm_mmu_notifier_invalidate_range, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + .age_page = kvm_mmu_notifier_age_page, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); @@ -3175,6 +3212,7 @@ struct kvm *kvm_arch_create_vm(void) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops; return kvm; } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index d6db0de..18496e0 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -404,6 +404,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva); +int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ea4764b..51c9bb8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -118,6 +118,7 @@ struct kvm { struct kvm_io_bus pio_bus; struct kvm_vm_stat stat; struct kvm_arch arch; + struct mmu_notifier mmu_notifier; }; /* The guest did something we don't support. */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4295623..b5454d1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -165,6 +165,7 @@ static struct kvm *kvm_create_vm(void) kvm->mm = current->mm; atomic_inc(&kvm->mm->mm_count); + mmu_notifier_register(&kvm->mmu_notifier, kvm->mm); spin_lock_init(&kvm->mmu_lock); kvm_io_bus_init(&kvm->pio_bus); mutex_init(&kvm->lock); |
From: Avi K. <av...@qu...> - 2008-01-22 16:17:40
|
Andrea Arcangeli wrote: > On Tue, Jan 22, 2008 at 03:37:59PM +0200, Avi Kivity wrote: > >> Andrea Arcangeli wrote: >> >>> On Sun, Jan 20, 2008 at 05:16:03PM +0200, Avi Kivity wrote: >>> >>> >>>> Yes, it's supposed to work (we can't prevent userspace from doing it). >>>> >>>> >>> Hmm, I think we already prevent it, so I don't think I need to update >>> my swap code until the below is removed. >>> >>> /* Check for overlaps */ >>> r = -EEXIST; >>> for (i = 0; i < KVM_MEMORY_SLOTS; ++i) { >>> struct kvm_memory_slot *s = &kvm->memslots[i]; >>> >>> if (s == memslot) >>> continue; >>> if (!((base_gfn + npages <= s->base_gfn) || >>> (base_gfn >= s->base_gfn + s->npages))) >>> goto out_free; >>> } >>> >>> >>> >> Actually, this checks against slots with the overlapping gfns. Aliases >> have different gfns but same hvas. >> > > Hmm not sure to follow, do you mean I need to change something? > Aliases live in a different dimension, and I can't see how my code > could ever notice or care about aliases (that have to be translated to > a real hva-memslot-backed gfn before calling get_user_pages). All I > care about are to find the right rmap structures for each hva. The > above snippet should be enough to guarantee that there can only be 1 > rmap structure for each hva, so when I checked yesterday that looked > enough to prove my kvm_hva_to_rmapp implementation already correct. > There can be more than one rmapp per hva. Real world example: memslot 1: gfn range 0xe000000 - 0xe0800000 @ hva 0x10000000 (8MB framebuffer) memslot 2: gfn range 0xa0000 - 0xa8000 @ hva 0x10000000 (32KB VGA window) If the guest accesses gfn 0xa0000 through one gva, and gfn 0xe0000000 through a second gva, then you will have two rmap chains for hva 0x10000000. This doesn't happen today because we use the alias mechanism in qemu, but we don't forbid it either. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-01-15 16:09:38
|
On Tue, Jan 15, 2008 at 05:57:03PM +0200, Avi Kivity wrote: > It's the same hva for two different gpas. Same functionality as the alias, > but with less data structures. Ok but if this is already supposed to work, then I need to at least change kvm_hva_to_rmapp not to stop when it find the first match but to keep going to find other memslots in the same hva range. It's not just one rmapp for each hva, but multiple rmapp for each hva. |