You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Mohammed G. <m.g...@gm...> - 2008-04-27 16:39:47
|
Hi, I was trying to compile the code from kvm-userspace.git against the kernel from kvm.git tree. I ran configure --with-patched-kernel and I got this error: ../libkvm/libkvm.h:385: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'kvm_get_cr8' main.c: In function 'enter_32': main.c:374: error: variable 'regs' has initializer but incomplete type main.c:375: error: unknown field 'rsp' specified in initializer main.c:375: warning: excess elements in struct initializer main.c:375: warning: (near initialization for 'regs') main.c:376: error: unknown field 'rip' specified in initializer main.c:376: warning: excess elements in struct initializer main.c:376: warning: (near initialization for 'regs') main.c:377: error: unknown field 'rflags' specified in initializer main.c:377: warning: excess elements in struct initializer main.c:377: warning: (near initialization for 'regs') main.c:374: error: storage size of 'regs' isn't known main.c:379: error: variable 'sregs' has initializer but incomplete type main.c:380: error: unknown field 'cs' specified in initializer main.c:380: error: extra brace group at end of initializer main.c:380: error: (near initialization for 'sregs') main.c:380: warning: excess elements in struct initializer main.c:380: warning: (near initialization for 'sregs') main.c:381: error: unknown field 'ds' specified in initializer main.c:381: error: extra brace group at end of initializer main.c:381: error: (near initialization for 'sregs') main.c:381: warning: excess elements in struct initializer main.c:381: warning: (near initialization for 'sregs') main.c:382: error: unknown field 'es' specified in initializer main.c:382: error: extra brace group at end of initializer main.c:382: error: (near initialization for 'sregs') main.c:382: warning: excess elements in struct initializer main.c:382: warning: (near initialization for 'sregs') main.c:383: error: unknown field 'fs' specified in initializer main.c:383: error: extra brace group at end of initializer main.c:383: error: (near initialization for 'sregs') main.c:383: warning: excess elements in struct initializer main.c:383: warning: (near initialization for 'sregs') main.c:384: error: unknown field 'gs' specified in initializer main.c:384: error: extra brace group at end of initializer main.c:384: error: (near initialization for 'sregs') main.c:384: warning: excess elements in struct initializer main.c:384: warning: (near initialization for 'sregs') main.c:385: error: unknown field 'ss' specified in initializer main.c:385: error: extra brace group at end of initializer main.c:385: error: (near initialization for 'sregs') main.c:385: warning: excess elements in struct initializer main.c:385: warning: (near initialization for 'sregs') main.c:387: error: unknown field 'tr' specified in initializer main.c:387: error: extra brace group at end of initializer main.c:387: error: (near initialization for 'sregs') main.c:387: warning: excess elements in struct initializer main.c:387: warning: (near initialization for 'sregs') main.c:388: error: unknown field 'ldt' specified in initializer main.c:388: error: extra brace group at end of initializer main.c:388: error: (near initialization for 'sregs') main.c:388: warning: excess elements in struct initializer main.c:388: warning: (near initialization for 'sregs') main.c:389: error: unknown field 'gdt' specified in initializer main.c:389: error: extra brace group at end of initializer main.c:389: error: (near initialization for 'sregs') main.c:389: warning: excess elements in struct initializer main.c:389: warning: (near initialization for 'sregs') main.c:390: error: unknown field 'idt' specified in initializer main.c:390: error: extra brace group at end of initializer main.c:390: error: (near initialization for 'sregs') main.c:390: warning: excess elements in struct initializer main.c:390: warning: (near initialization for 'sregs') main.c:391: error: unknown field 'cr0' specified in initializer main.c:391: warning: excess elements in struct initializer main.c:391: warning: (near initialization for 'sregs') main.c:392: error: unknown field 'cr3' specified in initializer main.c:392: warning: excess elements in struct initializer main.c:392: warning: (near initialization for 'sregs') main.c:393: error: unknown field 'cr4' specified in initializer main.c:393: warning: excess elements in struct initializer main.c:393: warning: (near initialization for 'sregs') main.c:394: error: unknown field 'efer' specified in initializer main.c:394: warning: excess elements in struct initializer main.c:394: warning: (near initialization for 'sregs') main.c:395: error: unknown field 'apic_base' specified in initializer main.c:395: warning: excess elements in struct initializer main.c:395: warning: (near initialization for 'sregs') main.c:396: error: unknown field 'interrupt_bitmap' specified in initializer main.c:396: error: extra brace group at end of initializer main.c:396: error: (near initialization for 'sregs') main.c:396: warning: excess elements in struct initializer main.c:396: warning: (near initialization for 'sregs') main.c:379: error: storage size of 'sregs' isn't known main.c:379: warning: unused variable 'sregs' main.c:374: warning: unused variable 'regs' main.c: In function 'do_create_vcpu': main.c:418: error: storage size of 'regs' isn't known main.c:418: warning: unused variable 'regs' make[1]: *** [main.o] Error 1 make[1]: Leaving directory `/home/mohd/code/kvm-userspace/user' make: *** [user] Error 2 |
From: Avi K. <av...@qu...> - 2008-04-27 15:39:30
|
Linus, please pull the kvm updates for 2.6.26 from the repo and branch git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git kvm-updates-2.6.26 The changes include new host architectures (s390, ia64, and ppc 44x), paravirtualized kernel support, support for the new AMD NPT and Intel VPID hardware virtualization extensions, large host page support, scalability improvements, a PIT model, a performance tracing system (kvmtrace), as well as the usual guest support improvement and minor speedups. Note that a few changes to s390 and ppc arch code are included; these have been acked by the respective maintainers. Al Viro (1): KVM: kill file->f_count abuse in kvm Alexander Graf (1): KVM: Implement dummy values for MSR_PERF_STATUS Amit Shah (1): KVM: Add stat counter for hypercalls Andrea Arcangeli (1): KVM: Disable pagefaults during copy_from_user_inatomic() Anthony Liguori (1): KVM: MMU: Don't assume struct page for x86 Avi Kivity (33): KVM: x86 emulator: add support for group decoding KVM: x86 emulator: group decoding for group 1A KVM: x86 emulator: Group decoding for group 3 KVM: x86 emulator: Group decoding for groups 4 and 5 KVM: x86 emulator: add group 7 decoding KVM: Only x86 has pio KVM: x86 emulator: group decoding for group 1 instructions KVM: MMU: Decouple mmio from shadow page tables KVM: Limit vcpu mmap size to one page on non-x86 KVM: Add API to retrieve the number of supported vcpus per vm KVM: Increase vcpu count to 16 KVM: Add API for determining the number of supported memory slots KVM: Increase the number of user memory slots per vm KVM: Use x86's segment descriptor struct instead of private definition KVM: Prefix control register accessors with kvm_ to avoid namespace pollution KVM: VMX: Don't adjust tsc offset forward KVM: Remove pointless desc_ptr #ifdef KVM: Provide unlocked version of emulator_write_phys() KVM: MMU: Set the accessed bit on non-speculative shadow ptes KVM: Move some x86 specific constants and structures to include/asm-x86 KVM: MMU: Introduce and use spte_to_page() KVM: no longer EXPERIMENTAL KVM: VMX: Add module option to disable flexpriority KVM: Free apic access page on vm destruction KVM: MMU: Only mark_page_accessed() if the page was accessed by the guest KVM: Register ioctl range KVM: s390: Stub out kvmtrace KVM: ia64: Stub out kvmtrace KVM: Rename VCPU_MP_STATE_* to KVM_MP_STATE_* KVM: SVM: force a new asid when initializing the vmcb KVM: x86 emulator: initialize src.val and dst.val for register operands KVM: x86 emulator: fix smsw and lmsw with a memory operand KVM: x86 emulator: fix lea to really get the effective address Carsten Otte (4): s390: KVM preparation: provide hook to enable pgstes in user pagetable KVM: s390: interrupt subsystem, cpu timer, waitpsw KVM: s390: API documentation s390: KVM guest: detect when running on kvm Christian Borntraeger (10): KVM: kvm.h: __user requires compiler.h s390: KVM preparation: host memory management changes for s390 kvm s390: KVM preparation: address of the 64bit extint parm in lowcore KVM: s390: sie intercept handling KVM: s390: intercepts for privileged instructions KVM: s390: interprocessor communication via sigp KVM: s390: intercepts for diagnose instructions KVM: s390: add kvm to kconfig on s390 KVM: s390: update maintainers s390: KVM guest: virtio device support, and kvm hypercalls Dong, Eddie (2): KVM: MMU: Update shadow ptes on partial guest pte writes KVM: MMU: Simplify hash table indexing Feng (Eric) Liu (1): KVM: Add trace markers Feng(Eric) Liu (1): KVM: Add kvm trace userspace interface Glauber Costa (3): x86: allow machine_crash_shutdown to be replaced x86: make native_machine_shutdown non-static x86: KVM guest: disable clock before rebooting. Glauber de Oliveira Costa (2): KVM: paravirtualized clocksource: host part x86: KVM guest: paravirtualized clocksource Harvey Harrison (7): KVM: x86 emulator: add ad_mask static inline KVM: x86 emulator: make register_address, address_mask static inlines KVM: x86 emulator: make register_address_increment and JMP_REL static inlines KVM: x86 emulator: fix sparse warnings in x86_emulate.c KVM: SVM: make iopm_base static KVM: sparse fixes for kvm/x86.c KVM: replace remaining __FUNCTION__ occurances Heiko Carstens (4): KVM: s390: arch backend for the kvm kernel module KVM: s390: Fix incorrect return value KVM: s390: rename stfl to kvm_stfl KVM: s390: Improve pgste accesses Hollis Blanchard (6): KVM: Use CONFIG_PREEMPT_NOTIFIERS around struct preempt_notifier KVM: Rename debugfs_dir to kvm_debugfs_dir ppc: Export tlb_44x_hwater for KVM KVM: ppc: Add DCR access information to struct kvm_run KVM: Add MAINTAINERS entry for PowerPC KVM KVM: ppc: PowerPC 440 KVM implementation Izik Eidus (5): KVM: MMU: fix dirty bit setting when removing write permissions KVM: x86: add functions to get the cpl of vcpu KVM: x86: hardware task switching support KVM: MMU: allow the vm to shrink the kvm mmu shadow caches KVM: add vm refcounting Jan Engelhardt (1): KVM: constify function pointer tables Joerg Roedel (27): KVM: make EFER_RESERVED_BITS configurable for architecture code KVM: align valid EFER bits with the features of the host system KVM: VMX: unifdef the EFER specific code KVM: allow access to EFER in 32bit KVM KVM: SVM: move feature detection to hardware setup code KVM: SVM: add detection of Nested Paging feature KVM: SVM: add module parameter to disable Nested Paging KVM: export information about NPT to generic x86 code KVM: MMU: make the __nonpaging_map function generic KVM: export the load_pdptrs() function to modules KVM: MMU: add TDP support to the KVM MMU KVM: SVM: add support for Nested Paging KVM: SVM: let init_vmcb() take struct vcpu_svm as parameter KVM: SVM: allocate the MSR permission map per VCPU KVM: SVM: enable LBR virtualization KVM: detect if VCPU triple faults KVM: function declaration parameter name cleanup KVM: SVM: indent svm_set_cr4 with tabs instead of spaces KVM: SVM: align shadow CR4.MCE with host KVM: SVM: add intercept for machine check exception KVM: SVM: do not intercept task switch with NPT KVM: SVM: sync TPR value to V_TPR field in the VMCB KVM: export kvm_lapic_set_tpr() to modules KVM: SVM: sync V_TPR with LAPIC.TPR if CR8 write intercept is disabled KVM: SVM: disable CR8 intercept when tpr is not masking interrupts KVM: SVM: remove now obsolete FIXME comment KVM: SVM: remove selective CR0 comment Marcelo Tosatti (13): KVM: MMU: ignore zapped root pagetables KVM: MMU: large page support KVM: add basic paravirt support x86: KVM guest: add basic paravirt support KVM: MMU: hypercall based pte updates and TLB flushes x86: KVM guest: hypercall based pte updates and TLB flushes x86: KVM guest: hypercall batching KVM: MMU: unify slots_lock usage KVM: MMU: prepopulate guest pages after write-protecting KVM: hlt emulation should take in-kernel APIC/PIT timers into account KVM: add ioctls to save/store mpstate KVM: fix kvm_vcpu_kick vs __vcpu_run race KVM: MMU: kvm_pv_mmu_op should not take mmap_sem Ryan Harper (1): KVM: VMX: fix typo in VMX header define Sheng Yang (5): KVM: VMX: Enable Virtual Processor Identification (VPID) KVM: In kernel PIT model KVM: Add save/restore supporting of in kernel PIT KVM: Add reset support for in kernel PIT KVM: VMX: Enable MSR Bitmap feature Xiantao Zhang (17): KVM: Use kzalloc to avoid allocating kvm_regs from kernel stack KVM: ia64: Prepare some structure and routines for kvm use KVM: ia64: Add header files for kvm/ia64 KVM: ia64: Add kvm arch-specific core code for kvm/ia64 KVM: ia64: Add header files for kvm/ia64 KVM: ia64: VMM module interfaces KVM: ia64: Add TLB virtulization support KVM: ia64: Add interruption vector table for vmm KVM: ia64: Add mmio decoder for kvm/ia64 KVM: ia64: Add trampoline for guest/host mode switch KVM: ia64: Add processor virtulization support KVM: ia64: Add optimization for some virtulization faults KVM: ia64: Generate offset values for assembly code use KVM: ia64: Add guest interruption injection support KVM: ia64: Add kvm sal/pal virtulization support KVM: ia64: Enable kvm build for ia64 KVM: ia64: Add a guide about how to create kvm guests on ia64 Documentation/ia64/kvm.txt | 82 ++ Documentation/ioctl-number.txt | 2 + Documentation/powerpc/kvm_440.txt | 41 + Documentation/s390/kvm.txt | 125 ++ MAINTAINERS | 17 + arch/ia64/Kconfig | 3 + arch/ia64/Makefile | 1 + arch/ia64/kvm/Kconfig | 49 + arch/ia64/kvm/Makefile | 61 + arch/ia64/kvm/asm-offsets.c | 251 ++++ arch/ia64/kvm/kvm-ia64.c | 1806 +++++++++++++++++++++++++++++ arch/ia64/kvm/kvm_fw.c | 500 ++++++++ arch/ia64/kvm/kvm_minstate.h | 273 +++++ arch/ia64/kvm/lapic.h | 25 + arch/ia64/kvm/misc.h | 93 ++ arch/ia64/kvm/mmio.c | 341 ++++++ arch/ia64/kvm/optvfault.S | 918 +++++++++++++++ arch/ia64/kvm/process.c | 970 ++++++++++++++++ arch/ia64/kvm/trampoline.S | 1038 +++++++++++++++++ arch/ia64/kvm/vcpu.c | 2163 +++++++++++++++++++++++++++++++++++ arch/ia64/kvm/vcpu.h | 740 ++++++++++++ arch/ia64/kvm/vmm.c | 66 ++ arch/ia64/kvm/vmm_ivt.S | 1424 +++++++++++++++++++++++ arch/ia64/kvm/vti.h | 290 +++++ arch/ia64/kvm/vtlb.c | 636 ++++++++++ arch/powerpc/Kconfig | 1 + arch/powerpc/Kconfig.debug | 3 + arch/powerpc/Makefile | 1 + arch/powerpc/kernel/asm-offsets.c | 28 + arch/powerpc/kvm/44x_tlb.c | 224 ++++ arch/powerpc/kvm/44x_tlb.h | 91 ++ arch/powerpc/kvm/Kconfig | 42 + arch/powerpc/kvm/Makefile | 15 + arch/powerpc/kvm/booke_guest.c | 615 ++++++++++ arch/powerpc/kvm/booke_host.c | 83 ++ arch/powerpc/kvm/booke_interrupts.S | 436 +++++++ arch/powerpc/kvm/emulate.c | 760 ++++++++++++ arch/powerpc/kvm/powerpc.c | 436 +++++++ arch/s390/Kconfig | 14 + arch/s390/Makefile | 2 +- arch/s390/kernel/early.c | 4 + arch/s390/kernel/setup.c | 14 +- arch/s390/kernel/vtime.c | 1 + arch/s390/kvm/Kconfig | 46 + arch/s390/kvm/Makefile | 14 + arch/s390/kvm/diag.c | 67 ++ arch/s390/kvm/gaccess.h | 274 +++++ arch/s390/kvm/intercept.c | 216 ++++ arch/s390/kvm/interrupt.c | 592 ++++++++++ arch/s390/kvm/kvm-s390.c | 685 +++++++++++ arch/s390/kvm/kvm-s390.h | 64 + arch/s390/kvm/priv.c | 323 ++++++ arch/s390/kvm/sie64a.S | 47 + arch/s390/kvm/sigp.c | 288 +++++ arch/s390/mm/pgtable.c | 65 +- arch/x86/Kconfig | 19 + arch/x86/kernel/Makefile | 2 + arch/x86/kernel/crash.c | 3 +- arch/x86/kernel/kvm.c | 248 ++++ arch/x86/kernel/kvmclock.c | 187 +++ arch/x86/kernel/reboot.c | 13 +- arch/x86/kernel/setup_32.c | 6 + arch/x86/kernel/setup_64.c | 7 + arch/x86/kvm/Kconfig | 13 +- arch/x86/kvm/Makefile | 6 +- arch/x86/kvm/i8254.c | 611 ++++++++++ arch/x86/kvm/i8254.h | 63 + arch/x86/kvm/irq.c | 18 + arch/x86/kvm/irq.h | 3 + arch/x86/kvm/kvm_svm.h | 2 + arch/x86/kvm/lapic.c | 35 +- arch/x86/kvm/mmu.c | 672 +++++++++-- arch/x86/kvm/mmu.h | 6 + arch/x86/kvm/paging_tmpl.h | 86 +- arch/x86/kvm/segment_descriptor.h | 29 - arch/x86/kvm/svm.c | 352 +++++-- arch/x86/kvm/svm.h | 3 + arch/x86/kvm/tss.h | 59 + arch/x86/kvm/vmx.c | 278 ++++- arch/x86/kvm/vmx.h | 10 +- arch/x86/kvm/x86.c | 897 +++++++++++++-- arch/x86/kvm/x86_emulate.c | 285 +++-- drivers/s390/Makefile | 2 +- drivers/s390/kvm/Makefile | 9 + drivers/s390/kvm/kvm_virtio.c | 338 ++++++ include/asm-ia64/gcc_intrin.h | 12 + include/asm-ia64/kvm.h | 205 ++++- include/asm-ia64/kvm_host.h | 524 +++++++++ include/asm-ia64/kvm_para.h | 29 + include/asm-ia64/processor.h | 63 + include/asm-powerpc/kvm.h | 53 +- include/asm-powerpc/kvm_asm.h | 55 + include/asm-powerpc/kvm_host.h | 152 +++ include/asm-powerpc/kvm_para.h | 37 + include/asm-powerpc/kvm_ppc.h | 88 ++ include/asm-powerpc/mmu-44x.h | 2 + include/asm-s390/Kbuild | 1 + include/asm-s390/kvm.h | 41 +- include/asm-s390/kvm_host.h | 234 ++++ include/asm-s390/kvm_para.h | 150 +++ include/asm-s390/kvm_virtio.h | 53 + include/asm-s390/lowcore.h | 15 +- include/asm-s390/mmu.h | 1 + include/asm-s390/mmu_context.h | 8 +- include/asm-s390/pgtable.h | 93 ++- include/asm-s390/setup.h | 1 + include/asm-x86/kvm.h | 41 + include/asm-x86/kvm_host.h | 99 ++- include/asm-x86/kvm_para.h | 55 + include/asm-x86/reboot.h | 2 + include/linux/kvm.h | 130 +++- include/linux/kvm_host.h | 59 +- include/linux/kvm_para.h | 11 +- include/linux/kvm_types.h | 2 + include/linux/sched.h | 2 + kernel/fork.c | 2 +- mm/rmap.c | 7 +- virt/kvm/kvm_main.c | 230 ++++- virt/kvm/kvm_trace.c | 276 +++++ 119 files changed, 23723 insertions(+), 638 deletions(-) create mode 100644 Documentation/ia64/kvm.txt create mode 100644 Documentation/powerpc/kvm_440.txt create mode 100644 Documentation/s390/kvm.txt create mode 100644 arch/ia64/kvm/Kconfig create mode 100644 arch/ia64/kvm/Makefile create mode 100644 arch/ia64/kvm/asm-offsets.c create mode 100644 arch/ia64/kvm/kvm-ia64.c create mode 100644 arch/ia64/kvm/kvm_fw.c create mode 100644 arch/ia64/kvm/kvm_minstate.h create mode 100644 arch/ia64/kvm/lapic.h create mode 100644 arch/ia64/kvm/misc.h create mode 100644 arch/ia64/kvm/mmio.c create mode 100644 arch/ia64/kvm/optvfault.S create mode 100644 arch/ia64/kvm/process.c create mode 100644 arch/ia64/kvm/trampoline.S create mode 100644 arch/ia64/kvm/vcpu.c create mode 100644 arch/ia64/kvm/vcpu.h create mode 100644 arch/ia64/kvm/vmm.c create mode 100644 arch/ia64/kvm/vmm_ivt.S create mode 100644 arch/ia64/kvm/vti.h create mode 100644 arch/ia64/kvm/vtlb.c create mode 100644 arch/powerpc/kvm/44x_tlb.c create mode 100644 arch/powerpc/kvm/44x_tlb.h create mode 100644 arch/powerpc/kvm/Kconfig create mode 100644 arch/powerpc/kvm/Makefile create mode 100644 arch/powerpc/kvm/booke_guest.c create mode 100644 arch/powerpc/kvm/booke_host.c create mode 100644 arch/powerpc/kvm/booke_interrupts.S create mode 100644 arch/powerpc/kvm/emulate.c create mode 100644 arch/powerpc/kvm/powerpc.c create mode 100644 arch/s390/kvm/Kconfig create mode 100644 arch/s390/kvm/Makefile create mode 100644 arch/s390/kvm/diag.c create mode 100644 arch/s390/kvm/gaccess.h create mode 100644 arch/s390/kvm/intercept.c create mode 100644 arch/s390/kvm/interrupt.c create mode 100644 arch/s390/kvm/kvm-s390.c create mode 100644 arch/s390/kvm/kvm-s390.h create mode 100644 arch/s390/kvm/priv.c create mode 100644 arch/s390/kvm/sie64a.S create mode 100644 arch/s390/kvm/sigp.c create mode 100644 arch/x86/kernel/kvm.c create mode 100644 arch/x86/kernel/kvmclock.c create mode 100644 arch/x86/kvm/i8254.c create mode 100644 arch/x86/kvm/i8254.h delete mode 100644 arch/x86/kvm/segment_descriptor.h create mode 100644 arch/x86/kvm/tss.h create mode 100644 drivers/s390/kvm/Makefile create mode 100644 drivers/s390/kvm/kvm_virtio.c create mode 100644 include/asm-ia64/kvm_host.h create mode 100644 include/asm-ia64/kvm_para.h create mode 100644 include/asm-powerpc/kvm_asm.h create mode 100644 include/asm-powerpc/kvm_host.h create mode 100644 include/asm-powerpc/kvm_para.h create mode 100644 include/asm-powerpc/kvm_ppc.h create mode 100644 include/asm-s390/kvm_host.h create mode 100644 include/asm-s390/kvm_para.h create mode 100644 include/asm-s390/kvm_virtio.h create mode 100644 virt/kvm/kvm_trace.c |
From: Avi K. <av...@qu...> - 2008-04-27 15:15:55
|
The preliminary KVM Forum 2008 agenda at http://kforum.qumranet.com/KVMForum/agenda.php has been updated, with a few more new presentations added. Be sure to check them out. Note we still have a couple more in the pipeline. If you haven't done so already, please register! There is a big "Register Now" button on the page. Early bird registration ends May 1. Note we have a few more sessions in the pipeline; the page will be updated in a few days. [Speakers: if we've misspelled your name or if you'd like to change your slot, let us know] -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: SourceForge.net <no...@so...> - 2008-04-27 14:02:47
|
Bugs item #1952730, was opened at 2008-04-27 15:02 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1952730&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Gordon Beck (gordon_beck) Assigned to: Nobody/Anonymous (nobody) Summary: keysym lookups for EastEuropean Langauages Initial Comment: I am assuming this site is for kvm-60-3.fc8.src.rpm The Hungarian keyboard - along with others - has several keys which provide three characters rather than the standard two. e.g. n provides n N and }. The third symbol is acquired using a meta or Level3 key. Gnome allows the mapping of any of the control/alt/windows keys for this purpose. However, the mapping is to ISO_Level3_Shift which is missing from vnc_keysym.h. Additionally Hungarian requires the use of odoubleacute and udoubleacute (latin2 extensions) both of which are missing. ./qemu/vnc_keysym.h 205a206,211 > /* latin 2 extensions */ > { "Odoubleacute", 0x1d5}, > { "odoubleacute", 0x1f5}, > { "Udoubleacute", 0x1db}, > { "udoubleacute", 0x1fb}, > 217a224 > {"ISO_Level3_Shift", 0xfe03}, /* ISO_Level3 */ The use of these requires additional changes to ./qemu/keymaps/modifiers ./qemu/keymaps/hu ... but these can easily be changed locally, whereas the above require code changes & compile. I will enter proposed config changes as a seperate bug. I guess this is for any version for Linux, but specifically Fedora 8 i386. If this should instead be a bug against qemu please say. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1952730&group_id=180599 |
From: Avi K. <av...@qu...> - 2008-04-27 13:45:39
|
Avi Kivity wrote: > Michal Ludvig wrote: >> Hi, >> >> I've experienced a kernel Oops on 2.6.24 with kvm 66 on AMD in 64bit >> mode while starting up WinXP: >> >> kvm: emulating exchange as write >> Unable to handle kernel NULL pointer dereference at 0000000000000000 >> RIP: >> [<ffffffff883a7a5a>] :kvm:x86_emulate_insn+0x3fa/0x4240 >> PGD 7658d067 PUD 242a6067 PMD 0 >> Oops: 0002 [1] SMP >> CPU 0 >> Modules linked in: bridge llc reiserfs tun kvm_amd kvm nfs nfsd lockd >> nfs_acl auth_rpcgss sunrpc exportfs w83627ehf hwmon_vid autofs4 >> snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device iptable_filter >> ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 fuse >> ext2 loop snd_hda_intel snd_pcm snd_timer k8temp i2c_nforce2 i2c_core >> hwmon sr_mod button cdrom snd soundcore snd_page_alloc forcedeth sg >> floppy linear sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot edd dm_mod >> fan sata_nv pata_amd libata scsi_mod thermal processor >> Pid: 3139, comm: qemu-system-x86 Not tainted 2.6.24-mludvig #1 >> RIP: 0010:[<ffffffff883a7a5a>] [<ffffffff883a7a5a>] >> :kvm:x86_emulate_insn+0x3fa/0x4240 >> RSP: 0018:ffff8100609fdc18 EFLAGS: 00010246 >> RAX: 000000008001003b RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8100609fe000 >> RBP: ffff8100609ff320 R08: ffff8100609ff3c0 R09: 0000000000000006 >> R10: 0000000000000002 R11: 0000000000000000 R12: ffff8100609ff368 >> R13: ffff8100609ff3c0 R14: ffffffff883be600 R15: 0000000001971353 >> FS: 0000000040804950(0063) GS:ffffffff8053e000(0000) >> knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> CR2: 0000000000000000 CR3: 000000006bb74000 CR4: 00000000000006e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> Process qemu-system-x86 (pid: 3139, threadinfo ffff8100609fc000, task >> ffff8100794d5680) >> Stack: 00000000609fdc74 0000000012187318 000000000ea5f068 >> ffff8100609ff3c0 >> ffff8100609fdc94 ffffffff8839c9e0 0000000000000000 ffff8100609fe000 >> ffff8100609ff320 0000000000000000 0000000000000000 0000000000000000 >> Call Trace: >> [<ffffffff8839c9e0>] :kvm:kvm_get_cs_db_l_bits+0x20/0x40 >> [<ffffffff8839dd2f>] :kvm:emulate_instruction+0x1bf/0x340 >> [<ffffffff883c1a22>] :kvm_amd:emulate_on_interception+0x12/0x60 >> [<ffffffff883a11d9>] :kvm:kvm_arch_vcpu_ioctl_run+0x169/0x6d0 >> [<ffffffff8839c14c>] :kvm:kvm_vcpu_ioctl+0x41c/0x440 >> [<ffffffff802305f3>] __wake_up+0x43/0x70 >> [<ffffffff803374c1>] __up_read+0x21/0xb0 >> [<ffffffff802586ec>] futex_wake+0xcc/0xf0 >> [<ffffffff80259559>] do_futex+0x129/0xbb0 >> [<ffffffff8022e7bd>] __dequeue_entity+0x3d/0x50 >> [<ffffffff8839b925>] :kvm:kvm_vm_ioctl+0x85/0x200 >> [<ffffffff802ab10f>] do_ioctl+0x2f/0xa0 >> [<ffffffff802ab3a0>] vfs_ioctl+0x220/0x2d0 >> [<ffffffff802ab4e1>] sys_ioctl+0x91/0xb0 >> [<ffffffff8020bcae>] system_call+0x7e/0x83 >> >> >> Code: 66 89 02 e9 ee fc ff ff 48 8b 95 88 00 00 00 48 8b 45 78 88 >> RIP [<ffffffff883a7a5a>] :kvm:x86_emulate_insn+0x3fa/0x4240 >> RSP <ffff8100609fdc18> >> CR2: 0000000000000000 >> ---[ end trace d358bab3f035112e ]--- >> >> The host is still alive but the XP guest is locked up in a boot screen. >> >> > > Please mail me (privately) an 'objdump -Sr x86_emulate.o' from the > kernel directory, so we can see where this happened. > Ok. Please try with the attached patch and let us know. Also repeat without the patch, so we can be sure it is easily reproducible. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 13:00:40
|
Jerone Young wrote: > This is a relic of the big userspace refactoring, but today libkvm does should not include settings from the test suite. This patch resolves this and removes the overwriting of setting from the main config.mak with test suite settings. > > applied, thanks. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 12:58:47
|
Gerd Hoffmann wrote: > Hi folks, > > My first attempt to send out a patch series with git ... > > The patches fix the kvm paravirt clocksource code to be compatible with > xen and they also factor out some code which can be shared into a > separate source files used by both kvm and xen. > > The patches look good, but pleasy copy Jeremy and virtualization@ for patches which touch things outside kvm. It's perhaps better to reverse the order: first fix kvm to be compatible, then merge the Xen and kvm implementations into a single one. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 12:53:06
|
Michal Ludvig wrote: > Hi, > > I've experienced a kernel Oops on 2.6.24 with kvm 66 on AMD in 64bit > mode while starting up WinXP: > > kvm: emulating exchange as write > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > [<ffffffff883a7a5a>] :kvm:x86_emulate_insn+0x3fa/0x4240 > PGD 7658d067 PUD 242a6067 PMD 0 > Oops: 0002 [1] SMP > CPU 0 > Modules linked in: bridge llc reiserfs tun kvm_amd kvm nfs nfsd lockd > nfs_acl auth_rpcgss sunrpc exportfs w83627ehf hwmon_vid autofs4 > snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device iptable_filter > ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 fuse ext2 > loop snd_hda_intel snd_pcm snd_timer k8temp i2c_nforce2 i2c_core hwmon > sr_mod button cdrom snd soundcore snd_page_alloc forcedeth sg floppy > linear sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot edd dm_mod fan > sata_nv pata_amd libata scsi_mod thermal processor > Pid: 3139, comm: qemu-system-x86 Not tainted 2.6.24-mludvig #1 > RIP: 0010:[<ffffffff883a7a5a>] [<ffffffff883a7a5a>] > :kvm:x86_emulate_insn+0x3fa/0x4240 > RSP: 0018:ffff8100609fdc18 EFLAGS: 00010246 > RAX: 000000008001003b RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8100609fe000 > RBP: ffff8100609ff320 R08: ffff8100609ff3c0 R09: 0000000000000006 > R10: 0000000000000002 R11: 0000000000000000 R12: ffff8100609ff368 > R13: ffff8100609ff3c0 R14: ffffffff883be600 R15: 0000000001971353 > FS: 0000000040804950(0063) GS:ffffffff8053e000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 0000000000000000 CR3: 000000006bb74000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process qemu-system-x86 (pid: 3139, threadinfo ffff8100609fc000, task > ffff8100794d5680) > Stack: 00000000609fdc74 0000000012187318 000000000ea5f068 ffff8100609ff3c0 > ffff8100609fdc94 ffffffff8839c9e0 0000000000000000 ffff8100609fe000 > ffff8100609ff320 0000000000000000 0000000000000000 0000000000000000 > Call Trace: > [<ffffffff8839c9e0>] :kvm:kvm_get_cs_db_l_bits+0x20/0x40 > [<ffffffff8839dd2f>] :kvm:emulate_instruction+0x1bf/0x340 > [<ffffffff883c1a22>] :kvm_amd:emulate_on_interception+0x12/0x60 > [<ffffffff883a11d9>] :kvm:kvm_arch_vcpu_ioctl_run+0x169/0x6d0 > [<ffffffff8839c14c>] :kvm:kvm_vcpu_ioctl+0x41c/0x440 > [<ffffffff802305f3>] __wake_up+0x43/0x70 > [<ffffffff803374c1>] __up_read+0x21/0xb0 > [<ffffffff802586ec>] futex_wake+0xcc/0xf0 > [<ffffffff80259559>] do_futex+0x129/0xbb0 > [<ffffffff8022e7bd>] __dequeue_entity+0x3d/0x50 > [<ffffffff8839b925>] :kvm:kvm_vm_ioctl+0x85/0x200 > [<ffffffff802ab10f>] do_ioctl+0x2f/0xa0 > [<ffffffff802ab3a0>] vfs_ioctl+0x220/0x2d0 > [<ffffffff802ab4e1>] sys_ioctl+0x91/0xb0 > [<ffffffff8020bcae>] system_call+0x7e/0x83 > > > Code: 66 89 02 e9 ee fc ff ff 48 8b 95 88 00 00 00 48 8b 45 78 88 > RIP [<ffffffff883a7a5a>] :kvm:x86_emulate_insn+0x3fa/0x4240 > RSP <ffff8100609fdc18> > CR2: 0000000000000000 > ---[ end trace d358bab3f035112e ]--- > > The host is still alive but the XP guest is locked up in a boot screen. > > Please mail me (privately) an 'objdump -Sr x86_emulate.o' from the kernel directory, so we can see where this happened. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 12:50:25
|
Marcelo Tosatti wrote: > Valgrind caught this: > > ==11754== Conditional jump or move depends on uninitialised value(s) > ==11754== at 0x50C9BC: kvm_create_pit (libkvm-x86.c:153) > ==11754== by 0x50CA7F: kvm_arch_create (libkvm-x86.c:178) > ==11754== by 0x50AB31: kvm_create (libkvm.c:383) > ==11754== by 0x4EE691: kvm_qemu_create_context (qemu-kvm.c:616) > ==11754== by 0x412031: main (vl.c:9653) > > Applied, thanks. Isn't valgrind great? -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-04-27 12:27:24
|
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: > the first four sets. The fifth is the oversubscription test which trips > my xpmem bug. This is as good as the v12 runs from before. Now that mmu-notifier-core #v14 seems finished and hopefully will appear in 2.6.26 ;), I started exercising more the kvm-mmu-notifier code with the full patchset applied and not only with mmu-notifier-core. I soon found the full patchset has a swap deadlock bug. Then I tried without using kvm (so with mmu notifier disarmed) and I could still reproduce the crashes. After grabbing a few stack traces I tracked it down to a bug in the i_mmap_lock->i_mmap_sem conversion. If you oversubscription means swapping, you should retest with this applied on #v14 i_mmap_sem patch as it would eventually deadlock with all tasks allocating memory in D state without this. Now the full patchset is as rock solid as with only mmu-notifier-core applied. It's swapping 2G memhog on top of a 3G VM with 2G of ram for the last hours without a problem. Everything is working great with KVM at least. Talking about post 2.6.26: the refcount with rcu in the anon-vma conversion seems unnecessary and may explain part of the AIM slowdown too. The rest looks ok and probably we should switch the code to a compile-time decision between rwlock and rwsem (so obsoleting the current spinlock). diff --git a/mm/rmap.c b/mm/rmap.c --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1008,7 +1008,7 @@ static int try_to_unmap_file(struct page list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) vma->vm_private_data = NULL; out: - up_write(&mapping->i_mmap_sem); + up_read(&mapping->i_mmap_sem); return ret; } |
From: Avi K. <av...@qu...> - 2008-04-27 09:47:15
|
Yang, Sheng wrote: > From 592b7855a88266fa19505f0d51fe12ec0eadfa62 Mon Sep 17 00:00:00 2001 > From: Sheng Yang <she...@in...> > Date: Fri, 25 Apr 2008 22:14:06 +0800 > Subject: [PATCH 8/8] KVM: VMX: Enable EPT feature for KVM > > > > Scratch my earlier __direct_map comment. > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 3dbedf1..7a8640a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -1177,8 +1177,15 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, > int write, > return -ENOMEM; > } > > - table[index] = __pa(new_table->spt) | PT_PRESENT_MASK > - | PT_WRITABLE_MASK | shadow_user_mask; > + if (shadow_user_mask) > + table[index] = __pa(new_table->spt) > + | PT_PRESENT_MASK | PT_WRITABLE_MASK > + | shadow_user_mask; > + else > + table[index] = __pa(new_table->spt) > + | PT_PRESENT_MASK | PT_WRITABLE_MASK > + | shadow_x_mask; > Why not do, unconditionally: + table[index] = __pa(new_table->spt) | PT_PRESENT_MASK + | PT_WRITABLE_MASK | shadow_user_mask | shadow_x_mask; ? -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 09:40:39
|
Avi Kivity wrote: > I propose moving the kvm lists to vger.kernel.org, for the following > benefits: > > - better spam control > - faster service (I see significant lag with the sourceforge lists) > - no ads appended to the end of each email > > If no objections are raised, and if the vger postmasters agree, I will > mass subscribe the current subscribers so that there will be no > service interruption. > Since no objections were raised, we'll start to get this rolling. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-04-27 09:38:20
|
Yang, Sheng wrote: > From 239f38236196c2321989c64d7c61ff28490b3f00 Mon Sep 17 00:00:00 2001 > From: Sheng Yang <she...@in...> > Date: Fri, 25 Apr 2008 21:13:50 +0800 > Subject: [PATCH 4/8] KVM: MMU: Add EPT support > > Enable kvm_set_spte() to generate EPT entries. > > > @@ -1155,7 +1178,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, > int write, > } > > table[index] = __pa(new_table->spt) | PT_PRESENT_MASK > - | PT_WRITABLE_MASK | PT_USER_MASK; > + | PT_WRITABLE_MASK | shadow_user_mask; > } > Shouldn't we have shadow_x_mask here as well? [and for non-EPT, shadow_accessed_mask and shadow_dirty_mask, but that's a different story]. -- error compiling committee.c: too many arguments to function |
From: Andrea A. <an...@qu...> - 2008-04-27 03:05:22
|
On Sat, Apr 26, 2008 at 08:54:23PM -0500, Anthony Liguori wrote: > Avi can correct me if I'm wrong, but I don't think the consensus of that > discussion was that we're going to avoid putting mmio pages in the rmap. My first impression on that discussion was that pci-passthrough mmio can't be swapped, can't require write throttling etc.. ;). From a linux VM pagetable point of view rmap on mmio looks weird. However thinking some more, it's not like in the linux kernel where write protect through rmap is needed only for write-throttling MAP_SHARED which clearly is strictly RAM, for sptes we need it for every cr3 touch too to trap pagetable updates (think ioremap done by guest kernel). So I think Avi's take that we need rmap for everything mapped by sptes is probably the only feasible way to go. > Practically speaking, replacing: > > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> > PAGE_SHIFT); > + get_page(page); > > > With: > > unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; > kvm_get_pfn(pfn); > > Results in exactly the same code except the later allows mmio pfns in the > rmap. So ignoring the whole mmio thing, using accessors that are already > there and used elsewhere seems like a good idea :-) Agreed especially at the light of the above. I didn't actually touch that function for a while (I clearly wrote it before we started moving the kvm mmu code from page to pfn), and it was still safe to use to test the locking of the mmu notifier methods. My current main focus in the last few days was to get the locking right against the last mmu notifier code #v14 ;). Now that I look into it more closely, the get_page/put_page are unnecessary by now (it was necessary with the older patches that didn't implement range_begin and that relied on page pinning). Not just in that function, but all reference counting inside kvm is now entirely useless and can be removed. NOTE: it is safe to flush the tlb outside the mmu_lock if done inside the mmu_notifier methods. But only mmu notifiers can defer the tlb flush after releasing mmu_lock because the page can't be freed by the VM until we return. All other kvm code must instead definitely flush the tlb inside the mmu_lock, otherwise when the mmu notifier code runs, it will see the spte nonpresent and so the mmu notifier code will do nothing (it will not wait kvm to drop the mmu_lock before allowing the main linux VM to free the page). The tlb flush must happen before the page is freed, and doing it inside mmu_lock everywhere (except in mmu-notifier contex where it can be done after releasing mmu_lock) guarantees it. The positive side of the tradeoff of having to do the tlb flush inside the mmu_lock, is that KVM can now safely zap and unmap as many sptes at it wants and do a single tlb flush at the end. The pages can't be freed as long as the mmu_lock is hold (this is why the tlb flush has to be done inside the mmu_lock). This model reduces heavily the tlb flush frequency for large spte-mangling, and tlb flushes here are quite expensive because of ipis. > I appreciate the desire to minimize changes, but taking a lock on return > seems to take that to a bit of an extreme. It seems like a simple thing to > fix though, no? I agree it needs to be rewritten as a cleaner fix but probably in a separate patch (which has to be incremental as that code will reject on the mmu notifier patch). I didn't see as a big issue however to apply my quick fix first and cleanup with an incremental update. > I see. It seems a little strange to me as a KVM guest isn't really tied to > the current mm. It seems like the net effect of this is that we are now > tying a KVM guest to an mm. > > For instance, if you create a guest, but didn't assign any memory to it, > you could transfer the fd to another process and then close the fd (without > destroying the guest). The other process then could assign memory to it > and presumably run the guest. Passing the anon kvm vm fd through unix sockets to another task is exactly why we need things like ->release not dependent on fd->release vma->vm_file->release ordering in the do_exit path to teardown the VM. The guest itself is definitely tied to a "mm", the guest runs using get_user_pages and get_user_pages is meaningless without an mm. But the fd where we run the ioctl isn't tied to the mm, it's just an fd that can be passed across tasks with unix sockets. > With your change, as soon as the first process exits, the guest will be > destroyed. I'm not sure this behavioral difference really matters but it > is a behavioral difference. The guest-mode of the cpu, can't run safely on any task but the one with the "mm" tracked by the mmu notifiers and where the memory is allocated from. The sptes points to the memory allocated in that "mm". It's definitely memory-corrupting to leave any spte established when the last thread of that "mm" exists as the memory supposedly pointed by the orphaned sptes will go immediately in the freelist and reused by the kernel. Keep in mind that there's no page pin on the memory pointed by the sptes. The ioctl of the qemu userland could run in any other task with a mm different than the one of the guest and ->release allows this to work fine without memory corruption and without requiring page pinning. As far a I can tell your example explains why we need this fix ;). Here an updated patch that passes my swap test (the only missing thing is the out_lock cleanup). Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 8d45fab..ce3251c 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2ad6f54..330eaed 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -663,6 +663,101 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) account_shadowed(kvm, gfn); } +static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + int need_tlb_flush = 0; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + rmap_remove(kvm, curr_spte); + set_shadow_pte(curr_spte, shadow_trap_nonpresent_pte); + need_tlb_flush = 1; + } + return need_tlb_flush; +} + +int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + int need_tlb_flush = 0; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + need_tlb_flush |= kvm_unmap_rmapp(kvm, + &memslot->rmap[gfn_offset]); + } + } + + return need_tlb_flush; +} + +static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte; + int young = 0; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _young; + u64 _spte = *spte; + BUG_ON(!(_spte & PT_PRESENT_MASK)); + _young = _spte & PT_ACCESSED_MASK; + if (_young) { + young = !!_young; + set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK); + } + spte = rmap_next(kvm, rmapp, spte); + } + return young; +} + +int kvm_age_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + int young = 0; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + + return young; +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { @@ -1200,6 +1295,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) int r; int largepage = 0; pfn_t pfn; + int mmu_seq; down_read(¤t->mm->mmap_sem); if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) { @@ -1207,6 +1303,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); @@ -1217,6 +1315,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, v, write, largepage, gfn, pfn, PT32E_ROOT_LEVEL); @@ -1224,6 +1327,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) return r; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } @@ -1355,6 +1463,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, int r; int largepage = 0; gfn_t gfn = gpa >> PAGE_SHIFT; + int mmu_seq; ASSERT(vcpu); ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); @@ -1368,6 +1477,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, gfn &= ~(KVM_PAGES_PER_HPAGE-1); largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); if (is_error_pfn(pfn)) { @@ -1375,12 +1486,22 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, return 1; } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, largepage, gfn, pfn, TDP_ROOT_LEVEL); spin_unlock(&vcpu->kvm->mmu_lock); return r; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } static void nonpaging_free(struct kvm_vcpu *vcpu) @@ -1643,11 +1764,11 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, int r; u64 gpte = 0; pfn_t pfn; - - vcpu->arch.update_pte.largepage = 0; + int mmu_seq; + int largepage; if (bytes != 4 && bytes != 8) - return; + goto out_lock; /* * Assume that the pte write on a page table of the same type @@ -1660,7 +1781,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, if ((bytes == 4) && (gpa % 4 == 0)) { r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8); if (r) - return; + goto out_lock; memcpy((void *)&gpte + (gpa % 8), new, 4); } else if ((bytes == 8) && (gpa % 8 == 0)) { memcpy((void *)&gpte, new, 8); @@ -1670,23 +1791,35 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, memcpy((void *)&gpte, new, 4); } if (!is_present_pte(gpte)) - return; + goto out_lock; gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; + largepage = 0; down_read(¤t->mm->mmap_sem); if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) { gfn &= ~(KVM_PAGES_PER_HPAGE-1); - vcpu->arch.update_pte.largepage = 1; + largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return; - } + if (is_error_pfn(pfn)) + goto out_release_and_lock; + + spin_lock(&vcpu->kvm->mmu_lock); + BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn)); vcpu->arch.update_pte.gfn = gfn; vcpu->arch.update_pte.pfn = pfn; + vcpu->arch.update_pte.largepage = largepage; + vcpu->arch.update_pte.mmu_seq = mmu_seq; + return; + +out_release_and_lock: + kvm_release_pfn_clean(pfn); +out_lock: + spin_lock(&vcpu->kvm->mmu_lock); } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, @@ -1711,7 +1844,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); - spin_lock(&vcpu->kvm->mmu_lock); kvm_mmu_free_some_pages(vcpu); ++vcpu->kvm->stat.mmu_pte_write; kvm_mmu_audit(vcpu, "pre pte write"); @@ -1790,11 +1922,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, } } kvm_mmu_audit(vcpu, "post pte write"); - spin_unlock(&vcpu->kvm->mmu_lock); if (!is_error_pfn(vcpu->arch.update_pte.pfn)) { kvm_release_pfn_clean(vcpu->arch.update_pte.pfn); vcpu->arch.update_pte.pfn = bad_pfn; } + spin_unlock(&vcpu->kvm->mmu_lock); } int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 156fe10..4ac73a6 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -263,6 +263,12 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, pfn = vcpu->arch.update_pte.pfn; if (is_error_pfn(pfn)) return; + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + return; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != + vcpu->arch.update_pte.mmu_seq)) + return; kvm_get_pfn(pfn); mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0, gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte), @@ -380,6 +386,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, int r; pfn_t pfn; int largepage = 0; + int mmu_seq; pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code); kvm_mmu_audit(vcpu, "pre page fault"); @@ -413,6 +420,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, largepage = 1; } } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, walker.gfn); up_read(¤t->mm->mmap_sem); @@ -424,6 +433,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault, largepage, &write_pt, pfn); @@ -439,6 +453,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, spin_unlock(&vcpu->kvm->mmu_lock); return write_pt; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ce5563..a026cb7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -27,6 +27,7 @@ #include <linux/module.h> #include <linux/mman.h> #include <linux/highmem.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/msr.h> @@ -3859,15 +3860,173 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + struct kvm_arch *kvm_arch; + kvm_arch = container_of(mn, struct kvm_arch, mmu_notifier); + return container_of(kvm_arch, struct kvm, arch); +} + +static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + int need_tlb_flush; + + /* + * When ->invalidate_page runs, the linux pte has been zapped + * already but the page is still allocated until + * ->invalidate_page returns. So if we increase the sequence + * here the kvm page fault will notice if the spte can't be + * established because the page is going to be freed. If + * instead the kvm page fault establishes the spte before + * ->invalidate_page runs, kvm_unmap_hva will release it + * before returning. + + * No need of memory barriers as the sequence increase only + * need to be seen at spin_unlock time, and not at spin_lock + * time. + * + * Increasing the sequence after the spin_unlock would be + * unsafe because the kvm page fault could then establish the + * pte after kvm_unmap_hva returned, without noticing the page + * is going to be freed. + */ + atomic_inc(&kvm->arch.mmu_notifier_seq); + spin_lock(&kvm->mmu_lock); + need_tlb_flush = kvm_unmap_hva(kvm, address); + spin_unlock(&kvm->mmu_lock); + + /* we've to flush the tlb before the pages can be freed */ + if (need_tlb_flush) + kvm_flush_remote_tlbs(kvm); + +} + +static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + int need_tlb_flush = 0; + + /* + * The count increase must become visible at unlock time as no + * spte can be established without taking the mmu_lock and + * count is also read inside the mmu_lock critical section. + */ + atomic_inc(&kvm->arch.mmu_notifier_count); + + spin_lock(&kvm->mmu_lock); + for (; start < end; start += PAGE_SIZE) + need_tlb_flush |= kvm_unmap_hva(kvm, start); + spin_unlock(&kvm->mmu_lock); + + /* we've to flush the tlb before the pages can be freed */ + if (need_tlb_flush) + kvm_flush_remote_tlbs(kvm); +} + +static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + /* + * + * This sequence increase will notify the kvm page fault that + * the page that is going to be mapped in the spte could have + * been freed. + * + * There's also an implicit mb() here in this comment, + * provided by the last PT lock taken to zap pagetables, and + * that the read side has to take too in follow_page(). The + * sequence increase in the worst case will become visible to + * the kvm page fault after the spin_lock of the last PT lock + * of the last PT-lock-protected critical section preceeding + * invalidate_range_end. So if the kvm page fault is about to + * establish the spte inside the mmu_lock, while we're freeing + * the pages, it will have to backoff and when it retries, it + * will have to take the PT lock before it can check the + * pagetables again. And after taking the PT lock it will + * re-establish the pte even if it will see the already + * increased sequence number before calling gfn_to_pfn. + */ + atomic_inc(&kvm->arch.mmu_notifier_seq); + /* + * The sequence increase must be visible before count + * decrease. The page fault has to read count before sequence + * for this write order to be effective. + */ + wmb(); + atomic_dec(&kvm->arch.mmu_notifier_count); + BUG_ON(atomic_read(&kvm->arch.mmu_notifier_count) < 0); +} + +static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + int young; + + spin_lock(&kvm->mmu_lock); + young = kvm_age_hva(kvm, address); + spin_unlock(&kvm->mmu_lock); + + if (young) + kvm_flush_remote_tlbs(kvm); + + return young; +} + +static void kvm_free_vcpus(struct kvm *kvm); +/* This must zap all the sptes because all pages will be freed then */ +static void kvm_mmu_notifier_release(struct mmu_notifier *mn, + struct mm_struct *mm) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + + kvm_destroy_common_vm(kvm); + + kvm_free_pit(kvm); + kfree(kvm->arch.vpic); + kfree(kvm->arch.vioapic); + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + if (kvm->arch.apic_access_page) + put_page(kvm->arch.apic_access_page); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .release = kvm_mmu_notifier_release, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start, + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .clear_flush_young = kvm_mmu_notifier_clear_flush_young, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } @@ -3899,13 +4058,12 @@ static void kvm_free_vcpus(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_free_pit(kvm); - kfree(kvm->arch.vpic); - kfree(kvm->arch.vioapic); - kvm_free_vcpus(kvm); - kvm_free_physmem(kvm); - if (kvm->arch.apic_access_page) - put_page(kvm->arch.apic_access_page); + /* + * kvm_mmu_notifier_release() will be called before + * mmu_notifier_unregister returns, if it didn't run + * already. + */ + mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm); kfree(kvm); } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 9d963cd..7b8deea 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -13,6 +13,7 @@ #include <linux/types.h> #include <linux/mm.h> +#include <linux/mmu_notifier.h> #include <linux/kvm.h> #include <linux/kvm_para.h> @@ -247,6 +248,7 @@ struct kvm_vcpu_arch { gfn_t gfn; /* presumed gfn during guest pte update */ pfn_t pfn; /* pfn corresponding to that gfn */ int largepage; + int mmu_seq; } update_pte; struct i387_fxsave_struct host_fx_image; @@ -314,6 +316,10 @@ struct kvm_arch{ struct page *apic_access_page; gpa_t wall_clock; + + struct mmu_notifier mmu_notifier; + atomic_t mmu_notifier_seq; + atomic_t mmu_notifier_count; }; struct kvm_vm_stat { @@ -434,6 +440,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +int kvm_unmap_hva(struct kvm *kvm, unsigned long hva); +int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4e16682..f089edc 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -267,6 +267,7 @@ void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); void kvm_free_physmem(struct kvm *kvm); +void kvm_destroy_common_vm(struct kvm *kvm); struct kvm *kvm_arch_create_vm(void); void kvm_arch_destroy_vm(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f095b73..4beae7a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -231,15 +231,19 @@ void kvm_free_physmem(struct kvm *kvm) kvm_free_physmem_slot(&kvm->memslots[i], NULL); } -static void kvm_destroy_vm(struct kvm *kvm) +void kvm_destroy_common_vm(struct kvm *kvm) { - struct mm_struct *mm = kvm->mm; - spin_lock(&kvm_lock); list_del(&kvm->vm_list); spin_unlock(&kvm_lock); kvm_io_bus_destroy(&kvm->pio_bus); kvm_io_bus_destroy(&kvm->mmio_bus); +} + +static void kvm_destroy_vm(struct kvm *kvm) +{ + struct mm_struct *mm = kvm->mm; + kvm_arch_destroy_vm(kvm); mmdrop(mm); } |
From: Anthony L. <an...@co...> - 2008-04-27 01:54:23
|
Andrea Arcangeli wrote: > On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote: > >>> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) >>> +{ >>> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> >>> PAGE_SHIFT); >>> + get_page(page); >>> >>> >> You should not assume a struct page exists for any given spte. Instead, use >> kvm_get_pfn() and kvm_release_pfn_clean(). >> > > Last email from muli@ibm in my inbox argues it's useless to build rmap > on mmio regions, so the above is more efficient so put_page runs > directly on the page without going back and forth between spte -> pfn > -> page -> pfn -> page in a single function. > Avi can correct me if I'm wrong, but I don't think the consensus of that discussion was that we're going to avoid putting mmio pages in the rmap. Practically speaking, replacing: + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); With: unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; kvm_get_pfn(pfn); Results in exactly the same code except the later allows mmio pfns in the rmap. So ignoring the whole mmio thing, using accessors that are already there and used elsewhere seems like a good idea :-) > Certainly if we start building rmap on mmio regions we'll have to > change that. > > >> Perhaps I just have a weak stomach but I am uneasy having a function that >> takes a lock on exit. I walked through the logic and it doesn't appear to >> be wrong but it also is pretty clear that you could defer the acquisition >> of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the >> update_pte assignment into kvm_mmu_pte_write. >> > > I agree out_lock is an uncommon exit path, the problem is that the > code was buggy, and I tried to fix it with the smallest possible > change and that resulting in an out_lock. That section likely need a > refactoring, all those update_pte fields should be at least returned > by the function guess_.... but I tried to reduce the changes to make > the issue more readable, I didn't want to rewrite certain functions > just to take a spinlock a few instructions ahead. > I appreciate the desire to minimize changes, but taking a lock on return seems to take that to a bit of an extreme. It seems like a simple thing to fix though, no? >> Why move the destruction of the vm to the MMU notifier unregister hook? >> Does anything else ever call mmu_notifier_unregister that would implicitly >> destroy the VM? >> > > mmu notifier ->release can run at anytime before the filehandle is > closed. ->release has to zap all sptes and freeze the mmu (hence all > vcpus) to prevent any further page fault. After ->release returns all > pages are freed (we'll never relay on the page pin to avoid the > rmap_remove put_page to be a relevant unpin event). So the idea is > that I wanted to maintain the same ordering of the current code in the > vm destroy event, I didn't want to leave a partially shutdown VM on > the vmlist. If the ordering is entirely irrelevant and the > kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then > I can avoid changes to kvm_main.c but I doubt. > > I've done it in a way that archs not needing mmu notifiers like s390 > can simply add the kvm_destroy_common_vm at the top of their > kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke > kvm_destroy_common_vm in the ->release of the mmu notifiers. > > This will ensure that everything will be ok regardless if exit_mmap is > called before/after exit_files, and it won't make a whole lot of > difference anymore, if the driver fd is pinned through vmas->vm_file > released in exit_mmap or through the task filedescriptors relased in > exit_files etc... Infact this allows to call mmu_notifier_unregister > at anytime later after the task has already been killed, without any > trouble (like if the mmu notifier owner isn't registering in > current->mm but some other tasks mm). > I see. It seems a little strange to me as a KVM guest isn't really tied to the current mm. It seems like the net effect of this is that we are now tying a KVM guest to an mm. For instance, if you create a guest, but didn't assign any memory to it, you could transfer the fd to another process and then close the fd (without destroying the guest). The other process then could assign memory to it and presumably run the guest. With your change, as soon as the first process exits, the guest will be destroyed. I'm not sure this behavioral difference really matters but it is a behavioral difference. Regards, Anthony Liguori > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel > |
From: Andrea A. <an...@qu...> - 2008-04-27 00:20:19
|
On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote: >> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) >> +{ >> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> >> PAGE_SHIFT); >> + get_page(page); >> > > You should not assume a struct page exists for any given spte. Instead, use > kvm_get_pfn() and kvm_release_pfn_clean(). Last email from muli@ibm in my inbox argues it's useless to build rmap on mmio regions, so the above is more efficient so put_page runs directly on the page without going back and forth between spte -> pfn -> page -> pfn -> page in a single function. Certainly if we start building rmap on mmio regions we'll have to change that. > Perhaps I just have a weak stomach but I am uneasy having a function that > takes a lock on exit. I walked through the logic and it doesn't appear to > be wrong but it also is pretty clear that you could defer the acquisition > of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the > update_pte assignment into kvm_mmu_pte_write. I agree out_lock is an uncommon exit path, the problem is that the code was buggy, and I tried to fix it with the smallest possible change and that resulting in an out_lock. That section likely need a refactoring, all those update_pte fields should be at least returned by the function guess_.... but I tried to reduce the changes to make the issue more readable, I didn't want to rewrite certain functions just to take a spinlock a few instructions ahead. > Worst case, you pass 4 more pointer arguments here and, take the spin lock, > and then depending on the result of mmu_guess_page_from_pte_write, update > vcpu->arch.update_pte. Yes that was my same idea, but that's left for a later patch. Fixing this bug mixed with the mmu notifier patch was perhaps excessive already ;). > Why move the destruction of the vm to the MMU notifier unregister hook? > Does anything else ever call mmu_notifier_unregister that would implicitly > destroy the VM? mmu notifier ->release can run at anytime before the filehandle is closed. ->release has to zap all sptes and freeze the mmu (hence all vcpus) to prevent any further page fault. After ->release returns all pages are freed (we'll never relay on the page pin to avoid the rmap_remove put_page to be a relevant unpin event). So the idea is that I wanted to maintain the same ordering of the current code in the vm destroy event, I didn't want to leave a partially shutdown VM on the vmlist. If the ordering is entirely irrelevant and the kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then I can avoid changes to kvm_main.c but I doubt. I've done it in a way that archs not needing mmu notifiers like s390 can simply add the kvm_destroy_common_vm at the top of their kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke kvm_destroy_common_vm in the ->release of the mmu notifiers. This will ensure that everything will be ok regardless if exit_mmap is called before/after exit_files, and it won't make a whole lot of difference anymore, if the driver fd is pinned through vmas->vm_file released in exit_mmap or through the task filedescriptors relased in exit_files etc... Infact this allows to call mmu_notifier_unregister at anytime later after the task has already been killed, without any trouble (like if the mmu notifier owner isn't registering in current->mm but some other tasks mm). |
From: Avi K. <av...@qu...> - 2008-04-26 21:02:05
|
Mohammed Gamal wrote: >> Why do that unconditionally, instead of only when in a big-real-mode state? >> > > Is big-real-mode the only state where we have problems? > In general, we need to emulate whenever we are in a VT-unfriendly state, where that is defined as guest state that fails the guest state checks defined by section 22.3.1 of volume 3B of the Intel software development manual, "checks on the guest state area", when that state is legal in a real processor. To date, we have encountered only two instances of such VT-unfriendly states: - "big real mode", where segment limits are not exactly 0xffff - protected mode transitions, where cs.rpl !=ss.rpl for a brief while There may well be more, as we remove the various hacks currently in place, and as we expand the envelope to support hybrid 16/32 bit guests like Windows 3.1 and Windows 95. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Anthony L. <ali...@us...> - 2008-04-26 19:42:40
|
Andrea Arcangeli wrote: > Hello everyone, > > here it is the mmu notifier #v14. > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/ > > Please everyone involved review and (hopefully ;) ack that this is > safe to go in 2.6.26, the most important is to verify that this is a > noop when disarmed regardless of MMU_NOTIFIER=y or =n. > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core > > I'll be sending that patch to Andrew inbox. > > Signed-off-by: Andrea Arcangeli <an...@qu...> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 8d45fab..ce3251c 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -21,6 +21,7 @@ config KVM > tristate "Kernel-based Virtual Machine (KVM) support" > depends on HAVE_KVM > select PREEMPT_NOTIFIERS > + select MMU_NOTIFIER > select ANON_INODES > ---help--- > Support hosting fully virtualized guest machines using hardware > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 2ad6f54..853087a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) > account_shadowed(kvm, gfn); > } > > +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) > +{ > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); > + get_page(page); > You should not assume a struct page exists for any given spte. Instead, use kvm_get_pfn() and kvm_release_pfn_clean(). > static void nonpaging_free(struct kvm_vcpu *vcpu) > @@ -1643,11 +1771,11 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > int r; > u64 gpte = 0; > pfn_t pfn; > - > - vcpu->arch.update_pte.largepage = 0; > + int mmu_seq; > + int largepage; > > if (bytes != 4 && bytes != 8) > - return; > + goto out_lock; > > /* > * Assume that the pte write on a page table of the same type > @@ -1660,7 +1788,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > if ((bytes == 4) && (gpa % 4 == 0)) { > r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8); > if (r) > - return; > + goto out_lock; > memcpy((void *)&gpte + (gpa % 8), new, 4); > } else if ((bytes == 8) && (gpa % 8 == 0)) { > memcpy((void *)&gpte, new, 8); > @@ -1670,23 +1798,35 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > memcpy((void *)&gpte, new, 4); > } > if (!is_present_pte(gpte)) > - return; > + goto out_lock; > gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; > > + largepage = 0; > down_read(¤t->mm->mmap_sem); > if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) { > gfn &= ~(KVM_PAGES_PER_HPAGE-1); > - vcpu->arch.update_pte.largepage = 1; > + largepage = 1; > } > + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); > + /* implicit mb(), we'll read before PT lock is unlocked */ > pfn = gfn_to_pfn(vcpu->kvm, gfn); > up_read(¤t->mm->mmap_sem); > > - if (is_error_pfn(pfn)) { > - kvm_release_pfn_clean(pfn); > - return; > - } > + if (is_error_pfn(pfn)) > + goto out_release_and_lock; > + > + spin_lock(&vcpu->kvm->mmu_lock); > + BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn)); > vcpu->arch.update_pte.gfn = gfn; > vcpu->arch.update_pte.pfn = pfn; > + vcpu->arch.update_pte.largepage = largepage; > + vcpu->arch.update_pte.mmu_seq = mmu_seq; > + return; > + > +out_release_and_lock: > + kvm_release_pfn_clean(pfn); > +out_lock: > + spin_lock(&vcpu->kvm->mmu_lock); > } > Perhaps I just have a weak stomach but I am uneasy having a function that takes a lock on exit. I walked through the logic and it doesn't appear to be wrong but it also is pretty clear that you could defer the acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the update_pte assignment into kvm_mmu_pte_write. > void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > @@ -1711,7 +1851,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > > pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); > mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); > Worst case, you pass 4 more pointer arguments here and, take the spin lock, and then depending on the result of mmu_guess_page_from_pte_write, update vcpu->arch.update_pte. > @@ -3899,13 +4037,12 @@ static void kvm_free_vcpus(struct kvm *kvm) > > void kvm_arch_destroy_vm(struct kvm *kvm) > { > - kvm_free_pit(kvm); > - kfree(kvm->arch.vpic); > - kfree(kvm->arch.vioapic); > - kvm_free_vcpus(kvm); > - kvm_free_physmem(kvm); > - if (kvm->arch.apic_access_page) > - put_page(kvm->arch.apic_access_page); > + /* > + * kvm_mmu_notifier_release() will be called before > + * mmu_notifier_unregister returns, if it didn't run > + * already. > + */ > + mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm); > kfree(kvm); > } > Why move the destruction of the vm to the MMU notifier unregister hook? Does anything else ever call mmu_notifier_unregister that would implicitly destroy the VM? Regards, Anthony Liguori |
From: Anthony L. <an...@co...> - 2008-04-26 17:56:49
|
Ulrich Drepper wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Ryan Harper wrote: > >> @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) >> >> void kvm_init_new_ap(int cpu, CPUState *env) >> { >> + pthread_mutex_lock(&vcpu_mutex); >> pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); >> - /* FIXME: wait for thread to spin up */ >> - usleep(200); >> + pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); >> + pthread_mutex_unlock(&vcpu_mutex); >> > > And something is very wrong here. The pattern for using a condvar is > > 1 take mutex > > 2 check condition > > 3 if condition is not fulfilled > 3a call cond_wait > 3b when returning, go back to step 2 > > 4 unlock mutex > > > Anything else is buggy. > > So, either your condvar use is wrong or you don't really want a condvar > in the first place. I haven't checked the code. > A flag is needed in the vcpu structure that indicates whether the vcpu spun up or not. This is what the while () condition should be. This is needed as the thread may spin up before it gets to the pthread_cond_wait() in which case the signal happens when noone is waiting on it. The other reason a while () is usually needed is that cond_signal may not wake up the right thread so it's necessary to check whether you really have something to do. Not really a problem here but the former race is. A condvar is definitely the right thing to use here. Regards, Anthony Liguori > - -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.7 (GNU/Linux) > > iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis > kw7ST4eJK9CXhNbjKphNsUo= > =ISaC > -----END PGP SIGNATURE----- > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel > |
From: Ulrich D. <dr...@re...> - 2008-04-26 17:04:16
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ryan Harper wrote: > @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) > > void kvm_init_new_ap(int cpu, CPUState *env) > { > + pthread_mutex_lock(&vcpu_mutex); > pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); > - /* FIXME: wait for thread to spin up */ > - usleep(200); > + pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); > + pthread_mutex_unlock(&vcpu_mutex); And something is very wrong here. The pattern for using a condvar is 1 take mutex 2 check condition 3 if condition is not fulfilled 3a call cond_wait 3b when returning, go back to step 2 4 unlock mutex Anything else is buggy. So, either your condvar use is wrong or you don't really want a condvar in the first place. I haven't checked the code. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis kw7ST4eJK9CXhNbjKphNsUo= =ISaC -----END PGP SIGNATURE----- |
From: Ulrich D. <dr...@re...> - 2008-04-26 16:59:28
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ryan Harper wrote: > + /* block until cond_wait occurs */ > + pthread_mutex_lock(&vcpu_mutex); > + /* now we can signal */ > + pthread_cond_signal(&qemu_vcpuup_cond); > + pthread_mutex_unlock(&vcpu_mutex); It is not necessary to take the mutex for a condvar if you're just waking a waiter. This just unnecessarily slows things down. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIE19D2ijCOnn/RHQRAvLqAJ9j+7ICPHpB/gCthCelDgYn5poGMgCfaZVy +tE5zOuxqlBMUR7Fgufw/wY= =5j4s -----END PGP SIGNATURE----- |
From: LINkeR <li...@in...> - 2008-04-26 16:56:12
|
I resend this, because It's could be helpful for another users. >From David Mair <dm...@ma...>: I saw your message on the kvm-devel list and I think I can advise you. I don't yet have a functioning subscription to the list so replying there is not simple for me. If this works out for you please post a message to the kvm-devel list with the solution you use (no need to mention me). To be honest, the problem isn't really a kvm or qemu one, it's an Ethernet issue I think. One solution is to create a bridge interface on the host, add your public facing NIC to the bridge, create a tap interface and attach it to the same bridge then use that tap interface for the guest you want to assign a public IP to. Assuming Linux: * You'll need a bridge-utils package or you'll need to build it * Assuming your host's public interface is eth0, then as root: # brctl addbr br0 # brctl addif br0 eth0 # tunctl -u <owner> -t qtap0 # brctl addif br0 qtap0 Replace eth0 with the real interface name for your host's public facing interface. Replace <owner> with the username that will be starting the guest. Replace qtap0 with a name you prefer to use for the tap interface. Now you can start the guest with the following -net option: qemu <other guest argument> -net [vlan=n,]ifname=qtap0,script=no Replace qtap0 with whatever you used for the tunctl command above. If you are using a vlan other than zero for the guest NIC then include the vlan=n option replacing n with the vlan number. You can use script=yes and modify /etc/qemu-ifup to perform the bridge setup command above but I prefer to use my own and setup the bridge and tap at init time then just have the guest use the preset bridge. You can now just use the relevant static IP address as-is on the guest because the guest's NIC is on the public facing cabling. Good luck, I think you should be able to solve this fairly easily. -- ------------------------------------------------------------------------- ICQ: 147137905 Skype: linker83 |
From: Andrea A. <an...@qu...> - 2008-04-26 16:47:08
|
Hello everyone, here it is the mmu notifier #v14. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/ Please everyone involved review and (hopefully ;) ack that this is safe to go in 2.6.26, the most important is to verify that this is a noop when disarmed regardless of MMU_NOTIFIER=y or =n. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core I'll be sending that patch to Andrew inbox. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 8d45fab..ce3251c 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2ad6f54..853087a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) account_shadowed(kvm, gfn); } +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + put_page(page); +} + +static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + kvm_unmap_spte(kvm, curr_spte); + } +} + +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } +} + +static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte; + int young = 0; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _young; + u64 _spte = *spte; + BUG_ON(!(_spte & PT_PRESENT_MASK)); + _young = _spte & PT_ACCESSED_MASK; + if (_young) { + young = !!_young; + set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK); + } + spte = rmap_next(kvm, rmapp, spte); + } + return young; +} + +int kvm_age_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + int young = 0; + + /* + * If mmap_sem isn't taken, we can look the memslots with only + * the mmu_lock by skipping over the slots with userspace_addr == 0. + */ + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + spin_unlock(&kvm->mmu_lock); + + if (young) + kvm_flush_remote_tlbs(kvm); + + return young; +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { @@ -1200,6 +1302,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) int r; int largepage = 0; pfn_t pfn; + int mmu_seq; down_read(¤t->mm->mmap_sem); if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) { @@ -1207,6 +1310,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); @@ -1217,6 +1322,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, v, write, largepage, gfn, pfn, PT32E_ROOT_LEVEL); @@ -1224,6 +1334,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) return r; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } @@ -1355,6 +1470,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, int r; int largepage = 0; gfn_t gfn = gpa >> PAGE_SHIFT; + int mmu_seq; ASSERT(vcpu); ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); @@ -1368,6 +1484,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, gfn &= ~(KVM_PAGES_PER_HPAGE-1); largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); if (is_error_pfn(pfn)) { @@ -1375,12 +1493,22 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, return 1; } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, largepage, gfn, pfn, TDP_ROOT_LEVEL); spin_unlock(&vcpu->kvm->mmu_lock); return r; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } static void nonpaging_free(struct kvm_vcpu *vcpu) @@ -1643,11 +1771,11 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, int r; u64 gpte = 0; pfn_t pfn; - - vcpu->arch.update_pte.largepage = 0; + int mmu_seq; + int largepage; if (bytes != 4 && bytes != 8) - return; + goto out_lock; /* * Assume that the pte write on a page table of the same type @@ -1660,7 +1788,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, if ((bytes == 4) && (gpa % 4 == 0)) { r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8); if (r) - return; + goto out_lock; memcpy((void *)&gpte + (gpa % 8), new, 4); } else if ((bytes == 8) && (gpa % 8 == 0)) { memcpy((void *)&gpte, new, 8); @@ -1670,23 +1798,35 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, memcpy((void *)&gpte, new, 4); } if (!is_present_pte(gpte)) - return; + goto out_lock; gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; + largepage = 0; down_read(¤t->mm->mmap_sem); if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) { gfn &= ~(KVM_PAGES_PER_HPAGE-1); - vcpu->arch.update_pte.largepage = 1; + largepage = 1; } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, gfn); up_read(¤t->mm->mmap_sem); - if (is_error_pfn(pfn)) { - kvm_release_pfn_clean(pfn); - return; - } + if (is_error_pfn(pfn)) + goto out_release_and_lock; + + spin_lock(&vcpu->kvm->mmu_lock); + BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn)); vcpu->arch.update_pte.gfn = gfn; vcpu->arch.update_pte.pfn = pfn; + vcpu->arch.update_pte.largepage = largepage; + vcpu->arch.update_pte.mmu_seq = mmu_seq; + return; + +out_release_and_lock: + kvm_release_pfn_clean(pfn); +out_lock: + spin_lock(&vcpu->kvm->mmu_lock); } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, @@ -1711,7 +1851,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); - spin_lock(&vcpu->kvm->mmu_lock); kvm_mmu_free_some_pages(vcpu); ++vcpu->kvm->stat.mmu_pte_write; kvm_mmu_audit(vcpu, "pre pte write"); @@ -1790,11 +1929,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, } } kvm_mmu_audit(vcpu, "post pte write"); - spin_unlock(&vcpu->kvm->mmu_lock); if (!is_error_pfn(vcpu->arch.update_pte.pfn)) { kvm_release_pfn_clean(vcpu->arch.update_pte.pfn); vcpu->arch.update_pte.pfn = bad_pfn; } + spin_unlock(&vcpu->kvm->mmu_lock); } int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 156fe10..4ac73a6 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -263,6 +263,12 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page, pfn = vcpu->arch.update_pte.pfn; if (is_error_pfn(pfn)) return; + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + return; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != + vcpu->arch.update_pte.mmu_seq)) + return; kvm_get_pfn(pfn); mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0, gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte), @@ -380,6 +386,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, int r; pfn_t pfn; int largepage = 0; + int mmu_seq; pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code); kvm_mmu_audit(vcpu, "pre page fault"); @@ -413,6 +420,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, largepage = 1; } } + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); + /* implicit mb(), we'll read before PT lock is unlocked */ pfn = gfn_to_pfn(vcpu->kvm, walker.gfn); up_read(¤t->mm->mmap_sem); @@ -424,6 +433,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, } spin_lock(&vcpu->kvm->mmu_lock); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count))) + goto out_unlock; + smp_rmb(); + if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq)) + goto out_unlock; kvm_mmu_free_some_pages(vcpu); shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault, largepage, &write_pt, pfn); @@ -439,6 +453,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, spin_unlock(&vcpu->kvm->mmu_lock); return write_pt; + +out_unlock: + spin_unlock(&vcpu->kvm->mmu_lock); + kvm_release_pfn_clean(pfn); + return 0; } static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ce5563..860559a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -27,6 +27,7 @@ #include <linux/module.h> #include <linux/mman.h> #include <linux/highmem.h> +#include <linux/mmu_notifier.h> #include <asm/uaccess.h> #include <asm/msr.h> @@ -3859,15 +3860,152 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) free_page((unsigned long)vcpu->arch.pio_data); } +static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn) +{ + struct kvm_arch *kvm_arch; + kvm_arch = container_of(mn, struct kvm_arch, mmu_notifier); + return container_of(kvm_arch, struct kvm, arch); +} + +static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + /* + * When ->invalidate_page runs, the linux pte has been zapped + * already but the page is still allocated until + * ->invalidate_page returns. So if we increase the sequence + * here the kvm page fault will notice if the spte can't be + * established because the page is going to be freed. If + * instead the kvm page fault establishes the spte before + * ->invalidate_page runs, kvm_unmap_hva will release it + * before returning. + + * No need of memory barriers as the sequence increase only + * need to be seen at spin_unlock time, and not at spin_lock + * time. + * + * Increasing the sequence after the spin_unlock would be + * unsafe because the kvm page fault could then establish the + * pte after kvm_unmap_hva returned, without noticing the page + * is going to be freed. + */ + atomic_inc(&kvm->arch.mmu_notifier_seq); + spin_lock(&kvm->mmu_lock); + kvm_unmap_hva(kvm, address); + spin_unlock(&kvm->mmu_lock); +} + +static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + + /* + * The count increase must become visible at unlock time as no + * spte can be established without taking the mmu_lock and + * count is also read inside the mmu_lock critical section. + */ + atomic_inc(&kvm->arch.mmu_notifier_count); + + spin_lock(&kvm->mmu_lock); + for (; start < end; start += PAGE_SIZE) + kvm_unmap_hva(kvm, start); + spin_unlock(&kvm->mmu_lock); +} + +static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long start, + unsigned long end) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + /* + * + * This sequence increase will notify the kvm page fault that + * the page that is going to be mapped in the spte could have + * been freed. + * + * There's also an implicit mb() here in this comment, + * provided by the last PT lock taken to zap pagetables, and + * that the read side has to take too in follow_page(). The + * sequence increase in the worst case will become visible to + * the kvm page fault after the spin_lock of the last PT lock + * of the last PT-lock-protected critical section preceeding + * invalidate_range_end. So if the kvm page fault is about to + * establish the spte inside the mmu_lock, while we're freeing + * the pages, it will have to backoff and when it retries, it + * will have to take the PT lock before it can check the + * pagetables again. And after taking the PT lock it will + * re-establish the pte even if it will see the already + * increased sequence number before calling gfn_to_pfn. + */ + atomic_inc(&kvm->arch.mmu_notifier_seq); + /* + * The sequence increase must be visible before count + * decrease. The page fault has to read count before sequence + * for this write order to be effective. + */ + wmb(); + atomic_dec(&kvm->arch.mmu_notifier_count); + BUG_ON(atomic_read(&kvm->arch.mmu_notifier_count) < 0); +} + +static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + return kvm_age_hva(kvm, address); +} + +static void kvm_free_vcpus(struct kvm *kvm); +/* This must zap all the sptes because all pages will be freed then */ +static void kvm_mmu_notifier_release(struct mmu_notifier *mn, + struct mm_struct *mm) +{ + struct kvm *kvm = mmu_notifier_to_kvm(mn); + BUG_ON(mm != kvm->mm); + + kvm_destroy_common_vm(kvm); + + kvm_free_pit(kvm); + kfree(kvm->arch.vpic); + kfree(kvm->arch.vioapic); + kvm_free_vcpus(kvm); + kvm_free_physmem(kvm); + if (kvm->arch.apic_access_page) + put_page(kvm->arch.apic_access_page); +} + +static const struct mmu_notifier_ops kvm_mmu_notifier_ops = { + .release = kvm_mmu_notifier_release, + .invalidate_page = kvm_mmu_notifier_invalidate_page, + .invalidate_range_start = kvm_mmu_notifier_invalidate_range_start, + .invalidate_range_end = kvm_mmu_notifier_invalidate_range_end, + .clear_flush_young = kvm_mmu_notifier_clear_flush_young, +}; + struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } @@ -3899,13 +4037,12 @@ static void kvm_free_vcpus(struct kvm *kvm) void kvm_arch_destroy_vm(struct kvm *kvm) { - kvm_free_pit(kvm); - kfree(kvm->arch.vpic); - kfree(kvm->arch.vioapic); - kvm_free_vcpus(kvm); - kvm_free_physmem(kvm); - if (kvm->arch.apic_access_page) - put_page(kvm->arch.apic_access_page); + /* + * kvm_mmu_notifier_release() will be called before + * mmu_notifier_unregister returns, if it didn't run + * already. + */ + mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm); kfree(kvm); } diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 9d963cd..f07e321 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -13,6 +13,7 @@ #include <linux/types.h> #include <linux/mm.h> +#include <linux/mmu_notifier.h> #include <linux/kvm.h> #include <linux/kvm_para.h> @@ -247,6 +248,7 @@ struct kvm_vcpu_arch { gfn_t gfn; /* presumed gfn during guest pte update */ pfn_t pfn; /* pfn corresponding to that gfn */ int largepage; + int mmu_seq; } update_pte; struct i387_fxsave_struct host_fx_image; @@ -314,6 +316,10 @@ struct kvm_arch{ struct page *apic_access_page; gpa_t wall_clock; + + struct mmu_notifier mmu_notifier; + atomic_t mmu_notifier_seq; + atomic_t mmu_notifier_count; }; struct kvm_vm_stat { @@ -434,6 +440,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva); +int kvm_age_hva(struct kvm *kvm, unsigned long hva); int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); void kvm_mmu_zap_all(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4e16682..f089edc 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -267,6 +267,7 @@ void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); void kvm_free_physmem(struct kvm *kvm); +void kvm_destroy_common_vm(struct kvm *kvm); struct kvm *kvm_arch_create_vm(void); void kvm_arch_destroy_vm(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f095b73..4beae7a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -231,15 +231,19 @@ void kvm_free_physmem(struct kvm *kvm) kvm_free_physmem_slot(&kvm->memslots[i], NULL); } -static void kvm_destroy_vm(struct kvm *kvm) +void kvm_destroy_common_vm(struct kvm *kvm) { - struct mm_struct *mm = kvm->mm; - spin_lock(&kvm_lock); list_del(&kvm->vm_list); spin_unlock(&kvm_lock); kvm_io_bus_destroy(&kvm->pio_bus); kvm_io_bus_destroy(&kvm->mmio_bus); +} + +static void kvm_destroy_vm(struct kvm *kvm) +{ + struct mm_struct *mm = kvm->mm; + kvm_arch_destroy_vm(kvm); mmdrop(mm); } As usual you also need the kvm-mmu-notifier-lock patch to read the memslots with only the mmu_lock. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c7ad235..8be6551 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3871,16 +3871,23 @@ int kvm_arch_set_memory_region(struct kvm *kvm, */ if (!user_alloc) { if (npages && !old.rmap) { + unsigned long userspace_addr; + down_write(¤t->mm->mmap_sem); - memslot->userspace_addr = do_mmap(NULL, 0, - npages * PAGE_SIZE, - PROT_READ | PROT_WRITE, - MAP_SHARED | MAP_ANONYMOUS, - 0); + userspace_addr = do_mmap(NULL, 0, + npages * PAGE_SIZE, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + 0); up_write(¤t->mm->mmap_sem); - if (IS_ERR((void *)memslot->userspace_addr)) - return PTR_ERR((void *)memslot->userspace_addr); + if (IS_ERR((void *)userspace_addr)) + return PTR_ERR((void *)userspace_addr); + + /* set userspace_addr atomically for kvm_hva_to_rmapp */ + spin_lock(&kvm->mmu_lock); + memslot->userspace_addr = userspace_addr; + spin_unlock(&kvm->mmu_lock); } else { if (!old.user_alloc && old.rmap) { int ret; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6a52c08..97bcc8d 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -342,7 +342,15 @@ int __kvm_set_memory_region(struct kvm *kvm, memset(new.rmap, 0, npages * sizeof(*new.rmap)); new.user_alloc = user_alloc; - new.userspace_addr = mem->userspace_addr; + /* + * hva_to_rmmap() serialzies with the mmu_lock and to be + * safe it has to ignore memslots with !user_alloc && + * !userspace_addr. + */ + if (user_alloc) + new.userspace_addr = mem->userspace_addr; + else + new.userspace_addr = 0; } if (npages && !new.lpage_info) { int largepages = npages / KVM_PAGES_PER_HPAGE; @@ -374,14 +382,18 @@ int __kvm_set_memory_region(struct kvm *kvm, memset(new.dirty_bitmap, 0, dirty_bytes); } + spin_lock(&kvm->mmu_lock); if (mem->slot >= kvm->nmemslots) kvm->nmemslots = mem->slot + 1; *memslot = new; + spin_unlock(&kvm->mmu_lock); r = kvm_arch_set_memory_region(kvm, mem, old, user_alloc); if (r) { + spin_lock(&kvm->mmu_lock); *memslot = old; + spin_unlock(&kvm->mmu_lock); goto out_free; } |
From: Anthony L. <an...@co...> - 2008-04-26 15:20:14
|
Avi Kivity wrote: > Anthony Liguori wrote: >> The second stage is to use a loop of x86_emulate() to run all 16-bit >> code (instead of using vm86 mode). This will allow us to support >> guests that use big real mode. >> >> > > Why do that unconditionally, instead of only when in a big-real-mode > state? It can be. It's probably easier from a development perspective to make it unconditional and then to flush out all the necessary instructions. Regards, Anthony Liguori |
From: <ext...@li...> - 2008-04-26 14:24:21
|
Hi Glauber, sorry for late reply. well, I'm no longer able to reproduce the problem in the same way (with backtraces etc) as before, but anyways enabling paravirt_clock with or without Your patches on SMP guest still causes troubles: on my phenom machine, the kernel hangs after printing "PCI-GART - No AMD northbridge found." on intel machine normal system boot seems to be terribly slow taking tens of seconds between steps and later hangs, using init=/bin/sh seems to work though. I'm wondering how can I get the backtraces I was getting before, I have soft CPU lockup debugging enabled, what else could help? cheers n. On Thu, 24 Apr 2008, Glauber Costa wrote: > Just saw Gerd's patches flying around. Can anyone that is able to > reproduce this problem (a subgroup of human beings that does not > include me) test it with them applied? > > If it still fails, please let me know asap > > -- > Glauber Costa. > "Free as in Freedom" > http://glommer.net > > "The less confident you are, the more serious you have to act." > > |