From: Anthony L. <ali...@us...> - 2007-08-27 17:26:35
|
This patchset refactors KVM's paravirtualization support to better support guest SMP and cross platform migration. It also introduces a bare-bones KVM paravirt_ops implementation. I've tested this on VT and it works nicely. I'm having trouble getting a bzImage to boot on SVM so I haven't been able to test on SVM yet. |
From: Anthony L. <ali...@us...> - 2007-08-27 16:05:41
|
Since a hypercall may span two pages and is a gva, we need a function to write to a gva that may span multiple pages. emulator_write_phys() seems like the logical choice for this. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index d154487..ebcc5c9 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, unsigned int bytes, struct kvm_vcpu *vcpu) { - pr_unimpl(vcpu, "emulator_write_std: addr %lx n %d\n", addr, bytes); - return X86EMUL_UNHANDLEABLE; + const void *data = val; + + while (bytes) { + gpa_t gpa = vcpu->mmu.gva_to_gpa(vcpu, addr); + unsigned offset = addr & (PAGE_SIZE-1); + unsigned tocopy = min(bytes, (unsigned)PAGE_SIZE - offset); + unsigned long pfn; + struct page *page; + void *page_virt; + + if (gpa == UNMAPPED_GVA) + return X86EMUL_PROPAGATE_FAULT; + pfn = gpa >> PAGE_SHIFT; + page = gfn_to_page(vcpu->kvm, pfn); + if (!page) + return X86EMUL_UNHANDLEABLE; + page_virt = kmap_atomic(page, KM_USER0); + + memcpy(page_virt + offset, data, tocopy); + + kunmap_atomic(page_virt, KM_USER0); + mark_page_dirty(vcpu->kvm, addr >> PAGE_SHIFT); + + bytes -= tocopy; + data += tocopy; + addr += tocopy; + } + + return X86EMUL_CONTINUE; } static struct kvm_io_device *vcpu_find_mmio_dev(struct kvm_vcpu *vcpu, |
From: Avi K. <av...@qu...> - 2007-08-27 16:08:52
|
Anthony Liguori wrote: > This patch refactors the current hypercall infrastructure to better support live > migration and SMP. It eliminates the hypercall page by trapping the UD > exception that would occur if you used the wrong hypercall instruction for the > underlying architecture and replacing it with the right one lazily. > > It also introduces the infrastructure to probe for hypercall available via > CPUID leaves 0x40000002 and 0x40000003. > > A fall-out of this patch is that the unhandled hypercalls no longer trap to > userspace. There is very little reason though to use a hypercall to communicate > with userspace as PIO or MMIO can be used. There is no code in tree that uses > userspace hypercalls. > > Allowing userspace to handle hypercalls means that we can have block and net drivers in the kernel or userspace, reflecting user privileges and performance/flxibility tradeoffs. I think that's an important feature to have. > void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > { > int i; > @@ -1632,6 +1575,12 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > vcpu->regs[VCPU_REGS_RBX] = 0; > vcpu->regs[VCPU_REGS_RCX] = 0; > vcpu->regs[VCPU_REGS_RDX] = 0; > + > + if ((function & 0xFFFF0000) == 0x40000000) { > + emulate_paravirt_cpuid(vcpu, function); > + goto out; > + } > + > Hmm. Suppose we expose kvm capabilities to host userspace instead, and let the host userspace decide which features to expose to the guest via the regular cpuid emulation? That allows the qemu command line to control the feature set. > > +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) > +{ > + int er; > + > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > + > + /* we should only succeed here in the case of hypercalls which > + cannot generate an MMIO event. MMIO means that the emulator > + is mistakenly allowing an instruction that should generate > + a UD fault so it's a bug. */ > + BUG_ON(er == EMULATE_DO_MMIO); > It's a guest-triggerable bug; one vcpu can be issuing ud2-in-a-loop while the other replaces the instruction with something that does mmio. > + > +#define KVM_ENOSYS ENOSYS > A real number (well, an integer, not a real) here please. I know that ENOSYS isn't going to change soon, but this file defines the kvm abi and I'd like it to be as independent as possible. Let's start it at 1000 so that spurious "return 1"s or "return -1"s don't get translated into valid error numbers. I really like the simplification to the guest/host interface that this patch brings. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2007-08-27 17:29:15
|
On Mon, 2007-08-27 at 19:06 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > > { > > int i; > > @@ -1632,6 +1575,12 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) > > vcpu->regs[VCPU_REGS_RBX] = 0; > > vcpu->regs[VCPU_REGS_RCX] = 0; > > vcpu->regs[VCPU_REGS_RDX] = 0; > > + > > + if ((function & 0xFFFF0000) == 0x40000000) { > > + emulate_paravirt_cpuid(vcpu, function); > > + goto out; > > + } > > + > > > > Hmm. Suppose we expose kvm capabilities to host userspace instead, and > let the host userspace decide which features to expose to the guest via > the regular cpuid emulation? That allows the qemu command line to > control the feature set. That's a little awkward with something like NPT. Some CPUID features should be masked by the availability of NPT support in hardware and whether kernelspace supports NPT. To set these feature bits in userspace, you would have to query kernelspace anyway. It would be nice to be consistent with the other cpuid flags though. There is somewhat of a similar issue with migration here. If we have an initial set of PV features and down the road, add more, it may be desirable to enable those features depending on what your network looks like. Yeah, I think I agree that userspace is the right place to set these bits. > > > > +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) > > +{ > > + int er; > > + > > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > > + > > + /* we should only succeed here in the case of hypercalls which > > + cannot generate an MMIO event. MMIO means that the emulator > > + is mistakenly allowing an instruction that should generate > > + a UD fault so it's a bug. */ > > + BUG_ON(er == EMULATE_DO_MMIO); > > > > It's a guest-triggerable bug; one vcpu can be issuing ud2-in-a-loop > while the other replaces the instruction with something that does mmio. Good catch! I'll update. > > + > > +#define KVM_ENOSYS ENOSYS > > > > A real number (well, an integer, not a real) here please. I know that > ENOSYS isn't going to change soon, but this file defines the kvm abi and > I'd like it to be as independent as possible. > > Let's start it at 1000 so that spurious "return 1"s or "return -1"s > don't get translated into valid error numbers. Okay. > I really like the simplification to the guest/host interface that this > patch brings. Thanks. Regards, Anthony Liguori |
From: Anthony L. <ali...@us...> - 2007-08-27 16:25:35
|
This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. It also introduces the infrastructure to probe for hypercall available via CPUID leaves 0x40000002 and 0x40000003. A fall-out of this patch is that the unhandled hypercalls no longer trap to userspace. There is very little reason though to use a hypercall to communicate with userspace as PIO or MMIO can be used. There is no code in tree that uses userspace hypercalls. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/drivers/kvm/kvm.h b/drivers/kvm/kvm.h index a42a6f3..38c0eaf 100644 --- a/drivers/kvm/kvm.h +++ b/drivers/kvm/kvm.h @@ -46,6 +46,7 @@ #define KVM_MAX_CPUID_ENTRIES 40 #define DE_VECTOR 0 +#define UD_VECTOR 6 #define NM_VECTOR 7 #define DF_VECTOR 8 #define TS_VECTOR 10 @@ -316,9 +317,6 @@ struct kvm_vcpu { unsigned long cr0; unsigned long cr2; unsigned long cr3; - gpa_t para_state_gpa; - struct page *para_state_page; - gpa_t hypercall_gpa; unsigned long cr4; unsigned long cr8; u64 pdptrs[4]; /* pae */ @@ -587,7 +585,9 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); void kvm_mmu_unload(struct kvm_vcpu *vcpu); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run); +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu); static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u32 error_code) diff --git a/drivers/kvm/kvm_main.c b/drivers/kvm/kvm_main.c index ebcc5c9..996d0ee 100644 --- a/drivers/kvm/kvm_main.c +++ b/drivers/kvm/kvm_main.c @@ -1294,51 +1294,49 @@ int kvm_emulate_halt(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_emulate_halt); -int kvm_hypercall(struct kvm_vcpu *vcpu, struct kvm_run *run) +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { - unsigned long nr, a0, a1, a2, a3, a4, a5, ret; + unsigned long nr, a0, a1, a2, a3, ret; kvm_arch_ops->cache_regs(vcpu); - ret = -KVM_EINVAL; -#ifdef CONFIG_X86_64 - if (is_long_mode(vcpu)) { - nr = vcpu->regs[VCPU_REGS_RAX]; - a0 = vcpu->regs[VCPU_REGS_RDI]; - a1 = vcpu->regs[VCPU_REGS_RSI]; - a2 = vcpu->regs[VCPU_REGS_RDX]; - a3 = vcpu->regs[VCPU_REGS_RCX]; - a4 = vcpu->regs[VCPU_REGS_R8]; - a5 = vcpu->regs[VCPU_REGS_R9]; - } else -#endif - { - nr = vcpu->regs[VCPU_REGS_RBX] & -1u; - a0 = vcpu->regs[VCPU_REGS_RAX] & -1u; - a1 = vcpu->regs[VCPU_REGS_RCX] & -1u; - a2 = vcpu->regs[VCPU_REGS_RDX] & -1u; - a3 = vcpu->regs[VCPU_REGS_RSI] & -1u; - a4 = vcpu->regs[VCPU_REGS_RDI] & -1u; - a5 = vcpu->regs[VCPU_REGS_RBP] & -1u; + + nr = vcpu->regs[VCPU_REGS_RAX]; + a0 = vcpu->regs[VCPU_REGS_RBX]; + a1 = vcpu->regs[VCPU_REGS_RCX]; + a2 = vcpu->regs[VCPU_REGS_RDX]; + a3 = vcpu->regs[VCPU_REGS_RSI]; + + if (!is_long_mode(vcpu)) { + nr &= ~1u; + a0 &= ~1u; + a1 &= ~1u; + a2 &= ~1u; + a3 &= ~1u; } + switch (nr) { default: - run->hypercall.nr = nr; - run->hypercall.args[0] = a0; - run->hypercall.args[1] = a1; - run->hypercall.args[2] = a2; - run->hypercall.args[3] = a3; - run->hypercall.args[4] = a4; - run->hypercall.args[5] = a5; - run->hypercall.ret = ret; - run->hypercall.longmode = is_long_mode(vcpu); - kvm_arch_ops->decache_regs(vcpu); - return 0; + ret = -KVM_ENOSYS; + break; } vcpu->regs[VCPU_REGS_RAX] = ret; kvm_arch_ops->decache_regs(vcpu); - return 1; + return 0; +} +EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); + +int kvm_fix_hypercall(struct kvm_vcpu *vcpu) +{ + char instruction[3]; + + kvm_arch_ops->cache_regs(vcpu); + kvm_arch_ops->patch_hypercall(vcpu, instruction); + if (emulator_write_std(vcpu->rip, instruction, 3, vcpu) + != X86EMUL_CONTINUE) + return -EFAULT; + + return 0; } -EXPORT_SYMBOL_GPL(kvm_hypercall); static u64 mk_cr_64(u64 curr_cr, u32 new_val) { @@ -1406,75 +1404,6 @@ void realmode_set_cr(struct kvm_vcpu *vcpu, int cr, unsigned long val, } } -/* - * Register the para guest with the host: - */ -static int vcpu_register_para(struct kvm_vcpu *vcpu, gpa_t para_state_gpa) -{ - struct kvm_vcpu_para_state *para_state; - hpa_t para_state_hpa, hypercall_hpa; - struct page *para_state_page; - unsigned char *hypercall; - gpa_t hypercall_gpa; - - printk(KERN_DEBUG "kvm: guest trying to enter paravirtual mode\n"); - printk(KERN_DEBUG ".... para_state_gpa: %08Lx\n", para_state_gpa); - - /* - * Needs to be page aligned: - */ - if (para_state_gpa != PAGE_ALIGN(para_state_gpa)) - goto err_gp; - - para_state_hpa = gpa_to_hpa(vcpu, para_state_gpa); - printk(KERN_DEBUG ".... para_state_hpa: %08Lx\n", para_state_hpa); - if (is_error_hpa(para_state_hpa)) - goto err_gp; - - mark_page_dirty(vcpu->kvm, para_state_gpa >> PAGE_SHIFT); - para_state_page = pfn_to_page(para_state_hpa >> PAGE_SHIFT); - para_state = kmap(para_state_page); - - printk(KERN_DEBUG ".... guest version: %d\n", para_state->guest_version); - printk(KERN_DEBUG ".... size: %d\n", para_state->size); - - para_state->host_version = KVM_PARA_API_VERSION; - /* - * We cannot support guests that try to register themselves - * with a newer API version than the host supports: - */ - if (para_state->guest_version > KVM_PARA_API_VERSION) { - para_state->ret = -KVM_EINVAL; - goto err_kunmap_skip; - } - - hypercall_gpa = para_state->hypercall_gpa; - hypercall_hpa = gpa_to_hpa(vcpu, hypercall_gpa); - printk(KERN_DEBUG ".... hypercall_hpa: %08Lx\n", hypercall_hpa); - if (is_error_hpa(hypercall_hpa)) { - para_state->ret = -KVM_EINVAL; - goto err_kunmap_skip; - } - - printk(KERN_DEBUG "kvm: para guest successfully registered.\n"); - vcpu->para_state_page = para_state_page; - vcpu->para_state_gpa = para_state_gpa; - vcpu->hypercall_gpa = hypercall_gpa; - - mark_page_dirty(vcpu->kvm, hypercall_gpa >> PAGE_SHIFT); - hypercall = kmap_atomic(pfn_to_page(hypercall_hpa >> PAGE_SHIFT), - KM_USER1) + (hypercall_hpa & ~PAGE_MASK); - kvm_arch_ops->patch_hypercall(vcpu, hypercall); - kunmap_atomic(hypercall, KM_USER1); - - para_state->ret = 0; -err_kunmap_skip: - kunmap(para_state_page); - return 0; -err_gp: - return 1; -} - int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) { u64 data; @@ -1588,12 +1517,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data) case MSR_IA32_MISC_ENABLE: vcpu->ia32_misc_enable_msr = data; break; - /* - * This is the 'probe whether the host is KVM' logic: - */ - case MSR_KVM_API_MAGIC: - return vcpu_register_para(vcpu, data); - default: pr_unimpl(vcpu, "unhandled wrmsr: 0x%x\n", msr); return 1; @@ -1620,6 +1543,26 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); +static void emulate_paravirt_cpuid(struct kvm_vcpu *vcpu, u32 function) +{ + u32 signature[3]; + + switch (function & 0xFFFF) { + case 2: + memcpy(signature, "LinuxPVLinux", 12); + vcpu->regs[VCPU_REGS_RAX] = 0; + vcpu->regs[VCPU_REGS_RBX] = signature[0]; + vcpu->regs[VCPU_REGS_RCX] = signature[1]; + vcpu->regs[VCPU_REGS_RDX] = signature[2]; + break; + case 3: + vcpu->regs[VCPU_REGS_RAX] + = (1 << KVM_PARA_FEATURE_NOP_IO_DELAY); + break; + } + +} + void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) { int i; @@ -1632,6 +1575,12 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) vcpu->regs[VCPU_REGS_RBX] = 0; vcpu->regs[VCPU_REGS_RCX] = 0; vcpu->regs[VCPU_REGS_RDX] = 0; + + if ((function & 0xFFFF0000) == 0x40000000) { + emulate_paravirt_cpuid(vcpu, function); + goto out; + } + best = NULL; for (i = 0; i < vcpu->cpuid_nent; ++i) { e = &vcpu->cpuid_entries[i]; @@ -1652,6 +1601,7 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu) vcpu->regs[VCPU_REGS_RCX] = best->ecx; vcpu->regs[VCPU_REGS_RDX] = best->edx; } +out: kvm_arch_ops->decache_regs(vcpu); kvm_arch_ops->skip_emulated_instruction(vcpu); } diff --git a/drivers/kvm/svm.c b/drivers/kvm/svm.c index cc674bf..cb9b05b 100644 --- a/drivers/kvm/svm.c +++ b/drivers/kvm/svm.c @@ -476,7 +476,8 @@ static void init_vmcb(struct vmcb *vmcb) INTERCEPT_DR5_MASK | INTERCEPT_DR7_MASK; - control->intercept_exceptions = 1 << PF_VECTOR; + control->intercept_exceptions = (1 << PF_VECTOR) | + (1 << UD_VECTOR); control->intercept = (1ULL << INTERCEPT_INTR) | @@ -958,6 +959,24 @@ static int pf_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) return 0; } +static int ud_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + int er; + + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); + + /* we should only succeed here in the case of hypercalls which + cannot generate an MMIO event. MMIO means that the emulator + is mistakenly allowing an instruction that should generate + a UD fault so it's a bug. */ + BUG_ON(er == EMULATE_DO_MMIO); + + if (er == EMULATE_FAIL) + inject_ud(&svm->vcpu); + + return 1; +} + static int nm_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { svm->vmcb->control.intercept_exceptions &= ~(1 << NM_VECTOR); @@ -1024,7 +1043,8 @@ static int vmmcall_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) { svm->next_rip = svm->vmcb->save.rip + 3; skip_emulated_instruction(&svm->vcpu); - return kvm_hypercall(&svm->vcpu, kvm_run); + kvm_emulate_hypercall(&svm->vcpu); + return 1; } static int invalid_op_interception(struct vcpu_svm *svm, @@ -1218,6 +1238,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_WRITE_DR3] = emulate_on_interception, [SVM_EXIT_WRITE_DR5] = emulate_on_interception, [SVM_EXIT_WRITE_DR7] = emulate_on_interception, + [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception, [SVM_EXIT_EXCP_BASE + PF_VECTOR] = pf_interception, [SVM_EXIT_EXCP_BASE + NM_VECTOR] = nm_interception, [SVM_EXIT_INTR] = nop_on_interception, @@ -1665,7 +1686,6 @@ svm_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) hypercall[0] = 0x0f; hypercall[1] = 0x01; hypercall[2] = 0xd9; - hypercall[3] = 0xc3; } static void svm_check_processor_compat(void *rtn) diff --git a/drivers/kvm/vmx.c b/drivers/kvm/vmx.c index d63e82e..1840802 100644 --- a/drivers/kvm/vmx.c +++ b/drivers/kvm/vmx.c @@ -163,6 +163,13 @@ static inline int is_no_device(u32 intr_info) (INTR_TYPE_EXCEPTION | NM_VECTOR | INTR_INFO_VALID_MASK); } +static inline int is_invalid_opcode(u32 intr_info) +{ + return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK | + INTR_INFO_VALID_MASK)) == + (INTR_TYPE_EXCEPTION | UD_VECTOR | INTR_INFO_VALID_MASK); +} + static inline int is_external_interrupt(u32 intr_info) { return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -304,7 +311,7 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu) { u32 eb; - eb = 1u << PF_VECTOR; + eb = (1u << PF_VECTOR) | (1u << UD_VECTOR); if (!vcpu->fpu_active) eb |= 1u << NM_VECTOR; if (vcpu->guest_debug.enabled) @@ -543,6 +550,15 @@ static void vmx_inject_gp(struct kvm_vcpu *vcpu, unsigned error_code) INTR_INFO_VALID_MASK); } +static void vmx_inject_ud(struct kvm_vcpu *vcpu) +{ + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + GP_VECTOR | + INTR_TYPE_EXCEPTION | + INTR_INFO_DELIEVER_CODE_MASK | + INTR_INFO_VALID_MASK); +} + /* * Swap MSR entry in host/guest MSR entry array. */ @@ -1697,6 +1713,21 @@ static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) return 1; } + if (is_invalid_opcode(intr_info)) { + er = emulate_instruction(vcpu, kvm_run, 0, 0); + + /* we should only succeed here in the case of hypercalls which + cannot generate an MMIO event. MMIO means that the emulator + is mistakenly allowing an instruction that should generate + a UD fault so it's a bug. */ + BUG_ON(er == EMULATE_DO_MMIO); + + if (er == EMULATE_FAIL) + vmx_inject_ud(vcpu); + + return 1; + } + error_code = 0; rip = vmcs_readl(GUEST_RIP); if (intr_info & INTR_INFO_DELIEVER_CODE_MASK) @@ -1799,7 +1830,6 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall) hypercall[0] = 0x0f; hypercall[1] = 0x01; hypercall[2] = 0xc1; - hypercall[3] = 0xc3; } static int handle_cr(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) @@ -1983,7 +2013,8 @@ static int handle_halt(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) static int handle_vmcall(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { skip_emulated_instruction(vcpu); - return kvm_hypercall(vcpu, kvm_run); + kvm_emulate_hypercall(vcpu); + return 1; } /* diff --git a/drivers/kvm/x86_emulate.c b/drivers/kvm/x86_emulate.c index 7439b34..1767983 100644 --- a/drivers/kvm/x86_emulate.c +++ b/drivers/kvm/x86_emulate.c @@ -1273,19 +1273,37 @@ twobyte_insn: u16 size; unsigned long address; - case 2: /* lgdt */ - rc = read_descriptor(ctxt, ops, src.ptr, - &size, &address, op_bytes); + case 0: /* vmcall */ + if (modrm_mod != 3 || modrm_rm != 1) + goto cannot_emulate; + + rc = kvm_fix_hypercall(ctxt->vcpu); if (rc) goto done; - realmode_lgdt(ctxt->vcpu, size, address); + + kvm_emulate_hypercall(ctxt->vcpu); break; - case 3: /* lidt */ + case 2: /* lgdt */ rc = read_descriptor(ctxt, ops, src.ptr, &size, &address, op_bytes); if (rc) goto done; - realmode_lidt(ctxt->vcpu, size, address); + realmode_lgdt(ctxt->vcpu, size, address); + break; + case 3: /* lidt/vmmcall */ + if (modrm_mod == 3 && modrm_rm == 1) { + rc = kvm_fix_hypercall(ctxt->vcpu); + if (rc) + goto done; + kvm_emulate_hypercall(ctxt->vcpu); + } else { + rc = read_descriptor(ctxt, ops, src.ptr, + &size, &address, + op_bytes); + if (rc) + goto done; + realmode_lidt(ctxt->vcpu, size, address); + } break; case 4: /* smsw */ if (modrm_mod != 3) diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index 3b29256..e92ea3e 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -1,73 +1,85 @@ #ifndef __LINUX_KVM_PARA_H #define __LINUX_KVM_PARA_H -/* - * Guest OS interface for KVM paravirtualization - * - * Note: this interface is totally experimental, and is certain to change - * as we make progress. - */ +#include <asm/processor.h> -/* - * Per-VCPU descriptor area shared between guest and host. Writable to - * both guest and host. Registered with the host by the guest when - * a guest acknowledges paravirtual mode. - * - * NOTE: all addresses are guest-physical addresses (gpa), to make it - * easier for the hypervisor to map between the various addresses. - */ -struct kvm_vcpu_para_state { - /* - * API version information for compatibility. If there's any support - * mismatch (too old host trying to execute too new guest) then - * the host will deny entry into paravirtual mode. Any other - * combination (new host + old guest and new host + new guest) - * is supposed to work - new host versions will support all old - * guest API versions. - */ - u32 guest_version; - u32 host_version; - u32 size; - u32 ret; +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" - /* - * The address of the vm exit instruction (VMCALL or VMMCALL), - * which the host will patch according to the CPU model the - * VM runs on: - */ - u64 hypercall_gpa; +static inline long kvm_hypercall0(unsigned int nr) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr)); + return ret; +} -} __attribute__ ((aligned(PAGE_SIZE))); +static inline long kvm_hypercall1(unsigned int nr, unsigned long p1) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1)); + return ret; +} -#define KVM_PARA_API_VERSION 1 +static inline long kvm_hypercall2(unsigned int nr, unsigned long p1, + unsigned long p2) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2)); + return ret; +} -/* - * This is used for an RDMSR's ECX parameter to probe for a KVM host. - * Hopefully no CPU vendor will use up this number. This is placed well - * out of way of the typical space occupied by CPU vendors' MSR indices, - * and we think (or at least hope) it wont be occupied in the future - * either. - */ -#define MSR_KVM_API_MAGIC 0x87655678 +static inline long kvm_hypercall3(unsigned int nr, unsigned long p1, + unsigned long p2, unsigned long p3) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2), "d"(p3)); + return ret; +} -#define KVM_EINVAL 1 +static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, + unsigned long p2, unsigned long p3, + unsigned long p4) +{ + long ret; + asm volatile(KVM_HYPERCALL + : "=a"(ret) + : "a"(nr), "b"(p1), "c"(p2), "d"(p3), "S"(p4)); + return ret; +} -/* - * Hypercall calling convention: - * - * Each hypercall may have 0-6 parameters. - * - * 64-bit hypercall index is in RAX, goes from 0 to __NR_hypercalls-1 - * - * 64-bit parameters 1-6 are in the standard gcc x86_64 calling convention - * order: RDI, RSI, RDX, RCX, R8, R9. - * - * 32-bit index is EBX, parameters are: EAX, ECX, EDX, ESI, EDI, EBP. - * (the first 3 are according to the gcc regparm calling convention) - * - * No registers are clobbered by the hypercall, except that the - * return value is in RAX. - */ -#define __NR_hypercalls 0 +static inline int kvm_para_available(void) +{ + unsigned int eax, ebx, ecx, edx; + char signature[13]; + + cpuid(0x40000002, &eax, &ebx, &ecx, &edx); + memcpy(signature + 0, &ebx, 4); + memcpy(signature + 4, &ecx, 4); + memcpy(signature + 8, &edx, 4); + signature[12] = 0; + + if (strcmp(signature, "LinuxPVLinux") == 0) + return 1; + + return 0; +} + +static inline int kvm_para_has_feature(unsigned int feature) +{ + if (cpuid_eax(0x40000003) & (1UL << feature)) + return 1; + return 0; +} + +#define KVM_PARA_FEATURE_NOP_IO_DELAY 0 + +#define KVM_ENOSYS ENOSYS #endif |
From: Avi K. <av...@qu...> - 2007-08-27 16:27:38
|
Anthony Liguori wrote: > Since a hypercall may span two pages and is a gva, we need a function to write > to a gva that may span multiple pages. emulator_write_phys() seems like the > logical choice for this. > > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > unsigned int bytes, > struct kvm_vcpu *vcpu I think that emulator_write_emulated(), except for being awkwardly named, should do the job. We have enough APIs. But! We may not overwrite the hypercall instruction while a vcpu may be executing, since there's no atomicity guarantee for code fetch. We have to to be out of guest mode while writing that insn. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2007-08-27 18:09:51
|
On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Since a hypercall may span two pages and is a gva, we need a function to write > > to a gva that may span multiple pages. emulator_write_phys() seems like the > > logical choice for this. > > > > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > > unsigned int bytes, > > struct kvm_vcpu *vcpu > > I think that emulator_write_emulated(), except for being awkwardly > named, should do the job. We have enough APIs. I'll drop this patch. My mistakenly thought emulator_write_emulated() took a GPA. Thanks! Regards, Anthony Liguori > But! We may not overwrite the hypercall instruction while a vcpu may be > executing, since there's no atomicity guarantee for code fetch. We have > to to be out of guest mode while writing that insn. > |
From: Anthony L. <ali...@us...> - 2007-08-27 19:29:34
|
On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Since a hypercall may span two pages and is a gva, we need a function to write > > to a gva that may span multiple pages. emulator_write_phys() seems like the > > logical choice for this. > > > > @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > > unsigned int bytes, > > struct kvm_vcpu *vcpu > > I think that emulator_write_emulated(), except for being awkwardly > named, should do the job. We have enough APIs. > > But! We may not overwrite the hypercall instruction while a vcpu may be > executing, since there's no atomicity guarantee for code fetch. We have > to to be out of guest mode while writing that insn. Hrm, good catch. How can we get out of guest mode given SMP guest support? Regards, Anthony Liguori > > |
From: Anthony L. <ali...@us...> - 2007-08-27 17:24:53
|
diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index f952493..ceacc66 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -237,6 +237,13 @@ config VMI at the moment), by linking the kernel to a GPL-ed ROM module provided by the hypervisor. +config KVM_GUEST + bool "KVM paravirt-ops support" + depends on PARAVIRT + help + This option enables various optimizations for running under the KVM + hypervisor. + config ACPI_SRAT bool default y diff --git a/arch/i386/kernel/Makefile b/arch/i386/kernel/Makefile index 9d33b00..6308fcf 100644 --- a/arch/i386/kernel/Makefile +++ b/arch/i386/kernel/Makefile @@ -42,6 +42,7 @@ obj-$(CONFIG_K8_NB) += k8.o obj-$(CONFIG_MGEODE_LX) += geode.o obj-$(CONFIG_VMI) += vmi.o vmiclock.o +obj-$(CONFIG_KVM_GUEST) += kvm.o obj-$(CONFIG_PARAVIRT) += paravirt.o obj-y += pcspeaker.o diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c new file mode 100644 index 0000000..26585a4 --- /dev/null +++ b/arch/i386/kernel/kvm.c @@ -0,0 +1,46 @@ +/* + * KVM paravirt_ops implementation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + * + * Copyright IBM Corporation, 2007 + * Authors: Anthony Liguori <ali...@us...> + */ + +#include <linux/kernel.h> +#include <linux/kvm_para.h> + +/* + * No need for any "IO delay" on KVM + */ +static void kvm_io_delay(void) +{ +} + +static __init void paravirt_setup(void) +{ + paravirt_ops.name = "KVM"; + + if (kvm_para_has_feature(KVM_PARA_FEATURE_NOP_IO_DELAY)) + paravirt_ops.io_delay = kvm_io_delay; + + paravirt_ops.paravirt_enabled = 1; +} + +void __init kvm_guest_init(void) +{ + if (kvm_para_available()) + paravirt_setup(); +} diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h index e92ea3e..16e51a1 100644 --- a/include/linux/kvm_para.h +++ b/include/linux/kvm_para.h @@ -3,6 +3,14 @@ #include <asm/processor.h> +#ifdef CONFIG_KVM_GUEST +void __init kvm_guest_init(void); +#else +static inline void kvm_guest_init(void) +{ +} +#endif + #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" static inline long kvm_hypercall0(unsigned int nr) diff --git a/init/main.c b/init/main.c index d3bcb3b..b224905 100644 --- a/init/main.c +++ b/init/main.c @@ -55,6 +55,7 @@ #include <linux/pid_namespace.h> #include <linux/device.h> #include <linux/kthread.h> +#include <linux/kvm_para.h> #include <asm/io.h> #include <asm/bugs.h> @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) } sort_main_extable(); trap_init(); + kvm_guest_init(); rcu_init(); init_IRQ(); pidhash_init(); |
From: Avi K. <av...@qu...> - 2007-08-27 17:29:21
|
Anthony Liguori wrote: > On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > >> Anthony Liguori wrote: >> >>> Since a hypercall may span two pages and is a gva, we need a function to write >>> to a gva that may span multiple pages. emulator_write_phys() seems like the >>> logical choice for this. >>> >>> @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, >>> unsigned int bytes, >>> struct kvm_vcpu *vcpu >>> >> I think that emulator_write_emulated(), except for being awkwardly >> named, should do the job. We have enough APIs. >> >> But! We may not overwrite the hypercall instruction while a vcpu may be >> executing, since there's no atomicity guarantee for code fetch. We have >> to to be out of guest mode while writing that insn. >> > > > Hrm, good catch. > > How can we get out of guest mode given SMP guest support? > > kvm_flush_remote_tlbs() is something that can be generalized. Basically, you set a bit in each vcpu and send an IPI to take them out. But that's deadlock prone and complex. Maybe you can just take kvm->lock, zap the mmu and the flush tlbs, and patch the instruction at your leisure, as no vcpu will be able to map memory until the lock is released. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2007-08-27 17:39:57
|
On Mon, 2007-08-27 at 20:26 +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: > > > >> Anthony Liguori wrote: > >> > >>> Since a hypercall may span two pages and is a gva, we need a function to write > >>> to a gva that may span multiple pages. emulator_write_phys() seems like the > >>> logical choice for this. > >>> > >>> @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, > >>> unsigned int bytes, > >>> struct kvm_vcpu *vcpu > >>> > >> I think that emulator_write_emulated(), except for being awkwardly > >> named, should do the job. We have enough APIs. > >> > >> But! We may not overwrite the hypercall instruction while a vcpu may be > >> executing, since there's no atomicity guarantee for code fetch. We have > >> to to be out of guest mode while writing that insn. > >> > > > > > > Hrm, good catch. > > > > How can we get out of guest mode given SMP guest support? > > > > > > kvm_flush_remote_tlbs() is something that can be generalized. > Basically, you set a bit in each vcpu and send an IPI to take them out. > > But that's deadlock prone and complex. Maybe you can just take > kvm->lock, zap the mmu and the flush tlbs, and patch the instruction at > your leisure, as no vcpu will be able to map memory until the lock is > released. This works for shadow paging but not necessarily with NPT. Do code fetches really not respect atomic writes? We could switch to a 32-bit atomic operation and that should result in no worse than the code being patched twice. Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2007-08-27 17:49:41
|
Anthony Liguori wrote: > On Mon, 2007-08-27 at 20:26 +0300, Avi Kivity wrote: > >> Anthony Liguori wrote: >> >>> On Mon, 2007-08-27 at 18:45 +0300, Avi Kivity wrote: >>> >>> >>>> Anthony Liguori wrote: >>>> >>>> >>>>> Since a hypercall may span two pages and is a gva, we need a function to write >>>>> to a gva that may span multiple pages. emulator_write_phys() seems like the >>>>> logical choice for this. >>>>> >>>>> @@ -962,8 +962,35 @@ static int emulator_write_std(unsigned long addr, >>>>> unsigned int bytes, >>>>> struct kvm_vcpu *vcpu >>>>> >>>>> >>>> I think that emulator_write_emulated(), except for being awkwardly >>>> named, should do the job. We have enough APIs. >>>> >>>> But! We may not overwrite the hypercall instruction while a vcpu may be >>>> executing, since there's no atomicity guarantee for code fetch. We have >>>> to to be out of guest mode while writing that insn. >>>> >>>> >>> Hrm, good catch. >>> >>> How can we get out of guest mode given SMP guest support? >>> >>> >>> >> kvm_flush_remote_tlbs() is something that can be generalized. >> Basically, you set a bit in each vcpu and send an IPI to take them out. >> >> But that's deadlock prone and complex. Maybe you can just take >> kvm->lock, zap the mmu and the flush tlbs, and patch the instruction at >> your leisure, as no vcpu will be able to map memory until the lock is >> released. >> > > This works for shadow paging but not necessarily with NPT. NPT will have tlb and 2nd level pagetable flushing (to support memory unplug, ballooning, guest swapping, and the like) so it's safe to assume that it will continue to work. > Do code > fetches really not respect atomic writes? We could switch to a 32-bit > atomic operation and that should result in no worse than the code being > patched twice. > I think they don't respect it (the data cache and insn cache and fetch pipeline are very separated for performance reasons, so it's not unexpected). I think the Intel manual has some guidelines on self-modifying code. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Alexey E. <al...@gm...> - 2007-08-27 17:49:17
|
Anthony Liguori wrote: >I've tested this on VT and it works nicely. Excuse me, but you said several months ago something about "kvm-xen", which doesn't requires VT CPUs. Will you publish or merge that technology ? (as I don't have VT-capable CPU at Home) ============================================================= Anyways, I'm very happy to hear about KVM adding paravirtualization to the portfolio. Now the features are close to Xen. Namely, Xen has: Linux host 32/64-bit Guests 32/64-bit Full Virtualization requires VT (Guests: any) Para-Virtualization works on any x86 processor (Guests: Linux) Host/Guest SMP KVM will also have all of above now, with your patch. But will it be able to work on non-VT CPUs ? -- -Alexey Eremenko "Technologov" |
From: Anthony L. <ali...@us...> - 2007-08-28 17:26:15
|
On Mon, 2007-08-27 at 20:49 +0300, Alexey Eremenko wrote: > Anthony Liguori wrote: > >I've tested this on VT and it works nicely. > > Excuse me, but you said several months ago something about "kvm-xen", > which doesn't requires VT CPUs. > > Will you publish or merge that technology ? (as I don't have > VT-capable CPU at Home) Rusty Russell is working on something better called kvm-lite. He's giving a talk at this week's KVM forum and I suspect that we'll see patches soon. Regards, Anthony Liguori > ============================================================= > Anyways, I'm very happy to hear about KVM adding paravirtualization to > the portfolio. Now the features are close to Xen. > > Namely, Xen has: > > Linux host 32/64-bit > Guests 32/64-bit > Full Virtualization requires VT (Guests: any) > Para-Virtualization works on any x86 processor (Guests: Linux) > Host/Guest SMP > > KVM will also have all of above now, with your patch. But will it be > able to work on non-VT CPUs ? > > -- > -Alexey Eremenko "Technologov" > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel |
From: Rusty R. <ru...@ru...> - 2007-08-28 18:13:20
|
On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > This patch refactors the current hypercall infrastructure to better support live > migration and SMP. It eliminates the hypercall page by trapping the UD > exception that would occur if you used the wrong hypercall instruction for the > underlying architecture and replacing it with the right one lazily. It also reduces the number of hypercall args, which you don't mention here. > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > + > + /* we should only succeed here in the case of hypercalls which > + cannot generate an MMIO event. MMIO means that the emulator > + is mistakenly allowing an instruction that should generate > + a UD fault so it's a bug. */ > + BUG_ON(er == EMULATE_DO_MMIO); This seems... unwise. Firstly we know our emulator is incomplete. Secondly an SMP guest can exploit this to crash the host. (Code is in two places). > +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" A nice big comment would be nice here, I think. Note that this is big enough for both "int $0x1f" and "sysenter", so I'm happy. Cheers, Rusty. |
From: Anthony L. <an...@co...> - 2007-08-29 05:51:18
|
On Wed, 2007-08-29 at 04:12 +1000, Rusty Russell wrote: > On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > > This patch refactors the current hypercall infrastructure to better support live > > migration and SMP. It eliminates the hypercall page by trapping the UD > > exception that would occur if you used the wrong hypercall instruction for the > > underlying architecture and replacing it with the right one lazily. > > It also reduces the number of hypercall args, which you don't mention > here. Oh yes, sorry. > > + er = emulate_instruction(&svm->vcpu, kvm_run, 0, 0); > > + > > + /* we should only succeed here in the case of hypercalls which > > + cannot generate an MMIO event. MMIO means that the emulator > > + is mistakenly allowing an instruction that should generate > > + a UD fault so it's a bug. */ > > + BUG_ON(er == EMULATE_DO_MMIO); > > This seems... unwise. Firstly we know our emulator is incomplete. > Secondly an SMP guest can exploit this to crash the host. This code is gone in v2. > (Code is in two places). > > > +#define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1" Good point. > A nice big comment would be nice here, I think. Note that this is big > enough for both "int $0x1f" and "sysenter", so I'm happy. I need to add a comment somewhere mentioning that if you patch with something less than 3 bytes, then you should pad with nop but the hypervisor must treat the whole instruction (including the padding) as atomic (that is, regardless of hypercall size, eip += 3) or you run the risk of breakage during migration. Regards, Anthony Liguori > Cheers, > Rusty. > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel |
From: Rusty R. <ru...@ru...> - 2007-08-28 18:31:52
|
On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) > } > sort_main_extable(); > trap_init(); > + kvm_guest_init(); > rcu_init(); > init_IRQ(); > pidhash_init(); Hi Anthony, This placement seems arbitrary. Why not earlier from setup_arch, or as a normal initcall? Rusty. |
From: Anthony L. <an...@co...> - 2007-08-29 05:53:25
|
On Wed, 2007-08-29 at 04:31 +1000, Rusty Russell wrote: > On Mon, 2007-08-27 at 10:16 -0500, Anthony Liguori wrote: > > @@ -569,6 +570,7 @@ asmlinkage void __init start_kernel(void) > > } > > sort_main_extable(); > > trap_init(); > > + kvm_guest_init(); > > rcu_init(); > > init_IRQ(); > > pidhash_init(); > > Hi Anthony, > > This placement seems arbitrary. Why not earlier from setup_arch, or as > a normal initcall? The placement is important if we wish to have a paravirt_ops hook for the interrupt controller. This is the latest possible spot we can do it. A comment is probably appropriate here. Regards, Anthony Liguori > Rusty. > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and a browser. > Download your FREE copy of Splunk now >> http://get.splunk.com/ > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel |