You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Glauber C. <gc...@re...> - 2008-05-15 14:09:59
|
we separate kqemu_init() into a part that depends on env, and other that does not. The later can be initialized earlier --- exec.c | 3 +++ kqemu.c | 10 +++++++--- target-i386/helper2.c | 2 +- 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/exec.c b/exec.c index 5384460..dfedfc3 100644 --- a/exec.c +++ b/exec.c @@ -334,6 +334,9 @@ void exec_init(void) code_gen_ptr = code_gen_buffer; page_init(); io_mem_init(); +#ifdef USE_KQEMU + kqemu_start(); +#endif } void cpu_exec_init(CPUState *env) diff --git a/kqemu.c b/kqemu.c index 88592ee..0e38d52 100644 --- a/kqemu.c +++ b/kqemu.c @@ -159,7 +159,7 @@ static void kqemu_update_cpuid(CPUState *env) accelerated code */ } -int kqemu_init(CPUState *env) +int kqemu_start(void) { struct kqemu_init init; int ret, version; @@ -238,13 +238,17 @@ int kqemu_init(CPUState *env) kqemu_fd = KQEMU_INVALID_FD; return -1; } - kqemu_update_cpuid(env); - env->kqemu_enabled = kqemu_allowed; nb_pages_to_flush = 0; nb_ram_pages_to_update = 0; return 0; } +void kqemu_init_env(CPUState *env) +{ + kqemu_update_cpuid(env); + env->kqemu_enabled = kqemu_allowed; +} + void kqemu_flush_page(CPUState *env, target_ulong addr) { #if defined(DEBUG) diff --git a/target-i386/helper2.c b/target-i386/helper2.c index 6cf218f..1c0fcdb 100644 --- a/target-i386/helper2.c +++ b/target-i386/helper2.c @@ -113,7 +113,7 @@ CPUX86State *cpu_x86_init(const char *cpu_model) } cpu_reset(env); #ifdef USE_KQEMU - kqemu_init(env); + kqemu_init_env(env); #endif return env; } -- 1.5.5 |
From: Anthony L. <an...@co...> - 2008-05-15 14:09:58
|
Chris Wright wrote: > * Anthony Liguori (an...@co...) wrote: > >> From a quick look, I suspect that the number of wildly off TSC >> calibrations correspond to the VMs that are misbehaving. I think this >> may mean that we have to re-examine the tsc delta computation. >> >> 10_serial.log:time.c: Detected 1995.038 MHz processor. >> 11_serial.log:time.c: Detected 2363.195 MHz processor. >> 12_serial.log:time.c: Detected 2492.675 MHz processor. >> 13_serial.log:time.c: Detected 1995.061 MHz processor. >> 14_serial.log:time.c: Detected 1994.917 MHz processor. >> 15_serial.log:time.c: Detected 4100.735 MHz processor. >> 16_serial.log:time.c: Detected 2075.800 MHz processor. >> 17_serial.log:time.c: Detected 2674.350 MHz processor. >> 18_serial.log:time.c: Detected 1995.002 MHz processor. >> 19_serial.log:time.c: Detected 1994.978 MHz processor. >> 1_serial.log:time.c: Detected 4384.310 MHz processor. >> > > Is this with pinning? We at least know we're losing small bits on > migration. From my measurements it's ~3000 (outliers are 10-20k). > > Also, what happens if you roll back to kvm-userspace 7f5c4d15ece5? > > I'm using this: > > diff -up arch/x86/kvm/svm.c~svm arch/x86/kvm/svm.c > --- arch/x86/kvm/svm.c~svm 2008-04-16 19:49:44.000000000 -0700 > +++ arch/x86/kvm/svm.c 2008-05-14 23:44:18.000000000 -0700 > @@ -621,6 +621,13 @@ static void svm_free_vcpu(struct kvm_vcp > kmem_cache_free(kvm_vcpu_cache, svm); > } > > +static void svm_tsc_update(void *arg) > +{ > + struct vcpu_svm *svm = arg; > + rdtscll(svm->vcpu.arch.host_tsc); > + > +} > + > static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > { > struct vcpu_svm *svm = to_svm(vcpu); > @@ -633,6 +640,9 @@ static void svm_vcpu_load(struct kvm_vcp > * Make sure that the guest sees a monotonically > * increasing TSC. > */ > + if (vcpu->cpu != -1) > + smp_call_function_single(vcpu->cpu, svm_tsc_update, > + svm, 0, 1); > I like this approach because of its simplicity although the IPI is not wonderful. I was also thinking of using cpu_clock() to take a timestamp on vcpu_put, then on vcpu_load, take another timestamp and use the cyc2ns conversion to try and estimate the elapsed tsc ticks on the new cpu. Regards, Anthony Liguori > rdtscll(tsc_this); > delta = vcpu->arch.host_tsc - tsc_this; > svm->vmcb->control.tsc_offset += delta; > > |
From: Glauber C. <gc...@re...> - 2008-05-15 14:09:58
|
This patch introduces QEMUAccel, a placeholder for function pointers that aims at helping qemu to abstract accelerators such as kqemu and kvm (actually, the 'accelerator' name was proposed by avi kivity, since he loves referring to kvm that way). To begin with, the accelerator is given the opportunity to register a cpu_interrupt function, to be called after the raw cpu_interrupt. This has the side effect of, for the kqemu accelerator, calling kqemu_cpu_interrupt everytime, which didn't use to happen. But looking at the code, this seems safe to me. This patch applies on raw qemu. --- block-raw-posix.c | 5 ----- exec-all.h | 18 +++++++++++++++++- exec.c | 2 ++ kqemu.c | 27 +++++++++++++++++---------- vl.c | 6 +----- 5 files changed, 37 insertions(+), 21 deletions(-) diff --git a/block-raw-posix.c b/block-raw-posix.c index 6b0009e..61c23ba 100644 --- a/block-raw-posix.c +++ b/block-raw-posix.c @@ -250,11 +250,6 @@ static void aio_signal_handler(int signum) if (env) { /* stop the currently executing cpu because a timer occured */ cpu_interrupt(env, CPU_INTERRUPT_EXIT); -#ifdef USE_KQEMU - if (env->kqemu_enabled) { - kqemu_cpu_interrupt(env); - } -#endif } #endif } diff --git a/exec-all.h b/exec-all.h index 8c32858..7b2d97d 100644 --- a/exec-all.h +++ b/exec-all.h @@ -578,6 +578,23 @@ static inline target_ulong get_phys_addr_code(CPUState *env1, target_ulong addr) } #endif +typedef struct QEMUAccel { + void (*cpu_interrupt)(CPUState *env); +} QEMUAccel; + +extern QEMUAccel *current_accel; + +static inline void register_qemu_accel(QEMUAccel *accel) +{ + current_accel = accel; +} + +static inline void accel_cpu_interrupt(CPUState *env) +{ + if (current_accel && current_accel->cpu_interrupt) + current_accel->cpu_interrupt(env); +} + #ifdef USE_KQEMU #define KQEMU_MODIFY_PAGE_MASK (0xff & ~(VGA_DIRTY_FLAG | CODE_DIRTY_FLAG)) @@ -587,7 +604,6 @@ void kqemu_flush_page(CPUState *env, target_ulong addr); void kqemu_flush(CPUState *env, int global); void kqemu_set_notdirty(CPUState *env, ram_addr_t ram_addr); void kqemu_modify_page(CPUState *env, ram_addr_t ram_addr); -void kqemu_cpu_interrupt(CPUState *env); void kqemu_record_dump(void); static inline int kqemu_is_ok(CPUState *env) diff --git a/exec.c b/exec.c index dfedfc3..73360d3 100644 --- a/exec.c +++ b/exec.c @@ -1256,6 +1256,8 @@ void cpu_interrupt(CPUState *env, int mask) tb_reset_jump_recursive(tb); resetlock(&interrupt_lock); } + + accel_cpu_interrupt(env); } void cpu_reset_interrupt(CPUState *env, int mask) diff --git a/kqemu.c b/kqemu.c index 0e38d52..f875e0e 100644 --- a/kqemu.c +++ b/kqemu.c @@ -159,6 +159,8 @@ static void kqemu_update_cpuid(CPUState *env) accelerated code */ } +QEMUAccel kqemu_accel; + int kqemu_start(void) { struct kqemu_init init; @@ -240,6 +242,7 @@ int kqemu_start(void) } nb_pages_to_flush = 0; nb_ram_pages_to_update = 0; + register_qemu_accel(&kqemu_accel); return 0; } @@ -249,6 +252,20 @@ void kqemu_init_env(CPUState *env) env->kqemu_enabled = kqemu_allowed; } +void kqemu_cpu_interrupt(CPUState *env) +{ +#if defined(_WIN32) && KQEMU_VERSION >= 0x010101 + /* cancelling the I/O request causes KQEMU to finish executing the + current block and successfully returning. */ + CancelIo(kqemu_fd); +#endif +} + +QEMUAccel kqemu_accel = { + .cpu_interrupt = kqemu_cpu_interrupt, +}; + + void kqemu_flush_page(CPUState *env, target_ulong addr) { #if defined(DEBUG) @@ -906,14 +923,4 @@ int kqemu_cpu_exec(CPUState *env) } return 0; } - -void kqemu_cpu_interrupt(CPUState *env) -{ -#if defined(_WIN32) && KQEMU_VERSION >= 0x010101 - /* cancelling the I/O request causes KQEMU to finish executing the - current block and successfully returning. */ - CancelIo(kqemu_fd); -#endif -} - #endif diff --git a/vl.c b/vl.c index 5999b37..26c1677 100644 --- a/vl.c +++ b/vl.c @@ -239,6 +239,7 @@ struct drive_opt { static CPUState *cur_cpu; static CPUState *next_cpu; static int event_pending = 1; +QEMUAccel *current_accel; #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR) @@ -1199,11 +1200,6 @@ static void host_alarm_handler(int host_signum) if (env) { /* stop the currently executing cpu because a timer occured */ cpu_interrupt(env, CPU_INTERRUPT_EXIT); -#ifdef USE_KQEMU - if (env->kqemu_enabled) { - kqemu_cpu_interrupt(env); - } -#endif } event_pending = 1; } -- 1.5.5 |
From: Glauber C. <gc...@re...> - 2008-05-15 14:09:58
|
Hi guys, This is a new version of the QEMUAccel work. To start with, I decided to keep the name for now. We don't have that many functions that are not cpu-related to justify splitting the structure into many. Plus, this is one of the less confusing names we came up with. The code I'm posting is tested with kqemu for both i386 and x86_64, and it works. So, if you guys feel like it, I can say it's ready for inclusion (which obviously does not mean bug-free). It is not complete, however. There are still some pieces of kqemu code that does not work. Specially the interrupt code in cpu-exec.c , which relies on the tricky longjmp. Comments are very welcome. |
From: Anthony L. <an...@co...> - 2008-05-15 13:57:07
|
Avi Kivity wrote: > Anthony Liguori wrote: >> FWIW, virtio-net is much better with my patches applied. > > The can_receive patches? > > Again, I'm not opposed to them in principle, I just think that if they > help that this points at a virtio deficiency. Virtio should never > leave the rx queue empty. Consider the case where the virtio queue > isn't tied to a socket buffer, but directly to hardware. For RX performance: right now [ 3] 0.0-10.0 sec 1016 MBytes 852 Mbits/sec revert tap hack [ 3] 0.0-10.0 sec 564 MBytes 473 Mbits/sec all patches applied [ 3] 0.0-10.0 sec 1.17 GBytes 1.01 Gbits/sec drop lots of packets [ 3] 0.0-10.0 sec 1.05 GBytes 905 Mbits/sec The last patch is not in my series but it basically makes the ring size 512 and drops packets when we run out of descriptors. That was to valid that we're not hiding a virtio deficiency. The reason I want to buffer packets is that it avoids having to deal with tuning. For vringfd/vmdq, we'll have to make sure to get the tuning right though. Regards, Anthony Liguori |
From: Laurent V. <Lau...@bu...> - 2008-05-15 13:52:13
|
This patch is the kernel part of the "batch writes to MMIO" patch. It intoduces the ioctl interface to define MMIO zone it is allowed to delay. Inside a zone, we can define sub-part we must not delay. If an MMIO can be delayed, it is stored in a ring buffer which common for all VCPUs. Signed-off-by: Laurent Vivier <Lau...@bu...> --- arch/x86/kvm/x86.c | 172 ++++++++++++++++++++++++++++++++++++++++++++ include/asm-x86/kvm.h | 7 ++ include/asm-x86/kvm_host.h | 23 ++++++ include/linux/kvm.h | 16 ++++ virt/kvm/kvm_main.c | 3 + 5 files changed, 221 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dab3d4f..930986b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1518,6 +1518,103 @@ out: return r; } +static struct kvm_delayed_mmio_zone *kvm_mmio_find_zone(struct kvm *kvm, + u64 addr, u32 size) +{ + int i; + struct kvm_delayed_mmio_zone *zone; + + for (i = 0; i < kvm->arch.nb_mmio_zones; i++) { + zone = &kvm->arch.mmio_zone[i]; + + /* (addr,size) is fully included in + * (zone->addr, zone->size) + */ + + if (zone->addr <= addr && + addr + size <= zone->addr + zone->size) + return zone; + } + return NULL; +} + +static struct kvm_excluded_mmio_zone * +kvm_mmio_find_excluded(struct kvm_delayed_mmio_zone *zone, u64 addr, u32 size) +{ + static struct kvm_excluded_mmio_zone *excluded; + int i; + + addr -= zone->addr; + for (i = 0; i < zone->nb_excluded_zones; i++) { + excluded = &zone->excluded[i]; + + if ((excluded->offset <= addr && + addr < excluded->offset + excluded->size) || + (excluded->offset < addr + size && + addr + size <= excluded->offset + + excluded->size)) + return excluded; + } + return NULL; +} + +static int kvm_is_delayed_mmio(struct kvm *kvm, u64 addr, u32 size) +{ + struct kvm_delayed_mmio_zone *zone; + struct kvm_excluded_mmio_zone *excluded; + + zone = kvm_mmio_find_zone(kvm, addr, size); + if (zone == NULL) + return 0; /* not a delayed MMIO address */ + + excluded = kvm_mmio_find_excluded(zone, addr, size); + return excluded == NULL; +} + +static int kvm_vm_ioctl_set_mmio(struct kvm *kvm, + struct kvm_mmio_zone *zone) +{ + struct kvm_delayed_mmio_zone *z; + + if (zone->is_delayed && + kvm->arch.nb_mmio_zones >= KVM_MAX_DELAYED_MMIO_ZONE) + return -ENOMEM; + + if (zone->is_delayed) { + + /* already defined ? */ + + if (kvm_mmio_find_zone(kvm, zone->addr, 1) || + kvm_mmio_find_zone(kvm, zone->addr + zone->size - 1, 1)) + return 0; + + z = &kvm->arch.mmio_zone[kvm->arch.nb_mmio_zones]; + z->addr = zone->addr; + z->size = zone->size; + kvm->arch.nb_mmio_zones++; + return 0; + } + + /* exclude some parts of the delayed MMIO zone */ + + z = kvm_mmio_find_zone(kvm, zone->addr, zone->size); + if (z == NULL) + return -EINVAL; + + if (z->nb_excluded_zones >= KVM_MAX_EXCLUDED_MMIO_ZONE) + return -ENOMEM; + + if (kvm_mmio_find_excluded(z, zone->addr, 1) || + kvm_mmio_find_excluded(z, zone->addr + zone->size - 1, 1)) + return 0; + + z->excluded[z->nb_excluded_zones].offset = zone->addr - z->addr; + z->excluded[z->nb_excluded_zones].size = zone->size; + z->nb_excluded_zones++; + + return 0; +} + long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -1671,6 +1768,18 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_SET_MMIO: { + struct kvm_mmio_zone zone; + r = -EFAULT; + if (copy_from_user(&zone, argp, sizeof zone)) + goto out; + r = -ENXIO; + r = kvm_vm_ioctl_set_mmio(kvm, &zone); + if (r) + goto out; + r = 0; + break; + } default: ; } @@ -2706,6 +2815,52 @@ static void vapic_exit(struct kvm_vcpu *vcpu) mark_page_dirty(vcpu->kvm, apic->vapic_addr >> PAGE_SHIFT); } +static int batch_mmio(struct kvm_vcpu *vcpu) +{ + struct kvm_batch *batch = vcpu->kvm->arch.batch; + spinlock_t *lock = &vcpu->kvm->arch.batch_lock; + int next; + + /* check if this MMIO can be delayed */ + + if (!kvm_is_delayed_mmio(vcpu->kvm, + vcpu->mmio_phys_addr, vcpu->mmio_size)) + return 0; + + /* check if ring is full + * we have no lock on "first" + * as it can only increase we can only have + * a false "full". + */ + + spin_lock(lock); + + /* last is the first free entry + * check if we don't meet the first used entry + * there is always one unused entry in the buffer + */ + + next = (batch->last + 1) % KVM_MAX_BATCH; + if (next == batch->first) { + /* full */ + spin_unlock(lock); + return 0; + } + + /* batch it */ + + /* copy data in first free entry of the ring */ + + batch->mmio[batch->last].phys_addr = vcpu->mmio_phys_addr; + batch->mmio[batch->last].len = vcpu->mmio_size; + memcpy(batch->mmio[batch->last].data, vcpu->mmio_data, vcpu->mmio_size); + batch->last = next; + + spin_unlock(lock); + + return 1; +} + static int __vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) { int r; @@ -2857,6 +3012,11 @@ again: goto again; } + if (!r && + vcpu->mmio_is_write && kvm_run->exit_reason == KVM_EXIT_MMIO + && !need_resched() && batch_mmio(vcpu)) + goto again; + out: up_read(&vcpu->kvm->slots_lock); if (r > 0) { @@ -3856,12 +4016,22 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + struct page *page; if (!kvm) return ERR_PTR(-ENOMEM); + page = alloc_page(GFP_KERNEL | __GFP_ZERO); + if (!page) { + kfree(kvm); + return ERR_PTR(-ENOMEM); + } + INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.batch_lock = __SPIN_LOCK_UNLOCKED(batch_lock); + kvm->arch.batch = (struct kvm_batch *)page_address(page); + return kvm; } @@ -3902,6 +4072,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm) put_page(kvm->arch.apic_access_page); if (kvm->arch.ept_identity_pagetable) put_page(kvm->arch.ept_identity_pagetable); + if (kvm->arch.batch) + free_page((unsigned long)kvm->arch.batch); kfree(kvm); } diff --git a/include/asm-x86/kvm.h b/include/asm-x86/kvm.h index 6f18408..3c4a611 100644 --- a/include/asm-x86/kvm.h +++ b/include/asm-x86/kvm.h @@ -209,6 +209,13 @@ struct kvm_pit_state { struct kvm_pit_channel_state channels[3]; }; +struct kvm_mmio_zone { + __u8 is_delayed; + __u8 pad[3]; + __u32 size; + __u64 addr; +}; + #define KVM_TRC_INJ_VIRQ (KVM_TRC_HANDLER + 0x02) #define KVM_TRC_REDELIVER_EVT (KVM_TRC_HANDLER + 0x03) #define KVM_TRC_PEND_INTR (KVM_TRC_HANDLER + 0x04) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 1466c3f..df42cdb 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -26,6 +26,7 @@ #define KVM_PRIVATE_MEM_SLOTS 4 #define KVM_PIO_PAGE_OFFSET 1 +#define KVM_MMIO_PAGE_OFFSET 2 #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1) #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD)) @@ -293,6 +294,21 @@ struct kvm_mem_alias { gfn_t target_gfn; }; +#define KVM_MAX_DELAYED_MMIO_ZONE 10 +#define KVM_MAX_EXCLUDED_MMIO_ZONE 10 + +struct kvm_excluded_mmio_zone { + u32 offset; + u32 size; +}; + +struct kvm_delayed_mmio_zone { + u64 addr; + u32 size; + u32 nb_excluded_zones; + struct kvm_excluded_mmio_zone excluded[KVM_MAX_EXCLUDED_MMIO_ZONE]; +}; + struct kvm_arch{ int naliases; struct kvm_mem_alias aliases[KVM_ALIAS_SLOTS]; @@ -317,6 +333,13 @@ struct kvm_arch{ struct page *ept_identity_pagetable; bool ept_identity_pagetable_done; + + /* MMIO batch */ + + spinlock_t batch_lock; + struct kvm_batch *batch; + int nb_mmio_zones; + struct kvm_delayed_mmio_zone mmio_zone[KVM_MAX_DELAYED_MMIO_ZONE]; }; struct kvm_vm_stat { diff --git a/include/linux/kvm.h b/include/linux/kvm.h index a281afe..b57010d 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -173,6 +173,21 @@ struct kvm_run { }; }; +struct kvm_mmio { + __u64 phys_addr; + __u32 len; + __u32 pad; + __u8 data[8]; +}; + +struct kvm_batch { + __u32 first, last; + struct kvm_mmio mmio[0]; +}; + +#define KVM_MAX_BATCH ((PAGE_SIZE - sizeof(struct kvm_batch)) / \ + sizeof(struct kvm_mmio)) + /* for KVM_TRANSLATE */ struct kvm_translation { /* in */ @@ -371,6 +386,7 @@ struct kvm_trace_rec { #define KVM_CREATE_PIT _IO(KVMIO, 0x64) #define KVM_GET_PIT _IOWR(KVMIO, 0x65, struct kvm_pit_state) #define KVM_SET_PIT _IOR(KVMIO, 0x66, struct kvm_pit_state) +#define KVM_SET_MMIO _IOW(KVMIO, 0x67, struct kvm_mmio_zone) /* * ioctls for vcpu fds diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 64ed402..c8f1bdf 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -824,6 +824,8 @@ static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf) #ifdef CONFIG_X86 else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET) page = virt_to_page(vcpu->arch.pio_data); + else if (vmf->pgoff == KVM_MMIO_PAGE_OFFSET) + page = virt_to_page(vcpu->kvm->arch.batch); #endif else return VM_FAULT_SIGBUS; @@ -1230,6 +1232,7 @@ static long kvm_dev_ioctl(struct file *filp, r = PAGE_SIZE; /* struct kvm_run */ #ifdef CONFIG_X86 r += PAGE_SIZE; /* pio data page */ + r += PAGE_SIZE; /* mmio batch page */ #endif break; case KVM_TRACE_ENABLE: -- 1.5.2.4 |
From: Laurent V. <Lau...@bu...> - 2008-05-15 13:51:39
|
This patch is userspace part of the "batch writes to MMIO" patch. It defines delayed MMIO zone using kvm_set_mmio() (for VGA and e1000). It empties the ring buffer and process the MMIO accesses. Signed-off-by: Laurent Vivier <Lau...@bu...> --- libkvm/libkvm-x86.c | 18 ++++++++++++++++++ libkvm/libkvm.c | 13 +++++++++++++ libkvm/libkvm.h | 2 ++ qemu/hw/cirrus_vga.c | 2 ++ qemu/hw/e1000.c | 8 ++++++++ qemu/hw/vga.c | 4 ++++ qemu/qemu-kvm.c | 6 ++++++ qemu/qemu-kvm.h | 2 ++ 8 files changed, 55 insertions(+), 0 deletions(-) diff --git a/libkvm/libkvm-x86.c b/libkvm/libkvm-x86.c index d46fdcc..911e079 100644 --- a/libkvm/libkvm-x86.c +++ b/libkvm/libkvm-x86.c @@ -391,6 +391,24 @@ int kvm_set_pit(kvm_context_t kvm, struct kvm_pit_state *s) #endif +int kvm_set_mmio(kvm_context_t kvm, + uint8_t is_delayed, uint64_t addr, uint32_t size) +{ + struct kvm_mmio_zone zone; + int r; + + zone.is_delayed = is_delayed; + zone.addr = addr; + zone.size = size; + + r = ioctl(kvm->vm_fd, KVM_SET_MMIO, &zone); + if (r == -1) { + r = -errno; + perror("kvm_set_mmio"); + } + return r; +} + void kvm_show_code(kvm_context_t kvm, int vcpu) { #define SHOW_CODE_LEN 50 diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c index d1e95a4..b891630 100644 --- a/libkvm/libkvm.c +++ b/libkvm/libkvm.c @@ -861,6 +861,9 @@ int kvm_run(kvm_context_t kvm, int vcpu) int r; int fd = kvm->vcpu_fd[vcpu]; struct kvm_run *run = kvm->run[vcpu]; +#if defined(__x86_64__) || defined(__i386__) + struct kvm_batch *batch = (void *)run + 2 * PAGE_SIZE; +#endif again: if (!kvm->irqchip_in_kernel) @@ -879,6 +882,16 @@ again: post_kvm_run(kvm, vcpu); +#if defined(__x86_64__) || defined(__i386__) + while (batch->first != batch->last) { + kvm->callbacks->mmio_write(kvm->opaque, + batch->mmio[batch->first].phys_addr, + &batch->mmio[batch->first].data[0], + batch->mmio[batch->first].len); + batch->first = (batch->first + 1) % KVM_MAX_BATCH; + } +#endif + if (r == -1) { r = handle_io_window(kvm); goto more; diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h index 31c0d59..1f453e1 100644 --- a/libkvm/libkvm.h +++ b/libkvm/libkvm.h @@ -448,6 +448,8 @@ int kvm_get_dirty_pages_range(kvm_context_t kvm, unsigned long phys_addr, unsigned long end_addr, void *buf, void*opaque, int (*cb)(unsigned long start, unsigned long len, void*bitmap, void *opaque)); +int kvm_set_mmio(kvm_context_t kvm, + uint8_t is_delayed, uint64_t addr, uint32_t size); /*! * \brief Create a memory alias diff --git a/qemu/hw/cirrus_vga.c b/qemu/hw/cirrus_vga.c index 2c4aeec..4ef8085 100644 --- a/qemu/hw/cirrus_vga.c +++ b/qemu/hw/cirrus_vga.c @@ -3291,6 +3291,8 @@ static void cirrus_init_common(CirrusVGAState * s, int device_id, int is_pci) cirrus_vga_mem_write, s); cpu_register_physical_memory(isa_mem_base + 0x000a0000, 0x20000, vga_io_memory); + if (kvm_enabled()) + qemu_kvm_set_mmio(1, isa_mem_base + 0x000a0000, 0x20000); s->sr[0x06] = 0x0f; if (device_id == CIRRUS_ID_CLGD5446) { diff --git a/qemu/hw/e1000.c b/qemu/hw/e1000.c index 0728539..d223631 100644 --- a/qemu/hw/e1000.c +++ b/qemu/hw/e1000.c @@ -26,6 +26,7 @@ #include "hw.h" #include "pci.h" #include "net.h" +#include "qemu-kvm.h" #include "e1000_hw.h" @@ -938,6 +939,13 @@ e1000_mmio_map(PCIDevice *pci_dev, int region_num, d->mmio_base = addr; cpu_register_physical_memory(addr, PNPMMIO_SIZE, d->mmio_index); + + if (kvm_enabled()) { + qemu_kvm_set_mmio(1, addr, PNPMMIO_SIZE); + qemu_kvm_set_mmio(0, addr + E1000_TCTL, 4); + qemu_kvm_set_mmio(0, addr + E1000_TDT, 4); + qemu_kvm_set_mmio(0, addr + E1000_ICR, 4); + } } static int diff --git a/qemu/hw/vga.c b/qemu/hw/vga.c index 3a49573..844c2a7 100644 --- a/qemu/hw/vga.c +++ b/qemu/hw/vga.c @@ -2257,6 +2257,8 @@ void vga_init(VGAState *s) vga_io_memory = cpu_register_io_memory(0, vga_mem_read, vga_mem_write, s); cpu_register_physical_memory(isa_mem_base + 0x000a0000, 0x20000, vga_io_memory); + if (kvm_enabled()) + qemu_kvm_set_mmio(1, isa_mem_base + 0x000a0000, 0x20000); } /* Memory mapped interface */ @@ -2332,6 +2334,8 @@ static void vga_mm_init(VGAState *s, target_phys_addr_t vram_base, cpu_register_physical_memory(ctrl_base, 0x100000, s_ioport_ctrl); s->bank_offset = 0; cpu_register_physical_memory(vram_base + 0x000a0000, 0x20000, vga_io_memory); + if (kvm_enabled()) + qemu_kvm_set_mmio(1, vram_base + 0x000a0000, 0x20000); } int isa_vga_init(DisplayState *ds, uint8_t *vga_ram_base, diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index e522269..9f03ab1 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -955,3 +955,9 @@ void kvm_mutex_lock(void) pthread_mutex_lock(&qemu_mutex); cpu_single_env = NULL; } + +int qemu_kvm_set_mmio(int is_delayed, + target_phys_addr_t addr, unsigned int size) +{ + return kvm_set_mmio(kvm_context, is_delayed, addr, size); +} diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index 21606e9..37d4d11 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -75,6 +75,8 @@ int handle_tpr_access(void *opaque, int vcpu, void kvm_tpr_vcpu_start(CPUState *env); int qemu_kvm_get_dirty_pages(unsigned long phys_addr, void *buf); +int qemu_kvm_set_mmio(int is_delayed, + target_phys_addr_t addr, unsigned int size); void qemu_kvm_system_reset_request(void); -- 1.5.2.4 |
From: Laurent V. <Lau...@bu...> - 2008-05-15 13:51:31
|
These two patches allow to batch writes to MMIO. When kernel has to send MMIO writes to userspace, it stores them in memory until it has to pass the hand to userspace for another reason. This avoids to have too many context switches on operations that can wait. These patches introduce an ioctl() to define MMIO allowed to be delayed. I made some bentchmark with iperf and e1000: average on 10 runs WITH WITHOUT PATCH PATCH 257.2 MB/s 193.7 MB/s 33% faster I've measured host_state_reload on WinXP boot: WITH WITHOUT PATCH PATCH 561397 739708 24% less I've measured host_state_reload on a VGA text scroll: WITH WITHOUT PATCH PATCH 3976242 13779849 70% less... [PATCH 1/2] kvm: Batch writes to MMIO - kernel part [PATCH 2/2] kvm-userspace: Batch writes to MMIO - userspace part Signed-off-by: Laurent Vivier <Lau...@bu...> |
From: Jan K. <jan...@si...> - 2008-05-15 13:07:53
|
These warnings continued to bug me (while scanning for my own mess). Signed-off-by: Jan Kiszka <jan...@si...> --- qemu/migration.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) Index: b/qemu/migration.c =================================================================== --- a/qemu/migration.c +++ b/qemu/migration.c @@ -834,8 +834,8 @@ static int migrate_incoming_fd(int fd) size = qemu_get_be32(f); if (size != phys_ram_size) { - fprintf(stderr, "migration: memory size mismatch: recv %u mine %u\n", - size, phys_ram_size); + fprintf(stderr, "migration: memory size mismatch: recv %u mine %llu\n", + size, (unsigned long long)phys_ram_size); return MIG_STAT_DST_MEM_SIZE_MISMATCH; } @@ -1090,7 +1090,8 @@ void do_info_migration(void) term_printf("Transfer rate %3.1f mb/s\n", (double)s->bps / (1024 * 1024)); term_printf("Iteration %d\n", s->iteration); - term_printf("Transferred %d/%d pages\n", s->updated_pages, phys_ram_size >> TARGET_PAGE_BITS); + term_printf("Transferred %d/%llu pages\n", s->updated_pages, + (unsigned long long)phys_ram_size >> TARGET_PAGE_BITS); if (s->iteration) term_printf("Last iteration found %d dirty pages\n", s->last_updated_pages); } else { |
From: Jan K. <jan...@si...> - 2008-05-15 13:05:15
|
Dead code since it was introduced. Is it planned to use it in the near future? Then I would suggest to put it under #if 0 for now. Otherwise, please pick up this patch. Signed-off-by: Jan Kiszka <jan...@si...> --- qemu/kvm-tpr-opt.c | 13 ------------- 1 file changed, 13 deletions(-) Index: b/qemu/kvm-tpr-opt.c =================================================================== --- a/qemu/kvm-tpr-opt.c +++ b/qemu/kvm-tpr-opt.c @@ -83,19 +83,6 @@ static void write_byte_virt(CPUState *en stb_phys(map_addr(&sregs, virt, NULL), b); } -static uint32_t get_bios_map(CPUState *env, unsigned *perms) -{ - uint32_t v; - struct kvm_sregs sregs; - - kvm_get_sregs(kvm_context, env->cpu_index, &sregs); - - for (v = -4096u; v != 0; v -= 4096) - if (map_addr(&sregs, v, perms) == 0xe0000) - return v; - return -1u; -} - struct vapic_bios { char signature[8]; uint32_t virt_base; |
From: Laurent V. <Lau...@bu...> - 2008-05-15 12:20:49
|
Le jeudi 15 mai 2008 à 15:04 +0300, Avi Kivity a écrit : > Daniel P. Berrange wrote: > > On Thu, May 15, 2008 at 11:04:47AM +0300, Avi Kivity wrote: > > > >> Daniel P. Berrange wrote: > >> > >>> With this kind of syntax, now tools generating config files need to make > >>> up unique names for each drive. So you'll probably end up with them just > >>> naming things based on the class name + a number appended. > >>> > >>> > >> I would hope that tools don't have to resort to reading and writing > >> these config files. Usually a management system would prefer storing > >> parameters in its own database, and writing a temporary config file just > >> to pass the data seems awkward. I would much prefer to see the command > >> line and monitor retain full control over every configurable parameter. > >> > > > > I expect that libvirt will create config files - it is only a matter of > > time before we hit the command line ARGV length limits - particularly > > with the -net and -drive syntax. People already requesting that we support > > guests with > 16 disks, and > 8 network cards so command lines get very > > long. > > > > What are those limits, btw? ISTR 10240 words, but how many chars? ARG_MAX - _SC_ARG_MAX The maximum length of the arguments to the exec(3) family of functions. Must not be less than _POSIX_ARG_MAX (4096). getconf ARG_MAX 131072 And from a configure log I have: checking the maximum length of command line arguments: 98304 Regards, Laurent -- ------------- Lau...@bu... --------------- "The best way to predict the future is to invent it." - Alan Kay |
From: Avi K. <av...@qu...> - 2008-05-15 12:04:28
|
Daniel P. Berrange wrote: > On Thu, May 15, 2008 at 11:04:47AM +0300, Avi Kivity wrote: > >> Daniel P. Berrange wrote: >> >>> With this kind of syntax, now tools generating config files need to make >>> up unique names for each drive. So you'll probably end up with them just >>> naming things based on the class name + a number appended. >>> >>> >> I would hope that tools don't have to resort to reading and writing >> these config files. Usually a management system would prefer storing >> parameters in its own database, and writing a temporary config file just >> to pass the data seems awkward. I would much prefer to see the command >> line and monitor retain full control over every configurable parameter. >> > > I expect that libvirt will create config files - it is only a matter of > time before we hit the command line ARGV length limits - particularly > with the -net and -drive syntax. People already requesting that we support > guests with > 16 disks, and > 8 network cards so command lines get very > long. > What are those limits, btw? ISTR 10240 words, but how many chars? > I wouldn't write out the config file to disk though - I'd just send it > on the fly on stdin -, eg 'qemu -config -' to tell it to read the config > on its stdin. > That's fine from my point of view. -- error compiling committee.c: too many arguments to function |
From: Daniel P. B. <ber...@re...> - 2008-05-15 11:55:40
|
On Thu, May 15, 2008 at 11:04:47AM +0300, Avi Kivity wrote: > Daniel P. Berrange wrote: > >With this kind of syntax, now tools generating config files need to make > >up unique names for each drive. So you'll probably end up with them just > >naming things based on the class name + a number appended. > > > > I would hope that tools don't have to resort to reading and writing > these config files. Usually a management system would prefer storing > parameters in its own database, and writing a temporary config file just > to pass the data seems awkward. I would much prefer to see the command > line and monitor retain full control over every configurable parameter. I expect that libvirt will create config files - it is only a matter of time before we hit the command line ARGV length limits - particularly with the -net and -drive syntax. People already requesting that we support guests with > 16 disks, and > 8 network cards so command lines get very long. I wouldn't write out the config file to disk though - I'd just send it on the fly on stdin -, eg 'qemu -config -' to tell it to read the config on its stdin. Regards, Daniel. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| |
From: Alexander G. <ag...@su...> - 2008-05-15 11:16:25
|
Hi, in the DSDT there are two different ways of defining, how an interrupt is supposed to be routed. Currently we are using the LNKA - LNKD method, which afaict is for legacy support. The other method is to directly tell the Operating System, which APIC pin the device is attached to. We can get that information from the very same entry, the LNKA to LNKD pseudo devices receive it. For now this does not give any obvious improvement. It does leave room for more advanced mappings, with several IOAPICs that can handle more devices separately. This might help when we have a lot of devices, as currently all devices sit on two interrupt lanes. More importantly (for me) though, is that Darwin enables the APIC mode unconditionally, so it won't easily run in legacy mode. Regards, Alex Signed-off-by: Alexander Graf <ag...@su...> |
From: Alexander G. <ag...@su...> - 2008-05-15 11:16:25
|
Hi, a patch recently introduced PCI device hotplugging. This added pseudo-buses for every PCI slot, so that each device can be easily ejected any time. The ACPI specification recommends the inclusion of a _SUN entry in these though, to enable proper indexation of the slots. Afaict Linux does not need this, but Darwin does. This patch adds the corresponding _SUN entries to the PCI slot definitions. Regards, Alex Signed-off-by: Alexander Graf <ag...@su...> |
From: Avi K. <av...@qu...> - 2008-05-15 11:12:39
|
Robin Holt wrote: > Then we need to deposit the information needed to do the invalidate. > > Lastly, we would need to interrupt. Unfortunately, here we have a > thundering herd. There could be up to 16256 processors interrupting the > same processor. That will be a lot of work. It will need to look up the > mm (without grabbing any sleeping locks in either xpmem or the kernel) > and do the tlb invalidates. > > You don't need to interrupt every time. Place your data in a queue (you do support rmw operations, right?) and interrupt. Invalidates from other processors will see that the queue hasn't been processed yet and skip the interrupt. -- error compiling committee.c: too many arguments to function |
From: Robin H. <ho...@sg...> - 2008-05-15 11:01:46
|
We are pursuing Linus' suggestion currently. This discussion is completely unrelated to that work. On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote: > I'm not sure if you're thinking about what I'm thinking of. With the > scheme I'm imagining, all you will need is some way to raise an IPI-like > interrupt on the target domain. The IPI target will have a driver to > handle the interrupt, which will determine the mm and virtual addresses > which are to be invalidated, and will then tear down those page tables > and issue hardware TLB flushes within its domain. On the Linux side, > I don't see why this can't be done. We would need to deposit the payload into a central location to do the invalidate, correct? That central location would either need to be indexed by physical cpuid (65536 possible currently, UV will push that up much higher) or some sort of global id which is difficult because remote partitions can reboot giving you a different view of the machine and running partitions would need to be updated. Alternatively, that central location would need to be protected by a global lock or atomic type operation, but a majority of the machine does not have coherent access to other partitions so they would need to use uncached operations. Essentially, take away from this paragraph that it is going to be really slow or really large. Then we need to deposit the information needed to do the invalidate. Lastly, we would need to interrupt. Unfortunately, here we have a thundering herd. There could be up to 16256 processors interrupting the same processor. That will be a lot of work. It will need to look up the mm (without grabbing any sleeping locks in either xpmem or the kernel) and do the tlb invalidates. Unfortunately, the sending side is not free to continue (in most cases) until it knows that the invalidate is completed. So it will need to spin waiting for a completion signal will could be as simple as an uncached word. But how will it handle the possible failure of the other partition? How will it detect that failure and recover? A timeout value could be difficult to gauge because the other side may be off doing a considerable amount of work and may just be backed up. > Sure, you obviously would need to rework your code because it's been > written with the assumption that it can sleep. It is an assumption based upon some of the kernel functions we call doing things like grabbing mutexes or rw_sems. That pushes back to us. I think the kernel's locking is perfectly reasonable. The problem we run into is we are trying to get from one context in one kernel to a different context in another and the in-between piece needs to be sleepable. > What is XPMEM exactly anyway? I'd assumed it is a Linux driver. XPMEM allows one process to make a portion of its virtual address range directly addressable by another process with the appropriate access. The other process can be on other partitions. As long as Numa-link allows access to the memory, we can make it available. Userland has an advantage in that the kernel entrance/exit code contains memory errors so we can contain hardware failures (in most cases) to only needing to terminate a user program and not lose the partition. The kernel enjoys no such fault containment so it can not safely directly reference memory. Thanks, Robin |
From: Avi K. <av...@qu...> - 2008-05-15 10:54:45
|
Bernd Schubert wrote: > On Thursday 15 May 2008 09:36:41 Avi Kivity wrote: > >> Bernd Schubert wrote: >> >>> Hello, >>> >>> there is a problem booting 2.6.26-rcX (X=1,2). It stops booting at >>> >>> Calibrating delay using timer specific routine.. 4016.92 BogoMIPS >>> (lpj=8033846) >>> >>> The kvm process then takes 100% of my host CPU. >>> >>> This is with kvm-67 on an AM64-X2- >>> >>> I'm not yet familiar with kvm and debugging. Will a sysrq+t trace of the >>> host show something useful? Or will only full git-bisect help? >>> >> Do you have CONFIG_KVM_GUEST or CONFIG_KVM_CLOCK in your config? If so, >> this may be a paravirt problem. Try turning them off and let us know. >> > > Thanks, I had both options enabled. Disabling these makes the > Can you check which one causes the trouble? Most likely it's the clock. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-05-15 10:53:38
|
David S. Ahern wrote: > Avi Kivity wrote: > >> Not so fast... the patch updates the flood count to 5. Can you check >> if a lower value still works? Also, whether updating the flood count to >> 5 (without the rest of the patch) works? >> >> Unconditionally bumping the flood count to 5 will likely cause a >> performance regression on other guests. >> > > I put the flood count back to 3, and the RHEL3 guest performance is even > better. > > Okay, I committed the patch without the flood count == 5. -- error compiling committee.c: too many arguments to function |
From: Yang, S. <she...@in...> - 2008-05-15 10:44:35
|
From b410060a395356eb4bca3ae31de7acb8c261b3f1 Mon Sep 17 00:00:00 2001 From: Sheng Yang <she...@in...> Date: Thu, 15 May 2008 18:23:27 +0800 Subject: [PATCH] KVM: Enable NMI Watchdog by PIT source The NMI watchdog used LINT0 of LAPIC to deliver NMI. It didn't disable PIC after switch to IOAPIC, but program LVT0 of every LAPIC as NMI, then deliver PIT interrupt to LINT0. So NMIs got the same generate freqency as PIT interrupts. The patch emulated this process and enabled NMI watchdog. For currently KVM, in fact we didn't connected PIC to LAPIC, so the patch bypassed PIC, sent the signal directly to the LAPIC. Signed-off-by: Sheng Yang <she...@in...> --- arch/x86/kvm/i8254.c | 16 ++++++++++++++++ arch/x86/kvm/irq.h | 1 + arch/x86/kvm/lapic.c | 32 ++++++++++++++++++++++++++++---- 3 files changed, 45 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 6d6dc6c..7c6ea62 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -563,12 +563,28 @@ void kvm_free_pit(struct kvm *kvm) static void __inject_pit_timer_intr(struct kvm *kvm) { + int i; + struct kvm_vcpu *vcpu; + mutex_lock(&kvm->lock); kvm_ioapic_set_irq(kvm->arch.vioapic, 0, 1); kvm_ioapic_set_irq(kvm->arch.vioapic, 0, 0); kvm_pic_set_irq(pic_irqchip(kvm), 0, 1); kvm_pic_set_irq(pic_irqchip(kvm), 0, 0); mutex_unlock(&kvm->lock); + + /* + * For NMI watchdog in IOAPIC mode + * After IOAPIC enabled, NMI watchdog programmed LVT0 of lapic as NMI, + * then a timer interrupt through IOAPIC and a NMI through PIC to lapic + * would be delivered when PIT time up. + */ + for (i = 0; i < KVM_MAX_VCPUS; ++i) { + vcpu = kvm->vcpus[i]; + if (!vcpu) + continue; + kvm_apic_local_deliver(vcpu, APIC_LVT0); + } } void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h index 1802134..7066660 100644 --- a/arch/x86/kvm/irq.h +++ b/arch/x86/kvm/irq.h @@ -84,6 +84,7 @@ void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec); void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu); void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu); void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu); +int kvm_apic_local_deliver(struct kvm_vcpu *vcpu, int lvt_type); int pit_has_pending_timer(struct kvm_vcpu *vcpu); int apic_has_pending_timer(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 2273836..62f70a1 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -388,6 +388,14 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, } break; + case APIC_DM_EXTINT: + /* + * Should only be called by kvm_apic_local_deliver() with LVT0, + * before NMI watchdog was enabled. Already handled by + * kvm_apic_accept_pic_intr(). + */ + break; + default: printk(KERN_ERR "TODO: unsupported delivery mode %x\n", delivery_mode); @@ -753,6 +761,9 @@ static void apic_mmio_write(struct kvm_io_device *this, case APIC_LVTTHMR: case APIC_LVTPC: case APIC_LVT0: + if (val == APIC_DM_NMI) + apic_debug("Receive NMI setting on APIC_LVT0 " + "for cpu %d\n", apic->vcpu->vcpu_id); case APIC_LVT1: case APIC_LVTERR: /* TODO: Check vector */ @@ -968,12 +979,25 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu) return 0; } -static int __inject_apic_timer_irq(struct kvm_lapic *apic) +int kvm_apic_local_deliver(struct kvm_vcpu *vcpu, int lvt_type) { - int vector; + struct kvm_lapic *apic = vcpu->arch.apic; + int vector, mode, trig_mode; + u32 reg; + + if (apic && apic_enabled(apic)) { + reg = apic_get_reg(apic, lvt_type); + vector = reg & APIC_VECTOR_MASK; + mode = reg & APIC_MODE_MASK; + trig_mode = reg & APIC_LVT_LEVEL_TRIGGER; + return __apic_accept_irq(apic, mode, vector, 1, trig_mode); + } + return 0; +} - vector = apic_lvt_vector(apic, APIC_LVTT); - return __apic_accept_irq(apic, APIC_DM_FIXED, vector, 1, 0); +static int __inject_apic_timer_irq(struct kvm_lapic *apic) +{ + return kvm_apic_local_deliver(apic->vcpu, APIC_LVTT); } static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) -- 1.5.5 |
From: Yang, S. <she...@in...> - 2008-05-15 10:44:08
|
From 069c50dca077796101af3eb5890e3fd31a72743f Mon Sep 17 00:00:00 2001 From: Sheng Yang <she...@in...> Date: Thu, 15 May 2008 18:23:25 +0800 Subject: [PATCH] KVM: VMX: Enable NMI with in-kernel irqchip Signed-off-by: Sheng Yang <she...@in...> --- arch/x86/kvm/vmx.c | 125 +++++++++++++++++++++++++++++++++++++------ arch/x86/kvm/vmx.h | 12 ++++- arch/x86/kvm/x86.c | 1 + include/asm-x86/kvm_host.h | 1 + 4 files changed, 120 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e94a8c3..134c75e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -264,6 +264,12 @@ static inline int cpu_has_vmx_vpid(void) SECONDARY_EXEC_ENABLE_VPID); } +static inline int cpu_has_virtual_nmis(void) +{ + return (vmcs_config.pin_based_exec_ctrl & + PIN_BASED_VIRTUAL_NMIS); +} + static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr) { int i; @@ -1083,7 +1089,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) u32 _vmentry_control = 0; min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING; - opt = 0; + opt = PIN_BASED_VIRTUAL_NMIS; if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS, &_pin_based_exec_control) < 0) return -EIO; @@ -2125,6 +2131,13 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, int irq) irq | INTR_TYPE_EXT_INTR | INTR_INFO_VALID_MASK); } +static void vmx_inject_nmi(struct kvm_vcpu *vcpu) +{ + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, + INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR); + vcpu->arch.nmi_pending = 0; +} + static void kvm_do_inject_irq(struct kvm_vcpu *vcpu) { int word_index = __ffs(vcpu->arch.irq_summary); @@ -2648,6 +2661,19 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) return 1; } +static int handle_nmi_window(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) +{ + u32 cpu_based_vm_exec_control; + + /* clear pending NMI */ + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_NMI_PENDING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); + ++vcpu->stat.nmi_window_exits; + + return 1; +} + /* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs @@ -2658,6 +2684,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu, [EXIT_REASON_EXCEPTION_NMI] = handle_exception, [EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt, [EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault, + [EXIT_REASON_NMI_WINDOW] = handle_nmi_window, [EXIT_REASON_IO_INSTRUCTION] = handle_io, [EXIT_REASON_CR_ACCESS] = handle_cr, [EXIT_REASON_DR_ACCESS] = handle_dr, @@ -2745,17 +2772,52 @@ static void enable_irq_window(struct kvm_vcpu *vcpu) vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); } +static void enable_nmi_window(struct kvm_vcpu *vcpu) +{ + u32 cpu_based_vm_exec_control; + + if (!cpu_has_virtual_nmis()) + return; + + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_NMI_PENDING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); +} + +static int vmx_nmi_enabled(struct kvm_vcpu *vcpu) +{ + u32 guest_intr = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); + return !(guest_intr & (GUEST_INTR_STATE_NMI | + GUEST_INTR_STATE_MOV_SS | + GUEST_INTR_STATE_STI)); +} + +static int vmx_irq_enabled(struct kvm_vcpu *vcpu) +{ + u32 guest_intr = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); + return (!(guest_intr & (GUEST_INTR_STATE_MOV_SS | + GUEST_INTR_STATE_STI)) && + (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF)); +} + +static void enable_intr_window(struct kvm_vcpu *vcpu) +{ + if (vcpu->arch.nmi_pending) + enable_nmi_window(vcpu); + else if (kvm_cpu_has_interrupt(vcpu)) + enable_irq_window(vcpu); +} + static void vmx_intr_assist(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - u32 idtv_info_field, intr_info_field; - int has_ext_irq, interrupt_window_open; + u32 idtv_info_field, intr_info_field, exit_intr_info_field; int vector; update_tpr_threshold(vcpu); - has_ext_irq = kvm_cpu_has_interrupt(vcpu); intr_info_field = vmcs_read32(VM_ENTRY_INTR_INFO_FIELD); + exit_intr_info_field = vmcs_read32(VM_EXIT_INTR_INFO); idtv_info_field = vmx->idt_vectoring_info; if (intr_info_field & INTR_INFO_VALID_MASK) { if (idtv_info_field & INTR_INFO_VALID_MASK) { @@ -2763,8 +2825,7 @@ static void vmx_intr_assist(struct kvm_vcpu *vcpu) if (printk_ratelimit()) printk(KERN_ERR "Fault when IDT_Vectoring\n"); } - if (has_ext_irq) - enable_irq_window(vcpu); + enable_intr_window(vcpu); return; } if (unlikely(idtv_info_field & INTR_INFO_VALID_MASK)) { @@ -2774,30 +2835,56 @@ static void vmx_intr_assist(struct kvm_vcpu *vcpu) u8 vect = idtv_info_field & VECTORING_INFO_VECTOR_MASK; vmx_inject_irq(vcpu, vect); - if (unlikely(has_ext_irq)) - enable_irq_window(vcpu); + enable_intr_window(vcpu); return; } KVMTRACE_1D(REDELIVER_EVT, vcpu, idtv_info_field, handler); - vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field); + /* + * SDM 3: 25.7.1.2 + * Clear bit "block by NMI" before VM entry if a NMI delivery + * faulted. + */ + if ((idtv_info_field & VECTORING_INFO_TYPE_MASK) + == INTR_TYPE_NMI_INTR && cpu_has_virtual_nmis()) + vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, + vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & + ~GUEST_INTR_STATE_NMI); + + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, idtv_info_field + & ~INTR_INFO_RESVD_BITS_MASK); vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, vmcs_read32(VM_EXIT_INSTRUCTION_LEN)); if (unlikely(idtv_info_field & INTR_INFO_DELIVER_CODE_MASK)) vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, vmcs_read32(IDT_VECTORING_ERROR_CODE)); - if (unlikely(has_ext_irq)) - enable_irq_window(vcpu); + enable_intr_window(vcpu); return; } - if (!has_ext_irq) + if (cpu_has_virtual_nmis()) { + /* + * SDM 3: 25.7.1.2 + * Re-set bit "block by NMI" before VM entry if vmexit caused by + * a guest IRET fault. + */ + if ((exit_intr_info_field & INTR_INFO_UNBLOCK_NMI) && + (exit_intr_info_field & INTR_INFO_VECTOR_MASK) != 8) + vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, + vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) | + GUEST_INTR_STATE_NMI); + else if (vcpu->arch.nmi_pending) { + if (vmx_nmi_enabled(vcpu)) + vmx_inject_nmi(vcpu); + enable_intr_window(vcpu); + return; + } + + } + if (!kvm_cpu_has_interrupt(vcpu)) return; - interrupt_window_open = - ((vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && - (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0); - if (interrupt_window_open) { + if (vmx_irq_enabled(vcpu)) { vector = kvm_cpu_get_interrupt(vcpu); vmx_inject_irq(vcpu, vector); kvm_timer_intr_post(vcpu, vector); @@ -2958,7 +3045,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) fixup_rmode_irq(vmx); vcpu->arch.interrupt_window_open = - (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & 3) == 0; + (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & + (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)) == 0; asm("mov %0, %%ds; mov %0, %%es" : : "r"(__USER_DS)); vmx->launched = 1; @@ -2966,7 +3054,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) intr_info = vmcs_read32(VM_EXIT_INTR_INFO); /* We need to handle NMIs before interrupts are enabled */ - if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */ + if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200 && + (intr_info & INTR_INFO_VALID_MASK)) { KVMTRACE_0D(NMI, vcpu, handler); asm("int $2"); } diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h index 79d94c6..425a134 100644 --- a/arch/x86/kvm/vmx.h +++ b/arch/x86/kvm/vmx.h @@ -40,6 +40,7 @@ #define CPU_BASED_CR8_LOAD_EXITING 0x00080000 #define CPU_BASED_CR8_STORE_EXITING 0x00100000 #define CPU_BASED_TPR_SHADOW 0x00200000 +#define CPU_BASED_VIRTUAL_NMI_PENDING 0x00400000 #define CPU_BASED_MOV_DR_EXITING 0x00800000 #define CPU_BASED_UNCOND_IO_EXITING 0x01000000 #define CPU_BASED_USE_IO_BITMAPS 0x02000000 @@ -216,7 +217,7 @@ enum vmcs_field { #define EXIT_REASON_TRIPLE_FAULT 2 #define EXIT_REASON_PENDING_INTERRUPT 7 - +#define EXIT_REASON_NMI_WINDOW 8 #define EXIT_REASON_TASK_SWITCH 9 #define EXIT_REASON_CPUID 10 #define EXIT_REASON_HLT 12 @@ -251,7 +252,9 @@ enum vmcs_field { #define INTR_INFO_VECTOR_MASK 0xff /* 7:0 */ #define INTR_INFO_INTR_TYPE_MASK 0x700 /* 10:8 */ #define INTR_INFO_DELIVER_CODE_MASK 0x800 /* 11 */ +#define INTR_INFO_UNBLOCK_NMI 0x1000 /* 12 */ #define INTR_INFO_VALID_MASK 0x80000000 /* 31 */ +#define INTR_INFO_RESVD_BITS_MASK 0x7ffff000 #define VECTORING_INFO_VECTOR_MASK INTR_INFO_VECTOR_MASK #define VECTORING_INFO_TYPE_MASK INTR_INFO_INTR_TYPE_MASK @@ -259,9 +262,16 @@ enum vmcs_field { #define VECTORING_INFO_VALID_MASK INTR_INFO_VALID_MASK #define INTR_TYPE_EXT_INTR (0 << 8) /* external interrupt */ +#define INTR_TYPE_NMI_INTR (2 << 8) /* NMI */ #define INTR_TYPE_EXCEPTION (3 << 8) /* processor exception */ #define INTR_TYPE_SOFT_INTR (4 << 8) /* software interrupt */ +/* GUEST_INTERRUPTIBILITY_INFO flags. */ +#define GUEST_INTR_STATE_STI 0x00000001 +#define GUEST_INTR_STATE_MOV_SS 0x00000002 +#define GUEST_INTR_STATE_SMI 0x00000004 +#define GUEST_INTR_STATE_NMI 0x00000008 + /* * Exit Qualifications for MOV for Control Register Access */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8461da4..e537005 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -72,6 +72,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { "mmio_exits", VCPU_STAT(mmio_exits) }, { "signal_exits", VCPU_STAT(signal_exits) }, { "irq_window", VCPU_STAT(irq_window_exits) }, + { "nmi_window", VCPU_STAT(nmi_window_exits) }, { "halt_exits", VCPU_STAT(halt_exits) }, { "halt_wakeup", VCPU_STAT(halt_wakeup) }, { "hypercalls", VCPU_STAT(hypercalls) }, diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 567d739..28f2e91 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -344,6 +344,7 @@ struct kvm_vcpu_stat { u32 mmio_exits; u32 signal_exits; u32 irq_window_exits; + u32 nmi_window_exits; u32 halt_exits; u32 halt_wakeup; u32 request_irq_exits; -- 1.5.5 |
From: Yang, S. <she...@in...> - 2008-05-15 10:44:08
|
Hi Sorry for the update delays, got a cold recently... No big modification. I dropped the ordinary first patch following Avi's comment, and fixed a bug when handling host NMI in vmx_vcpu_run in second patch. The third patch of enabling NMI watchdog is *not* meant to be merged. It was cooked for the test, and it would bring some overhead on interrupt handling, as well as regression on some version of Windows now(IRQ_NOT_EQUAL_OR_LESS BSOD). But is it necessary to got NMI watchdog support in KVM? If so, maybe we need a option(or kernel module parameter) to enable it. -- Thanks Yang, Sheng |
From: Yang, S. <she...@in...> - 2008-05-15 10:44:07
|
From 16680d2556ad065b128412b0f5d81f04de25b3f8 Mon Sep 17 00:00:00 2001 From: Sheng Yang <she...@in...> Date: Thu, 15 May 2008 09:52:48 +0800 Subject: [PATCH] KVM: IOAPIC/LAPIC: Enable NMI support Signed-off-by: Sheng Yang <she...@in...> --- arch/x86/kvm/lapic.c | 3 ++- arch/x86/kvm/x86.c | 6 ++++++ include/asm-x86/kvm_host.h | 4 ++++ virt/kvm/ioapic.c | 20 ++++++++++++++++++-- 4 files changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 7652f88..2273836 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -356,8 +356,9 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode, case APIC_DM_SMI: printk(KERN_DEBUG "Ignoring guest SMI\n"); break; + case APIC_DM_NMI: - printk(KERN_DEBUG "Ignoring guest NMI\n"); + kvm_inject_nmi(vcpu); break; case APIC_DM_INIT: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dab3d4f..8461da4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -173,6 +173,12 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long addr, kvm_queue_exception_e(vcpu, PF_VECTOR, error_code); } +void kvm_inject_nmi(struct kvm_vcpu *vcpu) +{ + vcpu->arch.nmi_pending = 1; +} +EXPORT_SYMBOL_GPL(kvm_inject_nmi); + void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code) { WARN_ON(vcpu->arch.exception.pending); diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 1466c3f..567d739 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -285,6 +285,8 @@ struct kvm_vcpu_arch { struct kvm_vcpu_time_info hv_clock; unsigned int time_offset; struct page *time_page; + + bool nmi_pending; }; struct kvm_mem_alias { @@ -512,6 +514,8 @@ void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long cr2, u32 error_code); +void kvm_inject_nmi(struct kvm_vcpu *vcpu); + void fx_init(struct kvm_vcpu *vcpu); int emulator_read_std(unsigned long addr, diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 4232fd7..99a1736 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -146,6 +146,11 @@ static void ioapic_inj_irq(struct kvm_ioapic *ioapic, kvm_apic_set_irq(vcpu, vector, trig_mode); } +static void ioapic_inj_nmi(struct kvm_vcpu *vcpu) +{ + kvm_inject_nmi(vcpu); +} + static u32 ioapic_get_delivery_bitmask(struct kvm_ioapic *ioapic, u8 dest, u8 dest_mode) { @@ -239,8 +244,19 @@ static void ioapic_deliver(struct kvm_ioapic *ioapic, int irq) } } break; - - /* TODO: NMI */ + case IOAPIC_NMI: + for (vcpu_id = 0; deliver_bitmask != 0; vcpu_id++) { + if (!(deliver_bitmask & (1 << vcpu_id))) + continue; + deliver_bitmask &= ~(1 << vcpu_id); + vcpu = ioapic->kvm->vcpus[vcpu_id]; + if (vcpu) + ioapic_inj_nmi(vcpu); + else + ioapic_debug("NMI to vcpu %d failed\n", + vcpu->vcpu_id); + } + break; default: printk(KERN_WARNING "Unsupported delivery mode %d\n", delivery_mode); -- 1.5.5 |
From: Bernd S. <bs...@q-...> - 2008-05-15 10:29:32
|
On Thursday 15 May 2008 09:36:41 Avi Kivity wrote: > Bernd Schubert wrote: > > Hello, > > > > there is a problem booting 2.6.26-rcX (X=1,2). It stops booting at > > > > Calibrating delay using timer specific routine.. 4016.92 BogoMIPS > > (lpj=8033846) > > > > The kvm process then takes 100% of my host CPU. > > > > This is with kvm-67 on an AM64-X2- > > > > I'm not yet familiar with kvm and debugging. Will a sysrq+t trace of the > > host show something useful? Or will only full git-bisect help? > > Do you have CONFIG_KVM_GUEST or CONFIG_KVM_CLOCK in your config? If so, > this may be a paravirt problem. Try turning them off and let us know. Thanks, I had both options enabled. Disabling these makes the kernel to boot again. But I just run into another bug: [17180751.908653] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [17180751.911775] IP: [<ffffffff803dff37>] start_xmit+0x41/0x11a Going to support with debugging support now to get a better backtrace. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH |
From: Avi K. <av...@qu...> - 2008-05-15 10:20:20
|
Alex Williamson wrote: > Trivial build warning/fixes when the local DEBUG define is enabled. > > Applied, thanks. -- error compiling committee.c: too many arguments to function |