You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Anthony L. <an...@co...> - 2008-05-12 21:44:40
|
Ryan Harper wrote: > * Anthony Liguori <an...@co...> [2008-05-12 15:05]: > >> Ryan Harper wrote: >> >>> I've been digging into some of the instability we see when running >>> larger numbers of guests at the same time. The test I'm currently using >>> involves launching 64 1vcpu guests on an 8-way AMD box. >>> >> Note this is a Barcelona system and therefore should have a >> fixed-frequency TSC. >> >> >>> With the latest >>> kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all >>> 64 of these 1 second apart, >>> >> BTW, what if you don't pace-out the startups? Do we still have issues >> with that? >> > > Do you mean without the 1 second delay or with a longer delay? My > experience is that delay helps (fewer hangs), but doesn't solve things > completely. > So you see problems when using numactrl to pin and using a 0-second delay? The short delay may help reduce the number of CPU migrations which would explain your observation. If there are problems when doing a 0-second delay and numactl, then perhaps it's not just a cpu-migration issue. > In svm.c, I do think we account for most of that time since the delta > calculation will shift the guest time forward to the tsc value read in > svm_vcpu_load(). We'll still miss the time between fixing the offset > and when the guest can actually read its tsc. > Yes, which is the duration that the guest isn't scheduled on any processor and the next time it runs happens to be on a different process. >> A possible way to fix this (that's only valid on a processor with a >> fixed-frequency TSC), is to take a high-res timestamp on vcpu_put, and >> then on vcpu_load, take the delta timestamp since the old TSC was saved, >> and use the TSC frequency on the new pcpu to calculate the number of >> elapsed cycles. >> >> Assuming a fixed frequency TSC, and a calibrated TSC across all >> processors, you could get the same affects by using the VT tsc delta >> logic. Basically, it always uses the new CPU's TSC unless that would >> cause the guest to move backwards in time. As long as you have a >> stable, calibrated TSC, this would work out. >> >> Can you try your old patch that did this and see if it fixes the problem? >> > > Yeah, I'll give it a spin. > Thanks, Anthony Liguori |
From: Ryan H. <ry...@us...> - 2008-05-12 21:24:17
|
* Anthony Liguori <an...@co...> [2008-05-12 15:05]: > Ryan Harper wrote: > >I've been digging into some of the instability we see when running > >larger numbers of guests at the same time. The test I'm currently using > >involves launching 64 1vcpu guests on an 8-way AMD box. > > Note this is a Barcelona system and therefore should have a > fixed-frequency TSC. > > > With the latest > >kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all > >64 of these 1 second apart, > > BTW, what if you don't pace-out the startups? Do we still have issues > with that? Do you mean without the 1 second delay or with a longer delay? My experience is that delay helps (fewer hangs), but doesn't solve things completely. > > > and only a handful (1 to 3) end up not > >making it up. In dmesg on the host, I get a couple messages: > > > >[321365.362534] vcpu not ready for apic_round_robin > > > >and > > > >[321503.023788] Unsupported delivery mode 7 > > > >Now, the interesting bit for me was when I used numactl to pin the guest > >to a processor, all of the guests come up with no issues at all. As I > >looked into it, it means that we're not running any of the vcpu > >migration code which on svm is comprised of tsc_offset recalibration and > >apic migration, and on vmx, a little more per-vcpu work > > > > Another data point is that -no-kvm-irqchip doesn't make the situation > better. Right. Let me clarify; I still see hung guests, but I need to validate if I see the apic related messages in the host or not, I don't recall for certain. > >I've convinced myself that svm.c's tsc offset calculation works and > >handles the migration from cpu to cpu quite well. I added the following > >snippet to trigger if we ever encountered the case where we migrated to > >a tsc that was behind: > > > > rdtscll(tsc_this); > > delta = vcpu->arch.host_tsc - tsc_this; > > old_time = vcpu->arch.host_tsc + svm->vmcb->control.tsc_offset; > > new_time = tsc_this + svm->vmcb->control.tsc_offset + delta; > > if (new_time < old_time) { > > printk(KERN_ERR "ACK! (CPU%d->CPU%d) time goes back %llu\n", > > vcpu->cpu, cpu, old_time - new_time); > > } > > svm->vmcb->control.tsc_offset += delta; > > > > Time will never go backwards, but what can happen is that the TSC > frequency will slow down. This is because upon VCPU migration, we don't > account for the time between vcpu_put on the old processor and vcpu_load > on the new processor. This time then disappears. In svm.c, I do think we account for most of that time since the delta calculation will shift the guest time forward to the tsc value read in svm_vcpu_load(). We'll still miss the time between fixing the offset and when the guest can actually read its tsc. > > A possible way to fix this (that's only valid on a processor with a > fixed-frequency TSC), is to take a high-res timestamp on vcpu_put, and > then on vcpu_load, take the delta timestamp since the old TSC was saved, > and use the TSC frequency on the new pcpu to calculate the number of > elapsed cycles. > > Assuming a fixed frequency TSC, and a calibrated TSC across all > processors, you could get the same affects by using the VT tsc delta > logic. Basically, it always uses the new CPU's TSC unless that would > cause the guest to move backwards in time. As long as you have a > stable, calibrated TSC, this would work out. > > Can you try your old patch that did this and see if it fixes the problem? Yeah, I'll give it a spin. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ry...@us... |
From: Jack S. <st...@sg...> - 2008-05-12 20:01:23
|
On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote: > From: Andrea Arcangeli <an...@qu...> > > With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to > pages. There are secondary MMUs (with secondary sptes and secondary > tlbs) too. sptes in the kvm case are shadow pagetables, but when I say > spte in mmu-notifier context, I mean "secondary pte". In GRU case > there's no actual secondary pte and there's only a secondary tlb > because the GRU secondary MMU has no knowledge about sptes and every > secondary tlb miss event in the MMU always generates a page fault that > has to be resolved by the CPU (this is not the case of KVM where the a > secondary tlb miss will walk sptes in hardware and it will refill the >... FYI, I applied to patch to a tree that has the GRU driver. All regression tests passed. --- jack |
From: Anthony L. <an...@co...> - 2008-05-12 19:49:08
|
Ryan Harper wrote: > I've been digging into some of the instability we see when running > larger numbers of guests at the same time. The test I'm currently using > involves launching 64 1vcpu guests on an 8-way AMD box. Note this is a Barcelona system and therefore should have a fixed-frequency TSC. > With the latest > kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all > 64 of these 1 second apart, BTW, what if you don't pace-out the startups? Do we still have issues with that? > and only a handful (1 to 3) end up not > making it up. In dmesg on the host, I get a couple messages: > > [321365.362534] vcpu not ready for apic_round_robin > > and > > [321503.023788] Unsupported delivery mode 7 > > Now, the interesting bit for me was when I used numactl to pin the guest > to a processor, all of the guests come up with no issues at all. As I > looked into it, it means that we're not running any of the vcpu > migration code which on svm is comprised of tsc_offset recalibration and > apic migration, and on vmx, a little more per-vcpu work > Another data point is that -no-kvm-irqchip doesn't make the situation better. > I've convinced myself that svm.c's tsc offset calculation works and > handles the migration from cpu to cpu quite well. I added the following > snippet to trigger if we ever encountered the case where we migrated to > a tsc that was behind: > > rdtscll(tsc_this); > delta = vcpu->arch.host_tsc - tsc_this; > old_time = vcpu->arch.host_tsc + svm->vmcb->control.tsc_offset; > new_time = tsc_this + svm->vmcb->control.tsc_offset + delta; > if (new_time < old_time) { > printk(KERN_ERR "ACK! (CPU%d->CPU%d) time goes back %llu\n", > vcpu->cpu, cpu, old_time - new_time); > } > svm->vmcb->control.tsc_offset += delta; > Time will never go backwards, but what can happen is that the TSC frequency will slow down. This is because upon VCPU migration, we don't account for the time between vcpu_put on the old processor and vcpu_load on the new processor. This time then disappears. A possible way to fix this (that's only valid on a processor with a fixed-frequency TSC), is to take a high-res timestamp on vcpu_put, and then on vcpu_load, take the delta timestamp since the old TSC was saved, and use the TSC frequency on the new pcpu to calculate the number of elapsed cycles. Assuming a fixed frequency TSC, and a calibrated TSC across all processors, you could get the same affects by using the VT tsc delta logic. Basically, it always uses the new CPU's TSC unless that would cause the guest to move backwards in time. As long as you have a stable, calibrated TSC, this would work out. Can you try your old patch that did this and see if it fixes the problem? > Noting that vcpu->arch.host_tsc is the tsc of the previous cpu the vcpu > was running on (see svm_put_vcpu()). This allows me to check if we are > in fact increasing the guest's view of the tsc. I've not be able to > trigger this at all when the vcpus are migrating. > > As for the apic, the migrate code seems to be rather simple, but I've > not yet dived in to see if we've got anything racy in there: > > lapic.c: > void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) > { > struct kvm_lapic *apic = vcpu->arch.apic; > struct hrtimer *timer; > > if (!apic) > return; > > timer = &apic->timer.dev; > if (hrtimer_cancel(timer)) > hrtimer_start(timer, timer->expires, HRTIMER_MODE_ABS); > } > > There's a big FIXME in the __apic_timer_fn() to make sure the timer runs on the current "pCPU". As written, it's possible for the timer to happen on a different pcpu as the current vcpu's but it wasn't obvious to me that it would cause problems. Eddie, et al: Care to elaborate on what the TODO was trying to address? Regards, Anthony Liguori > Ryan Harper > Software Engineer; Linux Technology Center > IBM Corp., Austin, Tx > (512) 838-9253 T/L: 678-9253 > ry...@us... > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel > |
From: Ryan H. <ry...@us...> - 2008-05-12 19:20:11
|
I've been digging into some of the instability we see when running larger numbers of guests at the same time. The test I'm currently using involves launching 64 1vcpu guests on an 8-way AMD box. With the latest kvm-userspace git and kvm.git + Gerd's kvmclock fixes, I can launch all 64 of these 1 second apart, and only a handful (1 to 3) end up not making it up. In dmesg on the host, I get a couple messages: [321365.362534] vcpu not ready for apic_round_robin and [321503.023788] Unsupported delivery mode 7 Now, the interesting bit for me was when I used numactl to pin the guest to a processor, all of the guests come up with no issues at all. As I looked into it, it means that we're not running any of the vcpu migration code which on svm is comprised of tsc_offset recalibration and apic migration, and on vmx, a little more per-vcpu work I've convinced myself that svm.c's tsc offset calculation works and handles the migration from cpu to cpu quite well. I added the following snippet to trigger if we ever encountered the case where we migrated to a tsc that was behind: rdtscll(tsc_this); delta = vcpu->arch.host_tsc - tsc_this; old_time = vcpu->arch.host_tsc + svm->vmcb->control.tsc_offset; new_time = tsc_this + svm->vmcb->control.tsc_offset + delta; if (new_time < old_time) { printk(KERN_ERR "ACK! (CPU%d->CPU%d) time goes back %llu\n", vcpu->cpu, cpu, old_time - new_time); } svm->vmcb->control.tsc_offset += delta; Noting that vcpu->arch.host_tsc is the tsc of the previous cpu the vcpu was running on (see svm_put_vcpu()). This allows me to check if we are in fact increasing the guest's view of the tsc. I've not be able to trigger this at all when the vcpus are migrating. As for the apic, the migrate code seems to be rather simple, but I've not yet dived in to see if we've got anything racy in there: lapic.c: void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; struct hrtimer *timer; if (!apic) return; timer = &apic->timer.dev; if (hrtimer_cancel(timer)) hrtimer_start(timer, timer->expires, HRTIMER_MODE_ABS); } Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ry...@us... |
From: SourceForge.net <no...@so...> - 2008-05-12 17:58:27
|
Bugs item #1962539, was opened at 2008-05-12 19:58 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1962539&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: qemu Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Pascal Terjan (pterjan) Assigned to: Nobody/Anonymous (nobody) Summary: Remaining of svn conflict in qemu/Makefile.target Initial Comment: [pterjan@celeste kvm-68]$ grep -r "<<<<<<<" qemu qemu/Makefile.target:<<<<<<< HEAD:qemu/Makefile.target ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1962539&group_id=180599 |
From: Marcelo T. <mto...@re...> - 2008-05-12 16:36:53
|
On Sat, May 10, 2008 at 10:12:02AM +0800, Yang, Sheng wrote: > > Did you have kvm.git commit 8ae6dc90ac84d9734e343210c8ec709f50cd9d89 > > when testing this? > > > > I believe it should fix that issue, because "ps->inject_pending" won't > > be set by kvm_pit_timer_intr_post() if the IRQ is masked. Please correct > > me if I'm wrong. > > Oh, sorry, I missed that commit. But... It just solved an half of the problem. > LAPIC suffered from it as well, and the current HLT emulation still didn't > work... And I can't find something like inject_pending in LAPIC timer. > > I have to say, I think my method is more preciously, directly and efficient... > It also can be extended easily if we got more clock sources (though I don't > think this would happen in near future...). In fact, I think take care of > pending counts is some kind of *wrong concept*... We should take care of the > window, or when the increment of pending counters happened, CMIIW. And it got > nothing to do with the current counter number (yeah, I realized it after saw > the hlt behaviour in XP, not before ;) ). Sheng, The problem is that you don't want to emulate hlt if you have a pending timer _and_ the guest is accepting events. So for example if there are two apic timers pending, you inject one of them, guest execute's hlt, we end up in vcpu_block(). Does this work for you? diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 7652f88..d41e34c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -961,7 +961,7 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu) { struct kvm_lapic *lapic = vcpu->arch.apic; - if (lapic) + if (lapic && apic_enabled(lapic) && apic_lvt_enabled(lapic, APIC_LVTT)) return atomic_read(&lapic->timer.pending); return 0; |
From: Gerry R. <gr...@ve...> - 2008-05-12 16:07:57
|
I would like to expose some fxo PCI cards to my Asterisk guest under KVM. How can I do this? I see where maybe you can hotplug some nic | storage devices but what about other PCI devices like my fxo cards? Is there some kernel line options that could do this? Regards, Gerry |
From: Avi K. <av...@qu...> - 2008-05-12 15:44:56
|
Anthony Liguori wrote: >> >> Yes. The loop was a (perhaps premature) optimization that is now >> totally unnecessary, unless I'm missing something quite large. >> > > It used to be that kvm_eat_signal() selected after consuming as many > signals as possible while only sleeping once. That's why there's a > combination of sleeping and polling. > Yes. > Now the VCPU threads never select so the whole loop can be simplified > to a single sigtimedwait() that always blocks. > > In reality, I don't think sigtimedwait() is really needed/useful for > VCPUs anymore. We only use it to catch SIG_IPI and we only use > SIG_IPI to break out of sleeping. I don't see any reason why we > couldn't switch over to using a file descriptor for notification (or a > pthread condition). How would you stop a vcpu running in guest mode? > In the very least, we could just select() on nothing and allow SIG_IPI > to break us out of the select. sigtimedwait() (or just sigwait, now) doesn't require the signal to be delivered, so it's faster. If there's nothing to select, why call select()? -- error compiling committee.c: too many arguments to function |
From: Anthony L. <an...@co...> - 2008-05-12 15:11:58
|
Avi Kivity wrote: >> Now the VCPU threads never select so the whole loop can be simplified >> to a single sigtimedwait() that always blocks. >> >> In reality, I don't think sigtimedwait() is really needed/useful for >> VCPUs anymore. We only use it to catch SIG_IPI and we only use >> SIG_IPI to break out of sleeping. I don't see any reason why we >> couldn't switch over to using a file descriptor for notification (or >> a pthread condition). > > How would you stop a vcpu running in guest mode? Yeah, I forgot about that. >> In the very least, we could just select() on nothing and allow >> SIG_IPI to break us out of the select. > > sigtimedwait() (or just sigwait, now) doesn't require the signal to be > delivered, so it's faster. Yeah, sigtimedwait() is probably the right thing to do since we have to use a signal for IPI. Regards, Anthony Liguori > If there's nothing to select, why call select()? > |
From: Anthony L. <an...@co...> - 2008-05-12 14:31:47
|
Avi Kivity wrote: > Jan Kiszka wrote: > >>> Given that with the iothread we spend very little time processing >>> signals in vcpu threads, maybe it's better to drop the loop completely. >>> The common case is zero or one pending signals. The uncommon case of >>> two or more pending signals will be handled by the KVM_RUN ioctl >>> returning immediately with -EINTR (i.e. in the outer loop). >>> >>> >>> >> You mean >> >> static void kvm_main_loop_wait(CPUState *env, int timeout) >> { >> pthread_mutex_unlock(&qemu_mutex); >> kvm_eat_signal(env, timeout); >> pthread_mutex_lock(&qemu_mutex); >> cpu_single_env = env; >> >> vcpu_info[env->cpu_index].signalled = 0; >> } >> >> ? >> >> > > Yes. The loop was a (perhaps premature) optimization that is now > totally unnecessary, unless I'm missing something quite large. > It used to be that kvm_eat_signal() selected after consuming as many signals as possible while only sleeping once. That's why there's a combination of sleeping and polling. Now the VCPU threads never select so the whole loop can be simplified to a single sigtimedwait() that always blocks. In reality, I don't think sigtimedwait() is really needed/useful for VCPUs anymore. We only use it to catch SIG_IPI and we only use SIG_IPI to break out of sleeping. I don't see any reason why we couldn't switch over to using a file descriptor for notification (or a pthread condition). In the very least, we could just select() on nothing and allow SIG_IPI to break us out of the select. Regards, Anthony Liguori > Oh. There used to be a bug where we didn't check for a pending signal > before the first guest entry, so this would add a lot of latency > (effectively making the bug window much larger). That was only closed > in 2.6.24 (by 7e66f350). > > |
From: Avi K. <av...@qu...> - 2008-05-12 11:56:41
|
Jan Kiszka wrote: >> Given that with the iothread we spend very little time processing >> signals in vcpu threads, maybe it's better to drop the loop completely. >> The common case is zero or one pending signals. The uncommon case of >> two or more pending signals will be handled by the KVM_RUN ioctl >> returning immediately with -EINTR (i.e. in the outer loop). >> >> > > You mean > > static void kvm_main_loop_wait(CPUState *env, int timeout) > { > pthread_mutex_unlock(&qemu_mutex); > kvm_eat_signal(env, timeout); > pthread_mutex_lock(&qemu_mutex); > cpu_single_env = env; > > vcpu_info[env->cpu_index].signalled = 0; > } > > ? > Yes. The loop was a (perhaps premature) optimization that is now totally unnecessary, unless I'm missing something quite large. Oh. There used to be a bug where we didn't check for a pending signal before the first guest entry, so this would add a lot of latency (effectively making the bug window much larger). That was only closed in 2.6.24 (by 7e66f350). -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-05-12 11:44:34
|
Jan Kiszka wrote: > Avi Kivity wrote: > >> Jan Kiszka wrote: >> >>> Resetting guests used to be racy, deadlock-prone, or simply broken (for >>> SMP). This patch fixes the issues, following Marcelo's suggestion to >>> consolidate the reset activity in the I/O thread. All vcpus are cleanly >>> stopped before the emulated hardware is reset, and kvm_arch_cpu_reset is >>> introduced and invoked to ensure that non-boot cpus are put into the >>> right state on x86. Note that other arch may need to look into this >>> service as well to get SMP reset right. >>> >>> >>> >> hmm. This means that reset is executed asynchronously of the calling cpu. >> > > Nope, the calling cpu is put in stopped state after informing the I/O > thread about the reset request. We just cannot control when the request > is handled /wrt to other cpus, but that's not different from real life I > guess. > That's true for triple-faults, but not for ordinary keyboard controller resets. All you do there is call main_loop_break(). But I think that's fine; on a real machine the keyboard controller reset is processed by a separate processor (the keyboard controller) which has nonzero latency. -- error compiling committee.c: too many arguments to function |
From: Jan K. <jan...@we...> - 2008-05-12 11:42:10
|
Avi Kivity wrote: > Jan Kiszka wrote: >> Having both kvm_eat_signal and kvm_eat_signals makes the code harder to >> read. Moreover, given the single caller of kvm_eat_signals, there is no >> real reason to keep it in a separate function. >> >> > > Given that with the iothread we spend very little time processing > signals in vcpu threads, maybe it's better to drop the loop completely. > The common case is zero or one pending signals. The uncommon case of > two or more pending signals will be handled by the KVM_RUN ioctl > returning immediately with -EINTR (i.e. in the outer loop). > You mean static void kvm_main_loop_wait(CPUState *env, int timeout) { pthread_mutex_unlock(&qemu_mutex); kvm_eat_signal(env, timeout); pthread_mutex_lock(&qemu_mutex); cpu_single_env = env; vcpu_info[env->cpu_index].signalled = 0; } ? Jan |
From: Jan K. <jan...@we...> - 2008-05-12 11:39:00
|
Avi Kivity wrote: > Jan Kiszka wrote: >> Resetting guests used to be racy, deadlock-prone, or simply broken (for >> SMP). This patch fixes the issues, following Marcelo's suggestion to >> consolidate the reset activity in the I/O thread. All vcpus are cleanly >> stopped before the emulated hardware is reset, and kvm_arch_cpu_reset is >> introduced and invoked to ensure that non-boot cpus are put into the >> right state on x86. Note that other arch may need to look into this >> service as well to get SMP reset right. >> >> > > hmm. This means that reset is executed asynchronously of the calling cpu. Nope, the calling cpu is put in stopped state after informing the I/O thread about the reset request. We just cannot control when the request is handled /wrt to other cpus, but that's not different from real life I guess. Jan |
From: Jan K. <jan...@we...> - 2008-05-12 11:35:04
|
Hi, before going wild with my idea, I would like to collect some comments on this approach: While doing first kernel debugging with my debug register patches for kvm, I quickly ran into the 4-breakpoints-only limitation that comes from the fact that we blindly map software to hardware breakpoints. Unhandy, simply suboptimal. Also, having 4 breakpoint slots hard-coded in the generic interface is not fair to arch that may support more. Moreover, we do not support watchpoints although this would easily be feasible. But if we supported watchpoints (via debug registers on x86), we would need the break out of the 4 slots limitations even earlier. In short, I came to the conclusion that a rewrite of the KVM_DEBUG_GUEST interface is required. Why do we set breakpoints in the kernel? Why not simply catching all debug traps, inserting software breakpoint ops into the guest code, and handling all this stuff as normal debuggers do? And the hardware breakpoints should just be pushed through the kernel interface like ptrace does. The new KVM_DEBUG_GUEST interface I currently have in mind would look like this: #define KVM_DBGGUEST_ENABLE 0x01 #define KVM_DBGGUEST_SINGLESTEP 0x02 struct kvm_debug_guest { __u32 control; struct kvm_debug_guest_arch arch; } Setting KVM_DBGGUEST_ENABLE would forward all debug-related traps to userspace first, which can then decide to handle or re-inject them. KVM_DBGGUEST_SINGLESTEP would work as before. And the extension for x86 would look like this: struct kvm_debug_guest_arch { __u32 use_hw_breakpoints; __u64 debugreg[8]; } If use_hw_breakpoints is non-zero, KVM would completely overwrite the guest's debug registers with the content of debugreg, giving full control of this feature to the host-side debugger (faking the content of debug registers, effectively disabling them for the guest - as we now do all the time). Questions: - Does anyone see traps and pitfalls in this approach? - May I replace the existing interface with this one, or am I overseeing some use case that already worked with the current code so that ABI compatibility is required (most debug stuff should have been simply broken so far, also due to bugs in userland)? Jan |
From: Avi K. <av...@qu...> - 2008-05-12 11:33:43
|
Jan Kiszka wrote: > Having both kvm_eat_signal and kvm_eat_signals makes the code harder to > read. Moreover, given the single caller of kvm_eat_signals, there is no > real reason to keep it in a separate function. > > Given that with the iothread we spend very little time processing signals in vcpu threads, maybe it's better to drop the loop completely. The common case is zero or one pending signals. The uncommon case of two or more pending signals will be handled by the KVM_RUN ioctl returning immediately with -EINTR (i.e. in the outer loop). -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-05-12 11:30:53
|
Jan Kiszka wrote: > As suggested by Anthony, this patch encapsulates the sequence "save > cpu_single_env, temporarily drop qemu_mutex, restore cpu_single_env" for > condition variables in a helper function. It also adds a safety check to > the open-coded kvm_mutex_lock that the caller is not a vcpu thread (as > kvm_mutex_unlock clears cpu_single_env). > Applied, thanks. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-05-12 11:29:23
|
Jan Kiszka wrote: > Resetting guests used to be racy, deadlock-prone, or simply broken (for > SMP). This patch fixes the issues, following Marcelo's suggestion to > consolidate the reset activity in the I/O thread. All vcpus are cleanly > stopped before the emulated hardware is reset, and kvm_arch_cpu_reset is > introduced and invoked to ensure that non-boot cpus are put into the > right state on x86. Note that other arch may need to look into this > service as well to get SMP reset right. > > hmm. This means that reset is executed asynchronously of the calling cpu. Well, in a real machine a keyboard controller reset is asynchronous as well, so I guess that's not an issue. -- error compiling committee.c: too many arguments to function |
From: Jan K. <jan...@we...> - 2008-05-12 10:51:04
|
Refreshed version of the earlier posted patch, now also adding a safety check about the correct vm state to kvm_load_registers. Some monitor commands as well as the vm_stop() issued by the gdbstub on external interruption so far deadlock on vcpu locks in the kernel. Patch below resolves the issue by temporarily or permanently stopping all vcpu threads before issuing the related KVM IOCTLs. At this chance the patch also removes redundant checks for kvm_enabled. Among other things, the patch now allows to break into the BIOS or other guest code spinning in the vcpu and to use things like "info cpus" in the monitor. Signed-off-by: Jan Kiszka <jan...@si...> --- qemu/gdbstub.c | 12 ++++-------- qemu/monitor.c | 3 +-- qemu/qemu-kvm.c | 41 +++++++++++++++++++++++++++-------------- qemu/vl.c | 2 +- 4 files changed, 33 insertions(+), 25 deletions(-) Index: b/qemu/monitor.c =================================================================== --- a/qemu/monitor.c +++ b/qemu/monitor.c @@ -286,8 +286,7 @@ static CPUState *mon_get_cpu(void) mon_set_cpu(0); } - if (kvm_enabled()) - kvm_save_registers(mon_cpu); + kvm_save_registers(mon_cpu); return mon_cpu; } Index: b/qemu/qemu-kvm.c =================================================================== --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -137,18 +137,6 @@ static int pre_kvm_run(void *opaque, int return 0; } -void kvm_load_registers(CPUState *env) -{ - if (kvm_enabled()) - kvm_arch_load_regs(env); -} - -void kvm_save_registers(CPUState *env) -{ - if (kvm_enabled()) - kvm_arch_save_regs(env); -} - int kvm_cpu_exec(CPUState *env) { int r; @@ -276,6 +264,26 @@ static void kvm_vm_state_change_handler( pause_all_threads(); } +void kvm_load_registers(CPUState *env) +{ + assert(!vm_running); + + if (kvm_enabled()) + kvm_arch_load_regs(env); +} + +void kvm_save_registers(CPUState *env) +{ + if (!kvm_enabled()) + return; + + if (vm_running) + pause_all_threads(); + kvm_arch_save_regs(env); + if (vm_running) + resume_all_threads(); +} + static void update_regs_for_sipi(CPUState *env) { kvm_arch_update_regs_for_sipi(env); @@ -764,7 +772,7 @@ int kvm_qemu_init_env(CPUState *cenv) int kvm_update_debugger(CPUState *env) { struct kvm_debug_guest dbg; - int i; + int i, r; dbg.enabled = 0; if (env->nb_breakpoints || env->singlestep_enabled) { @@ -775,7 +783,12 @@ int kvm_update_debugger(CPUState *env) } dbg.singlestep = env->singlestep_enabled; } - return kvm_guest_debug(kvm_context, env->cpu_index, &dbg); + if (vm_running) + pause_all_threads(); + r = kvm_guest_debug(kvm_context, env->cpu_index, &dbg); + if (vm_running) + resume_all_threads(); + return r; } Index: b/qemu/vl.c =================================================================== --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7225,12 +7225,12 @@ void vm_stop(int reason) { if (vm_running) { cpu_disable_ticks(); - vm_running = 0; if (reason != 0) { if (vm_stop_cb) { vm_stop_cb(vm_stop_opaque, reason); } } + vm_running = 0; vm_state_notify(0); } } Index: b/qemu/gdbstub.c =================================================================== --- a/qemu/gdbstub.c +++ b/qemu/gdbstub.c @@ -904,8 +904,7 @@ static int gdb_handle_packet(GDBState *s addr = strtoull(p, (char **)&p, 16); #if defined(TARGET_I386) env->eip = addr; - if (kvm_enabled()) - kvm_load_registers(env); + kvm_load_registers(env); #elif defined (TARGET_PPC) env->nip = addr; #elif defined (TARGET_SPARC) @@ -928,8 +927,7 @@ static int gdb_handle_packet(GDBState *s addr = strtoull(p, (char **)&p, 16); #if defined(TARGET_I386) env->eip = addr; - if (kvm_enabled()) - kvm_load_registers(env); + kvm_load_registers(env); #elif defined (TARGET_PPC) env->nip = addr; #elif defined (TARGET_SPARC) @@ -973,8 +971,7 @@ static int gdb_handle_packet(GDBState *s } break; case 'g': - if (kvm_enabled()) - kvm_save_registers(env); + kvm_save_registers(env); reg_size = cpu_gdb_read_registers(env, mem_buf); memtohex(buf, mem_buf, reg_size); put_packet(s, buf); @@ -984,8 +981,7 @@ static int gdb_handle_packet(GDBState *s len = strlen(p) / 2; hextomem((uint8_t *)registers, p, len); cpu_gdb_write_registers(env, mem_buf, len); - if (kvm_enabled()) - kvm_load_registers(env); + kvm_load_registers(env); put_packet(s, "OK"); break; case 'm': |
From: Jan K. <jan...@we...> - 2008-05-12 10:50:06
|
Having both kvm_eat_signal and kvm_eat_signals makes the code harder to read. Moreover, given the single caller of kvm_eat_signals, there is no real reason to keep it in a separate function. Signed-off-by: Jan Kiszka <jan...@we...> --- qemu/qemu-kvm.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) Index: b/qemu/qemu-kvm.c =================================================================== --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -210,11 +210,12 @@ static int kvm_eat_signal(CPUState *env, return ret; } - -static void kvm_eat_signals(CPUState *env, int timeout) +static void kvm_main_loop_wait(CPUState *env, int timeout) { int r = 0; + pthread_mutex_unlock(&qemu_mutex); + while (kvm_eat_signal(env, 0)) r = 1; if (!r && timeout) { @@ -223,14 +224,10 @@ static void kvm_eat_signals(CPUState *en while (kvm_eat_signal(env, 0)) ; } -} -static void kvm_main_loop_wait(CPUState *env, int timeout) -{ - pthread_mutex_unlock(&qemu_mutex); - kvm_eat_signals(env, timeout); pthread_mutex_lock(&qemu_mutex); cpu_single_env = env; + vcpu_info[env->cpu_index].signalled = 0; } |
From: Jan K. <jan...@we...> - 2008-05-12 10:49:39
|
Resetting guests used to be racy, deadlock-prone, or simply broken (for SMP). This patch fixes the issues, following Marcelo's suggestion to consolidate the reset activity in the I/O thread. All vcpus are cleanly stopped before the emulated hardware is reset, and kvm_arch_cpu_reset is introduced and invoked to ensure that non-boot cpus are put into the right state on x86. Note that other arch may need to look into this service as well to get SMP reset right. Moreover, a safety check is added to pause/resume_all_threads to ensure that they are only invoked in I/O thread context. Signed-off-by: Jan Kiszka <jan...@si...> --- qemu/qemu-kvm-ia64.c | 4 ++++ qemu/qemu-kvm-powerpc.c | 4 ++++ qemu/qemu-kvm-x86.c | 16 ++++++++++++++++ qemu/qemu-kvm.c | 34 +++++++++++++++++----------------- qemu/qemu-kvm.h | 1 + qemu/vl.c | 6 +----- 6 files changed, 43 insertions(+), 22 deletions(-) Index: b/qemu/qemu-kvm-ia64.c =================================================================== --- a/qemu/qemu-kvm-ia64.c +++ b/qemu/qemu-kvm-ia64.c @@ -61,3 +61,7 @@ int kvm_arch_try_push_interrupts(void *o void kvm_arch_update_regs_for_sipi(CPUState *env) { } + +void kvm_arch_cpu_reset(CPUState *env) +{ +} Index: b/qemu/qemu-kvm-powerpc.c =================================================================== --- a/qemu/qemu-kvm-powerpc.c +++ b/qemu/qemu-kvm-powerpc.c @@ -213,3 +213,7 @@ int handle_powerpc_dcr_write(int vcpu, u return 0; /* XXX ignore failed DCR ops */ } + +void kvm_arch_cpu_reset(CPUState *env) +{ +} Index: b/qemu/qemu-kvm-x86.c =================================================================== --- a/qemu/qemu-kvm-x86.c +++ b/qemu/qemu-kvm-x86.c @@ -671,3 +671,19 @@ int handle_tpr_access(void *opaque, int kvm_tpr_access_report(cpu_single_env, rip, is_write); return 0; } + +void kvm_arch_cpu_reset(CPUState *env) +{ + struct kvm_mp_state mp_state = { .mp_state = KVM_MP_STATE_UNINITIALIZED }; + + kvm_arch_load_regs(env); + if (env->cpu_index != 0) { + if (kvm_irqchip_in_kernel(kvm_context)) + kvm_set_mpstate(kvm_context, env->cpu_index, &mp_state); + else { + env->interrupt_request &= ~CPU_INTERRUPT_HARD; + env->hflags |= HF_HALTED_MASK; + env->exception_index = EXCP_HLT; + } + } +} Index: b/qemu/qemu-kvm.c =================================================================== --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -32,8 +32,6 @@ kvm_context_t kvm_context; extern int smp_cpus; -static int qemu_kvm_reset_requested; - pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; @@ -53,7 +51,6 @@ struct vcpu_info { int signalled; int stop; int stopped; - int reload_regs; int created; } vcpu_info[256]; @@ -251,6 +248,8 @@ static void pause_all_threads(void) { int i; + assert(!cpu_single_env); + for (i = 0; i < smp_cpus; ++i) { vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); @@ -263,6 +262,8 @@ static void resume_all_threads(void) { int i; + assert(!cpu_single_env); + for (i = 0; i < smp_cpus; ++i) { vcpu_info[i].stop = 0; vcpu_info[i].stopped = 0; @@ -307,15 +308,18 @@ static void setup_kernel_sigmask(CPUStat kvm_set_signal_mask(kvm_context, env->cpu_index, &set); } -void qemu_kvm_system_reset_request(void) +void qemu_kvm_system_reset(void) { int i; - for (i = 0; i < smp_cpus; ++i) { - vcpu_info[i].reload_regs = 1; - pthread_kill(vcpu_info[i].thread, SIG_IPI); - } + pause_all_threads(); + qemu_system_reset(); + + for (i = 0; i < smp_cpus; ++i) + kvm_arch_cpu_reset(vcpu_info[i].env); + + resume_all_threads(); } static int kvm_main_loop_cpu(CPUState *env) @@ -348,11 +352,6 @@ static int kvm_main_loop_cpu(CPUState *e kvm_cpu_exec(env); env->interrupt_request &= ~CPU_INTERRUPT_EXIT; kvm_main_loop_wait(env, 0); - if (info->reload_regs) { - info->reload_regs = 0; - if (env->cpu_index == 0) /* ap needs to be placed in INIT */ - kvm_arch_load_regs(env); - } } pthread_mutex_unlock(&qemu_mutex); return 0; @@ -535,10 +534,8 @@ int kvm_main_loop(void) break; else if (qemu_powerdown_requested()) qemu_system_powerdown(); - else if (qemu_reset_requested()) { - pthread_kill(vcpu_info[0].thread, SIG_IPI); - qemu_kvm_reset_requested = 1; - } + else if (qemu_reset_requested()) + qemu_kvm_system_reset(); } pause_all_threads(); @@ -647,6 +644,9 @@ static int kvm_halt(void *opaque, int vc static int kvm_shutdown(void *opaque, int vcpu) { + /* stop the current vcpu from going back to guest mode */ + vcpu_info[cpu_single_env->cpu_index].stopped = 1; + qemu_system_reset_request(); return 1; } Index: b/qemu/qemu-kvm.h =================================================================== --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -59,6 +59,7 @@ void kvm_arch_post_kvm_run(void *opaque, int kvm_arch_has_work(CPUState *env); int kvm_arch_try_push_interrupts(void *opaque); void kvm_arch_update_regs_for_sipi(CPUState *env); +void kvm_arch_cpu_reset(CPUState *env); CPUState *qemu_kvm_cpu_env(int index); Index: b/qemu/vl.c =================================================================== --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7302,11 +7302,7 @@ void qemu_system_reset_request(void) } if (cpu_single_env) cpu_interrupt(cpu_single_env, CPU_INTERRUPT_EXIT); -#ifdef USE_KVM - if (kvm_allowed) - if (!no_reboot) - qemu_kvm_system_reset_request(); -#endif + main_loop_break(); } void qemu_system_shutdown_request(void) |
From: Jan K. <jan...@we...> - 2008-05-12 10:49:32
|
As suggested by Anthony, this patch encapsulates the sequence "save cpu_single_env, temporarily drop qemu_mutex, restore cpu_single_env" for condition variables in a helper function. It also adds a safety check to the open-coded kvm_mutex_lock that the caller is not a vcpu thread (as kvm_mutex_unlock clears cpu_single_env). Signed-off-by: Jan Kiszka <jan...@we...> --- qemu/qemu-kvm.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) Index: b/qemu/qemu-kvm.c =================================================================== --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -12,6 +12,7 @@ int kvm_allowed = 1; int kvm_irqchip = 1; int kvm_pit = 1; +#include <assert.h> #include <string.h> #include "hw/hw.h" #include "sysemu.h" @@ -65,6 +66,14 @@ static inline unsigned long kvm_get_thre return syscall(SYS_gettid); } +static void qemu_cond_wait(pthread_cond_t *cond) +{ + CPUState *env = cpu_single_env; + + pthread_cond_wait(cond, &qemu_mutex); + cpu_single_env = env; +} + CPUState *qemu_kvm_cpu_env(int index) { return vcpu_info[index].env; @@ -246,11 +255,8 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } - while (!all_threads_paused()) { - CPUState *env = cpu_single_env; - pthread_cond_wait(&qemu_pause_cond, &qemu_mutex); - cpu_single_env = env; - } + while (!all_threads_paused()) + qemu_cond_wait(&qemu_pause_cond); } static void resume_all_threads(void) @@ -372,7 +378,7 @@ static void *ap_main_loop(void *_env) /* and wait for machine initialization */ while (!qemu_system_ready) - pthread_cond_wait(&qemu_system_cond, &qemu_mutex); + qemu_cond_wait(&qemu_system_cond); pthread_mutex_unlock(&qemu_mutex); kvm_main_loop_cpu(env); @@ -384,7 +390,7 @@ void kvm_init_new_ap(int cpu, CPUState * pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); while (vcpu_info[cpu].created == 0) - pthread_cond_wait(&qemu_vcpu_cond, &qemu_mutex); + qemu_cond_wait(&qemu_vcpu_cond); } int kvm_init_ap(void) @@ -892,8 +898,6 @@ void qemu_kvm_aio_wait_start(void) void qemu_kvm_aio_wait(void) { - CPUState *cpu_single = cpu_single_env; - if (!cpu_single_env) { if (io_thread_sigfd != -1) { fd_set rfds; @@ -910,10 +914,8 @@ void qemu_kvm_aio_wait(void) sigfd_handler((void *)(unsigned long)io_thread_sigfd); } qemu_aio_poll(); - } else { - pthread_cond_wait(&qemu_aio_cond, &qemu_mutex); - cpu_single_env = cpu_single; - } + } else + qemu_cond_wait(&qemu_aio_cond); } void qemu_kvm_aio_wait_end(void) @@ -939,6 +941,7 @@ void kvm_cpu_destroy_phys_mem(target_phy void kvm_mutex_unlock(void) { + assert(!cpu_single_env); pthread_mutex_unlock(&qemu_mutex); } |
From: Avi K. <av...@qu...> - 2008-05-12 08:10:43
|
Zhang, Xiantao wrote: > Could you help to try the attached one. > Sure, applied, thanks. -- error compiling committee.c: too many arguments to function |
From: Avi K. <av...@qu...> - 2008-05-12 07:54:00
|
Yang, Sheng wrote: > On Friday 09 May 2008 23:49:13 Avi Kivity wrote: > >> Yang, Sheng wrote: >> >>> From 4942a5c35c97e5edb6fe1303e04fb86f25cac345 Mon Sep 17 00:00:00 2001 >>> From: Sheng Yang <she...@in...> >>> Date: Thu, 8 May 2008 16:00:57 +0800 >>> Subject: [PATCH 3/4] KVM: VMX: Enable NMI with in-kernel irqchip >>> >>> >>> static void kvm_do_inject_irq(struct kvm_vcpu *vcpu) >>> { >>> int word_index = __ffs(vcpu->arch.irq_summary); >>> @@ -2146,9 +2159,11 @@ static void do_interrupt_requests(struct kvm_vcpu >>> *vcpu, >>> /* >>> * Interrupts blocked. Wait for unblock. >>> */ >>> - cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING; >>> + cpu_based_vm_exec_control |= >>> + CPU_BASED_VIRTUAL_INTR_PENDING; >>> else >>> - cpu_based_vm_exec_control &= ~CPU_BASED_VIRTUAL_INTR_PENDING; >>> + cpu_based_vm_exec_control &= >>> + ~CPU_BASED_VIRTUAL_INTR_PENDING; >>> >> This seems spurious. >> > > Sorry, seems I am too anxious to keep it in hand... I would like to check it > much careful in the future. > > >>> /* We need to handle NMIs before interrupts are enabled */ >>> - if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { /* nmi */ >>> + if ((intr_info & INTR_INFO_INTR_TYPE_MASK) == 0x200) { >>> KVMTRACE_0D(NMI, vcpu, handler); >>> - asm("int $2"); >>> + if (!cpu_has_virtual_nmis()) >>> + asm("int $2"); >>> } >>> } >>> >> That's a host nmi. So does the PIN_BASED_VIRTUAL_NMI mean NMIs are >> handled like unacked host interrupts? >> > > Not exactly. No host NMI here if Virtual_NMI is set. Copy from SDM 3B table > 20-5: > > "If this control(Virtual NMIs) is 1, NMIs are never blocked and the “blocking > by NMI” bit (bit 3) in the interruptibility-state field > indicates “virtual-NMI blocking” (see Table 20-3). This control also > interacts with the “NMI-window exiting” VM-execution control (see Section > 20.6.2)." > I still don't understand. What does "NMIs are never blocked" mean? what happens if an NMI occurs while in guest mode? Obviously we don't want it to be delivered to the guest. -- error compiling committee.c: too many arguments to function |