You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Gerd v. E. <li...@eg...> - 2008-04-18 21:28:08
|
Hi Marcelo, > > > http://www.mail-archive.com/kvm...@li.../msg14732.html > > > > I tried it this evening with kvm 66 - which should include your patch, > > right? > > No its not included. The issue is being worked on. my bad, sorry. Now I know I really have that patch: qemu-kvm hangs :( I was trying kvm 66 with only the patch listed above applied on an otherwise perfectly working vm with virtio_blk root partition: Last line of the booting kernel in my vnc window: Serial: 8250/16550 driver $Revision 1.90... (you know the rest) an strace of the qemu-kvm gave the following in rapid succession: clock_gettime(CLOCK_MONOTONIC, {2565, 306799672}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 307065342}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 307354930}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 307618803}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 307886312}) = 0 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 33000000}}, NULL) = 0 rt_sigtimedwait([USR1 USR2 ALRM IO], {si_signo=SIGALRM, si_code=SI_TIMER, si_pid=0, si_uid=0, si_value={int=0, ptr=0}}, 0xbfe5af88, 8) = 14 rt_sigaction(SIGALRM, NULL, {0x804d8f8, ~[KILL STOP RTMIN RT_1], 0}, 8) = 0 select(12, [6 11], [], [], {0, 0}) = 0 (Timeout) select(0, [], NULL, NULL, {0, 0}) = 0 (Timeout) clock_gettime(CLOCK_MONOTONIC, {2565, 342895116}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 343164113}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 343454002}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 343716804}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 343980012}) = 0 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 33000000}}, NULL) = 0 rt_sigtimedwait([USR1 USR2 ALRM IO], {si_signo=SIGALRM, si_code=SI_TIMER, si_pid=0, si_uid=0, si_value={int=0, ptr=0}}, 0xbfe5af88, 8) = 14 rt_sigaction(SIGALRM, NULL, {0x804d8f8, ~[KILL STOP RTMIN RT_1], 0}, 8) = 0 select(12, [6 11], [], [], {0, 0}) = 0 (Timeout) select(0, [], NULL, NULL, {0, 0}) = 0 (Timeout) clock_gettime(CLOCK_MONOTONIC, {2565, 379035364}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 379307884}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 379589434}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 379919100}) = 0 clock_gettime(CLOCK_MONOTONIC, {2565, 380183834}) = 0 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 33000000}}, NULL) = 0 rt_sigtimedwait([USR1 USR2 ALRM IO], {si_signo=SIGALRM, si_code=SI_TIMER, si_pid=0, si_uid=0, si_value={int=0, ptr=0}}, 0xbfe5af88, 8) = 14 rt_sigaction(SIGALRM, NULL, {0x804d8f8, ~[KILL STOP RTMIN RT_1], 0}, 8) = 0 select(12, [6 11], [], [], {0, 0}) = 0 (Timeout) select(0, [], NULL, NULL, {0, 0}) = 0 (Timeout) ... Hope that helps. Kind regards, Gerd -- Address (better: trap) for people I really don't want to get mail from: ja...@ca... |
From: Protti, D. J <dui...@in...> - 2008-04-18 20:18:32
|
Hi Uri, The method you propose in fact doesn't work (tested with KVM 65) at least for a Windows XP as guest. After performing steps from 1 to 7 with no errors: - In step 8, the VM in question is already loaded and its user interface is showed in the X windows (as mentioned a Windows XP in my tests) - After step 9, the VM seems to be unstopped (no more '[Stopped]' title in the X window caption) but in fact it doesn't runs. The X window appears to respond to mouse events, i.e. the "Press Ctrl-Alt to exit grab" message appears on mouse click, but the Windows XP interface does not respond. Also, top command shows near 0% CPU usage for the qemu process, so it seems that the Windows XP is not put to run after 'cont' in qemu monitor. It is supposed that this should work? Or this type of guest does not support these operations? Thanks, Duilio Protti Intel Corporation |
From: Marcelo T. <mto...@re...> - 2008-04-18 17:43:05
|
On Fri, Apr 18, 2008 at 10:18:33AM -0500, Anthony Liguori wrote: > >Sleeping in the context of vcpu's is extremely bad (eg virtio-block > >blocks in write() throttling which kills performance). It should wait > >on IO completions instead (qemu-kvm.c creates a pthread "waitqueue" to > >resolve that issue). > > > >Other than that looks fine to me, will give it a try. > > > > FWIW, I'm not getting wonderful results in KVM. It's hard to tell > though because time seems wildly inaccurate (even with kvm clock in the > guest). The time issue appears unrelated to this set of patches. Oh, you won't get completion signals on the aio eventfd. You might want to try the select-with-timeout() stuff. Will submit that with proper signalfd emulation shortly. |
From: Avi K. <av...@qu...> - 2008-04-18 16:33:08
|
Anthony Liguori wrote: > Right now, not specifying the -aio option is equivalent to your proposed > -aio auto. > > I guess I should include an info aio to let the user know what type of > aio they are using. We can add selection criteria later but > semantically, not specifying an explicit -aio option allows QEMU to > choose whichever one it thinks is best. > > For the majority of deployments posix aio should be sufficient. The few that need something else can use Linux aio. Of course, a managed environment can use Linux aio unconditionally if knows the kernel has all the needed goodies. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 16:28:09
|
Glauber de Oliveira Costa wrote: > Hi, > > I've got some qemu crashes while trying to passthrough an ide device > to a kvm guest. After some investigation, it turned out that > register_ioport_{read/write} will abort on errors instead of returning > a meaningful error. > > However, even if we do return an error, the asynchronous nature of pci > config space mapping updates makes it a little bit hard to treat. > > This series of patches basically treats errors in the mapping functions in > the pci layer. If anything goes wrong, we unregister the pci device, unmapping > any mappings that happened to be sucessfull already. > > After these patches are applied, a lot of warnings appears. And, you know, > everytime there is a warning, god kills a kitten. But I'm not planning on > touching the other pieces of qemu code for this until we set up (or not) in > this solution > > Comments are very welcome, specially from qemu folks (since it is a bit invasive) > > Have you considered, instead of rolling back the changes you already made before the failure, to have a function which checks if an ioport registration will be successful? This may simplify the code. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 16:24:38
|
Ryan Harper wrote: > From: Ryan Harper <ry...@us...> > > Rather than faking up some geometry, allow the backend to push the disk > geometry via virtio pci config option. Keep the old geo code around for > compatibility. > > Applied, thanks. > struct virtio_blk_config > { > uint64_t capacity; > uint32_t size_max; > uint32_t seg_max; > + uint16_t cylinders; > + uint8_t heads; > + uint8_t sectors; > }; > > I packed the structure here to avoid gcc surprises on odd architectures. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2008-04-18 16:23:53
|
Nguyen Anh Quynh wrote: >> >> The thinking is to eliminate the need to hijack the boot sector when using >> the -kernel option. >> > > I see, but does that offer any advantage over the current approach? > You no longer have to specify a -hda option when using -kernel. Regards, Anthony Liguori > Thanks, > Q > |
From: Jamie L. <ja...@sh...> - 2008-04-18 16:22:22
|
Anthony Liguori wrote: > >I'm of the view that '-aio auto' would be a really good option - and > >when it's proven itself, it should be the default. It could work on > >all QEMU hosts: it would pick synchronous IO when there is nothing else. > > Right now, not specifying the -aio option is equivalent to your proposed > -aio auto. > > I guess I should include an info aio to let the user know what type of > aio they are using. We can add selection criteria later but > semantically, not specifying an explicit -aio option allows QEMU to > choose whichever one it thinks is best. Great. I guess the next step is to add selection criteria, otherwise a million Wikis will tell everyone to use '-aio linux' :-) Do you know what the selection criteria should be - or is there a document/paper somewhere which says (ideally from benchmarks)? I'm interested for an unrelated project using AIO - so I'm willing to help get this right to some extent. -- Jamie |
From: Avi K. <av...@qu...> - 2008-04-18 16:13:51
|
Alex Davis wrote: > Host software: > Linux 2.6.24.4 > KVM 65 (I am using the kernel modules from this release). > X11 7.2 from Xorg > SDL 1.2.13 > GCC 4.1.1 > Glibc 2.4 > > Host hardware: > Asus P5B Deluxe (P965 chipset based) motherboard > 4 GB RAM > Intel E6700 CPU > > Guest software: > Slackware 12.0 installed from CD-ROM. > > Command used to first KVM instance: > /usr/local/bin/qemu-system-x86_64 -hda /spare/vdisk1.img -cdrom /dev/cdrom -boot c -m 384 -net > nic,macaddr=DE:AD:BE:EF:11:29 -net tap,ifname=tap0,script=no & > > Command used to start second KVM instance: > /usr/local/bin/qemu-system-x86_64 -hda /spare/vdisk2.img -cdrom /dev/cdrom -boot c -m 384 -net > nic,macaddr=DE:AD:BE:EF:11:30 -net tap,ifname=tap1,script=no & > > tap0 and tap1 are bridged on the host. The guest OS was installed on /spare/vdisk1.img, > which was initially created by /usr/local/bin/qemu-img create -f qcow /spare/vdisk.img 10G > After the guest installation completed, vdisk1 was copied to vdisk2. > > The second instance always stops after printing > Checking if the processor honours the WP bit even in supervisor mode... Ok. > It stays hung until I press the return key in the first instance; sometimes clicking in another X > window will wake it up as well. > > This is a test machine so I can test patches (almost) at will. > > Strange. Does pinning each guest to a different cpu help (use 'taskset 1 qemu ... vdisk1.img & ', taskset 2 qemu ... vdisk2.img) -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 16:09:45
|
Joerg Roedel wrote: > There is not selective cr0 intercept bug. The code in the comment sets the > CR0.PG bit. But KVM sets the CR4.PG bit for SVM always to implement the paged > real mode. So the 'mov %eax,%cr0' instruction does not change the CR0.PG bit. > Selective CR0 intercepts only occur when a bit is actually changed. So its the > right behavior that there is no intercept on this instruction. > > Applied, thanks. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 16:08:26
|
Joerg Roedel wrote: > This patch series implements optimizations to the CR8 intercept handling in > SVM. With these patches applied CR8 reads are not intercepted anymore. The > writes to CR8 are only intercepted if the TPR masks interrupts. This > significantly reduces the number of total CR8 intercepts when running Windows > 64 bit versions. Some quick numbers: > > Boot and shudown of Vista 64: > > Without these patches: ~38.000.000 CR8 writes intercepted > With these patches: ~38.000 CR8 writes intercepted > > Applied all, thanks. Good patchset. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 16:03:38
|
be...@il... wrote: > This patch for PCI passthrough devices enables a guest to access a device's > memory mapped I/O regions directly, without requiring the host to trap and > emulate every MMIO access. > > Updated from last version: we create a memory slot for each MMIO region of the > guest's devices, and then use the /sys/bus/pci/.../resource# mapping to find the > hfn for that MMIO region. The kernel part and the userspace part of this > patchset apply to Amit's pv-dma tree. Tested on a Lenovo M57p with an e1000 NIC > assigned directly to an FC8 guest. > > Comments are appreciated. > I see no support for cache attributes in the page attributes table or mtrr. I guess for most devices this will work (as they will be set as uncachable by the mtrrs), but for display cards we'd need to set the vram as write-combining to get reasonable performance. This requires mtrr and pat emulation in kvm so we detect the guest's intentions. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Avi K. <av...@qu...> - 2008-04-18 15:57:32
|
be...@il... wrote: > From: Ben-Ami Yassour <be...@il...> > > Signed-off-by: Ben-Ami Yassour <be...@il...> > Signed-off-by: Muli Ben-Yehuda <mu...@il...> > --- > libkvm/libkvm.c | 24 ++++++++---- > qemu/hw/pci-passthrough.c | 89 +++++++++++---------------------------------- > qemu/hw/pci-passthrough.h | 2 + > 3 files changed, 40 insertions(+), 75 deletions(-) > > diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c > index de91328..8c02af9 100644 > --- a/libkvm/libkvm.c > +++ b/libkvm/libkvm.c > @@ -400,7 +400,7 @@ void *kvm_create_userspace_phys_mem(kvm_context_t kvm, unsigned long phys_start, > { > int r; > int prot = PROT_READ; > - void *ptr; > + void *ptr = NULL; > struct kvm_userspace_memory_region memory = { > .memory_size = len, > .guest_phys_addr = phys_start, > @@ -410,16 +410,24 @@ void *kvm_create_userspace_phys_mem(kvm_context_t kvm, unsigned long phys_start, > if (writable) > prot |= PROT_WRITE; > > - ptr = mmap(NULL, len, prot, MAP_ANONYMOUS | MAP_SHARED, -1, 0); > - if (ptr == MAP_FAILED) { > - fprintf(stderr, "create_userspace_phys_mem: %s", strerror(errno)); > - return 0; > - } > + if (len > 0) { > + ptr = mmap(NULL, len, prot, MAP_ANONYMOUS | MAP_SHARED, -1, 0); > + if (ptr == MAP_FAILED) { > + fprintf(stderr, "create_userspace_phys_mem: %s", > + strerror(errno)); > + return 0; > + } > > - memset(ptr, 0, len); > + memset(ptr, 0, len); > + } > > memory.userspace_addr = (unsigned long)ptr; > - memory.slot = get_free_slot(kvm); > + > + if (len > 0) > + memory.slot = get_free_slot(kvm); > + else > + memory.slot = get_slot(phys_start); > + > r = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &memory); > if (r == -1) { > fprintf(stderr, "create_userspace_phys_mem: %s", strerror(errno)); > This looks like support for zero-length memory slots? Why is it needed? It needs to be in a separate patch. > diff --git a/qemu/hw/pci-passthrough.c b/qemu/hw/pci-passthrough.c > index 7ffcc7b..a5894d9 100644 > --- a/qemu/hw/pci-passthrough.c > +++ b/qemu/hw/pci-passthrough.c > @@ -25,18 +25,6 @@ typedef __u64 resource_size_t; > extern kvm_context_t kvm_context; > extern FILE *logfile; > > -CPUReadMemoryFunc *pt_mmio_read_cb[3] = { > - pt_mmio_readb, > - pt_mmio_readw, > - pt_mmio_readl > -}; > - > -CPUWriteMemoryFunc *pt_mmio_write_cb[3] = { > - pt_mmio_writeb, > - pt_mmio_writew, > - pt_mmio_writel > -}; > - > There's at least one use case for keeping mmio in userspace: reverse-engineering a device driver. So if it doesn't cause too much trouble, please keep this an option. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <an...@co...> - 2008-04-18 15:54:03
|
Yang, Sheng wrote: > On Friday 18 April 2008 21:30:14 Anthony Liguori wrote: > >> Yang, Sheng wrote: >> >>> @@ -1048,17 +1071,18 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, >>> u64 *shadow_pte, >>> * whether the guest actually used the pte (in order to detect >>> * demand paging). >>> */ >>> - spte = PT_PRESENT_MASK | PT_DIRTY_MASK; >>> + spte = shadow_base_present_pte | shadow_dirty_mask; >>> if (!speculative) >>> pte_access |= PT_ACCESSED_MASK; >>> if (!dirty) >>> pte_access &= ~ACC_WRITE_MASK; >>> - if (!(pte_access & ACC_EXEC_MASK)) >>> - spte |= PT64_NX_MASK; >>> - >>> - spte |= PT_PRESENT_MASK; >>> + if (pte_access & ACC_EXEC_MASK) { >>> + if (shadow_x_mask) >>> + spte |= shadow_x_mask; >>> + } else if (shadow_nx_mask) >>> + spte |= shadow_nx_mask; >>> >> This looks like it may be a bug. The old behavior sets NX if >> (pte_access & ACC_EXEC_MASK). The new behavior unconditionally sets NX >> and never sets PRESENT. Also, the if (shadow_x_mas k) checks are >> unnecessary. spte |= 0 is a nop. >> > > Thanks for the comment! I realized two judgments of shadow_nx/x_mask is > unnecessary... In fact, the correct behavior is either set shadow_x_mask or > shadow_nx_mask, may be there is a better approach for this. The logic assured > by program itself is always safer. But I will remove the redundant code at > first. > > But I don't think it's a bug. The old behavior set NX if (!(pte_access & > ACC_EXEC_MASK)), the same as the new one. The new behavior sets NX regardless of whether (pte_access & ACC_EXEC_MASK). Is the desired change to unconditionally set NX? > And I also curious about the > PRESENT bit. You see, the PRESENT bit was set at the beginning of the code, > and I really don't know why the duplicate one exists there... > Looking at the code, you appear to be right. In the future, I think you should separate any cleanups (like removing the redundant setting of PRESENT) into a separate patch and stick to just programmatic changes of PT_USER_MASK => shadow_user_mask, etc. in this patch. That makes it a lot easier to review correctness. Regards, Anthony Liguori >>> if (pte_access & ACC_USER_MASK) >>> - spte |= PT_USER_MASK; >>> + spte |= shadow_user_mask; >>> if (largepage) >>> spte |= PT_PAGE_SIZE_MASK; >>> > > |
From: Avi K. <av...@qu...> - 2008-04-18 15:51:38
|
be...@il... wrote: > From: Ben-Ami Yassour <be...@il...> > > Signed-off-by: Ben-Ami Yassour <be...@il...> > Signed-off-by: Muli Ben-Yehuda <mu...@il...> > --- > arch/x86/kvm/mmu.c | 59 +++++++++++++++++++++++++++++-------------- > arch/x86/kvm/paging_tmpl.h | 19 +++++++++---- > include/linux/kvm_host.h | 2 +- > virt/kvm/kvm_main.c | 17 +++++++++++- > 4 files changed, 69 insertions(+), 28 deletions(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 078a7f1..c89029d 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -112,6 +112,8 @@ static int dbg = 1; > #define PT_FIRST_AVAIL_BITS_SHIFT 9 > #define PT64_SECOND_AVAIL_BITS_SHIFT 52 > > +#define PT_SHADOW_IO_MARK (1ULL << PT_FIRST_AVAIL_BITS_SHIFT) > + > Please rename this PT_SHADOW_MMIO_MASK. > #define VALID_PAGE(x) ((x) != INVALID_PAGE) > > #define PT64_LEVEL_BITS 9 > @@ -237,6 +239,9 @@ static int is_dirty_pte(unsigned long pte) > > static int is_rmap_pte(u64 pte) > { > + if (pte & PT_SHADOW_IO_MARK) > + return false; > + > return is_shadow_present_pte(pte); > } > Why avoid rmap on mmio pages? Sure it's unnecessary work, but having less cases improves overall reliability. You can use pfn_valid() in gfn_to_pfn() and kvm_release_pfn_*() to conditionally update the page refcounts. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Hollis B. <ho...@us...> - 2008-04-18 15:40:16
|
Signed-off-by: Hollis Blanchard <ho...@us...> diff --git a/qemu/qemu-kvm-powerpc.c b/qemu/qemu-kvm-powerpc.c --- a/qemu/qemu-kvm-powerpc.c +++ b/qemu/qemu-kvm-powerpc.c @@ -72,7 +72,6 @@ for (i = 0;i < 32; i++){ regs.gpr[i] = env->gpr[i]; - regs.fpr[i] = env->fpr[i]; } rc = kvm_set_regs(kvm_context, env->cpu_index, ®s); @@ -113,7 +112,6 @@ for (i = 0;i < 32; i++){ env->gpr[i] = regs.gpr[i]; - env->fpr[i] = regs.fpr[i]; } } |
From: Anthony L. <an...@co...> - 2008-04-18 15:25:26
|
Guillaume Thouvenin wrote: > On Fri, 18 Apr 2008 08:23:07 -0500 > Anthony Liguori <an...@co...> wrote: > > > >> This doesn't seem right. You should have been able to break out of the >> emulator long before encountering an out instruction. The next >> instruction you encounter should be a mov instruction. Are you sure >> you're updating eip correctly? >> > > I think that eip is updated correctly but you're right, I think that > the condition to stop emulation is not well implemented. I emulate a > lot of mov instructions and I remain blocked in the emulation loop > until I reach the "out" instruction. The loop is the following: > > [...] > cs_rpl = vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK; > ss_rpl = vmcs_read16(GUEST_SS_SELECTOR) & SELECTOR_RPL_MASK; > > while (cs_rpl != ss_rpl) { > if (emulate_instruction(vcpu, NULL, 0,0, 0) == EMULATE_FAIL) { > printk(KERN_INFO "%s: emulation of 0x%x failed\n", > __FUNCTION__, > vcpu->arch.emulate_ctxt.decode.b); > return -1; > } > cs_rpl = vmcs_read16(GUEST_CS_SELECTOR) & SELECTOR_RPL_MASK; > ss_rpl = vmcs_read16(GUEST_SS_SELECTOR) & SELECTOR_RPL_MASK; > } > printk(KERN_INFO "%s: VMX friendly state recovered\n", __FUNCTION__); > // I never reach this point > > Maybe CS and SS selector are not well updated. I will add trace to see > their values before and after the emulation. > I'd prefer you not do an emulate_instruction loop at all. Just emulate one instruction on vmentry failure and let VT tell you what instructions you need to emulate. It's only four instructions so I don't think the performance is going to matter. Take a look at the patch I posted previously. Regards, Anthony Liguori > Regards, > Guillaume > |
From: Anthony L. <ali...@us...> - 2008-04-18 15:23:57
|
Jamie Lokier wrote: >> I've basically got a choice of making libvirt always ad '-aio linux' >> or never add it at all. My inclination is to the latter since it is >> compatible with existing QEMU which has no -aio option. Presumably >> '-aio linux' is intended to provide some performance benefit so it'd >> be nice to use it. If we can't express some criteria under which it >> should be turned on, I can't enable it; where as if you can express >> some criteria, then QEMU should apply them automatically. >> > > I'm of the view that '-aio auto' would be a really good option - and > when it's proven itself, it should be the default. It could work on > all QEMU hosts: it would pick synchronous IO when there is nothing else. > Right now, not specifying the -aio option is equivalent to your proposed -aio auto. I guess I should include an info aio to let the user know what type of aio they are using. We can add selection criteria later but semantically, not specifying an explicit -aio option allows QEMU to choose whichever one it thinks is best. Regards, Anthony Liguori |
From: Anthony L. <ali...@us...> - 2008-04-18 15:20:29
|
Nguyen Anh Quynh wrote: >> You no longer have to specify a -hda option when using -kernel. >> > > Without -hda, how can we load disk image? Or you mean you only want to > test the kernel? > Right. You may be booting from NFS, iSCSI, or something like that. Regards, Anthony Liguori > Thanks, > Q > |
From: Anthony L. <ali...@us...> - 2008-04-18 15:19:51
|
Marcelo Tosatti wrote: > On Thu, Apr 17, 2008 at 02:26:52PM -0500, Anthony Liguori wrote: > >> This patch introduces a Linux-aio backend that is disabled by default. To >> use this backend effectively, the user should disable caching and select >> it with the appropriate -aio option. For instance: >> >> qemu-system-x86_64 -drive foo.img,cache=off -aio linux >> >> There's no universal way to asynchronous wait with linux-aio. At some point, >> signals were added to signal completion. More recently, and eventfd interface >> was added. This patch relies on the later. >> >> We try hard to detect whether the right support is available in configure to >> avoid compile failures. >> > > >> + do { >> + err = io_submit(aio_ctxt_id, 1, iocbs); >> + } while (err == -1 && errno == EINTR); >> + >> + if (err != 1) { >> + fprintf(stderr, "failed to submit aio request: %m\n"); >> + exit(1); >> + } >> + >> + outstanding_requests++; >> + >> + return &aiocb->common; >> +} >> + >> +static void la_wait(void) >> +{ >> + main_loop_wait(10); >> +} >> > > Sleeping in the context of vcpu's is extremely bad (eg virtio-block > blocks in write() throttling which kills performance). It should wait > on IO completions instead (qemu-kvm.c creates a pthread "waitqueue" to > resolve that issue). > > Other than that looks fine to me, will give it a try. > FWIW, I'm not getting wonderful results in KVM. It's hard to tell though because time seems wildly inaccurate (even with kvm clock in the guest). The time issue appears unrelated to this set of patches. Regards, Anthony Liguori |
From: Yang, S. <she...@in...> - 2008-04-18 15:12:03
|
On Friday 18 April 2008 21:30:14 Anthony Liguori wrote: > Yang, Sheng wrote: > > @@ -1048,17 +1071,18 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, > > u64 *shadow_pte, > > * whether the guest actually used the pte (in order to detect > > * demand paging). > > */ > > - spte = PT_PRESENT_MASK | PT_DIRTY_MASK; > > + spte = shadow_base_present_pte | shadow_dirty_mask; > > if (!speculative) > > pte_access |= PT_ACCESSED_MASK; > > if (!dirty) > > pte_access &= ~ACC_WRITE_MASK; > > - if (!(pte_access & ACC_EXEC_MASK)) > > - spte |= PT64_NX_MASK; > > - > > - spte |= PT_PRESENT_MASK; > > + if (pte_access & ACC_EXEC_MASK) { > > + if (shadow_x_mask) > > + spte |= shadow_x_mask; > > + } else if (shadow_nx_mask) > > + spte |= shadow_nx_mask; > > This looks like it may be a bug. The old behavior sets NX if > (pte_access & ACC_EXEC_MASK). The new behavior unconditionally sets NX > and never sets PRESENT. Also, the if (shadow_x_mas k) checks are > unnecessary. spte |= 0 is a nop. Thanks for the comment! I realized two judgments of shadow_nx/x_mask is unnecessary... In fact, the correct behavior is either set shadow_x_mask or shadow_nx_mask, may be there is a better approach for this. The logic assured by program itself is always safer. But I will remove the redundant code at first. But I don't think it's a bug. The old behavior set NX if (!(pte_access & ACC_EXEC_MASK)), the same as the new one. And I also curious about the PRESENT bit. You see, the PRESENT bit was set at the beginning of the code, and I really don't know why the duplicate one exists there... > > > if (pte_access & ACC_USER_MASK) > > - spte |= PT_USER_MASK; > > + spte |= shadow_user_mask; > > if (largepage) > > spte |= PT_PAGE_SIZE_MASK; -- Thanks Yang, Sheng > > Regards, > > Anthony Liguori |
From: Marcelo T. <mto...@re...> - 2008-04-18 15:06:53
|
On Thu, Apr 17, 2008 at 02:26:52PM -0500, Anthony Liguori wrote: > This patch introduces a Linux-aio backend that is disabled by default. To > use this backend effectively, the user should disable caching and select > it with the appropriate -aio option. For instance: > > qemu-system-x86_64 -drive foo.img,cache=off -aio linux > > There's no universal way to asynchronous wait with linux-aio. At some point, > signals were added to signal completion. More recently, and eventfd interface > was added. This patch relies on the later. > > We try hard to detect whether the right support is available in configure to > avoid compile failures. > + do { > + err = io_submit(aio_ctxt_id, 1, iocbs); > + } while (err == -1 && errno == EINTR); > + > + if (err != 1) { > + fprintf(stderr, "failed to submit aio request: %m\n"); > + exit(1); > + } > + > + outstanding_requests++; > + > + return &aiocb->common; > +} > + > +static void la_wait(void) > +{ > + main_loop_wait(10); > +} Sleeping in the context of vcpu's is extremely bad (eg virtio-block blocks in write() throttling which kills performance). It should wait on IO completions instead (qemu-kvm.c creates a pthread "waitqueue" to resolve that issue). Other than that looks fine to me, will give it a try. |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-18 15:02:55
|
On 4/18/08, Anthony Liguori <ali...@us...> wrote: > Nguyen Anh Quynh wrote: > > > > > > > > > The thinking is to eliminate the need to hijack the boot sector when > using > > > the -kernel option. > > > > > > > > > > I see, but does that offer any advantage over the current approach? > > > > > > You no longer have to specify a -hda option when using -kernel. Without -hda, how can we load disk image? Or you mean you only want to test the kernel? Thanks, Q |
From: H. P. A. <hp...@zy...> - 2008-04-18 14:54:11
|
Anthony Liguori wrote: > Nguyen Anh Quynh wrote: >>> >>> The thinking is to eliminate the need to hijack the boot sector when >>> using >>> the -kernel option. >>> >> >> I see, but does that offer any advantage over the current approach? >> > > You no longer have to specify a -hda option when using -kernel. > Plus, you don't have funny side effects if you do -- and I suspect there is ad hoc code in the disk driver which can be removed. -hpa |
From: Gerd H. <kr...@re...> - 2008-04-18 14:26:30
|
Jeremy Fitzhardinge wrote: > Gerd Hoffmann wrote: >> Wall clock is off a few hours though. Oops. >> >> I think the way wall clock and system clock work together in xen (Jeremy >> correct me if I'm wrong) is that the wall clock specifies the point in >> time where the system clock started going. As kvm fills in host system >> time into the guest system time fields the guest wall clock fields >> should be filled with the host boot time timestamp I'd say. >> > > Yes. The wallclock field in the shared info structure is the wallclock > time at boot; you compute the current time by adding the system > timestamp to it. System time changes are effected by retroactively > changing the boot time of the machine, though that can also change > because of suspend/resume/migrate. > > In general the kernel only reads the wallclock time at boot, and then > maintains it for itself from then on. I think. Thanks. I'm looking at the guest side of the issue right now, trying to identify common code, and while doing so noticed that xen does the version-check-loop in both get_time_values_from_xen(void) and xen_clocksource_read(void), and I can't see any obvious reason for that. The loop in xen_clocksource_read(void) is not needed IMHO. Can I drop it? cheers, Gerd -- http://kraxel.fedorapeople.org/xenner/ |