You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: David S. A. <da...@ci...> - 2008-04-21 04:31:12
|
I added the traces and captured data over another apparent lockup of the guest. This seems to be representative of the sequence (pid/vcpu removed). (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+3632) VMENTRY (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ] (+ 54928) VMENTRY (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ] (+8432) VMENTRY (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 13832) VMENTRY (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+3712) VMENTRY (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ] (+ 65216) VMENTRY (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ] (+8640) VMENTRY (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 14160) VMENTRY I can forward a more complete time snippet if you'd like. vcpu0 + corresponding vcpu1 files have 85000 total lines and compressed the files total ~500k. I did not see the FLOODED trace come out during this sample though I did bump the count from 3 to 4 as you suggested. Correlating rip addresses to the 2.4 kernel: c0160d00-c0161290 = page_referenced It looks like the event is kscand running through the pages. I suspected this some time ago, and tried tweaking the kscand_work_percent sysctl variable. It appeared to lower the peak of the spikes, but maybe I imagined it. I believe lowering that value makes kscand wake up more often but do less work (page scanning) each time it is awakened. david Avi Kivity wrote: > Can you add a trace at mmu_guess_page_from_pte_write(), right before "if > (is_present_pte(gpte))"? I'm interested in gpa and gpte. Also a trace > at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase > the 3 to 4 in the line right above that, maybe the fork detector is > misfiring). |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-21 03:36:42
|
Hmm, the last patch includes a binary. So please take this patch instead. Thanks, Q # diffstat linuxboot1.diff Makefile | 13 ++++- linuxboot/Makefile | 40 +++++++++++++++ linuxboot/boot.S | 54 +++++++++++++++++++++ linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ linuxboot/util.h | 69 +++++++++++++++++++++++++++ qemu/Makefile | 3 - qemu/Makefile.target | 2 qemu/hw/linuxboot.c | 39 +++++++++++++++ qemu/hw/pc.c | 22 +++++++- qemu/hw/pc.h | 5 + 12 files changed, 600 insertions(+), 9 deletions(-) On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh <aq...@gm...> wrote: > Forget to say that this patch is against kvm-66. > > Thanks, > Q > > > > On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh <aq...@gm...> wrote: > > Hi, > > > > This should be submitted to upstream (but not to kvm-devel list), but > > this is only the test code that I want to quickly send out for > > comments. In case it looks OK, I will send it to upstream later. > > > > Inspired by extboot and conversations with Anthony and HPA, this > > linuxboot option ROM is a simple option ROM that intercepts int19 in > > order to execute linux setup code. This approach eliminates the need > > to manipulate the boot sector for this purpose. > > > > To test it, just load linux kernel with your KVM/QEMU image using > > -kernel option in normal way. > > > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > > Ubuntu 8.04. > > > > Thanks, > > Quynh > > > > > > # diffstat linuxboot1.diff > > Makefile | 13 ++++- > > linuxboot/Makefile | 40 +++++++++++++++ > > linuxboot/boot.S | 54 +++++++++++++++++++++ > > linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ > > linuxboot/signrom |binary > > linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ > > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > > qemu/Makefile | 3 - > > qemu/Makefile.target | 2 > > qemu/hw/linuxboot.c | 39 +++++++++++++++ > > qemu/hw/pc.c | 22 +++++++- > > qemu/hw/pc.h | 5 + > > 13 files changed, 600 insertions(+), 9 deletions(-) > > > |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-21 03:33:48
|
Forget to say that this patch is against kvm-66. Thanks, Q On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh <aq...@gm...> wrote: > Hi, > > This should be submitted to upstream (but not to kvm-devel list), but > this is only the test code that I want to quickly send out for > comments. In case it looks OK, I will send it to upstream later. > > Inspired by extboot and conversations with Anthony and HPA, this > linuxboot option ROM is a simple option ROM that intercepts int19 in > order to execute linux setup code. This approach eliminates the need > to manipulate the boot sector for this purpose. > > To test it, just load linux kernel with your KVM/QEMU image using > -kernel option in normal way. > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > Ubuntu 8.04. > > Thanks, > Quynh > > > # diffstat linuxboot1.diff > Makefile | 13 ++++- > linuxboot/Makefile | 40 +++++++++++++++ > linuxboot/boot.S | 54 +++++++++++++++++++++ > linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ > linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ > linuxboot/signrom |binary > linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > qemu/Makefile | 3 - > qemu/Makefile.target | 2 > qemu/hw/linuxboot.c | 39 +++++++++++++++ > qemu/hw/pc.c | 22 +++++++- > qemu/hw/pc.h | 5 + > 13 files changed, 600 insertions(+), 9 deletions(-) > |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-21 03:32:43
|
Hi, This should be submitted to upstream (but not to kvm-devel list), but this is only the test code that I want to quickly send out for comments. In case it looks OK, I will send it to upstream later. Inspired by extboot and conversations with Anthony and HPA, this linuxboot option ROM is a simple option ROM that intercepts int19 in order to execute linux setup code. This approach eliminates the need to manipulate the boot sector for this purpose. To test it, just load linux kernel with your KVM/QEMU image using -kernel option in normal way. I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest Ubuntu 8.04. Thanks, Quynh # diffstat linuxboot1.diff Makefile | 13 ++++- linuxboot/Makefile | 40 +++++++++++++++ linuxboot/boot.S | 54 +++++++++++++++++++++ linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ linuxboot/signrom |binary linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ linuxboot/util.h | 69 +++++++++++++++++++++++++++ qemu/Makefile | 3 - qemu/Makefile.target | 2 qemu/hw/linuxboot.c | 39 +++++++++++++++ qemu/hw/pc.c | 22 +++++++- qemu/hw/pc.h | 5 + 13 files changed, 600 insertions(+), 9 deletions(-) |
From: Javier G. G. <ja...@gu...> - 2008-04-21 00:31:50
|
On Sunday 20 April 2008, Avi Kivity wrote: > Also, I'd presume that those that need 10K IOPS and above will not place > their high throughput images on a filesystem; rather on a separate SAN LUN. i think that too; but still that LUN would be accessed by the VM's via one of these IO emulation layers, right? or maybe you're advocating using the SAN initiator in the VM instead of the host? -- Javier |
From: Marcelo T. <mto...@re...> - 2008-04-20 23:57:32
|
On Sun, Apr 20, 2008 at 02:16:52PM +0300, Avi Kivity wrote: > >The iperf numbers are pretty good. Performance of UP guests increase > >slightly but SMP > >is quite significant. > > I expect you're seeing contention induced by memcpy()s and inefficient > emulation. With the dma api, I expect the benefit will drop. You still have to memcpy() with the dma api. Even with vringfd the kernel->user copy has to be performed under the global mutex protection, difference being that several packets can be copied per-syscall instead of only one. > >Note that workloads with multiple busy devices (such as databases, web > >servers) should > >be the real winners. > > > >What is the feeling on this? Its not _that_ intrusive and can be easily > >NOP'ed out for > >QEMU. > > > > > > I think many parts are missing (or maybe, I missed them). You need to > lock the qemu internals (there are many read-mostly qemu caches > scattered around the code), lock against hotplug, etc. Yes, there are some parts missing, such as the bh list and hotplug as you mention. > For pure cpu emulation, there is a ton of work to be done: protecting > the translator as well as making the translated code smp safe. I now believe there is a lot of work (which was not clear before). Not particularly interested in getting real emulation to be multithreaded. Anyways, the lack of multithreading in qemu emulation should not be a blocker for these patches to get in, since these are infrastructural changes. > I think that QemuDevice makes sense, and that we want this long term, > but that we first need to improve efficiency (which reduces cpu > utilization _and_ improves scalability) rather than look at scalability > alone (which is much harder in addition to the drawback of not reducing > cpu utilization). Will complete the QEMUDevice+splitlock patchset, keep it uptodated, and test it under a wider variety of workloads. Thanks. |
From: Jamie L. <ja...@sh...> - 2008-04-20 23:39:11
|
Avi Kivity wrote: > >Does that mean "for the majority of deployments, the slow version is > >sufficient. The few that care about performance can use Linux AIO?" > > In essence, yes. s/slow/slower/ and s/performance/ultimate block device > performance/. > > Many deployments don't care at all about block device performance; they > care mostly about networking performance. That's interesting. I'd have expected block device performance to be important for most things, for the same reason that disk performance is (well, reasonably) important for non-virtual machines. But as you say next: > >I'm under the impression that the entire and only point of Linux AIO > >is that it's faster than POSIX AIO on Linux. > > It is. I estimate posix aio adds a few microseconds above linux aio per > I/O request, when using O_DIRECT. Assuming 10 microseconds, you will > need 10,000 I/O requests per second per vcpu to have a 10% performance > difference. That's definitely rare. Oh, I didn't realise the difference was so small. At such a tiny difference, I'm wondering why Linux-AIO exists at all, as it complicates the kernel rather a lot. I can see the theoretical appeal, but if performance is so marginal, I'm surprised it's in there. I'm also surprised the Glibc implementation of AIO using ordinary threads is so close to it. And then, I'm wondering why use AIO it all: it suggests QEMU would run about as fast doing synchronous I/O in a few dedicated I/O threads. > >Does that mean "a managed environment can have some code which check > >the host kernel version + filesystem type holding the VM image, to > >conditionally enable Linux AIO?" (Since if you care about > >performance, which is the sole reason for using Linux AIO, you > >wouldn't want to enable Linux AIO on any host in your cluster where it > >will trash performance.) > > Either that, or mandate that all hosts use a filesystem and kernel which > provide the necessary performance. Take ovirt for example, which > provides the entire hypervisor environment, and so can guarantee this. > > Also, I'd presume that those that need 10K IOPS and above will not place > their high throughput images on a filesystem; rather on a separate SAN LUN. Does the separate LUN make any difference? I thought O_DIRECT on a filesystem was meant to be pretty close to block device performance. I base this on messages here and there which say swapping to a file is about as fast as swapping to a block device, nowadays. Thanks for your useful remarks, btw. There doesn't seem to be a lot of good info about Linux-AIO around. -- Jamie |
From: Anthony L. <ali...@us...> - 2008-04-20 19:32:12
|
Blue Swirl wrote: > On 4/19/08, Anthony Liguori <an...@co...> wrote: > > Well, the IOVector part and bdrv_readv look OK, except for the heavy > mallocing involved. > I don't think that in practice, malloc is going to have any sort of performance impact. If it does, it's easy enough to implement a small object allocator for common, small vector sizes. > I'm not so sure about the DMA side and how everything fits together > for zero-copy IO. For example, do we still need explicit translation > at some point? I'm thinking that zero copy will be implemented by setting the map and unmap functions to NULL by default (instead of to the PCI read/write functions). Then the bus can decide whether copy functions are needed. I'll send an updated patch series tomorrow that includes this functionality. Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2008-04-20 18:47:16
|
Jamie Lokier wrote: > Avi Kivity wrote: > >> For the majority of deployments posix aio should be sufficient. The few >> that need something else can use Linux aio. >> > > Does that mean "for the majority of deployments, the slow version is > sufficient. The few that care about performance can use Linux AIO?" > > In essence, yes. s/slow/slower/ and s/performance/ultimate block device performance/. Many deployments don't care at all about block device performance; they care mostly about networking performance. > I'm under the impression that the entire and only point of Linux AIO > is that it's faster than POSIX AIO on Linux. > It is. I estimate posix aio adds a few microseconds above linux aio per I/O request, when using O_DIRECT. Assuming 10 microseconds, you will need 10,000 I/O requests per second per vcpu to have a 10% performance difference. That's definitely rare. >> Of course, a managed environment can use Linux aio unconditionally if >> knows the kernel has all the needed goodies. >> > > Does that mean "a managed environment can have some code which check > the host kernel version + filesystem type holding the VM image, to > conditionally enable Linux AIO?" (Since if you care about > performance, which is the sole reason for using Linux AIO, you > wouldn't want to enable Linux AIO on any host in your cluster where it > will trash performance.) > Either that, or mandate that all hosts use a filesystem and kernel which provide the necessary performance. Take ovirt for example, which provides the entire hypervisor environment, and so can guarantee this. Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. > Just wondering. > Hope this clarifies. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Jamie L. <ja...@sh...> - 2008-04-20 15:49:46
|
Avi Kivity wrote: > For the majority of deployments posix aio should be sufficient. The few > that need something else can use Linux aio. Does that mean "for the majority of deployments, the slow version is sufficient. The few that care about performance can use Linux AIO?" I'm under the impression that the entire and only point of Linux AIO is that it's faster than POSIX AIO on Linux. > Of course, a managed environment can use Linux aio unconditionally if > knows the kernel has all the needed goodies. Does that mean "a managed environment can have some code which check the host kernel version + filesystem type holding the VM image, to conditionally enable Linux AIO?" (Since if you care about performance, which is the sole reason for using Linux AIO, you wouldn't want to enable Linux AIO on any host in your cluster where it will trash performance.) Just wondering. Thanks, -- Jamie |
From: Yang, S. <she...@in...> - 2008-04-20 13:46:35
|
On Friday 18 April 2008 23:54:04 Anthony Liguori wrote: > Yang, Sheng wrote: > > On Friday 18 April 2008 21:30:14 Anthony Liguori wrote: > >> Yang, Sheng wrote: > >>> @@ -1048,17 +1071,18 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, > >>> u64 *shadow_pte, > >>> * whether the guest actually used the pte (in order to detect > >>> * demand paging). > >>> */ > >>> - spte = PT_PRESENT_MASK | PT_DIRTY_MASK; > >>> + spte = shadow_base_present_pte | shadow_dirty_mask; > >>> if (!speculative) > >>> pte_access |= PT_ACCESSED_MASK; > >>> if (!dirty) > >>> pte_access &= ~ACC_WRITE_MASK; > >>> - if (!(pte_access & ACC_EXEC_MASK)) > >>> - spte |= PT64_NX_MASK; > >>> - > >>> - spte |= PT_PRESENT_MASK; > >>> + if (pte_access & ACC_EXEC_MASK) { > >>> + if (shadow_x_mask) > >>> + spte |= shadow_x_mask; > >>> + } else if (shadow_nx_mask) > >>> + spte |= shadow_nx_mask; > >> > >> This looks like it may be a bug. The old behavior sets NX if > >> (pte_access & ACC_EXEC_MASK). The new behavior unconditionally sets NX > >> and never sets PRESENT. Also, the if (shadow_x_mas k) checks are > >> unnecessary. spte |= 0 is a nop. > > > > Thanks for the comment! I realized two judgments of shadow_nx/x_mask is > > unnecessary... In fact, the correct behavior is either set shadow_x_mask > > or shadow_nx_mask, may be there is a better approach for this. The logic > > assured by program itself is always safer. But I will remove the > > redundant code at first. > > > > But I don't think it's a bug. The old behavior set NX if (!(pte_access & > > ACC_EXEC_MASK)), the same as the new one. > > The new behavior sets NX regardless of whether (pte_access & > ACC_EXEC_MASK). Is the desired change to unconditionally set NX? Oh, I may see the point... shadow_x_mask != shadow_nx_mask. the old behavior was: if (!(pte_access & ACC_EXEC_MASK)) spte |= PT64_NX_MASK; the new behavior is: if (pte_access & ACC_EXEC_MASK) { spte |= shadow_x_mask; } else spte |= shadow_nx_mask; For current behavior, kvm_arch_init() got: kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK, PT_DIRTY_MASK, PT64_NX_MASK, 0); which means shadow_nx_mask = PT64_NX_MASK, and shadow_x_mask = 0 (NX means not executable, and X means executable). In patch 5/6, EPT got: kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK, VMX_EPT_FAKE_DIRTY_MASK, 0ull, VMX_EPT_EXECUTABLE_MASK); which means, shadow_nx_mask = 0, and shadow_x_mask = VMX_EPT_EXECUTABLE_MASK So, when shadow enabled, and (!(pte_access & ACC_EXEC_MASK)), then spte |= shadow_nx_mask = PT64_NX_MASK (no change would happen when the condition is not satisfied). When EPT enabled, and (pte_access & ACC_EXEC_MASK), then spte |= shadow_x_mask = VMX_EPT_EXECUTABLE_MASK (no change would happen when condition is not satisfied). They are two different bit and mutual exclusive ones. Maybe there are some better way to get their meaning more clearly... > > > And I also curious about the > > PRESENT bit. You see, the PRESENT bit was set at the beginning of the > > code, and I really don't know why the duplicate one exists there... > > Looking at the code, you appear to be right. In the future, I think you > should separate any cleanups (like removing the redundant setting of > PRESENT) into a separate patch and stick to just programmatic changes of > PT_USER_MASK => shadow_user_mask, etc. in this patch. That makes it a > lot easier to review correctness. Thanks for the advice, it's important to separate the cleanups. I will get it done more properly next time. -- Thanks Yang, Sheng > > Regards, > > Anthony Liguori > > >>> if (pte_access & ACC_USER_MASK) > >>> - spte |= PT_USER_MASK; > >>> + spte |= shadow_user_mask; > >>> if (largepage) > >>> spte |= PT_PAGE_SIZE_MASK; |
From: Avi K. <av...@qu...> - 2008-04-20 11:20:29
|
Marcelo Tosatti wrote: > Introduce QEMUDevice, making the ioport/iomem->device relationship visible. > > At the moment it only contains a lock, but could be extended. > > With it the following is possible: > - vcpu's to read/write via ioports/iomem while the iothread is working on > some unrelated device, or just copying data from the kernel. > - vcpu's to read/write via ioports/iomem to different devices simultaneously. > > This patchset is only a proof of concept kind of thing, so only serial+raw image > are supported. > > Tried two benchmarks, iperf and tiobench. With tiobench the reported latency is > significantly lower (20%+), but throughput with IDE is only slightly higher. > > Expect to see larger improvements with a higher performing IO scheme (SCSI still buggy, > looking at it). > > The iperf numbers are pretty good. Performance of UP guests increase slightly but SMP > is quite significant. > > I expect you're seeing contention induced by memcpy()s and inefficient emulation. With the dma api, I expect the benefit will drop. > Note that workloads with multiple busy devices (such as databases, web servers) should > be the real winners. > > What is the feeling on this? Its not _that_ intrusive and can be easily NOP'ed out for > QEMU. > > I think many parts are missing (or maybe, I missed them). You need to lock the qemu internals (there are many read-mostly qemu caches scattered around the code), lock against hotplug, etc. For pure cpu emulation, there is a ton of work to be done: protecting the translator as well as making the translated code smp safe. I think that QemuDevice makes sense, and that we want this long term, but that we first need to improve efficiency (which reduces cpu utilization _and_ improves scalability) rather than look at scalability alone (which is much harder in addition to the drawback of not reducing cpu utilization). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 10:37:00
|
David Abrahams wrote: >>> >>> >> Versions of kvm producing this sort of output are common in >> archaeological digs. Please try a more recent release. >> > > Well, I'll try Hardy Heron soon enough, I suppose. It's due out in 2 > weeks. > > I'm sure you understand that most people can't afford to rebuild all > their important software so that it stays on the bleeding edge. Have > you considered getting more recent versions of kvm into the updates or > backports repositories of major distros? I'm not really sure how much > influence you can have over such things; I'm just asking. > > That's up to the distro maintainers, or concerned users (who may either volunteer work or apply pressure). >>>> What HAL do you see in device manager? >>>> >>>> >>> "Standard PC" >>> >>> >>> >> This HAL does not support SMP. You need the "ACPI Multiprocessor PC" >> HAL or some such. >> > > And how would I get that HAL set up? > > Follow http://kvm.qumranet.com/kvmwiki/Windows_ACPI_Workaround, substituting your desired HAL for "Standard PC". >> Unless you have a recent Intel processor, the combination of SMP and >> Windows XP will give noticeably lower performance. I recommend sticking >> with uniprocessor in such cases. >> > > I have a Core Duo; isn't that recent enough? > No, this feature is present only on some of the Core 2s, IIRC. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 10:33:16
|
Muli Ben-Yehuda wrote: >>> >>> >> Why avoid rmap on mmio pages? Sure it's unnecessary work, but >> having less cases improves overall reliability. >> > > The rmap functions already have a check to bail out if the pte is not > an rmap pte, so in that sense, we aren't adding a new case for the > code to handle, just adding direct MMIO ptes to the existing list of > non-rmap ptes. > > I'm worried about the huge chain of direct_mmio parameters passed to functions, impact on the audit code (at the end of mmu.c, and the poor souls who debug the mmu. >> You can use pfn_valid() in gfn_to_pfn() and kvm_release_pfn_*() to >> conditionally update the page refcounts. >> > > Since rmap isn't useful for direct MMIO ptes, doesn't it make more > sense to "bail out" early rather than in the bowls of the rmap code? > It does, from a purist point of view (which also favors explicit parameters a la direct_mmio rather than indirect parameters like pfn_valid()), but I'm looking from the practical point of view now. With mmu notifiers, we don't need to hold the refcount at all. So presuming we drop the refcounting code completely, are any changes actually necessary here? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Kovanen <kel...@WI...> - 2008-04-20 09:53:05
|
Pump it all night long with our new winning formula that gives you the extra boost you need. http://www.tritwat.com/ |
From: Avi K. <av...@qu...> - 2008-04-20 08:34:27
|
Marcelo Tosatti wrote: > kvm_pv_mmu_op should not take mmap_sem. All gfn_to_page() callers down > in the MMU processing will take it if necessary, so as it is it can > deadlock. > > Apparently a leftover from the days before slots_lock. > Applied, thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 07:55:46
|
Anthony Liguori wrote: > > I'd prefer you not do an emulate_instruction loop at all. Just > emulate one instruction on vmentry failure and let VT tell you what > instructions you need to emulate. > > It's only four instructions so I don't think the performance is going > to matter. Take a look at the patch I posted previously. Once we remove the other VT realmode hacks, we may need more instructions emulated. Consider for example changing to real mode without reloading fs and gs; this will cause all real mode code to be emulated. However, there's no need to do everything at once; the loop can certainly be added later when we have a proven need for it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 07:49:10
|
Marcelo Tosatti wrote: > Now that threads are spinned up before machine->init(), clearing > of HF_HALTED_MASK for irqchip in kernel case needs to be moved > to actual vcpu startup. > Applied, thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 07:42:31
|
Hollis Blanchard wrote: Applied, thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-20 07:39:26
|
Hollis Blanchard wrote: > 1 file changed, 5 insertions(+), 6 deletions(-) > arch/powerpc/kvm/Kconfig | 11 +++++------ > > > Don't allow building as a module (asm-offsets dependencies). > > Also, automatically select KVM_BOOKE_HOST until we better separate the guest > and host layers. > > Applied, thanks. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Blue S. <bla...@gm...> - 2008-04-20 06:42:13
|
On 4/19/08, Anthony Liguori <an...@co...> wrote: > Blue Swirl wrote: > > > On 4/17/08, Anthony Liguori <ali...@us...> wrote: > > > > > > > Yes, the vector version of packet receive is tough. I'll take a look > at > > > your patch. Basically, you need to associate a set of RX vectors with > each > > > VLANClientState and then when it comes time to deliver a packet to the > VLAN, > > > before calling fd_read, see if there is an RX vector available for the > > > client. > > > > > > In the case of tap, I want to optimize further and do the initial > readv() > > > to one of the clients RX buffers and then copy that RX buffer to the > rest of > > > the clients if necessary. > > > > > > > > > > The vector versions should also help SLIRP to add IP and Ethernet > > headers to the incoming packets. > > > > > > Yeah, I'm hoping that with my posted linux-aio interface, I can add vector > support since linux-aio has a proper asynchronous vector function. > > Are we happy with the DMA API? If so, we should commit it now so we can > start adding proper vector interfaces for net/block. Well, the IOVector part and bdrv_readv look OK, except for the heavy mallocing involved. I'm not so sure about the DMA side and how everything fits together for zero-copy IO. For example, do we still need explicit translation at some point? |
From: Liu, E. E <eri...@in...> - 2008-04-20 05:38:35
|
Christian Ehrhardt wrote: > Liu, Eric E wrote: >> Hollis Blanchard wrote: >>> On Wednesday 16 April 2008 01:45:34 Liu, Eric E wrote: [...] >>> Actually... we could have kvmtrace itself insert the metadata, so >>> there would be no chance of it being overwritten in the kernel >>> buffers. The header could be written in tip_open_output(), and >>> update fs_size accordingly. >>> >> Yes, let kvmtrace insert the metadata is more reasonable. >> > > I wanted to note that the kvmtrace tool should, but not need to know > everything about the data format. > I think of e.g. changing kernel implementations that change endianess > or even flags we don't yet know, but we might need in the future. > > What about adding another debugfs entry the kernel can use to expose > the "kvmtrace-metadata" defined by the kernel implementation. > The kvmtrace tool could then use that to build up the record by using > one entry for kernel defined metadata and another to add any metadata > that would be defined by kvmtrace tool itself. > > what about that one: > struct metadata { > u32 kmagic; /* stores kernel defined metadata read from debugfs > entry */ u32 umagic; /* stores userspace tool defined metadata */ > u32 extra; /* it is redundant, only use to fit into record. */ > } > > That should give us the flexibility to keep the format if we get more > metadata requirements in the future. Yes, maybe we need metadata to indicate the changing kernel implementations in the future, but adding debugfs entry seems not a good approach. What about defining a similar metadat in kernel rather than in userland and write it in rchan at the first time we add trace data. Then we don't need kvmtrace tool to insert the medadata again. like this: struct kvm_trace_metadata { u32 kmagic; /* stores kernel defined metadata */ u64 extra; /* use to fit into record. */ } |
From: Alex D. <ale...@ya...> - 2008-04-20 00:08:31
|
--- On Sat, 4/19/08, Marcelo Tosatti <mto...@re...> wrote: > From: Marcelo Tosatti <mto...@re...> > Subject: Re: [kvm-devel] Second KVM process hangs eating 80-100% CPU on host during startup > To: "Alex Davis" <ale...@ya...> > Cc: av...@qu..., kvm...@li... > Date: Saturday, April 19, 2008, 7:11 PM > On Sat, Apr 19, 2008 at 03:47:31PM -0700, Alex Davis wrote: > > --- On Fri, 4/18/08, Avi Kivity > <av...@qu...> wrote: > > > > > From: Avi Kivity <av...@qu...> > > > Subject: Re: [kvm-devel] Second KVM process hangs > eating 80-100% CPU on host during startup > > [snip] > > > > I tried booting the guest with 'lpj=10682525' > to work around the > > calibrate_delay issue, but that gave me: > > > > [ 0.004100] ENABLING IO-APIC IRQs > > [ 0.004100] ..TIMER: vector=0x31 apic1=0 pin1=0 > apic2=-1 pin2=-1 > > [ 0.004100] ..MP-BIOS bug: 8254 timer not connected > to IO-APIC > > [ 0.004100] ...trying to set up timer (IRQ0) > through the 8259A ... failed. > > [ 0.004100] ...trying to set up timer as Virtual > Wire IRQ ... failed. > > [ 0.004100] ...trying to set up timer as ExtINT IRQ > ... failed :(. > > [ 0.004100] Kernel panic - not syncing: IO-APIC + > timer doesn't work! Boot > > with apic=debug and send a report.Then try booting > with the 'noapic' option. > > > > Booting with 'apic=debug' gives these > additional lines: > > [ 0.004100] Getting VERSION: 50014 > > [ 0.004100] Getting VERSION: 50014 > > [ 0.004100] Getting ID: 0 > > [ 0.004100] Getting LVT0: 700 > > [ 0.004100] Getting LVT1: 10000 > > Hi Alex, > > Can you please try the following. > > KVM: PIT: make last_injected_time per-guest > > Otherwise multiple guests use the same variable and boom. > > Also use kvm_vcpu_kick() to make sure that if a timer > triggers on > a different CPU the event won't be missed. > > Signed-off-by: Marcelo Tosatti <mto...@re...> > > > diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c > index 2852dd1..5697ad2 100644 > --- a/arch/x86/kvm/i8254.c > +++ b/arch/x86/kvm/i8254.c > @@ -200,10 +200,8 @@ int __pit_timer_fn(struct > kvm_kpit_state *ps) > > atomic_inc(&pt->pending); > smp_mb__after_atomic_inc(); > - if (vcpu0 && waitqueue_active(&vcpu0->wq)) > { > - vcpu0->arch.mp_state = KVM_MP_STATE_RUNNABLE; > - wake_up_interruptible(&vcpu0->wq); > - } > + if (vcpu0) > + kvm_vcpu_kick(vcpu0); > > pt->timer.expires = ktime_add_ns(pt->timer.expires, > pt->period); > pt->scheduled = ktime_to_ns(pt->timer.expires); > @@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct > kvm_vcpu *vcpu) > struct kvm_pit *pit = vcpu->kvm->arch.vpit; > struct kvm *kvm = vcpu->kvm; > struct kvm_kpit_state *ps; > - static unsigned long last_injected_time; > > if (vcpu && pit) { > ps = &pit->pit_state; > @@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct > kvm_vcpu *vcpu) > * 2. Last interrupt was accepted or waited for too long > time*/ > if (atomic_read(&ps->pit_timer.pending) > && > (ps->inject_pending || > - (jiffies - last_injected_time > + (jiffies - ps->last_injected_time > >= KVM_MAX_PIT_INTR_INTERVAL))) { > ps->inject_pending = 0; > __inject_pit_timer_intr(kvm); > - last_injected_time = jiffies; > + ps->last_injected_time = jiffies; > } > } > } > diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h > index e63ef38..db25c2a 100644 > --- a/arch/x86/kvm/i8254.h > +++ b/arch/x86/kvm/i8254.h > @@ -35,6 +35,7 @@ struct kvm_kpit_state { > struct mutex lock; > struct kvm_pit *pit; > bool inject_pending; /* if inject pending interrupts */ > + unsigned long last_injected_time; > }; > > struct kvm_pit { Problem(s) solved. Everything is working now. Can now boot both with and without 'lpj='. The BogoMIPs are also being calculated correctly in secondary guests without 'lpj='. I'll play with it some more just to make sure, then I'll close the original bug. Thanks, Marcelo et al. ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ |
From: Marcelo T. <mto...@re...> - 2008-04-19 23:08:06
|
On Sat, Apr 19, 2008 at 03:47:31PM -0700, Alex Davis wrote: > --- On Fri, 4/18/08, Avi Kivity <av...@qu...> wrote: > > > From: Avi Kivity <av...@qu...> > > Subject: Re: [kvm-devel] Second KVM process hangs eating 80-100% CPU on host during startup > [snip] > > I tried booting the guest with 'lpj=10682525' to work around the > calibrate_delay issue, but that gave me: > > [ 0.004100] ENABLING IO-APIC IRQs > [ 0.004100] ..TIMER: vector=0x31 apic1=0 pin1=0 apic2=-1 pin2=-1 > [ 0.004100] ..MP-BIOS bug: 8254 timer not connected to IO-APIC > [ 0.004100] ...trying to set up timer (IRQ0) through the 8259A ... failed. > [ 0.004100] ...trying to set up timer as Virtual Wire IRQ ... failed. > [ 0.004100] ...trying to set up timer as ExtINT IRQ ... failed :(. > [ 0.004100] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot > with apic=debug and send a report.Then try booting with the 'noapic' option. > > Booting with 'apic=debug' gives these additional lines: > [ 0.004100] Getting VERSION: 50014 > [ 0.004100] Getting VERSION: 50014 > [ 0.004100] Getting ID: 0 > [ 0.004100] Getting LVT0: 700 > [ 0.004100] Getting LVT1: 10000 Hi Alex, Can you please try the following. KVM: PIT: make last_injected_time per-guest Otherwise multiple guests use the same variable and boom. Also use kvm_vcpu_kick() to make sure that if a timer triggers on a different CPU the event won't be missed. Signed-off-by: Marcelo Tosatti <mto...@re...> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 2852dd1..5697ad2 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -200,10 +200,8 @@ int __pit_timer_fn(struct kvm_kpit_state *ps) atomic_inc(&pt->pending); smp_mb__after_atomic_inc(); - if (vcpu0 && waitqueue_active(&vcpu0->wq)) { - vcpu0->arch.mp_state = KVM_MP_STATE_RUNNABLE; - wake_up_interruptible(&vcpu0->wq); - } + if (vcpu0) + kvm_vcpu_kick(vcpu0); pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period); pt->scheduled = ktime_to_ns(pt->timer.expires); @@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) struct kvm_pit *pit = vcpu->kvm->arch.vpit; struct kvm *kvm = vcpu->kvm; struct kvm_kpit_state *ps; - static unsigned long last_injected_time; if (vcpu && pit) { ps = &pit->pit_state; @@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) * 2. Last interrupt was accepted or waited for too long time*/ if (atomic_read(&ps->pit_timer.pending) && (ps->inject_pending || - (jiffies - last_injected_time + (jiffies - ps->last_injected_time >= KVM_MAX_PIT_INTR_INTERVAL))) { ps->inject_pending = 0; __inject_pit_timer_intr(kvm); - last_injected_time = jiffies; + ps->last_injected_time = jiffies; } } } diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h index e63ef38..db25c2a 100644 --- a/arch/x86/kvm/i8254.h +++ b/arch/x86/kvm/i8254.h @@ -35,6 +35,7 @@ struct kvm_kpit_state { struct mutex lock; struct kvm_pit *pit; bool inject_pending; /* if inject pending interrupts */ + unsigned long last_injected_time; }; struct kvm_pit { |
From: Alex D. <ale...@ya...> - 2008-04-19 22:47:30
|
--- On Fri, 4/18/08, Avi Kivity <av...@qu...> wrote: > From: Avi Kivity <av...@qu...> > Subject: Re: [kvm-devel] Second KVM process hangs eating 80-100% CPU on host during startup [snip] I tried booting the guest with 'lpj=10682525' to work around the calibrate_delay issue, but that gave me: [ 0.004100] ENABLING IO-APIC IRQs [ 0.004100] ..TIMER: vector=0x31 apic1=0 pin1=0 apic2=-1 pin2=-1 [ 0.004100] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.004100] ...trying to set up timer (IRQ0) through the 8259A ... failed. [ 0.004100] ...trying to set up timer as Virtual Wire IRQ ... failed. [ 0.004100] ...trying to set up timer as ExtINT IRQ ... failed :(. [ 0.004100] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report.Then try booting with the 'noapic' option. Booting with 'apic=debug' gives these additional lines: [ 0.004100] Getting VERSION: 50014 [ 0.004100] Getting VERSION: 50014 [ 0.004100] Getting ID: 0 [ 0.004100] Getting LVT0: 700 [ 0.004100] Getting LVT1: 10000 ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ |