You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Christoph L. <cla...@sg...> - 2008-04-22 20:28:21
|
On Tue, 22 Apr 2008, Andrea Arcangeli wrote: > My patch order and API backward compatible extension over the patchset > is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support > XPMEM as well. KVM/GRU won't notice any difference once the support > for XPMEM is added, but even if the API would completely change in > 2.6.27, that's still better than no functionality at all in 2.6.26. Please redo the patchset with the right order. To my knowledge there is no chance of this getting merged for 2.6.26. |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:26:07
|
Doing the right patch ordering would have avoided this patch and allow better review. |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:25:04
|
Why are the subjects all screwed up? They are the first line of the description instead of the subject line of my patches. |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:24:15
|
Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:23:15
|
Missing signoff by you. |
From: David S. A. <da...@ci...> - 2008-04-22 20:23:08
|
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: 1. before vcpu->arch.mmu.page_fault() 2. after vcpu->arch.mmu.page_fault() 3. after mmu_topup_memory_caches() 4. after emulate_instruction() So the delta in the trace reports show: - cycles required for arch.mmu.page_fault (tracer 2) - cycles required for mmu_topup_memory_caches(tracer 3) - cycles required for emulate_instruction() (tracer 4) I captured trace data for ~5-seconds during one of the usual events (again this time it was due to kscand in the guest). I ran the formatted trace data through an awk script to summarize: TSC cycles tracer2 tracer3 tracer4 0 - 10,000: 295067 213251 115873 10,001 - 25,000: 7682 1004 98336 25,001 - 50,000: 201 15 36 50,001 - 100,000: 100655 0 10 > 100,000: 117 0 15 This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it took longer than 50,000 cycles. The page_fault function getting run is paging64_page_fault. mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times, most of them relatively quickly. Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few host processes could interrupt it. david Avi Kivity wrote: > David S. Ahern wrote: >> I added the traces and captured data over another apparent lockup of >> the guest. >> This seems to be representative of the sequence (pid/vcpu removed). >> >> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3632) VMENTRY >> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61c8 ] >> (+ 54928) VMENTRY >> > > Can you oprofile the host to see where the 54K cycles are spent? > >> (+4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 41c5d363 ] >> (+8432) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 13832) VMENTRY >> >> >> (+5768) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016127c ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+3712) VMENTRY >> (+4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c016104a ] >> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >> fffb61d0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 >> 3d55d047 ] >> > > This indeed has the accessed bit clear. > >> (+ 65216) VMENTRY >> (+4232) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610e7 ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db4 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 >> 3d598363 ] >> > > This has the accessed bit set and the user bit clear, and the pte > pointing at the previous pte_write gpa. Looks like a kmap_atomic(). > >> (+8640) VMENTRY >> (+3936) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >> c01610ee ] >> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >> c0009db0 ] >> (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 >> 00000000 ] >> (+ 14160) VMENTRY >> >> I can forward a more complete time snippet if you'd like. vcpu0 + >> corresponding >> vcpu1 files have 85000 total lines and compressed the files total ~500k. >> >> I did not see the FLOODED trace come out during this sample though I >> did bump >> the count from 3 to 4 as you suggested. >> >> >> > > Bumping the count was supposed to remove the flooding... > >> Correlating rip addresses to the 2.4 kernel: >> >> c0160d00-c0161290 = page_referenced >> >> It looks like the event is kscand running through the pages. I >> suspected this >> some time ago, and tried tweaking the kscand_work_percent sysctl >> variable. It >> appeared to lower the peak of the spikes, but maybe I imagined it. I >> believe >> lowering that value makes kscand wake up more often but do less work >> (page >> scanning) each time it is awakened. >> >> > > What does 'top' in the guest show (perhaps sorted by total cpu time > rather than instantaneous usage)? > > What host kernel are you running? How many host cpus? > |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:22:49
|
Looks like this is not complete. There are numerous .h files missing which means that various structs are undefined (fs.h and rmap.h are needed f.e.) which leads to surprises when dereferencing fields of these struct. It seems that mm_types.h is expected to be included only in certain contexts. Could you make sure to include all necessary .h files? Or add some docs to clarify the situation here. |
From: Alberto T. <al...@by...> - 2008-04-22 20:22:06
|
I was wondering if anyone could reproduce my problem. If it is reproduceable, then I'll file a bug. I am using e1000 ethernet adapters on Windows 2003 and Linux guests. The line to set it up is something like this: -net nic,vlan=1,macaddr=00:ff:21:cf:91:01,model=e1000 \ -net tap,vlan=1,ifname=tap.br1.91.1 On Linux, this works just fine. However, on Windows 2003, the mac address for the device is reported as 00:ff:ff:ff:ff:ff and the packets carry this mac address as well. The corresponding tap device has the correct IP address, however. This problem is definitely tied to using Windows 2003 with a e1000 device. If I use the rtl8139 device, Windows reports the correct mac address. When booting the same VM with a Linux bootable CD and the e1000 device, Linux reports the correct mac address as set in the qemu command. It's the combination of Windows 2003 and the e1000 device that causes the problem. Has anyone else seen this problem? Thanks in advance. -- Alberto Treviño al...@by... |
From: Christoph L. <cla...@sg...> - 2008-04-22 20:19:28
|
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. |
From: Robin H. <ho...@sg...> - 2008-04-22 19:42:21
|
On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: > > 1) invalidate_page: You retain an invalidate_page() callout. I believe > > we have progressed that discussion to the point that it requires some > > direction for Andrew, Linus, or somebody in authority. The basics > > of the difference distill down to no expected significant performance > > difference between the two. The invalidate_page() callout potentially > > can simplify GRU code. It does provide a more complex api for the > > users of mmu_notifier which, IIRC, Christoph had interpretted from one > > of Andrew's earlier comments as being undesirable. I vaguely recall > > that sentiment as having been expressed. > > invalidate_page as demonstrated in KVM pseudocode doesn't change the > locking requirements, and it has the benefit of reducing the window of > time the secondary page fault has to be masked and at the same time > _halves_ the number of _hooks_ in the VM every time the VM deal with > single pages (example: do_wp_page hot path). As long as we can't fully > converge because of point 3, it'd rather keep invalidate_page to be > better. But that's by far not a priority to keep. Christoph, Jack and I just discussed invalidate_page(). I don't think the point Andrew was making is that compelling in this circumstance. The code has change fairly remarkably. Would you have any objection to putting it back into your patch/agreeing to it remaining in Andrea's patch? If not, I think we can put this issue aside until Andrew gets out of the merge window and can decide it. Either way, the patches become much more similar with this in. Thanks, Robin |
From: Daniel P. B. <ber...@re...> - 2008-04-22 19:26:40
|
On Tue, Apr 22, 2008 at 02:26:45PM -0300, Marcelo Tosatti wrote: > On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: > > Avi Kivity wrote: > > >Anthony Liguori wrote: > > >> > > >>I think we need to decide what we want to target in terms of upper > > >>limits. > > >> > > >>With a bridge or two, we can probably easily do 128. > > >> > > >>If we really want to push things, I think we should do a PCI based > > >>virtio controller. I doubt a large number of PCI devices is ever > > >>going to perform very well b/c of interrupt sharing and some of the > > >>assumptions in virtio_pci. > >> > > >>If we implement a controller, we can use a single interrupt, but > > >>multiplex multiple notifications on that single interrupt. We can > > >>also be more aggressive about using shared memory instead of PCI > > >>config space which would reduce the overall number of exits. > > We should increase the number of interrupt lines, perhaps to 16. > > Using shared memory to avoid exits sounds very good idea. > > > >>We could easily support a very large number of devices this way. But > > >>again, what do we want to target for now? > > > > > >I think that for networking we should keep things as is. I don't see > > >anybody using 100 virtual NICs. > > The target was along the lines of 20 nics + 80 disks. Dan? I've already had people ask for ability to as many as 64 disks and 32 nics with Xen, so to my mind, the more we support the better. 100's if possible. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| |
From: Anthony L. <ali...@us...> - 2008-04-22 19:11:27
|
Marcelo Tosatti wrote: > On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: > >> Anthony Liguori wrote: >> >>> This patch changes virtio devices to be multi-function devices whenever >>> possible. This increases the number of virtio devices we can support now by >>> a factor of 8. >>> >>> With this patch, I've been able to launch a guest with either 220 disks or 220 >>> network adapters. >>> >>> >>> >> Does this play well with hotplug? Perhaps we need to allocate a new >> device on hotplug. >> >> (certainly if we have a device with one function, which then gets >> converted to a multifunction device) >> > > Would have to change the hotplug code to handle functions... > BTW, I've never been that convinced that hotplugging devices is as useful as people make it out to be. I also think that's particularly true when it comes to hot adding/removing very large numbers of disks. I think if you created all virtio devices as multifunction devices, but didn't add additional functions until you ran out of PCI slots, it would be a pretty acceptable solution. Hotplug works just as it does today until you get much higher than 32 devices. Even then, hotplug still works with most of your devices (until you hit the absolute maximum number of devices of course). Regards, Anthony Liguori > It sounds less hacky to just extend the PCI slots instead of (ab)using > multiple functions per-slot. > |
From: Ian K. <bl...@bl...> - 2008-04-22 19:01:03
|
Avi Kivity wrote: > For mass storage, we should follow the SCSI model with a single device > serving multiple disks, similar to what you suggest. Not sure if the > device should have a single queue or one queue per disk. Don't you just end up re-implementing SCSI then, at which point you might as well stick with a 'fake' SCSI device in the guest? |
From: Anthony L. <an...@co...> - 2008-04-22 18:52:15
|
>>> For mass storage, we should follow the SCSI model with a single device >>> serving multiple disks, similar to what you suggest. Not sure if the >>> device should have a single queue or one queue per disk. >>> >> My latest thought it to do a virtio-based virtio controller. >> > > Why do you dislike multiple disks per virtio-blk controller? As > mentioned this seems a natural way forward. > Logically speaking, virtio is a bus. virtio supports all of the features of a bus (discover, hot add, hot remove). Right now, we map virtio devices directly onto the PCI bus. The problem we're trying to address is limitations of the PCI bus. We have a couple options: 1) add a virtio device that supports multiple disks. we need to reinvent hotplug within this device. 2) add a new PCI virtio transport that supports multiple virtio-blk devices within a single PCI slot 3) add a generic PCI virtio transport that supports multiple virtio devices within a single PCI slot 4) add a generic virtio "bridge" that supports multiple virtio devices within a single virtio device. #4 may seem strange, but it's no different from a PCI-to-PCI bridge. I like #4 the most, but #2 is probably the most practical. Regards, Anthony Liguori |
From: Andrea A. <an...@qu...> - 2008-04-22 18:43:36
|
On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: > 1) invalidate_page: You retain an invalidate_page() callout. I believe > we have progressed that discussion to the point that it requires some > direction for Andrew, Linus, or somebody in authority. The basics > of the difference distill down to no expected significant performance > difference between the two. The invalidate_page() callout potentially > can simplify GRU code. It does provide a more complex api for the > users of mmu_notifier which, IIRC, Christoph had interpretted from one > of Andrew's earlier comments as being undesirable. I vaguely recall > that sentiment as having been expressed. invalidate_page as demonstrated in KVM pseudocode doesn't change the locking requirements, and it has the benefit of reducing the window of time the secondary page fault has to be masked and at the same time _halves_ the number of _hooks_ in the VM every time the VM deal with single pages (example: do_wp_page hot path). As long as we can't fully converge because of point 3, it'd rather keep invalidate_page to be better. But that's by far not a priority to keep. > 2) Range callout names: Your range callouts are invalidate_range_start > and invalidate_range_end whereas Christoph's are start and end. I do not > believe this has been discussed in great detail. I know I have expressed > a preference for your names. I admit to having failed to follow up on > this issue. I certainly believe we could come to an agreement quickly > if pressed. I think using ->start ->end is a mistake, think when we later add mprotect_range_start/end. Here too I keep the better names only because we can't converge on point 3 (the API will eventually change, like every other kernel interal API, even core things like __free_page have been mostly obsoleted). > 3) The structure of the patch set: Christoph's upcoming release orders > the patches so the prerequisite patches are seperately reviewable > and each file is only touched by a single patch. Additionally, that Each file touched by a single patch? I doubt... The split is about the same, the main difference is the merge ordering, I always had the zero risk part at the head, he moved it at the tail when he incorporated #v12 into his patchset. > allows mmu_notifiers to be introduced as a single patch with sleeping > functionality from its inception and an API which remains unchanged. > Your patch set, however, introduces one API, then turns around and > changes that API. Again, the desire to make it an unchanging API was > expressed by, IIRC, Andrew. This does represent a risk to XPMEM as > the non-sleeping API may become entrenched and make acceptance of the > sleeping version less acceptable. > > Can we agree upon this list of issues? This is a kernel internal API, so it will definitely change over time. It's nothing close to a syscall. Also note: the API is obviously defined in mmu_notifier.h and none of the 2-12 patches touches mmu_notifier.h. So the extension of the method semantics is 100% backwards compatible. My patch order and API backward compatible extension over the patchset is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support XPMEM as well. KVM/GRU won't notice any difference once the support for XPMEM is added, but even if the API would completely change in 2.6.27, that's still better than no functionality at all in 2.6.26. |
From: Alberto T. <al...@by...> - 2008-04-22 18:40:23
|
Thanks for all those who work on KVM. It is a wonderful product and I have been very impressed with its features, performance, and the level of activity in this project. Back in February a bug was filed. I've been hit by this bug as well, but there hasn't been much activity with it in the last little bit. I wanted to know if anyone had a fix for it, or a workaround (other than using IDE), or whether it was on someone's radar. Here is a link to the bug: http://sourceforge.net/tracker/index.php?func=detail&aid=1895893&group_id=180599&atid=893831 Thanks in advance. -- Alberto Treviño al...@by... |
From: Robin H. <ho...@sg...> - 2008-04-22 18:22:12
|
I believe the differences between your patch set and Christoph's need to be understood and a compromise approach agreed upon. Those differences, as I understand them, are: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. 2) Range callout names: Your range callouts are invalidate_range_start and invalidate_range_end whereas Christoph's are start and end. I do not believe this has been discussed in great detail. I know I have expressed a preference for your names. I admit to having failed to follow up on this issue. I certainly believe we could come to an agreement quickly if pressed. 3) The structure of the patch set: Christoph's upcoming release orders the patches so the prerequisite patches are seperately reviewable and each file is only touched by a single patch. Additionally, that allows mmu_notifiers to be introduced as a single patch with sleeping functionality from its inception and an API which remains unchanged. Your patch set, however, introduces one API, then turns around and changes that API. Again, the desire to make it an unchanging API was expressed by, IIRC, Andrew. This does represent a risk to XPMEM as the non-sleeping API may become entrenched and make acceptance of the sleeping version less acceptable. Can we agree upon this list of issues? Thank you, Robin Holt |
From: Glauber C. <gc...@re...> - 2008-04-22 17:59:57
|
Gerd Hoffmann wrote: > Jeremy Fitzhardinge wrote: >> Xen could change the parameters in the instant after get_time_values(). >> That change could be as a result of suspend-resume, so the parameters >> and the tsc could be wildly different. > > Ah, ok, forgot the rdtsc in the picture. With that in mind I fully > agree that the loop is needed. I think kvm guests can even hit that one > with the vcpu migrating to a different physical cpu, so we better handle > it correctly ;) It's probably not needed for kvm, since we update everything everytime we get scheduled in the host side, which would cover the case for migration between physical cpus. But it's probably okay to do it to get a common denominator with xen, if needed. |
From: Amit S. <ami...@qu...> - 2008-04-22 17:29:10
|
* On Sunday 13 Apr 2008 14:06:27 Avi Kivity wrote: > Amit Shah wrote: > > Passthrough devices are host machine PCI devices which have > > been handed off to the guest. Handle interrupts from these > > devices and route them to the appropriate guest irq lines. > > The userspace provides us with the necessary information > > via the ioctls. > > > > The guest IRQ numbers can change dynamically, so we have an > > additional ioctl that keeps track of those changes in userspace > > and notifies us whenever that happens. > > > > It is expected the kernel driver for the passthrough device > > is removed before passing it on to the guest. > > > > > > +/* > > + * Used to find a registered host PCI device (a "passthrough" device) > > + * during interrupts or EOI > > + */ > > +static struct kvm_pci_pt_dev_list * > > +find_pci_pt_dev(struct list_head *head, > > + struct kvm_pci_pt_info *pv_pci_info, int irq, int source) > > +{ > > + struct list_head *ptr; > > + struct kvm_pci_pt_dev_list *match; > > + > > + list_for_each(ptr, head) { > > + match = list_entry(ptr, struct kvm_pci_pt_dev_list, list); > > + > > + switch (source) { > > + case KVM_PT_SOURCE_IRQ: > > + /* > > + * Used to find a registered host device > > + * during interrupt context on host > > + */ > > + if (match->pt_dev.host.irq == irq) > > + return match; > > + break; > > + case KVM_PT_SOURCE_IRQ_ACK: > > + /* > > + * Used to find a registered host device when > > + * the guest acks an interrupt > > + */ > > + if (match->pt_dev.guest.irq == irq) > > + return match; > > + break; > > + } > > + } > > + return NULL; > > +} > > This would be better as two separate functions. Also, locking? For pvdma, there will be two more cases. Very similar functions for essentially looking up an entry in the same list. Locking will be supported soon. > > +static irqreturn_t > > +kvm_pci_pt_dev_intr(int irq, void *dev_id) > > Please don't split declarations unnecessarily. Fixed. > > +{ > > + struct kvm_pci_pt_dev_list *match; > > + struct kvm *kvm = (struct kvm *) dev_id; > > + > > + if (!test_bit(irq, pt_irq_handled)) > > + return IRQ_NONE; > > + > > + if (test_bit(irq, pt_irq_pending)) > > + return IRQ_HANDLED; > > Will the interrupt not fire immediately after this returns? Hmm. This is just an optimisation so that we don't have to look up the list each time to find out which assigned device it is and (re)injecting the interrupt. Also we avoid the (TODO) getting/releasing locks which will be needed for the list lookup. Disabling interrupts for PCI devices isn't a good idea even if we don't support shared interrupts. Any other ideas to avoid this from happening? > > + match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, > > + irq, KVM_PT_SOURCE_IRQ); > > + if (!match) > > + return IRQ_NONE; > > + > > + /* Not possible to detect if the guest uses the PIC or the > > + * IOAPIC. So set the bit in both. The guest will ignore > > + * writes to the unused one. > > + */ > > + kvm_ioapic_set_irq(kvm->arch.vioapic, match->pt_dev.guest.irq, 1); > > + kvm_pic_set_irq(pic_irqchip(kvm), match->pt_dev.guest.irq, 1); > > A function that calls both the apic and the pic is better, as it will be > easier to port. Done. > > + set_bit(irq, pt_irq_pending); > > + return IRQ_HANDLED; > > +} > > + > > +/* Ack the irq line for a passthrough device */ > > +void > > +kvm_pci_pt_ack_irq(struct kvm *kvm, int vector) > > +{ > > + int irq; > > + struct kvm_pci_pt_dev_list *match; > > + > > + irq = get_eoi_gsi(kvm->arch.vioapic, vector); > > + match = find_pci_pt_dev(&kvm->arch.pci_pt_dev_head, NULL, > > + irq, KVM_PT_SOURCE_IRQ_ACK); > > + if (!match) > > + return; > > + if (test_bit(match->pt_dev.host.irq, pt_irq_pending)) { > > + kvm_ioapic_set_irq(kvm->arch.vioapic, irq, 0); > > + kvm_pic_set_irq(pic_irqchip(kvm), irq, 0); > > This is dangerous with smp guests, if we aren't careful with the > ordering the interrupt may fire again and be forwarded to the other > vcpu. We need to call this before we redeliver interrupts. The 'pending' bitmap ensures we don't inject an interrupt that hasn't been ack'ed. Once the locking is in place, this shouldn't be a worry. > > + clear_bit(match->pt_dev.host.irq, pt_irq_pending); > > + } > > +} ... > > @@ -1671,6 +1836,30 @@ long kvm_arch_vm_ioctl(struct file *filp, > > r = 0; > > break; > > } > > + case KVM_ASSIGN_PCI_PT_DEV: { > > + struct kvm_pci_passthrough_dev pci_pt_dev; > > + > > + r = -EFAULT; > > + if (copy_from_user(&pci_pt_dev, argp, sizeof pci_pt_dev)) > > + goto out; > > + > > + r = kvm_vm_ioctl_pci_pt_dev(kvm, &pci_pt_dev); > > + if (r) > > + goto out; > > + break; > > + } > > + case KVM_UPDATE_PCI_PT_IRQ: { > > + struct kvm_pci_passthrough_dev pci_pt_dev; > > + > > + r = -EFAULT; > > + if (copy_from_user(&pci_pt_dev, argp, sizeof pci_pt_dev)) > > + goto out; > > + > > + r = kvm_vm_ioctl_pci_pt_irq(kvm, &pci_pt_dev); > > + if (r) > > + goto out; > > + break; > > + } > > Could these not be merged? Userspace will need to store some extra info. Easily fixable though. > > diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h > > index 781fc87..c4eb804 100644 > > --- a/include/asm-x86/kvm_host.h > > +++ b/include/asm-x86/kvm_host.h > > @@ -296,6 +296,18 @@ struct kvm_mem_alias { > > gfn_t target_gfn; > > }; > > > > +/* Some definitions for passthrough'ed devices */ > > "passthrough'ed" and especially "pt" are awkward. How about "assigned"? Fixed. > > +#define KVM_PT_SOURCE_IRQ 1 > > +#define KVM_PT_SOURCE_IRQ_ACK 2 > > + > > +/* This list is to store the guest bus:device:function and host > > + * bus:device:function mapping for passthrough'ed devices. > > + */ > > ? Some left-over bits from the pvdma implementation. Updated. I'll send out the new patch out soon. (The git tree already contains the updates.) Amit. |
From: Marcelo T. <mto...@re...> - 2008-04-22 17:23:44
|
On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: > Avi Kivity wrote: > >Anthony Liguori wrote: > >> > >>I think we need to decide what we want to target in terms of upper > >>limits. > >> > >>With a bridge or two, we can probably easily do 128. > >> > >>If we really want to push things, I think we should do a PCI based > >>virtio controller. I doubt a large number of PCI devices is ever > >>going to perform very well b/c of interrupt sharing and some of the > >>assumptions in virtio_pci. >> > >>If we implement a controller, we can use a single interrupt, but > >>multiplex multiple notifications on that single interrupt. We can > >>also be more aggressive about using shared memory instead of PCI > >>config space which would reduce the overall number of exits. We should increase the number of interrupt lines, perhaps to 16. Using shared memory to avoid exits sounds very good idea. > >>We could easily support a very large number of devices this way. But > >>again, what do we want to target for now? > > > >I think that for networking we should keep things as is. I don't see > >anybody using 100 virtual NICs. The target was along the lines of 20 nics + 80 disks. Dan? > >For mass storage, we should follow the SCSI model with a single device > >serving multiple disks, similar to what you suggest. Not sure if the > >device should have a single queue or one queue per disk. > > My latest thought it to do a virtio-based virtio controller. Why do you dislike multiple disks per virtio-blk controller? As mentioned this seems a natural way forward. |
From: Eric D. <da...@co...> - 2008-04-22 17:22:36
|
Andrea Arcangeli a écrit : > + > +static int mm_lock_cmp(const void *a, const void *b) > +{ > + cond_resched(); > + if ((unsigned long)*(spinlock_t **)a < > + (unsigned long)*(spinlock_t **)b) > + return -1; > + else if (a == b) > + return 0; > + else > + return 1; > +} > + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la < lb) return -1; if (la > lb) return 1; return 0; } |
From: Soren H. <so...@ub...> - 2008-04-22 16:52:14
|
Hi guys. I'm trying to figure out what's going on with this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217815 The short version of the problem is that it seems that if the console is left alone for an extended period of time, everything seems to stall until something (moving the mouse around, pressing a key, whatever) awakens it again. It usually shows itself when you choose the "Encrypted LVM" option in our installer (this process wipes the drive, which is a rather lenghty process), since that's probably the only place where you'd leave the console alone for a while, while still getting some UI feedback (and suddenly lack of feedback, obviously). It started when I backported this to the kvm version in our archive: commit d2668b3fd41f88c18a7f9c4f1d024f0e5d9f64cf Author: Marcelo Tosatti <mto...@re...> Date: Wed Apr 2 20:20:14 2008 -0300 Subject: kvm: qemu: separate thread for IO handling While trying to solve this problem, I noticed that that commit was just one of a set of three patches. Applying those two: commit 1743ef816b6cd22d100ccb80e542b8ca19c75392 Author: Marcelo Tosatti <mto...@re...> Date: Wed Apr 2 20:20:15 2008 -0300 Subject: kvm: qemu: add function to handle signals commit d84f71afaafec49e0ab3aa7a33518df04c14f38a Author: Marcelo Tosatti <mto...@re...> Date: Wed Apr 2 20:20:16 2008 -0300 Subject: kvm: qemu: notify IO thread of pending bhs ...makes it take a bit longer before it happens, but it's still very much reproducable. Reverting those changes fixes it completely. We've tried with kvm 66, which also exhibits this behaviour, so I'm fairly confident I didn't mess up the patch while backporting it. In case you're interested, the backported patch is here: http://people.ubuntu.com/~soren/virtio_hang.patch The latter two commits applied without changes (with a bit of fuzz, though). I'm hoping one of you guys could give me a hint (or perhaps even a patch)? -- Soren Hansen | Virtualisation specialist | Ubuntu Server Team Canonical Ltd. | http://www.ubuntu.com/ |
From: Andrea A. <an...@qu...> - 2008-04-22 16:46:15
|
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote: > I am saying your intent was probably to test > > else if ((unsigned long)*(spinlock_t **)a == > (unsigned long)*(spinlock_t **)b) > return 0; Indeed... > Hum, it's not a micro-optimization, but a bug fix. :) The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). |
From: Eric D. <da...@co...> - 2008-04-22 16:40:34
|
Andrea Arcangeli a écrit : > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need to dereference them. >> static int mm_lock_cmp(const void *a, const void *b) >> { >> unsigned long la = (unsigned long)*(spinlock_t **)a; >> unsigned long lb = (unsigned long)*(spinlock_t **)b; >> >> cond_resched(); >> if (la < lb) >> return -1; >> if (la > lb) >> return 1; >> return 0; >> } >> > > If your intent is to use the assumption that there are going to be few > equal entries, you should have used likely(la > lb) to signal it's > rarely going to return zero or gcc is likely free to do whatever it > wants with the above. Overall that function is such a slow path that > this is going to be lost in the noise. My suggestion would be to defer > microoptimizations like this after 1/12 will be applied to mainline. > > Thanks! > > Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear |
From: Anthony L. <an...@co...> - 2008-04-22 16:31:12
|
Avi Kivity wrote: > Anthony Liguori wrote: >> >> I think we need to decide what we want to target in terms of upper >> limits. >> >> With a bridge or two, we can probably easily do 128. >> >> If we really want to push things, I think we should do a PCI based >> virtio controller. I doubt a large number of PCI devices is ever >> going to perform very well b/c of interrupt sharing and some of the >> assumptions in virtio_pci. >> >> If we implement a controller, we can use a single interrupt, but >> multiplex multiple notifications on that single interrupt. We can >> also be more aggressive about using shared memory instead of PCI >> config space which would reduce the overall number of exits. >> >> We could easily support a very large number of devices this way. But >> again, what do we want to target for now? > > I think that for networking we should keep things as is. I don't see > anybody using 100 virtual NICs. > > For mass storage, we should follow the SCSI model with a single device > serving multiple disks, similar to what you suggest. Not sure if the > device should have a single queue or one queue per disk. My latest thought it to do a virtio-based virtio controller. We could avoid creating one in QEMU unless we detect an abnormally large number of disks or something. Regards, Anthony Liguori |