You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Ryan H. <ry...@us...> - 2008-05-07 22:03:28
|
I've been playing around with smp guests on a couple amd systems and I've also seen some of the smp/locking issues which lead me to dig into some of the tsc code. The svm_cpu_load, the tsc_offset calc will generate a massively large tsc_offset if we're switching cpus and tsc_this is ahead of the host_tsc value (delta would normally be negative, but since it's unsigned, we get a huge positive number). svm_vcpu_load() ... rdtscll(tsc_this); delta = vcpu->arch.host_tsc - tsc_this; svm->vmcb->control.tsc_offset += delta; This is handled a little differently on Intel (in vmx.c) where there is a check: if (tsc_this < vcpu->arch.host_tsc) /* do delta and new offset calc */ This check makes sense to me in that we only need to ensure that we don't go backwards which means that unless the new cpu is behind the current vcpu's host_tsc, we can skip a new delta calc. If the check doesn't make sense then we'll need to do the math with s64s. Attached patch fixed the case where an idle guest was live-locked. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ry...@us... diffstat output: svm.c | 10 ++++++++-- 1 files changed, 8 insertions(+), 2 deletions(-) Signed-off-by: Ryan Harper <ry...@us...> --- diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5528121..c919ddd 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -685,8 +685,14 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) * increasing TSC. */ rdtscll(tsc_this); - delta = vcpu->arch.host_tsc - tsc_this; - svm->vmcb->control.tsc_offset += delta; + /* we only need to adjust this if the old tsc was ahead + * also, we'll generate a massively large u64 value if + * tsc_this is less than host_tsc because of unsigned math + */ + if (tsc_this < vcpu->arch.host_tsc) { + delta = vcpu->arch.host_tsc - tsc_this; + svm->vmcb->control.tsc_offset += delta; + } vcpu->cpu = cpu; kvm_migrate_apic_timer(vcpu); } |
From: Andrea A. <an...@qu...> - 2008-05-07 21:58:35
|
On Wed, May 07, 2008 at 01:30:39PM -0700, Linus Torvalds wrote: > > > On Wed, 7 May 2008, Andrew Morton wrote: > > > > The patch looks OK to me. > > As far as I can tell, authorship has been destroyed by at least two of the > patches (ie Christoph seems to be the author, but Andrea seems to have > dropped that fact). I can't follow this, please be more specific. About the patches I merged from Christoph, I didn't touch them at all (except for fixing a kernel crashing bug in them plus some reject fix). Initially I didn't even add a signed-off-by: andrea, and I only had the signed-off-by: christoph. But then he said I had to add my signed-off-by too, while I thought at most an acked-by was required. So if I got any attribution on Christoph work it's only because he explicitly requested it as it was passing through my maintenance line. In any case, all patches except mmu-notifier-core are irrelevant in this context and I'm entirely fine to give Christoph the whole attribution of the whole patchset including the whole mmu-notifier-core where most of the code is mine. We had many discussions with Christoph, Robin and Jack, but I can assure you nobody had a single problem with regard to attribution. About all patches except mmu-notifier-core: Christoph, Robin and everyone (especially myself) agrees those patches can't yet be merged in 2.6.26. With regard to the post-2.6.26 material, I think adding a config option to make the change at compile time, is ok. And there's no other way to deal with it in a clean way, as vmtrunate has to teardown pagetables, and if the i_mmap_lock is a spinlock there's no way to notify secondary mmus about it, if the ->invalidate_range_start method has to allocate an skb, send it through the network and wait for I/O completion with schedule(). > Yeah, too late and no upside. No upside to all people setting CONFIG_KVM=n true, but no downside for them either, that's the important fact! And for all the people setting CONFIG_KVM!=n, I should provide some background here. KVM MM development is halted without this, that includes: paging, ballooning, tlb flushing at large, pci-passthrough removing page pin as a whole, etc... Everyone on kvm-devel talks about mmu-notifiers, check the last VT-d patch form Intel where Antony (IBM/qemu/kvm) wonders how to handle things without mmu notifiers (mlock whatever). Rusty agreed we had to get mmu notifiers in 2.6.26 so much that he has gone as far as writing his own ultrasimple mmu notifier implementation, unfortunately too simple as invalidate_range_start was missing and we can't remove the page pinning and avoid doing spte=invalid;tlbflush;unpin for every group of sptes released without it. And without mm_lock invalidate_range_start can't be implemented in a generic way (to work for GRU/XPMEM too). > That "locking" code is also too ugly to live, at least without some > serious arguments for why it has to be done that way. Sorting the locks? > In a vmalloc'ed area? And calling this something innocuous like > "mm_lock()"? Hell no. That's only invoked in mmu_notifier_register, mm_lock is explicitly documented as heavyweight function. In the KVM case it's only called when a VM is created, that's irrelevant cpu cost compared to the time it takes to the OS to boot in the VM... (especially without real mode emulation with direct NPT-like secondary-mmu paging). mm_lock solved the fundamental race in the range_start/end invalidation model (that will allow GRU to do a single tlb flush for the whole range that is going to be freed by zap_page_range/unmap_vmas/whatever). Christoph merged mm_lock in his EMM versions of mmu notifiers, moments after I released it, I think he wouldn't have done it if there was a better way. > That code needs some serious re-thinking. Even if you're totally right, with Nick's mmu notifiers, Rusty's mmu notifiers, my original mmu notifiers, Christoph's first version of my mmu notifiers, with my new mmu notifiers, with christoph EMM version of my new mmu notifiers, with my latest mmu notifiers, and all people making suggestions and testing the code and needing the code badly, and further patches waiting inclusion during 2.6.27 in this area, it must be obvious for everyone, that there's zero chance this code won't evolve over time to perfection, but we can't wait it to be perfect before start using it or we're screwed. Even if it's entirely broken this will allow kvm development to continue and then we'll fix it (but don't worry it works great at runtime and there are no race conditions, Jack and Robin are also using it with zero problems with GRU and XPMEM just in case the KVM testing going great isn't enough). Furthermore the API is freezed for almost months, everyone agrees with all fundamental blocks in mmu-notifier-core patch (to be complete Christoph would like to replace invalidate_page with an invalidate_range_start/end but that's a minor detail). And most important we need something in now, regardless of which API. We can handle a change of API totally fine later. mm_lock() is not even part of the mmu notifier API, it's just an internal implementation detail, so whatever problem it has, or whatever better name we can find, isn't an high priority right now. If you suggest a better name now I'll fix it up immediately. I hope the mm_lock name and whatever signed-off-by error in patches after mmu-notifier-core won't be really why this doesn't go in. Thanks a lot for your time to review even if it wasn't as positive as I hoped, Andrea |
From: Anthony L. <ali...@us...> - 2008-05-07 21:42:07
|
The current logic of the can_receive handler is to allow packets whenever the receiver is disabled or when there are descriptors available in the ring. I think the logic ought to be to allow packets whenever the receiver is enabled and there are descriptors available in the ring. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/hw/e1000.c b/qemu/hw/e1000.c index 0728539..01f8983 100644 --- a/qemu/hw/e1000.c +++ b/qemu/hw/e1000.c @@ -520,8 +520,8 @@ e1000_can_receive(void *opaque) { E1000State *s = opaque; - return (!(s->mac_reg[RCTL] & E1000_RCTL_EN) || - s->mac_reg[RDH] != s->mac_reg[RDT]); + return ((s->mac_reg[RCTL] & E1000_RCTL_EN) && + s->mac_reg[RDH] != s->mac_reg[RDT]); } static void |
From: Linus T. <tor...@li...> - 2008-05-07 21:37:33
|
On Wed, 7 May 2008, Andrea Arcangeli wrote: > > I think the spinlock->rwsem conversion is ok under config option, as > you can see I complained myself to various of those patches and I'll > take care they're in a mergeable state the moment I submit them. What > XPMEM requires are different semantics for the methods, and we never > had to do any blocking I/O during vmtruncate before, now we have to. I really suspect we don't really have to, and that it would be better to just fix the code that does that. > Please ignore all patches but mmu-notifier-core. I regularly forward > _only_ mmu-notifier-core to Andrew, that's the only one that is in > merge-ready status, everything else is just so XPMEM can test and we > can keep discussing it to bring it in a mergeable state like > mmu-notifier-core already is. The thing is, I didn't like that one *either*. I thought it was the biggest turd in the series (and by "biggest", I literally mean "most lines of turd-ness" rather than necessarily "ugliest per se"). I literally think that mm_lock() is an unbelievable piece of utter and horrible CRAP. There's simply no excuse for code like that. If you want to avoid the deadlock from taking multiple locks in order, but there is really just a single operation that needs it, there's a really really simple solution. And that solution is *not* to sort the whole damn f*cking list in a vmalloc'ed data structure prior to locking! Damn. No, the simple solution is to just make up a whole new upper-level lock, and get that lock *first*. You can then take all the multiple locks at a lower level in any order you damn well please. And yes, it's one more lock, and yes, it serializes stuff, but: - that code had better not be critical anyway, because if it was, then the whole "vmalloc+sort+lock+vunmap" sh*t was wrong _anyway_ - parallelism is overrated: it doesn't matter one effing _whit_ if something is a hundred times more parallel, if it's also a hundred times *SLOWER*. So dang it, flush the whole damn series down the toilet and either forget the thing entirely, or re-do it sanely. And here's an admission that I lied: it wasn't *all* clearly crap. I did like one part, namely list_del_init_rcu(), but that one should have been in a separate patch. I'll happily apply that one. Linus |
From: Andrea A. <an...@qu...> - 2008-05-07 21:26:49
|
On Wed, May 07, 2008 at 01:56:23PM -0700, Linus Torvalds wrote: > This also looks very debatable indeed. The only performance numbers quoted > are: > > > This results in f.e. the Aim9 brk performance test to got down by 10-15%. > > which just seems like a total disaster. > > The whole series looks bad, in fact. Lack of authorship, bad single-line Glad you agree. Note that the fact the whole series looks bad, is _exactly_ why I couldn't let Christoph keep going with mmu-notifier-core at the very end of his patchset. I had to move it at the top to have a chance to get the KVM and GRU requirements merged in 2.6.26. I think the spinlock->rwsem conversion is ok under config option, as you can see I complained myself to various of those patches and I'll take care they're in a mergeable state the moment I submit them. What XPMEM requires are different semantics for the methods, and we never had to do any blocking I/O during vmtruncate before, now we have to. And I don't see a problem in making the conversion from spinlock->rwsem only if CONFIG_XPMEM=y as I doubt XPMEM works on anything but ia64. Please ignore all patches but mmu-notifier-core. I regularly forward _only_ mmu-notifier-core to Andrew, that's the only one that is in merge-ready status, everything else is just so XPMEM can test and we can keep discussing it to bring it in a mergeable state like mmu-notifier-core already is. |
From: Linus T. <tor...@li...> - 2008-05-07 20:56:41
|
On Wed, 7 May 2008, Andrea Arcangeli wrote: > > Convert the anon_vma spinlock to a rw semaphore. This allows concurrent > traversal of reverse maps for try_to_unmap() and page_mkclean(). It also > allows the calling of sleeping functions from reverse map traversal as > needed for the notifier callbacks. It includes possible concurrency. This also looks very debatable indeed. The only performance numbers quoted are: > This results in f.e. the Aim9 brk performance test to got down by 10-15%. which just seems like a total disaster. The whole series looks bad, in fact. Lack of authorship, bad single-line description, and the code itself sucks so badly that it's not even funny. NAK NAK NAK. All of it. It stinks. Linus |
From: Linus T. <tor...@li...> - 2008-05-07 20:31:00
|
On Wed, 7 May 2008, Andrew Morton wrote: > > The patch looks OK to me. As far as I can tell, authorship has been destroyed by at least two of the patches (ie Christoph seems to be the author, but Andrea seems to have dropped that fact). > The proposal is that we sneak this into 2.6.26. Are there any > sufficiently-serious objections to this? Yeah, too late and no upside. That "locking" code is also too ugly to live, at least without some serious arguments for why it has to be done that way. Sorting the locks? In a vmalloc'ed area? And calling this something innocuous like "mm_lock()"? Hell no. That code needs some serious re-thinking. Linus |
From: Andrew M. <ak...@li...> - 2008-05-07 20:08:08
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core The patch looks OK to me. The proposal is that we sneak this into 2.6.26. Are there any sufficiently-serious objections to this? The patch will be a no-op for 2.6.26. This is all rather unusual. For the record, could we please review the reasons for wanting to do this? Thanks. |
From: Andrew M. <ak...@li...> - 2008-05-07 20:03:02
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210096013 -7200 > # Node ID e20917dcc8284b6a07cfcced13dda4cbca850a9c > # Parent 5026689a3bc323a26d33ad882c34c4c9c9a3ecd8 > mmu-notifier-core > > ... > > --- a/include/linux/list.h > +++ b/include/linux/list.h > @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis > * or hlist_del_rcu(), running on this same list. > * However, it is perfectly legal to run concurrently with > * the _rcu list-traversal primitives, such as > - * hlist_for_each_entry(). > + * hlist_for_each_entry_rcu(). > */ > static inline void hlist_del_rcu(struct hlist_node *n) > { > @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct > if (!hlist_unhashed(n)) { > __hlist_del(n); > INIT_HLIST_NODE(n); > + } > +} > + > +/** > + * hlist_del_init_rcu - deletes entry from hash list with re-initialization > + * @n: the element to delete from the hash list. > + * > + * Note: list_unhashed() on entry does return true after this. It is Should that be "does" or "does not". "does", I suppose. It should refer to hlist_unhashed() The term "on entry" is a bit ambiguous - we normally use that as shorthand to mean "on entry to the function". So I'll change this to > + * Note: hlist_unhashed() on the node returns true after this. It is OK? <oh, that was copied-and-pasted from similarly errant comments in that file> > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -10,6 +10,7 @@ > #include <linux/rbtree.h> > #include <linux/rwsem.h> > #include <linux/completion.h> > +#include <linux/cpumask.h> OK, unrelated bugfix ;) > --- a/include/linux/srcu.h > +++ b/include/linux/srcu.h > @@ -27,6 +27,8 @@ > #ifndef _LINUX_SRCU_H > #define _LINUX_SRCU_H > > +#include <linux/mutex.h> And another. Fair enough. |
From: Anthony L. <an...@co...> - 2008-05-07 19:22:24
|
Avi Kivity wrote: > Anthony Liguori wrote: > > >>> What should be done for unmodified guest where there is no PV driver in >>> the guest? Would a call to mlock() from >>> qemu/hw/pci-passthrough.c/add_pci_passthrough_device() a reasonable >>> thing to do? >>> >> >> Yup. The idea is to ensure that the memory is always present, >> without necessarily taking a reference to it. This allows for memory >> reclaiming which should allow for things like NUMA page migration. >> We can't swap of course but that doesn't mean reclaimation isn't useful. >> > > I don't think we can do page migration with VT-d. You need to be able > to detect whether the page has been changed by dma after you've copied > it but before you changed the pte, but VT-d doesn't allow that AFAICT. Hrm, I would have to look at the VT-d but I suspect you're right. That's unfortunate. That means mlock() isn't sufficient. It also means that the VMAs can't be updated while the guest is running. Is there any way to lock a vma region such that things like madvise/mmap(MAP_FIXED) will always fail? Regards, Anthony Liguori |
From: Gerd H. <kr...@re...> - 2008-05-07 18:45:24
|
Marcelo Tosatti wrote: > On Thu, Apr 24, 2008 at 10:37:04AM +0200, Gerd Hoffmann wrote: >> Hi folks, >> >> My first attempt to send out a patch series with git ... >> >> The patches fix the kvm paravirt clocksource code to be compatible with >> xen and they also factor out some code which can be shared into a >> separate source files used by both kvm and xen. > > The issue with SMP guests is still present. Booting with "nohz=off" resolves it. > > Same symptoms as before, apic_timer_fn for one of the vcpu's is ticking way slower > than the remaining ones: > > [root@localhost ~]# cat /proc/timer_stats | grep apic > 391, 4125 qemu-system-x86 apic_mmio_write (apic_timer_fn) > 2103, 4126 qemu-system-x86 apic_mmio_write (apic_timer_fn) > 1896, 4127 qemu-system-x86 apic_mmio_write (apic_timer_fn) > 1857, 4128 qemu-system-x86 apic_mmio_write (apic_timer_fn) > > Let me know what else is needed, or any patches to try. Ok folks, here is the band aid fix for testing from the odd bugs department. Goes on top of the four patches of this series. A real, clean solution is TBD. Tomorrow I hope (some urgent private problems are in the queue too ...). Problem is the per-cpu area for cpu 0 has two locations in memory, one before and one after pda initialization. kvmclock registers the first due to being initialized quite early, and the paravirt clock for cpu 0 stops seeing updates once the pda setup is done. Which makes the TSC effectively the base for timekeeping (instead of using the TSC for millisecond delta adjustments only). Secondary CPUs work as intended. This obviously screws up timekeeping on SMP guests, especially on hosts with unstable TSC. happy testing, Gerd -- [root@localhost ~]# dmesg | grep _clock kvm_register_clock: cpu 0 at 0:798601 (boot) kvm_clock_read: cpu 0 at 0:140b601 (pda) kvm_register_clock: cpu 1 at 0:1415601 |
From: Rik v. R. <ri...@re...> - 2008-05-07 18:12:50
|
On Wed, 07 May 2008 16:35:54 +0200 Andrea Arcangeli <an...@qu...> wrote: > Signed-off-by: Christoph Lameter <cla...@sg...> > Signed-off-by: Andrea Arcangeli <an...@qu...> Acked-by: Rik van Riel <ri...@re...> -- All rights reversed. |
From: Anthony L. <ali...@us...> - 2008-05-07 18:10:39
|
In the final patch of this series, we rely on a VLAN client's fd_can_read method to avoid dropping packets. Unfortunately, virtio's fd_can_read method is not very accurate at the moment. This patch addresses this. It also generates a notification to the IO thread when more RX packets become available. If we say we can't receive a packet because no RX buffers are available, this may result in the tap file descriptor not being select()'d. Without notifying the IO thread, we may have to wait until the select() times out before we can receive a packet (even if there is one pending). This particular change makes RX performance very consistent. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index 8d26832..5538979 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -14,6 +14,7 @@ #include "virtio.h" #include "net.h" #include "qemu-timer.h" +#include "qemu-kvm.h" /* from Linux's virtio_net.h */ @@ -60,11 +61,14 @@ typedef struct VirtIONet VirtQueue *rx_vq; VirtQueue *tx_vq; VLANClientState *vc; - int can_receive; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; +/* TODO + * - we could suppress RX interrupt if we were so inclined. + */ + static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq) { - VirtIONet *n = to_virtio_net(vdev); - n->can_receive = 1; + /* We now have RX buffers, signal to the IO thread to break out of the + select to re-poll the tap file descriptor */ + if (kvm_enabled()) + qemu_kvm_notify_work(); } static int virtio_net_can_receive(void *opaque) { VirtIONet *n = opaque; - return (n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK) && n->can_receive; + if (n->rx_vq->vring.avail == NULL || + !(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK)) + return 0; + + if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx) + return 0; + + return 1; } static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) @@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) struct virtio_net_hdr *hdr; int offset, i; - /* FIXME: the drivers really need to set their status better */ - if (n->rx_vq->vring.avail == NULL) { - n->can_receive = 0; - return; - } - if (virtqueue_pop(n->rx_vq, &elem) == 0) { - /* wait until the guest adds some rx bufs */ - n->can_receive = 0; + fprintf(stderr, "virtio_net: this should not happen\n"); return; } @@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) n->vdev.update_config = virtio_net_update_config; n->vdev.get_features = virtio_net_get_features; - n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx); + n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx); n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx); - n->can_receive = 0; memcpy(n->mac, nd->macaddr, 6); n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, virtio_net_can_receive, n); |
From: Anthony L. <ali...@us...> - 2008-05-07 18:10:21
|
Normally, tap always reads packets and simply lets the client drop them if it cannot receive them. For virtio-net, this results in massive packet loss and about an 80% performance loss in TCP throughput. This patch modifies qemu_send_packet() to only deliver a packet to a VLAN client if it doesn't have a fd_can_read method or the fd_can_read method indicates that it can receive packets. We also return a status of whether any clients were able to receive the packet. If no clients were able to receive a packet, we buffer the packet until a client indicates that it can receive packets again. This patch also modifies the tap code to only read from the tap fd if at least one client on the VLAN is able to receive a packet. Finally, this patch changes the tap code to drain all possible packets from the tap device when the tap fd is readable. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/net.h b/qemu/net.h index 13daa27..dfdf9af 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan, IOCanRWHandler *fd_can_read, void *opaque); int qemu_can_send_packet(VLANClientState *vc); -void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); +int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size); void qemu_handler_true(void *opaque); void do_info_network(void); diff --git a/qemu/vl.c b/qemu/vl.c index 03051da..28f8d1d 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3754,10 +3754,11 @@ int qemu_can_send_packet(VLANClientState *vc1) return 0; } -void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) +int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) { VLANState *vlan = vc1->vlan; VLANClientState *vc; + int ret = -EAGAIN; #if 0 printf("vlan %d send:\n", vlan->id); @@ -3765,9 +3766,14 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size) #endif for(vc = vlan->first_client; vc != NULL; vc = vc->next) { if (vc != vc1) { - vc->fd_read(vc->opaque, buf, size); + if (!vc->fd_can_read || vc->fd_can_read(vc->opaque)) { + vc->fd_read(vc->opaque, buf, size); + ret = 0; + } } } + + return ret; } #if defined(CONFIG_SLIRP) @@ -3970,6 +3976,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; + char buf[4096]; + int size; } TAPState; static void tap_receive(void *opaque, const uint8_t *buf, int size) @@ -3985,24 +3993,70 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size) } } +static int tap_can_send(void *opaque) +{ + TAPState *s = opaque; + VLANClientState *vc; + int can_receive = 0; + + /* Check to see if any of our clients can receive a packet */ + for (vc = s->vc->vlan->first_client; vc; vc = vc->next) { + /* Skip ourselves */ + if (vc == s->vc) + continue; + + if (!vc->fd_can_read) { + /* no fd_can_read handler, they always can receive */ + can_receive = 1; + } else + can_receive = vc->fd_can_read(vc->opaque); + + /* Once someone can receive, we try to send a packet */ + if (can_receive) + break; + } + + return can_receive; +} + static void tap_send(void *opaque) { TAPState *s = opaque; - uint8_t buf[4096]; - int size; + /* First try to send any buffered packet */ + if (s->size > 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s->vc, s->buf, s->size); + if (err == -EAGAIN) + return; + } + + /* Read packets until we hit EAGAIN */ + do { #ifdef __sun__ - struct strbuf sbuf; - int f = 0; - sbuf.maxlen = sizeof(buf); - sbuf.buf = buf; - size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1; + struct strbuf sbuf; + int f = 0; + sbuf.maxlen = sizeof(s->buf); + sbuf.buf = s->buf; + s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1; #else - size = read(s->fd, buf, sizeof(buf)); + s->size = read(s->fd, s->buf, sizeof(s->buf)); #endif - if (size > 0) { - qemu_send_packet(s->vc, buf, size); - } + + if (s->size == -1 && errno == EINTR) + continue; + + if (s->size > 0) { + int err; + + /* If noone can receive the packet, buffer it */ + err = qemu_send_packet(s->vc, s->buf, s->size); + if (err == -EAGAIN) + break; + } + } while (s->size > 0); } /* fd support */ @@ -4016,7 +4070,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) return NULL; s->fd = fd; s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); - qemu_set_fd_handler2(s->fd, NULL, tap_send, NULL, s); + qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s); snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd); return s; } |
From: Anthony L. <ali...@us...> - 2008-05-07 18:10:21
|
We should check that the first element is the size we expect instead of just casting blindly. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/hw/virtio-blk.c b/qemu/hw/virtio-blk.c index 3af36db..048285a 100644 --- a/qemu/hw/virtio-blk.c +++ b/qemu/hw/virtio-blk.c @@ -56,8 +56,6 @@ struct virtio_blk_outhdr uint32_t ioprio; /* Sector (ie. 512 byte offset) */ uint64_t sector; - /* Where to put reply. */ - uint64_t id; }; #define VIRTIO_BLK_S_OK 0 @@ -94,6 +92,17 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) off_t off; int i; + if (elem.out_num < 1 || elem.in_num < 1) { + fprintf(stderr, "virtio-blk missing headers\n"); + exit(1); + } + + if (elem.out_sg[0].iov_len != sizeof(*out) || + elem.in_sg[elem.in_num - 1].iov_len != sizeof(*in)) { + fprintf(stderr, "virtio-blk header not in correct element\n"); + exit(1); + } + out = (void *)elem.out_sg[0].iov_base; in = (void *)elem.in_sg[elem.in_num - 1].iov_base; off = out->sector; diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index f727b14..5ac5089 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -125,6 +125,11 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) return; } + if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { + fprintf(stderr, "virtio-net header not in first element\n"); + exit(1); + } + hdr = (void *)elem.in_sg[0].iov_base; hdr->flags = 0; hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; @@ -197,6 +202,11 @@ void virtio_net_poll(void) continue; } + if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { + fprintf(stderr, "virtio-net header not in first element\n"); + exit(1); + } + hdr = (void *)elem.in_sg[0].iov_base; hdr->flags = 0; hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; |
From: Anthony L. <ali...@us...> - 2008-05-07 18:10:06
|
While it has served us well, it is long overdue that we eliminate the virtio-net tap hack. It turns out that zero-copy has very little impact on performance. The tap hack was gaining such a significant performance boost not because of zero-copy, but because it avoided dropping packets on receive which is apparently a significant problem with the tap implementation in QEMU. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h index 73e3918..c284bf1 100644 --- a/qemu/hw/pc.h +++ b/qemu/hw/pc.h @@ -155,7 +155,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); /* virtio-net.c */ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); -void virtio_net_poll(void); /* virtio-blk.h */ void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device, diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c index e2db49c..a26d04e 100644 --- a/qemu/hw/virtio-net.c +++ b/qemu/hw/virtio-net.c @@ -13,7 +13,6 @@ #include "virtio.h" #include "net.h" -#include "pc.h" #include "qemu-timer.h" /* from Linux's virtio_net.h */ @@ -62,15 +61,10 @@ typedef struct VirtIONet VirtQueue *tx_vq; VLANClientState *vc; int can_receive; - int tap_fd; - struct VirtIONet *next; - int do_notify; QEMUTimer *tx_timer; int tx_timer_active; } VirtIONet; -static VirtIONet *VirtIONetHead = NULL; - static VirtIONet *to_virtio_net(VirtIODevice *vdev) { return (VirtIONet *)vdev; @@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque) return (n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK) && n->can_receive; } -/* -net user receive function */ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) { VirtIONet *n = opaque; @@ -149,92 +142,6 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) virtio_notify(&n->vdev, n->rx_vq); } -/* -net tap receive handler */ -void virtio_net_poll(void) -{ - VirtIONet *vnet; - int len; - fd_set rfds; - struct timeval tv; - int max_fd = -1; - VirtQueueElement elem; - struct virtio_net_hdr *hdr; - int did_notify; - - FD_ZERO(&rfds); - tv.tv_sec = 0; - tv.tv_usec = 0; - - while (1) { - - // Prepare the set of device to select from - for (vnet = VirtIONetHead; vnet; vnet = vnet->next) { - - if (vnet->tap_fd == -1) - continue; - - vnet->do_notify = 0; - //first check if the driver is ok - if (!virtio_net_can_receive(vnet)) - continue; - - /* FIXME: the drivers really need to set their status better */ - if (vnet->rx_vq->vring.avail == NULL) { - vnet->can_receive = 0; - continue; - } - - FD_SET(vnet->tap_fd, &rfds); - if (max_fd < vnet->tap_fd) max_fd = vnet->tap_fd; - } - - if (select(max_fd + 1, &rfds, NULL, NULL, &tv) <= 0) - break; - - // Now check who has data pending in the tap - for (vnet = VirtIONetHead; vnet; vnet = vnet->next) { - - if (!FD_ISSET(vnet->tap_fd, &rfds)) - continue; - - if (virtqueue_pop(vnet->rx_vq, &elem) == 0) { - vnet->can_receive = 0; - continue; - } - - if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) { - fprintf(stderr, "virtio-net header not in first element\n"); - exit(1); - } - - hdr = (void *)elem.in_sg[0].iov_base; - hdr->flags = 0; - hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; -again: - len = readv(vnet->tap_fd, &elem.in_sg[1], elem.in_num - 1); - if (len == -1) { - if (errno == EINTR || errno == EAGAIN) - goto again; - else - fprintf(stderr, "reading network error %d", len); - } - virtqueue_push(vnet->rx_vq, &elem, sizeof(*hdr) + len); - vnet->do_notify = 1; - } - - /* signal other side */ - did_notify = 0; - for (vnet = VirtIONetHead; vnet; vnet = vnet->next) - if (vnet->do_notify) { - virtio_notify(&vnet->vdev, vnet->rx_vq); - did_notify++; - } - if (!did_notify) - break; - } - -} - /* TX */ static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq) { @@ -313,12 +220,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) memcpy(n->mac, nd->macaddr, 6); n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, virtio_net_can_receive, n); - n->tap_fd = hack_around_tap(n->vc->vlan->first_client); - if (n->tap_fd != -1) { - n->next = VirtIONetHead; - //push the device on top of the list - VirtIONetHead = n; - } n->tx_timer = qemu_new_timer(rt_clock, virtio_net_tx_timer, n); n->tx_timer_active = 0; diff --git a/qemu/vl.c b/qemu/vl.c index a1aa270..03051da 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -3970,15 +3970,8 @@ typedef struct TAPState { VLANClientState *vc; int fd; char down_script[1024]; - int no_poll; } TAPState; -static int tap_read_poll(void *opaque) -{ - TAPState *s = opaque; - return (!s->no_poll); -} - static void tap_receive(void *opaque, const uint8_t *buf, int size) { TAPState *s = opaque; @@ -4012,22 +4005,6 @@ static void tap_send(void *opaque) } } -int hack_around_tap(void *opaque) -{ - VLANClientState *vc = opaque; - TAPState *ts = vc->opaque; - - if (vc->fd_read != tap_receive) - return -1; - - if (ts) { - ts->no_poll = 1; - return ts->fd; - } - - return -1; -} - /* fd support */ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) @@ -4038,10 +4015,8 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd) if (!s) return NULL; s->fd = fd; - s->no_poll = 0; - enable_sigio_timer(fd); s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s); - qemu_set_fd_handler2(s->fd, tap_read_poll, tap_send, NULL, s); + qemu_set_fd_handler2(s->fd, NULL, tap_send, NULL, s); snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd); return s; } @@ -7417,10 +7392,7 @@ void main_loop_wait(int timeout) slirp_select_poll(&rfds, &wfds, &xfds); } #endif - virtio_net_poll(); - qemu_aio_poll(); - if (vm_running) { qemu_run_timers(&active_timers[QEMU_TIMER_VIRTUAL], qemu_get_clock(vm_clock)); |
From: Anthony L. <ali...@us...> - 2008-05-07 18:10:04
|
We're pretty sloppy in virtio right now about phys_ram_base assumptions. This patch is an incremental step between what we have today and a full blown DMA API. I backported the DMA API but the performance impact was not acceptable to me. There's only a slight performance impact with this particular patch. Since we're no longer assuming guest physical memory is contiguous, we need a more complex way to validate the memory regions than just checking if it's within ram_size. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 6a50001..a4c9d10 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -58,6 +58,48 @@ /* virt queue functions */ +static void *virtio_map_gpa(target_phys_addr_t addr, size_t size) +{ + ram_addr_t off; + target_phys_addr_t addr1; + + off = cpu_get_physical_page_desc(addr); + if ((off & ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + fprintf(stderr, "virtio DMA to IO ram\n"); + exit(1); + } + + off = (off & TARGET_PAGE_MASK) | (addr & ~TARGET_PAGE_MASK); + + for (addr1 = addr + TARGET_PAGE_SIZE; + addr1 < TARGET_PAGE_ALIGN(addr + size); + addr1 += TARGET_PAGE_SIZE) { + ram_addr_t off1; + + off1 = cpu_get_physical_page_desc(addr1); + if ((off1 & ~TARGET_PAGE_MASK) != IO_MEM_RAM) { + fprintf(stderr, "virtio DMA to IO ram\n"); + exit(1); + } + + off1 = (off1 & TARGET_PAGE_MASK) | (addr1 & ~TARGET_PAGE_MASK); + + if (off1 != (off + (addr1 - addr))) { + fprintf(stderr, "discontigous virtio memory\n"); + exit(1); + } + } + + return phys_ram_base + off; +} + +static size_t virtqueue_size(int num) +{ + return TARGET_PAGE_ALIGN((sizeof(VRingDesc) * num) + + (sizeof(VRingAvail) + sizeof(uint16_t) * num)) + + (sizeof(VRingUsed) + sizeof(VRingUsedElem) * num); +} + static void virtqueue_init(VirtQueue *vq, void *p) { vq->vring.desc = p; @@ -127,9 +169,6 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) do { struct iovec *sg; - if ((vq->vring.desc[i].addr + vq->vring.desc[i].len) > ram_size) - errx(1, "Guest sent invalid pointer"); - if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE) sg = &elem->in_sg[elem->in_num++]; else @@ -137,7 +176,9 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) /* Grab the first descriptor, and check it's OK. */ sg->iov_len = vq->vring.desc[i].len; - sg->iov_base = phys_ram_base + vq->vring.desc[i].addr; + sg->iov_base = virtio_map_gpa(vq->vring.desc[i].addr, sg->iov_len); + if (sg->iov_base == NULL) + errx(1, "Invalid mapping\n"); /* If we've got too many, that implies a descriptor loop. */ if ((elem->in_num + elem->out_num) > vq->vring.num) @@ -198,9 +239,10 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) vdev->vq[vdev->queue_sel].pfn = val; if (pa == 0) { virtio_reset(vdev); - } else if (pa < (ram_size - TARGET_PAGE_SIZE)) { - virtqueue_init(&vdev->vq[vdev->queue_sel], phys_ram_base + pa); - /* FIXME if pa == 0, deal with device tear down */ + } else { + size_t size = virtqueue_size(vdev->vq[vdev->queue_sel].vring.num); + virtqueue_init(&vdev->vq[vdev->queue_sel], + virtio_map_gpa(pa, size)); } break; case VIRTIO_PCI_QUEUE_SEL: |
From: Andrea A. <an...@qu...> - 2008-05-07 17:57:15
|
On Wed, May 07, 2008 at 01:39:43PM -0400, Rik van Riel wrote: > Would it be an idea to merge them into one, so the first patch > introduces the right conventions directly? The only reason this isn't merged into one, is that this requires non obvious (not difficult though) to the core VM code. I wanted to keep an obviously safe approach for 2.6.26. The other conventions are only needed by XPMEM and XPMEM can't work without all other patches anyway. |
From: Rik v. R. <ri...@re...> - 2008-05-07 17:52:14
|
On Wed, 07 May 2008 16:35:55 +0200 Andrea Arcangeli <an...@qu...> wrote: > Signed-off-by: Christoph Lameter <cla...@sg...> > Signed-off-by: Andrea Arcangeli <an...@qu...> Acked-by: Rik van Riel <ri...@re...> -- All rights reversed. |
From: Rik v. R. <ri...@re...> - 2008-05-07 17:40:44
|
On Wed, 07 May 2008 16:35:53 +0200 Andrea Arcangeli <an...@qu...> wrote: > # HG changeset patch > # User Andrea Arcangeli <an...@qu...> > # Date 1210115129 -7200 > # Node ID d60d200565abde6a8ed45271e53cde9c5c75b426 > # Parent c5badbefeee07518d9d1acca13e94c981420317c > invalidate_page outside PT lock > > Moves all mmu notifier methods outside the PT lock (first and not last > step to make them sleep capable). This patch appears to undo some of the changes made by patch 01/11. Would it be an idea to merge them into one, so the first patch introduces the right conventions directly? -- All rights reversed. |
From: Rik v. R. <ri...@re...> - 2008-05-07 17:36:50
|
On Wed, 07 May 2008 16:35:51 +0200 Andrea Arcangeli <an...@qu...> wrote: > Signed-off-by: Andrea Arcangeli <an...@qu...> > Signed-off-by: Nick Piggin <np...@su...> > Signed-off-by: Christoph Lameter <cla...@sg...> Acked-by: Rik van Riel <ri...@re...> -- All rights reversed. |
From: Anthony L. <ali...@us...> - 2008-05-07 17:16:04
|
This patch adds compatibility code so that we can use signalfd() within QEMU. signalfd() provides a mechanism to receive signal notification through a file descriptor. This is very useful in eliminating the signal/select race condition. If signalfd() isn't available, we spawn a thread that uses sigwaitinfo() to emulate it. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/kvm-compatfd.c b/qemu/kvm-compatfd.c index 1b030ba..b1311e2 100644 --- a/qemu/kvm-compatfd.c +++ b/qemu/kvm-compatfd.c @@ -15,6 +15,97 @@ #include "qemu-kvm.h" #include <sys/syscall.h> +#include <pthread.h> + +struct sigfd_compat_info +{ + sigset_t mask; + int fd; +}; + +static void *sigwait_compat(void *opaque) +{ + struct sigfd_compat_info *info = opaque; + int err; + + sigprocmask(SIG_BLOCK, &info->mask, NULL); + + do { + siginfo_t siginfo; + + err = sigwaitinfo(&info->mask, &siginfo); + if (err == -1 && errno == EINTR) + continue; + + if (err > 0) { + char buffer[128]; + size_t offset = 0; + + memcpy(buffer, &err, sizeof(err)); + while (offset < sizeof(buffer)) { + ssize_t len; + + len = write(info->fd, buffer + offset, + sizeof(buffer) - offset); + if (len == -1 && errno == EINTR) + continue; + + if (len <= 0) { + err = -1; + break; + } + + offset += len; + } + } + } while (err >= 0); + + return NULL; +} + +static int kvm_signalfd_compat(const sigset_t *mask) +{ + pthread_attr_t attr; + pthread_t tid; + struct sigfd_compat_info *info; + int fds[2]; + + info = malloc(sizeof(*info)); + if (info == NULL) { + errno = ENOMEM; + return -1; + } + + if (pipe(fds) == -1) { + free(info); + return -1; + } + + memcpy(&info->mask, mask, sizeof(*mask)); + info->fd = fds[1]; + + pthread_attr_init(&attr); + pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); + + pthread_create(&tid, &attr, sigwait_compat, info); + + pthread_attr_destroy(&attr); + + return fds[0]; +} + +int kvm_signalfd(const sigset_t *mask) +{ +#if defined(SYS_signalfd) + int ret; + + ret = syscall(SYS_signalfd, -1, mask, _NSIG / 8); + if (!(ret == -1 && errno == ENOSYS)) + return ret; +#endif + + return kvm_signalfd_compat(mask); +} int kvm_eventfd(int *fds) { diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index 8fa3c1b..a0dd4a8 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -10,6 +10,8 @@ #include "cpu.h" +#include <signal.h> + int kvm_main_loop(void); int kvm_qemu_init(void); int kvm_qemu_create_context(void); @@ -79,6 +81,16 @@ int handle_powerpc_dcr_read(int vcpu, uint32_t dcrn, uint32_t *data); int handle_powerpc_dcr_write(int vcpu,uint32_t dcrn, uint32_t data); #endif +#if !defined(SYS_signalfd) +struct signalfd_siginfo { + uint32_t ssi_signo; + uint8_t pad[124]; +}; +#else +#include <linux/signalfd.h> +#endif + +int kvm_signalfd(const sigset_t *mask); int kvm_eventfd(int *fds); #define ALIGN(x, y) (((x)+(y)-1) & ~((y)-1)) |
From: Anthony L. <ali...@us...> - 2008-05-07 16:58:43
|
This patch reworks the IO thread to use signalfd() instead of sigtimedwait(). This will eliminate the need to use SIGIO everywhere. Since v2, I've fixed a nasty bug in qemu_kvm_aio_wait(). We can't use main_loop_wait() to sleep if it's at all possible we're being called from a handler in main_loop_wait() (which is the case with qemu_kvm_aio_wait()). Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 7134e56..492c3c4 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -17,6 +17,7 @@ int kvm_pit = 1; #include "sysemu.h" #include "qemu-common.h" #include "console.h" +#include "block.h" #include "qemu-kvm.h" #include <libkvm.h> @@ -36,18 +37,11 @@ pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_pause_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; static int qemu_system_ready; -struct qemu_kvm_signal_table { - sigset_t sigset; - sigset_t negsigset; -}; - -static struct qemu_kvm_signal_table io_signal_table; -static struct qemu_kvm_signal_table vcpu_signal_table; - #define SIG_IPI (SIGRTMIN+4) struct vcpu_info { @@ -64,6 +58,7 @@ struct vcpu_info { pthread_t io_thread; static int io_thread_fd = -1; +static int io_thread_sigfd = -1; static inline unsigned long kvm_get_thread_id(void) { @@ -172,37 +167,23 @@ static int has_work(CPUState *env) return kvm_arch_has_work(env); } -static int kvm_process_signal(int si_signo) -{ - struct sigaction sa; - - switch (si_signo) { - case SIGUSR2: - pthread_cond_signal(&qemu_aio_cond); - break; - case SIGALRM: - case SIGIO: - sigaction(si_signo, NULL, &sa); - sa.sa_handler(si_signo); - break; - } - - return 1; -} - -static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, - int timeout) +static int kvm_eat_signal(CPUState *env, int timeout) { struct timespec ts; int r, e, ret = 0; siginfo_t siginfo; + sigset_t waitset; ts.tv_sec = timeout / 1000; ts.tv_nsec = (timeout % 1000) * 1000000; - r = sigtimedwait(&waitset->sigset, &siginfo, &ts); + sigemptyset(&waitset); + sigaddset(&waitset, SIG_IPI); + + r = sigtimedwait(&waitset, &siginfo, &ts); if (r == -1 && (errno == EAGAIN || errno == EINTR) && !timeout) return 0; e = errno; + pthread_mutex_lock(&qemu_mutex); if (env && vcpu) cpu_single_env = vcpu->env; @@ -211,12 +192,12 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, exit(1); } if (r != -1) - ret = kvm_process_signal(siginfo.si_signo); + ret = 1; if (env && vcpu_info[env->cpu_index].stop) { vcpu_info[env->cpu_index].stop = 0; vcpu_info[env->cpu_index].stopped = 1; - qemu_kvm_notify_work(); + pthread_cond_signal(&qemu_pause_cond); } pthread_mutex_unlock(&qemu_mutex); @@ -227,14 +208,13 @@ static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env, static void kvm_eat_signals(CPUState *env, int timeout) { int r = 0; - struct qemu_kvm_signal_table *waitset = &vcpu_signal_table; - while (kvm_eat_signal(waitset, env, 0)) + while (kvm_eat_signal(env, 0)) r = 1; if (!r && timeout) { - r = kvm_eat_signal(waitset, env, timeout); + r = kvm_eat_signal(env, timeout); if (r) - while (kvm_eat_signal(waitset, env, 0)) + while (kvm_eat_signal(env, 0)) ; } } @@ -266,12 +246,8 @@ static void pause_all_threads(void) vcpu_info[i].stop = 1; pthread_kill(vcpu_info[i].thread, SIG_IPI); } - while (!all_threads_paused()) { - pthread_mutex_unlock(&qemu_mutex); - kvm_eat_signal(&io_signal_table, NULL, 1000); - pthread_mutex_lock(&qemu_mutex); - cpu_single_env = NULL; - } + while (!all_threads_paused()) + pthread_cond_wait(&qemu_pause_cond, &qemu_mutex); } static void resume_all_threads(void) @@ -310,6 +286,12 @@ static void setup_kernel_sigmask(CPUState *env) { sigset_t set; + sigemptyset(&set); + sigaddset(&set, SIGUSR2); + sigaddset(&set, SIGIO); + sigaddset(&set, SIGALRM); + sigprocmask(SIG_BLOCK, &set, NULL); + sigprocmask(SIG_BLOCK, NULL, &set); sigdelset(&set, SIG_IPI); @@ -346,7 +328,7 @@ static int kvm_main_loop_cpu(CPUState *env) cpu_single_env = env; while (1) { while (!has_work(env)) - kvm_main_loop_wait(env, 10); + kvm_main_loop_wait(env, 1000); if (env->interrupt_request & CPU_INTERRUPT_HARD) env->hflags &= ~HF_HALTED_MASK; if (!kvm_irqchip_in_kernel(kvm_context) && info->sipi_needed) @@ -394,18 +376,6 @@ static void *ap_main_loop(void *_env) return NULL; } -static void qemu_kvm_init_signal_table(struct qemu_kvm_signal_table *sigtab) -{ - sigemptyset(&sigtab->sigset); - sigfillset(&sigtab->negsigset); -} - -static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) -{ - sigaddset(&sigtab->sigset, signum); - sigdelset(&sigtab->negsigset, signum); -} - void kvm_init_new_ap(int cpu, CPUState *env) { pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); @@ -414,27 +384,12 @@ void kvm_init_new_ap(int cpu, CPUState *env) pthread_cond_wait(&qemu_vcpu_cond, &qemu_mutex); } -static void qemu_kvm_init_signal_tables(void) -{ - qemu_kvm_init_signal_table(&io_signal_table); - qemu_kvm_init_signal_table(&vcpu_signal_table); - - kvm_add_signal(&io_signal_table, SIGIO); - kvm_add_signal(&io_signal_table, SIGALRM); - kvm_add_signal(&io_signal_table, SIGUSR2); - - kvm_add_signal(&vcpu_signal_table, SIG_IPI); - - sigprocmask(SIG_BLOCK, &io_signal_table.sigset, NULL); -} - int kvm_init_ap(void) { #ifdef TARGET_I386 kvm_tpr_opt_setup(); #endif qemu_add_vm_change_state_handler(kvm_vm_state_change_handler, NULL); - qemu_kvm_init_signal_tables(); signal(SIG_IPI, sig_ipi_handler); return 0; @@ -468,6 +423,61 @@ void qemu_kvm_notify_work(void) fprintf(stderr, "failed to notify io thread\n"); } +static int received_signal; + +/* QEMU relies on periodically breaking out of select via EINTR to poll for IO + and timer signals. Since we're now using a file descriptor to handle + signals, select() won't be interrupted by a signal. We need to forcefully + break the select() loop when a signal is received hence + kvm_check_received_signal(). */ + +int kvm_check_received_signal(void) +{ + if (received_signal) { + received_signal = 0; + return 1; + } + + return 0; +} + +/* If we have signalfd, we mask out the signals we want to handle and then + * use signalfd to listen for them. We rely on whatever the current signal + * handler is to dispatch the signals when we receive them. + */ + +static void sigfd_handler(void *opaque) +{ + int fd = (unsigned long)opaque; + struct signalfd_siginfo info; + struct sigaction action; + ssize_t len; + + while (1) { + do { + len = read(fd, &info, sizeof(info)); + } while (len == -1 && errno == EINTR); + + if (len == -1 && errno == EAGAIN) + break; + + if (len != sizeof(info)) { + printf("read from sigfd returned %ld: %m\n", len); + return; + } + + sigaction(info.ssi_signo, NULL, &action); + if (action.sa_handler) + action.sa_handler(info.ssi_signo); + + if (info.ssi_signo == SIGUSR2) { + pthread_cond_signal(&qemu_aio_cond); + } + } + + received_signal = 1; +} + /* Used to break IO thread out of select */ static void io_thread_wakeup(void *opaque) { @@ -487,17 +497,15 @@ static void io_thread_wakeup(void *opaque) offset += len; } -} -/* - * The IO thread has all signals that inform machine events - * blocked (io_signal_table), so it won't get interrupted - * while processing in main_loop_wait(). - */ + received_signal = 1; +} int kvm_main_loop(void) { int fds[2]; + sigset_t mask; + int sigfd; io_thread = pthread_self(); qemu_system_ready = 1; @@ -511,15 +519,31 @@ int kvm_main_loop(void) (void *)(unsigned long)fds[0]); io_thread_fd = fds[1]; - pthread_mutex_unlock(&qemu_mutex); + + sigemptyset(&mask); + sigaddset(&mask, SIGIO); + sigaddset(&mask, SIGALRM); + sigaddset(&mask, SIGUSR2); + sigprocmask(SIG_BLOCK, &mask, NULL); + + sigfd = kvm_signalfd(&mask); + if (sigfd == -1) { + fprintf(stderr, "failed to create signalfd\n"); + return -errno; + } + + fcntl(sigfd, F_SETFL, O_NONBLOCK); + + qemu_set_fd_handler2(sigfd, NULL, sigfd_handler, NULL, + (void *)(unsigned long)sigfd); pthread_cond_broadcast(&qemu_system_cond); + io_thread_sigfd = sigfd; + cpu_single_env = NULL; + while (1) { - kvm_eat_signal(&io_signal_table, NULL, 1000); - pthread_mutex_lock(&qemu_mutex); - cpu_single_env = NULL; - main_loop_wait(0); + main_loop_wait(1000); if (qemu_shutdown_requested()) break; else if (qemu_powerdown_requested()) @@ -528,7 +552,6 @@ int kvm_main_loop(void) pthread_kill(vcpu_info[0].thread, SIG_IPI); qemu_kvm_reset_requested = 1; } - pthread_mutex_unlock(&qemu_mutex); } pause_all_threads(); @@ -891,10 +914,21 @@ void qemu_kvm_aio_wait(void) CPUState *cpu_single = cpu_single_env; if (!cpu_single_env) { - pthread_mutex_unlock(&qemu_mutex); - kvm_eat_signal(&io_signal_table, NULL, 1000); - pthread_mutex_lock(&qemu_mutex); - cpu_single_env = NULL; + if (io_thread_sigfd != -1) { + fd_set rfds; + int ret; + + FD_ZERO(&rfds); + FD_SET(io_thread_sigfd, &rfds); + + /* this is a rare case where we do want to hold qemu_mutex + * while sleeping. We cannot allow anything else to run + * right now. */ + ret = select(io_thread_sigfd + 1, &rfds, NULL, NULL, NULL); + if (ret > 0 && FD_ISSET(io_thread_sigfd, &rfds)) + sigfd_handler((void *)(unsigned long)io_thread_sigfd); + } + qemu_aio_poll(); } else { pthread_cond_wait(&qemu_aio_cond, &qemu_mutex); cpu_single_env = cpu_single; @@ -921,3 +955,14 @@ void kvm_cpu_destroy_phys_mem(target_phys_addr_t start_addr, { kvm_destroy_phys_mem(kvm_context, start_addr, size); } + +void kvm_mutex_unlock(void) +{ + pthread_mutex_unlock(&qemu_mutex); +} + +void kvm_mutex_lock(void) +{ + pthread_mutex_lock(&qemu_mutex); + cpu_single_env = NULL; +} diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index a0dd4a8..df573ec 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -12,6 +12,8 @@ #include <signal.h> +#include <signal.h> + int kvm_main_loop(void); int kvm_qemu_init(void); int kvm_qemu_create_context(void); @@ -111,4 +113,28 @@ extern kvm_context_t kvm_context; #define qemu_kvm_pit_in_kernel() (0) #endif +void kvm_mutex_unlock(void); +void kvm_mutex_lock(void); + +static inline void kvm_sleep_begin(void) +{ + if (kvm_enabled()) + kvm_mutex_unlock(); +} + +static inline void kvm_sleep_end(void) +{ + if (kvm_enabled()) + kvm_mutex_lock(); +} + +int kvm_check_received_signal(void); + +static inline int kvm_received_signal(void) +{ + if (kvm_enabled()) + return kvm_check_received_signal(); + return 0; +} + #endif diff --git a/qemu/vl.c b/qemu/vl.c index 3fcf6b6..541aacc 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7271,6 +7271,23 @@ void qemu_register_boot_set(QEMUBootSetHandler *func) qemu_boot_set_handler = func; } +static int qemu_select(int max_fd, fd_set *rfds, fd_set *wfds, fd_set *xfds, + struct timeval *tv) +{ + int ret; + + /* KVM holds a mutex while QEMU code is running, we need hooks to + release the mutex whenever QEMU code sleeps. */ + + kvm_sleep_begin(); + + ret = select(max_fd, rfds, wfds, xfds, tv); + + kvm_sleep_end(); + + return ret; +} + void main_loop_wait(int timeout) { IOHandlerRecord *ioh; @@ -7342,11 +7359,12 @@ void main_loop_wait(int timeout) } } - tv.tv_sec = 0; #ifdef _WIN32 + tv.tv_sec = 0; tv.tv_usec = 0; #else - tv.tv_usec = timeout * 1000; + tv.tv_sec = timeout / 1000; + tv.tv_usec = (timeout % 1000) * 1000; #endif #if defined(CONFIG_SLIRP) if (slirp_inited) { @@ -7354,7 +7372,7 @@ void main_loop_wait(int timeout) } #endif moreio: - ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv); + ret = qemu_select(nfds + 1, &rfds, &wfds, &xfds, &tv); if (ret > 0) { IOHandlerRecord **pioh; int more = 0; @@ -7383,7 +7401,7 @@ void main_loop_wait(int timeout) } else pioh = &ioh->next; } - if (more) + if (more && !kvm_received_signal()) goto moreio; } #if defined(CONFIG_SLIRP) |
From: Anthony L. <ali...@us...> - 2008-05-07 16:56:55
|
QEMU is rather aggressive about exhausting the wait period when selecting. This is fine when the wait period is low and when there is significant delays in-between selects as it improves IO throughput. With the IO thread, there is a very small delay between selects and our wait period for select is very large. This patch changes main_loop_wait to only select once before doing the various other things in the main loop. This generally improves responsiveness of things like SDL but also improves individual file descriptor throughput quite dramatically. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 492c3c4..cc8f292 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -423,24 +423,6 @@ void qemu_kvm_notify_work(void) fprintf(stderr, "failed to notify io thread\n"); } -static int received_signal; - -/* QEMU relies on periodically breaking out of select via EINTR to poll for IO - and timer signals. Since we're now using a file descriptor to handle - signals, select() won't be interrupted by a signal. We need to forcefully - break the select() loop when a signal is received hence - kvm_check_received_signal(). */ - -int kvm_check_received_signal(void) -{ - if (received_signal) { - received_signal = 0; - return 1; - } - - return 0; -} - /* If we have signalfd, we mask out the signals we want to handle and then * use signalfd to listen for them. We rely on whatever the current signal * handler is to dispatch the signals when we receive them. @@ -474,8 +456,6 @@ static void sigfd_handler(void *opaque) pthread_cond_signal(&qemu_aio_cond); } } - - received_signal = 1; } /* Used to break IO thread out of select */ @@ -497,8 +477,6 @@ static void io_thread_wakeup(void *opaque) offset += len; } - - received_signal = 1; } int kvm_main_loop(void) diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h index df573ec..21606e9 100644 --- a/qemu/qemu-kvm.h +++ b/qemu/qemu-kvm.h @@ -128,13 +128,4 @@ static inline void kvm_sleep_end(void) kvm_mutex_lock(); } -int kvm_check_received_signal(void); - -static inline int kvm_received_signal(void) -{ - if (kvm_enabled()) - return kvm_check_received_signal(); - return 0; -} - #endif diff --git a/qemu/vl.c b/qemu/vl.c index 4e25366..a1aa270 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -7381,23 +7381,18 @@ void main_loop_wait(int timeout) slirp_select_fill(&nfds, &rfds, &wfds, &xfds); } #endif - moreio: ret = qemu_select(nfds + 1, &rfds, &wfds, &xfds, &tv); if (ret > 0) { IOHandlerRecord **pioh; - int more = 0; for(ioh = first_io_handler; ioh != NULL; ioh = ioh->next) { if (!ioh->deleted && ioh->fd_read && FD_ISSET(ioh->fd, &rfds)) { ioh->fd_read(ioh->opaque); - if (!ioh->fd_read_poll || ioh->fd_read_poll(ioh->opaque)) - more = 1; - else + if (!(ioh->fd_read_poll && ioh->fd_read_poll(ioh->opaque))) FD_CLR(ioh->fd, &rfds); } if (!ioh->deleted && ioh->fd_write && FD_ISSET(ioh->fd, &wfds)) { ioh->fd_write(ioh->opaque); - more = 1; } } @@ -7411,8 +7406,6 @@ void main_loop_wait(int timeout) } else pioh = &ioh->next; } - if (more && !kvm_received_signal()) - goto moreio; } #if defined(CONFIG_SLIRP) if (slirp_inited) { |
From: Anthony L. <ali...@us...> - 2008-05-07 16:56:14
|
The select() in the IO thread may wait a long time before rebuilding the fd set. Whenever we do something that changes the fd set, we should interrupt the IO thread. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/vl.c b/qemu/vl.c index 1192759..e9f0ca4 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -260,6 +260,16 @@ static int event_pending = 1; #define TFR(expr) do { if ((expr) != -1) break; } while (errno == EINTR) +/* KVM runs the main loop in a separate thread. If we update one of the lists + * that are polled before or after select(), we need to make sure to break out + * of the select() to ensure the new item is serviced. + */ +static void main_loop_break(void) +{ + if (kvm_enabled()) + qemu_kvm_notify_work(); +} + void decorate_application_name(char *appname, int max_len) { if (kvm_enabled()) @@ -5680,6 +5690,7 @@ int qemu_set_fd_handler2(int fd, ioh->opaque = opaque; ioh->deleted = 0; } + main_loop_break(); return 0; } @@ -7606,8 +7617,7 @@ void qemu_bh_schedule(QEMUBH *bh) if (env) { cpu_interrupt(env, CPU_INTERRUPT_EXIT); } - if (kvm_enabled()) - qemu_kvm_notify_work(); + main_loop_break(); } void qemu_bh_cancel(QEMUBH *bh) |