kvm-devel Mailing List for kernel virtual machine (Page 24)

Brought to you by: avik, mtosatti

kvm-devel — kernel virtual machine development

You can subscribe to this list here.

2006	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (33)	_Nov (325)	_Dec (320)
2007	_Jan (484)	_Feb (438)	_Mar (407)	_Apr (713)	_May (831)	_Jun (806)	_Jul (1023)	_Aug (1184)	_Sep (1118)	_Oct (1461)	_Nov (1224)	_Dec (1042)
2008	_Jan (1449)	_Feb (1110)	_Mar (1428)	_Apr (1643)	_May (682)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 22 23 24 25 26 .. 703 > >> (Page 24 of 703)

Re: [kvm-devel] Feedback and errors

From: Avi K. <av...@qu...> - 2008-05-04 11:19:31

Anthony Liguori wrote:
> Avi Kivity wrote:
>   
>> Well, one user (me) has made this mistake, several times.
>>     
>
> I guess it's usage patterns.  I'm pretty religious about using -snapshot 
> unless I have a very specific reason not to.  I have never encountered 
> this problem myself.
>
>   

Most users cannot use -snapshot for their workloads.

>>> FWIW, the whole override thing for Xen has been an endless source of 
>>> pain.  It's very difficult (if not impossible) to accurately 
>>> determine if someone else is using the disk.  
>>>       
>> What's wrong with the standard file locking API?  Of course it won't 
>> stop non-qemu apps from accessing it, but that's unlikely anyway.
>>     
>
> Xen tries to be very smart about determining whether devices are mounted 
> somewhere else or not.
>
>   

I'm not talking about being too smart.  Just an flock().

>>> Also, it tends to confuse people trying to do something legitimate 
>>> more often than helping someone doing something stupid.
>>>       
>> -drive exclusive=off (or share=yes)
>>     
>
> The problem I have is that the default policy gets very complicated.  At 
> first thought, I would say it's fine as long as exclusive=off was the 
> default for using -snapshot or using raw images.  However, if you create 
> a VM with a qcow image using -snapshot, and then create another one 
> without using snapshot, you're boned.
>   

Well then, default to exclusive=on.  If you're using -shapshot you can 
add the extra parameter as well.

> What we really need is a global configuration file so that individual 
> users can select these defaults according to what makes sense for them.
>
> In the mean time, I think the policy vs. mechanism strongly suggests 
> that exclusive=off should be the default (not to mention maintaining 
> backwards compatibility).
>
>   

The problem is that this is bad for users.


-- 
error compiling committee.c: too many arguments to function

[kvm-devel] kvm > 61 segfaults when started with a bridged tap

From: iMil <im...@im...> - 2008-05-04 11:19:08

Hi,

Since I upgraded my ubuntu machine to 8.04, 
/usr/local/bin/qemu-system-x86_64 segfaults when starting with -net 
tap,ifname=tap0 flags. Of course, it's been recompiled.

$ sudo /usr/local/bin/qemu-system-x86_64 /data/virt/netbsd.img -net 
nic,macaddr=00:56:01:02:03:04 -net tap,ifname=tap0,script=/etc/qemu-ifup
Segmentation fault

I see the same behaviour with any NIC, and for kvm from -62 to -67 (I 
tested each version). 
Downgrading to kvm-61 fixes the problem. It seems like there's a similar 
bug report here: 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=476469

The VM also start correctly with qemu.

$ modinfo kvm-intel
filename:       /lib/modules/2.6.24-16-generic/extra/kvm-intel.ko
license:        GPL
author:         Qumranet
version:        kvm-67
srcversion:     2E2C88C6F09E216FDAA6797
depends:        kvm
vermagic:       2.6.24-16-generic SMP mod_unload 586
parm:           bypass_guest_pf:bool
parm:           enable_vpid:bool
parm:           flexpriority_enabled:bool

$ uname -a
Linux tatooine 2.6.24-16-generic #1 SMP Thu Apr 10 13:23:42 UTC 2008 i686 
GNU/Linux

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping	: 6
cpu MHz		: 1596.000
cache size	: 2048 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm 
constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 
cx16 xtpr lahf_lm
bogomips	: 3736.56
clflush size	: 64

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU          6300  @ 1.86GHz
stepping	: 6
cpu MHz		: 1596.000
cache size	: 2048 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm 
constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 
cx16 xtpr lahf_lm
bogomips	: 3733.37
clflush size	: 64

Interesting packages

ii  bridge-utils                                 1.2-2
ii  iproute                                      20071016-2ubuntu1
ii  libc6                                        2.7-10ubuntu3

Hope this helps,

regards

----------------------------------------
Emile "iMil" Heitor <im...@ho...>                                  _
                    http://gcu-squad.org            ASCII ribbon campaign ( )
                                                     - against HTML email  X
                                                                 & vCards / \

[kvm-devel] Last longer and pump harder here

From: Wheeles <Alb...@Ch...> - 2008-05-04 09:59:01

Just take them regularly and you will see results in no less than 2 months http://www.nutmentpa.com/

Re: [kvm-devel] [PATCH] Use pipe() to simulate signalfd() (v2)

From: Avi K. <av...@qu...> - 2008-05-04 09:06:15

Anthony Liguori wrote:
> This patch reworks the IO thread to use signalfd() instead of sigtimedwait().
> This will eliminate the need to use SIGIO everywhere.  In this version of the
> patch, we use signalfd() when it's available.  When it isn't available, we
> instead use a pipe() that is written to in each signal handler.
>
> I've tested Windows and Linux guests with SMP without seeing an obvious
> regressions.
>
>   

Please split the signalfd() emulation into a separate (preparatory) 
patch.  Also, we need to detect signalfd() at run time as well as 
compile time, since qemu may be compiled on a different machine than it 
is run on.

> +/* If we don't have signalfd, we don't mask out the signals we want to receive.
> + * To avoid the signal/select race, we use a pipe() that we write to from the
> + * signal handler.  As a consequence, we save off the signal handler to perform
> + * dispatch.
> + */
>   

We can keep the signals blocked, but run the signalfd emulation in a 
separate thread (where it can dequeue signals using sigwait as an added 
bonus).  This will reduce the differences between the two modes at the 
expense of increased signalfd() emulation complexity, which I think is a 
good tradeoff.

We can move signalfd emulation into a separate file in order to improve 
readability.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] problems running many guests

From: Avi K. <av...@qu...> - 2008-05-04 08:41:18

Karl Rister wrote:
> Hi
>
> I have been trying to do some testing of a large number of guests (72) on a 
> big multi-node IBM box (8 sockets, 32 cores, 128GB) and I am having various 
> issues with the guests.  I can get the guests to boot, but then I start to 
> have problems.  Some guests appear to stall doing I/O and some become 
> unresponsive and spin their single vcpu at 100%.
>   

One of the problems with these large boxes is that their TSCs are not 
synced across sockets; you may be hitting related issues.  Can you try 
configuring the guests not to use the tsc?

Also, if you are running on an old host kernel, you won't have 
smp_call_function_single() and there will be many broadcast IPIs.  
Please use a recent host kernel (kvm.git is best, though a bit bleeding 
edge).

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] [PATCH 00 of 10] Qemu PowerPC fixes and enhancements

From: Avi K. <av...@qu...> - 2008-05-04 08:17:12

Hollis Blanchard wrote:
> Avi, please apply these patches to the kvm-userspace repository. I've submitted
> the device emulation patches (UIC and PCI) to qemu-devel, but have received no
> response.
>
>   

Applied all, thanks.

> Thinking ahead to qemu integration, many of these should be folded into a
> single "Bamboo board" patch, but e.g. the device emulation patches are
> logically separate. Do you track qemu patches for upstream integration like
> you do for the kernel? Do you want me to keep these split-out patches locally?
> Unsplitting them later would be a pain...
>   

It's better to keep the patches split.  As you point out, folding 
patches together is easy but splitting them later is hard.  I don't keep 
a qemu patch queue; whether to take patches from kvm-userspace.git or 
rediff against upstream qemu is a decision that is best taken on a 
case-by-case basis, if and when qemu upstream becomes receptive to these 
patches.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] [PATCH 2/4] Revert virtio tap hack

From: Avi K. <av...@qu...> - 2008-05-04 08:13:26

Anthony Liguori wrote:
> Anthony Liguori wrote:
>> While it has served us well, it is long overdue that we eliminate the
>> virtio-net tap hack.  It turns out that zero-copy has very little 
>> impact on
>> performance.  The tap hack was gaining such a significant performance 
>> boost
>> not because of zero-copy, but because it avoided dropping packets on 
>> receive
>> which is apparently a significant problem with the tap implementation 
>> in QEMU.
>>   
>
> FWIW, attached is a pretty straight forward zero-copy patch.  What's 
> interesting is that I see no change in throughput using this patch.  
> The CPU is pegged at 100% during the iperf run.  Since we're still 
> using small MTUs, this isn't surprising.  Copying a 1500 byte packet 
> that we have to bring into the cache anyway doesn't seem significant.  
> I think zero-copy will be more important with GSO though.

Zero copy is important when the guest is zero copy, and when we are not 
doing any extra copying on the host.  This doesn't fit the way we benchmark.

I expect zero copy to show improvements on things like apachebench  
(with a file size > 50K) with an external client.  The improvements will 
also show up on SMP, where the likelihood of the copy happening on the 
wrong cpu increase.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] [patch 0/3] QEMU/KVM: add support for 128 PCI slots (v2)

From: Avi K. <av...@qu...> - 2008-05-04 07:56:34

Marcelo Tosatti wrote:
> Add three PCI bridges to support 128 slots.
>
> Changes since v1:
> - Remove I/O address range "support" (so standard PCI I/O space is used).
> - Verify that there's no special quirks for 82801 PCI bridge.
> - Introduce separate flat IRQ mapping function for non-SPARC targets.
>
>   

I've cooled off on the 128 slot stuff, mainly because most real hosts 
don't have them.  An unusual configuration will likely lead to problems 
as most guest OSes and workloads will not have been tested thoroughly 
with them.

- it requires a large number of interrupts, which are difficult to 
provide, and which it is hard to ensure all OSes support.  MSI is 
relatively new.
- is only a few interrupts are available, then each interrupt requires 
scanning a large number of queues

If we are to do this, then we need better tests than "80 disks show up".

The alternative approach of having the virtio block device control up to 
16 disks allows having those 80 disks with just 5 slots (and 5 
interrupts).  This is similar to the way traditional SCSI controllers 
behave, and so should not surprise the guest OS.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] Widescreen video in KVM

From: Izik E. <iz...@qu...> - 2008-05-04 07:39:56

Avi Kivity wrote:
> Bob Moran wrote:
>> The http://kvm.qumranet.com/kvmwiki/FAQ section3 Q13  RE. widescreen 
>> resolution in KVM,  refers me to:
>>
>>
>> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/13557   which 
>> describes a patch to be applied.
>>
>> I am not familiar with patch application and unsure where to find the 
>> file to patch.  Any help would be appreciated.
>>
>>   
> 
> The patch is included in kvm-62, so if you use that (or any more recent 
> release) you should have the functionality included.
> 
> In case you still encounter problems, let us know.
> 

isnt it work just with -std-vga?

-- 
woof.

Re: [kvm-devel] Widescreen video in KVM

From: Avi K. <av...@qu...> - 2008-05-04 06:14:24

Bob Moran wrote:
> The http://kvm.qumranet.com/kvmwiki/FAQ section3 Q13  RE. widescreen 
> resolution in KVM,  refers me to:
>
>
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/13557   which 
> describes a patch to be applied.
>
> I am not familiar with patch application and unsure where to find the 
> file to patch.  Any help would be appreciated.
>
>   

The patch is included in kvm-62, so if you use that (or any more recent 
release) you should have the functionality included.

In case you still encounter problems, let us know.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

[kvm-devel] Widescreen video in KVM

From: Bob M. <bm...@md...> - 2008-05-04 05:23:11

The http://kvm.qumranet.com/kvmwiki/FAQ section3 Q13  RE. widescreen 
resolution in KVM,  refers me to:


http://thread.gmane.org/gmane.comp.emulators.kvm.devel/13557   which 
describes a patch to be applied.

I am not familiar with patch application and unsure where to find the 
file to patch.  Any help would be appreciated.


Bob Moran

Re: [kvm-devel] [PATCH 2/4] Revert virtio tap hack

From: Anthony L. <an...@co...> - 2008-05-04 04:03:45

Attachments: net-tap-zero-copy.patch

Anthony Liguori wrote:
> While it has served us well, it is long overdue that we eliminate the
> virtio-net tap hack.  It turns out that zero-copy has very little impact on
> performance.  The tap hack was gaining such a significant performance boost
> not because of zero-copy, but because it avoided dropping packets on receive
> which is apparently a significant problem with the tap implementation in QEMU.
>   

FWIW, attached is a pretty straight forward zero-copy patch.  What's 
interesting is that I see no change in throughput using this patch.  The 
CPU is pegged at 100% during the iperf run.  Since we're still using 
small MTUs, this isn't surprising.  Copying a 1500 byte packet that we 
have to bring into the cache anyway doesn't seem significant.  I think 
zero-copy will be more important with GSO though.

Regards,

Anthony Liguori

> Patches 3 and 4 in this series address the packet dropping issue and the net
> result is a 25% boost in RX performance even in the absence of zero-copy.
>
> Also worth mentioning, is that this makes merging virtio into upstream QEMU
> significantly easier.
>
> Signed-off-by: Anthony Liguori <ali...@us...>
>

[kvm-devel] [PATCH 3/4] Make virtio-net can_receive more accurate

From: Anthony L. <ali...@us...> - 2008-05-04 02:48:09

In the final patch of this series, we rely on a VLAN client's fd_can_read
method to avoid dropping packets.  Unfortunately, virtio's fd_can_read method
is not very accurate at the moment.  This patch addresses this.

It also generates a notification to the IO thread when more RX packets become
available.  If we say we can't receive a packet because no RX buffers are
available, this may result in the tap file descriptor not being select()'d.
Without notifying the IO thread, we may have to wait until the select() times
out before we can receive a packet (even if there is one pending).

This particular change makes RX performance very consistent.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index 8d26832..5538979 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -14,6 +14,7 @@
 #include "virtio.h"
 #include "net.h"
 #include "qemu-timer.h"
+#include "qemu-kvm.h"
 
 /* from Linux's virtio_net.h */
 
@@ -60,11 +61,14 @@ typedef struct VirtIONet
     VirtQueue *rx_vq;
     VirtQueue *tx_vq;
     VLANClientState *vc;
-    int can_receive;
     QEMUTimer *tx_timer;
     int tx_timer_active;
 } VirtIONet;
 
+/* TODO
+ * - we could suppress RX interrupt if we were so inclined.
+ */
+
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
     return (VirtIONet *)vdev;
@@ -88,15 +92,24 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 
 static void virtio_net_handle_rx(VirtIODevice *vdev, VirtQueue *vq)
 {
-    VirtIONet *n = to_virtio_net(vdev);
-    n->can_receive = 1;
+    /* We now have RX buffers, signal to the IO thread to break out of the
+       select to re-poll the tap file descriptor */
+    if (kvm_enabled())
+	qemu_kvm_notify_work();
 }
 
 static int virtio_net_can_receive(void *opaque)
 {
     VirtIONet *n = opaque;
 
-    return (n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK) && n->can_receive;
+    if (n->rx_vq->vring.avail == NULL ||
+	!(n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK))
+	return 0;
+
+    if (n->rx_vq->vring.avail->idx == n->rx_vq->last_avail_idx)
+	return 0;
+
+    return 1;
 }
 
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
@@ -106,15 +119,8 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
     struct virtio_net_hdr *hdr;
     int offset, i;
 
-    /* FIXME: the drivers really need to set their status better */
-    if (n->rx_vq->vring.avail == NULL) {
-	n->can_receive = 0;
-	return;
-    }
-
     if (virtqueue_pop(n->rx_vq, &elem) == 0) {
-	/* wait until the guest adds some rx bufs */
-	n->can_receive = 0;
+	fprintf(stderr, "virtio_net: this should not happen\n");
 	return;
     }
 
@@ -209,9 +215,8 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
 
     n->vdev.update_config = virtio_net_update_config;
     n->vdev.get_features = virtio_net_get_features;
-    n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx);
+    n->rx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_rx);
     n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx);
-    n->can_receive = 0;
     memcpy(n->mac, nd->macaddr, 6);
     n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
                                  virtio_net_can_receive, n);

[kvm-devel] [PATCH 4/4] Stop dropping so many RX packets in tap

From: Anthony L. <ali...@us...> - 2008-05-04 02:47:53

Normally, tap always reads packets and simply lets the client drop them if it
cannot receive them.  For virtio-net, this results in massive packet loss and
about an 80% performance loss in TCP throughput.

This patch modifies qemu_send_packet() to only deliver a packet to a VLAN
client if it doesn't have a fd_can_read method or the fd_can_read method
indicates that it can receive packets.  We also return a status of whether
any clients were able to receive the packet.

If no clients were able to receive a packet, we buffer the packet until a
client indicates that it can receive packets again.

This patch also modifies the tap code to only read from the tap fd if at least
one client on the VLAN is able to receive a packet.

Finally, this patch changes the tap code to drain all possible packets from
the tap device when the tap fd is readable.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..dfdf9af 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -29,7 +29,7 @@ VLANClientState *qemu_new_vlan_client(VLANState *vlan,
                                       IOCanRWHandler *fd_can_read,
                                       void *opaque);
 int qemu_can_send_packet(VLANClientState *vc);
-void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+int qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
 void qemu_handler_true(void *opaque);
 
 void do_info_network(void);
diff --git a/qemu/vl.c b/qemu/vl.c
index b8ce485..74c34b6 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3750,10 +3750,11 @@ int qemu_can_send_packet(VLANClientState *vc1)
     return 0;
 }
 
-void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
+int qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
 {
     VLANState *vlan = vc1->vlan;
     VLANClientState *vc;
+    int ret = -EAGAIN;
 
 #if 0
     printf("vlan %d send:\n", vlan->id);
@@ -3761,9 +3762,14 @@ void qemu_send_packet(VLANClientState *vc1, const uint8_t *buf, int size)
 #endif
     for(vc = vlan->first_client; vc != NULL; vc = vc->next) {
         if (vc != vc1) {
-            vc->fd_read(vc->opaque, buf, size);
+	    if (!vc->fd_can_read || vc->fd_can_read(vc->opaque)) {
+		vc->fd_read(vc->opaque, buf, size);
+		ret = 0;
+	    }
         }
     }
+
+    return ret;
 }
 
 #if defined(CONFIG_SLIRP)
@@ -3966,6 +3972,8 @@ typedef struct TAPState {
     VLANClientState *vc;
     int fd;
     char down_script[1024];
+    char buf[4096];
+    int size;
 } TAPState;
 
 static void tap_receive(void *opaque, const uint8_t *buf, int size)
@@ -3981,24 +3989,70 @@ static void tap_receive(void *opaque, const uint8_t *buf, int size)
     }
 }
 
+static int tap_can_send(void *opaque)
+{
+    TAPState *s = opaque;
+    VLANClientState *vc;
+    int can_receive = 0;
+ 
+    /* Check to see if any of our clients can receive a packet */
+    for (vc = s->vc->vlan->first_client; vc; vc = vc->next) {
+	/* Skip ourselves */
+	if (vc == s->vc)
+	    continue;
+
+	if (!vc->fd_can_read) {
+	    /* no fd_can_read handler, they always can receive */
+	    can_receive = 1;
+	} else
+	    can_receive = vc->fd_can_read(vc->opaque);
+
+	/* Once someone can receive, we try to send a packet */
+	if (can_receive)
+	    break;
+    }
+
+    return can_receive;
+}
+
 static void tap_send(void *opaque)
 {
     TAPState *s = opaque;
-    uint8_t buf[4096];
-    int size;
 
+    /* First try to send any buffered packet */
+    if (s->size > 0) {
+	int err;
+
+	/* If noone can receive the packet, buffer it */
+	err = qemu_send_packet(s->vc, s->buf, s->size);
+	if (err == -EAGAIN)
+	    return;
+    }
+
+    /* Read packets until we hit EAGAIN */
+    do {
 #ifdef __sun__
-    struct strbuf sbuf;
-    int f = 0;
-    sbuf.maxlen = sizeof(buf);
-    sbuf.buf = buf;
-    size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1;
+	struct strbuf sbuf;
+	int f = 0;
+	sbuf.maxlen = sizeof(s->buf);
+	sbuf.buf = s->buf;
+	s->size = getmsg(s->fd, NULL, &sbuf, &f) >=0 ? sbuf.len : -1;
 #else
-    size = read(s->fd, buf, sizeof(buf));
+	s->size = read(s->fd, s->buf, sizeof(s->buf));
 #endif
-    if (size > 0) {
-        qemu_send_packet(s->vc, buf, size);
-    }
+
+	if (s->size == -1 && errno == EINTR)
+	    continue;
+
+	if (s->size > 0) {
+	    int err;
+
+	    /* If noone can receive the packet, buffer it */
+	    err = qemu_send_packet(s->vc, s->buf, s->size);
+	    if (err == -EAGAIN)
+		break;
+	}
+    } while (s->size > 0);
 }
 
 /* fd support */
@@ -4012,7 +4066,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
         return NULL;
     s->fd = fd;
     s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
-    qemu_set_fd_handler2(s->fd, NULL, tap_send, NULL, s);
+    qemu_set_fd_handler2(s->fd, tap_can_send, tap_send, NULL, s);
     snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
     return s;
 }

[kvm-devel] [PATCH 1/4] Only select once per-main_loop iteration

From: Anthony L. <ali...@us...> - 2008-05-04 02:47:45

QEMU is rather aggressive about exhausting the wait period when selecting.
This is fine when the wait period is low and when there is significant delays
in-between selects as it improves IO throughput.

With the IO thread, there is a very small delay between selects and our wait
period for select is very large.  This patch changes main_loop_wait to only
select once before doing the various other things in the main loop.  This
generally improves responsiveness of things like SDL but also improves
individual file descriptor throughput quite dramatically.

This patch is relies on my io-thread-timerfd.patch.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 0c7f49f..31c7ca7 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -401,24 +401,6 @@ void qemu_kvm_notify_work(void)
         pthread_kill(io_thread, SIGUSR1);
 }
 
-static int received_signal;
-
-/* QEMU relies on periodically breaking out of select via EINTR to poll for IO
-   and timer signals.  Since we're now using a file descriptor to handle
-   signals, select() won't be interrupted by a signal.  We need to forcefully
-   break the select() loop when a signal is received hence
-   kvm_check_received_signal(). */
-
-int kvm_check_received_signal(void)
-{
-    if (received_signal) {
-	received_signal = 0;
-	return 1;
-    }
-
-    return 0;
-}
-
 #if defined(SYS_signalfd)
 #if !defined(HAVE_signalfd)
 #include <linux/signalfd.h>
@@ -466,8 +448,6 @@ static void sigfd_handler(void *opaque)
 	if (info.ssi_signo == SIGUSR2)
 	    pthread_cond_signal(&qemu_aio_cond); 
     }
-
-    received_signal = 1;
 }
 
 static int setup_signal_handlers(int nr_signals, ...)
@@ -576,8 +556,6 @@ static void sigfd_handler(void *opaque)
 	if (signo == SIGUSR2)
 	    pthread_cond_signal(&qemu_aio_cond); 
     }
-
-    received_signal = 1;
 }
 
 static int setup_signal_handlers(int nr_signals, ...)
diff --git a/qemu/qemu-kvm.h b/qemu/qemu-kvm.h
index bcab82c..5109c64 100644
--- a/qemu/qemu-kvm.h
+++ b/qemu/qemu-kvm.h
@@ -112,13 +112,4 @@ static inline void kvm_sleep_end(void)
 	kvm_mutex_lock();
 }
 
-int kvm_check_received_signal(void);
-
-static inline int kvm_received_signal(void)
-{
-    if (kvm_enabled())
-	return kvm_check_received_signal();
-    return 0;
-}
-
 #endif
diff --git a/qemu/vl.c b/qemu/vl.c
index 1192759..bcf893f 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -7936,23 +7936,18 @@ void main_loop_wait(int timeout)
         slirp_select_fill(&nfds, &rfds, &wfds, &xfds);
     }
 #endif
- moreio:
     ret = qemu_select(nfds + 1, &rfds, &wfds, &xfds, &tv);
     if (ret > 0) {
         IOHandlerRecord **pioh;
-        int more = 0;
 
         for(ioh = first_io_handler; ioh != NULL; ioh = ioh->next) {
             if (!ioh->deleted && ioh->fd_read && FD_ISSET(ioh->fd, &rfds)) {
                 ioh->fd_read(ioh->opaque);
-                if (!ioh->fd_read_poll || ioh->fd_read_poll(ioh->opaque))
-                    more = 1;
-                else
+                if (!(ioh->fd_read_poll && ioh->fd_read_poll(ioh->opaque)))
                     FD_CLR(ioh->fd, &rfds);
             }
             if (!ioh->deleted && ioh->fd_write && FD_ISSET(ioh->fd, &wfds)) {
                 ioh->fd_write(ioh->opaque);
-                more = 1;
             }
         }
 
@@ -7966,8 +7961,6 @@ void main_loop_wait(int timeout)
             } else
                 pioh = &ioh->next;
         }
-        if (more && !kvm_received_signal())
-            goto moreio;
     }
 #if defined(CONFIG_SLIRP)
     if (slirp_inited) {

[kvm-devel] [PATCH 2/4] Revert virtio tap hack

From: Anthony L. <ali...@us...> - 2008-05-04 02:47:45

While it has served us well, it is long overdue that we eliminate the
virtio-net tap hack.  It turns out that zero-copy has very little impact on
performance.  The tap hack was gaining such a significant performance boost
not because of zero-copy, but because it avoided dropping packets on receive
which is apparently a significant problem with the tap implementation in QEMU.

Patches 3 and 4 in this series address the packet dropping issue and the net
result is a 25% boost in RX performance even in the absence of zero-copy.

Also worth mentioning, is that this makes merging virtio into upstream QEMU
significantly easier.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/qemu/hw/pc.h b/qemu/hw/pc.h
index 57d2123..f5157bd 100644
--- a/qemu/hw/pc.h
+++ b/qemu/hw/pc.h
@@ -154,7 +154,6 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd);
 /* virtio-net.c */
 
 PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn);
-void virtio_net_poll(void);
 
 /* virtio-blk.h */
 void *virtio_blk_init(PCIBus *bus, uint16_t vendor, uint16_t device,
diff --git a/qemu/hw/virtio-net.c b/qemu/hw/virtio-net.c
index f727b14..8d26832 100644
--- a/qemu/hw/virtio-net.c
+++ b/qemu/hw/virtio-net.c
@@ -13,7 +13,6 @@
 
 #include "virtio.h"
 #include "net.h"
-#include "pc.h"
 #include "qemu-timer.h"
 
 /* from Linux's virtio_net.h */
@@ -62,15 +61,10 @@ typedef struct VirtIONet
     VirtQueue *tx_vq;
     VLANClientState *vc;
     int can_receive;
-    int tap_fd;
-    struct VirtIONet *next;
-    int do_notify;
     QEMUTimer *tx_timer;
     int tx_timer_active;
 } VirtIONet;
 
-static VirtIONet *VirtIONetHead = NULL;
-
 static VirtIONet *to_virtio_net(VirtIODevice *vdev)
 {
     return (VirtIONet *)vdev;
@@ -105,7 +99,6 @@ static int virtio_net_can_receive(void *opaque)
     return (n->vdev.status & VIRTIO_CONFIG_S_DRIVER_OK) && n->can_receive;
 }
 
-/* -net user receive function */
 static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
 {
     VirtIONet *n = opaque;
@@ -144,87 +137,6 @@ static void virtio_net_receive(void *opaque, const uint8_t *buf, int size)
     virtio_notify(&n->vdev, n->rx_vq);
 }
 
-/* -net tap receive handler */
-void virtio_net_poll(void)
-{
-    VirtIONet *vnet;
-    int len;
-    fd_set rfds;
-    struct timeval tv;
-    int max_fd = -1;
-    VirtQueueElement elem;
-    struct virtio_net_hdr *hdr;
-    int did_notify;
-
-    FD_ZERO(&rfds);
-    tv.tv_sec = 0;
-    tv.tv_usec = 0;
-
-    while (1) {
-
-        // Prepare the set of device to select from
-        for (vnet = VirtIONetHead; vnet; vnet = vnet->next) {
-
-            if (vnet->tap_fd == -1)
-                continue;
-
-            vnet->do_notify = 0;
-            //first check if the driver is ok
-            if (!virtio_net_can_receive(vnet))
-                continue;
-
-            /* FIXME: the drivers really need to set their status better */
-            if (vnet->rx_vq->vring.avail == NULL) {
-                vnet->can_receive = 0;
-                continue;
-            }
-
-            FD_SET(vnet->tap_fd, &rfds);
-            if (max_fd < vnet->tap_fd) max_fd = vnet->tap_fd;
-        }
-
-        if (select(max_fd + 1, &rfds, NULL, NULL, &tv) <= 0)
-            break;
-
-        // Now check who has data pending in the tap
-        for (vnet = VirtIONetHead; vnet; vnet = vnet->next) {
-
-            if (!FD_ISSET(vnet->tap_fd, &rfds))
-                continue;
-
-            if (virtqueue_pop(vnet->rx_vq, &elem) == 0) {
-                vnet->can_receive = 0;
-                continue;
-            }
-
-            hdr = (void *)elem.in_sg[0].iov_base;
-            hdr->flags = 0;
-            hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
-again:
-            len = readv(vnet->tap_fd, &elem.in_sg[1], elem.in_num - 1);
-            if (len == -1) {
-                if (errno == EINTR || errno == EAGAIN)
-                    goto again;
-                else
-                    fprintf(stderr, "reading network error %d", len);
-            }
-            virtqueue_push(vnet->rx_vq, &elem, sizeof(*hdr) + len);
-            vnet->do_notify = 1;
-        }
-
-        /* signal other side */
-        did_notify = 0;
-        for (vnet = VirtIONetHead; vnet; vnet = vnet->next)
-            if (vnet->do_notify) {
-                virtio_notify(&vnet->vdev, vnet->rx_vq);
-                did_notify++;
-            }
-        if (!did_notify)
-            break;
-     }
-
-}
-
 /* TX */
 static void virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
 {
@@ -303,12 +215,6 @@ PCIDevice *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn)
     memcpy(n->mac, nd->macaddr, 6);
     n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive,
                                  virtio_net_can_receive, n);
-    n->tap_fd = hack_around_tap(n->vc->vlan->first_client);
-    if (n->tap_fd != -1) {
-        n->next = VirtIONetHead;
-        //push the device on top of the list
-        VirtIONetHead = n;
-    }
 
     n->tx_timer = qemu_new_timer(vm_clock, virtio_net_tx_timer, n);
     n->tx_timer_active = 0;
diff --git a/qemu/vl.c b/qemu/vl.c
index bcf893f..b8ce485 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -3966,15 +3966,8 @@ typedef struct TAPState {
     VLANClientState *vc;
     int fd;
     char down_script[1024];
-    int no_poll;
 } TAPState;
 
-static int tap_read_poll(void *opaque)
-{
-    TAPState *s = opaque;
-    return (!s->no_poll);
-}
-
 static void tap_receive(void *opaque, const uint8_t *buf, int size)
 {
     TAPState *s = opaque;
@@ -4008,22 +4001,6 @@ static void tap_send(void *opaque)
     }
 }
 
-int hack_around_tap(void *opaque)
-{
-    VLANClientState *vc = opaque;
-    TAPState *ts = vc->opaque;
-
-    if (vc->fd_read != tap_receive)
-        return -1;
-
-    if (ts) {
-       ts->no_poll = 1;
-       return ts->fd;
-    }
-
-    return -1;
-}
-
 /* fd support */
 
 static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
@@ -4034,10 +4011,8 @@ static TAPState *net_tap_fd_init(VLANState *vlan, int fd)
     if (!s)
         return NULL;
     s->fd = fd;
-    s->no_poll = 0;
-    enable_sigio_timer(fd);
     s->vc = qemu_new_vlan_client(vlan, tap_receive, NULL, s);
-    qemu_set_fd_handler2(s->fd, tap_read_poll, tap_send, NULL, s);
+    qemu_set_fd_handler2(s->fd, NULL, tap_send, NULL, s);
     snprintf(s->vc->info_str, sizeof(s->vc->info_str), "tap: fd=%d", fd);
     return s;
 }
@@ -7972,10 +7947,7 @@ void main_loop_wait(int timeout)
         slirp_select_poll(&rfds, &wfds, &xfds);
     }
 #endif
-    virtio_net_poll();
-
     qemu_aio_poll();
-
     if (vm_running) {
         qemu_run_timers(&active_timers[QEMU_TIMER_VIRTUAL],
                         qemu_get_clock(vm_clock));

[kvm-devel] Исполнение договора строительного подряда

From: Юридичесикий о. < <iy...@bo...> - 2008-05-03 23:52:57

Договоры в строительстве (практические рекомендации)

Однодневный семинар / 16 мая 2008 г. / Москва

Программа семинара

Договоры в строительстве: общие положения
∙ Общая характеристика договоров, сопровождающих строительную деятельность.
∙ Обзор договоров подрядного типа и практической сферы их применения.
∙ Анализ типичных спорных ситуаций с участием субподрядных организаций.

Документальное оформление договорных отношений
∙ Виды документов, оформляющих договорные отношения, их назначение и правила составления.
∙ Правовое значение протоколов о намерениях и протоколов разногласий.
∙ Судебно-арбитражная практика по спорам, связанным с использованием предварительных договоров в строительной деятельности (анализ конкретных арбитражных решений).

Договорные условия
∙ Существенные условия договора строительного подряда.
∙ Обзор судебно-арбитражной практики по спорам о несогласованных (неверно согласованных) существенных условиях договора.

Тендерный отбор контрагентов (преимущества и недостатки)
∙ Особенности заключения договора по результатам торгов (включая специфику государственных заказов).
∙ Обобщение судебно-арбитражной практики по спорам о признании торгов недействительными.
∙ Анализ конкретных ситуаций исполнения договоров строительного подряда с организациями, отобранными на конкурсной основе.

Обеспечение интересов заказчика в договорных отношениях
∙ Способы обеспечения прав заказчика в договоре строительного подряда.
∙ Цена договора и порядок расчетов. Установление условий о цене с учетом инфляции, удорожания материалов, рабочей силы и других тенденций изменения рыночных цен.
∙ Основные подходы к расчетам с подрядчиками, законодательные ограничения, возможные споры.

Оформление сдачи-приемки работ
∙ Основные документы, правила подписания, законодательные ограничения.
∙ Судебно-арбитражная практика оспаривания оформления сдачи-приемки работ, оплаты работ по оформленным актам, полномочиям лиц, осуществившим приемку работ.

Ответственность за нарушение договорных обязательств в строительной деятельности
∙ Основные меры ответственности за нарушение договорных обязательств.
∙ Практика взыскания, обобщение типичных судебных споров.

Ответы на вопросы участников семинара.

Продолжительность обучения: с 10 до 17 часов (с перерывом на обед и кофе-паузу).
Место обучения: г. Москва, 5 мин. пешком от м. Академическая.
Стоимость обучения: 4900 руб. (с НДС).
(В стоимость входит: раздаточный материал, кофе-пауза, обед в ресторане).

При отсутствии возможности посетить семинар, мы предлагаем приобрести его видеоверсию на DVD/CD дисках или видеокассетах (прилагается авторский раздаточный материал).
Цена видеокурса - 3500 рублей, с учетом НДС.

Для регистрации на семинар необходимо отправить нам по факсу: реквизиты организации, тему и дату семинара, полное ФИО участников, контактный телефон и факс.
Для заказа видеокурса необходимо отправить нам по факсу: реквизиты организации, тему видеокурса, указать носитель (ДВД или СД диски), телефон, факс, контактное лицо и точный адрес доставки.

Получить дополнительную информацию и зарегистрироваться можно:
по т/ф: ( 4 9 5 ) 54 З = 8 8 = 4 6

Re: [kvm-devel] [Qemu-devel] [PATCH] use common code for i386 and x86_64

From: Aurelien J. <aur...@au...> - 2008-05-03 23:45:33

On Wed, Apr 30, 2008 at 05:04:19PM -0300, Glauber Costa wrote:
> There is no reason why should i386 and x86_64 code for rdtsc be different.
> Unify them.

This makes the generated i386 assembly code far more complex (21 
instructions instead of 5).

> ---
>  cpu-all.h |   11 +----------
>  1 files changed, 1 insertions(+), 10 deletions(-)
> 
> diff --git a/cpu-all.h b/cpu-all.h
> index 2a2b197..1c9e2a3 100644
> --- a/cpu-all.h
> +++ b/cpu-all.h
> @@ -930,16 +930,7 @@ static inline int64_t cpu_get_real_ticks(void)
>      return ((int64_t)h << 32) | l;
>  }
>  
> -#elif defined(__i386__)
> -
> -static inline int64_t cpu_get_real_ticks(void)
> -{
> -    int64_t val;
> -    asm volatile ("rdtsc" : "=A" (val));
> -    return val;
> -}
> -
> -#elif defined(__x86_64__)
> +#elif defined(__i386__) || defined(__x86_64__)
>  
>  static inline int64_t cpu_get_real_ticks(void)
>  {
> -- 
> 1.5.0.6
> 
> 
> 
> 

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   au...@de...         | aur...@au...
   `-    people.debian.org/~aurel32 | www.aurel32.net

[kvm-devel] Исполнение договора строительного подряда

From: Юристы <yv...@bl...> - 2008-05-03 23:36:38

Договоры в строительстве (практические рекомендации)

Однодневный семинар / 16 мая 2008 г. / Москва

Программа семинара

Ответы на вопросы участников семинара.

Получить дополнительную информацию и зарегистрироваться можно:
по т/ф: ( 4 9 5 ) 54 З ~ 8 8 ~ 4 6

Re: [kvm-devel] [patch 1/3] QEMU/KVM: add 3 PCI bridges to ACPI table

From: Alexander G. <ag...@su...> - 2008-05-03 21:03:29

On May 2, 2008, at 7:35 PM, Marcelo Tosatti wrote:
> Add 3 PCI bridges to the ACPI table:
> - Move IRQ routing, slot device and GPE processing to separate files
> which can be included from acpi-dsdt.dsl.
> - Add _SUN methods to every slot device so as to avoid collisions
> in OS handling.
> - Fix copy&paste typo in slot devices 8/9 and 24/25.
>
> This table breaks PCI hotplug for older userspace, hopefully not an
> issue (trivial enough to upgrade the BIOS).
>
> Signed-off-by: Marcelo Tosatti <mto...@re...>
>
> Index: kvm-userspace.pci3/bios/acpi-dsdt.dsl
> ===================================================================
> --- kvm-userspace.pci3.orig/bios/acpi-dsdt.dsl
> +++ kvm-userspace.pci3/bios/acpi-dsdt.dsl
> @@ -208,218 +208,29 @@ DefinitionBlock (
>             Name (_HID, EisaId ("PNP0A03"))
>             Name (_ADR, 0x00)
>             Name (_UID, 1)
> -            Name(_PRT, Package() {
> -                /* PCI IRQ routing table, example from ACPI 2.0a  
> specification,
> -                   section 6.2.8.1 */
> -                /* Note: we provide the same info as the PCI routing
> -                   table of the Bochs BIOS */
> -
> -                // PCI Slot 0
> -                Package() {0x0000ffff, 0, LNKD, 0},
> -                Package() {0x0000ffff, 1, LNKA, 0},
> -                Package() {0x0000ffff, 2, LNKB, 0},
> -                Package() {0x0000ffff, 3, LNKC, 0},

[ ... snip ... ]

> -                // PCI Slot 31
> -                Package() {0x001fffff, 0, LNKC, 0},
> -                Package() {0x001fffff, 1, LNKD, 0},
> -                Package() {0x001fffff, 2, LNKA, 0},
> -                Package() {0x001fffff, 3, LNKB, 0},
> -            })
> +
> +            Include ("acpi-irq-routing.dsl")
>
>             OperationRegion(PCST, SystemIO, 0xae00, 0x08)
>             Field (PCST, DWordAcc, NoLock, WriteAsZeros)
> -	    {
> +            {
> 		PCIU, 32,
> 		PCID, 32,
> -	    }
> -
> +            }

Are these whitespace patches supposed to be here?

>
>             OperationRegion(SEJ, SystemIO, 0xae08, 0x04)
>             Field (SEJ, DWordAcc, NoLock, WriteAsZeros)
>             {
>                 B0EJ, 32,
>             }
>
> +            Device (S0) {              // Slot 0
> +               Name (_ADR, 0x00000000)
> +               Method (_EJ0,1) {
> +                    Store(0x1, B0EJ)
> +                    Return (0x0)
> +               }
> +            }
> +

I'm having trouble understanding the semantic of the Sx devices here.  
What is this S0, S1 and S2 device? Maybe different names would make  
everything more understandable.

>
>             Device (S1) {              // Slot 1
>                Name (_ADR, 0x00010000)
>                Method (_EJ0,1) {
> @@ -436,28 +247,70 @@ DefinitionBlock (
>                }
>             }
>
> -            Device (S3) {              // Slot 3
> +            Device (S3) {              // Slot 3, PCI-to-PCI bridge

This device could be called BRI1 for example. That would make reading  
the DSDT a lot easier.

>
>                Name (_ADR, 0x00030000)
> -               Method (_EJ0,1) {
> -                    Store (0x8, B0EJ)
> -                    Return (0x0)
> +               Include ("acpi-irq-routing.dsl")
> +
> +               OperationRegion(PCST, SystemIO, 0xae0c, 0x08)
> +               Field (PCST, DWordAcc, NoLock, WriteAsZeros)
> +               {
> +                   PCIU, 32,
> +                   PCID, 32,
>                }
> +
> +               OperationRegion(SEJ, SystemIO, 0xae14, 0x04)
> +               Field (SEJ, DWordAcc, NoLock, WriteAsZeros)
> +               {
> +                    B1EJ, 32,
> +               }
> +
> +               Name (SUN1, 30)
> +               Alias (\_SB.PCI0.S3.B1EJ, BEJ)
> +               Include ("acpi-pci-slots.dsl")

[ ... snip ... ]

>
>         Method(_L05) {
>             Return(0x01)
> Index: kvm-userspace.pci3/bios/acpi-hotplug-gpe.dsl
> ===================================================================
> --- /dev/null
> +++ kvm-userspace.pci3/bios/acpi-hotplug-gpe.dsl
> @@ -0,0 +1,257 @@
> +            /* Up status */
> +            If (And(UP, 0x1)) {
> +                Notify(S0, 0x1)
> +            }

While this is proper syntax I prefer the way Fabrice wrote the tables.  
Most of his entries were one-lined, even though they wouldn't end up  
like that when getting decompiled. In this case I'd vote for something  
like:

If (And(UP, 0x1)) { Notify(S0, 0x1) }

Which makes things easier to read again. The same goes for a lot of  
code below that chunk.

>
> +
> +            If (And(UP, 0x2)) {
> +                Notify(S1, 0x1)
> +            }
> +

[ ... snip ... ]

> Index: kvm-userspace.pci3/bios/acpi-pci-slots.dsl
> ===================================================================
> --- /dev/null
> +++ kvm-userspace.pci3/bios/acpi-pci-slots.dsl
> @@ -0,0 +1,385 @@
> +            Device (S0) {              // Slot 0
> +               Name (_ADR, 0x00000000)
> +               Method (_EJ0,1) {

Hmm ... I never assumed anything could be wrong here, but doesn't that  
1 mean there is one argument to the method?
 From the ACPI Specification:

Method(_EJ0, 1){ //Hot docking support
//Arg0: 0=insert, 1=eject

So we aren't using this information? What else do we use? Sorry if I  
missed something.

>
> +                    Store(0x1, BEJ)
> +                    Return (0x0)
> +               }
> +               Method(_SUN) {
> +                    Add (SUN1, 0, Local0)
> +                    Return (Local0)
> +               }
> +            }

Same comment here. I don't like copy&paste code that goes over a lot  
of lines. Can't you simply do some helper methods that do what _EJ0  
and _SUN do in a generic manner and Return that? I'd imagine something  
like:

             Device (S0) {              // Slot 0
                Name (_ADR, 0x00000000)
                Method (_EJ0,1) { Return( GEJ0(0x1) }
                Method(_SUN) { Return( GSUN(0) }
             }

This looks way easier to read to me and keeps generic things generic  
and not copy&pasted.


Nevertheless this is a nice approach, which will definitely show that  
we need to think about interrupt routing properly ;-).

Alex

Re: [kvm-devel] [PATCH] [ACPI] Enable direct GSI mapping for APIC

From: Alexander G. <ag...@su...> - 2008-05-03 20:29:38

On May 2, 2008, at 5:35 PM, Marcelo Tosatti wrote:

> On Fri, May 02, 2008 at 04:55:24PM +0200, Alexander Graf wrote:
>> Hi,
>>
>> in the DSDT there are two different ways of defining, how an  
>> interrupt
>> is supposed to be routed. Currently we are using the LNKA - LNKD  
>> method,
>> which afaict is for legacy support.
>> The other method is to directly tell the Operating System, which APIC
>> pin the device is attached to. We can get that information from the  
>> very
>> same entry, the LNKA to LNKD pseudo devices receive it.
>>
>> For now this does not give any obvious improvement. It does leave  
>> room
>> for more advanced mappings, with several IOAPICs that can handle more
>> devices separately. This might help when we have a lot of devices, as
>> currently all devices sit on two interrupt lanes.
>>
>> More importantly (for me) though, is that Darwin enables the APIC  
>> mode
>> unconditionally, so it won't easily run in legacy mode.
>
> Hi Alexander,
>

Hi Marcelo,

> I'm just about to resend the patchset to add 3 PCI bridges, which
> already adds the _SUN method appropriately. Please rebase the APRT  
> patch
> on top of that.

Sure, unfortunately I probably won't find the time to do so (or even  
have a closer look at your patches) until the end of this week.

Alex

>
>
> Thanks!
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save  
> $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> kvm-devel mailing list
> kvm...@li...
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] s390 kvm_virtio.c build error

From: Adrian B. <bu...@ke...> - 2008-05-03 20:28:11

Commit c45a6816c19dee67b8f725e6646d428901a6dc24
(virtio: explicit advertisement of driver features)
and commit e976a2b997fc4ad70ccc53acfe62811c4aaec851
(s390: KVM guest: virtio device support, and kvm hypercalls)
don't like each other:

<--  snip  -->

...
  CC      drivers/s390/kvm/kvm_virtio.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/s390/kvm/kvm_virtio.c:224: error: unknown field 'feature' specified in initializer
/home/bunk/linux/kernel-2.6/git/linux-2.6/drivers/s390/kvm/kvm_virtio.c:224: warning: initialization from incompatible pointer type
make[3]: *** [drivers/s390/kvm/kvm_virtio.o] Error 1

<--  snip  -->

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

[kvm-devel] kvm mmu notifier update

From: Andrea A. <an...@qu...> - 2008-05-03 12:00:05

I updated the mmu notifier patch. There was a problem in shutting down
the vm in mmu_notifier_release instead of waiting the last
filp->release, because all vcpus will be freed and the
filp->private_data will point to already freed memory when the vcpu fd
is closed.

It seems we may really not need to do anything in the mmu notifier
release method, because for us the shadow pagetables are meaningless
if no guest can run, and the ->release method is only invoked when all
tasks with current->mm == kvm->mm already quit. After that the guest
can't possibly run anymore, the ioctl becomes useless too. So I
changed the code to only invalidate the root of the spte radix tree in
every vcpu, just for debugging. If any guest attempts to run after mmu
notifier release runs, we'll then notice. No spte can be established
after ->release returns.

Probably ->release shouldn't be mandatory to implement, but from a
different point of view it may also payoff to make all methods
mandatory to be implemented as a microoptimization to avoid the null
pointer check before invoking the notifier (and in the future to also
fail registration if the api is extended and a module isn't updated,
to decrease the risk of runtime failure).

Signed-off-by: Andrea Arcangeli <an...@qu...>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d45fab..ce3251c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3d769c3..978da9b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -651,6 +651,101 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
 	account_shadowed(kvm, gfn);
 }
 
+static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte, *curr_spte;
+	int need_tlb_flush = 0;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		BUG_ON(!(*spte & PT_PRESENT_MASK));
+		rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+		curr_spte = spte;
+		spte = rmap_next(kvm, rmapp, spte);
+		rmap_remove(kvm, curr_spte);
+		set_shadow_pte(curr_spte, shadow_trap_nonpresent_pte);
+		need_tlb_flush = 1;
+	}
+	return need_tlb_flush;
+}
+
+int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+	int i;
+	int need_tlb_flush = 0;
+
+	/*
+	 * If mmap_sem isn't taken, we can look the memslots with only
+	 * the mmu_lock by skipping over the slots with userspace_addr == 0.
+	 */
+	for (i = 0; i < kvm->nmemslots; i++) {
+		struct kvm_memory_slot *memslot = &kvm->memslots[i];
+		unsigned long start = memslot->userspace_addr;
+		unsigned long end;
+
+		/* mmu_lock protects userspace_addr */
+		if (!start)
+			continue;
+
+		end = start + (memslot->npages << PAGE_SHIFT);
+		if (hva >= start && hva < end) {
+			gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+			need_tlb_flush |= kvm_unmap_rmapp(kvm,
+							  &memslot->rmap[gfn_offset]);
+		}
+	}
+
+	return need_tlb_flush;
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+	u64 *spte;
+	int young = 0;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		int _young;
+		u64 _spte = *spte;
+		BUG_ON(!(_spte & PT_PRESENT_MASK));
+		_young = _spte & PT_ACCESSED_MASK;
+		if (_young) {
+			young = !!_young;
+			set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK);
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+	return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+	int i;
+	int young = 0;
+
+	/*
+	 * If mmap_sem isn't taken, we can look the memslots with only
+	 * the mmu_lock by skipping over the slots with userspace_addr == 0.
+	 */
+	for (i = 0; i < kvm->nmemslots; i++) {
+		struct kvm_memory_slot *memslot = &kvm->memslots[i];
+		unsigned long start = memslot->userspace_addr;
+		unsigned long end;
+
+		/* mmu_lock protects userspace_addr */
+		if (!start)
+			continue;
+
+		end = start + (memslot->npages << PAGE_SHIFT);
+		if (hva >= start && hva < end) {
+			gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+			young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]);
+		}
+	}
+
+	return young;
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
@@ -1189,6 +1284,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 	int r;
 	int largepage = 0;
 	pfn_t pfn;
+	int mmu_seq;
 
 	down_read(&current->mm->mmap_sem);
 	if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) {
@@ -1196,6 +1292,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 		largepage = 1;
 	}
 
+	mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq);
+	/* implicit mb(), we'll read before PT lock is unlocked */
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 	up_read(&current->mm->mmap_sem);
 
@@ -1206,6 +1304,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 	}
 
 	spin_lock(&vcpu->kvm->mmu_lock);
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count)))
+		goto out_unlock;
+	smp_rmb();
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq))
+		goto out_unlock;
 	kvm_mmu_free_some_pages(vcpu);
 	r = __direct_map(vcpu, v, write, largepage, gfn, pfn,
 			 PT32E_ROOT_LEVEL);
@@ -1213,6 +1316,11 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn)
 
 
 	return r;
+
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return 0;
 }
 
 
@@ -1230,9 +1338,9 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 	int i;
 	struct kvm_mmu_page *sp;
 
-	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
-		return;
 	spin_lock(&vcpu->kvm->mmu_lock);
+	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+		goto out;
 	if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
 		hpa_t root = vcpu->arch.mmu.root_hpa;
 
@@ -1240,9 +1348,7 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 		--sp->root_count;
 		if (!sp->root_count && sp->role.invalid)
 			kvm_mmu_zap_page(vcpu->kvm, sp);
-		vcpu->arch.mmu.root_hpa = INVALID_PAGE;
-		spin_unlock(&vcpu->kvm->mmu_lock);
-		return;
+		goto out_invalid;
 	}
 	for (i = 0; i < 4; ++i) {
 		hpa_t root = vcpu->arch.mmu.pae_root[i];
@@ -1256,8 +1362,10 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
 		}
 		vcpu->arch.mmu.pae_root[i] = INVALID_PAGE;
 	}
-	spin_unlock(&vcpu->kvm->mmu_lock);
+out_invalid:
 	vcpu->arch.mmu.root_hpa = INVALID_PAGE;
+out:
+	spin_unlock(&vcpu->kvm->mmu_lock);
 }
 
 static void mmu_alloc_roots(struct kvm_vcpu *vcpu)
@@ -1340,6 +1448,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 	int r;
 	int largepage = 0;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
+	int mmu_seq;
 
 	ASSERT(vcpu);
 	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
@@ -1353,6 +1462,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
 		largepage = 1;
 	}
+	mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq);
+	/* implicit mb(), we'll read before PT lock is unlocked */
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 	up_read(&current->mm->mmap_sem);
 	if (is_error_pfn(pfn)) {
@@ -1360,12 +1471,22 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa,
 		return 1;
 	}
 	spin_lock(&vcpu->kvm->mmu_lock);
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count)))
+		goto out_unlock;
+	smp_rmb();
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq))
+		goto out_unlock;
 	kvm_mmu_free_some_pages(vcpu);
 	r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK,
 			 largepage, gfn, pfn, kvm_x86_ops->get_tdp_level());
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	return r;
+
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return 0;
 }
 
 static void nonpaging_free(struct kvm_vcpu *vcpu)
@@ -1621,18 +1742,20 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu)
 	return !!(spte && (*spte & shadow_accessed_mask));
 }
 
-static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-					  const u8 *new, int bytes)
+static int mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
+					 const u8 *new, int bytes,
+					 gfn_t *_gfn, pfn_t *_pfn,
+					 int *_mmu_seq, int *_largepage)
 {
 	gfn_t gfn;
 	int r;
 	u64 gpte = 0;
 	pfn_t pfn;
-
-	vcpu->arch.update_pte.largepage = 0;
+	int mmu_seq;
+	int largepage;
 
 	if (bytes != 4 && bytes != 8)
-		return;
+		return 0;
 
 	/*
 	 * Assume that the pte write on a page table of the same type
@@ -1645,7 +1768,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		if ((bytes == 4) && (gpa % 4 == 0)) {
 			r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8);
 			if (r)
-				return;
+				return 0;
 			memcpy((void *)&gpte + (gpa % 8), new, 4);
 		} else if ((bytes == 8) && (gpa % 8 == 0)) {
 			memcpy((void *)&gpte, new, 8);
@@ -1655,23 +1778,30 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 			memcpy((void *)&gpte, new, 4);
 	}
 	if (!is_present_pte(gpte))
-		return;
+		return 0;
 	gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
 
+	largepage = 0;
 	down_read(&current->mm->mmap_sem);
 	if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) {
 		gfn &= ~(KVM_PAGES_PER_HPAGE-1);
-		vcpu->arch.update_pte.largepage = 1;
+		largepage = 1;
 	}
+	mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq);
+	/* implicit mb(), we'll read before PT lock is unlocked */
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 	up_read(&current->mm->mmap_sem);
 
-	if (is_error_pfn(pfn)) {
+	if (unlikely(is_error_pfn(pfn))) {
 		kvm_release_pfn_clean(pfn);
-		return;
+		return 0;
 	}
-	vcpu->arch.update_pte.gfn = gfn;
-	vcpu->arch.update_pte.pfn = pfn;
+
+	*_gfn = gfn;
+	*_pfn = pfn;
+	*_mmu_seq = mmu_seq;
+	*_largepage = largepage;
+	return 1;
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -1694,9 +1824,24 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 	int npte;
 	int r;
 
+	int update_pte;
+	gfn_t gpte_gfn;
+	pfn_t pfn;
+	int mmu_seq;
+	int largepage;
+
 	pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
-	mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
+	update_pte = mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes,
+						   &gpte_gfn, &pfn,
+						   &mmu_seq, &largepage);
 	spin_lock(&vcpu->kvm->mmu_lock);
+	if (update_pte) {
+		BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn));
+		vcpu->arch.update_pte.gfn = gpte_gfn;
+		vcpu->arch.update_pte.pfn = pfn;
+		vcpu->arch.update_pte.mmu_seq = mmu_seq;
+		vcpu->arch.update_pte.largepage = largepage;
+	}
 	kvm_mmu_free_some_pages(vcpu);
 	++vcpu->kvm->stat.mmu_pte_write;
 	kvm_mmu_audit(vcpu, "pre pte write");
@@ -1775,11 +1920,11 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
 		}
 	}
 	kvm_mmu_audit(vcpu, "post pte write");
-	spin_unlock(&vcpu->kvm->mmu_lock);
 	if (!is_error_pfn(vcpu->arch.update_pte.pfn)) {
 		kvm_release_pfn_clean(vcpu->arch.update_pte.pfn);
 		vcpu->arch.update_pte.pfn = bad_pfn;
 	}
+	spin_unlock(&vcpu->kvm->mmu_lock);
 }
 
 int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 156fe10..4ac73a6 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -263,6 +263,12 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page,
 	pfn = vcpu->arch.update_pte.pfn;
 	if (is_error_pfn(pfn))
 		return;
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count)))
+		return;
+	smp_rmb();
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) !=
+		     vcpu->arch.update_pte.mmu_seq))
+		return;
 	kvm_get_pfn(pfn);
 	mmu_set_spte(vcpu, spte, page->role.access, pte_access, 0, 0,
 		     gpte & PT_DIRTY_MASK, NULL, largepage, gpte_to_gfn(gpte),
@@ -380,6 +386,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	int r;
 	pfn_t pfn;
 	int largepage = 0;
+	int mmu_seq;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 	kvm_mmu_audit(vcpu, "pre page fault");
@@ -413,6 +420,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 			largepage = 1;
 		}
 	}
+	mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq);
+	/* implicit mb(), we'll read before PT lock is unlocked */
 	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
 	up_read(&current->mm->mmap_sem);
 
@@ -424,6 +433,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	}
 
 	spin_lock(&vcpu->kvm->mmu_lock);
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_count)))
+		goto out_unlock;
+	smp_rmb();
+	if (unlikely(atomic_read(&vcpu->kvm->arch.mmu_notifier_seq) != mmu_seq))
+		goto out_unlock;
 	kvm_mmu_free_some_pages(vcpu);
 	shadow_pte = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
 				  largepage, &write_pt, pfn);
@@ -439,6 +453,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr,
 	spin_unlock(&vcpu->kvm->mmu_lock);
 
 	return write_pt;
+
+out_unlock:
+	spin_unlock(&vcpu->kvm->mmu_lock);
+	kvm_release_pfn_clean(pfn);
+	return 0;
 }
 
 static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 979f983..ceb8dee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/highmem.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/uaccess.h>
 #include <asm/msr.h>
@@ -3888,16 +3889,127 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 	free_page((unsigned long)vcpu->arch.pio_data);
 }
 
-struct  kvm *kvm_arch_create_vm(void)
+static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 {
-	struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+	struct kvm_arch *kvm_arch;
+	kvm_arch = container_of(mn, struct kvm_arch, mmu_notifier);
+	return container_of(kvm_arch, struct kvm, arch);
+}
 
-	if (!kvm)
-		return ERR_PTR(-ENOMEM);
+static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
+					     struct mm_struct *mm,
+					     unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int need_tlb_flush;
 
-	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+	/*
+	 * When ->invalidate_page runs, the linux pte has been zapped
+	 * already but the page is still allocated until
+	 * ->invalidate_page returns. So if we increase the sequence
+	 * here the kvm page fault will notice if the spte can't be
+	 * established because the page is going to be freed. If
+	 * instead the kvm page fault establishes the spte before
+	 * ->invalidate_page runs, kvm_unmap_hva will release it
+	 * before returning.
+
+	 * No need of memory barriers as the sequence increase only
+	 * need to be seen at spin_unlock time, and not at spin_lock
+	 * time.
+	 *
+	 * Increasing the sequence after the spin_unlock would be
+	 * unsafe because the kvm page fault could then establish the
+	 * pte after kvm_unmap_hva returned, without noticing the page
+	 * is going to be freed.
+	 */
+	atomic_inc(&kvm->arch.mmu_notifier_seq);
+	spin_lock(&kvm->mmu_lock);
+	need_tlb_flush = kvm_unmap_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
 
-	return kvm;
+	/* we've to flush the tlb before the pages can be freed */
+	if (need_tlb_flush)
+		kvm_flush_remote_tlbs(kvm);
+
+}
+
+static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+						    struct mm_struct *mm,
+						    unsigned long start,
+						    unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int need_tlb_flush = 0;
+
+	/*
+	 * The count increase must become visible at unlock time as no
+	 * spte can be established without taking the mmu_lock and
+	 * count is also read inside the mmu_lock critical section.
+	 */
+	atomic_inc(&kvm->arch.mmu_notifier_count);
+
+	spin_lock(&kvm->mmu_lock);
+	for (; start < end; start += PAGE_SIZE)
+		need_tlb_flush |= kvm_unmap_hva(kvm, start);
+	spin_unlock(&kvm->mmu_lock);
+
+	/* we've to flush the tlb before the pages can be freed */
+	if (need_tlb_flush)
+		kvm_flush_remote_tlbs(kvm);
+}
+
+static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
+						  struct mm_struct *mm,
+						  unsigned long start,
+						  unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	/*
+	 *
+	 * This sequence increase will notify the kvm page fault that
+	 * the page that is going to be mapped in the spte could have
+	 * been freed.
+	 *
+	 * There's also an implicit mb() here in this comment,
+	 * provided by the last PT lock taken to zap pagetables, and
+	 * that the read side has to take too in follow_page(). The
+	 * sequence increase in the worst case will become visible to
+	 * the kvm page fault after the spin_lock of the last PT lock
+	 * of the last PT-lock-protected critical section preceeding
+	 * invalidate_range_end. So if the kvm page fault is about to
+	 * establish the spte inside the mmu_lock, while we're freeing
+	 * the pages, it will have to backoff and when it retries, it
+	 * will have to take the PT lock before it can check the
+	 * pagetables again. And after taking the PT lock it will
+	 * re-establish the pte even if it will see the already
+	 * increased sequence number before calling gfn_to_pfn.
+	 */
+	atomic_inc(&kvm->arch.mmu_notifier_seq);
+	/*
+	 * The sequence increase must be visible before count
+	 * decrease. The page fault has to read count before sequence
+	 * for this write order to be effective.
+	 */
+	wmb();
+	atomic_dec(&kvm->arch.mmu_notifier_count);
+	BUG_ON(atomic_read(&kvm->arch.mmu_notifier_count) < 0);
+}
+
+static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
+					      struct mm_struct *mm,
+					      unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int young;
+
+	spin_lock(&kvm->mmu_lock);
+	young = kvm_age_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
+
+	if (young)
+		kvm_flush_remote_tlbs(kvm);
+
+	return young;
 }
 
 static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
@@ -3907,16 +4019,62 @@ static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
 	vcpu_put(vcpu);
 }
 
-static void kvm_free_vcpus(struct kvm *kvm)
+static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
+				     struct mm_struct *mm)
 {
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	unsigned int i;
 
+	BUG_ON(mm != kvm->mm);
+
 	/*
-	 * Unpin any mmu pages first.
+	 * All tasks with current->mm == mm quit and guest and the
+	 * ioctls can only run on tasks with current->mm == mm, so all
+	 * shadow pagebles are already meaningless because no guest
+	 * can run anymore at this point. We don't really need to, but
+	 * we can set the roots invalid here just to be more strict.
 	 */
 	for (i = 0; i < KVM_MAX_VCPUS; ++i)
 		if (kvm->vcpus[i])
 			kvm_unload_vcpu_mmu(kvm->vcpus[i]);
+}
+
+static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+	.release		= kvm_mmu_notifier_release,
+	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
+	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
+	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
+	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+};
+
+struct  kvm *kvm_arch_create_vm(void)
+{
+	struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+	int err;
+
+	if (!kvm)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+
+	kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+	err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+	if (err) {
+		kfree(kvm);
+		return ERR_PTR(err);
+	}
+
+	return kvm;
+}
+
+static void kvm_free_vcpus(struct kvm *kvm)
+{
+	unsigned int i;
+
+	for (i = 0; i < KVM_MAX_VCPUS; ++i)
+		if (kvm->vcpus[i])
+			BUG_ON(kvm->vcpus[i]->arch.mmu.root_hpa !=
+			       INVALID_PAGE);
 	for (i = 0; i < KVM_MAX_VCPUS; ++i) {
 		if (kvm->vcpus[i]) {
 			kvm_arch_vcpu_free(kvm->vcpus[i]);
@@ -3931,6 +4089,12 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_free_pit(kvm);
 	kfree(kvm->arch.vpic);
 	kfree(kvm->arch.vioapic);
+	/*
+	 * kvm_mmu_notifier_release() will be called before
+	 * mmu_notifier_unregister returns, if it didn't run
+	 * already.
+	 */
+	mmu_notifier_unregister(&kvm->arch.mmu_notifier, kvm->mm);
 	kvm_free_vcpus(kvm);
 	kvm_free_physmem(kvm);
 	if (kvm->arch.apic_access_page)
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1d8cd01..b9a1421 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -13,6 +13,7 @@
 
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/mmu_notifier.h>
 
 #include <linux/kvm.h>
 #include <linux/kvm_para.h>
@@ -247,6 +248,7 @@ struct kvm_vcpu_arch {
 		gfn_t gfn;	/* presumed gfn during guest pte update */
 		pfn_t pfn;	/* pfn corresponding to that gfn */
 		int largepage;
+		int mmu_seq;
 	} update_pte;
 
 	struct i387_fxsave_struct host_fx_image;
@@ -317,6 +319,10 @@ struct kvm_arch{
 
 	struct page *ept_identity_pagetable;
 	bool ept_identity_pagetable_done;
+
+	struct mmu_notifier mmu_notifier;
+	atomic_t mmu_notifier_seq;
+	atomic_t mmu_notifier_count;
 };
 
 struct kvm_vm_stat {
@@ -441,6 +447,8 @@ void kvm_mmu_set_base_ptes(u64 base_pte);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 		u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
+int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
+int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
 void kvm_mmu_zap_all(struct kvm *kvm);

Re: [kvm-devel] [PATCH 00 of 11] mmu notifier #v15

From: Jack S. <st...@sg...> - 2008-05-03 11:09:04

On Fri, May 02, 2008 at 05:05:03PM +0200, Andrea Arcangeli wrote:
> Hello everyone,
> 
> 1/11 is the latest version of the mmu-notifier-core patch.
> 
> As usual all later 2-11/11 patches follows but those aren't meant for 2.6.26.
> 

Not sure why -mm is different, but I get compile errors w/o the following...

--- jack


Index: linux/mm/mmu_notifier.c
===================================================================
--- linux.orig/mm/mmu_notifier.c	2008-05-02 16:54:52.780576831 -0500
+++ linux/mm/mmu_notifier.c	2008-05-02 16:56:38.817719509 -0500
@@ -16,6 +16,7 @@
 #include <linux/srcu.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
+#include <linux/rculist.h>
 
 /*
  * This function can't run concurrently against mmu_notifier_register

Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

From: Balaji R. <bal...@gm...> - 2008-05-03 08:27:08

On Friday 02 May 2008 12:43:31 am Marcelo Tosatti wrote:

Hi Guillaume,

With your patch applied ubuntu 8.04 livecd fails to boot. Not any better 
with Marcelo's patch on top.

exception 13 (33)
rax 000000000000007f rbx 0000000000800000 rcx 0000000000000000 rdx 0000000000000000
rsi 000000000005a81c rdi 000000000005a820 rsp 00000000fffa97cc rbp 000000000000200c
r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
rip 000000000000b02c rflags 00033882
cs 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
es 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss 5881 (00058810/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
fs 3002 (00030020/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt 40920/47
idt 0/ffff
cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
code: 10 28 6d 01 28 1e 01 28 6d 01 28 1f 01 28 6d 01 28 73 01 17 --> 0f 28 6d 01 28 74 01 17 0f 17 3b 28 6d 01 28 75 01 17 0f 28 6d 01 28 76 
01 17 0f 11 1c 17
Aborted

-- 
Warm Regards,

Balaji Rao
Dept. of Mechanical Engineering
NITK

157 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 22 23 24 25 26 .. 703 > >> (Page 24 of 703)

2006	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (33)	Nov (325)	Dec (320)
2007	Jan (484)	Feb (438)	Mar (407)	Apr (713)	May (831)	Jun (806)	Jul (1023)	Aug (1184)	Sep (1118)	Oct (1461)	Nov (1224)	Dec (1042)
2008	Jan (1449)	Feb (1110)	Mar (1428)	Apr (1643)	May (682)	Jun	Jul	Aug	Sep	Oct	Nov	Dec