kvm-devel Mailing List for kernel virtual machine (Page 27)

Brought to you by: avik, mtosatti

kvm-devel — kernel virtual machine development

You can subscribe to this list here.

2006	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (33)	_Nov (325)	_Dec (320)
2007	_Jan (484)	_Feb (438)	_Mar (407)	_Apr (713)	_May (831)	_Jun (806)	_Jul (1023)	_Aug (1184)	_Sep (1118)	_Oct (1461)	_Nov (1224)	_Dec (1042)
2008	_Jan (1449)	_Feb (1110)	_Mar (1428)	_Apr (1643)	_May (682)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 25 26 27 28 29 .. 703 > >> (Page 27 of 703)

Re: [kvm-devel] Wiki: Add SMP Count to "Guest Support Status"

From: Avi K. <av...@qu...> - 2008-05-02 10:39:07

Fabian Deutsch wrote:
> Avi Kivity wrote:
>   
>> Fabian Deutsch wrote:
>>     
>>> Hey.
>>>
>>> I've been trying Microsoft Windows 2003 a couple of times. The wiki
>>> tells me that "everything" should work okay. It does, when using -smp 1,
>>> but gets ugly when using -smp 2 or so.
>>>
>>> SO might it be useful, to add the column "smp" to the "Guest Support
>>> Status" Page in the wiki?
>>>   
>>>       
>> SMP Windows work best if you have FlexPriority on your hardware.  What 
>> host cpu are you using?
>>     
>
> In general I am not able to install Microsoft Windows guests when using
> -smp > 1 on the following hardware (and kvm modules+userspace head):
> Intel(R) Xeon(R) CPU           X3210  @ 2.13GHz
>
>   

What do you mean "not able"?  Does the installer hang?  bluescreen? where?

I don't think it has flexpriority, but not sure.  Will add code so 
people can find out.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH 2/2] KVM: Handle interrupts for PCI passthrough devices

From: Avi K. <av...@qu...> - 2008-05-02 10:35:32

Amit Shah wrote:
>   
>>> +static irqreturn_t
>>> +kvm_pci_pt_dev_intr(int irq, void *dev_id)
>>>       
>>> +{
>>> +	struct kvm_pci_pt_dev_list *match;
>>> +	struct kvm *kvm = (struct kvm *) dev_id;
>>> +
>>> +	if (!test_bit(irq, pt_irq_handled))
>>> +		return IRQ_NONE;
>>> +
>>> +	if (test_bit(irq, pt_irq_pending))
>>> +		return IRQ_HANDLED;
>>>       
>> Will the interrupt not fire immediately after this returns?
>>     
>
> Hmm. This is just an optimisation so that we don't have to look up the list 
> each time to find out which assigned device it is and (re)injecting the 
> interrupt. Also we avoid the (TODO) getting/releasing locks which will be 
> needed for the list lookup.
>
> Disabling interrupts for PCI devices isn't a good idea even if we don't 
> support shared interrupts. Any other ideas to avoid this from happening?
>
>   

I don't understand.  These are level-triggered interrupts, so if one 
fires and you don't disable it, it will fire again and again.

Seems to me the only choice here is to mask the interrupt at the ioapic 
level, wait until the guest acks the interrupt, then unmask the interrupt.

With the current code, how to the guest interrupt counters and the host 
interrupt counters compare?


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] qemu/kvm seems to take ALSA all for itself?

From: Avi K. <av...@qu...> - 2008-05-02 10:20:01

David Abrahams wrote:
> Jon <iroquoi <at> gmail.com> writes:
>
>   
>> I use:
>>
>> export QEMU_AUDIO_DRV=alsa
>> export QEMU_AUDIO_DAC_FIXED_FREQ=48000
>> export QEMU_AUDIO_ADC_FIXED_FREQ=48000
>> export QEMU_ALSA_DAC_BUFFER_SIZE=16384
>>
>> Buffer size is very important, else it crackles and pops for me.
>>     
>
> Unfortunately with my upgrade to Ubuntu Hardy this has stopped working; I can
> put off the effect by playing a test tone in linux, but Qemu again takes over
> the sound system completely the first time it succeeds in making noise.  Maybe
> this has something to do with the addition of *yet another* audio layer in Hardy
> (PulseAudio?)

What does your /etc/alsa/alsa.conf look like?  Also, please remove any 
user-local alsa configuration files you may have inherited from the 
previous installation.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] kvm-67: kernel panic while booting debian-40r3-i386-businesscard.iso

From: Avi K. <av...@qu...> - 2008-05-02 10:12:30

Jan Luebbe wrote:
> Hi!
>
> I'm preparing kvm-67 for debian. While testing i noticed a problem:
>
> When booting the debian installer from the official CD [1] this problem:
>   

> Call Trace:                     
>  [<c0146d5b>] kmem_cache_create+0x15e/0x410
> Code: c3 57 56 53 89 c6 9c 5f fa 8b 08 83 39 00 74 12 c7 41 0c 01 00 00
> 00 8b 01
>  48 89 01 8b 5c 81 10 eb 07 e8 a5 fb ff ff 89 c3 57 9d <0f> 0d 0b 90 85
> db 74 1b
>  8b 56 10 31 c0 89 d1 c1 e9 02 89 df f3
> EIP: [<c01467be>] kmem_cache_zalloc+0x2a/0x53 SS:ESP 0068:c030ff80
>  <0>Kernel panic - not syncing: Attempted to kill the idle task!
>   
0f 0d 0b                prefetchw (%ebx)

This is an AMD 3Dnow! instruction, which is not supported on Intel 
processors.  I guess the 3Dnow! cpuid bit leaked in via the qemu merge.

I guess two fixes are needed:
- remove the 3Dnow! bit
- add emulation for prefetchw (easy, as it doesn't need to do anything) 
to support live migration from AMD to Intel

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH]: Fake MSR_K7 performance counters

From: Avi K. <av...@qu...> - 2008-05-02 10:00:24

Chris Lalancette wrote:
> Chris Lalancette wrote:
>   
>> Attached is a patch that fixes a guest crash when booting older Linux kernels.
>> The problem stems from the fact that we are currently emulating
>> MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3].  Because of this,
>> setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to write
>> into MSR_K7_PERFCTR, which causes an OOPs.
>>
>> The patch fixes it by just "fake" emulating the appropriate MSRs, throwing away
>> the data in the process.  This causes the NMI watchdog to not actually work, but
>> it's not such a big deal in a virtualized environment.
>>
>> Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit guest.
>>     
>
> Avi,
>      Do you mind applying this patch for me (unless you see something wrong with
> it, of course)?
>
>   

Sorry, was behind on my email. Please add a ratelimited printk() if 
nonzero data is written, so that we know that the guest is using 
partially virtualized features.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] State of debug register emulation

From: Avi K. <av...@qu...> - 2008-05-02 09:55:11

Jan Kiszka wrote:
> Avi Kivity wrote:
>   
>> Jan Kiszka wrote:
>>     
>>> This still leaves me with the question how to handle the case when the
>>> host sets and arms some debug registers to debug the guest and the
>>> latter does the same to debug itself. Guest access will be trapped, OK,
>>> but KVM will then have to decide which value should actually be
>>> transfered into the registers. Hmm, does SVM virtualizes all debug
>>> registers, leaving the real ones to the host?
>>>   
>>>       
>> There's no way this can work.  There are still only four debug
>> registers, and the guest and host together can ask for eight different
>> addresses.  It is theoretically doable by hiding all mappings to pages
>> that are debug targets, but it would probably double the kvm code size.
>>
>> A good short-term compomise is to abort if the guest starts using
>> enabling a debug address register.  A better solution might be to place
>> host debug addresses into unused guest debug registers, so that as long
>> as nr_guest_debug + nr_host_debug <= 4, we can still proceed.
>>     
>
> I tried the latter, but we cannot cleanly share DR7 between both users.
>   

I actually think we can, but...

> Thus I'm now going for a prioritized approach: debug register will stop
> to have any effect for the guest as soon as the host starts to use them.
> That's far simpler the implement and also easier to understand for the user.
>
>   

Agreed, having a simple model is preferred here, both from the user's 
point of view and from a code complexity point of view. If you're 
debugging a debugger use plain qemu.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH] libkvm: fix physical_memory calculation

From: Avi K. <av...@qu...> - 2008-05-02 09:49:09

Jan Kiszka wrote:
> This looks bogus, but it is so far without practical impact (phys_start
> is always 0 when we do the calculation).
>
> Signed-off-by: Jan Kiszka <jan...@si...>
> ---
>  libkvm/libkvm.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: b/libkvm/libkvm.c
> ===================================================================
> --- a/libkvm/libkvm.c
> +++ b/libkvm/libkvm.c
> @@ -550,7 +550,7 @@ int kvm_register_userspace_phys_mem(kvm_
>  	int r;
>  
>  	if (!kvm->physical_memory)
> -		kvm->physical_memory = userspace_addr - phys_start;
> +		kvm->physical_memory = userspace_addr + phys_start;
>  
>   

I think it's correct. The intent (probably) was that 
kvm->physical_memory[x] would refer to the contents of physical memory 
address x.

In another way, it's incorrect, since nothing guarantees (now) that 
memory is contiguous.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH] kvm_show_code for ROM code

From: Avi K. <av...@qu...> - 2008-05-02 09:44:44

Jan Kiszka wrote:
> Userland-located memory is not unconditionally available via
> kvm->physical_memory + guest_address. To let kvm_show_code also dump
> useful information when, e.g., some problem in ROM (BIOS...) occurs,
> this patch tries to obtain the memory content via the mmio_read
> callback. If the callback fails, the code byte is marked as invalid.
>
> This patch also removes the check for protected mode and dumps the code
> in any case - I didn't find the reason for this restriction.
>
>   

Applied, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH] [RESEND] Update kernel/Makefile and remove x86 only entries

From: Avi K. <av...@qu...> - 2008-05-02 09:38:38

Jerone Young wrote:
> This patch removes static x86 entries and makes things work for multiple archs.
>
>   

Applied, thanks,

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v3)

From: Avi K. <av...@qu...> - 2008-05-02 09:33:24

Anthony Liguori wrote:
> This patch allows VMA's that contain no backing page to be used for guest
> memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
> mmio series.  Here, we continue to allow mmio pages to be represented in the
> rmap.
>
> Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
> instead of VM_IO and changed the BUG_ON to a return of bad_page.
>
> Since v2, I've incorporated comments from Avi about returning bad_page instead
> of NULL and fixed a typo spotted by Muli.
>
>   

Applied, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] [PATCH] Don't leak EPT identity page table

From: Avi K. <av...@qu...> - 2008-05-02 09:30:03

Anthony Liguori wrote:
> In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity
> page table via gfn_to_page().  We never release this reference though.
>
> This patch releases the reference to this page on VM destruction.  I haven't
> tested this with EPT.
>   

Applied, thanks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] fx_init schedule in atomic

From: Avi K. <av...@qu...> - 2008-05-02 09:28:46

Andrea Arcangeli wrote:
> This make sure not to schedule in atomic during fx_init. I also
> changed the name of fpu_init to fx_finit to avoid duplicating the name
> with fpu_init that is already used in the kernel, this makes grep
> simpler if nothing else.
>
>   

Applied, thanks. Dynamic allocation for the fpu state was introduced in 
2.6.26-rc, right?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

Re: [kvm-devel] State of debug register emulation

From: Jan K. <jan...@si...> - 2008-05-02 08:47:20

Avi Kivity wrote:
> Jan Kiszka wrote:
>> This still leaves me with the question how to handle the case when the
>> host sets and arms some debug registers to debug the guest and the
>> latter does the same to debug itself. Guest access will be trapped, OK,
>> but KVM will then have to decide which value should actually be
>> transfered into the registers. Hmm, does SVM virtualizes all debug
>> registers, leaving the real ones to the host?
>>   
> 
> There's no way this can work.  There are still only four debug
> registers, and the guest and host together can ask for eight different
> addresses.  It is theoretically doable by hiding all mappings to pages
> that are debug targets, but it would probably double the kvm code size.
> 
> A good short-term compomise is to abort if the guest starts using
> enabling a debug address register.  A better solution might be to place
> host debug addresses into unused guest debug registers, so that as long
> as nr_guest_debug + nr_host_debug <= 4, we can still proceed.

I tried the latter, but we cannot cleanly share DR7 between both users.
Thus I'm now going for a prioritized approach: debug register will stop
to have any effect for the guest as soon as the host starts to use them.
That's far simpler the implement and also easier to understand for the user.

A bit work remains, though, to clean up and enhance the DRx support in
KVM. And to test the changes (will contact you, Joerg, regarding SVM
tests). Stay tuned.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

Re: [kvm-devel] [PATCH] kvm_show_code for ROM code

From: Jan K. <jan...@si...> - 2008-05-02 08:44:47

Avi Kivity wrote:
> Jan Kiszka wrote:
>> Userland-located ROM memory is not available via kvm->physical_memory +
>> guest_address. To let kvm_show_code also dump useful information when
>> some problem in ROM (BIOS...) occurs, this patch first tries to obtain
>> the memory content via the mmio_read callback - maybe not 100% clean,
>> but works at least for the QEMU use case. If the callback complains
>> about the given address, we then fall back to RAM access.
>>
>>   
> 
> kvm->physical_memory is actually broken, since nothing guarantees a 1:1
> (+offset) mapping.
> 
> Why not use ->mmio_read() all the time?  Sure it overloads the
> definition of mmio_read(), but worse things have happened.

That was my first approach as well, but then I became unsure if such an
overloading is acceptable. As it is now:

----------

Userland-located memory is not unconditionally available via
kvm->physical_memory + guest_address. To let kvm_show_code also dump
useful information when, e.g., some problem in ROM (BIOS...) occurs,
this patch tries to obtain the memory content via the mmio_read
callback. If the callback fails, the code byte is marked as invalid.

This patch also removes the check for protected mode and dumps the code
in any case - I didn't find the reason for this restriction.

Signed-off-by: Jan Kiszka <jan...@si...>
---
 libkvm/libkvm-x86.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

Index: b/libkvm/libkvm-x86.c
===================================================================
--- a/libkvm/libkvm-x86.c
+++ b/libkvm/libkvm-x86.c
@@ -393,14 +393,14 @@ int kvm_set_pit(kvm_context_t kvm, struc
 
 void kvm_show_code(kvm_context_t kvm, int vcpu)
 {
-#define CR0_PE_MASK	(1ULL<<0)
+#define SHOW_CODE_LEN 50
 	int fd = kvm->vcpu_fd[vcpu];
 	struct kvm_regs regs;
 	struct kvm_sregs sregs;
-	int r;
-	unsigned char code[50];
+	int r, n;
 	int back_offset;
-	char code_str[sizeof(code) * 3 + 1];
+	unsigned char code;
+	char code_str[SHOW_CODE_LEN * 3 + 1];
 	unsigned long rip;
 
 	r = ioctl(fd, KVM_GET_SREGS, &sregs);
@@ -408,9 +408,6 @@ void kvm_show_code(kvm_context_t kvm, in
 		perror("KVM_GET_SREGS");
 		return;
 	}
-	if (sregs.cr0 & CR0_PE_MASK)
-		return;
-
 	r = ioctl(fd, KVM_GET_REGS, &regs);
 	if (r == -1) {
 		perror("KVM_GET_REGS");
@@ -420,12 +417,16 @@ void kvm_show_code(kvm_context_t kvm, in
 	back_offset = regs.rip;
 	if (back_offset > 20)
 	    back_offset = 20;
-	memcpy(code, kvm->physical_memory + rip - back_offset, sizeof code);
 	*code_str = 0;
-	for (r = 0; r < sizeof code; ++r) {
-	    	if (r == back_offset)
+	for (n = -back_offset; n < SHOW_CODE_LEN-back_offset; ++n) {
+		if (n == 0)
 			strcat(code_str, " -->");
-		sprintf(code_str + strlen(code_str), " %02x", code[r]);
+		r = kvm->callbacks->mmio_read(kvm->opaque, rip + n, &code, 1);
+		if (r < 0) {
+			strcat(code_str, " xx");
+			continue;
+		}
+		sprintf(code_str + strlen(code_str), " %02x", code);
 	}
 	fprintf(stderr, "code:%s\n", code_str);
 }

[kvm-devel] Release: VMSeed (080430 experimental version)

From: Kuniyasu S. <k.s...@ai...> - 2008-05-02 03:09:09

Dear,

VMSeed(080430 experimental version) is released.
# VMSeed is a growable virtual disk image for VMs.
# VMs are VMware, VirtualBox, VirtualPC, Parallels, Xen, QEMU, KQEMU, and KVM.
# Guest OSes are KNOPPIX(511, 502, and 402), Plan9/Xen-DomU, and NetBSD/Xen-DomU.
HP: http://openlab.jp/oscircular/vmseed/
GuidePDF http://openlab.jp/oscircular/vmseed/VMSeed080430-E.pdf

# This topic will be discussed at the BOF of Ottawa Linux Symposium08.
# http://www.linuxsymposium.org/2008/view_abstract.php?content_key=231

---------------------------------------------------------------------
VMSeed is "an effective virtual disk image(Guest OS) for virtual
machine". The initial virtual disk includes bootloader, kernel and
miniroot only. The other disk image is downloaded from Internet and
saved to the virtual disk. So the virtual disk grows by use of the
guest OS and makes quick launch of applications. The important point
is that it downloads necessary block image only and reduces network
traffic and disk space.

VMSeed is based on "InetBoot" and it is independent of virtual
machine because it is self-organized OS. The current target virtual
machines are VMware, VirtualBox, VirtualPC, Parallels, Xen, QEMU,
KQEMU, and KVM. The current available OSes are KNOPPIX(511, 502, and
402), Plan9 on Xen-DomU, and NetBSD on Xen-DomU.

### Special Features ###
* Small initial virtual disk file.
VMSeed utilizes sparse virtual disk format of each virtual machine.
The initial disk image includes bootloader, kernel and miniroot only.
The root file system is obtained by Internet Virtual Disk.

* Virtual disk grows by use of the Guest OS.
The most parts are obtained via Internet with Internet Virtual Disk
"Trusted HTTP-FUSE CLOOP". The requested disk images are cached on
the local virtual disk and reused.

* The image (Guest OS) is independent of Virtual Machine and applied to many Virtual Machines.
The image (Guest OS) has auto-configuration mechanism and Internet
loopback device. It is self-organized OS which is based o "InetBoot".
Current applied virtual machines are VMware, VirtualBox, VirtualPC,
Parallels, Xen, QEMU, KQEMU, and KVM. Current Guest OS are
KNOPPIX(511, 502, and 402), Plan9 on Xen-DomU , and NetBSD on
Xen-DomU.

* Effective distribution of block images
The block images are downloaded from HTTP servers with GSLB (Global
Server Load Balance). The GSLB selects a suitable site among 3 EU
sites, 3 US sites and 7 Japanese sites.

### Usage ###
The usage is simple.
* Download a target virtual disk for virtual machine.
* Set up virtual machine to boot from the virtual disk file.
* Set up network environment.
* Boot the virtual machine. The GRUB Menu will appear. Select an operating system.

* Cached Block files
The virtual disk grows as use of the Guest OS because downloaded block
files are cached at /knxblock directory. The function prevents
redundant download and makes quick re-boot and re-launch of
applications.

# Personal Update
VMSeed is based on 1CD Linux "KNOPPIX". Most CD bootable OS can not
keep any change of files. KNOPPIX however has a mechanism to keep the
updates on a local storage. It is based on Union File System and keeps
the change of files. It works as COW (Copy On Write) and makes
over-write on the CD image.

### List of available VM and GuestOS ###

|KNOPPIX KNOPPIX KNOPPIX Xen206 Plan9 NetBSD Initial Disk size
| 511 501 402 Dom0 DomU DomU (virtual size is 2GB)
------------------------------------------------------------------------------
VMWare | OK OK OK OK OK OK 33MB
VirtualBox | OK OK OK NG NG NG 68MB
VirtualPC | OK OK OK NG NG NG 102MB
Parallels | OK OK OK OK OK OK 32MB
KVM | OK OK OK OK OK OK 31GB on Sparse FS
QEMU/KQEMU | OK OK OK OK OK OK 31GB on Sparse FS
Xen | OK OK OK NG NG NG 31GB on Sparse FS

### Known Problems ###
* Depend on the situation of server and network.
It is sensitive of network latency and load of the server because
the root file system is mounted by "HTTP-FUSE CLOOP".
The situation may change by rebooting because the load balancer
(GSLB) may select another site

### Download ###
* VMware; vmseed_080430.vmdk, 33MB, MD5: 106ea4fda6f2c692e3c312cc178f5da6
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VMWare/vmseed_080430.vmdk
* VirtualBox; vmseed_080430.vdi, 66MB, MD5: d1827c684c6299eb3511ffd9d69dfc02
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VirtualBox/vmseed_080430.vdi
* VirtualPC; vmseed_080430.vhd, 100MB, MD5: 11a29cb8220d8c4846af0324130801ba
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VirtualPC/vmseed_080430.vhd
* Parallels; vmseed_080430.hdd, 31MB, MD5: 4d97ab5264c8974be814ebacef97364c
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/Parallels/vmseed_080430.hdd
* Xen/QEMU/KQEMU/KVM;vmseed_080430.tar.gz, 24MB, MD5:7f78cf33f6be8334078e655b3bb2cbf1
The image is created on a Space File System (ext3) and archived with "--space" option of tar.
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/QEMU-KVM-Xen/vmseed_080430.tar.gz
* Files included VMSeed008430 23MB MD5: 4c216e8672aff195b9cd98ed5224570e
http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/vmseed_080430_files.tar.gz

### Related Works ###
[1] InetBoot: http://openlab.jp/oscircular/inetboot/
[2] LivePC of Moka5: http://www.moka5.com/
[3] Collective Project at Stanford: http://suif.stanford.edu/collective/
[4] MajoPac: http://www.mojopac.com/

------
suzaki

Re: [kvm-devel] problems running many guests

From: Marcelo T. <mto...@re...> - 2008-05-02 00:13:49

On Thu, May 01, 2008 at 06:00:44PM -0500, Karl Rister wrote:
> Hi
> 
> I have been trying to do some testing of a large number of guests (72) on a 
> big multi-node IBM box (8 sockets, 32 cores, 128GB) and I am having various 
> issues with the guests.  I can get the guests to boot, but then I start to 
> have problems.  Some guests appear to stall doing I/O and some become 
> unresponsive and spin their single vcpu at 100%.

Does -no-kvm-irqchip or -no-kvm-pit makes a difference? If not, please
grab kvm_stat --once output when that happens.

Also run "readprofile -r ; readprofile -m System-map-of-guest.map" with the 
host booted with "profile=kvm". Make sure all guests are running the same kernel
image. 

The profiling should be easier to understand if you have 1 guest spinning and 
remaining ones idle.

[kvm-devel] problems running many guests

From: Karl R. <km...@us...> - 2008-05-01 23:01:29

Hi

I have been trying to do some testing of a large number of guests (72) on a 
big multi-node IBM box (8 sockets, 32 cores, 128GB) and I am having various 
issues with the guests.  I can get the guests to boot, but then I start to 
have problems.  Some guests appear to stall doing I/O and some become 
unresponsive and spin their single vcpu at 100%.

Each guest is configured with 1 vcpu and 1000MB of memory.  The single virtual 
disk is backed by a LVM volume.  Both the guest and host are running custom 
kernels.

I have tried kvm-67, kvm-64, and kvm-62 (not functional at all).  I have 
cloned both the kvm and kvm-userspace repositories and am building the tagged 
changesets from each.

Here are a few of the various things I have tried: virtio and emulated devices 
for the nic and disk; mixed virtio and emulated devices; kvm-clock and 
clock=jiffies.

Any help in pinpointing the problem would be appreciated.

Thanks.

-- 
Karl Rister
IBM Linux Performance Team
km...@us...
(512) 838-1553 (t/l 678)

Re: [kvm-devel] qemu/kvm seems to take ALSA all for itself?

From: David A. <da...@bo...> - 2008-05-01 19:50:25

Jon <iroquoi <at> gmail.com> writes:

> 
> I use:
> 
> export QEMU_AUDIO_DRV=alsa
> export QEMU_AUDIO_DAC_FIXED_FREQ=48000
> export QEMU_AUDIO_ADC_FIXED_FREQ=48000
> export QEMU_ALSA_DAC_BUFFER_SIZE=16384
> 
> Buffer size is very important, else it crackles and pops for me.

Unfortunately with my upgrade to Ubuntu Hardy this has stopped working; I can
put off the effect by playing a test tone in linux, but Qemu again takes over
the sound system completely the first time it succeeds in making noise.  Maybe
this has something to do with the addition of *yet another* audio layer in Hardy
(PulseAudio?)

--
Dave Abrahams
Boost Consulting, Inc.
http://boost-consulting.com

Re: [kvm-devel] Protected mode transitions and big real mode... still an issue

From: Marcelo T. <mto...@re...> - 2008-05-01 19:11:10

Hi Guillaume,

On Tue, Apr 29, 2008 at 03:02:36PM +0200, Guillaume Thouvenin wrote:
> Hello,

<snip>

> -hda ~/disk_images/hd_50G.qcow2
> -cdrom /images_iso/openSUSE-10.3-GM-x86_64-mini.iso -boot d -s -m 1024
> 
> exception 13 (33) 
> rax 0000000000000673 rbx 0000000000800000 rcx 0000000000000000 
> rdx 00000000000013ca rsi 0000000000055e1c rdi 0000000000055e1d 
> rsp 00000000fffa0080 rbp 000000000000200b r8 0000000000000000 
> r9  0000000000000000 r10 0000000000000000 r11 0000000000000000 
> r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 
> r15 0000000000000000 rip 000000000000b071 rflags 00033092 
> cs 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
> ds 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
> es 00ff (00000ff0/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> ss ff11 (000ff110/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
> fs 3002 (00030020/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
> gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
> tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0) 
> ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) 
> gdt 40920/47 idt 0/ffff cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
> code: 17 06 29 4b 01 18 eb 18 a8 25 aa 19 28 4c 01 28 4d 01 01 17 -->
> 0f 17 0f 01 17 0f 17 12 01 17 2c 25 4b 19 21 00 02 17 1a 94 0a 76 67 61
> 3d 30 78 25 78 20 Aborted
> 
> It's strange because handle_vmentry_failure() is not called. I'm trying
> to see where is the problem, any comments are welcome

Not sure if this is the same problem you're seeing, but with your patch
Plan9 triggers:

exception 13 (6b)
rax 0000000000010010 rbx 0000000000000001 rcx 00000000f0012000 rdx 00000000000000a1
rsi 00000000f0101000 rdi 00000000f0009000 rsp 0000000000007bfc rbp 00000000f0001320
r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
rip 000000000000023e rflags 00033002
cs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
es 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt 14000/4f
idt 0/3ff
cr0 10010 cr2 0 cr3 12000 cr4 d0 cr8 0 efer 0
code: 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff -->
00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0


The code sequence is:

    8235:       66                      data16
    8236:       0f 22 c0                mov    %eax,%cr0
    8239:       ea 3e 02 00 08 b8 00    ljmp   $0xb8,$0x800023e

So it switches to realmode and then does a ljmp. Problem is that you're
using the segment selector as a GDT index, but in realmode it should be
shifted left by 4 to determine the segment base address. Following patch
makes Plan9 happy.

Other than that, load_segment_descriptor() can return a positive error
on failure, should do a proper check.

Index: kvm/arch/x86/kvm/x86_emulate.c
===================================================================
--- kvm.orig/arch/x86/kvm/x86_emulate.c
+++ kvm/arch/x86/kvm/x86_emulate.c
@@ -1755,7 +1755,10 @@ special_insn:
 			goto cannot_emulate;
 		}
 		sel = insn_fetch(u16, 2, c->eip);
-		if (load_segment_descriptor(ctxt->vcpu, sel, 9, VCPU_SREG_CS) < 0) {
+		if (ctxt->mode == X86EMUL_MODE_REAL) 
+			eip |= (sel << 4);
+		else if (load_segment_descriptor(ctxt->vcpu, sel, 9,
+						 VCPU_SREG_CS) < 0) {
 			DPRINTF("jmp far: Failed to load CS descriptor\n");
 			goto cannot_emulate;
 		}

[kvm-devel] mmu notifier-core v14->v15 diff for review

From: Andrea A. <an...@qu...> - 2008-05-01 18:12:59

Hello everyone,

this is the v14 to v15 difference to the mmu-notifier-core patch. This
is just for review of the difference, I'll post full v15 soon, please
review the diff in the meantime. Lots of those cleanups are thanks to
Andrew review on mmu-notifier-core in v14. He also spotted the
GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP
failed to catch it until I enabled PREEMPT (GFP_KERNEL there was
perfectly safe with all patchset applied but not ok if only
mmu-notifier-core was applied). As usual that bug couldn't hurt
anybody unless the mmu notifiers were armed.

I also wrote a proper changelog to the mmu-notifier-core patch that I
will append before the v14->v15 diff:

Subject: mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.

Currently we take a page_count pin on every page mapped by sptes, but
that means the pages can't be swapped whenever they're mapped by any
spte because they're part of the guest working set. Furthermore a spte
unmap event can immediately lead to a page to be freed when the pin is
released (so requiring the same complex and relatively slow tlb_gather
smp safe logic we have in zap_page_range and that can be avoided
completely if the spte unmap event doesn't require an unpin of the
page previously mapped in the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
know when the VM is swapping or freeing or doing anything on the
primary MMU so that the secondary MMU code can drop sptes before the
pages are freed, avoiding all page pinning and allowing 100% reliable
swapping of guest physical address space. Furthermore it avoids the
code that teardown the mappings of the secondary MMU, to implement a
logic like tlb_gather in zap_page_range that would require many IPI to
flush other cpu tlbs, for each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings
will be invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an
updated spte or secondary-tlb-mapping on the copied page. Or it will
setup a readonly spte or readonly tlb mapping if it's a guest-read, if
it calls get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in
the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
or an full MMU with both sptes and secondary-tlb like the
shadow-pagetable layer with kvm), or a remote DMA in software like
XPMEM (hence needing of schedule in XPMEM code to send the invalidate
to the remote node, while no need to schedule in kvm/gru as it's an
immediate event like invalidating primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) Introduces list_del_init_rcu and documents it (fixes a comment for
   list_del_rcu too)

2) mm_lock() to register the mmu notifier when the whole VM isn't
   doing anything with "mm". This allows mmu notifier users to keep
   track if the VM is in the middle of the invalidate_range_begin/end
   critical section with an atomic counter incraese in range_begin and
   decreased in range_end. No secondary MMU page fault is allowed to
   map any spte or secondary tlb reference, while the VM is in the
   middle of range_begin/end as any page returned by get_user_pages in
   that critical section could later immediately be freed without any
   further ->invalidate_page notification (invalidate_range_begin/end
   works on ranges and ->invalidate_page isn't called immediately
   before freeing the page). To stop all page freeing and pagetable
   overwrites the mmap_sem must be taken in write mode and all other
   anon_vma/i_mmap locks must be taken in virtual address order. The
   order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running
   concurrently to trigger lock inversion deadlocks.

3) It'd be a waste to add branches in the VM if nobody could possibly
   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
   if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
   advantage of mmu notifiers, but this already allows to compile a
   KVM external module against a kernel with mmu notifiers enabled and
   from the next pull from kvm.git we'll start using them. And
   GRU/XPMEM will also be able to continue the development by enabling
   KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
   to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
   the same way KVM does it (even if KVM=n). This guarantees nobody
   selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.

The mmu_notifier_register call can fail because mm_lock may not
allocate the required vmalloc space. See the comment on top of
mm_lock() implementation for the worst case memory requirements.
Because mmu_notifier_reigster is used when a driver startup, a failure
can be gracefully handled. Here an example of the change applied to
kvm to register the mmu notifiers. Usually when a driver startups
other allocations are required anyway and -ENOMEM failure paths exists
already.

 struct  kvm *kvm_arch_create_vm(void)
 {
        struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+       int err;

        if (!kvm)
                return ERR_PTR(-ENOMEM);

        INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+       if (err) {
+               kfree(kvm);
+               return ERR_PTR(err);
+       }
+
        return kvm;
 }

mmu_notifier_unregister returns void and it's reliable.

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM
 	select PREEMPT_NOTIFIERS
+	select MMU_NOTIFIER
 	select ANON_INODES
 	---help---
 	  Support hosting fully virtualized guest machines using hardware
diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -739,7 +739,7 @@ static inline void hlist_del(struct hlis
  * or hlist_del_rcu(), running on this same list.
  * However, it is perfectly legal to run concurrently with
  * the _rcu list-traversal primitives, such as
- * hlist_for_each_entry().
+ * hlist_for_each_entry_rcu().
  */
 static inline void hlist_del_rcu(struct hlist_node *n)
 {
@@ -755,6 +755,26 @@ static inline void hlist_del_init(struct
 	}
 }
 
+/**
+ * hlist_del_init_rcu - deletes entry from hash list with re-initialization
+ * @n: the element to delete from the hash list.
+ *
+ * Note: list_unhashed() on entry does return true after this. It is
+ * useful for RCU based read lockfree traversal if the writer side
+ * must know if the list entry is still hashed or already unhashed.
+ *
+ * In particular, it means that we can not poison the forward pointers
+ * that may still be used for walking the hash list and we can only
+ * zero the pprev pointer so list_unhashed() will return true after
+ * this.
+ *
+ * The caller must take whatever precautions are necessary (such as
+ * holding appropriate locks) to avoid racing with another
+ * list-mutation primitive, such as hlist_add_head_rcu() or
+ * hlist_del_rcu(), running on this same list.  However, it is
+ * perfectly legal to run concurrently with the _rcu list-traversal
+ * primitives, such as hlist_for_each_entry_rcu().
+ */
 static inline void hlist_del_init_rcu(struct hlist_node *n)
 {
 	if (!hlist_unhashed(n)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,18 +1050,6 @@ extern int install_special_mapping(struc
 				   unsigned long addr, unsigned long len,
 				   unsigned long flags, struct page **pages);
 
-/*
- * mm_lock will take mmap_sem writably (to prevent all modifications
- * and scanning of vmas) and then also takes the mapping locks for
- * each of the vma to lockout any scans of pagetables of this address
- * space. This can be used to effectively holding off reclaim from the
- * address space.
- *
- * mm_lock can fail if there is not enough memory to store a pointer
- * array to all vmas.
- *
- * mm_lock and mm_unlock are expensive operations that may take a long time.
- */
 struct mm_lock_data {
 	spinlock_t **i_mmap_locks;
 	spinlock_t **anon_vma_locks;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -4,17 +4,24 @@
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
+#include <linux/srcu.h>
 
 struct mmu_notifier;
 struct mmu_notifier_ops;
 
 #ifdef CONFIG_MMU_NOTIFIER
-#include <linux/srcu.h>
 
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_lock() protected critical section
+ * and it's released only when mm_count reaches zero in mmdrop().
+ */
 struct mmu_notifier_mm {
+	/* all mmu notifiers registerd in this mm are queued in this list */
 	struct hlist_head list;
+	/* srcu structure for this mm */
 	struct srcu_struct srcu;
-	/* to serialize mmu_notifier_unregister against mmu_notifier_release */
+	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
 };
 
@@ -23,8 +30,8 @@ struct mmu_notifier_ops {
 	 * Called either by mmu_notifier_unregister or when the mm is
 	 * being destroyed by exit_mmap, always before all pages are
 	 * freed. It's mandatory to implement this method. This can
-	 * run concurrently to other mmu notifier methods and it
-	 * should teardown all secondary mmu mappings and freeze the
+	 * run concurrently with other mmu notifier methods and it
+	 * should tear down all secondary mmu mappings and freeze the
 	 * secondary mmu.
 	 */
 	void (*release)(struct mmu_notifier *mn,
@@ -43,9 +50,10 @@ struct mmu_notifier_ops {
 
 	/*
 	 * Before this is invoked any secondary MMU is still ok to
-	 * read/write to the page previously pointed by the Linux pte
-	 * because the old page hasn't been freed yet.  If required
-	 * set_page_dirty has to be called internally to this method.
+	 * read/write to the page previously pointed to by the Linux
+	 * pte because the page hasn't been freed yet and it won't be
+	 * freed until this returns. If required set_page_dirty has to
+	 * be called internally to this method.
 	 */
 	void (*invalidate_page)(struct mmu_notifier *mn,
 				struct mm_struct *mm,
@@ -53,20 +61,18 @@ struct mmu_notifier_ops {
 
 	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
-	 * paired and are called only when the mmap_sem is held and/or
-	 * the semaphores protecting the reverse maps. Both functions
+	 * paired and are called only when the mmap_sem and/or the
+	 * locks protecting the reverse maps are held. Both functions
 	 * may sleep. The subsystem must guarantee that no additional
-	 * references to the pages in the range established between
-	 * the call to invalidate_range_start() and the matching call
-	 * to invalidate_range_end().
+	 * references are taken to the pages in the range established
+	 * between the call to invalidate_range_start() and the
+	 * matching call to invalidate_range_end().
 	 *
-	 * Invalidation of multiple concurrent ranges may be permitted
-	 * by the driver or the driver may exclude other invalidation
-	 * from proceeding by blocking on new invalidate_range_start()
-	 * callback that overlap invalidates that are already in
-	 * progress. Either way the establishment of sptes to the
-	 * range can only be allowed if all invalidate_range_stop()
-	 * function have been called.
+	 * Invalidation of multiple concurrent ranges may be
+	 * optionally permitted by the driver. Either way the
+	 * establishment of sptes is forbidden in the range passed to
+	 * invalidate_range_begin/end for the whole duration of the
+	 * invalidate_range_begin/end critical section.
 	 *
 	 * invalidate_range_start() is called when all pages in the
 	 * range are still mapped and have at least a refcount of one.
@@ -187,6 +193,14 @@ static inline void mmu_notifier_mm_destr
 		__mmu_notifier_mm_destroy(mm);
 }
 
+/*
+ * These two macros will sometime replace ptep_clear_flush.
+ * ptep_clear_flush is impleemnted as macro itself, so this also is
+ * implemented as a macro until ptep_clear_flush will converted to an
+ * inline function, to diminish the risk of compilation failure. The
+ * invalidate_page method over time can be moved outside the PT lock
+ * and these two macros can be later removed.
+ */
 #define ptep_clear_flush_notify(__vma, __address, __ptep)		\
 ({									\
 	pte_t __pte;							\
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -193,7 +193,3 @@ config VIRT_TO_BUS
 config VIRT_TO_BUS
 	def_bool y
 	depends on !ARCH_NO_VIRT_TO_BUS
-
-config MMU_NOTIFIER
-	def_bool y
-	bool "MMU notifier, for paging KVM/RDMA"
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -613,6 +613,12 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	/*
+	 * We need to invalidate the secondary MMU mappings only when
+	 * there could be a permission downgrade on the ptes of the
+	 * parent mm. And a permission downgrade will only happen if
+	 * is_cow_mapping() returns true.
+	 */
 	if (is_cow_mapping(vma->vm_flags))
 		mmu_notifier_invalidate_range_start(src_mm, addr, end);
 
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2329,7 +2329,36 @@ static inline void __mm_unlock(spinlock_
  * operations that could ever happen on a certain mm. This includes
  * vmtruncate, try_to_unmap, and all page faults. The holder
  * must not hold any mm related lock. A single task can't take more
- * than one mm lock in a row or it would deadlock.
+ * than one mm_lock in a row or it would deadlock.
+ *
+ * The mmap_sem must be taken in write mode to block all operations
+ * that could modify pagetables and free pages without altering the
+ * vma layout (for example populate_range() with nonlinear vmas).
+ *
+ * The sorting is needed to avoid lock inversion deadlocks if two
+ * tasks run mm_lock at the same time on different mm that happen to
+ * share some anon_vmas/inodes but mapped in different order.
+ *
+ * mm_lock and mm_unlock are expensive operations that may have to
+ * take thousand of locks. Thanks to sort() the complexity is
+ * O(N*log(N)) where N is the number of VMAs in the mm. The max number
+ * of vmas is defined in /proc/sys/vm/max_map_count.
+ *
+ * mm_lock() can fail if memory allocation fails. The worst case
+ * vmalloc allocation required is 2*max_map_count*sizeof(spinlock *),
+ * so around 1Mbyte, but in practice it'll be much less because
+ * normally there won't be max_map_count vmas allocated in the task
+ * that runs mm_lock().
+ *
+ * The vmalloc memory allocated by mm_lock is stored in the
+ * mm_lock_data structure that must be allocated by the caller and it
+ * must be later passed to mm_unlock that will free it after using it.
+ * Allocating the mm_lock_data structure on the stack is fine because
+ * it's only a couple of bytes in size.
+ *
+ * If mm_lock() returns -ENOMEM no memory has been allocated and the
+ * mm_lock_data structure can be freed immediately, and mm_unlock must
+ * not be called.
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
@@ -2350,6 +2379,13 @@ int mm_lock(struct mm_struct *mm, struct
 			return -ENOMEM;
 		}
 
+		/*
+		 * When mm_lock_sort_anon_vma/i_mmap returns zero it
+		 * means there's no lock to take and so we can free
+		 * the array here without waiting mm_unlock. mm_unlock
+		 * will do nothing if nr_i_mmap/anon_vma_locks is
+		 * zero.
+		 */
 		data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks);
 		data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks);
 
@@ -2374,7 +2410,17 @@ static void mm_unlock_vfree(spinlock_t *
 	vfree(locks);
 }
 
-/* avoid memory allocations for mm_unlock to prevent deadlock */
+/*
+ * mm_unlock doesn't require any memory allocation and it won't fail.
+ *
+ * All memory has been previously allocated by mm_lock and it'll be
+ * all freed before returning. Only after mm_unlock returns, the
+ * caller is allowed to free and forget the mm_lock_data structure.
+ * 
+ * mm_unlock runs in O(N) where N is the max number of VMAs in the
+ * mm. The max number of vmas is defined in
+ * /proc/sys/vm/max_map_count.
+ */
 void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
 {
 	if (mm->map_count) {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -21,12 +21,12 @@
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
  * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
- * in parallel despite there's no task using this mm anymore, through
- * the vmas outside of the exit_mmap context, like with
+ * in parallel despite there being no task using this mm any more,
+ * through the vmas outside of the exit_mmap context, such as with
  * vmtruncate. This serializes against mmu_notifier_unregister with
  * the mmu_notifier_mm->lock in addition to SRCU and it serializes
  * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
- * can't go away from under us as exit_mmap holds a mm_count pin
+ * can't go away from under us as exit_mmap holds an mm_count pin
  * itself.
  */
 void __mmu_notifier_release(struct mm_struct *mm)
@@ -41,7 +41,7 @@ void __mmu_notifier_release(struct mm_st
 				 hlist);
 		/*
 		 * We arrived before mmu_notifier_unregister so
-		 * mmu_notifier_unregister will do nothing else than
+		 * mmu_notifier_unregister will do nothing other than
 		 * to wait ->release to finish and
 		 * mmu_notifier_unregister to return.
 		 */
@@ -66,7 +66,11 @@ void __mmu_notifier_release(struct mm_st
 	spin_unlock(&mm->mmu_notifier_mm->lock);
 
 	/*
-	 * Wait ->release if mmu_notifier_unregister is running it.
+	 * synchronize_srcu here prevents mmu_notifier_release to
+	 * return to exit_mmap (which would proceed freeing all pages
+	 * in the mm) until the ->release method returns, if it was
+	 * invoked by mmu_notifier_unregister.
+	 *
 	 * The mmu_notifier_mm can't go away from under us because one
 	 * mm_count is hold by exit_mmap.
 	 */
@@ -144,8 +148,9 @@ void __mmu_notifier_invalidate_range_end
  * Must not hold mmap_sem nor any other VM related lock when calling
  * this registration function. Must also ensure mm_users can't go down
  * to zero while this runs to avoid races with mmu_notifier_release,
- * so mm has to be current->mm or the mm should be pinned safely like
- * with get_task_mm(). mmput can be called after mmu_notifier_register
+ * so mm has to be current->mm or the mm should be pinned safely such
+ * as with get_task_mm(). If the mm is not current->mm, the mm_users
+ * pin should be released by calling mmput after mmu_notifier_register
  * returns. mmu_notifier_unregister must be always called to
  * unregister the notifier. mm_count is automatically pinned to allow
  * mmu_notifier_unregister to safely run at any time later, before or
@@ -155,29 +160,29 @@ int mmu_notifier_register(struct mmu_not
 int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct mm_lock_data data;
+	struct mmu_notifier_mm * mmu_notifier_mm;
 	int ret;
 
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	ret = -ENOMEM;
+	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+	if (unlikely(!mmu_notifier_mm))
+		goto out;
+
+	ret = init_srcu_struct(&mmu_notifier_mm->srcu);
+	if (unlikely(ret))
+		goto out_kfree;
+
 	ret = mm_lock(mm, &data);
 	if (unlikely(ret))
-		goto out;
+		goto out_cleanup;
 
 	if (!mm_has_notifiers(mm)) {
-		mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm),
-					      GFP_KERNEL);
-		ret = -ENOMEM;
-		if (unlikely(!mm_has_notifiers(mm)))
-			goto out_unlock;
-
-		ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu);
-		if (unlikely(ret)) {
-			kfree(mm->mmu_notifier_mm);
-			mmu_notifier_mm_init(mm);
-			goto out_unlock;
-		}
-		INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list);
-		spin_lock_init(&mm->mmu_notifier_mm->lock);
+		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
+		spin_lock_init(&mmu_notifier_mm->lock);
+		mm->mmu_notifier_mm = mmu_notifier_mm;
+		mmu_notifier_mm = NULL;
 	}
 	atomic_inc(&mm->mm_count);
 
@@ -192,8 +197,14 @@ int mmu_notifier_register(struct mmu_not
 	spin_lock(&mm->mmu_notifier_mm->lock);
 	hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
 	spin_unlock(&mm->mmu_notifier_mm->lock);
-out_unlock:
+
 	mm_unlock(mm, &data);
+out_cleanup:
+	if (mmu_notifier_mm)
+		cleanup_srcu_struct(&mmu_notifier_mm->srcu);
+out_kfree:
+	/* kfree() does nothing if mmu_notifier_mm is NULL */
+	kfree(mmu_notifier_mm);
 out:
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 	return ret;

[kvm-devel] fx_init schedule in atomic

From: Andrea A. <an...@qu...> - 2008-05-01 16:43:29

Hello,

This make sure not to schedule in atomic during fx_init. I also
changed the name of fpu_init to fx_finit to avoid duplicating the name
with fpu_init that is already used in the kernel, this makes grep
simpler if nothing else.

Signed-off-by: Andrea Arcangeli <an...@qu...>

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 578a0c1..5398b1c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3701,10 +3702,19 @@ void fx_init(struct kvm_vcpu *vcpu)
 {
 	unsigned after_mxcsr_mask;
 
+	/*
+	 * Touch the fpu the first time in non atomic context as if
+	 * this is the first fpu instruction the exception handler
+	 * will fire before the instruction returns and it'll have to
+	 * allocate ram with GFP_KERNEL.
+	 */
+	if (!used_math())
+		fx_save(&vcpu->arch.host_fx_image);
+
 	/* Initialize guest FPU by resetting ours and saving into guest's */
 	preempt_disable();
 	fx_save(&vcpu->arch.host_fx_image);
-	fpu_init();
+	fx_finit();
 	fx_save(&vcpu->arch.guest_fx_image);
 	fx_restore(&vcpu->arch.host_fx_image);
 	preempt_enable();
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 4baa9c9..b9a1421 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -627,7 +635,7 @@ static inline void fx_restore(struct i387_fxsave_struct *image)
 	asm("fxrstor (%0)":: "r" (image));
 }
 
-static inline void fpu_init(void)
+static inline void fx_finit(void)
 {
 	asm("finit");
 }

Re: [kvm-devel] [PATCH]: Fake MSR_K7 performance counters

From: Chris L. <cla...@re...> - 2008-05-01 14:23:32

Chris Lalancette wrote:
> Attached is a patch that fixes a guest crash when booting older Linux kernels.
> The problem stems from the fact that we are currently emulating
> MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3].  Because of this,
> setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to write
> into MSR_K7_PERFCTR, which causes an OOPs.
> 
> The patch fixes it by just "fake" emulating the appropriate MSRs, throwing away
> the data in the process.  This causes the NMI watchdog to not actually work, but
> it's not such a big deal in a virtualized environment.
> 
> Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit guest.

Avi,
     Do you mind applying this patch for me (unless you see something wrong with
it, of course)?

Thanks,
Chris Lalancette

[kvm-devel] Wanted Urgently

From: Dr R. Hansmond. <han...@ya...> - 2008-05-01 13:53:57

新しいメールアドレスをお知らせします新しいメールアドレス： han...@ya...

I am seeking your cooperation in building a Tourist Hotel or Real Estate in your country.I need an experienced person like you to assist me to set up, develop the project.


Alternative Email:- han...@ya...


Thanks and God bless.

Dr Raymond Hansmond.

- Dr Raymond Hansmond.

Re: [kvm-devel] [PATCH] KVM x86: Handle hypercalls for assigned PCI devices

From: Amit S. <ami...@qu...> - 2008-05-01 13:17:00

On Tuesday 29 April 2008 21:28:51 Amit Shah wrote:
> On Tuesday 29 April 2008 20:14:16 Glauber Costa wrote:
> > Amit Shah wrote:

> > > +	if (find_pci_pt_dev(&vcpu->kvm->arch.pci_pt_dev_head,
> > > +			    &pci_pt_info, 0, KVM_PT_SOURCE_ASSIGN))
> > > +		r++; /* We have assigned the device */
> > > +
> > > +	kunmap(host_page);
> >
> > better use atomic mappings here.
>
> We can't use atomic mappings for guest pages. They can be swapped out.

Actually you were right: there's no sleeping call here after doing the 
mapping. I've updated this call with kmap_atomic.

The other function that uses kmap can't be converted since we continue to map 
several pages in a loop (depending on the length of the DMA region) and hence 
can't use kmap_atomic there.

Re: [kvm-devel] [PATCH] Don't leak EPT identity page table

From: Yang, S. <she...@in...> - 2008-05-01 08:47:51

On Thursday 01 May 2008 04:16:05 Anthony Liguori wrote:
> In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity
> page table via gfn_to_page().  We never release this reference though.
>
> This patch releases the reference to this page on VM destruction.  I
> haven't tested this with EPT.
>
> Signed-off-by: Anthony Liguori <ali...@us...>
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 578a0c1..63f46cf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3909,6 +3909,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_free_physmem(kvm);
>  	if (kvm->arch.apic_access_page)
>  		put_page(kvm->arch.apic_access_page);
> +	if (kvm->arch.ept_identity_pagetable)
> +		put_page(kvm->arch.ept_identity_pagetable);
>  	kfree(kvm);
>  }

Um... I neglected that...Thanks for point it out!

-- 
Thanks
Yang, Sheng

157 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 25 26 27 28 29 .. 703 > >> (Page 27 of 703)

2006	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (33)	Nov (325)	Dec (320)
2007	Jan (484)	Feb (438)	Mar (407)	Apr (713)	May (831)	Jun (806)	Jul (1023)	Aug (1184)	Sep (1118)	Oct (1461)	Nov (1224)	Dec (1042)
2008	Jan (1449)	Feb (1110)	Mar (1428)	Apr (1643)	May (682)	Jun	Jul	Aug	Sep	Oct	Nov	Dec