You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Avi K. <av...@qu...> - 2008-05-02 10:39:07
|
Fabian Deutsch wrote: > Avi Kivity wrote: > >> Fabian Deutsch wrote: >> >>> Hey. >>> >>> I've been trying Microsoft Windows 2003 a couple of times. The wiki >>> tells me that "everything" should work okay. It does, when using -smp 1, >>> but gets ugly when using -smp 2 or so. >>> >>> SO might it be useful, to add the column "smp" to the "Guest Support >>> Status" Page in the wiki? >>> >>> >> SMP Windows work best if you have FlexPriority on your hardware. What >> host cpu are you using? >> > > In general I am not able to install Microsoft Windows guests when using > -smp > 1 on the following hardware (and kvm modules+userspace head): > Intel(R) Xeon(R) CPU X3210 @ 2.13GHz > > What do you mean "not able"? Does the installer hang? bluescreen? where? I don't think it has flexpriority, but not sure. Will add code so people can find out. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 10:35:32
|
Amit Shah wrote: > >>> +static irqreturn_t >>> +kvm_pci_pt_dev_intr(int irq, void *dev_id) >>> >>> +{ >>> + struct kvm_pci_pt_dev_list *match; >>> + struct kvm *kvm = (struct kvm *) dev_id; >>> + >>> + if (!test_bit(irq, pt_irq_handled)) >>> + return IRQ_NONE; >>> + >>> + if (test_bit(irq, pt_irq_pending)) >>> + return IRQ_HANDLED; >>> >> Will the interrupt not fire immediately after this returns? >> > > Hmm. This is just an optimisation so that we don't have to look up the list > each time to find out which assigned device it is and (re)injecting the > interrupt. Also we avoid the (TODO) getting/releasing locks which will be > needed for the list lookup. > > Disabling interrupts for PCI devices isn't a good idea even if we don't > support shared interrupts. Any other ideas to avoid this from happening? > > I don't understand. These are level-triggered interrupts, so if one fires and you don't disable it, it will fire again and again. Seems to me the only choice here is to mask the interrupt at the ioapic level, wait until the guest acks the interrupt, then unmask the interrupt. With the current code, how to the guest interrupt counters and the host interrupt counters compare? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 10:20:01
|
David Abrahams wrote: > Jon <iroquoi <at> gmail.com> writes: > > >> I use: >> >> export QEMU_AUDIO_DRV=alsa >> export QEMU_AUDIO_DAC_FIXED_FREQ=48000 >> export QEMU_AUDIO_ADC_FIXED_FREQ=48000 >> export QEMU_ALSA_DAC_BUFFER_SIZE=16384 >> >> Buffer size is very important, else it crackles and pops for me. >> > > Unfortunately with my upgrade to Ubuntu Hardy this has stopped working; I can > put off the effect by playing a test tone in linux, but Qemu again takes over > the sound system completely the first time it succeeds in making noise. Maybe > this has something to do with the addition of *yet another* audio layer in Hardy > (PulseAudio?) What does your /etc/alsa/alsa.conf look like? Also, please remove any user-local alsa configuration files you may have inherited from the previous installation. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 10:12:30
|
Jan Luebbe wrote: > Hi! > > I'm preparing kvm-67 for debian. While testing i noticed a problem: > > When booting the debian installer from the official CD [1] this problem: > > Call Trace: > [<c0146d5b>] kmem_cache_create+0x15e/0x410 > Code: c3 57 56 53 89 c6 9c 5f fa 8b 08 83 39 00 74 12 c7 41 0c 01 00 00 > 00 8b 01 > 48 89 01 8b 5c 81 10 eb 07 e8 a5 fb ff ff 89 c3 57 9d <0f> 0d 0b 90 85 > db 74 1b > 8b 56 10 31 c0 89 d1 c1 e9 02 89 df f3 > EIP: [<c01467be>] kmem_cache_zalloc+0x2a/0x53 SS:ESP 0068:c030ff80 > <0>Kernel panic - not syncing: Attempted to kill the idle task! > 0f 0d 0b prefetchw (%ebx) This is an AMD 3Dnow! instruction, which is not supported on Intel processors. I guess the 3Dnow! cpuid bit leaked in via the qemu merge. I guess two fixes are needed: - remove the 3Dnow! bit - add emulation for prefetchw (easy, as it doesn't need to do anything) to support live migration from AMD to Intel -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 10:00:24
|
Chris Lalancette wrote: > Chris Lalancette wrote: > >> Attached is a patch that fixes a guest crash when booting older Linux kernels. >> The problem stems from the fact that we are currently emulating >> MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3]. Because of this, >> setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to write >> into MSR_K7_PERFCTR, which causes an OOPs. >> >> The patch fixes it by just "fake" emulating the appropriate MSRs, throwing away >> the data in the process. This causes the NMI watchdog to not actually work, but >> it's not such a big deal in a virtualized environment. >> >> Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit guest. >> > > Avi, > Do you mind applying this patch for me (unless you see something wrong with > it, of course)? > > Sorry, was behind on my email. Please add a ratelimited printk() if nonzero data is written, so that we know that the guest is using partially virtualized features. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:55:11
|
Jan Kiszka wrote: > Avi Kivity wrote: > >> Jan Kiszka wrote: >> >>> This still leaves me with the question how to handle the case when the >>> host sets and arms some debug registers to debug the guest and the >>> latter does the same to debug itself. Guest access will be trapped, OK, >>> but KVM will then have to decide which value should actually be >>> transfered into the registers. Hmm, does SVM virtualizes all debug >>> registers, leaving the real ones to the host? >>> >>> >> There's no way this can work. There are still only four debug >> registers, and the guest and host together can ask for eight different >> addresses. It is theoretically doable by hiding all mappings to pages >> that are debug targets, but it would probably double the kvm code size. >> >> A good short-term compomise is to abort if the guest starts using >> enabling a debug address register. A better solution might be to place >> host debug addresses into unused guest debug registers, so that as long >> as nr_guest_debug + nr_host_debug <= 4, we can still proceed. >> > > I tried the latter, but we cannot cleanly share DR7 between both users. > I actually think we can, but... > Thus I'm now going for a prioritized approach: debug register will stop > to have any effect for the guest as soon as the host starts to use them. > That's far simpler the implement and also easier to understand for the user. > > Agreed, having a simple model is preferred here, both from the user's point of view and from a code complexity point of view. If you're debugging a debugger use plain qemu. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:49:09
|
Jan Kiszka wrote: > This looks bogus, but it is so far without practical impact (phys_start > is always 0 when we do the calculation). > > Signed-off-by: Jan Kiszka <jan...@si...> > --- > libkvm/libkvm.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: b/libkvm/libkvm.c > =================================================================== > --- a/libkvm/libkvm.c > +++ b/libkvm/libkvm.c > @@ -550,7 +550,7 @@ int kvm_register_userspace_phys_mem(kvm_ > int r; > > if (!kvm->physical_memory) > - kvm->physical_memory = userspace_addr - phys_start; > + kvm->physical_memory = userspace_addr + phys_start; > > I think it's correct. The intent (probably) was that kvm->physical_memory[x] would refer to the contents of physical memory address x. In another way, it's incorrect, since nothing guarantees (now) that memory is contiguous. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:44:44
|
Jan Kiszka wrote: > Userland-located memory is not unconditionally available via > kvm->physical_memory + guest_address. To let kvm_show_code also dump > useful information when, e.g., some problem in ROM (BIOS...) occurs, > this patch tries to obtain the memory content via the mmio_read > callback. If the callback fails, the code byte is marked as invalid. > > This patch also removes the check for protected mode and dumps the code > in any case - I didn't find the reason for this restriction. > > Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:38:38
|
Jerone Young wrote: > This patch removes static x86 entries and makes things work for multiple archs. > > Applied, thanks, -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:33:24
|
Anthony Liguori wrote: > This patch allows VMA's that contain no backing page to be used for guest > memory. This is a drop-in replacement for Ben-Ami's first page in his direct > mmio series. Here, we continue to allow mmio pages to be represented in the > rmap. > > Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP > instead of VM_IO and changed the BUG_ON to a return of bad_page. > > Since v2, I've incorporated comments from Avi about returning bad_page instead > of NULL and fixed a typo spotted by Muli. > > Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:30:03
|
Anthony Liguori wrote: > In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity > page table via gfn_to_page(). We never release this reference though. > > This patch releases the reference to this page on VM destruction. I haven't > tested this with EPT. > Applied, thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Avi K. <av...@qu...> - 2008-05-02 09:28:46
|
Andrea Arcangeli wrote: > This make sure not to schedule in atomic during fx_init. I also > changed the name of fpu_init to fx_finit to avoid duplicating the name > with fpu_init that is already used in the kernel, this makes grep > simpler if nothing else. > > Applied, thanks. Dynamic allocation for the fpu state was introduced in 2.6.26-rc, right? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. |
From: Jan K. <jan...@si...> - 2008-05-02 08:47:20
|
Avi Kivity wrote: > Jan Kiszka wrote: >> This still leaves me with the question how to handle the case when the >> host sets and arms some debug registers to debug the guest and the >> latter does the same to debug itself. Guest access will be trapped, OK, >> but KVM will then have to decide which value should actually be >> transfered into the registers. Hmm, does SVM virtualizes all debug >> registers, leaving the real ones to the host? >> > > There's no way this can work. There are still only four debug > registers, and the guest and host together can ask for eight different > addresses. It is theoretically doable by hiding all mappings to pages > that are debug targets, but it would probably double the kvm code size. > > A good short-term compomise is to abort if the guest starts using > enabling a debug address register. A better solution might be to place > host debug addresses into unused guest debug registers, so that as long > as nr_guest_debug + nr_host_debug <= 4, we can still proceed. I tried the latter, but we cannot cleanly share DR7 between both users. Thus I'm now going for a prioritized approach: debug register will stop to have any effect for the guest as soon as the host starts to use them. That's far simpler the implement and also easier to understand for the user. A bit work remains, though, to clean up and enhance the DRx support in KVM. And to test the changes (will contact you, Joerg, regarding SVM tests). Stay tuned. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux |
From: Jan K. <jan...@si...> - 2008-05-02 08:44:47
|
Avi Kivity wrote: > Jan Kiszka wrote: >> Userland-located ROM memory is not available via kvm->physical_memory + >> guest_address. To let kvm_show_code also dump useful information when >> some problem in ROM (BIOS...) occurs, this patch first tries to obtain >> the memory content via the mmio_read callback - maybe not 100% clean, >> but works at least for the QEMU use case. If the callback complains >> about the given address, we then fall back to RAM access. >> >> > > kvm->physical_memory is actually broken, since nothing guarantees a 1:1 > (+offset) mapping. > > Why not use ->mmio_read() all the time? Sure it overloads the > definition of mmio_read(), but worse things have happened. That was my first approach as well, but then I became unsure if such an overloading is acceptable. As it is now: ---------- Userland-located memory is not unconditionally available via kvm->physical_memory + guest_address. To let kvm_show_code also dump useful information when, e.g., some problem in ROM (BIOS...) occurs, this patch tries to obtain the memory content via the mmio_read callback. If the callback fails, the code byte is marked as invalid. This patch also removes the check for protected mode and dumps the code in any case - I didn't find the reason for this restriction. Signed-off-by: Jan Kiszka <jan...@si...> --- libkvm/libkvm-x86.c | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) Index: b/libkvm/libkvm-x86.c =================================================================== --- a/libkvm/libkvm-x86.c +++ b/libkvm/libkvm-x86.c @@ -393,14 +393,14 @@ int kvm_set_pit(kvm_context_t kvm, struc void kvm_show_code(kvm_context_t kvm, int vcpu) { -#define CR0_PE_MASK (1ULL<<0) +#define SHOW_CODE_LEN 50 int fd = kvm->vcpu_fd[vcpu]; struct kvm_regs regs; struct kvm_sregs sregs; - int r; - unsigned char code[50]; + int r, n; int back_offset; - char code_str[sizeof(code) * 3 + 1]; + unsigned char code; + char code_str[SHOW_CODE_LEN * 3 + 1]; unsigned long rip; r = ioctl(fd, KVM_GET_SREGS, &sregs); @@ -408,9 +408,6 @@ void kvm_show_code(kvm_context_t kvm, in perror("KVM_GET_SREGS"); return; } - if (sregs.cr0 & CR0_PE_MASK) - return; - r = ioctl(fd, KVM_GET_REGS, ®s); if (r == -1) { perror("KVM_GET_REGS"); @@ -420,12 +417,16 @@ void kvm_show_code(kvm_context_t kvm, in back_offset = regs.rip; if (back_offset > 20) back_offset = 20; - memcpy(code, kvm->physical_memory + rip - back_offset, sizeof code); *code_str = 0; - for (r = 0; r < sizeof code; ++r) { - if (r == back_offset) + for (n = -back_offset; n < SHOW_CODE_LEN-back_offset; ++n) { + if (n == 0) strcat(code_str, " -->"); - sprintf(code_str + strlen(code_str), " %02x", code[r]); + r = kvm->callbacks->mmio_read(kvm->opaque, rip + n, &code, 1); + if (r < 0) { + strcat(code_str, " xx"); + continue; + } + sprintf(code_str + strlen(code_str), " %02x", code); } fprintf(stderr, "code:%s\n", code_str); } |
From: Kuniyasu S. <k.s...@ai...> - 2008-05-02 03:09:09
|
Dear, VMSeed(080430 experimental version) is released. # VMSeed is a growable virtual disk image for VMs. # VMs are VMware, VirtualBox, VirtualPC, Parallels, Xen, QEMU, KQEMU, and KVM. # Guest OSes are KNOPPIX(511, 502, and 402), Plan9/Xen-DomU, and NetBSD/Xen-DomU. HP: http://openlab.jp/oscircular/vmseed/ GuidePDF http://openlab.jp/oscircular/vmseed/VMSeed080430-E.pdf # This topic will be discussed at the BOF of Ottawa Linux Symposium08. # http://www.linuxsymposium.org/2008/view_abstract.php?content_key=231 --------------------------------------------------------------------- VMSeed is "an effective virtual disk image(Guest OS) for virtual machine". The initial virtual disk includes bootloader, kernel and miniroot only. The other disk image is downloaded from Internet and saved to the virtual disk. So the virtual disk grows by use of the guest OS and makes quick launch of applications. The important point is that it downloads necessary block image only and reduces network traffic and disk space. VMSeed is based on "InetBoot" and it is independent of virtual machine because it is self-organized OS. The current target virtual machines are VMware, VirtualBox, VirtualPC, Parallels, Xen, QEMU, KQEMU, and KVM. The current available OSes are KNOPPIX(511, 502, and 402), Plan9 on Xen-DomU, and NetBSD on Xen-DomU. ### Special Features ### * Small initial virtual disk file. VMSeed utilizes sparse virtual disk format of each virtual machine. The initial disk image includes bootloader, kernel and miniroot only. The root file system is obtained by Internet Virtual Disk. * Virtual disk grows by use of the Guest OS. The most parts are obtained via Internet with Internet Virtual Disk "Trusted HTTP-FUSE CLOOP". The requested disk images are cached on the local virtual disk and reused. * The image (Guest OS) is independent of Virtual Machine and applied to many Virtual Machines. The image (Guest OS) has auto-configuration mechanism and Internet loopback device. It is self-organized OS which is based o "InetBoot". Current applied virtual machines are VMware, VirtualBox, VirtualPC, Parallels, Xen, QEMU, KQEMU, and KVM. Current Guest OS are KNOPPIX(511, 502, and 402), Plan9 on Xen-DomU , and NetBSD on Xen-DomU. * Effective distribution of block images The block images are downloaded from HTTP servers with GSLB (Global Server Load Balance). The GSLB selects a suitable site among 3 EU sites, 3 US sites and 7 Japanese sites. ### Usage ### The usage is simple. * Download a target virtual disk for virtual machine. * Set up virtual machine to boot from the virtual disk file. * Set up network environment. * Boot the virtual machine. The GRUB Menu will appear. Select an operating system. * Cached Block files The virtual disk grows as use of the Guest OS because downloaded block files are cached at /knxblock directory. The function prevents redundant download and makes quick re-boot and re-launch of applications. # Personal Update VMSeed is based on 1CD Linux "KNOPPIX". Most CD bootable OS can not keep any change of files. KNOPPIX however has a mechanism to keep the updates on a local storage. It is based on Union File System and keeps the change of files. It works as COW (Copy On Write) and makes over-write on the CD image. ### List of available VM and GuestOS ### |KNOPPIX KNOPPIX KNOPPIX Xen206 Plan9 NetBSD Initial Disk size | 511 501 402 Dom0 DomU DomU (virtual size is 2GB) ------------------------------------------------------------------------------ VMWare | OK OK OK OK OK OK 33MB VirtualBox | OK OK OK NG NG NG 68MB VirtualPC | OK OK OK NG NG NG 102MB Parallels | OK OK OK OK OK OK 32MB KVM | OK OK OK OK OK OK 31GB on Sparse FS QEMU/KQEMU | OK OK OK OK OK OK 31GB on Sparse FS Xen | OK OK OK NG NG NG 31GB on Sparse FS ### Known Problems ### * Depend on the situation of server and network. It is sensitive of network latency and load of the server because the root file system is mounted by "HTTP-FUSE CLOOP". The situation may change by rebooting because the load balancer (GSLB) may select another site ### Download ### * VMware; vmseed_080430.vmdk, 33MB, MD5: 106ea4fda6f2c692e3c312cc178f5da6 http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VMWare/vmseed_080430.vmdk * VirtualBox; vmseed_080430.vdi, 66MB, MD5: d1827c684c6299eb3511ffd9d69dfc02 http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VirtualBox/vmseed_080430.vdi * VirtualPC; vmseed_080430.vhd, 100MB, MD5: 11a29cb8220d8c4846af0324130801ba http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/VirtualPC/vmseed_080430.vhd * Parallels; vmseed_080430.hdd, 31MB, MD5: 4d97ab5264c8974be814ebacef97364c http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/Parallels/vmseed_080430.hdd * Xen/QEMU/KQEMU/KVM;vmseed_080430.tar.gz, 24MB, MD5:7f78cf33f6be8334078e655b3bb2cbf1 The image is created on a Space File System (ext3) and archived with "--space" option of tar. http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/QEMU-KVM-Xen/vmseed_080430.tar.gz * Files included VMSeed008430 23MB MD5: 4c216e8672aff195b9cd98ed5224570e http://knoppix.inetboot.net/archives/linux/oscircular/vmseed080430/vmseed_080430_files.tar.gz ### Related Works ### [1] InetBoot: http://openlab.jp/oscircular/inetboot/ [2] LivePC of Moka5: http://www.moka5.com/ [3] Collective Project at Stanford: http://suif.stanford.edu/collective/ [4] MajoPac: http://www.mojopac.com/ ------ suzaki |
From: Marcelo T. <mto...@re...> - 2008-05-02 00:13:49
|
On Thu, May 01, 2008 at 06:00:44PM -0500, Karl Rister wrote: > Hi > > I have been trying to do some testing of a large number of guests (72) on a > big multi-node IBM box (8 sockets, 32 cores, 128GB) and I am having various > issues with the guests. I can get the guests to boot, but then I start to > have problems. Some guests appear to stall doing I/O and some become > unresponsive and spin their single vcpu at 100%. Does -no-kvm-irqchip or -no-kvm-pit makes a difference? If not, please grab kvm_stat --once output when that happens. Also run "readprofile -r ; readprofile -m System-map-of-guest.map" with the host booted with "profile=kvm". Make sure all guests are running the same kernel image. The profiling should be easier to understand if you have 1 guest spinning and remaining ones idle. |
From: Karl R. <km...@us...> - 2008-05-01 23:01:29
|
Hi I have been trying to do some testing of a large number of guests (72) on a big multi-node IBM box (8 sockets, 32 cores, 128GB) and I am having various issues with the guests. I can get the guests to boot, but then I start to have problems. Some guests appear to stall doing I/O and some become unresponsive and spin their single vcpu at 100%. Each guest is configured with 1 vcpu and 1000MB of memory. The single virtual disk is backed by a LVM volume. Both the guest and host are running custom kernels. I have tried kvm-67, kvm-64, and kvm-62 (not functional at all). I have cloned both the kvm and kvm-userspace repositories and am building the tagged changesets from each. Here are a few of the various things I have tried: virtio and emulated devices for the nic and disk; mixed virtio and emulated devices; kvm-clock and clock=jiffies. Any help in pinpointing the problem would be appreciated. Thanks. -- Karl Rister IBM Linux Performance Team km...@us... (512) 838-1553 (t/l 678) |
From: David A. <da...@bo...> - 2008-05-01 19:50:25
|
Jon <iroquoi <at> gmail.com> writes: > > I use: > > export QEMU_AUDIO_DRV=alsa > export QEMU_AUDIO_DAC_FIXED_FREQ=48000 > export QEMU_AUDIO_ADC_FIXED_FREQ=48000 > export QEMU_ALSA_DAC_BUFFER_SIZE=16384 > > Buffer size is very important, else it crackles and pops for me. Unfortunately with my upgrade to Ubuntu Hardy this has stopped working; I can put off the effect by playing a test tone in linux, but Qemu again takes over the sound system completely the first time it succeeds in making noise. Maybe this has something to do with the addition of *yet another* audio layer in Hardy (PulseAudio?) -- Dave Abrahams Boost Consulting, Inc. http://boost-consulting.com |
From: Marcelo T. <mto...@re...> - 2008-05-01 19:11:10
|
Hi Guillaume, On Tue, Apr 29, 2008 at 03:02:36PM +0200, Guillaume Thouvenin wrote: > Hello, <snip> > -hda ~/disk_images/hd_50G.qcow2 > -cdrom /images_iso/openSUSE-10.3-GM-x86_64-mini.iso -boot d -s -m 1024 > > exception 13 (33) > rax 0000000000000673 rbx 0000000000800000 rcx 0000000000000000 > rdx 00000000000013ca rsi 0000000000055e1c rdi 0000000000055e1d > rsp 00000000fffa0080 rbp 000000000000200b r8 0000000000000000 > r9 0000000000000000 r10 0000000000000000 r11 0000000000000000 > r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 > r15 0000000000000000 rip 000000000000b071 rflags 00033092 > cs 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > ds 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > es 00ff (00000ff0/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > ss ff11 (000ff110/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > fs 3002 (00030020/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) > tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0) > ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) > gdt 40920/47 idt 0/ffff cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0 > code: 17 06 29 4b 01 18 eb 18 a8 25 aa 19 28 4c 01 28 4d 01 01 17 --> > 0f 17 0f 01 17 0f 17 12 01 17 2c 25 4b 19 21 00 02 17 1a 94 0a 76 67 61 > 3d 30 78 25 78 20 Aborted > > It's strange because handle_vmentry_failure() is not called. I'm trying > to see where is the problem, any comments are welcome Not sure if this is the same problem you're seeing, but with your patch Plan9 triggers: exception 13 (6b) rax 0000000000010010 rbx 0000000000000001 rcx 00000000f0012000 rdx 00000000000000a1 rsi 00000000f0101000 rdi 00000000f0009000 rsp 0000000000007bfc rbp 00000000f0001320 r8 0000000000000000 r9 0000000000000000 r10 0000000000000000 r11 0000000000000000 r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000 rip 000000000000023e rflags 00033002 cs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) es 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0) ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) gdt 14000/4f idt 0/3ff cr0 10010 cr2 0 cr3 12000 cr4 d0 cr8 0 efer 0 code: 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff --> 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 53 ff 00 f0 The code sequence is: 8235: 66 data16 8236: 0f 22 c0 mov %eax,%cr0 8239: ea 3e 02 00 08 b8 00 ljmp $0xb8,$0x800023e So it switches to realmode and then does a ljmp. Problem is that you're using the segment selector as a GDT index, but in realmode it should be shifted left by 4 to determine the segment base address. Following patch makes Plan9 happy. Other than that, load_segment_descriptor() can return a positive error on failure, should do a proper check. Index: kvm/arch/x86/kvm/x86_emulate.c =================================================================== --- kvm.orig/arch/x86/kvm/x86_emulate.c +++ kvm/arch/x86/kvm/x86_emulate.c @@ -1755,7 +1755,10 @@ special_insn: goto cannot_emulate; } sel = insn_fetch(u16, 2, c->eip); - if (load_segment_descriptor(ctxt->vcpu, sel, 9, VCPU_SREG_CS) < 0) { + if (ctxt->mode == X86EMUL_MODE_REAL) + eip |= (sel << 4); + else if (load_segment_descriptor(ctxt->vcpu, sel, 9, + VCPU_SREG_CS) < 0) { DPRINTF("jmp far: Failed to load CS descriptor\n"); goto cannot_emulate; } |
From: Andrea A. <an...@qu...> - 2008-05-01 18:12:59
|
Hello everyone, this is the v14 to v15 difference to the mmu-notifier-core patch. This is just for review of the difference, I'll post full v15 soon, please review the diff in the meantime. Lots of those cleanups are thanks to Andrew review on mmu-notifier-core in v14. He also spotted the GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP failed to catch it until I enabled PREEMPT (GFP_KERNEL there was perfectly safe with all patchset applied but not ok if only mmu-notifier-core was applied). As usual that bug couldn't hurt anybody unless the mmu notifiers were armed. I also wrote a proper changelog to the mmu-notifier-core patch that I will append before the v14->v15 diff: Subject: mmu-notifier-core With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're mapped by any spte because they're part of the guest working set. Furthermore a spte unmap event can immediately lead to a page to be freed when the pin is released (so requiring the same complex and relatively slow tlb_gather smp safe logic we have in zap_page_range and that can be avoided completely if the spte unmap event doesn't require an unpin of the page previously mapped in the secondary MMU). The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know when the VM is swapping or freeing or doing anything on the primary MMU so that the secondary MMU code can drop sptes before the pages are freed, avoiding all page pinning and allowing 100% reliable swapping of guest physical address space. Furthermore it avoids the code that teardown the mappings of the secondary MMU, to implement a logic like tlb_gather in zap_page_range that would require many IPI to flush other cpu tlbs, for each fixed number of spte unmapped. To make an example: if what happens on the primary MMU is a protection downgrade (from writeable to wrprotect) the secondary MMU mappings will be invalidated, and the next secondary-mmu-page-fault will call get_user_pages and trigger a do_wp_page through get_user_pages if it called get_user_pages with write=1, and it'll re-establishing an updated spte or secondary-tlb-mapping on the copied page. Or it will setup a readonly spte or readonly tlb mapping if it's a guest-read, if it calls get_user_pages with write=0. This is just an example. This allows to map any page pointed by any pte (and in turn visible in the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an full MMU with both sptes and secondary-tlb like the shadow-pagetable layer with kvm), or a remote DMA in software like XPMEM (hence needing of schedule in XPMEM code to send the invalidate to the remote node, while no need to schedule in kvm/gru as it's an immediate event like invalidating primary-mmu pte). At least for KVM without this patch it's impossible to swap guests reliably. And having this feature and removing the page pin allows several other optimizations that simplify life considerably. Dependencies: 1) Introduces list_del_init_rcu and documents it (fixes a comment for list_del_rcu too) 2) mm_lock() to register the mmu notifier when the whole VM isn't doing anything with "mm". This allows mmu notifier users to keep track if the VM is in the middle of the invalidate_range_begin/end critical section with an atomic counter incraese in range_begin and decreased in range_end. No secondary MMU page fault is allowed to map any spte or secondary tlb reference, while the VM is in the middle of range_begin/end as any page returned by get_user_pages in that critical section could later immediately be freed without any further ->invalidate_page notification (invalidate_range_begin/end works on ranges and ->invalidate_page isn't called immediately before freeing the page). To stop all page freeing and pagetable overwrites the mmap_sem must be taken in write mode and all other anon_vma/i_mmap locks must be taken in virtual address order. The order is critical to avoid mm_lock(mm1) and mm_lock(mm2) running concurrently to trigger lock inversion deadlocks. 3) It'd be a waste to add branches in the VM if nobody could possibly run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of mmu notifiers, but this already allows to compile a KVM external module against a kernel with mmu notifiers enabled and from the next pull from kvm.git we'll start using them. And GRU/XPMEM will also be able to continue the development by enabling KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n). This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n. The mmu_notifier_register call can fail because mm_lock may not allocate the required vmalloc space. See the comment on top of mm_lock() implementation for the worst case memory requirements. Because mmu_notifier_reigster is used when a driver startup, a failure can be gracefully handled. Here an example of the change applied to kvm to register the mmu notifiers. Usually when a driver startups other allocations are required anyway and -ENOMEM failure paths exists already. struct kvm *kvm_arch_create_vm(void) { struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL); + int err; if (!kvm) return ERR_PTR(-ENOMEM); INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops; + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm); + if (err) { + kfree(kvm); + return ERR_PTR(err); + } + return kvm; } mmu_notifier_unregister returns void and it's reliable. diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -739,7 +739,7 @@ static inline void hlist_del(struct hlis * or hlist_del_rcu(), running on this same list. * However, it is perfectly legal to run concurrently with * the _rcu list-traversal primitives, such as - * hlist_for_each_entry(). + * hlist_for_each_entry_rcu(). */ static inline void hlist_del_rcu(struct hlist_node *n) { @@ -755,6 +755,26 @@ static inline void hlist_del_init(struct } } +/** + * hlist_del_init_rcu - deletes entry from hash list with re-initialization + * @n: the element to delete from the hash list. + * + * Note: list_unhashed() on entry does return true after this. It is + * useful for RCU based read lockfree traversal if the writer side + * must know if the list entry is still hashed or already unhashed. + * + * In particular, it means that we can not poison the forward pointers + * that may still be used for walking the hash list and we can only + * zero the pprev pointer so list_unhashed() will return true after + * this. + * + * The caller must take whatever precautions are necessary (such as + * holding appropriate locks) to avoid racing with another + * list-mutation primitive, such as hlist_add_head_rcu() or + * hlist_del_rcu(), running on this same list. However, it is + * perfectly legal to run concurrently with the _rcu list-traversal + * primitives, such as hlist_for_each_entry_rcu(). + */ static inline void hlist_del_init_rcu(struct hlist_node *n) { if (!hlist_unhashed(n)) { diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,18 +1050,6 @@ extern int install_special_mapping(struc unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); -/* - * mm_lock will take mmap_sem writably (to prevent all modifications - * and scanning of vmas) and then also takes the mapping locks for - * each of the vma to lockout any scans of pagetables of this address - * space. This can be used to effectively holding off reclaim from the - * address space. - * - * mm_lock can fail if there is not enough memory to store a pointer - * array to all vmas. - * - * mm_lock and mm_unlock are expensive operations that may take a long time. - */ struct mm_lock_data { spinlock_t **i_mmap_locks; spinlock_t **anon_vma_locks; diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -4,17 +4,24 @@ #include <linux/list.h> #include <linux/spinlock.h> #include <linux/mm_types.h> +#include <linux/srcu.h> struct mmu_notifier; struct mmu_notifier_ops; #ifdef CONFIG_MMU_NOTIFIER -#include <linux/srcu.h> +/* + * The mmu notifier_mm structure is allocated and installed in + * mm->mmu_notifier_mm inside the mm_lock() protected critical section + * and it's released only when mm_count reaches zero in mmdrop(). + */ struct mmu_notifier_mm { + /* all mmu notifiers registerd in this mm are queued in this list */ struct hlist_head list; + /* srcu structure for this mm */ struct srcu_struct srcu; - /* to serialize mmu_notifier_unregister against mmu_notifier_release */ + /* to serialize the list modifications and hlist_unhashed */ spinlock_t lock; }; @@ -23,8 +30,8 @@ struct mmu_notifier_ops { * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are * freed. It's mandatory to implement this method. This can - * run concurrently to other mmu notifier methods and it - * should teardown all secondary mmu mappings and freeze the + * run concurrently with other mmu notifier methods and it + * should tear down all secondary mmu mappings and freeze the * secondary mmu. */ void (*release)(struct mmu_notifier *mn, @@ -43,9 +50,10 @@ struct mmu_notifier_ops { /* * Before this is invoked any secondary MMU is still ok to - * read/write to the page previously pointed by the Linux pte - * because the old page hasn't been freed yet. If required - * set_page_dirty has to be called internally to this method. + * read/write to the page previously pointed to by the Linux + * pte because the page hasn't been freed yet and it won't be + * freed until this returns. If required set_page_dirty has to + * be called internally to this method. */ void (*invalidate_page)(struct mmu_notifier *mn, struct mm_struct *mm, @@ -53,20 +61,18 @@ struct mmu_notifier_ops { /* * invalidate_range_start() and invalidate_range_end() must be - * paired and are called only when the mmap_sem is held and/or - * the semaphores protecting the reverse maps. Both functions + * paired and are called only when the mmap_sem and/or the + * locks protecting the reverse maps are held. Both functions * may sleep. The subsystem must guarantee that no additional - * references to the pages in the range established between - * the call to invalidate_range_start() and the matching call - * to invalidate_range_end(). + * references are taken to the pages in the range established + * between the call to invalidate_range_start() and the + * matching call to invalidate_range_end(). * - * Invalidation of multiple concurrent ranges may be permitted - * by the driver or the driver may exclude other invalidation - * from proceeding by blocking on new invalidate_range_start() - * callback that overlap invalidates that are already in - * progress. Either way the establishment of sptes to the - * range can only be allowed if all invalidate_range_stop() - * function have been called. + * Invalidation of multiple concurrent ranges may be + * optionally permitted by the driver. Either way the + * establishment of sptes is forbidden in the range passed to + * invalidate_range_begin/end for the whole duration of the + * invalidate_range_begin/end critical section. * * invalidate_range_start() is called when all pages in the * range are still mapped and have at least a refcount of one. @@ -187,6 +193,14 @@ static inline void mmu_notifier_mm_destr __mmu_notifier_mm_destroy(mm); } +/* + * These two macros will sometime replace ptep_clear_flush. + * ptep_clear_flush is impleemnted as macro itself, so this also is + * implemented as a macro until ptep_clear_flush will converted to an + * inline function, to diminish the risk of compilation failure. The + * invalidate_page method over time can be moved outside the PT lock + * and these two macros can be later removed. + */ #define ptep_clear_flush_notify(__vma, __address, __ptep) \ ({ \ pte_t __pte; \ diff --git a/mm/Kconfig b/mm/Kconfig --- a/mm/Kconfig +++ b/mm/Kconfig @@ -193,7 +193,3 @@ config VIRT_TO_BUS config VIRT_TO_BUS def_bool y depends on !ARCH_NO_VIRT_TO_BUS - -config MMU_NOTIFIER - def_bool y - bool "MMU notifier, for paging KVM/RDMA" diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -613,6 +613,12 @@ int copy_page_range(struct mm_struct *ds if (is_vm_hugetlb_page(vma)) return copy_hugetlb_page_range(dst_mm, src_mm, vma); + /* + * We need to invalidate the secondary MMU mappings only when + * there could be a permission downgrade on the ptes of the + * parent mm. And a permission downgrade will only happen if + * is_cow_mapping() returns true. + */ if (is_cow_mapping(vma->vm_flags)) mmu_notifier_invalidate_range_start(src_mm, addr, end); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2329,7 +2329,36 @@ static inline void __mm_unlock(spinlock_ * operations that could ever happen on a certain mm. This includes * vmtruncate, try_to_unmap, and all page faults. The holder * must not hold any mm related lock. A single task can't take more - * than one mm lock in a row or it would deadlock. + * than one mm_lock in a row or it would deadlock. + * + * The mmap_sem must be taken in write mode to block all operations + * that could modify pagetables and free pages without altering the + * vma layout (for example populate_range() with nonlinear vmas). + * + * The sorting is needed to avoid lock inversion deadlocks if two + * tasks run mm_lock at the same time on different mm that happen to + * share some anon_vmas/inodes but mapped in different order. + * + * mm_lock and mm_unlock are expensive operations that may have to + * take thousand of locks. Thanks to sort() the complexity is + * O(N*log(N)) where N is the number of VMAs in the mm. The max number + * of vmas is defined in /proc/sys/vm/max_map_count. + * + * mm_lock() can fail if memory allocation fails. The worst case + * vmalloc allocation required is 2*max_map_count*sizeof(spinlock *), + * so around 1Mbyte, but in practice it'll be much less because + * normally there won't be max_map_count vmas allocated in the task + * that runs mm_lock(). + * + * The vmalloc memory allocated by mm_lock is stored in the + * mm_lock_data structure that must be allocated by the caller and it + * must be later passed to mm_unlock that will free it after using it. + * Allocating the mm_lock_data structure on the stack is fine because + * it's only a couple of bytes in size. + * + * If mm_lock() returns -ENOMEM no memory has been allocated and the + * mm_lock_data structure can be freed immediately, and mm_unlock must + * not be called. */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { @@ -2350,6 +2379,13 @@ int mm_lock(struct mm_struct *mm, struct return -ENOMEM; } + /* + * When mm_lock_sort_anon_vma/i_mmap returns zero it + * means there's no lock to take and so we can free + * the array here without waiting mm_unlock. mm_unlock + * will do nothing if nr_i_mmap/anon_vma_locks is + * zero. + */ data->nr_anon_vma_locks = mm_lock_sort_anon_vma(mm, anon_vma_locks); data->nr_i_mmap_locks = mm_lock_sort_i_mmap(mm, i_mmap_locks); @@ -2374,7 +2410,17 @@ static void mm_unlock_vfree(spinlock_t * vfree(locks); } -/* avoid memory allocations for mm_unlock to prevent deadlock */ +/* + * mm_unlock doesn't require any memory allocation and it won't fail. + * + * All memory has been previously allocated by mm_lock and it'll be + * all freed before returning. Only after mm_unlock returns, the + * caller is allowed to free and forget the mm_lock_data structure. + * + * mm_unlock runs in O(N) where N is the max number of VMAs in the + * mm. The max number of vmas is defined in + * /proc/sys/vm/max_map_count. + */ void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data) { if (mm->map_count) { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -21,12 +21,12 @@ * This function can't run concurrently against mmu_notifier_register * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap * runs with mm_users == 0. Other tasks may still invoke mmu notifiers - * in parallel despite there's no task using this mm anymore, through - * the vmas outside of the exit_mmap context, like with + * in parallel despite there being no task using this mm any more, + * through the vmas outside of the exit_mmap context, such as with * vmtruncate. This serializes against mmu_notifier_unregister with * the mmu_notifier_mm->lock in addition to SRCU and it serializes * against the other mmu notifiers with SRCU. struct mmu_notifier_mm - * can't go away from under us as exit_mmap holds a mm_count pin + * can't go away from under us as exit_mmap holds an mm_count pin * itself. */ void __mmu_notifier_release(struct mm_struct *mm) @@ -41,7 +41,7 @@ void __mmu_notifier_release(struct mm_st hlist); /* * We arrived before mmu_notifier_unregister so - * mmu_notifier_unregister will do nothing else than + * mmu_notifier_unregister will do nothing other than * to wait ->release to finish and * mmu_notifier_unregister to return. */ @@ -66,7 +66,11 @@ void __mmu_notifier_release(struct mm_st spin_unlock(&mm->mmu_notifier_mm->lock); /* - * Wait ->release if mmu_notifier_unregister is running it. + * synchronize_srcu here prevents mmu_notifier_release to + * return to exit_mmap (which would proceed freeing all pages + * in the mm) until the ->release method returns, if it was + * invoked by mmu_notifier_unregister. + * * The mmu_notifier_mm can't go away from under us because one * mm_count is hold by exit_mmap. */ @@ -144,8 +148,9 @@ void __mmu_notifier_invalidate_range_end * Must not hold mmap_sem nor any other VM related lock when calling * this registration function. Must also ensure mm_users can't go down * to zero while this runs to avoid races with mmu_notifier_release, - * so mm has to be current->mm or the mm should be pinned safely like - * with get_task_mm(). mmput can be called after mmu_notifier_register + * so mm has to be current->mm or the mm should be pinned safely such + * as with get_task_mm(). If the mm is not current->mm, the mm_users + * pin should be released by calling mmput after mmu_notifier_register * returns. mmu_notifier_unregister must be always called to * unregister the notifier. mm_count is automatically pinned to allow * mmu_notifier_unregister to safely run at any time later, before or @@ -155,29 +160,29 @@ int mmu_notifier_register(struct mmu_not int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm) { struct mm_lock_data data; + struct mmu_notifier_mm * mmu_notifier_mm; int ret; BUG_ON(atomic_read(&mm->mm_users) <= 0); + ret = -ENOMEM; + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL); + if (unlikely(!mmu_notifier_mm)) + goto out; + + ret = init_srcu_struct(&mmu_notifier_mm->srcu); + if (unlikely(ret)) + goto out_kfree; + ret = mm_lock(mm, &data); if (unlikely(ret)) - goto out; + goto out_cleanup; if (!mm_has_notifiers(mm)) { - mm->mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), - GFP_KERNEL); - ret = -ENOMEM; - if (unlikely(!mm_has_notifiers(mm))) - goto out_unlock; - - ret = init_srcu_struct(&mm->mmu_notifier_mm->srcu); - if (unlikely(ret)) { - kfree(mm->mmu_notifier_mm); - mmu_notifier_mm_init(mm); - goto out_unlock; - } - INIT_HLIST_HEAD(&mm->mmu_notifier_mm->list); - spin_lock_init(&mm->mmu_notifier_mm->lock); + INIT_HLIST_HEAD(&mmu_notifier_mm->list); + spin_lock_init(&mmu_notifier_mm->lock); + mm->mmu_notifier_mm = mmu_notifier_mm; + mmu_notifier_mm = NULL; } atomic_inc(&mm->mm_count); @@ -192,8 +197,14 @@ int mmu_notifier_register(struct mmu_not spin_lock(&mm->mmu_notifier_mm->lock); hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); spin_unlock(&mm->mmu_notifier_mm->lock); -out_unlock: + mm_unlock(mm, &data); +out_cleanup: + if (mmu_notifier_mm) + cleanup_srcu_struct(&mmu_notifier_mm->srcu); +out_kfree: + /* kfree() does nothing if mmu_notifier_mm is NULL */ + kfree(mmu_notifier_mm); out: BUG_ON(atomic_read(&mm->mm_users) <= 0); return ret; |
From: Andrea A. <an...@qu...> - 2008-05-01 16:43:29
|
Hello, This make sure not to schedule in atomic during fx_init. I also changed the name of fpu_init to fx_finit to avoid duplicating the name with fpu_init that is already used in the kernel, this makes grep simpler if nothing else. Signed-off-by: Andrea Arcangeli <an...@qu...> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 578a0c1..5398b1c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3701,10 +3702,19 @@ void fx_init(struct kvm_vcpu *vcpu) { unsigned after_mxcsr_mask; + /* + * Touch the fpu the first time in non atomic context as if + * this is the first fpu instruction the exception handler + * will fire before the instruction returns and it'll have to + * allocate ram with GFP_KERNEL. + */ + if (!used_math()) + fx_save(&vcpu->arch.host_fx_image); + /* Initialize guest FPU by resetting ours and saving into guest's */ preempt_disable(); fx_save(&vcpu->arch.host_fx_image); - fpu_init(); + fx_finit(); fx_save(&vcpu->arch.guest_fx_image); fx_restore(&vcpu->arch.host_fx_image); preempt_enable(); diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 4baa9c9..b9a1421 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -627,7 +635,7 @@ static inline void fx_restore(struct i387_fxsave_struct *image) asm("fxrstor (%0)":: "r" (image)); } -static inline void fpu_init(void) +static inline void fx_finit(void) { asm("finit"); } |
From: Chris L. <cla...@re...> - 2008-05-01 14:23:32
|
Chris Lalancette wrote: > Attached is a patch that fixes a guest crash when booting older Linux kernels. > The problem stems from the fact that we are currently emulating > MSR_K7_EVNTSEL[0-3], but not emulating MSR_K7_PERFCTR[0-3]. Because of this, > setup_k7_watchdog() in the Linux kernel receives a GPF when it attempts to write > into MSR_K7_PERFCTR, which causes an OOPs. > > The patch fixes it by just "fake" emulating the appropriate MSRs, throwing away > the data in the process. This causes the NMI watchdog to not actually work, but > it's not such a big deal in a virtualized environment. > > Tested by myself on a RHEL-4 guest, and Joerg Roedel on a Windows XP 64-bit guest. Avi, Do you mind applying this patch for me (unless you see something wrong with it, of course)? Thanks, Chris Lalancette |
From: Dr R. Hansmond. <han...@ya...> - 2008-05-01 13:53:57
|
新しいメールアドレスをお知らせします新しいメールアドレス: han...@ya... I am seeking your cooperation in building a Tourist Hotel or Real Estate in your country.I need an experienced person like you to assist me to set up, develop the project. Alternative Email:- han...@ya... Thanks and God bless. Dr Raymond Hansmond. - Dr Raymond Hansmond. |
From: Amit S. <ami...@qu...> - 2008-05-01 13:17:00
|
On Tuesday 29 April 2008 21:28:51 Amit Shah wrote: > On Tuesday 29 April 2008 20:14:16 Glauber Costa wrote: > > Amit Shah wrote: > > > + if (find_pci_pt_dev(&vcpu->kvm->arch.pci_pt_dev_head, > > > + &pci_pt_info, 0, KVM_PT_SOURCE_ASSIGN)) > > > + r++; /* We have assigned the device */ > > > + > > > + kunmap(host_page); > > > > better use atomic mappings here. > > We can't use atomic mappings for guest pages. They can be swapped out. Actually you were right: there's no sleeping call here after doing the mapping. I've updated this call with kmap_atomic. The other function that uses kmap can't be converted since we continue to map several pages in a loop (depending on the length of the DMA region) and hence can't use kmap_atomic there. |
From: Yang, S. <she...@in...> - 2008-05-01 08:47:51
|
On Thursday 01 May 2008 04:16:05 Anthony Liguori wrote: > In vmx.c:alloc_identity_pagetable() we grab a reference to the EPT identity > page table via gfn_to_page(). We never release this reference though. > > This patch releases the reference to this page on VM destruction. I > haven't tested this with EPT. > > Signed-off-by: Anthony Liguori <ali...@us...> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 578a0c1..63f46cf 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3909,6 +3909,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm) > kvm_free_physmem(kvm); > if (kvm->arch.apic_access_page) > put_page(kvm->arch.apic_access_page); > + if (kvm->arch.ept_identity_pagetable) > + put_page(kvm->arch.ept_identity_pagetable); > kfree(kvm); > } Um... I neglected that...Thanks for point it out! -- Thanks Yang, Sheng |