kvm-devel Mailing List for kernel virtual machine (Page 33)

Brought to you by: avik, mtosatti

kvm-devel — kernel virtual machine development

You can subscribe to this list here.

2006	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (33)	_Nov (325)	_Dec (320)
2007	_Jan (484)	_Feb (438)	_Mar (407)	_Apr (713)	_May (831)	_Jun (806)	_Jul (1023)	_Aug (1184)	_Sep (1118)	_Oct (1461)	_Nov (1224)	_Dec (1042)
2008	_Jan (1449)	_Feb (1110)	_Mar (1428)	_Apr (1643)	_May (682)	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec

Flat | Threaded

<< < 1 .. 31 32 33 34 35 .. 703 > >> (Page 33 of 703)

Re: [kvm-devel] [kvm-ppc-devel] [PATCH 2 of 3] Add function dt_cell_multi to hw/device_tree.c

From: Hollis B. <ho...@us...> - 2008-04-29 15:08:48

On Monday 28 April 2008 16:23:04 Jerone Young wrote:
> +/* This function is to manipulate a cell with multiple values */
> +void dt_cell_multi(void *fdt, char *node_path, char *property,
> +                       uint32_t *val_array, int size)
> +{
> +       
> +       int offset;
> +       int ret;

Could you please be more careful with your whitespace?

-- 
Hollis Blanchard
IBM Linux Technology Center

[kvm-devel] [PATCH] x86: handle double and triple faults for every exception

From: Joerg R. <joe...@am...> - 2008-04-29 15:05:24

The current KVM x86 exception code handles double and triple faults only for
page fault exceptions. This patch extends this detection for every exception
that gets queued for the guest.

Signed-off-by: Joerg Roedel <joe...@am...>
Cc: Jan Kiszka <jan...@si...>
---
 arch/x86/kvm/x86.c |   31 +++++++++++++++++--------------
 1 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 578a0c1..c05aa32 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -144,9 +144,21 @@ void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data)
 }
 EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
+static void handle_multiple_faults(struct kvm_vcpu *vcpu)
+{
+	if (vcpu->arch.exception.nr != DF_VECTOR) {
+		vcpu->arch.exception.nr = DF_VECTOR;
+		vcpu->arch.exception.error_code = 0;
+	} else
+		set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
+}
+
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-	WARN_ON(vcpu->arch.exception.pending);
+	if (vcpu->arch.exception.pending) {
+		handle_multiple_faults(vcpu);
+		return;
+	}
 	vcpu->arch.exception.pending = true;
 	vcpu->arch.exception.has_error_code = false;
 	vcpu->arch.exception.nr = nr;
@@ -157,25 +169,16 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long addr,
 			   u32 error_code)
 {
 	++vcpu->stat.pf_guest;
-	if (vcpu->arch.exception.pending) {
-		if (vcpu->arch.exception.nr == PF_VECTOR) {
-			printk(KERN_DEBUG "kvm: inject_page_fault:"
-					" double fault 0x%lx\n", addr);
-			vcpu->arch.exception.nr = DF_VECTOR;
-			vcpu->arch.exception.error_code = 0;
-		} else if (vcpu->arch.exception.nr == DF_VECTOR) {
-			/* triple fault -> shutdown */
-			set_bit(KVM_REQ_TRIPLE_FAULT, &vcpu->requests);
-		}
-		return;
-	}
 	vcpu->arch.cr2 = addr;
 	kvm_queue_exception_e(vcpu, PF_VECTOR, error_code);
 }
 
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
 {
-	WARN_ON(vcpu->arch.exception.pending);
+	if (vcpu->arch.exception.pending) {
+		handle_multiple_faults(vcpu);
+		return;
+	}
 	vcpu->arch.exception.pending = true;
 	vcpu->arch.exception.has_error_code = true;
 	vcpu->arch.exception.nr = nr;
-- 
1.5.3.7

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page

From: Andrea A. <an...@qu...> - 2008-04-29 14:55:09

On Tue, Apr 29, 2008 at 09:32:09AM -0500, Anthony Liguori wrote:
> +		vma = find_vma(current->mm, addr);
> +		if (vma == NULL) {
> +			get_page(bad_page);
> +			return page_to_pfn(bad_page);
> +		}

Here you must check vm_start address, find_vma only checks addr <
vm_end but there's no guarantee addr >= vm_start yet.

> +
> +		BUG_ON(!(vma->vm_flags & VM_IO));

For consistency we should return bad_page and not bug on, VM_IO and
VM_PFNMAP can theoretically not be set at the same time, otherwise
get_user_pages would be buggy checking against VM_PFNMAP|VM_IO. I
doubt anybody isn't setting VM_IO before calling remap_pfn_range but
anyway...

Secondly the really correct check is against VM_PFNMAP. This is
because PFNMAP is set at the same time of vm_pgoff = pfn. VM_IO is not
even if in theory if a driver uses ->fault instead of remap_pfn_range,
shouldn't set VM_IO and it should only set VM_RESERVED. VM_IO is about
keeping gdb/coredump out as they could mess with the hardware if they
read, PFNMAP is about remap_pfn_range having been called and pgoff
pointing to the first pfn mapped at vm_start address.

Patch is in the right direction, way to go!

Re: [kvm-devel] [PATCH] KVM x86: Handle hypercalls for assigned PCI devices

From: Glauber C. <gc...@re...> - 2008-04-29 14:51:53

Amit Shah wrote:
> We introduce three hypercalls:
> 1. When the guest wants to check if a particular device is an assigned device
>    (this is done once per device by the guest to enable / disable hypercall-
>    based translation of addresses)
> 
> 2. map: to convert guest phyical addresses to host physical address to pass on
>    to the device for DMA. We also pin the pages thus requested so that they're
>    not swapped out.
> 
> 3. unmap: to unpin the pages and free any information we might have stored.
> 
> Signed-off-by: Amit Shah <ami...@qu...>
> ---
>  arch/x86/kvm/x86.c         |  211 +++++++++++++++++++++++++++++++++++++++++++-
>  include/asm-x86/kvm_host.h |   15 +++
>  include/asm-x86/kvm_para.h |    8 ++
>  3 files changed, 233 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fb9b329..94ee4db 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -24,8 +24,11 @@
>  #include <linux/interrupt.h>
>  #include <linux/kvm.h>
>  #include <linux/fs.h>
> +#include <linux/list.h>
> +#include <linux/pci.h>
>  #include <linux/vmalloc.h>
>  #include <linux/module.h>
> +#include <linux/highmem.h>
>  #include <linux/mman.h>
>  #include <linux/highmem.h>
>  
> @@ -76,6 +79,9 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>  	{ "halt_exits", VCPU_STAT(halt_exits) },
>  	{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
>  	{ "hypercalls", VCPU_STAT(hypercalls) },
> +	{ "hypercall_map", VCPU_STAT(hypercall_map) },
> +	{ "hypercall_unmap", VCPU_STAT(hypercall_unmap) },
> +	{ "hypercall_pv_dev", VCPU_STAT(hypercall_pv_dev) },
>  	{ "request_irq", VCPU_STAT(request_irq_exits) },
>  	{ "irq_exits", VCPU_STAT(irq_exits) },
>  	{ "host_state_reload", VCPU_STAT(host_state_reload) },
> @@ -95,9 +101,164 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>  	{ NULL }
>  };
>  
> +static struct kvm_pv_dma_map*
> +find_pci_pv_dmap(struct list_head *head, dma_addr_t dma)
> +{
might be better to prefix those functions with kvm? Even though they are 
static, it seems to be the current practice.

> +static int pv_map_hypercall(struct kvm_vcpu *vcpu, int npages, gfn_t page_gfn)
> +{
> +	int i, r = 0;
> +	struct page *host_page;
> +	struct scatterlist *sg;
> +	struct kvm_pv_dma_map *dmap;
> +	unsigned long *shared_addr, *hcall_page;
> +
> +	/* We currently don't support dma mappings which have more than
> +	 * PAGE_SIZE/sizeof(unsigned long *) pages
> +	 */
> +	if (!npages || npages > MAX_PVDMA_PAGES) {
> +		printk(KERN_INFO "%s: Illegal number of pages: %d\n",
> +		       __func__, npages);
> +		goto out;
> +	}
> +
> +	host_page = gfn_to_page(vcpu->kvm, page_gfn);
you need mmap_sem held for read to use gfn_to_page.

> +	if (is_error_page(host_page)) {
> +		printk(KERN_INFO "%s: Bad gfn %p\n", __func__,
> +		       (void *)page_gfn);
> +		goto out;
> +	}
> +	hcall_page = shared_addr = kmap(host_page);
> +
> +	/* scatterlist to map guest dma pages into host physical
> +	 * memory -- if they exceed the DMA map limit
> +	 */
> +	sg = kcalloc(npages, sizeof(struct scatterlist), GFP_KERNEL);
> +	if (sg == NULL) {
> +		printk(KERN_INFO "%s: Couldn't allocate memory (sg)\n",
> +		       __func__);
> +		goto out_unmap;
> +	}
> +
> +	/* List to store all guest pages mapped into host. This will
> +	 * be used later to free pages on the host. Think of this as a
> +	 * translation table from guest dma addresses into host dma
> +	 * addresses
> +	 */
> +	dmap = kzalloc(sizeof(*dmap), GFP_KERNEL);
> +	if (dmap == NULL) {
> +		printk(KERN_INFO "%s: Couldn't allocate memory\n",
> +		       __func__);
> +		goto out_unmap_sg;
> +	}
> +
> +	/* FIXME: consider the length of the last page. Guest should
> +	 * send this info.
> +	 */
> +	for (i = 0; i < npages; i++) {
> +		struct page *page;
> +		gpa_t gpa;
> +
> +		gpa = *shared_addr++;
> +		page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
care for locking here too.

> +		if (is_error_page(page)) {
> +			int j;
> +			printk(KERN_INFO "kvm %s: gpa %p not valid\n",
> +			       __func__, (void *)gpa);
> +
> +			for (j = 0; j < i; j++)
> +				put_page(sg_page(&sg[j]));
> +			goto out_unmap_sg_dmap;
> +		}
> +		prepare_sg_entry(&sg[i], page);
> +		get_page(sg_page(&sg[i]));
> +	}
> +
> +	/* Put this on the dmap_head list, so that we can find it
> +	 * later for the 'free' operation
> +	 */
> +	dmap->sg = sg;
> +	dmap->nents = npages;
> +	list_add(&dmap->list, &vcpu->kvm->arch.pci_pv_dmap_head);
> +
> +	/* FIXME: guest should send the direction */
> +	r = dma_ops->map_sg(NULL, sg, npages, PCI_DMA_BIDIRECTIONAL);
> +	if (r) {
> +		r = npages;
> +		*hcall_page = sg[0].dma_address | (*hcall_page & ~PAGE_MASK);
> +	}
> +
> + out_unmap:
> +	if (!r)
> +		*hcall_page = bad_dma_address;
> +	kunmap(host_page);
> + out:
> +	++vcpu->stat.hypercall_map;
> +	return r;
> + out_unmap_sg_dmap:
> +	kfree(dmap);
> + out_unmap_sg:
> +	kfree(sg);
> +	goto out_unmap;
those backwards goto are very clumsy. Might be better to give it a 
further attention in order to avoid id.

> +}
> +
> +static int free_dmap(struct kvm_pv_dma_map *dmap, struct list_head *head)
> +{
> +	int i;
> +
> +	if (!dmap)
> +		return 1;
that's ugly.

it's better to keep the free function with free-like semantics: just a 
void function that plainly returns if !dmap, and check in the caller.

> +static int
> +pv_mapped_pci_device_hypercall(struct kvm_vcpu *vcpu, gfn_t page_gfn)
> +{
> +	int r = 0;
> +	unsigned long *shared_addr;
> +	struct page *host_page;
> +	struct kvm_pci_pt_info pci_pt_info;
> +
> +	host_page = gfn_to_page(vcpu->kvm, page_gfn);
locking
> +	if (is_error_page(host_page)) {
> +		printk(KERN_INFO "%s: gfn %p not valid\n",
> +		       __func__, (void *)page_gfn);
> +		r = -1;
r = -1 is not really informative. Better use some meaningful error.

We can return here, and avoid this goto if we always increment the 
hypercall counter in the beginning of the function. But this is nitpicking.

> +		goto out;
> +	}
> +	shared_addr = kmap(host_page);
> +	memcpy(&pci_pt_info, shared_addr, sizeof(pci_pt_info));
> +
> +	if (find_pci_pt_dev(&vcpu->kvm->arch.pci_pt_dev_head,
> +			    &pci_pt_info, 0, KVM_PT_SOURCE_ASSIGN))
> +		r++; /* We have assigned the device */
> +
> +	kunmap(host_page);
better use atomic mappings here.

Re: [kvm-devel] State of debug register emulation

From: Joerg R. <joe...@am...> - 2008-04-29 14:42:07

On Tue, Apr 29, 2008 at 03:07:25PM +0200, Jan Kiszka wrote:
> Hi,
> 
> looks like we are getting better and better here in hitting yet
> unsupported corner-case features of KVM :). This time our guest fiddles
> with hardware debugging registers, but quickly gets unhappy as they do
> not yet have the expected effect.

KVM is mostly tested with guests that run with paging. So a 16 bit
protected mode guest is not tested very well :)

> Joerg, I found you SVM-related patch series in the archive which does
> not seem to have raised much responses. Is this general direction OK?
> Does it allow self-debugging of guests? But how are conflicts resolved
> if both guest and host need the physical registers (host debugging the
> guest which is debugging itself)?

I sent a patchset in the past to enable guest debugging for SVM which
means debugging the guest from outside using gdb. But I was not able to
test these patches because the userspace side of guest debugging is
broken in the kvm-qemu.
Debugging in the guest should work without problems. The debug registers
are switched between guest and host if the guest uses them. So there
should be no problems when the guest and the host using the debug
registers.

> I would try to dig into the VMX side if the general architecture is
> -mostly- clear. [ Sorry, Joerg, someone put the latter type of HW on my
> desk :->. Hope I can once check our stuff against SVM as well! ]

With some debug output from SVM I can better help to  debug your
problems ;-)

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

[kvm-devel] [PATCH] Handle vma regions with no backing page

From: Anthony L. <ali...@us...> - 2008-04-29 14:34:08

This patch allows VMA's that contain no backing page to be used for guest
memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
mmio series.  Here, we continue to allow mmio pages to be represented in the
rmap.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f095b73..11b26f5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -531,6 +531,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 	struct page *page[1];
 	unsigned long addr;
 	int npages;
+	pfn_t pfn;
 
 	might_sleep();
 
@@ -543,19 +544,36 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 	npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page,
 				NULL);
 
-	if (npages != 1) {
-		get_page(bad_page);
-		return page_to_pfn(bad_page);
-	}
+	if (unlikely(npages != 1)) {
+		struct vm_area_struct *vma;
+
+		vma = find_vma(current->mm, addr);
+		if (vma == NULL) {
+			get_page(bad_page);
+			return page_to_pfn(bad_page);
+		}
+
+		BUG_ON(!(vma->vm_flags & VM_IO));
 
-	return page_to_pfn(page[0]);
+		pfn = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		BUG_ON(pfn_valid(pfn));
+	} else
+		pfn = page_to_pfn(page[0]);
+
+	return pfn;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-	return pfn_to_page(gfn_to_pfn(kvm, gfn));
+	pfn_t pfn;
+
+	pfn = gfn_to_pfn(kvm, gfn);
+	if (pfn_valid(pfn))
+		return pfn_to_page(pfn);
+
+	return NULL;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -568,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(pfn_t pfn)
 {
-	put_page(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
@@ -593,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
 void kvm_set_pfn_dirty(pfn_t pfn)
 {
-	struct page *page = pfn_to_page(pfn);
-	if (!PageReserved(page))
-		SetPageDirty(page);
+	if (pfn_valid(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (!PageReserved(page))
+			SetPageDirty(page);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(pfn_t pfn)
 {
-	mark_page_accessed(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(pfn_t pfn)
 {
-	get_page(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);

Re: [kvm-devel] [PATCH] x86 DMA: Handle devices assigned to the guest by the host

From: Amit S. <ami...@qu...> - 2008-04-29 14:00:27

On Tuesday 29 April 2008 18:44:23 Andi Kleen wrote:
> Amit Shah <ami...@qu...> writes:
> > diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> > index 388b113..678cafb 100644
> > --- a/arch/x86/kernel/pci-dma.c
> > +++ b/arch/x86/kernel/pci-dma.c
> > @@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size,
> > dma_addr_t *dma_handle, memset(memory, 0, size);
> >  		if (!mmu) {
> >  			*dma_handle = bus;
> > +			if (unlikely(dma_ops->is_pv_device) &&
> > +			    unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {
>
> First double unlikely in a condition is useless. Just drop them.
>
> And then ->is_xyz() in a generic vops interface is about as ugly
> and non generic as you can get. dma_alloc_coherent is not performance
> critical, so you should rather change the interface that ->alloc_coherent
> is always called and the other handlers handle the !mmu case correctly.
> In fact they need that already I guess (e.g. on DMAR there is not really
> a nommu case)

This point came up the last time I sent out the patch; we should do this as 
well as implement stackable dma_ops (the need for that is evident in the next 
patch).

Thanks for the observation; this should be the next step.

Amit.

Re: [kvm-devel] [PATCH] KVM PV Guest: Implement paravirtualized DMA

From: Amit S. <ami...@qu...> - 2008-04-29 13:58:44

On Tuesday 29 April 2008 19:01:32 Andi Kleen wrote:
> Amit Shah <ami...@qu...> writes:

> > +const struct dma_mapping_ops *orig_dma_ops;
>
> I suspect real dma ops stacking will need some further thought than
> your simple hacks

Yes; that's something we're planning to do.

> Haven't read further, but to be honest the code doesn't seem to be anywhere
> near merging quality.

I'm basically using these patches to test the PCI passthrough functionality 
(by which we can assign host PCI devices to a guest OS via KVM). While other 
methods of handling DMA operations are being worked on (1-1 mapping of the 
guest in the host address space and virtualization-aware IOMMU translations), 
this patchset provides a quick way to check if things indeed work.

However, if some version of this patch can be useful upstream, I'll be glad to 
work on that. That said, as you point out, we need stackable dma ops as well 
as getting rid of the is_pv_device() function in dma_ops and that's something 
that can be done right away.

Thanks for the review!

Amit

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

From: Andrea A. <an...@qu...> - 2008-04-29 13:35:47

Hi Hugh!!

On Tue, Apr 29, 2008 at 11:49:11AM +0100, Hugh Dickins wrote:
> [I'm scarcely following the mmu notifiers to-and-fro, which seems
> to be in good hands, amongst faster thinkers than me: who actually
> need and can test this stuff.  Don't let me slow you down; but I
> can quickly clarify on this history.]

Still I think it'd be great if you could review mmu-notifier-core v14.
You and Nick are the core VM maintainers so it'd be great to hear any
feedback about it. I think it's fairly easy to classify the patch as
obviously safe as long as mmu notifiers are disarmed. Here a link for
your convenience.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core

> No, the locking was different as you had it, Andrea: there was an extra
> bitspin lock, carried over from the pte_chains days (maybe we changed
> the name, maybe we disagreed over the name, I forget), which mainly
> guarded the page->mapcount.  I thought that was one lock more than we
> needed, and eliminated it in favour of atomic page->mapcount in 2.6.9.

Thanks a lot for the explanation!

Re: [kvm-devel] [PATCH] KVM PV Guest: Implement paravirtualized DMA

From: Andi K. <an...@fi...> - 2008-04-29 13:35:35

Amit Shah <ami...@qu...> writes:
> +
> +static struct page *page;
> +static unsigned long page_gfn;
Bad variable names

> +
> +const struct dma_mapping_ops *orig_dma_ops;

I suspect real dma ops stacking will need some further thought than
your simple hacks
> +
> +	match = find_matching_pt_dev(&pt_devs_head, &pv_pci_info);
> +	if (match) {
> +		r = match->is_pv;
> +		goto out;
> +	}
> +
> +	memcpy(page_address(page), &pv_pci_info, sizeof(pv_pci_info));

Note that on 32bit page_address() might be not mapped.

> +
> +	npages = get_order(size) + 1;

Are you sure that's correct? It looks quite bogus. order is a 2 logarithm,
normally npages = 1 << order

if you want  npages from order the correct  need 1 << order

Haven't read further, but to be honest the code doesn't seem to be anywhere
near merging quality.

-Andi

Re: [kvm-devel] [PATCH] x86 DMA: Handle devices assigned to the guest by the host

From: Andi K. <an...@fi...> - 2008-04-29 13:16:09

Amit Shah <ami...@qu...> writes:
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 388b113..678cafb 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
>  		memset(memory, 0, size);
>  		if (!mmu) {
>  			*dma_handle = bus;
> +			if (unlikely(dma_ops->is_pv_device) &&
> +			    unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {

First double unlikely in a condition is useless. Just drop them.

And then ->is_xyz() in a generic vops interface is about as ugly
and non generic as you can get. dma_alloc_coherent is not performance
critical, so you should rather change the interface that ->alloc_coherent
is always called and the other handlers handle the !mmu case correctly.
In fact they need that already I guess (e.g. on DMAR there is not really
a nommu case) 

-Andi

Re: [kvm-devel] PV DMA for PCI passthrough devices for KVM

From: Andi K. <an...@fi...> - 2008-04-29 13:15:59

Amit Shah <ami...@qu...> writes:

> This patchset implements PVDMA for handling DMA requests from
> devices assigned to the guest from the host machine.

You forgot to post a high level design overview of how this works,
what it is good for, what are the design trade offs etc.? 

Include that in the first patch.

-Andi

[kvm-devel] State of debug register emulation

From: Jan K. <jan...@si...> - 2008-04-29 13:12:39

Hi,

looks like we are getting better and better here in hitting yet
unsupported corner-case features of KVM :). This time our guest fiddles
with hardware debugging registers, but quickly gets unhappy as they do
not yet have the expected effect.

Joerg, I found you SVM-related patch series in the archive which does
not seem to have raised much responses. Is this general direction OK?
Does it allow self-debugging of guests? But how are conflicts resolved
if both guest and host need the physical registers (host debugging the
guest which is debugging itself)?

I would try to dig into the VMX side if the general architecture is
-mostly- clear. [ Sorry, Joerg, someone put the latter type of HW on my
desk :->. Hope I can once check our stuff against SVM as well! ]

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

[kvm-devel] Protected mode transitions and big real mode... still an issue

From: Guillaume T. <gui...@ex...> - 2008-04-29 13:07:13

Hello,

 This patch should solve the problem observed during protected mode
transitions that appears for example during the installation of
openSuse-10.3. Unfortunately there is an issue that crashes
kvm-userspace. I'm not sure if it's a problem introduced by the
patch or if the patch is good and raises a new issue.

  Here is what I'm doing:

 1) Remove the SS patching that modifies SS_SELECTOR in enter_pmode()
to see vmentry failure.
 2) Add the handler that catches the VMentry failure. It is called
handle_vmentry_failure()
 3) while CS.RPL != SS.RPL, emulate the instruction.
 4) Add the emulation of "ljmp", "mov r, imm", "mov sreg, r/m16" and
"mov r/m16, sreg" that have respectively opcode 0xea, 0xb8, 0x8e and
0x8c.

Normally, it should be sufficient to boot openSuse-10.3 because
instructions that need to be emulated are:

  0x0000000000046e53:  ljmp   $0x18,$0x6e18
  0x0000000000046e58:  mov    $0x20,%ax
  0x0000000000046e5c:  mov    %eax,%ds
  0x0000000000046e5e:  mov    %ss,%eax
  0x0000000000046e60:  and    $0xffff,%esp
  0x0000000000046e66:  shl    $0x4,%eax
  0x0000000000046e69:  add    %eax,%esp
  0x0000000000046e6b:  mov    $0x8,%ax
  0x0000000000046e6f:  mov    %eax,%ss

At this point, cs.rpl is equal to ss.rpl. 

I added trace in handle_vmentry_failure() and also in writeback() to
see what functions are emulated and I observe:

[82766.614575] Failed vm entry (exit reason 0x21) invalid guest state
[82766.651046] emulation at (46e53) rip 6e13: ea 18 6e 18
[82766.682611]     writeback: dst.byte 0
[82766.706180]     writeback: dst.ptr  0x0000000000000000
[82766.734890]     writeback: dst.val  0x0
[82766.758591]     writeback: src.ptr  0x0000000000000000
[82766.790594]     writeback: src.val  0x0
[82766.855058] successfully emulated instruction
[82766.882695] Failed vm entry (exit reason 0x21) invalid guest state
[82766.923061] emulation at (46e58) rip 6e18: 66 b8 20 00
[82766.951079]     writeback: dst.byte 2
[82766.975074]     writeback: dst.ptr  0xffff810324d07400
[82767.003112]     writeback: dst.val  0x20
[82767.027100]     writeback: src.ptr  0x0000000000006e1a
[82767.059092]     writeback: src.val  0x20
[82767.127094] successfully emulated instruction
[82767.151111] Failed vm entry (exit reason 0x21) invalid guest state
[82767.191099] emulation at (46e5c) rip 6e1c: 8e d8 8c d0
[82767.219156]     writeback: dst.byte 4
[82767.243118]     writeback: dst.ptr  0xffff810324d07418
[82767.275091]     writeback: dst.val  0x800000
[82767.299122]     writeback: src.ptr  0x0000000000000000
[82767.331106]     writeback: src.val  0x20
[82767.395255] successfully emulated instruction
[82767.423135] Failed vm entry (exit reason 0x21) invalid guest state
[82767.459260] emulation at (46e5e) rip 6e1e: 8c d0 81 e4
[82767.491137]     writeback: dst.byte 2
[82767.515117]     writeback: dst.ptr  0xffff810324d07400
[82767.543138]     writeback: dst.val  0x53e1
[82767.567264]     writeback: src.ptr  0xffff810324d07410
[82767.599142]     writeback: src.val  0x20
[82767.667146] successfully emulated instruction
[82767.691277] Failed vm entry (exit reason 0x21) invalid guest state
[82767.731152] emulation at (46e60) rip 6e20: 81 e4 ff ff
[82767.763136]     writeback: dst.byte 0
[82767.783154]     writeback: dst.ptr  0x0000000000000000
[82767.815157]     writeback: dst.val  0x2004
[82767.839156]     writeback: src.ptr  0x0000000000006e22
[82767.871140]     writeback: src.val  0xffff
[82767.939170] successfully emulated instruction
[82767.963307] Failed vm entry (exit reason 0x21) invalid guest state
[82768.003174] emulation at (46e66) rip 6e26: c1 e0 04 01
[82768.035153]     writeback: dst.byte 0
[82768.055174]     writeback: dst.ptr  0x0000000000000000
[82768.087177]     writeback: dst.val  0x53e1
[82768.111178]     writeback: src.ptr  0x0000000000006e28
[82768.143157]     writeback: src.val  0x4
[82768.211151] successfully emulated instruction
[82768.235189] Failed vm entry (exit reason 0x21) invalid guest state
[82768.271311] emulation at (46e69) rip 6e29: 01 c4 66 b8
[82768.303214]     writeback: dst.byte 0
[82768.327213]     writeback: dst.ptr  0x0000000000000000
[82768.355238]     writeback: dst.val  0x2004
[82768.379316]     writeback: src.ptr  0xffff810324d07400
[82768.411227]     writeback: src.val  0x53e1
[82768.483168] successfully emulated instruction
[82768.507240] Failed vm entry (exit reason 0x21) invalid guest state
[82768.543329] emulation at (46e6b) rip 6e2b: 66 b8 08 00
[82768.575239]     writeback: dst.byte 2
[82768.599233]     writeback: dst.ptr  0xffff810324d07400
[82768.627257]     writeback: dst.val  0x8
[82768.651246]     writeback: src.ptr  0x0000000000006e2d
[82768.683245]     writeback: src.val  0x8
[82768.751250] successfully emulated instruction
[82768.775331] Failed vm entry (exit reason 0x21) invalid guest state
[82768.815256] emulation at (46e6f) rip 6e2f: 8e d0 8e c0
[82768.843348]     writeback: dst.byte 4
[82768.867268]     writeback: dst.ptr  0xffff810324d07410
[82768.899204]     writeback: dst.val  0x53e1
[82768.923259]     writeback: src.ptr  0x0000000000000000
[82768.951351]     writeback: src.val  0x8
[82769.019279] successfully emulated instruction

So everything seems ok but after the emulation of "mov %eax,%ss"
instruction, it seems that cs.rpl == ss.rpl but the guest is still in a
VT-unfriendly state because I have the following error in kvm-userspace:

[guill@enterprise][~/local/kvm-userspace.git/bin]$ ./qemu-system-x86_64
-hda ~/disk_images/hd_50G.qcow2
-cdrom /images_iso/openSUSE-10.3-GM-x86_64-mini.iso -boot d -s -m 1024

exception 13 (33) 
rax 0000000000000673 rbx 0000000000800000 rcx 0000000000000000 
rdx 00000000000013ca rsi 0000000000055e1c rdi 0000000000055e1d 
rsp 00000000fffa0080 rbp 000000000000200b r8 0000000000000000 
r9  0000000000000000 r10 0000000000000000 r11 0000000000000000 
r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 
r15 0000000000000000 rip 000000000000b071 rflags 00033092 
cs 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
ds 4004 (00040040/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
es 00ff (00000ff0/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss ff11 (000ff110/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
fs 3002 (00030020/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0) 
tr 0000 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0) 
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0) 
gdt 40920/47 idt 0/ffff cr0 10 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
code: 17 06 29 4b 01 18 eb 18 a8 25 aa 19 28 4c 01 28 4d 01 01 17 -->
0f 17 0f 01 17 0f 17 12 01 17 2c 25 4b 19 21 00 02 17 1a 94 0a 76 67 61
3d 30 78 25 78 20 Aborted

It's strange because handle_vmentry_failure() is not called. I'm trying
to see where is the problem, any comments are welcome

Regards,
Guillaume



 arch/x86/kvm/vmx.c         |   68 +++++++++++++++++++++++++++
 arch/x86/kvm/vmx.h         |    3 +
 arch/x86/kvm/x86.c         |   12 ++--
 arch/x86/kvm/x86_emulate.c |  112 +++++++++++++++++++++++++++++++++++++++++++--
 include/asm-x86/kvm_host.h |    4 +
 5 files changed, 190 insertions(+), 9 deletions(-)

---

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 79cdbe8..a0a13b8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1272,7 +1272,9 @@ static void enter_pmode(struct kvm_vcpu *vcpu)
 	fix_pmode_dataseg(VCPU_SREG_GS, &vcpu->arch.rmode.gs);
 	fix_pmode_dataseg(VCPU_SREG_FS, &vcpu->arch.rmode.fs);
 
+#if 0
 	vmcs_write16(GUEST_SS_SELECTOR, 0);
+#endif
 	vmcs_write32(GUEST_SS_AR_BYTES, 0x93);
 
 	vmcs_write16(GUEST_CS_SELECTOR,
@@ -2635,6 +2637,66 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	return 1;
 }
 
+static int invalid_guest_state(struct kvm_vcpu *vcpu,
+		struct kvm_run *kvm_run, u32 failure_reason)
+{
+	u16 ss, cs;
+	u8 opcodes[4];
+	unsigned long rip = vcpu->arch.rip;
+	unsigned long rip_linear;
+
+	ss = vmcs_read16(GUEST_SS_SELECTOR);
+	cs = vmcs_read16(GUEST_CS_SELECTOR);
+
+	if ((ss & 0x03) != (cs & 0x03)) {
+		int err;
+		rip_linear = rip + vmx_get_segment_base(vcpu, VCPU_SREG_CS);
+		emulator_read_std(rip_linear, (void *)opcodes, 4, vcpu);
+		printk(KERN_INFO "emulation at (%lx) rip %lx: %02x %02x %02x %02x\n",
+				rip_linear,
+				rip, opcodes[0], opcodes[1], opcodes[2], opcodes[3]);
+		err = emulate_instruction(vcpu, kvm_run, 0, 0, 0);
+		switch (err) {
+			case EMULATE_DONE:
+				printk(KERN_INFO "successfully emulated instruction\n");
+				return 1;
+			case EMULATE_DO_MMIO:
+				printk(KERN_INFO "mmio?\n");
+				return 0;
+			default:
+				kvm_report_emulation_failure(vcpu, "vmentry failure");
+				break;
+		}
+	}
+
+	kvm_run->exit_reason = KVM_EXIT_UNKNOWN;
+	kvm_run->hw.hardware_exit_reason = failure_reason;
+	return 0;
+}
+
+static int handle_vmentry_failure(struct kvm_vcpu *vcpu,
+				  struct kvm_run *kvm_run,
+				  u32 failure_reason)
+{
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+	printk(KERN_INFO "Failed vm entry (exit reason 0x%x) ", failure_reason);
+	switch (failure_reason) {
+		case EXIT_REASON_INVALID_GUEST_STATE:
+			printk("invalid guest state \n");
+			return invalid_guest_state(vcpu, kvm_run, failure_reason);
+		case EXIT_REASON_MSR_LOADING:
+			printk("caused by MSR entry %ld loading.\n", exit_qualification);
+			break;
+		case EXIT_REASON_MACHINE_CHECK:
+			printk("caused by machine check.\n");
+			break;
+		default:
+			printk("reason not known yet!\n");
+			break;
+	}
+	return 0;
+}
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -2696,6 +2758,12 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 			exit_reason != EXIT_REASON_EPT_VIOLATION))
 		printk(KERN_WARNING "%s: unexpected, valid vectoring info and "
 		       "exit reason is 0x%x\n", __func__, exit_reason);
+
+	if ((exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) {
+		exit_reason &= ~VMX_EXIT_REASONS_FAILED_VMENTRY;
+		return handle_vmentry_failure(vcpu, kvm_run, exit_reason);
+	}
+
 	if (exit_reason < kvm_vmx_max_exit_handlers
 	    && kvm_vmx_exit_handlers[exit_reason])
 		return kvm_vmx_exit_handlers[exit_reason](vcpu, kvm_run);
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 79d94c6..2cebf48 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -238,7 +238,10 @@ enum vmcs_field {
 #define EXIT_REASON_IO_INSTRUCTION      30
 #define EXIT_REASON_MSR_READ            31
 #define EXIT_REASON_MSR_WRITE           32
+#define EXIT_REASON_INVALID_GUEST_STATE 33
+#define EXIT_REASON_MSR_LOADING         34
 #define EXIT_REASON_MWAIT_INSTRUCTION   36
+#define EXIT_REASON_MACHINE_CHECK       41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
 #define EXIT_REASON_EPT_VIOLATION       48
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 578a0c1..9e5d687 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3027,8 +3027,8 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
-static void get_segment(struct kvm_vcpu *vcpu,
-			struct kvm_segment *var, int seg)
+void get_segment(struct kvm_vcpu *vcpu,
+		 struct kvm_segment *var, int seg)
 {
 	kvm_x86_ops->get_segment(vcpu, var, seg);
 }
@@ -3111,8 +3111,8 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static void set_segment(struct kvm_vcpu *vcpu,
-			struct kvm_segment *var, int seg)
+void set_segment(struct kvm_vcpu *vcpu,
+		 struct kvm_segment *var, int seg)
 {
 	kvm_x86_ops->set_segment(vcpu, var, seg);
 }
@@ -3270,8 +3270,8 @@ static int load_segment_descriptor_to_kvm_desct(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static int load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
-				   int type_bits, int seg)
+int load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
+			    int type_bits, int seg)
 {
 	struct kvm_segment kvm_seg;
 
diff --git a/arch/x86/kvm/x86_emulate.c b/arch/x86/kvm/x86_emulate.c
index 2ca0838..f6b9dad 100644
--- a/arch/x86/kvm/x86_emulate.c
+++ b/arch/x86/kvm/x86_emulate.c
@@ -138,7 +138,8 @@ static u16 opcode_table[256] = {
 	/* 0x88 - 0x8F */
 	ByteOp | DstMem | SrcReg | ModRM | Mov, DstMem | SrcReg | ModRM | Mov,
 	ByteOp | DstReg | SrcMem | ModRM | Mov, DstReg | SrcMem | ModRM | Mov,
-	0, ModRM | DstReg, 0, Group | Group1A,
+	DstMem | SrcReg | ModRM | Mov, ModRM | DstReg,
+	DstReg | SrcMem | ModRM | Mov, Group | Group1A,
 	/* 0x90 - 0x9F */
 	0, 0, 0, 0, 0, 0, 0, 0,
 	0, 0, 0, 0, ImplicitOps | Stack, ImplicitOps | Stack, 0, 0,
@@ -152,7 +153,8 @@ static u16 opcode_table[256] = {
 	ByteOp | ImplicitOps | Mov | String, ImplicitOps | Mov | String,
 	ByteOp | ImplicitOps | String, ImplicitOps | String,
 	/* 0xB0 - 0xBF */
-	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	DstReg | SrcImm | Mov, 0, 0, 0, 0, 0, 0, 0,
 	/* 0xC0 - 0xC7 */
 	ByteOp | DstMem | SrcImm | ModRM, DstMem | SrcImmByte | ModRM,
 	0, ImplicitOps | Stack, 0, 0,
@@ -168,7 +170,7 @@ static u16 opcode_table[256] = {
 	/* 0xE0 - 0xE7 */
 	0, 0, 0, 0, 0, 0, 0, 0,
 	/* 0xE8 - 0xEF */
-	ImplicitOps | Stack, SrcImm|ImplicitOps, 0, SrcImmByte|ImplicitOps,
+	ImplicitOps | Stack, SrcImm | ImplicitOps, ImplicitOps, SrcImmByte | ImplicitOps,
 	0, 0, 0, 0,
 	/* 0xF0 - 0xF7 */
 	0, 0, 0, 0,
@@ -1511,14 +1513,90 @@ special_insn:
 		break;
 	case 0x88 ... 0x8b:	/* mov */
 		goto mov;
+	case 0x8c: { /* mov r/m, sreg */
+		struct kvm_segment segreg;
+
+		if (c->modrm_mod == 0x3)
+			c->src.val = c->modrm_val;
+
+		switch ( c->modrm_reg ) {
+		case 0:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_ES);
+			break;
+		case 1:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_CS);
+			break;
+		case 2:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_SS);
+			break;
+		case 3:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_DS);
+			break;
+		case 4:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_FS);
+			break;
+		case 5:
+			get_segment(ctxt->vcpu, &segreg, VCPU_SREG_GS);
+			break;
+		default:
+			printk(KERN_INFO "0x8c: Invalid segreg in modrm byte 0x%02x\n",
+					 c->modrm);
+			goto cannot_emulate;
+		}
+		c->dst.val = segreg.selector;
+		c->dst.bytes = 2;
+		c->dst.ptr = (unsigned long *)decode_register(c->modrm_rm, c->regs,
+							      c->d & ByteOp);
+		break;
+	}
 	case 0x8d: /* lea r16/r32, m */
 		c->dst.val = c->modrm_ea;
 		break;
+	case 0x8e: { /* mov seg, r/m16 */
+		uint16_t sel;
+
+		sel = c->src.val;
+		switch ( c->modrm_reg ) {
+		case 0:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 1, VCPU_SREG_ES) < 0)
+				goto cannot_emulate;
+			break;
+		case 1:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 9, VCPU_SREG_CS) < 0)
+				goto cannot_emulate;
+			break;
+		case 2:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 1, VCPU_SREG_SS) < 0)
+				goto cannot_emulate;
+			break;
+		case 3:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 1, VCPU_SREG_DS) < 0)
+				goto cannot_emulate;
+			break;
+		case 4:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 1, VCPU_SREG_FS) < 0)
+				goto cannot_emulate;
+			break;
+		case 5:
+			if (load_segment_descriptor(ctxt->vcpu, sel, 1, VCPU_SREG_GS) < 0)
+				goto cannot_emulate;
+			break;
+		default:
+			printk(KERN_INFO "Invalid segreg in modrm byte 0x%02x\n",
+					  c->modrm);
+			goto cannot_emulate;
+		}
+
+		c->dst.type = OP_NONE;  /* Disable writeback. */
+		break;
+	}
 	case 0x8f:		/* pop (sole member of Grp1a) */
 		rc = emulate_grp1a(ctxt, ops);
 		if (rc != 0)
 			goto done;
 		break;
+	case 0xb8: /* mov r, imm */
+		goto mov;
 	case 0x9c: /* pushf */
 		c->src.val =  (unsigned long) ctxt->eflags;
 		emulate_push(ctxt);
@@ -1657,6 +1735,34 @@ special_insn:
 		break;
 	}
 	case 0xe9: /* jmp rel */
+		jmp_rel(c, c->src.val);
+		c->dst.type = OP_NONE; /* Disable writeback. */
+		break;
+	case 0xea: /* jmp far */ {
+		uint32_t eip;
+		uint16_t sel;
+
+		switch (c->op_bytes) {
+		case 2:
+			eip = insn_fetch(u16, 2, c->eip);
+			eip = eip & 0x0000FFFF; /* clear upper 16 bits */
+			break;
+		case 4:
+			eip = insn_fetch(u32, 4, c->eip);
+			break;
+		default:
+			DPRINTF("jmp far: Invalid op_bytes\n");
+			goto cannot_emulate;
+		}
+		sel = insn_fetch(u16, 2, c->eip);
+		if (load_segment_descriptor(ctxt->vcpu, sel, 9, VCPU_SREG_CS) < 0) {
+			DPRINTF("jmp far: Failed to load CS descriptor\n");
+			goto cannot_emulate;
+		}
+
+		c->eip = eip;
+		break;
+	}
 	case 0xeb: /* jmp rel short */
 		jmp_rel(c, c->src.val);
 		c->dst.type = OP_NONE; /* Disable writeback. */
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 4baa9c9..7a0846a 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -495,6 +495,10 @@ int emulator_get_dr(struct x86_emulate_ctxt *ctxt, int dr,
 int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr,
 		    unsigned long value);
 
+void set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector,
+			    int type_bits, int seg);
 int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason);
 
 void kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);

Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

From: Hugh D. <hu...@ve...> - 2008-04-29 11:00:14

On Tue, 29 Apr 2008, Andrea Arcangeli wrote:
> 
> My point of view is that there was no rcu when I wrote that code, yet
> there was no reference count and yet all locking looks still exactly
> the same as I wrote it. There's even still the page_table_lock to
> serialize threads taking the mmap_sem in read mode against the first
> vma->anon_vma = anon_vma during the page fault.
> 
> Frankly I've absolutely no idea why rcu is needed in all rmap code
> when walking the page->mapping. Definitely the PG_locked is taken so
> there's no way page->mapping could possibly go away under the rmap
> code, hence the anon_vma can't go away as it's queued in the vma, and
> the vma has to go away before the page is zapped out of the pte.

[I'm scarcely following the mmu notifiers to-and-fro, which seems
to be in good hands, amongst faster thinkers than me: who actually
need and can test this stuff.  Don't let me slow you down; but I
can quickly clarify on this history.]

No, the locking was different as you had it, Andrea: there was an extra
bitspin lock, carried over from the pte_chains days (maybe we changed
the name, maybe we disagreed over the name, I forget), which mainly
guarded the page->mapcount.  I thought that was one lock more than we
needed, and eliminated it in favour of atomic page->mapcount in 2.6.9.

Here's the relevant extracts from ChangeLog-2.6.9:

[PATCH] rmaplock: PageAnon in mapping

First of a batch of five patches to eliminate rmap's page_map_lock, replace
its trylocking by spinlocking, and use anon_vma to speed up swapoff.

Patches updated from the originals against 2.6.7-mm7: nothing new so I won't
spam the list, but including Manfred's SLAB_DESTROY_BY_RCU fixes, and omitting
the unuse_process mmap_sem fix already in 2.6.8-rc3.

This patch:

Replace the PG_anon page->flags bit by setting the lower bit of the pointer in
page->mapping when it's anon_vma: PAGE_MAPPING_ANON bit.

We're about to eliminate the locking which kept the flags and mapping in
synch: it's much easier to work on a local copy of page->mapping, than worry
about whether flags and mapping are in synch (though I imagine it could be
done, at greater cost, with some barriers).

[PATCH] rmaplock: kill page_map_lock

The pte_chains rmap used pte_chain_lock (bit_spin_lock on PG_chainlock) to
lock its pte_chains.  We kept this (as page_map_lock: bit_spin_lock on
PG_maplock) when we moved to objrmap.  But the file objrmap locks its vma tree
with mapping->i_mmap_lock, and the anon objrmap locks its vma list with
anon_vma->lock: so isn't the page_map_lock superfluous?

Pretty much, yes.  The mapcount was protected by it, and needs to become an
atomic: starting at -1 like page _count, so nr_mapped can be tracked precisely
up and down.  The last page_remove_rmap can't clear anon page mapping any
more, because of races with page_add_rmap; from which some BUG_ONs must go for
the same reason, but they've served their purpose.

vmscan decisions are naturally racy, little change there beyond removing
page_map_lock/unlock.  But to stabilize the file-backed page->mapping against
truncation while acquiring i_mmap_lock, page_referenced_file now needs page
lock to be held even for refill_inactive_zone.  There's a similar issue in
acquiring anon_vma->lock, where page lock doesn't help: which this patch
pretends to handle, but actually it needs the next.

Roughly 10% cut off lmbench fork numbers on my 2*HT*P4.  Must confess my
testing failed to show the races even while they were knowingly exposed: would
benefit from testing on racier equipment.

[PATCH] rmaplock: SLAB_DESTROY_BY_RCU

With page_map_lock gone, how to stabilize page->mapping's anon_vma while
acquiring anon_vma->lock in page_referenced_anon and try_to_unmap_anon?

The page cannot actually be freed (vmscan holds reference), but however much
we check page_mapped (which guarantees that anon_vma is in use - or would
guarantee that if we added suitable barriers), there's no locking against page
becoming unmapped the instant after, then anon_vma freed.

It's okay to take anon_vma->lock after it's freed, so long as it remains a
struct anon_vma (its list would become empty, or perhaps reused for an
unrelated anon_vma: but no problem since we always check that the page located
is the right one); but corruption if that memory gets reused for some other
purpose.

This is not unique: it's liable to be problem whenever the kernel tries to
approach a structure obliquely.  It's generally solved with an atomic
reference count; but one advantage of anon_vma over anonmm is that it does not
have such a count, and it would be a backward step to add one.

Therefore...  implement SLAB_DESTROY_BY_RCU flag, to guarantee that such a
kmem_cache_alloc'ed structure cannot get freed to other use while the
rcu_read_lock is held i.e.  preempt disabled; and use that for anon_vma.

Fix concerns raised by Manfred: this flag is incompatible with poisoning and
destructor, and kmem_cache_destroy needs to synchronize_kernel.

I hope SLAB_DESTROY_BY_RCU may be useful elsewhere; but though it's safe for
little anon_vma, I'd be reluctant to use it on any caches whose immediate
shrinkage under pressure is important to the system.

[PATCH] rmaplock: mm lock ordering

With page_map_lock out of the way, there's no need for page_referenced and
try_to_unmap to use trylocks - provided we switch anon_vma->lock and
mm->page_table_lock around in anon_vma_prepare.  Though I suppose it's
possible that we'll find that vmscan makes better progress with trylocks than
spinning - we're free to choose trylocks again if so.

Try to update the mm lock ordering documentation in filemap.c.  But I still
find it confusing, and I've no idea of where to stop.  So add an mm lock
ordering list I can understand to rmap.c.

[The fifth patch was about using anon_vma in swapoff, not relevant here.]

So, going back to what you wrote: holding the page lock there is
not enough to prevent the struct anon_vma going away beneath us.

Hugh

[kvm-devel] [PATCH] kvm: qemu: Allow booting from extboot drive with -kernel

From: Mark M. <ma...@re...> - 2008-04-29 10:42:32

The -kernel option generates a new boot sector for
the boot drive which jumps directly to the supplied
kernel rather than running the standard bootloader.

Trivially fix generate_bootsect() to handle the
case where we're booting using extboot.

Signed-off-by: Mark McLoughlin <ma...@re...>
---
 qemu/hw/pc.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index 48a36e0..506ef6b 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -405,11 +405,12 @@ static void generate_bootsect(uint32_t gpr[8], uint16_t segs[6], uint16_t ip)
 {
     uint8_t bootsect[512], *p;
     int i;
-    int hda;
+    int hda = extboot_drive;
 
-    hda = drive_get_index(IF_IDE, 0, 0);
+    if (hda == -1)
+      hda = drive_get_index(IF_IDE, 0, 0);
     if (hda == -1) {
-	fprintf(stderr, "A disk image must be given for 'hda' when booting "
+	fprintf(stderr, "-hda or -drive boot=on must be given when booting "
 		"a Linux kernel\n");
 	exit(1);
     }
-- 
1.5.4.1

[kvm-devel] [PATCH] x86 DMA: Handle devices assigned to the guest by the host

From: Amit S. <ami...@qu...> - 2008-04-29 10:37:34

dma_alloc_coherent() doesn't call dma_ops->alloc_coherent in case no IOMMU
translations are necessary. However, if the device doing the DMA is a
physical device assigned to the guest OS by the host, we need to map
all the DMA addresses to the host machine addresses. This is done via
hypercalls to the host.

In KVM, with pci passthrough support, we can assign actual devices to the
guest OS which need this functionality.

Signed-off-by: Amit Shah <ami...@qu...>
---
 arch/x86/kernel/pci-dma.c     |   11 +++++++++++
 include/asm-x86/dma-mapping.h |    2 ++
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 388b113..678cafb 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
 		memset(memory, 0, size);
 		if (!mmu) {
 			*dma_handle = bus;
+			if (unlikely(dma_ops->is_pv_device) &&
+			    unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {
+				void *r;
+				r = dma_ops->alloc_coherent(dev, size,
+							    dma_handle, gfp);
+				if (r == NULL) {
+					free_pages((unsigned long)memory,
+						   get_order(size));
+					memory = NULL;
+				}
+			}
 			return memory;
 		}
 	}
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index a1a4dc7..b9c6a39 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -55,6 +55,8 @@ struct dma_mapping_ops {
 				int direction);
 	int             (*dma_supported)(struct device *hwdev, u64 mask);
 	int		is_phys;
+	/* Is this a physical device in a paravirtualized guest? */
+	int		(*is_pv_device)(struct device *hwdev, const char *name);
 };
 
 extern const struct dma_mapping_ops *dma_ops;
-- 
1.5.4.3

[kvm-devel] [PATCH] KVM x86: Handle hypercalls for assigned PCI devices

From: Amit S. <ami...@qu...> - 2008-04-29 10:37:33

We introduce three hypercalls:
1. When the guest wants to check if a particular device is an assigned device
   (this is done once per device by the guest to enable / disable hypercall-
   based translation of addresses)

2. map: to convert guest phyical addresses to host physical address to pass on
   to the device for DMA. We also pin the pages thus requested so that they're
   not swapped out.

3. unmap: to unpin the pages and free any information we might have stored.

Signed-off-by: Amit Shah <ami...@qu...>
---
 arch/x86/kvm/x86.c         |  211 +++++++++++++++++++++++++++++++++++++++++++-
 include/asm-x86/kvm_host.h |   15 +++
 include/asm-x86/kvm_para.h |    8 ++
 3 files changed, 233 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb9b329..94ee4db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -24,8 +24,11 @@
 #include <linux/interrupt.h>
 #include <linux/kvm.h>
 #include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pci.h>
 #include <linux/vmalloc.h>
 #include <linux/module.h>
+#include <linux/highmem.h>
 #include <linux/mman.h>
 #include <linux/highmem.h>
 
@@ -76,6 +79,9 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ "halt_exits", VCPU_STAT(halt_exits) },
 	{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
 	{ "hypercalls", VCPU_STAT(hypercalls) },
+	{ "hypercall_map", VCPU_STAT(hypercall_map) },
+	{ "hypercall_unmap", VCPU_STAT(hypercall_unmap) },
+	{ "hypercall_pv_dev", VCPU_STAT(hypercall_pv_dev) },
 	{ "request_irq", VCPU_STAT(request_irq_exits) },
 	{ "irq_exits", VCPU_STAT(irq_exits) },
 	{ "host_state_reload", VCPU_STAT(host_state_reload) },
@@ -95,9 +101,164 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
 	{ NULL }
 };
 
+static struct kvm_pv_dma_map*
+find_pci_pv_dmap(struct list_head *head, dma_addr_t dma)
+{
+	struct list_head *ptr;
+	struct kvm_pv_dma_map *match;
+
+	list_for_each(ptr, head) {
+		match = list_entry(ptr, struct kvm_pv_dma_map, list);
+		if (match && match->sg[0].dma_address == dma)
+			return match;
+	}
+	return NULL;
+}
+
+static void prepare_sg_entry(struct scatterlist *sg, struct page *page)
+{
+	unsigned int offset, len;
+
+	offset = page_to_phys(page) & ~PAGE_MASK;
+	len = PAGE_SIZE - offset;
+
+	/* FIXME: Use the sg chaining features */
+	sg_set_page(sg, page, len, offset);
+}
+
+static int pv_map_hypercall(struct kvm_vcpu *vcpu, int npages, gfn_t page_gfn)
+{
+	int i, r = 0;
+	struct page *host_page;
+	struct scatterlist *sg;
+	struct kvm_pv_dma_map *dmap;
+	unsigned long *shared_addr, *hcall_page;
+
+	/* We currently don't support dma mappings which have more than
+	 * PAGE_SIZE/sizeof(unsigned long *) pages
+	 */
+	if (!npages || npages > MAX_PVDMA_PAGES) {
+		printk(KERN_INFO "%s: Illegal number of pages: %d\n",
+		       __func__, npages);
+		goto out;
+	}
+
+	host_page = gfn_to_page(vcpu->kvm, page_gfn);
+	if (is_error_page(host_page)) {
+		printk(KERN_INFO "%s: Bad gfn %p\n", __func__,
+		       (void *)page_gfn);
+		goto out;
+	}
+	hcall_page = shared_addr = kmap(host_page);
+
+	/* scatterlist to map guest dma pages into host physical
+	 * memory -- if they exceed the DMA map limit
+	 */
+	sg = kcalloc(npages, sizeof(struct scatterlist), GFP_KERNEL);
+	if (sg == NULL) {
+		printk(KERN_INFO "%s: Couldn't allocate memory (sg)\n",
+		       __func__);
+		goto out_unmap;
+	}
+
+	/* List to store all guest pages mapped into host. This will
+	 * be used later to free pages on the host. Think of this as a
+	 * translation table from guest dma addresses into host dma
+	 * addresses
+	 */
+	dmap = kzalloc(sizeof(*dmap), GFP_KERNEL);
+	if (dmap == NULL) {
+		printk(KERN_INFO "%s: Couldn't allocate memory\n",
+		       __func__);
+		goto out_unmap_sg;
+	}
+
+	/* FIXME: consider the length of the last page. Guest should
+	 * send this info.
+	 */
+	for (i = 0; i < npages; i++) {
+		struct page *page;
+		gpa_t gpa;
+
+		gpa = *shared_addr++;
+		page = gfn_to_page(vcpu->kvm, gpa >> PAGE_SHIFT);
+		if (is_error_page(page)) {
+			int j;
+			printk(KERN_INFO "kvm %s: gpa %p not valid\n",
+			       __func__, (void *)gpa);
+
+			for (j = 0; j < i; j++)
+				put_page(sg_page(&sg[j]));
+			goto out_unmap_sg_dmap;
+		}
+		prepare_sg_entry(&sg[i], page);
+		get_page(sg_page(&sg[i]));
+	}
+
+	/* Put this on the dmap_head list, so that we can find it
+	 * later for the 'free' operation
+	 */
+	dmap->sg = sg;
+	dmap->nents = npages;
+	list_add(&dmap->list, &vcpu->kvm->arch.pci_pv_dmap_head);
+
+	/* FIXME: guest should send the direction */
+	r = dma_ops->map_sg(NULL, sg, npages, PCI_DMA_BIDIRECTIONAL);
+	if (r) {
+		r = npages;
+		*hcall_page = sg[0].dma_address | (*hcall_page & ~PAGE_MASK);
+	}
+
+ out_unmap:
+	if (!r)
+		*hcall_page = bad_dma_address;
+	kunmap(host_page);
+ out:
+	++vcpu->stat.hypercall_map;
+	return r;
+ out_unmap_sg_dmap:
+	kfree(dmap);
+ out_unmap_sg:
+	kfree(sg);
+	goto out_unmap;
+}
+
+static int free_dmap(struct kvm_pv_dma_map *dmap, struct list_head *head)
+{
+	int i;
+
+	if (!dmap)
+		return 1;
+
+	for (i = 0; i < dmap->nents; i++)
+		put_page(sg_page(&dmap->sg[i]));
+
+	if (dma_ops->unmap_sg)
+		dma_ops->unmap_sg(NULL, dmap->sg, dmap->nents,
+				  PCI_DMA_BIDIRECTIONAL);
+	kfree(dmap->sg);
+	list_del(&dmap->list);
+	kfree(dmap);
+
+	return 0;
+}
+
+/* FIXME: the argument passed from guest can be 32-bit. We need 64-bit for
+ * dma_addr_t. Send the dma address in a page (or split in two registers)
+ */
+static int pv_unmap_hypercall(struct kvm_vcpu *vcpu, dma_addr_t dma)
+{
+	struct kvm_pv_dma_map *dmap;
+
+	++vcpu->stat.hypercall_unmap;
+
+	dmap = find_pci_pv_dmap(&vcpu->kvm->arch.pci_pv_dmap_head, dma);
+	return free_dmap(dmap, &vcpu->kvm->arch.pci_pv_dmap_head);
+}
+
 /*
  * Used to find a registered host PCI device (a "passthrough" device)
- * during ioctls, interrupts or EOI
+ * during hypercalls, ioctls or interrupts or EOI
  */
 static struct kvm_pci_pt_dev_list *
 find_pci_pt_dev(struct list_head *head,
@@ -136,6 +297,34 @@ find_pci_pt_dev(struct list_head *head,
 	return NULL;
 }
 
+static int
+pv_mapped_pci_device_hypercall(struct kvm_vcpu *vcpu, gfn_t page_gfn)
+{
+	int r = 0;
+	unsigned long *shared_addr;
+	struct page *host_page;
+	struct kvm_pci_pt_info pci_pt_info;
+
+	host_page = gfn_to_page(vcpu->kvm, page_gfn);
+	if (is_error_page(host_page)) {
+		printk(KERN_INFO "%s: gfn %p not valid\n",
+		       __func__, (void *)page_gfn);
+		r = -1;
+		goto out;
+	}
+	shared_addr = kmap(host_page);
+	memcpy(&pci_pt_info, shared_addr, sizeof(pci_pt_info));
+
+	if (find_pci_pt_dev(&vcpu->kvm->arch.pci_pt_dev_head,
+			    &pci_pt_info, 0, KVM_PT_SOURCE_ASSIGN))
+		r++; /* We have assigned the device */
+
+	kunmap(host_page);
+ out:
+	++vcpu->stat.hypercall_pv_dev;
+	return r;
+}
+
 static DECLARE_BITMAP(pt_irq_pending, NR_IRQS);
 static DECLARE_BITMAP(pt_irq_handled, NR_IRQS);
 
@@ -218,6 +407,10 @@ static int kvm_vm_ioctl_pci_pt_dev(struct kvm *kvm,
 		set_bit(pci_pt_dev->host.irq, pt_irq_handled);
 	}
 	list_add(&match->list, &kvm->arch.pci_pt_dev_head);
+
+	printk(KERN_INFO "kvm: Handling hypercalls for device %02x:%02x.%1x\n",
+	       pci_pt_dev->host.busnr, PCI_SLOT(pci_pt_dev->host.devfn),
+	       PCI_FUNC(pci_pt_dev->host.devfn));
  out:
 	return r;
  out_free:
@@ -248,6 +441,7 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 {
 	struct list_head *ptr, *ptr2;
 	struct kvm_pci_pt_dev_list *pci_pt_dev;
+	struct kvm_pv_dma_map *dmap;
 
 	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pt_dev_head) {
 		pci_pt_dev = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
@@ -257,6 +451,11 @@ static void kvm_free_pci_passthrough(struct kvm *kvm)
 
 		list_del(&pci_pt_dev->list);
 	}
+
+	list_for_each_safe(ptr, ptr2, &kvm->arch.pci_pv_dmap_head) {
+		dmap = list_entry(ptr, struct kvm_pv_dma_map, list);
+		free_dmap(dmap, &kvm->arch.pci_pv_dmap_head);
+	}
 }
 
 unsigned long segment_base(u16 selector)
@@ -2672,6 +2871,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 	}
 
 	switch (nr) {
+	case KVM_PV_DMA_MAP:
+		ret = pv_map_hypercall(vcpu, a0, a1);
+		break;
+	case KVM_PV_DMA_UNMAP:
+		ret = pv_unmap_hypercall(vcpu, a0);
+		break;
+	case KVM_PV_PCI_DEVICE:
+		ret = pv_mapped_pci_device_hypercall(vcpu, a0);
+		break;
 	case KVM_HC_VAPIC_POLL_IRQ:
 		ret = 0;
 		break;
@@ -4059,6 +4267,7 @@ struct  kvm *kvm_arch_create_vm(void)
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.pci_pt_dev_head);
+	INIT_LIST_HEAD(&kvm->arch.pci_pv_dmap_head);
 
 	return kvm;
 }
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 3b9cb50..d52e44e 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -298,6 +298,17 @@ struct kvm_mem_alias {
 #define KVM_PT_SOURCE_IRQ_ACK	2
 #define KVM_PT_SOURCE_ASSIGN	3
 
+/* Paravirt DMA: We pin the host-side pages for the GPAs that we get
+ * for the DMA operation. We do a sg_map on the host pages for a DMA
+ * operation on the guest side. We un-pin the pages on the
+ * unmap_hypercall.
+ */
+struct kvm_pv_dma_map {
+	struct list_head list;
+	int nents;
+	struct scatterlist *sg;
+};
+
 /* This list is to store the guest bus:device:function-irq and host
  * bus:device:function-irq mapping for assigned devices.
  */
@@ -319,6 +330,7 @@ struct kvm_arch{
 	 */
 	struct list_head active_mmu_pages;
 	struct list_head pci_pt_dev_head;
+	struct list_head pci_pv_dmap_head;
 	struct kvm_pic *vpic;
 	struct kvm_ioapic *vioapic;
 	struct kvm_pit *vpit;
@@ -366,6 +378,9 @@ struct kvm_vcpu_stat {
 	u32 insn_emulation;
 	u32 insn_emulation_fail;
 	u32 hypercalls;
+	u32 hypercall_map;
+	u32 hypercall_unmap;
+	u32 hypercall_pv_dev;
 };
 
 struct descriptor_table {
diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
index 5f93b78..e13bf4c 100644
--- a/include/asm-x86/kvm_para.h
+++ b/include/asm-x86/kvm_para.h
@@ -74,6 +74,12 @@ extern void kvmclock_init(void);
  */
 #define KVM_HYPERCALL ".byte 0x0f,0x01,0xc1"
 
+/* Hypercall numbers */
+#define KVM_PV_UNUSED		0
+#define KVM_PV_DMA_MAP		101
+#define KVM_PV_DMA_UNMAP	102
+#define KVM_PV_PCI_DEVICE	103
+
 /* For KVM hypercalls, a three-byte sequence of either the vmrun or the vmmrun
  * instruction.  The hypervisor may replace it with something else but only the
  * instructions are guaranteed to be supported.
@@ -155,6 +161,8 @@ static inline unsigned int kvm_arch_para_features(void)
 	return cpuid_eax(KVM_CPUID_FEATURES);
 }
 
+/* Max. DMA pages we send from guest to host for mapping */
+#define MAX_PVDMA_PAGES (PAGE_SIZE / sizeof(unsigned long *))
 #endif /* KERNEL */
 
 /* Stores information for identifying host PCI devices assigned to the
-- 
1.5.4.3

[kvm-devel] [PATCH] KVM PV Guest: Implement paravirtualized DMA

From: Amit S. <ami...@qu...> - 2008-04-29 10:37:33

We make the dma_mapping_ops structure to point to our structure
so that every DMA access goes through us.

We make a hypercall for every device that does a DMA operation
to find out if it is an assigned device -- so that we can make
hypercalls on each DMA access. The result of this hypercall is
cached, so that this hypercall is made only once for each device

This can be compiled as a module, but that's only used for debugging.
It can be compiled into the guest kernel directly without any side
effects (if you ignore one error message about the hypercall
failing for hard disks).

Signed-off-by: Amit Shah <ami...@qu...>
---
 arch/x86/Kconfig             |    8 +
 arch/x86/kernel/Makefile     |    1 +
 arch/x86/kernel/kvm_pv_dma.c |  391 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 400 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/kernel/kvm_pv_dma.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e5790fe..aad16d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -392,6 +392,14 @@ config KVM_GUEST
 	 This option enables various optimizations for running under the KVM
 	 hypervisor.
 
+config KVM_PV_DMA
+	tristate "KVM paravirtualized DMA access"
+	---help---
+	Provides support for DMA operations in the guest. A hypercall
+	is raised to the host to enable devices owned by guest to use
+	DMA. Select this if compiling a guest kernel and you need
+	paravirtualized DMA operations.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fa19c38..0adb37b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_DEBUG_NX_TEST)	+= test_nx.o
 obj-$(CONFIG_VMI)		+= vmi_32.o vmiclock_32.o
 obj-$(CONFIG_KVM_GUEST)		+= kvm.o
 obj-$(CONFIG_KVM_CLOCK)		+= kvmclock.o
+obj-$(CONFIG_KVM_PV_DMA)	+= kvm_pv_dma.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 
 ifdef CONFIG_INPUT_PCSPKR
diff --git a/arch/x86/kernel/kvm_pv_dma.c b/arch/x86/kernel/kvm_pv_dma.c
new file mode 100644
index 0000000..db83324
--- /dev/null
+++ b/arch/x86/kernel/kvm_pv_dma.c
@@ -0,0 +1,391 @@
+/*
+ * KVM guest DMA para-virtualization driver
+ *
+ * Copyright (C) 2007, Qumranet, Inc., Amit Shah <ami...@qu...>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <asm/page.h>
+#include <linux/io.h>
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/module.h>
+#include <linux/version.h>
+#include <linux/miscdevice.h>
+#include <linux/kvm_para.h>
+
+MODULE_AUTHOR("Amit Shah");
+MODULE_DESCRIPTION("Implements guest para-virtualized DMA");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("1");
+
+#define KVM_DMA_MINOR MISC_DYNAMIC_MINOR
+
+static struct page *page;
+static unsigned long page_gfn;
+
+const struct dma_mapping_ops *orig_dma_ops;
+
+#include <linux/list.h>
+struct pv_passthrough_dev_list {
+	struct list_head list;
+	struct kvm_pci_pt_info pv_pci_info;
+	int is_pv;
+};
+static LIST_HEAD(pt_devs_head);
+
+static struct pv_passthrough_dev_list*
+find_matching_pt_dev(struct list_head *head,
+		     struct kvm_pci_pt_info *pv_pci_info)
+{
+	struct list_head *ptr;
+	struct pv_passthrough_dev_list *match;
+
+	list_for_each(ptr, head) {
+		match = list_entry(ptr, struct pv_passthrough_dev_list, list);
+		if (match &&
+		    (match->pv_pci_info.busnr == pv_pci_info->busnr) &&
+		    (match->pv_pci_info.devfn == pv_pci_info->devfn))
+			return match;
+	}
+	return NULL;
+}
+
+void empty_pt_dev_list(struct list_head *head)
+{
+	struct pv_passthrough_dev_list *match;
+
+	while (!list_empty(head)) {
+		match = list_entry(head->next, \
+				   struct pv_passthrough_dev_list, list);
+		list_del(&match->list);
+	}
+}
+
+static int kvm_is_pv_device(struct device *dev, const char *name)
+{
+	int r;
+	struct pci_dev *pci_dev;
+	struct kvm_pci_pt_info pv_pci_info;
+	struct pv_passthrough_dev_list *match;
+
+	pci_dev = to_pci_dev(dev);
+	pv_pci_info.busnr = pci_dev->bus->number;
+	pv_pci_info.devfn = pci_dev->devfn;
+
+	match = find_matching_pt_dev(&pt_devs_head, &pv_pci_info);
+	if (match) {
+		r = match->is_pv;
+		goto out;
+	}
+
+	memcpy(page_address(page), &pv_pci_info, sizeof(pv_pci_info));
+	r = kvm_hypercall1(KVM_PV_PCI_DEVICE, page_gfn);
+	if (r < 1) {
+		printk(KERN_INFO "%s: Error doing hypercall!\n", __func__);
+		r = 0;
+		goto out;
+	}
+
+	match = kmalloc(sizeof(struct pv_passthrough_dev_list), GFP_KERNEL);
+	if (match == NULL) {
+		printk(KERN_INFO "%s: Out of memory\n", __func__);
+		r = 0;
+		goto out;
+	}
+	match->pv_pci_info.busnr = pv_pci_info.busnr;
+	match->pv_pci_info.devfn = pv_pci_info.devfn;
+	match->is_pv = r;
+	list_add(&match->list, &pt_devs_head);
+ out:
+	return r;
+}
+
+static void *kvm_dma_map(void *vaddr, size_t size, dma_addr_t *dma_handle)
+{
+	int npages, i;
+	unsigned long *dma_addr;
+	dma_addr_t host_addr = bad_dma_address;
+
+	if (page == NULL)
+		goto out;
+
+	npages = get_order(size) + 1;
+	dma_addr = page_address(page);
+
+	/* We have to take into consideration the offsets for the
+	 * virtual address provided by the calling
+	 * functions. Currently both, pci_alloc_consistent and
+	 * pci_map_single call this function. We have to change it so
+	 * that we can also pass to the host the offset of the addr in
+	 * the page it is in.
+	 */
+
+	if (*dma_handle == bad_dma_address)
+		goto out;
+
+	/* It's not really OK to use dma_handle here, as the IOMMU or
+	 * swiotlb could have mapped it elsewhere. But what's a better
+	 * solution?
+	 */
+	*dma_addr++ = *dma_handle;
+	if (npages > 1) {
+		/* All of the pages will be contiguous in guest
+		 * physical memory in both, pci_map_consistent and
+		 * pci_map_single cases (see DMA-API.txt)
+		 */
+		/* FIXME: we're currently not crossing over to
+		 * multiple pages to be sent to host, in case
+		 * we have a lot of pages that we can't
+		 * accomodate in one page.
+		 */
+		for (i = 1; i < min((unsigned long)npages, MAX_PVDMA_PAGES); i++)
+			*dma_addr++ = virt_to_phys(vaddr + PAGE_SIZE * i);
+	}
+
+	/* Maybe we need more arguments (we have first two):
+	 * @npages: number of gpas pages in this hypercall
+	 * @page: page we pass to host with all the gpas in them
+	 * @more: are there any more pages coming?
+	 * @offset: offset of the address in the first page
+	 * @direction: direction for the mapping (only for pci_map_single)
+	 */
+	npages = kvm_hypercall2(KVM_PV_DMA_MAP, npages, page_gfn);
+	if (!npages)
+		host_addr = bad_dma_address;
+	else
+		host_addr = *(unsigned long *)page_address(page);
+
+ out:
+	*dma_handle = host_addr;
+	if (host_addr == bad_dma_address)
+		vaddr = NULL;
+	return vaddr;
+}
+
+static void kvm_dma_unmap(dma_addr_t dma_handle)
+{
+	kvm_hypercall1(KVM_PV_DMA_UNMAP, dma_handle);
+	return;
+}
+
+static void *kvm_dma_alloc_coherent(struct device *dev, size_t size,
+				    dma_addr_t *dma_handle, gfp_t gfp)
+{
+	void *vaddr = NULL;
+	if ((*dma_handle == bad_dma_address)
+	    || !dma_ops->is_pv_device(dev, dev->bus_id))
+		goto out;
+
+	vaddr = bus_to_virt(*(unsigned long *)dma_handle);
+	vaddr = kvm_dma_map(vaddr, size, dma_handle);
+ out:
+	return vaddr;
+}
+
+static void kvm_dma_free_coherent(struct device *dev, size_t size, void *vaddr,
+				  dma_addr_t dma_handle)
+{
+	kvm_dma_unmap(dma_handle);
+}
+
+static dma_addr_t kvm_dma_map_single(struct device *dev, phys_addr_t paddr,
+				     size_t size, int direction)
+{
+	dma_addr_t r;
+
+	r = orig_dma_ops->map_single(dev, paddr, size, direction);
+
+	if (r != bad_dma_address && kvm_is_pv_device(dev, dev->bus_id))
+		kvm_dma_map(phys_to_virt(paddr), size, &r);
+	return r;
+}
+
+static inline void kvm_dma_unmap_single(struct device *dev, dma_addr_t addr,
+					size_t size, int direction)
+{
+	kvm_dma_unmap(addr);
+}
+
+int kvm_pv_dma_mapping_error(dma_addr_t dma_addr)
+{
+	if (orig_dma_ops->mapping_error)
+		return orig_dma_ops->mapping_error(dma_addr);
+
+	printk(KERN_ERR "%s: Unhandled PV DMA operation. Report this.\n",
+	       __func__);
+	return dma_addr == bad_dma_address;
+}
+
+/* like map_single, but doesn't check the device mask */
+dma_addr_t kvm_pv_dma_map_simple(struct device *hwdev, phys_addr_t paddr,
+				 size_t size, int direction)
+{
+	return orig_dma_ops->map_simple(hwdev, paddr, size, direction);
+}
+
+void kvm_pv_dma_sync_single_for_cpu(struct device *hwdev,
+				    dma_addr_t dma_handle, size_t size,
+				    int direction)
+{
+	if (orig_dma_ops->sync_single_for_cpu)
+		orig_dma_ops->sync_single_for_cpu(hwdev, dma_handle,
+						  size, direction);
+}
+
+void kvm_pv_dma_sync_single_for_device(struct device *hwdev,
+				       dma_addr_t dma_handle, size_t size,
+				       int direction)
+{
+	if (orig_dma_ops->sync_single_for_device)
+		orig_dma_ops->sync_single_for_device(hwdev, dma_handle,
+						     size, direction);
+}
+
+void kvm_pv_dma_sync_single_range_for_cpu(struct device *hwdev,
+					  dma_addr_t dma_handle,
+					  unsigned long offset,
+					  size_t size, int direction)
+{
+	if (orig_dma_ops->sync_single_range_for_cpu)
+		orig_dma_ops->sync_single_range_for_cpu(hwdev, dma_handle,
+							offset, size,
+							direction);
+}
+
+void kvm_pv_dma_sync_single_range_for_device(struct device *hwdev,
+					     dma_addr_t dma_handle,
+					     unsigned long offset,
+					     size_t size, int direction)
+{
+	if (orig_dma_ops->sync_single_range_for_device)
+		orig_dma_ops->sync_single_range_for_device(hwdev, dma_handle,
+							   offset, size,
+							   direction);
+}
+
+void kvm_pv_dma_sync_sg_for_cpu(struct device *hwdev,
+		     struct scatterlist *sg, int nelems,
+		     int direction)
+{
+	if (orig_dma_ops->sync_sg_for_cpu)
+		orig_dma_ops->sync_sg_for_cpu(hwdev, sg, nelems, direction);
+}
+
+void kvm_pv_dma_sync_sg_for_device(struct device *hwdev,
+				   struct scatterlist *sg, int nelems,
+				   int direction)
+{
+	if (orig_dma_ops->sync_sg_for_device)
+		orig_dma_ops->sync_sg_for_device(hwdev, sg, nelems, direction);
+}
+
+int kvm_pv_dma_map_sg(struct device *hwdev, struct scatterlist *sg,
+		      int nents, int direction)
+{
+	return orig_dma_ops->map_sg(hwdev, sg, nents, direction);
+	printk(KERN_ERR "%s: Unhandled PV DMA operation. Report this.\n",
+	       __func__);
+	return 0;
+}
+
+void kvm_pv_dma_unmap_sg(struct device *hwdev,
+			 struct scatterlist *sg, int nents,
+			 int direction)
+{
+	if (orig_dma_ops->unmap_sg)
+		orig_dma_ops->unmap_sg(hwdev, sg, nents, direction);
+}
+
+int kvm_pv_dma_dma_supported(struct device *hwdev, u64 mask)
+{
+	if (orig_dma_ops->dma_supported)
+		return orig_dma_ops->dma_supported(hwdev, mask);
+	printk(KERN_ERR "%s: Unhandled PV DMA operation. Report this.\n",
+	       __func__);
+	return 0;
+}
+
+static const struct dma_mapping_ops kvm_dma_ops = {
+	.alloc_coherent	= kvm_dma_alloc_coherent,
+	.free_coherent	= kvm_dma_free_coherent,
+	.map_single	= kvm_dma_map_single,
+	.unmap_single	= kvm_dma_unmap_single,
+	.is_pv_device	= kvm_is_pv_device,
+
+	.mapping_error  = kvm_pv_dma_mapping_error,
+	.map_simple	= kvm_pv_dma_map_simple,
+	.sync_single_for_cpu = kvm_pv_dma_sync_single_for_cpu,
+	.sync_single_for_device = kvm_pv_dma_sync_single_for_device,
+	.sync_single_range_for_cpu = kvm_pv_dma_sync_single_range_for_cpu,
+	.sync_single_range_for_device = kvm_pv_dma_sync_single_range_for_device,
+	.sync_sg_for_cpu = kvm_pv_dma_sync_sg_for_cpu,
+	.sync_sg_for_device = kvm_pv_dma_sync_sg_for_device,
+	.map_sg		= kvm_pv_dma_map_sg,
+	.unmap_sg	= kvm_pv_dma_unmap_sg,
+};
+
+static struct file_operations dma_chardev_ops;
+static struct miscdevice kvm_dma_dev = {
+	KVM_DMA_MINOR,
+	"kvm_dma",
+	&dma_chardev_ops,
+};
+
+int __init kvm_pv_dma_init(void)
+{
+	int r;
+
+	dma_chardev_ops.owner = THIS_MODULE;
+	if (misc_register(&kvm_dma_dev)) {
+		printk(KERN_ERR "%s: misc device register failed\n",
+		       __func__);
+		r = -EBUSY;
+		goto out;
+	}
+	if (!kvm_para_available()) {
+		printk(KERN_ERR "KVM paravirt support not available\n");
+		r = -ENODEV;
+		goto out_dereg;
+	}
+
+	/* FIXME: check for hypercall support */
+	page = alloc_page(GFP_ATOMIC);
+	if (page == NULL) {
+		printk(KERN_ERR "%s: Could not allocate page\n", __func__);
+		r = -ENOMEM;
+		goto out_dereg;
+	}
+	page_gfn = page_to_pfn(page);
+
+	orig_dma_ops = dma_ops;
+	dma_ops = &kvm_dma_ops;
+
+	printk(KERN_INFO "KVM PV DMA engine registered\n");
+	return 0;
+	goto out;
+	goto out_free;
+
+ out_free:
+	__free_page(page);
+ out_dereg:
+	misc_deregister(&kvm_dma_dev);
+ out:
+	return r;
+}
+
+static void __exit kvm_pv_dma_exit(void)
+{
+	dma_ops = orig_dma_ops;
+
+	__free_page(page);
+
+	empty_pt_dev_list(&pt_devs_head);
+
+	misc_deregister(&kvm_dma_dev);
+}
+
+module_init(kvm_pv_dma_init);
+module_exit(kvm_pv_dma_exit);
-- 
1.5.4.3

[kvm-devel] PV DMA for PCI passthrough devices for KVM

From: Amit S. <ami...@qu...> - 2008-04-29 10:37:33

This patchset implements PVDMA for handling DMA requests from
devices assigned to the guest from the host machine.

They're also available from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git pvdma

These patches are based on my pci-passthrough tree, which is available
from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git

and the userspace from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm-userspace.git

The first and the third patch in this series is needed on the guest (with some
bits from the 2nd as well). The 2nd patch is meant for the host kernel.

Amit.

Re: [kvm-devel] WARN_ON in kvm_queue_exception_e triggers

From: Jan K. <jan...@si...> - 2008-04-29 10:34:40

Joerg Roedel wrote:
> On Tue, Apr 29, 2008 at 10:38:41AM +0200, Jan Kiszka wrote:
>> Joerg Roedel wrote:
>>> Hmm, seems we have to check for DF and triple faults in the
>>> kvm_queue_exception functions too. Does the attached patch fix the
>>> problem (patch is against kvm-66).
>> Thanks, it indeed fixes the warnings (*) and makes KVM issue a reset. But
>> then is stumbles and falls probably over some inconsistent system state:
>>
>> exception 13 (43)
>> rax 0000000000000000 rbx 0000000000000000 rcx 0000000000000000 rdx 0000000000000633
>> rsi 0000000000000000 rdi 0000000000000000 rsp 0000000000000000 rbp 0000000000000000
>> r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
>> r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
>> rip 000000000000fff0 rflags 00033002
>> cs f000 (000f0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> es 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
>> tr 0178 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
>> ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
>> gdt 0/ffff
>> idt 0/ffff
>> cr0 60000010 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
>> code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 --> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>
>> Looks like trying to execute the first instruction after reset is
>> already unsuccessful. As the tr selector is non-zero here, I already
>> tried a kvm_arch_reset_cpu-hack along the line that sets
>> KVM_REQ_TRIPLE_FAULT, but without success. Any idea what to check?
> 
> Its weird to me what triggers the taskswitch. What guest operating

It is the guest, looking for a soft-restart (after it detected some
other error - now our main problem).

> system are you running and what is the qemu/kvm command line to start
> the guest?

Well, the guest is a proprietary OS of our customer, running in 16-bit
protected mode with a lot of segment shuffling. Due to this and also
some special hardware emulations, the current test case is not portable.
So I'm looking for input on where to dig and what to try.

Note that I ran the very same test with -no-kvm, and here we do not get
those post-reset GPF (provided that some reset-on-triple-fault patch is
applied to avoid the abort(), e.g. [1]).

> 
>> Note that this does not happen when I raise a reset via the monitor.
>>
>> BTW, kvm_show_code() does not seem to provide correct informations,
>> even when I add it right before the first kvm_run().
> 
> When the guest state is messed up the information may be incorrect.

I don't expect the guest state to be messed up right before the very
first guest code execution (that's where kvm_show_code() also reported
zeros)... :->

> 
>> (*) There is just a bit noise left behind in the syslog:
>>
>> kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9
> 
> Reason 0x9 is the taskswitch intercept.
> 
>> kvm: inject_page_fault: double fault
> 
> This is expected from the patch I sent you.

For sure. I would just suggest to rethink if a final version should
still issue such warnings. We basically had the same discussion on
qemu-devel around the reset-on-triple-fault patch (which is
unfortunately still not finalized :-/).

Jan

[1] http://permalink.gmane.org/gmane.comp.emulators.qemu/24475

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

Re: [kvm-devel] Moving kvm lists to kernel.org?

From: David M. <da...@da...> - 2008-04-29 10:30:08

From: Avi Kivity <av...@qu...>
Date: Sun, 27 Apr 2008 12:40:29 +0300

> Avi Kivity wrote:
> > I propose moving the kvm lists to vger.kernel.org, for the following 
> > benefits:
> >
> > - better spam control
> > - faster service (I see significant lag with the sourceforge lists)
> > - no ads appended to the end of each email
> >
> > If no objections are raised, and if the vger postmasters agree, I will 
> > mass subscribe the current subscribers so that there will be no 
> > service interruption.
> >
> 
> Since no objections were raised, we'll start to get this rolling.

Should I create the list(s) now?  If so, please let me know the
names they should have.

Thanks.

Re: [kvm-devel] WARN_ON in kvm_queue_exception_e triggers

From: Joerg R. <joe...@am...> - 2008-04-29 10:05:03

On Tue, Apr 29, 2008 at 10:38:41AM +0200, Jan Kiszka wrote:
> Joerg Roedel wrote:
> > Hmm, seems we have to check for DF and triple faults in the
> > kvm_queue_exception functions too. Does the attached patch fix the
> > problem (patch is against kvm-66).
> 
> Thanks, it indeed fixes the warnings (*) and makes KVM issue a reset. But
> then is stumbles and falls probably over some inconsistent system state:
> 
> exception 13 (43)
> rax 0000000000000000 rbx 0000000000000000 rcx 0000000000000000 rdx 0000000000000633
> rsi 0000000000000000 rdi 0000000000000000 rsp 0000000000000000 rbp 0000000000000000
> r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
> r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
> rip 000000000000fff0 rflags 00033002
> cs f000 (000f0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> es 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
> tr 0178 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
> ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
> gdt 0/ffff
> idt 0/ffff
> cr0 60000010 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
> code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 --> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Looks like trying to execute the first instruction after reset is
> already unsuccessful. As the tr selector is non-zero here, I already
> tried a kvm_arch_reset_cpu-hack along the line that sets
> KVM_REQ_TRIPLE_FAULT, but without success. Any idea what to check?

Its weird to me what triggers the taskswitch. What guest operating
system are you running and what is the qemu/kvm command line to start
the guest?

> Note that this does not happen when I raise a reset via the monitor.
> 
> BTW, kvm_show_code() does not seem to provide correct informations,
> even when I add it right before the first kvm_run().

When the guest state is messed up the information may be incorrect.

> (*) There is just a bit noise left behind in the syslog:
> 
> kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9

Reason 0x9 is the taskswitch intercept.

> kvm: inject_page_fault: double fault

This is expected from the patch I sent you.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

Re: [kvm-devel] WARN_ON in kvm_queue_exception_e triggers

From: Jan K. <jan...@si...> - 2008-04-29 08:38:35

Joerg Roedel wrote:
> On Mon, Apr 28, 2008 at 07:35:10PM +0200, Jan Kiszka wrote:
>> Hi,
>>
>> sorry, the test environment is not really reproducible (stock kvm-66,
>> yet unpublished NMI support by Sheng Yang and me, special guest), but
>> I'm just fishing for some ideas on what may cause the flood of the
>> following warning in my kernel log:
>>
>> ------------[ cut here ]------------
>> WARNING: at /data/kvm-66/kernel/x86.c:180
>> kvm_queue_exception_e+0x30/0x54 [kvm]()
>> Modules linked in: ipt_MASQUERADE kvm_intel kvm bridge tun ip6t_LOG
>> nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit snd_pcm_oss snd_mixer_oss
>> snd_seq snd_seq_device nls_utf8 cifs af_packet ip6t_REJECT xt_tcpudp
>> ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter
>> ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter
>> ip6_tables cpufreq_conservative x_tables cpufreq_userspace
>> cpufreq_powersave acpi_cpufreq ipv6 microcode fuse ohci_hcd loop rfcomm
>> l2cap wlan_scan_sta ath_rate_sample ath_pci snd_hda_intel wlan pcmcia
>> firmware_class hci_usb snd_pcm snd_timer ath_hal(P) sdhci battery
>> bluetooth button ohci1394 mmc_core rtc_cmos parport_pc intel_agp
>> rtc_core dock ac snd_page_alloc iTCO_wdt ieee1394 sky2 rtc_lib
>> yenta_socket parport snd_hwdep snd iTCO_vendor_support i2c_i801
>> rsrc_nonstatic pcmcia_core sg i2c_core soundcore serio_raw joydev
>> sha256_generic aes_x86_64 aes_generic cbc dm_crypt crypto_blkcipher
>> usbhid hid ff_memless sd_mod ehci_hcd uhci_hcd usbcore dm_snapshot
>> dm_mod edd ext3 mbcache jbd fan ata_piix ahci libata scsi_mod thermal
>> processor
>> Pid: 4718, comm: qemu-system-x86 Tainted: P        N
>> 2.6.25-rc5-git2-109.8-default #1
>>
>> Call Trace:
>>  [<ffffffff8020d826>] dump_trace+0xc4/0x576
>>  [<ffffffff8020dd18>] show_trace+0x40/0x57
>>  [<ffffffff8044e341>] _etext+0x72/0x7b
>>  [<ffffffff80238137>] warn_on_slowpath+0x58/0x80
>>  [<ffffffff886e2e05>] :kvm:kvm_queue_exception_e+0x30/0x54
>>  [<ffffffff886e3678>] :kvm:kvm_task_switch+0xca/0x20a
>>  [<ffffffff8870d096>] :kvm_intel:handle_task_switch+0x19/0x1b
>>  [<ffffffff8870cb1b>] :kvm_intel:kvm_handle_exit+0x7f/0x9c
>>  [<ffffffff886e51e2>] :kvm:kvm_arch_vcpu_ioctl_run+0x49b/0x686
>>  [<ffffffff886e08c9>] :kvm:kvm_vcpu_ioctl+0xf7/0x3ca
>>  [<ffffffff802ad0ba>] vfs_ioctl+0x2a/0x78
>>  [<ffffffff802ad34f>] do_vfs_ioctl+0x247/0x261
>>  [<ffffffff802ad3be>] sys_ioctl+0x55/0x77
>>  [<ffffffff8020c18a>] system_call_after_swapgs+0x8a/0x8f
>>  [<00007faed2969267>]
>>
>> ---[ end trace 5d286714f3c5c50f ]---
> 
> Hmm, seems we have to check for DF and triple faults in the
> kvm_queue_exception functions too. Does the attached patch fix the
> problem (patch is against kvm-66).

Thanks, it indeed fixes the warnings (*) and makes KVM issue a reset. But
then is stumbles and falls probably over some inconsistent system state:

exception 13 (43)
rax 0000000000000000 rbx 0000000000000000 rcx 0000000000000000 rdx 0000000000000633
rsi 0000000000000000 rdi 0000000000000000 rsp 0000000000000000 rbp 0000000000000000
r8  0000000000000000 r9  0000000000000000 r10 0000000000000000 r11 0000000000000000
r12 0000000000000000 r13 0000000000000000 r14 0000000000000000 r15 0000000000000000
rip 000000000000fff0 rflags 00033002
cs f000 (000f0000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ds 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
es 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
ss 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
fs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
gs 0000 (00000000/0000ffff p 1 dpl 3 db 0 s 1 type 3 l 0 g 0 avl 0)
tr 0178 (fffbd000/00002088 p 1 dpl 0 db 0 s 0 type b l 0 g 0 avl 0)
ldt 0000 (00000000/0000ffff p 1 dpl 0 db 0 s 0 type 2 l 0 g 0 avl 0)
gdt 0/ffff
idt 0/ffff
cr0 60000010 cr2 0 cr3 0 cr4 0 cr8 0 efer 0
code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 --> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Looks like trying to execute the first instruction after reset is already unsuccessful. As the tr selector is non-zero here, I already tried a kvm_arch_reset_cpu-hack along the line that sets KVM_REQ_TRIPLE_FAULT, but without success. Any idea what to check?

Note that this does not happen when I raise a reset via the monitor.

BTW, kvm_show_code() does not seem to provide correct informations, even when I add it right before the first kvm_run().

Jan


(*) There is just a bit noise left behind in the syslog:

kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9
kvm: inject_page_fault: double fault
kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9
handle_exception: unexpected, vectoring info 0x80000b08 intr info 0x80000b0d

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux

Re: [kvm-devel] Wiki: Add SMP Count to "Guest Support Status"

From: David M. <dm...@ma...> - 2008-04-29 05:09:06

Fabian Deutsch wrote:
> Hey.
> 
> I've been trying Microsoft Windows 2003 a couple of times. The wiki
> tells me that "everything" should work okay. It does, when using -smp 1,
> but gets ugly when using -smp 2 or so.
> 
> SO might it be useful, to add the column "smp" to the "Guest Support
> Status" Page in the wiki?

It's probably worth explaining what you mean by "gets ugly" and worth 
describing the environment. I just installed a Windows 2003 Enterprise 
Edition with -smp 2 and had no problems during or after the install. I 
confirmed Windows detected and uses two CPUs. IOW, works for me.

Intel(R) Core(TM)2 CPU T7600  @ 2.33GHz
SuSE kernel 2.6.22.17-0.1-default x86_64
kvm-67

# qemu-system-x86_64 -smp 2 -m 512 -usb -localtime -hda disk.img -boot c

---
David Mair.

157 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 31 32 33 34 35 .. 703 > >> (Page 33 of 703)

2006	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (33)	Nov (325)	Dec (320)
2007	Jan (484)	Feb (438)	Mar (407)	Apr (713)	May (831)	Jun (806)	Jul (1023)	Aug (1184)	Sep (1118)	Oct (1461)	Nov (1224)	Dec (1042)
2008	Jan (1449)	Feb (1110)	Mar (1428)	Apr (1643)	May (682)	Jun	Jul	Aug	Sep	Oct	Nov	Dec