Thread: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

Brought to you by: avik, mtosatti

kvm-devel

[kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <ali...@us...> - 2008-04-29 19:10:22

This patch allows VMA's that contain no backing page to be used for guest
memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
mmio series.  Here, we continue to allow mmio pages to be represented in the
rmap.

Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
instead of VM_IO and changed the BUG_ON to a return of bad_page.

Signed-off-by: Anthony Liguori <ali...@us...>

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1d7991a..64e5efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 	struct page *page[1];
 	unsigned long addr;
 	int npages;
+	pfn_t pfn;
 
 	might_sleep();
 
@@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 	npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page,
 				NULL);
 
-	if (npages != 1) {
-		get_page(bad_page);
-		return page_to_pfn(bad_page);
-	}
+	if (unlikely(npages != 1)) {
+		struct vm_area_struct *vma;
 
-	return page_to_pfn(page[0]);
+		vma = find_vma(current->mm, addr);
+		if (vma == NULL || addr >= vma->vm_start ||
+		    !(vma->vm_flags & VM_PFNMAP)) {
+			get_page(bad_page);
+			return page_to_pfn(bad_page);
+		}
+
+		pfn = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		BUG_ON(pfn_valid(pfn));
+	} else
+		pfn = page_to_pfn(page[0]);
+
+	return pfn;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 {
-	return pfn_to_page(gfn_to_pfn(kvm, gfn));
+	pfn_t pfn;
+
+	pfn = gfn_to_pfn(kvm, gfn);
+	if (pfn_valid(pfn))
+		return pfn_to_page(pfn);
+
+	return NULL;
 }
 
 EXPORT_SYMBOL_GPL(gfn_to_page);
@@ -569,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
 
 void kvm_release_pfn_clean(pfn_t pfn)
 {
-	put_page(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		put_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
 
@@ -594,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty);
 
 void kvm_set_pfn_dirty(pfn_t pfn)
 {
-	struct page *page = pfn_to_page(pfn);
-	if (!PageReserved(page))
-		SetPageDirty(page);
+	if (pfn_valid(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (!PageReserved(page))
+			SetPageDirty(page);
+	}
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty);
 
 void kvm_set_pfn_accessed(pfn_t pfn)
 {
-	mark_page_accessed(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		mark_page_accessed(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
 void kvm_get_pfn(pfn_t pfn)
 {
-	get_page(pfn_to_page(pfn));
+	if (pfn_valid(pfn))
+		get_page(pfn_to_page(pfn));
 }
 EXPORT_SYMBOL_GPL(kvm_get_pfn);

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Avi K. <av...@qu...> - 2008-04-29 22:19:21

Anthony Liguori wrote:
> This patch allows VMA's that contain no backing page to be used for guest
> memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
> mmio series.  Here, we continue to allow mmio pages to be represented in the
> rmap.
>
>   

I like this very much, as it only affects accessors and not the mmu core 
itself.

Hollis/Xiantao/Carsten, can you confirm that this approach works for 
you?  Carsten, I believe you don't have mmio, but at least this 
shouldn't interfere.

>  
>  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>  {
> -	return pfn_to_page(gfn_to_pfn(kvm, gfn));
> +	pfn_t pfn;
> +
> +	pfn = gfn_to_pfn(kvm, gfn);
> +	if (pfn_valid(pfn))
> +		return pfn_to_page(pfn);
> +
> +	return NULL;
>  }
>   

You're returning NULL here, not bad_page.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Hollis B. <ho...@us...> - 2008-04-29 22:57:27

On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote:
> Anthony Liguori wrote:
> > This patch allows VMA's that contain no backing page to be used for guest
> > memory.  This is a drop-in replacement for Ben-Ami's first page in his 
direct
> > mmio series.  Here, we continue to allow mmio pages to be represented in 
the
> > rmap.
> >
> >   
> 
> I like this very much, as it only affects accessors and not the mmu core 
> itself.
> 
> Hollis/Xiantao/Carsten, can you confirm that this approach works for 
> you?  Carsten, I believe you don't have mmio, but at least this 
> shouldn't interfere.

OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, 
and just include it within the normal guest RAM memslot?

How will the IOMMU be programmed? Wouldn't you still need to register a 
special type of memslot for that?

-- 
Hollis Blanchard
IBM Linux Technology Center

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Andrea A. <an...@qu...> - 2008-04-30 07:00:40

On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote:
> IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
> means that get_user_pages() would succeed and you can reference count them? 
>  In this case, we would never take the VM_PFNMAP path.

get_user_pages only works on vmas where only pfn with struct page can
be mapped, but if a struct page exists it doesn't mean get_user_pages
will succeed. All mmio regions should be marked VM_IO as reading on
them affects hardware somehow and that prevents get_user_pages to work
on them regardless if a struct page exists.

> That's independent of this patchset.  For non-aware guests, we'll have to 
> pin all of physical memory up front and then create an IOMMU table from the 
> pinned physical memory.  For aware guests with a PV DMA window API, we'll 
> be able to build that mapping on the fly (enforcing mlock allocation 
> limits).

BTW, as far as linux guest is concerned, if the PV DMA API mlock
ulimit triggers the guest will crash. Nothing checks when
pci_map_single returns null (the fix would be to throttle the I/O
until some other dma is completed and to split the dma in multiple
operations if it's a SG entry and if it repeteadly fails to fallback
to PIO or return an IO error if PIO isn't available). It can fail if
there's lots of weird pci hardware doing rdma at the same time (for
example see iommu_arena_alloc retval in
arch/alpha/kernel/pci_iommu.c). In short we'll either need ulimit -l
unlimited or we'll have to define practical limits so depending on the
guest driver code and number of devices using passthrough.

I'll make the reserved-ram patch incremental with those patches, then
it should pick the right pfn coming from /dev/mem without my
page_count == 0 check, and then I've only to fixup the page pinning
(so likely it'll also be incremental with the kvm mmu notifier patch
so I can hope to get something final and remove page pinning for good
not only on mmio regions that don't have a struct page). I've
currently troubles with the blk-settings.c change done in 2.6.25 to
boot in the host, I thought I fixed that already...(I did when loading
the host kernel in kvm, but on real hardware it fails still for
another reason). And Andrew sent me a large email about mmu notifiers,
so before I return on the reserved-ram I've to answer him and upload
an updated mmu-notifier patch with certain cleanups he requested, so
go ahead ignoring the reserved-ram and mmu notifiers, I'll pick
whatever is available in or outside kvm.git when I'm ready. Thanks!

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <an...@co...> - 2008-04-30 15:43:47

Andrea Arcangeli wrote:
> On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote:
>   
>> IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
>> means that get_user_pages() would succeed and you can reference count them? 
>>  In this case, we would never take the VM_PFNMAP path.
>>     
>
> get_user_pages only works on vmas where only pfn with struct page can
> be mapped, but if a struct page exists it doesn't mean get_user_pages
> will succeed. All mmio regions should be marked VM_IO as reading on
> them affects hardware somehow and that prevents get_user_pages to work
> on them regardless if a struct page exists.
>   

Ah, thanks for the clarification.

Regards,

Anthony Liguori

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Carsten O. <co...@de...> - 2008-04-30 14:18:53

Avi Kivity wrote:
> Hollis/Xiantao/Carsten, can you confirm that this approach works for 
> you?  Carsten, I believe you don't have mmio, but at least this 
> shouldn't interfere.
Should work fine on s390 afaics.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <an...@co...> - 2008-04-29 22:25:49

Avi Kivity wrote:
> Anthony Liguori wrote:
>   
>> This patch allows VMA's that contain no backing page to be used for guest
>> memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
>> mmio series.  Here, we continue to allow mmio pages to be represented in the
>> rmap.
>>
>>   
>>     
>
> I like this very much, as it only affects accessors and not the mmu core 
> itself.
>
> Hollis/Xiantao/Carsten, can you confirm that this approach works for 
> you?  Carsten, I believe you don't have mmio, but at least this 
> shouldn't interfere.
>
>   
>>  
>>  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>>  {
>> -	return pfn_to_page(gfn_to_pfn(kvm, gfn));
>> +	pfn_t pfn;
>> +
>> +	pfn = gfn_to_pfn(kvm, gfn);
>> +	if (pfn_valid(pfn))
>> +		return pfn_to_page(pfn);
>> +
>> +	return NULL;
>>  }
>>   
>>     
>
> You're returning NULL here, not bad_page.
>   

My thinking was that bad_page indicates that the gfn is invalid.  This 
is a different type of error though.  The problem is that the guest is 
we are trying to kmap() a page that has no struct page associated with 
it.  I'm not sure what the right thing to do here is.

Perhaps we should be replacing consumers of gfn_to_page() with 
copy_to_user() instead?

Regards,

Anthony Liguori

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Avi K. <av...@qu...> - 2008-04-29 22:43:51

Anthony Liguori wrote:
>>
>>  
>>>  
>>>  struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
>>>  {
>>> -    return pfn_to_page(gfn_to_pfn(kvm, gfn));
>>> +    pfn_t pfn;
>>> +
>>> +    pfn = gfn_to_pfn(kvm, gfn);
>>> +    if (pfn_valid(pfn))
>>> +        return pfn_to_page(pfn);
>>> +
>>> +    return NULL;
>>>  }
>>>       
>>
>> You're returning NULL here, not bad_page.
>>   
>
> My thinking was that bad_page indicates that the gfn is invalid.  This 
> is a different type of error though.  The problem is that the guest is 
> we are trying to kmap() a page that has no struct page associated with 
> it.  I'm not sure what the right thing to do here is.
>

It depends on what's going on?  Does a page table point to mmio?  Or the 
glommerclock?

Not sure there is a single answer.

> Perhaps we should be replacing consumers of gfn_to_page() with 
> copy_to_user() instead?

Indeed we should.  The problem is access in atomic contexts.  It's easy 
to detect failure, but not always easy to handle it.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <ali...@us...> - 2008-04-29 22:51:48

Avi Kivity wrote:
> It depends on what's going on?  Does a page table point to mmio?  Or 
> the glommerclock?
>
> Not sure there is a single answer.
>
>> Perhaps we should be replacing consumers of gfn_to_page() with 
>> copy_to_user() instead?
>
> Indeed we should.  The problem is access in atomic contexts.  It's 
> easy to detect failure, but not always easy to handle it.

So I think we should replace it with a rate limited printk and returning 
bad_page.  That way the guest can't exploit it and we'll still hopefully 
get printk()s to track down instances of things going bad.

Regards,

Anthony Liguori

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Avi K. <av...@qu...> - 2008-04-29 22:54:00

Anthony Liguori wrote:
> Avi Kivity wrote:
>> It depends on what's going on?  Does a page table point to mmio?  Or 
>> the glommerclock?
>>
>> Not sure there is a single answer.
>>
>>> Perhaps we should be replacing consumers of gfn_to_page() with 
>>> copy_to_user() instead?
>>
>> Indeed we should.  The problem is access in atomic contexts.  It's 
>> easy to detect failure, but not always easy to handle it.
>
> So I think we should replace it with a rate limited printk and 
> returning bad_page.  That way the guest can't exploit it and we'll 
> still hopefully get printk()s to track down instances of things going 
> bad.
>

Agreed.  Add a stacktrace so we can see what causes the badness.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <ali...@us...> - 2008-04-29 23:12:53

Hollis Blanchard wrote:
> On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote:
>   
>> I like this very much, as it only affects accessors and not the mmu core 
>> itself.
>>
>> Hollis/Xiantao/Carsten, can you confirm that this approach works for 
>> you?  Carsten, I believe you don't have mmio, but at least this 
>> shouldn't interfere.
>>     
>
> OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, 
> and just include it within the normal guest RAM memslot?
>   

In the case of x86, since the PCI IO region is pretty far away from 
normal RAM, we'll probably use a different memory slot, but yes, that's 
the general idea.

IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
means that get_user_pages() would succeed and you can reference count 
them?  In this case, we would never take the VM_PFNMAP path.

Is that correct?

> How will the IOMMU be programmed? Wouldn't you still need to register a 
> special type of memslot for that?
>   

That's independent of this patchset.  For non-aware guests, we'll have 
to pin all of physical memory up front and then create an IOMMU table 
from the pinned physical memory.  For aware guests with a PV DMA window 
API, we'll be able to build that mapping on the fly (enforcing mlock 
allocation limits).

Regards,

Anthony Liguori

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Hollis B. <ho...@us...> - 2008-04-30 15:11:45

On Tuesday 29 April 2008 18:12:51 Anthony Liguori wrote:
> IIUC PPC correctly, all IO pages have corresponding struct pages.  This 
> means that get_user_pages() would succeed and you can reference count 
> them?  In this case, we would never take the VM_PFNMAP path.
> 
> Is that correct?

I think Andrea's answered this, but for the record: I believe ioremap() will 
get you struct pages on PPC, but they don't automatically exist.

-- 
Hollis Blanchard
IBM Linux Technology Center

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Muli Ben-Y. <mu...@il...> - 2008-04-30 06:09:16

On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote:
> This patch allows VMA's that contain no backing page to be used for guest
> memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
> mmio series.  Here, we continue to allow mmio pages to be represented in the
> rmap.
> 
> Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
> instead of VM_IO and changed the BUG_ON to a return of bad_page.
> 
> Signed-off-by: Anthony Liguori <ali...@us...>
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1d7991a..64e5efe 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>  	struct page *page[1];
>  	unsigned long addr;
>  	int npages;
> +	pfn_t pfn;
>  
>  	might_sleep();
>  
> @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>  	npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page,
>  				NULL);
>  
> -	if (npages != 1) {
> -		get_page(bad_page);
> -		return page_to_pfn(bad_page);
> -	}
> +	if (unlikely(npages != 1)) {
> +		struct vm_area_struct *vma;
>  
> -	return page_to_pfn(page[0]);
> +		vma = find_vma(current->mm, addr);
> +		if (vma == NULL || addr >= vma->vm_start ||
> +		    !(vma->vm_flags & VM_PFNMAP)) {

Isn't the check for addr backwards here? For the VMA we would like to
to find, vma->vm_start <= addr < vma->vm_end.

Cheers,
Muli

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Anthony L. <an...@co...> - 2008-04-30 12:50:38

Muli Ben-Yehuda wrote:
> On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote:
>   
>> This patch allows VMA's that contain no backing page to be used for guest
>> memory.  This is a drop-in replacement for Ben-Ami's first page in his direct
>> mmio series.  Here, we continue to allow mmio pages to be represented in the
>> rmap.
>>
>> Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP
>> instead of VM_IO and changed the BUG_ON to a return of bad_page.
>>
>> Signed-off-by: Anthony Liguori <ali...@us...>
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 1d7991a..64e5efe 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>>  	struct page *page[1];
>>  	unsigned long addr;
>>  	int npages;
>> +	pfn_t pfn;
>>  
>>  	might_sleep();
>>  
>> @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>>  	npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page,
>>  				NULL);
>>  
>> -	if (npages != 1) {
>> -		get_page(bad_page);
>> -		return page_to_pfn(bad_page);
>> -	}
>> +	if (unlikely(npages != 1)) {
>> +		struct vm_area_struct *vma;
>>  
>> -	return page_to_pfn(page[0]);
>> +		vma = find_vma(current->mm, addr);
>> +		if (vma == NULL || addr >= vma->vm_start ||
>> +		    !(vma->vm_flags & VM_PFNMAP)) {
>>     
>
> Isn't the check for addr backwards here? For the VMA we would like to
> to find, vma->vm_start <= addr < vma->vm_end.
>   

Yes it is.  Thanks for spotting that.

Regards,

Anthony Liguori

> Cheers,
> Muli
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> kvm-devel mailing list
> kvm...@li...
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Avi K. <av...@qu...> - 2008-04-30 12:57:06

Muli Ben-Yehuda wrote:
>> @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
>>  	npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page,
>>  				NULL);
>>  
>> -	if (npages != 1) {
>> -		get_page(bad_page);
>> -		return page_to_pfn(bad_page);
>> -	}
>> +	if (unlikely(npages != 1)) {
>> +		struct vm_area_struct *vma;
>>  
>> -	return page_to_pfn(page[0]);
>> +		vma = find_vma(current->mm, addr);
>> +		if (vma == NULL || addr >= vma->vm_start ||
>> +		    !(vma->vm_flags & VM_PFNMAP)) {
>>     
>
> Isn't the check for addr backwards here? For the VMA we would like to
> to find, vma->vm_start <= addr < vma->vm_end.
>
>   

The code is not trying to find a vma for the address, but a vma for the 
address which also has VM_PFNMAP set. The cases for vma not found, or 
vma found, but not VM_PFNMAP, are folded together.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Andrea A. <an...@qu...> - 2008-04-30 12:52:02

On Wed, Apr 30, 2008 at 11:59:47AM +0300, Avi Kivity wrote:
> The code is not trying to find a vma for the address, but a vma for the 
> address which also has VM_PFNMAP set. The cases for vma not found, or vma 
> found, but not VM_PFNMAP, are folded together.

Muli's saying the comparison is reversed, change >= to <.

Re: [kvm-devel] [PATCH] Handle vma regions with no backing page (v2)

From: Avi K. <av...@qu...> - 2008-04-30 12:52:13

Andrea Arcangeli wrote:
> On Wed, Apr 30, 2008 at 11:59:47AM +0300, Avi Kivity wrote:
>   
>> The code is not trying to find a vma for the address, but a vma for the 
>> address which also has VM_PFNMAP set. The cases for vma not found, or vma 
>> found, but not VM_PFNMAP, are folded together.
>>     
>
> Muli's saying the comparison is reversed, change >= to <.
>   

Err, yes.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.