From: Anthony L. <ali...@us...> - 2008-04-29 19:10:22
|
This patch allows VMA's that contain no backing page to be used for guest memory. This is a drop-in replacement for Ben-Ami's first page in his direct mmio series. Here, we continue to allow mmio pages to be represented in the rmap. Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP instead of VM_IO and changed the BUG_ON to a return of bad_page. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1d7991a..64e5efe 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) struct page *page[1]; unsigned long addr; int npages; + pfn_t pfn; might_sleep(); @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page, NULL); - if (npages != 1) { - get_page(bad_page); - return page_to_pfn(bad_page); - } + if (unlikely(npages != 1)) { + struct vm_area_struct *vma; - return page_to_pfn(page[0]); + vma = find_vma(current->mm, addr); + if (vma == NULL || addr >= vma->vm_start || + !(vma->vm_flags & VM_PFNMAP)) { + get_page(bad_page); + return page_to_pfn(bad_page); + } + + pfn = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + BUG_ON(pfn_valid(pfn)); + } else + pfn = page_to_pfn(page[0]); + + return pfn; } EXPORT_SYMBOL_GPL(gfn_to_pfn); struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) { - return pfn_to_page(gfn_to_pfn(kvm, gfn)); + pfn_t pfn; + + pfn = gfn_to_pfn(kvm, gfn); + if (pfn_valid(pfn)) + return pfn_to_page(pfn); + + return NULL; } EXPORT_SYMBOL_GPL(gfn_to_page); @@ -569,7 +586,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(pfn_t pfn) { - put_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); @@ -594,21 +612,25 @@ EXPORT_SYMBOL_GPL(kvm_set_page_dirty); void kvm_set_pfn_dirty(pfn_t pfn) { - struct page *page = pfn_to_page(pfn); - if (!PageReserved(page)) - SetPageDirty(page); + if (pfn_valid(pfn)) { + struct page *page = pfn_to_page(pfn); + if (!PageReserved(page)) + SetPageDirty(page); + } } EXPORT_SYMBOL_GPL(kvm_set_pfn_dirty); void kvm_set_pfn_accessed(pfn_t pfn) { - mark_page_accessed(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + mark_page_accessed(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed); void kvm_get_pfn(pfn_t pfn) { - get_page(pfn_to_page(pfn)); + if (pfn_valid(pfn)) + get_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_get_pfn); |
From: Avi K. <av...@qu...> - 2008-04-29 22:19:21
|
Anthony Liguori wrote: > This patch allows VMA's that contain no backing page to be used for guest > memory. This is a drop-in replacement for Ben-Ami's first page in his direct > mmio series. Here, we continue to allow mmio pages to be represented in the > rmap. > > I like this very much, as it only affects accessors and not the mmu core itself. Hollis/Xiantao/Carsten, can you confirm that this approach works for you? Carsten, I believe you don't have mmio, but at least this shouldn't interfere. > > struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) > { > - return pfn_to_page(gfn_to_pfn(kvm, gfn)); > + pfn_t pfn; > + > + pfn = gfn_to_pfn(kvm, gfn); > + if (pfn_valid(pfn)) > + return pfn_to_page(pfn); > + > + return NULL; > } > You're returning NULL here, not bad_page. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Hollis B. <ho...@us...> - 2008-04-29 22:57:27
|
On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote: > Anthony Liguori wrote: > > This patch allows VMA's that contain no backing page to be used for guest > > memory. This is a drop-in replacement for Ben-Ami's first page in his direct > > mmio series. Here, we continue to allow mmio pages to be represented in the > > rmap. > > > > > > I like this very much, as it only affects accessors and not the mmu core > itself. > > Hollis/Xiantao/Carsten, can you confirm that this approach works for > you? Carsten, I believe you don't have mmio, but at least this > shouldn't interfere. OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, and just include it within the normal guest RAM memslot? How will the IOMMU be programmed? Wouldn't you still need to register a special type of memslot for that? -- Hollis Blanchard IBM Linux Technology Center |
From: Andrea A. <an...@qu...> - 2008-04-30 07:00:40
|
On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote: > IIUC PPC correctly, all IO pages have corresponding struct pages. This > means that get_user_pages() would succeed and you can reference count them? > In this case, we would never take the VM_PFNMAP path. get_user_pages only works on vmas where only pfn with struct page can be mapped, but if a struct page exists it doesn't mean get_user_pages will succeed. All mmio regions should be marked VM_IO as reading on them affects hardware somehow and that prevents get_user_pages to work on them regardless if a struct page exists. > That's independent of this patchset. For non-aware guests, we'll have to > pin all of physical memory up front and then create an IOMMU table from the > pinned physical memory. For aware guests with a PV DMA window API, we'll > be able to build that mapping on the fly (enforcing mlock allocation > limits). BTW, as far as linux guest is concerned, if the PV DMA API mlock ulimit triggers the guest will crash. Nothing checks when pci_map_single returns null (the fix would be to throttle the I/O until some other dma is completed and to split the dma in multiple operations if it's a SG entry and if it repeteadly fails to fallback to PIO or return an IO error if PIO isn't available). It can fail if there's lots of weird pci hardware doing rdma at the same time (for example see iommu_arena_alloc retval in arch/alpha/kernel/pci_iommu.c). In short we'll either need ulimit -l unlimited or we'll have to define practical limits so depending on the guest driver code and number of devices using passthrough. I'll make the reserved-ram patch incremental with those patches, then it should pick the right pfn coming from /dev/mem without my page_count == 0 check, and then I've only to fixup the page pinning (so likely it'll also be incremental with the kvm mmu notifier patch so I can hope to get something final and remove page pinning for good not only on mmio regions that don't have a struct page). I've currently troubles with the blk-settings.c change done in 2.6.25 to boot in the host, I thought I fixed that already...(I did when loading the host kernel in kvm, but on real hardware it fails still for another reason). And Andrew sent me a large email about mmu notifiers, so before I return on the reserved-ram I've to answer him and upload an updated mmu-notifier patch with certain cleanups he requested, so go ahead ignoring the reserved-ram and mmu notifiers, I'll pick whatever is available in or outside kvm.git when I'm ready. Thanks! |
From: Anthony L. <an...@co...> - 2008-04-30 15:43:47
|
Andrea Arcangeli wrote: > On Tue, Apr 29, 2008 at 06:12:51PM -0500, Anthony Liguori wrote: > >> IIUC PPC correctly, all IO pages have corresponding struct pages. This >> means that get_user_pages() would succeed and you can reference count them? >> In this case, we would never take the VM_PFNMAP path. >> > > get_user_pages only works on vmas where only pfn with struct page can > be mapped, but if a struct page exists it doesn't mean get_user_pages > will succeed. All mmio regions should be marked VM_IO as reading on > them affects hardware somehow and that prevents get_user_pages to work > on them regardless if a struct page exists. > Ah, thanks for the clarification. Regards, Anthony Liguori |
From: Carsten O. <co...@de...> - 2008-04-30 14:18:53
|
Avi Kivity wrote: > Hollis/Xiantao/Carsten, can you confirm that this approach works for > you? Carsten, I believe you don't have mmio, but at least this > shouldn't interfere. Should work fine on s390 afaics. |
From: Anthony L. <an...@co...> - 2008-04-29 22:25:49
|
Avi Kivity wrote: > Anthony Liguori wrote: > >> This patch allows VMA's that contain no backing page to be used for guest >> memory. This is a drop-in replacement for Ben-Ami's first page in his direct >> mmio series. Here, we continue to allow mmio pages to be represented in the >> rmap. >> >> >> > > I like this very much, as it only affects accessors and not the mmu core > itself. > > Hollis/Xiantao/Carsten, can you confirm that this approach works for > you? Carsten, I believe you don't have mmio, but at least this > shouldn't interfere. > > >> >> struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) >> { >> - return pfn_to_page(gfn_to_pfn(kvm, gfn)); >> + pfn_t pfn; >> + >> + pfn = gfn_to_pfn(kvm, gfn); >> + if (pfn_valid(pfn)) >> + return pfn_to_page(pfn); >> + >> + return NULL; >> } >> >> > > You're returning NULL here, not bad_page. > My thinking was that bad_page indicates that the gfn is invalid. This is a different type of error though. The problem is that the guest is we are trying to kmap() a page that has no struct page associated with it. I'm not sure what the right thing to do here is. Perhaps we should be replacing consumers of gfn_to_page() with copy_to_user() instead? Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2008-04-29 22:43:51
|
Anthony Liguori wrote: >> >> >>> >>> struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn) >>> { >>> - return pfn_to_page(gfn_to_pfn(kvm, gfn)); >>> + pfn_t pfn; >>> + >>> + pfn = gfn_to_pfn(kvm, gfn); >>> + if (pfn_valid(pfn)) >>> + return pfn_to_page(pfn); >>> + >>> + return NULL; >>> } >>> >> >> You're returning NULL here, not bad_page. >> > > My thinking was that bad_page indicates that the gfn is invalid. This > is a different type of error though. The problem is that the guest is > we are trying to kmap() a page that has no struct page associated with > it. I'm not sure what the right thing to do here is. > It depends on what's going on? Does a page table point to mmio? Or the glommerclock? Not sure there is a single answer. > Perhaps we should be replacing consumers of gfn_to_page() with > copy_to_user() instead? Indeed we should. The problem is access in atomic contexts. It's easy to detect failure, but not always easy to handle it. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2008-04-29 22:51:48
|
Avi Kivity wrote: > It depends on what's going on? Does a page table point to mmio? Or > the glommerclock? > > Not sure there is a single answer. > >> Perhaps we should be replacing consumers of gfn_to_page() with >> copy_to_user() instead? > > Indeed we should. The problem is access in atomic contexts. It's > easy to detect failure, but not always easy to handle it. So I think we should replace it with a rate limited printk and returning bad_page. That way the guest can't exploit it and we'll still hopefully get printk()s to track down instances of things going bad. Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2008-04-29 22:54:00
|
Anthony Liguori wrote: > Avi Kivity wrote: >> It depends on what's going on? Does a page table point to mmio? Or >> the glommerclock? >> >> Not sure there is a single answer. >> >>> Perhaps we should be replacing consumers of gfn_to_page() with >>> copy_to_user() instead? >> >> Indeed we should. The problem is access in atomic contexts. It's >> easy to detect failure, but not always easy to handle it. > > So I think we should replace it with a rate limited printk and > returning bad_page. That way the guest can't exploit it and we'll > still hopefully get printk()s to track down instances of things going > bad. > Agreed. Add a stacktrace so we can see what causes the badness. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Anthony L. <ali...@us...> - 2008-04-29 23:12:53
|
Hollis Blanchard wrote: > On Tuesday 29 April 2008 17:17:49 Avi Kivity wrote: > >> I like this very much, as it only affects accessors and not the mmu core >> itself. >> >> Hollis/Xiantao/Carsten, can you confirm that this approach works for >> you? Carsten, I believe you don't have mmio, but at least this >> shouldn't interfere. >> > > OK, so the idea is to mmap /sys/bus/pci/.../region within the guest RAM area, > and just include it within the normal guest RAM memslot? > In the case of x86, since the PCI IO region is pretty far away from normal RAM, we'll probably use a different memory slot, but yes, that's the general idea. IIUC PPC correctly, all IO pages have corresponding struct pages. This means that get_user_pages() would succeed and you can reference count them? In this case, we would never take the VM_PFNMAP path. Is that correct? > How will the IOMMU be programmed? Wouldn't you still need to register a > special type of memslot for that? > That's independent of this patchset. For non-aware guests, we'll have to pin all of physical memory up front and then create an IOMMU table from the pinned physical memory. For aware guests with a PV DMA window API, we'll be able to build that mapping on the fly (enforcing mlock allocation limits). Regards, Anthony Liguori |
From: Hollis B. <ho...@us...> - 2008-04-30 15:11:45
|
On Tuesday 29 April 2008 18:12:51 Anthony Liguori wrote: > IIUC PPC correctly, all IO pages have corresponding struct pages. This > means that get_user_pages() would succeed and you can reference count > them? In this case, we would never take the VM_PFNMAP path. > > Is that correct? I think Andrea's answered this, but for the record: I believe ioremap() will get you struct pages on PPC, but they don't automatically exist. -- Hollis Blanchard IBM Linux Technology Center |
From: Muli Ben-Y. <mu...@il...> - 2008-04-30 06:09:16
|
On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote: > This patch allows VMA's that contain no backing page to be used for guest > memory. This is a drop-in replacement for Ben-Ami's first page in his direct > mmio series. Here, we continue to allow mmio pages to be represented in the > rmap. > > Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP > instead of VM_IO and changed the BUG_ON to a return of bad_page. > > Signed-off-by: Anthony Liguori <ali...@us...> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 1d7991a..64e5efe 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) > struct page *page[1]; > unsigned long addr; > int npages; > + pfn_t pfn; > > might_sleep(); > > @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) > npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page, > NULL); > > - if (npages != 1) { > - get_page(bad_page); > - return page_to_pfn(bad_page); > - } > + if (unlikely(npages != 1)) { > + struct vm_area_struct *vma; > > - return page_to_pfn(page[0]); > + vma = find_vma(current->mm, addr); > + if (vma == NULL || addr >= vma->vm_start || > + !(vma->vm_flags & VM_PFNMAP)) { Isn't the check for addr backwards here? For the VMA we would like to to find, vma->vm_start <= addr < vma->vm_end. Cheers, Muli |
From: Anthony L. <an...@co...> - 2008-04-30 12:50:38
|
Muli Ben-Yehuda wrote: > On Tue, Apr 29, 2008 at 02:09:20PM -0500, Anthony Liguori wrote: > >> This patch allows VMA's that contain no backing page to be used for guest >> memory. This is a drop-in replacement for Ben-Ami's first page in his direct >> mmio series. Here, we continue to allow mmio pages to be represented in the >> rmap. >> >> Since v1, I've taken into account Andrea's suggestions at using VM_PFNMAP >> instead of VM_IO and changed the BUG_ON to a return of bad_page. >> >> Signed-off-by: Anthony Liguori <ali...@us...> >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 1d7991a..64e5efe 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -532,6 +532,7 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) >> struct page *page[1]; >> unsigned long addr; >> int npages; >> + pfn_t pfn; >> >> might_sleep(); >> >> @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) >> npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page, >> NULL); >> >> - if (npages != 1) { >> - get_page(bad_page); >> - return page_to_pfn(bad_page); >> - } >> + if (unlikely(npages != 1)) { >> + struct vm_area_struct *vma; >> >> - return page_to_pfn(page[0]); >> + vma = find_vma(current->mm, addr); >> + if (vma == NULL || addr >= vma->vm_start || >> + !(vma->vm_flags & VM_PFNMAP)) { >> > > Isn't the check for addr backwards here? For the VMA we would like to > to find, vma->vm_start <= addr < vma->vm_end. > Yes it is. Thanks for spotting that. Regards, Anthony Liguori > Cheers, > Muli > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > kvm-devel mailing list > kvm...@li... > https://lists.sourceforge.net/lists/listinfo/kvm-devel > |
From: Avi K. <av...@qu...> - 2008-04-30 12:57:06
|
Muli Ben-Yehuda wrote: >> @@ -544,19 +545,35 @@ pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn) >> npages = get_user_pages(current, current->mm, addr, 1, 1, 1, page, >> NULL); >> >> - if (npages != 1) { >> - get_page(bad_page); >> - return page_to_pfn(bad_page); >> - } >> + if (unlikely(npages != 1)) { >> + struct vm_area_struct *vma; >> >> - return page_to_pfn(page[0]); >> + vma = find_vma(current->mm, addr); >> + if (vma == NULL || addr >= vma->vm_start || >> + !(vma->vm_flags & VM_PFNMAP)) { >> > > Isn't the check for addr backwards here? For the VMA we would like to > to find, vma->vm_start <= addr < vma->vm_end. > > The code is not trying to find a vma for the address, but a vma for the address which also has VM_PFNMAP set. The cases for vma not found, or vma found, but not VM_PFNMAP, are folded together. -- Any sufficiently difficult bug is indistinguishable from a feature. |
From: Andrea A. <an...@qu...> - 2008-04-30 12:52:02
|
On Wed, Apr 30, 2008 at 11:59:47AM +0300, Avi Kivity wrote: > The code is not trying to find a vma for the address, but a vma for the > address which also has VM_PFNMAP set. The cases for vma not found, or vma > found, but not VM_PFNMAP, are folded together. Muli's saying the comparison is reversed, change >= to <. |
From: Avi K. <av...@qu...> - 2008-04-30 12:52:13
|
Andrea Arcangeli wrote: > On Wed, Apr 30, 2008 at 11:59:47AM +0300, Avi Kivity wrote: > >> The code is not trying to find a vma for the address, but a vma for the >> address which also has VM_PFNMAP set. The cases for vma not found, or vma >> found, but not VM_PFNMAP, are folded together. >> > > Muli's saying the comparison is reversed, change >= to <. > Err, yes. -- Any sufficiently difficult bug is indistinguishable from a feature. |