You can subscribe to this list here.
| 2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
| 2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: David S. A. <da...@ci...> - 2008-05-17 04:31:29
|
Avi Kivity wrote: > > Okay, I committed the patch without the flood count == 5. > I've continued testing the RHEL3 guests with the flood count at 3, and I am right back to where I started. With the patch and the flood count at 3, I had 2 runs totaling around 24 hours that looked really good. Now, I am back to square one. I guess the short of it is that I am not sure if the patch resolves this issue or not. If you want to back it out, I can continue to apply on my end as I continue testing. A snapshot of kvm_stat -f 'mmu*' -l is attached for two test runs with the patch (line wrap is horrible inline). I will work on creating an ap that will stimulate kscand activity similar to what I am seeing. Also, in a prior e-mail I mentioned guest time advancing rapidly. I've noticed that with the -no-kvm-pit option the guest time is much better and typically stays within 3 seconds or so of the host, even through the high kscand activity which is one instance of when I've noticed time jumps with the kernel pit. Yes, this result has been repeatable through 6 or so runs. :-) david |
|
From: Rusty R. <ru...@ru...> - 2008-05-17 04:02:12
|
On Friday 16 May 2008 19:28:27 Tomasz Chmielewski wrote: > Christian Borntraeger schrieb: > > Hello Rusty, > > > > sometimes it is useful to share a disk (e.g. usr). To avoid file system > > corruption, the disk should be mounted read-only in that case. > > Although it is done at a different level here, I wanted to note that > mounting a filesystem read-only does not necessarily mean the system will > not try to write to it. This is the case for ext3, for example - when > mounted ro, system will still reply the journal and do some writes etc. > > The patch, however, should take care of that, too, as it is completely > different place it is made ro. Note I'm assuming that the host will deny writes. Telling the guest is merely politeness. Cheers, Rusty. |
|
From: Christoph L. <cla...@sg...> - 2008-05-17 01:38:26
|
Implementation of what Linus suggested: Defer the XPMEM processing until
after the locks are dropped. Allow immediate action by GRU/KVM.
This patch implements a callbacks for device drivers that establish external
references to pages aside from the Linux rmaps. Those either:
1. Do not take a refcount on pages that are mapped from devices. They
have a TLB cache like handling and must be able to flush external references
from atomic contexts. These devices do not need to provide the _sync methods.
2. Do take a refcount on pages mapped externally. These are handling by
marking pages as to be invalidated in atomic contexts. Invalidation
may be started by the driver. A _sync variant for the individual or
range unmap is called when we are back in a nonatomic context. At that
point the device must complete the removal of external references
and drop its refcount.
With the mm notifier it is possible for the device driver to release external
references after the page references are removed from a process that made
them available.
With the notifier it becomes possible to get pages unpinned on request and thus
avoid issues that come with having a large amount of pinned pages.
A device driver must subscribe to a process using
mm_register_notifier(struct mm_struct *, struct mm_notifier *)
The VM will then perform callbacks for operations that unmap or change
permissions of pages in that address space.
When the process terminates then first the ->release method is called to
remove all pages still mapped to the proces.
Before the mm_struct is freed the ->destroy() method is called which
should dispose of the mm_notifier structure.
The following callbacks exist:
invalidate_range(notifier, mm_struct *, from , to)
Invalidate a range of addresses. The invalidation is
not required to complete immediately.
invalidate_range_sync(notifier, mm_struct *, from, to)
This is called after some invalidate_range callouts.
The driver may only return when the invalidation of the references
is completed. Callback is only called from non atomic contexts.
There is no need to provide this callback if the driver can remove
references in an atomic context.
invalidate_page(notifier, mm_struct *, struct page *page, unsigned long address)
Invalidate references to a particular page. The driver may
defer the invalidation.
invalidate_page_sync(notifier, mm_struct *,struct *)
Called after one or more invalidate_page() callbacks. The callback
must only return when the external references have been removed.
The callback does not need to be provided if the driver can remove
references in atomic contexts.
[NOTE] The invalidate_page_sync() callback is weird because it is called for
every notifier that supports the invalidate_page_sync() callback
if a page has PageNotifier() set. The driver must determine in an efficient
way that the page is not of interest. This is because we do not have the
mm context after we have dropped the rmap list lock.
Drivers incrementing the refcount must set and clear PageNotifier
appropriately when establishing and/or dropping a refcount!
[These conditions are similar to the rmap notifier that was introduced
in my V7 of the mmu_notifier].
There is no support for an aging callback. A device driver may simply set the
reference bit on the linux pte when the external mapping is referenced if such
support is desired.
The patch is provisional. All functions are inlined for now. They should be wrapped like
in Andrea's series. Its probably good to have Andrea review this if we actually
decide to go this route since he is pretty good as detecting issues with complex
lock interactions in the vm. mmu notifiers V7 was rejected by Andrew because of the
strange asymmetry in invalidate_page_sync() (at that time called rmap notifier) and
we are reintroducing that now in a light weight order to be able to defer freeing
until after the rmap spinlocks have been dropped.
Jack tested this with the GRU.
Signed-off-by: Christoph Lameter <cla...@sg...>
---
fs/hugetlbfs/inode.c | 2
include/linux/mm_types.h | 3
include/linux/page-flags.h | 3
include/linux/rmap.h | 161 +++++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 4 +
mm/Kconfig | 4 +
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/hugetlb.c | 3
mm/memory.c | 38 ++++++++--
mm/mmap.c | 3
mm/mprotect.c | 3
mm/mremap.c | 5 +
mm/rmap.c | 11 ++-
14 files changed, 234 insertions(+), 10 deletions(-)
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/kernel/fork.c 2008-05-16 16:06:26.000000000 -0700
@@ -386,6 +386,9 @@ static struct mm_struct * mm_init(struct
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
+#ifdef CONFIG_MM_NOTIFIER
+ mm->mm_notifier = NULL;
+#endif
return mm;
}
@@ -418,6 +421,7 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == &init_mm);
mm_free_pgd(mm);
destroy_context(mm);
+ mm_notifier_destroy(mm);
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c 2008-05-16 16:06:26.000000000 -0700
@@ -189,6 +189,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
@@ -197,6 +198,7 @@ __xip_unmap (struct address_space * mapp
}
}
spin_unlock(&mapping->i_mmap_lock);
+ mm_notifier_invalidate_page_sync(page);
}
/*
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/fremap.c 2008-05-16 16:06:26.000000000 -0700
@@ -214,7 +214,9 @@ asmlinkage long sys_remap_file_pages(uns
spin_unlock(&mapping->i_mmap_lock);
}
+ mm_notifier_invalidate_range(mm, start, start + size);
err = populate_range(mm, vma, start, size, pgoff);
+ mm_notifier_invalidate_range_sync(mm, start, start + size);
if (!err && !(flags & MAP_NONBLOCK)) {
if (unlikely(has_write_lock)) {
downgrade_write(&mm->mmap_sem);
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/hugetlb.c 2008-05-16 17:50:31.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
#include <linux/mutex.h>
+#include <linux/rmap.h>
#include <asm/page.h>
#include <asm/pgtable.h>
@@ -843,6 +844,7 @@ void __unmap_hugepage_range(struct vm_ar
}
spin_unlock(&mm->page_table_lock);
flush_tlb_range(vma, start, end);
+ mm_notifier_invalidate_range(mm, start, end);
list_for_each_entry_safe(page, tmp, &page_list, lru) {
list_del(&page->lru);
put_page(page);
@@ -864,6 +866,7 @@ void unmap_hugepage_range(struct vm_area
spin_lock(&vma->vm_file->f_mapping->i_mmap_lock);
__unmap_hugepage_range(vma, start, end);
spin_unlock(&vma->vm_file->f_mapping->i_mmap_lock);
+ mm_notifier_invalidate_range_sync(vma->vm_mm, start, end);
}
}
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/memory.c 2008-05-16 16:06:26.000000000 -0700
@@ -527,6 +527,7 @@ copy_one_pte(struct mm_struct *dst_mm, s
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
+ mm_notifier_invalidate_range(src_mm, addr, addr + PAGE_SIZE);
pte = pte_wrprotect(pte);
}
@@ -649,6 +650,7 @@ int copy_page_range(struct mm_struct *ds
unsigned long next;
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;
+ int ret;
/*
* Don't copy ptes where a page fault will fill them correctly.
@@ -664,17 +666,30 @@ int copy_page_range(struct mm_struct *ds
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+ ret = 0;
dst_pgd = pgd_offset(dst_mm, addr);
src_pgd = pgd_offset(src_mm, addr);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(src_pgd))
continue;
- if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
- vma, addr, next))
- return -ENOMEM;
+ if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
+ vma, addr, next))) {
+ ret = -ENOMEM;
+ break;
+ }
} while (dst_pgd++, src_pgd++, addr = next, addr != end);
- return 0;
+
+ /*
+ * We need to invalidate the secondary MMU mappings only when
+ * there could be a permission downgrade on the ptes of the
+ * parent mm. And a permission downgrade will only happen if
+ * is_cow_mapping() returns true.
+ */
+ if (is_cow_mapping(vma->vm_flags))
+ mm_notifier_invalidate_range_sync(src_mm, vma->vm_start, end);
+
+ return ret;
}
static unsigned long zap_pte_range(struct mmu_gather *tlb,
@@ -913,6 +928,7 @@ unsigned long unmap_vmas(struct mmu_gath
}
tlb_finish_mmu(*tlbp, tlb_start, start);
+ mm_notifier_invalidate_range(vma->vm_mm, tlb_start, start);
if (need_resched() ||
(i_mmap_lock && spin_needbreak(i_mmap_lock))) {
@@ -951,8 +967,10 @@ unsigned long zap_page_range(struct vm_a
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
- if (tlb)
+ if (tlb) {
tlb_finish_mmu(tlb, address, end);
+ mm_notifier_invalidate_range(mm, address, end);
+ }
return end;
}
@@ -1711,7 +1729,6 @@ static int do_wp_page(struct mm_struct *
*/
page_table = pte_offset_map_lock(mm, pmd, address,
&ptl);
- page_cache_release(old_page);
if (!pte_same(*page_table, orig_pte))
goto unlock;
@@ -1729,6 +1746,7 @@ static int do_wp_page(struct mm_struct *
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+ old_page = NULL;
goto unlock;
}
@@ -1774,6 +1792,7 @@ gotten:
* thread doing COW.
*/
ptep_clear_flush(vma, address, page_table);
+ mm_notifier_invalidate_page(mm, old_page, address);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1787,10 +1806,13 @@ gotten:
if (new_page)
page_cache_release(new_page);
- if (old_page)
- page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
+ if (old_page) {
+ mm_notifier_invalidate_page_sync(old_page);
+ page_cache_release(old_page);
+ }
+
if (dirty_page) {
if (vma->vm_file)
file_update_time(vma->vm_file);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mmap.c 2008-05-16 16:06:26.000000000 -0700
@@ -1759,6 +1759,8 @@ static void unmap_region(struct mm_struc
free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
next? next->vm_start: 0);
tlb_finish_mmu(tlb, start, end);
+ mm_notifier_invalidate_range(mm, start, end);
+ mm_notifier_invalidate_range_sync(mm, start, end);
}
/*
@@ -2048,6 +2050,7 @@ void exit_mmap(struct mm_struct *mm)
/* mm's last user has gone, and its about to be pulled down */
arch_exit_mmap(mm);
+ mm_notifier_release(mm);
lru_add_drain();
flush_cache_mm(mm);
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/mprotect.c 2008-05-16 16:06:26.000000000 -0700
@@ -21,6 +21,7 @@
#include <linux/syscalls.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/rmap.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
#include <asm/cacheflush.h>
@@ -132,6 +133,7 @@ static void change_protection(struct vm_
change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
} while (pgd++, addr = next, addr != end);
flush_tlb_range(vma, start, end);
+ mm_notifier_invalidate_range(vma->vm_mm, start, end);
}
int
@@ -211,6 +213,7 @@ success:
hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
else
change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+ mm_notifier_invalidate_range_sync(mm, start, end);
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
return 0;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/mm/mremap.c 2008-05-16 16:06:26.000000000 -0700
@@ -18,6 +18,7 @@
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/rmap.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -74,6 +75,7 @@ static void move_ptes(struct vm_area_str
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
+ unsigned long old_start = old_addr;
if (vma->vm_file) {
/*
@@ -100,6 +102,7 @@ static void move_ptes(struct vm_area_str
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
arch_enter_lazy_mmu_mode();
+ mm_notifier_invalidate_range(mm, old_addr, old_end);
for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
new_pte++, new_addr += PAGE_SIZE) {
if (pte_none(*old_pte))
@@ -116,6 +119,8 @@ static void move_ptes(struct vm_area_str
pte_unmap_unlock(old_pte - 1, old_ptl);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+
+ mm_notifier_invalidate_range_sync(vma->vm_mm, old_start, old_end);
}
#define LATENCY_LIMIT (64 * PAGE_SIZE)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/rmap.c 2008-05-16 16:06:26.000000000 -0700
@@ -52,6 +52,9 @@
#include <asm/tlbflush.h>
+struct mm_notifier *mm_notifier_page_sync;
+DECLARE_RWSEM(mm_notifier_page_sync_sem);
+
struct kmem_cache *anon_vma_cachep;
/* This must be called under the mmap_sem. */
@@ -458,6 +461,7 @@ static int page_mkclean_one(struct page
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
@@ -502,8 +506,8 @@ int page_mkclean(struct page *page)
ret = 1;
}
}
+ mm_notifier_invalidate_page_sync(page);
}
-
return ret;
}
EXPORT_SYMBOL_GPL(page_mkclean);
@@ -725,6 +729,7 @@ static int try_to_unmap_one(struct page
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
@@ -855,6 +860,7 @@ static void try_to_unmap_cluster(unsigne
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
+ mm_notifier_invalidate_page(mm, page, address);
/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
@@ -1013,8 +1019,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);
+ mm_notifier_invalidate_page_sync(page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
}
-
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/rmap.h 2008-05-16 18:32:52.000000000 -0700
@@ -133,4 +133,165 @@ static inline int page_mkclean(struct pa
#define SWAP_AGAIN 1
#define SWAP_FAIL 2
+#ifdef CONFIG_MM_NOTIFIER
+
+struct mm_notifier_ops {
+ void (*invalidate_range)(struct mm_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+ void (*invalidate_range_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
+ void (*invalidate_page)(struct mm_notifier *mn, struct mm_struct *mm,
+ struct page *page, unsigned long addr);
+ void (*invalidate_page_sync)(struct mm_notifier *mn, struct mm_struct *mm,
+ struct page *page);
+ void (*release)(struct mm_notifier *mn, struct mm_struct *mm);
+ void (*destroy)(struct mm_notifier *mn, struct mm_struct *mm);
+};
+
+struct mm_notifier {
+ struct mm_notifier_ops *ops;
+ struct mm_struct *mm;
+ struct mm_notifier *next;
+ struct mm_notifier *next_page_sync;
+};
+
+extern struct mm_notifier *mm_notifier_page_sync;
+extern struct rw_semaphore mm_notifier_page_sync_sem;
+
+/*
+ * Must hold mmap_sem when calling mm_notifier_register.
+ */
+static inline void mm_notifier_register(struct mm_notifier *mn,
+ struct mm_struct *mm)
+{
+ mn->mm = mm;
+ mn->next = mm->mm_notifier;
+ rcu_assign_pointer(mm->mm_notifier, mn);
+ if (mn->ops->invalidate_page_sync) {
+ down_write(&mm_notifier_page_sync_sem);
+ mn->next_page_sync = mm_notifier_page_sync;
+ mm_notifier_page_sync = mn;
+ up_write(&mm_notifier_page_sync_sem);
+ }
+}
+
+/*
+ * Invalidate remote references in a particular address range
+ */
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->invalidate_range(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references in a particular address range.
+ * Can sleep. Only return if all remote references have been removed.
+ */
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ if (mn->ops->invalidate_range_sync)
+ mn->ops->invalidate_range_sync(mn, mm, start, end);
+}
+
+/*
+ * Invalidate remote references to a page
+ */
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+ struct page *page, unsigned long addr)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->invalidate_page(mn, mm, page, addr);
+}
+
+/*
+ * Invalidate remote references to a partioular page. Only return
+ * if all references have been removed.
+ *
+ * Note: This is an expensive function since it is not clear at the time
+ * of call to which mm_struct() the page belongs.. It walks through the
+ * mmlist and calls the mmu notifier ops for each address space in the
+ * system. At some point this needs to be optimized.
+ */
+static inline void mm_notifier_invalidate_page_sync(struct page *page)
+{
+ struct mm_notifier *mn;
+
+ if (!PageNotifier(page))
+ return;
+
+ down_read(&mm_notifier_page_sync_sem);
+
+ for (mn = mm_notifier_page_sync; mn; mn = mn->next_page_sync)
+ if (mn->ops->invalidate_page_sync)
+ mn->ops->invalidate_page_sync(mn, mn->mm, page);
+
+ up_read(&mm_notifier_page_sync_sem);
+}
+
+/*
+ * Invalidate all remote references before shutdown
+ */
+static inline void mm_notifier_release(struct mm_struct *mm)
+{
+ struct mm_notifier *mn;
+
+ for (mn = rcu_dereference(mm->mm_notifier); mn;
+ mn = rcu_dereference(mn->next))
+ mn->ops->release(mn, mm);
+}
+
+/*
+ * Release resources before freeing mm_struct.
+ */
+static inline void mm_notifier_destroy(struct mm_struct *mm)
+{
+ struct mm_notifier *mn;
+
+ while (mm->mm_notifier) {
+ mn = mm->mm_notifier;
+ mm->mm_notifier = mn->next;
+ if (mn->ops->invalidate_page_sync) {
+ struct mm_notifier *m;
+
+ down_write(&mm_notifier_page_sync_sem);
+
+ if (mm_notifier_page_sync != mn) {
+ for (m = mm_notifier_page_sync; m; m = m->next_page_sync)
+ if (m->next_page_sync == mn)
+ break;
+
+ m->next_page_sync = mn->next_page_sync;
+ } else
+ mm_notifier_page_sync = mn->next_page_sync;
+
+ up_write(&mm_notifier_page_sync_sem);
+ }
+ mn->ops->destroy(mn, mm);
+ }
+}
+#else
+static inline void mm_notifier_invalidate_range(struct mm_struct *mm,
+ unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_range_sync(struct mm_struct *mm,
+ unsigned long start, unsigned long end) {}
+static inline void mm_notifier_invalidate_page(struct mm_struct *mm,
+ struct page *page, unsigned long address) {}
+static inline void mm_notifier_invalidate_page_sync(struct page *page) {}
+static inline void mm_notifier_release(struct mm_struct *mm) {}
+static inline void mm_notifier_destroy(struct mm_struct *mm) {}
+#endif
+
#endif /* _LINUX_RMAP_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2008-05-16 11:28:50.000000000 -0700
+++ linux-2.6/mm/Kconfig 2008-05-16 16:06:26.000000000 -0700
@@ -205,3 +205,7 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config MM_NOTIFIER
+ def_bool y
+
Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/mm_types.h 2008-05-16 16:06:26.000000000 -0700
@@ -244,6 +244,9 @@ struct mm_struct {
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
+#ifdef CONFIG_MM_NOTIFIER
+ struct mm_notifier *mm_notifier;
+#endif
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/include/linux/page-flags.h 2008-05-16 16:06:26.000000000 -0700
@@ -93,6 +93,7 @@ enum pageflags {
PG_mappedtodisk, /* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_buddy, /* Page is free, on buddy lists */
+ PG_notifier, /* Call notifier when page is changed/unmapped */
#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
PG_uncached, /* Page has been mapped as uncached */
#endif
@@ -173,6 +174,8 @@ PAGEFLAG(MappedToDisk, mappedtodisk)
PAGEFLAG(Reclaim, reclaim) TESTCLEARFLAG(Reclaim, reclaim)
PAGEFLAG(Readahead, reclaim) /* Reminder to do async read-ahead */
+PAGEFLAG(Notifier, notifier);
+
#ifdef CONFIG_HIGHMEM
/*
* Must use a macro here due to header dependency issues. page_zone() is not
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c 2008-05-16 11:28:49.000000000 -0700
+++ linux-2.6/fs/hugetlbfs/inode.c 2008-05-16 16:06:55.000000000 -0700
@@ -442,6 +442,8 @@ hugetlb_vmtruncate_list(struct prio_tree
__unmap_hugepage_range(vma,
vma->vm_start + v_offset, vma->vm_end);
+ mm_notifier_invalidate_range_sync(vma->vm_mm,
+ vma->vm_start + v_offset, vma->vm_end);
}
}
|
|
From: Marcelo T. <mto...@re...> - 2008-05-16 21:36:41
|
Hi Anthony,
We're experiencing qemu segfaults when using VNC over high latency
links.
(gdb) bt
#0 0x0000003a8ec838d3 in memcpy () from /lib64/libc.so.6
#1 0x00000000004b9aff in vnc_update_client (opaque=0x3514140) at vnc.c:223
#2 0x000000000040822d in qemu_run_timers (ptimer_head=0x8e9500, current_time=5942450)
at /root/marcelo/kvm-userspace/qemu/vl.c:1112
#3 0x0000000000413208 in main_loop_wait (timeout=1000)
at /root/marcelo/kvm-userspace/qemu/vl.c:7482
#4 0x000000000060de86 in kvm_main_loop ()
at /root/marcelo/kvm-userspace/qemu/qemu-kvm.c:524
#5 0x0000000000413259 in main_loop () at /root/marcelo/kvm-userspace/qemu/vl.c:7506
#6 0x0000000000415d3a in main (argc=21, argv=0x7fff00907dd8)
at /root/marcelo/kvm-userspace/qemu/vl.c:9369
Problem is that sometimes vs->width and vs->weight are not updated to
reflect the size allocated for the display memory. If they are larger
than whats allocated it segfaults:
(gdb) p vs->old_data_h
$22 = 400
(gdb) p vs->old_data_w
$23 = 720
(gdb) p vs->old_data_depth
$24 = 4
(gdb) p vs->height
$20 = 480
(gdb) p vs->width
$21 = 640
(gdb) p vs->depth
$25 = 4
old_data_h, old_data_w and old_data_depth have been saved from the last
vnc_dpy_resize run. The code relies on the client's "set_encondings"
processing to happen _before_ the vnc_update_client() timer triggers,
which might not always be the case.
I have no clue about correctness of the following though. What do you
say?
diff --git a/qemu/vnc.c b/qemu/vnc.c
index f6ec5cf..5540677 100644
--- a/qemu/vnc.c
+++ b/qemu/vnc.c
@@ -302,7 +302,7 @@ static void vnc_dpy_resize(DisplayState *ds, int w, int h)
ds->width = w;
ds->height = h;
ds->linesize = w * vs->depth;
- if (vs->csock != -1 && vs->has_resize && size_changed) {
+ if (vs->csock != -1 && size_changed) {
vnc_write_u8(vs, 0); /* msg id */
vnc_write_u8(vs, 0);
vnc_write_u16(vs, 1); /* number of rects */
|
|
From: Paul E. M. <pa...@li...> - 2008-05-16 19:10:35
|
On Fri, May 09, 2008 at 09:32:30PM +0200, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <an...@qu...>
The hlist_del_init_rcu() primitive looks good.
The rest of the RCU code looks fine assuming that "mn->ops->release()"
either does call_rcu() to defer actual removal, or that the actual
removal is deferred until after mmu_notifier_release() returns.
Acked-by: Paul E. McKenney <pa...@li...>
> With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
> pages. There are secondary MMUs (with secondary sptes and secondary
> tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
> spte in mmu-notifier context, I mean "secondary pte". In GRU case
> there's no actual secondary pte and there's only a secondary tlb
> because the GRU secondary MMU has no knowledge about sptes and every
> secondary tlb miss event in the MMU always generates a page fault that
> has to be resolved by the CPU (this is not the case of KVM where the a
> secondary tlb miss will walk sptes in hardware and it will refill the
> secondary tlb transparently to software if the corresponding spte is
> present). The same way zap_page_range has to invalidate the pte before
> freeing the page, the spte (and secondary tlb) must also be
> invalidated before any page is freed and reused.
>
> Currently we take a page_count pin on every page mapped by sptes, but
> that means the pages can't be swapped whenever they're mapped by any
> spte because they're part of the guest working set. Furthermore a spte
> unmap event can immediately lead to a page to be freed when the pin is
> released (so requiring the same complex and relatively slow tlb_gather
> smp safe logic we have in zap_page_range and that can be avoided
> completely if the spte unmap event doesn't require an unpin of the
> page previously mapped in the secondary MMU).
>
> The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and
> know when the VM is swapping or freeing or doing anything on the
> primary MMU so that the secondary MMU code can drop sptes before the
> pages are freed, avoiding all page pinning and allowing 100% reliable
> swapping of guest physical address space. Furthermore it avoids the
> code that teardown the mappings of the secondary MMU, to implement a
> logic like tlb_gather in zap_page_range that would require many IPI to
> flush other cpu tlbs, for each fixed number of spte unmapped.
>
> To make an example: if what happens on the primary MMU is a protection
> downgrade (from writeable to wrprotect) the secondary MMU mappings
> will be invalidated, and the next secondary-mmu-page-fault will call
> get_user_pages and trigger a do_wp_page through get_user_pages if it
> called get_user_pages with write=1, and it'll re-establishing an
> updated spte or secondary-tlb-mapping on the copied page. Or it will
> setup a readonly spte or readonly tlb mapping if it's a guest-read, if
> it calls get_user_pages with write=0. This is just an example.
>
> This allows to map any page pointed by any pte (and in turn visible in
> the primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU,
> or an full MMU with both sptes and secondary-tlb like the
> shadow-pagetable layer with kvm), or a remote DMA in software like
> XPMEM (hence needing of schedule in XPMEM code to send the invalidate
> to the remote node, while no need to schedule in kvm/gru as it's an
> immediate event like invalidating primary-mmu pte).
>
> At least for KVM without this patch it's impossible to swap guests
> reliably. And having this feature and removing the page pin allows
> several other optimizations that simplify life considerably.
>
> Dependencies:
>
> 1) Introduces list_del_init_rcu and documents it (fixes a comment for
> list_del_rcu too)
>
> 2) mm_take_all_locks() to register the mmu notifier when the whole VM
> isn't doing anything with "mm". This allows mmu notifier users to
> keep track if the VM is in the middle of the
> invalidate_range_begin/end critical section with an atomic counter
> incraese in range_begin and decreased in range_end. No secondary
> MMU page fault is allowed to map any spte or secondary tlb
> reference, while the VM is in the middle of range_begin/end as any
> page returned by get_user_pages in that critical section could
> later immediately be freed without any further ->invalidate_page
> notification (invalidate_range_begin/end works on ranges and
> ->invalidate_page isn't called immediately before freeing the
> page). To stop all page freeing and pagetable overwrites the
> mmap_sem must be taken in write mode and all other anon_vma/i_mmap
> locks must be taken too.
>
> 3) It'd be a waste to add branches in the VM if nobody could possibly
> run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled
> if CONFIG_KVM=m/y. In the current kernel kvm won't yet take
> advantage of mmu notifiers, but this already allows to compile a
> KVM external module against a kernel with mmu notifiers enabled and
> from the next pull from kvm.git we'll start using them. And
> GRU/XPMEM will also be able to continue the development by enabling
> KVM=m in their config, until they submit all GRU/XPMEM GPLv2 code
> to the mainline kernel. Then they can also enable MMU_NOTIFIERS in
> the same way KVM does it (even if KVM=n). This guarantees nobody
> selects MMU_NOTIFIER=y if KVM and GRU and XPMEM are all =n.
>
> The mmu_notifier_register call can fail because mm_take_all_locks may
> be interrupted by a signal and return -EINTR. Because
> mmu_notifier_reigster is used when a driver startup, a failure can be
> gracefully handled. Here an example of the change applied to kvm to
> register the mmu notifiers. Usually when a driver startups other
> allocations are required anyway and -ENOMEM failure paths exists
> already.
>
> struct kvm *kvm_arch_create_vm(void)
> {
> struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
> + int err;
>
> if (!kvm)
> return ERR_PTR(-ENOMEM);
>
> INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>
> + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
> + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
> + if (err) {
> + kfree(kvm);
> + return ERR_PTR(err);
> + }
> +
> return kvm;
> }
>
> mmu_notifier_unregister returns void and it's reliable.
>
> Signed-off-by: Andrea Arcangeli <an...@qu...>
> Signed-off-by: Nick Piggin <np...@su...>
> Signed-off-by: Christoph Lameter <cla...@sg...>
> ---
>
> Full patchset is here:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.26-rc1/mmu-notifier-v17
>
> Thanks!
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -21,6 +21,7 @@ config KVM
> tristate "Kernel-based Virtual Machine (KVM) support"
> depends on HAVE_KVM
> select PREEMPT_NOTIFIERS
> + select MMU_NOTIFIER
> select ANON_INODES
> ---help---
> Support hosting fully virtualized guest machines using hardware
> diff --git a/include/linux/list.h b/include/linux/list.h
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -747,7 +747,7 @@ static inline void hlist_del(struct hlis
> * or hlist_del_rcu(), running on this same list.
> * However, it is perfectly legal to run concurrently with
> * the _rcu list-traversal primitives, such as
> - * hlist_for_each_entry().
> + * hlist_for_each_entry_rcu().
> */
> static inline void hlist_del_rcu(struct hlist_node *n)
> {
> @@ -760,6 +760,34 @@ static inline void hlist_del_init(struct
> if (!hlist_unhashed(n)) {
> __hlist_del(n);
> INIT_HLIST_NODE(n);
> + }
> +}
> +
> +/**
> + * hlist_del_init_rcu - deletes entry from hash list with re-initialization
> + * @n: the element to delete from the hash list.
> + *
> + * Note: list_unhashed() on the node return true after this. It is
> + * useful for RCU based read lockfree traversal if the writer side
> + * must know if the list entry is still hashed or already unhashed.
> + *
> + * In particular, it means that we can not poison the forward pointers
> + * that may still be used for walking the hash list and we can only
> + * zero the pprev pointer so list_unhashed() will return true after
> + * this.
> + *
> + * The caller must take whatever precautions are necessary (such as
> + * holding appropriate locks) to avoid racing with another
> + * list-mutation primitive, such as hlist_add_head_rcu() or
> + * hlist_del_rcu(), running on this same list. However, it is
> + * perfectly legal to run concurrently with the _rcu list-traversal
> + * primitives, such as hlist_for_each_entry_rcu().
> + */
> +static inline void hlist_del_init_rcu(struct hlist_node *n)
> +{
> + if (!hlist_unhashed(n)) {
> + __hlist_del(n);
> + n->pprev = NULL;
> }
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1067,6 +1067,9 @@ extern struct vm_area_struct *copy_vma(s
> unsigned long addr, unsigned long len, pgoff_t pgoff);
> extern void exit_mmap(struct mm_struct *);
>
> +extern int mm_take_all_locks(struct mm_struct *mm);
> +extern void mm_drop_all_locks(struct mm_struct *mm);
> +
> #ifdef CONFIG_PROC_FS
> /* From fs/proc/base.c. callers must _not_ hold the mm's exe_file_lock */
> extern void added_exe_file_vma(struct mm_struct *mm);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -10,6 +10,7 @@
> #include <linux/rbtree.h>
> #include <linux/rwsem.h>
> #include <linux/completion.h>
> +#include <linux/cpumask.h>
> #include <asm/page.h>
> #include <asm/mmu.h>
>
> @@ -19,6 +20,7 @@
> #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
>
> struct address_space;
> +struct mmu_notifier_mm;
>
> #if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
> typedef atomic_long_t mm_counter_t;
> @@ -235,6 +237,9 @@ struct mm_struct {
> struct file *exe_file;
> unsigned long num_exe_file_vmas;
> #endif
> +#ifdef CONFIG_MMU_NOTIFIER
> + struct mmu_notifier_mm *mmu_notifier_mm;
> +#endif
> };
>
> #endif /* _LINUX_MM_TYPES_H */
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/mmu_notifier.h
> @@ -0,0 +1,279 @@
> +#ifndef _LINUX_MMU_NOTIFIER_H
> +#define _LINUX_MMU_NOTIFIER_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/mm_types.h>
> +
> +struct mmu_notifier;
> +struct mmu_notifier_ops;
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +/*
> + * The mmu notifier_mm structure is allocated and installed in
> + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> + * critical section and it's released only when mm_count reaches zero
> + * in mmdrop().
> + */
> +struct mmu_notifier_mm {
> + /* all mmu notifiers registerd in this mm are queued in this list */
> + struct hlist_head list;
> + /* to serialize the list modifications and hlist_unhashed */
> + spinlock_t lock;
> +};
> +
> +struct mmu_notifier_ops {
> + /*
> + * Called either by mmu_notifier_unregister or when the mm is
> + * being destroyed by exit_mmap, always before all pages are
> + * freed. This can run concurrently with other mmu notifier
> + * methods (the ones invoked outside the mm context) and it
> + * should tear down all secondary mmu mappings and freeze the
> + * secondary mmu. If this method isn't implemented you've to
> + * be sure that nothing could possibly write to the pages
> + * through the secondary mmu by the time the last thread with
> + * tsk->mm == mm exits.
> + *
> + * As side note: the pages freed after ->release returns could
> + * be immediately reallocated by the gart at an alias physical
> + * address with a different cache model, so if ->release isn't
> + * implemented because all _software_ driven memory accesses
> + * through the secondary mmu are terminated by the time the
> + * last thread of this mm quits, you've also to be sure that
> + * speculative _hardware_ operations can't allocate dirty
> + * cachelines in the cpu that could not be snooped and made
> + * coherent with the other read and write operations happening
> + * through the gart alias address, so leading to memory
> + * corruption.
> + */
> + void (*release)(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +
> + /*
> + * clear_flush_young is called after the VM is
> + * test-and-clearing the young/accessed bitflag in the
> + * pte. This way the VM will provide proper aging to the
> + * accesses to the page through the secondary MMUs and not
> + * only to the ones through the Linux pte.
> + */
> + int (*clear_flush_young)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> +
> + /*
> + * Before this is invoked any secondary MMU is still ok to
> + * read/write to the page previously pointed to by the Linux
> + * pte because the page hasn't been freed yet and it won't be
> + * freed until this returns. If required set_page_dirty has to
> + * be called internally to this method.
> + */
> + void (*invalidate_page)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long address);
> +
> + /*
> + * invalidate_range_start() and invalidate_range_end() must be
> + * paired and are called only when the mmap_sem and/or the
> + * locks protecting the reverse maps are held. The subsystem
> + * must guarantee that no additional references are taken to
> + * the pages in the range established between the call to
> + * invalidate_range_start() and the matching call to
> + * invalidate_range_end().
> + *
> + * Invalidation of multiple concurrent ranges may be
> + * optionally permitted by the driver. Either way the
> + * establishment of sptes is forbidden in the range passed to
> + * invalidate_range_begin/end for the whole duration of the
> + * invalidate_range_begin/end critical section.
> + *
> + * invalidate_range_start() is called when all pages in the
> + * range are still mapped and have at least a refcount of one.
> + *
> + * invalidate_range_end() is called when all pages in the
> + * range have been unmapped and the pages have been freed by
> + * the VM.
> + *
> + * The VM will remove the page table entries and potentially
> + * the page between invalidate_range_start() and
> + * invalidate_range_end(). If the page must not be freed
> + * because of pending I/O or other circumstances then the
> + * invalidate_range_start() callback (or the initial mapping
> + * by the driver) must make sure that the refcount is kept
> + * elevated.
> + *
> + * If the driver increases the refcount when the pages are
> + * initially mapped into an address space then either
> + * invalidate_range_start() or invalidate_range_end() may
> + * decrease the refcount. If the refcount is decreased on
> + * invalidate_range_start() then the VM can free pages as page
> + * table entries are removed. If the refcount is only
> + * droppped on invalidate_range_end() then the driver itself
> + * will drop the last refcount but it must take care to flush
> + * any secondary tlb before doing the final free on the
> + * page. Pages will no longer be referenced by the linux
> + * address space but may still be referenced by sptes until
> + * the last refcount is dropped.
> + */
> + void (*invalidate_range_start)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> + void (*invalidate_range_end)(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> +};
> +
> +/*
> + * The notifier chains are protected by mmap_sem and/or the reverse map
> + * semaphores. Notifier chains are only changed when all reverse maps and
> + * the mmap_sem locks are taken.
> + *
> + * Therefore notifier chains can only be traversed when either
> + *
> + * 1. mmap_sem is held.
> + * 2. One of the reverse map locks is held (i_mmap_lock or anon_vma->lock).
> + * 3. No other concurrent thread can access the list (release)
> + */
> +struct mmu_notifier {
> + struct hlist_node hlist;
> + const struct mmu_notifier_ops *ops;
> +};
> +
> +static inline int mm_has_notifiers(struct mm_struct *mm)
> +{
> + return unlikely(mm->mmu_notifier_mm);
> +}
> +
> +extern int mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +extern int __mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +extern void mmu_notifier_unregister(struct mmu_notifier *mn,
> + struct mm_struct *mm);
> +extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
> +extern void __mmu_notifier_release(struct mm_struct *mm);
> +extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> + unsigned long address);
> +extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> + unsigned long address);
> +extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> +extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> + unsigned long start, unsigned long end);
> +
> +static inline void mmu_notifier_release(struct mm_struct *mm)
> +{
> + if (mm_has_notifiers(mm))
> + __mmu_notifier_release(mm);
> +}
> +
> +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
> + unsigned long address)
> +{
> + if (mm_has_notifiers(mm))
> + return __mmu_notifier_clear_flush_young(mm, address);
> + return 0;
> +}
> +
> +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> + unsigned long address)
> +{
> + if (mm_has_notifiers(mm))
> + __mmu_notifier_invalidate_page(mm, address);
> +}
> +
> +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + if (mm_has_notifiers(mm))
> + __mmu_notifier_invalidate_range_start(mm, start, end);
> +}
> +
> +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + if (mm_has_notifiers(mm))
> + __mmu_notifier_invalidate_range_end(mm, start, end);
> +}
> +
> +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> +{
> + mm->mmu_notifier_mm = NULL;
> +}
> +
> +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> + if (mm_has_notifiers(mm))
> + __mmu_notifier_mm_destroy(mm);
> +}
> +
> +/*
> + * These two macros will sometime replace ptep_clear_flush.
> + * ptep_clear_flush is impleemnted as macro itself, so this also is
> + * implemented as a macro until ptep_clear_flush will converted to an
> + * inline function, to diminish the risk of compilation failure. The
> + * invalidate_page method over time can be moved outside the PT lock
> + * and these two macros can be later removed.
> + */
> +#define ptep_clear_flush_notify(__vma, __address, __ptep) \
> +({ \
> + pte_t __pte; \
> + struct vm_area_struct *___vma = __vma; \
> + unsigned long ___address = __address; \
> + __pte = ptep_clear_flush(___vma, ___address, __ptep); \
> + mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \
> + __pte; \
> +})
> +
> +#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
> +({ \
> + int __young; \
> + struct vm_area_struct *___vma = __vma; \
> + unsigned long ___address = __address; \
> + __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
> + __young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
> + ___address); \
> + __young; \
> +})
> +
> +#else /* CONFIG_MMU_NOTIFIER */
> +
> +static inline void mmu_notifier_release(struct mm_struct *mm)
> +{
> +}
> +
> +static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm,
> + unsigned long address)
> +{
> + return 0;
> +}
> +
> +static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
> + unsigned long address)
> +{
> +}
> +
> +static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> +}
> +
> +static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> +}
> +
> +static inline void mmu_notifier_mm_init(struct mm_struct *mm)
> +{
> +}
> +
> +static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> +}
> +
> +#define ptep_clear_flush_young_notify ptep_clear_flush_young
> +#define ptep_clear_flush_notify ptep_clear_flush
> +
> +#endif /* CONFIG_MMU_NOTIFIER */
> +
> +#endif /* _LINUX_MMU_NOTIFIER_H */
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -19,6 +19,7 @@
> */
> #define AS_EIO (__GFP_BITS_SHIFT + 0) /* IO error on async write */
> #define AS_ENOSPC (__GFP_BITS_SHIFT + 1) /* ENOSPC on async write */
> +#define AS_MM_ALL_LOCKS (__GFP_BITS_SHIFT + 2) /* under mm_take_all_locks() */
>
> static inline void mapping_set_error(struct address_space *mapping, int error)
> {
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -26,6 +26,14 @@
> */
> struct anon_vma {
> spinlock_t lock; /* Serialize access to vma list */
> + /*
> + * NOTE: the LSB of the head.next is set by
> + * mm_take_all_locks() _after_ taking the above lock. So the
> + * head must only be read/written after taking the above lock
> + * to be sure to see a valid next pointer. The LSB bit itself
> + * is serialized by a system wide lock only visible to
> + * mm_take_all_locks() (mm_all_locks_mutex).
> + */
> struct list_head head; /* List of private "related" vmas */
> };
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -54,6 +54,7 @@
> #include <linux/tty.h>
> #include <linux/proc_fs.h>
> #include <linux/blkdev.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/pgtable.h>
> #include <asm/pgalloc.h>
> @@ -386,6 +387,7 @@ static struct mm_struct * mm_init(struct
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
> + mmu_notifier_mm_init(mm);
> return mm;
> }
>
> @@ -418,6 +420,7 @@ void __mmdrop(struct mm_struct *mm)
> BUG_ON(mm == &init_mm);
> mm_free_pgd(mm);
> destroy_context(mm);
> + mmu_notifier_mm_destroy(mm);
> free_mm(mm);
> }
> EXPORT_SYMBOL_GPL(__mmdrop);
> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -205,3 +205,6 @@ config VIRT_TO_BUS
> config VIRT_TO_BUS
> def_bool y
> depends on !ARCH_NO_VIRT_TO_BUS
> +
> +config MMU_NOTIFIER
> + bool
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,4 +33,5 @@ obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_SMP) += allocpercpu.o
> obj-$(CONFIG_QUICKLIST) += quicklist.o
> obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
> +obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
>
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -188,7 +188,7 @@ __xip_unmap (struct address_space * mapp
> if (pte) {
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, pte_pfn(*pte));
> - pteval = ptep_clear_flush(vma, address, pte);
> + pteval = ptep_clear_flush_notify(vma, address, pte);
> page_remove_rmap(page, vma);
> dec_mm_counter(mm, file_rss);
> BUG_ON(pte_dirty(pteval));
> diff --git a/mm/fremap.c b/mm/fremap.c
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -15,6 +15,7 @@
> #include <linux/rmap.h>
> #include <linux/module.h>
> #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/mmu_context.h>
> #include <asm/cacheflush.h>
> @@ -214,7 +215,9 @@ asmlinkage long sys_remap_file_pages(uns
> spin_unlock(&mapping->i_mmap_lock);
> }
>
> + mmu_notifier_invalidate_range_start(mm, start, start + size);
> err = populate_range(mm, vma, start, size, pgoff);
> + mmu_notifier_invalidate_range_end(mm, start, start + size);
> if (!err && !(flags & MAP_NONBLOCK)) {
> if (unlikely(has_write_lock)) {
> downgrade_write(&mm->mmap_sem);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -14,6 +14,7 @@
> #include <linux/mempolicy.h>
> #include <linux/cpuset.h>
> #include <linux/mutex.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/page.h>
> #include <asm/pgtable.h>
> @@ -823,6 +824,7 @@ void __unmap_hugepage_range(struct vm_ar
> BUG_ON(start & ~HPAGE_MASK);
> BUG_ON(end & ~HPAGE_MASK);
>
> + mmu_notifier_invalidate_range_start(mm, start, end);
> spin_lock(&mm->page_table_lock);
> for (address = start; address < end; address += HPAGE_SIZE) {
> ptep = huge_pte_offset(mm, address);
> @@ -843,6 +845,7 @@ void __unmap_hugepage_range(struct vm_ar
> }
> spin_unlock(&mm->page_table_lock);
> flush_tlb_range(vma, start, end);
> + mmu_notifier_invalidate_range_end(mm, start, end);
> list_for_each_entry_safe(page, tmp, &page_list, lru) {
> list_del(&page->lru);
> put_page(page);
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -51,6 +51,7 @@
> #include <linux/init.h>
> #include <linux/writeback.h>
> #include <linux/memcontrol.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/pgalloc.h>
> #include <asm/uaccess.h>
> @@ -632,6 +633,7 @@ int copy_page_range(struct mm_struct *ds
> unsigned long next;
> unsigned long addr = vma->vm_start;
> unsigned long end = vma->vm_end;
> + int ret;
>
> /*
> * Don't copy ptes where a page fault will fill them correctly.
> @@ -647,17 +649,33 @@ int copy_page_range(struct mm_struct *ds
> if (is_vm_hugetlb_page(vma))
> return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> + /*
> + * We need to invalidate the secondary MMU mappings only when
> + * there could be a permission downgrade on the ptes of the
> + * parent mm. And a permission downgrade will only happen if
> + * is_cow_mapping() returns true.
> + */
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier_invalidate_range_start(src_mm, addr, end);
> +
> + ret = 0;
> dst_pgd = pgd_offset(dst_mm, addr);
> src_pgd = pgd_offset(src_mm, addr);
> do {
> next = pgd_addr_end(addr, end);
> if (pgd_none_or_clear_bad(src_pgd))
> continue;
> - if (copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> - vma, addr, next))
> - return -ENOMEM;
> + if (unlikely(copy_pud_range(dst_mm, src_mm, dst_pgd, src_pgd,
> + vma, addr, next))) {
> + ret = -ENOMEM;
> + break;
> + }
> } while (dst_pgd++, src_pgd++, addr = next, addr != end);
> - return 0;
> +
> + if (is_cow_mapping(vma->vm_flags))
> + mmu_notifier_invalidate_range_end(src_mm,
> + vma->vm_start, end);
> + return ret;
> }
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
> @@ -861,7 +879,9 @@ unsigned long unmap_vmas(struct mmu_gath
> unsigned long start = start_addr;
> spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
> int fullmm = (*tlbp)->fullmm;
> + struct mm_struct *mm = vma->vm_mm;
>
> + mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
> for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
> unsigned long end;
>
> @@ -912,6 +932,7 @@ unsigned long unmap_vmas(struct mmu_gath
> }
> }
> out:
> + mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
> return start; /* which is now the end (or restart) address */
> }
>
> @@ -1544,10 +1565,11 @@ int apply_to_page_range(struct mm_struct
> {
> pgd_t *pgd;
> unsigned long next;
> - unsigned long end = addr + size;
> + unsigned long start = addr, end = addr + size;
> int err;
>
> BUG_ON(addr >= end);
> + mmu_notifier_invalidate_range_start(mm, start, end);
> pgd = pgd_offset(mm, addr);
> do {
> next = pgd_addr_end(addr, end);
> @@ -1555,6 +1577,7 @@ int apply_to_page_range(struct mm_struct
> if (err)
> break;
> } while (pgd++, addr = next, addr != end);
> + mmu_notifier_invalidate_range_end(mm, start, end);
> return err;
> }
> EXPORT_SYMBOL_GPL(apply_to_page_range);
> @@ -1756,7 +1779,7 @@ gotten:
> * seen in the presence of one thread doing SMC and another
> * thread doing COW.
> */
> - ptep_clear_flush(vma, address, page_table);
> + ptep_clear_flush_notify(vma, address, page_table);
> set_pte_at(mm, address, page_table, entry);
> update_mmu_cache(vma, address, entry);
> lru_cache_add_active(new_page);
> diff --git a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -26,6 +26,7 @@
> #include <linux/mount.h>
> #include <linux/mempolicy.h>
> #include <linux/rmap.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -2048,6 +2049,7 @@ void exit_mmap(struct mm_struct *mm)
>
> /* mm's last user has gone, and its about to be pulled down */
> arch_exit_mmap(mm);
> + mmu_notifier_release(mm);
>
> lru_add_drain();
> flush_cache_mm(mm);
> @@ -2255,3 +2257,152 @@ int install_special_mapping(struct mm_st
>
> return 0;
> }
> +
> +static DEFINE_MUTEX(mm_all_locks_mutex);
> +
> +/*
> + * This operation locks against the VM for all pte/vma/mm related
> + * operations that could ever happen on a certain mm. This includes
> + * vmtruncate, try_to_unmap, and all page faults.
> + *
> + * The caller must take the mmap_sem in write mode before calling
> + * mm_take_all_locks(). The caller isn't allowed to release the
> + * mmap_sem until mm_drop_all_locks() returns.
> + *
> + * mmap_sem in write mode is required in order to block all operations
> + * that could modify pagetables and free pages without need of
> + * altering the vma layout (for example populate_range() with
> + * nonlinear vmas). It's also needed in write mode to avoid new
> + * anon_vmas to be associated with existing vmas.
> + *
> + * A single task can't take more than one mm_take_all_locks() in a row
> + * or it would deadlock.
> + *
> + * The LSB in anon_vma->head.next and the AS_MM_ALL_LOCKS bitflag in
> + * mapping->flags avoid to take the same lock twice, if more than one
> + * vma in this mm is backed by the same anon_vma or address_space.
> + *
> + * We can take all the locks in random order because the VM code
> + * taking i_mmap_lock or anon_vma->lock outside the mmap_sem never
> + * takes more than one of them in a row. Secondly we're protected
> + * against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
> + *
> + * mm_take_all_locks() and mm_drop_all_locks are expensive operations
> + * that may have to take thousand of locks.
> + *
> + * mm_take_all_locks() can fail if it's interrupted by signals.
> + */
> +int mm_take_all_locks(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> + int ret = -EINTR;
> +
> + BUG_ON(down_read_trylock(&mm->mmap_sem));
> +
> + mutex_lock(&mm_all_locks_mutex);
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + struct file *filp;
> + if (signal_pending(current))
> + goto out_unlock;
> + if (vma->anon_vma && !test_bit(0, (unsigned long *)
> + &vma->anon_vma->head.next)) {
> + /*
> + * The LSB of head.next can't change from
> + * under us because we hold the
> + * global_mm_spinlock.
> + */
> + spin_lock(&vma->anon_vma->lock);
> + /*
> + * We can safely modify head.next after taking
> + * the anon_vma->lock. If some other vma in
> + * this mm shares the same anon_vma we won't
> + * take it again.
> + *
> + * No need of atomic instructions here,
> + * head.next can't change from under us thanks
> + * to the anon_vma->lock.
> + */
> + if (__test_and_set_bit(0, (unsigned long *)
> + &vma->anon_vma->head.next))
> + BUG();
> + }
> +
> + filp = vma->vm_file;
> + if (filp && filp->f_mapping &&
> + !test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
> + /*
> + * AS_MM_ALL_LOCKS can't change from under us
> + * because we hold the global_mm_spinlock.
> + *
> + * Operations on ->flags have to be atomic
> + * because even if AS_MM_ALL_LOCKS is stable
> + * thanks to the global_mm_spinlock, there may
> + * be other cpus changing other bitflags in
> + * parallel to us.
> + */
> + if (test_and_set_bit(AS_MM_ALL_LOCKS,
> + &filp->f_mapping->flags))
> + BUG();
> + spin_lock(&filp->f_mapping->i_mmap_lock);
> + }
> + }
> + ret = 0;
> +
> +out_unlock:
> + if (ret)
> + mm_drop_all_locks(mm);
> +
> + return ret;
> +}
> +
> +/*
> + * The mmap_sem cannot be released by the caller until
> + * mm_drop_all_locks() returns.
> + */
> +void mm_drop_all_locks(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> +
> + BUG_ON(down_read_trylock(&mm->mmap_sem));
> + BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
> +
> + for (vma = mm->mmap; vma; vma = vma->vm_next) {
> + struct file *filp;
> + if (vma->anon_vma &&
> + test_bit(0, (unsigned long *)
> + &vma->anon_vma->head.next)) {
> + /*
> + * The LSB of head.next can't change to 0 from
> + * under us because we hold the
> + * global_mm_spinlock.
> + *
> + * We must however clear the bitflag before
> + * unlocking the vma so the users using the
> + * anon_vma->head will never see our bitflag.
> + *
> + * No need of atomic instructions here,
> + * head.next can't change from under us until
> + * we release the anon_vma->lock.
> + */
> + if (!__test_and_clear_bit(0, (unsigned long *)
> + &vma->anon_vma->head.next))
> + BUG();
> + spin_unlock(&vma->anon_vma->lock);
> + }
> + filp = vma->vm_file;
> + if (filp && filp->f_mapping &&
> + test_bit(AS_MM_ALL_LOCKS, &filp->f_mapping->flags)) {
> + /*
> + * AS_MM_ALL_LOCKS can't change to 0 from under us
> + * because we hold the global_mm_spinlock.
> + */
> + spin_unlock(&filp->f_mapping->i_mmap_lock);
> + if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
> + &filp->f_mapping->flags))
> + BUG();
> + }
> + }
> +
> + mutex_unlock(&mm_all_locks_mutex);
> +}
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/mmu_notifier.c
> @@ -0,0 +1,276 @@
> +/*
> + * linux/mm/mmu_notifier.c
> + *
> + * Copyright (C) 2008 Qumranet, Inc.
> + * Copyright (C) 2008 SGI
> + * Christoph Lameter <cla...@sg...>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2. See
> + * the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mmu_notifier.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/err.h>
> +#include <linux/rcupdate.h>
> +#include <linux/sched.h>
> +
> +/*
> + * This function can't run concurrently against mmu_notifier_register
> + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
> + * in parallel despite there being no task using this mm any more,
> + * through the vmas outside of the exit_mmap context, such as with
> + * vmtruncate. This serializes against mmu_notifier_unregister with
> + * the mmu_notifier_mm->lock in addition to RCU and it serializes
> + * against the other mmu notifiers with RCU. struct mmu_notifier_mm
> + * can't go away from under us as exit_mmap holds an mm_count pin
> + * itself.
> + */
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> + struct mmu_notifier *mn;
> +
> + spin_lock(&mm->mmu_notifier_mm->lock);
> + while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> + mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> + struct mmu_notifier,
> + hlist);
> + /*
> + * We arrived before mmu_notifier_unregister so
> + * mmu_notifier_unregister will do nothing other than
> + * to wait ->release to finish and
> + * mmu_notifier_unregister to return.
> + */
> + hlist_del_init_rcu(&mn->hlist);
> + /*
> + * RCU here will block mmu_notifier_unregister until
> + * ->release returns.
> + */
> + rcu_read_lock();
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> + /*
> + * if ->release runs before mmu_notifier_unregister it
> + * must be handled as it's the only way for the driver
> + * to flush all existing sptes and stop the driver
> + * from establishing any more sptes before all the
> + * pages in the mm are freed.
> + */
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + rcu_read_unlock();
> + spin_lock(&mm->mmu_notifier_mm->lock);
> + }
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> + /*
> + * synchronize_rcu here prevents mmu_notifier_release to
> + * return to exit_mmap (which would proceed freeing all pages
> + * in the mm) until the ->release method returns, if it was
> + * invoked by mmu_notifier_unregister.
> + *
> + * The mmu_notifier_mm can't go away from under us because one
> + * mm_count is hold by exit_mmap.
> + */
> + synchronize_rcu();
> +}
> +
> +/*
> + * If no young bitflag is supported by the hardware, ->clear_flush_young can
> + * unmap the address and return 1 or 0 depending if the mapping previously
> + * existed or not.
> + */
> +int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
> + unsigned long address)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> + int young = 0;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> + if (mn->ops->clear_flush_young)
> + young |= mn->ops->clear_flush_young(mn, mm, address);
> + }
> + rcu_read_unlock();
> +
> + return young;
> +}
> +
> +void __mmu_notifier_invalidate_page(struct mm_struct *mm,
> + unsigned long address)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> + if (mn->ops->invalidate_page)
> + mn->ops->invalidate_page(mn, mm, address);
> + }
> + rcu_read_unlock();
> +}
> +
> +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> + if (mn->ops->invalidate_range_start)
> + mn->ops->invalidate_range_start(mn, mm, start, end);
> + }
> + rcu_read_unlock();
> +}
> +
> +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> + unsigned long start, unsigned long end)
> +{
> + struct mmu_notifier *mn;
> + struct hlist_node *n;
> +
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
> + if (mn->ops->invalidate_range_end)
> + mn->ops->invalidate_range_end(mn, mm, start, end);
> + }
> + rcu_read_unlock();
> +}
> +
> +static int do_mmu_notifier_register(struct mmu_notifier *mn,
> + struct mm_struct *mm,
> + int take_mmap_sem)
> +{
> + struct mmu_notifier_mm * mmu_notifier_mm;
> + int ret;
> +
> + BUG_ON(atomic_read(&mm->mm_users) <= 0);
> +
> + ret = -ENOMEM;
> + mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
> + if (unlikely(!mmu_notifier_mm))
> + goto out;
> +
> + if (take_mmap_sem)
> + down_write(&mm->mmap_sem);
> + ret = mm_take_all_locks(mm);
> + if (unlikely(ret))
> + goto out_cleanup;
> +
> + if (!mm_has_notifiers(mm)) {
> + INIT_HLIST_HEAD(&mmu_notifier_mm->list);
> + spin_lock_init(&mmu_notifier_mm->lock);
> + mm->mmu_notifier_mm = mmu_notifier_mm;
> + mmu_notifier_mm = NULL;
> + }
> + atomic_inc(&mm->mm_count);
> +
> + /*
> + * Serialize the update against mmu_notifier_unregister. A
> + * side note: mmu_notifier_release can't run concurrently with
> + * us because we hold the mm_users pin (either implicitly as
> + * current->mm or explicitly with get_task_mm() or similar).
> + * We can't race against any other mmu notifier method either
> + * thanks to mm_take_all_locks().
> + */
> + spin_lock(&mm->mmu_notifier_mm->lock);
> + hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> + mm_drop_all_locks(mm);
> +out_cleanup:
> + if (take_mmap_sem)
> + up_write(&mm->mmap_sem);
> + /* kfree() does nothing if mmu_notifier_mm is NULL */
> + kfree(mmu_notifier_mm);
> +out:
> + BUG_ON(atomic_read(&mm->mm_users) <= 0);
> + return ret;
> +}
> +
> +/*
> + * Must not hold mmap_sem nor any other VM related lock when calling
> + * this registration function. Must also ensure mm_users can't go down
> + * to zero while this runs to avoid races with mmu_notifier_release,
> + * so mm has to be current->mm or the mm should be pinned safely such
> + * as with get_task_mm(). If the mm is not current->mm, the mm_users
> + * pin should be released by calling mmput after mmu_notifier_register
> + * returns. mmu_notifier_unregister must be always called to
> + * unregister the notifier. mm_count is automatically pinned to allow
> + * mmu_notifier_unregister to safely run at any time later, before or
> + * after exit_mmap. ->release will always be called before exit_mmap
> + * frees the pages.
> + */
> +int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + return do_mmu_notifier_register(mn, mm, 1);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_register);
> +
> +/*
> + * Same as mmu_notifier_register but here the caller must hold the
> + * mmap_sem in write mode.
> + */
> +int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + return do_mmu_notifier_register(mn, mm, 0);
> +}
> +EXPORT_SYMBOL_GPL(__mmu_notifier_register);
> +
> +/* this is called after the last mmu_notifier_unregister() returned */
> +void __mmu_notifier_mm_destroy(struct mm_struct *mm)
> +{
> + BUG_ON(!hlist_empty(&mm->mmu_notifier_mm->list));
> + kfree(mm->mmu_notifier_mm);
> + mm->mmu_notifier_mm = LIST_POISON1; /* debug */
> +}
> +
> +/*
> + * This releases the mm_count pin automatically and frees the mm
> + * structure if it was the last user of it. It serializes against
> + * running mmu notifiers with RCU and against mmu_notifier_unregister
> + * with the unregister lock + RCU. All sptes must be dropped before
> + * calling mmu_notifier_unregister. ->release or any other notifier
> + * method may be invoked concurrently with mmu_notifier_unregister,
> + * and only after mmu_notifier_unregister returned we're guaranteed
> + * that ->release or any other method can't run anymore.
> + */
> +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> +{
> + BUG_ON(atomic_read(&mm->mm_count) <= 0);
> +
> + spin_lock(&mm->mmu_notifier_mm->lock);
> + if (!hlist_unhashed(&mn->hlist)) {
> + hlist_del_rcu(&mn->hlist);
> +
> + /*
> + * RCU here will force exit_mmap to wait ->release to finish
> + * before freeing the pages.
> + */
> + rcu_read_lock();
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> + /*
> + * exit_mmap will block in mmu_notifier_release to
> + * guarantee ->release is called before freeing the
> + * pages.
> + */
> + if (mn->ops->release)
> + mn->ops->release(mn, mm);
> + rcu_read_unlock();
> + } else
> + spin_unlock(&mm->mmu_notifier_mm->lock);
> +
> + /*
> + * Wait any running method to finish, of course including
> + * ->release if it was run by mmu_notifier_relase instead of us.
> + */
> + synchronize_rcu();
> +
> + BUG_ON(atomic_read(&mm->mm_count) <= 0);
> +
> + mmdrop(mm);
> +}
> +EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -21,6 +21,7 @@
> #include <linux/syscalls.h>
> #include <linux/swap.h>
> #include <linux/swapops.h>
> +#include <linux/mmu_notifier.h>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> #include <asm/cacheflush.h>
> @@ -198,10 +199,12 @@ success:
> dirty_accountable = 1;
> }
>
> + mmu_notifier_invalidate_range_start(mm, start, end);
> if (is_vm_hugetlb_page(vma))
> hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
> else
> change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
> + mmu_notifier_invalidate_range_end(mm, start, end);
> vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
> vm_stat_account(mm, newflags, vma->vm_file, nrpages);
> return 0;
> diff --git a/mm/mremap.c b/mm/mremap.c
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -18,6 +18,7 @@
> #include <linux/highmem.h>
> #include <linux/security.h>
> #include <linux/syscalls.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/uaccess.h>
> #include <asm/cacheflush.h>
> @@ -74,7 +75,11 @@ static void move_ptes(struct vm_area_str
> struct mm_struct *mm = vma->vm_mm;
> pte_t *old_pte, *new_pte, pte;
> spinlock_t *old_ptl, *new_ptl;
> + unsigned long old_start;
>
> + old_start = old_addr;
> + mmu_notifier_invalidate_range_start(vma->vm_mm,
> + old_start, old_end);
> if (vma->vm_file) {
> /*
> * Subtle point from Rajesh Venkatasubramanian: before
> @@ -116,6 +121,7 @@ static void move_ptes(struct vm_area_str
> pte_unmap_unlock(old_pte - 1, old_ptl);
> if (mapping)
> spin_unlock(&mapping->i_mmap_lock);
> + mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
> }
>
> #define LATENCY_LIMIT (64 * PAGE_SIZE)
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -49,6 +49,7 @@
> #include <linux/module.h>
> #include <linux/kallsyms.h>
> #include <linux/memcontrol.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/tlbflush.h>
>
> @@ -287,7 +288,7 @@ static int page_referenced_one(struct pa
> if (vma->vm_flags & VM_LOCKED) {
> referenced++;
> *mapcount = 1; /* break early from loop */
> - } else if (ptep_clear_flush_young(vma, address, pte))
> + } else if (ptep_clear_flush_young_notify(vma, address, pte))
> referenced++;
>
> /* Pretend the page is referenced if the task has the
> @@ -457,7 +458,7 @@ static int page_mkclean_one(struct page
> pte_t entry;
>
> flush_cache_page(vma, address, pte_pfn(*pte));
> - entry = ptep_clear_flush(vma, address, pte);
> + entry = ptep_clear_flush_notify(vma, address, pte);
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
> set_pte_at(mm, address, pte, entry);
> @@ -717,14 +718,14 @@ static int try_to_unmap_one(struct page
> * skipped over this mm) then we should reactivate it.
> */
> if (!migration && ((vma->vm_flags & VM_LOCKED) ||
> - (ptep_clear_flush_young(vma, address, pte)))) {
> + (ptep_clear_flush_young_notify(vma, address, pte)))) {
> ret = SWAP_FAIL;
> goto out_unmap;
> }
>
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, page_to_pfn(page));
> - pteval = ptep_clear_flush(vma, address, pte);
> + pteval = ptep_clear_flush_notify(vma, address, pte);
>
> /* Move the dirty bit to the physical page now the pte is gone. */
> if (pte_dirty(pteval))
> @@ -849,12 +850,12 @@ static void try_to_unmap_cluster(unsigne
> page = vm_normal_page(vma, address, *pte);
> BUG_ON(!page || PageAnon(page));
>
> - if (ptep_clear_flush_young(vma, address, pte))
> + if (ptep_clear_flush_young_notify(vma, address, pte))
> continue;
>
> /* Nuke the page table entry. */
> flush_cache_page(vma, address, pte_pfn(*pte));
> - pteval = ptep_clear_flush(vma, address, pte);
> + pteval = ptep_clear_flush_notify(vma, address, pte);
>
> /* If nonlinear, store the file page offset in the pte. */
> if (page->index != linear_page_index(vma, address))
|
|
From: Jan K. <jan...@we...> - 2008-05-16 16:03:33
|
With this patch applied, kvm is able to ignore breakpoint requests of an
attached gdb frontend so that the latter is motivated to insert soft
breakpoints into the guest code. All we need to do for this is to catch
and forward #BP exceptions which are now provided by the kernel module.
Along this, the patch makes vm_stop-on-breakpoint start to work.
Limitations:
- only takes care of x86 so far
- gdbstub state tracking is broken (artificially incrementing
nb_breakpoints won't fly, as we'll have no chance to decrement it),
we need an enhanced interface to the stub
---
libkvm/kvm-common.h | 2 ++
libkvm/libkvm-x86.c | 16 ++++++++++++++++
libkvm/libkvm.c | 5 -----
libkvm/libkvm.h | 8 +++++++-
qemu/exec.c | 19 +++++++++++++------
qemu/qemu-kvm.c | 22 +++++++++++++---------
6 files changed, 51 insertions(+), 21 deletions(-)
Index: b/libkvm/kvm-common.h
===================================================================
--- a/libkvm/kvm-common.h
+++ b/libkvm/kvm-common.h
@@ -85,4 +85,6 @@ int handle_io_window(kvm_context_t kvm);
int handle_debug(kvm_context_t kvm, int vcpu);
int try_push_interrupts(kvm_context_t kvm);
+int handle_debug(kvm_context_t kvm, int vcpu);
+
#endif
Index: b/libkvm/libkvm-x86.c
===================================================================
--- a/libkvm/libkvm-x86.c
+++ b/libkvm/libkvm-x86.c
@@ -661,3 +661,19 @@ int kvm_disable_tpr_access_reporting(kvm
}
#endif
+
+int handle_debug(kvm_context_t kvm, int vcpu)
+{
+ struct kvm_run *run = kvm->run[vcpu];
+ unsigned long watchpoint = 0;
+ int break_type;
+
+ if (run->debug.arch.exception == 1) {
+ /* TODO: fully process run->debug.arch */
+ break_type = KVM_GDB_BREAKPOINT_HW;
+ } else
+ break_type = KVM_GDB_BREAKPOINT_SW;
+
+ return kvm->callbacks->debug(kvm->opaque, vcpu, break_type,
+ watchpoint);
+}
Index: b/libkvm/libkvm.c
===================================================================
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -738,11 +738,6 @@ static int handle_io(kvm_context_t kvm,
return 0;
}
-int handle_debug(kvm_context_t kvm, int vcpu)
-{
- return kvm->callbacks->debug(kvm->opaque, vcpu);
-}
-
int kvm_get_regs(kvm_context_t kvm, int vcpu, struct kvm_regs *regs)
{
return ioctl(kvm->vcpu_fd[vcpu], KVM_GET_REGS, regs);
Index: b/qemu/exec.c
===================================================================
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -1157,6 +1157,12 @@ int cpu_breakpoint_insert(CPUState *env,
#if defined(TARGET_HAS_ICE)
int i;
+ if (kvm_enabled()) {
+ env->nb_breakpoints++;
+ kvm_update_debugger(env);
+ return -ENOSYS;
+ }
+
for(i = 0; i < env->nb_breakpoints; i++) {
if (env->breakpoints[i] == pc)
return 0;
@@ -1166,9 +1172,6 @@ int cpu_breakpoint_insert(CPUState *env,
return -ENOBUFS;
env->breakpoints[env->nb_breakpoints++] = pc;
- if (kvm_enabled())
- kvm_update_debugger(env);
-
breakpoint_invalidate(env, pc);
return 0;
#else
@@ -1182,6 +1185,13 @@ int cpu_breakpoint_remove(CPUState *env,
{
#if defined(TARGET_HAS_ICE)
int i;
+
+ if (kvm_enabled()) {
+ env->nb_breakpoints--;
+ kvm_update_debugger(env);
+ return -ENOSYS;
+ }
+
for(i = 0; i < env->nb_breakpoints; i++) {
if (env->breakpoints[i] == pc)
goto found;
@@ -1192,9 +1202,6 @@ int cpu_breakpoint_remove(CPUState *env,
if (i < env->nb_breakpoints)
env->breakpoints[i] = env->breakpoints[env->nb_breakpoints];
- if (kvm_enabled())
- kvm_update_debugger(env);
-
breakpoint_invalidate(env, pc);
return 0;
#else
Index: b/qemu/qemu-kvm.c
===================================================================
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -58,6 +58,8 @@ pthread_t io_thread;
static int io_thread_fd = -1;
static int io_thread_sigfd = -1;
+static int kvm_debug_stop_requested;
+
static inline unsigned long kvm_get_thread_id(void)
{
return syscall(SYS_gettid);
@@ -517,6 +519,10 @@ int kvm_main_loop(void)
qemu_system_powerdown();
else if (qemu_reset_requested())
qemu_kvm_system_reset();
+ else if (kvm_debug_stop_requested) {
+ kvm_debug_stop_requested = 0;
+ vm_stop(EXCP_DEBUG);
+ }
}
pause_all_threads();
@@ -525,11 +531,12 @@ int kvm_main_loop(void)
return 0;
}
-static int kvm_debug(void *opaque, int vcpu)
+static int kvm_debug(void *opaque, int vcpu, int break_type,
+ uint64_t watchpoint_addr)
{
- CPUState *env = cpu_single_env;
-
- env->exception_index = EXCP_DEBUG;
+ /* TODO: process break_type */
+ kvm_debug_stop_requested = 1;
+ vcpu_info[vcpu].stopped = 1;
return 1;
}
@@ -748,15 +755,12 @@ int kvm_qemu_init_env(CPUState *cenv)
int kvm_update_debugger(CPUState *env)
{
struct kvm_debug_guest dbg;
- int i, r;
+ int r;
dbg.enabled = 0;
if (env->nb_breakpoints || env->singlestep_enabled) {
dbg.enabled = 1;
- for (i = 0; i < 4 && i < env->nb_breakpoints; ++i) {
- dbg.breakpoints[i].enabled = 1;
- dbg.breakpoints[i].address = env->breakpoints[i];
- }
+ memset(dbg.breakpoints, 0, sizeof(dbg.breakpoints));
dbg.singlestep = env->singlestep_enabled;
}
if (vm_running)
Index: b/libkvm/libkvm.h
===================================================================
--- a/libkvm/libkvm.h
+++ b/libkvm/libkvm.h
@@ -25,6 +25,12 @@ int kvm_get_msrs(kvm_context_t, int vcpu
int kvm_set_msrs(kvm_context_t, int vcpu, struct kvm_msr_entry *msrs, int n);
#endif
+#define KVM_GDB_BREAKPOINT_SW 0
+#define KVM_GDB_BREAKPOINT_HW 1
+#define KVM_GDB_WATCHPOINT_WRITE 2
+#define KVM_GDB_WATCHPOINT_READ 3
+#define KVM_GDB_WATCHPOINT_ACCESS 4
+
/*!
* \brief KVM callbacks structure
*
@@ -51,7 +57,7 @@ struct kvm_callbacks {
/// generic memory writes to unmapped memory (For MMIO devices)
int (*mmio_write)(void *opaque, uint64_t addr, uint8_t *data,
int len);
- int (*debug)(void *opaque, int vcpu);
+ int (*debug)(void *opaque, int vcpu, int break_type, uint64_t watchpoint);
/*!
* \brief Called when the VCPU issues an 'hlt' instruction.
*
|
|
From: Jan K. <jan...@we...> - 2008-05-16 16:02:29
|
This adds an arch field to kvm_run.debug, the payload that is returned
to user space on KVM_EXIT_DEBUG guest exits. For x86, this field is now
supposed to report the precise debug exception (#DB or #BP) and the
current state of the debug registers (the latter is not yet
implemented).
---
arch/x86/kvm/vmx.c | 1 +
include/asm-x86/kvm.h | 5 +++++
include/linux/kvm.h | 1 +
3 files changed, 7 insertions(+)
Index: b/arch/x86/kvm/vmx.c
===================================================================
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2282,6 +2282,7 @@ static int handle_exception(struct kvm_v
if ((intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK)) ==
(INTR_TYPE_EXCEPTION | 1)) {
kvm_run->exit_reason = KVM_EXIT_DEBUG;
+ kvm_run->debug.arch.exception = 1;
return 0;
}
kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
Index: b/include/asm-x86/kvm.h
===================================================================
--- a/include/asm-x86/kvm.h
+++ b/include/asm-x86/kvm.h
@@ -230,4 +230,9 @@ struct kvm_pit_state {
#define KVM_TRC_APIC_ACCESS (KVM_TRC_HANDLER + 0x14)
#define KVM_TRC_TDP_FAULT (KVM_TRC_HANDLER + 0x15)
+struct kvm_debug_exit_arch {
+ __u32 exception;
+ __u64 dr[8];
+};
+
#endif
Index: b/include/linux/kvm.h
===================================================================
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -125,6 +125,7 @@ struct kvm_run {
__u64 data_offset; /* relative to kvm_run start */
} io;
struct {
+ struct kvm_debug_exit_arch arch;
} debug;
/* KVM_EXIT_MMIO */
struct {
|
|
From: Jan K. <jan...@we...> - 2008-05-16 16:02:21
|
In order to allow the gdbstub of QEMU to push (soft) breakpoint handling
completely into the gdb frontend, this patch enables guest exits also
for #BP exceptions - in case guest debugging was turned on.
Along this enhancement, this patch also fixes the flag manipulation for
the singlestep mode.
---
arch/x86/kvm/vmx.c | 38 +++++++++++++++-----------------------
1 file changed, 15 insertions(+), 23 deletions(-)
Index: b/arch/x86/kvm/vmx.c
===================================================================
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -461,7 +461,7 @@ static void update_exception_bitmap(stru
if (!vcpu->fpu_active)
eb |= 1u << NM_VECTOR;
if (vcpu->guest_debug.enabled)
- eb |= 1u << 1;
+ eb |= (1u << 1) | (1u << 3);
if (vcpu->arch.rmode.active)
eb = ~0;
if (vm_need_ept())
@@ -949,6 +949,7 @@ static int set_guest_debug(struct kvm_vc
{
unsigned long dr7 = 0x400;
int old_singlestep;
+ unsigned long flags;
old_singlestep = vcpu->guest_debug.singlestep;
@@ -969,13 +970,12 @@ static int set_guest_debug(struct kvm_vc
} else
vcpu->guest_debug.singlestep = 0;
- if (old_singlestep && !vcpu->guest_debug.singlestep) {
- unsigned long flags;
-
- flags = vmcs_readl(GUEST_RFLAGS);
+ flags = vmcs_readl(GUEST_RFLAGS);
+ if (vcpu->guest_debug.singlestep)
+ flags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
+ else if (old_singlestep && !vcpu->guest_debug.singlestep)
flags &= ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
- vmcs_writel(GUEST_RFLAGS, flags);
- }
+ vmcs_writel(GUEST_RFLAGS, flags);
update_exception_bitmap(vcpu);
vmcs_writel(GUEST_DR7, dr7);
@@ -2192,14 +2192,6 @@ static void kvm_guest_debug_pre(struct k
set_debugreg(dbg->bp[1], 1);
set_debugreg(dbg->bp[2], 2);
set_debugreg(dbg->bp[3], 3);
-
- if (dbg->singlestep) {
- unsigned long flags;
-
- flags = vmcs_readl(GUEST_RFLAGS);
- flags |= X86_EFLAGS_TF | X86_EFLAGS_RF;
- vmcs_writel(GUEST_RFLAGS, flags);
- }
}
static int handle_rmode_exception(struct kvm_vcpu *vcpu,
@@ -2221,7 +2213,7 @@ static int handle_rmode_exception(struct
static int handle_exception(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- u32 intr_info, error_code;
+ u32 intr_info, ex_no, error_code;
unsigned long cr2, rip;
u32 vect_info;
enum emulation_result er;
@@ -2279,15 +2271,15 @@ static int handle_exception(struct kvm_v
return 1;
}
- if ((intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VECTOR_MASK)) ==
- (INTR_TYPE_EXCEPTION | 1)) {
+ ex_no = intr_info & INTR_INFO_VECTOR_MASK;
+ if (ex_no == 1 || ex_no == 3) {
kvm_run->exit_reason = KVM_EXIT_DEBUG;
- kvm_run->debug.arch.exception = 1;
- return 0;
+ kvm_run->debug.arch.exception = ex_no;
+ } else {
+ kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
+ kvm_run->ex.exception = ex_no;
+ kvm_run->ex.error_code = error_code;
}
- kvm_run->exit_reason = KVM_EXIT_EXCEPTION;
- kvm_run->ex.exception = intr_info & INTR_INFO_VECTOR_MASK;
- kvm_run->ex.error_code = error_code;
return 0;
}
|
|
From: Jan K. <jan...@we...> - 2008-05-16 16:02:04
|
[ Should apply against vanilla QEMU, but not ATM due to ongoing
constructions in gdbstub. ]
This patch prepares the QEMU cpu_watchpoint/breakpoint API to allow us
hooking in with KVM and doing guest debugging differently (maybe
QEMUAccel should provide appropriate callbacks for this, too). But it
also allows to extend QEMU's debugging features one day, specifically
/wrt different watchpoint types.
---
qemu/cpu-all.h | 14 ++++++++++----
qemu/exec.c | 30 ++++++++++++++++++++----------
qemu/gdbstub.c | 50 +++++++++++++++++++++++++++++---------------------
3 files changed, 59 insertions(+), 35 deletions(-)
Index: b/qemu/cpu-all.h
===================================================================
--- a/qemu/cpu-all.h
+++ b/qemu/cpu-all.h
@@ -758,10 +758,16 @@ extern int code_copy_enabled;
void cpu_interrupt(CPUState *s, int mask);
void cpu_reset_interrupt(CPUState *env, int mask);
-int cpu_watchpoint_insert(CPUState *env, target_ulong addr);
-int cpu_watchpoint_remove(CPUState *env, target_ulong addr);
-int cpu_breakpoint_insert(CPUState *env, target_ulong pc);
-int cpu_breakpoint_remove(CPUState *env, target_ulong pc);
+#define GDB_BREAKPOINT_SW 0
+#define GDB_BREAKPOINT_HW 1
+#define GDB_WATCHPOINT_WRITE 2
+#define GDB_WATCHPOINT_READ 3
+#define GDB_WATCHPOINT_ACCESS 4
+
+int cpu_watchpoint_insert(CPUState *env, target_ulong addr, target_ulong len, int type);
+int cpu_watchpoint_remove(CPUState *env, target_ulong addr, target_ulong len, int type);
+int cpu_breakpoint_insert(CPUState *env, target_ulong pc, target_ulong len, int type);
+int cpu_breakpoint_remove(CPUState *env, target_ulong pc, target_ulong len, int type);
void cpu_single_step(CPUState *env, int enabled);
void cpu_reset(CPUState *s);
Index: b/qemu/exec.c
===================================================================
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -1104,16 +1104,20 @@ static void breakpoint_invalidate(CPUSta
#endif
/* Add a watchpoint. */
-int cpu_watchpoint_insert(CPUState *env, target_ulong addr)
+int cpu_watchpoint_insert(CPUState *env, target_ulong addr, target_ulong len,
+ int type)
{
int i;
+ if (type != GDB_WATCHPOINT_WRITE)
+ return -ENOSYS;
+
for (i = 0; i < env->nb_watchpoints; i++) {
if (addr == env->watchpoint[i].vaddr)
return 0;
}
if (env->nb_watchpoints >= MAX_WATCHPOINTS)
- return -1;
+ return -ENOBUFS;
i = env->nb_watchpoints++;
env->watchpoint[i].vaddr = addr;
@@ -1126,10 +1130,14 @@ int cpu_watchpoint_insert(CPUState *env
}
/* Remove a watchpoint. */
-int cpu_watchpoint_remove(CPUState *env, target_ulong addr)
+int cpu_watchpoint_remove(CPUState *env, target_ulong addr, target_ulong len,
+ int type)
{
int i;
+ if (type != GDB_WATCHPOINT_WRITE)
+ return -ENOSYS;
+
for (i = 0; i < env->nb_watchpoints; i++) {
if (addr == env->watchpoint[i].vaddr) {
env->nb_watchpoints--;
@@ -1138,12 +1146,13 @@ int cpu_watchpoint_remove(CPUState *env,
return 0;
}
}
- return -1;
+ return -ENOENT;
}
/* add a breakpoint. EXCP_DEBUG is returned by the CPU loop if a
breakpoint is reached */
-int cpu_breakpoint_insert(CPUState *env, target_ulong pc)
+int cpu_breakpoint_insert(CPUState *env, target_ulong pc, target_ulong len,
+ int type)
{
#if defined(TARGET_HAS_ICE)
int i;
@@ -1154,7 +1163,7 @@ int cpu_breakpoint_insert(CPUState *env,
}
if (env->nb_breakpoints >= MAX_BREAKPOINTS)
- return -1;
+ return -ENOBUFS;
env->breakpoints[env->nb_breakpoints++] = pc;
if (kvm_enabled())
@@ -1163,12 +1172,13 @@ int cpu_breakpoint_insert(CPUState *env,
breakpoint_invalidate(env, pc);
return 0;
#else
- return -1;
+ return -ENOSYS;
#endif
}
/* remove a breakpoint */
-int cpu_breakpoint_remove(CPUState *env, target_ulong pc)
+int cpu_breakpoint_remove(CPUState *env, target_ulong pc, target_ulong len,
+ int type)
{
#if defined(TARGET_HAS_ICE)
int i;
@@ -1176,7 +1186,7 @@ int cpu_breakpoint_remove(CPUState *env,
if (env->breakpoints[i] == pc)
goto found;
}
- return -1;
+ return -ENOENT;
found:
env->nb_breakpoints--;
if (i < env->nb_breakpoints)
@@ -1188,7 +1198,7 @@ int cpu_breakpoint_remove(CPUState *env,
breakpoint_invalidate(env, pc);
return 0;
#else
- return -1;
+ return -ENOSYS;
#endif
}
Index: b/qemu/gdbstub.c
===================================================================
--- a/qemu/gdbstub.c
+++ b/qemu/gdbstub.c
@@ -882,7 +882,7 @@ static void cpu_gdb_write_registers(CPUS
static int gdb_handle_packet(GDBState *s, CPUState *env, const char *line_buf)
{
const char *p;
- int ch, reg_size, type;
+ int ch, reg_size, type, res;
char buf[4096];
uint8_t mem_buf[4096];
uint32_t *registers;
@@ -1017,21 +1017,20 @@ static int gdb_handle_packet(GDBState *s
if (*p == ',')
p++;
len = strtoull(p, (char **)&p, 16);
- if (type == 0 || type == 1) {
- if (cpu_breakpoint_insert(env, addr) < 0)
- goto breakpoint_error;
- put_packet(s, "OK");
+ switch (type) {
+ case GDB_BREAKPOINT_SW ... GDB_BREAKPOINT_HW:
+ res = cpu_breakpoint_insert(env, addr, len, type);
+ break;
#ifndef CONFIG_USER_ONLY
- } else if (type == 2) {
- if (cpu_watchpoint_insert(env, addr) < 0)
- goto breakpoint_error;
- put_packet(s, "OK");
+ case GDB_WATCHPOINT_WRITE ... GDB_WATCHPOINT_ACCESS:
+ res = cpu_watchpoint_insert(env, addr, len, type);
+ break;
#endif
- } else {
- breakpoint_error:
- put_packet(s, "E22");
+ default:
+ res = -ENOSYS;
+ break;
}
- break;
+ goto answer_bp_packet;
case 'z':
type = strtoul(p, (char **)&p, 16);
if (*p == ',')
@@ -1040,17 +1039,26 @@ static int gdb_handle_packet(GDBState *s
if (*p == ',')
p++;
len = strtoull(p, (char **)&p, 16);
- if (type == 0 || type == 1) {
- cpu_breakpoint_remove(env, addr);
- put_packet(s, "OK");
+ switch (type) {
+ case GDB_BREAKPOINT_SW ... GDB_BREAKPOINT_HW:
+ res = cpu_breakpoint_remove(env, addr, len, type);
+ break;
#ifndef CONFIG_USER_ONLY
- } else if (type == 2) {
- cpu_watchpoint_remove(env, addr);
- put_packet(s, "OK");
+ case GDB_WATCHPOINT_WRITE ... GDB_WATCHPOINT_ACCESS:
+ res = cpu_watchpoint_remove(env, addr, len, type);
+ break;
#endif
- } else {
- goto breakpoint_error;
+ default:
+ res = -ENOSYS;
+ break;
}
+ answer_bp_packet:
+ if (res >= 0)
+ put_packet(s, "OK");
+ else if (res == -ENOSYS)
+ put_packet(s, "");
+ else
+ put_packet(s, "E22");
break;
#ifdef CONFIG_LINUX_USER
case 'q':
|
|
From: Jan K. <jan...@we...> - 2008-05-16 16:01:59
|
This is yet only a proof of concept; more cleanups, generalizations, full arch support, and some care for ABI consistency are needed. However, this series allows to actually _use_ guest debugging with kvm on x86 (Intel only so far). There is no more limitation on how many breakpoints can be used in parallel, and this does not depend on existing (and correct) hardware breakpoint support. In fact, it breaks my patches for this, need to rebase them later and enable hw-breakpoint/ watchpoint support this way. The patch series depends on earlier fixes of mine, namely - qemu-kvm: Fix guest resetting v2 (http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/17442) - kvm-qemu: Fix monitor and gdbstub deadlocks v2 (http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/17444) Feedback is welcome. Jan |
|
From: Rusty R. <ru...@ru...> - 2008-05-16 14:27:15
|
On Friday 16 May 2008 19:17:03 Christian Borntraeger wrote: > Hello Rusty, > > sometimes it is useful to share a disk (e.g. usr). To avoid file system > corruption, the disk should be mounted read-only in that case. This patch > adds a new feature flag, that allows the host to specify, if the disk > should be considered read-only. Applied, thanks. Rusty. |
|
From: Robin H. <ho...@sg...> - 2008-05-16 11:50:03
|
On Fri, May 16, 2008 at 06:23:06AM -0500, Robin Holt wrote: > On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > > > Oh, I get that confused because of the mixed up naming conventions > > > > there: unmap_page_range should actually be called zap_page_range. But > > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > > > How is that synchronized with code that walks the same pagetable. These > > > walks may not hold mmap_sem either. I would expect that one could only > > > remove a portion of the pagetable where we have some sort of guarantee > > > that no accesses occur. So the removal of the vma prior ensures that? > > > > I don't really understand the question. If you remove the pte and invalidate > > the TLBS on the remote image's process (importing the page), then it can > > of course try to refault the page in because it's vma is still there. But > > you catch that refault in your driver , which can prevent the page from > > being faulted back in. > > I think Christoph's question has more to do with faults that are > in flight. A recently requested fault could have just released the > last lock that was holding up the invalidate callout. It would then > begin messaging back the response PFN which could still be in flight. > The invalidate callout would then fire and do the interrupt shoot-down > while that response was still active (essentially beating the inflight > response). The invalidate would clear up nothing and then the response > would insert the PFN after it is no longer the correct PFN. I just looked over XPMEM. I think we could make this work. We already have a list of active faults which is protected by a simple spinlock. I would need to nest this lock within another lock protected our PFN table (currently it is a mutex) and then the invalidate interrupt handler would need to mark the fault as invalid (which is also currently there). I think my sticking points with the interrupt method remain at fault containment and timeout. The inability of the ia64 processor to handle provide predictive failures for the read/write of memory on other partitions prevents us from being able to contain the failure. I don't think we can get the information we would need to do the invalidate without introducing fault containment issues which has been a continous area of concern for our customers. Thanks, Robin |
|
From: Robin H. <ho...@sg...> - 2008-05-16 11:23:03
|
On Fri, May 16, 2008 at 01:52:03AM +0200, Nick Piggin wrote: > On Thu, May 15, 2008 at 10:33:57AM -0700, Christoph Lameter wrote: > > On Thu, 15 May 2008, Nick Piggin wrote: > > > > > Oh, I get that confused because of the mixed up naming conventions > > > there: unmap_page_range should actually be called zap_page_range. But > > > at any rate, yes we can easily zap pagetables without holding mmap_sem. > > > > How is that synchronized with code that walks the same pagetable. These > > walks may not hold mmap_sem either. I would expect that one could only > > remove a portion of the pagetable where we have some sort of guarantee > > that no accesses occur. So the removal of the vma prior ensures that? > > I don't really understand the question. If you remove the pte and invalidate > the TLBS on the remote image's process (importing the page), then it can > of course try to refault the page in because it's vma is still there. But > you catch that refault in your driver , which can prevent the page from > being faulted back in. I think Christoph's question has more to do with faults that are in flight. A recently requested fault could have just released the last lock that was holding up the invalidate callout. It would then begin messaging back the response PFN which could still be in flight. The invalidate callout would then fire and do the interrupt shoot-down while that response was still active (essentially beating the inflight response). The invalidate would clear up nothing and then the response would insert the PFN after it is no longer the correct PFN. Thanks, Robin |
|
From: Tomasz C. <ma...@wp...> - 2008-05-16 09:28:40
|
Christian Borntraeger schrieb: > Hello Rusty, > > sometimes it is useful to share a disk (e.g. usr). To avoid file system > corruption, the disk should be mounted read-only in that case. Although it is done at a different level here, I wanted to note that mounting a filesystem read-only does not necessarily mean the system will not try to write to it. This is the case for ext3, for example - when mounted ro, system will still reply the journal and do some writes etc. The patch, however, should take care of that, too, as it is completely different place it is made ro. -- Tomasz Chmielewski http://wpkg.org |
|
From: Christian B. <bor...@de...> - 2008-05-16 09:19:17
|
Hello Rusty,
sometimes it is useful to share a disk (e.g. usr). To avoid file system
corruption, the disk should be mounted read-only in that case. This patch
adds a new feature flag, that allows the host to specify, if the disk should
be considered read-only.
Signed-off-by: Christian Borntraeger <bor...@de...>
---
drivers/block/virtio_blk.c | 6 +++++-
include/linux/virtio_blk.h | 1 +
2 files changed, 6 insertions(+), 1 deletion(-)
Index: kvm/drivers/block/virtio_blk.c
===================================================================
--- kvm.orig/drivers/block/virtio_blk.c
+++ kvm/drivers/block/virtio_blk.c
@@ -260,6 +260,10 @@ static int virtblk_probe(struct virtio_d
if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+ /* If disk is read-only in the host, the guest should obey */
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))
+ set_disk_ro(vblk->disk, 1);
+
/* Host must always specify the capacity. */
vdev->config->get(vdev, offsetof(struct virtio_blk_config, capacity),
&cap, sizeof(cap));
@@ -325,7 +329,7 @@ static struct virtio_device_id id_table[
static unsigned int features[] = {
VIRTIO_BLK_F_BARRIER, VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX,
- VIRTIO_BLK_F_GEOMETRY,
+ VIRTIO_BLK_F_GEOMETRY, VIRTIO_BLK_F_RO,
};
static struct virtio_driver virtio_blk = {
Index: kvm/include/linux/virtio_blk.h
===================================================================
--- kvm.orig/include/linux/virtio_blk.h
+++ kvm/include/linux/virtio_blk.h
@@ -10,6 +10,7 @@
#define VIRTIO_BLK_F_SIZE_MAX 1 /* Indicates maximum segment size */
#define VIRTIO_BLK_F_SEG_MAX 2 /* Indicates maximum # of segments */
#define VIRTIO_BLK_F_GEOMETRY 4 /* Legacy geometry available */
+#define VIRTIO_BLK_F_RO 5 /* Disk is read-only */
struct virtio_blk_config
{
|
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:09:28
|
Hi, With xenner I see a very high number of exits for no appearent reason in the statistics: kvm stats : total diff mmu_cache_miss : 5380 0 mmu_flooded : 1294 0 mmu_pde_zapped : 10132 0 mmu_pte_updated : 122292 0 mmu_pte_write : 128223 0 mmu_shadow_zapped : 4742 0 insn_emulation : 179971 1001 fpu_reload : 920 0 host_state_reload : 52192 1065 irq_exits : 52956 1226 halt_wakeup : 3302 0 halt_exits : 1770 0 io_exits : 41564 1001 exits : 12209733 454557 tlb_flush : 514163 2002 pf_guest : 80554 0 pf_fixed : 172032 0 Ideas what this might be? Unusual exit reason with no counter I guess? Suggestions how to track that one down? cheers, Gerd -- http://kraxel.fedorapeople.org/xenner/ |
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:02:18
|
Signed-off-by: Gerd Hoffmann <kr...@re...>
---
arch/x86/Kconfig | 4 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/pvclock.c | 148 +++++++++++++++++++++++++++++++++++++++++++++
include/asm-x86/pvclock.h | 6 ++
4 files changed, 159 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/pvclock.c
create mode 100644 include/asm-x86/pvclock.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fe361ae..deb3049 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -417,6 +417,10 @@ config PARAVIRT
over full virtualization. However, when run without a hypervisor
the kernel is theoretically slower and slightly larger.
+config PARAVIRT_CLOCK
+ bool
+ default n
+
endif
config MEMTEST_BOOTPARAM
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5e618c3..77807d4 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_KVM_CLOCK) += kvmclock.o
obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o
+obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o
obj-$(CONFIG_PCSPKR_PLATFORM) += pcspeaker.o
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
new file mode 100644
index 0000000..33e526f
--- /dev/null
+++ b/arch/x86/kernel/pvclock.c
@@ -0,0 +1,148 @@
+/* paravirtual clock -- common code used by kvm/xen
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+*/
+
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <asm/pvclock.h>
+
+/*
+ * These are perodically updated
+ * xen: magic shared_info page
+ * kvm: gpa registered via msr
+ * and then copied here.
+ */
+struct pvclock_shadow_time {
+ u64 tsc_timestamp; /* TSC at last update of time vals. */
+ u64 system_timestamp; /* Time, in nanosecs, since boot. */
+ u32 tsc_to_nsec_mul;
+ int tsc_shift;
+ u32 version;
+};
+
+/*
+ * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
+ * yielding a 64-bit result.
+ */
+static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
+{
+ u64 product;
+#ifdef __i386__
+ u32 tmp1, tmp2;
+#endif
+
+ if (shift < 0)
+ delta >>= -shift;
+ else
+ delta <<= shift;
+
+#ifdef __i386__
+ __asm__ (
+ "mul %5 ; "
+ "mov %4,%%eax ; "
+ "mov %%edx,%4 ; "
+ "mul %5 ; "
+ "xor %5,%5 ; "
+ "add %4,%%eax ; "
+ "adc %5,%%edx ; "
+ : "=A" (product), "=r" (tmp1), "=r" (tmp2)
+ : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif __x86_64__
+ __asm__ (
+ "mul %%rdx ; shrd $32,%%rdx,%%rax"
+ : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+ return product;
+}
+
+static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+{
+ u64 delta = native_read_tsc() - shadow->tsc_timestamp;
+ return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
+}
+
+/*
+ * Reads a consistent set of time-base values from hypervisor,
+ * into a shadow data area.
+ */
+static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
+ struct kvm_vcpu_time_info *src)
+{
+ do {
+ dst->version = src->version;
+ rmb(); /* fetch version before data */
+ dst->tsc_timestamp = src->tsc_timestamp;
+ dst->system_timestamp = src->system_time;
+ dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
+ dst->tsc_shift = src->tsc_shift;
+ rmb(); /* test version after fetching data */
+ } while ((src->version & 1) || (dst->version != src->version));
+
+ return dst->version;
+}
+
+/*
+ * This is our read_clock function. The host puts an tsc timestamp each time
+ * it updates a new time. Without the tsc adjustment, we can have a situation
+ * in which a vcpu starts to run earlier (smaller system_time), but probes
+ * time later (compared to another vcpu), leading to backwards time
+ */
+
+cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src)
+{
+ struct pvclock_shadow_time shadow;
+ unsigned version;
+ cycle_t ret, offset;
+
+ do {
+ version = pvclock_get_time_values(&shadow, src);
+ barrier();
+ offset = pvclock_get_nsec_offset(&shadow);
+ ret = shadow.system_timestamp + offset;
+ barrier();
+ } while (version != src->version);
+
+ return ret;
+}
+
+void pvclock_read_wallclock(struct kvm_wall_clock *wall_clock,
+ struct kvm_vcpu_time_info *vcpu_time,
+ struct timespec *ts)
+{
+ u32 version;
+ u64 delta;
+ struct timespec now;
+
+ /* get wallclock at system boot */
+ do {
+ version = wall_clock->wc_version;
+ rmb(); /* fetch version before time */
+ now.tv_sec = wall_clock->wc_sec;
+ now.tv_nsec = wall_clock->wc_nsec;
+ rmb(); /* fetch time before checking version */
+ } while ((wall_clock->wc_version & 1) || (version != wall_clock->wc_version));
+
+ delta = pvclock_clocksource_read(vcpu_time); /* time since system boot */
+ delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
+
+ now.tv_nsec = do_div(delta, NSEC_PER_SEC);
+ now.tv_sec = delta;
+
+ set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
+}
diff --git a/include/asm-x86/pvclock.h b/include/asm-x86/pvclock.h
new file mode 100644
index 0000000..2b9812f
--- /dev/null
+++ b/include/asm-x86/pvclock.h
@@ -0,0 +1,6 @@
+#include <linux/clocksource.h>
+#include <asm/kvm_para.h>
+cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src);
+void pvclock_read_wallclock(struct kvm_wall_clock *wall,
+ struct kvm_vcpu_time_info *vcpu,
+ struct timespec *ts);
--
1.5.4.1
|
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:01:34
|
Signed-off-by: Gerd Hoffmann <kr...@re...>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/kvmclock.c | 86 ++++++++++++++++---------------------------
2 files changed, 33 insertions(+), 54 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index deb3049..b749c85 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -390,6 +390,7 @@ config VMI
config KVM_CLOCK
bool "KVM paravirtualized clock"
select PARAVIRT
+ select PARAVIRT_CLOCK
depends on !(X86_VISWS || X86_VOYAGER)
help
Turning on this option will allow you to run a paravirtualized clock
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 4bc1be5..135a8f7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -18,6 +18,7 @@
#include <linux/clocksource.h>
#include <linux/kvm_para.h>
+#include <asm/pvclock.h>
#include <asm/arch_hooks.h>
#include <asm/msr.h>
#include <asm/apic.h>
@@ -37,17 +38,9 @@ early_param("no-kvmclock", parse_no_kvmclock);
/* The hypervisor will put information about time periodically here */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct kvm_vcpu_time_info, hv_clock);
-#define get_clock(cpu, field) per_cpu(hv_clock, cpu).field
-
-static inline u64 kvm_get_delta(u64 last_tsc)
-{
- int cpu = smp_processor_id();
- u64 delta = native_read_tsc() - last_tsc;
- return (delta * get_clock(cpu, tsc_to_system_mul)) >> KVM_SCALE;
-}
static struct kvm_wall_clock wall_clock;
-static cycle_t kvm_clock_read(void);
+
/*
* The wallclock is the time of day when we booted. Since then, some time may
* have elapsed since the hypervisor wrote the data. So we try to account for
@@ -55,35 +48,19 @@ static cycle_t kvm_clock_read(void);
*/
unsigned long kvm_get_wallclock(void)
{
- u32 wc_sec, wc_nsec;
- u64 delta;
+ struct kvm_vcpu_time_info *vcpu_time;
struct timespec ts;
- int version, nsec;
int low, high;
low = (int)__pa(&wall_clock);
high = ((u64)__pa(&wall_clock) >> 32);
-
- delta = kvm_clock_read();
-
native_write_msr(MSR_KVM_WALL_CLOCK, low, high);
- do {
- version = wall_clock.wc_version;
- rmb();
- wc_sec = wall_clock.wc_sec;
- wc_nsec = wall_clock.wc_nsec;
- rmb();
- } while ((wall_clock.wc_version != version) || (version & 1));
-
- delta = kvm_clock_read() - delta;
- delta += wc_nsec;
- nsec = do_div(delta, NSEC_PER_SEC);
- set_normalized_timespec(&ts, wc_sec + delta, nsec);
- /*
- * Of all mechanisms of time adjustment I've tested, this one
- * was the champion!
- */
- return ts.tv_sec + 1;
+
+ vcpu_time = &get_cpu_var(hv_clock);
+ pvclock_read_wallclock(&wall_clock, vcpu_time, &ts);
+ put_cpu_var(hv_clock);
+
+ return ts.tv_sec;
}
int kvm_set_wallclock(unsigned long now)
@@ -91,28 +68,17 @@ int kvm_set_wallclock(unsigned long now)
return 0;
}
-/*
- * This is our read_clock function. The host puts an tsc timestamp each time
- * it updates a new time. Without the tsc adjustment, we can have a situation
- * in which a vcpu starts to run earlier (smaller system_time), but probes
- * time later (compared to another vcpu), leading to backwards time
- */
static cycle_t kvm_clock_read(void)
{
- u64 last_tsc, now;
- int cpu;
-
- preempt_disable();
- cpu = smp_processor_id();
-
- last_tsc = get_clock(cpu, tsc_timestamp);
- now = get_clock(cpu, system_time);
+ struct kvm_vcpu_time_info *src;
+ cycle_t ret;
- now += kvm_get_delta(last_tsc);
- preempt_enable();
-
- return now;
+ src = &get_cpu_var(hv_clock);
+ ret = pvclock_clocksource_read(src);
+ put_cpu_var(hv_clock);
+ return ret;
}
+
static struct clocksource kvm_clock = {
.name = "kvm-clock",
.read = kvm_clock_read,
@@ -123,13 +89,14 @@ static struct clocksource kvm_clock = {
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
-static int kvm_register_clock(void)
+static int kvm_register_clock(char *txt)
{
int cpu = smp_processor_id();
int low, high;
low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
-
+ printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
+ cpu, high, low, txt);
return native_write_msr_safe(MSR_KVM_SYSTEM_TIME, low, high);
}
@@ -140,12 +107,20 @@ static void kvm_setup_secondary_clock(void)
* Now that the first cpu already had this clocksource initialized,
* we shouldn't fail.
*/
- WARN_ON(kvm_register_clock());
+ WARN_ON(kvm_register_clock("secondary cpu clock"));
/* ok, done with our trickery, call native */
setup_secondary_APIC_clock();
}
#endif
+#ifdef CONFIG_SMP
+void __init kvm_smp_prepare_boot_cpu(void)
+{
+ WARN_ON(kvm_register_clock("primary cpu clock"));
+ native_smp_prepare_boot_cpu();
+}
+#endif
+
/*
* After the clock is registered, the host will keep writing to the
* registered memory location. If the guest happens to shutdown, this memory
@@ -174,7 +149,7 @@ void __init kvmclock_init(void)
return;
if (kvmclock && kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
- if (kvm_register_clock())
+ if (kvm_register_clock("boot clock"))
return;
pv_time_ops.get_wallclock = kvm_get_wallclock;
pv_time_ops.set_wallclock = kvm_set_wallclock;
@@ -182,6 +157,9 @@ void __init kvmclock_init(void)
#ifdef CONFIG_X86_LOCAL_APIC
pv_apic_ops.setup_secondary_clock = kvm_setup_secondary_clock;
#endif
+#ifdef CONFIG_SMP
+ smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
+#endif
machine_ops.shutdown = kvm_shutdown;
#ifdef CONFIG_KEXEC
machine_ops.crash_shutdown = kvm_crash_shutdown;
--
1.5.4.1
|
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:01:30
|
Signed-off-by: Gerd Hoffmann <kr...@re...>
---
arch/x86/xen/Kconfig | 1 +
arch/x86/xen/time.c | 110 +++++---------------------------------------------
2 files changed, 12 insertions(+), 99 deletions(-)
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 2e641be..3a4f16a 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -5,6 +5,7 @@
config XEN
bool "Xen guest support"
select PARAVIRT
+ select PARAVIRT_CLOCK
depends on X86_32
depends on X86_CMPXCHG && X86_TSC && !(X86_VISWS || X86_VOYAGER)
help
diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
index c39e1a5..3d5f945 100644
--- a/arch/x86/xen/time.c
+++ b/arch/x86/xen/time.c
@@ -13,6 +13,7 @@
#include <linux/clockchips.h>
#include <linux/kernel_stat.h>
+#include <asm/pvclock.h>
#include <asm/xen/hypervisor.h>
#include <asm/xen/hypercall.h>
@@ -30,17 +31,6 @@
static cycle_t xen_clocksource_read(void);
-/* These are perodically updated in shared_info, and then copied here. */
-struct shadow_time_info {
- u64 tsc_timestamp; /* TSC at last update of time vals. */
- u64 system_timestamp; /* Time, in nanosecs, since boot. */
- u32 tsc_to_nsec_mul;
- int tsc_shift;
- u32 version;
-};
-
-static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
-
/* runstate info updated by Xen */
static DEFINE_PER_CPU(struct vcpu_runstate_info, runstate);
@@ -230,95 +220,14 @@ unsigned long xen_cpu_khz(void)
return xen_khz;
}
-/*
- * Reads a consistent set of time-base values from Xen, into a shadow data
- * area.
- */
-static unsigned get_time_values_from_xen(void)
-{
- struct vcpu_time_info *src;
- struct shadow_time_info *dst;
-
- /* src is shared memory with the hypervisor, so we need to
- make sure we get a consistent snapshot, even in the face of
- being preempted. */
- src = &__get_cpu_var(xen_vcpu)->time;
- dst = &__get_cpu_var(shadow_time);
-
- do {
- dst->version = src->version;
- rmb(); /* fetch version before data */
- dst->tsc_timestamp = src->tsc_timestamp;
- dst->system_timestamp = src->system_time;
- dst->tsc_to_nsec_mul = src->tsc_to_system_mul;
- dst->tsc_shift = src->tsc_shift;
- rmb(); /* test version after fetching data */
- } while ((src->version & 1) | (dst->version ^ src->version));
-
- return dst->version;
-}
-
-/*
- * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
- * yielding a 64-bit result.
- */
-static inline u64 scale_delta(u64 delta, u32 mul_frac, int shift)
-{
- u64 product;
-#ifdef __i386__
- u32 tmp1, tmp2;
-#endif
-
- if (shift < 0)
- delta >>= -shift;
- else
- delta <<= shift;
-
-#ifdef __i386__
- __asm__ (
- "mul %5 ; "
- "mov %4,%%eax ; "
- "mov %%edx,%4 ; "
- "mul %5 ; "
- "xor %5,%5 ; "
- "add %4,%%eax ; "
- "adc %5,%%edx ; "
- : "=A" (product), "=r" (tmp1), "=r" (tmp2)
- : "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
-#elif __x86_64__
- __asm__ (
- "mul %%rdx ; shrd $32,%%rdx,%%rax"
- : "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
-#else
-#error implement me!
-#endif
-
- return product;
-}
-
-static u64 get_nsec_offset(struct shadow_time_info *shadow)
-{
- u64 now, delta;
- now = native_read_tsc();
- delta = now - shadow->tsc_timestamp;
- return scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
-}
-
static cycle_t xen_clocksource_read(void)
{
- struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
+ struct vcpu_time_info *src;
cycle_t ret;
- unsigned version;
-
- do {
- version = get_time_values_from_xen();
- barrier();
- ret = shadow->system_timestamp + get_nsec_offset(shadow);
- barrier();
- } while (version != __get_cpu_var(xen_vcpu)->time.version);
-
- put_cpu_var(shadow_time);
+ src = &get_cpu_var(xen_vcpu)->time;
+ ret = pvclock_clocksource_read((void*)src);
+ put_cpu_var(xen_vcpu);
return ret;
}
@@ -349,9 +258,14 @@ static void xen_read_wallclock(struct timespec *ts)
unsigned long xen_get_wallclock(void)
{
+ const struct shared_info *s = HYPERVISOR_shared_info;
+ struct kvm_wall_clock *wall_clock = (void*)&(s->wc_version);
+ struct vcpu_time_info *vcpu_time;
struct timespec ts;
- xen_read_wallclock(&ts);
+ vcpu_time = &get_cpu_var(xen_vcpu)->time;
+ pvclock_read_wallclock(wall_clock, (void*)vcpu_time, &ts);
+ put_cpu_var(xen_vcpu);
return ts.tv_sec;
}
@@ -576,8 +490,6 @@ __init void xen_time_init(void)
{
int cpu = smp_processor_id();
- get_time_values_from_xen();
-
clocksource_register(&xen_clocksource);
if (HYPERVISOR_vcpu_op(VCPUOP_stop_periodic_timer, cpu, NULL) == 0) {
--
1.5.4.1
|
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:01:30
|
paravirt clock source patches, next round, with a bunch of changes in the host code according to Avi's review comments and some minor code tweaks. cheers, Gerd |
|
From: Gerd H. <kr...@re...> - 2008-05-16 08:01:29
|
Signed-off-by: Gerd Hoffmann <kr...@re...>
---
arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++++++++-------
include/asm-x86/kvm_host.h | 1 +
2 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dab3d4f..7f84467 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -493,7 +493,7 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
{
static int version;
struct kvm_wall_clock wc;
- struct timespec wc_ts;
+ struct timespec now, sys, boot;
if (!wall_clock)
return;
@@ -502,9 +502,18 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
kvm_write_guest(kvm, wall_clock, &version, sizeof(version));
- wc_ts = current_kernel_time();
- wc.wc_sec = wc_ts.tv_sec;
- wc.wc_nsec = wc_ts.tv_nsec;
+ /*
+ * The guest calculates current wall clock time by adding
+ * system time (updated by kvm_write_guest_time below) to the
+ * wall clock specified here. guest system time equals host
+ * system time for us, thus we must fill in host boot time here.
+ */
+ now = current_kernel_time();
+ ktime_get_ts(&sys);
+ boot = ns_to_timespec(timespec_to_ns(&now) - timespec_to_ns(&sys));
+
+ wc.wc_sec = boot.tv_sec;
+ wc.wc_nsec = boot.tv_nsec;
wc.wc_version = version;
kvm_write_guest(kvm, wall_clock, &wc, sizeof(wc));
@@ -513,6 +522,44 @@ static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
kvm_write_guest(kvm, wall_clock, &version, sizeof(version));
}
+static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
+{
+ uint32_t quotient, remainder;
+
+ /* This is NOT what do_div() does ... */
+ __asm__ ( "divl %4"
+ : "=a" (quotient), "=d" (remainder)
+ : "0" (0), "1" (dividend), "r" (divisor) );
+ return quotient;
+}
+
+static void kvm_set_time_scale(uint32_t tsc_khz, struct kvm_vcpu_time_info *hv_clock)
+{
+ uint64_t nsecs = 1000000000LL;
+ int32_t shift = 0;
+ uint64_t tps64;
+ uint32_t tps32;
+
+ tps64 = tsc_khz * 1000LL;
+ while (tps64 > nsecs*2) {
+ tps64 >>= 1;
+ shift--;
+ }
+
+ tps32 = (uint32_t)tps64;
+ while (tps32 <= (uint32_t)nsecs) {
+ tps32 <<= 1;
+ shift++;
+ }
+
+ hv_clock->tsc_shift = shift;
+ hv_clock->tsc_to_system_mul = div_frac(nsecs, tps32);
+
+ pr_debug("%s: tsc_khz %u, tsc_shift %d, tsc_mul %u\n",
+ __FUNCTION__, tsc_khz, hv_clock->tsc_shift,
+ hv_clock->tsc_to_system_mul);
+}
+
static void kvm_write_guest_time(struct kvm_vcpu *v)
{
struct timespec ts;
@@ -523,6 +570,11 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
if ((!vcpu->time_page))
return;
+ if (unlikely(vcpu->hv_clock_tsc_khz != tsc_khz)) {
+ kvm_set_time_scale(tsc_khz, &vcpu->hv_clock);
+ vcpu->hv_clock_tsc_khz = tsc_khz;
+ }
+
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
kvm_get_msr(v, MSR_IA32_TIME_STAMP_COUNTER,
@@ -537,21 +589,20 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
/*
* The interface expects us to write an even number signaling that the
* update is finished. Since the guest won't see the intermediate
- * state, we just write "2" at the end
+ * state, we just increase by 2 at the end.
*/
- vcpu->hv_clock.version = 2;
+ vcpu->hv_clock.version += 2;
shared_kaddr = kmap_atomic(vcpu->time_page, KM_USER0);
memcpy(shared_kaddr + vcpu->time_offset, &vcpu->hv_clock,
- sizeof(vcpu->hv_clock));
+ sizeof(vcpu->hv_clock));
kunmap_atomic(shared_kaddr, KM_USER0);
mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
}
-
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
switch (msr) {
@@ -599,10 +650,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
/* ...but clean it before doing the actual write */
vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);
- vcpu->arch.hv_clock.tsc_to_system_mul =
- clocksource_khz2mult(tsc_khz, 22);
- vcpu->arch.hv_clock.tsc_shift = 22;
-
down_read(¤t->mm->mmap_sem);
vcpu->arch.time_page =
gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 1466c3f..820eef4 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -283,6 +283,7 @@ struct kvm_vcpu_arch {
gpa_t time;
struct kvm_vcpu_time_info hv_clock;
+ unsigned int hv_clock_tsc_khz;
unsigned int time_offset;
struct page *time_page;
};
--
1.5.4.1
|
|
From: Gerd H. <kr...@re...> - 2008-05-16 07:47:44
|
Avi Kivity wrote:
>> + struct timespec now,sys,boot;
>
> Add spaces.
Done.
>> +#if 0
>> + /* Hmm, getboottime() isn't exported to modules ... */
>> + getboottime(&boot);
>> +#else
>> + now = current_kernel_time();
>> + ktime_get_ts(&sys);
>> + boot = ns_to_timespec(timespec_to_ns(&now) - timespec_to_ns(&sys));
>> +#endif
>> + wc.wc_sec = boot.tv_sec;
>> + wc.wc_nsec = boot.tv_nsec;
>
> Please drop the #if 0.
Done, and added a comment for the calculation.
>> +static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
>> +{
>> + uint32_t quotient, remainder;
>> +
>> + __asm__ ( "divl %4"
>> + : "=a" (quotient), "=d" (remainder)
>> + : "0" (0), "1" (dividend), "r" (divisor) );
>> + return quotient;
>> +}
>>
>
> do_div()?
No, this one does something else. Already tried to get rid of that one
before ;)
> pr_debug() or something?
Done.
>> + kvm_set_time_scale(tsc_khz, &vcpu->arch.hv_clock);
>>
> What if the tsc frequency changes later on? we need to adjust the
> multiplier, no?
We better do that, yes.
New patch series prepared and tested, will be posted in a moment ...
cheers,
Gerd
--
http://kraxel.fedorapeople.org/xenner/
|
|
From: Jan K. <jan...@we...> - 2008-05-16 07:43:46
|
Jan Kiszka wrote: > Avi Kivity wrote: >> [I forgot to do this last weekend, so it's postponed to Saturday] >> >> During the upcoming Saturday, the various kvm lists will move to >> vger.kenel.org. This will improve responsiveness, and reduce spam and >> advertising. >> >> Please subscribe to the lists you are interested in as soon as >> possible. You can subscribe by sending an email to >> maj...@vg..., with the following lines in the body: >> >> subscribe kvm >> subscribe kvm-commits >> subscribe kvm-ia64 >> subscribe kvm-ppc > > Will someone take care of creating new gmane.org archives at the same > time, too? I'm relying on their news services for high-traffic lists > like kvm's. (Which means, if nobody does, I would do. Later, I'll "just" need the mbox archives in order to feed-in all postings up to the subscription time.) Jan |
|
From: Jan K. <jan...@we...> - 2008-05-16 07:34:56
|
Avi Kivity wrote: > [I forgot to do this last weekend, so it's postponed to Saturday] > > During the upcoming Saturday, the various kvm lists will move to > vger.kenel.org. This will improve responsiveness, and reduce spam and > advertising. > > Please subscribe to the lists you are interested in as soon as > possible. You can subscribe by sending an email to > maj...@vg..., with the following lines in the body: > > subscribe kvm > subscribe kvm-commits > subscribe kvm-ia64 > subscribe kvm-ppc Will someone take care of creating new gmane.org archives at the same time, too? I'm relying on their news services for high-traffic lists like kvm's. Thanks, Jan |
|
From: Jan K. <jan...@we...> - 2008-05-16 07:21:59
|
Here comes the second revision of the attempt to consolidate
kvm_eat_signal[s]. It follows the suggestions to remove looping over
kvm_eat_signal and folds everything into kvm_main_loop_wait.
Signed-off-by: Jan Kiszka <jan...@we...>
---
qemu/qemu-kvm.c | 43 ++++++++-----------------------------------
1 file changed, 8 insertions(+), 35 deletions(-)
Index: b/qemu/qemu-kvm.c
===================================================================
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -173,64 +173,37 @@ static int has_work(CPUState *env)
return kvm_arch_has_work(env);
}
-static int kvm_eat_signal(CPUState *env, int timeout)
+static void kvm_main_loop_wait(CPUState *env, int timeout)
{
struct timespec ts;
- int r, e, ret = 0;
+ int r, e;
siginfo_t siginfo;
sigset_t waitset;
+ pthread_mutex_unlock(&qemu_mutex);
+
ts.tv_sec = timeout / 1000;
ts.tv_nsec = (timeout % 1000) * 1000000;
sigemptyset(&waitset);
sigaddset(&waitset, SIG_IPI);
r = sigtimedwait(&waitset, &siginfo, &ts);
- if (r == -1 && (errno == EAGAIN || errno == EINTR) && !timeout)
- return 0;
e = errno;
pthread_mutex_lock(&qemu_mutex);
- if (env && vcpu)
- cpu_single_env = vcpu->env;
- if (r == -1 && !(errno == EAGAIN || errno == EINTR)) {
+
+ if (r == -1 && !(e == EAGAIN || e == EINTR)) {
printf("sigtimedwait: %s\n", strerror(e));
exit(1);
}
- if (r != -1)
- ret = 1;
- if (env && vcpu_info[env->cpu_index].stop) {
+ if (vcpu_info[env->cpu_index].stop) {
vcpu_info[env->cpu_index].stop = 0;
vcpu_info[env->cpu_index].stopped = 1;
pthread_cond_signal(&qemu_pause_cond);
}
- pthread_mutex_unlock(&qemu_mutex);
-
- return ret;
-}
-
-
-static void kvm_eat_signals(CPUState *env, int timeout)
-{
- int r = 0;
-
- while (kvm_eat_signal(env, 0))
- r = 1;
- if (!r && timeout) {
- r = kvm_eat_signal(env, timeout);
- if (r)
- while (kvm_eat_signal(env, 0))
- ;
- }
-}
-
-static void kvm_main_loop_wait(CPUState *env, int timeout)
-{
- pthread_mutex_unlock(&qemu_mutex);
- kvm_eat_signals(env, timeout);
- pthread_mutex_lock(&qemu_mutex);
cpu_single_env = env;
+
vcpu_info[env->cpu_index].signalled = 0;
}
|