You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Christoph L. <cla...@sg...> - 2008-04-29 01:28:04
|
On Tue, 29 Apr 2008, Andrea Arcangeli wrote: > Frankly I've absolutely no idea why rcu is needed in all rmap code > when walking the page->mapping. Definitely the PG_locked is taken so > there's no way page->mapping could possibly go away under the rmap > code, hence the anon_vma can't go away as it's queued in the vma, and > the vma has to go away before the page is zapped out of the pte. zap_pte_range can race with the rmap code and it does not take the page lock. The page may not go away since a refcount was taken but the mapping can go away. Without RCU you have no guarantee that the anon_vma is existing when you take the lock. How long were you away from VM development? > Now the double atomic op may not be horrible when not contented, as it > works on the same cacheline but with cacheline bouncing with > contention it sounds doubly horrible than a single cacheline bounce > and I don't see the point of it as you can't use rcu anyways, so you > can't possibly take advantage of whatever microoptimization done over > the original locking. Cachelines are acquired for exclusive use for a mininum duration. Multiple atomic operations can be performed after a cacheline becomes exclusive without danger of bouncing. |
From: Ulrich D. <dr...@re...> - 2008-04-29 00:46:54
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > @@ -369,6 +372,10 @@ static void *ap_main_loop(void *_env) > sigfillset(&signals); > sigprocmask(SIG_BLOCK, &signals, NULL); > kvm_create_vcpu(kvm_context, env->cpu_index); > + pthread_mutex_lock(&vcpu_mutex); > + vcpu->created = 1; > + pthread_cond_signal(&qemu_vcpuup_cond); > + pthread_mutex_unlock(&vcpu_mutex); > kvm_qemu_init_env(env); > kvm_main_loop_cpu(env); > return NULL; Still no locking needed on any CPU we support. The memory access is atomic and that's all that counts here. With the mutex taken the woken thread immediately runs into a brick wall and has to be put to sleep again. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIFm/w2ijCOnn/RHQRAmxRAKDHoem5zvdt6gSVdoX6vQoYPr106QCcC3IA RzKcqRUNgUmUDzqnrkjHrAI= =MNbq -----END PGP SIGNATURE----- |
From: Andrea A. <an...@qu...> - 2008-04-29 00:10:50
|
On Mon, Apr 28, 2008 at 01:34:11PM -0700, Christoph Lameter wrote: > On Sun, 27 Apr 2008, Andrea Arcangeli wrote: > > > Talking about post 2.6.26: the refcount with rcu in the anon-vma > > conversion seems unnecessary and may explain part of the AIM slowdown > > too. The rest looks ok and probably we should switch the code to a > > compile-time decision between rwlock and rwsem (so obsoleting the > > current spinlock). > > You are going to take a semphore in an rcu section? Guess you did not > activate all debugging options while testing? I was not aware that you can > take a sleeping lock from a non preemptible context. I'd hoped to discuss this topic after mmu-notifier-core was already merged, but let's do it anyway. My point of view is that there was no rcu when I wrote that code, yet there was no reference count and yet all locking looks still exactly the same as I wrote it. There's even still the page_table_lock to serialize threads taking the mmap_sem in read mode against the first vma->anon_vma = anon_vma during the page fault. Frankly I've absolutely no idea why rcu is needed in all rmap code when walking the page->mapping. Definitely the PG_locked is taken so there's no way page->mapping could possibly go away under the rmap code, hence the anon_vma can't go away as it's queued in the vma, and the vma has to go away before the page is zapped out of the pte. So there are some possible scenarios: 1) my original anon_vma code was buggy not taking the rcu_read_lock() and somebody fixed it (I tend to exclude it) 2) somebody has seen a race that doesn't exist and didn't bother to document it other than with this obscure comment * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. I tend to exclude it too as VM folks are too smart for this to be the case. 3) somebody did some microoptimization using rcu and we surely can undo that microoptimization to get the code back to my original code that didn't need rcu despite it worked exactly the same, and that is going to be cheaper to use with semaphores than doubling the number of locked ops for every lock instruction. Now the double atomic op may not be horrible when not contented, as it works on the same cacheline but with cacheline bouncing with contention it sounds doubly horrible than a single cacheline bounce and I don't see the point of it as you can't use rcu anyways, so you can't possibly take advantage of whatever microoptimization done over the original locking. |
From: David S. A. <da...@ci...> - 2008-04-28 23:45:05
|
Hi Marcelo: mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime. Here is a kvm_stat sample where guest time was very high and qemu had 2 processors at 100% on the host. I removed counters where both columns have 0 value for brevity. exits 45937979 758051 fpu_reload 1416831 87 halt_exits 112911 0 halt_wakeup 31771 0 host_state_reload 2068602 263 insn_emulation 21601480 365493 io_exits 1827374 2705 irq_exits 8934818 285196 mmio_exits 421674 147 mmu_cache_miss 4817689 93680 mmu_flooded 4815273 93680 mmu_pde_zapped 51344 0 mmu_prefetch 4817625 93680 mmu_pte_updated 14803298 270104 mmu_pte_write 19859863 363785 mmu_shadow_zapped 4832106 93679 pf_fixed 32184355 468398 pf_guest 264138 0 remote_tlb_flush 10697762 280522 tlb_flush 10301338 176424 (NOTE: This is for a *5* second sample interval instead of 1 to allow me to capture the data). Here's a sample when the guest is "well-behaved" (system time <10%, though ): exits 51502194 97453 fpu_reload 1421736 227 halt_exits 138361 1927 halt_wakeup 33047 117 host_state_reload 2110190 3740 insn_emulation 24367441 47260 io_exits 1874075 2576 irq_exits 10224702 13333 mmio_exits 435154 1726 mmu_cache_miss 5414097 11258 mmu_flooded 5411548 11243 mmu_pde_zapped 52851 44 mmu_prefetch 5414031 11258 mmu_pte_updated 16854686 29901 mmu_pte_write 22526765 42285 mmu_shadow_zapped 5430025 11313 pf_fixed 36144578 67666 pf_guest 282794 430 remote_tlb_flush 12126268 14619 tlb_flush 11753162 21460 There is definitely a strong correlation between the mmu counters and high system times in the guest. I am still trying to find out what in the guest is stimulating it when running on RHEL3; I do not see this same behavior for an equivalent setup running on RHEL4. By the way I added an mmu_prefetch stat in prefetch_page() to count the number of times the for() loop is hit with PTTYPE == 64; ie., number of times paging64_prefetch_page() is invoked. (I wanted an explicit counter for this loop, though the info seems to duplicate other entries.) That counter is listed above. As I mentioned in a prior post when kscand kicks in the change in mmu_prefetch counter is at 20,000+/sec, with each trip through that function taking 45k+ cycles. kscand is an instigator shortly after boot, however, kscand is *not* the culprit once the system has been up for 30-45 minutes. I have started instrumenting the RHEL3U8 kernel and for the load I am running kscand does not walk the active lists very often once the system is up. So, to dig deeper on what in the guest is stimulating the mmu I collected kvmtrace data for about a 2 minute time interval which caught about a 30-second period where guest system time was steady in the 25-30% range. Summarizing the number of times a RIP appears in an VMEXIT shows the following high runners: count RIP RHEL3-symbol 82549 0xc0140e42 follow_page [kernel] c0140d90 offset b2 42532 0xc0144760 handle_mm_fault [kernel] c01446d0 offset 90 36826 0xc013da4a futex_wait [kernel] c013d870 offset 1da 29987 0xc0145cd0 zap_pte_range [kernel] c0145c10 offset c0 27451 0xc0144018 do_no_page [kernel] c0143e20 offset 1f8 (halt entry removed the list since that is the ideal scenario for an exit). So the RIP correlates to follow_page() for a large percentage of the VMEXITs. I wrote an awk script to summarize (histogram style) the TSC cycles between VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times (ie., almost 100% of the time) the trace shows a delta between 50k and 100k cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These seems to correlate with the prefetch_page function in kvm, though I am not 100% positive on that. I am now investigating the kernel paths leading to those functions. Any insights would definitely be appreciated. david Marcelo Tosatti wrote: > On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: >> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): >> >> for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { >> gpa_t pte_gpa = gfn_to_gpa(sp->gfn); >> pte_gpa += (i+offset) * sizeof(pt_element_t); >> >> r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, >> sizeof(pt_element_t)); >> if (r || is_present_pte(pt)) >> sp->spt[i] = shadow_trap_nonpresent_pte; >> else >> sp->spt[i] = shadow_notrap_nonpresent_pte; >> } >> >> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per >> loop. >> >> This function gets run >20,000/sec during some of the kscand loops. > > Hi David, > > Do you see the mmu_recycled counter increase? > |
From: Fabian D. <fab...@gm...> - 2008-04-28 23:43:06
|
Hey. I've been trying Microsoft Windows 2003 a couple of times. The wiki tells me that "everything" should work okay. It does, when using -smp 1, but gets ugly when using -smp 2 or so. SO might it be useful, to add the column "smp" to the "Guest Support Status" Page in the wiki? - fabian |
From: Anthony L. <ali...@us...> - 2008-04-28 22:30:36
|
We hold qemu_mutex while machine->init() executes, which issues a VCPU create. We need to make sure to not return from the VCPU creation until the VCPU file descriptor is valid to ensure that APIC creation succeeds. However, we also need to make sure that the VCPU thread doesn't start running until the machine->init() is complete. This is addressed today because the VCPU thread tries to grab the qemu_mutex before doing anything interesting. If we release qemu_mutex to wait for VCPU creation, then we open a window for a race to occur. This patch introduces two wait conditions. The first lets the VCPU create code that runs in the IO thread to wait for a VCPU to initialize. The second condition lets the VCPU thread wait for the machine to fully initialize before running. An added benefit of this patch is it makes the dependencies now explicit. Signed-off-by: Anthony Liguori <ali...@us...> diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 78127de..9a9bf59 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -32,8 +32,12 @@ static int qemu_kvm_reset_requested; pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_vcpu_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_system_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; +static int qemu_system_ready; + struct qemu_kvm_signal_table { sigset_t sigset; sigset_t negsigset; @@ -53,6 +57,7 @@ struct vcpu_info { int stop; int stopped; int reload_regs; + int created; } vcpu_info[256]; pthread_t io_thread; @@ -324,6 +329,7 @@ static int kvm_main_loop_cpu(CPUState *env) struct vcpu_info *info = &vcpu_info[env->cpu_index]; setup_kernel_sigmask(env); + pthread_mutex_lock(&qemu_mutex); if (kvm_irqchip_in_kernel(kvm_context)) env->hflags &= ~HF_HALTED_MASK; @@ -370,6 +376,17 @@ static void *ap_main_loop(void *_env) sigprocmask(SIG_BLOCK, &signals, NULL); kvm_create_vcpu(kvm_context, env->cpu_index); kvm_qemu_init_env(env); + + /* signal VCPU creation */ + pthread_mutex_lock(&qemu_mutex); + vcpu->created = 1; + pthread_cond_signal(&qemu_vcpu_cond); + + /* and wait for machine initialization */ + while (!qemu_system_ready) + pthread_cond_wait(&qemu_system_cond, &qemu_mutex); + pthread_mutex_unlock(&qemu_mutex); + kvm_main_loop_cpu(env); return NULL; } @@ -389,8 +406,9 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) void kvm_init_new_ap(int cpu, CPUState *env) { pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); - /* FIXME: wait for thread to spin up */ - usleep(200); + + while (vcpu_info[cpu].created == 0) + pthread_cond_wait(&qemu_vcpu_cond, &qemu_mutex); } static void qemu_kvm_init_signal_tables(void) @@ -435,7 +453,11 @@ void qemu_kvm_notify_work(void) int kvm_main_loop(void) { io_thread = pthread_self(); + qemu_system_ready = 1; pthread_mutex_unlock(&qemu_mutex); + + pthread_cond_broadcast(&qemu_system_cond); + while (1) { kvm_eat_signal(&io_signal_table, NULL, 1000); pthread_mutex_lock(&qemu_mutex); |
From: SourceForge.net <no...@so...> - 2008-04-28 21:30:43
|
Bugs item #1952988, was opened at 2008-04-27 18:25 Message generated for change (Comment added) made by mtosatti You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1952988&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: kernel Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Carlo Marcelo Arenas Belon (carenas) Assigned to: Nobody/Anonymous (nobody) Summary: in-kernel PIT blocks DragonFlyBSD from booting Initial Comment: starting with kvm 64, DragonFlyBSD guests won't be able to boot unless -no-kvm-pit or -no-kvm-irqchip is used (-no-kvm works as well) : cpu model: Intel(R) Core(TM)2 CPU kvm version: kvm 66 (kvm>=64 fails, kvm<=63 works) host kernel: 2.6.24-gentoo-r4 (SMP) host arch: x86_64 to replicate, download dragonfly ISO (it is a livecd) and boot from it : kvm -curses -cdrom dfly-1.12.2_REL.iso -boot d ---------------------------------------------------------------------- Comment By: Marcelo Tosatti (mtosatti) Date: 2008-04-28 17:30 Message: Logged In: YES user_id=2022487 Originator: NO DragonFlyBSD uses mode 4 of the PIT which the in-kernel emulation simply ignores. Mode 4 seems to be similar to one-shot mode, other than the fact that mode 4 starts counting after the next CLK pulse once programmed, and mode1 starts counting immediately, so this _should_ be OK: http://people.redhat.com/~mtosatti/kvm-i8254-mode4.patch ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1952988&group_id=180599 |
From: Jerone Y. <jy...@us...> - 2008-04-28 21:30:00
|
1 file changed, 1 deletion(-) qemu/hw/ppc440_bamboo.c | 1 - In 2.6.26 wait is now enabled by default. With this the /hypervisor node will not be need to be idetified to enable the guest to go into wait state while idle. Signed-off-by: Jerone Young <jy...@us...> diff --git a/qemu/hw/ppc440_bamboo.c b/qemu/hw/ppc440_bamboo.c --- a/qemu/hw/ppc440_bamboo.c +++ b/qemu/hw/ppc440_bamboo.c @@ -163,7 +163,6 @@ void bamboo_init(ram_addr_t ram_size, in dt_cell(fdt, "/chosen", "linux,initrd-end", (initrd_base + initrd_size)); dt_string(fdt, "/chosen", "bootargs", (char *)kernel_cmdline); - dt_node(fdt, "/", "hypervisor"); #endif if (kvm_enabled()) { |
From: Jerone Y. <jy...@us...> - 2008-04-28 21:29:00
|
This set of patches contain fixes for bamboo board model, as well as provides more functionality for device tree manipulation. Signed-off-by: Jerone Young <jy...@us...> 3 files changed, 22 insertions(+), 1 deletion(-) qemu/hw/device_tree.c | 16 ++++++++++++++++ qemu/hw/device_tree.h | 2 ++ qemu/hw/ppc440_bamboo.c | 5 ++++- |
From: Jerone Y. <jy...@us...> - 2008-04-28 21:28:59
|
2 files changed, 18 insertions(+) qemu/hw/device_tree.c | 16 ++++++++++++++++ qemu/hw/device_tree.h | 2 ++ This patch adds function dt_cell_multi to allow for manipulation of device tree properties that contain mulitiple 32bit values. Signed-off-by: Jerone Young <jy...@us...> diff --git a/qemu/hw/device_tree.c b/qemu/hw/device_tree.c --- a/qemu/hw/device_tree.c +++ b/qemu/hw/device_tree.c @@ -162,6 +162,22 @@ void dt_cell(void *fdt, char *node_path, } } +/* This function is to manipulate a cell with multiple values */ +void dt_cell_multi(void *fdt, char *node_path, char *property, + uint32_t *val_array, int size) +{ + + int offset; + int ret; + offset = get_offset_of_node(fdt, node_path); + ret = fdt_setprop(fdt, offset, property, val_array, size); + if (ret < 0) { + printf("Unable to set device tree property '%s'\n", + property); + exit(1); + } +} + void dt_string(void *fdt, char *node_path, char *property, char *string) { diff --git a/qemu/hw/device_tree.h b/qemu/hw/device_tree.h --- a/qemu/hw/device_tree.h +++ b/qemu/hw/device_tree.h @@ -19,6 +19,8 @@ void dump_device_tree_to_file(void *fdt, void dump_device_tree_to_file(void *fdt, char *filename); void dt_cell(void *fdt, char *node_path, char *property, uint32_t val); +void dt_cell_multi(void *fdt, char *node_path, char *property, + uint32_t *val_array, int size); void dt_string(void *fdt, char *node_path, char *property, char *string); void dt_node(void *fdt, char *node_parent_path, char *name); |
From: Jerone Y. <jy...@us...> - 2008-04-28 21:26:48
|
1 file changed, 4 insertions(+) qemu/hw/ppc440_bamboo.c | 4 ++++ This fixes a issue where the amount of memory is not properly being defined in the device tree. It currently is hardcoded for 144MB. The result is that if you specify a memory size below the hardcoded size, the guest crashes. This patch now dynamically changes the device tree to the memory value specified. Signed-off-by: Jerone Young <jy...@us...> diff --git a/qemu/hw/ppc440_bamboo.c b/qemu/hw/ppc440_bamboo.c --- a/qemu/hw/ppc440_bamboo.c +++ b/qemu/hw/ppc440_bamboo.c @@ -50,6 +50,7 @@ void bamboo_init(ram_addr_t ram_size, in int i=0, k=0; uint32_t cpu_freq; uint32_t timebase_freq; + uint32_t mem_reg_property[]={0, 0, ram_size}; printf("%s: START\n", __func__); @@ -73,6 +74,7 @@ void bamboo_init(ram_addr_t ram_size, in printf("WARNING: %i MB left over memory is ram\n", bytes_to_mb((int)tmp_ram_size)); ram_size -= tmp_ram_size; + mem_reg_property[2] = ram_size; } /* Setup CPU */ @@ -159,6 +161,8 @@ void bamboo_init(ram_addr_t ram_size, in /* manipulate device tree in memory */ dt_cell(fdt, "/cpus/cpu@0", "clock-frequency", cpu_freq); dt_cell(fdt, "/cpus/cpu@0", "timebase-frequency", timebase_freq); + dt_cell_multi(fdt, "/memory", "reg", mem_reg_property, + sizeof(mem_reg_property)); dt_cell(fdt, "/chosen", "linux,initrd-start", initrd_base); dt_cell(fdt, "/chosen", "linux,initrd-end", (initrd_base + initrd_size)); |
From: nadim k. <na...@kh...> - 2008-04-28 20:59:20
|
Hi, great work. While playing with kvm-qemu I noticed a few points that might be of interrest: 1/ -loadvm and -snapshot don't work together. It works as if -loadvm wasn't passed as argument 2/ two instances of kvm can be passed the same -hda. There is no locking whatsoever. This messes up things seriously. 3/ trying to run 'savevm' in the qemu console when -usb is used results in (qemu) savevm scite (qemu) exception 13 (0) rax 0000000000000000 rbx 0000000000000000 rcx 0000000000000010 rdx 0000000000000000 ... This is documented but a warning in the console would be better than a crash if the vm is stopped first, 'savevm' works but it still crashes on 'cont' instead 4/ if -vga-std is used when doing a 'savevm', 'loadvm' restores a black screen. Everything is there and with some gymnastic (moving a window around) the screen is like it should be. 5/ -usbdevice tablet is a must, 'ctl+alt' is just too painfull! is it possible to get the same effect (with another system) and still be able to 'savevm'? 6/ If you use -usbdevice tablet, keyboard is first handled by guest OS. In my case I have 'alt F4' close windows in the host OS. If I try to close a window in the guest OS with 'alt f4', it closes qemu altogether. 7/ On the other hand, mouse events are _not_ handled by the gues OS first, IE: alt click isn't handled by X but by windows (in this case) 8/ keyboard input lost when switching to full screen or back. fixed by using 'ctl+alt' twice 9/ IMHO, the way "versionning" with 'savevm' is done could feel more natural. first run time ------------------------------> stop VM ^ | | | | v savevm state1 automatically save "HEAD" in -hda second run time ------------------------------>stop VM ^ | | | | v loadvm state1 automatically save "HEAD" in -hda ^ | .----------' | .-------------------------------------------------. | I believe most people want to save in 'state1' | | or possibly in 'state2' but few want to | | override "HEAD" | '-------------------------------------------------' automatically overriding 'state1' feels as wrong as overriding "HEAD". I believe that a -savevm to qemu would be a good idea. If nothing is passed as argument "HEAD" is used. That would preserve "HEAD" and allow saving to a user defined vm snapshot. 10/ subscription to the mailing list doesn't seem to work Cheers, Nadim. # what kvm version you are using. qemu 0.9.1 # the cpu, arch and host kernel version: Linux 2.6.24 #9 SMP x86_64 Intel(R) Core(TM)2 Quad CPU @ 2.66GHz GenuineIntel GNU/Linux # what guest you are using, including OS type Windows XP SP2 (32 I guess) # the qemu command line you are using to start the guest qemu-system-x86_64 -no-acpi -m 1536 -hda overlay.img -boot c -monitor stdio -usb -smp 2 -std-vga -smb /home/nadim/ -k fr |
From: Joerg R. <joe...@am...> - 2008-04-28 20:43:17
|
On Mon, Apr 28, 2008 at 07:35:10PM +0200, Jan Kiszka wrote: > Hi, > > sorry, the test environment is not really reproducible (stock kvm-66, > yet unpublished NMI support by Sheng Yang and me, special guest), but > I'm just fishing for some ideas on what may cause the flood of the > following warning in my kernel log: > > ------------[ cut here ]------------ > WARNING: at /data/kvm-66/kernel/x86.c:180 > kvm_queue_exception_e+0x30/0x54 [kvm]() > Modules linked in: ipt_MASQUERADE kvm_intel kvm bridge tun ip6t_LOG > nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit snd_pcm_oss snd_mixer_oss > snd_seq snd_seq_device nls_utf8 cifs af_packet ip6t_REJECT xt_tcpudp > ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter > ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter > ip6_tables cpufreq_conservative x_tables cpufreq_userspace > cpufreq_powersave acpi_cpufreq ipv6 microcode fuse ohci_hcd loop rfcomm > l2cap wlan_scan_sta ath_rate_sample ath_pci snd_hda_intel wlan pcmcia > firmware_class hci_usb snd_pcm snd_timer ath_hal(P) sdhci battery > bluetooth button ohci1394 mmc_core rtc_cmos parport_pc intel_agp > rtc_core dock ac snd_page_alloc iTCO_wdt ieee1394 sky2 rtc_lib > yenta_socket parport snd_hwdep snd iTCO_vendor_support i2c_i801 > rsrc_nonstatic pcmcia_core sg i2c_core soundcore serio_raw joydev > sha256_generic aes_x86_64 aes_generic cbc dm_crypt crypto_blkcipher > usbhid hid ff_memless sd_mod ehci_hcd uhci_hcd usbcore dm_snapshot > dm_mod edd ext3 mbcache jbd fan ata_piix ahci libata scsi_mod thermal > processor > Pid: 4718, comm: qemu-system-x86 Tainted: P N > 2.6.25-rc5-git2-109.8-default #1 > > Call Trace: > [<ffffffff8020d826>] dump_trace+0xc4/0x576 > [<ffffffff8020dd18>] show_trace+0x40/0x57 > [<ffffffff8044e341>] _etext+0x72/0x7b > [<ffffffff80238137>] warn_on_slowpath+0x58/0x80 > [<ffffffff886e2e05>] :kvm:kvm_queue_exception_e+0x30/0x54 > [<ffffffff886e3678>] :kvm:kvm_task_switch+0xca/0x20a > [<ffffffff8870d096>] :kvm_intel:handle_task_switch+0x19/0x1b > [<ffffffff8870cb1b>] :kvm_intel:kvm_handle_exit+0x7f/0x9c > [<ffffffff886e51e2>] :kvm:kvm_arch_vcpu_ioctl_run+0x49b/0x686 > [<ffffffff886e08c9>] :kvm:kvm_vcpu_ioctl+0xf7/0x3ca > [<ffffffff802ad0ba>] vfs_ioctl+0x2a/0x78 > [<ffffffff802ad34f>] do_vfs_ioctl+0x247/0x261 > [<ffffffff802ad3be>] sys_ioctl+0x55/0x77 > [<ffffffff8020c18a>] system_call_after_swapgs+0x8a/0x8f > [<00007faed2969267>] > > ---[ end trace 5d286714f3c5c50f ]--- Hmm, seems we have to check for DF and triple faults in the kvm_queue_exception functions too. Does the attached patch fix the problem (patch is against kvm-66). Joerg -- | AMD Saxony Limited Liability Company & Co. KG Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany System | Register Court Dresden: HRA 4896 Research | General Partner authorized to represent: Center | AMD Saxony LLC (Wilmington, Delaware, US) | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy |
From: Christoph L. <cla...@sg...> - 2008-04-28 20:40:46
|
On Sun, 27 Apr 2008, Andrea Arcangeli wrote: > Talking about post 2.6.26: the refcount with rcu in the anon-vma > conversion seems unnecessary and may explain part of the AIM slowdown > too. The rest looks ok and probably we should switch the code to a > compile-time decision between rwlock and rwsem (so obsoleting the > current spinlock). You are going to take a semphore in an rcu section? Guess you did not activate all debugging options while testing? I was not aware that you can take a sleeping lock from a non preemptible context. |
From: SourceForge.net <no...@so...> - 2008-04-28 20:12:29
|
Bugs item #1895893, was opened at 2008-02-18 01:44 Message generated for change (Comment added) made by alfmel You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1895893&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Technologov (technologov) Assigned to: Nobody/Anonymous (nobody) Summary: KVM-60+ halts, when using SCSI Initial Comment: Host: Intel CPU, F7/x64, KVM-60+ from git (userspace: kvm-60-155-g4422f97, kernelspace: kvm-60-10207-g9ef1f35) When installing Windows XP guest on emulated SCSI disk, KVM lock ups. The Command sent to Qemu/KVM: /usr/local/bin/qemu-system-x86_64 -drive file=/vm/WindowsXP.qcow2,if=scsi,boot=on -m 128 -monitor tcp:localhost:4503,server,nowait -cdrom /isos/windows/WindowsXP-SP2-Home-Pro-Tablet.iso -boot d -name WindowsXP Reproducible: Sometimes. Symptons: -The image during XP setup looks halted/locked, and no progress over 12 hours. -kvm_stat shows zero KVM activity. -Host CPU is 100% busy. -Qemu doesn't responds to any commands (such as alt+f2). GNU Debugger shows: (gdb) bt #0 lsi_execute_script (s=0x2bed030) at ../cpu-all.h:848 #1 0x000000000048a2e9 in qcow_aio_write_cb (opaque=0x2c8a050, ret=0) at block-qcow2.c:947 #2 0x000000000041898f in qemu_aio_poll () at /root/git/kvm/qemu/block-raw-posix.c:318 #3 0x000000000040de3c in main_loop_wait (timeout=0) at /root/git/kvm/qemu/vl.c:7822 #4 0x00000000004fd81d in kvm_eat_signals (env=0x2b52400, timeout=0) at /root/git/kvm/qemu/qemu-kvm.c:204 #5 0x00000000004fd859 in kvm_main_loop_wait (env=0x2b52400, timeout=0) at /root/git/kvm/qemu/qemu-kvm.c:211 #6 0x00000000004fe0a6 in kvm_main_loop_cpu (env=0x2b52400) at /root/git/kvm/qemu/qemu-kvm.c:309 #7 0x0000000000410e3d in main (argc=<value optimized out>, argv=0x7fff06235728) at /root/git/kvm/qemu/vl.c:7856 ==================================================== Dmesg shows: apic write: bad size=1 fee00030 Ignoring de-assert INIT to vcpu 0 Ignoring de-assert INIT to vcpu 0 apic write: bad size=1 fee00030 Ignoring de-assert INIT to vcpu 0 Ignoring de-assert INIT to vcpu 0 ...looping forever. -Alexey "Technologov", 18.02.2008. ---------------------------------------------------------------------- Comment By: Alf Mel (alfmel) Date: 2008-04-28 14:11 Message: Logged In: YES user_id=1865908 Originator: NO OK. I've applied the matley patch and your debug patch to KVM 66. I've also been able to reproduce the problem on a raw SCSI disk while installing Windows 2003. You can find the log at: http://mel.byu.edu/kvm-scsi-debug.log.bz2 ---------------------------------------------------------------------- Comment By: Marcelo Tosatti (mtosatti) Date: 2008-04-26 17:45 Message: Logged In: YES user_id=2022487 Originator: NO Alexey, Alberto, I'm unable to reproduce the problem with the Linux driver. The Windows SCSI SCRIPTS is different so that might the reason. The state machine is relatively complex depending on this SCRIPTS code. Please try the following: 1 - Attempt to reproduce the problem with raw disk instead of qcow2. 2 - Apply matley's patch below, and on top of that, this debug patch: http://people.redhat.com/~mtosatti/lsi-debug-crash.patch And then run qemu-kvm as usual, but redirect stderr output to a file: # qemu-kvm options 2> log-scsi-crash.txt Once the crash happens, there should be a pattern that repeats in this output. With that information its easier to understand what is going on. Thanks. ---------------------------------------------------------------------- Comment By: Alf Mel (alfmel) Date: 2008-04-11 16:46 Message: Logged In: YES user_id=1865908 Originator: NO I've confirmed the problem with KVM-65 as well. I applied the patch but it didn't work; I still experienced lockups. I am trying to install Windows Server 2003 on a SCSI disk and the installation keeps locking up on different parts of the file copy process. I'm using qcow2 disk format. I tried using raw format and it would lock up consistently when formatting the disk. I have tried installing W2K3 at least a dozen times with the same lockups. As part of my configuration, I move the monitor to run on a telnet server. When the lockup occurs, I can't connect to the monitor via telnet. I am also experiencing boot problems with Grub on SCSI disks. I reported the problem on the mailing list: http://article.gmane.org/gmane.comp.emulators.kvm.devel/15884 I don't know if the problems are related. ---------------------------------------------------------------------- Comment By: lanconnected (lanconnected) Date: 2008-04-08 10:17 Message: Logged In: YES user_id=2041746 Originator: NO Applied proposed patch on kvm-65. Windows XP Pro can be installed on scsi disk and boots up, but hangs unpredictably during disk activity. SDL windows can't be closed, kvm can only be killed with kill -9. ---------------------------------------------------------------------- Comment By: Matteo Frigo (matley) Date: 2008-03-30 06:58 Message: Logged In: YES user_id=35769 Originator: NO The bug seems to have nothing to do with Windows. You can reproduce the bug in kvm-63 and kvm-64 by creating an empty qcow2 scsi disk and running ``dd if=/dev/sda of=/dev/null bs=1M'' in linux. The patch below seems to fix the problem (at least with linux, I haven't tried Windows). If I understand the AIO layer correctly, scsi_read_data() and scsi_write_data() can be called again before the bdrv_aio_read call returns. If this happens, the original code reissues the same request twice, which is incorrect. The patch increments the read/writer counters before invoking the AIO layer. diff -aur kvm-64.old/qemu/hw/scsi-disk.c kvm-64.new/qemu/hw/scsi-disk.c --- kvm-64.old/qemu/hw/scsi-disk.c 2008-03-26 08:49:35.000000000 -0400 +++ kvm-64.new/qemu/hw/scsi-disk.c 2008-03-30 08:37:25.000000000 -0400 @@ -196,12 +196,12 @@ n = SCSI_DMA_BUF_SIZE / 512; r->buf_len = n * 512; - r->aiocb = bdrv_aio_read(s->bdrv, r->sector, r->dma_buf, n, + r->sector += n; + r->sector_count -= n; + r->aiocb = bdrv_aio_read(s->bdrv, r->sector - n, r->dma_buf, n, scsi_read_complete, r); if (r->aiocb == NULL) scsi_command_complete(r, SENSE_HARDWARE_ERROR); - r->sector += n; - r->sector_count -= n; } static void scsi_write_complete(void * opaque, int ret) @@ -248,12 +248,12 @@ BADF("Data transfer already in progress\n"); n = r->buf_len / 512; if (n) { - r->aiocb = bdrv_aio_write(s->bdrv, r->sector, r->dma_buf, n, + r->sector += n; + r->sector_count -= n; + r->aiocb = bdrv_aio_write(s->bdrv, r->sector - n, r->dma_buf, n, scsi_write_complete, r); if (r->aiocb == NULL) scsi_command_complete(r, SENSE_HARDWARE_ERROR); - r->sector += n; - r->sector_count -= n; } else { /* Invoke completion routine to fetch data from host. */ scsi_write_complete(r, 0); ---------------------------------------------------------------------- Comment By: lanconnected (lanconnected) Date: 2008-03-20 13:23 Message: Logged In: YES user_id=2041746 Originator: NO Can confirm it on kvm-63, 100% reproducible, same symptoms. System can be installed and always boots in safe mode, but never boots in normal mode. ACPI/noACPI settings have no influance. ---------------------------------------------------------------------- Comment By: Technologov (technologov) Date: 2008-02-18 03:21 Message: Logged In: YES user_id=1839746 Originator: YES ps axu: alexeye 21429 84.2 4.1 296740 166712 pts/4 Rl+ 04:40 16:22 /usr/local/bin/qemu-system-x86_64 -drive file=/vm/WindowsXP.qcow2,if=scsi,boot=on -m 128 -monitor tcp:localhost:4503,server,nowait -cdrom /isos/windows/WindowsXP-SP2-Home-Pro-Tablet.iso -boot c -name WindowsXP-SCSI-manual -no-kvm Another symptom I forgot to mention: Qemu (both KVM and -no-kvm) cannot be killed by pressing "X" on the SDL window, only by doing ctrl+C on the console. Anyone knows what "Rl+" means in the "ps" command output? -Alexey "Technologov", 18.02.2008. ---------------------------------------------------------------------- Comment By: Technologov (technologov) Date: 2008-02-18 03:18 Message: Logged In: YES user_id=1839746 Originator: YES Well, the same problem is reproducible with Qemu (-no-kvm): Same symptoms. (gdb) bt #0 0x000000000048ea9d in cpu_physical_memory_rw (addr=72552, buf=0x7fff26397b70 "???\200", len=4, is_write=0) at /root/git/kvm/qemu/exec.c:2682 #1 0x000000000041b0db in lsi_execute_script (s=0x2bed030) at ../cpu-all.h:848 #2 0x000000000048a2e9 in qcow_aio_write_cb (opaque=0x2bcefa0, ret=0) at block-qcow2.c:947 #3 0x000000000041898f in qemu_aio_poll () at /root/git/kvm/qemu/block-raw-posix.c:318 #4 0x000000000040de3c in main_loop_wait (timeout=10) at /root/git/kvm/qemu/vl.c:7822 #5 0x0000000000410d97 in main (argc=<value optimized out>, argv=0x7fff2639c858) at /root/git/kvm/qemu/vl.c:7926 -Alexey "Technologov", 18.02.2008. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1895893&group_id=180599 |
From: Joerg R. <jo...@8b...> - 2008-04-28 20:06:39
|
On Mon, Apr 28, 2008 at 06:50:22PM +0200, Jan Luebbe wrote: > Hi! > > I'm preparing kvm-67 for debian. While testing i noticed a problem: > > When booting the debian installer from the official CD [1] this problem: > > CPU: L1 I cache: 32K, L1 D cache: 32K > CPU: L2 cache: 2048K > Compat vDSO mapped to ffffe000. > CPU: Intel QEMU Virtual CPU version 0.9.1 stepping 03 > Checking 'hlt' instruction... OK. > ACPI: Core revision 20060707 > invalid opcode: 0000 [#1] > Modules linked in: > CPU: 0 > EIP: 0060:[<c01467be>] Not tainted VLI > EFLAGS: 00010202 (2.6.18-6-486 #1) > EIP is at kmem_cache_zalloc+0x2a/0x53 > eax: 0000000a ebx: c7fe75c0 ecx: c7fe9e00 edx: 000000d0 > esi: c02c50c0 edi: 00000202 ebp: c036bd20 esp: c030ff80 > ds: 007b es: 007b ss: 0068 > Process swapper (pid: 0, ti=c030e000 task=c02bd7a0 task.ti=c030e000) > Stack: 00000004 c028f968 c029c49a c0146d5b 00000004 00000000 00000014 > c029c499 > 00000046 c030ffc4 00000046 00000046 00000000 00000000 00039100 > c0302800 > 003a7007 c01c7e90 00000000 00000000 00000000 c01db3b5 c0378ce8 > c01dcf07 > Call Trace: > [<c0146d5b>] kmem_cache_create+0x15e/0x410 > [<c01c7e90>] acpi_os_create_cache+0x10/0x1c > [<c01db3b5>] acpi_ut_create_caches+0x19/0x93 > [<c01dcf07>] acpi_ut_init_globals+0x5/0x1de > [<c01dc5d1>] acpi_initialize_subsystem+0x1b/0x56 > [<c0323a73>] acpi_early_init+0x45/0xfe > [<c03105f5>] start_kernel+0x26b/0x272 > Code: c3 57 56 53 89 c6 9c 5f fa 8b 08 83 39 00 74 12 c7 41 0c 01 00 00 > 00 8b 01 > 48 89 01 8b 5c 81 10 eb 07 e8 a5 fb ff ff 89 c3 57 9d <0f> 0d 0b 90 85 > db 74 1b > 8b 56 10 31 c0 89 d1 c1 e9 02 89 df f3 > EIP: [<c01467be>] kmem_cache_zalloc+0x2a/0x53 SS:ESP 0068:c030ff80 > <0>Kernel panic - not syncing: Attempted to kill the idle task! I tried to reproduce this on an AMD system with no success. But when looking into the code of kmem_cache_zalloc this looks like a guest state corruption. The guest disables interrupts and the hypervisor reenables them which triggers the BUG() macro. Maybe kvmtrace can give a hint which intercept causes this. Joerg |
From: Anthony L. <an...@co...> - 2008-04-28 20:02:54
|
Ryan Harper wrote: > * Avi Kivity <av...@qu...> [2008-04-26 02:23]: > >> Please reuse qemu_mutex for this, no need for a new one. >> > > I'm having a little trouble wrapping my head around all of the locking > here. If I avoid qemu_mutex and use a new one, I've got everything > working. However, attemping to use qemu_mutex is stepping into a pile > of locking crap. I'm sure there is a good reason... > > The current code looks like this: > > Thread1: > main() > kvm_qemu_init() // mutex_lock() > machine->init() > pc_init1() > pc_new_cpu() > cpu_init() > cpu_x86_init() > kvm_init_new_ap() // create vcpu Thread2 > <-- > <-- > <-- > <-- > <-- > <-- > kvm_main_loop() // mutex_unlock() > > Thread2: > ap_main_loop() > /* vcpu init */ > kvm_main_loop_cpu() > kvm_main_loop_wait() // mutex_unlock() on enter, lock on exit > kvm_eat_signals() // mutex_lock() on enter, unlock on exit > <-- > <-- > <-- > The qemu_mutex is meant to ensure that the QEMU code is only ever entered by one thread at a time. QEMU is not thread-safe so this is necessary. It's a little odd because we're taking this lock initially by being called from QEMU code. Here's the basic theory of locking AFAICT: kvm_main_loop_cpu() is the main loop for each VCPU. It must acquire the QEMU mutex since it will call into normal QEMU code to process events. Whenever a VCPU is allowed to run, or when the VCPU is idling, it needs to release the QEMU mutex. The former is done via the post_kvm_run() and pre_kvm_run() hooks. The later is done within kvm_main_loop_wait(). kvm_main_loop_wait() will release the QEMU mutex and call kvm_eat_signals() which calls kvm_eat_signal() which will issue a sigtimedwait(). This is where we actually idle (and why SIGIO is so important right now). We don't want to idle with the QEMU mutex held as that may result in dead lock so this is why we release it here. kvm_eat_signal() has to acquire the lock again in order to dispatch IO events (via kvm_process_signal()). Regards, Anthony Liguori > It wedges up in kvm_init_new_ap() if I attempt acquire qemu_mutex. > Quite obvious after I looked at the call trace and discovered > kvm_qemu_init() locking on exit. I see other various functions that > unlock and then lock; I really don't want to wade into this mess... > rather whomever cooked it up should do some cleanup. I tried the > unlock, then re-lock on exit in kvm_init_new_ap() but that also wedged. > > Here is a rework with a new flag in vcpu_info indicating vcpu > creation. Tested this with 64 1VCPU guests booting with 1 second delay, > and single 16-way SMP guest boot. > > |
From: Marcelo T. <mto...@re...> - 2008-04-28 19:25:54
|
On Thu, Apr 24, 2008 at 10:37:04AM +0200, Gerd Hoffmann wrote: > Hi folks, > > My first attempt to send out a patch series with git ... > > The patches fix the kvm paravirt clocksource code to be compatible with > xen and they also factor out some code which can be shared into a > separate source files used by both kvm and xen. The issue with SMP guests is still present. Booting with "nohz=off" resolves it. Same symptoms as before, apic_timer_fn for one of the vcpu's is ticking way slower than the remaining ones: [root@localhost ~]# cat /proc/timer_stats | grep apic 391, 4125 qemu-system-x86 apic_mmio_write (apic_timer_fn) 2103, 4126 qemu-system-x86 apic_mmio_write (apic_timer_fn) 1896, 4127 qemu-system-x86 apic_mmio_write (apic_timer_fn) 1857, 4128 qemu-system-x86 apic_mmio_write (apic_timer_fn) Let me know what else is needed, or any patches to try. |
From: Jerone Y. <jy...@us...> - 2008-04-28 19:09:18
|
1 file changed, 25 insertions(+), 13 deletions(-) kernel/Makefile | 38 +++++++++++++++++++++++++------------- This patch removes static x86 entries and makes things work for multiple archs. Signed-off-by: Jerone Young <jy...@us...> diff --git a/kernel/Makefile b/kernel/Makefile --- a/kernel/Makefile +++ b/kernel/Makefile @@ -1,5 +1,10 @@ include ../config.mak include ../config.mak +ARCH_DIR=$(ARCH) +ifneq '$(filter $(ARCH_DIR), x86_64 i386)' '' + ARCH_DIR=x86 +endif + KVERREL = $(patsubst /lib/modules/%/build,%,$(KERNELDIR)) DESTDIR= @@ -18,10 +23,19 @@ _hack = mv $1 $1.orig && \ | sed '/\#include/! s/\blapic\b/l_apic/g' > $1 && rm $1.orig unifdef = mv $1 $1.orig && \ - unifdef -DCONFIG_X86 $1.orig > $1; \ + unifdef -DCONFIG_$(shell echo $(ARCH_DIR)|tr '[:lower:]' '[:upper:]') $1.orig > $1; \ [ $$? -le 1 ] && rm $1.orig hack = $(call _hack,$T/$(strip $1)) + +ifneq '$(filter $(ARCH_DIR), x86)' '' +HACK_FILES = kvm_main.c \ + mmu.c \ + vmx.c \ + svm.c \ + x86.c \ + irq.h +endif all:: # include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat @@ -49,21 +63,19 @@ header-sync: rm -rf $T rm -f include/asm - ln -sf asm-x86 include/asm - ln -sf asm-x86 include-compat/asm + ln -sf asm-$(ARCH_DIR) include/asm + ln -sf asm-$(ARCH_DIR) include-compat/asm source-sync: rm -rf $T rsync --exclude='*.mod.c' -R \ - "$(LINUX)"/arch/x86/kvm/./*.[ch] \ - "$(LINUX)"/virt/kvm/./*.[ch] \ - $T/ - $(call hack, kvm_main.c) - $(call hack, mmu.c) - $(call hack, vmx.c) - $(call hack, svm.c) - $(call hack, x86.c) - $(call hack, irq.h) + "$(LINUX)"/arch/$(ARCH_DIR)/kvm/./*.[ch] \ + "$(LINUX)"/virt/kvm/./*.[ch] \ + $T/ + + for i in $(HACK_FILES); \ + do $(call hack, $$i); done + for i in $$(find $T -type f -printf '%P '); \ do cmp -s $$i $T/$$i || cp $T/$$i $$i; done rm -rf $T @@ -72,7 +84,7 @@ install: mkdir -p $(DESTDIR)/$(INSTALLDIR) cp *.ko $(DESTDIR)/$(INSTALLDIR) for i in $(ORIGMODDIR)/drivers/kvm/*.ko \ - $(ORIGMODDIR)/arch/x86/kvm/*.ko; do \ + $(ORIGMODDIR)/arch/$(ARCH_DIR)/kvm/*.ko; do \ if [ -f "$$i" ]; then mv "$$i" "$$i.orig"; fi; \ done /sbin/depmod -a |
From: Marcelo T. <mto...@re...> - 2008-04-28 18:12:57
|
On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote: > Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page(): > > for (i = 0; i < PT64_ENT_PER_PAGE; ++i) { > gpa_t pte_gpa = gfn_to_gpa(sp->gfn); > pte_gpa += (i+offset) * sizeof(pt_element_t); > > r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt, > sizeof(pt_element_t)); > if (r || is_present_pte(pt)) > sp->spt[i] = shadow_trap_nonpresent_pte; > else > sp->spt[i] = shadow_notrap_nonpresent_pte; > } > > This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per > loop. > > This function gets run >20,000/sec during some of the kscand loops. Hi David, Do you see the mmu_recycled counter increase? |
From: Andrea A. <an...@qu...> - 2008-04-28 18:10:33
|
On Mon, Apr 28, 2008 at 11:11:56AM -0500, Anthony Liguori wrote: > Here's my thinking as to why we don't want to destroy the VM in the mmu > notifiers ->release method. I don't have a valid use-case for this but my > argument depends on the fact that this is something that should work. > Daemonizing a running VM may be a reasonable use-case. It's useful to wait > to daemonize until you are sure that everything is working correctly so > it's not all that unreasonable to move the daemonize until after the VCPUs > have been launched. > > If you take a running VM, and pause all of the VCPUs, and then issue a > fork() followed by an immediate exit() in the parent process, the child > process should be able to unpause all the VCPUs and the guest should > continue running uninterrupted. Fork shares nothing, the only thing that don't need to be duplicated and copied are the ones that can be marked wrprotect in the pagetables. > From KVM's perspective, issuing the fork() will increment the reference > count of the file descriptor for the VM but otherwise, no real change > should happen. The issue would now be that we must completely flush the > shadow page table cache. In theory, MMU notifiers should do this for us. Let's ignore sptes for now. What you miss for something like this to remotely work is to copy all memslots so an ioctl issued on the parent won't break the child too. If you want to ever support the above you have to change fork() so that fork() will also mmu_notifier_register in the child, copy memslots, patch all kvm->mm, and do all other things needed for this to work. Then yes, it'll work, but not because exit_mmap didn't shutdown the VM of the parent, but because you won't notice anymore that the vm in the parent was _entirely_ shutdown by ->release. > When the parent process exits, this will result in exit_mmap() and will > destroy the KVM guest. This leaves the child process with a file > descriptor that refers to a VM that is no longer valid. Rightfully so, as nothing could possibly work anymore in that VM. If it works in the child it'd be only because you copied all memslots in the child and did many other things and you changed the fd to work on a different kvm structure for parent and child. You have to copy the memslots or then parent couldn't run at the same time of the child. > Just avoiding destroying the VM in the ->release() method won't fix this > use-case I don't think. In general, I think we need to think a little more > about how fork() is handled with respect to mmu notifiers. This has very little to do with mmu notifiers. Let's make this work first: if (fork()) { ioctl(delete a memslot) } else { ioctl(create a memslot) } those two ops won't invoke mmu notifiers at all, and there's no chance the above will work right now. In short you think fork() exit() should allow the child to keep working with the guest of the parent when infact they both need to work on different guests, and if they do you won't notice ->release of the parent shutting down the parent vm (just in case it runs exit()). Thanks! |
From: Jan K. <jan...@si...> - 2008-04-28 17:59:21
|
Avi Kivity wrote: > Marcelo Tosatti wrote: >> Valgrind caught this: >> >> ==11754== Conditional jump or move depends on uninitialised value(s) >> ==11754== at 0x50C9BC: kvm_create_pit (libkvm-x86.c:153) >> ==11754== by 0x50CA7F: kvm_arch_create (libkvm-x86.c:178) >> ==11754== by 0x50AB31: kvm_create (libkvm.c:383) >> ==11754== by 0x4EE691: kvm_qemu_create_context (qemu-kvm.c:616) >> ==11754== by 0x412031: main (vl.c:9653) >> >> > > Applied, thanks. Isn't valgrind great? > Yeah, it is. Reminds me of another warning I recently came across (offsets may vary due to other patches: ==5801== 1 errors in context 1 of 2: ==5801== Conditional jump or move depends on uninitialised value(s) ==5801== at 0x53F4AE: kvm_register_userspace_phys_mem (libkvm.c:552) ==5801== by 0x521ACA: kvm_cpu_register_physical_memory (qemu-kvm.c:654) ==5801== by 0x45FC82: pc_init1 (pc.c:809) ==5801== by 0x461313: pc_init_pci (pc.c:1149) ==5801== by 0x43081B: main (vl.c:9845) This silences valgrind and may even be correct (if I got the code path right): Signed-off-by: Jan Kiszka <jan...@si...> --- a/libkvm/libkvm.c +++ b/libkvm/libkvm.c @@ -328,9 +328,10 @@ static int kvm_create_default_phys_mem(k #ifdef KVM_CAP_USER_MEMORY r = ioctl(kvm->fd, KVM_CHECK_EXTENSION, KVM_CAP_USER_MEMORY); - if (r > 0) + if (r > 0) { + kvm->physical_memory = NULL; return 0; - else + } else #endif r = kvm_alloc_kernel_memory(kvm, memory, vm_mem); if (r < 0) |
From: Jan K. <jan...@si...> - 2008-04-28 17:35:32
|
Hi, sorry, the test environment is not really reproducible (stock kvm-66, yet unpublished NMI support by Sheng Yang and me, special guest), but I'm just fishing for some ideas on what may cause the flood of the following warning in my kernel log: ------------[ cut here ]------------ WARNING: at /data/kvm-66/kernel/x86.c:180 kvm_queue_exception_e+0x30/0x54 [kvm]() Modules linked in: ipt_MASQUERADE kvm_intel kvm bridge tun ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device nls_utf8 cifs af_packet ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter ip6_tables cpufreq_conservative x_tables cpufreq_userspace cpufreq_powersave acpi_cpufreq ipv6 microcode fuse ohci_hcd loop rfcomm l2cap wlan_scan_sta ath_rate_sample ath_pci snd_hda_intel wlan pcmcia firmware_class hci_usb snd_pcm snd_timer ath_hal(P) sdhci battery bluetooth button ohci1394 mmc_core rtc_cmos parport_pc intel_agp rtc_core dock ac snd_page_alloc iTCO_wdt ieee1394 sky2 rtc_lib yenta_socket parport snd_hwdep snd iTCO_vendor_support i2c_i801 rsrc_nonstatic pcmcia_core sg i2c_core soundcore serio_raw joydev sha256_generic aes_x86_64 aes_generic cbc dm_crypt crypto_blkcipher usbhid hid ff_memless sd_mod ehci_hcd uhci_hcd usbcore dm_snapshot dm_mod edd ext3 mbcache jbd fan ata_piix ahci libata scsi_mod thermal processor Pid: 4718, comm: qemu-system-x86 Tainted: P N 2.6.25-rc5-git2-109.8-default #1 Call Trace: [<ffffffff8020d826>] dump_trace+0xc4/0x576 [<ffffffff8020dd18>] show_trace+0x40/0x57 [<ffffffff8044e341>] _etext+0x72/0x7b [<ffffffff80238137>] warn_on_slowpath+0x58/0x80 [<ffffffff886e2e05>] :kvm:kvm_queue_exception_e+0x30/0x54 [<ffffffff886e3678>] :kvm:kvm_task_switch+0xca/0x20a [<ffffffff8870d096>] :kvm_intel:handle_task_switch+0x19/0x1b [<ffffffff8870cb1b>] :kvm_intel:kvm_handle_exit+0x7f/0x9c [<ffffffff886e51e2>] :kvm:kvm_arch_vcpu_ioctl_run+0x49b/0x686 [<ffffffff886e08c9>] :kvm:kvm_vcpu_ioctl+0xf7/0x3ca [<ffffffff802ad0ba>] vfs_ioctl+0x2a/0x78 [<ffffffff802ad34f>] do_vfs_ioctl+0x247/0x261 [<ffffffff802ad3be>] sys_ioctl+0x55/0x77 [<ffffffff8020c18a>] system_call_after_swapgs+0x8a/0x8f [<00007faed2969267>] ---[ end trace 5d286714f3c5c50f ]--- I'm suspecting that it is the way our guest raises a triple fault in order to initiate a restart. At least it tells us via virtual console that it wants to restart, and those messages start around the same time. So, while waiting for my colleagues to dig out the precise triple-fault code pattern (for a cleaner test case), maybe someone could comment on potential reasons for this warning - or even ways to resolve them. Thanks! Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux |
From: Ryan H. <ry...@us...> - 2008-04-28 17:16:18
|
* Avi Kivity <av...@qu...> [2008-04-26 02:23]: > > Please reuse qemu_mutex for this, no need for a new one. I'm having a little trouble wrapping my head around all of the locking here. If I avoid qemu_mutex and use a new one, I've got everything working. However, attemping to use qemu_mutex is stepping into a pile of locking crap. I'm sure there is a good reason... The current code looks like this: Thread1: main() kvm_qemu_init() // mutex_lock() machine->init() pc_init1() pc_new_cpu() cpu_init() cpu_x86_init() kvm_init_new_ap() // create vcpu Thread2 <-- <-- <-- <-- <-- <-- kvm_main_loop() // mutex_unlock() Thread2: ap_main_loop() /* vcpu init */ kvm_main_loop_cpu() kvm_main_loop_wait() // mutex_unlock() on enter, lock on exit kvm_eat_signals() // mutex_lock() on enter, unlock on exit <-- <-- <-- It wedges up in kvm_init_new_ap() if I attempt acquire qemu_mutex. Quite obvious after I looked at the call trace and discovered kvm_qemu_init() locking on exit. I see other various functions that unlock and then lock; I really don't want to wade into this mess... rather whomever cooked it up should do some cleanup. I tried the unlock, then re-lock on exit in kvm_init_new_ap() but that also wedged. Here is a rework with a new flag in vcpu_info indicating vcpu creation. Tested this with 64 1VCPU guests booting with 1 second delay, and single 16-way SMP guest boot. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ry...@us... diffstat output: qemu-kvm.c | 14 ++++++++++++-- 1 files changed, 12 insertions(+), 2 deletions(-) Signed-off-by: Ryan Harper <ry...@us...> --- diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 78127de..2768ea5 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -31,7 +31,9 @@ extern int smp_cpus; static int qemu_kvm_reset_requested; pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; +pthread_mutex_t vcpu_mutex = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; +pthread_cond_t qemu_vcpuup_cond = PTHREAD_COND_INITIALIZER; __thread struct vcpu_info *vcpu; struct qemu_kvm_signal_table { @@ -53,6 +55,7 @@ struct vcpu_info { int stop; int stopped; int reload_regs; + int created; } vcpu_info[256]; pthread_t io_thread; @@ -369,6 +372,10 @@ static void *ap_main_loop(void *_env) sigfillset(&signals); sigprocmask(SIG_BLOCK, &signals, NULL); kvm_create_vcpu(kvm_context, env->cpu_index); + pthread_mutex_lock(&vcpu_mutex); + vcpu->created = 1; + pthread_cond_signal(&qemu_vcpuup_cond); + pthread_mutex_unlock(&vcpu_mutex); kvm_qemu_init_env(env); kvm_main_loop_cpu(env); return NULL; @@ -388,9 +395,12 @@ static void kvm_add_signal(struct qemu_kvm_signal_table *sigtab, int signum) void kvm_init_new_ap(int cpu, CPUState *env) { + pthread_mutex_lock(&vcpu_mutex); pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); - /* FIXME: wait for thread to spin up */ - usleep(200); + while (vcpu_info[cpu].created == 0) { + pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); + } + pthread_mutex_unlock(&vcpu_mutex); } static void qemu_kvm_init_signal_tables(void) |
From: Anthony L. <ali...@us...> - 2008-04-28 17:01:31
|
Here's my thinking as to why we don't want to destroy the VM in the mmu notifiers ->release method. I don't have a valid use-case for this but my argument depends on the fact that this is something that should work. Daemonizing a running VM may be a reasonable use-case. It's useful to wait to daemonize until you are sure that everything is working correctly so it's not all that unreasonable to move the daemonize until after the VCPUs have been launched. If you take a running VM, and pause all of the VCPUs, and then issue a fork() followed by an immediate exit() in the parent process, the child process should be able to unpause all the VCPUs and the guest should continue running uninterrupted. From KVM's perspective, issuing the fork() will increment the reference count of the file descriptor for the VM but otherwise, no real change should happen. The issue would now be that we must completely flush the shadow page table cache. In theory, MMU notifiers should do this for us. When the parent process exits, this will result in exit_mmap() and will destroy the KVM guest. This leaves the child process with a file descriptor that refers to a VM that is no longer valid. Just avoiding destroying the VM in the ->release() method won't fix this use-case I don't think. In general, I think we need to think a little more about how fork() is handled with respect to mmu notifiers. Regards, Anthony Liguori |