From: Daniel P. B. <ber...@re...> - 2008-04-30 13:56:19
|
On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote: > Avi Kivity wrote: > > David S. Ahern wrote: > >> Another tidbit for you guys as I make my way through various > >> permutations: > >> I installed the RHEL3 hugemem kernel and the guest behavior is *much* > >> better. > >> System time still has some regular hiccups that are higher than xen > >> and esx > >> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), > >> but > >> overall guest behavior is good with the hugemem kernel. > >> > >> > > > > Wait, the amount of info here is overwhelming. Let's stick with the > > current kernel (32-bit, HIGHMEM4G, right?) > > > > Did you get any traces with bypass_guest_pf=0? That may show more info. > > > > My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. > My point in the last email was that the hugemem kernel shows a remarkable > difference (it uses 3-levels of page tables right?). I was hoping that would > ring a bell with someone. IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which give userspace and kernelspace their own independant pagetables http://lwn.net/Articles/39925/ http://lwn.net/Articles/39283/ Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| |
From: David S. A. <da...@ci...> - 2008-04-30 14:23:29
|
Yes, the 4G/4G patch and the 64G options are both enabled for the hugemem kernel: CONFIG_HIGHMEM64G=y CONFIG_X86_4G=y Differences between the "standard" kernel and the hugemem kernel: # diff config-2.4.21-47.ELsmp config-2.4.21-47.ELhugemem 2157,2158c2157,2158 < CONFIG_M686=y < # CONFIG_MPENTIUMIII is not set --- > # CONFIG_M686 is not set > CONFIG_MPENTIUMIII=y 2169c2169 < CONFIG_X86_PGE=y --- > # CONFIG_X86_PGE is not set 2193c2193 < # CONFIG_X86_4G is not set --- > CONFIG_X86_4G=y 2365,2366c2365 < CONFIG_M686=y < CONFIG_X86_PGE=y --- > CONFIG_MPENTIUMIII=y 2369,2372d2367 < # CONFIG_MXT is not set < CONFIG_HOTPLUG_PCI=y < CONFIG_HOTPLUG_PCI_COMPAQ=m < CONFIG_HOTPLUG_PCI_IBM=m 2373a2369 > CONFIG_X86_4G=y 2377,2379d2372 < # CONFIG_EWRK3 is not set < CONFIG_UNIX98_PTY_COUNT=2048 < CONFIG_HZ=512 2382a2376,2383 > # CONFIG_MXT is not set > CONFIG_HOTPLUG_PCI=y > CONFIG_HOTPLUG_PCI_COMPAQ=m > CONFIG_HOTPLUG_PCI_IBM=m > # CONFIG_EWRK3 is not set > CONFIG_UNIX98_PTY_COUNT=2048 > CONFIG_DEBUG_BUGVERBOSE=y > # CONFIG_PNPBIOS is not set Avi: Centos releases: http://isoredirect.centos.org/centos/3/isos/i386/ I am running RHEL3.8 which I do not see listed. Also, I'll need to work on a stock install and try to capture some kind of workload that exhibits the problem. It will be a couple of days. david Daniel P. Berrange wrote: > On Wed, Apr 30, 2008 at 07:39:53AM -0600, David S. Ahern wrote: >> Avi Kivity wrote: >>> David S. Ahern wrote: >>>> Another tidbit for you guys as I make my way through various >>>> permutations: >>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much* >>>> better. >>>> System time still has some regular hiccups that are higher than xen >>>> and esx >>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), >>>> but >>>> overall guest behavior is good with the hugemem kernel. >>>> >>>> >>> Wait, the amount of info here is overwhelming. Let's stick with the >>> current kernel (32-bit, HIGHMEM4G, right?) >>> >>> Did you get any traces with bypass_guest_pf=0? That may show more info. >>> >> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest. >> My point in the last email was that the hugemem kernel shows a remarkable >> difference (it uses 3-levels of page tables right?). I was hoping that would >> ring a bell with someone. > > IIRC, the RHEL-3 hugemem kernel is using the 4g/4g split patches which > give userspace and kernelspace their own independant pagetables > > http://lwn.net/Articles/39925/ > http://lwn.net/Articles/39283/ > > Dan. |
From: Avi K. <av...@qu...> - 2008-04-23 08:07:37
|
David S. Ahern wrote: > kvm_stat -1 is practically impossible to time correctly to get a good snippet. > > I've added a --log option to get vmstat-like output. I've also added --fields to select which fields are of interest, to avoid the need for 280-column displays. That's now pushed to kvm-userspace.git. Example: ./kvm_stat -f 'mmu.*|pf.*|remote.*' -l -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-23 16:04:56
|
David S. Ahern wrote: > I've continued drilling down with the tracers to answer that question. I have > done runs with tracers in paging64_page_fault and it showed the overhead is with > the fetch() function. On my last run the tracers are in paging64_fetch() as follows: > > 1. after is_present_pte() check > 2. before kvm_mmu_get_page() > 3. after kvm_mmu_get_page() > 4. after if (!metaphysical) {} > > The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta > between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run. > Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new > tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and > access variables; tracer5 dumps value in shadow_ent. > > A representative trace sample is: > > (+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] > (+ 2664) PAGE_FAULT1 [ write_count = 0 ] > (+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 50416) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ] > (+ 1528) VMENTRY > (+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2296) PAGE_FAULT1 [ write_count = 0 ] > (+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ] > (+ 6424) VMENTRY > (+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2496) PAGE_FAULT1 [ write_count = 1 ] > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 10248) VMENTRY > (+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2408) PAGE_FAULT1 [ write_count = 2 ] > (+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ] > (+ 1240) VMENTRY > (+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] > (+ 2512) PAGE_FAULT1 [ write_count = 0 ] > (+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 48664) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ] > (+ 1576) VMENTRY > > So basically every 4th trip through the fetch function it runs > kvm_mmu_get_page(). A summary of the entire trace file shows this function > rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count > is always 0 when the high cycles are hit. > > Ah! The flood detector is not seeing the access through the kmap_atomic() pte, because that access has gone through the emulator. last_updated_pte_accessed(vcpu) will never return true. Can you verify that last_updated_pte_accessed(vcpu) indeed always returns false? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: David S. A. <da...@ci...> - 2008-04-23 16:39:33
|
Avi Kivity wrote: > > Ah! The flood detector is not seeing the access through the > kmap_atomic() pte, because that access has gone through the emulator. > last_updated_pte_accessed(vcpu) will never return true. > > Can you verify that last_updated_pte_accessed(vcpu) indeed always > returns false? > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump the rc of last_updated_pte_accessed(vcpu). ie., pte_access = last_updated_pte_accessed(vcpu); KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); A sample: (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] (+ 2480) PAGE_FAULT1 [ write_count = 0 ] (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 51672) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] (+ 1496) VMENTRY (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2352) PAGE_FAULT1 [ write_count = 0 ] (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] (+ 0) PTE_ACCESS [ pte_access = 1 ] (+ 6864) VMENTRY (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 2376) PAGE_FAULT1 [ write_count = 1 ] (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 0) PTE_ACCESS [ pte_access = 0 ] (+ 12344) VMENTRY (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2416) PAGE_FAULT1 [ write_count = 2 ] (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] (+ 1128) VMENTRY (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] (+ 2448) PAGE_FAULT1 [ write_count = 0 ] (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 51520) PAGE_FAULT3 (+ 432) PAGE_FAULT4 (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] (+ 1480) VMENTRY david |
From: Avi K. <av...@qu...> - 2008-04-26 06:21:14
|
David S. Ahern wrote: > Avi Kivity wrote: > >> Ah! The flood detector is not seeing the access through the >> kmap_atomic() pte, because that access has gone through the emulator. >> last_updated_pte_accessed(vcpu) will never return true. >> >> Can you verify that last_updated_pte_accessed(vcpu) indeed always >> returns false? >> >> > > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump > the rc of last_updated_pte_accessed(vcpu). ie., > pte_access = last_updated_pte_accessed(vcpu); > KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); > > A sample: > > (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2480) PAGE_FAULT1 [ write_count = 0 ] > (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51672) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] > (+ 1496) VMENTRY > (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2352) PAGE_FAULT1 [ write_count = 0 ] > (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] > (+ 0) PTE_ACCESS [ pte_access = 1 ] > (+ 6864) VMENTRY > (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2376) PAGE_FAULT1 [ write_count = 1 ] > (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 0) PTE_ACCESS [ pte_access = 0 ] > (+ 12344) VMENTRY > (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2416) PAGE_FAULT1 [ write_count = 2 ] > (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] > (+ 1128) VMENTRY > (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2448) PAGE_FAULT1 [ write_count = 0 ] > (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51520) PAGE_FAULT3 > (+ 432) PAGE_FAULT4 > (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] > (+ 1480) VMENTRY > > Strange... there should be at least two pte_access = 0 traces in there before flooding can occur, according to my reading of the code. The counter needs to go up to 3 somehow. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: David S. A. <da...@ci...> - 2008-04-24 17:25:03
|
What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the current instruction pointer for the guest? I take it the virt in the PAGE_FAULT trace output is the virtual address the guest was referencing when the page fault occurred. What I don't understand (one of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any ideas? Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT trace data). What does the 4th bit in 0xb mean? bit 0 set means PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3? david David S. Ahern wrote: > > Avi Kivity wrote: >> Ah! The flood detector is not seeing the access through the >> kmap_atomic() pte, because that access has gone through the emulator. >> last_updated_pte_accessed(vcpu) will never return true. >> >> Can you verify that last_updated_pte_accessed(vcpu) indeed always >> returns false? >> > > It returns both true and false. I added a tracer to kvm_mmu_pte_write() to dump > the rc of last_updated_pte_accessed(vcpu). ie., > pte_access = last_updated_pte_accessed(vcpu); > KVMTRACE_1D(PTE_ACCESS, vcpu, (u32) pte_access, handler); > > A sample: > > (+ 4488) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2480) PAGE_FAULT1 [ write_count = 0 ] > (+ 424) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51672) PAGE_FAULT3 > (+ 472) PAGE_FAULT4 > (+ 704) PAGE_FAULT5 [ shadow_ent = 0x80000001 2dfb5043 ] > (+ 1496) VMENTRY > (+ 4568) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2352) PAGE_FAULT1 [ write_count = 0 ] > (+ 728) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41fb5363 ] > (+ 0) PTE_ACCESS [ pte_access = 1 ] > (+ 6864) VMENTRY > (+ 3896) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] > (+ 2376) PAGE_FAULT1 [ write_count = 1 ] > (+ 720) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409041 ] > (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] > (+ 0) PTE_ACCESS [ pte_access = 0 ] > (+ 12344) VMENTRY > (+ 4688) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] > (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] > (+ 2416) PAGE_FAULT1 [ write_count = 2 ] > (+ 792) PAGE_FAULT5 [ shadow_ent = 0x00000001 91409043 ] > (+ 1128) VMENTRY > (+ 4512) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] > (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb63b0 ] > (+ 2448) PAGE_FAULT1 [ write_count = 0 ] > (+ 448) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] > (+ 51520) PAGE_FAULT3 > (+ 432) PAGE_FAULT4 > (+ 696) PAGE_FAULT5 [ shadow_ent = 0x80000001 2df5a043 ] > (+ 1480) VMENTRY > > > david > |
From: Avi K. <av...@qu...> - 2008-04-26 06:43:42
|
David S. Ahern wrote: > What is the rip (GUEST_RIP) value in the VMEXIT trace output? Is that the > current instruction pointer for the guest? > > Yes. > I take it the virt in the PAGE_FAULT trace output is the virtual address the > guest was referencing when the page fault occurred. What I don't understand (one > of many things really) is what the 0xfffb63b0 corresponds to in the guest. Any > ideas? > > I'm pretty sure it is the kmap_atomic() pte. The guest wants to update a pte (call it pte1), which is in HIGHMEM, so it doesn't have a permanent mapping for it. It calls kmap_atomic() which sets up another pte (pte2, two writes), and then accesses pte1 through pte2. > Also, the expensive page fault occurs on errorcode = 0x0000000b (PAGE_FAULT > trace data). What does the 4th bit in 0xb mean? bit 0 set means > PFERR_PRESENT_MASK is set, and bit 1 means PT_WRITABLE_MASK. What is bit 3? > Bit 3 is the reserved bit, which means the shadow pte has an illegal bit combination. kvm sets up vmx to forward non-persent page faults (bit 0 clear) directly to the guest, so it needs some other pattern to get a trapping fault. IOW, there are two types of non-present shadow ptes in kvm: trapping ones (where we don't know what the guest pte looks like) and nontrapping ones (where we know the guest pte is not present, so we forward the fault directly to the guest). The first type is encoded with the reserved bit and present bit set, the second with both of them clear. You can disable this trickery using the bypass_guest_pf module parameter. It should be useful to try it, we'll see the forwarded faults as well. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |