Thread: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

Brought to you by: avik, mtosatti

kvm-devel

[kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-16 00:15:35

I have been looking at RHEL3 based guests lately, and to say the least the
performance is horrible. Rather than write a long tome on what I've done and
observed, I'd like to find out if anyone has some insights or known problem
areas running 2.4 guests. The short of it is that % system time spikes from time
to time (e.g., on exec of a new process such as running /bin/true).

I do not see the problem running RHEL3 on ESX, and an equivalent VM running
RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
that is not handled efficiently by kvm.

Can someone shed some light on it?

thanks,

david

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-16 08:46:40

David S. Ahern wrote:
> I have been looking at RHEL3 based guests lately, and to say the least the
> performance is horrible. Rather than write a long tome on what I've done and
> observed, I'd like to find out if anyone has some insights or known problem
> areas running 2.4 guests. The short of it is that % system time spikes from time
> to time (e.g., on exec of a new process such as running /bin/true).
>
> I do not see the problem running RHEL3 on ESX, and an equivalent VM running
> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something in a way
> that is not handled efficiently by kvm.
>
> Can someone shed some light on it?
>   

It's not something that I test regularly.  If you're running a 32-bit 
kernel, I'd suspect kmap(), or perhaps false positives from the fork 
detector.

kvmtrace will probably give enough info to tell exactly what's going on; 
'kvmstat -1' while the badness is happening may also help.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-17 21:12:56

kvm_stat -1 is practically impossible to time correctly to get a good snippet.

kvmtrace is a fascinating tool. I captured trace data that encompassed one
intense period where the VM appeared to freeze (no terminal response for a few
seconds).

After converting to text I examined an arbitrary section in time (how do you
correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
vcpu1 hammered with page faults. (I put the representative data below; I can
send the binary or text files if you really want to see them.) All toll over
about a 10-12 second time period the trace text files contain 8426221 lines and
2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
high).

david

---------------------------------

vcpu0 data:

0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400096784 (+   76248)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+    1792)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400114528 (+   15952)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+    1800)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400137216 (+   20888)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+    1624)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400209344 (+   70504)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400226312 (+   15256)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+    1728)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400248688 (+   20648)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]




vcpu1 data:

9968400002032 (+    3808)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016127f ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400005448 (+    3416)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400009832 (+    4384)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c016104a ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x0000000b, virt = 0x00000000 fffb6f88 ]
9968400071584 (+   61752)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400075608 (+    4024)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400083528 (+    7920)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400087288 (+    3760)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400097312 (+   10024)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400103064 (+    5752)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400116624 (+   13560)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400120424 (+    3800)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400123856 (+    3432)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400128208 (+    4352)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d28 ]
9968400183848 (+   55640)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400188232 (+    4384)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e4d ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400196160 (+    7928)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400199928 (+    3768)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160e54 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400209864 (+    9936)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400214984 (+    5120)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160f9c ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db4 ]
9968400228232 (+   13248)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400232000 (+    3768)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160fa1 ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000003, virt = 0x00000000 c0009db0 ]
9968400235424 (+    3424)  VMENTRY       vcpu = 0x00000000  pid = 0x000011ea
9968400239816 (+    4392)  VMEXIT        vcpu = 0x00000000  pid = 0x000011ea [
exitcode = 0x00000000, rip = 0x00000000 c0160dab ]
0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
0x00000009, virt = 0x00000000 fffb6d30 ]




Avi Kivity wrote:
> David S. Ahern wrote:
>> I have been looking at RHEL3 based guests lately, and to say the least
>> the
>> performance is horrible. Rather than write a long tome on what I've
>> done and
>> observed, I'd like to find out if anyone has some insights or known
>> problem
>> areas running 2.4 guests. The short of it is that % system time spikes
>> from time
>> to time (e.g., on exec of a new process such as running /bin/true).
>>
>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>> running
>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>> in a way
>> that is not handled efficiently by kvm.
>>
>> Can someone shed some light on it?
>>   
> 
> It's not something that I test regularly.  If you're running a 32-bit
> kernel, I'd suspect kmap(), or perhaps false positives from the fork
> detector.
> 
> kvmtrace will probably give enough info to tell exactly what's going on;
> 'kvmstat -1' while the badness is happening may also help.
>

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-18 08:01:32

David S. Ahern wrote:
> kvm_stat -1 is practically impossible to time correctly to get a good snippet.
>
> kvmtrace is a fascinating tool. I captured trace data that encompassed one
> intense period where the VM appeared to freeze (no terminal response for a few
> seconds).
>
> After converting to text I examined an arbitrary section in time (how do you
> correlate tsc to unix epoch?), and it shows vcpu0 hammered with interrupts and
> vcpu1 hammered with page faults. (I put the representative data below; I can
> send the binary or text files if you really want to see them.) All toll over
> about a 10-12 second time period the trace text files contain 8426221 lines and
> 2051344 of them are PAGE_FAULTs (that's 24% of the text lines which seems really
> high).
>
> david
>   

>
> vcpu1 data:
>
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d28 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db4 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000003, virt = 0x00000000 c0009db0 ]
> 0 (+       0)  PAGE_FAULT    vcpu = 0x00000000  pid = 0x000011ea [ errorcode =
> 0x00000009, virt = 0x00000000 fffb6d30 ]
>
>
>   

The pattern here is  c0009db4, c0009db0, fffb6xxx, c0009db0.  Setting a 
pte at c0009db0, accessing the page mapped by the pte, unmapping the 
pte.  Note that c0009db0 (bits 3:11) == 0x1b6 == fffb6xxx (bits 12:20).  
That's a kmap_atomic() + access +kunmap_atomic() sequence.

The expensive accesses ~50K cycles) seem to be the onces at fffb6xxx.  
Now theses shouldn't show up at all -- the kvm_mmu_pte_write() ought to 
have set up the ptes correctly.

Can you add a trace at mmu_guess_page_from_pte_write(), right before "if 
(is_present_pte(gpte))"?  I'm interested in gpa and gpte.  Also a trace 
at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase 
the 3 to 4 in the line right above that, maybe the fork detector is 
misfiring).


---------------------------------

vcpu0 data:

0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400020536 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400096784 (+   76248)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400098576 (+    1792)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400114528 (+   15952)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400116328 (+    1800)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400137216 (+   20888)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7a ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400138840 (+    1624)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400209344 (+   70504)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400211056 (+    1712)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400226312 (+   15256)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]
0 (+       0)  INTR          vcpu = 0x00000001  pid = 0x000011ea [ vector = 0x00 ]
9968400228040 (+    1728)  VMENTRY       vcpu = 0x00000001  pid = 0x000011ea
9968400248688 (+   20648)  VMEXIT        vcpu = 0x00000001  pid = 0x000011ea [
exitcode = 0x00000001, rip = 0x00000000 c0154d7c ]




Those are probably IPIs due to the kmaps above.

>
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> I have been looking at RHEL3 based guests lately, and to say the least
>>> the
>>> performance is horrible. Rather than write a long tome on what I've
>>> done and
>>> observed, I'd like to find out if anyone has some insights or known
>>> problem
>>> areas running 2.4 guests. The short of it is that % system time spikes
>>> from time
>>> to time (e.g., on exec of a new process such as running /bin/true).
>>>
>>> I do not see the problem running RHEL3 on ESX, and an equivalent VM
>>> running
>>> RHEL4 runs fine. That suggests that the 2.4 kernel is doing something
>>> in a way
>>> that is not handled efficiently by kvm.
>>>
>>> Can someone shed some light on it?
>>>   
>>>       
>> It's not something that I test regularly.  If you're running a 32-bit
>> kernel, I'd suspect kmap(), or perhaps false positives from the fork
>> detector.
>>
>> kvmtrace will probably give enough info to tell exactly what's going on;
>> 'kvmstat -1' while the badness is happening may also help.
>>
>>     


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-21 04:31:12

I added the traces and captured data over another apparent lockup of the guest.
This seems to be representative of the sequence (pid/vcpu removed).

(+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3632)  VMENTRY
(+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
(+   54928)  VMENTRY
(+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
(+8432)  VMENTRY
(+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   13832)  VMENTRY


(+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+3712)  VMENTRY
(+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
(+   65216)  VMENTRY
(+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
(+8640)  VMENTRY
(+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   14160)  VMENTRY

I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
vcpu1 files have 85000 total lines and compressed the files total ~500k.

I did not see the FLOODED trace come out during this sample though I did bump
the count from 3 to 4 as you suggested.


Correlating rip addresses to the 2.4 kernel:

c0160d00-c0161290 = page_referenced

It looks like the event is kscand running through the pages. I suspected this
some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
appeared to lower the peak of the spikes, but maybe I imagined it. I believe
lowering that value makes kscand wake up more often but do less work (page
scanning) each time it is awakened.

david


Avi Kivity wrote:
> Can you add a trace at mmu_guess_page_from_pte_write(), right before "if 
> (is_present_pte(gpte))"?  I'm interested in gpa and gpte.  Also a trace 
> at kvm_mmu_pte_write(), where it sets flooded = 1 (hmm, try to increase 
> the 3 to 4 in the line right above that, maybe the fork detector is 
> misfiring).

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-21 09:20:21

David S. Ahern wrote:
> I added the traces and captured data over another apparent lockup of the guest.
> This seems to be representative of the sequence (pid/vcpu removed).
>
> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3632)  VMENTRY
> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61c8 ]
> (+   54928)  VMENTRY
>   

Can you oprofile the host to see where the 54K cycles are spent?

> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 41c5d363 ]
> (+8432)  VMENTRY
> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+   13832)  VMENTRY
>
>
> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+3712)  VMENTRY
> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000 fffb61d0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000 3d55d047 ]
>   

This indeed has the accessed bit clear.

> (+   65216)  VMENTRY
> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000 3d598363 ]
>   

This has the accessed bit set and the user bit clear, and the pte 
pointing at the previous pte_write gpa.  Looks like a kmap_atomic().

> (+8640)  VMENTRY
> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
> (+   14160)  VMENTRY
>
> I can forward a more complete time snippet if you'd like. vcpu0 + corresponding
> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>
> I did not see the FLOODED trace come out during this sample though I did bump
> the count from 3 to 4 as you suggested.
>
>
>   

Bumping the count was supposed to remove the flooding...

> Correlating rip addresses to the 2.4 kernel:
>
> c0160d00-c0161290 = page_referenced
>
> It looks like the event is kscand running through the pages. I suspected this
> some time ago, and tried tweaking the kscand_work_percent sysctl variable. It
> appeared to lower the peak of the spikes, but maybe I imagined it. I believe
> lowering that value makes kscand wake up more often but do less work (page
> scanning) each time it is awakened.
>
>   

What does 'top' in the guest show (perhaps sorted by total cpu time 
rather than instantaneous usage)?

What host kernel are you running?  How many host cpus?

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-21 17:07:37

host:
    2.6.25-rc8, x86_64, kvm-66
    1 dual-core Xeon(R) CPU 3050 @ 2.13GHz
    6 GB RAM
    (This behavior also occurs on a larger server with 2 dual-core Xeon(R) CPU
5140 @ 2.33GHz, 4 GB RAM. Same kernel and kvm versions.)

guest:
    RHEL3 U8 (2.4.21-47.ELsmp), 2 vcpus, 2 GB RAM


As usual, waited for a guest "event" -- high system time in guest which appears
to lock it up. Following the event, kscand was the top CPU user (cumulative
time) in the guest.

During the event, 2 qemu threads are pegging the host CPU at 100%. Top samples
from oprofile (oprofile was started after the freeze start and stopped when
guest response returned):

samples  %        image name     app name       symbol name
171716   35.1350  kvm-intel.ko   kvm_intel     vmx_vcpu_run
45836     9.3786  vmlinux        vmlinux       copy_user_generic_string
39417     8.0652  kvm.ko         kvm           kvm_read_guest_atomic
23604     4.8296  vmlinux        vmlinux       add_preempt_count
22878     4.6811  vmlinux        vmlinux       __smp_call_function_mask
16143     3.3030  kvm.ko         kvm           gfn_to_hva
14648     2.9971  vmlinux        vmlinux       sub_preempt_count
14589     2.9851  kvm.ko         kvm           __gfn_to_memslot
11666     2.3870  kvm.ko         kvm           unalias_gfn
10834     2.2168  kvm.ko         kvm           kvm_mmu_zap_page
10532     2.1550  kvm.ko         kvm           paging64_prefetch_page
6285      1.2860  kvm-intel.ko   kvm_intel     handle_exception
6066      1.2412  kvm.ko         kvm           kvm_arch_vcpu_ioctl_run
4741      0.9701  kvm.ko         kvm           kvm_add_trace
3801      0.7777  vmlinux        vmlinux       __copy_from_user_inatomic
3592      0.7350  vmlinux        vmlinux       follow_page
3326      0.6805  kvm.ko         kvm           mmu_memory_cache_alloc
3317      0.6787  kvm-intel.ko   kvm_intel     kvm_handle_exit
2971      0.6079  kvm.ko         kvm           paging64_page_fault
2777      0.5682  kvm.ko         kvm           paging64_walk_addr
2294      0.4694  kvm.ko         kvm           kvm_mmu_pte_write
2278      0.4661  kvm.ko         kvm           kvm_flush_remote_tlbs
2266      0.4636  kvm-intel.ko   kvm_intel     vmcs_writel
2086      0.4268  kvm.ko         kvm           mmu_set_spte
2041      0.4176  kvm.ko         kvm           kvm_read_guest
1615      0.3304  vmlinux        vmlinux       free_hot_cold_page

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632)  VMENTRY
>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+   54928)  VMENTRY
>>   
> 
> Can you oprofile the host to see where the 54K cycles are spent?
> 
>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   13832)  VMENTRY
>>
>>
>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712)  VMENTRY
>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>   
> 
> This indeed has the accessed bit clear.
> 
>> (+   65216)  VMENTRY
>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>   
> 
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
> 
>> (+8640)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   14160)  VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>   
> 
> Bumping the count was supposed to remove the flooding...
> 
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>   
> 
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
> 
> What host kernel are you running?  How many host cpus?
>

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-22 20:23:08

I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:

1. before vcpu->arch.mmu.page_fault()
2. after vcpu->arch.mmu.page_fault()
3. after mmu_topup_memory_caches()
4. after emulate_instruction()

So the delta in the trace reports show:
- cycles required for arch.mmu.page_fault (tracer 2)
- cycles required for mmu_topup_memory_caches(tracer 3)
- cycles required for emulate_instruction() (tracer 4)

I captured trace data for ~5-seconds during one of the usual events (again this
time it was due to kscand in the guest). I ran the formatted trace data through
an awk script to summarize:

    TSC cycles      tracer2   tracer3   tracer4
      0 -  10,000:   295067    213251    115873
 10,001 -  25,000:     7682      1004     98336
 25,001 -  50,000:      201        15        36
 50,001 - 100,000:   100655         0        10
        > 100,000:      117         0        15

This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
took longer than 50,000 cycles. The page_fault function getting run is
paging64_page_fault.

mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
most of them relatively quickly.

Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
host processes could interrupt it.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I added the traces and captured data over another apparent lockup of
>> the guest.
>> This seems to be representative of the sequence (pid/vcpu removed).
>>
>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3632)  VMENTRY
>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61c8 ]
>> (+   54928)  VMENTRY
>>   
> 
> Can you oprofile the host to see where the 54K cycles are spent?
> 
>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 41c5d363 ]
>> (+8432)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   13832)  VMENTRY
>>
>>
>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016127c ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+3712)  VMENTRY
>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c016104a ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>> fffb61d0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>> 3d55d047 ]
>>   
> 
> This indeed has the accessed bit clear.
> 
>> (+   65216)  VMENTRY
>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610e7 ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db4 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>> 3d598363 ]
>>   
> 
> This has the accessed bit set and the user bit clear, and the pte
> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
> 
>> (+8640)  VMENTRY
>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>> c01610ee ]
>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>> c0009db0 ]
>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>> 00000000 ]
>> (+   14160)  VMENTRY
>>
>> I can forward a more complete time snippet if you'd like. vcpu0 +
>> corresponding
>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>
>> I did not see the FLOODED trace come out during this sample though I
>> did bump
>> the count from 3 to 4 as you suggested.
>>
>>
>>   
> 
> Bumping the count was supposed to remove the flooding...
> 
>> Correlating rip addresses to the 2.4 kernel:
>>
>> c0160d00-c0161290 = page_referenced
>>
>> It looks like the event is kscand running through the pages. I
>> suspected this
>> some time ago, and tried tweaking the kscand_work_percent sysctl
>> variable. It
>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>> believe
>> lowering that value makes kscand wake up more often but do less work
>> (page
>> scanning) each time it is awakened.
>>
>>   
> 
> What does 'top' in the guest show (perhaps sorted by total cpu time
> rather than instantaneous usage)?
> 
> What host kernel are you running?  How many host cpus?
>

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-23 08:08:28

David S. Ahern wrote:
> I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:
>
> 1. before vcpu->arch.mmu.page_fault()
> 2. after vcpu->arch.mmu.page_fault()
> 3. after mmu_topup_memory_caches()
> 4. after emulate_instruction()
>
> So the delta in the trace reports show:
> - cycles required for arch.mmu.page_fault (tracer 2)
> - cycles required for mmu_topup_memory_caches(tracer 3)
> - cycles required for emulate_instruction() (tracer 4)
>
> I captured trace data for ~5-seconds during one of the usual events (again this
> time it was due to kscand in the guest). I ran the formatted trace data through
> an awk script to summarize:
>
>     TSC cycles      tracer2   tracer3   tracer4
>       0 -  10,000:   295067    213251    115873
>  10,001 -  25,000:     7682      1004     98336
>  25,001 -  50,000:      201        15        36
>  50,001 - 100,000:   100655         0        10
>         > 100,000:      117         0        15
>
> This means vcpu->arch.mmu.page_fault() was called 403,722 times in the roughyl
> 5-second interval: 295,067 times it took < 10,000 cycles, but 100,772 times it
> took longer than 50,000 cycles. The page_fault function getting run is
> paging64_page_fault.
>
>   

This does look like the fork detector.  Once in every four faults, it 
triggers and the fault becomes slow.  100K floods == 100K page tables == 
200GB of virtual memory, which seems excessive.

Is this running a forked load like apache, with many processes?  How 
much memory is on the guest, and is there any memory pressure?

> mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
> most of them relatively quickly.
> b
> Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
> host processes could interrupt it.
>
> david
>
>
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632)  VMENTRY
>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+   54928)  VMENTRY
>>>   
>>>       
>> Can you oprofile the host to see where the 54K cycles are spent?
>>
>>     
>>> (+4568)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 41c5d363 ]
>>> (+8432)  VMENTRY
>>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+   13832)  VMENTRY
>>>
>>>
>>> (+5768)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3712)  VMENTRY
>>> (+4576)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61d0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 3d5981d0 gpte = 0x00000000
>>> 3d55d047 ]
>>>   
>>>       
>> This indeed has the accessed bit clear.
>>
>>     
>>> (+   65216)  VMENTRY
>>> (+4232)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610e7 ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db4 gpte = 0x00000000
>>> 3d598363 ]
>>>   
>>>       
>> This has the accessed bit set and the user bit clear, and the pte
>> pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
>>
>>     
>>> (+8640)  VMENTRY
>>> (+3936)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c01610ee ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db0 ]
>>> (+   0)  PTE_WRITE      [ gpa = 0x00000000 00009db0 gpte = 0x00000000
>>> 00000000 ]
>>> (+   14160)  VMENTRY
>>>
>>> I can forward a more complete time snippet if you'd like. vcpu0 +
>>> corresponding
>>> vcpu1 files have 85000 total lines and compressed the files total ~500k.
>>>
>>> I did not see the FLOODED trace come out during this sample though I
>>> did bump
>>> the count from 3 to 4 as you suggested.
>>>
>>>
>>>   
>>>       
>> Bumping the count was supposed to remove the flooding...
>>
>>     
>>> Correlating rip addresses to the 2.4 kernel:
>>>
>>> c0160d00-c0161290 = page_referenced
>>>
>>> It looks like the event is kscand running through the pages. I
>>> suspected this
>>> some time ago, and tried tweaking the kscand_work_percent sysctl
>>> variable. It
>>> appeared to lower the peak of the spikes, but maybe I imagined it. I
>>> believe
>>> lowering that value makes kscand wake up more often but do less work
>>> (page
>>> scanning) each time it is awakened.
>>>
>>>   
>>>       
>> What does 'top' in the guest show (perhaps sorted by total cpu time
>> rather than instantaneous usage)?
>>
>> What host kernel are you running?  How many host cpus?
>>
>>     


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-23 15:24:03

>> Avi Kivity wrote:
>>
>>> David S. Ahern wrote:
>>>
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632)  VMENTRY
>>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+   54928)  VMENTRY
>>>>
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>

I've continued drilling down with the tracers to answer that question. I have
done runs with tracers in paging64_page_fault and it showed the overhead is with
the fetch() function. On my last run the tracers are in paging64_fetch() as follows:

1. after is_present_pte() check
2. before kvm_mmu_get_page()
3. after kvm_mmu_get_page()
4. after if (!metaphysical) {}

The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta
between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run.
Tracer1 dumps  vcpu->arch.last_pt_write_count (a carryover from when the new
tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and
access variables; tracer5 dumps value in shadow_ent.

A representative trace sample is:

(+    4576)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+    2664)  PAGE_FAULT1   [ write_count = 0 ]
(+     472)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   50416)  PAGE_FAULT3
(+     472)  PAGE_FAULT4
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9276d043 ]
(+    1528)  VMENTRY
(+    4992)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2296)  PAGE_FAULT1   [ write_count = 0 ]
(+     816)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
(+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ]
(+    6424)  VMENTRY
(+    3864)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ]
(+    2496)  PAGE_FAULT1   [ write_count = 1 ]
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809041 ]
(+       0)  PTE_WRITE     [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ]
(+   10248)  VMENTRY
(+    4744)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016127c ]
(+       0)  PAGE_FAULT    [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ]
(+    2408)  PAGE_FAULT1   [ write_count = 2 ]
(+     760)  PAGE_FAULT5   [ shadow_ent_val = 0x00000000 8a809043 ]
(+    1240)  VMENTRY
(+    4624)  VMEXIT        [ exitcode = 0x00000000, rip = 0x00000000 c016104a ]
(+       0)  PAGE_FAULT    [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ]
(+    2512)  PAGE_FAULT1   [ write_count = 0 ]
(+     496)  PAGE_FAULT2   [ level = 2 metaphysical = 0 access 0x00000007 ]
(+   48664)  PAGE_FAULT3
(+     472)  PAGE_FAULT4
(+     856)  PAGE_FAULT5   [ shadow_ent_val = 0x80000000 9272d043 ]
(+    1576)  VMENTRY

So basically every 4th trip through the fetch function it runs
kvm_mmu_get_page(). A summary of the entire trace file shows this function
rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count
is always 0 when the high cycles are hit.


More tidbits:
- The hugepage option seems to have no effect -- the system spikes and overhead
occurs with and without the hugepage option (above data is with it).

- As the guest runs for hours, the intensity of the spikes drop though they
still occur regularly and kscand continues to be the primary suspect. qemu's RSS
tends to the guests memory allotment of 2GB. Internally guest memory usage runs
at ~1GB page cache, 57M buffers, 24M swap, ~800MB for processes.

- I have looked at process creation and do not see a strong correlation between
system time spikes and number of new processes. So far the only correlations
seem to be kscand and amount of memory used. ie., stock RHEL3 with few processes
shows tiny spikes whereas my tests with 90+ processes using about 800M plus a
continually updating page cache (ie., moderate IO levels) the spikes are strong
and last for seconds.

- Time runs really fast in the guest, gaining several minutes in 24-hours.

I'll download your kvm_stat update and give it a try. When I started this
investigation I was using Christian's kvmstat script which dumped stats to a
file. Plots of that data did not show a strong correlation with guest system time.

david

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-25 17:34:34

David S. Ahern wrote:
> Avi Kivity wrote:
>
>> David S. Ahern wrote:
>>
>>> I added the traces and captured data over another apparent lockup of
>>> the guest.
>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>
>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016127c ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>> c0009db4 ]
>>> (+3632)  VMENTRY
>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>> c016104a ]
>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>> fffb61c8 ]
>>> (+   54928)  VMENTRY
>>>
>> Can you oprofile the host to see where the 54K cycles are spent?
>>

Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():

        for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
                gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
                pte_gpa += (i+offset) * sizeof(pt_element_t);

                r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
                                          sizeof(pt_element_t));
                if (r || is_present_pte(pt))
                        sp->spt[i] = shadow_trap_nonpresent_pte;
                else
                        sp->spt[i] = shadow_notrap_nonpresent_pte;
        }

This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
loop.

This function gets run >20,000/sec during some of the kscand loops.

david

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-26 06:45:37

David S. Ahern wrote:
> David S. Ahern wrote:
>   
>> Avi Kivity wrote:
>>
>>     
>>> David S. Ahern wrote:
>>>
>>>       
>>>> I added the traces and captured data over another apparent lockup of
>>>> the guest.
>>>> This seems to be representative of the sequence (pid/vcpu removed).
>>>>
>>>> (+4776)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016127c ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x00000003, virt = 0x00000000
>>>> c0009db4 ]
>>>> (+3632)  VMENTRY
>>>> (+4552)  VMEXIT         [ exitcode = 0x00000000, rip = 0x00000000
>>>> c016104a ]
>>>> (+   0)  PAGE_FAULT     [ errorcode = 0x0000000b, virt = 0x00000000
>>>> fffb61c8 ]
>>>> (+   54928)  VMENTRY
>>>>
>>>>         
>>> Can you oprofile the host to see where the 54K cycles are spent?
>>>
>>>       
>
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>
>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>
>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>                                           sizeof(pt_element_t));
>                 if (r || is_present_pte(pt))
>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>                 else
>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>         }
>
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
>
> This function gets run >20,000/sec during some of the kscand loops.
>
>   

We really ought to optimize it.  That's second order however.  The real 
fix is making sure it isn't called so often.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Marcelo T. <mto...@re...> - 2008-04-28 18:12:57

On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
> 
>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
> 
>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>                                           sizeof(pt_element_t));
>                 if (r || is_present_pte(pt))
>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>                 else
>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>         }
> 
> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
> loop.
> 
> This function gets run >20,000/sec during some of the kscand loops.

Hi David,

Do you see the mmu_recycled counter increase?

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-28 23:45:05

Hi Marcelo:

mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.

Here is a kvm_stat sample where guest time was very high and qemu had 2
processors at 100% on the host. I removed counters where both columns have 0
value for brevity.

 exits                 45937979  758051
 fpu_reload             1416831      87
 halt_exits              112911       0
 halt_wakeup              31771       0
 host_state_reload      2068602     263
 insn_emulation        21601480  365493
 io_exits               1827374    2705
 irq_exits              8934818  285196
 mmio_exits              421674     147
 mmu_cache_miss         4817689   93680
 mmu_flooded            4815273   93680
 mmu_pde_zapped           51344       0
 mmu_prefetch           4817625   93680
 mmu_pte_updated       14803298  270104
 mmu_pte_write         19859863  363785
 mmu_shadow_zapped      4832106   93679
 pf_fixed              32184355  468398
 pf_guest                264138       0
 remote_tlb_flush      10697762  280522
 tlb_flush             10301338  176424

(NOTE: This is for a *5* second sample interval instead of 1 to allow me to
capture the data).

Here's a sample when the guest is "well-behaved" (system time <10%, though ):
 exits                 51502194   97453
 fpu_reload             1421736     227
 halt_exits              138361    1927
 halt_wakeup              33047     117
 host_state_reload      2110190    3740
 insn_emulation        24367441   47260
 io_exits               1874075    2576
 irq_exits             10224702   13333
 mmio_exits              435154    1726
 mmu_cache_miss         5414097   11258
 mmu_flooded            5411548   11243
 mmu_pde_zapped           52851      44
 mmu_prefetch           5414031   11258
 mmu_pte_updated       16854686   29901
 mmu_pte_write         22526765   42285
 mmu_shadow_zapped      5430025   11313
 pf_fixed              36144578   67666
 pf_guest                282794     430
 remote_tlb_flush      12126268   14619
 tlb_flush             11753162   21460

There is definitely a strong correlation between the mmu counters and high
system times in the guest. I am still trying to find out what in the guest is
stimulating it when running on RHEL3; I do not see this same behavior for an
equivalent setup running on RHEL4.

By the way I added an mmu_prefetch stat in prefetch_page() to count the number
of times the for() loop is hit with PTTYPE == 64; ie., number of times
paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
loop, though the info seems to duplicate other entries.) That counter is listed
above. As I mentioned in a prior post when kscand kicks in the change in
mmu_prefetch counter is at 20,000+/sec, with each trip through that function
taking 45k+ cycles.

kscand is an instigator shortly after boot, however, kscand is *not* the culprit
once the system has been up for 30-45 minutes. I have started instrumenting the
RHEL3U8 kernel and for the load I am running kscand does not walk the active
lists very often once the system is up.

So, to dig deeper on what in the guest is stimulating the mmu I collected
kvmtrace data for about a 2 minute time interval which caught about a 30-second
period where guest system time was steady in the 25-30% range. Summarizing the
number of times a RIP appears in an VMEXIT shows the following high runners:

  count      RIP       RHEL3-symbol
  82549   0xc0140e42  follow_page [kernel] c0140d90 offset b2
  42532   0xc0144760  handle_mm_fault [kernel] c01446d0 offset 90
  36826   0xc013da4a  futex_wait [kernel] c013d870 offset 1da
  29987   0xc0145cd0  zap_pte_range [kernel] c0145c10 offset c0
  27451   0xc0144018  do_no_page [kernel] c0143e20 offset 1f8

(halt entry removed the list since that is the ideal scenario for an exit).

So the RIP correlates to follow_page() for a large percentage of the VMEXITs.

I wrote an awk script to summarize (histogram style) the TSC cycles between
VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
(ie., almost 100% of the time) the trace shows a delta between 50k and 100k
cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
seems to correlate with the prefetch_page function in kvm, though I am not 100%
positive on that.

I am now investigating the kernel paths leading to those functions. Any insights
would definitely be appreciated.

david

Marcelo Tosatti wrote:
> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>
>>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>>
>>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>>                                           sizeof(pt_element_t));
>>                 if (r || is_present_pte(pt))
>>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>>                 else
>>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>>         }
>>
>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>> loop.
>>
>> This function gets run >20,000/sec during some of the kscand loops.
> 
> Hi David,
> 
> Do you see the mmu_recycled counter increase?
>

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-30 04:18:21

Another tidbit for you guys as I make my way through various permutations:
I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
System time still has some regular hiccups that are higher than xen and esx
(e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
overall guest behavior is good with the hugemem kernel.

One side effect I've noticed is that I cannot restart the RHEL3 guest running
the hugemem kernel in successive attempts. The guest has 2 vcpus and qemu shows
one thread at 100% cpu. If I recall correctly kvm_stat shows a large amount of
tlb_flushes (like millions in a 5-second sample). The scenario is:
1. start guest running hugemem kernel,
2. shutdown,
3. restart guest.

During 3. it hangs, but at random points. Removing kvm/kvm-intel has no effect -
guest still hangs on the restart. Rebooting the host clears the problem.

Alternatively, during the hang on a restart I can kill the guest, and then on
restart choose the normal, 32-bit smp kernel and the guest boots just fine. At
this point I can shutdown the guest and restart with the hugemem kernel and it
boots just fine.

david


David S. Ahern wrote:
> Hi Marcelo:
> 
> mmu_recycled is always 0 for this guest -- even after almost 4 hours of uptime.
> 
> Here is a kvm_stat sample where guest time was very high and qemu had 2
> processors at 100% on the host. I removed counters where both columns have 0
> value for brevity.
> 
>  exits                 45937979  758051
>  fpu_reload             1416831      87
>  halt_exits              112911       0
>  halt_wakeup              31771       0
>  host_state_reload      2068602     263
>  insn_emulation        21601480  365493
>  io_exits               1827374    2705
>  irq_exits              8934818  285196
>  mmio_exits              421674     147
>  mmu_cache_miss         4817689   93680
>  mmu_flooded            4815273   93680
>  mmu_pde_zapped           51344       0
>  mmu_prefetch           4817625   93680
>  mmu_pte_updated       14803298  270104
>  mmu_pte_write         19859863  363785
>  mmu_shadow_zapped      4832106   93679
>  pf_fixed              32184355  468398
>  pf_guest                264138       0
>  remote_tlb_flush      10697762  280522
>  tlb_flush             10301338  176424
> 
> (NOTE: This is for a *5* second sample interval instead of 1 to allow me to
> capture the data).
> 
> Here's a sample when the guest is "well-behaved" (system time <10%, though ):
>  exits                 51502194   97453
>  fpu_reload             1421736     227
>  halt_exits              138361    1927
>  halt_wakeup              33047     117
>  host_state_reload      2110190    3740
>  insn_emulation        24367441   47260
>  io_exits               1874075    2576
>  irq_exits             10224702   13333
>  mmio_exits              435154    1726
>  mmu_cache_miss         5414097   11258
>  mmu_flooded            5411548   11243
>  mmu_pde_zapped           52851      44
>  mmu_prefetch           5414031   11258
>  mmu_pte_updated       16854686   29901
>  mmu_pte_write         22526765   42285
>  mmu_shadow_zapped      5430025   11313
>  pf_fixed              36144578   67666
>  pf_guest                282794     430
>  remote_tlb_flush      12126268   14619
>  tlb_flush             11753162   21460
> 
> 
> There is definitely a strong correlation between the mmu counters and high
> system times in the guest. I am still trying to find out what in the guest is
> stimulating it when running on RHEL3; I do not see this same behavior for an
> equivalent setup running on RHEL4.
> 
> By the way I added an mmu_prefetch stat in prefetch_page() to count the number
> of times the for() loop is hit with PTTYPE == 64; ie., number of times
> paging64_prefetch_page() is invoked. (I wanted an explicit counter for this
> loop, though the info seems to duplicate other entries.) That counter is listed
> above. As I mentioned in a prior post when kscand kicks in the change in
> mmu_prefetch counter is at 20,000+/sec, with each trip through that function
> taking 45k+ cycles.
> 
> kscand is an instigator shortly after boot, however, kscand is *not* the culprit
> once the system has been up for 30-45 minutes. I have started instrumenting the
> RHEL3U8 kernel and for the load I am running kscand does not walk the active
> lists very often once the system is up.
> 
> So, to dig deeper on what in the guest is stimulating the mmu I collected
> kvmtrace data for about a 2 minute time interval which caught about a 30-second
> period where guest system time was steady in the 25-30% range. Summarizing the
> number of times a RIP appears in an VMEXIT shows the following high runners:
> 
>   count      RIP       RHEL3-symbol
>   82549   0xc0140e42  follow_page [kernel] c0140d90 offset b2
>   42532   0xc0144760  handle_mm_fault [kernel] c01446d0 offset 90
>   36826   0xc013da4a  futex_wait [kernel] c013d870 offset 1da
>   29987   0xc0145cd0  zap_pte_range [kernel] c0145c10 offset c0
>   27451   0xc0144018  do_no_page [kernel] c0143e20 offset 1f8
> 
> (halt entry removed the list since that is the ideal scenario for an exit).
> 
> So the RIP correlates to follow_page() for a large percentage of the VMEXITs.
> 
> I wrote an awk script to summarize (histogram style) the TSC cycles between
> VMEXIT and VMENTRY for an address. For the first rip, 0xc0140e42, 82,271 times
> (ie., almost 100% of the time) the trace shows a delta between 50k and 100k
> cycles between the VMEXIT and the subsequent VMENTRY. Similarly for the second
> one, 0xc0144760, 42403 times (again almost 100% of the occurrences) the trace
> shows a delta between 50k and 100k cycles between VMEXIT and VMENTRY. These
> seems to correlate with the prefetch_page function in kvm, though I am not 100%
> positive on that.
> 
> I am now investigating the kernel paths leading to those functions. Any insights
> would definitely be appreciated.
> 
> david
> 
> 
> Marcelo Tosatti wrote:
>> On Fri, Apr 25, 2008 at 11:33:18AM -0600, David S. Ahern wrote:
>>> Most of the cycles (~80% of that 54k+) are spent in paging64_prefetch_page():
>>>
>>>         for (i = 0; i < PT64_ENT_PER_PAGE; ++i) {
>>>                 gpa_t pte_gpa = gfn_to_gpa(sp->gfn);
>>>                 pte_gpa += (i+offset) * sizeof(pt_element_t);
>>>
>>>                 r = kvm_read_guest_atomic(vcpu->kvm, pte_gpa, &pt,
>>>                                           sizeof(pt_element_t));
>>>                 if (r || is_present_pte(pt))
>>>                         sp->spt[i] = shadow_trap_nonpresent_pte;
>>>                 else
>>>                         sp->spt[i] = shadow_notrap_nonpresent_pte;
>>>         }
>>>
>>> This loop is run 512 times and takes a total of ~45k cycles, or ~88 cycles per
>>> loop.
>>>
>>> This function gets run >20,000/sec during some of the kscand loops.
>> Hi David,
>>
>> Do you see the mmu_recycled counter increase?
>>
>

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-30 13:02:11

David S. Ahern wrote:
> Another tidbit for you guys as I make my way through various permutations:
> I installed the RHEL3 hugemem kernel and the guest behavior is *much* better.
> System time still has some regular hiccups that are higher than xen and esx
> (e.g., 1 minute samples out of 5 show system time between 10 and 15%), but
> overall guest behavior is good with the hugemem kernel.
>
>   

Wait, the amount of info here is overwhelming. Let's stick with the 
current kernel (32-bit, HIGHMEM4G, right?)

Did you get any traces with bypass_guest_pf=0? That may show more info.

-- 

Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-04-30 13:45:22

Avi Kivity wrote:
> David S. Ahern wrote:
>> Another tidbit for you guys as I make my way through various
>> permutations:
>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>> better.
>> System time still has some regular hiccups that are higher than xen
>> and esx
>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>> but
>> overall guest behavior is good with the hugemem kernel.
>>
>>   
> 
> Wait, the amount of info here is overwhelming. Let's stick with the
> current kernel (32-bit, HIGHMEM4G, right?)
> 
> Did you get any traces with bypass_guest_pf=0? That may show more info.
> 

My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
My point in the last email was that the hugemem kernel shows a remarkable
difference (it uses 3-levels of page tables right?). I was hoping that would
ring a bell with someone.

Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
particular with that setting -- like a RIP summary or a summary of exit-entry
cycles?

david

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-04-30 13:51:32

David S. Ahern wrote:
> Avi Kivity wrote:
>   
>> David S. Ahern wrote:
>>     
>>> Another tidbit for you guys as I make my way through various
>>> permutations:
>>> I installed the RHEL3 hugemem kernel and the guest behavior is *much*
>>> better.
>>> System time still has some regular hiccups that are higher than xen
>>> and esx
>>> (e.g., 1 minute samples out of 5 show system time between 10 and 15%),
>>> but
>>> overall guest behavior is good with the hugemem kernel.
>>>
>>>   
>>>       
>> Wait, the amount of info here is overwhelming. Let's stick with the
>> current kernel (32-bit, HIGHMEM4G, right?)
>>
>> Did you get any traces with bypass_guest_pf=0? That may show more info.
>>
>>     
>
> My preference is to stick with the "standard", 32-bit RHEL3 kernel in the guest.
>   

Me too. I would like to see all reasonable guests supported well, 
without performance issues, and not have to tell the use which kernel to 
use.

> My point in the last email was that the hugemem kernel shows a remarkable
> difference (it uses 3-levels of page tables right?). I was hoping that would
> ring a bell with someone.
>   

 From the traces I saw I think the standard kernel is pae as well.  Can 
you verify?  I think it's CONFIG_HIGHMEM4G (instead of 
CONFIG_HIGHMEM64G) but that option may be different for such an old kernel.

> Adding bypass_guest_pf=0 did not improve the situation. Did you want anything
> particular with that setting -- like a RIP summary or a summary of exit-entry
> cycles?
>   

I asked fo this thinking bypass_guest_pf may help show more 
information.  But thinking a bit more, it will not.

I think I do know what the problem is.  I will try it out.  Is there a 
free clone (like centos) available somewhere?

-- 
Any sufficiently difficult bug is indistinguishable from a feature.

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-05-11 12:32:05

Attachments: prevent-kscand-flooding.patch

Avi Kivity wrote:
>
> I asked fo this thinking bypass_guest_pf may help show more 
> information.  But thinking a bit more, it will not.
>
> I think I do know what the problem is.  I will try it out.  Is there a 
> free clone (like centos) available somewhere?

This patch tracks down emulated accesses to speculated ptes and marks 
them as accessed, preventing the flooding on centos-3.1.  Unfortunately 
it also causes a host oops midway through the boot process.

I believe the oops is merely exposed by the patch, not caused by it.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-05-11 13:36:42

Attachments: prevent-kscand-flooding.patch

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> I asked fo this thinking bypass_guest_pf may help show more 
>> information.  But thinking a bit more, it will not.
>>
>> I think I do know what the problem is.  I will try it out.  Is there 
>> a free clone (like centos) available somewhere?
>
> This patch tracks down emulated accesses to speculated ptes and marks 
> them as accessed, preventing the flooding on centos-3.1.  
> Unfortunately it also causes a host oops midway through the boot process.
>
> I believe the oops is merely exposed by the patch, not caused by it.
>

It was caused by the patch, please try the updated one attached.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-05-13 03:49:22

That does the trick with kscand.

Do you have recommendations for clock source settings? For example in my
test case for this patch the guest gained 73 seconds (ahead of real
time) after only 3 hours, 5 min of uptime.

thanks,

david


Avi Kivity wrote:
> Avi Kivity wrote:
>> Avi Kivity wrote:
>>>
>>> I asked fo this thinking bypass_guest_pf may help show more
>>> information.  But thinking a bit more, it will not.
>>>
>>> I think I do know what the problem is.  I will try it out.  Is there
>>> a free clone (like centos) available somewhere?
>>
>> This patch tracks down emulated accesses to speculated ptes and marks
>> them as accessed, preventing the flooding on centos-3.1. 
>> Unfortunately it also causes a host oops midway through the boot process.
>>
>> I believe the oops is merely exposed by the patch, not caused by it.
>>
> 
> It was caused by the patch, please try the updated one attached.
> 
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> kvm-devel mailing list
> kvm...@li...
> https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-05-13 07:25:38

David S. Ahern wrote:
> That does the trick with kscand.
>
>   

Not so fast...  the patch updates the flood count to 5.  Can you check 
if a lower value still works?  Also, whether updating the flood count to 
5 (without the rest of the patch) works?

Unconditionally bumping the flood count to 5 will likely cause a 
performance regression on other guests.

While I was able to see excessive flooding, I couldn't reproduce your 
kscand problem.  Running /bin/true always returned immediately for me.

> Do you have recommendations for clock source settings? For example in my
> test case for this patch the guest gained 73 seconds (ahead of real
> time) after only 3 hours, 5 min of uptime.
>   

The kernel is trying to correlate tsc and pit, which isn't going to work.

Try disabling the tsc, set edx.bit4=0 for cpuid.eax=1 in qemu-kvm-x86  
.c do_cpuid_ent().

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-05-14 20:36:13

Avi Kivity wrote:
> Not so fast...  the patch updates the flood count to 5.  Can you check
> if a lower value still works?  Also, whether updating the flood count to
> 5 (without the rest of the patch) works?
> 
> Unconditionally bumping the flood count to 5 will likely cause a
> performance regression on other guests.

I put the flood count back to 3, and the RHEL3 guest performance is even
better.

> 
> While I was able to see excessive flooding, I couldn't reproduce your
> kscand problem.  Running /bin/true always returned immediately for me.

A poor attempt at finding a simplistic, minimal re-create. The use case
I am investigating has over 500 processes/threads with a base memory
consumption around 1GB. I was finding it nearly impossible to have a
generic re-create of the problem for you to use in your investigations
on CentOS.

Thanks for the patch.

david

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: Avi K. <av...@qu...> - 2008-05-15 10:53:38

David S. Ahern wrote:
> Avi Kivity wrote:
>   
>> Not so fast...  the patch updates the flood count to 5.  Can you check
>> if a lower value still works?  Also, whether updating the flood count to
>> 5 (without the rest of the patch) works?
>>
>> Unconditionally bumping the flood count to 5 will likely cause a
>> performance regression on other guests.
>>     
>
> I put the flood count back to 3, and the RHEL3 guest performance is even
> better.
>
>   

Okay, I committed the patch without the flood count == 5.

-- 
error compiling committee.c: too many arguments to function

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

From: David S. A. <da...@ci...> - 2008-05-17 04:31:29

Attachments: kvm-stats-rhel3

Avi Kivity wrote:
> 
> Okay, I committed the patch without the flood count == 5.
> 

I've continued testing the RHEL3 guests with the flood count at 3, and I
am right back to where I started. With the patch and the flood count at
3, I had 2 runs totaling around 24 hours that looked really good. Now, I
am back to square one. I guess the short of it is that I am not sure if
the patch resolves this issue or not.

If you want to back it out, I can continue to apply on my end as I
continue testing. A snapshot of kvm_stat -f 'mmu*' -l is attached for
two test runs with the patch (line wrap is horrible inline).

I will work on creating an ap that will stimulate kscand activity
similar to what I am seeing.

Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
noticed that with the -no-kvm-pit option the guest time is much better
and typically stays within 3 seconds or so of the host, even through the
high kscand activity which is one instance of when I've noticed time
jumps with the kernel pit. Yes, this result has been repeatable through
6 or so runs. :-)

david

1 2 > >> (Page 1 of 2)