You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Anthony L. <an...@co...> - 2008-04-23 15:47:46
|
Avi Kivity wrote: > Anthony Liguori wrote: >> >> The ne2k is pretty mmio heavy. You should be able to observe a boost >> with something like iperf (guest=>host) I would think if this is a >> real savings. >> > > If we're just improving ne2k, the complexity isn't worth it. We have > two better nics which are widely supported in guests. Sure, but the ne2k should be pretty close to an optimal synthetic benchmark for batching. If we don't see a big improvement with it, we're probably not going to see a big improvement with it for anything. As you point out though, the converse isn't necessarily true. Regards, Anthony Liguori |
From: Robin H. <ho...@sg...> - 2008-04-23 15:45:34
|
On Wed, Apr 23, 2008 at 03:44:27PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:14:26PM -0700, Christoph Lameter wrote: > > We want a full solution and this kind of patching makes the patches > > difficuilt to review because later patches revert earlier ones. > > I know you rather want to see KVM development stalled for more months > than to get a partial solution now that already covers KVM and GRU > with the same API that XPMEM will also use later. It's very unfair on > your side to pretend to stall other people development if what you > need has stronger requirements and can't be merged immediately. This > is especially true given it was publically stated that XPMEM never > passed all regression tests anyway, so you can't possibly be in such XPMEM has passed all regression tests using your version 12 notifiers. I have a bug in xpmem which shows up on our 8x oversubscription tests, but that is clearly my bug to figure out. Unfortunately it only shows up on a 128 processor machine so I have 1024 stack traces to sort through each time it fails. Does take a bit of time and a lot of concentration. > an hurry like we are, we can't progress without this. Infact we can SGI is under an equally strict timeline. We really needed the sleeping version into 2.6.26. We may still be able to get this accepted by vendor distros if we make 2.6.27. Thanks, Robin |
From: David S. A. <da...@ci...> - 2008-04-23 15:24:03
|
>> Avi Kivity wrote: >> >>> David S. Ahern wrote: >>> >>>> I added the traces and captured data over another apparent lockup of >>>> the guest. >>>> This seems to be representative of the sequence (pid/vcpu removed). >>>> >>>> (+4776) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016127c ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 >>>> c0009db4 ] >>>> (+3632) VMENTRY >>>> (+4552) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 >>>> c016104a ] >>>> (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 >>>> fffb61c8 ] >>>> (+ 54928) VMENTRY >>>> >>> Can you oprofile the host to see where the 54K cycles are spent? >>> >>> I've continued drilling down with the tracers to answer that question. I have done runs with tracers in paging64_page_fault and it showed the overhead is with the fetch() function. On my last run the tracers are in paging64_fetch() as follows: 1. after is_present_pte() check 2. before kvm_mmu_get_page() 3. after kvm_mmu_get_page() 4. after if (!metaphysical) {} The delta between 2 and 3 shows the cycles to run kvm_mmu_get_page(). The delta between 3 and 4 shows the cycles to run kvm_read_guest_atomic(), if it is run. Tracer1 dumps vcpu->arch.last_pt_write_count (a carryover from when the new tracers were in paging64_page_fault); tracer2 dumps the level, metaphysical and access variables; tracer5 dumps value in shadow_ent. A representative trace sample is: (+ 4576) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] (+ 2664) PAGE_FAULT1 [ write_count = 0 ] (+ 472) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 50416) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9276d043 ] (+ 1528) VMENTRY (+ 4992) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2296) PAGE_FAULT1 [ write_count = 0 ] (+ 816) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db4 gpte = 0x00000000 4176d363 ] (+ 6424) VMENTRY (+ 3864) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db0 ] (+ 2496) PAGE_FAULT1 [ write_count = 1 ] (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809041 ] (+ 0) PTE_WRITE [ gpa = 0x00000000 00009db0 gpte = 0x00000000 00000000 ] (+ 10248) VMENTRY (+ 4744) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x00000003, virt = 0x00000000 c0009db4 ] (+ 2408) PAGE_FAULT1 [ write_count = 2 ] (+ 760) PAGE_FAULT5 [ shadow_ent_val = 0x00000000 8a809043 ] (+ 1240) VMENTRY (+ 4624) VMEXIT [ exitcode = 0x00000000, rip = 0x00000000 c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x0000000b, virt = 0x00000000 fffb6950 ] (+ 2512) PAGE_FAULT1 [ write_count = 0 ] (+ 496) PAGE_FAULT2 [ level = 2 metaphysical = 0 access 0x00000007 ] (+ 48664) PAGE_FAULT3 (+ 472) PAGE_FAULT4 (+ 856) PAGE_FAULT5 [ shadow_ent_val = 0x80000000 9272d043 ] (+ 1576) VMENTRY So basically every 4th trip through the fetch function it runs kvm_mmu_get_page(). A summary of the entire trace file shows this function rarely executes in less than 50,000 cycles. Also, vcpu->arch.last_pt_write_count is always 0 when the high cycles are hit. More tidbits: - The hugepage option seems to have no effect -- the system spikes and overhead occurs with and without the hugepage option (above data is with it). - As the guest runs for hours, the intensity of the spikes drop though they still occur regularly and kscand continues to be the primary suspect. qemu's RSS tends to the guests memory allotment of 2GB. Internally guest memory usage runs at ~1GB page cache, 57M buffers, 24M swap, ~800MB for processes. - I have looked at process creation and do not see a strong correlation between system time spikes and number of new processes. So far the only correlations seem to be kscand and amount of memory used. ie., stock RHEL3 with few processes shows tiny spikes whereas my tests with 90+ processes using about 800M plus a continually updating page cache (ie., moderate IO levels) the spikes are strong and last for seconds. - Time runs really fast in the guest, gaining several minutes in 24-hours. I'll download your kvm_stat update and give it a try. When I started this investigation I was using Christian's kvmstat script which dumped stats to a file. Plots of that data did not show a strong correlation with guest system time. david |
From: Avi K. <av...@qu...> - 2008-04-23 15:16:38
|
Anthony Liguori wrote: > > The ne2k is pretty mmio heavy. You should be able to observe a boost > with something like iperf (guest=>host) I would think if this is a > real savings. > If we're just improving ne2k, the complexity isn't worth it. We have two better nics which are widely supported in guests. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Anthony L. <an...@co...> - 2008-04-23 15:10:48
|
Avi Kivity wrote: > Laurent Vivier wrote: > > That's the lovely animation. You can time XP startup, or look at system > time in 'top' once it stabilizes. Also compare total host_state_reloads > to boot in kvm_stat (each costs 5 usec, after discount). > Win2k's boot up animation (the moving bar) is done in VGA planar mode. You should be able to time the difference in startup. >> and by group of 10 with XP >> when we move a window. >> >> > > That's probably programming the cirrus blitter, also a candidate for > batching, but probably not measurable. Maybe one of those Windows > graphics benchmarks which draw tons of rectangles very fast and cause > epilepsy seizures. > > I guess most interesting is e1000, if we need to do more than one mmio > to start transmitting. > The ne2k is pretty mmio heavy. You should be able to observe a boost with something like iperf (guest=>host) I would think if this is a real savings. Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2008-04-23 15:08:24
|
Laurent Vivier wrote: >>> >>> >> This breaks ordering on smp guests. batch_data needs to be a kvm thing, >> not a vcpu thing, and locked, of course. >> > > - is ordering between vcpu important when we already delay operations ? > Yes. Ordering is very different from delaying. Also, we will only delay writes which don't matter (those that simply latch something into a register, which is later read). > - using vcpu avoids the lock > Correctness beats performance > - Why PIO (pio_data) are vcpu things and not kvm things, then ? > pio batching is used for just a single instruction that can work with batches (ins/outs), so a guest can't expect to see partial output. With mmio, you can have 1. mmio 2. raise flag in shared memory 3. other vcpu sees flag, does mmio 4. spin-wait for ack from other vcpu 5. mmio You can't do that with ins/outs. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Laurent V. <Lau...@bu...> - 2008-04-23 14:59:23
|
Le mercredi 23 avril 2008 à 17:31 +0300, Avi Kivity a écrit : > Laurent Vivier wrote: > > This patch is the kernel part of the "batch writes to MMIO" patch. > > > > When kernel has to send MMIO writes to userspace, it stores them > > in memory until it has to pass the hand to userspace for another > > reason. This avoids to have too many context switches on operations > > that can wait. > > > > WARNING: this breaks compatibility with old userspace part. > > > > Signed-off-by: Laurent Vivier <Lau...@bu...> > > --- > > arch/x86/kvm/x86.c | 21 +++++++++++++++++++++ > > include/asm-x86/kvm_host.h | 2 ++ > > include/linux/kvm.h | 10 +++++++++- > > virt/kvm/kvm_main.c | 3 +++ > > 4 files changed, 35 insertions(+), 1 deletions(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 0ce5563..3881056 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -2942,8 +2942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) > > kvm_x86_ops->decache_regs(vcpu); > > } > > > > +batch: > > r = __vcpu_run(vcpu, kvm_run); > > > > + if (!r && vcpu->mmio_is_write && > > + kvm_run->exit_reason == KVM_EXIT_MMIO && > > + kvm_run->batch_count < KVM_MAX_BATCH) { > > + struct kvm_batch *batch = vcpu->arch.batch_data; > > + int i = kvm_run->batch_count++; > > + > > + batch[i].phys_addr = vcpu->mmio_phys_addr; > > + batch[i].len = vcpu->mmio_size; > > + memcpy(batch[i].data, vcpu->mmio_data, batch[i].len); > > > > This breaks ordering on smp guests. batch_data needs to be a kvm thing, > not a vcpu thing, and locked, of course. - is ordering between vcpu important when we already delay operations ? - using vcpu avoids the lock - Why PIO (pio_data) are vcpu things and not kvm things, then ? > Also, you don't want to queue writes which trigger I/O since that will > affect latency. Userspace should tell the kernel which mmio addresses > are queuable. I agree (but in my first patch it was easier to ignore this) > > + > > + goto batch; > > > > If you move this to within __vcpu_run, you won't need to loop. Maybe > the best place is where we actually decide it's an mmio write. I agree > You also need to flush the queue each time you have an in-kernel mmio > write. For example: > > vcpu0 vcpu1 > > mmio write (queued) > apic write -> IPI > doesn't see effects of write I agree > > + } > > out: > > if (vcpu->sigset_active) > > sigprocmask(SIG_SETMASK, &sigsaved, NULL); > > @@ -3830,6 +3843,13 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) > > } > > vcpu->arch.pio_data = page_address(page); > > > > + page = alloc_page(GFP_KERNEL | __GFP_ZERO); > > + if (!page) { > > + r = -ENOMEM; > > + goto fail; > > + } > > + vcpu->arch.batch_data = page_address(page); > > + > > r = kvm_mmu_create(vcpu); > > if (r < 0) > > goto fail_free_pio_data; > > @@ -3857,6 +3877,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) > > kvm_mmu_destroy(vcpu); > > up_read(&vcpu->kvm->slots_lock); > > free_page((unsigned long)vcpu->arch.pio_data); > > + free_page((unsigned long)vcpu->arch.batch_data); > > } > > > > #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1) > > #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD)) > > @@ -255,6 +256,7 @@ struct kvm_vcpu_arch { > > gva_t mmio_fault_cr2; > > struct kvm_pio_request pio; > > void *pio_data; > > + void *batch_data; > > > > > > It's an array of structs, no? So it shouldn't be a void *. Yes (I love cut&paste...) > > > > +#define KVM_MAX_BATCH (PAGE_SIZE / sizeof(struct kvm_batch)) > > +struct kvm_batch { > > + __u64 phys_addr; > > + __u32 len; > > + __u8 data[8]; > > +}; > > > > Size is 24 on 64-bit and 20 on 32-bit. Please pad (after len, so data > is nicely aligned). OK Thank you, Laurent -- ------------- Lau...@bu... --------------- "The best way to predict the future is to invent it." - Alan Kay |
From: Robin H. <ho...@sg...> - 2008-04-23 14:48:34
|
On Wed, Apr 23, 2008 at 03:36:19PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > > The only other change I did has been to move mmu_notifier_unregister > > > at the end of the patchset after getting more questions about its > > > reliability and I documented a bit the rmmod requirements for > > > ->release. we'll think later if it makes sense to add it, nobody's > > > using it anyway. > > > > XPMEM is using it. GRU will be as well (probably already does). > > XPMEM requires more patches anyway. Note that in previous email you > told me you weren't using it. I think GRU can work fine on 2.6.26 I said I could test without it. It is needed for the final version. It also makes the API consistent. What you are proposing is equivalent to having a file you can open but never close. This whole discussion seems ludicrous. You could refactor the code to get the sorted list of locks, pass that list into mm_lock to do the locking, do the register/unregister, then pass the same list into mm_unlock. If the allocation fails, you could fall back to the older slower method of repeatedly scanning the lists and acquiring locks in ascending order. > without mmu_notifier_unregister, like KVM too. You've simply to unpin > the module count in ->release. The most important bit is that you've > to do that anyway in case mmu_notifier_unregister fails (and it can If you are not going to provide the _unregister callout you need to change the API so I can scan the list of notifiers to see if my structures are already registered. We register our notifier structure at device open time. If we receive a _release callout, we mark our structure as unregistered. At device close time, if we have not been unregistered, we call _unregister. If you take away _unregister, I have an xpmem kernel structure in use _AFTER_ the device is closed with no indication that the process is using it. In that case, I need to get an extra reference to the module in my device open method and hold that reference until the _release callout. Additionally, if the users program reopens the device, I need to scan the mmu_notifiers list to see if this tasks notifier is already registered. I view _unregister as essential. Did I miss something? Thanks, Robin |
From: Avi K. <av...@qu...> - 2008-04-23 14:44:09
|
Laurent Vivier wrote: > Le mercredi 23 avril 2008 à 17:05 +0300, Avi Kivity a écrit : > >> Laurent Vivier wrote: >> >>> These two patches allow to batch writes to MMIO. >>> >>> When kernel has to send MMIO writes to userspace, it stores them >>> in memory until it has to pass the hand to userspace for another >>> reason. This avoids to have too many context switches on operations >>> that can wait. >>> >>> >>> >> Did you obtain any measurable performance benefit? >> > > Well, the problem is how to measure it. Really, I don't know. > > But when I add traces I saw MMIO writes are batched: by group of 170 > (this is the max in a page) at the beginning, That's the lovely animation. You can time XP startup, or look at system time in 'top' once it stabilizes. Also compare total host_state_reloads to boot in kvm_stat (each costs 5 usec, after discount). > and by group of 10 with XP > when we move a window. > That's probably programming the cirrus blitter, also a candidate for batching, but probably not measurable. Maybe one of those Windows graphics benchmarks which draw tons of rectangles very fast and cause epilepsy seizures. I guess most interesting is e1000, if we need to do more than one mmio to start transmitting. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-23 14:34:46
|
Laurent Vivier wrote: > This patch is the kernel part of the "batch writes to MMIO" patch. > > When kernel has to send MMIO writes to userspace, it stores them > in memory until it has to pass the hand to userspace for another > reason. This avoids to have too many context switches on operations > that can wait. > > WARNING: this breaks compatibility with old userspace part. > > Signed-off-by: Laurent Vivier <Lau...@bu...> > --- > arch/x86/kvm/x86.c | 21 +++++++++++++++++++++ > include/asm-x86/kvm_host.h | 2 ++ > include/linux/kvm.h | 10 +++++++++- > virt/kvm/kvm_main.c | 3 +++ > 4 files changed, 35 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 0ce5563..3881056 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2942,8 +2942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) > kvm_x86_ops->decache_regs(vcpu); > } > > +batch: > r = __vcpu_run(vcpu, kvm_run); > > + if (!r && vcpu->mmio_is_write && > + kvm_run->exit_reason == KVM_EXIT_MMIO && > + kvm_run->batch_count < KVM_MAX_BATCH) { > + struct kvm_batch *batch = vcpu->arch.batch_data; > + int i = kvm_run->batch_count++; > + > + batch[i].phys_addr = vcpu->mmio_phys_addr; > + batch[i].len = vcpu->mmio_size; > + memcpy(batch[i].data, vcpu->mmio_data, batch[i].len); > This breaks ordering on smp guests. batch_data needs to be a kvm thing, not a vcpu thing, and locked, of course. Also, you don't want to queue writes which trigger I/O since that will affect latency. Userspace should tell the kernel which mmio addresses are queuable. > + > + goto batch; > If you move this to within __vcpu_run, you won't need to loop. Maybe the best place is where we actually decide it's an mmio write. You also need to flush the queue each time you have an in-kernel mmio write. For example: vcpu0 vcpu1 mmio write (queued) apic write -> IPI doesn't see effects of write > + } > out: > if (vcpu->sigset_active) > sigprocmask(SIG_SETMASK, &sigsaved, NULL); > @@ -3830,6 +3843,13 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) > } > vcpu->arch.pio_data = page_address(page); > > + page = alloc_page(GFP_KERNEL | __GFP_ZERO); > + if (!page) { > + r = -ENOMEM; > + goto fail; > + } > + vcpu->arch.batch_data = page_address(page); > + > r = kvm_mmu_create(vcpu); > if (r < 0) > goto fail_free_pio_data; > @@ -3857,6 +3877,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) > kvm_mmu_destroy(vcpu); > up_read(&vcpu->kvm->slots_lock); > free_page((unsigned long)vcpu->arch.pio_data); > + free_page((unsigned long)vcpu->arch.batch_data); > } > > #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1) > #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD)) > @@ -255,6 +256,7 @@ struct kvm_vcpu_arch { > gva_t mmio_fault_cr2; > struct kvm_pio_request pio; > void *pio_data; > + void *batch_data; > > It's an array of structs, no? So it shouldn't be a void *. > > +#define KVM_MAX_BATCH (PAGE_SIZE / sizeof(struct kvm_batch)) > +struct kvm_batch { > + __u64 phys_addr; > + __u32 len; > + __u8 data[8]; > +}; > Size is 24 on 64-bit and 20 on 32-bit. Please pad (after len, so data is nicely aligned). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Laurent V. <Lau...@bu...> - 2008-04-23 14:23:39
|
Le mercredi 23 avril 2008 à 17:05 +0300, Avi Kivity a écrit : > Laurent Vivier wrote: > > These two patches allow to batch writes to MMIO. > > > > When kernel has to send MMIO writes to userspace, it stores them > > in memory until it has to pass the hand to userspace for another > > reason. This avoids to have too many context switches on operations > > that can wait. > > > > > > Did you obtain any measurable performance benefit? Well, the problem is how to measure it. Really, I don't know. But when I add traces I saw MMIO writes are batched: by group of 170 (this is the max in a page) at the beginning, and by group of 10 with XP when we move a window. So all comments are welcome... > > WARNING: this breaks compatibility with old userspace part. > > > > > > So it's just an RFC :) Yes... to have some comments how to manage this :) Laurent -- ------------- Lau...@bu... --------------- "The best way to predict the future is to invent it." - Alan Kay |
From: Avi K. <av...@qu...> - 2008-04-23 14:09:26
|
Laurent Vivier wrote: > These two patches allow to batch writes to MMIO. > > When kernel has to send MMIO writes to userspace, it stores them > in memory until it has to pass the hand to userspace for another > reason. This avoids to have too many context switches on operations > that can wait. > > Did you obtain any measurable performance benefit? > WARNING: this breaks compatibility with old userspace part. > > So it's just an RFC :) -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Avi K. <av...@qu...> - 2008-04-23 13:59:38
|
The preliminary KVM Forum 2008 agenda is now online at http://kforum.qumranet.com/KVMForum/agenda.php If you haven't done so already, please register! There is a big "Register Now" button on the page. Early bird registration ends May 1. Note we have a few more sessions in the pipeline; the page will be updated in a few days. [Speakers: if we've misspelled your name or if you'd like to change your slot, let us know] -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-23 13:49:36
|
On 4/23/08, H. Peter Anvin <hp...@zy...> wrote: > Nguyen Anh Quynh wrote: > > > Hi, > > > > I am thinking about comibing this ROM with the extboot. Both two ROM > > are about "booting", so I think that is reasonable. So we will have > > only 1 ROM that supports both external boot and Linux boot. > > > > Is that desirable or not? > > > > > > Does it make the code simpler and easier to understand? If not, then I would say no. Actually no. I must say that it looks ugly to me. The fact is that linuxboot ROM and extboot ROM are about different things, and they only share a bit in that they both intercept int 19. I very much like the the Unix principal, that is "Does one thing, and does it well", so I prefer to have 2 independent ROM: one to boot Linux, and one to do external boot. I see no problem that we have more than one ROM, and these two can be independently loaded into the memory. So I think I agree with Peter here. Thanks, Q |
From: Andrea A. <an...@qu...> - 2008-04-23 13:47:06
|
On Tue, Apr 22, 2008 at 06:07:27PM -0500, Robin Holt wrote: > > The only other change I did has been to move mmu_notifier_unregister > > at the end of the patchset after getting more questions about its > > reliability and I documented a bit the rmmod requirements for > > ->release. we'll think later if it makes sense to add it, nobody's > > using it anyway. > > XPMEM is using it. GRU will be as well (probably already does). XPMEM requires more patches anyway. Note that in previous email you told me you weren't using it. I think GRU can work fine on 2.6.26 without mmu_notifier_unregister, like KVM too. You've simply to unpin the module count in ->release. The most important bit is that you've to do that anyway in case mmu_notifier_unregister fails (and it can fail because of vmalloc space shortage because somebody loaded some framebuffer driver or whatever). |
From: Andrea A. <an...@qu...> - 2008-04-23 13:44:24
|
On Tue, Apr 22, 2008 at 04:14:26PM -0700, Christoph Lameter wrote: > We want a full solution and this kind of patching makes the patches > difficuilt to review because later patches revert earlier ones. I know you rather want to see KVM development stalled for more months than to get a partial solution now that already covers KVM and GRU with the same API that XPMEM will also use later. It's very unfair on your side to pretend to stall other people development if what you need has stronger requirements and can't be merged immediately. This is especially true given it was publically stated that XPMEM never passed all regression tests anyway, so you can't possibly be in such an hurry like we are, we can't progress without this. Infact we can but it would be an huge effort and it would run _slower_ and it would all need to be deleted once mmu notifiers are in. Note that the only patch that you can avoid with your approach is mm_lock-rwsem, given that's software developed and not human developed I don't see a big deal of wasted effort. The main difference is the ordering. Most of the code is orthogonal so there's not much to revert. |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-23 13:41:21
|
On 4/22/08, Anthony Liguori <ali...@us...> wrote: > Nguyen Anh Quynh wrote: > > > Hi, > > > > This should be submitted to upstream (but not to kvm-devel list), but > > this is only the test code that I want to quickly send out for > > comments. In case it looks OK, I will send it to upstream later. > > > > Inspired by extboot and conversations with Anthony and HPA, this > > linuxboot option ROM is a simple option ROM that intercepts int19 in > > order to execute linux setup code. This approach eliminates the need > > to manipulate the boot sector for this purpose. > > > > To test it, just load linux kernel with your KVM/QEMU image using > > -kernel option in normal way. > > > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > > Ubuntu 8.04. > > > > > > For the next rounds, could you actually rebase against upstream QEMU and > submit to qemu-devel? One of Paul Brook's objections to extboot had > historically been that it wasn't not easily sharable with other > architectures. With a C version, it seems more reasonable now to do that. OK. > > Make sure you remove all the old linux boot code too within QEMU along with > the -hda checks. Sure. Thanks, Q |
From: Nguyen A. Q. <aq...@gm...> - 2008-04-23 13:40:47
|
On 4/22/08, Alexander Graf <al...@cs...> wrote: > I believe that's the way to go. If you have spare time on your hands, feel > free to integrate my multiboot patches as well. > OK, that looks straightforward enough. Thanks, Q > > On Apr 22, 2008, at 11:07 AM, Nguyen Anh Quynh wrote: > > > > Hi, > > > > I am thinking about comibing this ROM with the extboot. Both two ROM > > are about "booting", so I think that is reasonable. So we will have > > only 1 ROM that supports both external boot and Linux boot. > > > > Is that desirable or not? > > > > Thanks, > > Quynh > > > > On 4/21/08, Nguyen Anh Quynh <aq...@gm...> wrote: > > > > > Hmm, the last patch includes a binary. So please take this patch > instead. > > > > > > Thanks, > > > > > > Q > > > > > > # diffstat linuxboot1.diff > > > Makefile | 13 ++++- > > > linuxboot/Makefile | 40 +++++++++++++++ > > > linuxboot/boot.S | 54 +++++++++++++++++++++ > > > linuxboot/farvar.h | 130 > +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > linuxboot/rom.c | 104 > ++++++++++++++++++++++++++++++++++++++++ > > > > > > linuxboot/signrom.c | 128 > ++++++++++++++++++++++++++++++++++++++++++++++++++ > > > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > > > qemu/Makefile | 3 - > > > qemu/Makefile.target | 2 > > > qemu/hw/linuxboot.c | 39 +++++++++++++++ > > > qemu/hw/pc.c | 22 +++++++- > > > qemu/hw/pc.h | 5 + > > > > > > 12 files changed, 600 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh <aq...@gm...> > wrote: > > > > > > > Forget to say that this patch is against kvm-66. > > > > > > > > Thanks, > > > > Q > > > > > > > > > > > > > > > > On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh <aq...@gm...> > wrote: > > > > > > > > > Hi, > > > > > > > > > > This should be submitted to upstream (but not to kvm-devel list), > but > > > > > this is only the test code that I want to quickly send out for > > > > > comments. In case it looks OK, I will send it to upstream later. > > > > > > > > > > Inspired by extboot and conversations with Anthony and HPA, this > > > > > linuxboot option ROM is a simple option ROM that intercepts int19 in > > > > > order to execute linux setup code. This approach eliminates the need > > > > > to manipulate the boot sector for this purpose. > > > > > > > > > > To test it, just load linux kernel with your KVM/QEMU image using > > > > > -kernel option in normal way. > > > > > > > > > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, > guest > > > > > Ubuntu 8.04. > > > > > > > > > > Thanks, > > > > > Quynh > > > > > > > > > > > > > > > # diffstat linuxboot1.diff > > > > > Makefile | 13 ++++- > > > > > linuxboot/Makefile | 40 +++++++++++++++ > > > > > linuxboot/boot.S | 54 +++++++++++++++++++++ > > > > > linuxboot/farvar.h | 130 > +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > linuxboot/rom.c | 104 > ++++++++++++++++++++++++++++++++++++++++ > > > > > linuxboot/signrom |binary > > > > > linuxboot/signrom.c | 128 > ++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > > > > > qemu/Makefile | 3 - > > > > > qemu/Makefile.target | 2 > > > > > qemu/hw/linuxboot.c | 39 +++++++++++++++ > > > > > qemu/hw/pc.c | 22 +++++++- > > > > > qemu/hw/pc.h | 5 + > > > > > 13 files changed, 600 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
From: Andrea A. <an...@qu...> - 2008-04-23 13:33:53
|
On Tue, Apr 22, 2008 at 01:30:53PM -0700, Christoph Lameter wrote: > One solution would be to separate the invalidate_page() callout into a > patch at the very end that can be omitted. AFACIT There is no compelling > reason to have this callback and it complicates the API for the device > driver writers. Not having this callback makes the way that mmu notifiers > are called from the VM uniform which is a desirable goal. I agree that the invalidate_page optimization can be moved to a separate patch. That will be a patch that will definitely alter the API in a not backwards compatible way (unlike 2-12 in my #v13, which are all backwards compatible in terms of mmu notifier API). invalidate_page is beneficial to both mmu notifier users, and a bit beneficial to the do_wp_page users too. So there's no point to remove it from my mmu-notifier-core as long as the mmu-notifier-core is 1/N in my patchset, and N/N in your patchset, the differences caused by that ordering difference are a bigger change than invalidate_page existing or not. As I expected invalidate_page provided significant benefits (not just to GRU but to KVM too) without altering the locking scheme at all, this is because the page fault handler has to notice if begin->end both runs anyway after follow_page/get_user_pages. So it's a no brainer to keep and my approach will avoid a not backwards compatible breakage of the API IMHO. Not a big deal, nobody can care if the API will change, it will definitely change eventually, it's a kernel internal one, but given I've already invalidate_page in my patch there's no reason to remove it as long as mmu-notifier-core remains N/N on your patchset. |
From: Laurent V. <Lau...@bu...> - 2008-04-23 13:10:40
|
This patch is the kernel part of the "batch writes to MMIO" patch. When kernel has to send MMIO writes to userspace, it stores them in memory until it has to pass the hand to userspace for another reason. This avoids to have too many context switches on operations that can wait. WARNING: this breaks compatibility with old userspace part. Signed-off-by: Laurent Vivier <Lau...@bu...> --- arch/x86/kvm/x86.c | 21 +++++++++++++++++++++ include/asm-x86/kvm_host.h | 2 ++ include/linux/kvm.h | 10 +++++++++- virt/kvm/kvm_main.c | 3 +++ 4 files changed, 35 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0ce5563..3881056 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2942,8 +2942,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_x86_ops->decache_regs(vcpu); } +batch: r = __vcpu_run(vcpu, kvm_run); + if (!r && vcpu->mmio_is_write && + kvm_run->exit_reason == KVM_EXIT_MMIO && + kvm_run->batch_count < KVM_MAX_BATCH) { + struct kvm_batch *batch = vcpu->arch.batch_data; + int i = kvm_run->batch_count++; + + batch[i].phys_addr = vcpu->mmio_phys_addr; + batch[i].len = vcpu->mmio_size; + memcpy(batch[i].data, vcpu->mmio_data, batch[i].len); + + goto batch; + } out: if (vcpu->sigset_active) sigprocmask(SIG_SETMASK, &sigsaved, NULL); @@ -3830,6 +3843,13 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) } vcpu->arch.pio_data = page_address(page); + page = alloc_page(GFP_KERNEL | __GFP_ZERO); + if (!page) { + r = -ENOMEM; + goto fail; + } + vcpu->arch.batch_data = page_address(page); + r = kvm_mmu_create(vcpu); if (r < 0) goto fail_free_pio_data; @@ -3857,6 +3877,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) kvm_mmu_destroy(vcpu); up_read(&vcpu->kvm->slots_lock); free_page((unsigned long)vcpu->arch.pio_data); + free_page((unsigned long)vcpu->arch.batch_data); } struct kvm *kvm_arch_create_vm(void) diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h index 9d963cd..2824652 100644 --- a/include/asm-x86/kvm_host.h +++ b/include/asm-x86/kvm_host.h @@ -26,6 +26,7 @@ #define KVM_PRIVATE_MEM_SLOTS 4 #define KVM_PIO_PAGE_OFFSET 1 +#define KVM_MMIO_PAGE_OFFSET 2 #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1) #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD)) @@ -255,6 +256,7 @@ struct kvm_vcpu_arch { gva_t mmio_fault_cr2; struct kvm_pio_request pio; void *pio_data; + void *batch_data; struct kvm_queued_exception { bool pending; diff --git a/include/linux/kvm.h b/include/linux/kvm.h index a281afe..cf0d266 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -94,7 +94,8 @@ struct kvm_run { __u32 exit_reason; __u8 ready_for_interrupt_injection; __u8 if_flag; - __u8 padding2[2]; + __u8 batch_count; + __u8 padding2; /* in (pre_kvm_run), out (post_kvm_run) */ __u64 cr8; @@ -173,6 +174,13 @@ struct kvm_run { }; }; +#define KVM_MAX_BATCH (PAGE_SIZE / sizeof(struct kvm_batch)) +struct kvm_batch { + __u64 phys_addr; + __u32 len; + __u8 data[8]; +}; + /* for KVM_TRANSLATE */ struct kvm_translation { /* in */ diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d3cb4cc..b2234b3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -796,6 +796,8 @@ static int kvm_vcpu_fault(struct vm_area_struct *vma, struct vm_fault *vmf) #ifdef CONFIG_X86 else if (vmf->pgoff == KVM_PIO_PAGE_OFFSET) page = virt_to_page(vcpu->arch.pio_data); + else if (vmf->pgoff == KVM_MMIO_PAGE_OFFSET) + page = virt_to_page(vcpu->arch.batch_data); #endif else return VM_FAULT_SIGBUS; @@ -1214,6 +1216,7 @@ static long kvm_dev_ioctl(struct file *filp, r = PAGE_SIZE; /* struct kvm_run */ #ifdef CONFIG_X86 r += PAGE_SIZE; /* pio data page */ + r += PAGE_SIZE; /* mmio batch page */ #endif break; case KVM_TRACE_ENABLE: -- 1.5.2.4 |
From: Laurent V. <Lau...@bu...> - 2008-04-23 13:09:14
|
These two patches allow to batch writes to MMIO. When kernel has to send MMIO writes to userspace, it stores them in memory until it has to pass the hand to userspace for another reason. This avoids to have too many context switches on operations that can wait. WARNING: this breaks compatibility with old userspace part. Signed-off-by: Laurent Vivier <Lau...@bu...> [PATCH 1/2] kvm: Batch writes to MMIO - kernel part [PATCH 2/2] kvm-userspace: Batch writes to MMIO - userspace part Signed-off-by: Laurent Vivier <Lau...@bu...> |
From: Laurent V. <Lau...@bu...> - 2008-04-23 13:09:13
|
This patch is userspace part of the "batch writes to MMIO" patch. When kernel has to send MMIO writes to userspace, it stores them in memory until it has to pass the hand to userspace for another reason. This avoids too have to many context switches on operations that can wait. Signed-off-by: Laurent Vivier <Lau...@bu...> --- libkvm/libkvm.c | 17 +++++++++++++++++ 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c index 329f29f..be74477 100644 --- a/libkvm/libkvm.c +++ b/libkvm/libkvm.c @@ -864,6 +864,10 @@ int kvm_run(kvm_context_t kvm, int vcpu) int r; int fd = kvm->vcpu_fd[vcpu]; struct kvm_run *run = kvm->run[vcpu]; +#if defined(__x86_64__) || defined(__i386__) + struct kvm_batch *batch = (void *)run + 2 * PAGE_SIZE; + int i; +#endif again: if (!kvm->irqchip_in_kernel) @@ -882,6 +886,19 @@ again: post_kvm_run(kvm, vcpu); +#if defined(__x86_64__) || defined(__i386__) + for (i = 0; i < run->batch_count; i++) { + if ((batch[i].phys_addr > 0xa0000-4 && + batch[i].phys_addr <= 0xa0000) && batch[i].len == 3) + continue; + kvm->callbacks->mmio_write(kvm->opaque, + batch[i].phys_addr, + &batch[i].data[0], batch[i].len); + + } + run->batch_count = 0; +#endif + if (r == -1) { r = handle_io_window(kvm); goto more; -- 1.5.2.4 |
From: Christian B. <bor...@de...> - 2008-04-23 12:47:31
|
Am Mittwoch, 23. April 2008 schrieb Avi Kivity: > >> #define VIRTIO_CONFIG_S_FAILED 0x80 > >> > >> #ifdef __KERNEL__ > >> -struct virtio_device; > >> +#include <linux/virtio.h> > >> > > > > I just realized, that this breaks make headers_check as we dont export > > virtio.h (and we dont want to export it as it relies on scatterlist.h). > > > > > > It's guarded by a #ifdef __KERNEL__, so it should be alright. Yes, you are right. Thanks |
From: Avi K. <av...@qu...> - 2008-04-23 12:42:29
|
Christian Borntraeger wrote: > Am Dienstag, 22. April 2008 schrieb Rusty Russell: > [...] > >> diff -r a098f19a6da5 include/linux/virtio_config.h >> --- a/include/linux/virtio_config.h Sun Apr 20 14:41:02 2008 +1000 >> +++ b/include/linux/virtio_config.h Sun Apr 20 15:07:45 2008 +1000 >> @@ -16,7 +16,7 @@ >> #define VIRTIO_CONFIG_S_FAILED 0x80 >> >> #ifdef __KERNEL__ >> -struct virtio_device; >> +#include <linux/virtio.h> >> > > I just realized, that this breaks make headers_check as we dont export > virtio.h (and we dont want to export it as it relies on scatterlist.h). > > It's guarded by a #ifdef __KERNEL__, so it should be alright. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. |
From: <ext...@li...> - 2008-04-23 11:20:36
|
My fault, of course I couldn't find, I was looking into kvm.git and not kvm-userspace.git :-/ but anyways, it works properly now, thanks a lot! n. On Wed, 23 Apr 2008, Avi Kivity wrote: > ext...@li... wrote: >> Hi Avi, >> thanks for info. Can You tell me which commit has caused the problem so I >> can revert it in my testing sources? I don't see any reverts in git (yet). > > 6bb0805aeabd5c6ef5408f57c7da5ca6385dd0f5. > > > -- > Do not meddle in the internals of kernels, for they are subtle and quick to > panic. > > |