You can subscribe to this list here.
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(33) |
Nov
(325) |
Dec
(320) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2007 |
Jan
(484) |
Feb
(438) |
Mar
(407) |
Apr
(713) |
May
(831) |
Jun
(806) |
Jul
(1023) |
Aug
(1184) |
Sep
(1118) |
Oct
(1461) |
Nov
(1224) |
Dec
(1042) |
2008 |
Jan
(1449) |
Feb
(1110) |
Mar
(1428) |
Apr
(1643) |
May
(682) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Avi K. <av...@qu...> - 2008-04-22 16:27:24
|
Anthony Liguori wrote: > > I think we need to decide what we want to target in terms of upper > limits. > > With a bridge or two, we can probably easily do 128. > > If we really want to push things, I think we should do a PCI based > virtio controller. I doubt a large number of PCI devices is ever > going to perform very well b/c of interrupt sharing and some of the > assumptions in virtio_pci. > > If we implement a controller, we can use a single interrupt, but > multiplex multiple notifications on that single interrupt. We can > also be more aggressive about using shared memory instead of PCI > config space which would reduce the overall number of exits. > > We could easily support a very large number of devices this way. But > again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. -- error compiling committee.c: too many arguments to function |
From: H. P. A. <hp...@zy...> - 2008-04-22 16:23:13
|
Nguyen Anh Quynh wrote: > Hi, > > I am thinking about comibing this ROM with the extboot. Both two ROM > are about "booting", so I think that is reasonable. So we will have > only 1 ROM that supports both external boot and Linux boot. > > Is that desirable or not? > Does it make the code simpler and easier to understand? If not, then I would say no. -hpa |
From: Hollis B. <ho...@us...> - 2008-04-22 16:22:57
|
On Tuesday 22 April 2008 06:22:48 Avi Kivity wrote: > Rusty Russell wrote: > > [Christian, Hollis, how much is this ABI breakage going to hurt you?] > > > > A recent proposed feature addition to the virtio block driver revealed > > some flaws in the API, in particular how easy it is to break big > > endian machines. > > > > The virtio config space was originally chosen to be little-endian, > > because we thought the config might be part of the PCI config space > > for virtio_pci. It's actually a separate mmio region, so that > > argument holds little water; as only x86 is currently using the virtio > > mechanism, we can change this (but must do so now, before the > > impending s390 and ppc merges). > > This will probably annoy Hollis which has guests that can go both ways. Rusty and I have discussed it. Ultimately, this just takes us from a cross-architecture endianness definition to a per-architecture definition. Anyways, we've already fallen into this situation with the virtio ring data itself, so we're really saying "same endianness as the ring". -- Hollis Blanchard IBM Linux Technology Center |
From: Anthony L. <an...@co...> - 2008-04-22 16:18:43
|
Marcelo Tosatti wrote: >> Maybe require explicit device/function assignment on the command line? >> It will be managed anyway. >> > > ACPI does support hotplugging of individual functions inside slots, > not sure how well does Linux (and other OSes) support that.. should be > transparent though. > I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We could easily support a very large number of devices this way. But again, what do we want to target for now? Regards, Anthony Liguori |
From: Javier G. <ja...@gu...> - 2008-04-22 15:57:54
|
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity <av...@qu...> wrote: > I'm rooting for btrfs myself. but could btrfs (when stable) work for migration? i'm curious about OCFS2 performance on this kind of load... when i manage to sell the idea of a KVM cluster i'd like to know if i should try first EVMS-HA (cluster LV's) or OCFS (cluster FS) -- Javier |
From: Jamie L. <ja...@sh...> - 2008-04-22 15:37:21
|
Avi Kivity wrote: > >And video streaming on some embedded devices with no MMU! (Due to the > >page cache heuristics working poorly with no MMU, sustained reliable > >streaming is managed with O_DIRECT and the app managing cache itself > >(like a database), and that needs AIO to keep the request queue busy. > >At least, that's the theory.) > > Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. > It's me at fault here. I just assumed that because it's easy to do aio > in a thread pool efficiently, that's what glibc does. > > Unfortunately the code does some ridiculous things like not service > multiple requests on a single fd in parallel. I see absolutely no > reason for it (the code says "fight for resources"). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? > So my comments only apply to linux-aio vs a sane thread pool. Sorry for > spreading confusion. Thanks. I thought you'd measured it :-) > It could and should. It probably doesn't. > > A simple thread pool implementation could come within 10% of Linux aio > for most workloads. It will never be "exactly", but for small numbers > of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. > >Hmm. Thanks. I may consider switching to XFS now.... > > I'm rooting for btrfs myself. In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll be happy to give it a try! :-) -- Jamie |
From: Jamie L. <ja...@sh...> - 2008-04-22 15:36:29
|
Avi Kivity wrote: > >Perhaps. This raises another point about AIO vs. threads: > > > >If I submit sequential O_DIRECT reads with aio_read(), will they enter > >the device read queue in the same order, and reach the disk in that > >order (allowing for reordering when worthwhile by the elevator)? > > Yes, unless the implementation in the kernel (or glibc) is threaded. > > >With threads this isn't guaranteed and scheduling makes it quite > >likely to issue the parallel synchronous reads out of order, and for > >them to reach the disk out of order because the elevator doesn't see > >them simultaneously. > > If the disk is busy, it doesn't matter. The requests will queue and the > elevator will sort them out. So it's just the first few requests that > may get to disk out of order. There's two cases where it matters to a read-streaming app: 1. Disk isn't busy with anything else, maximum streaming performance is desired. 2. Disk is busy with unrelated things, but you're using I/O priorities to give the streaming app near-absolute priority. Then you need to maintain overlapped streaming requests, otherwise disk is given to a lower priority I/O. If that happens often, you lose, priority is ineffective. Because one of the streaming requests is usually being serviced, elevator has similar limitations as for a disk which is not busy with anything else. > I haven't considered tape, but this is a good point indeed. I expect it > doesn't make much of a difference for a loaded disk. Yes, as long as it's loaded with unrelated requests at the same I/O priority, the elevator has time to sort requests and hide thread scheduling artifacts. Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's elevator, then submits them to the host's elevator. If the guest and host elevators are both configured 'anticipatory', do the anticipatory delays add up? -- Jamie |
From: Marcelo T. <mto...@re...> - 2008-04-22 15:32:59
|
On Tue, Apr 22, 2008 at 05:51:51PM +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > Avi Kivity wrote: > >> Anthony Liguori wrote: > >> > >>> This patch changes virtio devices to be multi-function devices whenever > >>> possible. This increases the number of virtio devices we can > >>> support now by > >>> a factor of 8. > >>> > >>> With this patch, I've been able to launch a guest with either 220 > >>> disks or 220 > >>> network adapters. > >>> > >>> > >> > >> Does this play well with hotplug? Perhaps we need to allocate a new > >> device on hotplug. > >> > > > > Probably not. I imagine you can only hotplug devices, not individual > > functions? > > > > It sounds reasonable to expect so. ACPI has objects for devices, not > functions (IIRC). So what I dislike about multifunction devices is the fact that a single slot shares an IRQ, and that special code is required in the QEMU drivers (virtio guest capability might not always be present). I don't see any need for using them if we can extend PCI slots... > Maybe require explicit device/function assignment on the command line? > It will be managed anyway. ACPI does support hotplugging of individual functions inside slots, not sure how well does Linux (and other OSes) support that.. should be transparent though. |
From: Robin H. <ho...@sg...> - 2008-04-22 15:26:03
|
Andrew, Could we get direction/guidance from you as regards the invalidate_page() callout of Andrea's patch set versus the invalidate_range_start/invalidate_range_end callout pairs of Christoph's patchset? This is only in the context of the __xip_unmap, do_wp_page, page_mkclean_one, and try_to_unmap_one call sites. On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: > > I am a little confused about the value of the seq_lock versus a simple > > atomic, but I assumed there is a reason and left it at that. > > There's no value for anything but get_user_pages (get_user_pages takes > its own lock internally though). I preferred to explain it as a > seqlock because it was simpler for reading, but I totally agree in the > final implementation it shouldn't be a seqlock. My code was meant to > be pseudo-code only. It doesn't even need to be atomic ;). Unless there is additional locking in your fault path, I think it does need to be atomic. > > I don't know what you mean by "it'd" run slower and what you mean by > > "armed and disarmed". > > 1) when armed the time-window where the kvm-page-fault would be > blocked would be a bit larger without invalidate_page for no good > reason But that is a distinction without a difference. In the _start/_end case, kvm's fault handler will not have any _DIRECT_ blocking, but get_user_pages() had certainly better block waiting for some other lock to prevent the process's pages being refaulted. I am no VM expert, but that seems like it is critical to having a consistent virtual address space. Effectively, you have a delay on the kvm fault handler beginning when either invalidate_page() is entered or invalidate_range_start() is entered until when the _CALLER_ of the invalidate* method has unlocked. That time will remain essentailly identical for either case. I would argue you would be hard pressed to even measure the difference. > 2) if you were to remove invalidate_page when disarmed the VM could > would need two branches instead of one in various places Those branches are conditional upon there being list entries. That check should be extremely cheap. The vast majority of cases will have no registered notifiers. The second check for the _end callout will be from cpu cache. > I don't want to waste cycles if not wasting them improves performance > both when armed and disarmed. In summary, I think we have narrowed down the case of no registered notifiers to being infinitesimal. The case of registered notifiers being a distinction without a difference. > > When I was discussing this difference with Jack, he reminded me that > > the GRU, due to its hardware, does not have any race issues with the > > invalidate_page callout simply doing the tlb shootdown and not modifying > > any of its internal structures. He then put a caveat on the discussion > > that _either_ method was acceptable as far as he was concerned. The real > > issue is getting a patch in that satisfies all needs and not whether > > there is a seperate invalidate_page callout. > > Sure, we have that patch now, I'll send it out in a minute, I was just > trying to explain why it makes sense to have an invalidate_page too > (which remains the only difference by now), removing it would be a > regression on all sides, even if a minor one. I think GRU is the only compelling case I have heard for having the invalidate_page seperate. In the case of the GRU, the hardware enforces a lifetime of the invalidate which covers all in-progress faults including ones where the hardware is informed after the flush of a PTE. in all cases, once the GRU invalidate instruction is issued, all active requests are invalidated. Future faults will be blocked in get_user_pages(). Without that special feature of the hardware, I don't think any code simplification exists. I, of course, reserve the right to be wrong. I believe the argument against a seperate invalidate_page() callout was Christoph's interpretation of Andrew's comments. I am not certain Andrew was aware of this special aspects of the GRU hardware and whether that had been factored into the discussion at that point in time. Thanks, Robin |
From: Avi K. <av...@qu...> - 2008-04-22 15:24:31
|
Andrea Arcangeli wrote: > On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > >> Andrea Arcangeli a écrit : >> >>> + >>> +static int mm_lock_cmp(const void *a, const void *b) >>> +{ >>> + cond_resched(); >>> + if ((unsigned long)*(spinlock_t **)a < >>> + (unsigned long)*(spinlock_t **)b) >>> + return -1; >>> + else if (a == b) >>> + return 0; >>> + else >>> + return 1; >>> +} >>> + >>> >> This compare function looks unusual... >> It should work, but sort() could be faster if the >> if (a == b) test had a chance to be true eventually... >> > > Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > > You need to compare *a to *b (at least, that's what you're doing for the < case). -- error compiling committee.c: too many arguments to function |
From: Jamie L. <ja...@sh...> - 2008-04-22 15:23:32
|
Avi Kivity wrote: > Anthony Liguori wrote: > >>If I submit sequential O_DIRECT reads with aio_read(), will they enter > >>the device read queue in the same order, and reach the disk in that > >>order (allowing for reordering when worthwhile by the elevator)? > >> > >There's no guarantee that any sort of order will be preserved by AIO > >requests. The same is true with writes. This is what fdsync is for, > >to guarantee ordering. > > I believe he'd like a hint to get good scheduling, not a guarantee. > With a thread pool if the threads are scheduled out of order, so are > your requests. > If the elevator doesn't plug the queue, the first few requests may > not be optimally sorted. That's right. Then they tend to settle to a good order. But any delay in scheduling one of the threads, or a signal received by one of them, can make it lose order briefly, making the streaming stutter as the disk performes a few local seeks until it settles to good order again. You can mitigate the disruption in various ways. 1. If all threads share an "offset" variable, and reads and increments that atomically just prior to calling pread(), that helps especially at the start. (If threaded I/O is used for QEMU disk emulation, I would suggest doing that, in the more general form of popping a request from QEMU's internal shared queue at the last moment.) 2. Using more threads helps keep it sustained, at the cost of more wasted I/O when there's a cancellation (changed mind), and more memory. However, AIO, in principle (if not implementations...) could be better at keeping the suggested I/O order than thread, without special tricks. -- Jamie |
From: Marcelo T. <mto...@re...> - 2008-04-22 15:16:49
|
On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: > Anthony Liguori wrote: > > This patch changes virtio devices to be multi-function devices whenever > > possible. This increases the number of virtio devices we can support now by > > a factor of 8. > > > > With this patch, I've been able to launch a guest with either 220 disks or 220 > > network adapters. > > > > > > Does this play well with hotplug? Perhaps we need to allocate a new > device on hotplug. > > (certainly if we have a device with one function, which then gets > converted to a multifunction device) Would have to change the hotplug code to handle functions... It sounds less hacky to just extend the PCI slots instead of (ab)using multiple functions per-slot. |
From: Anthony L. <ali...@us...> - 2008-04-22 15:16:45
|
Ryan Harper wrote: > * Anthony Liguori <ali...@us...> [2008-04-22 09:16]: > >> This patch changes virtio devices to be multi-function devices whenever >> possible. This increases the number of virtio devices we can support now by >> a factor of 8. >> >> With this patch, I've been able to launch a guest with either 220 disks or 220 >> network adapters. >> > > Have you confirmed that the network devices show up? I was playing > around with some of the limits last night and while it is easy to get > QEMU to create the adapters, so far I've only had a guest see 29 pci > nics (e1000). > Yup, I had an eth219 Regards, Anthony Liguori |
From: Andrea A. <an...@qu...> - 2008-04-22 15:15:52
|
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: > Andrea Arcangeli a écrit : >> + >> +static int mm_lock_cmp(const void *a, const void *b) >> +{ >> + cond_resched(); >> + if ((unsigned long)*(spinlock_t **)a < >> + (unsigned long)*(spinlock_t **)b) >> + return -1; >> + else if (a == b) >> + return 0; >> + else >> + return 1; >> +} >> + > This compare function looks unusual... > It should work, but sort() could be faster if the > if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? > static int mm_lock_cmp(const void *a, const void *b) > { > unsigned long la = (unsigned long)*(spinlock_t **)a; > unsigned long lb = (unsigned long)*(spinlock_t **)b; > > cond_resched(); > if (la < lb) > return -1; > if (la > lb) > return 1; > return 0; > } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la > lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! |
From: Ryan H. <ry...@us...> - 2008-04-22 15:15:35
|
* Anthony Liguori <ali...@us...> [2008-04-22 09:16]: > This patch changes virtio devices to be multi-function devices whenever > possible. This increases the number of virtio devices we can support now by > a factor of 8. > > With this patch, I've been able to launch a guest with either 220 disks or 220 > network adapters. Have you confirmed that the network devices show up? I was playing around with some of the limits last night and while it is easy to get QEMU to create the adapters, so far I've only had a guest see 29 pci nics (e1000). -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ry...@us... |
From: Jamie L. <ja...@sh...> - 2008-04-22 15:12:30
|
Anthony Liguori wrote: > >Perhaps. This raises another point about AIO vs. threads: > > > >If I submit sequential O_DIRECT reads with aio_read(), will they enter > >the device read queue in the same order, and reach the disk in that > >order (allowing for reordering when worthwhile by the elevator)? > > There's no guarantee that any sort of order will be preserved by AIO > requests. The same is true with writes. This is what fdsync is for, to > guarantee ordering. You misunderstand. I'm not talking about guarantees, I'm talking about expectations for the performance effect. Basically, to do performant streaming read with O_DIRECT you need two things: 1. Overlap at least 2 requests, so the device is kept busy. 2. Requests be sent to the disk in a good order, which is usually (but not always) sequential offset order. The kernel does this itself with buffered reads, doing readahead. It works very well, unless you have other problems caused by readahead. With O_DIRECT, an application has to do the equivalent of readahead itself to get performant streaming. If the app uses two threads calling pread(), it's hard to ensure the kernel even _sees_ the first two calls in sequential offset order. You spawn two threads, and then both threads call pread() with non-deterministic scheduling. The problem starts before even entering the kernel. Then, depending on I/O scheduling in the kernel, it might send the less good pread() to the disk immediately, then later a backward head seek and the other one. The elevator cannot fix this: it doesn't have enough information, unless it adds artificial delays. But artificial delays may harm too; it's not optimal. After that, the two threads tend to call pread() in the best order provided there's no scheduling conflicts, but are easily disrupted by other tasks, especially on SMP (one reading thread per CPU, so when one of them is descheduled, the other continues and issues a request in the 'wrong' order.) With AIO, even though you can't be sure what the kernel does, you can be sure the kernel receives aio_read() calls in the exact order which is most likely to perform well. Application knowledge of it's access pattern is passed along better. As I've said, I saw a man page which described why this makes AIO superior to using threads for reading tapes on that OS. So it's not a completely spurious point. This has nothing to do with guarantees. -- Jamie |
From: Luca T. <kro...@gm...> - 2008-04-22 15:12:23
|
On Tue, Apr 22, 2008 at 4:15 PM, Anthony Liguori <ali...@us...> wrote: > This patch changes virtio devices to be multi-function devices whenever > possible. This increases the number of virtio devices we can support now by > a factor of 8. [...] > diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c > index 9100bb1..9ea14d3 100644 > --- a/qemu/hw/virtio.c > +++ b/qemu/hw/virtio.c > @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, > PCIDevice *pci_dev; > uint8_t *config; > uint32_t size; > + static int devfn = 7; > + > + if ((devfn % 8) == 7) > + devfn = -1; > + else > + devfn++; This code look strange... devfn should be passed to virtio_init_pci by virtio-{net,blk} init functions, no? Luca |
From: Avi K. <av...@qu...> - 2008-04-22 15:06:02
|
Anthony Liguori wrote: >> >> If I submit sequential O_DIRECT reads with aio_read(), will they enter >> the device read queue in the same order, and reach the disk in that >> order (allowing for reordering when worthwhile by the elevator)? >> > > There's no guarantee that any sort of order will be preserved by AIO > requests. The same is true with writes. This is what fdsync is for, > to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. -- error compiling committee.c: too many arguments to function |
From: Anthony L. <ali...@us...> - 2008-04-22 15:05:30
|
Nguyen Anh Quynh wrote: > Hi, > > This should be submitted to upstream (but not to kvm-devel list), but > this is only the test code that I want to quickly send out for > comments. In case it looks OK, I will send it to upstream later. > > Inspired by extboot and conversations with Anthony and HPA, this > linuxboot option ROM is a simple option ROM that intercepts int19 in > order to execute linux setup code. This approach eliminates the need > to manipulate the boot sector for this purpose. > > To test it, just load linux kernel with your KVM/QEMU image using > -kernel option in normal way. > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > Ubuntu 8.04. > For the next rounds, could you actually rebase against upstream QEMU and submit to qemu-devel? One of Paul Brook's objections to extboot had historically been that it wasn't not easily sharable with other architectures. With a C version, it seems more reasonable now to do that. Make sure you remove all the old linux boot code too within QEMU along with the -hda checks. Regards, Anthony Liguori > Thanks, > Quynh > > > # diffstat linuxboot1.diff > Makefile | 13 ++++- > linuxboot/Makefile | 40 +++++++++++++++ > linuxboot/boot.S | 54 +++++++++++++++++++++ > linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ > linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ > linuxboot/signrom |binary > linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > qemu/Makefile | 3 - > qemu/Makefile.target | 2 > qemu/hw/linuxboot.c | 39 +++++++++++++++ > qemu/hw/pc.c | 22 +++++++- > qemu/hw/pc.h | 5 + > 13 files changed, 600 insertions(+), 9 deletions(-) > |
From: Avi K. <av...@qu...> - 2008-04-22 15:03:48
|
Jamie Lokier wrote: > Avi Kivity wrote: > >>> And video streaming on some embedded devices with no MMU! (Due to the >>> page cache heuristics working poorly with no MMU, sustained reliable >>> streaming is managed with O_DIRECT and the app managing cache itself >>> (like a database), and that needs AIO to keep the request queue busy. >>> At least, that's the theory.) >>> >> Could use threads as well, no? >> > > Perhaps. This raises another point about AIO vs. threads: > > If I submit sequential O_DIRECT reads with aio_read(), will they enter > the device read queue in the same order, and reach the disk in that > order (allowing for reordering when worthwhile by the elevator)? > Yes, unless the implementation in the kernel (or glibc) is threaded. > With threads this isn't guaranteed and scheduling makes it quite > likely to issue the parallel synchronous reads out of order, and for > them to reach the disk out of order because the elevator doesn't see > them simultaneously. > If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. > With AIO (non-Glibc! (and non-kthreads)) it might be better at > keeping the intended issue order, I'm not sure. > > It is highly desirable: O_DIRECT streaming performance depends on > avoiding seeks (no reordering) and on keeping the request queue > non-empty (no gap). > > I read a man page for some other unix, describing AIO as better than > threaded parallel reads for reading tape drives because of this (tape > seeks are very expensive). But the rest of the man page didn't say > anything more. Unfortunately I don't remember where I read it. I > have no idea whether AIO submission order is nearly always preserved > in general, or expected to be. > I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. > >> It's me at fault here. I just assumed that because it's easy to do aio >> in a thread pool efficiently, that's what glibc does. >> >> Unfortunately the code does some ridiculous things like not service >> multiple requests on a single fd in parallel. I see absolutely no >> reason for it (the code says "fight for resources"). >> > > Ouch. Perhaps that relates to my thought above, about multiple > requests to the same file causing seek storms when thread scheduling > is unlucky? > My first thought on seeing this is that it relates to a deficiency on older kernels servicing multiple requests on a single fd (i.e. a per-file lock). I don't know if such a deficiency ever existed, though. > >> It could and should. It probably doesn't. >> >> A simple thread pool implementation could come within 10% of Linux aio >> for most workloads. It will never be "exactly", but for small numbers >> of disks, close enough. >> > > I would wait for benchmark results for I/O patterns like sequential > reading and writing, because of potential for seeks caused by request > reordering, before being confident of that. > > I did have measurements (and a test rig) at a previous job (where I did a lot of I/O work); IIRC the performance of a tuned thread pool was not far behind aio, both for seeks and sequential. It was a while back though. -- error compiling committee.c: too many arguments to function |
From: Laurent V. <Lau...@bu...> - 2008-04-22 15:02:56
|
Le mardi 22 avril 2008 à 08:50 -0500, Anthony Liguori a écrit : > Nguyen Anh Quynh wrote: > > Hi, > > > > This should be submitted to upstream (but not to kvm-devel list), but > > this is only the test code that I want to quickly send out for > > comments. In case it looks OK, I will send it to upstream later. > > > > Inspired by extboot and conversations with Anthony and HPA, this > > linuxboot option ROM is a simple option ROM that intercepts int19 in > > order to execute linux setup code. This approach eliminates the need > > to manipulate the boot sector for this purpose. > > > > To test it, just load linux kernel with your KVM/QEMU image using > > -kernel option in normal way. > > > > I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest > > Ubuntu 8.04. > > > > For the next rounds, could you actually rebase against upstream QEMU and > submit to qemu-devel? One of Paul Brook's objections to extboot had > historically been that it wasn't not easily sharable with other > architectures. With a C version, it seems more reasonable now to do that. Moreover add a binary version of the ROM in the pc-bios directory: it avoids to have a cross-compiler to build ROM on non-x86 architecture. Regards, Laurent > Make sure you remove all the old linux boot code too within QEMU along > with the -hda checks. > > Regards, > > Anthony Liguori > > > Thanks, > > Quynh > > > > > > # diffstat linuxboot1.diff > > Makefile | 13 ++++- > > linuxboot/Makefile | 40 +++++++++++++++ > > linuxboot/boot.S | 54 +++++++++++++++++++++ > > linuxboot/farvar.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > linuxboot/rom.c | 104 ++++++++++++++++++++++++++++++++++++++++ > > linuxboot/signrom |binary > > linuxboot/signrom.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ > > linuxboot/util.h | 69 +++++++++++++++++++++++++++ > > qemu/Makefile | 3 - > > qemu/Makefile.target | 2 > > qemu/hw/linuxboot.c | 39 +++++++++++++++ > > qemu/hw/pc.c | 22 +++++++- > > qemu/hw/pc.h | 5 + > > 13 files changed, 600 insertions(+), 9 deletions(-) > > > > > > -- ------------- Lau...@bu... --------------- "The best way to predict the future is to invent it." - Alan Kay |
From: Anthony L. <an...@co...> - 2008-04-22 14:53:28
|
Jamie Lokier wrote: > Avi Kivity wrote: > >>> And video streaming on some embedded devices with no MMU! (Due to the >>> page cache heuristics working poorly with no MMU, sustained reliable >>> streaming is managed with O_DIRECT and the app managing cache itself >>> (like a database), and that needs AIO to keep the request queue busy. >>> At least, that's the theory.) >>> >> Could use threads as well, no? >> > > Perhaps. This raises another point about AIO vs. threads: > > If I submit sequential O_DIRECT reads with aio_read(), will they enter > the device read queue in the same order, and reach the disk in that > order (allowing for reordering when worthwhile by the elevator)? > There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. Regards, Anthony Liguori |
From: Avi K. <av...@qu...> - 2008-04-22 14:52:03
|
Anthony Liguori wrote: > Avi Kivity wrote: >> Anthony Liguori wrote: >> >>> This patch changes virtio devices to be multi-function devices whenever >>> possible. This increases the number of virtio devices we can >>> support now by >>> a factor of 8. >>> >>> With this patch, I've been able to launch a guest with either 220 >>> disks or 220 >>> network adapters. >>> >>> >> >> Does this play well with hotplug? Perhaps we need to allocate a new >> device on hotplug. >> > > Probably not. I imagine you can only hotplug devices, not individual > functions? > It sounds reasonable to expect so. ACPI has objects for devices, not functions (IIRC). Maybe require explicit device/function assignment on the command line? It will be managed anyway. -- error compiling committee.c: too many arguments to function |
From: Anthony L. <an...@co...> - 2008-04-22 14:46:38
|
Avi Kivity wrote: > Anthony Liguori wrote: > >> This patch changes virtio devices to be multi-function devices whenever >> possible. This increases the number of virtio devices we can support now by >> a factor of 8. >> >> With this patch, I've been able to launch a guest with either 220 disks or 220 >> network adapters. >> >> >> > > Does this play well with hotplug? Perhaps we need to allocate a new > device on hotplug. > Probably not. I imagine you can only hotplug devices, not individual functions? Regards, Anthony Liguori > (certainly if we have a device with one function, which then gets > converted to a multifunction device) > > |
From: Avi K. <av...@qu...> - 2008-04-22 14:43:09
|
Marcelo Tosatti wrote: > Otherwise multiple guests use the same variable and boom. > > Also use kvm_vcpu_kick() to make sure that if a timer triggers on > a different CPU the event won't be missed. > > Applied, thanks. -- error compiling committee.c: too many arguments to function |