[uml-devel] UBD performance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

At the moment, the performance of UBD is disappointing.
For instance, here is a trivial benchmark on a quiet and
reasonably modern P4 server with 3Ware 8xxx RAID (this is
linux 2.6.13.3, but I haven't spent any effort on tuning
the IO subsystem; in particular, the below results are for
the default, anticipatory, IO scheduler, but I don't
expect this choice to make much difference in this test
environment):

    # time dd if=/dev/zero of=testfile bs=1M count=32
    32+0 records in
    32+0 records out
    33554432 bytes transferred in 0.118687 seconds (282713683 bytes/sec)

    real    0m0.120s
    user    0m0.000s
    sys     0m0.120s
    # time rm testfile

    real    0m0.027s
    user    0m0.000s
    sys     0m0.028s

... while the same experiment on a 2.6.13.3 UML kernel
(booting with parameters ubd0s=filesystem mem=128M)
running on the same hardware gives the following less
impressive results:

    # time dd if=/dev/zero of=testfile bs=1M count=32
    32+0 records in
    32+0 records out
    33554432 bytes transferred in 0.761825 seconds (44044809 bytes/sec)

    real    0m0.770s
    user    0m0.000s
    sys     0m0.240s
    # time rm testfile

    real    1m20.808s
    user    0m0.000s
    sys     0m0.120s

(in this case the filesystem is from Debian `sarge', but
there's no reason this should make a significant
difference). Note that in both cases I've run the test
several times in order until the timings stabilise a bit.

Note that if you mount the host filesystem `sync' the
cost of creating the file rises a bit but the cost of
unlinking it doesn't change that much. Similarly if you
configure the UBD device without O_SYNC, things get a bit
quicker, but ``I like my data and I want to keep it'', so
this isn't an attractive option. (Though see my comment
about write barriers from earlier in the week.)

I should say also at this stage that no benchmark is
meaningful without context and the best benchmarks are
those which are based on a specific application which you
want to use a system to run. Equally I don't think that
the above pair of commands is an unreasonable thing to
expect to run reasonably swiftly on modern hardware, and
it's an interesting test case because the gap between host
and virtual machine performance is so large.

Matters can be improved a bit by rewriting the UBD code to
accept whole requests at once, use scatter/gather IO
(readv/writev) on the host, and allow there to be more
than one outstanding request at once. Here's a patch
against 2.6.12.5 which does this:
    http://ex-parrot.com/~chris/tmp/20051008/ubd-sgio-2.6.12.5.patch
but note that it's very much a work-in-progress (it also
removes support for COW and isn't anywhere near
sufficiently well-tested for real use):

    # time dd if=/dev/zero of=testfile bs=1M count=32
    32+0 records in
    32+0 records out
    33554432 bytes transferred in 0.753967 seconds (44503843 bytes/sec)

    real    0m0.763s
    user    0m0.000s
    sys     0m0.320s
    # time rm testfile

    real    0m1.026s
    user    0m0.000s
    sys     0m0.140s

Note that, while this comes closer to acceptable
performance, the rm is getting on for two orders of
magnitude slower than on the host.

Jeff Dike also has an AIO reimplementation of UBD in the
works, but I haven't had a chance to look at it yet.

I'm not really sure why this simple test is so slow. A
limitation of the existing UBD implementation is that it
issues requests to the host serially; Jeff's AIO
reimplementation will fix this, and I had a go at making
my implementation multithreaded (N threads submitting up
to N simultaneous IO operations to the host). In fact this
doesn't make a lot of difference to the above test, and
the concurrency is a bit messy (it makes the write
barriers case harder, in particular), so I haven't
investigated in detail.

Here is another simple test. The code in,
    http://caesious.beasts.org/~chris/tmp/20051008/ioperformance.tar.gz
will issue random seeks and randomly-sized
reads/writes/synchronous writes against a file or block
device. This gives a measure of the effective seek time
and transfer rate of a block device (with or without an
overlying filesystem). Basic usage as follows:

    # tar xzf ioperformance.tar.gz
    # cd ioperformance
    # make rate
    # dd if=/dev/zero of=300M bs=1M count=300
    # ./rate writesync 300 > results
        [ ... this takes a little while... ]

that gives a text file whose first column is the size of
each IO operation and whose second column is its duration
of that operation. Obviously there will be a distribution
of durations for any given size of (in this case) write
operation, so there are two further scripts in the
distribution, bsmedian and bspercentile, which compute
respectively the median duration of operations of a given
size and the nth percentile points of the distribution for
each size. Here are some sample results, again comparing
the host with 2.6.12.5 UML kernels:
    http://ex-parrot.com/~chris/tmp/20051008/host-vs-uml-io-results.png

Note that while stock UBD in 2.6.12.5 is about an order of
magnitude slower than the host kernel for writes of any
significant size, a more effective implementation can
match the host's performance pretty well, which is as we
expect, since the cost of issuing the writes in UML is
pretty small compared to the actual cost of the disk
operations themselves. (As an aside, I don't really
understand the plateau in the curve above for writes of
smaller than 64KB. This may be an effect of stripe size in
the RAID array, but I'm not sure.)

Thoughts / comments?

-- 
Commitment can be best illustrated by a breakfast of ham and eggs.
The chicken was involved, the pig was committed. (unknown origin)