Re: [fuse-devel] Read/Write I/Os to block devices - Zero-Copy

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Miklos Szeredi <mi...@sz...> writes:

>> >> I guess I am asking: "Would it"? Since I have mmap()ed the file 
>> >> (read-only)
>> >> the kernel would not have copied the data into a user-space buffer - and
>> >> any data referenced from it would come directly from the page cache.
>> >
>> > What I'm always wandering, why are people so obsessed with eliminating
>> > memory copies...
>> 
>> Because they are an easy starting point in eliminating waste.
>
> It only *seems* easy.  It's actually very difficult to integrate this
> into the API so that it satisfies everyone's requirements, and doesn't
> make the interface an ugly mess at the same time.
>
>> > The goal is not to minimize memory copies, but to maximize
>> > performance.  And one does *not* necessarily imply the other.
>> >
>> > Guess why "cp" doesn't use mmap to read and write files?  The simple
>> > reason is that it would be slower than read() and write(), even though
>> > the latter obviously involves more memory copies.
>> 
>> Because mmap does not mean zero copy. Actualy it would change nothing
>> in the number of copies if you do
>> 
>> src = mmap(src_fd);
>> dst = mmap(dst_fd);
>> memcpy(dst, src, size);
>
> No, that's still one less copy than read(src_fd, buf); write(dst_fd,
> buf), since there's no intermediate buffer between the source and the
> destination caches.
>
> And the same is true for
>
> 	src = mmap(src_fd);
> 	while (...) {
> 		write(dst_fd, src + offset, size);
> 	}

That is certainly a saner solution.
Still not as good as splice().

>> On the other hand that would page fault every page for src and dst and
>> read the page from the file.
>
> Yes, there's a page fault, but...
>
>> So you would not only read the source file but also the destination
>> file.
>
> ... you are wrong on this.  For "cp" the destination doesn't exist so
> there's no I/O happending there at all.
>
> And for my example there's not even a page fault for the write side.
> Yet, it's still slower.

You can't mmap the dst without creating a sparse file of proper size
first. And then when memcpy copies the first data into the target page
a segfault occurs and map a page from the destination file. The FS
will see that the file is sparse and use an empty page but it still
goes through most of the motions.

> The overhead is actually just from the administrative part of the page
> fault (i.e.  setting up the page tables).  That can actually mean a
> significant slowdown compared to copying the whole page.  Memory
> copies can be very fast, and they aren't necessarily constrained by
> main memory bandwidth since modern processors can have big CPU caches
> where such copies fit nicely without ever hitting the main memory.

What I don't get is why there is no mechanism in the kernel for page
aligned (address and size) memcpy by sharing the page. Or have I just
not found it yet? It might be slower for a single page but certainly
for 128k or more just copying the physical page addresses must be
faster than copy the contents. For read only buffer that should give a
good boost.

>> Not to mention that error handling with mmap is a nightmare.
>> 
>> The way to get rid of the copying in cp would be to use sendfile(), if
>> only that would work on non sockets. Recently the kernel has added
>> splice support which does exactly that. But that is pretty kernel
>> specific and cp is an old tool that has to run everywhere.
>
> Yeah, splice is nice.  And "cp" could detect if kernel supports splice
> and fall back to read/write if not.  But it's probably irrelevant,
> since the speed is mostly limited by disk bandwidth and not memory
> copies.

The half year I've worked on a storage cluster for the university of
Karlsruhe. The systems are a bit higher on I/O capabilities than your
average desktop but there a single 'dd if=/dev/zero of=/dev/disk
bs=1M' can't write fast enough. A single core can't process enough I/O
to fill the disk. Same with a single cp.

>> > I refuse to hear about zero-copy schemes until someone actually does
>> > some performance measurements and shows that it would be a significant
>> > improvement in real life situations, that cannot be achieved without
>> > messing with the API.
>> 
>> When you want to achieve throughputs of 300-600MiB/s then just copying
>> that much data around becomes a real burden.
>
> Err, attaching a little program I wrote to test this: it gets the
> buffer size and the number of memcpy() iterations to do.  Here's the
> result I get on a 1.8GHz CoreDuo laptop:
>
>   > ./memcpy 262144 65536
>   16384MBytes copied in 2.65382seconds = 6173.75MiB/s
>
> Fuse uses 128k buffers, for which I get a slightly worse throughput of
> about 6070MiB/s.

500MiB/s with 4 copies for every buffer means 2GiB/s. So 30% of cpu
time is spend on copying. Now assume the system is at 100% cpu doing
something else in the remaining 70% (like parsing metadata in fuse,
and running the application itself). Reducing the cpu load by 30%
could give the application 30% more time.

The problem also might not even be that the cpu or memory bus hits the
limit but that you add a delay to the operation. Memcopy does take
time so the responce time of your fs goes down if they can be avoided.

> So I'm not yet scared *at all* about the performance impact of memory
> copies in fuse :)

I'm not scared but I would like to avoid them. I patched the nfs
handling to avoid one memcpy in the kernel and got it to go from
40MiB/s to >200MiB/s.

> Come on people, do some measurements to prove your point!  Theoretical
> arguments don't matter.  If you can't measure the great improvements
> zero-copy will bring, then sorry, I'm not interested.

Give me a patch for fuse to support zero copy and I will test it for you.

> Miklos

MfG
        Goswin