From: Goswin v. B. <gos...@we...> - 2008-06-16 09:35:35
|
Miklos Szeredi <mi...@sz...> writes: >> >> I guess I am asking: "Would it"? Since I have mmap()ed the file >> >> (read-only) >> >> the kernel would not have copied the data into a user-space buffer - and >> >> any data referenced from it would come directly from the page cache. >> > >> > What I'm always wandering, why are people so obsessed with eliminating >> > memory copies... >> >> Because they are an easy starting point in eliminating waste. > > It only *seems* easy. It's actually very difficult to integrate this > into the API so that it satisfies everyone's requirements, and doesn't > make the interface an ugly mess at the same time. > >> > The goal is not to minimize memory copies, but to maximize >> > performance. And one does *not* necessarily imply the other. >> > >> > Guess why "cp" doesn't use mmap to read and write files? The simple >> > reason is that it would be slower than read() and write(), even though >> > the latter obviously involves more memory copies. >> >> Because mmap does not mean zero copy. Actualy it would change nothing >> in the number of copies if you do >> >> src = mmap(src_fd); >> dst = mmap(dst_fd); >> memcpy(dst, src, size); > > No, that's still one less copy than read(src_fd, buf); write(dst_fd, > buf), since there's no intermediate buffer between the source and the > destination caches. > > And the same is true for > > src = mmap(src_fd); > while (...) { > write(dst_fd, src + offset, size); > } That is certainly a saner solution. Still not as good as splice(). >> On the other hand that would page fault every page for src and dst and >> read the page from the file. > > Yes, there's a page fault, but... > >> So you would not only read the source file but also the destination >> file. > > ... you are wrong on this. For "cp" the destination doesn't exist so > there's no I/O happending there at all. > > And for my example there's not even a page fault for the write side. > Yet, it's still slower. You can't mmap the dst without creating a sparse file of proper size first. And then when memcpy copies the first data into the target page a segfault occurs and map a page from the destination file. The FS will see that the file is sparse and use an empty page but it still goes through most of the motions. > The overhead is actually just from the administrative part of the page > fault (i.e. setting up the page tables). That can actually mean a > significant slowdown compared to copying the whole page. Memory > copies can be very fast, and they aren't necessarily constrained by > main memory bandwidth since modern processors can have big CPU caches > where such copies fit nicely without ever hitting the main memory. What I don't get is why there is no mechanism in the kernel for page aligned (address and size) memcpy by sharing the page. Or have I just not found it yet? It might be slower for a single page but certainly for 128k or more just copying the physical page addresses must be faster than copy the contents. For read only buffer that should give a good boost. >> Not to mention that error handling with mmap is a nightmare. >> >> The way to get rid of the copying in cp would be to use sendfile(), if >> only that would work on non sockets. Recently the kernel has added >> splice support which does exactly that. But that is pretty kernel >> specific and cp is an old tool that has to run everywhere. > > Yeah, splice is nice. And "cp" could detect if kernel supports splice > and fall back to read/write if not. But it's probably irrelevant, > since the speed is mostly limited by disk bandwidth and not memory > copies. The half year I've worked on a storage cluster for the university of Karlsruhe. The systems are a bit higher on I/O capabilities than your average desktop but there a single 'dd if=/dev/zero of=/dev/disk bs=1M' can't write fast enough. A single core can't process enough I/O to fill the disk. Same with a single cp. >> > I refuse to hear about zero-copy schemes until someone actually does >> > some performance measurements and shows that it would be a significant >> > improvement in real life situations, that cannot be achieved without >> > messing with the API. >> >> When you want to achieve throughputs of 300-600MiB/s then just copying >> that much data around becomes a real burden. > > Err, attaching a little program I wrote to test this: it gets the > buffer size and the number of memcpy() iterations to do. Here's the > result I get on a 1.8GHz CoreDuo laptop: > > > ./memcpy 262144 65536 > 16384MBytes copied in 2.65382seconds = 6173.75MiB/s > > Fuse uses 128k buffers, for which I get a slightly worse throughput of > about 6070MiB/s. 500MiB/s with 4 copies for every buffer means 2GiB/s. So 30% of cpu time is spend on copying. Now assume the system is at 100% cpu doing something else in the remaining 70% (like parsing metadata in fuse, and running the application itself). Reducing the cpu load by 30% could give the application 30% more time. The problem also might not even be that the cpu or memory bus hits the limit but that you add a delay to the operation. Memcopy does take time so the responce time of your fs goes down if they can be avoided. > So I'm not yet scared *at all* about the performance impact of memory > copies in fuse :) I'm not scared but I would like to avoid them. I patched the nfs handling to avoid one memcpy in the kernel and got it to go from 40MiB/s to >200MiB/s. > Come on people, do some measurements to prove your point! Theoretical > arguments don't matter. If you can't measure the great improvements > zero-copy will bring, then sorry, I'm not interested. Give me a patch for fuse to support zero copy and I will test it for you. > Miklos MfG Goswin |