From: Miklos S. <mi...@sz...> - 2008-06-13 14:33:27
|
> >> I guess I am asking: "Would it"? Since I have mmap()ed the file > >> (read-only) > >> the kernel would not have copied the data into a user-space buffer - and > >> any data referenced from it would come directly from the page cache. > > > > What I'm always wandering, why are people so obsessed with eliminating > > memory copies... > > Because they are an easy starting point in eliminating waste. It only *seems* easy. It's actually very difficult to integrate this into the API so that it satisfies everyone's requirements, and doesn't make the interface an ugly mess at the same time. > > The goal is not to minimize memory copies, but to maximize > > performance. And one does *not* necessarily imply the other. > > > > Guess why "cp" doesn't use mmap to read and write files? The simple > > reason is that it would be slower than read() and write(), even though > > the latter obviously involves more memory copies. > > Because mmap does not mean zero copy. Actualy it would change nothing > in the number of copies if you do > > src = mmap(src_fd); > dst = mmap(dst_fd); > memcpy(dst, src, size); No, that's still one less copy than read(src_fd, buf); write(dst_fd, buf), since there's no intermediate buffer between the source and the destination caches. And the same is true for src = mmap(src_fd); while (...) { write(dst_fd, src + offset, size); } > On the other hand that would page fault every page for src and dst and > read the page from the file. Yes, there's a page fault, but... > So you would not only read the source file but also the destination > file. ... you are wrong on this. For "cp" the destination doesn't exist so there's no I/O happending there at all. And for my example there's not even a page fault for the write side. Yet, it's still slower. The overhead is actually just from the administrative part of the page fault (i.e. setting up the page tables). That can actually mean a significant slowdown compared to copying the whole page. Memory copies can be very fast, and they aren't necessarily constrained by main memory bandwidth since modern processors can have big CPU caches where such copies fit nicely without ever hitting the main memory. > Not to mention that error handling with mmap is a nightmare. > > The way to get rid of the copying in cp would be to use sendfile(), if > only that would work on non sockets. Recently the kernel has added > splice support which does exactly that. But that is pretty kernel > specific and cp is an old tool that has to run everywhere. Yeah, splice is nice. And "cp" could detect if kernel supports splice and fall back to read/write if not. But it's probably irrelevant, since the speed is mostly limited by disk bandwidth and not memory copies. > > I refuse to hear about zero-copy schemes until someone actually does > > some performance measurements and shows that it would be a significant > > improvement in real life situations, that cannot be achieved without > > messing with the API. > > When you want to achieve throughputs of 300-600MiB/s then just copying > that much data around becomes a real burden. Err, attaching a little program I wrote to test this: it gets the buffer size and the number of memcpy() iterations to do. Here's the result I get on a 1.8GHz CoreDuo laptop: > ./memcpy 262144 65536 16384MBytes copied in 2.65382seconds = 6173.75MiB/s Fuse uses 128k buffers, for which I get a slightly worse throughput of about 6070MiB/s. So I'm not yet scared *at all* about the performance impact of memory copies in fuse :) Come on people, do some measurements to prove your point! Theoretical arguments don't matter. If you can't measure the great improvements zero-copy will bring, then sorry, I'm not interested. Miklos --- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> int main(int argc, char *argv[]) { int numiter; int bufsize; char *buf1; char *buf2; struct timeval start; struct timeval end; double difftime; int i; if (argc != 3) { fprintf(stderr, "usage: %s bufsize numiter\n", argv[0]); return 1; } bufsize = atoi(argv[1]); numiter = atoi(argv[2]); buf1 = malloc(bufsize); buf2 = malloc(bufsize); if (!buf1 || !buf2) { fprintf(stderr, "failed to allocate memory\n"); return 1; } gettimeofday(&start, NULL); for (i = 0; i < numiter; i++) memcpy(buf1, buf2, bufsize); gettimeofday(&end, NULL); difftime = end.tv_sec - start.tv_sec; difftime += ((double) end.tv_usec - (double) start.tv_usec) / 1000000.0; printf("%gMBytes copied in %gseconds = %gMiB/s\n", (double) bufsize * numiter / 1048576, difftime, (double) bufsize * numiter / difftime / 1048576); return 0; } |