Re: [fuse-devel] Read/Write I/Os to block devices - Zero-Copy

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> >  a) where the filesystem gets its data from and what the user of the
> >  filesystem does with the data
> 
> The data will come from entierly from block devices, as in a typical 
> filesystem (like ext3 or whatever). The device will be a RAID device 
> (hardware or software) which would be sourcing its data through other 
> block devices - primarily SCSI, iSCSI or Infiniband. 
> 
> The user of the filesystem will principally be Apache - which will be 
> sending the files out over TCP. Apache uses sendfile to do this. The TCP 
> is offloaded through a TOE - which uses sendfile (et al) to bus-master the 
> block directly out of the page cache to the network. The CPU literally 
> never touches or moves the data. A few other users of the data are some 
> other (non-TCP) network stacks and file export (via. Infiniband) 
> transports which use the same zero-copy SG directly out of page-cache 
> method (a-la sendfile).

What sort of data is this, small files or large files?  Why aren't you
using a normal high performance filesystem on top of those block
devices?

Are you aware of the fact that context switches between the caller
(Apache) and the filesystem daemon can be a significant performance
issue with fuse?  This usually dominates CPU usage, not memory copies,
and is even harder to eliminate.

> >  b) some performance data (bandwidth, CPU usage) with the current fuse
> >  setup
> 
> The hardware that will ultimatly run this actual setup isn't available 
> yet. I also need to know the most optimum way to handle the I/O with FUSE 
> as-is. (Like the "read" vs. mmaped "readpage" stuff I was asking about 
> before).

Mmap is almost always the wrong answer to performance problems because
setting up the memory mapping is going to be far slower than a memory
copy.  And it wouldn't even eliminate the memory copy from the
device's page cache to the filesystem's page cache.

I'd say it's impossible to design a solution without actually having a
means to test it out.  We may come up with some perfect zero-copy
solution using splice() or some other mechanism, and yet it may be
still irrelevant because of some other performance limitation.

> >  c) what changes you propose to improve the performance, and how much
> >  you expect the performance to improve (preferably with a prototype
> >  and actual measurements).
> 
> I'm less of in the phase of proposing changes, as I am understanding 
> FUSE's capabilites, and how it works - as well as coming a bit up a 
> learning-curve of some VFS stuff.

That's cool :)

> What I need is just basic ability to do zero-copy readpage support to a 
> block device - just like other filesystems (like ext3) do.

Look at splice().  It's the most promising interface for this sort of
thing, and it might be made usable on the fuse device (it won't work
now, it would need additional code in the fuse kernel module).  But
I'm not familiar either with that interface to say for sure if that
will work or not.

I'd also suggest that you look at some solutions not involving fuse.
Being in userspace is nice and all that, but it will never have the
same performance as an in-kernel solution.

Miklos