Re: [fuse-devel] RFC: Read/Write I/Os to block devices - Zero-Copy

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I think I specifically touched-on (and agreed with) most of your points in 
later comments (in a later message)

A. Kernel XOR/Crypto/checksum/ECC w/ or wo/ hardware assist

B. The concept of ext2 having the first "x" blocks cached - and saving 
those context switches.

Some new points you mentioned:

A. Preallocation - I didn't think of this - indeed, it is 
filesystem-specific. I believe that the current "cache" mechanism I am 
proposing would work well with this - if you populate the cache with those 
"pre-allocated" blocks - writes will be able to use them. Perhaps the only 
addition now would be for the kernel to be able to "cache" the update to 
the filesize, if in-fact you have written within those 
(preallocated/cached) blocks, but beyond the current EOF.

BTW - This also fits logically into the schema of my filsystem - and I am 
sure most. (i.e. I allocate blocks, but maybe don't immediatley write all 
the way to EOF).

B. >> Doesn't it suffice to do that on flush() and fsync() but not on 
fdatasync()?

Yes, you are correct.

C. General I/Os (not just block devices):
> If this is done with normal FDs then the libfuse can just transform
> then into read/write calls for the actual data and ignore the cache
> fills (or also emulate that).

Yes, true - however - I have been focusing on readpage() and writepage() - 
mostly because this seems to be the underlying mechanism to implement 
read() and write(), (in almost all cases) as well as sendfile(). I am not 
sure about splice() - I will have to look into it. read() and write() 
themselves however would copy the data, which is what I am trying to 
avoid. Perhaps directly calling readpage() writepage() of the target 
device would do the trick in my situation, but read() or write() in others 
(like a network socket?) would be correct. The readpage() writepage() only 
works for things like block devices, but a general read() or write() would 
work in a broader sense. 

Perhaps splice() would easily work the same way!

I don't know about anyone else - but I am really liking the possibilities 
here - i.e. abiliy for very high-performance user-space filesystem I/O!

-BKG

gos...@we... wrote on 06/19/2008 09:43:02 AM:

> Miklos Szeredi <mi...@sz...> writes:
> 
> >> FUSE [seems to have been] written primarily with the concept of 
user-space 
> >> "translation" filesytems - i.e. thinkgs like gmailfs and sshfs - i.e. 

> >> doing translation of data, involving user-space processes, and other 
> >> things that would be cumbersome to implmenet in kernel-space - and 
even if 
> >> they were - would have to communicate with a whole bunch of 
user-space 
> >> code anyway (like SSH, etc.)
> >> 
> >> It seems as though a lot of people (like myself, NTFS-3g, 
ext2inuserspace, 
> >> etc.) are now trying to use it for "traditional" filesystems which 
are 
> >> "traditional" in the sense that they are filesystems which are backed 
by 
> >> block devices. Examples would be [user-space implmenetations of] 
ext[234], 
> >> NTFS (g3) or ZFS. Examples would NOT include things like gmailfs or 
sshfs. 
> >
> > ZFS does checksumming of blocks I think, so even though it's backed by
> > a block device (or block devices) it has to process data that passes
> > through it.  This applies to compressed files on NTFS as well.
> 
> And I would very much like to pass the checksumming of blocks to the
> kernels async crypto engine. If no hardware support is there then the
> generic_* driver wil do it in software and nothing is gained much. But
> if you have hardware this frees a lot of cpu time.
> 
> This also applies for the striping that uses XOR for parity.
> 
> >> However, FUSE is not really intended to be optimized as normal 
in-kernel 
> >> filesytems in this respect - i.e. user<->kernel tanslation, context 
> >> switching, copying, etc.
> >> 
> >> 1. This implmenetation is geared primarily for "traditional" type 
> >> filesystems - i.e. ones backed by block devices.
> >> 
> >> 2. The patimplmenetation ch aims to accomplish the following: Let 
> >> user-space code do the "heavy-lifting" - i.e. the "logic" of the 
> >> filesystem is implmented in normal FUSE code - handling of the VFS 
> >> functions, etc. Just like FUSE does today. User-space code implmenets 

> >> almost the entire filesystem.  HOWEVER when it comes to the actual 
I/O - 
> >> get the user-space code out of the way and let the kernel code take 
over. 
> >> This provides the following optimizaitons:
> >> 
> >>         a. Reducing context switching - user-space code may be 
avoided in 
> >> most cases during normal I/O (read/write/readpage, etc)
> >
> > This helps only with I/O on large files which are read infrequently
> > and thus do not get cached (or are too large to be cached).  I know
> > that this does apply in your case, I'm just noting that this is not a
> > universal solution to all problems :)
> 
> Why? On the first read for an ext2 filesystem the cache gets set up
> for the first 20 or so blocks. Even a 64k file will save context
> switches.
> 
> > Also when write is growing a file (which is by far the most common
> > mode of operation), the userspace code has to do the block allocation.
> > So unless some trickery (fallocate()) is used, this won't get rid of
> > interaction with userspace.
> 
> The kernel ext2/3 code can do preallocation. Doing that in userspace
> would still require some feedback when a block is then used but that
> could run in parallel with the kernel writing to the cache block
> address.
> 
> > With writes there's also the question of st_mtime update.  Currently
> > updating the timestamp is the responsibility of the userspace part, so
> > if writing is moved to the kernel, this issue needs to be addressed as
> > well.
> 
> Doesn't it suffice to do that on flush() and fsync() but not on
> fdatasync()?
> 
> >>         b. Allow the filesystem to "redirect" their I/Os to the 
underlying 
> >> block devices. (I have re-included some code sinippits below of how 
they 
> >> do this). This would be optimum for zero-copy I/O.
> >
> > Why just block devices?  One common use of fuse is to do some
> > transformation on a normal filesystem, and sometimes the actual data
> > is not involved in the transformation.  So having fuse perform I/O
> > directly on the underlying file is also a feature that is often asked
> > for.
> 
> ACK. I would prefer if this would work with any file descriptor. In my
> mind splice() should be used somehow.
> 
> > If we are doing some sort of zero-copy thing, I'd really not like to
> > limit it to just block devices.
> >
> > Also it would be important to have an API that can easily be emulated
> > with legacy kernel support, so filesystems using the new zero-copy
> > interface are not forced to implement two kinds of APIs for backward
> > compatibility (and compatibility with other OS's than Linux).
> 
> If this is done with normal FDs then the libfuse can just transform
> then into read/write calls for the actual data and ignore the cache
> fills (or also emulate that).
> 
> > Miklos
> 
> 
> One thing I'm missing is a feedback method for errors. Lets stick with
> the ZFS example from above. Say you have a raidX chunk and one disk
> fails. Then the userspace should be told about read errors, fetch the
> parity block, reconstruct the missing block, do any repair work on the
> FS and return the proper data to the reader.
> 
> For the checksumming it would also be nice to do that on read. There
> would have to be a callback for whenever a block is read to fire off
> the in kernel checksumming and to OK the read block.
> 
> MfG
>         Goswin