From: <Bra...@sc...> - 2008-06-19 14:09:46
|
I think I specifically touched-on (and agreed with) most of your points in later comments (in a later message) A. Kernel XOR/Crypto/checksum/ECC w/ or wo/ hardware assist B. The concept of ext2 having the first "x" blocks cached - and saving those context switches. Some new points you mentioned: A. Preallocation - I didn't think of this - indeed, it is filesystem-specific. I believe that the current "cache" mechanism I am proposing would work well with this - if you populate the cache with those "pre-allocated" blocks - writes will be able to use them. Perhaps the only addition now would be for the kernel to be able to "cache" the update to the filesize, if in-fact you have written within those (preallocated/cached) blocks, but beyond the current EOF. BTW - This also fits logically into the schema of my filsystem - and I am sure most. (i.e. I allocate blocks, but maybe don't immediatley write all the way to EOF). B. >> Doesn't it suffice to do that on flush() and fsync() but not on fdatasync()? Yes, you are correct. C. General I/Os (not just block devices): > If this is done with normal FDs then the libfuse can just transform > then into read/write calls for the actual data and ignore the cache > fills (or also emulate that). Yes, true - however - I have been focusing on readpage() and writepage() - mostly because this seems to be the underlying mechanism to implement read() and write(), (in almost all cases) as well as sendfile(). I am not sure about splice() - I will have to look into it. read() and write() themselves however would copy the data, which is what I am trying to avoid. Perhaps directly calling readpage() writepage() of the target device would do the trick in my situation, but read() or write() in others (like a network socket?) would be correct. The readpage() writepage() only works for things like block devices, but a general read() or write() would work in a broader sense. Perhaps splice() would easily work the same way! I don't know about anyone else - but I am really liking the possibilities here - i.e. abiliy for very high-performance user-space filesystem I/O! -BKG gos...@we... wrote on 06/19/2008 09:43:02 AM: > Miklos Szeredi <mi...@sz...> writes: > > >> FUSE [seems to have been] written primarily with the concept of user-space > >> "translation" filesytems - i.e. thinkgs like gmailfs and sshfs - i.e. > >> doing translation of data, involving user-space processes, and other > >> things that would be cumbersome to implmenet in kernel-space - and even if > >> they were - would have to communicate with a whole bunch of user-space > >> code anyway (like SSH, etc.) > >> > >> It seems as though a lot of people (like myself, NTFS-3g, ext2inuserspace, > >> etc.) are now trying to use it for "traditional" filesystems which are > >> "traditional" in the sense that they are filesystems which are backed by > >> block devices. Examples would be [user-space implmenetations of] ext[234], > >> NTFS (g3) or ZFS. Examples would NOT include things like gmailfs or sshfs. > > > > ZFS does checksumming of blocks I think, so even though it's backed by > > a block device (or block devices) it has to process data that passes > > through it. This applies to compressed files on NTFS as well. > > And I would very much like to pass the checksumming of blocks to the > kernels async crypto engine. If no hardware support is there then the > generic_* driver wil do it in software and nothing is gained much. But > if you have hardware this frees a lot of cpu time. > > This also applies for the striping that uses XOR for parity. > > >> However, FUSE is not really intended to be optimized as normal in-kernel > >> filesytems in this respect - i.e. user<->kernel tanslation, context > >> switching, copying, etc. > >> > >> 1. This implmenetation is geared primarily for "traditional" type > >> filesystems - i.e. ones backed by block devices. > >> > >> 2. The patimplmenetation ch aims to accomplish the following: Let > >> user-space code do the "heavy-lifting" - i.e. the "logic" of the > >> filesystem is implmented in normal FUSE code - handling of the VFS > >> functions, etc. Just like FUSE does today. User-space code implmenets > >> almost the entire filesystem. HOWEVER when it comes to the actual I/O - > >> get the user-space code out of the way and let the kernel code take over. > >> This provides the following optimizaitons: > >> > >> a. Reducing context switching - user-space code may be avoided in > >> most cases during normal I/O (read/write/readpage, etc) > > > > This helps only with I/O on large files which are read infrequently > > and thus do not get cached (or are too large to be cached). I know > > that this does apply in your case, I'm just noting that this is not a > > universal solution to all problems :) > > Why? On the first read for an ext2 filesystem the cache gets set up > for the first 20 or so blocks. Even a 64k file will save context > switches. > > > Also when write is growing a file (which is by far the most common > > mode of operation), the userspace code has to do the block allocation. > > So unless some trickery (fallocate()) is used, this won't get rid of > > interaction with userspace. > > The kernel ext2/3 code can do preallocation. Doing that in userspace > would still require some feedback when a block is then used but that > could run in parallel with the kernel writing to the cache block > address. > > > With writes there's also the question of st_mtime update. Currently > > updating the timestamp is the responsibility of the userspace part, so > > if writing is moved to the kernel, this issue needs to be addressed as > > well. > > Doesn't it suffice to do that on flush() and fsync() but not on > fdatasync()? > > >> b. Allow the filesystem to "redirect" their I/Os to the underlying > >> block devices. (I have re-included some code sinippits below of how they > >> do this). This would be optimum for zero-copy I/O. > > > > Why just block devices? One common use of fuse is to do some > > transformation on a normal filesystem, and sometimes the actual data > > is not involved in the transformation. So having fuse perform I/O > > directly on the underlying file is also a feature that is often asked > > for. > > ACK. I would prefer if this would work with any file descriptor. In my > mind splice() should be used somehow. > > > If we are doing some sort of zero-copy thing, I'd really not like to > > limit it to just block devices. > > > > Also it would be important to have an API that can easily be emulated > > with legacy kernel support, so filesystems using the new zero-copy > > interface are not forced to implement two kinds of APIs for backward > > compatibility (and compatibility with other OS's than Linux). > > If this is done with normal FDs then the libfuse can just transform > then into read/write calls for the actual data and ignore the cache > fills (or also emulate that). > > > Miklos > > > One thing I'm missing is a feedback method for errors. Lets stick with > the ZFS example from above. Say you have a raidX chunk and one disk > fails. Then the userspace should be told about read errors, fetch the > parity block, reconstruct the missing block, do any repair work on the > FS and return the proper data to the reader. > > For the checksumming it would also be nice to do that on read. There > would have to be a callback for whenever a block is read to fire off > the in kernel checksumming and to OK the read block. > > MfG > Goswin |