From: <Bra...@sc...> - 2008-06-17 15:19:29
|
So I've been thinking about this a lot, and have come up with what I believe is a pretty cool idea. I'm not completely there yet, but here's a rough sketch: FUSE [seems to have been] written primarily with the concept of user-space "translation" filesytems - i.e. thinkgs like gmailfs and sshfs - i.e. doing translation of data, involving user-space processes, and other things that would be cumbersome to implmenet in kernel-space - and even if they were - would have to communicate with a whole bunch of user-space code anyway (like SSH, etc.) It seems as though a lot of people (like myself, NTFS-3g, ext2inuserspace, etc.) are now trying to use it for "traditional" filesystems which are "traditional" in the sense that they are filesystems which are backed by block devices. Examples would be [user-space implmenetations of] ext[234], NTFS (g3) or ZFS. Examples would NOT include things like gmailfs or sshfs. However, FUSE is not really intended to be optimized as normal in-kernel filesytems in this respect - i.e. user<->kernel tanslation, context switching, copying, etc. 1. This implmenetation is geared primarily for "traditional" type filesystems - i.e. ones backed by block devices. 2. The patimplmenetation ch aims to accomplish the following: Let user-space code do the "heavy-lifting" - i.e. the "logic" of the filesystem is implmented in normal FUSE code - handling of the VFS functions, etc. Just like FUSE does today. User-space code implmenets almost the entire filesystem. HOWEVER when it comes to the actual I/O - get the user-space code out of the way and let the kernel code take over. This provides the following optimizaitons: a. Reducing context switching - user-space code may be avoided in most cases during normal I/O (read/write/readpage, etc) b. Allow the filesystem to "redirect" their I/Os to the underlying block devices. (I have re-included some code sinippits below of how they do this). This would be optimum for zero-copy I/O. IMPLEMENTATION: So FUSE, and user-space code basically acts as it does today. "Step-One" - allows readpage to directly call the kernel's block_read_full_page() to complete a request. This is the heart (or often the *sole*) thing that kernel "traditional" filesystems need to do for a readpage(). The readpage() calls this, with a pointer to a function (within the filesystem) that helps the filesystem locate the block and block device. From there, the kernel takes care of the I/O (See details at bottom of message). This can be mmap, zero-copy from the page cache, etc. It just "redirects" the filesystem I/O to the block device. Very very simple. This would give us all the zero-copy semantics. "Step-Two" would aim to reduce or eliminate a lot of the user-space code execution and context switching involved during I/O operations. This would effectibly be done by implementing a "fuse_get_block" (kernel-space) function (which will be passed to, and called by block_read_full_page - see code snip at bottom of file for detail). A very basic stub of this is inherently required by "step one" - but a "fuller" implementation would: 1. Have access to a "cached" or "prefetched" block map - optionally provided by the user-space code - possibly when the file was opened. It would be a list generated by the user-code and made available to the kernel-code of block/blockdevs in the file. It may also be a *partial* list, or it may not exist. The would require an API to create this or "attach" the data to Kernel space. 2. FUSE userspace would require a callback for the fuse_get_block function. If a requested block was not in the cache (or the cache did not exist), it would pass this callback to user space to resolve the block/blockdev. 3. Once it has the block/blockdev, it would return it, and let the kernel continue with block_read_full_page to service the I/O (in the zero-copy generic block-device fasion described above). Example: So VFS calls would get serviced through normal means - excpet for readpage (and writepage) which [appear] to be the actual heart of all the I/Os. So if a block was in the cache it would look like: Kernel/Syscall FUSE Kernel Kernel readpage() block_read_full_page (page,fuse_get_callback) Call fuse_get_block Return block in cache Hand I/O to block driver If the block was NOT in the cache, it would look like: Kernel/Syscall FUSE Kernel Kernel Userspace readpage() block_read_full_page (page,fuse_get_callback) Call fuse_get_block Call user-space get_block get_block user callback to get block Return Block now in cache Hand I/O to block driver Note that as I said before, the user-space code has the option of populating the cache at any point in time. i.e. Only when needed, or when the file is opened, etc - or populating a portition of it. This would be fantastic for a filesystem like (for example ext2) - where the first "x" blocks of the file are stored in the "normal" inode record (I forget the exact term). When you opened, or even stat'ed a file, you'd have to read this record anyway - so you might as well jam this data in the cache. (It's tiny). From this point onward - any readpage-writepage would be all in kernel space - unless the file was larger than those first "x" block - at which point you'd get a hit on uncached data - and get vectored through the user-space get_block callback - if the data was hit. No - I haven't started implmeneting it yet - but am fairly close to starting. I just wanted to run it by people for feedback and sanity-checking first. Any comments? Thanks, -BKG ------ Original message on optimizing FUSE for block-device access ----- All I really need is for readpage() to do what every other filesystem does: block_read_full_page(page, my_get_block); And for my_get_block to [effectivley] do what it does for a normal block device: bh->b_bdev = I_BDEV(inode); bh->b_blocknr = iblock; set_buffer_mapped(bh); Where "inode" and "iblock" would come from my user-space code, via. a "new" FUSE request. I believe this would allow FUSE filesystems to "redirect" I/Os to other block-devices- allowing the system to use the same page cache and zero-copy semantics for readpage, sendfile, (and probibly splice, though I don't know much about it). This is basically what other filesystems appear to do - using their own "my_get_block" function to translate the iblock and inode values. Outside of my own use, other filesystems backed by "normal" block-devices stores (such as "ext2-in-userspace") would benefit from this. Does this sound correct? -BKG |