From: Teijo H. <th...@we...> - 2014-01-22 21:37:12
|
Hi Miklos, thanks very much for taking the time to respond. I've attached the kernel patch to this email, I didn't bother with the libfuse patch as that is trivial (it simply adds another command-line option & puts the bit into the capability flag). The kernel patch adds a field "pid_daemon" to "struct fuse_conn" (fuse_i.h) and populates it automatically at mount time with the pid of the FUSE daemon (inode.c). The main work is done in file.c. I've applied the short-circuit patch to fuse_send_read & fuse_send_write only, so currently, the "short-circuit" option also requires the use of the "direct_io" option (which is fine, as any caching is done by the next layer down anyway, e.g. NFS). With respect to the kernel looking up a file descriptor from user-land (fi->fh), this is what happens with normal read/write syscalls anyway. Only the short-circuit option attaches semantic meaning to fi->fh, the user can pass whatever they want (just like with read/write syscalls). If fi->fh is not a valid, open file descriptor inside the FUSE daemon, the client simply gets EIO. I'd be more than happy to test Maxim Patlasov write patch, I've been following progress on that too. It is a large change set though. With respect to using the slice options, to me it seemed the actual number of calls into user-space are the performance killer (rather that the amount of data that needs to be copied). This particular performance problem seems to be CPU-bound. This is shown by getting higher throughput when using larger block sizes. The data that needs to be copied is the same, however the number of context switches decreases. Again, thanks very much for taking the time to look at this. Teijo Holzer FUSE read/write short-circuit kernel patch ========================================== diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/file.c fs/fuse/file.c --- /usr/src/linux-source-3.8.0-33/fs/fuse/file.c 2013-11-11 15:58:19.000000000 +1300 +++ fs/fuse/file.c 2014-01-23 10:02:07.712889000 +1300 @@ -15,6 +15,9 @@ #include <linux/module.h> #include <linux/compat.h> #include <linux/swap.h> +#include <linux/pid.h> +#include <linux/pid_namespace.h> +#include <linux/fdtable.h> static const struct file_operations fuse_direct_io_file_operations; @@ -491,8 +494,55 @@ void fuse_read_fill(struct fuse_req *req req->out.args[0].size = count; } +static struct file *fuse_get_file_ptr(struct fuse_file *ff, struct task_struct **task) +{ + struct file *file = NULL; + *task = NULL; + + if (!ff || !ff->fc || !ff->fc->pid_daemon || !ff->fh) + { + printk(KERN_INFO "fuse_get_file_ptr: no ff/ff->fc/ff->fc->pid_daemon/ff->fh\n"); + return file; + } + + rcu_read_lock(); + + *task = pid_task(find_get_pid(ff->fc->pid_daemon), PIDTYPE_PID); + if (!(*task)) + { + printk(KERN_INFO "fuse_get_file_ptr: no task for %u\n", ff->fc->pid_daemon); + rcu_read_unlock(); + return file; + } + + get_task_struct(*task); + task_lock(*task); + + if ((*task)->files) + file = fcheck_files((*task)->files, ff->fh); + + rcu_read_unlock(); + task_unlock(*task); + + return file; +} + +static ssize_t fuse_do_vfs_read(struct fuse_req *req, struct fuse_file *ff, loff_t pos, size_t count, const char __user *buf) +{ + ssize_t ret = -1; + struct task_struct *task = NULL; + struct file *file = fuse_get_file_ptr(ff, &task); + if (!file || !buf) + printk(KERN_INFO "fuse_do_vfs_read: no file %p or buf %p, please use direct_io\n", file, buf); + else + ret = vfs_read(file, (char __user *)buf, count, &pos); + if (task) + put_task_struct(task); + return ret; +} + static size_t fuse_send_read(struct fuse_req *req, struct file *file, - loff_t pos, size_t count, fl_owner_t owner) + loff_t pos, size_t count, fl_owner_t owner, const char __user *buf) { struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; @@ -504,6 +554,10 @@ static size_t fuse_send_read(struct fuse inarg->read_flags |= FUSE_READ_LOCKOWNER; inarg->lock_owner = fuse_lock_owner_id(fc, owner); } + + if (fc && fc->pid_daemon) + return fuse_do_vfs_read(req, ff, pos, count, buf); + fuse_request_send(fc, req); return req->out.args[0].size; } @@ -555,7 +609,7 @@ static int fuse_readpage(struct file *fi req->out.argpages = 1; req->num_pages = 1; req->pages[0] = page; - num_read = fuse_send_read(req, file, pos, count, NULL); + num_read = fuse_send_read(req, file, pos, count, NULL, NULL); err = req->out.h.error; fuse_put_request(fc, req); @@ -744,8 +798,22 @@ static void fuse_write_fill(struct fuse_ req->out.args[0].value = outarg; } +static size_t fuse_do_vfs_write(struct fuse_req *req, struct fuse_file *ff, loff_t pos, size_t count, const char __user *buf) +{ + ssize_t ret = -1; + struct task_struct *task = NULL; + struct file *file = fuse_get_file_ptr(ff, &task); + if (!file || !buf) + printk(KERN_INFO "fuse_do_vfs_write: no file %p or buf %p, please use direct_io\n", file, buf); + else + ret = vfs_write(file, buf, count, &pos); + if (task) + put_task_struct(task); + return ret; +} + static size_t fuse_send_write(struct fuse_req *req, struct file *file, - loff_t pos, size_t count, fl_owner_t owner) + loff_t pos, size_t count, fl_owner_t owner, const char __user *buf) { struct fuse_file *ff = file->private_data; struct fuse_conn *fc = ff->fc; @@ -757,6 +825,10 @@ static size_t fuse_send_write(struct fus inarg->write_flags |= FUSE_WRITE_LOCKOWNER; inarg->lock_owner = fuse_lock_owner_id(fc, owner); } + + if (fc && fc->pid_daemon) + return fuse_do_vfs_write(req, ff, pos, count, buf); + fuse_request_send(fc, req); return req->misc.write.out.size; } @@ -784,7 +856,7 @@ static size_t fuse_send_write_pages(stru for (i = 0; i < req->num_pages; i++) fuse_wait_on_page_writeback(inode, req->pages[i]->index); - res = fuse_send_write(req, file, pos, count, NULL); + res = fuse_send_write(req, file, pos, count, NULL, NULL); offset = req->page_offset; count = res; @@ -1087,9 +1159,9 @@ ssize_t fuse_direct_io(struct file *file } if (write) - nres = fuse_send_write(req, file, pos, nbytes, owner); + nres = fuse_send_write(req, file, pos, nbytes, owner, buf); else - nres = fuse_send_read(req, file, pos, nbytes, owner); + nres = fuse_send_read(req, file, pos, nbytes, owner, buf); fuse_release_user_pages(req, !write); if (req->out.h.error) { diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/fuse_i.h fs/fuse/fuse_i.h --- /usr/src/linux-source-3.8.0-33/fs/fuse/fuse_i.h 2013-02-19 12:58:34.000000000 +1300 +++ fs/fuse/fuse_i.h 2014-01-23 09:56:51.765414000 +1300 @@ -528,6 +528,9 @@ struct fuse_conn { /** Read/write semaphore to hold when accessing sb. */ struct rw_semaphore killsb; + + /** FUSE daemon pid for FUSE_SHORTCIRCUIT option */ + u32 pid_daemon; }; static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb) diff -uprN /usr/src/linux-source-3.8.0-33/fs/fuse/inode.c fs/fuse/inode.c --- /usr/src/linux-source-3.8.0-33/fs/fuse/inode.c 2013-02-19 12:58:34.000000000 +1300 +++ fs/fuse/inode.c 2014-01-23 09:59:15.180539000 +1300 @@ -582,6 +582,7 @@ void fuse_conn_init(struct fuse_conn *fc fc->reqctr = 0; fc->blocked = 1; fc->attr_version = 1; + fc->pid_daemon = 0; get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); } EXPORT_SYMBOL_GPL(fuse_conn_init); @@ -846,6 +847,11 @@ static void process_init_reply(struct fu if (arg->minor >= 17) { if (!(arg->flags & FUSE_FLOCK_LOCKS)) fc->no_flock = 1; + // TODO: Add the following magic number to fuse.h as FUSE_SHORTCIRCUIT + if (arg->flags & (1 << 31)) + fc->pid_daemon = current->group_leader->pid; + else + fc->pid_daemon = 0; } else { if (!(arg->flags & FUSE_POSIX_LOCKS)) fc->no_flock = 1; On 23/01/14 05:33, Miklos Szeredi wrote: > On Tue, Jan 21, 2014 at 9:40 PM, Teijo Holzer <th...@we...> wrote: >> Hi Jean-Pierre, >> >> > How does the user-space file system gets called >>> when the client issues a read or write ? >> >> it doesn't. The short-circuit option is meant to provide open/close callbacks >> into the FUSE daemon but bypass read/write altogether for performance reasons. >> >> This option is useful when there is only a need for doing file path translation >> on open, but no need for the read/write paths to enter the FUSE daemon. > > Teijo, > > First, it'd be nice if we could see the actual patch. > > My main objection to this thing is the interface strangulation it > involves. For example using fi->fh in the kernel as an actual file > descriptor is completely bogus. To the kernel the "fh" is an opaque > identifier, it should only ever be used by the userspace filesystem, > never by the kernel. > > We do have an interface for passing file descriptors into the library > and that is fuse_bufvec. For example we could use > fuse_lowlevel_notify_store() to tell the kernel that (part of) a file > is equivalent to (part of) another one. > > Anyway, I'm not convinced that this optimization is actually needed. > I see the write performance numbers, but they are not surprising and > Maxim Patlasov has been working on that for a long time. The issue > here is is delaying writes, like any other filesystem, so that the > kernel can coalesce small writes into big requests. I'll try to put > that patchset in a testable form into git, and then you can see if the > numbers are good enough. > > And then there's still room for optimization in the splice > implementation, which in theory could allow zero copy reads and single > copy writes. > > Thanks, > Miklos > |