From: VenkataRao <ven...@gm...> - 2011-10-27 21:03:17
|
Hi, I'm kind of new to this fuse development and our file-system implementation requires the buffer address & file offsets to all read & write requests from lib-fuse to be aligned with 512 bytes. We are implementing a file system on a backed hard disk, where in, the file-system opens the disk device in O_DIRECT mode and eventually O_DIRECT requires for all the read & write requests must have the buffer & offset aligned with sector boundaries. Is there any way that I can get these buffer address and file offsets aligned with 512 bytes when a read or write request comes to the file-system from the lib-fuse, like a mount option or some thing else. If possible, could you guys please kindly guide me how can I get these buffer address and file offsets 512 aligned. Thanks for your attention and time. Regards, VenkataRao. |
From: Alain S. <asp...@gm...> - 2011-10-30 23:13:30
|
On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao <ven...@gm...> wrote: > Hi, > > I'm kind of new to this fuse development and our file-system > implementation requires the buffer address & file offsets to all read > & write requests from lib-fuse to be aligned with 512 bytes. Not aligned with a page boundary, aka 4k instead ? Just asking ? > > We are implementing a file system on a backed hard disk, where in, the > file-system opens the disk device in O_DIRECT mode and eventually > O_DIRECT requires for all the read & write requests must have the > buffer & offset aligned with sector boundaries. > > Is there any way that I can get these buffer address and file offsets > aligned with 512 bytes when a read or write request comes to the > file-system from the lib-fuse, like a mount option or some thing else. I don't know. Probably not But I have the same requirement myself, I use "read before write". If I get a write of 4bytes to offset 520 then I read 512 bytes of block at offset 512 I write the 4bytes a the appropriate place (local offset 4= 520-512) and then write back the block to appropriate place. > > If possible, could you guys please kindly guide me how can I get these > buffer address and file offsets 512 aligned. > > Thanks for your attention and time. > > Regards, > VenkataRao. > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > fuse-devel mailing list > fus...@li... > https://lists.sourceforge.net/lists/listinfo/fuse-devel > -- Alain Spineux | aspineux gmail com Monitor your iT & Backups | http://www.magikmon.com Free Backup front-end | http://www.magikmon.com/mksbackup Your email 100% available | http://www.emailgency.com |
From: VenkataRao <ven...@gm...> - 2011-11-01 17:41:57
|
Please see below for my responses. On Sun, Oct 30, 2011 at 4:13 PM, Alain Spineux <asp...@gm...> wrote: > On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao > <ven...@gm...> wrote: > > Hi, > > > > I'm kind of new to this fuse development and our file-system > > implementation requires the buffer address & file offsets to all read > > & write requests from lib-fuse to be aligned with 512 bytes. > > Not aligned with a page boundary, aka 4k instead ? Just asking ? > [Venkata]: As we are using > 2.6 kernel and under Linux 2.6 kernels, the alignment to 512-byte boundaries suffices, I think we do not need 4k boundaries. If you use older versions (e.g 2.4), we need 4K boundaries. > > > > > We are implementing a file system on a backed hard disk, where in, the > > file-system opens the disk device in O_DIRECT mode and eventually > > O_DIRECT requires for all the read & write requests must have the > > buffer & offset aligned with sector boundaries. > > > > Is there any way that I can get these buffer address and file offsets > > aligned with 512 bytes when a read or write request comes to the > > file-system from the lib-fuse, like a mount option or some thing else. > > I don't know. Probably not > But I have the same requirement myself, I use "read before write". > > If I get a write of 4bytes to offset 520 then I read 512 bytes of > block at offset 512 > I write the 4bytes a the appropriate place (local offset 4= 520-512) and > then > write back the block to appropriate place. [Venkata]: I also can do read-before-write, but there will be an additional copy involved and due to this the file-system performance gets effected. > > > > > If possible, could you guys please kindly guide me how can I get these > > buffer address and file offsets 512 aligned. > > > > Thanks for your attention and time. > > > > Regards, > > VenkataRao. > > > > > ------------------------------------------------------------------------------ > > The demand for IT networking professionals continues to grow, and the > > demand for specialized networking skills is growing even more rapidly. > > Take a complimentary Learning@Cisco Self-Assessment and learn > > about Cisco certifications, training, and career opportunities. > > http://p.sf.net/sfu/cisco-dev2dev > > _______________________________________________ > > fuse-devel mailing list > > fus...@li... > > https://lists.sourceforge.net/lists/listinfo/fuse-devel > > > > > > -- > Alain Spineux | aspineux gmail com > Monitor your iT & Backups | http://www.magikmon.com > Free Backup front-end | http://www.magikmon.com/mksbackup > Your email 100% available | http://www.emailgency.com > |
From: Alain S. <asp...@gm...> - 2011-11-03 22:54:09
|
On Tue, Nov 1, 2011 at 6:41 PM, VenkataRao <ven...@gm...> wrote: > Please see below for my responses. > > On Sun, Oct 30, 2011 at 4:13 PM, Alain Spineux <asp...@gm...> wrote: > >> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >> <ven...@gm...> wrote: >> > Hi, >> > >> > I'm kind of new to this fuse development and our file-system >> > implementation requires the buffer address & file offsets to all read >> > & write requests from lib-fuse to be aligned with 512 bytes. >> >> Not aligned with a page boundary, aka 4k instead ? Just asking ? >> > > [Venkata]: As we are using > 2.6 kernel and under Linux 2.6 kernels, the > alignment to 512-byte boundaries suffices, I think we do not need 4k > boundaries. If you use older versions (e.g 2.4), we need 4K boundaries. > > >> >> > >> > We are implementing a file system on a backed hard disk, where in, the >> > file-system opens the disk device in O_DIRECT mode and eventually >> > O_DIRECT requires for all the read & write requests must have the >> > buffer & offset aligned with sector boundaries. >> > >> > Is there any way that I can get these buffer address and file offsets >> > aligned with 512 bytes when a read or write request comes to the >> > file-system from the lib-fuse, like a mount option or some thing else. >> >> I don't know. Probably not >> But I have the same requirement myself, I use "read before write". >> >> If I get a write of 4bytes to offset 520 then I read 512 bytes of >> block at offset 512 >> I write the 4bytes a the appropriate place (local offset 4= 520-512) and >> then >> write back the block to appropriate place. > > > [Venkata]: I also can do read-before-write, but there will be an additional > copy involved and due to this the file-system performance gets effected. I don't think you have the choice. You could maintain a local cache to improve further writes to the same block. You must read it at least the first time ! > > >> >> > >> > If possible, could you guys please kindly guide me how can I get these >> > buffer address and file offsets 512 aligned. >> > >> > Thanks for your attention and time. >> > >> > Regards, >> > VenkataRao. >> > >> > >> ------------------------------------------------------------------------------ >> > The demand for IT networking professionals continues to grow, and the >> > demand for specialized networking skills is growing even more rapidly. >> > Take a complimentary Learning@Cisco Self-Assessment and learn >> > about Cisco certifications, training, and career opportunities. >> > http://p.sf.net/sfu/cisco-dev2dev >> > _______________________________________________ >> > fuse-devel mailing list >> > fus...@li... >> > https://lists.sourceforge.net/lists/listinfo/fuse-devel >> > >> >> >> >> -- >> Alain Spineux | aspineux gmail com >> Monitor your iT & Backups | http://www.magikmon.com >> Free Backup front-end | http://www.magikmon.com/mksbackup >> Your email 100% available | http://www.emailgency.com >> > ------------------------------------------------------------------------------ > RSA® Conference 2012 > Save $700 by Nov 18 > Register now > http://p.sf.net/sfu/rsa-sfdev2dev1 > _______________________________________________ > fuse-devel mailing list > fus...@li... > https://lists.sourceforge.net/lists/listinfo/fuse-devel > -- Alain Spineux | aspineux gmail com Monitor your iT & Backups | http://www.magikmon.com Free Backup front-end | http://www.magikmon.com/mksbackup Your email 100% available | http://www.emailgency.com |
From: Miklos S. <mi...@sz...> - 2011-11-08 18:58:08
|
>>> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >>> <ven...@gm...> wrote: >>> > Hi, >>> > >>> > I'm kind of new to this fuse development and our file-system >>> > implementation requires the buffer address & file offsets to all read >>> > & write requests from lib-fuse to be aligned with 512 bytes. Following patch should fix the read buffer alignment. Fixing the write buffer alignment properly would be more involved. You could also try the zero copy interfaces introduced in the latest git version. I'm not sure it will work with O_DIRECT files though, but that would definitely be worth fixing. Thanks, Miklos diff --git a/lib/fuse.c b/lib/fuse.c index db638ec..f1b2a97 100644 --- a/lib/fuse.c +++ b/lib/fuse.c @@ -1698,10 +1698,10 @@ int fuse_fs_read_buf(struct fuse_fs *fs, const char *path, if (buf == NULL) return -ENOMEM; - mem = malloc(size); - if (mem == NULL) { + res = -posix_memalign(&mem, getpagesize(), size); + if (res != 0) { free(buf); - return -ENOMEM; + return res; } *buf = FUSE_BUFVEC_INIT(size); buf->buf[0].mem = mem; @@ -1777,9 +1777,8 @@ int fuse_fs_write_buf(struct fuse_fs *fs, const char *path, !(buf->buf[0].flags & FUSE_BUF_IS_FD)) { flatbuf = &buf->buf[0]; } else { - res = -ENOMEM; - mem = malloc(size); - if (mem == NULL) + res = -posix_memalign(&mem, getpagesize(), size); + if (res != 0) goto out; tmp.buf[0].mem = mem; |
From: Goswin v. B. <gos...@we...> - 2011-11-19 22:57:52
|
Miklos Szeredi <mi...@sz...> writes: >>>> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >>>> <ven...@gm...> wrote: >>>> > Hi, >>>> > >>>> > I'm kind of new to this fuse development and our file-system >>>> > implementation requires the buffer address & file offsets to all read >>>> > & write requests from lib-fuse to be aligned with 512 bytes. > > Following patch should fix the read buffer alignment. Fixing the write > buffer alignment properly would be more involved. For write buffers the buffer needs to be aligned to 512/4K - the size of the header. Unfortunately there is no posix_memalign() equivalent to allocate memory at an offset to alignment. But it isn't that hard (although a bit wastefull) to allocate a bigger buffer and ignore the first X bytes. Alternatively wouldn't it be possible to have 2 buffers, one for the header, one for payload data, and user readv() to fill them both atomically? > You could also try the zero copy interfaces introduced in the latest git > version. I'm not sure it will work with O_DIRECT files though, but that > would definitely be worth fixing. Should work. > Thanks, > Miklos MfG Goswin |
From: Miklos S. <mi...@sz...> - 2011-11-25 18:24:45
|
On Sat, Nov 19, 2011 at 11:57 PM, Goswin von Brederlow <gos...@we...> wrote: > Miklos Szeredi <mi...@sz...> writes: > >>>>> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >>>>> <ven...@gm...> wrote: >>>>> > Hi, >>>>> > >>>>> > I'm kind of new to this fuse development and our file-system >>>>> > implementation requires the buffer address & file offsets to all read >>>>> > & write requests from lib-fuse to be aligned with 512 bytes. >> >> Following patch should fix the read buffer alignment. Fixing the write >> buffer alignment properly would be more involved. > > For write buffers the buffer needs to be aligned to 512/4K - the size of > the header. Unfortunately there is no posix_memalign() equivalent to > allocate memory at an offset to alignment. But it isn't that hard > (although a bit wastefull) to allocate a bigger buffer and ignore the > first X bytes. > > Alternatively wouldn't it be possible to have 2 buffers, one for the > header, one for payload data, and user readv() to fill them both > atomically? Yes, those are possibilities. The problem is, the parsing of the message is done on a different layer as the reading of the message. Which means it's pretty difficult to cleanly add those alignment constraints to the message reading part. But yes, it's doable. Thanks, Miklos |
From: Goswin v. B. <gos...@we...> - 2011-11-26 04:02:28
|
Miklos Szeredi <mi...@sz...> writes: > On Sat, Nov 19, 2011 at 11:57 PM, Goswin von Brederlow > <gos...@we...> wrote: >> Miklos Szeredi <mi...@sz...> writes: >> >>>>>> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >>>>>> <ven...@gm...> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I'm kind of new to this fuse development and our file-system >>>>>> > implementation requires the buffer address & file offsets to all read >>>>>> > & write requests from lib-fuse to be aligned with 512 bytes. >>> >>> Following patch should fix the read buffer alignment. Â Fixing the write >>> buffer alignment properly would be more involved. >> >> For write buffers the buffer needs to be aligned to 512/4K - the size of >> the header. Unfortunately there is no posix_memalign() equivalent to >> allocate memory at an offset to alignment. But it isn't that hard >> (although a bit wastefull) to allocate a bigger buffer and ignore the >> first X bytes. >> >> Alternatively wouldn't it be possible to have 2 buffers, one for the >> header, one for payload data, and user readv() to fill them both >> atomically? > > Yes, those are possibilities. The problem is, the parsing of the > message is done on a different layer as the reading of the message. > Which means it's pretty difficult to cleanly add those alignment > constraints to the message reading part. > > But yes, it's doable. > > Thanks, > Miklos I know the code. :) I already started on a patch for this a while back but then I got concerns about the atomicity of readv(). The manpage says readv/writev are atomic with the exception noted in pipe(7), which I means this part: O_NONBLOCK disabled, n > PIPE_BUF The write is nonatomic: the data given to write(2) may be inter- leaved with write(2)s by other process; the write(2) blocks until n bytes have been written. What isn't clear is wether that only applies if a pipe is involved or on any readv/writev operation? Well, the only concern would be the behaviour of /dev/fuse for this. But if you say they will be atomic then lets try that. I will see if I can dig out the patch, update and complete it this weekend. MfG Goswin |
From: Goswin v. B. <gos...@we...> - 2011-11-27 17:25:31
|
Miklos Szeredi <mi...@sz...> writes: > On Sat, Nov 19, 2011 at 11:57 PM, Goswin von Brederlow > <gos...@we...> wrote: >> Miklos Szeredi <mi...@sz...> writes: >> >>>>>> On Thu, Oct 27, 2011 at 11:03 PM, VenkataRao >>>>>> <ven...@gm...> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I'm kind of new to this fuse development and our file-system >>>>>> > implementation requires the buffer address & file offsets to all read >>>>>> > & write requests from lib-fuse to be aligned with 512 bytes. >>> >>> Following patch should fix the read buffer alignment. Â Fixing the write >>> buffer alignment properly would be more involved. >> >> For write buffers the buffer needs to be aligned to 512/4K - the size of >> the header. Unfortunately there is no posix_memalign() equivalent to >> allocate memory at an offset to alignment. But it isn't that hard >> (although a bit wastefull) to allocate a bigger buffer and ignore the >> first X bytes. >> >> Alternatively wouldn't it be possible to have 2 buffers, one for the >> header, one for payload data, and user readv() to fill them both >> atomically? > > Yes, those are possibilities. The problem is, the parsing of the > message is done on a different layer as the reading of the message. > Which means it's pretty difficult to cleanly add those alignment > constraints to the message reading part. > > But yes, it's doable. > > Thanks, > Miklos I updated my git clone and tried to update my old patch for alignable and stealable buffers but there has been quite a bit of bitrot. Looking at the splice interface I wonder if it is even worth it now to do this at all. Lets compare the two: 1) alignable and stealable buffers - New functions: fuse_session_receive_bufv, fuse_session_process_bufv, fuse_ll_receive_bufv, fuse_ll_process_bufv, set_alignment, steal_buffer - New fields in the session and request structures - rewrite fuse_loop, fuse_loop_mt and wroker structure 2) splice (no changes to libfuse) - Set splice mode - Allocate buffer yourself - Call fuse_buf_copy Having aligned buffers in libfuse would save at least 2 syscalls per write request but the changes needed seem to be a bit much right now. MfG Goswin |
From: Miklos S. <mi...@sz...> - 2011-12-05 12:25:14
|
On Sun, Nov 27, 2011 at 6:25 PM, Goswin von Brederlow <gos...@we...> wrote: > I updated my git clone and tried to update my old patch for alignable > and stealable buffers but there has been quite a bit of bitrot. Looking > at the splice interface I wonder if it is even worth it now to do this > at all. Lets compare the two: > > 1) alignable and stealable buffers > > - New functions: fuse_session_receive_bufv, fuse_session_process_bufv, > fuse_ll_receive_bufv, fuse_ll_process_bufv, > set_alignment, steal_buffer > - New fields in the session and request structures > - rewrite fuse_loop, fuse_loop_mt and wroker structure > > 2) splice (no changes to libfuse) > > - Set splice mode > - Allocate buffer yourself > - Call fuse_buf_copy > > Having aligned buffers in libfuse would save at least 2 syscalls per > write request but the changes needed seem to be a bit much right now. Yes, the extra syscalls will hurt some workloads, but mainly because they will be used on all requests not just the big write ones. For big writes the overhead of the two extra syscalls is negligible. So I believe the right direction is to eliminate the overhead for small requests while keeping the advantages of the splice interface for handling big writes. Thanks, Miklos |
From: Goswin v. B. <gos...@we...> - 2011-12-05 19:22:48
|
Miklos Szeredi <mi...@sz...> writes: > On Sun, Nov 27, 2011 at 6:25 PM, Goswin von Brederlow <gos...@we...> wrote: >> I updated my git clone and tried to update my old patch for alignable >> and stealable buffers but there has been quite a bit of bitrot. Looking >> at the splice interface I wonder if it is even worth it now to do this >> at all. Lets compare the two: >> >> 1) alignable and stealable buffers >> >> - New functions: fuse_session_receive_bufv, fuse_session_process_bufv, >> Â Â Â Â Â Â Â Â fuse_ll_receive_bufv, fuse_ll_process_bufv, >> Â Â Â Â Â Â Â Â set_alignment, steal_buffer >> - New fields in the session and request structures >> - rewrite fuse_loop, fuse_loop_mt and wroker structure >> >> 2) splice (no changes to libfuse) >> >> - Set splice mode >> - Allocate buffer yourself >> - Call fuse_buf_copy >> >> Having aligned buffers in libfuse would save at least 2 syscalls per >> write request but the changes needed seem to be a bit much right now. > > Yes, the extra syscalls will hurt some workloads, but mainly because > they will be used on all requests not just the big write ones. For > big writes the overhead of the two extra syscalls is negligible. > > So I believe the right direction is to eliminate the overhead for > small requests while keeping the advantages of the splice interface > for handling big writes. > > Thanks, > Miklos But how? At the point where you do know the request type and length you have already spliced the request and data into a pipe. And then it is too late. You need the extra syscall to get the data out of the pipe again. MfG Goswin |
From: Miklos S. <mi...@sz...> - 2011-12-06 13:53:00
|
On Mon, Dec 5, 2011 at 8:22 PM, Goswin von Brederlow <gos...@we...> wrote: > But how? At the point where you do know the request type and length you > have already spliced the request and data into a pipe. And then it is > too late. You need the extra syscall to get the data out of the pipe > again. One idea is to allow attaching multiple device file descriptors to the mount: one on which only small requests are queued and one on which only large requests. Then the device instance with the small requests is read directly, while the device instance with the large request is spliced into a pipe. I think there might be better ways to do this but haven't thought about it much. Thanks, Miklos |
From: Goswin v. B. <gos...@we...> - 2011-12-07 09:09:37
|
Miklos Szeredi <mi...@sz...> writes: > On Mon, Dec 5, 2011 at 8:22 PM, Goswin von Brederlow <gos...@we...> wrote: > >> But how? At the point where you do know the request type and length you >> have already spliced the request and data into a pipe. And then it is >> too late. You need the extra syscall to get the data out of the pipe >> again. > > One idea is to allow attaching multiple device file descriptors to the > mount: one on which only small requests are queued and one on which > only large requests. Then the device instance with the small > requests is read directly, while the device instance with the large > request is spliced into a pipe. > > I think there might be better ways to do this but haven't thought about it much. > > Thanks, > Miklos I don't know how feasable this is, but: Could request be split up into a command fd and data fd? The main thread would open the command fd and each thread would request it own data fd over the command fd. Then when a thread reads a request from the command fd (and it is a big one) the kernel would dump the data to the data fd for the requesting thread. With this fuse wouldn't even have to splice the data into an extra pipe but could just pass the data fd to the callback as is. MfG Goswin |
From: Miklos S. <mi...@sz...> - 2011-12-07 14:50:26
|
Goswin von Brederlow <gos...@we...> writes: > I don't know how feasable this is, but: Could request be split up into a > command fd and data fd? The main thread would open the command fd and > each thread would request it own data fd over the command fd. Then when > a thread reads a request from the command fd (and it is a big one) the > kernel would dump the data to the data fd for the requesting thread. > > With this fuse wouldn't even have to splice the data into an extra pipe > but could just pass the data fd to the callback as is. The kernel should not need to care about threads and such. It should not contain knowledge about the implementation of libfuse. So slightly modifying your idea: the kernel can attach a data fd (basically the read end of a pipe) to the request and dump the data into that pipe. The question is: how to attach the data fd to the request? File descriptor passing over unix domain sockets (SCM_RIGHTS) comes to mind. But changing the device interface into a socket interface is not quite trivial. Passing the file descriptor in the fuse_write_in header would be possible, but that's really messy. An ioctl() to request that the data be dumped onto the given pipe is perhaps the cleanest way to do this. E.g. struct fuse_data_request { u64 unique; int fd; }; static void do_write() { struct fuse_data_request fdr; int pip[2]; ... pipe(pip); fdr.unique = req->unique; fdr.fd = pip[1]; ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); /* data available in pip[0] */ ... } Thanks, Miklos |
From: Goswin v. B. <gos...@we...> - 2011-12-08 10:39:37
|
Miklos Szeredi <mi...@sz...> writes: > Goswin von Brederlow <gos...@we...> writes: > >> I don't know how feasable this is, but: Could request be split up into a >> command fd and data fd? The main thread would open the command fd and >> each thread would request it own data fd over the command fd. Then when >> a thread reads a request from the command fd (and it is a big one) the >> kernel would dump the data to the data fd for the requesting thread. >> >> With this fuse wouldn't even have to splice the data into an extra pipe >> but could just pass the data fd to the callback as is. > > The kernel should not need to care about threads and such. It should > not contain knowledge about the implementation of libfuse. > > So slightly modifying your idea: the kernel can attach a data fd > (basically the read end of a pipe) to the request and dump the data into > that pipe. How expensive would opening and closing all those pipes be? > The question is: how to attach the data fd to the request? File > descriptor passing over unix domain sockets (SCM_RIGHTS) comes to mind. > But changing the device interface into a socket interface is not quite > trivial. > > Passing the file descriptor in the fuse_write_in header would be > possible, but that's really messy. > > An ioctl() to request that the data be dumped onto the given pipe is > perhaps the cleanest way to do this. E.g. > > struct fuse_data_request { > u64 unique; > int fd; > }; > > static void do_write() > { > struct fuse_data_request fdr; > int pip[2]; > ... > pipe(pip); > fdr.unique = req->unique; > fdr.fd = pip[1]; > ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); > /* data available in pip[0] */ > ... > } > > Thanks, > Miklos And a feature negotiation to tell the kernel to send writes (above a certain size) without payload. It would add another syscall but given the size of data this would be for that is probably negible. As a plus this could easily allow write requests over 128k without the libfuse having to waste HUGE buffers for trivial requests. And you could reuse the pipes. MfG Goswin |
From: Goswin v. B. <gos...@we...> - 2011-12-08 10:54:00
|
Miklos Szeredi <mi...@sz...> writes: > Goswin von Brederlow <gos...@we...> writes: > >> I don't know how feasable this is, but: Could request be split up into a >> command fd and data fd? The main thread would open the command fd and >> each thread would request it own data fd over the command fd. Then when >> a thread reads a request from the command fd (and it is a big one) the >> kernel would dump the data to the data fd for the requesting thread. >> >> With this fuse wouldn't even have to splice the data into an extra pipe >> but could just pass the data fd to the callback as is. > > The kernel should not need to care about threads and such. It should > not contain knowledge about the implementation of libfuse. > > So slightly modifying your idea: the kernel can attach a data fd > (basically the read end of a pipe) to the request and dump the data into > that pipe. > > The question is: how to attach the data fd to the request? File > descriptor passing over unix domain sockets (SCM_RIGHTS) comes to mind. > But changing the device interface into a socket interface is not quite > trivial. > > Passing the file descriptor in the fuse_write_in header would be > possible, but that's really messy. > > An ioctl() to request that the data be dumped onto the given pipe is > perhaps the cleanest way to do this. E.g. > > struct fuse_data_request { > u64 unique; > int fd; > }; > > static void do_write() > { > struct fuse_data_request fdr; > int pip[2]; > ... > pipe(pip); > fdr.unique = req->unique; > fdr.fd = pip[1]; > ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); > /* data available in pip[0] */ > ... > } > > Thanks, > Miklos One more thing I forgot. Does the FD have to be a pipe? Think of an overlay filesystem. Why not pass the FD of the underlying file: struct fuse_data_request { u64 unique; u64 offset; int fd; }; static void do_write(...) { struct fuse_data_request fdr; fdr.unique = req->unique; fdt.offset = off; fdr.fd = fi->fh; ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); } Plus some error handling. MfG Goswin |
From: Miklos S. <mi...@sz...> - 2011-12-08 11:23:21
|
Goswin von Brederlow <gos...@we...> writes: > > One more thing I forgot. > > Does the FD have to be a pipe? Think of an overlay filesystem. Why not > pass the FD of the underlying file: > > struct fuse_data_request { > u64 unique; > u64 offset; > int fd; > }; > > static void do_write(...) > { > struct fuse_data_request fdr; > fdr.unique = req->unique; > fdt.offset = off; > fdr.fd = fi->fh; > ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); > } > Right, and observe how that ioctl is almost like a splice() now. The only problem is that the request must be transferred from the device with one syscall otherwise there's nothing to identify parts of the message and so it cannot be connected up. But if we manage to drop that requirement everything becomes much easier. So what about introducing a new mode of operation where there may be multiple fuse device fd's assigned to the filesystem and when a read is attempted on one of them the request is assigned to that particular device instance and the rest of the request can be transferred with an arbitrary number of syscalls. In other words, just drop the thread safety requirement from the device fd. This is similar to your first proposal, but there's now no "command fd" and "data fd" which would again have problems with having to connect the pieces of the request up somehow. I like this because a) it doesn't require any special method (ioctl's are to be avoided when possible), b) it's pretty close to what is currently done, so no heavy modification is needed on either the kernel or the userspace side. And then to allow optimizing the write() requests there could be an INIT flag saying that for certain messages (e.g. write, setxattr) only return the header in the the first read() but connect up the rest of the request with that instance. So the write would just become: splice(dev_fd_instance, NULL, pip[1], NULL, arg->size, 0); /* data is available from pip[0] */ And at some stage if it looks like a worthwhile optimization the fuse device could itself acquire pipe-like properties, so splicing from it to any destination would become possible. But that's not trivial at all, and not entirely sure it's even worth doing. The only question remaining is how to create a new device instance. One idea would be: new_dev_fd_inst = open("/dev/fuse", ...); ioctl(new_dev_fd_inst, FUSE_IOCTL_CONNECT_DEV, orig_dev_fd); Is there a better way to do this not involving ioctls? Thanks, Miklos |
From: Goswin v. B. <gos...@we...> - 2011-12-08 17:30:21
|
Miklos Szeredi <mi...@sz...> writes: > Goswin von Brederlow <gos...@we...> writes: > >> >> One more thing I forgot. >> >> Does the FD have to be a pipe? Think of an overlay filesystem. Why not >> pass the FD of the underlying file: >> >> struct fuse_data_request { >> u64 unique; >> u64 offset; >> int fd; >> }; >> >> static void do_write(...) >> { >> struct fuse_data_request fdr; >> fdr.unique = req->unique; >> fdt.offset = off; >> fdr.fd = fi->fh; >> ioctl(fuse_dev, FUSE_IOCTL_GETDATA, &fdr); >> } >> > > Right, and observe how that ioctl is almost like a splice() now. The > only problem is that the request must be transferred from the device > with one syscall otherwise there's nothing to identify parts of the > message and so it cannot be connected up. > > But if we manage to drop that requirement everything becomes much > easier. Requiring that the whole block of data has to be transfered as one should make things easier. But I wouldn't think that it should be a problem. Options: 1) add offset and size to the ioctl and call it multiple times to get chunks of the paylod 2) add an iovec structure like (p)writev uses to allow writing out the payload in fragments. Probably many more. > So what about introducing a new mode of operation where there may be > multiple fuse device fd's assigned to the filesystem and when a read is > attempted on one of them the request is assigned to that particular > device instance and the rest of the request can be transferred with an > arbitrary number of syscalls. You mean you change nothing in the protocol and libfuse may simply do (in pseudocode): read(fuse_fd, req, sizeof(req)); if (req->cmd == WRITE) { splice(fuse_fd, 0, my_pipe, 0, req->len, flags); ... } > In other words, just drop the thread safety requirement from the device > fd. This is similar to your first proposal, but there's now no "command > fd" and "data fd" which would again have problems with having to connect > the pieces of the request up somehow. If you do this with ioctl() then you don't need multiple FDs or give up the thread savety. The "unique" in the fuse_data_request would take care of connecting the payload with the request you care about. I think that would require the least amount of change. > I like this because > > a) it doesn't require any special method (ioctl's are to be avoided > when possible), > > b) it's pretty close to what is currently done, so no heavy > modification is needed on either the kernel or the userspace side. Not everyone uses the libfuse loop function and if you have your own multithreaded loop then you need to rewrite that to have one FD per thread. You need to keep backward compatibility, the feature has to be negotiated on startup. But I think you already have that well in mind. :) > And then to allow optimizing the write() requests there could be an INIT > flag saying that for certain messages (e.g. write, setxattr) only return > the header in the the first read() but connect up the rest of the > request with that instance. > > So the write would just become: > > splice(dev_fd_instance, NULL, pip[1], NULL, arg->size, 0); > /* data is available from pip[0] */ > > And at some stage if it looks like a worthwhile optimization the fuse > device could itself acquire pipe-like properties, so splicing from it to > any destination would become possible. But that's not trivial at all, > and not entirely sure it's even worth doing. I still think it might be worth it to support small inlined writes and large external ones. Your way with ioctl() would make that simple to implement. You tell the kernel the buffer size for request + payload. If a request has more payload than would fit in the buffer it sends just the request with a flag set to indicate that payload has to be retrived seperately via ioctl(). That way small requests (e.g. < 4k payload) would use just a single read() and large requests read() + ioctl(). I'm not sure at what size the break even point would be but splicing 200 bytes for a setattr through a pipe can't be faster than memcpy()ing them. > The only question remaining is how to create a new device instance. One > idea would be: > > new_dev_fd_inst = open("/dev/fuse", ...); > ioctl(new_dev_fd_inst, FUSE_IOCTL_CONNECT_DEV, orig_dev_fd); > > Is there a better way to do this not involving ioctls? Can you catch the dup() syscall to make the new FD a new instance connected to the same fs? Or have a dup() like ioctl to return a new FD ioctl(new_dev_fd_inst, FUSE_DUP_FD); MfG Goswin |