From: Anand A. <av...@zr...> - 2007-05-25 18:39:47
|
Miklos, first of all thank you SO MUCH for keeping the fuse project in such a good state! I admire the way it is being managed. I am one of the authors of GlusterFS (www.gluster.org), a distributed cluster filesystem which uses FUSE for its client. I have a few questions 1. shared writable mmap. - I saw your patches on the linux-kernel on feb-28th for this. but i dont see the fixes gone into the mainstream kernel yet. (is it present in the -mm?) neither do i see it in the fuse cvs too (obviously since it depends on complementary changes in the mm/?). when can one expect mmap() to work smoothly over files opened with O_RDWR ? 2. fuse_writepages - this i presume will help improve write performance for filesystems. does this mean that if an application does write(fd,buf,131072) then filesystem gets entire 128kb, or, even if application does multipel write(fd,buf,4k) then it would get aggregated to some extent and filesystem gets an aggregated chunk? also same as previous, when will fuse_writepages be available, atleast in cvs? is this available in some other private repository of yours? I'm very anxious for this! 3. is there any way to get inode's i_generation access to the filesystem? (instead of fuse sending only fuse_ino_t). in glusterfs, the glusterfs server uses underlying filesystem for storing files and folders. glusterfs server can detect inode number recycling (reiserfs does very aggressive inode number recycling) but the glusterfs client (based on fuse) cannot express whether the inode number sent by the fuse kernel is the latest or older generation. i also understand that the generation sent by filesystem to fuse kernel is currently not being used, correct me if i'm wrong. 4. the readahead and channel is peaked at 128kb, by changing few thing in the kernel module (fuse_conn->bdi.ra_pages) i was able to increase this to bigger readahead value (ofcourse by increasing channel size too). why is the 128kb hard limit? what is the side-effect of having ra_pages beyond VM_MAX_READAHEAD pages? can this lead to any issues? 5. how can a fuse based filesystem detect if the application did an fcntl(fd,F_SETFL, O_DIRECT) on an fd which was not opened with O_DIRECT during the read() and write() operations? it would be convinient if the fi.flags be set with direct_io for all file based operations. 6. (unrelated to fuse) is the VM's readahead careful enough not to readahead into locked regions (hence causing an wrong 'block' in case mandatory locks was enabled). taking this to the next level, is there a trick for a fuse based filesystem detect what part of the read() request is belonging to a valid application's request and how much is from the VM's readahead logic (to ensure it doesnt accidentally+wrongly get blocked into a mandatory locked region held by another client machine?.) I understand disabling kernel's readahead is one of the ways around, but kernel's readahead is tremondously increasing performance by drastically reducing context switches. thanks! avati -- Anand V. Avati |
From: Miklos S. <mi...@sz...> - 2007-05-28 13:28:48
|
> 1. shared writable mmap. - I saw your patches on the linux-kernel on > feb-28th for this. but i dont see the fixes gone into the mainstream > kernel yet. (is it present in the -mm?) No. > neither do i see it in the > fuse cvs too (obviously since it depends on complementary changes in > the mm/?). Yes, it depends on this: http://lkml.org/lkml/2007/5/10/161 which is not -mm yet. > when can one expect mmap() to work smoothly over files > opened with O_RDWR ? Are there specific applications, which you know to be broken because of the lack of writable mmap? Or is it just an item you want to check off? > 2. fuse_writepages - this i presume will help improve write > performance for filesystems. does this mean that if an application > does write(fd,buf,131072) then filesystem gets entire 128kb, or, even > if application does multipel write(fd,buf,4k) then it would get > aggregated to some extent and filesystem gets an aggregated chunk? No, unfortunately that doesn't work with fuse. The reason is that all file attributes, including size and modification time are accounted in the userspace filesystem, not in the kernel. You can imagine the confusion this could cause if writes are not immediately propagated to userspace. > also same as previous, when will fuse_writepages be available, atleast > in cvs? is this available in some other private repository of yours? > I'm very anxious for this! Why? > 3. is there any way to get inode's i_generation access to the > filesystem? (instead of fuse sending only fuse_ino_t). in glusterfs, > the glusterfs server uses underlying filesystem for storing files and > folders. glusterfs server can detect inode number recycling (reiserfs > does very aggressive inode number recycling) but the glusterfs client > (based on fuse) cannot express whether the inode number sent by the > fuse kernel is the latest or older generation. i also understand that > the generation sent by filesystem to fuse kernel is currently not > being used, correct me if i'm wrong. It's not quite clear to me what your requirements are wrt. the generation number. In the low level library interface ->lookup(), ->mknod(), etc. can supply a generation number to the kernel, which is used to set i_generation, but which is not actually used by the VFS, only for NFS exporting, which is only supported by the out-of-tree fuse module. > 4. the readahead and channel is peaked at 128kb, by changing few thing > in the kernel module (fuse_conn->bdi.ra_pages) i was able to increase > this to bigger readahead value (ofcourse by increasing channel size > too). why is the 128kb hard limit? what is the side-effect of having > ra_pages beyond VM_MAX_READAHEAD pages? can this lead to any issues? I don't really know. Actually there's some current effort to improve the readahead algorithm, so you may be better asking on linux-fsdevel and/or Fengguang Wu in particular. > 5. how can a fuse based filesystem detect if the application did an > fcntl(fd,F_SETFL, O_DIRECT) on an fd which was not opened with > O_DIRECT during the read() and write() operations? it would be > convinient if the fi.flags be set with direct_io for all file based > operations. fcntl(O_DIRECT) should return EINVAL. If not then that's a bug. O_DIRECT is not supported by FUSE, and I don't see any reason to add support. Don't confuse this with the fuse_file_info->direct_io. > 6. (unrelated to fuse) is the VM's readahead careful enough not to > readahead into locked regions (hence causing an wrong 'block' in case > mandatory locks was enabled). taking this to the next level, is there > a trick for a fuse based filesystem detect what part of the read() > request is belonging to a valid application's request and how much is > from the VM's readahead logic (to ensure it doesnt > accidentally+wrongly get blocked into a mandatory locked region held > by another client machine?.) I understand disabling kernel's readahead > is one of the ways around, but kernel's readahead is tremondously > increasing performance by drastically reducing context switches. Hmm, interesting question. ->readpages() doesn't receive any hint about which part of the read is from the application and which is speculative. Possibly the right solution would be to disable readahead on for those files which currently have mandatory locks. But this is not currently possible either. Miklos |
From: Anand A. <av...@zr...> - 2007-05-28 14:04:10
|
> > when can one expect mmap() to work smoothly over files > > opened with O_RDWR ? > > Are there specific applications, which you know to be broken because > of the lack of writable mmap? Or is it just an item you want to check > off? apt-get upgrade is unhappy since it mmaps a file immediately after a seek+write (to 'set' the file size to an N byte hole which is mmaped and filled) > > 2. fuse_writepages - this i presume will help improve write > > performance for filesystems. does this mean that if an application > > does write(fd,buf,131072) then filesystem gets entire 128kb, or, even > > if application does multipel write(fd,buf,4k) then it would get > > aggregated to some extent and filesystem gets an aggregated chunk? > > No, unfortunately that doesn't work with fuse. The reason is that all > file attributes, including size and modification time are accounted in > the userspace filesystem, not in the kernel. You can imagine the > confusion this could cause if writes are not immediately propagated to > userspace. > > > also same as previous, when will fuse_writepages be available, atleast > > in cvs? is this available in some other private repository of yours? > > I'm very anxious for this! > > Why? because, currently, a 128kb write() comes as 32 4kb write()s. I'm assuming that with writepages a 128kb write() will result in a 128kb write() to userspace as well. This is purely for performance sake, it increases the performance multifold. > > 3. is there any way to get inode's i_generation access to the > > filesystem? (instead of fuse sending only fuse_ino_t). in glusterfs, > > the glusterfs server uses underlying filesystem for storing files and > > folders. glusterfs server can detect inode number recycling (reiserfs > > does very aggressive inode number recycling) but the glusterfs client > > (based on fuse) cannot express whether the inode number sent by the > > fuse kernel is the latest or older generation. i also understand that > > the generation sent by filesystem to fuse kernel is currently not > > being used, correct me if i'm wrong. > > It's not quite clear to me what your requirements are wrt. the > generation number. > > In the low level library interface ->lookup(), ->mknod(), etc. can > supply a generation number to the kernel, which is used to set > i_generation, but which is not actually used by the VFS, only for NFS > exporting, which is only supported by the out-of-tree fuse module. the idea is similar to NFS. in the case what you mention, fuse is in the NFS server. I am talking about the situation where fuse is on the client, and server is a differnt machines, and the server's inode numbers are directly mapped fuse. If some other application deletes the file and creates a new one (recycling the inode number then the fuse client is clueless.) Well the right way is for the filesystem to somehow detect the inode got reused since managing inodes is the filesystem's responsibility, but if fuse is going to do it if at all, i would get a chance to be lazy ;) thanks, avati -- Anand V. Avati |
From: Miklos S. <mi...@sz...> - 2007-05-28 14:33:45
|
> > > 2. fuse_writepages - this i presume will help improve write > > > performance for filesystems. does this mean that if an application > > > does write(fd,buf,131072) then filesystem gets entire 128kb, or, even > > > if application does multipel write(fd,buf,4k) then it would get > > > aggregated to some extent and filesystem gets an aggregated chunk? > > > > No, unfortunately that doesn't work with fuse. The reason is that all > > file attributes, including size and modification time are accounted in > > the userspace filesystem, not in the kernel. You can imagine the > > confusion this could cause if writes are not immediately propagated to > > userspace. > > > > > also same as previous, when will fuse_writepages be available, atleast > > > in cvs? is this available in some other private repository of yours? > > > I'm very anxious for this! > > > > Why? > > > > because, currently, a 128kb write() comes as 32 4kb write()s. I'm > assuming that with writepages a 128kb write() will result in a 128kb > write() to userspace as well. This is purely for performance sake, it > increases the performance multifold. Getting a 128kb write request from a 128kb write() is possible in theory (it doesn't have same problems as with aggregating multiple write() calls). But it also requires some new kernel infrastructure. The "new aops" patchset currently being submitted will provide a base for that infrastructure, which will hopefully follow shortly after this patchset is accepted. > > > 3. is there any way to get inode's i_generation access to the > > > filesystem? (instead of fuse sending only fuse_ino_t). in glusterfs, > > > the glusterfs server uses underlying filesystem for storing files and > > > folders. glusterfs server can detect inode number recycling (reiserfs > > > does very aggressive inode number recycling) but the glusterfs client > > > (based on fuse) cannot express whether the inode number sent by the > > > fuse kernel is the latest or older generation. i also understand that > > > the generation sent by filesystem to fuse kernel is currently not > > > being used, correct me if i'm wrong. > > > > It's not quite clear to me what your requirements are wrt. the > > generation number. > > > > In the low level library interface ->lookup(), ->mknod(), etc. can > > supply a generation number to the kernel, which is used to set > > i_generation, but which is not actually used by the VFS, only for NFS > > exporting, which is only supported by the out-of-tree fuse module. > > the idea is similar to NFS. in the case what you mention, fuse is in > the NFS server. I am talking about the situation where fuse is on the > client, and server is a differnt machines, and the server's inode > numbers are directly mapped fuse. If some other application deletes > the file and creates a new one (recycling the inode number then the > fuse client is clueless.) Well the right way is for the filesystem to > somehow detect the inode got reused since managing inodes is the > filesystem's responsibility, but if fuse is going to do it if at all, > i would get a chance to be lazy ;) So how should fuse help? Miklos |
From: Anand A. <av...@zr...> - 2007-05-28 15:59:31
|
> > the idea is similar to NFS. in the case what you mention, fuse is in > > the NFS server. I am talking about the situation where fuse is on the > > client, and server is a differnt machines, and the server's inode > > numbers are directly mapped fuse. If some other application deletes > > the file and creates a new one (recycling the inode number then the > > fuse client is clueless.) Well the right way is for the filesystem to > > somehow detect the inode got reused since managing inodes is the > > filesystem's responsibility, but if fuse is going to do it if at all, > > i would get a chance to be lazy ;) > > So how should fuse help? consider this sequence of events. client1 is a client machine which has a fuse client. client2 is another machine having a fuse client. server is a common server to which both client1 and client2 have a fuse based shared network filesystem. client1:lookup request file1 ==> server <= lookup reply file1, inode=1000, generation=1 client2:rm file1 ==> server <= success client2:create file2 ==> server <= FD, inode=1000 (got re-used) client1:lookup request file2 ==> server <= lookup reply file2, inode=1000, generation=2 client1:open inode=1000 at this case, if it was behaving correctly (like an NFS client) it would specify the generation as well. and the filesystem could know which exact inode (no matter they using same inode number) the call is actually referring to. if it was the old inode number, then the filesystem could treat it correctly by erroring the call, if the generation number was 2, then it could let the call go through. one argument is, it is the filesystem's reponsibility to ensure the inode number is not recycled. but if fuse is being used as a shared network filesystem there can be another argument. just like NFS, the client would 'remember' an inode and try to refer it again. the server needs a way to know if it was referring to the deleted inode, or the new inode. since fuse uses the generation mechanism to work correctly as an nfs server, it also needs to send back generation number to work 'like' an NFS client. the amount of state the server needs to maintain to know which client has looked up which inode number (to not re-using) is extermely expensive, and if just fuse were to reply back the generation number it would be of great help. also, it is tough to make the assumption that whenever an access on an inode comes it is always referring to the generation replied latet, because the request could already have been queued before the lookup reply with the new generation was sent. therefor there is no way currently for an FS to know which inode generation a call is referring to. does the argument hold good for reasoning that fuse should send generation number to the filesystem on every call? thanks, avati -- Anand V. Avati |
From: Miklos S. <mi...@sz...> - 2007-05-31 17:54:29
|
> consider this sequence of events. client1 is a client machine which > has a fuse client. client2 is another machine having a fuse client. > server is a common server to which both client1 and client2 have a > fuse based shared network filesystem. > > client1:lookup request file1 ==> server > <= lookup reply file1, inode=1000, generation=1 > > client2:rm file1 ==> server > <= success > > client2:create file2 ==> server > <= FD, inode=1000 (got re-used) > > client1:lookup request file2 ==> server > <= lookup reply file2, inode=1000, generation=2 > > client1:open inode=1000 > > at this case, if it was behaving correctly (like an NFS client) it > would specify the generation as well. and the filesystem could know > which exact inode (no matter they using same inode number) the call is > actually referring to. if it was the old inode number, then the > filesystem could treat it correctly by erroring the call, if the > generation number was 2, then it could let the call go through. > > one argument is, it is the filesystem's reponsibility to ensure the > inode number is not recycled. but if fuse is being used as a shared > network filesystem there can be another argument. > > just like NFS, the client would 'remember' an inode and try to refer > it again. the server needs a way to know if it was referring to the > deleted inode, or the new inode. since fuse uses the generation > mechanism to work correctly as an nfs server, it also needs to send > back generation number to work 'like' an NFS client. > > the amount of state the server needs to maintain to know which client > has looked up which inode number (to not re-using) is extermely > expensive, and if just fuse were to reply back the generation number > it would be of great help. So how about just remembering the generation in the client, in userspace? Yes, it's slightly wasteful to store the generation in userspace, when the kernel already stores it, but this would have the big advantage of not needing any API modifications. Miklos |
From: Anand A. <av...@zr...> - 2007-06-01 05:31:28
|
> > So how about just remembering the generation in the client, in > userspace? > > Yes, it's slightly wasteful to store the generation in userspace, when > the kernel already stores it, but this would have the big advantage of > not needing any API modifications. there are cases when you cannot do this correctly. consider the following sequence of events. client1:lookup request file1 ==> server <= lookup reply file1, inode=1000, generation=1 client2:rm file1 ==> server <= success client2:create file2 ==> server <= FD, inode=1000 (got re-used) client1:lookup request file2 ==> server <= lookup reply file2, inode=1000, generation=2 client1:open inode=1000 in the above case, you do not know whether the last open was posted by kernel after the inode generation number got updated, or if the open was already posted before you replied to file2's lookup. is there a way to detect this? if this situation can be handled i'm happy to remember and manage the generation number from userspace itself. thanks, avati -- Anand V. Avati |
From: Erich F. <ef...@gm...> - 2007-06-01 08:38:55
|
Hello & Szia Miklos, I've seen a comment from Miklos in this thread in the archive, which I'd be happy to get more explanations on: Opening a file with O_DIRECT is not supported in FUSE and there's not much interest in adding that. What is the reason for that and why is it different from mounting a FUSE filesystem with -o direct_io? When mounting with -o direct_io we're bypassing the pagecache anyway, AFAIK, so this is quite equivalent to O_DIRECT, isn't it? The difference is actually only that data doesn't go directly from the application user space buffer to the disk, but traverses the kernel with the help of FUSE. Thanks, best regards, Erich |