From: sage w. <sa...@ne...> - 2005-07-28 05:35:37
|
Hi all, I've run into trouble with the mmap() versus direct_io thing: I want them both! From what I gather it's not practical to actually do mmap() correctly with a userspace file system without using the VFS page cache (and thus not using direct_io), but I'm hoping it's possible to come up with some sort of compromise. Basically, I need direct_io because I'm working on a distributed file system and need to be able to intelligently manage buffer cache consistency across multiple clients. Because FUSE doesn't let me force write-thru in the VFS page cache, or to force the kernel to flush dirty buffers, or selectively disallow caching on a per-file basis, I need to put my buffer cache in userspace (and turn off the kernel's page cache). At the same time, I need mmap() because I want to be able to use the file system to execute programs and run gcc and other normalish things that break without mmap() (even just read-only mmap()). Although there are apparently some fundamental problems with doing a robust mmap() on a userspace fs "right," I think I only really need read-only mmap() to get by, and would be willing to cut corners with consistency. Like, if mmap() used the page cache, but regular file I/O didn't (i.e. direct_io excluded mmap()), that would probably be fine. In one thread someone suggested just disabling the DIRECT_IO flag check on mmap() might work, but it didn't seem to do the trick for me (w/ 2.3.0). Also, now that I think about it, I remember somebody mentioning that you can somehow enable/disable direct_io on a per-file basis... is that true? If so, is there a way to tell from the userspace side of things whether a file is being mmap()'d or not (and thus whether we can safely enable direct_io)? It doesn't seem like it'd be that complicated to do some sort of usually_direct_io type mode that still allowed mmap(), but I'm not familiar enough with the FUSE stuff to really know. Is it possible? Difficult? Or should I be approaching this from some other direction? Thanks so much! sage |
From: Miklos S. <mi...@sz...> - 2005-07-28 11:40:07
|
> I've run into trouble with the mmap() versus direct_io thing: I want them > both! From what I gather it's not practical to actually do mmap() > correctly with a userspace file system without using the VFS page cache > (and thus not using direct_io), but I'm hoping it's possible to come up > with some sort of compromise. > > Basically, I need direct_io because I'm working on a distributed file > system and need to be able to intelligently manage buffer cache > consistency across multiple clients. Because FUSE doesn't let me force > write-thru in the VFS page cache, or to force the kernel to flush dirty > buffers, or selectively disallow caching on a per-file basis, I need to > put my buffer cache in userspace (and turn off the kernel's page cache). FUSE always does write-through. No dirty buffers ever accumulate. > At the same time, I need mmap() because I want to be able to use the file > system to execute programs and run gcc and other normalish things that > break without mmap() (even just read-only mmap()). > > Although there are apparently some fundamental problems with doing a > robust mmap() on a userspace fs "right," I think I only really need > read-only mmap() to get by, and would be willing to cut corners with > consistency. Like, if mmap() used the page cache, but regular file I/O > didn't (i.e. direct_io excluded mmap()), that would probably be fine. In > one thread someone suggested just disabling the DIRECT_IO flag check on > mmap() might work, but it didn't seem to do the trick for me (w/ 2.3.0). > > Also, now that I think about it, I remember somebody mentioning that you > can somehow enable/disable direct_io on a per-file basis... is that true? That is the plan. Next release will have this. > If so, is there a way to tell from the userspace side of things whether a > file is being mmap()'d or not (and thus whether we can safely enable > direct_io)? > > It doesn't seem like it'd be that complicated to do some sort of > usually_direct_io type mode that still allowed mmap(), but I'm not > familiar enough with the FUSE stuff to really know. Is it possible? > Difficult? Or should I be approaching this from some other direction? I think you should. Do you know when the cache needs to be invalidated? Currenty you can do that by opening and closing the file you want the cache to be purged for. Miklos |
From: sage w. <sa...@ne...> - 2005-07-28 20:04:44
|
On Thu, 28 Jul 2005, Miklos Szeredi wrote: > FUSE always does write-through. No dirty buffers ever accumulate. Ok, that avoids half the problem... >> Also, now that I think about it, I remember somebody mentioning that you >> can somehow enable/disable direct_io on a per-file basis... is that true? > > That is the plan. Next release will have this. Okay. Although in order for this to fully solve my problem there's need to be a way to tell if a given file is being mmap()'d, and to selectively disable it. Doesn't sound very elegant.. > I think you should. Do you know when the cache needs to be > invalidated? Currenty you can do that by opening and closing the file > you want the cache to be purged for. Are you suggesting that the FUSE user process open and close the file to kick the kernel? And that would really flush pages even though another process has hte file open the whole time? It's not so much that I need to periodically purge all pages, it's that I need to force all reads to be synchronous for some indefinite period. When processes on different nodes have a file open for both reading and writing, all reads and writes have to go to the server to get correct behavior. I supposed after every read operation completes I could kick the kernel into purging pages...? sage |
From: Miklos S. <mi...@sz...> - 2005-07-28 20:27:42
|
> >> Also, now that I think about it, I remember somebody mentioning that you > >> can somehow enable/disable direct_io on a per-file basis... is that true? > > > > That is the plan. Next release will have this. > > Okay. Although in order for this to fully solve my problem there's need > to be a way to tell if a given file is being mmap()'d, and to selectively > disable it. Doesn't sound very elegant.. No, it doesn't. > > I think you should. Do you know when the cache needs to be > > invalidated? Currenty you can do that by opening and closing the file > > you want the cache to be purged for. > > Are you suggesting that the FUSE user process open and close the file to > kick the kernel? And that would really flush pages even though another > process has hte file open the whole time? > > It's not so much that I need to periodically purge all pages, it's that I > need to force all reads to be synchronous for some indefinite period. > When processes on different nodes have a file open for both reading and > writing, all reads and writes have to go to the server to get correct > behavior. I supposed after every read operation completes I could kick > the kernel into purging pages...? No, what I meant, was you could have a file change notification from the server to all clients having the file open for reading. Then these clients could do a cache flush. Wouldn't that work? Miklos |
From: sage w. <sa...@ne...> - 2005-07-28 20:55:26
|
On Thu, 28 Jul 2005, Miklos Szeredi wrote: > No, what I meant, was you could have a file change notification from > the server to all clients having the file open for reading. Then > these clients could do a cache flush. Wouldn't that work? Almost, a 'file change' notification won't work because of the delay. During that period new data may have been written but the reader may still be using cached data from the page cache. To get real consistency, caches need to be purged and caching disabled _before_ writing starts, and then caching needs to stay disabled (with reads and writes synchronous) until there's no longer a mix of readers/writers. Even if my FUSE module could say "purge pages, now!", it would need to do that after every read in order to effectively disable caching. direct_io is a more graceful way to accomplish that, but then I lose mmap(). I'm willing to fudge the consistency to make mmap() work, since in practice modification of files that are being executed doesn't really happen. But I want proper consistency the rest of the time. So actually, if the per-file direct_io in the next FUSE version will let you turn on/off direct_io for open files at will (i.e. at any random point after the file is already open, via some callback mechanism, upon revocation of caching capability by the server), then I think that would solve my problem--that's exactly what the userspace buffer cache is currently doing. mmap() would work normally unless for some reason another node tried to write to the file and I have to enable direct_io on the file... which shouldn't happen under normal workloads. Is that how the per-file direct_io thing is going to work? Via a callback of some sort? sage |
From: Miklos S. <mi...@sz...> - 2005-07-29 08:49:12
|
> Almost, a 'file change' notification won't work because of the delay. > During that period new data may have been written but the reader may still > be using cached data from the page cache. To get real consistency, caches > need to be purged and caching disabled _before_ writing starts, and then > caching needs to stay disabled (with reads and writes synchronous) until > there's no longer a mix of readers/writers. I don't think that disabling the cache gives you any more consistency guarantees in "time topology". So it's a quantitative improvement over cache flushing, rather than a qualitative. Am I missing something? > Even if my FUSE module could say "purge pages, now!", it would need to do > that after every read in order to effectively disable caching. direct_io > is a more graceful way to accomplish that, but then I lose mmap(). I'm > willing to fudge the consistency to make mmap() work, since in practice > modification of files that are being executed doesn't really happen. But > I want proper consistency the rest of the time. > > So actually, if the per-file direct_io in the next FUSE version will let > you turn on/off direct_io for open files at will (i.e. at any random point > after the file is already open, via some callback mechanism, upon > revocation of caching capability by the server), then I think that would > solve my problem--that's exactly what the userspace buffer cache is > currently doing. mmap() would work normally unless for some reason > another node tried to write to the file and I have to enable direct_io on > the file... which shouldn't happen under normal workloads. > > Is that how the per-file direct_io thing is going to work? Via a > callback of some sort? No. It will be a flag returned from the OPEN request. So you can't change it while the file is open. Miklos |
From: sage w. <sa...@ne...> - 2005-07-29 16:08:33
|
On Fri, 29 Jul 2005, Miklos Szeredi wrote: >> Almost, a 'file change' notification won't work because of the delay. >> During that period new data may have been written but the reader may still >> be using cached data from the page cache. To get real consistency, caches >> need to be purged and caching disabled _before_ writing starts, and then >> caching needs to stay disabled (with reads and writes synchronous) until >> there's no longer a mix of readers/writers. > > I don't think that disabling the cache gives you any more consistency > guarantees in "time topology". So it's a quantitative improvement > over cache flushing, rather than a qualitative. Am I missing > something? It's easiest to see if you consider an outside communications channel (although in reality any metadata operations are "outside" because they don't involve the page cache). Say client1 and client2 both have a file open. Client1 writes something, and then tells client2 he's done. Client2 reads it. If caching is enabled, client2 may read old or new data, depending on the relative speeds of network links, how quickly 'file change' messages are processed, etc. The only way (well, simplest way) to get correct behavior is to disable caching when there is a mix of readers and writers (on different nodes). (Or make the server wait until all caches are invalidated before acknowledging the write, but that's abyssmally slow.) That guarantees that a read that begins after a write completed will return correct data, which gives you the same behavior you expect with two processes on the same machine. (If the read/write calls overlap it's still ambiguous--also what you expect with POSIX.) >> Is that how the per-file direct_io thing is going to work? Via a >> callback of some sort? > > No. It will be a flag returned from the OPEN request. So you can't > change it while the file is open. Ok. In that case, I think the easiest approach might be try to tweak FUSE's mmap() so that it will still work well enough to keep most users happy when (global) direct_io is enabled. Somebody mentioned that simply taking out the DIRECT_IO check in the FUSE mmap function might do the trick, but that didn't seem to work for me. Do you think it will be much more complicated than that? Or can you point me toward the relevant functions? It's not ideal, but ultimately being able to correctly manage consistency for most files (ones that aren't mmap()'d) is good enough! Thanks so much! sage |
From: Miklos S. <mi...@sz...> - 2005-07-29 17:03:00
|
> >> Almost, a 'file change' notification won't work because of the delay. > >> During that period new data may have been written but the reader may still > >> be using cached data from the page cache. To get real consistency, caches > >> need to be purged and caching disabled _before_ writing starts, and then > >> caching needs to stay disabled (with reads and writes synchronous) until > >> there's no longer a mix of readers/writers. > > > > I don't think that disabling the cache gives you any more consistency > > guarantees in "time topology". So it's a quantitative improvement > > over cache flushing, rather than a qualitative. Am I missing > > something? > > It's easiest to see if you consider an outside communications channel > (although in reality any metadata operations are "outside" because they > don't involve the page cache). Say client1 and client2 both have a file > open. Client1 writes something, and then tells client2 he's done. > Client2 reads it. If caching is enabled, client2 may read old or new > data, depending on the relative speeds of network links, how quickly 'file > change' messages are processed, etc. The only way (well, simplest way) to > get correct behavior is to disable caching when there is a mix of readers > and writers (on different nodes). (Or make the server wait until all > caches are invalidated before acknowledging the write, but that's > abyssmally slow.) That guarantees that a read that begins after a write > completed will return correct data, which gives you the same behavior you > expect with two processes on the same machine. (If the read/write calls > overlap it's still ambiguous--also what you expect with POSIX.) OK, I see the problem better now. > >> Is that how the per-file direct_io thing is going to work? Via a > >> callback of some sort? > > > > No. It will be a flag returned from the OPEN request. So you can't > > change it while the file is open. > > Ok. In that case, I think the easiest approach might be try to tweak > FUSE's mmap() so that it will still work well enough to keep most users > happy when (global) direct_io is enabled. Somebody mentioned that simply > taking out the DIRECT_IO check in the FUSE mmap function might do the > trick, but that didn't seem to work for me. Do you think it will be much > more complicated than that? Or can you point me toward the relevant > functions? Well, removing the check _should_ work. I can't see why it doesn't. There's no other difference between the two modes of operation in the mmap path. > It's not ideal, but ultimately being able to correctly manage consistency > for most files (ones that aren't mmap()'d) is good enough! It would be nice to keep the local consistency that the page cache gives you for memory maps. A "read-through" mode would do it nicely. Not sure if it's worth the effort though. Miklos |
From: sage w. <sa...@ne...> - 2005-07-29 17:25:04
|
On Fri, 29 Jul 2005, Miklos Szeredi wrote: > Well, removing the check _should_ work. I can't see why it doesn't. > There's no other difference between the two modes of operation in the > mmap path. I commented out the DIRECT_IO check in fuse_file_mmap() in kernel/file.c (that's the check you mean, right?). Here's what I see when I try to execute something: unique: 6483, opcode: LOOKUP (1), nodeid: 1, insize: 48 LOOKUP /fakesyn NODEID: 2 unique: 6483, error: 0 (Success), outsize: 136 unique: 6484, opcode: OPEN (14), nodeid: 2, insize: 48 OPEN[11] flags: 0x0 unique: 6484, error: 0 (Success), outsize: 32 unique: 6485, opcode: RELEASE (18), nodeid: 2, insize: 56 RELEASE[11] flags: 0x0 unique: 6486, opcode: OPEN (14), nodeid: 2, insize: 48 OPEN[12] flags: 0x8000 unique: 6486, error: 0 (Success), outsize: 32 unique: 6487, opcode: READ (15), nodeid: 2, insize: 64 READ[12] 80 bytes from 0 READ[12] 80 bytes unique: 6487, error: 0 (Success), outsize: 96 unique: 6488, opcode: RELEASE (18), nodeid: 2, insize: 56 RELEASE[12] flags: 0x8000 unique: 6485, error: 0 (Success), outsize: 16 unique: 6488, error: 0 (Success), outsize: 16 and in my shell, # mnt/tester -bash: mnt/tester: Bad address # strace -f mnt/tester strace: exec: Bad address execve("mnt/tester", ["mnt/tester"], [/* 25 vars */]) = 0 I thought the slow return of the first close() might be the issue, but I get basically the same thing with -s: unique: 6484, opcode: OPEN (14), nodeid: 2, insize: 48 OPEN[11] flags: 0x0 unique: 6484, error: 0 (Success), outsize: 32 unique: 6485, opcode: RELEASE (18), nodeid: 2, insize: 56 RELEASE[11] flags: 0x0 unique: 6485, error: 0 (Success), outsize: 16 unique: 6486, opcode: OPEN (14), nodeid: 2, insize: 48 OPEN[12] flags: 0x8000 unique: 6486, error: 0 (Success), outsize: 32 unique: 6487, opcode: READ (15), nodeid: 2, insize: 64 READ[12] 80 bytes from 0 READ[12] 80 bytes unique: 6487, error: 0 (Success), outsize: 96 unique: 6488, opcode: RELEASE (18), nodeid: 2, insize: 56 RELEASE[12] flags: 0x8000 unique: 6488, error: 0 (Success), outsize: 16 sage > >> It's not ideal, but ultimately being able to correctly manage consistency >> for most files (ones that aren't mmap()'d) is good enough! > > It would be nice to keep the local consistency that the page cache > gives you for memory maps. A "read-through" mode would do it nicely. > Not sure if it's worth the effort though. > > Miklos > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > fuse-devel mailing list > fus...@li... > https://lists.sourceforge.net/lists/listinfo/fuse-devel > |
From: Miklos S. <mi...@sz...> - 2005-07-29 18:47:18
|
> I commented out the DIRECT_IO check in fuse_file_mmap() in kernel/file.c > (that's the check you mean, right?). Here's what I see when I try to > execute something: Did you do 'rmmod fuse; modprobe fuse'? Miklos |
From: sage w. <sa...@ne...> - 2005-07-29 19:01:18
|
On Fri, 29 Jul 2005, Miklos Szeredi wrote: >> I commented out the DIRECT_IO check in fuse_file_mmap() in kernel/file.c >> (that's the check you mean, right?). Here's what I see when I try to >> execute something: > > Did you do 'rmmod fuse; modprobe fuse'? Yeah. Actually, after putting in some printk's, it looks like fuse_file_mmap() isn't being called at all when direct_io is enabled (nothing printed). With direct_io off, I get 1 2 3 (as expected). static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma) { struct inode *inode = file->f_dentry->d_inode; struct fuse_conn *fc = get_fuse_conn(inode); printk("mmap 1\n"); //if (fc->flags & FUSE_DIRECT_IO) // return -ENODEV; printk("mmap 2\n"); if ((vma->vm_flags & VM_SHARED)) { if ((vma->vm_flags & VM_WRITE)) { return -ENODEV; printk("flags & VM_WRITE\n"); } else vma->vm_flags &= ~VM_MAYWRITE; } printk("mmap 3\n"); return generic_file_mmap(file, vma); } |
From: Miklos S. <mi...@sz...> - 2005-07-29 19:16:39
|
> > > > Did you do 'rmmod fuse; modprobe fuse'? > > Yeah. Actually, after putting in some printk's, it looks like > fuse_file_mmap() isn't being called at all when direct_io is enabled > (nothing printed). With direct_io off, I get 1 2 3 (as expected). And which version of FUSE is it? In CVS this has changed, but then the if(FUSE_DIRECT_IO) thing wouldn't be there. I'm totally confused. Miklos |
From: sage w. <sa...@ne...> - 2005-07-29 19:24:05
|
On Fri, 29 Jul 2005, Miklos Szeredi wrote: >>> Did you do 'rmmod fuse; modprobe fuse'? >> >> Yeah. Actually, after putting in some printk's, it looks like >> fuse_file_mmap() isn't being called at all when direct_io is enabled >> (nothing printed). With direct_io off, I get 1 2 3 (as expected). > > And which version of FUSE is it? In CVS this has changed, but then > the if(FUSE_DIRECT_IO) thing wouldn't be there. > > I'm totally confused. 2.3.0. vapre:fuse-2.3.0 12:19 PM $ grep -rn DIRECT_IO . ./kernel/fuse_i.h:102:#define FUSE_DIRECT_IO (1 << 3) ./kernel/file.c:592: if (fc->flags & FUSE_DIRECT_IO) ./kernel/file.c:618: if (fc->flags & FUSE_DIRECT_IO) { ./kernel/file.c:636: //if (fc->flags & FUSE_DIRECT_IO) ./kernel/file.c:656: if (fc->flags & FUSE_DIRECT_IO) ./kernel/inode.c:328: OPT_DIRECT_IO, ./kernel/inode.c:344: {OPT_DIRECT_IO, "direct_io"}, ./kernel/inode.c:406: case OPT_DIRECT_IO: ./kernel/inode.c:407: d->flags |= FUSE_DIRECT_IO; ./kernel/inode.c:442: if (fc->flags & FUSE_DIRECT_IO) Maybe the problem is that fuse_file_read does the direct_io read instead of generic_file_read (line ~592), and subsequently doesn't populate the page cache? Should I try pulling the latest from CVS instead? sage |
From: Miklos S. <mi...@sz...> - 2005-07-30 09:26:10
|
> Maybe the problem is that fuse_file_read does the direct_io read instead > of generic_file_read (line ~592), and subsequently doesn't populate the > page cache? No, mmap (or rather memory access after mmap) will populate the page cache via the address_space_operations::readpage() method. Finally got over my extreme lazyness and actually tried out this thing. And to my utter amazement, it really didn't work ;) After some investigation it turns out, that it fails on the !current->mm check in fuse_get_user_pages(). So it seems it's not yet doing any memory mapping, but some sort of tricky read, that fuse_direct_io() can't handle (so in a way you were right). The following patch should implement a sort of read-through behavior: each read will refresh all pages even if they were previously cached. Not really tested (it compiles, and didn't crash the kernel immediately). Is this something like what you need? Miklos Index: file.c =================================================================== RCS file: /cvsroot/fuse/fuse/kernel/file.c,v retrieving revision 1.75 diff -u -r1.75 file.c --- file.c 12 May 2005 14:56:34 -0000 1.75 +++ file.c 30 Jul 2005 09:20:12 -0000 @@ -583,6 +583,24 @@ return res; } +static void clear_pages_uptodate(struct address_space *mapping, off_t pos, + size_t count) +{ + pgoff_t start = pos >> PAGE_CACHE_SHIFT; + pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; + pgoff_t i; + + for (i = start; i <= end; i++) { + struct page *page = find_get_page(mapping, i); + if (page) { + lock_page(page); + ClearPageUptodate(page); + unlock_page(page); + page_cache_release(page); + } + } +} + static ssize_t fuse_file_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { @@ -604,8 +622,10 @@ return generic_file_read(file, buf, count, ppos); } #else - else + else { + clear_pages_uptodate(inode->i_mapping, *ppos, count); return generic_file_read(file, buf, count, ppos); + } #endif } |
From: sage w. <sa...@ne...> - 2005-07-31 19:07:55
|
Ah, yes, I think that does the trick! The only downside is that I'll eat memory filling page cache pages that will never be used (except by mmap). Not sure how to elegantly avoid that. This solves our problem well enough for the time being, though. Thanks so much! sage On Sat, 30 Jul 2005, Miklos Szeredi wrote: >> Maybe the problem is that fuse_file_read does the direct_io read instead >> of generic_file_read (line ~592), and subsequently doesn't populate the >> page cache? > > No, mmap (or rather memory access after mmap) will populate the page > cache via the address_space_operations::readpage() method. > > Finally got over my extreme lazyness and actually tried out this > thing. And to my utter amazement, it really didn't work ;) > > After some investigation it turns out, that it fails on the > !current->mm check in fuse_get_user_pages(). > > So it seems it's not yet doing any memory mapping, but some sort of > tricky read, that fuse_direct_io() can't handle (so in a way you were > right). > > The following patch should implement a sort of read-through behavior: > each read will refresh all pages even if they were previously cached. > Not really tested (it compiles, and didn't crash the kernel > immediately). > > Is this something like what you need? > > Miklos > > Index: file.c > =================================================================== > RCS file: /cvsroot/fuse/fuse/kernel/file.c,v > retrieving revision 1.75 > diff -u -r1.75 file.c > --- file.c 12 May 2005 14:56:34 -0000 1.75 > +++ file.c 30 Jul 2005 09:20:12 -0000 > @@ -583,6 +583,24 @@ > return res; > } > > +static void clear_pages_uptodate(struct address_space *mapping, off_t pos, > + size_t count) > +{ > + pgoff_t start = pos >> PAGE_CACHE_SHIFT; > + pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; > + pgoff_t i; > + > + for (i = start; i <= end; i++) { > + struct page *page = find_get_page(mapping, i); > + if (page) { > + lock_page(page); > + ClearPageUptodate(page); > + unlock_page(page); > + page_cache_release(page); > + } > + } > +} > + > static ssize_t fuse_file_read(struct file *file, char __user *buf, > size_t count, loff_t *ppos) > { > @@ -604,8 +622,10 @@ > return generic_file_read(file, buf, count, ppos); > } > #else > - else > + else { > + clear_pages_uptodate(inode->i_mapping, *ppos, count); > return generic_file_read(file, buf, count, ppos); > + } > #endif > } > > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > fuse-devel mailing list > fus...@li... > https://lists.sourceforge.net/lists/listinfo/fuse-devel > |
From: Miklos S. <mi...@sz...> - 2005-08-01 08:37:47
|
> Ah, yes, I think that does the trick! The only downside is that I'll eat > memory filling page cache pages that will never be used (except by mmap). > Not sure how to elegantly avoid that. We could try to free the pages after the read. There's a function that does this in mm/truncate.c: invalidate_mapping_pages(). Unfortunately it's not exported to modules (and neither the pagevec_xx functions it uses), so it would need to be reimplemented. Not terribly difficult though ;) Miklos |