|
From: Nix <ni...@es...> - 2006-01-20 10:57:11
|
On Fri, 20 Jan 2006, Miklos Szeredi whispered secretively:
>> Now obviously a file with a given iode number can't change to a
>> different type; I'm handling that (it's easy, since the inode numbers
>> I'm handing out are fakes valid only until umount() anyway). But if I
>> roll a bunch of files to a different revision, their contents will
>> change, and right now I have no way of informing the VFS of that so that
>> the inode can be ditched from the page cache. (Obviously I can fsync()
>> the file to get *dirty* blocks from the page cache
>
> FUSE does write-through, so there are never any dirty blocks. The
Ah, right, good.
> fsync() method is only interesting if you do buffering in the
> filesystem itself.
(which of course I do, it being backed by PostgreSQL. This is one of the
few unresolvable POSIX-noncompliances I'm stuck with: I can't force sync
*off* selectively, and whether it's on or off isn't up to me but up to
the database server. The other two major noncompliances are that I can't
give a decent indication of free disk space, and I can't handle
distributed locking if I have multiple FUSE daemons on different
machines talking to the same database. Only the latter is a problem FUSE
could potentially fix, by letting the userspace FS know when locks are
taken out, and I'm avoiding that mess until I get enough of the fs
together to test it... it's a very minor thing, really.)
> FUSE already does have this: the keep_cache flag in the OPEN reply.
Aha! I missed this. A much better place to put it, too. :)
>> The idea is that the fsync() and fsyncdir() methods can say `my job
>> is done; please forget about this inode, as its contents are now
>> known to be invalid'. It would obviously be invalid to say this for
>> any inode that has been open()ed and not release()d yet: only a
>> moron would try to version-roll an open file and not expect to have
>> caching problems,
>
> Well, then what's the problem? Fuse already _does_ throw away the
> cache on each open() by default.
Woo. No problem, I misread the results of a testcase.
>> (oh, and one last question: the inode generation number has to increment
>> whenever an inum is replaced by an inum with the same number and a
>> different type, right? And can never go down?)
>
> Generation is only used by the NFS export code.
I remember it being added for use by that: what I wasn't sure of was if
its use had spread to other parts of the kernel.
> The rule is that the
> inum:generation pair must be a unique identifier for each "file" for
> the filesystem's lifetime.
Well, that's easy :)
> Inum must be unique for all the currently existing "files", but they
> may be reused after a file has been deleted.
I use 64-bit file identifiers internally and am enforcing exactly that
rule already in the translation layer down to 32-bit inums.
(I assume you mean inums can be reused after a file has been deleted
*and release() has been called* for all open()s*.)
> A "file" in this case is the object that exist between creation
> (open(O_CREAT), mkdir(), mknod(), symlink()) and deletion (no more
> links, open files and memory mappings).
Well, in fact an inode's lifetime is discontinuous under version
control, and merely unlinking it completely and closing it doesn't mean
it's gone: so you can do this (in very crude pseudo-C):
a=open ("blah", O_CREAT | O_WRONLY) # gets inum A
gettimeofday(&time, NULL);
write (a, "foo", strlen ("foo"));
close (a);
unlink ("blah");
a=open ("blah", O_CREAT) # gets inum B
close(a);
unlink ("blah");
recant_roll_to_time ("blah", time); # roll back to `time'
# under the covers this does an fsetxattr()
# to get the message to the filesystem and
# carry out the roll, and forces the next
# open() of this file to set keep_cache to zero
a=open ("blah", O_RDWR) # inum A rises from the dead, same inum and
# generation counter
read (a, &buf, 3) # buf now contains `foo' again
Thankfully that's acceptable under the limits you've given. (I took some
care to make sure that it worked in the presence of hard links,
concurrent writes, and so on. Rolling back and forwards is not specified
by POSIX so I can be as strange as I like there. ;) )
> Inums and generation can be allocated arbitrary ways (they can go
> up/down/whatever, there can be a global generation counter or a local
> one for each inum) as long as the above uniqueness rules are observed.
Excellent!
FUSE's design has shown a great degree of precognition. I needed a
lowlevel API working on an inode-by-inode basis, so you introduced one
in FUSE 2.4 about a week *before* I started thinking about this
filesystem: I needed 64-bit fhs and it was useful to record their
number, so you introduced it within a day of my thinking of the
need... without my even needing to mention it!
So, um, where do you get your magic telepathy helmet and TARDIS from? :)
--
`Logic and human nature don't seem to mix very well,
unfortunately.' --- Velvet Wood
|