At 14:38 12/08/02, Richard Russon wrote:
>Ages ago you split up my data runs code into two pieces.
>Check that the operation is legitimate, then actually perform the
>operation.
>
>That's fair enough since we can't recover if something goes wrong.
Yup.
>If we were to write to an MFT record can we bypass this. If things
>go wrong can we just tell the cache that the page is invalid?
>Is this a legal thing to do? Is it sensible?
Pages can be invalidated, yes. The easiest form of invalidation is
"ClearPageUptodate(page); ClearPageDirty(page);". Note such invalidation
must happen under PageLock protection. But it is bad to use this method.
There are other methods to discard a page. Some discarding contents, some
preserving them, some actually truncating the file. I am sure we will find
a function that does just what you suggest.
Clearing PageUptodate() means that on next (read) access the page will be
read back in via ->readpage. [We just need to take care of the buffers in
the page, they will have to either be thrown away or at least marked
non-uptodate, too. But this assumes we will have buffers which is not
carved in stone because it causes a lot of complications...]
It is a legal thing to do in general. However it is not sensible for mft
record pages for these reasons:
1) There are 4 or more records in a page. invalidating the page will throw
away the other 3 or more records as well. What if they were perfectly
valid, but dirty, and not written to disk yet? We would lose their changes.
Oops...
2) The record we are writing to may have been written to previously but the
contents may not have been written to disk yet. Example: User creates a
file, then writes 1kiB, then writes another 1kiB. We edit the mft record
three times, first marking it in use and cleaning it up for reuse, adding a
new file name attribute, etc, then the second time we extend the run list
of the attribute by 1kiB and the third time extend it by another 1kiB. If
the first write succeeds but the second one fails, and we invalidate the
page, we loose the changes from the first write. We may even lose the file
creation step itself. That would mean we have an allocated mft record in
$MFT/$Bitmap, we have clusters allocated in $Bitmap/$DATA for the attribute
data, we have a directory entry for the file, but the mft record for the
file is marked not in use (we never wrote the file creating/extension to
disk after all and just threw away the page) so we end up with a lost mft
record, a (few) lost clusters, and a stale directory entry. Oops...
The easiest way to recover a failed write from within the write functions,
will be to roll back what we changed. Just think "journalling fs" without
the journal. The write function needs to keep the old and changed state in
some fashion (perhaps just have variables like: "BOOL size_is_changed" and
"s64 old_size" and on error, we then do "if (size_is_changed)
resize(old_size);", you get the idea). Or we could go fully journalling but
I would rather not get involved with journalling yet... Non-journalled
writes are difficult enough.
>(Obviously before starting we'd have to make sure the whole page is
>uptodate, not just the record).
For the $MFT/$DATA (and in fact for all MST protected attributes) we will
need to either have the whole page uptodate or nothing at all. We may
actually end up having to have completely separate codepaths for MST
protected vs non-protected attributes. Certainly for writepage it looks
like we are heading down distinct routes. The way we perform the i/o is
just too different. We actually will need to violate a lot of VM/buffer
layer semantics to get MST protection to work as I would like it to work.
Should it turn out that this is impossible because the VM/buffer layer
interact with us badly it will cause data corruption - we will notice and
then we will have to do some kludges to get this all to work. But lets get
normal attribute i/o working first, then worry about mst protected i/o...
I hope I will manage to get ntfs_write_block() hammered out tonight so some
of the problems that occur will be more easy to explain.
It is all about page vs buffer state. Each page can be buffer less (e.g.
resident attributes) in which case the state of the page (dirty, uptodate,
etc) is definitive. However, when there are buffers, the buffers are
definitive and the page state is only a hint to the overall state. Note the
hint has rules; for example it is not legal to have dirty buffers against a
clean page. But it is quite legal to have a dirty page where only some
buffers are dirty - then only the dirty ones need to be written out. The
rest ignored. [This occurs for small writes via generic_file_write() which
uses ntfs_prepare/commit_write(). It cannot occur for mmap()ped writing as
that only operates on pages as a whole.]
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
|