At 13:55 04/07/01, Richard Russon wrote:
>I've taken a careful look at all the actions we need to perform to write
>to an NTFS volume. I've tried to break everything down into atomic
>pieces.
>
>The simplest example is touch_file(). We have to update the $MFT entry
>and the Index Entry. We also need to touch the $MFT, to show that it has
>been altered.
Does NT really update the times of the MFT record for $MFT every single
time the MFT is written to?!? It seems crazy... You could say this is
putting some very heavy wear on the disk sectors containing the first mft
record...
>So, touch_file() looks like:
>
> touch_file()
> update_file() # update file
> finalise() # update $MFT
>
> update_file()
> Update $MFT entry # update file's entry
> Update Index entry
>
> finalise()
> update_file ($MFT) # update timestamp
> Copy $MFT to $MFTMirr
>
>The finalise will be needed after every update.
Apart from the nomenclature which I don't like (but that's details...) I
agree in general. However, I disagree about finalise(). IMO
update_file($MFT) and "Copy $MFT to $MFTMirr" should be one and the same
thing. - When we are writing to $MFT anywhere in offset 0 to 4 * MFT record
size, we have to write the data to $MFTMirr as well. - The best solution is
IMO to put this into the low level write routines themselves, in fact the
lower we go the better for transparency of $MFTMirr updates. After all the
update of $MFTMirr should be in the same transaction as the update of $MFT.
- Since the $MFTMirr is a byte-wise copy of $MFT's first four mft records,
rather than a functional copy, it is not a problem to have the copy done
real deep inside the write routines. - Maybe in (the yet to be written)
ntfs_write_page() or something similar.
>Indeed if we copy NT exactly, then after every file READ, we need to
>update the access times of the file, i.e. The $MFT and its mirror will
>need updating.
Are you sure about this? AFAIK, NT will only update access times after a
time delta has passed since the last update (I do not know how large the
time delta is admittedly), and even then, only if update time updates are
enabled in the registry (which they are by default, but setting the right
registry key will deactivate the updates). - We will of course have a
(re)mount option to disable access time updates and hence beet NT since it
will per mounted volume and not for the whole driver. (-;
>The actual touch is broken into update & finalise to prevent infintite
>loops when touching the $MFT.
Infinite loops are something we definitely have to watch for carefully
during the driver design/implementation as NTFS is full of dangers
involving them due to having all the metadata being files...
>Another simple case is append_file().
>
> append_file()
> write_data() # write data to disk
> touch_file() # update timestamp
>
> write_data()
> Write data # put data on disk
> Update $Bitmap # set bits
> touch_file($Bitmap) # update timestamp
Update $Bitmap AFTER writing the data?!? I think I am hearing voices and
they are all screaming "RACE condition"...
My suggestion would be:
append_file()
{
if (!write_attribute($DATA,...))
fail();
touch_file();
touch_file($MFT);
}
write_attribute()
{
... sees that it has to extend ...
if (!resize_attribute())
fail();
if (!inline: write_data_into_attribute &
update attribute record &
lock_inode_and_mft_record while doing this) {
resize_attribute(back to old size);
fail();
}
}
resize_attribute()
{
inline: calculate nr_clusters needed;
if (!allocate_clusters())
fail();
lock_both_inode_and_mft_record_for_write(separate?);
if (!do_page_cache_magic_to_add_data_to_pages()) {
unlock_inode_and_mft_record();
deallocate_clusters();
fail();
}
inline: update_attribute record (run list, etc)
unlock_both_inode_and_mft_record(separate?);
}
allocate_clusters()
{
lock_bitmap();
inline: allocate_clusters_and_create_run_list_of_them;
unlock_bitmap();
touch_file(bitmap);
return run_list;
}
Note: I haven't written above optimized, it's just conceptual. Otherwise
for example allocate clusters would not necessarily do the locking this way...
>The code we have tries to do everything itself. If we can break it down
>into packets of work, two things happen.
>We can mimic, possibly even USE the log file,
Yes.
>and second we could queue and coalesce similar requests.
No. This already happens at two levels and we really do not need to do it
and in fact can't do it usefully AFAICS. 1) At the level of the block
devices in ll_rw_block and friends consecutive accesses are already
batched/coalesced. 2) The page cache buffers multiple reads/writes and only
invokes NTFS when we either need to really do a write or read (basically
the stuff in address space operations). In the mean time we are never
invoked and we don't care... The VFS rules. (-;
>Any thoughts / comments?
As above. Sorry it took so long to reply but I have been very busy in the
lab...
Anton
--
"Nothing succeeds like success." - Alexandre Dumas
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
|