Hi,
What do you think about using delayed allocation for NTFS. Here is my idea,
let me know what you think...
The idea below is based on the simplest thing I could think off, i.e. there
is no form of ENOSPC handling in prepare/commit_write.
Ok so here goes:
ntfs_prepare_write()
====================
if (page is uptodate)
return success.
// For non-uptodate pages:
if (there is an allocation for the destination buffers) {
map any partially overlapping buffers;
read them synchronously from the backing store;
} else {
// There is no allocation, i.e. hole or data extension
zero any partially overlapping buffers;
}
return success;
ntfs_commit_write()
===================
mark buffers uptodate, but leave them unmapped;
update i_size if necessary;
set_page_dirty();
return success;
ntfs_writepage() -- analogous to the current ntfs_readpage()
================
- For non-resident, uncompressed attributes, map all *uptodate* buffers,
allocating if necessary, and finally write them (i.e. only write uptodate
buffers. if the page is uptodate, write all buffers). (If the attribute is
mst protected, need to get fully exclusive access to the page,
pre_write_mst protect the data, then have our async io completion handler
post_write_mst deprotect the data again before unlocking the page.)
- For non-resident, compressed attributes, compress data chunk, allocate
space if necessary, finally write out to backing store.
- For resident attributes, may need to convert to non-resident if size has
grown too big, then go to the non-resident, uncompressed case above. If
size is still small enough, just copy data to the mft record and mark that
dirty for later write out (we could force a synchronous write if desired).
Conclusions
===========
The above proposal has the problem of permitting overallocation. So the
user can overallocate without any form of stopping them. We could just have
a stupid check like "if (NVolENOSPC(vol)) return -ENOSPC;" if we want so
that once we notice we are actually out of space all further writes will be
stopped. And we would do a simple NVolSetENOSPC(vol) when we notice we are
out of space... The cluster deallocator would do a NVolClearENOSPC(vol).
We could of course do more complicated accounting to make sure we will
never overallocate but that would make everything a lot more complicated...
Basically my proposal makes ntfs_writepage() the "workhorse" and keeps
prepare/commit_write() as simple as possible. At the same time this speeds
up writes a lot as we are not slowed down by allocations at write(2) time
and allocate on vm writeback/sync instead.
Does this make sense or have I missed something important? Any
better/alternative ideas are welcome!
What do you think about the necessity for free space accounting? Can we do
just none? Should we do a simple ENOSPC per volume flag? Or do we really
have to do full accounting to ensure we never overallocate?
The advantage of the delayed allocation is it allows easier handling of
compressed files - there we cannot know how much space we will need as the
page cache data size is not equal to the data written out to disk. We need
to compress the data before we know. And we don't want to compress every
time prepare/commit_write are called, otherwise byte by byte writes would
_really_ suck performance wise.
But even for delayed allocation as above, if we decide to do full
accounting of free space to prevent overallocation, we have a problem with
compressed files. For extension of files or filling in of holes, we could
just assume the data would not compress at all and charge that much in the
accounting. But this becomes more complicated on overwrite as the new data
is likely to compress differently well to the existing data so we may need
to allocate more space just when overwriting and we have no way of telling
the difference until we have compressed the data. This is why I suggest not
to do accounting at all. Perhaps the ENOSPC flag would be useful as a
sanity check though so an application can't trash the machine by just
writing to a full partition and us not being able to write out the dirty
data...
/me finishes in hope of stimulating some discussion or at least getting
people's opinions...
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
|