Anton Altaparmakov wrote:
> Hi,
>
> btw. Jeff, your IbuFS supports delayed allocation, doesn't it? Can I
> have a copy of the source to look at? XFS seems horrendously complex and
> reimplements the almost the entire fs/buffer.c and mm/filemap.c to make
> it work and that seems just a touch over the top. I am not aware of any
> other FS on Linux doing delayed allocation at present...
Nope, it's just a basic extent-based fs at this point.
Andrew Morton posted delayed-alloc patches for ext2 and/or ext3 a while
ago, so that's something to look at. I think there were problems with
the patch itself and also VM problems which meant that it wasn't very
useful in his tests.
>
> At 01:46 20/07/02, Jeff Garzik wrote:
>
>> * You want to reserve space via some sort of internal accounting, to
>> properly handle ENOSPC
>
>
> ok. That is going to be interesting to get right. Perhaps we just need
> to over account so that we may start returning -ENOSPC when in fact
> there is space left. Would that be acceptable? Any ideas welcome...
I'm missing a bit of information (or probably, more correctly,
forgotten): why can't we accurately account free space?
At worst, you could schedule_task a process that scans the free block
bitmap, and block any processes that want to allocate new blocks until
(a) free blocks > 0 or (b) the keventd-launched task ends.
>> * You probably want to either have a pool of pages, as a temporary
>> backing store, or, grab an extra reference to the page passed to
>> prepare/commit_write. For the second, I think you'll want to unlock
>> the page ASAP, and lock it again later when you start doing I/O on it.
>
>
> If we get an extra reference, is the vm ever going to call our writepage
> (e.g. under memory pressure)? If yes and it doesn't consider the extra
> reference as "page busy" than that would work nicely.
Better ask a VM expert. I am just guessing: it should call writepage
as long as the page is on the dirty list, regardless of how many people
have references to the page. Remember, the page is unlocked when the VM
would want to do the writeback. Keeping a reference just means we won't
lost the page to general page allocation/freeing.
>> * I think compression is better done outside the write path, to keep
>> it simple and fast. Maybe something in ntfsprogs could talk to an
>> ioctl(2) which compresses an open(2)'d file descriptor.
>
>
> The problem is what happens with already compressed files?
You don't have to convert 100% of the file data, just the blocks being
updated. NTFS TNG should already support files which contain mixed
compressed and uncompressed runs (I thought??)
If you want to compress in place -- go for it. My only objection is
that it's a not-necessary feature in the filesystem's fast path. If you
want to put compression support in there and deal with the complexity,
more power to you :) It really comes down to the maintainer's call --
do I want the bloat and complexity in the kernel or userspace? The
upside is the obvious convenience of putting yet another feature in the
kernel :)
>> The bigger advantage to delayed allocation is the allocation of
>> contiguous runs.
>
>
> The way my cluster allocator works gives you that, too, except that it
> doesn't work well if more than one file is being written to at once.
> Then the two get interleaved with one cluster each in the worst case
> (depends on write(2) chunk size and on cluster size). But I have thought
> of reserving space on allocations so that this doesn't happen but I
> haven't tried that yet so no idea how well it would work... And I don't
> know how to determine how much space to reserve in advance. A static
> value seems daft, something that would scale depending on write activity
> would be nice. Obviously delayed allocation would be best...
As long as you allocate-on-flush, you have a single point where you need
to actually allocate the new blocks, and you can properly interleave there.
But that brings up mmap issues, which also responds to some of your
comments [which would be further down, if I hadn't edited :)]
It will be worth testing with ->writepage implemented, where are the
best places to start writing out delayed-alloc pages?
Your entry points are: ->commit_write, ->flush, ->writepage with
PageLaunder bit set [2.4 specific!], and ->writepage called from kupdated.
And IMHO, your proper points for doing actual I/O are:
1) ->writepage with PageLaunder bit, because the system is saying "get
rid of this page as fast as possible, I need to free it". [you'll need
to find the 2.5 equivalent of ->writepage call case for when the page
needs evicting] Basically write the page, ignoring delayed alloc, as
fast as possible.
2) ->flush, typically the last close of the fd, but in general the point
where you want to allocate new blocks for delayed I/O pages, and start
their writeout by dirtying them
3) some internal timeout occurs, and the page hasn't been written to
disk in N seconds. if we are delaying kupdated's flushes, we have to do
them
I think (and VM experts may disagree, it's a question to ask them) that
you should have ->writepage delay the allocation if it's called from
kupdated.
So anyway, mmap and ENOSPC. I think the ext2 behavior here is sane, and
can be followed even in a delayed-alloc system. If you adding new
blocks to a file via mmap(2), clearly you aren't going to see ENOSPC. I
think you will get a segfault, actually. ext2 marks the page with
PageError if it can't allocate blocks. That's all you can really do,
when called from ->writepage. With prepare/commit_write, the expected
behavior is to return a value that will be passed straight to userspace,
in this case ENOSPC.
Overall, your ->writepage implementation will likely be quite different
from ->prepare/commit_write. Even though they kinda sorta do the same
thing :)
>> * FWIW the last close of a file triggers a flush, which might be a
>> good point for actually allocating blocks
>
>
> Odd. I opened a file, wrote to it, exited the editor, did "cat file" and
> only a few seconds later did the kernel blow up (due to the lacking
> writepage). Admittedly this may be an artefact of the lacking
> writepage... We will find out when it is implemented. (-:
Is flush implemented? If not, it would probably fall back to writepage...
> If you delay the allocation how do you know that that is a hole you are
> writing into as opposed to data? Or do we always have to check for this
> kind of thing at prepare_write time? And if we do check, and say we
> decrement the amount of available space to account for the new data,
> what happens if someone writes to the same page again? Our check would
> again return that the page is inside a hole and we would account for it
> _again_ so we would be overaccounting and run out of space quickly...
As long as you have a list of pages for whom allocation is delayed, you
should be fine. No additional bits necessary. If it isn't on the list,
it isn't accounted for.
But since we have the nasty job of writing tons of zeros to disk when
sparse-file support is absent, I would say that you need a totally
separate accounting of zero pages delayed, and factor that into the
overall free-blocks number.
I feel like I'm rambling, so I'll shut up now. I hope I made some sense :)
Jeff
|