From: FUJITA Tomonori <tomof@...>
Subject: Re: [Iscsitarget-devel] Write caching - subjective summary
Date: Wed, 27 Oct 2004 23:48:04 +0900
> > 1) make write back as an option here, as a patch, not be merged.
> > 2) try to implement support for WRITE FUA, SYNCHRONIZE_CACHE (this is
> > useful for both write through and write back)
> > 3) pursue a true write back solution with order preserve. or extra write
> > redo log. implement a new io handler.
>
> Tomorrow (maybe day after tomorrow), I'll post reasons why it is to
> implement write-back, which disk drives provide, though I've already
> said some here. And I'll also suggest a possible way to implement it.
Let's discuss write-back-reorder cache first, because it is easier
than write-back.
1. fileio
If we'll go with the fileio, we can implement write-back-reorder cache
with one-line change to fileio_sync(), as we saw before.
With the change, when IET receives and copy data to page cache, it
tells an initiator that the write command finishes. And the dirty page
cache will be written to disk later by someone like the bdflush
daemon. And the writing order is aggressively changed.
This looks like write-back-reorder cache, which disk drives provide,
however, it far more harmful for file systems.
I think that there are many people use the combination of
write-back-reorder cache and file systems (like XFS) cannot handle it
properly. But most of them have not found that their file systems were
corrupted. This is because the amount of disk cache
(write-back-reorder) is very small (typically several megabytes). So
there is little possibility that your file systems are corrupted
badly.
However, IET with the change uses the huge amount of memory as disk
cache. With 2.4 kernels, you can dirty almost all of page cache. The
large amount of clean page cache (for reading) is useful. But if 800
MB dirty page cache is lost due to a crash, probably your file system
cannot survive.
Another problem about the change is that dirty page cache are kept for
a longer time, I think. I'm not sure how disk cache works, however, I
guess that dirty disk cache cannot be kept for 30 seconds, unlike
Linux kernel does.
Now you understand that the change to fileio_sync() is far more
dangerous than write-back-reorder cache, which your disk drives use.
Bad news is that it is impossible to solve this problem. Limiting the
amount of dirty page cache needs modifications to Linux kernels, and
it's unacceptable.
We need to more freedom to control the behavior of page cache (i.e.
the VM system) per devices, like kind of QoS. Possibly such features
will be implemented to Linux kernels because some people think that
such features are important for clients (initiators) in IP-SAN.
Now we have no choice but to accept the riskier write-back-reorder
cache. I don't like it because I feel like I'm going to bring more
dangerous products than similar one in the market.
However, I also understand that you are responsible for what you do
and we are not be responsible for how they use. So if you guys like
write-back-reorder cache, it is totally OK and will be implemented.
Secondly, we need to implement SCSI tag attributes, that is, ordered,
head of queue, and aca. Now IET ignores them. Implementing only
ordering tag, which is important for file systems, is not
difficult. However, treating all tags correctly is not so easy, I
guess.
I always feel that I have to implement this.
Note that file systems, which can handle write-back cache, use ordered
tags, though IET ignores it now. This may break your file system or
not. It depends on how file systems use ordered tags. I know that ext3
can, but I'm not sure about others.
Thirdly, we need SYNCHRONIZE CACHE command. It is easy to implemnt it.
Fourthly, we need tree-structured configuration files, as we
discussed. With the current format, if we put per-lun things, it will
be messy.
Maybe, we also need FUA bit support, though I'm not sure that file
systems use it.
2. blockio
I'm too tired to write more. Generally speaking, without modifications
to Linux kernels, it can completely control how to write. But it needs
lots of work to implement this.
If someone is interested in this approach, I'll explain more.
|