Hi Andy, Dieter,
I've had some correspondence with Bartlomiej Zolnierkiewicz, who maintains
the Linux ide drivers, to help understand this better. So his comments
are incorporated in what I write below, which I learned from him.
On Tue, 6 Apr 2004, Andy Isaacson wrote:
> On Mon, Apr 05, 2004 at 12:38:24PM -0500, Bruce Allen wrote:
> > On Mon, 5 Apr 2004, Dieter Stueken wrote:
> > > in principle yes, but using hdc, the kernels block buffering is involved.
> > > Thus when you try to replace 512 bytes within some 4k block, the kernel
> > > will read all 8 sectors first, modify the content of the block and will
> > > write it back again. Unfortunately this must fail, as you end up with a
> > > read error from the broken sector. You may be successful if, and only if
> > > the data you write perfectly matches a 4k block and the kernel is smart
> > > enough to recognize, that no reading is needed, as the whole block
> > > gets replaced.
> >
> > Interesting. I thought that when you reference /dev/hdc, there is no
> > 'file system' involved, which is where I thought that the 4kB block size
> > came from.
>
> There is no "file system", but there is a buffer cache associated with
> the inode for /dev/hdc. (Notably, it's independent of the inodes for
> the files *in the filesystem* on /dev/hdc, so you can't "prefetch .so
> files to speed up boot" by reading blocks from /dev/hdc.)
>
> You can verify this by watching "vmstat 1" output as you read from
> /dev/hdc. If it were a raw device like BSD's /dev/rwd0c, every read
> request would go directly to the disk and the buffer memory amount would
> be unaffected; but Linux buffers those reads, so you'll see the "cache"
> column grow as /dev/hdc is read from.
Very nice experiment! I tried it, and you are absolutely right. Reading
and writing directly to /dev/hda does NOT bypass the buffer cache.
What's more, if you try and read a single sector, you'll see the buffer
count jump by 4 blocks. Since vmstat uses units of 1 kB blocks, we have
just proved that the buffer cache uses a 4 kB = 8 sector blocksize.
The 'raw block size' used in the IDE drivers themselves is indeed 1 kB
(two 512 byte sectors) so in fact it's simply not possible to read or
write a single sector. As Dieter had speculated, if you try to over-write
a single 512-byte block, the OS will first have to read in the *other* 512
byte sector that is part of that logical block, then modify the 1kB block
in memory, then write out the two 512 byte blocks to disk, together.
This is independent of the buffering of /dev/hda by the buffer cache, that
Andy describes.
If you use the normal /dev/hda device, then the fact that the kernel
buffer cache uses a 4kB blocksize means that the smallest number of
sectors you can over-write (if just one is bad) is 8. Dieter, next time
you have a pending sector at a known LBA, you should be able to confirm
this by demonstrating that reading nearby sectors fails because they are
part of the 4kB kernel buffer block containing that bad LBA.
Now, to get around the buffering that Andy describes, you need to pass
open(2) the O_DIRECT flag. Read about it with 'man 2 open'. This forces
the read/writes 'direct to the device', bypassing the buffer cache.
Bart also said to "read help to CONFIG_RAW_DRIVER. It says OBSOLETE :),
maybe try to use O_DIRECT (man 2 open)."
Perhaps we ought to write to the 'dd' maintainers, and ask if they can add
an extra flag (for unix that supports 'bypass the buffer') which passes
O_DIRECT when dd opens a device. Andy, Dieter, do you think this makes
sense? We could also just write our own 'short and sweet' sector
reallocation tool.
> I don't know what raw(1) does, but I would hazard a guess that it
> duplicates the BSD behavior, allowing read(2) and write(2) to directly
> invoke the "read a block from disk" routines.
I guess it must use this O_DIRECT flag.
Cheers,
Bruce
|