Re: [Aoetools-discuss] vblade performance issues
Brought to you by:
ecashin,
elcapitansam
From: kelsey h. <kh...@dr...> - 2007-07-31 22:22:45
|
kelsey hudson wrote: > However, it seems that the AoE initiator in the kernel is good at doing > region write-combining, because the writes I see coming from vblade (via > strace) are always 4096 bytes, regardless of what the initial block size > is. Writes which are a multiple of the page size shouldn't trigger a > page read on write(). I've narrowed this down even further. It appears as though there's an alignment issue going on when an AoE device is partitioned. If I write blocks direct to etherd/e0.0, for instance, the block writes are directly aligned with pages and I can stream at full-speed(*) to the underlying device. If I add a partition table and write to this partition, the writes are offset by 512 bytes, which is not a multiple of a page size (and, incedentally, the exact size of an x86 partition table). That is to say, if I write blocks directly to etherd/e0.0p1, the writes cause page cache reads every time. I have discovered that this alignment error causes every write() on the initiator to be offset into two cache pages on the target. So, even if the cache reads don't happen on the initiator, they do happen on the target. Additionally, this would be the cause of the slightly slower read conditions. (*) in this case I'm saturating my gigabit ethernet interface on a write. Now, here's where I'm a bit confused on this whole issue. On a standard hard disk, the boot sector exists always in cylinder 0. Due to x86 bios stupidity, this boot sector is "limited" to 512 bytes in size. However, the boot sector actually occupies the entire cylinder (because usable partitions can only begin at cylinder 1). There is typically inaccessable space at the end of cylinder 0 (and this is OK). The cylinder size on my virtual disk is 8225280 bytes (exactly 8Mbytes, and an exact multiple of the page size). If the first partition existed at an exact 8MB offset, it would be properly aligned with a cache page. Somewhere along the lines (I'm guessing in the AoE kernel driver, somewhere betweeen the write() done to the virtual device and the write() done to the physical device), there is an off-by-one (block/512b) error which is causing this misalignment. I haven't yet taken a look at the code for the AoE module. Perhaps one of the Coraid folks who is more familiar with the code would be better able to spot this than myself (although I'll take a look and let you know what I find, but it will take me quite a while to get familiar with the code). Once I (or the Coraid folks) correct this error, this change will absolutely not be compatible with the old on-disk format because everything will be shifted by an appropriate amount of bytes. Am I off-base in making these assumptions? Please advise. Thanks, -Kelsey |