Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem
Brought to you by:
ecashin,
elcapitansam
From: Adi K. <ad...@cg...> - 2011-07-06 13:59:35
|
Hi! > >> What filesystem do you use? XFS is known to be the recommended > >> filesystem for AoE. > > Actually I think this could be due to RAID block sizes: most AoE > > implementations assume a block size of 512Byte. If you're using a linux > > software RAID5 with a default chunk size of 512K and you're using 4 disks, > > a single "block" has 3*512K block size. This is what has to be written when > > changing data in a file for example. > > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe > > width and so on (see man pages) when the information is available (which is > > not the case when creating a file system on an AoE device. > > I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a > single block device. > The RAID itself is a RAID6 configuration, using default settings. > MegaCLI says that the virtual drive has a "Strip Size" of 64KB. Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot see them. > It seems I get the same filesystem settings if I create the filesystem > right on the LVM volume, > or if I create it on the AoE volume. Hmmm... that means that the controller does not expose its chunk size to the operating system. The most important parameters here are: * stride = number of blocks on one raid disk (aka chunk-size/block-size) * stripe-width = number of strides of one data block in the raid Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)" where N is the number of disks in the array. There are plenty of sites out there about finding good parameters for mkfs and RAID (like http://www.altechnative.net/?p=96 or http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). > I have iostat running continually, and I have seen that "massive read" > problem earlier. The "problem" with AoE (or whatever intermediate network protocol iscsi, fcoe, ... you will use) is, that it needs to force writes to happen. The Linux kernel tries to assume the physical layout of the underlaying disk by at least using the file system layout on disk and tries to write one "physical block" at a time. (blockdev --report /dev/sdX reports what the kernel thinks how the physical layout looks like) Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is 64K that means one "physical block" has a size of 4*64K = 256K. The file systems you created had a block size of 4K -- so in case AoE forces the kernel to commit every 4K, the RAID-Controller needs to read 256K, update 4K, calculate checksums and write 256K again. This is what is behind the "massive read" issue. Write rate should improve by creating the file system with correct stride size and stripe width. But there are other factors for this as well: * You're using lvm (which is an excellent tool). You need to create your physical volumes with parameters that fit your RAID too. That is use "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is that it exports "physical extents" which need to be alligned to the beginning of your RAID's boundaries. For testing purposes you might start without LVM and try to align and export the filesystem via AoE first. That way you get better reference numbers for further experiments.) * For real world scenarios it might be a better idea to recreate the RAID with a smaller chunk size. This -- of course -- depends on what kind of files you intend to store on that RAID. You should try to fit an average file in more than just one "physical block"... > However, when I'm doing these tests, I have a bare minimum of reads, > it's mostly all writes. As mentioned above: this is due to the controller "hiding" real disk operation away. Hope, this helps... and please send back results! -- Adi |