Thread: [Aoetools-discuss] Throughput for raw AoE device versus filesystem
Brought to you by:
ecashin,
elcapitansam
From: Torbjørn T. <tor...@tr...> - 2011-07-05 15:31:14
|
I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a good performance level. I can read and write the raw AoE device (/dev/etherd/*) at more or less line-speed on my 1gig Ethernet adapters. This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or something similar. However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s. I've tested mostly by using rsync, cp or dd, but I tried bonnie and saw much the same results. I've been testing mostly with ext4, but I saw pretty much the same performance with ext3. Since I'm testing sequential reads and writes, I was expecting the filesystem performance to be closer to line-speed than what I'm seeing now. Thanks to aoetools-discuss, I think I've got a pretty good configuration going, with MTU at 9000, flow control on the switch and some kernel tuning. Since I'm seeing line-speed when using the device directly, I guess this means that the configuration is more or less okay. What kind of performance are you guys seeing on your filesystems when using 1gig Ethernet adapters ? -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no |
From: Tracy R. <tr...@ul...> - 2011-07-05 18:29:06
|
On Tue, Jul 05, 2011 at 05:03:40PM +0200, Torbjørn Thorsen spake thusly: > I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a > good performance level. > > I can read and write the raw AoE device (/dev/etherd/*) at more or > less line-speed > on my 1gig Ethernet adapters. > This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or > something similar. This is in line with what I get also. Sounds like your performance level is as expected (very good). > However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s. > I've tested mostly by using rsync, cp or dd, but I tried bonnie and > saw much the same results. Yep. You are most likely running into physical limitations of the disk. > Since I'm seeing line-speed when using the device directly, I guess this means > that the configuration is more or less okay. Yep. > What kind of performance are you guys seeing on your filesystems when > using 1gig Ethernet adapters ? The speed of the network is not nearly as important as the speed of the disk hardware. I get performance similar to yours when doing streaming reads/writes to at least two disks. A single 7200rpm drive can typically do 70MB/s so you usually need to gang up at least two of these in a mirror or stripe. Many more smaller disks are necessary for higher IOPS. Fortunately, this is a problem completely independent of AoE so lots of people know how to solve it. These days I deploy SuperMicro 24 bay 2.5" servers stuffed full of 10k RPM disks. This seems to get me the most reasonable bang/buck while providing the kind of IOPS I need to run databases, mail servers, etc. The giant/cheap 2T disks you can buy these days are great for archival and backup storage but for actual data processing the advice has been the same for many years: Throw lots of spindles at the problem. -- Tracy Reed Digital signature attached for your safety. Copilotco Professionally Managed PCI Compliant Secure Hosting 866-MY-COPILOT x101 http://copilotco.com |
From: Torbjørn T. <tor...@tr...> - 2011-07-06 09:17:55
|
2011/7/5 Tracy Reed <tr...@ul...>: > On Tue, Jul 05, 2011 at 05:03:40PM +0200, Torbjørn Thorsen spake thusly: >> I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a >> good performance level. >> >> I can read and write the raw AoE device (/dev/etherd/*) at more or >> less line-speed >> on my 1gig Ethernet adapters. > >> This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or >> something similar. > > This is in line with what I get also. Sounds like your performance level is as > expected (very good). > >> However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s. >> I've tested mostly by using rsync, cp or dd, but I tried bonnie and >> saw much the same results. > > Yep. You are most likely running into physical limitations of the disk. I should have mentioned that the AoE device is backed by a RAID setup that is able to write well above 120 MB/s. If I mount the same filesystem locally, on the server, bonnie tells me it's able to do sequential writes at ~370 MB/s. If I write straight to the AoE device, I can get the expected line-speed of the network, around ~110 MB/s. dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M However, when mounting a filesystem, and copying a file onto the AoE device, I only see about ~70 MB/s. This leads me to thinking that the performance degradation I'm seeing is related to the filesystem or the network. Of course, I wouldn't expect a filesystem to give the same performance as the raw device, but I didn't expect to see a ~25% hit in performance, especially when doing a sequential write. > -- > Tracy Reed Digital signature attached for your safety. > Copilotco Professionally Managed PCI Compliant Secure Hosting > 866-MY-COPILOT x101 http://copilotco.com > -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no |
From: Gabor G. <go...@di...> - 2011-07-11 20:15:21
|
On Mon, Jul 11, 2011 at 12:16:44PM +0200, Torbjřrn Thorsen wrote: > With this setup, I get the same ~70 MB/s I have been fighting with for > a while now. > It seems curious to me that I get ~70 MB/s seemingly no matter what changes > I do to the the configuration, so I'm beginning to suspect my testing > method is broken. Try ext2. More advanced file systems need to sync from time to time to ensure your data is safe. Since the AoE protocol does not support barriers, and AFAIK support for the FLUSH ATA command was never implemented, the client kernel can do just one thing: stop sending new commands, wait until all pending commands finish, and really-really hope that the server did commit the data to disk, even if it got no indication to do so. This means that most file system operations (esp. those involving metadata) will insert "gaps" into the data stream. So when you're using a file system, you will never be able to reach the performance of the raw device, or the network. If you have many clients, then the fact that one of them can't saturate the server is probably not that important. If the performance of a single client is important, then try iSCSI. Gabor |
From: Tracy R. <tr...@ul...> - 2011-07-13 04:53:16
|
On Mon, Jul 11, 2011 at 10:15:12PM +0200, Gabor Gombas spake thusly: > Since the AoE protocol does not support barriers, and AFAIK support for the > FLUSH ATA command was never implemented, the client kernel can do just one > thing: stop sending new commands, wait until all pending commands finish, and > really-really hope that the server did commit the data to disk, even if it > got no indication to do so. Are there any plans to fix this? Is it even technically possible? It seems that Coraid would want to remove any doubt as to using AoE for "enterprise" use. This: http://lwn.net/Articles/283161/ has a good explanation of the history of write barriers in Linux. My understanding is that until recently RedHat had been turning off write barriers in the kernel anyway yet people still ran their journalling filesystems and databases etc. just fine. RHEL6 seems to have write barriers enabled for all filesystems that support them. A sync should ensure consistency as long as the backing disk system actually gets the data onto disk during the sync, right? My datacenter has had power issues lately so I am paying careful attention to this sort of thing. -- Tracy Reed Digital signature attached for your safety. Copilotco Professionally Managed PCI Compliant Secure Hosting 866-MY-COPILOT x101 http://copilotco.com |
From: Adi K. <ad...@cg...> - 2011-07-06 10:09:53
|
Hi! > >I should have mentioned that the AoE device is backed by a RAID setup that > >is > >able to write well above 120 MB/s. > >If I mount the same filesystem locally, on the server, bonnie tells me > >it's able to do > >sequential writes at ~370 MB/s. > > > >If I write straight to the AoE device, I can get the expected > >line-speed of the network, around ~110 MB/s. > >dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M > > > >However, when mounting a filesystem, and copying a file onto the AoE > >device, I only see about ~70 MB/s. > > > >This leads me to thinking that the performance degradation I'm seeing > >is related to > >the filesystem or the network. > >Of course, I wouldn't expect a filesystem to give the same performance as > >the > >raw device, but I didn't expect to see a ~25% hit in performance, > >especially > >when doing a sequential write. > > > What filesystem do you use? XFS is known to be the recommended > filesystem for AoE. Actually I think this could be due to RAID block sizes: most AoE implementations assume a block size of 512Byte. If you're using a linux software RAID5 with a default chunk size of 512K and you're using 4 disks, a single "block" has 3*512K block size. This is what has to be written when changing data in a file for example. mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe width and so on (see man pages) when the information is available (which is not the case when creating a file system on an AoE device. To check if you're hit by this is quite simple: install dstat or iostat on the server exporting the volume. Run your benchmark and watch the output of dstat/iostat: if you experience massive reads while writing, congrats, you found the root cause. To improve things a little, create the file system on the server that is exporting the AoE targets. To improve them even more -- especially with RAID5 and RAID6 -- choose a smaller chunk size. I'd be glad if you could post back some numbers... :-) On a side note: linear performance isn't what is counting when using network storage. You better measure iops (input/output operations per second). I use fio for benchmarks which lets you define your I/O patterns to (kind of) fit real world usage. -- Adi |
From: Torbjørn T. <tor...@tr...> - 2011-07-06 12:41:19
|
On Wed, Jul 6, 2011 at 11:51, Adi Kriegisch <ad...@cg...> wrote: > Hi! > >> >I should have mentioned that the AoE device is backed by a RAID setup that >> >is >> >able to write well above 120 MB/s. >> >If I mount the same filesystem locally, on the server, bonnie tells me >> >it's able to do >> >sequential writes at ~370 MB/s. >> > >> >If I write straight to the AoE device, I can get the expected >> >line-speed of the network, around ~110 MB/s. >> >dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M >> > >> >However, when mounting a filesystem, and copying a file onto the AoE >> >device, I only see about ~70 MB/s. >> > >> >This leads me to thinking that the performance degradation I'm seeing >> >is related to >> >the filesystem or the network. >> >Of course, I wouldn't expect a filesystem to give the same performance as >> >the >> >raw device, but I didn't expect to see a ~25% hit in performance, >> >especially >> >when doing a sequential write. >> > >> What filesystem do you use? XFS is known to be the recommended >> filesystem for AoE. > Actually I think this could be due to RAID block sizes: most AoE > implementations assume a block size of 512Byte. If you're using a linux > software RAID5 with a default chunk size of 512K and you're using 4 disks, > a single "block" has 3*512K block size. This is what has to be written when > changing data in a file for example. > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe > width and so on (see man pages) when the information is available (which is > not the case when creating a file system on an AoE device. I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a single block device. The RAID itself is a RAID6 configuration, using default settings. MegaCLI says that the virtual drive has a "Strip Size" of 64KB. The virtual device from the RAID controller is used as a physical volume for LVM, and the exported AoE devices are LVM logical volumes cut from this physical volume. It seems I get the same filesystem settings if I create the filesystem right on the LVM volume, or if I create it on the AoE volume. Creating it on the server, mkfs.ext4 says: root@storage01:~# mkfs.ext4 /dev/aoepool0/aoetest mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 1310720 inodes, 5242880 blocks Creating it on the client, mkfs.ext4 says: root@xen08:/home/torbjorn# mkfs.ext4 /dev/etherd/e7.1 mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 1310720 inodes, 5242880 blocks Using both of these filesystems on the client, I end up with pretty much the same transfer rate of about ~70MB/s. Using it on the client, that is, mounting the LVM volume directly, I get the much preferable ~370 MB/s. > > To check if you're hit by this is quite simple: install dstat or iostat on > the server exporting the volume. Run your benchmark and watch the output of > dstat/iostat: if you experience massive reads while writing, congrats, you > found the root cause. To improve things a little, create the file system on > the server that is exporting the AoE targets. To improve them even more -- > especially with RAID5 and RAID6 -- choose a smaller chunk size. > > I'd be glad if you could post back some numbers... :-) I have iostat running continually, and I have seen that "massive read" problem earlier. However, when I'm doing these tests, I have a bare minimum of reads, it's mostly all writes. The "%util" column from iostat is mostly around ~10%, while at some intervals peaking towards 100%. I'm guessing there is some cache flushing going on when I'm seeing those spikes. This is on the server, the client chugs stably along at ~70 MB/s. > > On a side note: linear performance isn't what is counting when using > network storage. You better measure iops (input/output operations per > second). I use fio for benchmarks which lets you define your I/O patterns > to (kind of) fit real world usage. > > -- Adi > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > Aoetools-discuss mailing list > Aoe...@li... > https://lists.sourceforge.net/lists/listinfo/aoetools-discuss > -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no |
From: Adi K. <ad...@cg...> - 2011-07-06 13:59:35
|
Hi! > >> What filesystem do you use? XFS is known to be the recommended > >> filesystem for AoE. > > Actually I think this could be due to RAID block sizes: most AoE > > implementations assume a block size of 512Byte. If you're using a linux > > software RAID5 with a default chunk size of 512K and you're using 4 disks, > > a single "block" has 3*512K block size. This is what has to be written when > > changing data in a file for example. > > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe > > width and so on (see man pages) when the information is available (which is > > not the case when creating a file system on an AoE device. > > I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a > single block device. > The RAID itself is a RAID6 configuration, using default settings. > MegaCLI says that the virtual drive has a "Strip Size" of 64KB. Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot see them. > It seems I get the same filesystem settings if I create the filesystem > right on the LVM volume, > or if I create it on the AoE volume. Hmmm... that means that the controller does not expose its chunk size to the operating system. The most important parameters here are: * stride = number of blocks on one raid disk (aka chunk-size/block-size) * stripe-width = number of strides of one data block in the raid Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)" where N is the number of disks in the array. There are plenty of sites out there about finding good parameters for mkfs and RAID (like http://www.altechnative.net/?p=96 or http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). > I have iostat running continually, and I have seen that "massive read" > problem earlier. The "problem" with AoE (or whatever intermediate network protocol iscsi, fcoe, ... you will use) is, that it needs to force writes to happen. The Linux kernel tries to assume the physical layout of the underlaying disk by at least using the file system layout on disk and tries to write one "physical block" at a time. (blockdev --report /dev/sdX reports what the kernel thinks how the physical layout looks like) Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is 64K that means one "physical block" has a size of 4*64K = 256K. The file systems you created had a block size of 4K -- so in case AoE forces the kernel to commit every 4K, the RAID-Controller needs to read 256K, update 4K, calculate checksums and write 256K again. This is what is behind the "massive read" issue. Write rate should improve by creating the file system with correct stride size and stripe width. But there are other factors for this as well: * You're using lvm (which is an excellent tool). You need to create your physical volumes with parameters that fit your RAID too. That is use "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is that it exports "physical extents" which need to be alligned to the beginning of your RAID's boundaries. For testing purposes you might start without LVM and try to align and export the filesystem via AoE first. That way you get better reference numbers for further experiments.) * For real world scenarios it might be a better idea to recreate the RAID with a smaller chunk size. This -- of course -- depends on what kind of files you intend to store on that RAID. You should try to fit an average file in more than just one "physical block"... > However, when I'm doing these tests, I have a bare minimum of reads, > it's mostly all writes. As mentioned above: this is due to the controller "hiding" real disk operation away. Hope, this helps... and please send back results! -- Adi |
From: Torbjørn T. <tor...@tr...> - 2011-07-06 15:57:12
|
2011/7/6 Adi Kriegisch <ad...@cg...>: > Hi! > >> >> What filesystem do you use? XFS is known to be the recommended >> >> filesystem for AoE. >> > Actually I think this could be due to RAID block sizes: most AoE >> > implementations assume a block size of 512Byte. If you're using a linux >> > software RAID5 with a default chunk size of 512K and you're using 4 disks, >> > a single "block" has 3*512K block size. This is what has to be written when >> > changing data in a file for example. >> > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe >> > width and so on (see man pages) when the information is available (which is >> > not the case when creating a file system on an AoE device. >> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a >> single block device. >> The RAID itself is a RAID6 configuration, using default settings. >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB. > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot > see them. > I'm too happy about this either. My intention in the start was to get the RAID controller to just expose the disks, and let Linux handle the RAID side of things. However, I was unsuccessful in convincing the RAID controller to do so. >> It seems I get the same filesystem settings if I create the filesystem >> right on the LVM volume, >> or if I create it on the AoE volume. > Hmmm... that means that the controller does not expose its chunk size to > the operating system. The most important parameters here are: > * stride = number of blocks on one raid disk (aka chunk-size/block-size) > * stripe-width = number of strides of one data block in the raid > > Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)" > where N is the number of disks in the array. There are plenty of sites out > there about finding good parameters for mkfs and RAID (like > http://www.altechnative.net/?p=96 or > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). > The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity. I created the filesystem as you suggested, the resulting output from mkfs was: root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=16 blocks, Stripe width=48 blocks 1310720 inodes, 5242880 blocks I then mounted the newly created filesystem on the server, and have it a run with bonnie. Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with the default settings. When I exported it using AoE, the throughput on the client was ~60 MB/s, down from ~70 MB/s. So these particular settings for the filesystem doesn't seem to be right on the money, but I guess it's a matter of tuning these settings. I didn't see a massive increase in read operations with these settings, but I guess there was a bit more read activity going on. >> I have iostat running continually, and I have seen that "massive read" >> problem earlier. > The "problem" with AoE (or whatever intermediate network protocol iscsi, > fcoe, ... you will use) is, that it needs to force writes to happen. The > Linux kernel tries to assume the physical layout of the underlaying disk by > at least using the file system layout on disk and tries to write one > "physical block" at a time. (blockdev --report /dev/sdX reports what the > kernel thinks how the physical layout looks like) > Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is > 64K that means one "physical block" has a size of 4*64K = 256K. The file > systems you created had a block size of 4K -- so in case AoE forces the > kernel to commit every 4K, the RAID-Controller needs to read 256K, update > 4K, calculate checksums and write 256K again. This is what is behind the > "massive read" issue. > > Write rate should improve by creating the file system with correct stride > size and stripe width. But there are other factors for this as well: > * You're using lvm (which is an excellent tool). You need to create your > physical volumes with parameters that fit your RAID too. That is use > "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is > that it exports "physical extents" which need to be alligned to the > beginning of your RAID's boundaries. For testing purposes you might start > without LVM and try to align and export the filesystem via AoE first. > That way you get better reference numbers for further experiments.) > * For real world scenarios it might be a better idea to recreate the RAID > with a smaller chunk size. This -- of course -- depends on what kind of > files you intend to store on that RAID. You should try to fit an average > file in more than just one "physical block"... > I haven't investigated this level of detail in storage before, so this is the first time I'm tuning a system like this for production. I'll read up and try to see if I can't get all these settings to align. >> However, when I'm doing these tests, I have a bare minimum of reads, >> it's mostly all writes. > As mentioned above: this is due to the controller "hiding" real disk > operation away. > > Hope, this helps... and please send back results! > > -- Adi > Thanks, I appreciate the help from you and all the others who have been very helpful here on aoetools-discuss. What I'm not quite understanding is how exporting a device via AoE would introduce new alignment problems or similar. When I can write to the local filesystem at ~370 MB/s, what kind of problem is introduced by using AoE or other network storage solution ? I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the exact same ~70 MB/s throughput there, so I guess this isn't related to AoE in itself. -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no |
From: Adi K. <ad...@cg...> - 2011-07-06 16:51:37
|
Hi! > >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a > >> single block device. > >> The RAID itself is a RAID6 configuration, using default settings. > >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB. > > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot > > see them. > > > > I'm too happy about this either. > My intention in the start was to get the RAID controller to just > expose the disks, > and let Linux handle the RAID side of things. > However, I was unsuccessful in convincing the RAID controller to do so. Too bad... I'd prefer a Linux software RAID too... btw. there are hw-raid management tools available for linux. You probably want to check out http://hwraid.le-vert.net/wiki. > > Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)" > > where N is the number of disks in the array. There are plenty of sites out > > there about finding good parameters for mkfs and RAID (like > > http://www.altechnative.net/?p=96 or > > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). > > > > The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity. correct. > I created the filesystem as you suggested, the resulting output from mkfs was: > root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest [SNIP] > I then mounted the newly created filesystem on the server, and have it > a run with bonnie. > Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with > the default settings. > > When I exported it using AoE, the throughput on the client was ~60 > MB/s, down from ~70 MB/s. The values you used are correct for 3 data disks with 64K chunk size. Probably this issue is related to a misalignment of LVM. LVM adds a header which has a default size of 192K -- that would perfectly match your RAID: 3*64K = 192K... but the default "physical extent" size does not match your RAID: 4MB cannot be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks aren't propperly aligned -- I doubt you can align them, because the physical extent size needs to be a power of two and > 1K and to be aligned with the RAID divideable by 192... The only way could be to change the number of disks in the array to 4 or 6. :-( Could you just once try to use the raw device with the above used stride and stripe-width values? (without LVM inbetween) > Thanks, I appreciate the help from you and all the others > who have been very helpful here on aoetools-discuss. You're welcome! And thank you very much for always reporting back the results. > What I'm not quite understanding is how exporting a device via AoE > would introduce new alignment problems or similar. > When I can write to the local filesystem at ~370 MB/s, what kind of > problem is introduced by using AoE or other network storage solution ? > > I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the > exact same ~70 MB/s throughput there, so I guess this isn't related to > AoE in itself. There are two root causes for these issues: * SAN protocols force a "commit" of unwritten data, be it a "sync", direct i/o or whatever, way more often than local disks -- for the sake of data integrity. (actually write barriers should be enabled for all those AoE devices -- especially with newer kernels.) * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits into a jumbo frame). So all I/O is aligned around this size. When using a filesystem like ext4 or xfs one can influence the block sizes by creating the file system properly. And now for some ascii art: lets say a simple hard disk has the following physical blocks: +----+----+----+----+----+----+----+----+----+----+-..-+ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | .. | +----+----+----+----+----+----+----+----+----+----+-..-+ then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks looks like this (D1 1-2 means disk1 block 1 and 2): +----+----+----+----+----+----+----+----+----+----+-..-+ | D1 1-2 | D2 1-2 | D3 1-2 | D4 1-2 | D1 3-4 | .. | +----+----+----+----+----+----+----+----+----+----+-..-+ \------------ DATA -----------/\-PARITY-/ \ / \ ----------- RAID block 1 ------------- --------- .. One data block of this RAID can only be written at once. So whenever only one bit within that block changes, the whole block has to written again (because the checksum is only valid for the block as a whole). Now imagine, you write you have a lvm header that has half of the size of a RAID block: it will fill the first half of the block and the first lvm volume will then fill the rest of the first block plus some more blocks and a half at the end. Write operations are not alligned then and cause massive rewrites in the backend. >From my point of view there are several ways to find the root cause of the issues: * try a different RAID level (like 10 or so) * (re)-try to export the disks to Linux as JBODs. * try different filesystem and lvm parameters (actually you better write a script for that... ;-) And, let us know about the results! Thanks, Adi |
From: Tracy R. <tr...@ul...> - 2011-07-06 17:45:09
|
On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch spake thusly: > (actually write barriers should be enabled for all those AoE devices -- > especially with newer kernels.) How? > One data block of this RAID can only be written at once. So whenever only > one bit within that block changes, the whole block has to written again The alignment issues at every layer of the storage system have always been my biggest hassle in dealing with SANs. -- Tracy Reed Digital signature attached for your safety. Copilotco Professionally Managed PCI Compliant Secure Hosting 866-MY-COPILOT x101 http://copilotco.com |
From: Adi K. <ad...@cg...> - 2011-07-07 08:40:14
|
Hi! > > (actually write barriers should be enabled for all those AoE devices > > -- > > especially with newer kernels.) > > How? The default behavior depends on the kernel version and the vendor (Redhat is said to disable barrier support for local file systems on recent kernels). Between 2.6.31 and 2.6.33 most/all devices gained propper barrier support (which of course made disc access in most/all cases slower). In case barrier support for the underlaying device is available, the mount option "barrier" can be used to enable/disable barrier support. You can for example disable barrier support with this command: mount -o remount,barrier=0 /mount/point For mounting file systems over a SAN protocol like AoE or iscsi I'd strongly recommend using write barriers. Due to the higher latency of those protocols ending up with a broken filesystem and lost data is way more likely. > > One data block of this RAID can only be written at once. So whenever > > only > > one bit within that block changes, the whole block has to written again > > The alignment issues at every layer of the storage system have always > been my > biggest hassle in dealing with SANs. Sigh. Yeah... it is not so easy to deal with that. I'm struggling myself from time to time. ;-) Probably time to write a complete tutorial on how to deal with alignment?! -- any volunteers?? :-) -- Adi |
From: Gabor G. <go...@di...> - 2011-07-07 19:42:08
|
On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch wrote: > * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits > into a jumbo frame). So all I/O is aligned around this size. When using a > filesystem like ext4 or xfs one can influence the block sizes by creating > the file system properly. No, AoE has no block size. It will cram as many sectors as it can into a packet; e.g. if the MTU is 9000, then 17 sectors fit inside it, which does not play well with any kind of alignment. [...] > >From my point of view there are several ways to find the root cause of the > issues: > * try a different RAID level (like 10 or so) > * (re)-try to export the disks to Linux as JBODs. > * try different filesystem and lvm parameters (actually you better write a > script for that... ;-) And if you insist on using parity RAID (i.e. RAID5 or RAID6), then make sure the number of data disks is a power of two. That makes computing various alignments much easier. Gabor |
From: Jesse B. <bec...@ma...> - 2011-07-07 20:47:09
|
On Thu, Jul 07, 2011 at 03:41:59PM -0400, Gabor Gombas wrote: >On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch wrote: > >> * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits >> into a jumbo frame). So all I/O is aligned around this size. When using a >> filesystem like ext4 or xfs one can influence the block sizes by creating >> the file system properly. > >No, AoE has no block size. It will cram as many sectors as it can into a >packet; e.g. if the MTU is 9000, then 17 sectors fit inside it, which >does not play well with any kind of alignment. So perhaps there's something to be gained from artificially lowering the MTU? >> >From my point of view there are several ways to find the root cause of the >> issues: >> * try a different RAID level (like 10 or so) >> * (re)-try to export the disks to Linux as JBODs. >> * try different filesystem and lvm parameters (actually you better write a >> script for that... ;-) > >And if you insist on using parity RAID (i.e. RAID5 or RAID6), then make >sure the number of data disks is a power of two. That makes computing >various alignments much easier. > >Gabor > >------------------------------------------------------------------------------ >All of the data generated in your IT infrastructure is seriously valuable. >Why? It contains a definitive record of application performance, security >threats, fraudulent activity, and more. Splunk takes this data and makes >sense of it. IT sense. And common sense. >http://p.sf.net/sfu/splunk-d2d-c2 >_______________________________________________ >Aoetools-discuss mailing list >Aoe...@li... >https://lists.sourceforge.net/lists/listinfo/aoetools-discuss -- Jesse Becker NHGRI Linux support (Digicon Contractor) |
From: Torbjørn T. <tor...@tr...> - 2011-07-11 10:16:54
|
2011/7/6 Adi Kriegisch <ad...@cg...>: > Hi! > >> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a >> >> single block device. >> >> The RAID itself is a RAID6 configuration, using default settings. >> >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB. >> > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot >> > see them. >> > >> >> I'm too happy about this either. >> My intention in the start was to get the RAID controller to just >> expose the disks, >> and let Linux handle the RAID side of things. >> However, I was unsuccessful in convincing the RAID controller to do so. > Too bad... I'd prefer a Linux software RAID too... > btw. there are hw-raid management tools available for linux. You probably > want to check out http://hwraid.le-vert.net/wiki. > Unfortunately, there doesn't seem be any free or open tool available for the line of cards I'm using. http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS >> > Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)" >> > where N is the number of disks in the array. There are plenty of sites out >> > there about finding good parameters for mkfs and RAID (like >> > http://www.altechnative.net/?p=96 or >> > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). >> > >> >> The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity. > correct. > >> I created the filesystem as you suggested, the resulting output from mkfs was: >> root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest > [SNIP] >> I then mounted the newly created filesystem on the server, and have it >> a run with bonnie. >> Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with >> the default settings. >> >> When I exported it using AoE, the throughput on the client was ~60 >> MB/s, down from ~70 MB/s. > The values you used are correct for 3 data disks with 64K chunk size. > Probably this issue is related to a misalignment of LVM. LVM adds a header > which has a default size of 192K -- that would perfectly match your > RAID: 3*64K = 192K... > but the default "physical extent" size does not match your RAID: 4MB cannot > be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks > aren't propperly aligned -- I doubt you can align them, because the > physical extent size needs to be a power of two and > 1K and to be aligned > with the RAID divideable by 192... The only way could be to change the > number of disks in the array to 4 or 6. :-( > Could you just once try to use the raw device with the above used stride > and stripe-width values? (without LVM inbetween) > I've reinstalled the server, so that I can easily try different configurations on the RAID controller. However, none of the settings I have tried goes any faster than 70 MB/s. I've tried adjusting the stripe size and create filesystems accordingly, but I haven't seen any improvements in throughput. In my latest test, the RAID volume is just a simple 2 disk stripe. This volume is then exported directly with AoE, no LVM or mdadm. With this test I hoped to eliminate any problem related to having the RAID controller generate parity for unaligned writes. However, I'm still seeing writes of ~70 MB/s. I also tested the network with iperf, and iperf said it could copy at ~960 Mbit/s, as expected. >> Thanks, I appreciate the help from you and all the others >> who have been very helpful here on aoetools-discuss. > You're welcome! And thank you very much for always reporting back the > results. > >> What I'm not quite understanding is how exporting a device via AoE >> would introduce new alignment problems or similar. >> When I can write to the local filesystem at ~370 MB/s, what kind of >> problem is introduced by using AoE or other network storage solution ? >> >> I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the >> exact same ~70 MB/s throughput there, so I guess this isn't related to >> AoE in itself. > There are two root causes for these issues: > * SAN protocols force a "commit" of unwritten data, be it a "sync", direct > i/o or whatever, way more often than local disks -- for the sake of data > integrity. (actually write barriers should be enabled for all those AoE > devices -- especially with newer kernels.) I guess this is different from doing everything with "sync" enabled, though ? If I mount the filesystem with the "sync" option, I get a different throughput. > * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits > into a jumbo frame). So all I/O is aligned around this size. When using a > filesystem like ext4 or xfs one can influence the block sizes by creating > the file system properly. > > And now for some ascii art: > lets say a simple hard disk has the following physical blocks: > +----+----+----+----+----+----+----+----+----+----+-..-+ > | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | .. | > +----+----+----+----+----+----+----+----+----+----+-..-+ > > then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks > looks like this (D1 1-2 means disk1 block 1 and 2): > +----+----+----+----+----+----+----+----+----+----+-..-+ > | D1 1-2 | D2 1-2 | D3 1-2 | D4 1-2 | D1 3-4 | .. | > +----+----+----+----+----+----+----+----+----+----+-..-+ > \------------ DATA -----------/\-PARITY-/ > \ / \ > ----------- RAID block 1 ------------- --------- .. > > One data block of this RAID can only be written at once. So whenever only > one bit within that block changes, the whole block has to written again > (because the checksum is only valid for the block as a whole). > > Now imagine, you write you have a lvm header that has half of the size of a > RAID block: it will fill the first half of the block and the first lvm > volume will then fill the rest of the first block plus some more blocks and > a half at the end. Write operations are not alligned then and cause massive > rewrites in the backend. > > From my point of view there are several ways to find the root cause of the > issues: > * try a different RAID level (like 10 or so) > * (re)-try to export the disks to Linux as JBODs. > * try different filesystem and lvm parameters (actually you better write a > script for that... ;-) > > And, let us know about the results! > Thanks, > Adi > Thank you for that very thorough explanation, I've just learned a lot about I/O and alignment. As I mentioned, I have tried different configurations, trying to avoid any source of alignment issues. My last attempt has no parity in the RAID setup, the virtual device from the controller is partitioned and exported via AoE. With this setup, I get the same ~70 MB/s I have been fighting with for a while now. It seems curious to me that I get ~70 MB/s seemingly no matter what changes I do to the the configuration, so I'm beginning to suspect my testing method is broken. -- Vennlig hilsen Torbjørn Thorsen Utvikler / driftstekniker Trollweb Solutions AS - Professional Magento Partner www.trollweb.no |