Thread: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

Brought to you by: ecashin, elcapitansam

aoetools-discuss

[Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Torbjørn T. <tor...@tr...> - 2011-07-05 15:31:14

I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a
good performance level.

I can read and write the raw AoE device (/dev/etherd/*) at more or
less line-speed
on my 1gig Ethernet adapters.
This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or
something similar.

However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s.
I've tested mostly by using rsync, cp or dd, but I tried bonnie and
saw much the same results.

I've been testing mostly with ext4, but I saw pretty much the same
performance with ext3.

Since I'm testing sequential reads and writes, I was expecting the
filesystem performance
to be closer to line-speed than what I'm seeing now.

Thanks to aoetools-discuss, I think I've got a pretty good configuration going,
with MTU at 9000, flow control on the switch and some kernel tuning.
Since I'm seeing line-speed when using the device directly, I guess this means
that the configuration is more or less okay.

What kind of performance are you guys seeing on your filesystems when
using 1gig Ethernet adapters ?

-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Tracy R. <tr...@ul...> - 2011-07-05 18:29:06

On Tue, Jul 05, 2011 at 05:03:40PM +0200, Torbjørn Thorsen spake thusly:
> I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a
> good performance level.
> 
> I can read and write the raw AoE device (/dev/etherd/*) at more or
> less line-speed
> on my 1gig Ethernet adapters.

> This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or
> something similar.

This is in line with what I get also. Sounds like your performance level is as
expected (very good).

> However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s.
> I've tested mostly by using rsync, cp or dd, but I tried bonnie and
> saw much the same results.

Yep. You are most likely running into physical limitations of the disk.

> Since I'm seeing line-speed when using the device directly, I guess this means
> that the configuration is more or less okay.

Yep.

> What kind of performance are you guys seeing on your filesystems when
> using 1gig Ethernet adapters ?

The speed of the network is not nearly as important as the speed of the disk
hardware. I get performance similar to yours when doing streaming reads/writes
to at least two disks. A single 7200rpm drive can typically do 70MB/s so you
usually need to gang up at least two of these in a mirror or stripe. Many more
smaller disks are necessary for higher IOPS. Fortunately, this is a problem
completely independent of AoE so lots of people know how to solve it. These
days I deploy SuperMicro 24 bay 2.5" servers stuffed full of 10k RPM disks.
This seems to get me the most reasonable bang/buck while providing the kind of
IOPS I need to run databases, mail servers, etc. The giant/cheap 2T disks you
can buy these days are great for archival and backup storage but for actual
data processing the advice has been the same for many years: Throw lots of
spindles at the problem.

-- 
Tracy Reed           Digital signature attached for your safety.
Copilotco            Professionally Managed PCI Compliant Secure Hosting
866-MY-COPILOT x101  http://copilotco.com

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Torbjørn T. <tor...@tr...> - 2011-07-06 09:17:55

2011/7/5 Tracy Reed <tr...@ul...>:
> On Tue, Jul 05, 2011 at 05:03:40PM +0200, Torbjørn Thorsen spake thusly:
>> I'm setting up a AoE-based SAN, and I'm not quite sure I've reached a
>> good performance level.
>>
>> I can read and write the raw AoE device (/dev/etherd/*) at more or
>> less line-speed
>> on my 1gig Ethernet adapters.
>
>> This means I'm seeing I/O rates of 100 to 120 MB/s when using dd or
>> something similar.
>
> This is in line with what I get also. Sounds like your performance level is as
> expected (very good).
>
>> However, when I put a filesystem on there, I'm seeing rates of 55 to 70 MB/s.
>> I've tested mostly by using rsync, cp or dd, but I tried bonnie and
>> saw much the same results.
>
> Yep. You are most likely running into physical limitations of the disk.

I should have mentioned that the AoE device is backed by a RAID setup that is
able to write well above 120 MB/s.
If I mount the same filesystem locally, on the server, bonnie tells me
it's able to do
sequential writes at ~370 MB/s.

If I write straight to the AoE device, I can get the expected
line-speed of the network, around ~110 MB/s.
dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M

However, when mounting a filesystem, and copying a file onto the AoE
device, I only see about ~70 MB/s.

This leads me to thinking that the performance degradation I'm seeing
is related to
the filesystem or the network.
Of course, I wouldn't expect a filesystem to give the same performance as the
raw device, but I didn't expect to see a ~25% hit in performance, especially
when doing a sequential write.

> --
> Tracy Reed           Digital signature attached for your safety.
> Copilotco            Professionally Managed PCI Compliant Secure Hosting
> 866-MY-COPILOT x101  http://copilotco.com
>


-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Gabor G. <go...@di...> - 2011-07-11 20:15:21

On Mon, Jul 11, 2011 at 12:16:44PM +0200, Torbjřrn Thorsen wrote:

> With this setup, I get the same ~70 MB/s I have been fighting with for
> a while now.
> It seems curious to me that I get ~70 MB/s seemingly no matter what changes
> I do to the the configuration, so I'm beginning to suspect my testing
> method is broken.

Try ext2. More advanced file systems need to sync from time to time to
ensure your data is safe. Since the AoE protocol does not support
barriers, and AFAIK support for the FLUSH ATA command was never
implemented, the client kernel can do just one thing: stop sending new
commands, wait until all pending commands finish, and really-really hope
that the server did commit the data to disk, even if it got no
indication to do so.

This means that most file system operations (esp. those involving
metadata) will insert "gaps" into the data stream. So when you're using
a file system, you will never be able to reach the performance of the
raw device, or the network.

If you have many clients, then the fact that one of them can't saturate
the server is probably not that important. If the performance of a
single client is important, then try iSCSI.

Gabor

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Tracy R. <tr...@ul...> - 2011-07-13 04:53:16

On Mon, Jul 11, 2011 at 10:15:12PM +0200, Gabor Gombas spake thusly:
> Since the AoE protocol does not support barriers, and AFAIK support for the
> FLUSH ATA command was never implemented, the client kernel can do just one
> thing: stop sending new commands, wait until all pending commands finish, and
> really-really hope that the server did commit the data to disk, even if it
> got no indication to do so.

Are there any plans to fix this? Is it even technically possible? It seems that
Coraid would want to remove any doubt as to using AoE for "enterprise" use.

This: http://lwn.net/Articles/283161/ has a good explanation of the history of
write barriers in Linux.

My understanding is that until recently RedHat had been turning off write
barriers in the kernel anyway yet people still ran their journalling
filesystems and databases etc. just fine. RHEL6 seems to have write barriers
enabled for all filesystems that support them. A sync should ensure consistency
as long as the backing disk system actually gets the data onto disk during the
sync, right?

My datacenter has had power issues lately so I am paying careful attention to
this sort of thing.

-- 
Tracy Reed           Digital signature attached for your safety.
Copilotco            Professionally Managed PCI Compliant Secure Hosting
866-MY-COPILOT x101  http://copilotco.com

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Adi K. <ad...@cg...> - 2011-07-06 10:09:53

Hi!

> >I should have mentioned that the AoE device is backed by a RAID setup that 
> >is
> >able to write well above 120 MB/s.
> >If I mount the same filesystem locally, on the server, bonnie tells me
> >it's able to do
> >sequential writes at ~370 MB/s.
> >
> >If I write straight to the AoE device, I can get the expected
> >line-speed of the network, around ~110 MB/s.
> >dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M
> >
> >However, when mounting a filesystem, and copying a file onto the AoE
> >device, I only see about ~70 MB/s.
> >
> >This leads me to thinking that the performance degradation I'm seeing
> >is related to
> >the filesystem or the network.
> >Of course, I wouldn't expect a filesystem to give the same performance as 
> >the
> >raw device, but I didn't expect to see a ~25% hit in performance, 
> >especially
> >when doing a sequential write.
> >
> What filesystem do you use? XFS is known to be the recommended 
> filesystem for AoE.
Actually I think this could be due to RAID block sizes: most AoE
implementations assume a block size of 512Byte. If you're using a linux
software RAID5 with a default chunk size of 512K and you're using 4 disks,
a single "block" has 3*512K block size. This is what has to be written when
changing data in a file for example.
mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
width and so on (see man pages) when the information is available (which is
not the case when creating a file system on an AoE device.

To check if you're hit by this is quite simple: install dstat or iostat on
the server exporting the volume. Run your benchmark and watch the output of
dstat/iostat: if you experience massive reads while writing, congrats, you
found the root cause. To improve things a little, create the file system on
the server that is exporting the AoE targets. To improve them even more --
especially with RAID5 and RAID6 -- choose a smaller chunk size.

I'd be glad if you could post back some numbers... :-)

On a side note: linear performance isn't what is counting when using
network storage. You better measure iops (input/output operations per
second). I use fio for benchmarks which lets you define your I/O patterns
to (kind of) fit real world usage.

-- Adi

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Torbjørn T. <tor...@tr...> - 2011-07-06 12:41:19

On Wed, Jul 6, 2011 at 11:51, Adi Kriegisch <ad...@cg...> wrote:
> Hi!
>
>> >I should have mentioned that the AoE device is backed by a RAID setup that
>> >is
>> >able to write well above 120 MB/s.
>> >If I mount the same filesystem locally, on the server, bonnie tells me
>> >it's able to do
>> >sequential writes at ~370 MB/s.
>> >
>> >If I write straight to the AoE device, I can get the expected
>> >line-speed of the network, around ~110 MB/s.
>> >dd if=/dev/zero of=/dev/etherd/e1.1 bs=1M
>> >
>> >However, when mounting a filesystem, and copying a file onto the AoE
>> >device, I only see about ~70 MB/s.
>> >
>> >This leads me to thinking that the performance degradation I'm seeing
>> >is related to
>> >the filesystem or the network.
>> >Of course, I wouldn't expect a filesystem to give the same performance as
>> >the
>> >raw device, but I didn't expect to see a ~25% hit in performance,
>> >especially
>> >when doing a sequential write.
>> >
>> What filesystem do you use? XFS is known to be the recommended
>> filesystem for AoE.
> Actually I think this could be due to RAID block sizes: most AoE
> implementations assume a block size of 512Byte. If you're using a linux
> software RAID5 with a default chunk size of 512K and you're using 4 disks,
> a single "block" has 3*512K block size. This is what has to be written when
> changing data in a file for example.
> mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
> width and so on (see man pages) when the information is available (which is
> not the case when creating a file system on an AoE device.

I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
single block device.
The RAID itself is a RAID6 configuration, using default settings.
MegaCLI says that the virtual drive has a "Strip Size" of 64KB.

The virtual device from the RAID controller is used as a physical
volume for LVM,
and the exported AoE devices are LVM logical volumes cut from this
physical volume.

It seems I get the same filesystem settings if I create the filesystem
right on the LVM volume,
or if I create it on the AoE volume.

Creating it on the server, mkfs.ext4 says:
root@storage01:~# mkfs.ext4 /dev/aoepool0/aoetest
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1310720 inodes, 5242880 blocks

Creating it on the client, mkfs.ext4 says:
root@xen08:/home/torbjorn# mkfs.ext4 /dev/etherd/e7.1
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
1310720 inodes, 5242880 blocks

Using both of these filesystems on the client, I end up with pretty
much the same
transfer rate of about ~70MB/s.

Using it on the client, that is, mounting the LVM volume directly,
I get the much preferable ~370 MB/s.

>
> To check if you're hit by this is quite simple: install dstat or iostat on
> the server exporting the volume. Run your benchmark and watch the output of
> dstat/iostat: if you experience massive reads while writing, congrats, you
> found the root cause. To improve things a little, create the file system on
> the server that is exporting the AoE targets. To improve them even more --
> especially with RAID5 and RAID6 -- choose a smaller chunk size.
>
> I'd be glad if you could post back some numbers... :-)

I have iostat running continually, and I have seen that "massive read"
problem earlier.
However, when I'm doing these tests, I have a bare minimum of reads,
it's mostly all writes.

The "%util" column from iostat is mostly around  ~10%, while at some
intervals peaking towards 100%.
I'm guessing there is some cache flushing going on when I'm seeing those spikes.
This is on the server, the client chugs stably along at ~70 MB/s.

>
> On a side note: linear performance isn't what is counting when using
> network storage. You better measure iops (input/output operations per
> second). I use fio for benchmarks which lets you define your I/O patterns
> to (kind of) fit real world usage.
>
> -- Adi
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Aoetools-discuss mailing list
> Aoe...@li...
> https://lists.sourceforge.net/lists/listinfo/aoetools-discuss
>



-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Adi K. <ad...@cg...> - 2011-07-06 13:59:35

Hi!

> >> What filesystem do you use? XFS is known to be the recommended
> >> filesystem for AoE.
> > Actually I think this could be due to RAID block sizes: most AoE
> > implementations assume a block size of 512Byte. If you're using a linux
> > software RAID5 with a default chunk size of 512K and you're using 4 disks,
> > a single "block" has 3*512K block size. This is what has to be written when
> > changing data in a file for example.
> > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
> > width and so on (see man pages) when the information is available (which is
> > not the case when creating a file system on an AoE device.
> 
> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
> single block device.
> The RAID itself is a RAID6 configuration, using default settings.
> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
see them.

> It seems I get the same filesystem settings if I create the filesystem
> right on the LVM volume,
> or if I create it on the AoE volume.
Hmmm... that means that the controller does not expose its chunk size to
the operating system. The most important parameters here are:
* stride = number of blocks on one raid disk (aka chunk-size/block-size)
* stripe-width = number of strides of one data block in the raid

Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)"
where N is the number of disks in the array. There are plenty of sites out
there about finding good parameters for mkfs and RAID (like
http://www.altechnative.net/?p=96 or
http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example). 

> I have iostat running continually, and I have seen that "massive read"
> problem earlier.
The "problem" with AoE (or whatever intermediate network protocol iscsi,
fcoe, ... you will use) is, that it needs to force writes to happen. The
Linux kernel tries to assume the physical layout of the underlaying disk by
at least using the file system layout on disk and tries to write one
"physical block" at a time. (blockdev --report /dev/sdX reports what the
kernel thinks how the physical layout looks like)
Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is
64K that means one "physical block" has a size of 4*64K = 256K. The file
systems you created had a block size of 4K -- so in case AoE forces the
kernel to commit every 4K, the RAID-Controller needs to read 256K, update
4K, calculate checksums and write 256K again. This is what is behind the
"massive read" issue.

Write rate should improve by creating the file system with correct stride
size and stripe width. But there are other factors for this as well:
* You're using lvm (which is an excellent tool). You need to create your
  physical volumes with parameters that fit your RAID too. That is use
  "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is
  that it exports "physical extents" which need to be alligned to the
  beginning of your RAID's boundaries. For testing purposes you might start
  without LVM and try to align and export the filesystem via AoE first.
  That way you get better reference numbers for further experiments.)
* For real world scenarios it might be a better idea to recreate the RAID
  with a smaller chunk size. This -- of course -- depends on what kind of
  files you intend to store on that RAID. You should try to fit an average
  file in more than just one "physical block"...

> However, when I'm doing these tests, I have a bare minimum of reads,
> it's mostly all writes.
As mentioned above: this is due to the controller "hiding" real disk
operation away.

Hope, this helps... and please send back results!

-- Adi

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Torbjørn T. <tor...@tr...> - 2011-07-06 15:57:12

2011/7/6 Adi Kriegisch <ad...@cg...>:
> Hi!
>
>> >> What filesystem do you use? XFS is known to be the recommended
>> >> filesystem for AoE.
>> > Actually I think this could be due to RAID block sizes: most AoE
>> > implementations assume a block size of 512Byte. If you're using a linux
>> > software RAID5 with a default chunk size of 512K and you're using 4 disks,
>> > a single "block" has 3*512K block size. This is what has to be written when
>> > changing data in a file for example.
>> > mkfs.ext4 or mkfs.xfs respects those block sizes, stride sizes, stripe
>> > width and so on (see man pages) when the information is available (which is
>> > not the case when creating a file system on an AoE device.
>>
>> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
>> single block device.
>> The RAID itself is a RAID6 configuration, using default settings.
>> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
> Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
> see them.
>

I'm too happy about this either.
My intention in the start was to get the RAID controller to just
expose the disks,
and let Linux handle the RAID side of things.
However, I was unsuccessful in convincing the RAID controller to do so.

>> It seems I get the same filesystem settings if I create the filesystem
>> right on the LVM volume,
>> or if I create it on the AoE volume.
> Hmmm... that means that the controller does not expose its chunk size to
> the operating system. The most important parameters here are:
> * stride = number of blocks on one raid disk (aka chunk-size/block-size)
> * stripe-width = number of strides of one data block in the raid
>
> Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)"
> where N is the number of disks in the array. There are plenty of sites out
> there about finding good parameters for mkfs and RAID (like
> http://www.altechnative.net/?p=96 or
> http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
>

The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.

I created the filesystem as you suggested, the resulting output from mkfs was:
root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest
mke2fs 1.41.12 (17-May-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=16 blocks, Stripe width=48 blocks
1310720 inodes, 5242880 blocks

I then mounted the newly created filesystem on the server, and have it
a run with bonnie.
Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with
the default settings.

When I exported it using AoE, the throughput on the client was ~60
MB/s, down from ~70 MB/s.

So these particular settings for the filesystem doesn't seem to be
right on the money,
but I guess it's a matter of tuning these settings.
I didn't see a massive increase in read operations with these
settings, but I guess
there was a bit more read activity going on.

>> I have iostat running continually, and I have seen that "massive read"
>> problem earlier.
> The "problem" with AoE (or whatever intermediate network protocol iscsi,
> fcoe, ... you will use) is, that it needs to force writes to happen. The
> Linux kernel tries to assume the physical layout of the underlaying disk by
> at least using the file system layout on disk and tries to write one
> "physical block" at a time. (blockdev --report /dev/sdX reports what the
> kernel thinks how the physical layout looks like)
> Lets assume you have 6 disks in RAID6: 4 disks contain data; chunk size is
> 64K that means one "physical block" has a size of 4*64K = 256K. The file
> systems you created had a block size of 4K -- so in case AoE forces the
> kernel to commit every 4K, the RAID-Controller needs to read 256K, update
> 4K, calculate checksums and write 256K again. This is what is behind the
> "massive read" issue.
>
> Write rate should improve by creating the file system with correct stride
> size and stripe width. But there are other factors for this as well:
> * You're using lvm (which is an excellent tool). You need to create your
>  physical volumes with parameters that fit your RAID too. That is use
>  "--dataalignmentoffset" and "--dataalignment". (The issue with LVM is
>  that it exports "physical extents" which need to be alligned to the
>  beginning of your RAID's boundaries. For testing purposes you might start
>  without LVM and try to align and export the filesystem via AoE first.
>  That way you get better reference numbers for further experiments.)
> * For real world scenarios it might be a better idea to recreate the RAID
>  with a smaller chunk size. This -- of course -- depends on what kind of
>  files you intend to store on that RAID. You should try to fit an average
>  file in more than just one "physical block"...
>

I haven't investigated this level of detail in storage before, so this
is the first time
I'm tuning a system like this for production.
I'll read up and try to see if I can't get all these settings to align.

>> However, when I'm doing these tests, I have a bare minimum of reads,
>> it's mostly all writes.
> As mentioned above: this is due to the controller "hiding" real disk
> operation away.
>
> Hope, this helps... and please send back results!
>
> -- Adi
>

Thanks, I appreciate the help from you and all the others
who have been very helpful here on aoetools-discuss.

What I'm not quite understanding is how exporting a device via AoE
would introduce new alignment problems or similar.
When I can write to the local filesystem at ~370 MB/s, what kind of
problem is introduced by using AoE or other network storage solution ?

I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
exact same ~70 MB/s throughput there, so I guess this isn't related to
AoE in itself.

-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Adi K. <ad...@cg...> - 2011-07-06 16:51:37

Hi!

> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
> >> single block device.
> >> The RAID itself is a RAID6 configuration, using default settings.
> >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
> > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
> > see them.
> >
> 
> I'm too happy about this either.
> My intention in the start was to get the RAID controller to just
> expose the disks,
> and let Linux handle the RAID side of things.
> However, I was unsuccessful in convincing the RAID controller to do so.
Too bad... I'd prefer a Linux software RAID too...
btw. there are hw-raid management tools available for linux. You probably
want to check out http://hwraid.le-vert.net/wiki.
 
> > Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)"
> > where N is the number of disks in the array. There are plenty of sites out
> > there about finding good parameters for mkfs and RAID (like
> > http://www.altechnative.net/?p=96 or
> > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
> >
> 
> The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.
correct.

> I created the filesystem as you suggested, the resulting output from mkfs was:
> root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest
[SNIP]
> I then mounted the newly created filesystem on the server, and have it
> a run with bonnie.
> Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with
> the default settings.
> 
> When I exported it using AoE, the throughput on the client was ~60
> MB/s, down from ~70 MB/s.
The values you used are correct for 3 data disks with 64K chunk size.
Probably this issue is related to a misalignment of LVM. LVM adds a header
which has a default size of 192K -- that would perfectly match your
RAID: 3*64K = 192K...
but the default "physical extent" size does not match your RAID: 4MB cannot
be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks
aren't propperly aligned -- I doubt you can align them, because the
physical extent size needs to be a power of two and > 1K and to be aligned
with the RAID divideable by 192... The only way could be to change the
number of disks in the array to 4 or 6. :-(
Could you just once try to use the raw device with the above used stride
and stripe-width values? (without LVM inbetween)

> Thanks, I appreciate the help from you and all the others
> who have been very helpful here on aoetools-discuss.
You're welcome! And thank you very much for always reporting back the
results.

> What I'm not quite understanding is how exporting a device via AoE
> would introduce new alignment problems or similar.
> When I can write to the local filesystem at ~370 MB/s, what kind of
> problem is introduced by using AoE or other network storage solution ?
> 
> I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
> exact same ~70 MB/s throughput there, so I guess this isn't related to
> AoE in itself.
There are two root causes for these issues:
* SAN protocols force a "commit" of unwritten data, be it a "sync", direct
  i/o or whatever, way more often than local disks -- for the sake of data
  integrity. (actually write barriers should be enabled for all those AoE
  devices -- especially with newer kernels.)
* AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
  into a jumbo frame). So all I/O is aligned around this size. When using a
  filesystem like ext4 or xfs one can influence the block sizes by creating
  the file system properly.

And now for some ascii art:
lets say a simple hard disk has the following physical blocks:
+----+----+----+----+----+----+----+----+----+----+-..-+
| 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | .. |
+----+----+----+----+----+----+----+----+----+----+-..-+

then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks
looks like this (D1 1-2 means disk1 block 1 and 2):
+----+----+----+----+----+----+----+----+----+----+-..-+
| D1 1-2  | D2 1-2  | D3 1-2  | D4 1-2  | D1 3-4  | .. |
+----+----+----+----+----+----+----+----+----+----+-..-+
\------------ DATA -----------/\-PARITY-/
\                                      / \
 ----------- RAID block 1 -------------   --------- ..

One data block of this RAID can only be written at once. So whenever only
one bit within that block changes, the whole block has to written again
(because the checksum is only valid for the block as a whole).

Now imagine, you write you have a lvm header that has half of the size of a
RAID block: it will fill the first half of the block and the first lvm
volume will then fill the rest of the first block plus some more blocks and
a half at the end. Write operations are not alligned then and cause massive
rewrites in the backend.

>From my point of view there are several ways to find the root cause of the
issues:
* try a different RAID level (like 10 or so)
* (re)-try to export the disks to Linux as JBODs.
* try different filesystem and lvm parameters (actually you better write a
  script for that... ;-)

And, let us know about the results!
Thanks,
	Adi

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Tracy R. <tr...@ul...> - 2011-07-06 17:45:09

On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch spake thusly:
>   (actually write barriers should be enabled for all those AoE devices --
>   especially with newer kernels.)

How?

> One data block of this RAID can only be written at once. So whenever only
> one bit within that block changes, the whole block has to written again

The alignment issues at every layer of the storage system have always been my
biggest hassle in dealing with SANs.

-- 
Tracy Reed           Digital signature attached for your safety.
Copilotco            Professionally Managed PCI Compliant Secure Hosting
866-MY-COPILOT x101  http://copilotco.com

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Adi K. <ad...@cg...> - 2011-07-07 08:40:14

Hi!                                                                                  
                                                                                     
> >   (actually write barriers should be enabled for all those AoE devices
> >   --        
> >   especially with newer kernels.)                                                
>                                                                                    
> How?                                                                               
The default behavior depends on the kernel version and the vendor (Redhat            
is said to disable barrier support for local file systems on recent                  
kernels). Between 2.6.31 and 2.6.33 most/all devices gained propper barrier          
support (which of course made disc access in most/all cases slower).                 
                                                                                     
In case barrier support for the underlaying device is available, the mount           
option "barrier" can be used to enable/disable barrier support. You can              
for example disable barrier support with this command:                               
mount -o remount,barrier=0 /mount/point                                              
                                                                                     
For mounting file systems over a SAN protocol like AoE or iscsi I'd                  
strongly recommend using write barriers. Due to the higher latency of those          
protocols ending up with a broken filesystem and lost data is way more               
likely.                                                                              
                                                                                     
> > One data block of this RAID can only be written at once. So whenever
> > only        
> > one bit within that block changes, the whole block has to written again          
>                                                                                    
> The alignment issues at every layer of the storage system have always
> been my      
> biggest hassle in dealing with SANs.                                               
Sigh. Yeah... it is not so easy to deal with that. I'm struggling myself             
from time to time. ;-)                                                               
Probably time to write a complete tutorial on how to deal with alignment?!           
-- any volunteers?? :-)                                                              
                                                                                     
-- Adi

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Gabor G. <go...@di...> - 2011-07-07 19:42:08

On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch wrote:

> * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
>   into a jumbo frame). So all I/O is aligned around this size. When using a
>   filesystem like ext4 or xfs one can influence the block sizes by creating
>   the file system properly.

No, AoE has no block size. It will cram as many sectors as it can into a
packet; e.g. if the MTU is 9000, then 17 sectors fit inside it, which
does not play well with any kind of alignment.

[...]

> >From my point of view there are several ways to find the root cause of the
> issues:
> * try a different RAID level (like 10 or so)
> * (re)-try to export the disks to Linux as JBODs.
> * try different filesystem and lvm parameters (actually you better write a
>   script for that... ;-)

And if you insist on using parity RAID (i.e. RAID5 or RAID6), then make
sure the number of data disks is a power of two. That makes computing
various alignments much easier.

Gabor

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Jesse B. <bec...@ma...> - 2011-07-07 20:47:09

On Thu, Jul 07, 2011 at 03:41:59PM -0400, Gabor Gombas wrote:
>On Wed, Jul 06, 2011 at 06:51:07PM +0200, Adi Kriegisch wrote:
>
>> * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
>>   into a jumbo frame). So all I/O is aligned around this size. When using a
>>   filesystem like ext4 or xfs one can influence the block sizes by creating
>>   the file system properly.
>
>No, AoE has no block size. It will cram as many sectors as it can into a
>packet; e.g. if the MTU is 9000, then 17 sectors fit inside it, which
>does not play well with any kind of alignment.

So perhaps there's something to be gained from artificially lowering the
MTU?

>> >From my point of view there are several ways to find the root cause of the
>> issues:
>> * try a different RAID level (like 10 or so)
>> * (re)-try to export the disks to Linux as JBODs.
>> * try different filesystem and lvm parameters (actually you better write a
>>   script for that... ;-)
>
>And if you insist on using parity RAID (i.e. RAID5 or RAID6), then make
>sure the number of data disks is a power of two. That makes computing
>various alignments much easier.
>
>Gabor
>
>------------------------------------------------------------------------------
>All of the data generated in your IT infrastructure is seriously valuable.
>Why? It contains a definitive record of application performance, security
>threats, fraudulent activity, and more. Splunk takes this data and makes
>sense of it. IT sense. And common sense.
>http://p.sf.net/sfu/splunk-d2d-c2
>_______________________________________________
>Aoetools-discuss mailing list
>Aoe...@li...
>https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

-- 
Jesse Becker
NHGRI Linux support (Digicon Contractor)

Re: [Aoetools-discuss] Throughput for raw AoE device versus filesystem

From: Torbjørn T. <tor...@tr...> - 2011-07-11 10:16:54

2011/7/6 Adi Kriegisch <ad...@cg...>:
> Hi!
>
>> >> I'm using a LSI MegaRAID SAS 9280 RAID controller that is exposing a
>> >> single block device.
>> >> The RAID itself is a RAID6 configuration, using default settings.
>> >> MegaCLI says that the virtual drive has a "Strip Size" of 64KB.
>> > Hmmm... too bad, then reads are "hidden" by the controller and Linux cannot
>> > see them.
>> >
>>
>> I'm too happy about this either.
>> My intention in the start was to get the RAID controller to just
>> expose the disks,
>> and let Linux handle the RAID side of things.
>> However, I was unsuccessful in convincing the RAID controller to do so.
> Too bad... I'd prefer a Linux software RAID too...
> btw. there are hw-raid management tools available for linux. You probably
> want to check out http://hwraid.le-vert.net/wiki.
>

Unfortunately, there doesn't seem be any free or open tool available
for the line
of cards I'm using.
http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS

>> > Could you try to create the file system with "-E stride=16,stripe-width=16*(N-2)"
>> > where N is the number of disks in the array. There are plenty of sites out
>> > there about finding good parameters for mkfs and RAID (like
>> > http://www.altechnative.net/?p=96 or
>> > http://h1svr.anu.edu.au/wiki/H1DataStorage/MD1220Setup for example).
>> >
>>
>> The RAID setup is 5 disks, so I guess that means 3 for data and 2 for parity.
> correct.
>
>> I created the filesystem as you suggested, the resulting output from mkfs was:
>> root@storage01:~# mkfs.ext4 -E stride=16,stripe-width=48 /dev/aoepool0/aoetest
> [SNIP]
>> I then mounted the newly created filesystem on the server, and have it
>> a run with bonnie.
>> Bonnie reported sequential writing rate of ~225 MB/s, down from ~370 MB/s with
>> the default settings.
>>
>> When I exported it using AoE, the throughput on the client was ~60
>> MB/s, down from ~70 MB/s.
> The values you used are correct for 3 data disks with 64K chunk size.
> Probably this issue is related to a misalignment of LVM. LVM adds a header
> which has a default size of 192K -- that would perfectly match your
> RAID: 3*64K = 192K...
> but the default "physical extent" size does not match your RAID: 4MB cannot
> be devided by 192K: (4*1024)/192 = 21.333. That means your LVM chunks
> aren't propperly aligned -- I doubt you can align them, because the
> physical extent size needs to be a power of two and > 1K and to be aligned
> with the RAID divideable by 192... The only way could be to change the
> number of disks in the array to 4 or 6. :-(
> Could you just once try to use the raw device with the above used stride
> and stripe-width values? (without LVM inbetween)
>

I've reinstalled the server, so that I can easily try different configurations
on the RAID controller.
However, none of the settings I have tried goes any faster than 70 MB/s.
I've tried adjusting the stripe size and create filesystems accordingly,
but I haven't seen any improvements in throughput.

In my latest test, the RAID volume is just a simple 2 disk stripe.
This volume is then exported directly with AoE, no LVM or mdadm.
With this test I hoped to eliminate any problem related to having
the RAID controller generate parity for unaligned writes.
However, I'm still seeing writes of ~70 MB/s.

I also tested the network with iperf, and iperf said it could copy at
~960 Mbit/s, as expected.

>> Thanks, I appreciate the help from you and all the others
>> who have been very helpful here on aoetools-discuss.
> You're welcome! And thank you very much for always reporting back the
> results.
>
>> What I'm not quite understanding is how exporting a device via AoE
>> would introduce new alignment problems or similar.
>> When I can write to the local filesystem at ~370 MB/s, what kind of
>> problem is introduced by using AoE or other network storage solution ?
>>
>> I did a quick iSCSI setup using iscsitarget and open-iscsi, but I saw the
>> exact same ~70 MB/s throughput there, so I guess this isn't related to
>> AoE in itself.
> There are two root causes for these issues:
> * SAN protocols force a "commit" of unwritten data, be it a "sync", direct
>  i/o or whatever, way more often than local disks -- for the sake of data
>  integrity. (actually write barriers should be enabled for all those AoE
>  devices -- especially with newer kernels.)

I guess this is different from doing everything with "sync" enabled, though ?
If I mount the filesystem with the "sync" option, I get a different throughput.

> * AoE (and iSCSI too) uses a block size of 4K (a size that perfectly fits
>  into a jumbo frame). So all I/O is aligned around this size. When using a
>  filesystem like ext4 or xfs one can influence the block sizes by creating
>  the file system properly.
>
> And now for some ascii art:
> lets say a simple hard disk has the following physical blocks:
> +----+----+----+----+----+----+----+----+----+----+-..-+
> | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | .. |
> +----+----+----+----+----+----+----+----+----+----+-..-+
>
> then a raid 5 with a chunk size of 2 harddisk blocks consisting of 4 disks
> looks like this (D1 1-2 means disk1 block 1 and 2):
> +----+----+----+----+----+----+----+----+----+----+-..-+
> | D1 1-2  | D2 1-2  | D3 1-2  | D4 1-2  | D1 3-4  | .. |
> +----+----+----+----+----+----+----+----+----+----+-..-+
> \------------ DATA -----------/\-PARITY-/
> \                                      / \
>  ----------- RAID block 1 -------------   --------- ..
>
> One data block of this RAID can only be written at once. So whenever only
> one bit within that block changes, the whole block has to written again
> (because the checksum is only valid for the block as a whole).
>
> Now imagine, you write you have a lvm header that has half of the size of a
> RAID block: it will fill the first half of the block and the first lvm
> volume will then fill the rest of the first block plus some more blocks and
> a half at the end. Write operations are not alligned then and cause massive
> rewrites in the backend.
>
> From my point of view there are several ways to find the root cause of the
> issues:
> * try a different RAID level (like 10 or so)
> * (re)-try to export the disks to Linux as JBODs.
> * try different filesystem and lvm parameters (actually you better write a
>  script for that... ;-)
>
> And, let us know about the results!
> Thanks,
>        Adi
>

Thank you for that very thorough explanation, I've just learned a lot
about I/O and alignment.

As I mentioned, I have tried different configurations, trying to avoid
any source of alignment issues.
My last attempt has no parity in the RAID setup, the virtual device
from the controller is partitioned
and exported via AoE.

With this setup, I get the same ~70 MB/s I have been fighting with for
a while now.
It seems curious to me that I get ~70 MB/s seemingly no matter what changes
I do to the the configuration, so I'm beginning to suspect my testing
method is broken.

-- 
Vennlig hilsen
Torbjørn Thorsen
Utvikler / driftstekniker

Trollweb Solutions AS
- Professional Magento Partner
www.trollweb.no