Lachlan,

 

I like your switchless design.   Fewer things to break.

 

With bonding, you have presented one logical interface (bond0) to aoe / vblade, so multiple interfaces should not be an important consideration.  Definitely check for hardware flow control (ethtool –g) and lower aoe_maxout if needed.

 

Version 47 of aoe.ko may be quite old.  The driver I'm running reports version 75:

 

# dmesg | grep aoe

aoe: AoE v75 initialised.

 

If you have a fairly recent driver you can see more stats via the /sys filesystem, e.g.:

 

# cat /sys/block/etherd\!e1.0/debug

rttavg: 1348 rttdev: 1127

nskbpool: 4

kicked: 1812

maxbcnt: 8704

ref: 0

falloc: 126

ffree: ffff88001ffc25c0

003048b96514:0:8:8

        ssthresh:4

        lost:511430

        taint:0

        r:20920045

        w:511086653

        eth4

falloc: 148

ffree: ffff8800374d2180

003048b96515:0:8:8

        ssthresh:4

        lost:508817

        taint:0

        r:21594568

        w:533599056

        eth5

 

There's a lot of useful info there.  The rttavg/rttdev fields are used for congestion control and show average round-trip times (in my case computed in microseconds).  The RTT is typically on the order of several millliseconds, because AoE measures round-trip as the combination of network and disk latency, not simply network.  If a reply packet exceeds 2*rttavg + 8*rttdev, it is retransmitted and the "lost" counter is incremented for that interface.  (If the reply packet is eventually received following the retransmit, "lost" is decremented however.)

 

The "r" and "w" counters are counts of read and write requests, respectively.  From the ratio of lost/(r+w) I can see that about 1 of every 1,000 packets is lost on my AoE storage network.  Empirically, that seems okay.  (When I was having problems with closewait that ratio was closer to 1/100.)

 

You can also watch retransmits in real-time if you "cat" the /dev/etherd/err character pseudo-device:

 

# cat /dev/etherd/err

     retransmit e1.0 oldtag=09cc2022@1237d2026 newtag=09cd2026 s=00188b751c17 d=003048b96514 nout=1

unexpected rsp e1.0    tag=09cc2022@1237d2028 s=003048b96514 d=00188b751c17

 

The tag uniquely identifies an AoE request.  Taking apart the tag, the "09cc2022" part is divided into a sequence number (09cc) and a timestamp (2022) that represents the lower 16 bits of the kernel's "jiffies" counter.  The part following "@" (1237d2026) is the jiffies counter at the time of retransmit (when the original packet is deemed lost).  Only 4 jiffies elapsed, about 16ms on my system (HZ=250) before the retransmit.

 

Immediately following the retransmit event is an unexpected response, which shows that the original lost packet actually arrived in 6 jiffies (24ms).

 

Due to extreme variation in disk response times (often less than a millisecond, sometimes tens of ms), it's hard to avoid infrequent retransmits.  My impression is that the congestion avoidance algorithms are better suited to network latency than disk latency, since the former tends to be more regular.

 

There are also problems with the aoe driver implementation.  It seems to me the timeout values should take into account the queue length.  If packets are sent one-at-a-time to the target, a timeout of, say, 15ms seems reasonable.  If however packets are sent in groups of 20, or groups of 50, they can't all be processed concurrently on a typical storage array.  But the current driver implementation will assign them all the same timeout (15ms in my example), making retransmits very likely for high values of aoe_maxout.

 

-Jeff

 

From: Lachlan Evans [mailto:aoetools-discuss@conf.net.au]
Sent: Thursday, March 03, 2011 9:36 PM
To: Jeff Sturm
Cc: aoetools-discuss@lists.sourceforge.net
Subject: Re: [Aoetools-discuss] Down,closewait under load

 

G'day Jeff,

Thanks for your response.

Initiator version?  Looking at the strings output on the aoe.ko file I see version=47 but as for the aoetools version from what I can make of that's version 30.

Thanks for the information regarding the buffers and maxouts and especially considering multiple interfaces (which we do have).  Speaking of which, we current have 4 interfaces bonded using the Linux Ethernet bonding driver in round-robin mode which we've found to perform quite well, but I do have to ask the question - is there a better way? 

We don't have a switch in the equation, the SAN is essentially 4 cross over cables and 8 NICs between two hosts.

Hopefully I'll get approval to persist with some more testing and I'll post my results as soon as I get them.

Cheers,

Lachlan

On 4 March 2011 01:50, Jeff Sturm <jeff.sturm@eprize.com> wrote:

We were plagued by this problem a while ago.  "closewait" status means the driver sees the device but is waiting for the block device to close before automatically revalidating it.

 

What version of the initiator (aoe.ko) are you using?

 

After some inspection of the aoe driver source, I now understand that the RTT calculations do not take into account packets that are permanently lost.  So it is possible for the driver to get into a state in which the network is flooded with request packets, resent after each TTL expiration, while the TTL is not adjusted.  After aoe_deadsecs seconds elapse (300 by default, IIRC) the device is marked down.

 

The aoe_maxout defaults are flawed.  With a single target  (say, e1.1) and a single initiator, the device will be queried for its "buffer count" (in response to a "query config" status) and return, for example, 64.  The aoe initiator then uses this number as the default value of aoe_maxout, and will send up to 64 requests to the target before receiving a response.  Now suppose there are 2 ethernet links from the initiator to the target (multipath).  The aoe initiator will send up to 2*64, or 128, requests before receiving a response, which can overwhelm the target.

 

It gets worse than that.  If the shelf has 3 different slots (e.g. e1.1, e1.2, e1.3), the Linux aoe initiator will queue up to 64 requests per slot per interface (3 * 2 * 64, 384).  And if there are 4 different hosts all connecting to the same target, multiply this by 4 (1536).  That's far more outstanding requests than the target can safely handle, and intermediate switch buffers are likely to get flooded as well.

 

Here's how we handled it:

 

-      Enable hardware flow control on all Ethernet devices carrying aoe traffic.  The usual wisdom with hardware flow control is to leave it off, since TCP has pretty good congestion control.  However AOE is not TCP, and there is ample evidence that AOE performs better with it enabled.

 

-      Ensure network buffers are large enough to store outstanding packets.  This is particularly important if you are running jumbo frames.  In our sysctl.conf I have:

 

net.core.rmem_default = 262144

net.core.rmem_max = 16777216

net.core.wmem_default = 262144

net.core.wmem_max = 16777216

 

-      Lower the aoe_maxout parameter of the aoe module as much as necessary to preserve stability of the storage network.  As mentioned above the default aoe_maxout is obtained by querying the device.  Cut this in half, or less, and run some performance tests.  We've lowered it all the way to 8 without much sacrifice in performance.

 

-      Buy good network switches, if you haven't done so already.  The network is only as good as its weakest component.  Switches are not a good place to save money, I've found, and not all are made the same.  Try a few different models if you have the luxury.

 

Good luck,

 

-Jeff

 

From: Lachlan Evans [mailto:aoetools-discuss@conf.net.au]
Sent: Thursday, March 03, 2011 12:50 AM
To: aoetools-discuss@lists.sourceforge.net
Subject: [Aoetools-discuss] Down,closewait under load

 

Hi list,

I've encountered an re-occuring issue where a single AoE device goes into the closewait,down state.  I'm hoping someone here might be able to point me in the right direction of where to look to find the underlying cause.

A little about the setup:  two hosts, one acting as a SAN the other as a Xen host.   Both running Debian Squeeze using Debian distributed AoE packages.
 A 5 disk RAID-6 array configured using md and LVM on the SAN.  LVM volumes are then exported via AoE using vblade.  There are 5 volumes exported from the SAN to the Xen host:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

which are then used by the Windows 2003 Server Xen DomU as its disk devices.

The issue first occurred on February 17th 19:13 where this was recorded:

Feb 17 19:13:30 vmsrv kernel: [456093.648028] VBD Resize: new size 0

I believe this log entry originates from Xen's VBD driver reporting the change.

And aoe-stat on the Xen host displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 up
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 closewait,down
      e0.4        53.687GB  bond0 up

Over night last night:

Mar  2 20:28:23 vmsrv kernel: [900000.336023] VBD Resize: new size 0

and aoe-stat displaying:

      e0.0       171.798GB  bond0 up
      e0.1       268.435GB  bond0 closewait,down
      e0.2        53.687GB  bond0 up
      e0.3       128.849GB  bond0 up
      e0.4        53.687GB  bond0 up

An aoe-revalidate instantly resolves the issue but in the mean time the disks are unavailable.

What leads me to believing that this is an issue related to load is that both occurences have occurred within our backup schedule which generates a large amount of load particularly on the SAN.  Up until about a month ago we were running a combination of IET+open-iscsi and the backup schedule (which has not changed since) didn't seem to impact on that combination.

Any pointers would be greatly appreciated.

Cheers,

Lachlan