Whats the preferred way to setup bonding on IBM blade center thru a pair of Dlink Switches?
I've been using Active passive with arp_interval of two seconds to the default route, but this seems discouraged by IBMs PDF:
By setting our Arp_Target to the default route, we should get suitable behaviour by punching thru the Dlink, thru its external LACP uplinks , thru another switch and then to the core router. It seems to freak out when we pull a link (or both) from the outside of the Dlink,
By pulling out switch or rebooting DLink, bonding agent makes the active link flip to alternate eth0, That part is fine.
options bond0 arp_interval=2000 arp_ip_target=10.41.110.2 mode=1 primary=eth1
THere are many blades out there, whats best?
The document you reference is out of date. The warning about not using ARP monitoring predates the existence of the "arp_validate" option, which resolves the problem (which is that traffic from multiple blades on the same subnet can fool the ARP monitor even if the switch is unavailable; with arp_validate enabled, the traffic is checked to insure that it's for the bond in question, not for a different bond).
You don't describe the "freak out" in any detail, but my guess would be that bonding would flip between the slaves at irregular intervals, and connectivity to the outside is intermittent at best. If that's about right, then your problem may very well be the one I describe above, which should be resolved by setting arp_validate=all. Note that if your arp_ip_target is reached via a VLAN configured above bonding, you'll need a fairly recent kernel that includes a fix for this (the fix was done in late 2009).
The other portion of the document, which describes MII monitoring as only monitoring the link between bonding and the switch module itself is generally accurate. However, many recent switch modules also have a facility generally called "trunk failover," which will cause the switch to set switch ports on the "inside" (blade side) down when a switch port on the "outside" (next hop outbound) goes down. You'd have to check your switch's documentation or configuration screens to determine if this is supported (I don't know of a list, and I'm not as familiar with the d-link switch modules).
-Jay Vosburgh, firstname.lastname@example.org
Jay, Thanks for your thoughtful reply. Perhaps we need an update to the referenced doc at IBM? Your interpretation of "freak out" is accurate, pings usually dont make it, but sometimes do. The ESM/DLink BladeCenter switches do have a feature under Link aggregation called "Redundant Switch Failover", with IP-Packet and non-IP-Packet "Distribution Methods" with SrcMac and DestMac options as well.
The Help screen on the Switch Interface shows this: "Failover Mode When all trunk ports are link down or dsiabled, then all blades will be link down. So total loss of link connectivity on external ports will be detected by the blade NIC driver." … so this looks perfect for what we are trying to do.
The Active/Passive bonding (without this feature enabled) does seem to detect a loss of uplink to default route (when arp to default route fails) but may not have a way to detect a resumption of the service since Link is not enough to indicate a resumption of the Inter-Switch-Links way upstream?
I'll test this Redundant Switch Failover setting on our dev Bladecenter.. It looks like a winner!
Thanks very very Much! John W.