I’m experiencing and issue where I have a bond with 2 cards connected to different switches (switch 1 and 3) which are in turn connected to another switch (switch 3) and I am using a target on the third switch for monitoring, this is to make sure of end to end connectivity.
When everything is plugged in all is well and when I disconnect switch 1 or 2 from switch 3 the link goes down, for a little bit. Unfortunately the slave card connected to the disconnected switch keeps flapping between up and down for a few seconds and then sometimes stays either up or down.
I reconfigured my bond to have only 1 nic in it and tried again, but the same thing happened. I ran tcpdump on both the slave and the bond and could not see any Arp traffic from my target (or at all as they were the only thing plugged into the switch at the time)
I was wondering if anybody else had seen this, or knows if anything could be fighting with the Arp monitoring to give the up status. I’m using Debian 6 with the back ports kernel 3.2.0-0.bpo.2-amd64.
Thanks in advance for any help
Finally worked this out. It was due to another machine using bonding on the same segment for the same target. It seems that the bonding driver will interpret any arp for that ip (even another arp request) as a return.
As this was a failover cluster and the interferance was from the inactive node I configured my cluster to take the bonding interfaces down when then were not in use and the problem cleared up.
Just as an FYI, the "arp_validate" option is meant to handle this case (that of multiple bonds on a network segment having each other's ARPs fool one another into thinking the path to the arp_ip_target is working). With arp_validate enabled, only the ARP traffic from the bond itself (and the replies to it) counts for the purpose of determining link state.
The documentation follows:
Specifies whether or not ARP probes and replies should be
validated in the active-backup mode. This causes the ARP
monitor to examine the incoming ARP requests and replies, and
only consider a slave to be up if it is receiving the
appropriate ARP traffic.
Possible values are:
none or 0
No validation is performed. This is the default.
active or 1
Validation is performed only for the active slave.
backup or 2
Validation is performed only for backup slaves.
all or 3
Validation is performed for all slaves.
For the active slave, the validation checks ARP replies to
confirm that they were generated by an arp_ip_target. Since
backup slaves do not typically receive these replies, the
validation performed for backup slaves is on the ARP request
sent out via the active slave. It is possible that some
switch or network configurations may result in situations
wherein the backup slaves do not receive the ARP requests; in
such a situation, validation of backup slaves must be
This option is useful in network configurations in which
multiple bonding hosts are concurrently issuing ARPs to one or
more targets beyond a common switch. Should the link between
the switch and target fail (but not the switch itself), the
probe traffic generated by the multiple bonding instances will
fool the standard ARP monitor into considering the links as
still up. Use of the arp_validate option can resolve this, as
the ARP monitor will only consider ARP requests and replies
associated with its own instance of bonding.
This option was added in bonding version 3.1.0.
Thanks for your answer, unfortunately I was using balance-rr rather than active/backup mode (should have specifed) for which the arp_validate option is not applicable.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.