Re: [Keepalived-devel] keepalived partition handling and quorum

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 26.09.2014, Howley, Tom wrote:
> So it sounded like this would solve it: " garp_master_delay in keepalived should take care of this; it keeps the master re-announcing themself to the network routers over and over again."
> 
> Although looking at man page, it suggests that this is just a one-off delay before sending a single ARP after transition to Master. Is that the actual effect of that setting or is there an alternative one?

Hi Tom,

the whole set of garp_master_* is related to this, depending on the exact outage scenario.

garp_master_delay and garp_master_repeat do repeat the gratuitous ARP reply
"just after" transition to MASTER, but should stop after the amount of 
garp_master_repeat repetitions.

garp_master_refresh and garp_master_refresh_repeat do periodically repeat
the gratuitous ARP reply while the node is in MASTER state.

A simple example of the scenario I'm considering:
- certain routing equipment is designed to perform about every
  core function by specific ASICS or otherwise specific hardware.
  As such, ARP traffic is of very low priority, isn't handled by 
  specific hardware but a very smallish CPU, which can usually keep
  up with the amount of ARP traffic and other "minor features".

- part of the network did split off, the network did partition.
- your BACKUP node doesn't any longer see an active MASTER and decides
  to become MASTER.
- the node does send out gratuitous ARP replies for every VIP,
  which may be dozens or hundreds of VIPs.
- other nodes on the network do perform the same action and send out
  dozens or hundreds of ARP replies as well. As a result, the network 
  does temporarily see a lot of ARP replies.
- The router's CPU is temporarily crowded and will be temporarily 
  ignoring some of those replies - but you don't know which ones.
  When those ARP replies have only been sent once, the router didn't 
  update their mapping from IP to MAC address for some IPs and will 
  continue forwarding incoming packets for an IP address to the old 
  MAC address.
  In our situation, the old MAC address is the node currently not on
  our live network - so incoming traffic is routed to an offline node,
  resulting in an outage.

The RFC-recommended way of solving this is the usage of a virtual MAC address:
your master node will always use a specially formed MAC address for your VIPs.
The routers will continue forwarding packets to that MAC address, no matter
which actual node is currently hosting it. However, your switches need to rapidly learn the new port for the MAC address, and this may introduce a different kind of trouble.

Configuring garp_master_delay and garp_master_repeat may assist in that
situation: the router would still experience the initial "ARP storm", 
but after garp_master_delay, we'd repeat our ARP replies for every VIP
one or two times to assist the router catching up and updating those
important ARP entries.

However, you just don't know when all entries have been updated; if you're running hundreds of VIPs on a single network or with some bad luck, the same ARP reply for the same IP address has been ignored two or three times in a row. One attempt may be to increase garp_master_repeat to some very high value, another one is to continue sending out ARP replies with a periodic delay. Exactly the later option is the combination of garp_master_refresh and garp_master_refresh_repeat.

An alternative solution I do prefer is NOT to announce all VIPs via ARP, but announce a single HA IP per balancer and by means of a BGP announcement instruct the network to route traffic aimed for a VIP to that single HA IP. This reduces the amount of ARP traffic dramatically, but that HA IP becomes even more important - so I need to ensure that my routers do know about that. Using garp_master_refresh and garp_master_refresh_repeat does exactly do that: it repeats a single IP per balancer pair.

Another scenario: you're using keepalived for loadbalancing. One of your realservers has been misconfigured and starts sending out ARP replies for a VIP address during reboot. Your network router will see this and happily forward all traffic no longer to your loadbalancer but that single realserver.

Via configuration of garp_master_refresh and garp_master_refresh_repeat, your loadbalancer will continously send our ARP replies for its VIP addresses and so update the routers ARP mapping table. As such, the realserver's misconfiguration did only temporarily take effect and the negative changes will be automatically "overwritten". Of course, it'd been better to investigate closer and fix the misconfiguration, but this way, you've kept uptime and reduced the impact. And with some software like arpwatch on your network, you may still see what exactly happened and where you need to fix things.

Hopefully those examples do help.

Anders

> 
> Thanks again,
> Tom
> 
> 
> 
> -----Original Message-----
> From: Ryan O'Hara [mailto:ro...@re...] 
> Sent: 26 September 2014 14:35
> To: Anders Henke
> Cc: kee...@li...
> Subject: Re: [Keepalived-devel] keepalived partition handling and quorum
> 
> On Fri, Sep 26, 2014 at 10:51:05AM +0200, Anders Henke wrote:
> > Hi Ryan,
> > 
> > Thanks for correcting my sloppy description and use of language when trying to simplify things.
> 
> Your email original was a good description of VRRP. I didn't think it was sloppy at all.
> 
> Ryan
> 
> > Of course, VRRP is not specific to balancing, the sentence resulted in a misleading concatenated, shortened version of the following:
> > - VRRP ensures an IP address to be present on a network at some node.
> > - the node operating that IP address is doing whatever they need to do: routing/forwarding/loadbalancing IP packets, failover of (near-)stateless services, ... in the context of keepalived, this usually is loadbalancing (it doesn't need to be that way, keepalived may be used without loadbalancing).
> > 
> > And of course, the usage of garp_master_refresh isn't the preferred way. For some time now, keepalived does now optionally support the RFC-required usage of a virtual MAC address (use_vmac). However, there are some situations in real life where this falls short, and garp_master_refresh does address them.
> > 
> > Best,
> > 
> > Anders
> > 
> > On 25.09.2014, Ryan O'Hara wrote:
> > > On Thu, Sep 25, 2014 at 12:49:38PM +0200, Anders Henke wrote:
> > > > On 24.09.2014, Howley, Tom wrote:
> > > > > I’m relatively new to keepalived, so apologies in advance. I 
> > > > > have a relatively simple keepalived setup with a single vrrp 
> > > > > instance that is managing a single VIP across three nodes. The 
> > > > > only point worth noting is that my config is identical across 
> > > > > all three nodes, so the ip address is used in the original 
> > > > > Master election. I’m wondering if VRRP has a concept of quorum 
> > > > > handling. I basically want to avoid the scenario where a network 
> > > > > partition (which could be isolated to just a multicast failure) 
> > > > > results in two nodes of the cluster claiming to hold the VIP. For example if a node that was Master becomes isolated, can I configure it to disassociate the VIP from itself?
> > > > 
> > > > VRRP, extremely simplified:
> > > > - your balancer nodes are announcing their availability via multicast on your local network.
> > > >   This availability message contains a router ID, a priority and a list of VIPs.
> > > 
> > > s/availability/advertisement/
> > > 
> > > > - your balancer nodes are listing to those announces as well.
> > > >   If they don't see an announcement with a higher priority than their own using
> > > >   the same router ID and (optionally) the same list of VIPs, they'll start serving those
> > > >   VIPs, otherwise they'll stop serving them.
> > > 
> > > I'd refrain from calling these balancers. If you're using keepalived 
> > > with for VRRP and IPVS, then yes, they are balancers. But if you are 
> > > trying to give an overview of VRRP, they may not be balancers. VRRP 
> > > really has nothing to do with load-balancing.
> > > 
> > > > Some VRRP implementations also add an additional tie-breaker: when 
> > > > multiple nodes are using the same parameters (router ID, VIPs, priority), the node with highest IP becomes master.
> > > 
> > > This is actually part of the RFC.
> > > 
> > > > Just by design, it's an "if I don't see anyone else trying to do that job, I'll do it" idea and there is no such thing as gaining a vote using some quorum mechanism or a sophisticated election algorithm between multiple nodes.
> > > > 
> > > > Whenever your network partitions, you may end with two or more of your balancer nodes claiming to hold the same VIP.
> > > > 
> > > > This doesn't need to be a bad thing: probably two of your three balancer nodes will see each other and would gain a majority vote, but they might be on an isolated part of your network without any internet connectivity. So if "both" partitions of your network are active, incoming traffic might arrive at a non-redundant, but still internet-connected balancer with hopefully some realservers in behind, serving incoming requests.
> > > > 
> > > > Whenever your network re-unites, those balancer nodes will discover each other again, a single node will keep the VIP and the other ones will release it.
> > > > 
> > > > However, there are some pitfalls involved; for example, if the master did try to announce themself to the network routers, but those initial ARP replies were ignored by a too-busy router, so your network may continue sending traffic to your backup balancer node. As the backup balancer does de-configure the VIP from its local interfaces, that box will still receiver, but also happily ignore that incoming traffic: resulting in non-availability of your service.
> > > > Configuring garp_master_delay in keepalived should take care of this; it keeps the master re-announcing themself to the network routers over and over again.
> > > > 
> > > > > I have just tried adding some LVS config, specifying a pool of 
> > > > > real servers, so that I now have a script that is invoked if either quorum is lost or regained.
> > > > > So I could possibly use that to do what I want, but it feels 
> > > > > like I’m going the road of hackery.
> > > > 
> > > > Quorum in that context is way different from VRRP; it's the the amount of available realservers, which is unrelated to the amount of available balancer nodes.
> > > > 
> > > > VRRP just takes care about at least one balancer to announce themself to the network and distribute incoming traffic to your realservers (or a sorry_server).
> > > 
> > > VRRP does not load balance.
> > > 
> > > Ryan
> > > 
> > > > Quorum in keepalived is just an extra. When keepalived does see "enough" realserver capacity to serve requests, it'll trigger a script with a custom action. When the capacity drops below a threshold, keepalived will trigger (a different) script, doing some (different) custom action.
> > > > 
> > > > What is quorum good for?
> > > > - you may want to trigger custom actions, whenever there are "too few" realservers available.
> > > >   A quorum script can notify your monitoring system or trigger a deployment system to
> > > >   add more (virtual) realservers to your network.
> > > > 
> > > > - you may want to announce your balancer not just via ARP, but via a dynamic routing protocol;
> > > >   for example, you may want to serve the same VIP from multiple data centers using anycast.
> > > >   A quorum script can reconfigure your local BGP daemon, withdrawing or adding VIP announcements
> > > >   dynamically, ensuring that requests don't flood a balancer with too few available realservers.
> > > > 
> > > > 
> > > > Anders
> > > > -- 
> > > > 1&1 Internet AG              Expert Systems Architect (IT Operations)
> > > > Brauerstrasse 50             v://49.721.91374.0
> > > > D-76135 Karlsruhe            f://49.721.91374.225
> > > > 
> > > > Amtsgericht Montabaur HRB 6484
> > > > Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, 
> > > > Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, 
> > > > Jan Oetjen, Christian Würst
> > > > Aufsichtsratsvorsitzender: Michael Scheeren
> > > > 
> > > > ------------------------------------------------------------------
> > > > ------------ Meet PCI DSS 3.0 Compliance Requirements with 
> > > > EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with 
> > > > Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 
> > > > Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 
> > > > 10 and 11.5 with EventLog Analyzer 
> > > > http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/o
> > > > stg.clktrk _______________________________________________
> > > > Keepalived-devel mailing list
> > > > Kee...@li...
> > > > https://lists.sourceforge.net/lists/listinfo/keepalived-devel
> > -- 
> > 1&1 Internet AG              Expert Systems Architect (IT Operations)
> > Brauerstrasse 50             v://49.721.91374.0
> > D-76135 Karlsruhe            f://49.721.91374.225
> > 
> > Amtsgericht Montabaur HRB 6484
> > Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, 
> > Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan 
> > Oetjen, Christian Würst
> > Aufsichtsratsvorsitzender: Michael Scheeren
> 
> ------------------------------------------------------------------------------
> Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
> _______________________________________________
> Keepalived-devel mailing list
> Kee...@li...
> https://lists.sourceforge.net/lists/listinfo/keepalived-devel
-- 
1&1 Internet AG              Expert Systems Architect (IT Operations)
Brauerstrasse 50             v://49.721.91374.0
D-76135 Karlsruhe            f://49.721.91374.225

Amtsgericht Montabaur HRB 6484
Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, 
Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, 
Jan Oetjen, Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren