Re: [Keepalived-devel] Keepalive holddown timer vs interval?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 2021-07-19 at 15:51 +0000, Jeremy Guthrie wrote:
> We are seeing an issue where a single second disconnect or packet loss appears
> to cause keepalived to go into fault and die.  We'd like to actually run with
> an 'advert_int' of 2 seconds with a '10 second hold down timer' meaning we
> have to lose five(5) hellos instead.  Is there a way to do that?   Right now
> it appears that we have only to lose a single hello and VRRP wants to fail
> over which is too fragile.  We are trying to find the configs to allow for
> that.
> 
> Any thoughts?

Jeremy,

This is a very old list that you have sent this message to. The current list
is kee...@gr... .

You don't say which version of keepalived you are using; generally newer
versions are more reliable than older versions. The current version is v2.2.2.

> We are seeing an issue where a single second disconnect or packet loss appears
> to cause keepalived to go into fault and die.  

What is happening here depends on what you are doing. If the network cable is
disconnected from an interface that is being used by keepalived (or indeed if
the interface is downed by command - e.g. ip link set XXX down), then any VRRP
instance tracking that interface will go to fault state until the interface
comes back up again. Once the interface comes back up, the VRRP instance will go
to BACKUP state, and then if it is the highest priority instance and nopreempt
is not set, then after three advert intervals plus a bit (I'll explain later) it
will take over as master.

I am not sure what you mean by "die". Does the VRRP instance remain in fault
state after the interface is restored?

> We'd like to actually run with an 'advert_int' of 2 seconds with a '10 second
> hold down timer' meaning we have to lose five(5) hellos instead.  Is there a
> way to do that?

The simple answer is NO. The VRRP RFCs are specific about how long a backup will
wait before it takes over as master. For VRRPv3 this is 3 * advert_int + (256 -
priority) / 256 * advert_int. For VRRPv2 it is 3 * advert_int + (256 - priority)
/ 256. So the timeout is always somewhere between 3 and 4 advert intervals, and
the higher the priority of the VRRP instance the closer the timeout is to 3
advert intervals.

Since the RFCs are explicit about the calculation of 3 * advert_int plus a bit,
the 3 * is hard coded, and not configurable. Technically the 3 * could be
replaced with a configurable parameter, but then you would not be running VRRP!

>    Right now it appears that we have only to lose a single hello and VRRP
> wants to fail over which is too fragile.

This is not the case. An interface going down will cause an immediate transition
to fault state, but the loss of a single advert will NOT cause a backup vrrp to
take over as master.

>   We are trying to find the configs to allow for that.
> 
There are none since keepalived does not support anything other than 3 *
advert_int + a bit, and always immediately transitions to FAULT state if the
interface it is using goes down.

If you could give more details about what is happening in respect of "single
second disconnect or packet loss" and also "go into fault and die" then we might
be able to make some more concrete suggestions.

I hope this helps,

Quentin Armitage