Re: [Keepalived-devel] Keepalive holddown timer vs interval?
Status: Beta
Brought to you by:
acassen
From: Quentin A. <qu...@ar...> - 2021-07-19 17:29:06
|
On Mon, 2021-07-19 at 15:51 +0000, Jeremy Guthrie wrote: > We are seeing an issue where a single second disconnect or packet loss appears > to cause keepalived to go into fault and die. We'd like to actually run with > an 'advert_int' of 2 seconds with a '10 second hold down timer' meaning we > have to lose five(5) hellos instead. Is there a way to do that? Right now > it appears that we have only to lose a single hello and VRRP wants to fail > over which is too fragile. We are trying to find the configs to allow for > that. > > Any thoughts? Jeremy, This is a very old list that you have sent this message to. The current list is kee...@gr... . You don't say which version of keepalived you are using; generally newer versions are more reliable than older versions. The current version is v2.2.2. > We are seeing an issue where a single second disconnect or packet loss appears > to cause keepalived to go into fault and die. What is happening here depends on what you are doing. If the network cable is disconnected from an interface that is being used by keepalived (or indeed if the interface is downed by command - e.g. ip link set XXX down), then any VRRP instance tracking that interface will go to fault state until the interface comes back up again. Once the interface comes back up, the VRRP instance will go to BACKUP state, and then if it is the highest priority instance and nopreempt is not set, then after three advert intervals plus a bit (I'll explain later) it will take over as master. I am not sure what you mean by "die". Does the VRRP instance remain in fault state after the interface is restored? > We'd like to actually run with an 'advert_int' of 2 seconds with a '10 second > hold down timer' meaning we have to lose five(5) hellos instead. Is there a > way to do that? The simple answer is NO. The VRRP RFCs are specific about how long a backup will wait before it takes over as master. For VRRPv3 this is 3 * advert_int + (256 - priority) / 256 * advert_int. For VRRPv2 it is 3 * advert_int + (256 - priority) / 256. So the timeout is always somewhere between 3 and 4 advert intervals, and the higher the priority of the VRRP instance the closer the timeout is to 3 advert intervals. Since the RFCs are explicit about the calculation of 3 * advert_int plus a bit, the 3 * is hard coded, and not configurable. Technically the 3 * could be replaced with a configurable parameter, but then you would not be running VRRP! > Right now it appears that we have only to lose a single hello and VRRP > wants to fail over which is too fragile. This is not the case. An interface going down will cause an immediate transition to fault state, but the loss of a single advert will NOT cause a backup vrrp to take over as master. > We are trying to find the configs to allow for that. > There are none since keepalived does not support anything other than 3 * advert_int + a bit, and always immediately transitions to FAULT state if the interface it is using goes down. If you could give more details about what is happening in respect of "single second disconnect or packet loss" and also "go into fault and die" then we might be able to make some more concrete suggestions. I hope this helps, Quentin Armitage |