OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 1PBE with 10K objects
Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout
Invoked headless by killing Active followed by Standby and Spare Controller,
maintaining gap of 6 sec between controller reboot
After couple of failover, CLMNA faulted on PL-4 and PL-5 due to healthCheckcallbackTimeout, and cluster reset happened.
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: ER safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: Rebooting OpenSAF NodeId = 132111 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 132111, SupervisionTime = 60
Notes:
There is time gap between system
With respect to PL-4(Sep 10 17:52:46 SCALE_SLOT-74) the corresponding time for other system as:
Sep 27 18:46:53: SC-1
Oct 03 10:02:54: SC-2
Oct 03 10:26:44: SC-3
Sep 10 17:54:46: PL-5
There is No syslog logged on controller's during above time.
Syslog of SC-1,SC-2,SC-3, PL-4 and PL-5 attached
Diff: