Menu

#2025 Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout

future
unassigned
nobody
None
defect
clm
-
major
2016-09-20
2016-09-12
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 1PBE with 10K objects

Summary :

Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout

Steps followed & Observed behaviour

  1. Invoked headless by killing Active followed by Standby and Spare Controller,
    maintaining gap of 6 sec between controller reboot

  2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to healthCheckcallbackTimeout, and cluster reset happened.

Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer started (timeout: 1200000000000 ns)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' recovery action escalated from 'componentFailover' to 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO 'safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF' faulted due to 'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: ER safComp=CLMNA,safSu=PL-4,safSg=NoRed,safApp=OpenSAF Faulted due to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: Rebooting OpenSAF NodeId = 132111 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 132111, SupervisionTime = 60

Notes:

  1. There is time gap between system
    With respect to PL-4(Sep 10 17:52:46 SCALE_SLOT-74) the corresponding time for other system as:
    Sep 27 18:46:53: SC-1
    Oct 03 10:02:54: SC-2
    Oct 03 10:26:44: SC-3
    Sep 10 17:54:46: PL-5
    There is No syslog logged on controller's during above time.

  2. Syslog of SC-1,SC-2,SC-3, PL-4 and PL-5 attached

  3. clmnd traces not enabled
5 Attachments

Discussion

  • Ritu Raj

    Ritu Raj - 2016-09-12
    • summary: Cluster reset happend during headless as CLMNA faulted due to csiSetcallbackTimeout --> Cluster reset happend during headless as CLMNA faulted due to healthCheckcallbackTimeout
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,13 +4,13 @@
     Setup : 5 nodes ( 3 controllers and 2 payloads with headless feature enabled & 1PBE with 10K objects
    
     #Summary :
    -Cluster reset happend  during headless as CLMNA  faulted due to csiSetcallbackTimeout 
    +Cluster reset happend  during headless as CLMNA  faulted due to healthCheckcallbackTimeout
    
     #Steps followed & Observed behaviour
    
     1. Invoked headless by killing Active followed by Standby and Spare Controller,
         maintaining gap of 6 sec between controller reboot
    
    -2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to csiSetcallbackTimeout, and cluster reset happened.
    +2. After couple of failover, CLMNA faulted on PL-4 and PL-5 due to healthCheckcallbackTimeout, and cluster reset happened.
    
     Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO SU failover probation timer started (timeout: 1200000000000 ns)
     Sep 10 17:52:46 SCALE_SLOT-74 osafamfnd[12421]: NO Performing failover of 'safSu=PL-4,safSg=NoRed,safApp=OpenSAF' (SU failover count: 1)
    
     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     

Log in to post a comment.

MongoDB Logo MongoDB