Menu

#2950 clmd: crash after split brain event

future
accepted
nobody
None
defect
unknown
-
major
False
2019-07-23
2018-10-30
Gary Lee
No

After applying [#2935] so that one SC is kept up after a split brain, clmd sometimes crashes:

2018-10-30 22:04:38.926 SC-1 osafimmnd[211]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
2018-10-30 22:04:38.926 SC-1 osafimmpbed: NO Update epoch 4 committing with ccbId:10000002a/4294967338
2018-10-30 22:04:39.699 SC-1 osafclmd[275]: ER saImmOiImplementerSet failed rc: 6, exiting
2018-10-30 22:04:39.701 SC-1 osafamfnd[304]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
2018-10-30 22:04:39.701 SC-1 osafamfnd[304]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

CLMD trace:

<143>1 2018-10-31T10:38:11.036839+11:00 SC-1 osafclmd 275 osafclmd [meta sequenceId="20598"] 275:clm/clmd/clms_imm.cc:820 >> clms_retry_pending_rtupdates
<143>1 2018-10-31T10:38:11.036848+11:00 SC-1 osafclmd 275 osafclmd [meta sequenceId="20599"] 275:clm/clmd/clms_imm.cc:823 << clms_retry_pending_rtupdates: Implementerset yet to happen, try later
<143>1 2018-10-31T10:38:11.036861+11:00 SC-1 osafclmd 275 osafclmd [meta sequenceId="20600"] 275:clm/clmd/clms_main.cc:490 TR There is an IMM task to be tried again. setting poll time out to 500
<143>1 2018-10-31T10:38:11.099564+11:00 SC-1 osafclmd 275 osafclmd [meta sequenceId="20601"] 278:mds/mds_dt_trans.c:755 >> mdtm_process_poll_recv_data_tcp
<139>1 2018-10-31T10:38:11.09987+11:00 SC-1 osafclmd 275 osafclmd [meta sequenceId="20602"] 600:clm/clmd/clms_imm.cc:2771 ER saImmOiImplementerSet failed rc: 6, exiting

Increasing the waiting time appears to fix the issue.

diff --git a/src/clm/clmd/clms_imm.cc b/src/clm/clmd/clms_imm.cc
index 017607d..cea4755 100644
--- a/src/clm/clmd/clms_imm.cc
+++ b/src/clm/clmd/clms_imm.cc
@@ -42,7 +42,7 @@ static uint32_t clms_lock_send_no_start_cbk(CLMS_CLUSTER_NODE *nodeop);
 static const SaVersionT immVersion = {'A', 2, 1};

 const unsigned int sleep_delay_ms = 500;
-const unsigned int max_waiting_time_ms = 60 * 1000; /* 60 seconds */
+const unsigned int max_waiting_time_ms = 120 * 1000; /* 120 seconds */

 /**

  * Initialize the track response patricia tree for the node

Related

Tickets: #2935

Discussion

  • Gary Lee

    Gary Lee - 2019-01-09
    • Milestone: 5.19.01 --> 5.19.03
     
  • Gary Lee

    Gary Lee - 2019-03-26
    • Milestone: 5.19.03 --> 5.19.06
     
  • Gary Lee

    Gary Lee - 2019-07-23
    • Milestone: 5.19.07 --> future
     

Log in to post a comment.

MongoDB Logo MongoDB