Setup:
Changeset- 7436
Version - opensaf 5.0
4 nodes configured with single PBE and a load of 30K objects
Steps performed:
During switchover, on the quiesced controller the error message "Failed to stop cluster tracking" is observed in the syslog
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: NO ROLE SWITCH Active --> Quiesced
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer (applier) connected: 9499 (@OpenSafImmReplicatorA) <13773, 2010f>
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafntfimcnd[17992]: NO Started
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: ER Failed to stop cluster tracking 12
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: ER ClmTrack stop failed
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafrded[6108]: NO RDE role set to QUIESCED
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer (applier) connected: 9500 (@OpenSafImmReplicatorB) <0, 2020f>
Apr 7 04:12:43 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer disconnected 9490 <26, 2010f> (safAmfService)
Please upload the logs.
log files of Active and Standby controller attached.
From the analysis, it looks that the req reaches to new act clmd, which rejects clm stop:
Apr 12 16:51:35.924476 osafclmd [22505:clms_evt.c:1618] >> clms_process_mbx
Apr 12 16:51:35.924486 osafclmd [22505:clms_evt.c:1585] >> process_api_evt
Apr 12 16:51:35.924491 osafclmd [22505:clms_evt.c:1236] >> proc_track_stop_msg
Apr 12 16:51:35.924495 osafclmd [22505:clms_util.c:0113] >> clms_node_get_by_id
Apr 12 16:51:35.924499 osafclmd [22505:clms_util.c:0121] TR Node found 131599
Apr 12 16:51:35.924503 osafclmd [22505:clms_util.c:0124] << clms_node_get_by_id
Apr 12 16:51:35.924507 osafclmd [22505:clms_evt.c:1239] TR Node id = 131599
Apr 12 16:51:35.924515 osafclmd [22505:clms_evt.c:1255] IN Client 19 didn't subscribe for track start
Apr 12 16:51:35.924520 osafclmd [22505:clms_mds.c:1484] >> clms_mds_msg_send
Apr 12 16:51:35.924618 osafclmd [22505:clms_mds.c:1517] << clms_mds_msg_send
Apr 12 16:51:35.924640 osafclmd [22505:clms_evt.c:1289] << proc_track_stop_msg
Apr 12 16:51:35.924646 osafclmd [22505:clms_evt.c:1601] << process_api_evt
Apr 12 16:51:35.924651 osafclmd [22505:clms_evt.c:1667] << clms_process_mbx
I tested on CS 7324 and this isuue is not reproducible. This CS 7324 is just before first patch of #79 series.
So, is it because of #79 series patches ?
Last edit: Nagendra Kumar 2016-04-13
I also got below message during failover but this message is not frequent.
Apr 13 18:28:08 SLES-64BIT-SLOT1 osafsmfd[2228]: ER saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12)
Apr 13 18:28:08 SLES-64BIT-SLOT1 osafsmfd[2228]: WA SMFND UP failed
Hi,
I tested this case, with CS #7326 and this issue is not comming, but it is comming with CS #7327. So, this issue exists at:
changeset: 7327:4ed808a6af5b
user: Hans Nordeback hans.nordeback@ericsson.com
date: Tue Mar 15 13:59:58 2016 +0100
summary: clmd: Wait for cold sync to complete before sending message based checkpoints [#1701]
Please check.
Related
Tickets:
#1701Try to reproduce the fault with clmd trace active. Before the correction in CS #7327 there were problem with message based checkpoints arriving before cold sync completed. The patch fixes this problem.
The issue comes in second switchover.
SC-2:
Apr 14 16:44:07 PM_SC-2 osafamfnd[31337]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Apr 14 16:44:07 PM_SC-2 osafamfnd[31337]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: NO Controller switch over initiated
Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: NO ROLE SWITCH Active --> Quiesced
Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: ER Failed to stop cluster tracking 12
Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: ER ClmTrack stop failed
Apr 14 16:44:07 PM_SC-2 osafrded[31244]: NO RDE role set to QUIESCED
Apr 14 16:44:07 PM_SC-2 osafimmnd[31274]: NO Implementer disconnected 21 <12, 2020f> (safAmfService)
Apr 14 16:44:07 PM_SC-2 osafimmnd[31274]: NO Implementer (applier) connected: 29 (@safAmfService2020f) <12, 2020f>
Apr 14 16:44:08 PM_SC-2 osafimmnd[31274]: NO Implementer disconnected 20 <0, 2010f> (@safAmfService2010f)
Apr 14 16:44:08 PM_SC-2 osafimmnd[31274]: NO Implementer connected: 30 (safAmfService) <0, 2010f>
Apr 14 16:44:08 PM_SC-2 osafamfd[31327]: NO Switching Quiesced --> StandBy
Apr 14 16:44:08 PM_SC-2 osafrded[31244]: NO RDE role set to STANDBY
Apr 14 16:44:08 PM_SC-2 osafamfd[31327]: NO Controller switch over done
Please find clmd traces attached.
I did one more switchover and the problem is repeated with next subsequent switchover.
Please find the traces with 3 switchover done:
The logs on SC-1 in 3rd switchover:
Apr 14 16:48:21 PM_SC-1 osafamfnd[22254]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 14 16:48:21 PM_SC-1 osafamfnd[22254]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: NO Controller switch over initiated
Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: NO ROLE SWITCH Active --> Quiesced
Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: ER Failed to stop cluster tracking 12
Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: ER ClmTrack stop failed
Apr 14 16:48:21 PM_SC-1 osafrded[22164]: NO RDE role set to QUIESCED
Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer disconnected 30 <16, 2010f> (safAmfService)
Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer (applier) connected: 38 (@safAmfService2010f) <16, 2010f>
Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer disconnected 29 <0, 202
changeset: 7534:f1f6c25f150b
branch: opensaf-5.0.x
parent: 7532:f80647c1a0af
user: Anders Widellanders.widell@ericsson.com
date: Sun Apr 24 02:10:41 2016 +0530
summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777
changeset: 7535:546086f9daa2
tag: tip
parent: 7533:fea799faf484
user: Anders Widellanders.widell@ericsson.com
date: Sun Apr 24 02:10:41 2016 +0530
summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777
changeset: 7534:f1f6c25f150b
branch: opensaf-5.0.x
parent: 7532:f80647c1a0af
user: Anders Widellanders.widell@ericsson.com
date: Sun Apr 24 02:10:41 2016 +0530
summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777
changeset: 7535:546086f9daa2
tag: tip
parent: 7533:fea799faf484
user: Anders Widellanders.widell@ericsson.com
date: Sun Apr 24 02:10:41 2016 +0530
summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777
The only change related to checkpointing was introduced in the fix of ticket 1701.
So, given that 1738, 1762 is not reproduced if 1701 is removed and given that this ticket also has the same pattern, i have reverted 1701 patch as a fix for the 3 tickets 1738, 1726 and 1777.
Note: I had used an incorrect .hgrc while pushing this patch, the user name above should be mine.