Menu

#1738 ClmTrack stop failed error message observed in the syslog during switchover

5.0.FC
fixed
None
defect
clm
d
5.0.FC
minor
2016-04-24
2016-04-07
Ritu Raj
No

Setup:
Changeset- 7436
Version - opensaf 5.0
4 nodes configured with single PBE and a load of 30K objects

Steps performed:
During switchover, on the quiesced controller the error message "Failed to stop cluster tracking" is observed in the syslog

Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: NO ROLE SWITCH Active --> Quiesced
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer (applier) connected: 9499 (@OpenSafImmReplicatorA) <13773, 2010f>
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafntfimcnd[17992]: NO Started
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: ER Failed to stop cluster tracking 12
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafamfd[6197]: ER ClmTrack stop failed

Apr 7 04:12:42 SLES-64BIT-SLOT1 osafrded[6108]: NO RDE role set to QUIESCED
Apr 7 04:12:42 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer (applier) connected: 9500 (@OpenSafImmReplicatorB) <0, 2020f>
Apr 7 04:12:43 SLES-64BIT-SLOT1 osafimmnd[6138]: NO Implementer disconnected 9490 <26, 2010f> (safAmfService)

Discussion

  • Nagendra Kumar

    Nagendra Kumar - 2016-04-11

    Please upload the logs.

     
  • Ritu Raj

    Ritu Raj - 2016-04-11

    log files of Active and Standby controller attached.

     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-12
    • Component: unknown --> clm
    • Part: - --> d
     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-12

    From the analysis, it looks that the req reaches to new act clmd, which rejects clm stop:

    Apr 12 16:51:35.924476 osafclmd [22505:clms_evt.c:1618] >> clms_process_mbx
    Apr 12 16:51:35.924486 osafclmd [22505:clms_evt.c:1585] >> process_api_evt
    Apr 12 16:51:35.924491 osafclmd [22505:clms_evt.c:1236] >> proc_track_stop_msg
    Apr 12 16:51:35.924495 osafclmd [22505:clms_util.c:0113] >> clms_node_get_by_id
    Apr 12 16:51:35.924499 osafclmd [22505:clms_util.c:0121] TR Node found 131599
    Apr 12 16:51:35.924503 osafclmd [22505:clms_util.c:0124] << clms_node_get_by_id
    Apr 12 16:51:35.924507 osafclmd [22505:clms_evt.c:1239] TR Node id = 131599
    Apr 12 16:51:35.924515 osafclmd [22505:clms_evt.c:1255] IN Client 19 didn't subscribe for track start
    Apr 12 16:51:35.924520 osafclmd [22505:clms_mds.c:1484] >> clms_mds_msg_send
    Apr 12 16:51:35.924618 osafclmd [22505:clms_mds.c:1517] << clms_mds_msg_send
    Apr 12 16:51:35.924640 osafclmd [22505:clms_evt.c:1289] << proc_track_stop_msg
    Apr 12 16:51:35.924646 osafclmd [22505:clms_evt.c:1601] << process_api_evt
    Apr 12 16:51:35.924651 osafclmd [22505:clms_evt.c:1667] << clms_process_mbx

     
  • Ritu Raj

    Ritu Raj - 2016-04-12
    • Milestone: 4.6.2 --> 5.0.FC
     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-13
    • Version: --> 5.0.FC
     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-13

    I tested on CS 7324 and this isuue is not reproducible. This CS 7324 is just before first patch of #79 series.
    So, is it because of #79 series patches ?

     

    Last edit: Nagendra Kumar 2016-04-13
  • Ritu Raj

    Ritu Raj - 2016-04-13

    I also got below message during failover but this message is not frequent.

    Apr 13 18:28:08 SLES-64BIT-SLOT1 osafsmfd[2228]: ER saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12)
    Apr 13 18:28:08 SLES-64BIT-SLOT1 osafsmfd[2228]: WA SMFND UP failed

     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-14

    Hi,
    I tested this case, with CS #7326 and this issue is not comming, but it is comming with CS #7327. So, this issue exists at:
    changeset: 7327:4ed808a6af5b
    user: Hans Nordeback hans.nordeback@ericsson.com
    date: Tue Mar 15 13:59:58 2016 +0100
    summary: clmd: Wait for cold sync to complete before sending message based checkpoints [#1701]

    Please check.

     

    Related

    Tickets: #1701

  • Hans Nordebäck

    Hans Nordebäck - 2016-04-14

    Try to reproduce the fault with clmd trace active. Before the correction in CS #7327 there were problem with message based checkpoints arriving before cold sync completed. The patch fixes this problem.

     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-14

    The issue comes in second switchover.
    SC-2:
    Apr 14 16:44:07 PM_SC-2 osafamfnd[31337]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Apr 14 16:44:07 PM_SC-2 osafamfnd[31337]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: NO Controller switch over initiated
    Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: NO ROLE SWITCH Active --> Quiesced
    Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: ER Failed to stop cluster tracking 12
    Apr 14 16:44:07 PM_SC-2 osafamfd[31327]: ER ClmTrack stop failed
    Apr 14 16:44:07 PM_SC-2 osafrded[31244]: NO RDE role set to QUIESCED
    Apr 14 16:44:07 PM_SC-2 osafimmnd[31274]: NO Implementer disconnected 21 <12, 2020f> (safAmfService)
    Apr 14 16:44:07 PM_SC-2 osafimmnd[31274]: NO Implementer (applier) connected: 29 (@safAmfService2020f) <12, 2020f>
    Apr 14 16:44:08 PM_SC-2 osafimmnd[31274]: NO Implementer disconnected 20 <0, 2010f> (@safAmfService2010f)
    Apr 14 16:44:08 PM_SC-2 osafimmnd[31274]: NO Implementer connected: 30 (safAmfService) <0, 2010f>
    Apr 14 16:44:08 PM_SC-2 osafamfd[31327]: NO Switching Quiesced --> StandBy
    Apr 14 16:44:08 PM_SC-2 osafrded[31244]: NO RDE role set to STANDBY
    Apr 14 16:44:08 PM_SC-2 osafamfd[31327]: NO Controller switch over done

    Please find clmd traces attached.

     
  • Nagendra Kumar

    Nagendra Kumar - 2016-04-14

    I did one more switchover and the problem is repeated with next subsequent switchover.
    Please find the traces with 3 switchover done:
    The logs on SC-1 in 3rd switchover:
    Apr 14 16:48:21 PM_SC-1 osafamfnd[22254]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Apr 14 16:48:21 PM_SC-1 osafamfnd[22254]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' STANDBY to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: NO Controller switch over initiated
    Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: NO ROLE SWITCH Active --> Quiesced
    Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: ER Failed to stop cluster tracking 12
    Apr 14 16:48:21 PM_SC-1 osafamfd[22244]: ER ClmTrack stop failed
    Apr 14 16:48:21 PM_SC-1 osafrded[22164]: NO RDE role set to QUIESCED
    Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer disconnected 30 <16, 2010f> (safAmfService)
    Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer (applier) connected: 38 (@safAmfService2010f) <16, 2010f>
    Apr 14 16:48:22 PM_SC-1 osafimmnd[22194]: NO Implementer disconnected 29 <0, 202

     
  • Mathi Naickan

    Mathi Naickan - 2016-04-23

    changeset: 7534:f1f6c25f150b
    branch: opensaf-5.0.x
    parent: 7532:f80647c1a0af
    user: Anders Widellanders.widell@ericsson.com
    date: Sun Apr 24 02:10:41 2016 +0530
    summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777

    changeset: 7535:546086f9daa2
    tag: tip
    parent: 7533:fea799faf484
    user: Anders Widellanders.widell@ericsson.com
    date: Sun Apr 24 02:10:41 2016 +0530
    summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777

     
  • Mathi Naickan

    Mathi Naickan - 2016-04-23
    • status: unassigned --> fixed
    • assigned_to: Mathi Naickan
     
  • Mathi Naickan

    Mathi Naickan - 2016-04-23

    changeset: 7534:f1f6c25f150b
    branch: opensaf-5.0.x
    parent: 7532:f80647c1a0af
    user: Anders Widellanders.widell@ericsson.com
    date: Sun Apr 24 02:10:41 2016 +0530
    summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777

    changeset: 7535:546086f9daa2
    tag: tip
    parent: 7533:fea799faf484
    user: Anders Widellanders.widell@ericsson.com
    date: Sun Apr 24 02:10:41 2016 +0530
    summary: clm: revert incorrect fix of #1701 - i.e. fix 1738, 1762, 1777

     
  • Mathi Naickan

    Mathi Naickan - 2016-04-24

    The only change related to checkpointing was introduced in the fix of ticket 1701.
    So, given that 1738, 1762 is not reproduced if 1701 is removed and given that this ticket also has the same pattern, i have reverted 1701 patch as a fix for the 3 tickets 1738, 1726 and 1777.

    Note: I had used an incorrect .hgrc while pushing this patch, the user name above should be mine.

     

Log in to post a comment.