Menu

#2265 clm: clmd coredump

5.0.2
fixed
Praveen
None
defect
clm
d
major
2017-02-10
2017-01-16
Hung Nguyen
No

Jan 11 10:36:23 SC-2 osafclmd[14467]: ER Node is NULL,problem with the database.
Jan 11 10:36:23 SC-2 osafclmd[14467]: ../../../../../../../opensaf/osaf/services/saf/clmsv/clms/clms_mbcsv.c:467: ckpt_proc_node_rec: Assertion '0' failed.
Jan 11 10:36:23 SC-2 osafamfnd[14497]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Jan 11 10:36:23 SC-2 osafamfnd[14497]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Jan 11 10:36:23 SC-2 osafamfnd[14497]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
Jan 11 10:36:23 SC-2 opensaf_reboot: Rebooting local node; timeout=60

Related

Tickets: #2265
Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Praveen

    Praveen - 2017-01-19

    Hi Hung,

    Please share traces/logs and steps to reproduce.

    Thanks,
    Praveen

     
  • Hung Nguyen

    Hung Nguyen - 2017-01-23

    Hi,
    Here's the syslog, trace was not enabled.

     
  • Praveen

    Praveen - 2017-01-23
    • status: unassigned --> assigned
    • assigned_to: Praveen
    • Milestone: 5.2.FC --> 5.0.2
     
  • Praveen

    Praveen - 2017-01-23

    Hi Hung,

    Thanks for the logs. I am going thorugh them.

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2017-01-25

    Hi,
    Syslogs in systemlogs.tgz indicates that cluster was coming up with SC-1, SC-2 and PL-3 and also some CCB operations were initiated when SC-2 and PL-3 were still joining.
    If CCB operations are related to scale out, then there is very thin window of time in which this issue can occur. I increased this thin window by adding some sleeps and reproduced the issue . Logs, traces and configuration file to add a payload in clm cluster are attached in clm_issue.tgz. I have not used scale out script but I think with that also it can be reproduced.

    Steps to reproduce:
    1) Bring first controller up with attached imm.xml. It contains all the MW objects for PL-3 except object of CLM node.
    2) When standby controller is coming up and clms reads cluster information from IMM, add PL-3 configuraion immcfg -f pl_3.xml. Active CLMS will not checkpoint this node to standby CLMS as it is still not visible via MBCSV.
    3) Now when Standby is trying to encode MBCSV request for COLD sync, modify attribute of CLM node PL-3 with command:
    immcfg -a saClmNodeLockCallbackTimeout=50000 safNode=PL-3,safCluster=myClmCluster
    4) SInce standby CLMS is now visible, active will try to send this runtime information to standby. PL-3 was added runtime after stanbby has read the configuration from IMM so it will assert for not finding the PL-3.

    Solution: One solution could be: Active should not send async updates if cold sync is not completed. Other solution could be: standby CLMS should ignore async update requests if cold sync is not completed. In cold sync messages it will get updated states. Need to evaluate.

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2017-01-27
    • status: assigned --> accepted
     
  • Praveen

    Praveen - 2017-02-03
    • status: accepted --> review
     
  • Praveen

    Praveen - 2017-02-10
    • status: review --> fixed
     

Log in to post a comment.