Menu

#816 CLM causes cluster restart when unknown node tries to join

4.4.1
fixed
None
defect
clm
d
4.4
major
2014-08-18
2014-03-21
Hans Feldt
No

When an unconfigured node tries to join an existing 4.4 CLM cluster the osafclmd process segfaults, after failover the new active osafclmd segfaults and we get a cluster restart.

Mar 21 14:06:12 SC-1 local0.err osafclmd[418]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:12 SC-1 local0.notice osafamfnd[441]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:12 SC-1 local0.err osafamfnd[441]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:12 SC-1 local0.crit osafamfnd[441]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

Mar 21 14:06:35 SC-2 local0.notice osafamfd[431]: NO Node 'SC-1' left the cluster
Mar 21 14:06:37 SC-2 local0.err osafclmd[415]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:37 SC-2 local0.notice osafamfnd[439]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:37 SC-2 local0.err osafamfnd[439]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:37 SC-2 local0.crit osafamfnd[439]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60

The log entry is also wrong. It has the wrong level ER. It does not have to be an error, this would happen during scale out - adding a new node. Should be notice. The text itself is also not correct since it is normally not related to imm.xml or contents of node_name.

I suggest the following log instead; "NO '<RDN value="">' is not a configured cluster node"

This is a regression, it works with 4.3

Patch attached with proposed solution.

2 Attachments

Related

Wiki: ChangeLog-4.4.1

Discussion

  • Mathi Naickan

    Mathi Naickan - 2014-03-21
    • status: unassigned --> accepted
    • assigned_to: Mathi Naickan
    • Part: - --> d
    • Priority: critical --> major
    • Milestone: future --> 4.4.1
     
  • Mathi Naickan

    Mathi Naickan - 2014-03-21

    Hi,

    I have attached a patch(will send it out separately when i get my hands on the latest review tool). However, please find my comments
    1) w.r.t the log string and its relevance to /etc/opensaf/node_name:

    The /etc/opensaf/node_name is an user exposed configuration file.
    The node_name file contains the RDN value of the CLM node name.
    (a) When opensaf cluster configuration is pre-provisioned using the OpenSAF IMM tools:
    the /etc/opensaf/node_name should contain one of the values specified
    in nodes.cfg while generating the imm.xml.
    (b) When opensaf cluster nodes are dynamically added at runtime:
    the /etc/opensaf/node_name should contain the rdn value.

    So, to the end user, the log messages should convey the relationship with the node_name file in some grammar. I have changed the log to notice though and reframed with your inputs also in the attached patch.

    2) w.r.t the use case

    I think it is primarily a case of non-existent configuration and also a case of invalid configuration.
    In the cluster expansion case, i think the expansion logic should first update the cluster configuration, because otherwise the node startup will still be seen as afailure attempt for the "unconfigured" node.

    Note: Since, there is a way around the situation, i have changed the prioroity to major. I will send a formal review request later, but please use the attached patch(modified your patch with my impressions).

     
  • Mathi Naickan

    Mathi Naickan - 2014-03-28
    • status: accepted --> review
     
  • Mathi Naickan

    Mathi Naickan - 2014-04-14
    • status: review --> fixed
     

Log in to post a comment.