OpenSAF / Tickets / #816 CLM causes cluster restart when unknown node tries to join

CLM causes cluster restart when unknown node tries to join

#816 CLM causes cluster restart when unknown node tries to join

Milestone: 4.4.1

Status: fixed

Owner: Mathi Naickan

Labels: None

Type: defect

Component: clm

Part: d

Version: 4.4

Priority: major

Blocker:

Updated: 2014-08-18

Created: 2014-03-21

Creator: Hans Feldt

Private: No

When an unconfigured node tries to join an existing 4.4 CLM cluster the osafclmd process segfaults, after failover the new active osafclmd segfaults and we get a cluster restart.

Mar 21 14:06:12 SC-1 local0.err osafclmd[418]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:12 SC-1 local0.notice osafamfnd[441]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:12 SC-1 local0.err osafamfnd[441]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:12 SC-1 local0.crit osafamfnd[441]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60

Mar 21 14:06:35 SC-2 local0.notice osafamfd[431]: NO Node 'SC-1' left the cluster
Mar 21 14:06:37 SC-2 local0.err osafclmd[415]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:37 SC-2 local0.notice osafamfnd[439]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:37 SC-2 local0.err osafamfnd[439]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:37 SC-2 local0.crit osafamfnd[439]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60

The log entry is also wrong. It has the wrong level ER. It does not have to be an error, this would happen during scale out - adding a new node. Should be notice. The text itself is also not correct since it is normally not related to imm.xml or contents of node_name.

I suggest the following log instead; "NO '<RDN value="">' is not a configured cluster node"

This is a regression, it works with 4.3

Patch attached with proposed solution.

2 Attachments

816_clm.patch

clm-1

Mathi Naickan - 2014-03-21

status: unassigned --> accepted

assigned_to: Mathi Naickan

Part: - --> d

Priority: critical --> major

Milestone: future --> 4.4.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-03-21

Hi,

I have attached a patch(will send it out separately when i get my hands on the latest review tool). However, please find my comments
1) w.r.t the log string and its relevance to /etc/opensaf/node_name:

The /etc/opensaf/node_name is an user exposed configuration file.
The node_name file contains the RDN value of the CLM node name.
(a) When opensaf cluster configuration is pre-provisioned using the OpenSAF IMM tools:
the /etc/opensaf/node_name should contain one of the values specified
in nodes.cfg while generating the imm.xml.
(b) When opensaf cluster nodes are dynamically added at runtime:
the /etc/opensaf/node_name should contain the rdn value.

So, to the end user, the log messages should convey the relationship with the node_name file in some grammar. I have changed the log to notice though and reframed with your inputs also in the attached patch.

2) w.r.t the use case

I think it is primarily a case of non-existent configuration and also a case of invalid configuration.
In the cluster expansion case, i think the expansion logic should first update the cluster configuration, because otherwise the node startup will still be seen as afailure attempt for the "unconfigured" node.

Note: Since, there is a way around the situation, i have changed the prioroity to major. I will send a formal review request later, but please use the attached patch(modified your patch with my impressions).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-03-28

https://sourceforge.net/p/opensaf/mailman/opensaf-devel/thread/patchbomb.1395882818%40ubuntu/#msg32147440

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-03-28

status: accepted --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-04-14

status: review --> fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mathi Naickan - 2014-04-14

[staging:92d926]
[staging:00fba3]
[staging:1a15ac]
[staging:30ba84]
[staging:cc869f]
[staging:8bca8f]

Related

Commit: [00fba3]
Commit: [1a15ac]
Commit: [30ba84]
Commit: [8bca8f]
Commit: [92d926]
Commit: [cc869f]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CLM causes cluster restart when unknown node tries to join

Milestone

Searches

Help

#816 CLM causes cluster restart when unknown node tries to join

Related

Discussion

Related