When an unconfigured node tries to join an existing 4.4 CLM cluster the osafclmd process segfaults, after failover the new active osafclmd segfaults and we get a cluster restart.
Mar 21 14:06:12 SC-1 local0.err osafclmd[418]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:12 SC-1 local0.notice osafamfnd[441]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:12 SC-1 local0.err osafamfnd[441]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:12 SC-1 local0.crit osafamfnd[441]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 21 14:06:35 SC-2 local0.notice osafamfd[431]: NO Node 'SC-1' left the cluster
Mar 21 14:06:37 SC-2 local0.err osafclmd[415]: ER CLM NodeName: 'PL-6' doesn't match entry in imm.xml. Specify a correct node name in/etc/opensaf/node_name
Mar 21 14:06:37 SC-2 local0.notice osafamfnd[439]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Mar 21 14:06:37 SC-2 local0.err osafamfnd[439]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Mar 21 14:06:37 SC-2 local0.crit osafamfnd[439]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
The log entry is also wrong. It has the wrong level ER. It does not have to be an error, this would happen during scale out - adding a new node. Should be notice. The text itself is also not correct since it is normally not related to imm.xml or contents of node_name.
I suggest the following log instead; "NO '<RDN value="">' is not a configured cluster node"
This is a regression, it works with 4.3
Patch attached with proposed solution.
Hi,
I have attached a patch(will send it out separately when i get my hands on the latest review tool). However, please find my comments
1) w.r.t the log string and its relevance to /etc/opensaf/node_name:
The /etc/opensaf/node_name is an user exposed configuration file.
The node_name file contains the RDN value of the CLM node name.
(a) When opensaf cluster configuration is pre-provisioned using the OpenSAF IMM tools:
the /etc/opensaf/node_name should contain one of the values specified
in nodes.cfg while generating the imm.xml.
(b) When opensaf cluster nodes are dynamically added at runtime:
the /etc/opensaf/node_name should contain the rdn value.
So, to the end user, the log messages should convey the relationship with the node_name file in some grammar. I have changed the log to notice though and reframed with your inputs also in the attached patch.
2) w.r.t the use case
I think it is primarily a case of non-existent configuration and also a case of invalid configuration.
In the cluster expansion case, i think the expansion logic should first update the cluster configuration, because otherwise the node startup will still be seen as afailure attempt for the "unconfigured" node.
Note: Since, there is a way around the situation, i have changed the prioroity to major. I will send a formal review request later, but please use the attached patch(modified your patch with my impressions).
https://sourceforge.net/p/opensaf/mailman/opensaf-devel/thread/patchbomb.1395882818%40ubuntu/#msg32147440
[staging:92d926]
[staging:00fba3]
[staging:1a15ac]
[staging:30ba84]
[staging:cc869f]
[staging:8bca8f]
Related
Commit: [00fba3]
Commit: [1a15ac]
Commit: [30ba84]
Commit: [8bca8f]
Commit: [92d926]
Commit: [cc869f]