OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Controller able to join with invalid node_name
Opensaf status:
fos1:/opt/goahead/tetware/opensaffire/suites/avsv/api/suites # /etc/init.d/opensafd status
safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
OpenSAF should come up with only SC-1 / SC-2, as immxml generated with :
./immxml-clustersize -s 2 -p 2
./immxml-configure
Had a discussion with ritu and Tagging this ticket as a discussion topic and assigning to immtools.
The issue can be reproduced as below:
Generate imm.xml for 4 nodes with names set to SC-1, SC-2, PL-3 ,PL-4 in the nodes.cfg
SC SC-1 SC-1
SC SC-2 SC-2
PL PL-3 PL-3
PL PL-4 PL-4
Now, start the first node with node_name set to PL-4. OpenSAF comes up fine.
Since the nodes.cfg is exposed to the end user, I guess Ritu is questioning the need for the first column in nodes.cfg i.e. 'differentiation based on 'SC' versus 'PL'.
This could be discussed further.
I want to add this one too:
So, if we start second node SC-2, it will failed to join the cluster
And both node will go for reboot
**and finally after reboot when node join back:
Syslog of SC-2:
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: ER Failed to find candidate for new IMMND coordinator (ScAbsenceAllowed:0 RulingEpoch:0
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: ER Active IMMD has to restart the IMMSv. All IMMNDs will restart
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO Cluster failed to load => IMMDs will not exit.
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO MDS event from svc_id 25 (change:4, dest:564114851160080)
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: IN Added IMMND node with dest 564114851160080
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: IN Added IMMND node with dest 565216431636496
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: WA Error returned from processing message err:0 msg-type:14
Sep 20 17:27:18 TestBed-R2 osafimmnd[27372]: ER IMMND forced to restart on order from IMMD, exiting
Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO MDS event from svc_id 25 (change:4, dest:565216431636496)
Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns)
Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO Restarting a component of 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO 'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart
.............
Sep 20 17:27:23 TestBed-R2 osafclmd[27402]: NO ERR_INVALID_PARAM: Implementer safClmService already set for this handle when trying to set safClmService
Sep 20 17:27:23 TestBed-R2 osafclmd[27402]: ER saImmOiImplementerSet failed, rc = 7
Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
Sep 20 17:27:23 TestBed-R2 opensaf_reboot: Rebooting local node; timeout=60
Syslog of firstnode:
Sep 20 17:28:10 TestBed-R1 osafimmnd[31481]: ER No IMMD service => cluster restart, exiting
Sep 20 17:28:10 TestBed-R1 osafamfnd[30949]: NO Restarting a component of 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' (comp restart count: 2)
Sep 20 17:28:10 TestBed-R1 osafamfnd[30949]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Sep 20 17:28:10 TestBed-R1 osafntfimcnd[31487]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9)
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: NO Node 'SC-2' left the cluster
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF OperState ENABLED => DISABLED
Sep 20 17:28:10 TestBed-R1 opensaf_reboot: Rebooting local node; timeout=60
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)
Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE
Hi,
I'm not playing a lot with nodes.cfg, but as I know, the first column tells if a node is a system controller or a payload. Base on the first column, immxml tools knows which template will be used.
The second column is AMF node name.
The third column is CLM node name.
AMF and CLM node don't need to be the same.
If you set that a system controller node name is PL-3 then a node with node name PL-3 is a system controller.
Node names don't need to start with SC or PL. It can be any name.
I think, the discussion got deviated by the usage of PL string in nodes.cfg.
On the fist node in the opensaf cluster, the following info is filled up in opensaf cfg files.
cat /usr/share/opensaf/immxml/nodes.cfg
SC node-1 node-1
SC node-2 node-2
PL node-3 node-3
PL node-4 node-4
PL node-5 node-5
PL node-6 node-6
cat /etc/opensaf/slot_id
1
cat /etc/opensaf/node_name
node-3
cat /etc/opensaf/node_type
controller
-> Opensafd starts successfully, but with the following output
safSISU=safSu=node-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
-> After a timegap of 5 minutes, the node went for reboot with the following output.
Nov 1 12:31:22 CONTROLLER-1 osaffmd[3945]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Activation timer supervision expired: no ACTIVE assignment received within the time limit, OwnNodeId = 131343, SupervisionTime = 60
Nov 1 12:31:22 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60
Observed behavior :
If user mistakenly populates the node_name with the payload's node_name and starts the opensafd script, then user shall not be informed about mis-configuration. The node reboots continuously as opensafd is enabled in runtime by default during RPM installation.
Expected behavior :
Either fms / imm / amf should detect that the node_name used in bringing up is intended for payload, but not for controller. More importantly, the node should not go for reboot.
Hi Srikanth,
Immxml tool is used for creating the first basic IMM xml database for starting OpenSAF.
As I remember, according to the first column SC/PL, immxml tools use a template for SC or PL to create imm.xml file.
From my point of view, if a node is misconfigured, the node reboot is reasonable action for the recovery.
When the node misconfiguration is detected, you have written that the node should not reboot.
What do you expect to happen with OpenSAF on the affected node ? To Stop or to continue working as payload ?
BR,
Zoran
-----Original Message-----
From: Srikanth R [mailto:rwpq68@users.sf.net]
Sent: den 1 november 2016 08:26
To: [opensaf:tickets] 2052@tickets.opensaf.p.re.sf.net
Subject: [opensaf:tickets] #2052 immtools: SC/PL field in nodes.cfg is not used
I think, the discussion got deviated by the usage of PL string in nodes.cfg.
On the fist node in the opensaf cluster, the following info is filled up in opensaf cfg files.
cat /usr/share/opensaf/immxml/nodes.cfg
SC node-1 node-1
SC node-2 node-2
PL node-3 node-3
PL node-4 node-4
PL node-5 node-5
PL node-6 node-6
cat /etc/opensaf/slot_id
1
cat /etc/opensaf/node_name
node-3
cat /etc/opensaf/node_type
controller
-> Opensafd starts successfully, but with the following output
safSISU=safSu=node-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
-> After a timegap of 5 minutes, the node went for reboot with the following output.
Nov 1 12:31:22 CONTROLLER-1 osaffmd[3945]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Activation timer supervision expired: no ACTIVE assignment received within the time limit, OwnNodeId = 131343, SupervisionTime = 60
Nov 1 12:31:22 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60
Observed behavior :
If user mistakenly populates the node_name with the payload's node_name and starts the opensafd script, then user shall not be informed about mis-configuration. The node reboots continuously as opensafd is enabled in runtime by default during RPM installation.
Expected behavior :
Either fms / imm / amf should detect that the node_name used in bringing up is intended for payload, but not for controller. More importantly, the node should not go for reboot.
** [tickets:#2052] immtools: SC/PL field in nodes.cfg is not used**
Status: unassigned
Milestone: 5.0.2
Created: Tue Sep 20, 2016 09:41 AM UTC by Ritu Raj
Last Updated: Tue Sep 20, 2016 05:49 PM UTC
Owner: nobody
Environment details
OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)
Summary
Controller able to join with invalid node_name
Steps followed & Observed behaviour
Opensaf status:
fos1:/opt/goahead/tetware/opensaffire/suites/avsv/api/suites # /etc/init.d/opensafd status
safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
Expected
OpenSAF should come up with only SC-1 / SC-2, as immxml generated with :
./immxml-clustersize -s 2 -p 2
./immxml-configure
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/2052/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
Related
Tickets: #2052
Zoran,
Node reboot recovery is to be followed, when the system cannot recover from the observed fault. For a fault like amfd crashing, node reboot can be followed. But in the current scenario, upon reboot same configuration exists and node shall go for reboot as opensafd is enabled in the runlevel by default.
If the system has the same environment after reboot, then it doesn't help user / system by rebooting to recover from a misconfiguration or even a fault.