Menu

#2052 immtools: SC/PL field in nodes.cfg is not used

future
unassigned
nobody
None
discussion
osaf
-
5.1.FC
minor
2016-11-01
2016-09-20
Ritu Raj
No

Environment details

OS : Suse 64bit
Changeset : 7997 ( 5.1.FC)

Summary

Controller able to join with invalid node_name

Steps followed & Observed behaviour

  1. Mistakenly configured controller node_name with PL-3 and the remaining configuration files are properly installed and updated apart from /etc/opensaf/node_name.
  2. Bringup OpenSAF, OpneSAF still able to comeup with misconfigured node_name

Opensaf status:
fos1:/opt/goahead/tetware/opensaffire/suites/avsv/api/suites # /etc/init.d/opensafd status
safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)

Expected

OpenSAF should come up with only SC-1 / SC-2, as immxml generated with :
./immxml-clustersize -s 2 -p 2
./immxml-configure

Related

Tickets: #2052

Discussion

  • Mathi Naickan

    Mathi Naickan - 2016-09-20
    • summary: Controller able to join with invalid node_name --> immtools: SC/PL field in nodes.cfg is not used
    • Type: defect --> discussion
     
  • Mathi Naickan

    Mathi Naickan - 2016-09-20

    Had a discussion with ritu and Tagging this ticket as a discussion topic and assigning to immtools.

    The issue can be reproduced as below:
    Generate imm.xml for 4 nodes with names set to SC-1, SC-2, PL-3 ,PL-4 in the nodes.cfg

    SC SC-1 SC-1
    SC SC-2 SC-2
    PL PL-3 PL-3
    PL PL-4 PL-4

    Now, start the first node with node_name set to PL-4. OpenSAF comes up fine.

    Since the nodes.cfg is exposed to the end user, I guess Ritu is questioning the need for the first column in nodes.cfg i.e. 'differentiation based on 'SC' versus 'PL'.

    This could be discussed further.

     
  • Ritu Raj

    Ritu Raj - 2016-09-20

    I want to add this one too:
    So, if we start second node SC-2, it will failed to join the cluster
    And both node will go for reboot
    **and finally after reboot when node join back:

    SC-2 will join with "ACTIVE" role and first node(PL-3) will join as "QUIESCED"

    Syslog of SC-2:
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: ER Failed to find candidate for new IMMND coordinator (ScAbsenceAllowed:0 RulingEpoch:0
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: ER Active IMMD has to restart the IMMSv. All IMMNDs will restart
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO Cluster failed to load => IMMDs will not exit.
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO MDS event from svc_id 25 (change:4, dest:564114851160080)
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: IN Added IMMND node with dest 564114851160080
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: IN Added IMMND node with dest 565216431636496
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: WA Error returned from processing message err:0 msg-type:14
    Sep 20 17:27:18 TestBed-R2 osafimmnd[27372]: ER IMMND forced to restart on order from IMMD, exiting
    Sep 20 17:27:18 TestBed-R2 osafimmd[27361]: NO MDS event from svc_id 25 (change:4, dest:565216431636496)
    Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns)
    Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO Restarting a component of 'safSu=SC-2,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
    Sep 20 17:27:18 TestBed-R2 osafamfnd[27422]: NO 'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart

    .............

    Sep 20 17:27:23 TestBed-R2 osafclmd[27402]: NO ERR_INVALID_PARAM: Implementer safClmService already set for this handle when trying to set safClmService
    Sep 20 17:27:23 TestBed-R2 osafclmd[27402]: ER saImmOiImplementerSet failed, rc = 7
    Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: NO 'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
    Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: ER safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
    Sep 20 17:27:23 TestBed-R2 osafamfnd[27422]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131599, SupervisionTime = 60
    Sep 20 17:27:23 TestBed-R2 opensaf_reboot: Rebooting local node; timeout=60

    Syslog of firstnode:
    Sep 20 17:28:10 TestBed-R1 osafimmnd[31481]: ER No IMMD service => cluster restart, exiting
    Sep 20 17:28:10 TestBed-R1 osafamfnd[30949]: NO Restarting a component of 'safSu=PL-3,safSg=NoRed,safApp=OpenSAF' (comp restart count: 2)
    Sep 20 17:28:10 TestBed-R1 osafamfnd[30949]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
    Sep 20 17:28:10 TestBed-R1 osafntfimcnd[31487]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9)
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: NO Node 'SC-2' left the cluster
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF OperState ENABLED => DISABLED
    Sep 20 17:28:10 TestBed-R1 opensaf_reboot: Rebooting local node; timeout=60
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF PresenceState INSTANTIATED => UNINSTANTIATED
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6)
    Sep 20 17:28:10 TestBed-R1 osafamfd[30935]: safSu=SC-2,safSg=2N,safApp=OpenSAF ReadinessState IN_SERVICE => OUT_OF_SERVICE

     
  • Zoran Milinkovic

    Hi,

    I'm not playing a lot with nodes.cfg, but as I know, the first column tells if a node is a system controller or a payload. Base on the first column, immxml tools knows which template will be used.
    The second column is AMF node name.
    The third column is CLM node name.

    AMF and CLM node don't need to be the same.
    If you set that a system controller node name is PL-3 then a node with node name PL-3 is a system controller.
    Node names don't need to start with SC or PL. It can be any name.

     
  • Anders Widell

    Anders Widell - 2016-09-20
    • Milestone: 4.7.2 --> 5.0.2
     
  • Srikanth R

    Srikanth R - 2016-11-01

    I think, the discussion got deviated by the usage of PL string in nodes.cfg.

    On the fist node in the opensaf cluster, the following info is filled up in opensaf cfg files.

    cat /usr/share/opensaf/immxml/nodes.cfg
    SC node-1 node-1
    SC node-2 node-2
    PL node-3 node-3
    PL node-4 node-4
    PL node-5 node-5
    PL node-6 node-6

    cat /etc/opensaf/slot_id
    1

    cat /etc/opensaf/node_name
    node-3
    cat /etc/opensaf/node_type
    controller

    -> Opensafd starts successfully, but with the following output
    safSISU=safSu=node-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
    saAmfSISUHAState=ACTIVE(1)

    -> After a timegap of 5 minutes, the node went for reboot with the following output.

    Nov 1 12:31:22 CONTROLLER-1 osaffmd[3945]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Activation timer supervision expired: no ACTIVE assignment received within the time limit, OwnNodeId = 131343, SupervisionTime = 60
    Nov 1 12:31:22 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60

    Observed behavior :

    If user mistakenly populates the node_name with the payload's node_name and starts the opensafd script, then user shall not be informed about mis-configuration. The node reboots continuously as opensafd is enabled in runtime by default during RPM installation.

    Expected behavior :

    Either fms / imm / amf should detect that the node_name used in bringing up is intended for payload, but not for controller. More importantly, the node should not go for reboot.

     
    • Zoran Milinkovic

      Hi Srikanth,

      Immxml tool is used for creating the first basic IMM xml database for starting OpenSAF.
      As I remember, according to the first column SC/PL, immxml tools use a template for SC or PL to create imm.xml file.

      From my point of view, if a node is misconfigured, the node reboot is reasonable action for the recovery.

      When the node misconfiguration is detected, you have written that the node should not reboot.
      What do you expect to happen with OpenSAF on the affected node ? To Stop or to continue working as payload ?

      BR,
      Zoran

      -----Original Message-----
      From: Srikanth R [mailto:rwpq68@users.sf.net]
      Sent: den 1 november 2016 08:26
      To: [opensaf:tickets] 2052@tickets.opensaf.p.re.sf.net
      Subject: [opensaf:tickets] #2052 immtools: SC/PL field in nodes.cfg is not used

      I think, the discussion got deviated by the usage of PL string in nodes.cfg.

      On the fist node in the opensaf cluster, the following info is filled up in opensaf cfg files.

      cat /usr/share/opensaf/immxml/nodes.cfg
      SC node-1 node-1
      SC node-2 node-2
      PL node-3 node-3
      PL node-4 node-4
      PL node-5 node-5
      PL node-6 node-6

      cat /etc/opensaf/slot_id
      1

      cat /etc/opensaf/node_name
      node-3
      cat /etc/opensaf/node_type
      controller

      -> Opensafd starts successfully, but with the following output
      safSISU=safSu=node-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
      saAmfSISUHAState=ACTIVE(1)

      -> After a timegap of 5 minutes, the node went for reboot with the following output.

      Nov 1 12:31:22 CONTROLLER-1 osaffmd[3945]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Activation timer supervision expired: no ACTIVE assignment received within the time limit, OwnNodeId = 131343, SupervisionTime = 60
      Nov 1 12:31:22 CONTROLLER-1 opensaf_reboot: Rebooting local node; timeout=60

      Observed behavior :

      If user mistakenly populates the node_name with the payload's node_name and starts the opensafd script, then user shall not be informed about mis-configuration. The node reboots continuously as opensafd is enabled in runtime by default during RPM installation.

      Expected behavior :

      Either fms / imm / amf should detect that the node_name used in bringing up is intended for payload, but not for controller. More importantly, the node should not go for reboot.


      ** [tickets:#2052] immtools: SC/PL field in nodes.cfg is not used**

      Status: unassigned
      Milestone: 5.0.2
      Created: Tue Sep 20, 2016 09:41 AM UTC by Ritu Raj
      Last Updated: Tue Sep 20, 2016 05:49 PM UTC
      Owner: nobody

      Environment details

      OS : Suse 64bit
      Changeset : 7997 ( 5.1.FC)

      Summary

      Controller able to join with invalid node_name

      Steps followed & Observed behaviour

      1. Mistakenly configured controller node_name with PL-3 and the remaining configuration files are properly installed and updated apart from /etc/opensaf/node_name.
      2. Bringup OpenSAF, OpneSAF still able to comeup with misconfigured node_name

      Opensaf status:
      fos1:/opt/goahead/tetware/opensaffire/suites/avsv/api/suites # /etc/init.d/opensafd status
      safSISU=safSu=PL-3\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
      saAmfSISUHAState=ACTIVE(1)

      Expected

      OpenSAF should come up with only SC-1 / SC-2, as immxml generated with :
      ./immxml-clustersize -s 2 -p 2
      ./immxml-configure


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/2052/

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

       

      Related

      Tickets: #2052

  • Srikanth R

    Srikanth R - 2016-11-01

    Zoran,

    Node reboot recovery is to be followed, when the system cannot recover from the observed fault. For a fault like amfd crashing, node reboot can be followed. But in the current scenario, upon reboot same configuration exists and node shall go for reboot as opensafd is enabled in the runlevel by default.

    If the system has the same environment after reboot, then it doesn't help user / system by rebooting to recover from a misconfiguration or even a fault.

      My expectation is that node shouldn't go for reboot and opensafd should be either running in a suspended way or can even be stopped. This issue is observed mainly for newbies. Rebooting a node upon starting opensaf for misconfiguration doesn't look good.
    
     
  • Anders Widell

    Anders Widell - 2017-04-03
    • Milestone: 5.0.2 --> future
     

Log in to post a comment.

Auth0 Logo