Menu

#2112 amfd: multiple SUs incorrectly assigned to single node

5.0.2
fixed
None
defect
amf
d
major
2016-11-10
2016-10-11
Gary Lee
No

Multiple SUs are assigned to a single node after SC absence.

To reproduce:

0) load nwayactive demo
1) stop SCs
2) restart SCs

The following is observed:

root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNode SA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42)

root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2
...
saAmfSUHostedByNode SA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42)

SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4.

Operations on SU4 will lead to a crash of amfnd on PL-4.

Related

Wiki: ChangeLog-5.0.2
Wiki: ChangeLog-5.1.1

Discussion

  • Gary Lee

    Gary Lee - 2016-10-11
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -17,3 +17,5 @@
     saAmfSUHostedByNode                                SA_NAME_T    safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 
    
     SU4 is indeed assigned to PL-4, but SU2 was assigned to one of the SCs and is not assigned to PL-4.
    +
    +Operations on SU4 will lead to a crash of amfnd on PL-4.
    
     
  • Gary Lee

    Gary Lee - 2016-10-11
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -16,6 +16,6 @@
     ...
     saAmfSUHostedByNode                                SA_NAME_T    safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) 
    
    -SU4 is indeed assigned to PL-4, but SU2 was assigned to one of the SCs and is not assigned to PL-4.
    +SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4.
    
     Operations on SU4 will lead to a crash of amfnd on PL-4.
    
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-12
    • status: unassigned --> assigned
    • assigned_to: Minh Hon Chau
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-13

    Attached file is amfd trace that shows how a duplicated node mapped to SUs
    Before headless, the node<->su map as below:
    SU1: PL3, SU2: PL4, SU3: PL5, SU4: SC1, SU5: SC2

    After headless, the nodes read from IMM and initialization phase was reverted in order, thus the node <-> su map was changed:
    SU5: PL3, SU4: PL4, SU3: PL5, SU2: SC1, SU1: SC2

    At the headless sync phase, the node of SU2 was updated to PL4, because SU2 was actually mapped to PL4 before headless, eventually it becomes SU2: PL4 and SU4: PL4.

    The problem happens due to some reasons that order of SU read from IMM reverted and saAmfSUHostedByNode was empty which caused amfd pick randomly the node to assign to SU.

    A solution could be:
    (1) Make both active/standby amfd become early implementer/applier so that saAmfSUHostedByNode was read properly after headless
    Or, (2) Just before reading headless sync information, amfd read (again) saAmfSUHostedByNode and update to SU. At this point, saAmfSUHostedByNode should be read as non-empty value and the saAmfSUHostedByNode was mapped in initialization phase is not reliable.

     
  • Praveen

    Praveen - 2016-10-13

    Hi Minh,
    Since this is a defect, I think as of now we can take approach (2).
    For (1), an enhancement can be raised for 5.2 FC so that it will go through proper testing post 5.2 FC tag.

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-13

    Hi Praveen

    I have tried (2), I thought it should work, but it doesn't
    The reason is after avd_su_config_get(), saAmfSUHostedByNode of all SUs now have been updated to IMM incorrectly. So reading saAmfSUHostedByNode in headless sync phase will still give incorrect mapping.

    Any ideas?

    Thanks,
    Minh

     
    • Minh Hon Chau

      Minh Hon Chau - 2016-10-14

      In (2), I actually also need to remove IMM update of saAmfSUHostedByNode in avd_susi_recreate.

       
  • Praveen

    Praveen - 2016-10-14

    Hi Minh,
    Function avd_susi_recreate() will be called after reading configuration from IMM. In this function, the second for loop over SUSI_list should overwrite the node name even if it has been set wrong while creating SU during config_get_su() from IMM.
    Also in the same function there are two for loops one for SU and other for SUSI. In the loop for SU we are not updating su->saAmfSUHostedByNode = node->name. Since a SU can be instantiated and unassigned on a node, first loop should also update the node name.
    By any chance, SU were not assigned when system become headless. In that case su->saAmfSUHostedByNode will not be updated. I think even a single unassigned SU can also create a problem

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-17

    Hi Praveen,

    After headless, config_get_su() has already did a wrong mapping, and su_add_to_model() also updated this wrong mapping to IMM.
    After calling config_get_su after headless, the only right mapping only exists in the sync information where amfd retrieves from avd_susi_recreate(). But this is also where the problem came, that causes one node is mapped to two different SU(s) (as in earlier example). The fact that we can not completely rely on avd_susi_recreate() to update the mapping since some of nodes could be still down

    Attached file is a patch of approach (2), it removes the mapping update in avd_susi_recreate(), it can avoid duplicated mapping which causes coredump. But the su/node mapping has already been different to what it was before headless.

    May this shuffle of su/node mapping after headless that causes any other problems?

    Thanks,
    Minh

     
  • Praveen

    Praveen - 2016-10-19

    Hi Minh,

    I am going through the patch,

    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2016-10-21

    Hi Minh,
    I have gone through the patch. SU mapping must be same before headless state and after headless state. I was trying a totally different approach but could not finish it (attach is a crude patch adjust.patch). I think we are only left with your other suggestion i.e approach (1) only. Please go ahead with that and publish the ptach.
    Thanks,
    Praveen

     
  • Praveen

    Praveen - 2016-10-21

    adjust.patch

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-24

    Thanks Praveen, I will publish the patch with (1)

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-10-24
    • status: assigned --> review
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-10

    Pushed for 5.1 and default:

    changeset: 8301:a25d5d50b01a
    changeset: 8300:773643625dc6

    Could it happen with 5.0?

     
  • Gary Lee

    Gary Lee - 2016-11-10
    • Milestone: 5.1.1 --> 5.0.2
     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-10

    Push into 5.0 branch

    changeset: 8302:6557805ec604
    branch: opensaf-5.0.x

     
  • Minh Hon Chau

    Minh Hon Chau - 2016-11-10
    • status: review --> fixed
     

Log in to post a comment.