Menu

#1180 NTF: Add support for cloud resilience feature

5.0.FC
fixed
nobody
None
enhancement
ntf
-
major
2016-03-22
2014-10-20
elunlen
No

The Ntfservice shall be able to recover if both SC nodes goes down at the same time. This is not possible today. A cluster restart is needed.
NOTE: This is also applicable for the LOG service. [#1179]

Related

Tickets: #1179
Tickets: #1180
Tickets: #1641
Tickets: #1707
Wiki: ChangeLog-5.0.0
Wiki: NEWS-5.0.0

Discussion

  • Minh Hon Chau

    Minh Hon Chau - 2014-11-14

    I have the patch implementing the "INVALID HANDLE" idea which is similar to #1179
    Comments are welcome

    --- Brief explanation ---
    An observation on the MDS events in Normal View toward the NTF Agent by following cases:

    • Start ntfsubcribe
      Nov 06 05:32:44 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 3 (NCSMDS_UP)

    • Stop SC-1, failover
      Nov 06 05:33:09 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 1 (NCSMDS_NO_ACTIVE)
      Nov 06 05:33:09 PL-3 ntfsubscribe: NO NTFS down

    • Active NTF Server is on SC-2
      Nov 06 05:33:10 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 2 (NCSMDS_NEW_ACTIVE)
      Nov 06 05:33:10 PL-3 ntfsubscribe: NO MSG from NTFS NCSMDS_NEW_ACTIVE/UP

    • Stop SC-2
      Nov 06 05:33:41 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 1 (NCSMDS_NO_ACTIVE)
      Nov 06 05:33:41 PL-3 ntfsubscribe: NO NTFS down

    • No Active NTF Server
      Nov 06 05:33:41 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 4 (NCSMDS_DOWN)
      Nov 06 05:33:41 PL-3 ntfsubscribe: NO NTFS down

    • Start SC-1 again, Active NTF Server is on SC-1
      Nov 06 05:34:11 PL-3 ntfsubscribe: NO mds_cb_info->info.svc_evt.i_change 2 (NCSMDS_NEW_ACTIVE)
      Nov 06 05:34:11 PL-3 ntfsubscribe: NO MSG from NTFS NCSMDS_NEW_ACTIVE/UP

    • Restart cluster, start ntfsubcribe, then only stop SC-2, no mds event

    So the @ntfa_ntfsv_state_t is introduced to control the server states based on the MDS event.
    State handling:

    • Initial value is NTFA_NTFSV_NONE
    • If start a NTF client, Agent receives NCSMDS_UP, set @ntfa_ntfsv_state_t is NTFA_NTFSV_UP
      • At state NTFA_NTFSV_UP, all APIs are functioning normally if the handle is valid
    • Then if Active SC goes down, Agent will receive NCSMDS_NO_ACTIVE, set @ntfa_ntfsv_state_t is NTFA_NTFSV_NO_ACTIVE
      • At state NTFA_NTFSV_NO_ACTIVE, any APIs call will get returned code TRY_AGAIN
      • If NCSMDS_NEW_ACTIVE is coming afterwards, set @ntfa_ntfsv_state_t is NTFA_NTFSV_UP
        • All APIs are functioning normally with valid handle
      • Else If NCSMDS_DOWN is coming, set @ntfa_ntfsv_state_t is NTFA_NTFSV_DOWN
        • At state NTFA_NTFSV_DOWN:
          • Return TRY_AGAIN for saNtfInitialize
          • Return OK for saNtfFinalize
          • Return BAD_HANDLE for all other APIs
    • At state NTFA_NTFSV_DOWN, if one of SCs starts again, Agent will receive NCSMDS_NEW_ACTIVE
      • "Recovery" could be done at this point in time, yet it's not implemented
      • So set @ntfa_ntfsv_state_t is NTFA_NTFSV_UP for now. All APIs are functioning normally if the handle is valid.
      • Any API call with "old" handle will receive BAD_HANDLE, this has already been done by current implementation that APIs are calling ncshm_take_hdl() to map the handle record.
     
  • Minh Hon Chau

    Minh Hon Chau - 2014-11-17
    • status: unassigned --> accepted
    • assigned_to: elunlen --> Minh Hon Chau
     
  • Mathi Naickan

    Mathi Naickan - 2014-11-18

    Please note, we have been discussing this in the TLC and have not yet agreed upon the solution. Iam still waiting for confirmation from AndersWidell on some posers around the solution if not the usecase.

     
  • Minh Hon Chau

    Minh Hon Chau - 2015-02-18
    • Milestone: 4.6.FC --> future
     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: future --> 5.0.FC
     
  • Minh Hon Chau

    Minh Hon Chau - 2015-12-23
    • summary: NTF: Ntf service shall be able to recover if both SC nodes goes down --> NTF: Add support for cloud resilience feature
     
  • Mathi Naickan

    Mathi Naickan - 2015-12-23

    One major criteria for this feature is that it must be configurable. i.e. The user should be able to turn on/off the feature.

     
  • Minh Hon Chau

    Minh Hon Chau - 2015-12-23
    • status: accepted --> review
     
  • elunlen

    elunlen - 2016-03-22
    • status: review --> fixed
    • assigned_to: Minh Hon Chau --> nobody
     
  • elunlen

    elunlen - 2016-03-22

    changeset: 7342:7c969b351068
    tag: tip
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Tue Mar 22 09:50:14 2016 +0100
    summary: NTF: Add new README file for description of cloud resilience support [#1180] V2

    rev: 7c969b3510681a7e5a30096fb70553cb30e6a067

    changeset: 7341:e3814be9e4cc
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Tue Mar 22 09:49:47 2016 +0100
    summary: NTF: Add tests for NTF cloud resilience feature [#1180] V2

    rev: e3814be9e4cc3dba5f52db18672e63909afc87ed

    changeset: 7340:81190bce2e01
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Tue Mar 22 09:49:29 2016 +0100
    summary: NTF: Add wrapper for usage of NTF API in ntftools to handle TRY_AGAIN [#1180]

    rev: 81190bce2e01c80fbf97b11c4593ba542ce8b087

    changeset: 7339:1b6ced612cdd
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Tue Mar 22 09:49:04 2016 +0100
    summary: NTF: Add support cloud resilience for NTF Agent [#1180] V3

    rev: 1b6ced612cdd3b26dc1a2bf5df51beb4777b01cc

    changeset: 7338:fc2b1ecfb6b0
    user: Minh Hon Chau minh.chau@dektech.com.au
    date: Tue Mar 22 09:48:52 2016 +0100
    summary: NTF: Add support cloud resilience for NTF libs common [#1180]

    rev: fc2b1ecfb6b0145ef7abb9c20eda52f44798f6ac

     

    Related

    Tickets: #1180


Log in to post a comment.