Menu

#3277 ntf: Discarded notifications accumulation causing standby controller reboot during cold sync

5.21.09
fixed
None
defect
ntf
d
5.21.06
major
False
2021-09-14
2021-08-02
No

Ntf service accumulates lots of discarded notifications(around 2,00,000) and it checkpoints these discarded notifications to Standby Ntf while coming up in cold sync. Standby Ntf takes more than 40 seconds to process them. During this time, Act Ntf gets few notifications and it checkpoints(async updates) notifications information to Standby Ntf which is a sync call with timeout of 1 second. Since, Standby Ntf is busy in processing cold sync, so it doesn't process async updates from Act Ntf and Act Ntf keeps timing out at an interval of 1 second for more than 40 times(i.e. more than 40 seconds).
During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for CSI and reboots the upcoming node.

The root cause is it loses down event of subscriber and never removes the subscriber information and discarded notifications keep increasing each time a notification is sent.
The notification can be missed because of less memory in the system or not able to send the down event in the mail box etc. We don't know the real root cause, but discarded notifications can be accumulated only in such cases.
We could reproduce it, please check the reproducible steps.

Steps to reproduce:

  1. comment the line clientRemoveMDS() in proc_ntfa_updn_mds_msg() function in ntfs_evt.c file.
  2. subscribe to ntf service by using ntfsubscribe.
  3. send the notifications using ntfsend(ntfsend -s 1 --notificationType=0x4000 --additionalText=TEXT --repeatSends=200000).
  4. while running the ntfsend , kill the ntfsubscribe pid.
  5. start the standby and see the discarded notifications in osafntfd trace file.

Related

Tickets: #3277
Wiki: ChangeLog-5.21.09

Discussion

  • Mohan  Kanakam

    Mohan Kanakam - 2021-08-02
    • status: unassigned --> assigned
     
  • Mohan  Kanakam

    Mohan Kanakam - 2021-08-03
    • status: assigned --> review
     
  • Mohan  Kanakam

    Mohan Kanakam - 2021-08-11

    The comments from Thanh are incorporated in the patch attached (Discardnotification_v3.patch).

     
    • Thanh Nguyen

      Thanh Nguyen - 2021-08-17

      Hello Mohan,

      Ack from me.
      Best Regards,
      Thanh

      From: Mohan Kanakam via Opensaf-tickets opensaf-tickets@lists.sourceforge.net
      Sent: Thursday, 12 August 2021 12:32 AM
      To: opensaf-tickets@lists.sourceforge.net
      Cc: Mohan Kanakam mohan-hasoln@users.sourceforge.net
      Subject: [tickets] [opensaf:tickets] #3277 ntf: Discarded notifications accumulation causing standby controller reboot during cold sync

      The comments from Thanh are incorporated in the patch attached (Discardnotification_v3.patch).

      Attachments:


      [tickets:#3277]https://sourceforge.net/p/opensaf/tickets/3277/ ntf: Discarded notifications accumulation causing standby controller reboot during cold sync

      Status: review
      Milestone: 5.21.10
      Created: Mon Aug 02, 2021 01:31 PM UTC by Mohan Kanakam
      Last Updated: Tue Aug 03, 2021 09:12 AM UTC
      Owner: Mohan Kanakam

      Ntf service accumulates lots of discarded notifications(around 2,00,000) and it checkpoints these discarded notifications to Standby Ntf while coming up in cold sync. Standby Ntf takes more than 40 seconds to process them. During this time, Act Ntf gets few notifications and it checkpoints(async updates) notifications information to Standby Ntf which is a sync call with timeout of 1 second. Since, Standby Ntf is busy in processing cold sync, so it doesn't process async updates from Act Ntf and Act Ntf keeps timing out at an interval of 1 second for more than 40 times(i.e. more than 40 seconds).
      During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for CSI and reboots the upcoming node.

      The root cause is it loses down event of subscriber and never removes the subscriber information and discarded notifications keep increasing each time a notification is sent.
      The notification can be missed because of less memory in the system or not able to send the down event in the mail box etc. We don't know the real root cause, but discarded notifications can be accumulated only in such cases.
      We could reproduce it, please check the reproducible steps.

      Steps to reproduce:

      1. comment the line clientRemoveMDS() in proc_ntfa_updn_mds_msg() function in ntfs_evt.c file.
      2. subscribe to ntf service by using ntfsubscribe.
      3. send the notifications using ntfsend(ntfsend -s 1 --notificationType=0x4000 --additionalText=TEXT --repeatSends=200000).
      4. while running the ntfsend , kill the ntfsubscribe pid.
      5. start the standby and see the discarded notifications in osafntfd trace file.

      Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/

      To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

       

      Related

      Tickets: #3277
      Tickets: tickets

  • Minh Hon Chau

    Minh Hon Chau - 2021-08-18

    commit c0f7603a4a7354d30099898d005bf474b78e3d6e
    Author: Mohan mohan@hasolutions.in
    Date: Wed Aug 18 18:59:52 2021 +1000

    NTF: Delete discarded notifications when send fails twice [#3277]
    
     
  • Minh Hon Chau

    Minh Hon Chau - 2021-08-18
    • status: review --> fixed
     

Log in to post a comment.

MongoDB Logo MongoDB