Ntf service accumulates lots of discarded notifications(around 2,00,000) and it checkpoints these discarded notifications to Standby Ntf while coming up in cold sync. Standby Ntf takes more than 40 seconds to process them. During this time, Act Ntf gets few notifications and it checkpoints(async updates) notifications information to Standby Ntf which is a sync call with timeout of 1 second. Since, Standby Ntf is busy in processing cold sync, so it doesn't process async updates from Act Ntf and Act Ntf keeps timing out at an interval of 1 second for more than 40 times(i.e. more than 40 seconds).
During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for CSI and reboots the upcoming node.
The root cause is it loses down event of subscriber and never removes the subscriber information and discarded notifications keep increasing each time a notification is sent.
The notification can be missed because of less memory in the system or not able to send the down event in the mail box etc. We don't know the real root cause, but discarded notifications can be accumulated only in such cases.
We could reproduce it, please check the reproducible steps.
Steps to reproduce:
The comments from Thanh are incorporated in the patch attached (Discardnotification_v3.patch).
Hello Mohan,
Ack from me.
Best Regards,
Thanh
From: Mohan Kanakam via Opensaf-tickets opensaf-tickets@lists.sourceforge.net
Sent: Thursday, 12 August 2021 12:32 AM
To: opensaf-tickets@lists.sourceforge.net
Cc: Mohan Kanakam mohan-hasoln@users.sourceforge.net
Subject: [tickets] [opensaf:tickets] #3277 ntf: Discarded notifications accumulation causing standby controller reboot during cold sync
The comments from Thanh are incorporated in the patch attached (Discardnotification_v3.patch).
Attachments:
[tickets:#3277]https://sourceforge.net/p/opensaf/tickets/3277/ ntf: Discarded notifications accumulation causing standby controller reboot during cold sync
Status: review
Milestone: 5.21.10
Created: Mon Aug 02, 2021 01:31 PM UTC by Mohan Kanakam
Last Updated: Tue Aug 03, 2021 09:12 AM UTC
Owner: Mohan Kanakam
Ntf service accumulates lots of discarded notifications(around 2,00,000) and it checkpoints these discarded notifications to Standby Ntf while coming up in cold sync. Standby Ntf takes more than 40 seconds to process them. During this time, Act Ntf gets few notifications and it checkpoints(async updates) notifications information to Standby Ntf which is a sync call with timeout of 1 second. Since, Standby Ntf is busy in processing cold sync, so it doesn't process async updates from Act Ntf and Act Ntf keeps timing out at an interval of 1 second for more than 40 times(i.e. more than 40 seconds).
During this time, Standby Clmd sends NtfInitialize request to Act Ntf and gets timeout for 4 times(40 seconds) and then Amf timesout(csi timeout 40 sec) for CSI and reboots the upcoming node.
The root cause is it loses down event of subscriber and never removes the subscriber information and discarded notifications keep increasing each time a notification is sent.
The notification can be missed because of less memory in the system or not able to send the down event in the mail box etc. We don't know the real root cause, but discarded notifications can be accumulated only in such cases.
We could reproduce it, please check the reproducible steps.
Steps to reproduce:
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.netopensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Related
Tickets:
#3277Tickets: tickets
commit c0f7603a4a7354d30099898d005bf474b78e3d6e
Author: Mohan mohan@hasolutions.in
Date: Wed Aug 18 18:59:52 2021 +1000