Menu

#458 Dtmd flickering resulted in node reboot

future
unassigned
nobody
None
enhancement
dtm
-
4.3 GA
major
2015-07-15
2013-06-14
No

Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 54 <0, 2020f(down)> (safAmfService)
Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 43 <0, 2020f(down)> (safSmfService)
Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 44 <0, 2020f(down)> (safMsgGrpService)
Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 45 <0, 2020f(down)> (safCheckPointService)
Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 47 <0, 2020f(down)> (safLckService)
Jun 13 17:42:00 PL-3 osafimmnd[5650]: NO Implementer disconnected 46 <0, 2020f(down)> (safEvtService)
Jun 13 17:42:02 PL-3 osafimmnd[5650]: NO No IMMD service => cluster restart
Jun 13 17:42:03 PL-3 osafamfnd[3391]: NO 'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'
Jun 13 17:42:03 PL-3 osafimmnd[5982]: Started
Jun 13 17:42:03 PL-3 osafdtmd[3320]: NO Lost contact with 'SC-2'
Jun 13 17:42:04 PL-3 osafdtmd[3320]: NO Established contact with 'SC-2'
Jun 13 17:42:05 PL-3 osafamfnd[3391]: ER AMF director unexpectedly crashed
Jun 13 17:42:05 PL-3 osafamfnd[3391]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received
Jun 13 17:42:05 PL-3 osafdtmd[3320]: NO Lost contact with 'SC-1'
Jun 13 17:42:05 PL-3 opensaf_reboot: Rebooting local node
Jun 13 17:42:06 PL-3 osafdtmd[3320]: NO Established contact with 'SC-1'
Jun 13 17:42:06 PL-3 osafimmnd[5982]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jun 13 17:42:07 PL-3 osafimmnd[5982]: WA Resending introduce-me - problems with MDS ?
Jun 13 17:42:07 PL-3 osafimmnd[5982]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SER
=============================================

It looks MDS problem.

DTM lost the connection and established again within 1 second.

Jun 13 17:42:03 PL-3 osafdtmd[3320]: NO Lost contact with 'SC-2'
Jun 13 17:42:04 PL-3 osafdtmd[3320]: NO Established contact with 'SC-2'

Since AMFND will get down event as it lost contact with Act AMFD and Std AMFD. This results in AMFND rebooting the blade.

Jun 13 17:42:05 PL-3 osafamfnd[3391]: ER AMF director unexpectedly crashed
Jun 13 17:42:05 PL-3 osafamfnd[3391]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received

Discussion

  • Hans Feldt

    Hans Feldt - 2013-06-17

    Classic split brain. You should adjust dtm to worst case network latency (TIPC link tolerace) and have redundant communication to the other controller

     
  • Anders Bjornerstedt

    • Type: defect --> enhancement
     

Log in to post a comment.