Menu

#457 Dtm: standby joins as active after restart in a 70 node setup

future
unassigned
nobody
None
enhancement
dtm
-
4.3
major
False
2017-08-28
2013-06-14
No

After analyzing the logs following is the observation:

Slot1 is active and slot2 is standby

  1. IMMND killed in slot-2

Jun 11 21:29:46 SLES-64BIT-SLOT2 osafamfnd[3750]: NO 'safComp=IMMND,safSu=SC-2,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart'

  1. Active IMMD detected the slot-2 IMMND is discarded

Jun 11 15:54:02 SLES-64BIT-SLOT1 osafimmnd[3746]: NO Global discard node received for nodeId:2020f pid:3668

  1. New immnd at slot2 requests for sync

Jun 11 21:29:46 SLES-64BIT-SLOT2 osafimmnd[7315]: Started

Jun 11 15:54:03 SLES-64BIT-SLOT1 osafimmd[3736]: NO Node 2020f request sync sync-pid:7315 epoch:0

  1. slot2 went for reboot, IMMD is killed

Jun 11 21:29:49 SLES-64BIT-SLOT2 osafamfnd[3750]: ER safComp=IMMD,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Jun 11 21:29:49 SLES-64BIT-SLOT2 osafamfnd[3750]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node failfast
Jun 11 21:29:49 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node

  1. After coming up the slot2 got active role (slot1 is still in active)

Jun 11 21:30:22 SLES-64BIT-SLOT2 osafrded[2095]: NO Peer not available => Active role
Jun 11 21:30:23 SLES-64BIT-SLOT2 osaffmd[2108]: Started
Jun 11 21:30:23 SLES-64BIT-SLOT2 osafimmd[2117]: Started
Jun 11 21:30:23 SLES-64BIT-SLOT2 osafimmnd[2127]: Started

  1. After getting active role the node went for loading

Jun 11 21:30:23 SLES-64BIT-SLOT2 osafimmnd[2127]: NO This IMMND is now the NEW Coord

  1. After some time, there is a connection established to the active node

Jun 11 21:30:23 SLES-64BIT-SLOT2 osafdtmd[2077]: NO Established contact with 'SC-1
Jun 11 15:54:39 SLES-64BIT-SLOT1 osafdtmd[3696]: NO Established contact with 'SC-2'

  1. after connecting the loading event reaches to active IMMD at Slot1, the immnd up event is not received because by the time immnd is up the connection is not established between the two nodes.

Jun 11 15:54:42 SLES-64BIT-SLOT1 osafimmd[3736]: WA Wrong PID 0 != 2127

  1. AMFD, tries to re-connect to IMM because, IMMND return bad_handle when the previous synchronous call from the amfd is not yet complete and AMFD requested for one more request on same handle.

Jun 11 15:54:49 SLES-64BIT-SLOT1 osafamfd[3815]: NO Re-initializing with IMM
Jun 11 15:54:49 SLES-64BIT-SLOT1 osafimmnd[3746]: WA IMMND - Client Node Get Failed for cli_hdl 85899477263
Jun 11 15:54:49 SLES-64BIT-SLOT1 osafamfd[3815]: ER saImmOiImplementerSet failed 14
Jun 11 15:54:49 SLES-64BIT-SLOT1 osafamfd[3815]: ER exiting since avd_imm_impl_set failed

conclusion:

The mds in the slot2 connected with slot1, after initiating loading in IMMND, because of this slot2 got active role.

2 Attachments

Discussion

  • Neelakanta Reddy

     
  • Neelakanta Reddy

     
  • Neelakanta Reddy

    • summary: Dtm: stndby joins as active after restart in a 70 node setup --> Dtm: standby joins as active after restart in a 70 node setup
     
  • Hans Feldt

    Hans Feldt - 2013-06-17

    Apart from possible bugs and improvements in dtm, the following are relevant:
    * Configure RDE_DISCOVER_PEER_TIMEOUT with a larger value than the default 2 secs
    * Startup fencing: https://sourceforge.net/p/opensaf/tickets/64/

     
  • Anders Bjornerstedt

    • Type: defect --> enhancement
     
  • A V Mahesh (AVM)

    • assigned_to: A V Mahesh (AVM) --> nobody
    • Blocker: --> False
     

Log in to post a comment.