Setup :
SLES 11 Physical machine
Changeset :7997 5.1 FC
2 controllers and 2 payloads with headless feature enabled.
2N application with 3 SUs. (AmfDemo).
Issue :
amfd asserted on controllers continuoulsy for every reboot after initial split brain scenario is observed
Steps performed :
-> Initially brought up four nodes and all the nodes joined the cluster.
-> Brought up the 2N application, with SUs hosted on SC-1 ,SC-2 and PL-3 successfully.
-> Performed some operations on the AMF objects and the cluster is left in idle state later.
-> After a gap of 2 weeks, MDS down event is generated on both the controllers for which spilt brain scenario is generated. Because of momentary cable(s) unplugging, MDS down event is generated.
Sep 24 21:36:40 SLES-SLOT1 osafimmd[2729]: NO MDS event from svc_id 25 (change:3, dest:565214187380752)
Sep 24 21:36:40 SLES-SLOT1 kernel: [1297950.833811] TIPC: Established link <1.1.1:em1-1.1.2:em1> on network plane A
Sep 24 21:36:40 SLES-SLOT1 osafrded[2710]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131343, SupervisionTime = 60
Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: NO Got peer info request from node 0x2010f with role ACTIVE
Sep 26 00:00:01 SLES-SLOT2 osafrded[2715]: Rebooting OpenSAF NodeId = 0 EE Name = No EE Mapped, Reason: Split-brain detected, OwnNodeId = 131599, SupervisionTime = 60
-> As headless feature is enabled, payloads did not go for reboot.
-> Once controllers joined the payloads, amfd asserted on the rebooted controller and controllers went for reboot.
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: NO Received node_up from 2010f: msg_id 1
Sep 24 21:39:27 SLES-SLOT1 osafamfd[2772]: siass.cc:953: avd_susi_recreate: Assertion 'su' failed.
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: WA AMF director unexpectedly crashed
Sep 24 21:39:27 SLES-SLOT1 osafamfnd[2782]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131343, SupervisionTime = 60
Below is the backtrace :
No symbol table info available.
No symbol table info available.
__assertion=0x517d01 "su") at sysf_def.c:281
No locals.
su = 0x0
__FUNCTION__ = "avd_susi_recreate"
susi = 0x0
node = 0x7bfdf0
susi_state = 0x0
su_state = 0x7f1d200055a0
__PRETTY_FUNCTION__ = "SaAisErrorT avd_susi_recreate(AVSV_N2D_ND_SISU_STATE_MSG_INFO*)"
n2d_msg = 0x7f1d20008ec0
i = 0
queue_size = 4
queue_evt = 0x7a9b60
act_amfnd_node_up_count = 1
found_state_info = true
__FUNCTION__ = "avd_process_state_info_queue"
avnd = 0x7bf380
n2d_msg = 0x7f1d20004b30
rc = 1
sync_nd_size = 4
act_nd = true
__FUNCTION__ = "avd_node_up_evh"
__FUNCTION__ = "process_event"
pollretval = 1
cb = 0x75cba0 <_control_block>
evt = 0x7f1d20008880
mbx_fd = {raise_obj = 11, rmv_obj = 12}
error = SA_AIS_OK
polltmo = -1
term_fd = 17
__FUNCTION__ = "main_loop"
Suggested recovery :
During a split brain scenario, payloads should be ordered for reboot even in headless feature.
I will try to reproduce it and will let you know if it is reproducible on the latest.
Though I have tested such scenarios before and I didn't get into such situation.
Thanks
Mohan
High Availability Solutions(www.GetHighAvailability.com)
Hi Guys,
I tried to reproduce the issue on latest Opensaf release but i did not succeed.
I have an set up of 2 controllers and 1 payload with headless feature enabled.
I also upload the 2N application with 3 SUs on 3 different nodes.
I performed the spilt brain scenario to check the issue.
In my case, after spilt brain, nodes got joined successfully and application successfully started on controllers.
The reported issue is not observed in the latest release.
So, can I close this ticket?
Thanks
Mohan(www.GetHighAvailability.com)