The scenario of the problem in this ticket happens in the same scenario reported in #2416
After SC absence period, amfd gets into osafassert(), causes coredump, and the problem repeatedly happens
One of patches of #2416 had tried to call IMM sync as soon as possible, and it works fine with a small cluster (5 nodes). But a large cluster consists of about 75 nodes, the change of IMM sync calls takes mostly no effect.
In #2416, a problem had been seen with an assumption of unreliable IMM sync calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration to failover all absent assignments [#2416]" (refer to: https://sourceforge.net/p/opensaf/tickets/2416/#f83b)
Another variant problem of unreliable IMM calls before both SC go down, is that amfd can have both SUs with ACTIVE assignments, that leads to assert. This problem can only be seen in large cluster so far
Details of coredump:
Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/lib64/opensaf/osafamfd'. Program terminated with signal SIGABRT, Aborted. #0 0x00007f784279b0c7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: zypper install opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64 (gdb) bt full #0 0x00007f784279b0c7 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007f784279c478 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007f78435fdf4e in __osafassert_fail (__file=<optimized out>, __line=<optimized out>, __func=<optimized out>, __assertion=<optimized out>) at ../../opensaf/src/base/sysf_def.c:286 No locals. #3 0x00007f78445671e8 in avd_sg_2n_act_susi (sg=<optimized out>, stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>) at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596 susi = <optimized out> a_susi_2 = 0x7f7845e0d0c0 s_susi_1 = 0x7f7845e0d0c0 su_2 = <optimized out> t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0} s_susi_2 = 0x7f7845e2a030 a_susi = 0x0 a_susi_1 = 0x7f7845e2a030 s_susi = 0x0 su_1 = 0x7f7845d69e60 #4 0x00007f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60) at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402 a_susi = <optimized out> s_susi = 0x7f7845d69a68 o_su = <optimized out> flag = <optimized out> __FUNCTION__ = "node_fail" su_ha_state = <optimized out> t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0} #5 0x00007f784455de1a in AVD_SG::failover_absent_assignment (this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307 t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0} __FUNCTION__ = "failover_absent_assignment" failed_su = 0x7f7845d69e60 #6 0x00007f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 <_control_block>, evt=<optimized out>) at ../../opensaf/src/amf/amfd/cluster.cc:103 i_sg = 0x7f7845d5f4f0 __for_range = @0x7f7845ca2a90: {db = {_M_t = { _M_impl = {<std::allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<No data fields>}, <No data fields>}, _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, _M_right = 0x7f7845d81580}, _M_node_count = 28}}}} t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0} __FUNCTION__ = "avd_cluster_tmr_init_evh" su = 0x0 node = <optimized out> #7 0x00007f784453ca2c in process_event (cb_now=0x7f78447f2e80 <_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775 t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0} __FUNCTION__ = "process_event" #8 0x00007f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691 pollretval = <optimized out> evt = 0x7f78340013d0 polltmo = 0 term_fd = 24 cb = 0x7f78447f2e80 <_control_block> error = <optimized out> old_sync_state = AVD_STBY_OUT_OF_SYNC #9 main (argc=<optimized out>, argv=<optimized out>) at ../../opensaf/src/amf/amfd/main.cc:848 No locals.
Tickets: #2416
Wiki: ChangeLog-5.17.07
Wiki: ChangeLog-5.17.11
Attach log/trace.
A debug patch that simulates the IMM sync calls do not work as expected, it helps to reproduce the problem
Attached preliminary fix
Hi Minh,
What are the steps to reproduce after applying the patch 2477_rep.diff?
Thanks,
Praveen
Hi Praveen,
Steps:
amf-adm unlock safSu=SU1,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU2,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon
echo 1 > /root/2477
stop SC1
stop SC2
start SC1
start SC2
echo 0 > /root/2477
Note that I stop SC abruptly by killing the container of SC
Thanks,
Minh
Also, to note, it is documented as limitations in Amf PR Doc as below, so this ticket qualifies as Enhancement (could have been #2416 as well):
2.2.11.3 Limitations
• Possible loss of RTA updates and SI assignment messages
If both SCs go down abruptly (SCs are immediately powered-off for instance), AMFD could fail to update RTA to IMM, the SI assignment messages sent from AMFND could not reach to AMFD, or vice versa. In such cases, recovery could be impossible, applications may have inappropriate assignment states.
Hi Minh,
But I agree that we need to avoid rebooting the controllers, but by avoiding assert, I am not sure, let me check.
Thanks
-Nagu
Attached 2 amfd traces file: one is for cyclic reboot problem without patch, another is trace with patch
commits
release:[29e88c0b052910c6fd660e8b5adca80535832edc]
develop:[02a87b05c3a210c55cce005cc7b892bbd3a6e8e9]
Related
Commit: [02a87b]
Commit: [29e88c]