OpenSAF / Tickets / #2477 amfd: Cyclic reboot after SC absence period (in large cluster)

#2477 amfd: Cyclic reboot after SC absence period (in large cluster)

Milestone: 5.17.07

Status: fixed

Owner: nobody

Labels: assignment failover during stop of both SC (2) 2416 (1)

Type: defect

Component: amf

Part: d

Version:

Priority: major

Blocker: True

Updated: 2017-07-27

Created: 2017-06-02

Creator: Minh Hon Chau

Private: No

The scenario of the problem in this ticket happens in the same scenario reported in #2416

After SC absence period, amfd gets into osafassert(), causes coredump, and the problem repeatedly happens

One of patches of #2416 had tried to call IMM sync as soon as possible, and it works fine with a small cluster (5 nodes). But a large cluster consists of about 75 nodes, the change of IMM sync calls takes mostly no effect.

In #2416, a problem had been seen with an assumption of unreliable IMM sync calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration to failover all absent assignments [#2416]" (refer to: https://sourceforge.net/p/opensaf/tickets/2416/#f83b)

Another variant problem of unreliable IMM calls before both SC go down, is that amfd can have both SUs with ACTIVE assignments, that leads to assert. This problem can only be seen in large cluster so far

Details of coredump:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f78435fdf4e in __osafassert_fail (__file=<optimized out>, __line=<optimized out>, __func=<optimized out>, 
    __assertion=<optimized out>) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3  0x00007f78445671e8 in avd_sg_2n_act_susi (sg=<optimized out>, stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
        susi = <optimized out>
        a_susi_2 = 0x7f7845e0d0c0
        s_susi_1 = 0x7f7845e0d0c0
        su_2 = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        s_susi_2 = 0x7f7845e2a030
        a_susi = 0x0
        a_susi_1 = 0x7f7845e2a030
        s_susi = 0x0
        su_1 = 0x7f7845d69e60
#4  0x00007f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
        a_susi = <optimized out>
        s_susi = 0x7f7845d69a68
        o_su = <optimized out>
        flag = <optimized out>
        __FUNCTION__ = "node_fail"
        su_ha_state = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5  0x00007f784455de1a in AVD_SG::failover_absent_assignment (this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "failover_absent_assignment"
        failed_su = 0x7f7845d69e60
#6  0x00007f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 <_control_block>, evt=<optimized out>)
    at ../../opensaf/src/amf/amfd/cluster.cc:103
        i_sg = 0x7f7845d5f4f0
        __for_range = @0x7f7845ca2a90: {db = {_M_t = {
              _M_impl = {<std::allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<No data fields>}, <No data fields>}, 
                _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, 
                  _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, _M_right = 0x7f7845d81580}, _M_node_count = 28}}}}
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "avd_cluster_tmr_init_evh"
        su = 0x0
        node = <optimized out>
#7  0x00007f784453ca2c in process_event (cb_now=0x7f78447f2e80 <_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "process_event"
#8  0x00007f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691
        pollretval = <optimized out>
        evt = 0x7f78340013d0
        polltmo = 0
        term_fd = 24
        cb = 0x7f78447f2e80 <_control_block>
        error = <optimized out>
        old_sync_state = AVD_STBY_OUT_OF_SYNC
#9  main (argc=<optimized out>, argv=<optimized out>) at ../../opensaf/src/amf/amfd/main.cc:848
No locals.

Minh Hon Chau - 2017-06-02

Attach log/trace.
A debug patch that simulates the IMM sync calls do not work as expected, it helps to reproduce the problem

2477.tgz

2477_rep.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-06-02

Attached preliminary fix

2477_pre_fix.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-06-02

status: assigned --> review
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Praveen - 2017-06-05

Hi Minh,

What are the steps to reproduce after applying the patch 2477_rep.diff?

Thanks,
Praveen

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-06-05

Hi Praveen,

Steps:

amf-adm unlock safSu=SU1,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU2,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon
amf-adm unlock safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon

echo 1 > /root/2477
stop SC1
stop SC2
start SC1
start SC2
echo 0 > /root/2477

Note that I stop SC abruptly by killing the container of SC

Thanks,
Minh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-06-07

Also, to note, it is documented as limitations in Amf PR Doc as below, so this ticket qualifies as Enhancement (could have been #2416 as well):
2.2.11.3 Limitations
• Possible loss of RTA updates and SI assignment messages
If both SCs go down abruptly (SCs are immediately powered-off for instance), AMFD could fail to update RTA to IMM, the SI assignment messages sent from AMFND could not reach to AMFD, or vice versa. In such cases, recovery could be impossible, applications may have inappropriate assignment states.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nagendra Kumar - 2017-06-07

Hi Minh,
But I agree that we need to avoid rebooting the controllers, but by avoiding assert, I am not sure, let me check.

Thanks
-Nagu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-06-09

Attached 2 amfd traces file: one is for cyclic reboot problem without patch, another is trace with patch

log_ref.tgz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2017-07-01

Milestone: 5.17.06 --> 5.17.08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-07-07

status: review --> fixed

assigned_to: Minh Hon Chau --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Minh Hon Chau - 2017-07-07

commits
release:[29e88c0b052910c6fd660e8b5adca80535832edc]
develop:[02a87b05c3a210c55cce005cc7b892bbd3a6e8e9]

Related

Commit: [02a87b]
Commit: [29e88c]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

amfd: Cyclic reboot after SC absence period (in large cluster)

Milestone

Searches

Help

#2477 amfd: Cyclic reboot after SC absence period (in large cluster)

Related

Discussion

Related