Menu

#2477 amfd: Cyclic reboot after SC absence period (in large cluster)

5.17.07
fixed
nobody
defect
amf
d
major
True
2017-07-27
2017-06-02
No

The scenario of the problem in this ticket happens in the same scenario reported in #2416

After SC absence period, amfd gets into osafassert(), causes coredump, and the problem repeatedly happens

One of patches of #2416 had tried to call IMM sync as soon as possible, and it works fine with a small cluster (5 nodes). But a large cluster consists of about 75 nodes, the change of IMM sync calls takes mostly no effect.

In #2416, a problem had been seen with an assumption of unreliable IMM sync calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration to failover all absent assignments [#2416]" (refer to: https://sourceforge.net/p/opensaf/tickets/2416/#f83b)

Another variant problem of unreliable IMM calls before both SC go down, is that amfd can have both SUs with ACTIVE assignments, that leads to assert. This problem can only be seen in large cluster so far

Details of coredump:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f78435fdf4e in __osafassert_fail (__file=<optimized out>, __line=<optimized out>, __func=<optimized out>, 
    __assertion=<optimized out>) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3  0x00007f78445671e8 in avd_sg_2n_act_susi (sg=<optimized out>, stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
        susi = <optimized out>
        a_susi_2 = 0x7f7845e0d0c0
        s_susi_1 = 0x7f7845e0d0c0
        su_2 = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        s_susi_2 = 0x7f7845e2a030
        a_susi = 0x0
        a_susi_1 = 0x7f7845e2a030
        s_susi = 0x0
        su_1 = 0x7f7845d69e60
#4  0x00007f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
        a_susi = <optimized out>
        s_susi = 0x7f7845d69a68
        o_su = <optimized out>
        flag = <optimized out>
        __FUNCTION__ = "node_fail"
        su_ha_state = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5  0x00007f784455de1a in AVD_SG::failover_absent_assignment (this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "failover_absent_assignment"
        failed_su = 0x7f7845d69e60
#6  0x00007f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 <_control_block>, evt=<optimized out>)
    at ../../opensaf/src/amf/amfd/cluster.cc:103
        i_sg = 0x7f7845d5f4f0
        __for_range = @0x7f7845ca2a90: {db = {_M_t = {
              _M_impl = {<std::allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<No data fields>}, <No data fields>}, 
                _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = {_M_color = std::_S_red, 
                  _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, _M_right = 0x7f7845d81580}, _M_node_count = 28}}}}
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "avd_cluster_tmr_init_evh"
        su = 0x0
        node = <optimized out>
#7  0x00007f784453ca2c in process_event (cb_now=0x7f78447f2e80 <_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "process_event"
#8  0x00007f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691
        pollretval = <optimized out>
        evt = 0x7f78340013d0
        polltmo = 0
        term_fd = 24
        cb = 0x7f78447f2e80 <_control_block>
        error = <optimized out>
        old_sync_state = AVD_STBY_OUT_OF_SYNC
#9  main (argc=<optimized out>, argv=<optimized out>) at ../../opensaf/src/amf/amfd/main.cc:848
No locals.

Related

Tickets: #2416
Wiki: ChangeLog-5.17.07
Wiki: ChangeLog-5.17.11

Discussion

  • Minh Hon Chau

    Minh Hon Chau - 2017-06-02

    Attach log/trace.
    A debug patch that simulates the IMM sync calls do not work as expected, it helps to reproduce the problem

     
  • Minh Hon Chau

    Minh Hon Chau - 2017-06-02

    Attached preliminary fix

     
  • Minh Hon Chau

    Minh Hon Chau - 2017-06-02
    • status: assigned --> review
     
  • Praveen

    Praveen - 2017-06-05

    Hi Minh,

    What are the steps to reproduce after applying the patch 2477_rep.diff?

    Thanks,
    Praveen

     
  • Minh Hon Chau

    Minh Hon Chau - 2017-06-05

    Hi Praveen,

    Steps:

    amf-adm unlock safSu=SU1,safSg=AmfDemoTwon,safApp=AmfDemoTwon
    amf-adm unlock safSu=SU4,safSg=AmfDemoTwon,safApp=AmfDemoTwon
    amf-adm unlock safSu=SU2,safSg=AmfDemoTwon,safApp=AmfDemoTwon
    amf-adm unlock safSu=SU3,safSg=AmfDemoTwon,safApp=AmfDemoTwon
    amf-adm unlock safSu=SU5,safSg=AmfDemoTwon,safApp=AmfDemoTwon

    echo 1 > /root/2477
    stop SC1
    stop SC2
    start SC1
    start SC2
    echo 0 > /root/2477

    Note that I stop SC abruptly by killing the container of SC

    Thanks,
    Minh

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-06-07

    Also, to note, it is documented as limitations in Amf PR Doc as below, so this ticket qualifies as Enhancement (could have been #2416 as well):
    2.2.11.3 Limitations
    • Possible loss of RTA updates and SI assignment messages
    If both SCs go down abruptly (SCs are immediately powered-off for instance), AMFD could fail to update RTA to IMM, the SI assignment messages sent from AMFND could not reach to AMFD, or vice versa. In such cases, recovery could be impossible, applications may have inappropriate assignment states.

     
  • Nagendra Kumar

    Nagendra Kumar - 2017-06-07

    Hi Minh,
    But I agree that we need to avoid rebooting the controllers, but by avoiding assert, I am not sure, let me check.

    Thanks
    -Nagu

     
  • Minh Hon Chau

    Minh Hon Chau - 2017-06-09

    Attached 2 amfd traces file: one is for cyclic reboot problem without patch, another is trace with patch

     
  • Anders Widell

    Anders Widell - 2017-07-01
    • Milestone: 5.17.06 --> 5.17.08
     
  • Minh Hon Chau

    Minh Hon Chau - 2017-07-07
    • status: review --> fixed
    • assigned_to: Minh Hon Chau --> nobody
     

Log in to post a comment.