Menu

#122 TIPC link down

None
closed
1
2018-05-22
2017-02-17
No

Hi All,

I have HA pair, and i observed tipc link lost event was not received by standby machine. I got this problem

on ACTIVE machine :

Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308014] TIPC: Resetting link <1.1.2:bond0-1.1.1:bond0>, peer not responding
Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308021] TIPC: Lost link <1.1.2:bond0-1.1.1:bond0> on network plane A
Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308026] TIPC: Lost contact with <1.1.1>
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Node Down event for node id 2010f:
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Done Locking applications on node id:2010f ret val:0
Jan 6 16:45:01 ffm-sbc-2b osafclmd[4963]: NO Node 131343 went down. Not sending track callback for agents on that node
Jan 6 16:45:01 ffm-sbc-2b osafclmd: Last message 'NO Node 131343 went ' repeated 5 times, suppressed by syslog-ng on ffm-sbc-2b.mydomain.com
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Current role: ACTIVE
Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60
Jan 6 16:45:01 ffm-sbc-2b osafamfd[4986]: NO Node 'SC-1' left the cluster
Jan 6 16:45:01 ffm-sbc-2b osafimmd[4910]: WA IMMD lost contact with peer IMMD (NCSMDS_RED_DOWN)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Global discard node received for nodeId:2010f pid:5047
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 121 <0, 2010f(down)> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 120 <0, 2010f(down)> (@safAmfService2010f)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer connected: 122 (MsgQueueService131343) <592, 2020f>
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer locally disconnected. Marking it as doomed 122 <592, 2020f> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected 122 <592, 2020f> (MsgQueueService131343)
Jan 6 16:45:01 ffm-sbc-2b opensaf_reboot: No lock is in progress going to process further...

On standby machine :

I observed rcu_bh_state, and kernel stack dumo when TIPC lost link was occured on ACTIVE machine and after 6 sec we got link lost message on standby machine.

Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state detected stall on CPU 0 (t=0 jiffies)
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all CPUs:
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu 0
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP]
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042]
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP: 0010:[<ffffffff81007f51>] [<ffffffff81007f51>] native_read_tsc+0x2/0xf
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP: 0018:ffff88043ee03db0 EFLAGS: 00000803
Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395 RBX: 00000000000003e9 RCX: 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active controller f2 detected at standby immd!! f1. Possible failover
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications on node id:2020f ret val:0
Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0 Role:STANDBY
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY
Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977 RSI: 0000000000000002 RDI: 0000000000032bd4
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4 R08: 0000000000000000 R09: ffffffff819232b0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff R11: 7fffffffffffffff R12: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0 R14: 0000000000000001 R15: ffffffff81065c28
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS: 0000000000000000(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034 CR3: 0000000001805000 CR4: 00000000000406f0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid: 0, threadinfo ffffffff81800000, task ffffffff8180d020)
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5 ffffffff81200f44 00000000000003e9 0000000000001000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 ffffffff819232b0 ffffffff81017698 7fffffffffffffff
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002 0000000000000002 ffffffff81017fdf 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200eb5>] paravirt_read_tsc+0x5/0x8
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200f44>] delay_tsc+0x1d/0x68
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017698>] native_safe_apic_wait_icr_idle+0x27/0x32
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017fdf>] default_send_IPI_dest_field.constprop.0+0x19/0x4d
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101804b>] default_send_IPI_mask_sequence_phys+0x38/0x6a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101815e>] arch_trigger_all_cpu_backtrace+0x4d/0x7b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109801b>] check_cpu_stall+0x66/0xdb
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff810980aa>]
rcu_pending+0x1a/0x10a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109852c>] rcu_check_callbacks+0x9d/0xae
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8104c56c>] update_process_times+0x31/0x63
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81065c92>] tick_sched_timer+0x6a/0x90
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105b872>] __run_hrtimer+0xa4/0x148
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105c08e>] hrtimer_interrupt+0xdb/0x19a
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017768>] smp_apic_timer_interrupt+0x6e/0x80
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff813efcde>] apic_timer_interrupt+0x6e/0x80
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81239655>] intel_idle+0xdd/0x104
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81001200>] cpu_idle+0x51/0x95
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193db0f>] start_kernel+0x388/0x393
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193d3af>] x86_64_start_kernel+0xcf/0xdc
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3 e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3 40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31
Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0 20 48 09 c8 c3 41 56 41 55 41 54 55 53

.
.
.
.
.
.

Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu 31
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in: nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor thermal_sys [last unloaded: ipt_PORTMAP]
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734]
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm: kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP: 0010:[<ffffffff81239624>] [<ffffffff81239624>] intel_idle+0xac/0x104
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP: 0018:ffff880425e33eb8 EFLAGS: 00000046
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030 RBX: 0000000000000010 RCX: 0000000000000001
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000 RSI: ffff880425e33fd8 RDI: ffffffff81810500
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030 R08: 000000000000006d R09: 0000000000034b10
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830 R11: ffff88083eded830 R12: 149739342cb2ca49
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004 R14: 000000000000001f R15: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS: 0000000000000000(0000) GS:ffff88083ede0000(0000) knlGS:0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120 CR3: 0000000001805000 CR4: 00000000000406e0
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1 (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580)
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link <1.1.1:bond0-1.1.2:bond0>, requested by peer
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link <1.1.1:bond0-1.1.2:bond0> on network plane A
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact with <1.1.2>
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 000000000cdd3a47 0000000000000000 000000000cdd3a47
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8 0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000 ffffffff81304773 ffffffff819232b0 ffff880425e33fd8
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] cpu_idle+0x51/0x95
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80 e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29 e0 48 89 c7 e8 10 ae e0 ff 48 69 e8 40 42
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>] cpuidle_idle_call+0xda/0x169
Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>] cpu_idle+0x51/0x95
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting remote node in the absence of PLM with custom handling through 62.115.30.49
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting peer node...
Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooted peer node!

can someone help me on it. Please tell me why this issue occured.

Thanks
~Sumit Gemini

Discussion

  • Erik Hugne

    Erik Hugne - 2017-02-19

    As Ying already said on the mailing list, the root cause of the lost link must be the rcu stall on cpu0
    It seems to be in an idle state when the IPI is sent that detects the stall, which is weird..
    You are running a paravirtualized setup, have you observed this more than once? is it always the same backtrace? Is the microcode loaded for the cpu?

     
    • Sumit Gemini

      Sumit Gemini - 2017-02-19
      Post awaiting moderation.
    • Sumit Gemini

      Sumit Gemini - 2017-02-20

      Hi Eric,

      This stall issue was observed first time. The product is session border controller(SBC), and we are configured TIPC along with opensf.
      But problem is stall issue occurred on standby machine, and active machine was rebooted by standby machine almost 6 second later using opensf functionality.

      Could you tell me why an ideal machine got to rcu_bh_state because problem is we don't have any cpu or memory related log too when this issue occured.

      And yes, we are not using VM machine here, we are actually using real hardwares( HA pair)

      Thanks
      ~Sumit Gemini

       
      • Erik Hugne

        Erik Hugne - 2017-02-20

        No idea..
        Your best bet is to contact your distro provider for support.

        On 20 Feb 2017 06:45, "Sumit Gemini" sumitgemini@users.sf.net wrote:

        Hi Eric,

        This stall issue was observed first time. The product is session border
        controller(SBC), and we are configured TIPC along with opensf.
        But problem is stall issue occurred on standby machine, and active machine
        was rebooted by standby machine almost 6 second later using opensf
        functionality.

        Could you tell me why an ideal machine got to rcu_bh_state because problem
        is we don't have any cpu or memory related log too when this issue occured.

        And yes, we are not using VM machine here, we are actually using real
        hardwares( HA pair)

        Thanks
        ~Sumit Gemini


        Status: open
        Group:
        Labels: tipc rcu_bh_state
        Created: Fri Feb 17, 2017 08:50 AM UTC by Sumit Gemini
        Last Updated: Sun Feb 19, 2017 11:46 AM UTC
        Owner: Erik Hugne

        Hi All,

        I have HA pair, and i observed tipc link lost event was not received by
        standby machine. I got this problem

        on ACTIVE machine :

        Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308014] TIPC: Resetting link
        <1.1.2:bond0-1.1.1:bond0>, peer not responding
        Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308021] TIPC: Lost link
        <1.1.2:bond0-1.1.1:bond0> on network plane A
        Jan 6 16:45:00 ffm-sbc-2b kernel: [3341017.308026] TIPC: Lost contact
        with <1.1.1>
        Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Node Down event for node id
        2010f:
        Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Done Locking applications on
        node id:2010f ret val:0
        Jan 6 16:45:01 ffm-sbc-2b osafclmd[4963]: NO Node 131343 went down. Not
        sending track callback for agents on that node
        Jan 6 16:45:01 ffm-sbc-2b osafclmd: Last message 'NO Node 131343 went '
        repeated 5 times, suppressed by syslog-ng on ffm-sbc-2b.mydomain.com
        Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: NO Current role: ACTIVE
        Jan 6 16:45:01 ffm-sbc-2b osaffmd[4898]: Rebooting OpenSAF NodeId =
        131343 EE Name = , Reason: Received Node Down for peer controller,
        OwnNodeId = 131599, SupervisionTime = 60
        Jan 6 16:45:01 ffm-sbc-2b osafamfd[4986]: NO Node 'SC-1' left the cluster
        Jan 6 16:45:01 ffm-sbc-2b osafimmd[4910]: WA IMMD lost contact with peer
        IMMD (NCSMDS_RED_DOWN)
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Global discard node
        received for nodeId:2010f pid:5047
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected
        121 <0, 2010f(down)> (MsgQueueService131343)
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected
        120 <0, 2010f(down)> (@safAmfService2010f)
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer connected: 122
        (MsgQueueService131343) <592, 2020f>
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer locally
        disconnected. Marking it as doomed 122 <592, 2020f> (MsgQueueService131343)
        Jan 6 16:45:01 ffm-sbc-2b osafimmnd[4922]: NO Implementer disconnected
        122 <592, 2020f> (MsgQueueService131343)
        Jan 6 16:45:01 ffm-sbc-2b opensaf_reboot: No lock is in progress going to
        process further...

        On standby machine :

        I observed rcu_bh_state, and kernel stack dumo when TIPC lost link was
        occured on ACTIVE machine and after 6 sec we got link lost message on
        standby machine.

        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.520060] INFO: rcu_bh_state
        detected stall on CPU 0 (t=0 jiffies)
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] sending NMI to all
        CPUs:
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] NMI backtrace for cpu 0
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] CPU 0
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Modules linked in:
        nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
        ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port
        xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
        ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter
        ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr
        twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic
        cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null
        af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables
        iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding
        binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat
        nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf
        edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp
        nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas
        sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button
        container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor
        thermal_sys [last unloaded: ipt_PORTMAP]
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042]
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] Pid: 0, comm: swapper
        Tainted: P 3.1.10-gb17-default #1 Intel Corporation S2600CO/S2600CO
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RIP: 0010:
        [<ffffffff81007f51>] [<ffffffff81007f51>] native_read_tsc+0x2/0xf
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RSP:
        0018:ffff88043ee03db0 EFLAGS: 00000803
        Jan 6 16:45:06 ffm-sbc-2a kernel: [3167216.524042] RAX: 0000000037185395
        RBX: 00000000000003e9 RCX: 0000000000000001
        Jan 6 16:45:07 ffm-sbc-2a osafimmd[5035]: WA IMMND DOWN on active
        controller f2 detected at standby immd!! f1. Possible failover
        Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Done Locking applications on
        node id:2020f ret val:0
        Jan 6 16:45:07 ffm-sbc-2a opensaf_recovery: Control interface status:0
        Role:STANDBY
        Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: NO Current role: STANDBY
        Jan 6 16:45:07 ffm-sbc-2a osaffmd[5023]: Rebooting OpenSAF NodeId =
        131599 EE Name = , Reason: Received Node Down for peer controller,
        OwnNodeId = 131343, SupervisionTime = 60
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RDX: 0000000000bf0977
        RSI: 0000000000000002 RDI: 0000000000032bd4
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] RBP: 0000000000032bd4
        R08: 0000000000000000 R09: ffffffff819232b0
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R10: 7fffffffffffffff
        R11: 7fffffffffffffff R12: 0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] R13: ffffffff819232b0
        R14: 0000000000000001 R15: ffffffff81065c28
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] FS:
        0000000000000000(0000) GS:ffff88043ee00000(0000) knlGS:0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CS: 0010 DS: 0000 ES:
        0000 CR0: 000000008005003b
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] CR2: 000000000069e034
        CR3: 0000000001805000 CR4: 00000000000406f0
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR0: 0000000000000000
        DR1: 0000000000000000 DR2: 0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] DR3: 0000000000000000
        DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Process swapper (pid:
        0, threadinfo ffffffff81800000, task ffffffff8180d020)
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Stack:
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] ffffffff81200eb5
        ffffffff81200f44 00000000000003e9 0000000000001000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002
        ffffffff819232b0 ffffffff81017698 7fffffffffffffff
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] 0000000000000002
        0000000000000002 ffffffff81017fdf 0000000000000001
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Call Trace:
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200eb5>]
        paravirt_read_tsc+0x5/0x8
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81200f44>]
        delay_tsc+0x1d/0x68
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017698>]
        native_safe_apic_wait_icr_idle+0x27/0x32
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017fdf>]

        default_send_IPI_dest_field.constprop.0+0x19/0x4d Jan 6 16:45:07
        ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8101804b>]
        default_send_IPI_mask_sequence_phys+0x38/0x6a Jan 6 16:45:07 ffm-sbc-2a
        kernel: [3167216.524042] [<ffffffff8101815e>]
        arch_trigger_all_cpu_backtrace+0x4d/0x7b Jan 6 16:45:07 ffm-sbc-2a kernel:
        [3167216.524042] [<ffffffff8109801b>] check_cpu_stall+0x66/0xdb Jan 6
        16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff810980aa>]

        rcu_pending+0x1a/0x10a
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8109852c>]
        rcu_check_callbacks+0x9d/0xae
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8104c56c>]
        update_process_times+0x31/0x63
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81065c92>]
        tick_sched_timer+0x6a/0x90
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105b872>]
        __run_hrtimer+0xa4/0x148
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8105c08e>]
        hrtimer_interrupt+0xdb/0x19a
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81017768>]
        smp_apic_timer_interrupt+0x6e/0x80
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff813efcde>]
        apic_timer_interrupt+0x6e/0x80
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81239655>]
        intel_idle+0xdd/0x104
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81304773>]
        cpuidle_idle_call+0xda/0x169
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff81001200>]
        cpu_idle+0x51/0x95
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193db0f>]
        start_kernel+0x388/0x393
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] [<ffffffff8193d3af>]
        x86_64_start_kernel+0xcf/0xdc
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.524042] Code: 74 03 e6 80 c3
        e6 ed c3 bf 8e 21 00 00 e9 ba 8f 1f 00 c3 90 90 90 40 88 f8 e6 70 e4 71 c3
        40 88 f0 e6 70 40 88 f8 e6 71 c3 0f 31
        Jan 6 16:45:07 ffm-sbc-2a kernel[3167216.524042]: c1 48 89 d0 48 c1 e0 20
        48 09 c8 c3 41 56 41 55 41 54 55 53

        .
        .
        .
        .
        .
        .

        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] NMI backtrace for cpu
        31
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CPU 31
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Modules linked in:
        nf_conntrack_netlink af_packet xt_sharedlimit xt_hashlimit
        ip_set_hash_ipport ip_set_hash_ipportip xt_NOTRACK ip_set_bitmap_port
        xt_sctp nf_conntrack_ipv6 nf_defrag_ipv6 xt_CT arpt_mangle
        ip_set_hash_ipnet xt_NFLOG nfnetlink_log ipt_ULOG xt_limit xt_hashcounter
        ip_set_hash_ipip xt_set ip_set_hash_ip deflate zlib_deflate ctr
        twofish_x86_64 twofish_common camellia serpent blowfish cast5 des_generic
        cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic md5 crypto_null
        af_key iptable_mangle ip_set nfnetlink arptable_filter arp_tables
        iptable_raw iptable_nat tipc xt_tcpudp xt_state xt_pkttype bonding
        binfmt_misc iptable_filter ip6table_filter ip6_tables nf_nat_ftp nf_nat
        nf_conntrack_ftp nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables x_tables mperf
        edd ipmi_devintf ipmi_si ipmi_msghandler nf_conntrack_proto_sctp
        nf_conntrack sctp 8021q garp stp llc gb_sys usb_storage ioatdma ixgbe uas
        sg igb iTCO_wdt wmi i2c_i801 pcspkr mdio iTCO_vendor_support button
        container dca ipv6 autofs4 usbhid megasr(P) ehci_hcd usbcore processor
        thermal_sys [last unloaded: ipt_PORTMAP]
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734]
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Pid: 0, comm:
        kworker/0:1 Tainted: P 3.1.10-gb17-default #1 Intel Corporation
        S2600CO/S2600CO
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RIP: 0010:
        [<ffffffff81239624>] [<ffffffff81239624>] intel_idle+0xac/0x104
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RSP:
        0018:ffff880425e33eb8 EFLAGS: 00000046
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RAX: 0000000000000030
        RBX: 0000000000000010 RCX: 0000000000000001
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RDX: 0000000000000000
        RSI: ffff880425e33fd8 RDI: ffffffff81810500
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] RBP: 0000000000000030
        R08: 000000000000006d R09: 0000000000034b10
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R10: ffff88083eded830
        R11: ffff88083eded830 R12: 149739342cb2ca49
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] R13: 0000000000000004
        R14: 000000000000001f R15: 0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] FS:
        0000000000000000(0000) GS:ffff88083ede0000(0000) knlGS:0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CS: 0010 DS: 0000 ES:
        0000 CR0: 000000008005003b
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] CR2: 00007fe89df67120
        CR3: 0000000001805000 CR4: 00000000000406e0
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR0: 0000000000000000
        DR1: 0000000000000000 DR2: 0000000000000000
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] DR3: 0000000000000000
        DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Process kworker/0:1
        (pid: 0, threadinfo ffff880425e32000, task ffff880425e30580)
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167216.830734] Stack:
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685484] TIPC: Resetting link
        <1.1.1:bond0-1.1.2:bond0>, requested by peer
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685487] TIPC: Lost link
        <1.1.1:bond0-1.1.2:bond0> on network plane A
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.685491] TIPC: Lost contact
        with <1.1.2>
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000
        000000000cdd3a47 0000000000000000 000000000cdd3a47
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] ffff880425e33fd8
        0000001f3edf8970 ffff88083edf8970 ffff88083edf8b00
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] 0000000000000000
        ffffffff81304773 ffffffff819232b0 ffff880425e33fd8
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>]
        cpuidle_idle_call+0xda/0x169
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>]
        cpu_idle+0x51/0x95
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Code: 28 e0 ff ff 80
        e2 08 75 22 31 d2 48 83 c0 10 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff
        ff a8 08 75 08 b1 01 48 89 e8 0f 01 c9 <e8> 3f 6e e2 ff 4c 29 e0 48 89 c7
        e8 10 ae e0 ff 48 69 e8 40 42
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] Call Trace:
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81304773>]
        cpuidle_idle_call+0xda/0x169
        Jan 6 16:45:07 ffm-sbc-2a kernel: [3167225.687214] [<ffffffff81001200>]
        cpu_idle+0x51/0x95
        Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting remote node in the
        absence of PLM with custom handling through 62.115.30.49
        Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooting peer node...
        Jan 6 16:45:08 ffm-sbc-2a opensaf_reboot: Rebooted peer node!

        can someone help me on it. Please tell me why this issue occured.

        Thanks
        ~Sumit Gemini


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/tipc/bugs/122/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

         
  • Sumit Gemini

    Sumit Gemini - 2017-02-19

    Hi Eric,

    This stall issue was observed first time. The product is session border controller(SBC), and we are configured TIPC along with opensf.
    But problem is stall issue occurred on standby machine, and active machine was rebooted by standby machine almost 6 second later using opensf functionality.

    Could you tell me why an ideal machine got to rcu_bh_state because problem is we don't have any cpu or memory related log too when this issue occured.

    And yes, we are not using VM machine here, we are actually using real hardwares( HA pair)

    Thanks
    ~Sumit Gemini

     
  • Jon Paul Maloy

    Jon Paul Maloy - 2018-05-22
    • status: open --> closed
    • Group: -->
     
  • Jon Paul Maloy

    Jon Paul Maloy - 2018-05-22

    No more information about this. Fixed by code updates?

     

Log in to post a comment.