MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)

Brought to you by: anders-w, gary_lee, hansnordeback, jonasarndt, and 4 others

#1157 MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)

Milestone: never

Status: duplicate

Owner: nobody

Labels: None

Type: defect

Component: mds

Part: -

Version:

Priority: major

Blocker:

Updated: 2015-11-02

Created: 2014-10-07

Creator: Adrian Szwej

Private: No

Changeset: 4.6.M0 - 6009:b2ddaa23aae4
When starting ~50 linux containers IMMD coredumps resulting in cluster reset.
Communication is TCP.
dtmd.conf configuration is:

DTM_SOCK_SND_RCV_BUF_SIZE=65536
DTM_CLUSTER_ID=1
DTM_NODE_IP=172.17.1.42
DTM_MCAST_ADDR=224.0.0.6

BatchSize reduced to 4096

opensafImm=opensafImm,safApp=safImmService
Name                                               Type         Value(s)
========================================================================
opensafImmSyncBatchSize                            SA_UINT32_T  4096 (0x1000)

When node PL-51 joins the cluster the following messages is seen in the syslog:

Oct  6 00:35:57 SC-1 osafdtmd[1028]: NO Established contact with 'PL-51'
Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Extended intro from node 2330f
Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Node 2330f request sync sync-pid:79 epoch:0 
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO Announce sync, epoch:292
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Oct  6 00:35:58 SC-1 osafimmd[1063]: NO Successfully announced sync. New ruling epoch:292
Oct  6 00:35:58 SC-1 osafimmloadd: NO Sync starting
Oct  6 00:36:00 SC-1 osafimmd[1063]:  MDTM unsent message is more!=200
Oct  6 00:36:00 SC-1 osafimmnd[1072]: WA Director Service in NOACTIVE state - fevs replies pending:9 fevs highest processed:20037
Oct  6 00:36:00 SC-1 osafamfnd[1143]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Oct  6 00:36:00 SC-1 osafamfnd[1143]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Oct  6 00:36:00 SC-1 osafamfnd[1143]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Oct  6 00:36:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60
Oct  6 00:36:00 SC-1 osafimmnd[1072]: NO No IMMD service => cluster restart, exiting

There is a coredump generated:
core_1412555760.osafimmd.1063

1 Attachments

immd.core

Discussion

Anders Bjornerstedt - 2014-10-07

summary: IMMD coredump --> IMMD coredumps in MDS BCAST send

status: unassigned --> needinfo

Component: imm --> mds

Part: d --> -

Milestone: 4.6.FC --> 4.5.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-10-07

The IMMD crashes inside MDS BCAST send.

Information is needed about exactly which branch & changeset this was
executed with. There have been some fixes recently in MDS on the
4.5 and default branch. Relevant may also be a TIPC fix/patch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2014-10-07
  
  TIPC fix is not related to this TCP MDS BCAST send , the issue is seen when opensaf is running docker containers setup ,while ~50 payload is joining the cluster.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-10-07

Changeset is provided.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-10-07

status: needinfo --> unassigned
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-07

summary: IMMD coredumps in MDS BCAST send --> MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-07

This doesn't look like Opensaf/MDS/IMM issue.

In case of INTRA Node send() fails , MDS do allow to recover temporary network problem by
by queuing up to 200 messages , in this case the network was not recovered till the accumulation of 200 messages in unsent queue, so MDS did a intentional asset assuming the network issue may not be recoverable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hans Feldt - 2014-10-07

Aha I have seen this one before. This is a behavior difference between MDS/TCP and MDS/TIPC. With TIPC we get flow control by having a blocking send. In this case obviously not. I remember that I have seen this before. Any kind of bursty send would trigger this. For example a LOG burst of async messages.

Why can't send be blocking in the MDS/TCP case and this queue removed?

I wonder if I did write some ticket on this...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hans Feldt - 2014-10-07

https://sourceforge.net/p/opensaf/tickets/607/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-07

status: unassigned --> duplicate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A V Mahesh (AVM) - 2014-10-07

Duplicate of https://sourceforge.net/p/opensaf/tickets/607/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Szwej - 2014-10-07

With the patch provided in #607 I can passs 45 containers limitation and get up to 67 containers before next problem appears; which I am investigating.
Around 60 nodes, where nodes still are joining the cluster; other nodes seem to leave the cluster. Some immnd coredumps are seen on payloads. One segfault of immnd on controller. But this is subject for different ticket where I have done more investigation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Widell - 2015-11-02

Milestone: 4.5.0 --> never
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.