Menu

#1157 MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)

never
duplicate
nobody
None
defect
mds
-
major
2015-11-02
2014-10-07
No

Changeset: 4.6.M0 - 6009:b2ddaa23aae4
When starting ~50 linux containers IMMD coredumps resulting in cluster reset.
Communication is TCP.
dtmd.conf configuration is:

DTM_SOCK_SND_RCV_BUF_SIZE=65536
DTM_CLUSTER_ID=1
DTM_NODE_IP=172.17.1.42
DTM_MCAST_ADDR=224.0.0.6

BatchSize reduced to 4096

opensafImm=opensafImm,safApp=safImmService
Name                                               Type         Value(s)
========================================================================
opensafImmSyncBatchSize                            SA_UINT32_T  4096 (0x1000)

When node PL-51 joins the cluster the following messages is seen in the syslog:

Oct  6 00:35:57 SC-1 osafdtmd[1028]: NO Established contact with 'PL-51'
Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Extended intro from node 2330f
Oct  6 00:35:57 SC-1 osafimmd[1063]: NO Node 2330f request sync sync-pid:79 epoch:0 
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO Announce sync, epoch:292
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Oct  6 00:35:58 SC-1 osafimmnd[1072]: NO NODE STATE-> IMM_NODE_R_AVAILABLE
Oct  6 00:35:58 SC-1 osafimmd[1063]: NO Successfully announced sync. New ruling epoch:292
Oct  6 00:35:58 SC-1 osafimmloadd: NO Sync starting
Oct  6 00:36:00 SC-1 osafimmd[1063]:  MDTM unsent message is more!=200
Oct  6 00:36:00 SC-1 osafimmnd[1072]: WA Director Service in NOACTIVE state - fevs replies pending:9 fevs highest processed:20037
Oct  6 00:36:00 SC-1 osafamfnd[1143]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast'
Oct  6 00:36:00 SC-1 osafamfnd[1143]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast
Oct  6 00:36:00 SC-1 osafamfnd[1143]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
Oct  6 00:36:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60
Oct  6 00:36:00 SC-1 osafimmnd[1072]: NO No IMMD service => cluster restart, exiting

There is a coredump generated:
core_1412555760.osafimmd.1063

1 Attachments

Discussion

  • Anders Bjornerstedt

    • summary: IMMD coredump --> IMMD coredumps in MDS BCAST send
    • status: unassigned --> needinfo
    • Component: imm --> mds
    • Part: d --> -
    • Milestone: 4.6.FC --> 4.5.0
     
  • Anders Bjornerstedt

    The IMMD crashes inside MDS BCAST send.

    Information is needed about exactly which branch & changeset this was
    executed with. There have been some fixes recently in MDS on the
    4.5 and default branch. Relevant may also be a TIPC fix/patch.

     
    • A V Mahesh (AVM)

      TIPC fix is not related to this TCP MDS BCAST send , the issue is seen when opensaf is running docker containers setup ,while ~50 payload is joining the cluster.

       
  • Anders Bjornerstedt

    Changeset is provided.

     
  • Anders Bjornerstedt

    • status: needinfo --> unassigned
     
  • A V Mahesh (AVM)

    • summary: IMMD coredumps in MDS BCAST send --> MDS: IMMD coredumps in MDS BCAST send (TCP with MCAST_ADDR)
     
  • A V Mahesh (AVM)

    This doesn't look like Opensaf/MDS/IMM issue.

    In case of INTRA Node send() fails , MDS do allow to recover temporary network problem by
    by queuing up to 200 messages , in this case the network was not recovered till the accumulation of 200 messages in unsent queue, so MDS did a intentional asset assuming the network issue may not be recoverable.

     
  • Hans Feldt

    Hans Feldt - 2014-10-07

    Aha I have seen this one before. This is a behavior difference between MDS/TCP and MDS/TIPC. With TIPC we get flow control by having a blocking send. In this case obviously not. I remember that I have seen this before. Any kind of bursty send would trigger this. For example a LOG burst of async messages.

    Why can't send be blocking in the MDS/TCP case and this queue removed?

    I wonder if I did write some ticket on this...

     
  • A V Mahesh (AVM)

    • status: unassigned --> duplicate
     
  • Adrian Szwej

    Adrian Szwej - 2014-10-07

    With the patch provided in #607 I can passs 45 containers limitation and get up to 67 containers before next problem appears; which I am investigating.
    Around 60 nodes, where nodes still are joining the cluster; other nodes seem to leave the cluster. Some immnd coredumps are seen on payloads. One segfault of immnd on controller. But this is subject for different ticket where I have done more investigation.

     
  • Anders Widell

    Anders Widell - 2015-11-02
    • Milestone: 4.5.0 --> never
     

Log in to post a comment.