OpenSAF / Tickets / #1072 Sync stop after few payload nodes joining the cluster (TCP)

Anders Bjornerstedt - 2014-09-15

Version: --> 4.3

Milestone: 4.6.FC --> 4.3.3
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-15

status: unassigned --> invalid
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-15

The symptoms indicate a performance problem with the setup of resources vs
load for this test.

The test & setup manages to get sync (plus presumably other trafic ?) to overload fevs.

OpenSAF currently has no overload protection or load regulation so
overloading the system will inevtiably cause degeneration of service.
Here this results in the failure of a sync.
This should lead to the joing paylod retrying the sync.

If the 3rd/4th payload always fails to join it means this resource and
load configuration can not support more than 2 SC and 2 payloads.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Adrian Szwej - 2014-09-15
  
  I don't think it is performance problems.
  There is nothing indicating CPU load; memory; nor IO bandwith.
  This is just a simple node joining seem to trigger some "logical" bug.
  There is no application; but just pure opensaf.
  
  I am now trying to elaborate with different MDS configuration options and MDS buffer settings together with MTU 9000 just to see if there is any difference in triggering this bug.
  
  Opensaf is running inside containers; meaning there is no virtualization overhead.
  
  Could you hint me what could cause the outstanding messages to reach 16?
  E.g. could message loss / timing issue lead to this?
  I am having more nodes configured than what is actually joining at the moment; around 10. But I am bringing them into cluster one by one.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anders Bjornerstedt - 2014-09-15
    
    Well a hint is that you managed to bypass the problem (temporarily) by increasing a queue size.
    
    The error:
    Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too many pending incoming fevs messages (> 16) rejecting sync iteration next request
    Is very rarely seen, but can happen due to the latency of fevs turn arround being lower than the rate of generated trafic.
    
    So the question for you is simply why this happens in our setup and with your traffic.
    or if there is anything else unusual with your setup or traffic.
    If the only imm traffic is sync traffic then it is really strange.
    
    Again, this is a rare problem (in fact no one has complained about this before that I can recall) and involves a mechanism that
    has been there since the start of OpenSAF.
    
    If the same problem had popped up in testing of 4.5 it would indicate som introduced problem.
    But no one has reported any problem like this.
    
    /AndersBj
    
    From: Adrian Szwej [mailto:adrianszwej@users.sf.net]
    Sent: den 15 september 2014 12:04
    To: [opensaf:tickets]
    Subject: [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)
    
    I don't think it is performance problems.
    There is nothing indicating CPU load; memory; nor IO bandwith.
    This is just a simple node joining seem to trigger some "logical" bug.
    There is no application; but just pure opensaf.
    
    I am now trying to elaborate with different MDS configuration options and MDS buffer settings together with MTU 9000 just to see if there is any difference in triggering this bug.
    
    Opensaf is running inside containers; meaning there is no virtualization overhead.
    
    Could you hint me what could cause the outstanding messages to reach 16?
    E.g. could message loss / timing issue lead to this?
    I am having more nodes configured than what is actually joining at the moment; around 10. But I am bringing them into cluster one by one.
    
    [tickets:#1072]http://sourceforge.net/p/opensaf/tickets/1072 Sync stop after few payload nodes joining the cluster (TCP)
    
    Status: invalid
    Milestone: 4.3.3
    Created: Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
    Last Updated: Mon Sep 15, 2014 07:45 AM UTC
    Owner: Anders Bjornerstedt
    
    Communication is MDS over TCP. Cluster 2+3; where scenario is
    Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; start 3rd payload. Third one fails; or sometimes it might be forth.
    
    There is problem of getting more than 2/3 payloads synchronized due to a consistent way of triggering a bug.
    
    The following is triggered in the loading immnd causing the joined node to timeout/fail to start up.
    
    Sep 6 6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
    Sep 6 6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> immnd_evt_proc_search_next
    Sep 6 6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look for id:1664
    Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too many pending incoming fevs messages (> 16) rejecting sync iteration next request
    Sep 6 6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << immnd_evt_proc_search_next
    Sep 6 6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: step:540
    
    I have managed to overcome this bug temporary by making following patch:
    
    +++ b/osaf/libs/common/immsv/include/immsv_api.h Sat Sep 06 08:38:16 2014 +0000
    @@ -70,7 +70,7 @@
    
    /Max # of outstanding fevs messages towards director./
    /Note max-max is 255. cb->fevs_replies_pending is an uint8_t/
    -#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16
    +#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255
    
    #define IMMSV_MAX_OBJECTS 10000
    #define IMMSV_MAX_ATTRIBUTES 128
    
    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1072/https://sourceforge.net/p/opensaf/tickets/1072
    
    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions
    
    Related
    
    Tickets: #1072
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Szwej - 2014-09-15

I had 1 controller and 4 payloads up and running.
Normally the "Messages pending" is kept to 2 and sometimes go up to 3,4.
I was bringing up the 5th payload up and down for around 10-15 times.
while ( true ); do /etc/init.d/opensafd stop && /etc/init.d/opensafd start; done

tail -f /var/log/opensaf/osafimmnd | grep "Messages pending:"
Sep 15 21:12:50.691919 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:2
Sep 15 21:12:50.724038 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:2
Sep 15 21:12:50.957123 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:2
Sep 15 21:12:50.961528 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:3
Sep 15 21:12:51.215563 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:2
Sep 15 21:12:52.785945 osafimmnd [368:immnd_evt.c:2674] TR Messages pending:2
Sep 15 21:12:52.799428 osafimmnd [368:immnd_evt.c:2674] TR Messages pending:2
Sep 15 21:12:57.923195 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:2
Sep 15 21:12:58.355613 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:3
Sep 15 21:12:58.369637 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:5
Sep 15 21:12:58.372522 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:6
Sep 15 21:12:58.394801 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:8
Sep 15 21:12:58.458708 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:10
Sep 15 21:12:58.470905 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:12
Sep 15 21:12:58.480655 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:14
Sep 15 21:12:58.484411 osafimmnd [368:immnd_evt.c:0960] TR Messages pending:16

Once this happen; it does not help to terminate the 5th payload.
Some minute later cluster reset is triggered.
osafimmnd [738:immnd_mds.c:0573] TR Resetting fevs replies pending to zero.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Szwej - 2014-09-15

I have also tried following flavours:
Larger MDS buffers

export MDS_SOCK_SND_RCV_BUF_SIZE=126976 DTM_SOCK_SND_RCV_BUF_SIZE=126976

Longer keep alive settings

OpenSAF build 4.5

MTU 9000
veth4e51 Link encap:Ethernet HWaddr aa:a6:f0:5f:0f:82
UP BROADCAST RUNNING MTU:9000 Metric:1
--
veth76a4 Link encap:Ethernet HWaddr 9a:ea:07:f4:be:55
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethb5f5 Link encap:Ethernet HWaddr 22:98:e3:39:32:34
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethb9e3 Link encap:Ethernet HWaddr d2:ec:18:c4:f9:2d
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethd703 Link encap:Ethernet HWaddr 3e:a0:49:c0:f0:73
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethf736 Link encap:Ethernet HWaddr 4e:c4:6e:74:fc:03
UP BROADCAST RUNNING MTU:9000 Metric:1

Ping during sync between containers show latency of 0.250-0.500 ms.

The result is the same.
I can provoke the problem by cycling start/stop of 6th opensaf instance in linux container.

while ( true ); do /etc/init.d/opensafd stop && /etc/init.d/opensafd start; done
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2014-09-16
  
  Some time back ,I bought-up 30 Nodes with TCP transport with out any issue, at that time In addition to increasing Larger MDS buffers(MDS_SOCK_SND_RCV_BUF_SIZE & DTM_SOCK_SND_RCV_BUF_SIZE), I also increased wmem_max & rmem_max, you also give a try.
  
  sysctl -w net.core.wmem_max=33554432
  sysctl -w net.core.rmem_max=33554432
  sysctl -w net.ipv4.tcp_rmem="4096 87380 33554432"
  sysctl -w net.ipv4.tcp_wmem="4096 87380 33554432"
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anders Bjornerstedt - 2014-09-16
  
  Instead of blindly changing other configuration parameters, please first try to find out what the PROBLEM is.
  Go back to OpensAF defaults on all settings, except IMMSV_FEVS_MAX_PENDING which you had
  increased to 255 (the maximum possible).
  
  You said you had "managed to overcome the perormance issue temporarily" by this increase to 255.
  What does that mean ?
  Do you still get the problem after some time? or not? with only that change.
  
  How much traffic are you generating ?
  Not counting SYNC traffic here, I mean YOUR application traffic.
  Do you have zero traffic ?
  Obviously it is possible to generate too much traffic on ANY configuration and you will end up with
  symptoms like the ones you see.
  
  If the problem appears "fixed" by the 255 (maximum) setting, try reducing IMMSV_FEVS_MAX_PENDING
  down again by 50% from 255 (current maximum possible) to 128.
  Test this some time and see if you have a stable system.
  If stable repeat, i.e. reduce again by 50%, test again etc, untill you get to a level where the problem re-appears.
  Then double the value back up to the lowest level where it appeared to be stable.
  
  This would solve the problem if the cause is that your setup has more VARIANCE in latency,
  more "bursty" traffic, more chunky scheduling of execution for the containters/processors/processes/threads.
  If that is the case then the problem is not traffic overload but that you indeed need some buffers to be larger
  to avoid the extremes of the variance to cut you off.
  
  /AndersBj
  
  From: Adrian Szwej [mailto:adrianszwej@users.sf.net]
  Sent: den 16 september 2014 00:47
  To: opensaf-tickets@lists.sourceforge.net
  Subject: [tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes joining the cluster (TCP)
  
  I have also tried following flavours:
  Larger MDS buffers
  
  export MDS_SOCK_SND_RCV_BUF_SIZE=126976
  DTM_SOCK_SND_RCV_BUF_SIZE=126976
  
  Longer keep alive settings
  
  OpenSAF build 4.5
  
  MTU 9000
  veth4e51 Link encap:Ethernet HWaddr aa:a6:f0:5f:0f:82
  UP BROADCAST RUNNING MTU:9000 Metric:1
  --
  veth76a4 Link encap:Ethernet HWaddr 9a:ea:07:f4:be:55
  UP BROADCAST RUNNING MTU:9000 Metric:1
  --
  vethb5f5 Link encap:Ethernet HWaddr 22:98:e3:39:32:34
  UP BROADCAST RUNNING MTU:9000 Metric:1
  --
  vethb9e3 Link encap:Ethernet HWaddr d2:ec:18:c4:f9:2d
  UP BROADCAST RUNNING MTU:9000 Metric:1
  --
  vethd703 Link encap:Ethernet HWaddr 3e:a0:49:c0:f0:73
  UP BROADCAST RUNNING MTU:9000 Metric:1
  --
  vethf736 Link encap:Ethernet HWaddr 4e:c4:6e:74:fc:03
  UP BROADCAST RUNNING MTU:9000 Metric:1
  
  Ping during sync between containers show latency of 0.250-0.500 ms.
  
  The result is the same.
  I can provoke the problem by cycling start/stop of 6th opensaf instance in linux container.
  
  while ( true ); do /etc/init.d/opensafd stop && /etc/init.d/opensafd start; done
  
  [tickets:#1072]http://sourceforge.net/p/opensaf/tickets/1072 Sync stop after few payload nodes joining the cluster (TCP)
  
  Status: invalid
  Milestone: 4.3.3
  Created: Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
  Last Updated: Mon Sep 15, 2014 09:48 PM UTC
  Owner: Anders Bjornerstedt
  
  Communication is MDS over TCP. Cluster 2+3; where scenario is
  Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; start 3rd payload. Third one fails; or sometimes it might be forth.
  
  There is problem of getting more than 2/3 payloads synchronized due to a consistent way of triggering a bug.
  
  The following is triggered in the loading immnd causing the joined node to timeout/fail to start up.
  
  Sep 6 6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
  Sep 6 6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> immnd_evt_proc_search_next
  Sep 6 6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look for id:1664
  Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too many pending incoming fevs messages (> 16) rejecting sync iteration next request
  Sep 6 6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << immnd_evt_proc_search_next
  Sep 6 6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: step:540
  
  I have managed to overcome this bug temporary by making following patch:
  
  +++ b/osaf/libs/common/immsv/include/immsv_api.h Sat Sep 06 08:38:16 2014 +0000
  @@ -70,7 +70,7 @@
  
  /Max # of outstanding fevs messages towards director./
  /Note max-max is 255. cb->fevs_replies_pending is an uint8_t/
  -#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16
  +#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255
  
  #define IMMSV_MAX_OBJECTS 10000
  #define IMMSV_MAX_ATTRIBUTES 128
  
  Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/https://sourceforge.net/p/opensaf/tickets
  
  To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
  
  Related
  
  Tickets: #1072
  Tickets: tickets
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adrian Szwej - 2014-09-18
    
    It is the IMMD that is crashing causing the messages to become pending.
    I am attaching coredump and immnd and immd trace files from SC-1 where 7 nodes join one by one. When PL-8 joins; the IMMD coredumps.
    
    The code used was changeset 5828:df7bef2079b1 + change of IMMSV_DEFAULT_FEVS_MAX_PENDING to 255.
    
    ticket-1072.tar
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Adrian Szwej - 2014-09-18
      
      #0 0x00007fe7eba49bb9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fe7eba4cfc8 in __GI_abort () at abort.c:89 #2 0x00007fe7eba42a76 in __assert_fail_base (fmt=0x7fe7ebb94370 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fe7ec463b27 "0", file=file@entry=0x7fe7ec4691df "mds_dt_trans.c", line=line@entry=94, function=function@entry=0x7fe7ec4692a0 <__PRETTY_FUNCTION__.10222> "mds_mdtm_queue_add_unsent_msg") at assert.c:92 #3 0x00007fe7eba42b22 in __GI___assert_fail (assertion=assertion@entry=0x7fe7ec463b27 "0", file=file@entry=0x7fe7ec4691df "mds_dt_trans.c", line=line@entry=94, function=function@entry=0x7fe7ec4692a0 <__PRETTY_FUNCTION__.10222> "mds_mdtm_queue_add_unsent_msg") at assert.c:101 #4 0x00007fe7ec449e3d in mds_mdtm_queue_add_unsent_msg (tcp_buffer=tcp_buffer@entry=0x7fff623f3df0 "", bufflen=bufflen@entry=108) at mds_dt_trans.c:94 #5 0x00007fe7ec44a5b8 in mds_mdtm_unsent_queue_add_send (tcp_buffer=tcp_buffer@entry=0x7fff623f3df0 "", bufflen=bufflen@entry=108) at mds_dt_trans.c:153 #6 0x00007fe7ec44b05f in mds_mdtm_send_tcp (req=0x7fff623f3fe0) at mds_dt_trans.c:593 #7 0x00007fe7ec4541e8 in mcm_msg_encode_full_or_flat_and_send (pri=<optimized out>, xch_id=<optimized out>, snd_type=<optimized out>, dest_vdest_id=<optimized out>, adest=<optimized out>, svc_cb=<optimized out>, to_svc_id=<optimized out>, to_msg=<optimized out>, to=<optimized out>) at mds_c_sndrcv.c:1516 #8 mds_mcm_send_msg_enc (to=<optimized out>, svc_cb=svc_cb@entry=0x18806a0, to_msg=to_msg@entry=0x7fff623f41d0, to_svc_id=to_svc_id@entry=25, dest_vdest_id=<optimized out>, req=req@entry=0x7fff623f4270, xch_id=xch_id@entry=0, dest=568511936069707, pri=pri@entry=MDS_SEND_PRIORITY_MEDIUM) at mds_c_sndrcv.c:1086 #9 0x00007fe7ec4576db in mcm_pvt_process_svc_bcast_common (env_hdl=<optimized out>, fr_svc_id=fr_svc_id@entry=24, to_msg=..., to_svc_id=to_svc_id@entry=25, req=req@entry=0x7fff623f4270, scope=NCSMDS_SCOPE_NONE, pri=pri@entry=MDS_SEND_PRIORITY_MEDIUM, flag=flag@entry=0 '\000') at mds_c_sndrcv.c:3882 #10 0x00007fe7ec458195 in mcm_pvt_normal_svc_bcast (pri=MDS_SEND_PRIORITY_MEDIUM, scope=<optimized out>, req=0x7fff623f4270, to_svc_id=25, msg=<optimized out>, fr_svc_id=24, env_hdl=<optimized out>) at mds_c_sndrcv.c:3734 #11 mds_mcm_send (info=0x7fff623f4320) at mds_c_sndrcv.c:790 #12 mds_send (info=info@entry=0x7fff623f4320) at mds_c_sndrcv.c:386 #13 0x00007fe7ec4521c8 in ncsmds_api (svc_to_mds_info=svc_to_mds_info@entry=0x7fff623f4320) at mds_papi.c:104 #14 0x000000000040d482 in immd_mds_bcast_send (cb=cb@entry=0x629360 <_immd_cb>, evt=evt@entry=0x7fff623f4420, to_svc=to_svc@entry=NCSMDS_SVC_ID_IMMND) at immd_mds.c:765 #15 0x00000000004054bd in immd_evt_proc_fevs_req (cb=cb@entry=0x629360 <_immd_cb>, evt=evt@entry=0x7fff623f4630, sinfo=sinfo@entry=0x7fe7e4001ad0, deallocate=deallocate@entry=false) at immd_evt.c:314 #16 0x0000000000406e56 in immd_evt_proc_sync_fevs_base (cb=cb@entry=0x629360 <_immd_cb>, sinfo=sinfo@entry=0x7fe7e4001ad0, evt=0x7fe7e4001990, evt=0x7fe7e4001990) at immd_evt.c:1930 #17 0x0000000000407f57 in immd_process_evt () at immd_evt.c:164 #18 0x0000000000402781 in main (argc=<optimized out>, argv=<optimized out>) at immd_main.c:291
      
      Last edit: Adrian Szwej 2014-09-18
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Hans Feldt - 2014-10-07
        
        this backtrace indicates it is the same as https://sourceforge.net/p/opensaf/tickets/1157/ duplicate of https://sourceforge.net/p/opensaf/tickets/607/
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Anders Bjornerstedt - 2014-09-18
      
      Hi Adrian,
      
      I have re-open the ticket and change component to MDS.
      MDS responsible may be able to diagnose the cause just based on the
      coredump.
      
      I have not checked the MDS backlog if there is any older ticket
      documenting similar symptoms.
      
      https://sourceforge.net/p/opensaf/tickets/search/?q=status%3A%28unassigned+accepted+assigned+review%29+AND+_component%3A%28mds+dtm%29
      
      I will leave that to MDS responsible.
      
      /AndesBj
      
      Adrian Szwej wrote:
      
      It is the IMMD that is crashing causing the messages to become pending.
      I am attaching coredump and immnd and immd trace files from SC-1 where 7 nodes join one by one. When PL-8 joins; the IMMD coredumps.
      
      The code used was changeset 5828:df7bef2079b1 + change of IMMSV_DEFAULT_FEVS_MAX_PENDING to 255.
      
      ** [tickets:#1072] Sync stop after few payload nodes joining the cluster (TCP)**
      
      Status: invalid
      Milestone: 4.3.3
      Created: Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
      Last Updated: Mon Sep 15, 2014 10:46 PM UTC
      Owner: Anders Bjornerstedt
      
      Communication is MDS over TCP. Cluster 2+3; where scenario is
      Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; start 3rd payload. Third one fails; or sometimes it might be forth.
      
      There is problem of getting more than 2/3 payloads synchronized due to a consistent way of triggering a bug.
      
      The following is triggered in the loading immnd causing the joined node to timeout/fail to start up.
      
      Sep 6 6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
      Sep 6 6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> immnd_evt_proc_search_next
      Sep 6 6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look for id:1664
      Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too many pending incoming fevs messages (> 16) rejecting sync iteration next request
      Sep 6 6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << immnd_evt_proc_search_next
      Sep 6 6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: step:540
      
      I have managed to overcome this bug temporary by making following patch:
      
      +++ b/osaf/libs/common/immsv/include/immsv_api.h Sat Sep 06 08:38:16 2014 +0000 @@ -70,7 +70,7 @@ /*Max # of outstanding fevs messages towards director.*/ /*Note max-max is 255. cb->fevs_replies_pending is an uint8_t*/ -#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16 +#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255 #define IMMSV_MAX_OBJECTS 10000 #define IMMSV_MAX_ATTRIBUTES 128
      
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/opensaf/tickets/1072/
      
      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
      
      Related
      
      Tickets: #1072
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

status: invalid --> unassigned

Component: imm --> mds

Part: nd --> -
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

The communication blockage it turns out is due to IMMD crashing.
IMMD crashes on assert in the MDS library in mds_send (bcast varanit).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anders Bjornerstedt - 2014-09-18

assigned_to: Anders Bjornerstedt --> nobody
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Szwej - 2014-09-18

Without the patch there is no coredump. But timeout in three minutes.
Then immd exits. I provide traces.

ticket-1072-vanilla-opensaf.tar

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adrian Szwej - 2014-10-01

I have now applied patch for #1032 ontop of 4.6 changeset 5969:ead18326c13b.
[devel] [PATCH 1 of 1] mds: use correct buff-length to distinguish mcast or multi-unicast [#1036]

This patch does not resolve the problem.

SC-1 immnd get the TRY_AGAIN message with to many outstanding messages.
PL-3 - PL-6 joins without problems.
PL-7; which is the node causing this condition have following entries in the trace log:

Oct 1 18:25:45.749109 osafimmnd [472:immnd_mds.c:0127] >> immnd_mds_register Oct 1 18:25:45.749505 osafimmnd [472:immnd_mds.c:0192] T2 cb->node_id:2070f Oct 1 18:25:45.749525 osafimmnd [472:immnd_mds.c:0194] << immnd_mds_register Oct 1 18:25:45.749557 osafimmnd [472:immnd_main.c:0238] << immnd_initialize Oct 1 18:25:45.850504 osafimmnd [472:ImmModel.cc:3381] << protocol43Allowed Oct 1 18:25:45.850601 osafimmnd [472:immnd_proc.c:1626] T5 tmout:100 ste:1 ME:0 RE:0 crd:0 rim:FROM_FILE 4.3A:0 2Pbe:0 VetA/B: 0/0 othsc:0/0 Oct 1 18:25:45.850631 osafimmnd [472:immnd_proc.c:0393] TR First immnd_introduceMe, sending pbeEnabled:3 WITH params Oct 1 18:25:45.850653 osafimmnd [472:immnd_proc.c:0413] TR Possibly extended intro from this IMMND pbeEnabled: 3 dirsize:22 Oct 1 18:25:45.951519 osafimmnd [472:immnd_proc.c:0393] TR First immnd_introduceMe, sending pbeEnabled:3 WITH params Oct 1 18:25:45.951618 osafimmnd [472:immnd_proc.c:0413] TR Possibly extended intro from this IMMND pbeEnabled: 3 dirsize:22

Keeps on looping for long time with the last two messages

Related

Tickets: ~~#1036~~
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- A V Mahesh (AVM) - 2014-10-02
  
  On 10/2/2014 12:09 AM, Adrian Szwej wrote:
  I have now applied patch for #1032 ontop of 4.6 changeset 5969:ead18326c13b.
  
  You mean [#1036] ?
  
  [devel] [PATCH 1 of 1] mds: use correct buff-length to distinguish
  mcast or multi-unicast [#1036]
  This patch does not resolve the problem.
  
  This patch is not related to TCP this exclusively for TIPC ,
  Please provide following , for me to reproduce the problem :
  
  Reproducible steps
  
  dtmd.conf file
  
  imm.xml configuration details ( it seems you preperaed 70 node configuration )
  
  You system buffers info ,check below link to get the data of your nodes:
  
  http://www.cyberciti.biz/faq/linux-tcp-tuning/
  
  Related
  
  Tickets: ~~#1036~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Adrian Szwej - 2014-10-03
    
    Hi Mahesh
    Yes; I meant #1036. I got instruction to test this patch to see if it help.
    BR
    
    DTMD config;
    DTM_NODE_IP=172.17.0.109
    DTM_MCAST_ADDR=224.0.0.6
    
    imm.xml
    Default generated 7-70 nodes. Does not matter. It is reproduciple with around 6-8 nodes. immnd tracing seem to trigger the fault earlier.
    
    I am attaching sysctl -a settings from container; since the values on the webpage does not exist inside container.
    
    sysctl
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - A V Mahesh (AVM) - 2014-10-03
      
      On 10/3/2014 12:11 PM, Adrian Szwej wrote:
      
      Yes; I meant #1036. I got instruction to test this patch to see if it help.
      
      This Bug fix is for exclusively for TIPC , so TCP not effective in any
      manner .
      
      DTMD config;
      DTM_NODE_IP=172.17.0.109
      DTM_MCAST_ADDR=224.0.0.6
      
      It is news to me that you are using TCP multicast , i was testing on TCP Broadcast ( that means DTM_MCAST_ADDR= is empty ).
      Ok I will test with TCP Multicast and try to reproduce with your configuration.
      
      Can you please attach one dtmd.conf file i wood like to see other changes as well
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Adrian Szwej - 2014-10-03
        
        It does not work for me with empty DTM_MCAST_ADDR
        The payload node just loops with;
        
        Oct 3 19:08:42.162880 osafimmnd [3275:immnd_proc.c:0393] TR First immnd_introduceMe, sending pbeEnabled:3 WITH params Oct 3 19:08:42.163181 osafimmnd [3275:immnd_proc.c:0413] TR Possibly extended intro from this IMMND pbeEnabled: 3 dirsize:22
        
        So I set it as above. The two are the only settings I have for DTMD.
        The other settings are default.
        
        There is a missing README where the dtmd.conf points users to:
        # See the file osaf/services/infrastructure/dtm/README for more configuration options.
        
        So it not that easy to figure out how to configure opensaf.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        A V Mahesh (AVM) - 2014-10-06
        
        It seems very fundamental TCP cluster bring-up with Broadcast is not working for you So let us start from basic configuration.
        
        1) please make sure all of you node are in same sub-net
        
        say like :
        SC-1 : 192.168.56.101 slot -1
        SC-2 : 192.168.56.102 slot -2
        PL-3 : 192.168.56.103 slot -3
        PL-4 : 192.168.56.104 slot -4
        PL-5 : 192.168.56.105 slot -5
        PL-6 : 192.168.56.106 slot -6
        PL-7 : 192.168.56.107 slot -7
        ......
        
        And in /etc/opensaf/nid.conf make sure "export MDS_TRANSPORT=TCP"
        
        If you are using multiple setup please do change "DTM_CLUSTER_ID=7" some different number .
        
        Please also check firewall are disable for broadcasting in your network .
        
        except this don't change any thing in Code or Configuration of Opensaf and let me know status.
        
        Or simple share share following of all nodes:
        
        1) #ifconfig o/p of all nodes
        2) dtmd.conf of all nodes
        3) nid.conf of all nodes
        
        -AVM
        
        Last edit: A V Mahesh (AVM) 2014-10-06
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Adrian Szwej - 2014-10-06
        
        I am running opensaf in docker containers:
        
        * one cluster. * have dont have any iptables rules. * can reach internet from my containers * can multicast network to other nodes in my network.
        
        All containers are connected to docker0 bridge: inet addr:172.17.42.1
        
        bridge name bridge id STP enabled interfaces docker0 8000.56847afe9799 no veth7350 veth7b9b veth8e99 veth96d3 vetha7aa vethef63 docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99 inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::5484:7aff:fefe:9799/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:25945 errors:0 dropped:0 overruns:0 frame:0 TX packets:13869 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:5501361 (5.2 MiB) TX bytes:65819518 (62.7 MiB)
        
        Containers have addresses
        172.17.0.1 - 172.17.0.150
        But notice the broadcast of 0.0.0.0 created by default by dockers. Could that be an issue?
        
        eth0 Link encap:Ethernet HWaddr 66:39:65:d4:7f:0a inet addr:172.17.0.105 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::6439:65ff:fed4:7f0a/64 Scope:Link UP BROADCAST RUNNING MTU:1500 Metric:1 RX packets:9280 errors:0 dropped:0 overruns:0 frame:0 TX packets:10565 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:721938 (721.9 KB) TX bytes:9934296 (9.9 MB)
        
        You can instansiate my containers pulling from my repo:
        
        https://registry.hub.docker.com/u/adrianszwej/opensaf/
        
        NODE=SC-1 && docker run --privileged -t --name $NODE -h $NODE -v /home/adrian/sharedfs:/etc/opensaf/sharedfs -i adrianszwej/opensaf:4.6-debian-7.6 /bin/bash
        
        I am building opensaf with following:
        
        ./configure --enable-imm-pbe --disable-ais-plm --disable-ais-msg --disable-ais-lck --disable-ais-evt --disable-rpm-target
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Hans Feldt - 2014-10-07
        
        What is TCP broadcast? I have never heard of that...
        
        My guess is that DTM_MCAST_ADDR allows you to specify the UDP multicast address to be used for discovery. In Adrians case there is no broadcast address on the eth0 interface in each container so he has to specify it instead if using the one from the interface
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sync stop after few payload nodes joining the cluster (TCP)

Milestone

Searches

Help

#1072 Sync stop after few payload nodes joining the cluster (TCP)

Related

Discussion

Related

Related

Related

Related

Related