Menu

JGroups agents disconnecting and never reconnecting again

Help
2013-05-21
2013-05-22
  • Sebastiano Vigna

    We are trying to use JGroups to coordinate a number of crawling agents. We are meeting a strange problem: after a few hours of exchange of messages (about 60 million messages are sent and received correctly), our agents do not recognize themselves anymore and their views remain disconnected.

    This is our configuration string:

    UDP(ip_mcast=true;mcast_addr=228.8.8.8;bind_addr=192.168.0.22;ip_ttl=32):PING(timeout=7000):MERGE2:FD_ALL(timeout=60000;interval=10000):pbcast.NAKACK:UNICAST:pbcast.GMS

    To exclude communication problems, we ran two test agents s2a and s2b on the same machine. After a few hours, the agents seems stuck in a cyclic try for merging their views, which however never happens (the logs below continue, endlessly, with the same NACKACK -> NACKACK cycle).

    We must add that the machine is under a significant load. In a realistic configuration, we would have as many threads as cores doing CPU-bound work, and several hundreds of low-priority threads doing I/O. In our tests, we had just a few dozen I/O threads through.

    We are including the logs from the two agents, hoping that they ring a bell.

    We have also performed test with the same configuration without a significant load and things work fine, so we suspect some misconfiguration from our part. Note that three threads threads of our agents are at maximum priority.

    We are interested in suggestions for further testing to isolate the problem.

    Thank you for any help!

    =================s2a log================
    12:55:14,056 1480 DEBUG [main] o.j.p.p.GMS - changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
    12:55:25,195 12619 DEBUG [main] o.j.p.UDP - creating sockets
    12:55:25,195 12619 DEBUG [main] o.j.p.UDP - sockets will use interface 192.168.0.22
    12:55:25,201 12625 DEBUG [main] o.j.p.UDP - socket information:
    12:55:25,215 12639 DEBUG [main] o.j.p.UDP - created unicast receiver thread
    12:55:25,216 12640 DEBUG [main] o.j.p.UDP - created multicast receiver thread
    12:55:32,220 19644 DEBUG [main] o.j.p.p.GMS - initial_mbrs are []
    12:55:32,220 19644 DEBUG [main] o.j.p.p.GMS - no initial members discovered: creating group as first member
    12:55:32,231 19655 DEBUG [main] o.j.p.p.NAKACK -
    12:55:32,232 19656 DEBUG [main] o.j.p.p.GMS - s2a: view is [s2a|0] [s2a]
    12:55:32,237 19661 DEBUG [main] o.j.p.p.GMS - s2a:
    12:55:32,238 19662 DEBUG [main] o.j.p.p.GMS - created group (first member). My view is [s2a|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
    12:55:40,132 27556 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: got merge request from s2b, merge_id=s2b::1, mbrs=[s2a]
    12:55:40,133 27557 WARN [OOB-1,local,s2a] o.j.p.p.NAKACK - s2a: dropped message from s2b (not in xmit_table), keys are [s2a], view=[s2a|0] [s2a]
    12:55:40,134 27558 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: fetched all digests for [s2a]
    12:55:40,135 27559 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: sending merge response=GmsHeader[MERGE_RSP]: view=[s2a|0] [s2a], digest=s2a: [0 : 0 (0)], merge_id=s2b::1
    12:55:40,149 27573 DEBUG [Incoming-2,local,s2a] o.j.p.p.NAKACK -
    12:55:40,149 27573 DEBUG [Incoming-2,local,s2a] o.j.p.p.GMS - s2a: view is MergeView::[s2b|1] [s2b, s2a], subgroups=[[s2a|0] [s2a], [s2b|0] [s2b]]
    12:55:40,154 27578 DEBUG [Incoming-2,local,s2a] o.j.p.p.GMS - s2a:
    17:55:24,201 18011625 DEBUG [ViewHandler,local,s2a] o.j.p.p.GMS - suspected members=[s2a], suspected_mbrs=[s2a]
    17:55:24,221 18011645 DEBUG [Incoming-2,local,s2a] o.j.p.p.GMS - s2a: view is [s2b|2] [s2b]
    17:55:24,221 18011645 WARN [Incoming-2,local,s2a] o.j.p.p.GMS - s2a: not member of view [s2b|2] [s2b]; discarding it
    17:55:34,002 18021426 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: got merge request from s2b, merge_id=s2b::2, mbrs=[s2a]
    17:55:34,003 18021427 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: fetched all digests for [s2a]
    17:55:34,003 18021427 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|1] [s2a], digest=s2b: [0 : 3 (3)], s2a: [0 : 2 (2)], merge_id=s2b::2
    17:55:34,011 18021435 DEBUG [Incoming-1,local,s2a] o.j.p.p.NAKACK -
    17:55:34,011 18021435 DEBUG [Incoming-1,local,s2a] o.j.p.p.GMS - s2a: view is MergeView::[s2b|3] [s2b, s2a], subgroups=[[s2b|1] [s2a], [s2b|2] [s2b]]
    17:55:38,875 18026299 WARN [Incoming-5,local,s2a] o.j.p.p.GMS - s2a: failed to collect all ACKs (expected=2) for view MergeView::[s2b|3] [s2b, s2a], subgroups=[[s2b|1] [s2a], [s2b|2] [s2b]] after 2000ms, missing ACKs from [s2b]
    18:06:32,605 18680029 DEBUG [Incoming-4,local,s2a] o.j.p.p.GMS - s2a: view is [s2b|4] [s2b]
    18:06:32,605 18680029 WARN [Incoming-4,local,s2a] o.j.p.p.GMS - s2a: not member of view [s2b|4] [s2b]; discarding it
    18:06:32,655 18680079 DEBUG [ViewHandler,local,s2a] o.j.p.p.GMS - suspected members=[s2a], suspected_mbrs=[s2a]
    18:06:33,548 18680972 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: got merge request from s2b, merge_id=s2b::3, mbrs=[s2a]
    18:06:33,549 18680973 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: fetched all digests for [s2a]
    18:06:33,549 18680973 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|3] [s2a], digest=s2b: [0 : 7 (7)], s2a: [0 : 4 (4)], merge_id=s2b::3
    18:06:33,554 18680978 DEBUG [Incoming-14,local,s2a] o.j.p.p.NAKACK -
    18:06:33,556 18680980 DEBUG [Incoming-14,local,s2a] o.j.p.p.GMS - s2a: view is MergeView::[s2b|5] [s2b, s2a], subgroups=[[s2b|3] [s2a], [s2b|4] [s2b]]
    18:15:08,777 19196201 DEBUG [Incoming-12,local,s2a] o.j.p.p.GMS - s2a: view is [s2b|6] [s2b]
    18:15:08,778 19196202 WARN [Incoming-12,local,s2a] o.j.p.p.GMS - s2a: not member of view [s2b|6] [s2b]; discarding it
    18:15:08,828 19196252 DEBUG [ViewHandler,local,s2a] o.j.p.p.GMS - suspected members=[s2a], suspected_mbrs=[s2a]
    18:15:08,834 19196258 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: got merge request from s2b, merge_id=s2b::4, mbrs=[s2a]
    18:15:08,835 19196259 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: fetched all digests for [s2a]
    18:15:08,835 19196259 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - s2a: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|5] [s2a], digest=s2b: [0 : 9 (9)], s2a: [0 : 6 (6)], merge_id=s2b::4
    18:15:21,637 19209061 WARN [OOB-1,local,s2a] o.j.p.p.GMS - s2a: merge is already in progress
    18:15:21,637 19209061 DEBUG [OOB-1,local,s2a] o.j.p.p.GMS - merge response=GmsHeader[MERGE_RSP]: view=null, digest=null, merge_id=s2b::5, merge_rejected=true
    18:15:34,843 19222267 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: got merge request from s2b, merge_id=s2b::6, mbrs=[s2a]
    18:15:34,843 19222267 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: fetched all digests for [s2a]
    18:15:34,843 19222267 DEBUG [OOB-2,local,s2a] o.j.p.p.GMS - s2a: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|5] [s2a], digest=s2b: [0 : 12 (12)], s2a: [0 : 7 (7)], merge_id=s2b::6
    18:15:34,851 19222275 DEBUG [Incoming-20,local,s2a] o.j.p.p.NAKACK -
    ...

    ======================s2b log================================
    12:55:18,890 1479 DEBUG [main] o.j.p.p.GMS - changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
    12:55:30,452 13041 DEBUG [main] o.j.p.UDP - creating sockets
    12:55:30,452 13041 DEBUG [main] o.j.p.UDP - sockets will use interface 192.168.0.22
    12:55:30,455 13044 DEBUG [main] o.j.p.UDP - socket information:
    12:55:30,463 13052 DEBUG [main] o.j.p.UDP - created unicast receiver thread
    12:55:30,463 13052 DEBUG [main] o.j.p.UDP - created multicast receiver thread
    12:55:30,475 13064 DEBUG [main] o.j.p.p.GMS - initial_mbrs are [own_addr=s2a, view id=null, is_server=false, is_coord=false, logical_name=s2a, physical_addrs=192.168.0.22:48644]
    12:55:30,476 13065 DEBUG [main] o.j.p.p.GMS - election results: {}
    12:55:30,487 13076 DEBUG [main] o.j.p.p.NAKACK -
    12:55:30,487 13076 DEBUG [main] o.j.p.p.GMS - s2b: view is [s2b|0] [s2b]
    12:55:30,493 13082 DEBUG [main] o.j.p.p.GMS - s2b:
    12:55:30,494 13083 DEBUG [main] o.j.p.p.GMS - created group (first member). My view is [s2b|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
    12:55:40,061 22650 DEBUG [Timer-2,local,s2b] o.j.p.MERGE2 - s2b found different views : [s2a|0], [s2b|0]; sending up MERGE event with merge participants [s2a, s2b].
    12:55:40,115 22704 DEBUG [ViewHandler,local,s2b] o.j.p.p.GMS - determining merge leader from [s2a, s2b]
    12:55:40,116 22705 DEBUG [ViewHandler,local,s2b] o.j.p.p.GMS - I (s2b) will be the leader. Starting the merge task for [s2a, s2b]
    12:55:40,118 22707 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: sending MERGE_REQ to [s2b, s2a]
    12:55:40,130 22719 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: got merge request from s2b, merge_id=s2b::1, mbrs=[s2b]
    12:55:40,132 22721 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: fetched all digests for [s2b]
    12:55:40,133 22722 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|0] [s2b], digest=s2b: [0 : 0 (0)], merge_id=s2b::1
    12:55:40,135 22724 DEBUG [OOB-1,local,s2b] o.j.p.p.GMS - s2b: got merge response from s2b, merge_id=s2b::1, merge data is sender=s2b, view=[s2b|0] [s2b], digest=s2b: [0 : 0 (0)]
    12:55:40,135 22724 WARN [OOB-2,local,s2b] o.j.p.p.NAKACK - s2b: dropped message from s2a (not in xmit_table), keys are [s2b], view=[s2b|0] [s2b]
    12:55:40,140 22729 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: got merge response from s2a, merge_id=s2b::1, merge data is sender=s2a, view=[s2a|0] [s2a], digest=s2a: [0 : 0 (0)]
    12:55:40,141 22730 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: collected 2 merge response(s) in 23 ms
    12:55:40,142 22731 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - Merge leader s2b: consolidated view=MergeView::[s2b|1] [s2b, s2a], subgroups=[[s2a|0] [s2a], [s2b|0] [s2b]]
    12:55:40,143 22732 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: sending merge view [s2b|1] to coordinators [s2b, s2a]
    12:55:40,144 22733 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: merge leader completed merge task
    12:55:40,145 22734 DEBUG [Incoming-1,local,s2b] o.j.p.p.NAKACK -
    12:55:40,146 22735 DEBUG [Incoming-1,local,s2b] o.j.p.p.GMS - s2b: view is MergeView::[s2b|1] [s2b, s2a], subgroups=[[s2a|0] [s2a], [s2b|0] [s2b]]
    17:55:24,198 18006787 DEBUG [ViewHandler,local,s2b] o.j.p.p.GMS - new=[], suspected=[s2a], leaving=[], new view: [s2b|2] [s2b]
    17:55:24,200 18006789 DEBUG [Incoming-1,local,s2b] o.j.p.p.GMS - s2b: view is [s2b|2] [s2b]
    17:55:24,361 18006950 DEBUG [Incoming-1,local,s2b] o.j.p.p.NAKACK - removed s2a from xmit_table (not member anymore)
    17:55:33,950 18016539 DEBUG [Timer-2,local,s2b] o.j.p.MERGE2 - s2b found different views : [s2b|1], [s2b|2]; sending up MERGE event with merge participants [s2a, s2b].
    17:55:34,001 18016590 DEBUG [ViewHandler,local,s2b] o.j.p.p.GMS - determining merge leader from [s2a, s2b]
    17:55:34,001 18016590 DEBUG [ViewHandler,local,s2b] o.j.p.p.GMS - I (s2b) will be the leader. Starting the merge task for [s2a, s2b]
    17:55:34,002 18016591 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: sending MERGE_REQ to [s2b, s2a]
    17:55:34,003 18016592 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: got merge request from s2b, merge_id=s2b::2, mbrs=[s2b]
    17:55:34,003 18016592 WARN [OOB-1,local,s2b] o.j.p.p.NAKACK - s2b: dropped message from s2a (not in xmit_table), keys are [s2b], view=[s2b|2] [s2b]
    17:55:34,003 18016592 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: fetched all digests for [s2b]
    17:55:34,004 18016593 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: sending merge response=GmsHeader[MERGE_RSP]: view=[s2b|2] [s2b], digest=s2b: [0 : 3 (3)], merge_id=s2b::2
    17:55:34,004 18016593 DEBUG [OOB-2,local,s2b] o.j.p.p.GMS - s2b: got merge response from s2b, merge_id=s2b::2, merge data is sender=s2b, view=[s2b|2] [s2b], digest=s2b: [0 : 3 (3)]
    17:55:34,004 18016593 DEBUG [OOB-1,local,s2b] o.j.p.p.GMS - s2b: got merge response from s2a, merge_id=s2b::2, merge data is sender=s2a, view=[s2b|1] [s2a], digest=s2b: [0 : 3 (3)], s2a: [0 : 2 (2)]
    17:55:34,004 18016593 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: collected 2 merge response(s) in 3 ms
    17:55:34,006 18016595 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - Merge leader s2b: consolidated view=MergeView::[s2b|3] [s2b, s2a], subgroups=[[s2b|1] [s2a], [s2b|2] [s2b]]
    17:55:34,006 18016595 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: sending merge view [s2b|3] to coordinators [s2b, s2a]
    17:55:34,007 18016596 DEBUG [MergeTask,local,s2b] o.j.p.p.GMS - s2b: merge leader completed merge task
    17:55:34,010 18016599 DEBUG [Incoming-3,local,s2b] o.j.p.p.NAKACK -

     
    • Bela Ban

      Bela Ban - 2013-05-21

      This configuration can't possibly work as STABLE is missing, and there's no flow control either ! Where did you get it from ? I suggest use udp.xml from the JGroups distro.
      Is this really the config you used ? Running millions of messages should cause an OutOfMemoryException, as STABLE doesn't garbage collect messages ! Didn't you notice the memory increase ?

       
  • Sebastiano Vigna

    Ahem. I was sure that there was something big missing (this line of configuration dates years, so we cannot really trace back to the original writer).

    We run with -Xmx30G, so it takes a while to detected memory leaks ;). We'll have a look at udp.xml and see what's missing. Thanks!

     
  • Bela Ban

    Bela Ban - 2013-05-22

    I suggest measure memory increase before you change anything to confirm that this is the culprit.
    Also copy a config file from the version of JGroups you use and then make modifications. Using an old config file with a new release will most probably lead to incorrect results.
    Cheers,

     

Log in to post a comment.