Re: [jgroups-users] Odd RPC behavior
Brought to you by:
belaban
|
From: Bela B. <be...@ya...> - 2014-06-11 05:47:15
|
On 10/06/14 20:38, Jim Thomas wrote: > Thanks Bela. > > I was able to confirm that the long delay was due to a lack of message > traffic. As soon as I created a frequent 'heartbeat' message the > problem went away. I suppose I was expecting the default clustering to > be in contact much more often by default. I suspect I will need to > shorten the average gossip time somewhat, 50 seconds is an eternity for > my system. Ideally I'd like re-transmission of missed multicast > messages to happen within seconds. > > I have a handful of multicast RPC calls I'm using that are very > important that they propagate through the system in a timely manner. > Also I'm using the distributed map which I assume will have the same > issue. I guess I can use RSVP on those and broadcast an empty > 'heartbeat' message if a multicast rpc call times out which will cause > NAKACK2 to kick in. Why an empty heartbeat message ? Just mark the important messages as RSVP and JGroups will (under the cover) do this for you. > My clusters will be small for the foreseeable future (less than 20 > nodes) so I'm somewhat intrigued by the idea of a different protocol > that will perform better for me on multicast messages. Do you suppose > that it might be feasible for me to try to resurrect the SMAK protocol? SMACK [1] was removed some time ago. I would suggest use RSVP rather than SMACK. Even if you have to mark all messages as RSVP, that's still better than resurrecting SMACK as you'd use the same config as everybody else. Tagging all messages as RSVP is more or less SMACK. [1] http://grepcode.com/file/repo1.maven.org/maven2/org.jgroups/jgroups/2.11.1.Final/org/jgroups/protocols/SMACK.java > Thanks, > > JT > > > On Tue, Jun 10, 2014 at 12:01 AM, Bela Ban <be...@ya... > <mailto:be...@ya...>> wrote: > > Hi Jim, > > when a unicast message gets dropped, it will get retransmitted within 2 > * UNICAST3.xmit_interval ms, as UNICAST3 uses positive acks. > > However, for multicast messages, NAKACK2 uses *negative acks*, ie. the > receivers ask the senders for missing messages. If you mulicast messages > A1 and A2, but A2 is dropped, nobody will know that A actually did send > message #2 until A sends A3, or STABLE kicks in and does message > reconciliation. > > The reason for using neg acks for multicasts is to prevent acks > flooding; I used to have a SMACK protocol in JGroups some time ago, but > for large clusters it generated too many acks. > > Now, to solve your problem, you could add RSVP [1] to the stack and mark > some messages/RPCs as RSVP. > > Ideally, this would be done after a *batch of work*, as RSVP is costly, > especially in large clusters. See [1] for details. > > Alternatively, reduce the STABLE.desired_avg_gossip, but this will cause > constant traffic from all members to the coord, which I don't think is a > good idea. > > [1] http://www.jgroups.org/manual/html/user-channel.html#RsvpSection > > > On 06/06/14 03:35, Jim Thomas wrote: > > I'm using a muxed RPC on Android with JGroups 3.4.4, presently > with two > > nodes. I'm doing a 30 second periodic > callRemoteMethodsWithFuture(null > > ...) from node 1 and occasionally the call does not go through > on node > > 2 until the next (of the same) call is sent. So what I see is: > > > > T N1 N1 > > 0 rpc1 fc1 rpc1 > > 30 rpc2 nothing received > > 60 rpc3 fc2,fc3 rpc2 rpc3 (receive one call right after the > other) > > 90 rpc4 fc4 rpc4 > > > > The future callbacks always show success=true and > suspected=false. On > > the call options I set the timeout to 1000 (1 sec right?) but I don't > > get any timeout behavior as far as I can tell. > > > > The channels are carrying frequent unreliable traffic and > infrequent rpc > > traffic but the rpc calls of other methods seem to be going through > > reliably. > > > > I was getting similar behavior of missed calls on the remote node > when I > > was using callRemoteMethods with GET_NONE. > > > > This is over wifi so I can see that maybe a message could be lost but > > this seems more frequent than I'd expect. But I would expect the > > message to be resent long before the next RPC call. > > > > I do have rpc calls back and forth but I thought I had avoided > deadlock. > > It seems to me that if this were the case I'd see the same > problem on > > the local as well as the remote node and it would happen most of the > > time. I'd also expect it to not happen here since this is the first > > message in the chain of activity. > > > > Here is my config: > > > > <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > > xmlns="urn:org:jgroups" > > xsi:schemaLocation="urn:org:jgroups > > http://www.jgroups.org/schema/JGroups-3.3.xsd" > > > > > <UDP > > enable_diagnostics="true" > > ip_mcast="true" > > ip_ttl="${jgroups.udp.ip_ttl:8}" > > loopback="true" > > max_bundle_size="1400" > > max_bundle_timeout="5" > > mcast_port="${jgroups.udp.mcast_port:45588}" > > mcast_recv_buf_size="200K" > > mcast_send_buf_size="200K" > > oob_thread_pool.enabled="true" > > oob_thread_pool.keep_alive_time="5000" > > oob_thread_pool.max_threads="8" > > oob_thread_pool.min_threads="1" > > oob_thread_pool.queue_enabled="false" > > oob_thread_pool.queue_max_size="100" > > oob_thread_pool.rejection_policy="discard" > > thread_naming_pattern="cl" > > thread_pool.enabled="true" > > thread_pool.keep_alive_time="5000" > > thread_pool.max_threads="8" > > thread_pool.min_threads="2" > > thread_pool.queue_enabled="true" > > thread_pool.queue_max_size="10000" > > thread_pool.rejection_policy="discard" > > timer.keep_alive_time="3000" > > timer.max_threads="10" > > timer.min_threads="4" > > timer.queue_max_size="500" > > timer_type="new3" > > tos="8" > > ucast_recv_buf_size="200K" > > ucast_send_buf_size="200K" /> > > > > <PING /> > > > > <MERGE2 > > max_interval="30000" > > min_interval="10000" /> > > > > <FD_SOCK /> > > > > <FD_ALL /> > > > > <VERIFY_SUSPECT timeout="1500" /> > > > > <BARRIER /> > > > > <pbcast.NAKACK2 > > discard_delivered_msgs="true" > > max_msg_batch_size="500" > > use_mcast_xmit="false" > > xmit_interval="500" > > xmit_table_max_compaction_time="30000" > > xmit_table_msgs_per_row="2000" > > xmit_table_num_rows="100" /> > > > > <UNICAST3 > > conn_expiry_timeout="0" > > max_msg_batch_size="500" > > xmit_interval="500" > > xmit_table_max_compaction_time="60000" > > xmit_table_msgs_per_row="2000" > > xmit_table_num_rows="100" /> > > > > <pbcast.STABLE > > desired_avg_gossip="50000" > > max_bytes="4M" > > stability_delay="1000" /> > > > > <pbcast.GMS > > join_timeout="3000" > > print_local_addr="true" > > view_bundling="true" /> > > > > <FRAG frag_size="1000" /> > > > > <pbcast.STATE_TRANSFER /> > > > > <CENTRAL_LOCK num_backups="2" /> > > > > </config> > > > > Any ideas? > > > > Thanks, > > > > JT > > > > > > > ------------------------------------------------------------------------------ > > Learn Graph Databases - Download FREE O'Reilly Book > > "Graph Databases" is the definitive new guide to graph databases > and their > > applications. Written by three acclaimed leaders in the field, > > this first edition is now available. Download your free book today! > > http://p.sf.net/sfu/NeoTech > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > <mailto:jav...@li...> > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban, JGroups lead (http://www.jgroups.org) > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk > Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > javagroups-users mailing list > jav...@li... > <mailto:jav...@li...> > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban, JGroups lead (http://www.jgroups.org) |