Re: [jgroups-users] Odd RPC behavior
Brought to you by:
belaban
|
From: Jim T. <jim...@no...> - 2014-06-11 15:24:40
|
"Why an empty heartbeat message ? Just mark the important messages as RSVP and JGroups will (under the cover) do this for you." Ok it was not clear to me from reading the documentation on RSVP that it would automatically re-transmit, it is only mentioned in the RSVP configuration section. I got the impression that RSVP would only notify me when the messages were all received or timeout, not automatically re-transmit missed messages. All of my multicast rpc calls are GET_NONE so I would not block my threads waiting for stragglers to reply. So if I were to setup RSVP with a 250ms resend interval and a 2000ms timeout then I would have to lose several packets for the rpc call not to go through. Also since my RPC calls don't wait for a return I suppose I should set throw_exception_on_timeout to false and ack_on_delivery to false. In this case would my callRemoteMethods block for up to 2000ms or does it still return immediately. Thanks, JT On Tue, Jun 10, 2014 at 10:47 PM, Bela Ban <be...@ya...> wrote: > > > On 10/06/14 20:38, Jim Thomas wrote: > > Thanks Bela. > > > > I was able to confirm that the long delay was due to a lack of message > > traffic. As soon as I created a frequent 'heartbeat' message the > > problem went away. I suppose I was expecting the default clustering to > > be in contact much more often by default. I suspect I will need to > > shorten the average gossip time somewhat, 50 seconds is an eternity for > > my system. Ideally I'd like re-transmission of missed multicast > > messages to happen within seconds. > > > > I have a handful of multicast RPC calls I'm using that are very > > important that they propagate through the system in a timely manner. > > Also I'm using the distributed map which I assume will have the same > > issue. I guess I can use RSVP on those and broadcast an empty > > 'heartbeat' message if a multicast rpc call times out which will cause > > NAKACK2 to kick in. > > > Why an empty heartbeat message ? Just mark the important messages as > RSVP and JGroups will (under the cover) do this for you. > > > > My clusters will be small for the foreseeable future (less than 20 > > nodes) so I'm somewhat intrigued by the idea of a different protocol > > that will perform better for me on multicast messages. Do you suppose > > that it might be feasible for me to try to resurrect the SMAK protocol? > > SMACK [1] was removed some time ago. I would suggest use RSVP rather > than SMACK. Even if you have to mark all messages as RSVP, that's still > better than resurrecting SMACK as you'd use the same config as everybody > else. Tagging all messages as RSVP is more or less SMACK. > > [1] > > http://grepcode.com/file/repo1.maven.org/maven2/org.jgroups/jgroups/2.11.1.Final/org/jgroups/protocols/SMACK.java > > > > Thanks, > > > > JT > > > > > > On Tue, Jun 10, 2014 at 12:01 AM, Bela Ban <be...@ya... > > <mailto:be...@ya...>> wrote: > > > > Hi Jim, > > > > when a unicast message gets dropped, it will get retransmitted > within 2 > > * UNICAST3.xmit_interval ms, as UNICAST3 uses positive acks. > > > > However, for multicast messages, NAKACK2 uses *negative acks*, ie. > the > > receivers ask the senders for missing messages. If you mulicast > messages > > A1 and A2, but A2 is dropped, nobody will know that A actually did > send > > message #2 until A sends A3, or STABLE kicks in and does message > > reconciliation. > > > > The reason for using neg acks for multicasts is to prevent acks > > flooding; I used to have a SMACK protocol in JGroups some time ago, > but > > for large clusters it generated too many acks. > > > > Now, to solve your problem, you could add RSVP [1] to the stack and > mark > > some messages/RPCs as RSVP. > > > > Ideally, this would be done after a *batch of work*, as RSVP is > costly, > > especially in large clusters. See [1] for details. > > > > Alternatively, reduce the STABLE.desired_avg_gossip, but this will > cause > > constant traffic from all members to the coord, which I don't think > is a > > good idea. > > > > [1] http://www.jgroups.org/manual/html/user-channel.html#RsvpSection > > > > > > On 06/06/14 03:35, Jim Thomas wrote: > > > I'm using a muxed RPC on Android with JGroups 3.4.4, presently > > with two > > > nodes. I'm doing a 30 second periodic > > callRemoteMethodsWithFuture(null > > > ...) from node 1 and occasionally the call does not go through > > on node > > > 2 until the next (of the same) call is sent. So what I see is: > > > > > > T N1 N1 > > > 0 rpc1 fc1 rpc1 > > > 30 rpc2 nothing received > > > 60 rpc3 fc2,fc3 rpc2 rpc3 (receive one call right after the > > other) > > > 90 rpc4 fc4 rpc4 > > > > > > The future callbacks always show success=true and > > suspected=false. On > > > the call options I set the timeout to 1000 (1 sec right?) but I > don't > > > get any timeout behavior as far as I can tell. > > > > > > The channels are carrying frequent unreliable traffic and > > infrequent rpc > > > traffic but the rpc calls of other methods seem to be going > through > > > reliably. > > > > > > I was getting similar behavior of missed calls on the remote node > > when I > > > was using callRemoteMethods with GET_NONE. > > > > > > This is over wifi so I can see that maybe a message could be lost > but > > > this seems more frequent than I'd expect. But I would expect the > > > message to be resent long before the next RPC call. > > > > > > I do have rpc calls back and forth but I thought I had avoided > > deadlock. > > > It seems to me that if this were the case I'd see the same > > problem on > > > the local as well as the remote node and it would happen most of > the > > > time. I'd also expect it to not happen here since this is the > first > > > message in the chain of activity. > > > > > > Here is my config: > > > > > > <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > > > xmlns="urn:org:jgroups" > > > xsi:schemaLocation="urn:org:jgroups > > > http://www.jgroups.org/schema/JGroups-3.3.xsd" > > > > > > > <UDP > > > enable_diagnostics="true" > > > ip_mcast="true" > > > ip_ttl="${jgroups.udp.ip_ttl:8}" > > > loopback="true" > > > max_bundle_size="1400" > > > max_bundle_timeout="5" > > > mcast_port="${jgroups.udp.mcast_port:45588}" > > > mcast_recv_buf_size="200K" > > > mcast_send_buf_size="200K" > > > oob_thread_pool.enabled="true" > > > oob_thread_pool.keep_alive_time="5000" > > > oob_thread_pool.max_threads="8" > > > oob_thread_pool.min_threads="1" > > > oob_thread_pool.queue_enabled="false" > > > oob_thread_pool.queue_max_size="100" > > > oob_thread_pool.rejection_policy="discard" > > > thread_naming_pattern="cl" > > > thread_pool.enabled="true" > > > thread_pool.keep_alive_time="5000" > > > thread_pool.max_threads="8" > > > thread_pool.min_threads="2" > > > thread_pool.queue_enabled="true" > > > thread_pool.queue_max_size="10000" > > > thread_pool.rejection_policy="discard" > > > timer.keep_alive_time="3000" > > > timer.max_threads="10" > > > timer.min_threads="4" > > > timer.queue_max_size="500" > > > timer_type="new3" > > > tos="8" > > > ucast_recv_buf_size="200K" > > > ucast_send_buf_size="200K" /> > > > > > > <PING /> > > > > > > <MERGE2 > > > max_interval="30000" > > > min_interval="10000" /> > > > > > > <FD_SOCK /> > > > > > > <FD_ALL /> > > > > > > <VERIFY_SUSPECT timeout="1500" /> > > > > > > <BARRIER /> > > > > > > <pbcast.NAKACK2 > > > discard_delivered_msgs="true" > > > max_msg_batch_size="500" > > > use_mcast_xmit="false" > > > xmit_interval="500" > > > xmit_table_max_compaction_time="30000" > > > xmit_table_msgs_per_row="2000" > > > xmit_table_num_rows="100" /> > > > > > > <UNICAST3 > > > conn_expiry_timeout="0" > > > max_msg_batch_size="500" > > > xmit_interval="500" > > > xmit_table_max_compaction_time="60000" > > > xmit_table_msgs_per_row="2000" > > > xmit_table_num_rows="100" /> > > > > > > <pbcast.STABLE > > > desired_avg_gossip="50000" > > > max_bytes="4M" > > > stability_delay="1000" /> > > > > > > <pbcast.GMS > > > join_timeout="3000" > > > print_local_addr="true" > > > view_bundling="true" /> > > > > > > <FRAG frag_size="1000" /> > > > > > > <pbcast.STATE_TRANSFER /> > > > > > > <CENTRAL_LOCK num_backups="2" /> > > > > > > </config> > > > > > > Any ideas? > > > > > > Thanks, > > > > > > JT > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > Learn Graph Databases - Download FREE O'Reilly Book > > > "Graph Databases" is the definitive new guide to graph databases > > and their > > > applications. Written by three acclaimed leaders in the field, > > > this first edition is now available. Download your free book > today! > > > http://p.sf.net/sfu/NeoTech > > > > > > > > > > > > _______________________________________________ > > > javagroups-users mailing list > > > jav...@li... > > <mailto:jav...@li...> > > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > > > > -- > > Bela Ban, JGroups lead (http://www.jgroups.org) > > > > > ------------------------------------------------------------------------------ > > HPCC Systems Open Source Big Data Platform from LexisNexis Risk > > Solutions > > Find What Matters Most in Your Big Data with HPCC Systems > > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > > http://p.sf.net/sfu/hpccsystems > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > <mailto:jav...@li...> > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > > > > > > > > ------------------------------------------------------------------------------ > > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > > Find What Matters Most in Your Big Data with HPCC Systems > > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > > http://p.sf.net/sfu/hpccsystems > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban, JGroups lead (http://www.jgroups.org) > > > ------------------------------------------------------------------------------ > HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions > Find What Matters Most in Your Big Data with HPCC Systems > Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. > Leverages Graph Analysis for Fast Processing & Easy Data Exploration > http://p.sf.net/sfu/hpccsystems > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |