Re: [jgroups-users] Odd RPC behavior

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jim,

when a unicast message gets dropped, it will get retransmitted within 2 
* UNICAST3.xmit_interval ms, as UNICAST3 uses positive acks.

However, for multicast messages, NAKACK2 uses *negative acks*, ie. the 
receivers ask the senders for missing messages. If you mulicast messages 
A1 and A2, but A2 is dropped, nobody will know that A actually did send 
message #2 until A sends A3, or STABLE kicks in and does message 
reconciliation.

The reason for using neg acks for multicasts is to prevent acks 
flooding; I used to have a SMACK protocol in JGroups some time ago, but 
for large clusters it generated too many acks.

Now, to solve your problem, you could add RSVP [1] to the stack and mark 
some messages/RPCs as RSVP.

Ideally, this would be done after a *batch of work*, as RSVP is costly, 
especially in large clusters. See [1] for details.

Alternatively, reduce the STABLE.desired_avg_gossip, but this will cause 
constant traffic from all members to the coord, which I don't think is a 
good idea.

[1] http://www.jgroups.org/manual/html/user-channel.html#RsvpSection

On 06/06/14 03:35, Jim Thomas wrote:
> I'm using a muxed RPC on Android with JGroups 3.4.4, presently with two
> nodes.  I'm doing a 30 second periodic callRemoteMethodsWithFuture(null
> ...)  from node 1 and occasionally the call does not go through on node
> 2 until the next (of the same) call is sent.  So what I see is:
>
> T     N1               N1
> 0     rpc1 fc1        rpc1
> 30   rpc2              nothing received
> 60   rpc3 fc2,fc3   rpc2 rpc3  (receive one call right after the other)
> 90   rpc4 fc4         rpc4
>
> The future callbacks always show success=true and suspected=false.  On
> the call options I set the timeout to 1000 (1 sec right?) but I don't
> get any timeout behavior as far as I can tell.
>
> The channels are carrying frequent unreliable traffic and infrequent rpc
> traffic but the rpc calls of other methods seem to be going through
> reliably.
>
> I was getting similar behavior of missed calls on the remote node when I
> was using callRemoteMethods with GET_NONE.
>
> This is over wifi so I can see that maybe a message could be lost but
> this seems more frequent than I'd expect.  But I would expect the
> message to be resent long before the next RPC call.
>
> I do have rpc calls back and forth but I thought I had avoided deadlock.
>   It seems to me that if this were the case I'd see the same problem on
> the local as well as the remote node and it would happen most of the
> time.  I'd also expect it to not happen here since this is the first
> message in the chain of activity.
>
> Here is my config:
>
> <config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>      xmlns="urn:org:jgroups"
>      xsi:schemaLocation="urn:org:jgroups
> http://www.jgroups.org/schema/JGroups-3.3.xsd" >
>
>      <UDP
>          enable_diagnostics="true"
>          ip_mcast="true"
>          ip_ttl="${jgroups.udp.ip_ttl:8}"
>          loopback="true"
>          max_bundle_size="1400"
>          max_bundle_timeout="5"
>          mcast_port="${jgroups.udp.mcast_port:45588}"
>          mcast_recv_buf_size="200K"
>          mcast_send_buf_size="200K"
>          oob_thread_pool.enabled="true"
>          oob_thread_pool.keep_alive_time="5000"
>          oob_thread_pool.max_threads="8"
>          oob_thread_pool.min_threads="1"
>          oob_thread_pool.queue_enabled="false"
>          oob_thread_pool.queue_max_size="100"
>          oob_thread_pool.rejection_policy="discard"
>          thread_naming_pattern="cl"
>          thread_pool.enabled="true"
>          thread_pool.keep_alive_time="5000"
>          thread_pool.max_threads="8"
>          thread_pool.min_threads="2"
>          thread_pool.queue_enabled="true"
>          thread_pool.queue_max_size="10000"
>          thread_pool.rejection_policy="discard"
>          timer.keep_alive_time="3000"
>          timer.max_threads="10"
>          timer.min_threads="4"
>          timer.queue_max_size="500"
>          timer_type="new3"
>          tos="8"
>          ucast_recv_buf_size="200K"
>          ucast_send_buf_size="200K" />
>
>      <PING />
>
>      <MERGE2
>          max_interval="30000"
>          min_interval="10000" />
>
>      <FD_SOCK />
>
>      <FD_ALL />
>
>      <VERIFY_SUSPECT timeout="1500" />
>
>      <BARRIER />
>
>      <pbcast.NAKACK2
>          discard_delivered_msgs="true"
>          max_msg_batch_size="500"
>          use_mcast_xmit="false"
>          xmit_interval="500"
>          xmit_table_max_compaction_time="30000"
>          xmit_table_msgs_per_row="2000"
>          xmit_table_num_rows="100" />
>
>      <UNICAST3
>          conn_expiry_timeout="0"
>          max_msg_batch_size="500"
>          xmit_interval="500"
>          xmit_table_max_compaction_time="60000"
>          xmit_table_msgs_per_row="2000"
>          xmit_table_num_rows="100" />
>
>      <pbcast.STABLE
>          desired_avg_gossip="50000"
>          max_bytes="4M"
>          stability_delay="1000" />
>
>      <pbcast.GMS
>          join_timeout="3000"
>          print_local_addr="true"
>          view_bundling="true" />
>
>      <FRAG frag_size="1000" />
>
>      <pbcast.STATE_TRANSFER />
>
>      <CENTRAL_LOCK num_backups="2" />
>
> </config>
>
> Any ideas?
>
> Thanks,
>
> JT
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/NeoTech
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)