[javagroups-users] Crash detection takes too long and one more question

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear list members ,

I am creating a failover application with the following stack:

UDP(bind_addr={0};mcast_addr={1};mcast_port={2};ip_ttl=32;\
mcast_send_buf_size=150000;mcast_recv_buf_size=80000):\
PING(timeout=1250;num_initial_members=2):\
FD_SOCK:\
FD(timeout=1000;max_tries=2):\
VERIFY_SUSPECT(timeout=500):\
pbcast.NAKACK(gc_lag=50;retransmit_timeout=300):\
UNICAST(timeout=1000):\
pbcast.STABLE(desired_avg_gossip=5000):\
FRAG(frag_size=4096;down_thread=false;up_thread=false):\
pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;\
shun=false;print_local_addr=true):\
pbcast.STATE_TRANSFER

In 100Mb switched network , multicast is supported.

First question:

It works just fine , most of the time. The failure detection of crashed
members is being detected after 2 secs (most of the times) , but there are
times when it takes up to 50 sec. !!! How can it happen ?

Second question :

I am using DistributedHashTable to store info across the network. While
trying to "remove" an entry from the table , the thread "hangs" there
forever and does not go on. (deadlock). To overcome that problem I use
_remove (local remove) on each member (which is ugly). Is it some known
issue ?

Thanks in advance for your kind help.