We are having an issue using TCP_NIO and the TCPPING discovery protocol in
JGroups 2.8 GA. We have two serves using the jgroups DistributedHashtable
class in order to set a shared cluster state. The jgroups
DistributedHashtable.Notification method contentsSet() (state transfer) is
ONLY called during server startup when each server's TCPPING "port_range" is
set to 3 or less. This was not an issue when we used JGroups 2.6.1. Below is
the expected stack trace call (only called when port_range<=3):
Increasing the TCPPING timeout works if one server is allowed to completely
start up before the second server is started. If both servers are started
almost at the same time, TCPPING with the increased timeout doesn't seem to
work. Each server does not "see" the other. TCPPING intial host discovery
seems to have a timeout problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
We are having an issue using TCP_NIO and the TCPPING discovery protocol in
JGroups 2.8 GA. We have two serves using the jgroups DistributedHashtable
class in order to set a shared cluster state. The jgroups
DistributedHashtable.Notification method contentsSet() (state transfer) is
ONLY called during server startup when each server's TCPPING "port_range" is
set to 3 or less. This was not an issue when we used JGroups 2.6.1. Below is
the expected stack trace call (only called when port_range<=3):
SDCluster.contentsSet(Map) line: 1161 <<< implements
DistributedHashtable.Notification
SDDistributedHashtable(DistributedHashtable)._putAll(Map) line: 436
SDDistributedHashtable(DistributedHashtable).setState(InputStream) line: 691
MessageDispatcher$ProtocolAdapter.handleUpEvent(Event) line: 787
MessageDispatcher$ProtocolAdapter.up(Event) line: 849
JChannel.up(Event) line: 1413
ProtocolStack.up(Event) line: 829
STREAMING_STATE_TRANSFER.connectToStateProvider(STREAMING_STATE_TRANSFER$State
Header) line: 526
STREAMING_STATE_TRANSFER.handleStateRsp(STREAMING_STATE_TRANSFER$StateHeader)
line: 465
STREAMING_STATE_TRANSFER.up(Event) line: 230
FRAG2.up(Event) line: 188
FC.up(Event) line: 470
VIEW_SYNC.up(Event) line: 173
GMS.up(Event) line: 890
AUTH.up(Event) line: 143
STABLE.up(Event) line: 236
UNICAST.handleDataReceived(Address, long, long, boolean, Message) line: 582
UNICAST.up(Event) line: 275
NAKACK.up(Event) line: 692
VERIFY_SUSPECT.up(Event) line: 132
FD.up(Event) line: 259
FD_SOCK.up(Event) line: 269
MERGE2(Protocol).up(Event) line: 340
TCPPING(Discovery).up(Event) line: 277
TCP_NIO(TP).passMessageUp(Message, boolean, boolean, boolean) line: 953
TP.access$100(TP, Message, boolean, boolean, boolean) line: 53
TP$IncomingPacket.handleMyMessage(Message, boolean) line: 1457
TP$IncomingPacket.run() line: 1439
ThreadPoolExecutor$Worker.runTask(Runnable) line: 650
ThreadPoolExecutor$Worker.run() line: 675
Thread.run() line: 595
And here is our jgroups protocol configuration (for each server):
<TCP_NIO
bind_port="7800"
loopback="true"
discard_incompatible_packets="true"
max_bundle_size="64000"
max_bundle_timeout="30"
enable_bundling="true"
oob_thread_pool.min_threads="20"
oob_thread_pool.max_threads="30"
reader_threads="3"
writer_threads="3"
processor_threads="5"
processor_minThreads="5"
processor_maxThreads="5"
processor_queueSize="100"/>
<TCPPING timeout="5000"
initial_hosts="139.185.17.80,139.185.17.82"
port_range="3"
num_initial_members="2"/>
<MERGE2 max_interval="100000"
min_interval="20000"/>
<fd_sock></fd_sock>
<fd timeout="20000" max_tries="5"></fd>
<verify_suspect timeout="1500"></verify_suspect>
<pbcast.NAKACK
max_xmit_size="60000" use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="100,200,300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<unicast timeout="300,600,1200"></unicast>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="false" join_timeout="3000"
view_bundling="true"/>
<FC max_credits="2000000"
min_threshold="0.10"/>
<frag2 frag_size="60000"></frag2>
<pbcast.streaming_state_transfer></pbcast.streaming_state_transfer>
When using jgroups 2.8 GA, we also repeatedly get the following error
messsages (one for each port in the TCPPING port range):
ERROR failed sending message to 139.185.17.80:7803 (117 bytes):
java.lang.Exception: connection to 139.185.17.80:7803 could not be established
I also see the following drop message in one of the servers when the second
server comes on line:
2010-03-10 10:20:55,496 TRACE message is , headers are MsgDisp: , dest_mbrs=,
NAKACK: , TCP_NIO:
2010-03-10 10:20:55,511 TRACE barney-59011: received hollyrock-46328#1
2010-03-10 10:20:55,511 WARN barney-59011: dropped message from
hollyrock-46328 (not in xmit_table), keys are , view=
Any help would be appreciated.
Thanks,
Ryan
Increasing the TCPPING timeout seems to have resolved the problem.
TCPPING(timeout=5000;port_range=3;...) -- works
TCPPING(timeout=10000;port_range=10;...) -- works
TCPPING(timeout=30000;port_range=20;...) -- works
Increasing the TCPPING timeout works if one server is allowed to completely
start up before the second server is started. If both servers are started
almost at the same time, TCPPING with the increased timeout doesn't seem to
work. Each server does not "see" the other. TCPPING intial host discovery
seems to have a timeout problem.