Alex Litvak - 2012-04-18

Hello,

I run 4 jgroups nodes on the network.

It worked in dev environment, but when I moved it to QA some nodes can't join
the groups.

lhqesched001: - no lhqeetl007!

lhqeetl005: - no lhqeetl007!

lhqeetl007:

lhqeetl008: - no lhqeetl007!

When I shot down lhqesched001 node, then

lhqeetl005: - added lhqeetl007

lhqeetl007:

lhqeetl008: - added lhqeetl007

Do you know what is the problem? Anything wrong with my protocol settings?

Any help is really appreciated!

Thank you,

Alex Litvak


jgroups-2.12.1.3.Final

Protocol stack defined in the code:

protocols.add(new TCP()

.setValue("bind_addr", InetAddress.getByName(bindAddr))

.setValue("bind_port", bindPort)

.setValue("loopback", false)

);

List<IpAddress> initial_hosts = new ArrayList<IpAddress>();

initial_hosts.add(new
IpAddress(InetAddress.getByName(settingBean.destAddress), destPort));

protocols.add(new MPING()

.setValue("timeout", 1000)

.setValue("bind_addr", InetAddress.getByName(bindAddr))

.setValue("mcast_addr", InetAddress.getByName(settingBean.mcastAddress))

.setValue("mcast_port", settingBean.mcastPort)

.setValue("ip_ttl", 20)

.setValue("num_initial_members", settingBean.numInitialMembersMcast)

);

protocols.add(new MERGE2()

.setValue("min_interval", 10000 )

.setValue("max_interval", 30000 )

);

protocols.add(new FD_SOCK());

protocols.add(new FD()

.setValue( "timeout", 1500 )

.setValue( "max_tries", 3 )

);

protocols.add(new VERIFY_SUSPECT()

.setValue( "timeout", 1500 )

);

protocols.add(new BARRIER());

protocols.add(new NAKACK()

.setValue("use_mcast_xmit", false )

.setValue("gc_lag", 0 )

.setValue("discard_delivered_msgs", true )

);

protocols.add(new UNICAST()

);

protocols.add(new STABLE()

.setValue("stability_delay", 1000 )

.setValue("desired_avg_gossip", 50000 )

);

protocols.add(new GMS()

.setValue("print_local_addr", true )

.setValue("join_timeout", 3000 )

.setValue("view_bundling", true )

);

protocols.add(new UFC()

.setValue("max_credits", 2000000 )

.setValue("min_threshold", 0.4 )

);

protocols.add(new MFC()

.setValue("max_credits", 2000000 )

.setValue("min_threshold", 0.4 )

);

protocols.add(new FRAG2()

.setValue( "frag_size", 60000 )

);

protocols.add(new STREAMING_STATE_TRANSFER());


Configuration


4 virtual machines.

lhqesched001

lhqeetl005

lhqeetl007

lhqeetl008

3 Clusters:

cluster TestCluster1

cluster TestCluster2

cluster QSchedulerManagerCluster7711

1.Machine lhqesched001 (10.12.60.151) settings:

cluster TestCluster1: bind_addr=/10.12.60.151 bind_port=7806
mcast_addr=/228.8.8.8, mcast_port=7606

cluster TestCluster2: bind_addr=/10.12.60.151 bind_port=7808
mcast_addr=/228.8.8.8, mcast_port=7608

cluster QSchedulerManagerCluster7711: bind_addr=/10.12.60.151 bind_port=7710
mcast_addr=/228.8.8.8, mcast_port=7711

2.Machine lhqeetl005 (10.12.60.85) settings:

cluster TestCluster1: bind_addr=/10.12.60.85 bind_port=7816
mcast_addr=/228.8.8.8, mcast_port=7606

cluster TestCluster2: bind_addr=/10.12.60.85 bind_port=7818
mcast_addr=/228.8.8.8, mcast_port=7608

3.Machine lhqeetl007 (10.12.60.87) settings:

cluster TestCluster1: bind_addr=/10.12.60.87 bind_port=7816
mcast_addr=/228.8.8.8, mcast_port=7606

cluster TestCluster2: bind_addr=/10.12.60.87 bind_port=7818
mcast_addr=/228.8.8.8, mcast_port=7608

4.Machine lhqeetl008 (10.12.60.88) settings:

cluster TestCluster1: bind_addr=/10.12.60.88 bind_port=7816
mcast_addr=/228.8.8.8, mcast_port=7606

cluster TestCluster2: bind_addr=/10.12.60.88 bind_port=7818
mcast_addr=/228.8.8.8, mcast_port=7608

I started test apps in the order lhqesched001, lhqeetl005, lhqeetl007,
lhqeetl008


Logs:


From lhqesched001 log:

2012-04-18 12:02:27,691 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:02:28,753 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:02:29,802 INFO (JgRpcBase.java:60) - Cluster
QSchedulerManagerCluster7711, new view:

2012-04-18 12:03:04,060 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:03:04,247 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:03:51,794 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:03:51,996 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

!!! no lhqeetl007 !!!

2012-04-18 12:04:32,130 ^C


From lhqeetl005 log:

2012-04-18 12:03:04,078 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:03:04,250 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:03:51,795 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:03:51,996 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:32,153 DEBUG (NAKACK.java:1116) - removed lhqesched001-59970
from xmit_table (not member anymore)

2012-04-18 12:04:32,156 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:32,482 DEBUG (NAKACK.java:1116) - removed lhqesched001-27884
from xmit_table (not member anymore)

2012-04-18 12:04:32,483 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

!!! lhqeetl007-49202 and lhqeetl007-53393 added to the view after lhqesched001
was shot down!!!

2012-04-18 12:04:33,483 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:34,095 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:48,170 ^C


From lhqeetl007 log:

2012-04-18 12:03:30,341 DEBUG (ClientGmsImpl.java:84) - initial_mbrs are ,
is_server=true, is_coord=false, logical_name=lhqeetl005-11124,
physical_addrs=10.12.60.85:7818]

2012-04-18 12:03:30,341 DEBUG (ClientGmsImpl.java:308) - election results:
{64ee3330-af61-141d-06c0-3549e8ada868=1}

2012-04-18 12:03:30,341 DEBUG (ClientGmsImpl.java:136) - sending
handleJoin(lhqeetl007-49202) to 64ee3330-af61-141d-06c0-3549e8ada868

2012-04-18 12:03:30,345 WARN (TP.java:1209) - lhqeetl007-49202: no physical
address for 64ee3330-af61-141d-06c0-3549e8ada868, dropping message

2012-04-18 12:03:33,348 WARN (ClientGmsImpl.java:145) - join(lhqeetl007-49202)
sent to 64ee3330-af61-141d-06c0-3549e8ada868 timed out (after 3000 ms),
retrying

2012-04-18 12:03:33,350 DEBUG (ClientGmsImpl.java:136) - sending
handleJoin(lhqeetl007-49202) to 64ee3330-af61-141d-06c0-3549e8ada868

2012-04-18 12:03:36,151 WARN (TP.java:1209) - lhqeetl007-49202: no physical
address for 64ee3330-af61-141d-06c0-3549e8ada868, dropping message

2012-04-18 12:03:36,351 WARN (ClientGmsImpl.java:145) - join(lhqeetl007-49202)
sent to 64ee3330-af61-141d-06c0-3549e8ada868 timed out (after 3000 ms),
retrying

2012-04-18 12:03:36,352 DEBUG (ClientGmsImpl.java:84) - initial_mbrs are ,
is_server=true, is_coord=false, logical_name=lhqeetl005-11124,
physical_addrs=10.12.60.85:7818]

2012-04-18 12:03:36,352 DEBUG (ClientGmsImpl.java:308) - election results:
{64ee3330-af61-141d-06c0-3549e8ada868=1}

2012-04-18 12:03:54,369 DEBUG (ClientGmsImpl.java:84) - initial_mbrs are ,
is_server=true, is_coord=false, logical_name=lhqeetl008-11390,
physical_addrs=10.12.60.88:7818]

2012-04-18 12:03:54,369 DEBUG (ClientGmsImpl.java:308) - election results:
{64ee3330-af61-141d-06c0-3549e8ada868=1}

2012-04-18 12:03:54,369 DEBUG (ClientGmsImpl.java:136) - sending
handleJoin(lhqeetl007-49202) to 64ee3330-af61-141d-06c0-3549e8ada868

!!! no lhqesched001 !!!

2012-04-18 12:04:33,503 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:33,619 DEBUG (ClientGmsImpl.java:84) - initial_mbrs are ,
is_server=true, is_coord=false, logical_name=lhqeetl008-30012,
physical_addrs=10.12.60.88:7816]

2012-04-18 12:04:33,619 DEBUG (ClientGmsImpl.java:308) - election results:
{f2e73d14-b5eb-6eb6-cd09-96821af8b741=1}

2012-04-18 12:04:33,619 DEBUG (ClientGmsImpl.java:136) - sending
handleJoin(lhqeetl007-53393) to f2e73d14-b5eb-6eb6-cd09-96821af8b741

2012-04-18 12:04:33,619 WARN (TP.java:1209) - lhqeetl007-53393: no physical
address for f2e73d14-b5eb-6eb6-cd09-96821af8b741, dropping message

!!! no lhqesched001 !!!

2012-04-18 12:04:34,106 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:48,196 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:48,526 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:56,450 ^C


From lhqeetl008 log:

2012-04-18 12:03:51,813 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:03:52,003 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:32,157 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:32,483 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

!!! lhqeetl007-49202 and lhqeetl007-53393 added to the view after lhqesched001
was shot down!!!

2012-04-18 12:04:33,483 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:34,095 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:48,194 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:48,522 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:04:56,526 INFO (JgRpcBase.java:60) - Cluster TestCluster2, new
view:

2012-04-18 12:04:56,889 INFO (JgRpcBase.java:60) - Cluster TestCluster1, new
view:

2012-04-18 12:05:05,394 ^C