[jgroups-users] Nodes unable to join cluster when one node is hung
Brought to you by:
belaban
From: Ramakrishna K. <ra...@gm...> - 2012-12-14 23:32:38
|
Jgroups experts, We are facing a problem where the nodes are not able to join when one node is hung. Nodes A,B, C are in cluster. C has crashed/hung .. it was accepting network connection but not responding. The view has changed and now includes only members A,B. Now when a node D is trying to join, inside JChannel.connect, it was doing a discovery and connected to this hung node C and not coming out of there. We have set the GMS join timeout but it is not taking effect. Is there any way to avoid this as one hung member can cause discovery issues and affect the entire cluster? We have verified by looking at the heap dump that this thread is in fact reading from the socket connected to hung member C. We are using gossip router for discovery and A is our gossip router. The stack trace of where it is hanging is attached below. Thanks, ramky "main" prio=10 tid=0x000000005910e000 nid=0x2202 runnable [0x0000000040e7e000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at com.sun.net.ssl.internal.ssl.InputRecord.readFully(Unknown Source) at com.sun.net.ssl.internal.ssl.InputRecord.read(Unknown Source) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source) - locked <0x00000000c2f4a630> (a java.lang.Object) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake (Unknown Source) - locked <0x00000000c2f4a5f0> (a java.lang.Object) at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord (Unknown Source) at com.sun.net.ssl.internal.ssl.AppOutputStream.write(Unknown Source) - locked <0x00000000c2f4afb8> (a com.sun.net.ssl.internal.ssl.AppOutputStream) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.flush(Unknown Source) - locked <0x00000000c2f2d250> (a java.io.BufferedOutputStream) at java.io.DataOutputStream.flush(Unknown Source) at org.jgroups.blocks.TCPConnectionMap$TCPConnection.sendLocalAddress (TCPConnectionMap.java:548) at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init> (TCPConnectionMap.java:397) at org.jgroups.blocks.TCPConnectionMap$Mapper.getConnection (TCPConnectionMap.java:785) at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:174) at org.jgroups.protocols.TCP.send(TCP.java:56) at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:86) at org.jgroups.protocols.TP.sendToSingleMember(TP.java:1306) at org.jgroups.protocols.TP.doSend(TP.java:1299) at org.jgroups.protocols.TP.send(TP.java:1285) at org.jgroups.protocols.TP.down(TP.java:1143) at org.jgroups.protocols.Discovery.sendDiscoveryRequest (Discovery.java:276) at org.jgroups.protocols.Discovery.findMembers(Discovery.java:216) at org.jgroups.protocols.Discovery.findInitialMembers (Discovery.java:197) at org.jgroups.protocols.Discovery.down(Discovery.java:527) at org.jgroups.protocols.MERGE2.down(MERGE2.java:181) at org.jgroups.protocols.FD.down(FD.java:308) at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80) at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:539) at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:531) at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328) at org.jgroups.protocols.AUTH.down(AUTH.java:180) at org.jgroups.protocols.pbcast.ClientGmsImpl.findInitialMembers (ClientGmsImpl.java:227) at org.jgroups.protocols.pbcast.ClientGmsImpl.joinInternal (ClientGmsImpl.java:74) at org.jgroups.protocols.pbcast.ClientGmsImpl.join (ClientGmsImpl.java:38) at org.jgroups.protocols.pbcast.GMS.down(GMS.java:941) at org.jgroups.protocols.FlowControl.down(FlowControl.java:351) at org.jgroups.protocols.FlowControl.down(FlowControl.java:351) at org.jgroups.protocols.FRAG2.down(FRAG2.java:147) at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025) at org.jgroups.JChannel.down(JChannel.java:729) at org.jgroups.JChannel.connect(JChannel.java:291) - locked <0x00000000c2ee56f0> (a org.jgroups.JChannel) at org.jgroups.JChannel.connect(JChannel.java:262) |