Re: [jgroups-users] Nodes unable to join cluster when one node is hung
Brought to you by:
belaban
From: Bela B. <be...@ya...> - 2012-12-15 09:11:43
|
Which version of JGroups ? What's your XML config ? On 12/15/12 12:32 AM, Ramakrishna Kandula wrote: > > Jgroups experts, > > We are facing a problem where the nodes are not able to join when one node is > hung. > > Nodes A,B, C are in cluster. C has crashed/hung .. it was accepting network > connection but not responding. The view has changed and now includes only > members A,B. > Now when a node D is trying to join, inside JChannel.connect, it was doing a > discovery and connected to this hung node C and not coming out of there. We > have set the GMS join timeout but it is not taking effect. > > Is there any way to avoid this as one hung member can cause discovery issues > and affect the entire cluster? We have verified by looking at the heap dump > that this thread is in fact reading from the socket connected to hung member C. > > We are using gossip router for discovery and A is our gossip router. The stack > trace of where it is hanging is attached below. > > Thanks, > ramky > > "main" prio=10 tid=0x000000005910e000 nid=0x2202 runnable [0x0000000040e7e000] > java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(Unknown Source) > at com.sun.net.ssl.internal.ssl.InputRecord.readFully(Unknown Source) > at com.sun.net.ssl.internal.ssl.InputRecord.read(Unknown Source) > at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source) > - locked <0x00000000c2f4a630> (a java.lang.Object) > at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake > (Unknown Source) > - locked <0x00000000c2f4a5f0> (a java.lang.Object) > at com.sun.net.ssl.internal.ssl.SSLSocketImpl.writeRecord > (Unknown Source) > at com.sun.net.ssl.internal.ssl.AppOutputStream.write(Unknown Source) > - locked <0x00000000c2f4afb8> > (a com.sun.net.ssl.internal.ssl.AppOutputStream) > at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > at java.io.BufferedOutputStream.flush(Unknown Source) > - locked <0x00000000c2f2d250> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.flush(Unknown Source) > at org.jgroups.blocks.TCPConnectionMap$TCPConnection.sendLocalAddress > (TCPConnectionMap.java:548) > at org.jgroups.blocks.TCPConnectionMap$TCPConnection.<init> > (TCPConnectionMap.java:397) > at org.jgroups.blocks.TCPConnectionMap$Mapper.getConnection > (TCPConnectionMap.java:785) > at org.jgroups.blocks.TCPConnectionMap.send(TCPConnectionMap.java:174) > at org.jgroups.protocols.TCP.send(TCP.java:56) > at org.jgroups.protocols.BasicTCP.sendUnicast(BasicTCP.java:86) > at org.jgroups.protocols.TP.sendToSingleMember(TP.java:1306) > at org.jgroups.protocols.TP.doSend(TP.java:1299) > at org.jgroups.protocols.TP.send(TP.java:1285) > at org.jgroups.protocols.TP.down(TP.java:1143) > at org.jgroups.protocols.Discovery.sendDiscoveryRequest > (Discovery.java:276) > at org.jgroups.protocols.Discovery.findMembers(Discovery.java:216) > at org.jgroups.protocols.Discovery.findInitialMembers > (Discovery.java:197) > at org.jgroups.protocols.Discovery.down(Discovery.java:527) > at org.jgroups.protocols.MERGE2.down(MERGE2.java:181) > at org.jgroups.protocols.FD.down(FD.java:308) > at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:80) > at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:539) > at org.jgroups.protocols.UNICAST2.down(UNICAST2.java:531) > at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:328) > at org.jgroups.protocols.AUTH.down(AUTH.java:180) > at org.jgroups.protocols.pbcast.ClientGmsImpl.findInitialMembers > (ClientGmsImpl.java:227) > at org.jgroups.protocols.pbcast.ClientGmsImpl.joinInternal > (ClientGmsImpl.java:74) > at org.jgroups.protocols.pbcast.ClientGmsImpl.join > (ClientGmsImpl.java:38) > at org.jgroups.protocols.pbcast.GMS.down(GMS.java:941) > at org.jgroups.protocols.FlowControl.down(FlowControl.java:351) > at org.jgroups.protocols.FlowControl.down(FlowControl.java:351) > at org.jgroups.protocols.FRAG2.down(FRAG2.java:147) > at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1025) > at org.jgroups.JChannel.down(JChannel.java:729) > at org.jgroups.JChannel.connect(JChannel.java:291) > - locked <0x00000000c2ee56f0> (a org.jgroups.JChannel) > at org.jgroups.JChannel.connect(JChannel.java:262) > > > > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban, JGroups lead (http://www.jgroups.org) |