Re: [jgroups-users] Nodes are disconnected after a few minutes
Brought to you by:
belaban
From: Bela B. <be...@ya...> - 2013-02-04 11:41:16
|
On 2/4/13 9:59 AM, Mattias Jiderhamn wrote: > JGroups seems to find the correct IP address anyway, and an explicit > -Djgroups.bind_addr makes no difference. OK > The machines only have 1 (dual stack) NIC, but it may be worth noting > they are virtual servers. > If it picked the wrong IP/NIC/stack, it wouldn't work the first 4 ½ > minutes either, would it? So what can happen to make it stop working? A lot of things: firewalls, misconfiguration, different versions, broken NICs, switches misbehaving, IGMP snoping bug in the firmware etc etc etc > It looks like using TCP works, but long term UDP would be preferable. +1 on UDP > Is there any part of JGroups where I should increase the logging, to > understand better was is happening? Or is network traffic analysis the > only way? I'd enable TRACE for org.jgroups.protocols.PING, org.jgroups.protocols.pbcast.GMS. Well, actually, if this stops working after 4.5 minutes, then simply enable TRACE for org.jgroups. Try to run Draw, it doesn't generate a lot of traffic unless you draw something. Which version is this with ? > </Mattias> > > ----- Original Message ----- > Subject: Re: [jgroups-users] Nodes are disconnected after a few minutes > Date: Sat, 02 Feb 2013 14:49:50 +0100 > From: Bela Ban <be...@ya...> > > You may need to pick a valid NIC. JGroups tries to find a good one, but > in the worst case, it picks a NIC to network-1 on one box and a NIC to a > different netwoek-2 on the other box. Or it could pick the VPN tunnel. > > To do that, use -Djgroups.bind_addr=1.2.3.4, where 1.2.3.4 is the IP > address of the given node. > > On 2/2/13 8:44 AM, Mattias Jiderhamn wrote: > > Trying to set up our first JGroups cluster with two nodes, using UDP > > with pretty much default udp.xml. Everyting seems to work fine, but only > > for about four and a half minute. Then the failure detection starts > > suspecting nodes, and they end up in individual views. > > > > When using FD_SOCK + FD_ALL, the FD_ALL seems to simply stop receiving > > heartbeats. I can see in the logs that they are sent, but I'm getting > > "haven't received a heartbeat from xyz-45609 for 11046 ms, adding it to > > suspect list". > > > > I tried switching from FD_SOCK + FD_ALL to FD_SOCK + FD which seems to > > have resulted in FD_SOCK failing instead, after the same amount of time. > > Error message this time is "peer xyz-18917 closed socket (eof)" > > > > Switching to FD only, it seems to me like the heartbeats continue > > working and I get no suspects or view changes - but the actual messages > > are not received!? > > > > Where should I be looking for the problem - in the config or in the > network? > > > -- Bela Ban, JGroups lead (http://www.jgroups.org) |