Thread: [Javagroups-development] FD_SOCK...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bela et al.,

I set up my platform with 2.2.9.1 and put some new member creation load 
on it (launching 10 JGroups member JVMs at a time on a total of about 5 
machines, getting to at least 100 members in the group).

Some of my machines simply froze (were replying to ping but that's about 
it, no remote reboot possible). After restart, system messages indicated 
the kernel reported out of memory problems and killing of java 
processes... This put the system in an interesting state.

Once I restarted my machines (and killed the JVMs on all but one of the 
surviving machine), I tried to start a new member. It tried to contact 
the old (dead) coordinator, because the one surviving JVM that I forgot 
to kill still thought this was the coordinator (monitored by FD_SOCK) 
and sent its identity to the new member.

I quickly looked at FD_SOCK and I have a few questions:

1. It seems that down() in FD_SOCK does not handle Event.SUSPECT. Does 
that mean that if another member is suspecting the member FD_SOCK is 
monitoring, FD_SOCK will just ignore the suspicion?

2. I had the impression that FD_SOCK does not take advantage of the 
symmetric nature of the TCP connection. Does the monitored (server) 
member use the TCP connection to monitor the monitoring (client) member?

3. I feel strongly that some heart beat message should be sent on these 
idle FD_SOCK TCP connections to detect router or server failure... The 
incurred network cost can't be significant in a 'normal' system using 
network resources. This will prevent indefinite lockups as I experienced 
(requiring a complete shutdown of ALL machines in a platform... not 
always easy, especially when the platform still provides degraded 
service despite a partial failure).

I'm going to do more tests under 'normal' load and see if the problem 
occurs again.

Thanks,
Ilan

Thread: [Javagroups-development] FD_SOCK...

javagroups-development