javagroups-users Mailing List for JGroups

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

FYI, I just released 5.3.

Details here: http://belaban.blogspot.com/2023/09/jgroups-53-released.html

Cheers

-- 
Bela Ban | http://www.jgroups.org

Thanks for the confirmation Bela, I was easily able to modify the
Infinispan configuration to include the FORK [1] and the application is no
longer prone to random OOMs during bootstrap.

Cheers,

Johnathan

[1]

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups
http://www.jgroups.org/schema/jgroups-4.0.xsd">
    ...
    <FORK>
        <fork-stacks>
            <fork-stack id="hijack-stack"/>
        </fork-stacks>
    </FORK>
    <FRAG3 />
</config>

if (cacheManager.getTransport() instanceof JGroupsTransport) {
    JGroupsTransport jGroupsTransport = (JGroupsTransport)
cacheManager.getTransport();

    ProtocolStack stack = jGroupsTransport.getChannel().getProtocolStack();

    Class<? extends Protocol> neighborProtocol =
stack.findProtocol(FRAG2.class) != null ?
            FRAG2.class : FRAG3.class;
    channel = new ForkChannel(jGroupsTransport.getChannel(),
                              "hijack-stack",
                              "lead-hijacker",
                              false,
                              ProtocolStack.Position.ABOVE,
                              neighborProtocol);

On Fri, 21 Oct 2022, 16:40 Questions/problems related to using JGroups via
javagroups-users, <jav...@li...> wrote:

> Hi Jonathan
>
> yes, the best solution is to define FORK in the JGroups section of the
> Infinispan configuration, as you mentioned.
>
> If you don't control the configuration, then it gets a bit more
> tricky... you could (*before sending any brodcast traffic*) insert FORK
> dynamically into every JChannel instance.
> To do that, you need to get the JChannel; IIRC, via (paraphrased)
> cache.getExtendedCache().getRpcManager().getTransport(), downcast it to
> JGroupsTransport, then call getChannel().
>
> Once you have JChannel, call
> channel.getProtocolStack().insertProtocolAtTop(new FORK(...));
> Hope this helps,
>
> [1] http://www.jgroups.org/manual5/index.html#ForkChannel
>
> On 21.10.22 16:02, Questions/problems related to using JGroups wrote:
> > Hi all,
> >
> > We have run into an interesting race condition when attempting to use a
> > fork channel in our application, we more or less follow what Bela wrote
> > here
> >
> http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html
> <
> http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html>
> .
> >
> > When the cluster comes up and receives views the node broadcasts
> > information about itself over the fork channel so the cluster knows what
> > each node is capable of handling.
> >
> > Unfortunately from what I can see there is a race condition on bootstrap
> > where the JGroups stack is started and receives a fork message before
> > the fork is inserted into the stack (fork not present in stack trace)
> > [1] which results in garbage / unknown data passing through the
> > Infinispan marshaller.. if you are lucky enough it will read an
> > extremely large int and try and allocate that into a byte array
> > resulting in the JVM to throw an OOM or NegativeArraySizeException
> >
> > I believe one possible solution is to define the fork inside the
> > jgroups.xml which is used to create the initial jgroups stack which
> > would hopefully discard fork channel messages (until the message
> > listener is registered) and not pass them up the stack resulting in
> > undefined behaviour.
> >
> > I had a look at the fork documentation but there are not many examples,
> > does my possible solution seem feasible or does someone have alternative
> > solutions? I am currently looking through the Infinispan code to see if
> > there is any way to decorate jgroups before it starts.
> >
> > Thanks in advance,
> >
> > Johnathan
> >
> > [1]
> >
> > 2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2]
> > (org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2
> > java.lang.NegativeArraySizeException: -436207616
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at
> >
> org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at
> >
> org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450)
> ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> > at org.jgroups.JChannel.up(JChannel.java:796)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.FRAG3.up(FRAG3.java:187)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.Protocol.up(Protocol.java:338)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.Protocol.up(Protocol.java:338)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.Protocol.up(Protocol.java:338)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.Protocol.up(Protocol.java:338)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.stack.Protocol.up(Protocol.java:338)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at org.jgroups.protocols.TP.passBatchUp(TP.java:1408)
> > ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at
> >
> org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284)
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at
> >
> org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136)
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at
> >
> org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273)
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> > at java.lang.Thread.run(Thread.java:829) ~[?:?]
> >
> >
> > _______________________________________________
> > javagroups-users mailing list
> > jav...@li...
> > https://lists.sourceforge.net/lists/listinfo/javagroups-users
>
> --
> Bela Ban | http://www.jgroups.org
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>

Hi Jonathan

yes, the best solution is to define FORK in the JGroups section of the 
Infinispan configuration, as you mentioned.

If you don't control the configuration, then it gets a bit more 
tricky... you could (*before sending any brodcast traffic*) insert FORK 
dynamically into every JChannel instance.
To do that, you need to get the JChannel; IIRC, via (paraphrased) 
cache.getExtendedCache().getRpcManager().getTransport(), downcast it to 
JGroupsTransport, then call getChannel().

Once you have JChannel, call 
channel.getProtocolStack().insertProtocolAtTop(new FORK(...));
Hope this helps,

[1] http://www.jgroups.org/manual5/index.html#ForkChannel

On 21.10.22 16:02, Questions/problems related to using JGroups wrote:
> Hi all,
> 
> We have run into an interesting race condition when attempting to use a 
> fork channel in our application, we more or less follow what Bela wrote 
> here 
> http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html <http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html> .
> 
> When the cluster comes up and receives views the node broadcasts 
> information about itself over the fork channel so the cluster knows what 
> each node is capable of handling.
> 
> Unfortunately from what I can see there is a race condition on bootstrap 
> where the JGroups stack is started and receives a fork message before 
> the fork is inserted into the stack (fork not present in stack trace) 
> [1] which results in garbage / unknown data passing through the 
> Infinispan marshaller.. if you are lucky enough it will read an 
> extremely large int and try and allocate that into a byte array 
> resulting in the JVM to throw an OOM or NegativeArraySizeException
> 
> I believe one possible solution is to define the fork inside the 
> jgroups.xml which is used to create the initial jgroups stack which 
> would hopefully discard fork channel messages (until the message 
> listener is registered) and not pass them up the stack resulting in 
> undefined behaviour.
> 
> I had a look at the fork documentation but there are not many examples, 
> does my possible solution seem feasible or does someone have alternative 
> solutions? I am currently looking through the Infinispan code to see if 
> there is any way to decorate jgroups before it starts.
> 
> Thanks in advance,
> 
> Johnathan
> 
> [1]
> 
> 2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2] 
> (org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2
> java.lang.NegativeArraySizeException: -436207616
> at 
> org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at 
> org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at 
> org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
> at org.jgroups.JChannel.up(JChannel.java:796) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.FRAG3.up(FRAG3.java:187) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.Protocol.up(Protocol.java:338) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.Protocol.up(Protocol.java:338) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.Protocol.up(Protocol.java:338) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.Protocol.up(Protocol.java:338) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.stack.Protocol.up(Protocol.java:338) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at org.jgroups.protocols.TP.passBatchUp(TP.java:1408) 
> ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at 
> org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284) ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at 
> org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136) ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at 
> org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273) ~[jgroups-4.2.1.Final.jar:4.2.1.Final]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> at java.lang.Thread.run(Thread.java:829) ~[?:?]
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users

-- 
Bela Ban | http://www.jgroups.org

Hi all,

We have run into an interesting race condition when attempting to use a
fork channel in our application, we more or less follow what Bela wrote
here
http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html
.

When the cluster comes up and receives views the node broadcasts
information about itself over the fork channel so the cluster knows what
each node is capable of handling.

Unfortunately from what I can see there is a race condition on bootstrap
where the JGroups stack is started and receives a fork message before the
fork is inserted into the stack (fork not present in stack trace) [1] which
results in garbage / unknown data passing through the Infinispan
marshaller.. if you are lucky enough it will read an extremely large int
and try and allocate that into a byte array resulting in the JVM to throw
an OOM or NegativeArraySizeException

I believe one possible solution is to define the fork inside the
jgroups.xml which is used to create the initial jgroups stack which would
hopefully discard fork channel messages (until the message listener is
registered) and not pass them up the stack resulting in undefined behaviour.

I had a look at the fork documentation but there are not many examples,
does my possible solution seem feasible or does someone have alternative
solutions? I am currently looking through the Infinispan code to see if
there is any way to decorate jgroups before it starts.

Thanks in advance,

Johnathan

[1]

2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2]
(org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2
java.lang.NegativeArraySizeException: -436207616
at
org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at
org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450)
~[infinispan-core-11.0.1.Final.jar:11.0.1.Final]
at org.jgroups.JChannel.up(JChannel.java:796)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.FRAG3.up(FRAG3.java:187)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.Protocol.up(Protocol.java:338)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.Protocol.up(Protocol.java:338)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.Protocol.up(Protocol.java:338)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.Protocol.up(Protocol.java:338)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.stack.Protocol.up(Protocol.java:338)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at org.jgroups.protocols.TP.passBatchUp(TP.java:1408)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at
org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at
org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at
org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273)
~[jgroups-4.2.1.Final.jar:4.2.1.Final]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
~[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]

On 17.10.22 14:30, Questions/problems related to using JGroups wrote:
> Hi,
> 
> I see that the suspect() method is only called on the coordinator node 
> in my cluster when another node disappears; no other nodes in the 
> cluster get the message. Is this the expected behavior? The docs don't 
> say one way or the other:
> 
> http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address) <http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address)>

Yes, this is correct: coordinator (or newly promoted coords) get this 
callback only.

> We're using jgroups 4.1.8 currently.

Note that the suspect() / unsuspect() were removed in 5.0.

  We wrote code, way back with
> jgroups 3.4, where every node needs to know if a member left the view 
> suspect or not.

IMO there's better ways of doing this, as suspect() is only an 
indication and may never result in a view change.

If a member P needs to leave gracefully, I'd have P broadcast a 
LEAVING_GRACEFULLY message to all, before it leaves. Everyone receives 
this message and caches it. When the actual view change arrives, members 
part of the previous view but not the current view, and *not* part of 
the cache, crashed. All others left gracefully. The cache needs to be 
adjusted on every view change.

> The nodes use this to compare the current cluster size 
> to a "stable" size that changes more slowly (changes to match cluster 
> size right away if nodes leave gracefully since that means a human 
> caused it on purpose). So either suspect() was called on every node back 
> then or we really missed something during testing. It was seven years 
> ago so we could have definitely missed it.
> 
> Can send more information if you'd like it, but is this the way it's 
> supposed to work? Is there anything else I can do have suspect() called 
> on each node?
> 
> (If not, I currently think my best option is to have non-coordinator 
> nodes treat any decrease in view size as suspect until, after the view 
> change, the coordinator can send them a message telling them that a node 
> left gracefully and that the current actual size is "stable.")
> 
> Thank you,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users

-- 
Bela Ban | http://www.jgroups.org

Hi,

I see that the suspect() method is only called on the coordinator node in
my cluster when another node disappears; no other nodes in the cluster get
the message. Is this the expected behavior? The docs don't say one way or
the other:

http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address)

We're using jgroups 4.1.8 currently. We wrote code, way back with jgroups
3.4, where every node needs to know if a member left the view suspect or
not. The nodes use this to compare the current cluster size to a "stable"
size that changes more slowly (changes to match cluster size right away if
nodes leave gracefully since that means a human caused it on purpose). So
either suspect() was called on every node back then or we really missed
something during testing. It was seven years ago so we could have
definitely missed it.

Can send more information if you'd like it, but is this the way it's
supposed to work? Is there anything else I can do have suspect() called on
each node?

(If not, I currently think my best option is to have non-coordinator nodes
treat any decrease in view size as suspect until, after the view change,
the coordinator can send them a message telling them that a node left
gracefully and that the current actual size is "stable.")

Thank you,
Bobby

On 06.06.22 22:15, Questions/problems related to using JGroups wrote:
> Although, looking at this again, I think we might not be talking about 
> the same setup. From this:
> 
> 
> 
>     On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using
>     JGroups via javagroups-users <jav...@li...
>     <mailto:jav...@li...>> wrote:
> 
>         [....]
> 
>          >
>          > Right, and what they want is some way to fully remove a node
>         from a
>          > cluster. I.e. the cluster stops trying to contact that address.
> 
> 
>         Then you would have to remove the 130 node from the old cluster's
>         initial_hosts (TCPPING) and TCP's logical address cache. Either by
>         restarting, or by programmatically removing it. This can get
>         complex
>         quickly though, as you'd have to maintain a list of ports per
>         cluster.
> 
> 
> Each cluster is separate from all the others, so I don't know what I 
> would need to keep in this list or why a cluster would need it. 

Referring to my previous email: if you use FILE_PING, each cluster has a 
_separate_ directory (the cluster name) under which the discovery info 
is stored.

> If a cluster has A/B/C/D in it, and the code sees that D leaves the cluster 
> without going suspect first, can I programmatically do these?

For TCP, it's complicated, but doable. Among other things you'd have to:
- Close all TCP connections to D
- Close all connections to D in UNICAST3, too
- Remove D's info from the address cache (contents: 'probe.sh uuids')
- Remove D's information from all instances of TCPPING (initial_hosts 
and dynamic_hosts)

Again, using a dynamic discovery protocol such as FILE_PING makes more 
sense here.

> - set new initial_hosts on the existing TCPPING protocol in my stack to 
> include only A/B/C
> - access the logical address cache and remove the address

Yes, but this is not enough (see above).

> I mean, I know I can hack the TCPPING again, but didn't know that would 
> have any effect on the existing channel and members. I don't know 
> offhand how to access the address cache, which I think is all I'm 
> missing to experiment with this.

Pseudo code:
TP tp=channel.getProtocolStack().getTransport();
LazyRemovalCache cache=tp.getLogicalAddressCache();
cache.remove(address, true); // force removal

> If I can do the above then I think that 
> solves the issue -- if a suspect member leaves the view I won't do 
> anything, because we want to keep trying it in case it was disconnected 
> and reconnected. But if a member leaves gracefully and the above is all 
> I need to make the cluster forget about it, that's great and means we 
> wouldn't have to change any startup features for the customers.
> 
> Thanks again,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users

-- 
Bela Ban | http://www.jgroups.org

Hi Bobby

On 06.06.22 21:08, Questions/problems related to using JGroups wrote:
> Hi again Bela et al,
> 
> We've finally come back to this issue after not working on the product 
> for a while. I'm keeping the context all below, but the short version 
> was that we use TCPPING and, if someone removes a node with address X 
> and, later, starts a new cluster that includes the address, the old 
> cluster keeps trying to find its lost buddy at X.

Right, and I suggested using a dynamic discovery protocol, *not* TCPPING.

> We're still back on v4.1.8 and I wanted to ask if the suggestion below, 
> i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on 
> their own networks) is the most appropriate, and if there would be any 
> benefit for this particular issue by moving to v5.X?

There are loads of benefits by moving to 5.x :-)

But, specifically to this case, only the ability to have multiple 
discovery protocols in the same stack would be beneficial here. I guess 
MULTI_PING in 4.x might do the same job though...

> The way they run 
> things now is to put host:port info for each node in a file and then 
> start the applications, which read that file to set initial hosts. So 
> FILE_PING might be the best for them so that we don't need to have any 
> new processes running.

Yes, the benefits/drawbacks of FILE_PING are
+ No additional process needed
+ All processes access a shared dir, e.g. on NFS
- NFS adds overhead (but only for discovery)
+ The discovery info is human-readable, and can thus be modified 
manually (if needed)

> Thanks,
> Bobby
> 
> On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using 
> JGroups via javagroups-users <jav...@li... 
> <mailto:jav...@li...>> wrote:
> 
> 
> 
>     On 25.05.21 18:59, Questions/problems related to using JGroups wrote:
>      > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using
>      > JGroups via javagroups-users
>     <jav...@li...
>     <mailto:jav...@li...>
>      > <mailto:jav...@li...
>     <mailto:jav...@li...>>> wrote:
>      >
>      >     Hi Bobby
>      >     apologies for the delay!
>      >
>      >
>      > No problem -- thanks for looking.
>      >
>      >
>      >     You cannot have the old cluster's initial_hosts be
>     128,129,130 and the
>      >     new one has the overlapping range 130,131.
>      >
>      >
>      > That's the problem. The customer has lots of nodes, clusters that
>     grow
>      > and shrink, and they're going to reuse the same IP addresses
>     eventually.
> 
> 
>     Then using TCPPING for the discovery is the wrong solution; it is
>     designed for a static cluster with a fixed and known membership.
> 
>     For the above requirements, I'd rather recommend:
>     * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc)
>     * Emphemeral ports
>     * A new (different) cluster name for each new cluster that is started
> 
> 
>      >     The old cluster will try to contact 130 (e.g. trying to
>     merge), thereby
>      >     send its information to 130.
>      >
>      >
>      > Right, and what they want is some way to fully remove a node from a
>      > cluster. I.e. the cluster stops trying to contact that address.
> 
> 
>     Then you would have to remove the 130 node from the old cluster's
>     initial_hosts (TCPPING) and TCP's logical address cache. Either by
>     restarting, or by programmatically removing it. This can get complex
>     quickly though, as you'd have to maintain a list of ports per cluster.
> 
>     The first solution above is much better IMO.
> 
> 
>      >     What is it you're trying to achieve?
>      >
>      >
>      > Simply to take a node out of a cluster when it's not needed, then
>     later
>      > reuse the address of that node with a different cluster. If I
>     change the
>      > cluster names (same port though) then I still get constant
>     warnings, like:
>      > JGRP000012: discarded message from different cluster <old> (our
>     cluster
>      > is <new>). Sender was <some addr>
>      >
>      > We can suggest that they restart the cluster after removing a
>     node, but
>      > I don't know if that will work for them. I'll also try using
>     different
>      > ports for different clusters and see how that works for them.
> 
>     That will certainly work, but - again - you'd have to maintain ports
>     numbers for each cluster. Registration service? Excel spreadsheet?
> 
> 
>      > Given the size of the company in question, I can see that it
>     might be hard to
>      > coordinate that and eventually they'll get back in the same
>     situation
>      > where a previously used address is being used again with the same
>     port
>      > it used the last time.
> 
>     Right. So I have to come back to my suggestion of not using TCPPING!
>     Cheers,
> 
> 
>      > Thanks,
>      > Bobby
>      >
>      >
>      >
>      > _______________________________________________
>      > javagroups-users mailing list
>      > jav...@li...
>     <mailto:jav...@li...>
>      > https://lists.sourceforge.net/lists/listinfo/javagroups-users
>     <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
>      >
> 
>     -- 
>     Bela Ban | http://www.jgroups.org <http://www.jgroups.org>
> 
> 
> 
>     _______________________________________________
>     javagroups-users mailing list
>     jav...@li...
>     <mailto:jav...@li...>
>     https://lists.sourceforge.net/lists/listinfo/javagroups-users
>     <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users

-- 
Bela Ban | http://www.jgroups.org

Although, looking at this again, I think we might not be talking about the
same setup. From this:

>
> On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using
> JGroups via javagroups-users <jav...@li...>
> wrote:
>
>> [....]
>>
>> >
>> > Right, and what they want is some way to fully remove a node from a
>> > cluster. I.e. the cluster stops trying to contact that address.
>>
>>
>> Then you would have to remove the 130 node from the old cluster's
>> initial_hosts (TCPPING) and TCP's logical address cache. Either by
>> restarting, or by programmatically removing it. This can get complex
>> quickly though, as you'd have to maintain a list of ports per cluster.
>
>
Each cluster is separate from all the others, so I don't know what I would
need to keep in this list or why a cluster would need it. If a cluster has
A/B/C/D in it, and the code sees that D leaves the cluster without going
suspect first, can I programmatically do these?

- set new initial_hosts on the existing TCPPING protocol in my stack to
include only A/B/C
- access the logical address cache and remove the address

I mean, I know I can hack the TCPPING again, but didn't know that would
have any effect on the existing channel and members. I don't know offhand
how to access the address cache, which I think is all I'm missing
to experiment with this. If I can do the above then I think that solves the
issue -- if a suspect member leaves the view I won't do anything, because
we want to keep trying it in case it was disconnected and reconnected. But
if a member leaves gracefully and the above is all I need to make the
cluster forget about it, that's great and means we wouldn't have to change
any startup features for the customers.

Thanks again,
Bobby

Hi again Bela et al,

We've finally come back to this issue after not working on the product for
a while. I'm keeping the context all below, but the short version was that
we use TCPPING and, if someone removes a node with address X and, later,
starts a new cluster that includes the address, the old cluster keeps
trying to find its lost buddy at X.

We're still back on v4.1.8 and I wanted to ask if the suggestion below,
i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on their
own networks) is the most appropriate, and if there would be any benefit
for this particular issue by moving to v5.X? The way they run things now is
to put host:port info for each node in a file and then start the
applications, which read that file to set initial hosts. So FILE_PING might
be the best for them so that we don't need to have any new processes
running.

Thanks,
Bobby

On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using JGroups
via javagroups-users <jav...@li...> wrote:

>
>
> On 25.05.21 18:59, Questions/problems related to using JGroups wrote:
> > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using
> > JGroups via javagroups-users <jav...@li...
> > <mailto:jav...@li...>> wrote:
> >
> >     Hi Bobby
> >     apologies for the delay!
> >
> >
> > No problem -- thanks for looking.
> >
> >
> >     You cannot have the old cluster's initial_hosts be 128,129,130 and
> the
> >     new one has the overlapping range 130,131.
> >
> >
> > That's the problem. The customer has lots of nodes, clusters that grow
> > and shrink, and they're going to reuse the same IP addresses eventually.
>
>
> Then using TCPPING for the discovery is the wrong solution; it is
> designed for a static cluster with a fixed and known membership.
>
> For the above requirements, I'd rather recommend:
> * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc)
> * Emphemeral ports
> * A new (different) cluster name for each new cluster that is started
>
>
> >     The old cluster will try to contact 130 (e.g. trying to merge),
> thereby
> >     send its information to 130.
> >
> >
> > Right, and what they want is some way to fully remove a node from a
> > cluster. I.e. the cluster stops trying to contact that address.
>
>
> Then you would have to remove the 130 node from the old cluster's
> initial_hosts (TCPPING) and TCP's logical address cache. Either by
> restarting, or by programmatically removing it. This can get complex
> quickly though, as you'd have to maintain a list of ports per cluster.
>
> The first solution above is much better IMO.
>
>
> >     What is it you're trying to achieve?
> >
> >
> > Simply to take a node out of a cluster when it's not needed, then later
> > reuse the address of that node with a different cluster. If I change the
> > cluster names (same port though) then I still get constant warnings,
> like:
> > JGRP000012: discarded message from different cluster <old> (our cluster
> > is <new>). Sender was <some addr>
> >
> > We can suggest that they restart the cluster after removing a node, but
> > I don't know if that will work for them. I'll also try using different
> > ports for different clusters and see how that works for them.
>
> That will certainly work, but - again - you'd have to maintain ports
> numbers for each cluster. Registration service? Excel spreadsheet?
>
>
> > Given the size of the company in question, I can see that it might be
> hard to
> > coordinate that and eventually they'll get back in the same situation
> > where a previously used address is being used again with the same port
> > it used the last time.
>
> Right. So I have to come back to my suggestion of not using TCPPING!
> Cheers,
>
>
> > Thanks,
> > Bobby
> >
> >
> >
> > _______________________________________________
> > javagroups-users mailing list
> > jav...@li...
> > https://lists.sourceforge.net/lists/listinfo/javagroups-users
> >
>
> --
> Bela Ban | http://www.jgroups.org
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>

Hi again,

Thanks for this -- seems like an obvious answer but I wanted to be sure.

On Mon, Nov 22, 2021 at 3:43 AM Questions/problems related to using JGroups
via javagroups-users <jav...@li...> wrote:

>
> If they use this as a kind of health service ping, why don't they use a
> different port?
>

I think it's a security tool, but I don't know much about it. AFAIK it
attempts to connect to any port in use no matter what is running there to
check for problems.

Cheers,
Bobby

On 18.11.21 13:49, Questions/problems related to using JGroups wrote:
> Hi,
> 
> One of our customers has a security tool that constantly tries to 
> connect to each server, so the jgroups logs have this happening every ~4 
> few seconds:
> 
> 2021-10-28 05:18:35 org.jgroups.protocols.TCP warn WARN: JGRP000006: 
> failed accepting connection from peer
> java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at 
> org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:245)
> at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:53)
> at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:126)
> at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:111)
> at java.lang.Thread.run(Thread.java:748)
> 
> Can that affect the performance of jgroups?

I don't think so, this causes an additional TCP connection to be 
(half-)established, but it will be torn down immediately, so a bit of 
processing.

If they use this as a kind of health service ping, why don't they use a 
different port?

> I see it on all nodes, but 
> one of their nodes, a primary database, when under load sometimes 
> doesn't see view changes that the other nodes see until a minute or more 
> later.

That should be unrelated...

> Since the above happens on *every* node I'd think it's unrelated 
> but wanted to check. I know it's kind of a *qualitative* question, sorry.
> 
> This is with jgroups 4.1.8.Final using a TCP stack. Can get you the full 
> channel creation info if it helps.
> 
> Thanks,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

Hi,

One of our customers has a security tool that constantly tries to connect
to each server, so the jgroups logs have this happening every ~4 few
seconds:

2021-10-28 05:18:35 org.jgroups.protocols.TCP warn WARN: JGRP000006: failed
accepting connection from peer
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at
org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:245)
at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:53)
at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:126)
at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:111)
at java.lang.Thread.run(Thread.java:748)

Can that affect the performance of jgroups? I see it on all nodes, but one
of their nodes, a primary database, when under load sometimes doesn't see
view changes that the other nodes see until a minute or more later. Since
the above happens on *every* node I'd think it's unrelated but wanted to
check. I know it's kind of a *qualitative* question, sorry.

This is with jgroups 4.1.8.Final using a TCP stack. Can get you the full
channel creation info if it helps.

Thanks,
Bobby

I'm considering removing support for JMX in 5.2 [1].

Is anyone using JMX at all to obtain info about a running JGroups system?

I've used probe.sh for quite a while now and haven't really used JMX in 
a long time..

Feedback welcome!

[1] https://issues.redhat.com/browse/JGRP-2572

-- 
Bela Ban | http://www.jgroups.org

On 25.05.21 18:59, Questions/problems related to using JGroups wrote:
> On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using 
> JGroups via javagroups-users <jav...@li... 
> <mailto:jav...@li...>> wrote:
> 
>     Hi Bobby
>     apologies for the delay!
> 
> 
> No problem -- thanks for looking.
> 
> 
>     You cannot have the old cluster's initial_hosts be 128,129,130 and the
>     new one has the overlapping range 130,131.
> 
> 
> That's the problem. The customer has lots of nodes, clusters that grow 
> and shrink, and they're going to reuse the same IP addresses eventually.

Then using TCPPING for the discovery is the wrong solution; it is 
designed for a static cluster with a fixed and known membership.

For the above requirements, I'd rather recommend:
* A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc)
* Emphemeral ports
* A new (different) cluster name for each new cluster that is started

>     The old cluster will try to contact 130 (e.g. trying to merge), thereby
>     send its information to 130.
> 
> 
> Right, and what they want is some way to fully remove a node from a 
> cluster. I.e. the cluster stops trying to contact that address.

Then you would have to remove the 130 node from the old cluster's 
initial_hosts (TCPPING) and TCP's logical address cache. Either by 
restarting, or by programmatically removing it. This can get complex 
quickly though, as you'd have to maintain a list of ports per cluster.

The first solution above is much better IMO.

>     What is it you're trying to achieve?
> 
> 
> Simply to take a node out of a cluster when it's not needed, then later 
> reuse the address of that node with a different cluster. If I change the 
> cluster names (same port though) then I still get constant warnings, like:
> JGRP000012: discarded message from different cluster <old> (our cluster 
> is <new>). Sender was <some addr>
> 
> We can suggest that they restart the cluster after removing a node, but 
> I don't know if that will work for them. I'll also try using different 
> ports for different clusters and see how that works for them.

That will certainly work, but - again - you'd have to maintain ports 
numbers for each cluster. Registration service? Excel spreadsheet?

> Given the size of the company in question, I can see that it might be hard to 
> coordinate that and eventually they'll get back in the same situation 
> where a previously used address is being used again with the same port 
> it used the last time.

Right. So I have to come back to my suggestion of not using TCPPING!
Cheers,

> Thanks,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using
JGroups via javagroups-users <jav...@li...> wrote:

> Hi Bobby
> apologies for the delay!
>

No problem -- thanks for looking.

>
> You cannot have the old cluster's initial_hosts be 128,129,130 and the
> new one has the overlapping range 130,131.
>

That's the problem. The customer has lots of nodes, clusters that grow and
shrink, and they're going to reuse the same IP addresses eventually.

>
> The old cluster will try to contact 130 (e.g. trying to merge), thereby
> send its information to 130.
>

Right, and what they want is some way to fully remove a node from a
cluster. I.e. the cluster stops trying to contact that address.

>
> What is it you're trying to achieve?
>

Simply to take a node out of a cluster when it's not needed, then later
reuse the address of that node with a different cluster. If I change the
cluster names (same port though) then I still get constant warnings, like:
JGRP000012: discarded message from different cluster <old> (our cluster is
<new>). Sender was <some addr>

We can suggest that they restart the cluster after removing a node, but I
don't know if that will work for them. I'll also try using different ports
for different clusters and see how that works for them. Given the size of
the company in question, I can see that it might be hard to coordinate that
and eventually they'll get back in the same situation where a previously
used address is being used again with the same port it used the last time.

Thanks,
Bobby

Hi Bobby
apologies for the delay!

You cannot have the old cluster's initial_hosts be 128,129,130 and the 
new one has the overlapping range 130,131.

The old cluster will try to contact 130 (e.g. trying to merge), thereby 
send its information to 130.

Depending on traffic patterns, everbody will know everyone's else's 
address, or not. For example, it could be that 128 and 130 know everyone 
else, but 129 and 131 don't know each other.

In the former case, there will be a merge to {128,129,130,131}. In the 
latter case, members will fail to talk to other members, as they don't 
have the other members in their logical address cache.

If the old cluster didn't have 130 in its initial_hosts, everything 
would be fine.

What is it you're trying to achieve?

If you're trying to start a new cluster, then either give it a new 
cluster name and/or a new set of (unused) ports. Both cluster names and 
ports could be dished out by a server accessible to all.
Cheers

On 04.05.21 21:00, Questions/problems related to using JGroups wrote:
> On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using 
> JGroups via javagroups-users <jav...@li... 
> <mailto:jav...@li...>> wrote:
> 
>     [...]
> 
> 
>      > 3. Later they use a node with the same address to join a different
>      > cluster with the same name.
> 
>     Can you post an example? Note that discovery requests from different
>     clusters are discarded.
> 
> 
> Sure, but in summary: I can't reuse an IP address after it's already 
> been in a cluster. The customer is trying to run separate clusters, but 
> the address of a node in one of them was previously in a different one, 
> and that is causing problems.
> 
> My config is programmatic; I've included it below. We use a custom 
> authentication class. When authenticate() is called it will output the 
> source and the response it's returning. I've set the jgroups logging to 
> DEBUG level; my application only logs the initial_hosts it sets and the 
> authentication calls. The member addresses end in: 128, 129, 130, and 131.
> 
> 1. I start a cluster with (started in this order) 128, 129, 130. Each of 
> them has all three of those addresses in initial_hosts.
> 
> 2. I shut down the application running on 130. The logs for 128 and 129 
> have "*** stopping application on .130" in them right before this.
> 
> 3. I start an application on 131 that has 130/131 in initial_hosts.
> 
> 4. I start a new application on the node with the 130 address. It has 
> 130 and 131 in initial hosts. The logs on 128 and 129 have "*** new 
> application on .130 starting and will join new cluster with .131" in 
> them to show when it happens.
> 
> About a minute later, the errors start showing up. The 128 application 
> is trying to connect to the one running on 130 even though that one had 
> previously shut down and left the cluster. The new one on 130 doesn't 
> let it join, and there are merge views repeating with warning messages 
> throughout. There is a merge view change every minute or so in the 
> original cluster (128/129).
> 
> The stack we create (comments and text changes for sharing):
> 
>      public JChannel createJChannel() throws Exception {
>          Logger logger = <...>
>          logger.log(Level.DEBUG, "Creating default JChannel.");
>          List<Protocol> stack = new ArrayList<>();
>          final Protocol tcp = new TCP()
>              // bind_addr will be same address, e.g. .128, .129, etc 
> that we use in initial_hosts
>              .setValue("bind_addr", 
> InetAddress.getByName(getBindingAddress()))
>              .setValue("bind_port", bindingPort)
>              .setValue("thread_pool_min_threads", 1)
>              .setValue("thread_pool_keep_alive_time", 5000)
>              .setValue("send_buf_size", 640000)
>              .setValue("sock_conn_timeout", 300)
>              .setValue("recv_buf_size", 5000000);
>          // some optional things we could add to tcp removed. not used 
> in this example
>          stack.add(tcp);
>          stack.add(new TCPPING()
>              // the parseHostList method will output the list for this 
> example at ERROR level
>              .setValue("initial_hosts", parseHostList())
>              .setValue("send_cache_on_join", true)
>              .setValue("port_range", 0));
>          stack.add(new MERGE3()
>              .setValue("min_interval", 10000)
>              .setValue("max_interval", 30000));
>          FD_ALL fdAll = new FD_ALL();
>          final long jgroupsTimeout = <>
>          fdAll.setValue("timeout", jgroupsTimeout);
>          final long maxInterval = jgroupsTimeout / 3L; // to have ~3 
> heartbeats before going suspect. <jira number removed>
>          if (maxInterval < fdAll.getInterval()) {
>              logger.log(Level.WARN, ".......");
>              fdAll.setValue("interval", maxInterval);
>          }
>          stack.add(fdAll);
>          stack.add(new VERIFY_SUSPECT()
>              .setValue("timeout", 1500));
>          stack.add(new BARRIER());
>          if (getBoolean(<an application property>)) {
>              logger.debug("adding jgroups asym encryption");
>              stack.add(new ASYM_ENCRYPT()
>                  .setValue("sym_keylength", 128)
>                  .setValue("sym_algorithm", "AES/CBC/PKCS5Padding")
>                  .setValue("sym_iv_length", 16)
>                  .setValue("asym_keylength", 2048)
>                  .setValue("asym_algorithm", "RSA")
>                  .setValue("change_key_on_leave", true));
>          }
>          stack.add(new NAKACK2()
>              .setValue("use_mcast_xmit", false));
>          stack.add(new UNICAST3());
>          stack.add(new STABLE()
>              .setValue("desired_avg_gossip", 50000)
>              .setValue("max_bytes", 4000000));
>          // protocol will log auth request source and response
>          stack.add(createAuthProtocol());
>          stack.add(new GMS()
>              .setValue("join_timeout", 3000));
>          stack.add(new MFC()
>              .setValue("max_credits", 2000000)
>              .setValue("min_credits", 800000));
>          stack.add(new FRAG2());
>          stack.add(new STATE_TRANSFER());
>          return new JChannel(stack);
>      }
> 
> Thanks again,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

This should be somewhere in TCP_NIO2:
<TCP_NIO2 max_length="5M".../>

Use 'ant make-schema' to generate a schema from the sources.

On 14.05.21 23:24, Questions/problems related to using JGroups wrote:
> I am trying to understand how i adjust my configuration to take 
> advantage of the fix for JGRP-2523 
> <https://issues.redhat.com/browse/JGRP-2523> but am unable to figure out 
> what element the new attribute should be applied to in my configuration.
> 
>  From what i can tell, the new attribute isn't anywhere in any of he 
> .xsd's so I am unable to even create a channel if my config tries to use it.
> 
> Here is my sample config:
> 
> <!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml -->
> <config  xmlns="urn:org:jgroups"
>          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>          xsi:schemaLocation="urn:org:jgroups 
> http://www.jgroups.org/schema/jgroups-4.0.xsd">
>      <TCP_NIO2
>           recv_buf_size="${tcp.recv_buf_size:128K}"
>           send_buf_size="${tcp.send_buf_size:128K}"
>           max_bundle_size="64K"
>           sock_conn_timeout="1000"
> 
>           thread_pool.enabled="true"
>           thread_pool.min_threads="1"
>           thread_pool.max_threads="10"
>           thread_pool.keep_alive_time="5000"/>
> 
>      <CENTRAL_LOCK  />
> 
>      <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING
>               location="${HA_JGROUPS_DIR}"
>               remove_old_coords_on_view_change="true"/>
>      <MERGE3  max_interval="30000"
>              min_interval="10000"/>
>      <FD_SOCK/>
>      <FD  timeout="3000"  max_tries="3"  />
>      <VERIFY_SUSPECT  timeout="1500"   />
>      <BARRIER  />
>      <pbcast.NAKACK2  use_mcast_xmit="false"
>                      discard_delivered_msgs="true"/>
>      <UNICAST3  />
>      <!--
> When a new node joins a cluster, initial message broadcast doesn't 
> necessarily seem
> to arrive. Using a shorter cycles in the STABLE protocol makes the 
> cluster recognize
> this dropped transmission and cause a retransmission.
> -->
>      <pbcast.STABLE  stability_delay="1000"  desired_avg_gossip="50000"
>                      max_bytes="4M"/>
>      <pbcast.GMS  print_local_addr="true"  join_timeout="3000"
>                  view_bundling="true"
>                  max_join_attempts="5"/>
>      <MFC  max_credits="2M"
>           min_threshold="0.4"/>
>      <FRAG2  frag_size="60K"   />
>      <pbcast.STATE_TRANSFER  />
>      <!-- pbcast.FLUSH /-->
> </config>
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

I am trying to understand how i adjust my configuration to take advantage of the fix for JGRP-2523 [https://issues.redhat.com/browse/JGRP-2523] but am unable to figure out what element the new attribute should be applied to in my configuration.
>From what i can tell, the new attribute isn't anywhere in any of he .xsd's so I am unable to even create a channel if my config tries to use it.
Here is my sample config:
<!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd"> <TCP_NIO2 recv_buf_size="${tcp.recv_buf_size:128K}" send_buf_size="${tcp.send_buf_size:128K}" max_bundle_size="64K" sock_conn_timeout="1000" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000"/> <CENTRAL_LOCK /> <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING location="${HA_JGROUPS_DIR}" remove_old_coords_on_view_change="true"/> <MERGE3 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST3 /> <!-- When a new node joins a cluster, initial message broadcast doesn't necessarily seem to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize this dropped transmission and cause a retransmission. --> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" max_join_attempts="5"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> <!-- pbcast.FLUSH /--> </config>
On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using JGroups
via javagroups-users <jav...@li...> wrote:

> [...]
>
>
> > 3. Later they use a node with the same address to join a different
> > cluster with the same name.
>
> Can you post an example? Note that discovery requests from different
> clusters are discarded.
>

Sure, but in summary: I can't reuse an IP address after it's already been
in a cluster. The customer is trying to run separate clusters, but the
address of a node in one of them was previously in a different one, and
that is causing problems.

My config is programmatic; I've included it below. We use a custom
authentication class. When authenticate() is called it will output the
source and the response it's returning. I've set the jgroups logging to
DEBUG level; my application only logs the initial_hosts it sets and the
authentication calls. The member addresses end in: 128, 129, 130, and 131.

1. I start a cluster with (started in this order) 128, 129, 130. Each of
them has all three of those addresses in initial_hosts.

2. I shut down the application running on 130. The logs for 128 and 129
have "*** stopping application on .130" in them right before this.

3. I start an application on 131 that has 130/131 in initial_hosts.

4. I start a new application on the node with the 130 address. It has 130
and 131 in initial hosts. The logs on 128 and 129 have "*** new application
on .130 starting and will join new cluster with .131" in them to show when
it happens.

About a minute later, the errors start showing up. The 128 application is
trying to connect to the one running on 130 even though that one had
previously shut down and left the cluster. The new one on 130 doesn't let
it join, and there are merge views repeating with warning messages
throughout. There is a merge view change every minute or so in the original
cluster (128/129).

The stack we create (comments and text changes for sharing):

    public JChannel createJChannel() throws Exception {
        Logger logger = <...>
        logger.log(Level.DEBUG, "Creating default JChannel.");
        List<Protocol> stack = new ArrayList<>();
        final Protocol tcp = new TCP()
            // bind_addr will be same address, e.g. .128, .129, etc that we
use in initial_hosts
            .setValue("bind_addr",
InetAddress.getByName(getBindingAddress()))
            .setValue("bind_port", bindingPort)
            .setValue("thread_pool_min_threads", 1)
            .setValue("thread_pool_keep_alive_time", 5000)
            .setValue("send_buf_size", 640000)
            .setValue("sock_conn_timeout", 300)
            .setValue("recv_buf_size", 5000000);
        // some optional things we could add to tcp removed. not used in
this example
        stack.add(tcp);
        stack.add(new TCPPING()
            // the parseHostList method will output the list for this
example at ERROR level
            .setValue("initial_hosts", parseHostList())
            .setValue("send_cache_on_join", true)
            .setValue("port_range", 0));
        stack.add(new MERGE3()
            .setValue("min_interval", 10000)
            .setValue("max_interval", 30000));
        FD_ALL fdAll = new FD_ALL();
        final long jgroupsTimeout = <>
        fdAll.setValue("timeout", jgroupsTimeout);
        final long maxInterval = jgroupsTimeout / 3L; // to have ~3
heartbeats before going suspect. <jira number removed>
        if (maxInterval < fdAll.getInterval()) {
            logger.log(Level.WARN, ".......");
            fdAll.setValue("interval", maxInterval);
        }
        stack.add(fdAll);
        stack.add(new VERIFY_SUSPECT()
            .setValue("timeout", 1500));
        stack.add(new BARRIER());
        if (getBoolean(<an application property>)) {
            logger.debug("adding jgroups asym encryption");
            stack.add(new ASYM_ENCRYPT()
                .setValue("sym_keylength", 128)
                .setValue("sym_algorithm", "AES/CBC/PKCS5Padding")
                .setValue("sym_iv_length", 16)
                .setValue("asym_keylength", 2048)
                .setValue("asym_algorithm", "RSA")
                .setValue("change_key_on_leave", true));
        }
        stack.add(new NAKACK2()
            .setValue("use_mcast_xmit", false));
        stack.add(new UNICAST3());
        stack.add(new STABLE()
            .setValue("desired_avg_gossip", 50000)
            .setValue("max_bytes", 4000000));
        // protocol will log auth request source and response
        stack.add(createAuthProtocol());
        stack.add(new GMS()
            .setValue("join_timeout", 3000));
        stack.add(new MFC()
            .setValue("max_credits", 2000000)
            .setValue("min_credits", 800000));
        stack.add(new FRAG2());
        stack.add(new STATE_TRANSFER());
        return new JChannel(stack);
    }

Thanks again,
Bobby

On 03.05.21 17:02, Questions/problems related to using JGroups wrote:
> Hi again,
> 
> Thanks for this. I have more information from the customer now, and see 
> that the problem they're having isn't due to incorrect host information 
> at startup like I thought. The setup to reproduce is pretty simple, and 
> I understand their point that it doesn't look like user error.
> 
> 1. Set up cluster A/B/C (A is coordinator).
> 2. At some point they don't need C in the cluster anymore and shut down 
> the application there. It's a regular shutdown, not going suspect first. 
> We use JChannel#close and then exit.

OK

> 3. Later they use a node with the same address to join a different 
> cluster with the same name.

Can you post an example? Note that discovery requests from different 
clusters are discarded.

> When C starts it only has D's address, and cluster D/C.
> 
> After the above, the A/B cluster is getting a merge view change every 
> ~minute, always including only A/B in the view. The log on A is also 
> filling with:
> JGRP000032: <A>: no physical address for <D>, dropping message
> Because it's a merge view, we do extra processing to handle potential 
> rejoin cases, which causes a couple other warnings every minute.
> 
> I also see every ~minute that A tries to authorize itself with C. C's 
> log has messages from our custom AuthToken class.
> 
> 
> If I use a different cluster for C/D that avoids a lot of the issues. 
> There are no longer view changes and warnings in the first cluster, but 
> the new one D/C has this in C's log constantly:
> JGRP000012: discarded message from different cluster <old> (our cluster 
> is <new>). Sender was <A>
> 
> That will help them some, but it's a large organization and they have a 
> lot of clusters, since we thought it would be ok to reuse the name as 
> long as the addresses weren't shared. Is there anything we can do to 
> make a cluster forget a member that has left gracefully?

You lost me early in your description of the case... can you post a 
simple example, with 2 configs including TCPPING?

In general, I recommend separating the sets of {TCP.bind_addr, 
TCPPING.initial_hosts) cleanly for each cluster, plus including *all* of 
the members of a cluster in TCPPING.initial_hosts.

If you can't do that, then look into using a dynamic discovery mechanism.
Cheers

> Thanks,
> Bobby
> 
> 
> 
> On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using 
> JGroups via javagroups-users <jav...@li... 
> <mailto:jav...@li...>> wrote:
> 
>     You can always change the list of initial hosts in TCPPING
>     programmatically, via getInitialHosts() / setInitialHosts().
> 
>     Detecting that an address is wrong is outside the scope of JGroups, and
>     should be done (IMO) by your application, e.g. at
>     config/installation/startup time.
> 
>     This can of course be arbitrarily difficult, e.g.
>     * See if a symbolic name resolves correctly
>     * Check if a host is pingable
> 
>     You could also disallow a user from entering hostnames/IP addresses
>     him/herself directly and instead generate them yourself, e.g. by
>     recording all hosts on which an installation was performed and using
>     this as initial_hosts.
> 
>     You could also think of adding a protocol which checks (in init() or
>     start()) that the hostnames/addresses in TCPPING.initial_hosts resolve,
>     and possibly ping all entries before starting the stack.
> 
>     On a related note, take a look at [1] (added in 4.2.12): it skips
>     unresolved/unresolvable entries until an entry finally does resolve.
> 
>     Hope this helps,
> 
>     [1] https://issues.redhat.com/browse/JGRP-2535
>     <https://issues.redhat.com/browse/JGRP-2535>
> 
>     On 05.04.21 22:50, Questions/problems related to using JGroups wrote:
>      > Hi,
>      >
>      > Our product uses the TCP stack with jgroups 4.1.8. It gets set up
>     by end
>      > users through a configuration file that contains (among other
>     things), a
>      > list of IP addresses for a node to connect to when joining a
>     cluster. We
>      > set this for TCPPING.initial_hosts.
>      >
>      > If they have a wrong address at startup they end up
>     getting JGRP000032
>      > warnings filling the logs. For instance, the following leads to logs
>      > filling on two nodes, one of which was set up correctly:
>      >
>      > 1. Start cluster A/B. A is the coordinator.
>      > 2. Start a one-node cluster C.
>      > 3. On node D, include addresses for D and B in the initial hosts
>     list
>      > and attempt to join.
>      > 4. D will join C for a cluster C/D and, obviously, not join A/B
>     since it
>      > didn't attempt to connect to the coordinator.
>      >
>      > After this, the logs for D will fill with:
>      > WARN: JGRP000032: <D>: no physical address for <A>, dropping message
>      >
>      > ...and B logs will fill with:
>      > WARN: JGRP000032: <B>: no physical address for <C>, dropping message
>      >
>      > I know this is a setup error on the user's side, but was
>     wondering if
>      > there's anything we could add programmatically to stop it. For
>     instance,
>      > when they see the logs on X filling up with messages about Y in
>     another
>      > cluster, is there something we could do to tell X to forget Y
>     exists?
>      > It's not enough just to stop/fix/start that cluster, as (in the
>     case of
>      > A/B above) the cluster that was started correctly could be
>     showing this
>      > problem. For some customers, getting a maintenance window to shut
>     down
>      > all related clusters and restart them is a problem.
>      >
>      > For that matter, is there anything programmatically we could do to
>      > detect that this is happening? Besides parsing the jgroups logging
>      > output I mean.
>      >
>      > Thank you,
>      > Bobby
>      >
>      >
>      >
>      > _______________________________________________
>      > javagroups-users mailing list
>      > jav...@li...
>     <mailto:jav...@li...>
>      > https://lists.sourceforge.net/lists/listinfo/javagroups-users
>     <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
>      >
> 
>     -- 
>     Bela Ban | http://www.jgroups.org <http://www.jgroups.org>
> 
> 
> 
>     _______________________________________________
>     javagroups-users mailing list
>     jav...@li...
>     <mailto:jav...@li...>
>     https://lists.sourceforge.net/lists/listinfo/javagroups-users
>     <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

Hi again,

Thanks for this. I have more information from the customer now, and see
that the problem they're having isn't due to incorrect host information at
startup like I thought. The setup to reproduce is pretty simple, and I
understand their point that it doesn't look like user error.

1. Set up cluster A/B/C (A is coordinator).
2. At some point they don't need C in the cluster anymore and shut down the
application there. It's a regular shutdown, not going suspect first. We use
JChannel#close and then exit.
3. Later they use a node with the same address to join a different cluster
with the same name. When C starts it only has D's address, and forms
cluster D/C.

After the above, the A/B cluster is getting a merge view change every
~minute, always including only A/B in the view. The log on A is also
filling with:
JGRP000032: <A>: no physical address for <D>, dropping message
Because it's a merge view, we do extra processing to handle potential
rejoin cases, which causes a couple other warnings every minute.

I also see every ~minute that A tries to authorize itself with C. C's log
has messages from our custom AuthToken class.

If I use a different cluster for C/D that avoids a lot of the issues. There
are no longer view changes and warnings in the first cluster, but the new
one D/C has this in C's log constantly:
JGRP000012: discarded message from different cluster <old> (our cluster is
<new>). Sender was <A>

That will help them some, but it's a large organization and they have a lot
of clusters, since we thought it would be ok to reuse the name as long as
the addresses weren't shared. Is there anything we can do to make a cluster
forget a member that has left gracefully?

Thanks,
Bobby

On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using JGroups
via javagroups-users <jav...@li...> wrote:

> You can always change the list of initial hosts in TCPPING
> programmatically, via getInitialHosts() / setInitialHosts().
>
> Detecting that an address is wrong is outside the scope of JGroups, and
> should be done (IMO) by your application, e.g. at
> config/installation/startup time.
>
> This can of course be arbitrarily difficult, e.g.
> * See if a symbolic name resolves correctly
> * Check if a host is pingable
>
> You could also disallow a user from entering hostnames/IP addresses
> him/herself directly and instead generate them yourself, e.g. by
> recording all hosts on which an installation was performed and using
> this as initial_hosts.
>
> You could also think of adding a protocol which checks (in init() or
> start()) that the hostnames/addresses in TCPPING.initial_hosts resolve,
> and possibly ping all entries before starting the stack.
>
> On a related note, take a look at [1] (added in 4.2.12): it skips
> unresolved/unresolvable entries until an entry finally does resolve.
>
> Hope this helps,
>
> [1] https://issues.redhat.com/browse/JGRP-2535
>
> On 05.04.21 22:50, Questions/problems related to using JGroups wrote:
> > Hi,
> >
> > Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end
> > users through a configuration file that contains (among other things), a
> > list of IP addresses for a node to connect to when joining a cluster. We
> > set this for TCPPING.initial_hosts.
> >
> > If they have a wrong address at startup they end up getting JGRP000032
> > warnings filling the logs. For instance, the following leads to logs
> > filling on two nodes, one of which was set up correctly:
> >
> > 1. Start cluster A/B. A is the coordinator.
> > 2. Start a one-node cluster C.
> > 3. On node D, include addresses for D and B in the initial hosts list
> > and attempt to join.
> > 4. D will join C for a cluster C/D and, obviously, not join A/B since it
> > didn't attempt to connect to the coordinator.
> >
> > After this, the logs for D will fill with:
> > WARN: JGRP000032: <D>: no physical address for <A>, dropping message
> >
> > ...and B logs will fill with:
> > WARN: JGRP000032: <B>: no physical address for <C>, dropping message
> >
> > I know this is a setup error on the user's side, but was wondering if
> > there's anything we could add programmatically to stop it. For instance,
> > when they see the logs on X filling up with messages about Y in another
> > cluster, is there something we could do to tell X to forget Y exists?
> > It's not enough just to stop/fix/start that cluster, as (in the case of
> > A/B above) the cluster that was started correctly could be showing this
> > problem. For some customers, getting a maintenance window to shut down
> > all related clusters and restart them is a problem.
> >
> > For that matter, is there anything programmatically we could do to
> > detect that this is happening? Besides parsing the jgroups logging
> > output I mean.
> >
> > Thank you,
> > Bobby
> >
> >
> >
> > _______________________________________________
> > javagroups-users mailing list
> > jav...@li...
> > https://lists.sourceforge.net/lists/listinfo/javagroups-users
> >
>
> --
> Bela Ban | http://www.jgroups.org
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>

You can always change the list of initial hosts in TCPPING 
programmatically, via getInitialHosts() / setInitialHosts().

Detecting that an address is wrong is outside the scope of JGroups, and 
should be done (IMO) by your application, e.g. at 
config/installation/startup time.

This can of course be arbitrarily difficult, e.g.
* See if a symbolic name resolves correctly
* Check if a host is pingable

You could also disallow a user from entering hostnames/IP addresses 
him/herself directly and instead generate them yourself, e.g. by 
recording all hosts on which an installation was performed and using 
this as initial_hosts.

You could also think of adding a protocol which checks (in init() or 
start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, 
and possibly ping all entries before starting the stack.

On a related note, take a look at [1] (added in 4.2.12): it skips 
unresolved/unresolvable entries until an entry finally does resolve.

Hope this helps,

[1] https://issues.redhat.com/browse/JGRP-2535

On 05.04.21 22:50, Questions/problems related to using JGroups wrote:
> Hi,
> 
> Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end 
> users through a configuration file that contains (among other things), a 
> list of IP addresses for a node to connect to when joining a cluster. We 
> set this for TCPPING.initial_hosts.
> 
> If they have a wrong address at startup they end up getting JGRP000032 
> warnings filling the logs. For instance, the following leads to logs 
> filling on two nodes, one of which was set up correctly:
> 
> 1. Start cluster A/B. A is the coordinator.
> 2. Start a one-node cluster C.
> 3. On node D, include addresses for D and B in the initial hosts list 
> and attempt to join.
> 4. D will join C for a cluster C/D and, obviously, not join A/B since it 
> didn't attempt to connect to the coordinator.
> 
> After this, the logs for D will fill with:
> WARN: JGRP000032: <D>: no physical address for <A>, dropping message
> 
> ...and B logs will fill with:
> WARN: JGRP000032: <B>: no physical address for <C>, dropping message
> 
> I know this is a setup error on the user's side, but was wondering if 
> there's anything we could add programmatically to stop it. For instance, 
> when they see the logs on X filling up with messages about Y in another 
> cluster, is there something we could do to tell X to forget Y exists? 
> It's not enough just to stop/fix/start that cluster, as (in the case of 
> A/B above) the cluster that was started correctly could be showing this 
> problem. For some customers, getting a maintenance window to shut down 
> all related clusters and restart them is a problem.
> 
> For that matter, is there anything programmatically we could do to 
> detect that this is happening? Besides parsing the jgroups logging 
> output I mean.
> 
> Thank you,
> Bobby
> 
> 
> 
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> 

-- 
Bela Ban | http://www.jgroups.org

Hi,

Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end
users through a configuration file that contains (among other things), a
list of IP addresses for a node to connect to when joining a cluster. We
set this for TCPPING.initial_hosts.

If they have a wrong address at startup they end up getting JGRP000032
warnings filling the logs. For instance, the following leads to logs
filling on two nodes, one of which was set up correctly:

1. Start cluster A/B. A is the coordinator.
2. Start a one-node cluster C.
3. On node D, include addresses for D and B in the initial hosts list and
attempt to join.
4. D will join C for a cluster C/D and, obviously, not join A/B since it
didn't attempt to connect to the coordinator.

After this, the logs for D will fill with:
WARN: JGRP000032: <D>: no physical address for <A>, dropping message

...and B logs will fill with:
WARN: JGRP000032: <B>: no physical address for <C>, dropping message

I know this is a setup error on the user's side, but was wondering if
there's anything we could add programmatically to stop it. For instance,
when they see the logs on X filling up with messages about Y in another
cluster, is there something we could do to tell X to forget Y exists? It's
not enough just to stop/fix/start that cluster, as (in the case of A/B
above) the cluster that was started correctly could be showing this
problem. For some customers, getting a maintenance window to shut down all
related clusters and restart them is a problem.

For that matter, is there anything programmatically we could do to detect
that this is happening? Besides parsing the jgroups logging output I mean.

Thank you,
Bobby

But a member won't be able to connect (JChannel.connect(cluster)), so 
what's the point? This will fail!

On 18.11.20 4:38 pm, Questions/problems related to using JGroups wrote:
> So the environment where we deploy the nodes is very unreliable. 
> Network switches or links could be down and that is okay since we test 
> to make sure all the nodes in the system can handle it.
>
> An example of this happening in production is when the network is down 
> but a node is coming up due to a system power recovery or a complete 
> system wide reboot.
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users

-- 
Bela Ban, JGroups lead (http://www.jgroups.org)

2001	Jan	Feb	Mar	Apr	May (1)	Jun (3)	Jul	Aug	Sep (4)	Oct (2)	Nov (3)	Dec
2002	Jan	Feb (4)	Mar (1)	Apr	May (4)	Jun (13)	Jul (8)	Aug (12)	Sep (21)	Oct (47)	Nov (33)	Dec (29)
2003	Jan (38)	Feb (44)	Mar (32)	Apr (48)	May (30)	Jun (24)	Jul (70)	Aug (89)	Sep (58)	Oct (25)	Nov (42)	Dec (53)
2004	Jan (64)	Feb (74)	Mar (18)	Apr (72)	May (22)	Jun (40)	Jul (66)	Aug (44)	Sep (23)	Oct (47)	Nov (96)	Dec (52)
2005	Jan (35)	Feb (74)	Mar (52)	Apr (43)	May (74)	Jun (60)	Jul (39)	Aug (51)	Sep (60)	Oct (57)	Nov (90)	Dec (82)
2006	Jan (74)	Feb (84)	Mar (92)	Apr (127)	May (139)	Jun (58)	Jul (47)	Aug (42)	Sep (68)	Oct (86)	Nov (76)	Dec (73)
2007	Jan (38)	Feb (42)	Mar (50)	Apr (51)	May (70)	Jun (80)	Jul (69)	Aug (131)	Sep (57)	Oct (90)	Nov (148)	Dec (75)
2008	Jan (125)	Feb (136)	Mar (92)	Apr (94)	May (44)	Jun (83)	Jul (35)	Aug (52)	Sep (91)	Oct (129)	Nov (129)	Dec (48)
2009	Jan (74)	Feb (59)	Mar (76)	Apr (76)	May (56)	Jun (117)	Jul (83)	Aug (62)	Sep (61)	Oct (129)	Nov (97)	Dec (84)
2010	Jan (56)	Feb (93)	Mar (80)	Apr (49)	May (37)	Jun (106)	Jul (71)	Aug (65)	Sep (146)	Oct (70)	Nov (80)	Dec (40)
2011	Jan (98)	Feb (83)	Mar (132)	Apr (58)	May (45)	Jun (55)	Jul (58)	Aug (68)	Sep (59)	Oct (26)	Nov (88)	Dec (31)
2012	Jan (57)	Feb (103)	Mar (85)	Apr (40)	May (44)	Jun (54)	Jul (25)	Aug (24)	Sep (10)	Oct (25)	Nov (61)	Dec (25)
2013	Jan (34)	Feb (52)	Mar (16)	Apr (61)	May (44)	Jun (45)	Jul (74)	Aug (59)	Sep (38)	Oct (37)	Nov (53)	Dec (16)
2014	Jan (14)	Feb (46)	Mar (38)	Apr (13)	May (67)	Jun (31)	Jul (45)	Aug (12)	Sep (13)	Oct (14)	Nov (52)	Dec (26)
2015	Jan (34)	Feb (36)	Mar (29)	Apr (16)	May (14)	Jun (41)	Jul (22)	Aug (28)	Sep (26)	Oct (42)	Nov (54)	Dec (85)
2016	Jan (39)	Feb (9)	Mar (42)	Apr (39)	May (25)	Jun (33)	Jul (20)	Aug (12)	Sep (2)	Oct (8)	Nov (8)	Dec (12)
2017	Jan (5)	Feb (29)	Mar (16)	Apr (5)	May (8)	Jun (9)	Jul (19)	Aug (9)	Sep (6)	Oct (23)	Nov (15)	Dec (3)
2018	Jan (1)	Feb (1)	Mar (3)	Apr (10)	May (14)	Jun (16)	Jul (1)	Aug (8)	Sep (1)	Oct (26)	Nov (12)	Dec (6)
2019	Jan (3)	Feb (2)	Mar (5)	Apr (5)	May (14)	Jun (1)	Jul (1)	Aug	Sep (5)	Oct	Nov	Dec (1)
2020	Jan (20)	Feb (3)	Mar (6)	Apr (15)	May (2)	Jun	Jul (5)	Aug (5)	Sep (1)	Oct	Nov (5)	Dec
2021	Jan	Feb	Mar	Apr (2)	May (8)	Jun	Jul	Aug	Sep (1)	Oct	Nov (3)	Dec
2022	Jan	Feb	Mar	Apr	May	Jun (4)	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec
2023	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep (1)	Oct	Nov	Dec

javagroups-users Mailing List for JGroups

javagroups-users — Questions and problems related to using JavaGroups