javagroups-users Mailing List for JGroups
Brought to you by:
belaban
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(3) |
Jul
|
Aug
|
Sep
(4) |
Oct
(2) |
Nov
(3) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
|
Feb
(4) |
Mar
(1) |
Apr
|
May
(4) |
Jun
(13) |
Jul
(8) |
Aug
(12) |
Sep
(21) |
Oct
(47) |
Nov
(33) |
Dec
(29) |
2003 |
Jan
(38) |
Feb
(44) |
Mar
(32) |
Apr
(48) |
May
(30) |
Jun
(24) |
Jul
(70) |
Aug
(89) |
Sep
(58) |
Oct
(25) |
Nov
(42) |
Dec
(53) |
2004 |
Jan
(64) |
Feb
(74) |
Mar
(18) |
Apr
(72) |
May
(22) |
Jun
(40) |
Jul
(66) |
Aug
(44) |
Sep
(23) |
Oct
(47) |
Nov
(96) |
Dec
(52) |
2005 |
Jan
(35) |
Feb
(74) |
Mar
(52) |
Apr
(43) |
May
(74) |
Jun
(60) |
Jul
(39) |
Aug
(51) |
Sep
(60) |
Oct
(57) |
Nov
(90) |
Dec
(82) |
2006 |
Jan
(74) |
Feb
(84) |
Mar
(92) |
Apr
(127) |
May
(139) |
Jun
(58) |
Jul
(47) |
Aug
(42) |
Sep
(68) |
Oct
(86) |
Nov
(76) |
Dec
(73) |
2007 |
Jan
(38) |
Feb
(42) |
Mar
(50) |
Apr
(51) |
May
(70) |
Jun
(80) |
Jul
(69) |
Aug
(131) |
Sep
(57) |
Oct
(90) |
Nov
(148) |
Dec
(75) |
2008 |
Jan
(125) |
Feb
(136) |
Mar
(92) |
Apr
(94) |
May
(44) |
Jun
(83) |
Jul
(35) |
Aug
(52) |
Sep
(91) |
Oct
(129) |
Nov
(129) |
Dec
(48) |
2009 |
Jan
(74) |
Feb
(59) |
Mar
(76) |
Apr
(76) |
May
(56) |
Jun
(117) |
Jul
(83) |
Aug
(62) |
Sep
(61) |
Oct
(129) |
Nov
(97) |
Dec
(84) |
2010 |
Jan
(56) |
Feb
(93) |
Mar
(80) |
Apr
(49) |
May
(37) |
Jun
(106) |
Jul
(71) |
Aug
(65) |
Sep
(146) |
Oct
(70) |
Nov
(80) |
Dec
(40) |
2011 |
Jan
(98) |
Feb
(83) |
Mar
(132) |
Apr
(58) |
May
(45) |
Jun
(55) |
Jul
(58) |
Aug
(68) |
Sep
(59) |
Oct
(26) |
Nov
(88) |
Dec
(31) |
2012 |
Jan
(57) |
Feb
(103) |
Mar
(85) |
Apr
(40) |
May
(44) |
Jun
(54) |
Jul
(25) |
Aug
(24) |
Sep
(10) |
Oct
(25) |
Nov
(61) |
Dec
(25) |
2013 |
Jan
(34) |
Feb
(52) |
Mar
(16) |
Apr
(61) |
May
(44) |
Jun
(45) |
Jul
(74) |
Aug
(59) |
Sep
(38) |
Oct
(37) |
Nov
(53) |
Dec
(16) |
2014 |
Jan
(14) |
Feb
(46) |
Mar
(38) |
Apr
(13) |
May
(67) |
Jun
(31) |
Jul
(45) |
Aug
(12) |
Sep
(13) |
Oct
(14) |
Nov
(52) |
Dec
(26) |
2015 |
Jan
(34) |
Feb
(36) |
Mar
(29) |
Apr
(16) |
May
(14) |
Jun
(41) |
Jul
(22) |
Aug
(28) |
Sep
(26) |
Oct
(42) |
Nov
(54) |
Dec
(85) |
2016 |
Jan
(39) |
Feb
(9) |
Mar
(42) |
Apr
(39) |
May
(25) |
Jun
(33) |
Jul
(20) |
Aug
(12) |
Sep
(2) |
Oct
(8) |
Nov
(8) |
Dec
(12) |
2017 |
Jan
(5) |
Feb
(29) |
Mar
(16) |
Apr
(5) |
May
(8) |
Jun
(9) |
Jul
(19) |
Aug
(9) |
Sep
(6) |
Oct
(23) |
Nov
(15) |
Dec
(3) |
2018 |
Jan
(1) |
Feb
(1) |
Mar
(3) |
Apr
(10) |
May
(14) |
Jun
(16) |
Jul
(1) |
Aug
(8) |
Sep
(1) |
Oct
(26) |
Nov
(12) |
Dec
(6) |
2019 |
Jan
(3) |
Feb
(2) |
Mar
(5) |
Apr
(5) |
May
(14) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
(5) |
Oct
|
Nov
|
Dec
(1) |
2020 |
Jan
(20) |
Feb
(3) |
Mar
(6) |
Apr
(15) |
May
(2) |
Jun
|
Jul
(5) |
Aug
(5) |
Sep
(1) |
Oct
|
Nov
(5) |
Dec
|
2021 |
Jan
|
Feb
|
Mar
|
Apr
(2) |
May
(8) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
|
2023 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Questions/problems r. to u. J. <jav...@li...> - 2023-09-12 11:09:27
|
FYI, I just released 5.3. Details here: http://belaban.blogspot.com/2023/09/jgroups-53-released.html Cheers -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-11-22 01:29:38
|
Thanks for the confirmation Bela, I was easily able to modify the Infinispan configuration to include the FORK [1] and the application is no longer prone to random OOMs during bootstrap. Cheers, Johnathan [1] <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd"> ... <FORK> <fork-stacks> <fork-stack id="hijack-stack"/> </fork-stacks> </FORK> <FRAG3 /> </config> if (cacheManager.getTransport() instanceof JGroupsTransport) { JGroupsTransport jGroupsTransport = (JGroupsTransport) cacheManager.getTransport(); ProtocolStack stack = jGroupsTransport.getChannel().getProtocolStack(); Class<? extends Protocol> neighborProtocol = stack.findProtocol(FRAG2.class) != null ? FRAG2.class : FRAG3.class; channel = new ForkChannel(jGroupsTransport.getChannel(), "hijack-stack", "lead-hijacker", false, ProtocolStack.Position.ABOVE, neighborProtocol); On Fri, 21 Oct 2022, 16:40 Questions/problems related to using JGroups via javagroups-users, <jav...@li...> wrote: > Hi Jonathan > > yes, the best solution is to define FORK in the JGroups section of the > Infinispan configuration, as you mentioned. > > If you don't control the configuration, then it gets a bit more > tricky... you could (*before sending any brodcast traffic*) insert FORK > dynamically into every JChannel instance. > To do that, you need to get the JChannel; IIRC, via (paraphrased) > cache.getExtendedCache().getRpcManager().getTransport(), downcast it to > JGroupsTransport, then call getChannel(). > > Once you have JChannel, call > channel.getProtocolStack().insertProtocolAtTop(new FORK(...)); > Hope this helps, > > [1] http://www.jgroups.org/manual5/index.html#ForkChannel > > On 21.10.22 16:02, Questions/problems related to using JGroups wrote: > > Hi all, > > > > We have run into an interesting race condition when attempting to use a > > fork channel in our application, we more or less follow what Bela wrote > > here > > > http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html > < > http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html> > . > > > > When the cluster comes up and receives views the node broadcasts > > information about itself over the fork channel so the cluster knows what > > each node is capable of handling. > > > > Unfortunately from what I can see there is a race condition on bootstrap > > where the JGroups stack is started and receives a fork message before > > the fork is inserted into the stack (fork not present in stack trace) > > [1] which results in garbage / unknown data passing through the > > Infinispan marshaller.. if you are lucky enough it will read an > > extremely large int and try and allocate that into a byte array > > resulting in the JVM to throw an OOM or NegativeArraySizeException > > > > I believe one possible solution is to define the fork inside the > > jgroups.xml which is used to create the initial jgroups stack which > > would hopefully discard fork channel messages (until the message > > listener is registered) and not pass them up the stack resulting in > > undefined behaviour. > > > > I had a look at the fork documentation but there are not many examples, > > does my possible solution seem feasible or does someone have alternative > > solutions? I am currently looking through the Infinispan code to see if > > there is any way to decorate jgroups before it starts. > > > > Thanks in advance, > > > > Johnathan > > > > [1] > > > > 2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2] > > (org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2 > > java.lang.NegativeArraySizeException: -436207616 > > at > > > org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at > > > org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at > > > org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450) > ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > > at org.jgroups.JChannel.up(JChannel.java:796) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.FRAG3.up(FRAG3.java:187) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.Protocol.up(Protocol.java:338) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.Protocol.up(Protocol.java:338) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.Protocol.up(Protocol.java:338) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.Protocol.up(Protocol.java:338) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.stack.Protocol.up(Protocol.java:338) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at org.jgroups.protocols.TP.passBatchUp(TP.java:1408) > > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at > > > org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at > > > org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at > > > org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > ~[?:?] > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > ~[?:?] > > at java.lang.Thread.run(Thread.java:829) ~[?:?] > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > -- > Bela Ban | http://www.jgroups.org > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-10-21 14:39:08
|
Hi Jonathan yes, the best solution is to define FORK in the JGroups section of the Infinispan configuration, as you mentioned. If you don't control the configuration, then it gets a bit more tricky... you could (*before sending any brodcast traffic*) insert FORK dynamically into every JChannel instance. To do that, you need to get the JChannel; IIRC, via (paraphrased) cache.getExtendedCache().getRpcManager().getTransport(), downcast it to JGroupsTransport, then call getChannel(). Once you have JChannel, call channel.getProtocolStack().insertProtocolAtTop(new FORK(...)); Hope this helps, [1] http://www.jgroups.org/manual5/index.html#ForkChannel On 21.10.22 16:02, Questions/problems related to using JGroups wrote: > Hi all, > > We have run into an interesting race condition when attempting to use a > fork channel in our application, we more or less follow what Bela wrote > here > http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html <http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html> . > > When the cluster comes up and receives views the node broadcasts > information about itself over the fork channel so the cluster knows what > each node is capable of handling. > > Unfortunately from what I can see there is a race condition on bootstrap > where the JGroups stack is started and receives a fork message before > the fork is inserted into the stack (fork not present in stack trace) > [1] which results in garbage / unknown data passing through the > Infinispan marshaller.. if you are lucky enough it will read an > extremely large int and try and allocate that into a byte array > resulting in the JVM to throw an OOM or NegativeArraySizeException > > I believe one possible solution is to define the fork inside the > jgroups.xml which is used to create the initial jgroups stack which > would hopefully discard fork channel messages (until the message > listener is registered) and not pass them up the stack resulting in > undefined behaviour. > > I had a look at the fork documentation but there are not many examples, > does my possible solution seem feasible or does someone have alternative > solutions? I am currently looking through the Infinispan code to see if > there is any way to decorate jgroups before it starts. > > Thanks in advance, > > Johnathan > > [1] > > 2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2] > (org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2 > java.lang.NegativeArraySizeException: -436207616 > at > org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at > org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at > org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] > at org.jgroups.JChannel.up(JChannel.java:796) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.FRAG3.up(FRAG3.java:187) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.Protocol.up(Protocol.java:338) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.Protocol.up(Protocol.java:338) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.Protocol.up(Protocol.java:338) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.Protocol.up(Protocol.java:338) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.stack.Protocol.up(Protocol.java:338) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at org.jgroups.protocols.TP.passBatchUp(TP.java:1408) > ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at > org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at > org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at > org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] > at java.lang.Thread.run(Thread.java:829) ~[?:?] > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-10-21 14:02:42
|
Hi all, We have run into an interesting race condition when attempting to use a fork channel in our application, we more or less follow what Bela wrote here http://belaban.blogspot.com/2013/08/how-to-hijack-jgroups-channel-inside.html . When the cluster comes up and receives views the node broadcasts information about itself over the fork channel so the cluster knows what each node is capable of handling. Unfortunately from what I can see there is a race condition on bootstrap where the JGroups stack is started and receives a fork message before the fork is inserted into the stack (fork not present in stack trace) [1] which results in garbage / unknown data passing through the Infinispan marshaller.. if you are lucky enough it will read an extremely large int and try and allocate that into a byte array resulting in the JVM to throw an OOM or NegativeArraySizeException I believe one possible solution is to define the fork inside the jgroups.xml which is used to create the initial jgroups stack which would hopefully discard fork channel messages (until the message listener is registered) and not pass them up the stack resulting in undefined behaviour. I had a look at the fork documentation but there are not many examples, does my possible solution seem feasible or does someone have alternative solutions? I am currently looking through the Infinispan code to see if there is any way to decorate jgroups before it starts. Thanks in advance, Johnathan [1] 2022-10-20 10:42:28,515 ERROR [jgroups-89,service-2] (org.infinispan.CLUSTER) ISPN000474: Error processing request 0@service-2 java.lang.NegativeArraySizeException: -436207616 at org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:904) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.marshall.core.GlobalMarshaller.readUnknown(GlobalMarshaller.java:891) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:715) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:358) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:192) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:221) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processRequest(JGroupsTransport.java:1361) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1301) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:130) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.lambda$up$0(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.jgroups.util.MessageBatch.forEach(MessageBatch.java:318) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1450) ~[infinispan-core-11.0.1.Final.jar:11.0.1.Final] at org.jgroups.JChannel.up(JChannel.java:796) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:903) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.FRAG3.up(FRAG3.java:187) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.Protocol.up(Protocol.java:338) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:297) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.UNICAST3.deliverBatch(UNICAST3.java:1071) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.UNICAST3.removeAndDeliver(UNICAST3.java:886) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.UNICAST3.handleBatchReceived(UNICAST3.java:852) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:501) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:689) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.Protocol.up(Protocol.java:338) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:197) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.Protocol.up(Protocol.java:338) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.Protocol.up(Protocol.java:338) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.stack.Protocol.up(Protocol.java:338) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.protocols.TP.passBatchUp(TP.java:1408) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.passBatchUp(MaxOneThreadPerSender.java:284) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.util.SubmitToThreadPool$BatchHandler.run(SubmitToThreadPool.java:136) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at org.jgroups.util.MaxOneThreadPerSender$BatchHandlerLoop.run(MaxOneThreadPerSender.java:273) ~[jgroups-4.2.1.Final.jar:4.2.1.Final] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-10-17 14:06:13
|
On 17.10.22 14:30, Questions/problems related to using JGroups wrote: > Hi, > > I see that the suspect() method is only called on the coordinator node > in my cluster when another node disappears; no other nodes in the > cluster get the message. Is this the expected behavior? The docs don't > say one way or the other: > > http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address) <http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address)> Yes, this is correct: coordinator (or newly promoted coords) get this callback only. > We're using jgroups 4.1.8 currently. Note that the suspect() / unsuspect() were removed in 5.0. We wrote code, way back with > jgroups 3.4, where every node needs to know if a member left the view > suspect or not. IMO there's better ways of doing this, as suspect() is only an indication and may never result in a view change. If a member P needs to leave gracefully, I'd have P broadcast a LEAVING_GRACEFULLY message to all, before it leaves. Everyone receives this message and caches it. When the actual view change arrives, members part of the previous view but not the current view, and *not* part of the cache, crashed. All others left gracefully. The cache needs to be adjusted on every view change. > The nodes use this to compare the current cluster size > to a "stable" size that changes more slowly (changes to match cluster > size right away if nodes leave gracefully since that means a human > caused it on purpose). So either suspect() was called on every node back > then or we really missed something during testing. It was seven years > ago so we could have definitely missed it. > > Can send more information if you'd like it, but is this the way it's > supposed to work? Is there anything else I can do have suspect() called > on each node? > > (If not, I currently think my best option is to have non-coordinator > nodes treat any decrease in view size as suspect until, after the view > change, the coordinator can send them a message telling them that a node > left gracefully and that the current actual size is "stable.") > > Thank you, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-10-17 12:58:03
|
Hi, I see that the suspect() method is only called on the coordinator node in my cluster when another node disappears; no other nodes in the cluster get the message. Is this the expected behavior? The docs don't say one way or the other: http://www.jgroups.org/javadoc4/org/jgroups/MembershipListener.html#suspect(org.jgroups.Address) We're using jgroups 4.1.8 currently. We wrote code, way back with jgroups 3.4, where every node needs to know if a member left the view suspect or not. The nodes use this to compare the current cluster size to a "stable" size that changes more slowly (changes to match cluster size right away if nodes leave gracefully since that means a human caused it on purpose). So either suspect() was called on every node back then or we really missed something during testing. It was seven years ago so we could have definitely missed it. Can send more information if you'd like it, but is this the way it's supposed to work? Is there anything else I can do have suspect() called on each node? (If not, I currently think my best option is to have non-coordinator nodes treat any decrease in view size as suspect until, after the view change, the coordinator can send them a message telling them that a node left gracefully and that the current actual size is "stable.") Thank you, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-10 10:25:09
|
On 06.06.22 22:15, Questions/problems related to using JGroups wrote: > Although, looking at this again, I think we might not be talking about > the same setup. From this: > > > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > [....] > > > > > Right, and what they want is some way to fully remove a node > from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get > complex > quickly though, as you'd have to maintain a list of ports per > cluster. > > > Each cluster is separate from all the others, so I don't know what I > would need to keep in this list or why a cluster would need it. Referring to my previous email: if you use FILE_PING, each cluster has a _separate_ directory (the cluster name) under which the discovery info is stored. > If a cluster has A/B/C/D in it, and the code sees that D leaves the cluster > without going suspect first, can I programmatically do these? For TCP, it's complicated, but doable. Among other things you'd have to: - Close all TCP connections to D - Close all connections to D in UNICAST3, too - Remove D's info from the address cache (contents: 'probe.sh uuids') - Remove D's information from all instances of TCPPING (initial_hosts and dynamic_hosts) Again, using a dynamic discovery protocol such as FILE_PING makes more sense here. > - set new initial_hosts on the existing TCPPING protocol in my stack to > include only A/B/C > - access the logical address cache and remove the address Yes, but this is not enough (see above). > I mean, I know I can hack the TCPPING again, but didn't know that would > have any effect on the existing channel and members. I don't know > offhand how to access the address cache, which I think is all I'm > missing to experiment with this. Pseudo code: TP tp=channel.getProtocolStack().getTransport(); LazyRemovalCache cache=tp.getLogicalAddressCache(); cache.remove(address, true); // force removal > If I can do the above then I think that > solves the issue -- if a suspect member leaves the view I won't do > anything, because we want to keep trying it in case it was disconnected > and reconnected. But if a member leaves gracefully and the above is all > I need to make the cluster forget about it, that's great and means we > wouldn't have to change any startup features for the customers. > > Thanks again, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-10 09:47:52
|
Hi Bobby On 06.06.22 21:08, Questions/problems related to using JGroups wrote: > Hi again Bela et al, > > We've finally come back to this issue after not working on the product > for a while. I'm keeping the context all below, but the short version > was that we use TCPPING and, if someone removes a node with address X > and, later, starts a new cluster that includes the address, the old > cluster keeps trying to find its lost buddy at X. Right, and I suggested using a dynamic discovery protocol, *not* TCPPING. > We're still back on v4.1.8 and I wanted to ask if the suggestion below, > i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on > their own networks) is the most appropriate, and if there would be any > benefit for this particular issue by moving to v5.X? There are loads of benefits by moving to 5.x :-) But, specifically to this case, only the ability to have multiple discovery protocols in the same stack would be beneficial here. I guess MULTI_PING in 4.x might do the same job though... > The way they run > things now is to put host:port info for each node in a file and then > start the applications, which read that file to set initial hosts. So > FILE_PING might be the best for them so that we don't need to have any > new processes running. Yes, the benefits/drawbacks of FILE_PING are + No additional process needed + All processes access a shared dir, e.g. on NFS - NFS adds overhead (but only for discovery) + The discovery info is human-readable, and can thus be modified manually (if needed) > Thanks, > Bobby > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > > > On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > > JGroups via javagroups-users > <jav...@li... > <mailto:jav...@li...> > > <mailto:jav...@li... > <mailto:jav...@li...>>> wrote: > > > > Hi Bobby > > apologies for the delay! > > > > > > No problem -- thanks for looking. > > > > > > You cannot have the old cluster's initial_hosts be > 128,129,130 and the > > new one has the overlapping range 130,131. > > > > > > That's the problem. The customer has lots of nodes, clusters that > grow > > and shrink, and they're going to reuse the same IP addresses > eventually. > > > Then using TCPPING for the discovery is the wrong solution; it is > designed for a static cluster with a fixed and known membership. > > For the above requirements, I'd rather recommend: > * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) > * Emphemeral ports > * A new (different) cluster name for each new cluster that is started > > > > The old cluster will try to contact 130 (e.g. trying to > merge), thereby > > send its information to 130. > > > > > > Right, and what they want is some way to fully remove a node from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get complex > quickly though, as you'd have to maintain a list of ports per cluster. > > The first solution above is much better IMO. > > > > What is it you're trying to achieve? > > > > > > Simply to take a node out of a cluster when it's not needed, then > later > > reuse the address of that node with a different cluster. If I > change the > > cluster names (same port though) then I still get constant > warnings, like: > > JGRP000012: discarded message from different cluster <old> (our > cluster > > is <new>). Sender was <some addr> > > > > We can suggest that they restart the cluster after removing a > node, but > > I don't know if that will work for them. I'll also try using > different > > ports for different clusters and see how that works for them. > > That will certainly work, but - again - you'd have to maintain ports > numbers for each cluster. Registration service? Excel spreadsheet? > > > > Given the size of the company in question, I can see that it > might be hard to > > coordinate that and eventually they'll get back in the same > situation > > where a previously used address is being used again with the same > port > > it used the last time. > > Right. So I have to come back to my suggestion of not using TCPPING! > Cheers, > > > > Thanks, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > <mailto:jav...@li...> > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > -- > Bela Ban | http://www.jgroups.org <http://www.jgroups.org> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > <mailto:jav...@li...> > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-06 20:40:05
|
Although, looking at this again, I think we might not be talking about the same setup. From this: > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li...> > wrote: > >> [....] >> >> > >> > Right, and what they want is some way to fully remove a node from a >> > cluster. I.e. the cluster stops trying to contact that address. >> >> >> Then you would have to remove the 130 node from the old cluster's >> initial_hosts (TCPPING) and TCP's logical address cache. Either by >> restarting, or by programmatically removing it. This can get complex >> quickly though, as you'd have to maintain a list of ports per cluster. > > Each cluster is separate from all the others, so I don't know what I would need to keep in this list or why a cluster would need it. If a cluster has A/B/C/D in it, and the code sees that D leaves the cluster without going suspect first, can I programmatically do these? - set new initial_hosts on the existing TCPPING protocol in my stack to include only A/B/C - access the logical address cache and remove the address I mean, I know I can hack the TCPPING again, but didn't know that would have any effect on the existing channel and members. I don't know offhand how to access the address cache, which I think is all I'm missing to experiment with this. If I can do the above then I think that solves the issue -- if a suspect member leaves the view I won't do anything, because we want to keep trying it in case it was disconnected and reconnected. But if a member leaves gracefully and the above is all I need to make the cluster forget about it, that's great and means we wouldn't have to change any startup features for the customers. Thanks again, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-06 19:35:18
|
Hi again Bela et al, We've finally come back to this issue after not working on the product for a while. I'm keeping the context all below, but the short version was that we use TCPPING and, if someone removes a node with address X and, later, starts a new cluster that includes the address, the old cluster keeps trying to find its lost buddy at X. We're still back on v4.1.8 and I wanted to ask if the suggestion below, i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on their own networks) is the most appropriate, and if there would be any benefit for this particular issue by moving to v5.X? The way they run things now is to put host:port info for each node in a file and then start the applications, which read that file to set initial hosts. So FILE_PING might be the best for them so that we don't need to have any new processes running. Thanks, Bobby On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > > > On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > > JGroups via javagroups-users <jav...@li... > > <mailto:jav...@li...>> wrote: > > > > Hi Bobby > > apologies for the delay! > > > > > > No problem -- thanks for looking. > > > > > > You cannot have the old cluster's initial_hosts be 128,129,130 and > the > > new one has the overlapping range 130,131. > > > > > > That's the problem. The customer has lots of nodes, clusters that grow > > and shrink, and they're going to reuse the same IP addresses eventually. > > > Then using TCPPING for the discovery is the wrong solution; it is > designed for a static cluster with a fixed and known membership. > > For the above requirements, I'd rather recommend: > * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) > * Emphemeral ports > * A new (different) cluster name for each new cluster that is started > > > > The old cluster will try to contact 130 (e.g. trying to merge), > thereby > > send its information to 130. > > > > > > Right, and what they want is some way to fully remove a node from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get complex > quickly though, as you'd have to maintain a list of ports per cluster. > > The first solution above is much better IMO. > > > > What is it you're trying to achieve? > > > > > > Simply to take a node out of a cluster when it's not needed, then later > > reuse the address of that node with a different cluster. If I change the > > cluster names (same port though) then I still get constant warnings, > like: > > JGRP000012: discarded message from different cluster <old> (our cluster > > is <new>). Sender was <some addr> > > > > We can suggest that they restart the cluster after removing a node, but > > I don't know if that will work for them. I'll also try using different > > ports for different clusters and see how that works for them. > > That will certainly work, but - again - you'd have to maintain ports > numbers for each cluster. Registration service? Excel spreadsheet? > > > > Given the size of the company in question, I can see that it might be > hard to > > coordinate that and eventually they'll get back in the same situation > > where a previously used address is being used again with the same port > > it used the last time. > > Right. So I have to come back to my suggestion of not using TCPPING! > Cheers, > > > > Thanks, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban | http://www.jgroups.org > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-11-22 15:27:41
|
Hi again, Thanks for this -- seems like an obvious answer but I wanted to be sure. On Mon, Nov 22, 2021 at 3:43 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > > If they use this as a kind of health service ping, why don't they use a > different port? > I think it's a security tool, but I don't know much about it. AFAIK it attempts to connect to any port in use no matter what is running there to check for problems. Cheers, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-11-22 08:43:46
|
On 18.11.21 13:49, Questions/problems related to using JGroups wrote: > Hi, > > One of our customers has a security tool that constantly tries to > connect to each server, so the jgroups logs have this happening every ~4 > few seconds: > > 2021-10-28 05:18:35 org.jgroups.protocols.TCP warn WARN: JGRP000006: > failed accepting connection from peer > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at > org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:245) > at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:53) > at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:126) > at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:111) > at java.lang.Thread.run(Thread.java:748) > > Can that affect the performance of jgroups? I don't think so, this causes an additional TCP connection to be (half-)established, but it will be torn down immediately, so a bit of processing. If they use this as a kind of health service ping, why don't they use a different port? > I see it on all nodes, but > one of their nodes, a primary database, when under load sometimes > doesn't see view changes that the other nodes see until a minute or more > later. That should be unrelated... > Since the above happens on *every* node I'd think it's unrelated > but wanted to check. I know it's kind of a *qualitative* question, sorry. > > This is with jgroups 4.1.8.Final using a TCP stack. Can get you the full > channel creation info if it helps. > > Thanks, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-11-18 13:46:54
|
Hi, One of our customers has a security tool that constantly tries to connect to each server, so the jgroups logs have this happening every ~4 few seconds: 2021-10-28 05:18:35 org.jgroups.protocols.TCP warn WARN: JGRP000006: failed accepting connection from peer java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at org.jgroups.blocks.cs.TcpConnection.readPeerAddress(TcpConnection.java:245) at org.jgroups.blocks.cs.TcpConnection.<init>(TcpConnection.java:53) at org.jgroups.blocks.cs.TcpServer$Acceptor.handleAccept(TcpServer.java:126) at org.jgroups.blocks.cs.TcpServer$Acceptor.run(TcpServer.java:111) at java.lang.Thread.run(Thread.java:748) Can that affect the performance of jgroups? I see it on all nodes, but one of their nodes, a primary database, when under load sometimes doesn't see view changes that the other nodes see until a minute or more later. Since the above happens on *every* node I'd think it's unrelated but wanted to check. I know it's kind of a *qualitative* question, sorry. This is with jgroups 4.1.8.Final using a TCP stack. Can get you the full channel creation info if it helps. Thanks, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-09-29 10:21:24
|
I'm considering removing support for JMX in 5.2 [1]. Is anyone using JMX at all to obtain info about a running JGroups system? I've used probe.sh for quite a while now and haven't really used JMX in a long time.. Feedback welcome! [1] https://issues.redhat.com/browse/JGRP-2572 -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-26 07:26:10
|
On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > Hi Bobby > apologies for the delay! > > > No problem -- thanks for looking. > > > You cannot have the old cluster's initial_hosts be 128,129,130 and the > new one has the overlapping range 130,131. > > > That's the problem. The customer has lots of nodes, clusters that grow > and shrink, and they're going to reuse the same IP addresses eventually. Then using TCPPING for the discovery is the wrong solution; it is designed for a static cluster with a fixed and known membership. For the above requirements, I'd rather recommend: * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) * Emphemeral ports * A new (different) cluster name for each new cluster that is started > The old cluster will try to contact 130 (e.g. trying to merge), thereby > send its information to 130. > > > Right, and what they want is some way to fully remove a node from a > cluster. I.e. the cluster stops trying to contact that address. Then you would have to remove the 130 node from the old cluster's initial_hosts (TCPPING) and TCP's logical address cache. Either by restarting, or by programmatically removing it. This can get complex quickly though, as you'd have to maintain a list of ports per cluster. The first solution above is much better IMO. > What is it you're trying to achieve? > > > Simply to take a node out of a cluster when it's not needed, then later > reuse the address of that node with a different cluster. If I change the > cluster names (same port though) then I still get constant warnings, like: > JGRP000012: discarded message from different cluster <old> (our cluster > is <new>). Sender was <some addr> > > We can suggest that they restart the cluster after removing a node, but > I don't know if that will work for them. I'll also try using different > ports for different clusters and see how that works for them. That will certainly work, but - again - you'd have to maintain ports numbers for each cluster. Registration service? Excel spreadsheet? > Given the size of the company in question, I can see that it might be hard to > coordinate that and eventually they'll get back in the same situation > where a previously used address is being used again with the same port > it used the last time. Right. So I have to come back to my suggestion of not using TCPPING! Cheers, > Thanks, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-25 17:59:07
|
On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > Hi Bobby > apologies for the delay! > No problem -- thanks for looking. > > You cannot have the old cluster's initial_hosts be 128,129,130 and the > new one has the overlapping range 130,131. > That's the problem. The customer has lots of nodes, clusters that grow and shrink, and they're going to reuse the same IP addresses eventually. > > The old cluster will try to contact 130 (e.g. trying to merge), thereby > send its information to 130. > Right, and what they want is some way to fully remove a node from a cluster. I.e. the cluster stops trying to contact that address. > > What is it you're trying to achieve? > Simply to take a node out of a cluster when it's not needed, then later reuse the address of that node with a different cluster. If I change the cluster names (same port though) then I still get constant warnings, like: JGRP000012: discarded message from different cluster <old> (our cluster is <new>). Sender was <some addr> We can suggest that they restart the cluster after removing a node, but I don't know if that will work for them. I'll also try using different ports for different clusters and see how that works for them. Given the size of the company in question, I can see that it might be hard to coordinate that and eventually they'll get back in the same situation where a previously used address is being used again with the same port it used the last time. Thanks, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-25 14:30:37
|
Hi Bobby apologies for the delay! You cannot have the old cluster's initial_hosts be 128,129,130 and the new one has the overlapping range 130,131. The old cluster will try to contact 130 (e.g. trying to merge), thereby send its information to 130. Depending on traffic patterns, everbody will know everyone's else's address, or not. For example, it could be that 128 and 130 know everyone else, but 129 and 131 don't know each other. In the former case, there will be a merge to {128,129,130,131}. In the latter case, members will fail to talk to other members, as they don't have the other members in their logical address cache. If the old cluster didn't have 130 in its initial_hosts, everything would be fine. What is it you're trying to achieve? If you're trying to start a new cluster, then either give it a new cluster name and/or a new set of (unused) ports. Both cluster names and ports could be dished out by a server accessible to all. Cheers On 04.05.21 21:00, Questions/problems related to using JGroups wrote: > On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > [...] > > > > 3. Later they use a node with the same address to join a different > > cluster with the same name. > > Can you post an example? Note that discovery requests from different > clusters are discarded. > > > Sure, but in summary: I can't reuse an IP address after it's already > been in a cluster. The customer is trying to run separate clusters, but > the address of a node in one of them was previously in a different one, > and that is causing problems. > > My config is programmatic; I've included it below. We use a custom > authentication class. When authenticate() is called it will output the > source and the response it's returning. I've set the jgroups logging to > DEBUG level; my application only logs the initial_hosts it sets and the > authentication calls. The member addresses end in: 128, 129, 130, and 131. > > 1. I start a cluster with (started in this order) 128, 129, 130. Each of > them has all three of those addresses in initial_hosts. > > 2. I shut down the application running on 130. The logs for 128 and 129 > have "*** stopping application on .130" in them right before this. > > 3. I start an application on 131 that has 130/131 in initial_hosts. > > 4. I start a new application on the node with the 130 address. It has > 130 and 131 in initial hosts. The logs on 128 and 129 have "*** new > application on .130 starting and will join new cluster with .131" in > them to show when it happens. > > About a minute later, the errors start showing up. The 128 application > is trying to connect to the one running on 130 even though that one had > previously shut down and left the cluster. The new one on 130 doesn't > let it join, and there are merge views repeating with warning messages > throughout. There is a merge view change every minute or so in the > original cluster (128/129). > > The stack we create (comments and text changes for sharing): > > public JChannel createJChannel() throws Exception { > Logger logger = <...> > logger.log(Level.DEBUG, "Creating default JChannel."); > List<Protocol> stack = new ArrayList<>(); > final Protocol tcp = new TCP() > // bind_addr will be same address, e.g. .128, .129, etc > that we use in initial_hosts > .setValue("bind_addr", > InetAddress.getByName(getBindingAddress())) > .setValue("bind_port", bindingPort) > .setValue("thread_pool_min_threads", 1) > .setValue("thread_pool_keep_alive_time", 5000) > .setValue("send_buf_size", 640000) > .setValue("sock_conn_timeout", 300) > .setValue("recv_buf_size", 5000000); > // some optional things we could add to tcp removed. not used > in this example > stack.add(tcp); > stack.add(new TCPPING() > // the parseHostList method will output the list for this > example at ERROR level > .setValue("initial_hosts", parseHostList()) > .setValue("send_cache_on_join", true) > .setValue("port_range", 0)); > stack.add(new MERGE3() > .setValue("min_interval", 10000) > .setValue("max_interval", 30000)); > FD_ALL fdAll = new FD_ALL(); > final long jgroupsTimeout = <> > fdAll.setValue("timeout", jgroupsTimeout); > final long maxInterval = jgroupsTimeout / 3L; // to have ~3 > heartbeats before going suspect. <jira number removed> > if (maxInterval < fdAll.getInterval()) { > logger.log(Level.WARN, "......."); > fdAll.setValue("interval", maxInterval); > } > stack.add(fdAll); > stack.add(new VERIFY_SUSPECT() > .setValue("timeout", 1500)); > stack.add(new BARRIER()); > if (getBoolean(<an application property>)) { > logger.debug("adding jgroups asym encryption"); > stack.add(new ASYM_ENCRYPT() > .setValue("sym_keylength", 128) > .setValue("sym_algorithm", "AES/CBC/PKCS5Padding") > .setValue("sym_iv_length", 16) > .setValue("asym_keylength", 2048) > .setValue("asym_algorithm", "RSA") > .setValue("change_key_on_leave", true)); > } > stack.add(new NAKACK2() > .setValue("use_mcast_xmit", false)); > stack.add(new UNICAST3()); > stack.add(new STABLE() > .setValue("desired_avg_gossip", 50000) > .setValue("max_bytes", 4000000)); > // protocol will log auth request source and response > stack.add(createAuthProtocol()); > stack.add(new GMS() > .setValue("join_timeout", 3000)); > stack.add(new MFC() > .setValue("max_credits", 2000000) > .setValue("min_credits", 800000)); > stack.add(new FRAG2()); > stack.add(new STATE_TRANSFER()); > return new JChannel(stack); > } > > Thanks again, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-17 07:39:08
|
This should be somewhere in TCP_NIO2: <TCP_NIO2 max_length="5M".../> Use 'ant make-schema' to generate a schema from the sources. On 14.05.21 23:24, Questions/problems related to using JGroups wrote: > I am trying to understand how i adjust my configuration to take > advantage of the fix for JGRP-2523 > <https://issues.redhat.com/browse/JGRP-2523> but am unable to figure out > what element the new attribute should be applied to in my configuration. > > From what i can tell, the new attribute isn't anywhere in any of he > .xsd's so I am unable to even create a channel if my config tries to use it. > > Here is my sample config: > > <!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> > <config xmlns="urn:org:jgroups" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="urn:org:jgroups > http://www.jgroups.org/schema/jgroups-4.0.xsd"> > <TCP_NIO2 > recv_buf_size="${tcp.recv_buf_size:128K}" > send_buf_size="${tcp.send_buf_size:128K}" > max_bundle_size="64K" > sock_conn_timeout="1000" > > thread_pool.enabled="true" > thread_pool.min_threads="1" > thread_pool.max_threads="10" > thread_pool.keep_alive_time="5000"/> > > <CENTRAL_LOCK /> > > <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING > location="${HA_JGROUPS_DIR}" > remove_old_coords_on_view_change="true"/> > <MERGE3 max_interval="30000" > min_interval="10000"/> > <FD_SOCK/> > <FD timeout="3000" max_tries="3" /> > <VERIFY_SUSPECT timeout="1500" /> > <BARRIER /> > <pbcast.NAKACK2 use_mcast_xmit="false" > discard_delivered_msgs="true"/> > <UNICAST3 /> > <!-- > When a new node joins a cluster, initial message broadcast doesn't > necessarily seem > to arrive. Using a shorter cycles in the STABLE protocol makes the > cluster recognize > this dropped transmission and cause a retransmission. > --> > <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" > max_bytes="4M"/> > <pbcast.GMS print_local_addr="true" join_timeout="3000" > view_bundling="true" > max_join_attempts="5"/> > <MFC max_credits="2M" > min_threshold="0.4"/> > <FRAG2 frag_size="60K" /> > <pbcast.STATE_TRANSFER /> > <!-- pbcast.FLUSH /--> > </config> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-14 21:24:56
|
I am trying to understand how i adjust my configuration to take advantage of the fix for JGRP-2523 [https://issues.redhat.com/browse/JGRP-2523] but am unable to figure out what element the new attribute should be applied to in my configuration. >From what i can tell, the new attribute isn't anywhere in any of he .xsd's so I am unable to even create a channel if my config tries to use it. Here is my sample config: <!-- HA setup; may be overridden with $JENKINS_HOME/jgroups.xml --> <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-4.0.xsd"> <TCP_NIO2 recv_buf_size="${tcp.recv_buf_size:128K}" send_buf_size="${tcp.send_buf_size:128K}" max_bundle_size="64K" sock_conn_timeout="1000" thread_pool.enabled="true" thread_pool.min_threads="1" thread_pool.max_threads="10" thread_pool.keep_alive_time="5000"/> <CENTRAL_LOCK /> <com.cloudbees.jenkins.ha.singleton.CHMOD_FILE_PING location="${HA_JGROUPS_DIR}" remove_old_coords_on_view_change="true"/> <MERGE3 max_interval="30000" min_interval="10000"/> <FD_SOCK/> <FD timeout="3000" max_tries="3" /> <VERIFY_SUSPECT timeout="1500" /> <BARRIER /> <pbcast.NAKACK2 use_mcast_xmit="false" discard_delivered_msgs="true"/> <UNICAST3 /> <!-- When a new node joins a cluster, initial message broadcast doesn't necessarily seem to arrive. Using a shorter cycles in the STABLE protocol makes the cluster recognize this dropped transmission and cause a retransmission. --> <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/> <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true" max_join_attempts="5"/> <MFC max_credits="2M" min_threshold="0.4"/> <FRAG2 frag_size="60K" /> <pbcast.STATE_TRANSFER /> <!-- pbcast.FLUSH /--> </config> |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-04 19:07:38
|
On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > [...] > > > > 3. Later they use a node with the same address to join a different > > cluster with the same name. > > Can you post an example? Note that discovery requests from different > clusters are discarded. > Sure, but in summary: I can't reuse an IP address after it's already been in a cluster. The customer is trying to run separate clusters, but the address of a node in one of them was previously in a different one, and that is causing problems. My config is programmatic; I've included it below. We use a custom authentication class. When authenticate() is called it will output the source and the response it's returning. I've set the jgroups logging to DEBUG level; my application only logs the initial_hosts it sets and the authentication calls. The member addresses end in: 128, 129, 130, and 131. 1. I start a cluster with (started in this order) 128, 129, 130. Each of them has all three of those addresses in initial_hosts. 2. I shut down the application running on 130. The logs for 128 and 129 have "*** stopping application on .130" in them right before this. 3. I start an application on 131 that has 130/131 in initial_hosts. 4. I start a new application on the node with the 130 address. It has 130 and 131 in initial hosts. The logs on 128 and 129 have "*** new application on .130 starting and will join new cluster with .131" in them to show when it happens. About a minute later, the errors start showing up. The 128 application is trying to connect to the one running on 130 even though that one had previously shut down and left the cluster. The new one on 130 doesn't let it join, and there are merge views repeating with warning messages throughout. There is a merge view change every minute or so in the original cluster (128/129). The stack we create (comments and text changes for sharing): public JChannel createJChannel() throws Exception { Logger logger = <...> logger.log(Level.DEBUG, "Creating default JChannel."); List<Protocol> stack = new ArrayList<>(); final Protocol tcp = new TCP() // bind_addr will be same address, e.g. .128, .129, etc that we use in initial_hosts .setValue("bind_addr", InetAddress.getByName(getBindingAddress())) .setValue("bind_port", bindingPort) .setValue("thread_pool_min_threads", 1) .setValue("thread_pool_keep_alive_time", 5000) .setValue("send_buf_size", 640000) .setValue("sock_conn_timeout", 300) .setValue("recv_buf_size", 5000000); // some optional things we could add to tcp removed. not used in this example stack.add(tcp); stack.add(new TCPPING() // the parseHostList method will output the list for this example at ERROR level .setValue("initial_hosts", parseHostList()) .setValue("send_cache_on_join", true) .setValue("port_range", 0)); stack.add(new MERGE3() .setValue("min_interval", 10000) .setValue("max_interval", 30000)); FD_ALL fdAll = new FD_ALL(); final long jgroupsTimeout = <> fdAll.setValue("timeout", jgroupsTimeout); final long maxInterval = jgroupsTimeout / 3L; // to have ~3 heartbeats before going suspect. <jira number removed> if (maxInterval < fdAll.getInterval()) { logger.log(Level.WARN, "......."); fdAll.setValue("interval", maxInterval); } stack.add(fdAll); stack.add(new VERIFY_SUSPECT() .setValue("timeout", 1500)); stack.add(new BARRIER()); if (getBoolean(<an application property>)) { logger.debug("adding jgroups asym encryption"); stack.add(new ASYM_ENCRYPT() .setValue("sym_keylength", 128) .setValue("sym_algorithm", "AES/CBC/PKCS5Padding") .setValue("sym_iv_length", 16) .setValue("asym_keylength", 2048) .setValue("asym_algorithm", "RSA") .setValue("change_key_on_leave", true)); } stack.add(new NAKACK2() .setValue("use_mcast_xmit", false)); stack.add(new UNICAST3()); stack.add(new STABLE() .setValue("desired_avg_gossip", 50000) .setValue("max_bytes", 4000000)); // protocol will log auth request source and response stack.add(createAuthProtocol()); stack.add(new GMS() .setValue("join_timeout", 3000)); stack.add(new MFC() .setValue("max_credits", 2000000) .setValue("min_credits", 800000)); stack.add(new FRAG2()); stack.add(new STATE_TRANSFER()); return new JChannel(stack); } Thanks again, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-04 05:53:29
|
On 03.05.21 17:02, Questions/problems related to using JGroups wrote: > Hi again, > > Thanks for this. I have more information from the customer now, and see > that the problem they're having isn't due to incorrect host information > at startup like I thought. The setup to reproduce is pretty simple, and > I understand their point that it doesn't look like user error. > > 1. Set up cluster A/B/C (A is coordinator). > 2. At some point they don't need C in the cluster anymore and shut down > the application there. It's a regular shutdown, not going suspect first. > We use JChannel#close and then exit. OK > 3. Later they use a node with the same address to join a different > cluster with the same name. Can you post an example? Note that discovery requests from different clusters are discarded. > When C starts it only has D's address, and cluster D/C. > > After the above, the A/B cluster is getting a merge view change every > ~minute, always including only A/B in the view. The log on A is also > filling with: > JGRP000032: <A>: no physical address for <D>, dropping message > Because it's a merge view, we do extra processing to handle potential > rejoin cases, which causes a couple other warnings every minute. > > I also see every ~minute that A tries to authorize itself with C. C's > log has messages from our custom AuthToken class. > > > If I use a different cluster for C/D that avoids a lot of the issues. > There are no longer view changes and warnings in the first cluster, but > the new one D/C has this in C's log constantly: > JGRP000012: discarded message from different cluster <old> (our cluster > is <new>). Sender was <A> > > That will help them some, but it's a large organization and they have a > lot of clusters, since we thought it would be ok to reuse the name as > long as the addresses weren't shared. Is there anything we can do to > make a cluster forget a member that has left gracefully? You lost me early in your description of the case... can you post a simple example, with 2 configs including TCPPING? In general, I recommend separating the sets of {TCP.bind_addr, TCPPING.initial_hosts) cleanly for each cluster, plus including *all* of the members of a cluster in TCPPING.initial_hosts. If you can't do that, then look into using a dynamic discovery mechanism. Cheers > Thanks, > Bobby > > > > On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > You can always change the list of initial hosts in TCPPING > programmatically, via getInitialHosts() / setInitialHosts(). > > Detecting that an address is wrong is outside the scope of JGroups, and > should be done (IMO) by your application, e.g. at > config/installation/startup time. > > This can of course be arbitrarily difficult, e.g. > * See if a symbolic name resolves correctly > * Check if a host is pingable > > You could also disallow a user from entering hostnames/IP addresses > him/herself directly and instead generate them yourself, e.g. by > recording all hosts on which an installation was performed and using > this as initial_hosts. > > You could also think of adding a protocol which checks (in init() or > start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, > and possibly ping all entries before starting the stack. > > On a related note, take a look at [1] (added in 4.2.12): it skips > unresolved/unresolvable entries until an entry finally does resolve. > > Hope this helps, > > [1] https://issues.redhat.com/browse/JGRP-2535 > <https://issues.redhat.com/browse/JGRP-2535> > > On 05.04.21 22:50, Questions/problems related to using JGroups wrote: > > Hi, > > > > Our product uses the TCP stack with jgroups 4.1.8. It gets set up > by end > > users through a configuration file that contains (among other > things), a > > list of IP addresses for a node to connect to when joining a > cluster. We > > set this for TCPPING.initial_hosts. > > > > If they have a wrong address at startup they end up > getting JGRP000032 > > warnings filling the logs. For instance, the following leads to logs > > filling on two nodes, one of which was set up correctly: > > > > 1. Start cluster A/B. A is the coordinator. > > 2. Start a one-node cluster C. > > 3. On node D, include addresses for D and B in the initial hosts > list > > and attempt to join. > > 4. D will join C for a cluster C/D and, obviously, not join A/B > since it > > didn't attempt to connect to the coordinator. > > > > After this, the logs for D will fill with: > > WARN: JGRP000032: <D>: no physical address for <A>, dropping message > > > > ...and B logs will fill with: > > WARN: JGRP000032: <B>: no physical address for <C>, dropping message > > > > I know this is a setup error on the user's side, but was > wondering if > > there's anything we could add programmatically to stop it. For > instance, > > when they see the logs on X filling up with messages about Y in > another > > cluster, is there something we could do to tell X to forget Y > exists? > > It's not enough just to stop/fix/start that cluster, as (in the > case of > > A/B above) the cluster that was started correctly could be > showing this > > problem. For some customers, getting a maintenance window to shut > down > > all related clusters and restart them is a problem. > > > > For that matter, is there anything programmatically we could do to > > detect that this is happening? Besides parsing the jgroups logging > > output I mean. > > > > Thank you, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > <mailto:jav...@li...> > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > -- > Bela Ban | http://www.jgroups.org <http://www.jgroups.org> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > <mailto:jav...@li...> > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-03 15:03:51
|
Hi again, Thanks for this. I have more information from the customer now, and see that the problem they're having isn't due to incorrect host information at startup like I thought. The setup to reproduce is pretty simple, and I understand their point that it doesn't look like user error. 1. Set up cluster A/B/C (A is coordinator). 2. At some point they don't need C in the cluster anymore and shut down the application there. It's a regular shutdown, not going suspect first. We use JChannel#close and then exit. 3. Later they use a node with the same address to join a different cluster with the same name. When C starts it only has D's address, and forms cluster D/C. After the above, the A/B cluster is getting a merge view change every ~minute, always including only A/B in the view. The log on A is also filling with: JGRP000032: <A>: no physical address for <D>, dropping message Because it's a merge view, we do extra processing to handle potential rejoin cases, which causes a couple other warnings every minute. I also see every ~minute that A tries to authorize itself with C. C's log has messages from our custom AuthToken class. If I use a different cluster for C/D that avoids a lot of the issues. There are no longer view changes and warnings in the first cluster, but the new one D/C has this in C's log constantly: JGRP000012: discarded message from different cluster <old> (our cluster is <new>). Sender was <A> That will help them some, but it's a large organization and they have a lot of clusters, since we thought it would be ok to reuse the name as long as the addresses weren't shared. Is there anything we can do to make a cluster forget a member that has left gracefully? Thanks, Bobby On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > You can always change the list of initial hosts in TCPPING > programmatically, via getInitialHosts() / setInitialHosts(). > > Detecting that an address is wrong is outside the scope of JGroups, and > should be done (IMO) by your application, e.g. at > config/installation/startup time. > > This can of course be arbitrarily difficult, e.g. > * See if a symbolic name resolves correctly > * Check if a host is pingable > > You could also disallow a user from entering hostnames/IP addresses > him/herself directly and instead generate them yourself, e.g. by > recording all hosts on which an installation was performed and using > this as initial_hosts. > > You could also think of adding a protocol which checks (in init() or > start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, > and possibly ping all entries before starting the stack. > > On a related note, take a look at [1] (added in 4.2.12): it skips > unresolved/unresolvable entries until an entry finally does resolve. > > Hope this helps, > > [1] https://issues.redhat.com/browse/JGRP-2535 > > On 05.04.21 22:50, Questions/problems related to using JGroups wrote: > > Hi, > > > > Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end > > users through a configuration file that contains (among other things), a > > list of IP addresses for a node to connect to when joining a cluster. We > > set this for TCPPING.initial_hosts. > > > > If they have a wrong address at startup they end up getting JGRP000032 > > warnings filling the logs. For instance, the following leads to logs > > filling on two nodes, one of which was set up correctly: > > > > 1. Start cluster A/B. A is the coordinator. > > 2. Start a one-node cluster C. > > 3. On node D, include addresses for D and B in the initial hosts list > > and attempt to join. > > 4. D will join C for a cluster C/D and, obviously, not join A/B since it > > didn't attempt to connect to the coordinator. > > > > After this, the logs for D will fill with: > > WARN: JGRP000032: <D>: no physical address for <A>, dropping message > > > > ...and B logs will fill with: > > WARN: JGRP000032: <B>: no physical address for <C>, dropping message > > > > I know this is a setup error on the user's side, but was wondering if > > there's anything we could add programmatically to stop it. For instance, > > when they see the logs on X filling up with messages about Y in another > > cluster, is there something we could do to tell X to forget Y exists? > > It's not enough just to stop/fix/start that cluster, as (in the case of > > A/B above) the cluster that was started correctly could be showing this > > problem. For some customers, getting a maintenance window to shut down > > all related clusters and restart them is a problem. > > > > For that matter, is there anything programmatically we could do to > > detect that this is happening? Besides parsing the jgroups logging > > output I mean. > > > > Thank you, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban | http://www.jgroups.org > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-04-06 11:44:59
|
You can always change the list of initial hosts in TCPPING programmatically, via getInitialHosts() / setInitialHosts(). Detecting that an address is wrong is outside the scope of JGroups, and should be done (IMO) by your application, e.g. at config/installation/startup time. This can of course be arbitrarily difficult, e.g. * See if a symbolic name resolves correctly * Check if a host is pingable You could also disallow a user from entering hostnames/IP addresses him/herself directly and instead generate them yourself, e.g. by recording all hosts on which an installation was performed and using this as initial_hosts. You could also think of adding a protocol which checks (in init() or start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, and possibly ping all entries before starting the stack. On a related note, take a look at [1] (added in 4.2.12): it skips unresolved/unresolvable entries until an entry finally does resolve. Hope this helps, [1] https://issues.redhat.com/browse/JGRP-2535 On 05.04.21 22:50, Questions/problems related to using JGroups wrote: > Hi, > > Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end > users through a configuration file that contains (among other things), a > list of IP addresses for a node to connect to when joining a cluster. We > set this for TCPPING.initial_hosts. > > If they have a wrong address at startup they end up getting JGRP000032 > warnings filling the logs. For instance, the following leads to logs > filling on two nodes, one of which was set up correctly: > > 1. Start cluster A/B. A is the coordinator. > 2. Start a one-node cluster C. > 3. On node D, include addresses for D and B in the initial hosts list > and attempt to join. > 4. D will join C for a cluster C/D and, obviously, not join A/B since it > didn't attempt to connect to the coordinator. > > After this, the logs for D will fill with: > WARN: JGRP000032: <D>: no physical address for <A>, dropping message > > ...and B logs will fill with: > WARN: JGRP000032: <B>: no physical address for <C>, dropping message > > I know this is a setup error on the user's side, but was wondering if > there's anything we could add programmatically to stop it. For instance, > when they see the logs on X filling up with messages about Y in another > cluster, is there something we could do to tell X to forget Y exists? > It's not enough just to stop/fix/start that cluster, as (in the case of > A/B above) the cluster that was started correctly could be showing this > problem. For some customers, getting a maintenance window to shut down > all related clusters and restart them is a problem. > > For that matter, is there anything programmatically we could do to > detect that this is happening? Besides parsing the jgroups logging > output I mean. > > Thank you, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
From: Questions/problems r. to u. J. <jav...@li...> - 2021-04-05 21:43:18
|
Hi, Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end users through a configuration file that contains (among other things), a list of IP addresses for a node to connect to when joining a cluster. We set this for TCPPING.initial_hosts. If they have a wrong address at startup they end up getting JGRP000032 warnings filling the logs. For instance, the following leads to logs filling on two nodes, one of which was set up correctly: 1. Start cluster A/B. A is the coordinator. 2. Start a one-node cluster C. 3. On node D, include addresses for D and B in the initial hosts list and attempt to join. 4. D will join C for a cluster C/D and, obviously, not join A/B since it didn't attempt to connect to the coordinator. After this, the logs for D will fill with: WARN: JGRP000032: <D>: no physical address for <A>, dropping message ...and B logs will fill with: WARN: JGRP000032: <B>: no physical address for <C>, dropping message I know this is a setup error on the user's side, but was wondering if there's anything we could add programmatically to stop it. For instance, when they see the logs on X filling up with messages about Y in another cluster, is there something we could do to tell X to forget Y exists? It's not enough just to stop/fix/start that cluster, as (in the case of A/B above) the cluster that was started correctly could be showing this problem. For some customers, getting a maintenance window to shut down all related clusters and restart them is a problem. For that matter, is there anything programmatically we could do to detect that this is happening? Besides parsing the jgroups logging output I mean. Thank you, Bobby |
From: Questions/problems r. to u. J. <jav...@li...> - 2020-11-18 15:55:57
|
But a member won't be able to connect (JChannel.connect(cluster)), so what's the point? This will fail! On 18.11.20 4:38 pm, Questions/problems related to using JGroups wrote: > So the environment where we deploy the nodes is very unreliable. > Network switches or links could be down and that is okay since we test > to make sure all the nodes in the system can handle it. > > An example of this happening in production is when the network is down > but a node is coming up due to a system power recovery or a complete > system wide reboot. > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban, JGroups lead (http://www.jgroups.org) |