Thread: [jgroups-users] removing unknown address from cluster? JGRP000032
Brought to you by:
belaban
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-04-05 21:43:18
|
Hi, Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end users through a configuration file that contains (among other things), a list of IP addresses for a node to connect to when joining a cluster. We set this for TCPPING.initial_hosts. If they have a wrong address at startup they end up getting JGRP000032 warnings filling the logs. For instance, the following leads to logs filling on two nodes, one of which was set up correctly: 1. Start cluster A/B. A is the coordinator. 2. Start a one-node cluster C. 3. On node D, include addresses for D and B in the initial hosts list and attempt to join. 4. D will join C for a cluster C/D and, obviously, not join A/B since it didn't attempt to connect to the coordinator. After this, the logs for D will fill with: WARN: JGRP000032: <D>: no physical address for <A>, dropping message ...and B logs will fill with: WARN: JGRP000032: <B>: no physical address for <C>, dropping message I know this is a setup error on the user's side, but was wondering if there's anything we could add programmatically to stop it. For instance, when they see the logs on X filling up with messages about Y in another cluster, is there something we could do to tell X to forget Y exists? It's not enough just to stop/fix/start that cluster, as (in the case of A/B above) the cluster that was started correctly could be showing this problem. For some customers, getting a maintenance window to shut down all related clusters and restart them is a problem. For that matter, is there anything programmatically we could do to detect that this is happening? Besides parsing the jgroups logging output I mean. Thank you, Bobby |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-04-06 11:44:59
|
You can always change the list of initial hosts in TCPPING programmatically, via getInitialHosts() / setInitialHosts(). Detecting that an address is wrong is outside the scope of JGroups, and should be done (IMO) by your application, e.g. at config/installation/startup time. This can of course be arbitrarily difficult, e.g. * See if a symbolic name resolves correctly * Check if a host is pingable You could also disallow a user from entering hostnames/IP addresses him/herself directly and instead generate them yourself, e.g. by recording all hosts on which an installation was performed and using this as initial_hosts. You could also think of adding a protocol which checks (in init() or start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, and possibly ping all entries before starting the stack. On a related note, take a look at [1] (added in 4.2.12): it skips unresolved/unresolvable entries until an entry finally does resolve. Hope this helps, [1] https://issues.redhat.com/browse/JGRP-2535 On 05.04.21 22:50, Questions/problems related to using JGroups wrote: > Hi, > > Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end > users through a configuration file that contains (among other things), a > list of IP addresses for a node to connect to when joining a cluster. We > set this for TCPPING.initial_hosts. > > If they have a wrong address at startup they end up getting JGRP000032 > warnings filling the logs. For instance, the following leads to logs > filling on two nodes, one of which was set up correctly: > > 1. Start cluster A/B. A is the coordinator. > 2. Start a one-node cluster C. > 3. On node D, include addresses for D and B in the initial hosts list > and attempt to join. > 4. D will join C for a cluster C/D and, obviously, not join A/B since it > didn't attempt to connect to the coordinator. > > After this, the logs for D will fill with: > WARN: JGRP000032: <D>: no physical address for <A>, dropping message > > ...and B logs will fill with: > WARN: JGRP000032: <B>: no physical address for <C>, dropping message > > I know this is a setup error on the user's side, but was wondering if > there's anything we could add programmatically to stop it. For instance, > when they see the logs on X filling up with messages about Y in another > cluster, is there something we could do to tell X to forget Y exists? > It's not enough just to stop/fix/start that cluster, as (in the case of > A/B above) the cluster that was started correctly could be showing this > problem. For some customers, getting a maintenance window to shut down > all related clusters and restart them is a problem. > > For that matter, is there anything programmatically we could do to > detect that this is happening? Besides parsing the jgroups logging > output I mean. > > Thank you, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-03 15:03:51
|
Hi again, Thanks for this. I have more information from the customer now, and see that the problem they're having isn't due to incorrect host information at startup like I thought. The setup to reproduce is pretty simple, and I understand their point that it doesn't look like user error. 1. Set up cluster A/B/C (A is coordinator). 2. At some point they don't need C in the cluster anymore and shut down the application there. It's a regular shutdown, not going suspect first. We use JChannel#close and then exit. 3. Later they use a node with the same address to join a different cluster with the same name. When C starts it only has D's address, and forms cluster D/C. After the above, the A/B cluster is getting a merge view change every ~minute, always including only A/B in the view. The log on A is also filling with: JGRP000032: <A>: no physical address for <D>, dropping message Because it's a merge view, we do extra processing to handle potential rejoin cases, which causes a couple other warnings every minute. I also see every ~minute that A tries to authorize itself with C. C's log has messages from our custom AuthToken class. If I use a different cluster for C/D that avoids a lot of the issues. There are no longer view changes and warnings in the first cluster, but the new one D/C has this in C's log constantly: JGRP000012: discarded message from different cluster <old> (our cluster is <new>). Sender was <A> That will help them some, but it's a large organization and they have a lot of clusters, since we thought it would be ok to reuse the name as long as the addresses weren't shared. Is there anything we can do to make a cluster forget a member that has left gracefully? Thanks, Bobby On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > You can always change the list of initial hosts in TCPPING > programmatically, via getInitialHosts() / setInitialHosts(). > > Detecting that an address is wrong is outside the scope of JGroups, and > should be done (IMO) by your application, e.g. at > config/installation/startup time. > > This can of course be arbitrarily difficult, e.g. > * See if a symbolic name resolves correctly > * Check if a host is pingable > > You could also disallow a user from entering hostnames/IP addresses > him/herself directly and instead generate them yourself, e.g. by > recording all hosts on which an installation was performed and using > this as initial_hosts. > > You could also think of adding a protocol which checks (in init() or > start()) that the hostnames/addresses in TCPPING.initial_hosts resolve, > and possibly ping all entries before starting the stack. > > On a related note, take a look at [1] (added in 4.2.12): it skips > unresolved/unresolvable entries until an entry finally does resolve. > > Hope this helps, > > [1] https://issues.redhat.com/browse/JGRP-2535 > > On 05.04.21 22:50, Questions/problems related to using JGroups wrote: > > Hi, > > > > Our product uses the TCP stack with jgroups 4.1.8. It gets set up by end > > users through a configuration file that contains (among other things), a > > list of IP addresses for a node to connect to when joining a cluster. We > > set this for TCPPING.initial_hosts. > > > > If they have a wrong address at startup they end up getting JGRP000032 > > warnings filling the logs. For instance, the following leads to logs > > filling on two nodes, one of which was set up correctly: > > > > 1. Start cluster A/B. A is the coordinator. > > 2. Start a one-node cluster C. > > 3. On node D, include addresses for D and B in the initial hosts list > > and attempt to join. > > 4. D will join C for a cluster C/D and, obviously, not join A/B since it > > didn't attempt to connect to the coordinator. > > > > After this, the logs for D will fill with: > > WARN: JGRP000032: <D>: no physical address for <A>, dropping message > > > > ...and B logs will fill with: > > WARN: JGRP000032: <B>: no physical address for <C>, dropping message > > > > I know this is a setup error on the user's side, but was wondering if > > there's anything we could add programmatically to stop it. For instance, > > when they see the logs on X filling up with messages about Y in another > > cluster, is there something we could do to tell X to forget Y exists? > > It's not enough just to stop/fix/start that cluster, as (in the case of > > A/B above) the cluster that was started correctly could be showing this > > problem. For some customers, getting a maintenance window to shut down > > all related clusters and restart them is a problem. > > > > For that matter, is there anything programmatically we could do to > > detect that this is happening? Besides parsing the jgroups logging > > output I mean. > > > > Thank you, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban | http://www.jgroups.org > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-04 05:53:29
|
On 03.05.21 17:02, Questions/problems related to using JGroups wrote:
> Hi again,
>
> Thanks for this. I have more information from the customer now, and see
> that the problem they're having isn't due to incorrect host information
> at startup like I thought. The setup to reproduce is pretty simple, and
> I understand their point that it doesn't look like user error.
>
> 1. Set up cluster A/B/C (A is coordinator).
> 2. At some point they don't need C in the cluster anymore and shut down
> the application there. It's a regular shutdown, not going suspect first.
> We use JChannel#close and then exit.
OK
> 3. Later they use a node with the same address to join a different
> cluster with the same name.
Can you post an example? Note that discovery requests from different
clusters are discarded.
> When C starts it only has D's address, and cluster D/C.
>
> After the above, the A/B cluster is getting a merge view change every
> ~minute, always including only A/B in the view. The log on A is also
> filling with:
> JGRP000032: <A>: no physical address for <D>, dropping message
> Because it's a merge view, we do extra processing to handle potential
> rejoin cases, which causes a couple other warnings every minute.
>
> I also see every ~minute that A tries to authorize itself with C. C's
> log has messages from our custom AuthToken class.
>
>
> If I use a different cluster for C/D that avoids a lot of the issues.
> There are no longer view changes and warnings in the first cluster, but
> the new one D/C has this in C's log constantly:
> JGRP000012: discarded message from different cluster <old> (our cluster
> is <new>). Sender was <A>
>
> That will help them some, but it's a large organization and they have a
> lot of clusters, since we thought it would be ok to reuse the name as
> long as the addresses weren't shared. Is there anything we can do to
> make a cluster forget a member that has left gracefully?
You lost me early in your description of the case... can you post a
simple example, with 2 configs including TCPPING?
In general, I recommend separating the sets of {TCP.bind_addr,
TCPPING.initial_hosts) cleanly for each cluster, plus including *all* of
the members of a cluster in TCPPING.initial_hosts.
If you can't do that, then look into using a dynamic discovery mechanism.
Cheers
> Thanks,
> Bobby
>
>
>
> On Tue, Apr 6, 2021 at 7:46 AM Questions/problems related to using
> JGroups via javagroups-users <jav...@li...
> <mailto:jav...@li...>> wrote:
>
> You can always change the list of initial hosts in TCPPING
> programmatically, via getInitialHosts() / setInitialHosts().
>
> Detecting that an address is wrong is outside the scope of JGroups, and
> should be done (IMO) by your application, e.g. at
> config/installation/startup time.
>
> This can of course be arbitrarily difficult, e.g.
> * See if a symbolic name resolves correctly
> * Check if a host is pingable
>
> You could also disallow a user from entering hostnames/IP addresses
> him/herself directly and instead generate them yourself, e.g. by
> recording all hosts on which an installation was performed and using
> this as initial_hosts.
>
> You could also think of adding a protocol which checks (in init() or
> start()) that the hostnames/addresses in TCPPING.initial_hosts resolve,
> and possibly ping all entries before starting the stack.
>
> On a related note, take a look at [1] (added in 4.2.12): it skips
> unresolved/unresolvable entries until an entry finally does resolve.
>
> Hope this helps,
>
> [1] https://issues.redhat.com/browse/JGRP-2535
> <https://issues.redhat.com/browse/JGRP-2535>
>
> On 05.04.21 22:50, Questions/problems related to using JGroups wrote:
> > Hi,
> >
> > Our product uses the TCP stack with jgroups 4.1.8. It gets set up
> by end
> > users through a configuration file that contains (among other
> things), a
> > list of IP addresses for a node to connect to when joining a
> cluster. We
> > set this for TCPPING.initial_hosts.
> >
> > If they have a wrong address at startup they end up
> getting JGRP000032
> > warnings filling the logs. For instance, the following leads to logs
> > filling on two nodes, one of which was set up correctly:
> >
> > 1. Start cluster A/B. A is the coordinator.
> > 2. Start a one-node cluster C.
> > 3. On node D, include addresses for D and B in the initial hosts
> list
> > and attempt to join.
> > 4. D will join C for a cluster C/D and, obviously, not join A/B
> since it
> > didn't attempt to connect to the coordinator.
> >
> > After this, the logs for D will fill with:
> > WARN: JGRP000032: <D>: no physical address for <A>, dropping message
> >
> > ...and B logs will fill with:
> > WARN: JGRP000032: <B>: no physical address for <C>, dropping message
> >
> > I know this is a setup error on the user's side, but was
> wondering if
> > there's anything we could add programmatically to stop it. For
> instance,
> > when they see the logs on X filling up with messages about Y in
> another
> > cluster, is there something we could do to tell X to forget Y
> exists?
> > It's not enough just to stop/fix/start that cluster, as (in the
> case of
> > A/B above) the cluster that was started correctly could be
> showing this
> > problem. For some customers, getting a maintenance window to shut
> down
> > all related clusters and restart them is a problem.
> >
> > For that matter, is there anything programmatically we could do to
> > detect that this is happening? Besides parsing the jgroups logging
> > output I mean.
> >
> > Thank you,
> > Bobby
> >
> >
> >
> > _______________________________________________
> > javagroups-users mailing list
> > jav...@li...
> <mailto:jav...@li...>
> > https://lists.sourceforge.net/lists/listinfo/javagroups-users
> <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
> >
>
> --
> Bela Ban | http://www.jgroups.org <http://www.jgroups.org>
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> <mailto:jav...@li...>
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
> <https://lists.sourceforge.net/lists/listinfo/javagroups-users>
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>
--
Bela Ban | http://www.jgroups.org
|
On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using JGroups
via javagroups-users <jav...@li...> wrote:
> [...]
>
>
> > 3. Later they use a node with the same address to join a different
> > cluster with the same name.
>
> Can you post an example? Note that discovery requests from different
> clusters are discarded.
>
Sure, but in summary: I can't reuse an IP address after it's already been
in a cluster. The customer is trying to run separate clusters, but the
address of a node in one of them was previously in a different one, and
that is causing problems.
My config is programmatic; I've included it below. We use a custom
authentication class. When authenticate() is called it will output the
source and the response it's returning. I've set the jgroups logging to
DEBUG level; my application only logs the initial_hosts it sets and the
authentication calls. The member addresses end in: 128, 129, 130, and 131.
1. I start a cluster with (started in this order) 128, 129, 130. Each of
them has all three of those addresses in initial_hosts.
2. I shut down the application running on 130. The logs for 128 and 129
have "*** stopping application on .130" in them right before this.
3. I start an application on 131 that has 130/131 in initial_hosts.
4. I start a new application on the node with the 130 address. It has 130
and 131 in initial hosts. The logs on 128 and 129 have "*** new application
on .130 starting and will join new cluster with .131" in them to show when
it happens.
About a minute later, the errors start showing up. The 128 application is
trying to connect to the one running on 130 even though that one had
previously shut down and left the cluster. The new one on 130 doesn't let
it join, and there are merge views repeating with warning messages
throughout. There is a merge view change every minute or so in the original
cluster (128/129).
The stack we create (comments and text changes for sharing):
public JChannel createJChannel() throws Exception {
Logger logger = <...>
logger.log(Level.DEBUG, "Creating default JChannel.");
List<Protocol> stack = new ArrayList<>();
final Protocol tcp = new TCP()
// bind_addr will be same address, e.g. .128, .129, etc that we
use in initial_hosts
.setValue("bind_addr",
InetAddress.getByName(getBindingAddress()))
.setValue("bind_port", bindingPort)
.setValue("thread_pool_min_threads", 1)
.setValue("thread_pool_keep_alive_time", 5000)
.setValue("send_buf_size", 640000)
.setValue("sock_conn_timeout", 300)
.setValue("recv_buf_size", 5000000);
// some optional things we could add to tcp removed. not used in
this example
stack.add(tcp);
stack.add(new TCPPING()
// the parseHostList method will output the list for this
example at ERROR level
.setValue("initial_hosts", parseHostList())
.setValue("send_cache_on_join", true)
.setValue("port_range", 0));
stack.add(new MERGE3()
.setValue("min_interval", 10000)
.setValue("max_interval", 30000));
FD_ALL fdAll = new FD_ALL();
final long jgroupsTimeout = <>
fdAll.setValue("timeout", jgroupsTimeout);
final long maxInterval = jgroupsTimeout / 3L; // to have ~3
heartbeats before going suspect. <jira number removed>
if (maxInterval < fdAll.getInterval()) {
logger.log(Level.WARN, ".......");
fdAll.setValue("interval", maxInterval);
}
stack.add(fdAll);
stack.add(new VERIFY_SUSPECT()
.setValue("timeout", 1500));
stack.add(new BARRIER());
if (getBoolean(<an application property>)) {
logger.debug("adding jgroups asym encryption");
stack.add(new ASYM_ENCRYPT()
.setValue("sym_keylength", 128)
.setValue("sym_algorithm", "AES/CBC/PKCS5Padding")
.setValue("sym_iv_length", 16)
.setValue("asym_keylength", 2048)
.setValue("asym_algorithm", "RSA")
.setValue("change_key_on_leave", true));
}
stack.add(new NAKACK2()
.setValue("use_mcast_xmit", false));
stack.add(new UNICAST3());
stack.add(new STABLE()
.setValue("desired_avg_gossip", 50000)
.setValue("max_bytes", 4000000));
// protocol will log auth request source and response
stack.add(createAuthProtocol());
stack.add(new GMS()
.setValue("join_timeout", 3000));
stack.add(new MFC()
.setValue("max_credits", 2000000)
.setValue("min_credits", 800000));
stack.add(new FRAG2());
stack.add(new STATE_TRANSFER());
return new JChannel(stack);
}
Thanks again,
Bobby
|
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-25 14:30:37
|
Hi Bobby
apologies for the delay!
You cannot have the old cluster's initial_hosts be 128,129,130 and the
new one has the overlapping range 130,131.
The old cluster will try to contact 130 (e.g. trying to merge), thereby
send its information to 130.
Depending on traffic patterns, everbody will know everyone's else's
address, or not. For example, it could be that 128 and 130 know everyone
else, but 129 and 131 don't know each other.
In the former case, there will be a merge to {128,129,130,131}. In the
latter case, members will fail to talk to other members, as they don't
have the other members in their logical address cache.
If the old cluster didn't have 130 in its initial_hosts, everything
would be fine.
What is it you're trying to achieve?
If you're trying to start a new cluster, then either give it a new
cluster name and/or a new set of (unused) ports. Both cluster names and
ports could be dished out by a server accessible to all.
Cheers
On 04.05.21 21:00, Questions/problems related to using JGroups wrote:
> On Tue, May 4, 2021 at 1:53 AM Questions/problems related to using
> JGroups via javagroups-users <jav...@li...
> <mailto:jav...@li...>> wrote:
>
> [...]
>
>
> > 3. Later they use a node with the same address to join a different
> > cluster with the same name.
>
> Can you post an example? Note that discovery requests from different
> clusters are discarded.
>
>
> Sure, but in summary: I can't reuse an IP address after it's already
> been in a cluster. The customer is trying to run separate clusters, but
> the address of a node in one of them was previously in a different one,
> and that is causing problems.
>
> My config is programmatic; I've included it below. We use a custom
> authentication class. When authenticate() is called it will output the
> source and the response it's returning. I've set the jgroups logging to
> DEBUG level; my application only logs the initial_hosts it sets and the
> authentication calls. The member addresses end in: 128, 129, 130, and 131.
>
> 1. I start a cluster with (started in this order) 128, 129, 130. Each of
> them has all three of those addresses in initial_hosts.
>
> 2. I shut down the application running on 130. The logs for 128 and 129
> have "*** stopping application on .130" in them right before this.
>
> 3. I start an application on 131 that has 130/131 in initial_hosts.
>
> 4. I start a new application on the node with the 130 address. It has
> 130 and 131 in initial hosts. The logs on 128 and 129 have "*** new
> application on .130 starting and will join new cluster with .131" in
> them to show when it happens.
>
> About a minute later, the errors start showing up. The 128 application
> is trying to connect to the one running on 130 even though that one had
> previously shut down and left the cluster. The new one on 130 doesn't
> let it join, and there are merge views repeating with warning messages
> throughout. There is a merge view change every minute or so in the
> original cluster (128/129).
>
> The stack we create (comments and text changes for sharing):
>
> public JChannel createJChannel() throws Exception {
> Logger logger = <...>
> logger.log(Level.DEBUG, "Creating default JChannel.");
> List<Protocol> stack = new ArrayList<>();
> final Protocol tcp = new TCP()
> // bind_addr will be same address, e.g. .128, .129, etc
> that we use in initial_hosts
> .setValue("bind_addr",
> InetAddress.getByName(getBindingAddress()))
> .setValue("bind_port", bindingPort)
> .setValue("thread_pool_min_threads", 1)
> .setValue("thread_pool_keep_alive_time", 5000)
> .setValue("send_buf_size", 640000)
> .setValue("sock_conn_timeout", 300)
> .setValue("recv_buf_size", 5000000);
> // some optional things we could add to tcp removed. not used
> in this example
> stack.add(tcp);
> stack.add(new TCPPING()
> // the parseHostList method will output the list for this
> example at ERROR level
> .setValue("initial_hosts", parseHostList())
> .setValue("send_cache_on_join", true)
> .setValue("port_range", 0));
> stack.add(new MERGE3()
> .setValue("min_interval", 10000)
> .setValue("max_interval", 30000));
> FD_ALL fdAll = new FD_ALL();
> final long jgroupsTimeout = <>
> fdAll.setValue("timeout", jgroupsTimeout);
> final long maxInterval = jgroupsTimeout / 3L; // to have ~3
> heartbeats before going suspect. <jira number removed>
> if (maxInterval < fdAll.getInterval()) {
> logger.log(Level.WARN, ".......");
> fdAll.setValue("interval", maxInterval);
> }
> stack.add(fdAll);
> stack.add(new VERIFY_SUSPECT()
> .setValue("timeout", 1500));
> stack.add(new BARRIER());
> if (getBoolean(<an application property>)) {
> logger.debug("adding jgroups asym encryption");
> stack.add(new ASYM_ENCRYPT()
> .setValue("sym_keylength", 128)
> .setValue("sym_algorithm", "AES/CBC/PKCS5Padding")
> .setValue("sym_iv_length", 16)
> .setValue("asym_keylength", 2048)
> .setValue("asym_algorithm", "RSA")
> .setValue("change_key_on_leave", true));
> }
> stack.add(new NAKACK2()
> .setValue("use_mcast_xmit", false));
> stack.add(new UNICAST3());
> stack.add(new STABLE()
> .setValue("desired_avg_gossip", 50000)
> .setValue("max_bytes", 4000000));
> // protocol will log auth request source and response
> stack.add(createAuthProtocol());
> stack.add(new GMS()
> .setValue("join_timeout", 3000));
> stack.add(new MFC()
> .setValue("max_credits", 2000000)
> .setValue("min_credits", 800000));
> stack.add(new FRAG2());
> stack.add(new STATE_TRANSFER());
> return new JChannel(stack);
> }
>
> Thanks again,
> Bobby
>
>
>
> _______________________________________________
> javagroups-users mailing list
> jav...@li...
> https://lists.sourceforge.net/lists/listinfo/javagroups-users
>
--
Bela Ban | http://www.jgroups.org
|
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-25 17:59:07
|
On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > Hi Bobby > apologies for the delay! > No problem -- thanks for looking. > > You cannot have the old cluster's initial_hosts be 128,129,130 and the > new one has the overlapping range 130,131. > That's the problem. The customer has lots of nodes, clusters that grow and shrink, and they're going to reuse the same IP addresses eventually. > > The old cluster will try to contact 130 (e.g. trying to merge), thereby > send its information to 130. > Right, and what they want is some way to fully remove a node from a cluster. I.e. the cluster stops trying to contact that address. > > What is it you're trying to achieve? > Simply to take a node out of a cluster when it's not needed, then later reuse the address of that node with a different cluster. If I change the cluster names (same port though) then I still get constant warnings, like: JGRP000012: discarded message from different cluster <old> (our cluster is <new>). Sender was <some addr> We can suggest that they restart the cluster after removing a node, but I don't know if that will work for them. I'll also try using different ports for different clusters and see how that works for them. Given the size of the company in question, I can see that it might be hard to coordinate that and eventually they'll get back in the same situation where a previously used address is being used again with the same port it used the last time. Thanks, Bobby |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2021-05-26 07:26:10
|
On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > Hi Bobby > apologies for the delay! > > > No problem -- thanks for looking. > > > You cannot have the old cluster's initial_hosts be 128,129,130 and the > new one has the overlapping range 130,131. > > > That's the problem. The customer has lots of nodes, clusters that grow > and shrink, and they're going to reuse the same IP addresses eventually. Then using TCPPING for the discovery is the wrong solution; it is designed for a static cluster with a fixed and known membership. For the above requirements, I'd rather recommend: * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) * Emphemeral ports * A new (different) cluster name for each new cluster that is started > The old cluster will try to contact 130 (e.g. trying to merge), thereby > send its information to 130. > > > Right, and what they want is some way to fully remove a node from a > cluster. I.e. the cluster stops trying to contact that address. Then you would have to remove the 130 node from the old cluster's initial_hosts (TCPPING) and TCP's logical address cache. Either by restarting, or by programmatically removing it. This can get complex quickly though, as you'd have to maintain a list of ports per cluster. The first solution above is much better IMO. > What is it you're trying to achieve? > > > Simply to take a node out of a cluster when it's not needed, then later > reuse the address of that node with a different cluster. If I change the > cluster names (same port though) then I still get constant warnings, like: > JGRP000012: discarded message from different cluster <old> (our cluster > is <new>). Sender was <some addr> > > We can suggest that they restart the cluster after removing a node, but > I don't know if that will work for them. I'll also try using different > ports for different clusters and see how that works for them. That will certainly work, but - again - you'd have to maintain ports numbers for each cluster. Registration service? Excel spreadsheet? > Given the size of the company in question, I can see that it might be hard to > coordinate that and eventually they'll get back in the same situation > where a previously used address is being used again with the same port > it used the last time. Right. So I have to come back to my suggestion of not using TCPPING! Cheers, > Thanks, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > -- Bela Ban | http://www.jgroups.org |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-06 19:35:18
|
Hi again Bela et al, We've finally come back to this issue after not working on the product for a while. I'm keeping the context all below, but the short version was that we use TCPPING and, if someone removes a node with address X and, later, starts a new cluster that includes the address, the old cluster keeps trying to find its lost buddy at X. We're still back on v4.1.8 and I wanted to ask if the suggestion below, i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on their own networks) is the most appropriate, and if there would be any benefit for this particular issue by moving to v5.X? The way they run things now is to put host:port info for each node in a file and then start the applications, which read that file to set initial hosts. So FILE_PING might be the best for them so that we don't need to have any new processes running. Thanks, Bobby On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using JGroups via javagroups-users <jav...@li...> wrote: > > > On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > > JGroups via javagroups-users <jav...@li... > > <mailto:jav...@li...>> wrote: > > > > Hi Bobby > > apologies for the delay! > > > > > > No problem -- thanks for looking. > > > > > > You cannot have the old cluster's initial_hosts be 128,129,130 and > the > > new one has the overlapping range 130,131. > > > > > > That's the problem. The customer has lots of nodes, clusters that grow > > and shrink, and they're going to reuse the same IP addresses eventually. > > > Then using TCPPING for the discovery is the wrong solution; it is > designed for a static cluster with a fixed and known membership. > > For the above requirements, I'd rather recommend: > * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) > * Emphemeral ports > * A new (different) cluster name for each new cluster that is started > > > > The old cluster will try to contact 130 (e.g. trying to merge), > thereby > > send its information to 130. > > > > > > Right, and what they want is some way to fully remove a node from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get complex > quickly though, as you'd have to maintain a list of ports per cluster. > > The first solution above is much better IMO. > > > > What is it you're trying to achieve? > > > > > > Simply to take a node out of a cluster when it's not needed, then later > > reuse the address of that node with a different cluster. If I change the > > cluster names (same port though) then I still get constant warnings, > like: > > JGRP000012: discarded message from different cluster <old> (our cluster > > is <new>). Sender was <some addr> > > > > We can suggest that they restart the cluster after removing a node, but > > I don't know if that will work for them. I'll also try using different > > ports for different clusters and see how that works for them. > > That will certainly work, but - again - you'd have to maintain ports > numbers for each cluster. Registration service? Excel spreadsheet? > > > > Given the size of the company in question, I can see that it might be > hard to > > coordinate that and eventually they'll get back in the same situation > > where a previously used address is being used again with the same port > > it used the last time. > > Right. So I have to come back to my suggestion of not using TCPPING! > Cheers, > > > > Thanks, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > > > > -- > Bela Ban | http://www.jgroups.org > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users > |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-10 09:47:52
|
Hi Bobby On 06.06.22 21:08, Questions/problems related to using JGroups wrote: > Hi again Bela et al, > > We've finally come back to this issue after not working on the product > for a while. I'm keeping the context all below, but the short version > was that we use TCPPING and, if someone removes a node with address X > and, later, starts a new cluster that includes the address, the old > cluster keeps trying to find its lost buddy at X. Right, and I suggested using a dynamic discovery protocol, *not* TCPPING. > We're still back on v4.1.8 and I wanted to ask if the suggestion below, > i.e. use TCPGOSSIP or FILE_PING (this is for in-house deployments on > their own networks) is the most appropriate, and if there would be any > benefit for this particular issue by moving to v5.X? There are loads of benefits by moving to 5.x :-) But, specifically to this case, only the ability to have multiple discovery protocols in the same stack would be beneficial here. I guess MULTI_PING in 4.x might do the same job though... > The way they run > things now is to put host:port info for each node in a file and then > start the applications, which read that file to set initial hosts. So > FILE_PING might be the best for them so that we don't need to have any > new processes running. Yes, the benefits/drawbacks of FILE_PING are + No additional process needed + All processes access a shared dir, e.g. on NFS - NFS adds overhead (but only for discovery) + The discovery info is human-readable, and can thus be modified manually (if needed) > Thanks, > Bobby > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > > > On 25.05.21 18:59, Questions/problems related to using JGroups wrote: > > On Tue, May 25, 2021 at 10:30 AM Questions/problems related to using > > JGroups via javagroups-users > <jav...@li... > <mailto:jav...@li...> > > <mailto:jav...@li... > <mailto:jav...@li...>>> wrote: > > > > Hi Bobby > > apologies for the delay! > > > > > > No problem -- thanks for looking. > > > > > > You cannot have the old cluster's initial_hosts be > 128,129,130 and the > > new one has the overlapping range 130,131. > > > > > > That's the problem. The customer has lots of nodes, clusters that > grow > > and shrink, and they're going to reuse the same IP addresses > eventually. > > > Then using TCPPING for the discovery is the wrong solution; it is > designed for a static cluster with a fixed and known membership. > > For the above requirements, I'd rather recommend: > * A dynamic discovery mechanism (TCPGOSSIP, FILE_PING, GOOGLE_PING etc) > * Emphemeral ports > * A new (different) cluster name for each new cluster that is started > > > > The old cluster will try to contact 130 (e.g. trying to > merge), thereby > > send its information to 130. > > > > > > Right, and what they want is some way to fully remove a node from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get complex > quickly though, as you'd have to maintain a list of ports per cluster. > > The first solution above is much better IMO. > > > > What is it you're trying to achieve? > > > > > > Simply to take a node out of a cluster when it's not needed, then > later > > reuse the address of that node with a different cluster. If I > change the > > cluster names (same port though) then I still get constant > warnings, like: > > JGRP000012: discarded message from different cluster <old> (our > cluster > > is <new>). Sender was <some addr> > > > > We can suggest that they restart the cluster after removing a > node, but > > I don't know if that will work for them. I'll also try using > different > > ports for different clusters and see how that works for them. > > That will certainly work, but - again - you'd have to maintain ports > numbers for each cluster. Registration service? Excel spreadsheet? > > > > Given the size of the company in question, I can see that it > might be hard to > > coordinate that and eventually they'll get back in the same > situation > > where a previously used address is being used again with the same > port > > it used the last time. > > Right. So I have to come back to my suggestion of not using TCPPING! > Cheers, > > > > Thanks, > > Bobby > > > > > > > > _______________________________________________ > > javagroups-users mailing list > > jav...@li... > <mailto:jav...@li...> > > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > -- > Bela Ban | http://www.jgroups.org <http://www.jgroups.org> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > <mailto:jav...@li...> > https://lists.sourceforge.net/lists/listinfo/javagroups-users > <https://lists.sourceforge.net/lists/listinfo/javagroups-users> > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-06 20:40:05
|
Although, looking at this again, I think we might not be talking about the same setup. From this: > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li...> > wrote: > >> [....] >> >> > >> > Right, and what they want is some way to fully remove a node from a >> > cluster. I.e. the cluster stops trying to contact that address. >> >> >> Then you would have to remove the 130 node from the old cluster's >> initial_hosts (TCPPING) and TCP's logical address cache. Either by >> restarting, or by programmatically removing it. This can get complex >> quickly though, as you'd have to maintain a list of ports per cluster. > > Each cluster is separate from all the others, so I don't know what I would need to keep in this list or why a cluster would need it. If a cluster has A/B/C/D in it, and the code sees that D leaves the cluster without going suspect first, can I programmatically do these? - set new initial_hosts on the existing TCPPING protocol in my stack to include only A/B/C - access the logical address cache and remove the address I mean, I know I can hack the TCPPING again, but didn't know that would have any effect on the existing channel and members. I don't know offhand how to access the address cache, which I think is all I'm missing to experiment with this. If I can do the above then I think that solves the issue -- if a suspect member leaves the view I won't do anything, because we want to keep trying it in case it was disconnected and reconnected. But if a member leaves gracefully and the above is all I need to make the cluster forget about it, that's great and means we wouldn't have to change any startup features for the customers. Thanks again, Bobby |
|
From: Questions/problems r. to u. J. <jav...@li...> - 2022-06-10 10:25:09
|
On 06.06.22 22:15, Questions/problems related to using JGroups wrote: > Although, looking at this again, I think we might not be talking about > the same setup. From this: > > > > On Wed, May 26, 2021 at 3:26 AM Questions/problems related to using > JGroups via javagroups-users <jav...@li... > <mailto:jav...@li...>> wrote: > > [....] > > > > > Right, and what they want is some way to fully remove a node > from a > > cluster. I.e. the cluster stops trying to contact that address. > > > Then you would have to remove the 130 node from the old cluster's > initial_hosts (TCPPING) and TCP's logical address cache. Either by > restarting, or by programmatically removing it. This can get > complex > quickly though, as you'd have to maintain a list of ports per > cluster. > > > Each cluster is separate from all the others, so I don't know what I > would need to keep in this list or why a cluster would need it. Referring to my previous email: if you use FILE_PING, each cluster has a _separate_ directory (the cluster name) under which the discovery info is stored. > If a cluster has A/B/C/D in it, and the code sees that D leaves the cluster > without going suspect first, can I programmatically do these? For TCP, it's complicated, but doable. Among other things you'd have to: - Close all TCP connections to D - Close all connections to D in UNICAST3, too - Remove D's info from the address cache (contents: 'probe.sh uuids') - Remove D's information from all instances of TCPPING (initial_hosts and dynamic_hosts) Again, using a dynamic discovery protocol such as FILE_PING makes more sense here. > - set new initial_hosts on the existing TCPPING protocol in my stack to > include only A/B/C > - access the logical address cache and remove the address Yes, but this is not enough (see above). > I mean, I know I can hack the TCPPING again, but didn't know that would > have any effect on the existing channel and members. I don't know > offhand how to access the address cache, which I think is all I'm > missing to experiment with this. Pseudo code: TP tp=channel.getProtocolStack().getTransport(); LazyRemovalCache cache=tp.getLogicalAddressCache(); cache.remove(address, true); // force removal > If I can do the above then I think that > solves the issue -- if a suspect member leaves the view I won't do > anything, because we want to keep trying it in case it was disconnected > and reconnected. But if a member leaves gracefully and the above is all > I need to make the cluster forget about it, that's great and means we > wouldn't have to change any startup features for the customers. > > Thanks again, > Bobby > > > > _______________________________________________ > javagroups-users mailing list > jav...@li... > https://lists.sourceforge.net/lists/listinfo/javagroups-users -- Bela Ban | http://www.jgroups.org |