If TCP is used for transport, and TCP_USER_TIMEOUT is used also, if a node leaves the cluster due to some quick network outage, the nodes do not come back into the cluster automatically.
If TCP_USER_TIMEOUT is set to 1500 ms, and the network outage on the link is for 2000 ms, the node never comes back into the cluster.
I believe DTM sends broadcast (or multicast) messages on the network for a while after it has started, to discover other nodes on the network. But it stops doing this after a while and that is the reason why it fails to reconnect after a network disturbance.
A solution could be:
The node with the lowest node_id will never stop broadcasting the discovery messages
A node which is connected with another node with a lower node_id will never broadcast discovery messages
* The node with the lowest node_id will inform all the other connected nodes about the topology of the cluster - in particular, if a new node has appeared.
I don't think Alex is taking about initial discovery issue/ processes ( topology node discovery) ,
but any how we can configure very big value of
DTM_INI_DIS_TIMEOUT_SECS
in dtm.conf to verifyIf I set DTM_INI_DIS_TIMEOUT_SECS to 5000s the nodes do relearn each other and come back into the cluster.
commit 3ac6c452d30d2814f1704af578617f2a90f439b7
Author: Alex Jones alex.jones@genband.com
Date: Tue Aug 15 11:36:41 2017 -0400