Postgres-XL / Tickets / #71 Failed to get pooled connections on GTM Master Failover

srinivasa reddy - 2016-07-25

select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
coord1 | C | 11921 | 10.102.31.167 | f | f | 1885696643
coord3 | C | 11921 | 10.102.31.206 | f | f | 1638403545
datanode2 | D | 12921 | 10.102.31.202 | f | f | -905831925
datanode1 | D | 12921 | 10.102.31.171 | t | f | 888802358

coord1 IP is not updated even in the pgxc_ctl.conf has updated with failover IP as below.
coordMasterServers=( 10.102.31.168 none 10.102.31.206 )
coordPorts=( 11921 -1 11921 )
poolerPorts=( 15921 -1 15921 )
coordMasterDirs=( /var/mps/db_pgxl/data/auto/nodes/coord none /var/mps/db_pgxl/data/auto/nodes/coord )
coordSlaveServers=( none none none )
coordSlavePorts=( none none 11921 )
coordSlavePoolerPorts=( none none 15921 )
coordSlaveDirs=( none none none )

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pavan Deolasee - 2016-07-27

FreeBSD is not a supported platform at this stage. But you're welcome to test and submit patches to fix things.

Regarding this issue, did you use pgxc_ctl failover command to failover a coordinator? Can you attach the output of the command? You could possibly fix this by manually running ALTER NODE and change coordinator's IP address in pgxc_node.

Thanks,
Pavan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

srinivasa reddy - 2016-08-03

Hi Pavan,

Thanks for your response inspite of your busy schedule. Can you please guide me how to fix these issues.
Here is my complete story from scratch again. Looking forward for your guidance.

I have reconfigured setup and tested. on Making GTM Master failover, even restart of the cluster also still getting error when we execute a query:

ERROR: Failed to get pooled connections

The steps I have done during the Test:

PG XL Initial configuration:

=====================================================
GTM Master and coordinator(coord1) master IP: 10.102.31.167
GTM Slave and coordinator(coord1) slave IP: 10.102.31.168
Coordinator(coord2,coord3) Master with GTM_Proxy : 10.102.31.203, 10.102.31.206 --> No Slaves configured for these
Datanode(datanode1,datanode2) Master IP: 10.102.31.170, 10.102.31.201
Datanode(datanode1,datanode2) Slave IP: 10.102.31.171, 10.102.31.202

I removed running slave datanodes and coordinators using pgxc_ctl tool by using script. I have observed that If I shutdown the node and executed commands then pgxc_ctl commands are hanged until make the Node UP again. Is it the expected behaviour.

Before Failover of GTM Master(10.102.31.167), the pgxc_node state as below :

===========================================================================
/mps/db_pgxl/bin/psql -U mpsroot mpsdb -p 11921 -h 10.102.31.167
psql (PGXL 9.2.0, based on PG 9.2.4 (Postgres-XL 9.2.0))
Type "help" for help.

mpsdb=> select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
coord1 | C | 11921 | 10.102.31.167 | f | f | 1885696643
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
datanode2 | D | 12921 | 10.102.31.202 | f | f | -905831925
(3 rows)

Here is the failover GTM and coordinator command outputs.

===========================================================================================

bash-2.05b# su -l pgxl -c "sh /mps/failover_gtm.sh coord1"

Script to failover GTM and Coordinator.

`/mps/db_pgxl/bin/pgxc_ctl "failover gtm " \'
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished to read configuration.
/usr/bin/bash
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
Failover gtm
Running "gtm_ctl promote -Z gtm -D /var/mps/db_pgxl/data/auto/nodes/gtm"
Actual Command: ssh pgxl@10.102.31.168 "( gtm_ctl promote -Z gtm -D /var/mps/db_pgxl/data/auto/nodes/gtm ) > /tmp/ns-mgmt-system_STDOUT_24874_0 2>&1" < /dev/null > /dev/null 2>&1
Bring remote stdout: scp pgxl@10.102.31.168:/tmp/ns-mgmt-system_STDOUT_24874_0 /tmp/STDOUT_24874_1 > /dev/null 2>&1
Updating gtm.conf at 10.102.31.168:/var/mps/db_pgxl/data/auto/nodes/gtm
Actual Command: ssh pgxl@10.102.31.168 "( cat >> /var/mps/db_pgxl/data/auto/nodes/gtm/gtm.conf ) > /tmp/ns-mgmt-system_STDOUT_24874_3 2>&1" < /tmp/STDIN_24874_2 > /dev/null 2>&1
Bring remote stdout: scp pgxl@10.102.31.168:/tmp/ns-mgmt-system_STDOUT_24874_3 /tmp/STDOUT_24874_4 > /dev/null 2>&1
`/mps/db_pgxl/bin/pgxc_ctl "reconnect gtm_proxy all " \'
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished to read configuration.
/usr/bin/bash
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
ERROR: gtm proxy is not configured.
`/mps/db_pgxl/bin/pgxc_ctl "stop gtm_proxy all " \'
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished to read configuration.
/usr/bin/bash
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
Stopping all the gtm proxies.
ERROR: GTM Proxy is not configured.
`/mps/db_pgxl/bin/pgxc_ctl "start gtm_proxy all " \'
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished to read configuration.
/usr/bin/bash
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
Starting all the gtm proxies.
ERROR: GTM Proxy is not configured.
`/mps/db_pgxl/bin/pgxc_ctl "failover coordinator coord1 " \'
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished to read configuration.
ERROR: Failed to get pooled connections
ERROR: Failed to get pooled connections
/usr/bin/bash
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
Failover coordinators.
Failover the coordinator coord1.
Failover coordinator coord1 using GTM itself
Actual Command: ssh pgxl@10.102.31.168 "( pg_ctl promote -Z coordinator -D /var/mps/db_pgxl/data/auto/nodes/coord ) > /tmp/ns-mgmt-system_STDOUT_24987_0 2>&1" < /dev/null > /dev/null 2>&1
Bring remote stdout: scp pgxl@10.102.31.168:/tmp/ns-mgmt-system_STDOUT_24987_0 /tmp/STDOUT_24987_1 > /dev/null 2>&1
pg_ctl: cannot promote server; server is not in standby mode
Actual Command: ssh pgxl@10.102.31.168 "( pg_ctl restart -Z coordinator -D /var/mps/db_pgxl/data/auto/nodes/coord -w -o -i; sleep 1 ) > /tmp/ns-mgmt-system_STDOUT_24987_2 2>&1" < /dev/null > /dev/null 2>&1
Bring remote stdout: scp pgxl@10.102.31.168:/tmp/ns-mgmt-system_STDOUT_24987_2 /tmp/STDOUT_24987_3 > /dev/null 2>&1
ALTER NODE
pgxc_pool_reload

t
(1 row)

EXECUTE DIRECT
pgxc_pool_reload

t
(1 row)

Done.
----------- failover GTM and coordinator done -------

Failover happened Found that pgxc_node is not synced as in older GTM Master 10.102.31.167. To achieve it, user "ALTER/DROP NODe" commands through current GTM Master 10.102.31.168

======================================================================

On the GTM Slave which has taken over as GTM Master now:

. To achieve it as you recommended executed "ALTER NODE"

/mps/db_pgxl/bin/psql -U mpsroot mpsdb -p 11921 -h 10.102.31.168
psql (PGXL 9.2.0, based on PG 9.2.4 (Postgres-XL 9.2.0))
Type "help" for help.

mpsdb=> select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+-------------
coord2 | C | 11921 | 10.102.31.203 | f | f | -1197102633
coord3 | C | 11921 | 10.102.31.206 | f | f | 1638403545
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
datanode2 | D | 12921 | 10.102.31.201 | f | f | -905831925
coord1 | C | 11921 | 10.102.31.168 | f | f | 1885696643
(5 rows)

mpsdb=# drop node coord2;
DROP NODE

mpsdb=# drop node coord3;
DROP NODE

mpsdb=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
datanode2 | D | 12921 | 10.102.31.201 | f | f | -905831925
coord1 | C | 11921 | 10.102.31.168 | f | f | 1885696643

mpsdb=# ALTER NODE datanode2 WITH (HOST='10.102.31.202', PORT=12921);
ALTER NODE
mpsdb=# select * from db_cluster ;
ERROR: Failed to get pooled connections
mpsdb=# select * from db_cluster ;
ERROR: Failed to get pooled connections
mpsdb=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
coord1 | C | 11921 | 10.102.31.168 | f | f | 1885696643
datanode2 | D | 12921 | 10.102.31.202 | f | f | -905831925
(3 rows)

mpsdb=# SELECT pgxc_pool_reload();
pgxc_pool_reload

t
(1 row)

mpsdb=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
coord1 | C | 11921 | 10.102.31.168 | f | f | 1885696643
datanode2 | D | 12921 | 10.102.31.202 | f | f | -905831925
(3 rows)

mpsdb=# select * from db_cluster ;
ERROR: Failed to get pooled connections

Thanks,
Srinivas

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pavan Deolasee - 2016-08-03
  
  On Wed, Aug 3, 2016 at 11:08 AM, srinivasa reddy ksreddy543@users.sf.net
  wrote:
  
  Hi Pavan,
  
  Thanks for your response inspite of your busy schedule. Can you please
  guide me how to fix these issues.
  Here is my complete story from scratch again. Looking forward for your
  guidance.
  
  I have reconfigured setup and tested. on Making GTM Master failover, even
  restart of the cluster also still getting error when we execute a query:
  
  ERROR: Failed to get pooled connections
  
  The steps I have done during the Test:
  PG XL Initial configuration:
  
  =====================================================
  GTM Master and coordinator(coord1) master IP: 10.102.31.167
  GTM Slave and coordinator(coord1) slave IP: 10.102.31.168
  Coordinator(coord2,coord3) Master with GTM_Proxy : 10.102.31.203,
  10.102.31.206 --> No Slaves configured for these
  Datanode(datanode1,datanode2) Master IP: 10.102.31.170, 10.102.31.201
  Datanode(datanode1,datanode2) Slave IP: 10.102.31.171, 10.102.31.202
  
  I removed running slave datanodes and coordinators using pgxc_ctl tool by
  using script. I have observed that If I shutdown the node and executed
  commands then pgxc_ctl commands are hanged until make the Node UP again. Is
  it the expected behaviour.
  Before Failover of GTM Master(10.102.31.167), the pgxc_node state as below
  :
  
  ===========================================================================
  /mps/db_pgxl/bin/psql -U mpsroot mpsdb -p 11921 -h 10.102.31.167
  psql (PGXL 9.2.0, based on PG 9.2.4 (Postgres-XL 9.2.0))
  Type "help" for help.
  
  Gosh you are using 9.2? That's obsolete and hasn't got any attention in the
  last 2 years and is not even supported. I would suggest download 9.5 r1.2
  from http://www.postgres-xl.org/download/ and try again.
  
  Thanks,
  Pavan
  
  --
  Pavan Deolasee http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

srinivasa reddy - 2016-08-22

Hi Pavan,

As per your suggestion, I have upgraded and tested the scenarios on 9.5 r1.2 on FreeBSD 8.4 platform . No issues has been observed when i failover either distinct master and slave datanodes.But problem is seen when I failover the GTM node with coordinator.

DB Setup details

DBNodeName | master IP | Slave IP
datanode1 | 10.102.31.170 | 10.102.31.171
datanode2 | 10.102.31.201 | 10.102.31.202
coord1 | 10.102.31.167 | 10.102.31.168

I have executed the following script to take over to slave gtm and coordinator:

n=/mps/db_pgxl/bin/pgxc_ctl "failover gtm"
echo "$n"

n=/mps/db_pgxl/bin/pgxc_ctl "reconnect gtm_proxy all"
echo "$n"

n=/mps/db_pgxl/bin/pgxc_ctl "stop gtm_proxy all"
echo "$n"

n=/mps/db_pgxl/bin/pgxc_ctl "start gtm_proxy all"
echo "$n"

n=/mps/db_pgxl/bin/pgxc_ctl "failover coordinator $1"
echo "$n"

echo "----------- failover GTM and coordinator done -------"

After the script execution , monitor all command is hanged to list Datanode Status:

su -l pgxl -c "/mps/db_pgxl/bin/pgxc_ctl 'monitor all'"
/usr/bin/bash
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Installing pgxc_ctl_bash script as /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash.
Reading configuration using /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl_bash --home /var/mps/db_pgxl/pgxl/pgxc_ctl_home --configuration /var/mps/db_pgxl/pgxl/pgxc_ctl_home/pgxc_ctl.conf
Finished reading configuration.
* PGXC_CTL START **

Current directory: /var/mps/db_pgxl/pgxl/pgxc_ctl_home
Running: gtm master
Running: coordinator master coord1

#Other actions tried as followed:
Rebooted all the database nodes, All the nodes status are in Running state.

#Connected to DB through Coordinator Master (which was earlier slave node with gtm and coordinator)
/mps/db_pgsql/bin/psql -U postgres mpsdb -p 11921 -h 10.102.31.168
psql (9.4.5, server 9.5.3 (Postgres-XL 9.5r1.2))
WARNING: psql major version 9.4, server major version 9.5.
Some psql features might not work.
Type "help" for help.

mpsdb=# select * from db_nodes;

ERROR: Failed to get pooled connections

HINT: This may happen because one or more nodes are currently unreachable, either because of node or network failure.
Its also possible that the target node may have hit the connection limit or the pooler is configured with low connections.
Please check if all nodes are running fine and also review max_connections and max_pool_size configuration parameters
mpsdb=# select * from pgxc_node;
node_name | node_type | node_port | node_host | nodeis_primary | nodeis_preferred | node_id
-----------+-----------+-----------+---------------+----------------+------------------+------------
datanode1 | D | 12921 | 10.102.31.170 | t | f | 888802358
datanode2 | D | 12921 | 10.102.31.202 | f | f | -905831925
coord1 | C | 11921 | 10.102.31.168 | f | f | 1885696643
(3 rows)

Tried connection through Psql on Datanodes and seen the following error:

mpsdb=# \q
bash-2.05b# /mps/db_pgsql/bin/psql -U postgres mpsdb -p 12921 -h 10.102.31.170
psql: FATAL: Could not obtain a transaction ID from GTM. The GTM might have failed or lost connectivity
bash-2.05b# /mps/db_pgsql/bin/psql -U postgres mpsdb -p 12921 -h 10.102.31.202
psql: FATAL: Could not obtain a transaction ID from GTM. The GTM might have failed or lost connectivity

Could you please suggest some pointer how to proceed on it.

Thanks,
Srinivas

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

srinivasa reddy - 2016-10-21

I have found the issue. On failover of GTM at datanode postgres.conf file gtm_host is not getting updated to GTM_Standby IP. Is it the known bug?

For workaround we are doing update of gtm_host information through script.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Failed to get pooled connections on GTM Master Failover

Milestone

Searches

Help

#71 Failed to get pooled connections on GTM Master Failover

Discussion

PG XL Initial configuration:

Before Failover of GTM Master(10.102.31.167), the pgxc_node state as below :

Here is the failover GTM and coordinator command outputs.

bash-2.05b# su -l pgxl -c "sh /mps/failover_gtm.sh coord1"

Script to failover GTM and Coordinator.

On the GTM Slave which has taken over as GTM Master now:

DB Setup details

I have executed the following script to take over to slave gtm and coordinator:

After the script execution , monitor all command is hanged to list Datanode Status:

mpsdb=# select * from db_nodes;

ERROR: Failed to get pooled connections

Tried connection through Psql on Datanodes and seen the following error: