Version of PG XL used is...
/Releases/Version_9.5r1/postgres-xl-9.5r1.4.tar.gz
We were running a 2 node cluster on Amazon AWS (EC2 instances), but have run into some issues and now the whole cluster is unuseable and we are struggling to find answers in the documentation.
Invalid operation: Could not begin transaction on data node...
...and we can see that datanode1 is not available, but cannot seem to get it functional again no matter what. In addition there has been a series of issues we have noted in trying to use the cluster.
Some odd behavour we noticed originally was when we tried to create a copy of a replicated table using CTAS. First I tried specifying the distribution method...
CREATE TABLE reporting.mike_messing_product DISTRIBUTE BY REPLICATION AS SELECT * FROM reporting.dim_product
...this produced the error...
Amazon Invalid operation: Couldn't resolve SQueue race condition after 10 tries;
...next I tried without specifying the distribution method...
CREATE TABLE reporting.mike_messing_product AS SELECT * FROM reporting.dim_product
...but now we got the error...
Amazon Invalid operation: Failed to get pooled connections;
Note: originally this was a two node cluster, but we used the pgxc_ctl utility to extend this by 2 nodes. This second error happened after the two additional nodes were added.
Next we tried a hack where we ran the CTAS on a single node with plans to alter it afterwards...
EXECUTE DIRECT ON (datanode2) 'CREATE TABLE reporting.mike_messing AS SELECT * FROM reporting.dim_online_store';
...this created the table on that node only, but we could not alter, query etc this table as it was not recognised by the controllers as they had no visibility of the table as it only resided on the one node (we thought this might happen, but were getting desparate, but you might want to make it so that the EXECUTE DIRECT cannot execute DDL).
The second error pointed to an issue with the configuration as "pooled connections" suggested an issue with pooled ports. On looking at the configuration we see that the pooled ports are not configured as we might have expected, even though we had manually entered specific and unique ports values, we see from the below that several of them are NULL and others have been configured differently from the values specified. I have raised a separate issue for this.
Looks like the this "Invalid operation: Failed to get pooled connections;" is your root cause. Either the same port numbers are being used (they need to be different even on different nodes), or the EC2's are not able to connect to each other. By default EC2's are pretty locked down. Need to make sure what ever ports you are using are open.