|
From: ZhangJulian <jul...@ou...> - 2014-11-14 09:11:03
|
======Scenario 1: To recreate it, create a 4C4D environment and connect to coord1:
create table t1(id int) distribute by hash(id);
insert into t1 (select generate_series(1, 1000));
ALTER TABLE t1 DELETE NODE (datanode4);
\! pgxc_ctl remove datanode master datanode4
select pgxc_pool_reload(); ==>psql to other coordinators, run select pgxc_pool_reload();
create table t2(id int) distribute by hash(id);
ERROR: cache lookup failed for node 16390
======Scenario 2: To work around it,
create table t1(id int) distribute by hash(id);
insert into t1 (select generate_series(1, 1000));
ALTER TABLE t1 DELETE NODE (datanode4); ==>psql to other coordinators before "pgxc_ctl remove datanode"
\! pgxc_ctl remove datanode master datanode4
select pgxc_pool_reload(); ==>run select pgxc_pool_reload() on the other connected coordinators
create table t2(id int) distribute by hash(id);
Success!!
======After the debug,
In the function,
Datum
pgxc_pool_reload(PG_FUNCTION_ARGS)
{
......
/* No need to reload, node information is consistent */
if (PoolManagerCheckConnectionInfo())
{
/* Release the lock on pooler */
PoolManagerLock(false);
PG_RETURN_BOOL(true);
}
......
/* Signal other sessions to reconnect to pooler */
ReloadConnInfoOnBackends();
......
}
In Scenario 1, the new connected sessions could get the updated cluster stat values, for example, "NumDataNodes". Then PoolManagerCheckConnectionInfo() will return true, and ReloadConnInfoOnBackends has no chance to be executed.
That is why Scenario 2 could work around it, since the old sessions know the cluster has been changed and could signal other backends (the backend on coord2-4 for the session of coord1) by ReloadConnInfoOnBackends().
======The possible fix.
Erase the short path of PoolManagerCheckConnectionInfo() check, do the whole process once user run "select pgxc_pool_reload().
Please advise!
Thanks
Julian
|