|
From: Zhou L. <zh...@gm...> - 2014-11-24 09:33:31
|
In second scenario, the reason is that coordinator sends unregister
command twice to gtm proxy :
UnregisterGTM(GTM_PGXCNodeType type)
{
int ret;
CheckConnection();
if (!conn)
return EOF;
ret = node_unregister(conn, type, PGXCNodeName);
/* If something went wrong, retry once */
if (ret < 0)
{
CloseGTM();
InitGTM();
if (conn)
ret = node_unregister(conn, type, PGXCNodeName);
}
...
}
Gtm proxy Recovery_PGXCNodeUnregister return success for the first
unregister command,but gtm return failure because of gtm restarting, then
coordinator send the second unregister command , so the
Recovery_PGXCNodeUnregister will fail , siglongjmp happens.
Patch attached to fix. It maybe not the best fix approach, please advice.
2014-11-17 17:54 GMT+08:00 Zhou Liang <zh...@gm...>:
> I found gtm proxy have some problems after gtm restarting , my environment
> is rhel7 and pgxc1.2.
>
> 1、gtm_proxy don't handle SIGPIPE , so gtm_proxy also down when gtm down,
> it sometimes occur.
> reproduce steps:
> PGXC stop gtm master
> Stop GTM master
> waiting for server to shut down.... done
> server stopped
> PGXC
> PGXC monitor all
> Not running: gtm master
> Running: gtm proxy gtm_pxy1
> Running: gtm proxy gtm_pxy2
> Running: coordinator master coord1
> Running: coordinator master coord2
> Running: datanode master datanode1
> Running: datanode master datanode2
> PGXC
> PGXC monitor all
> Not running: gtm master
> Not running: gtm proxy gtm_pxy1
> Not running: gtm proxy gtm_pxy2
> Running: coordinator master coord1
> Running: coordinator master coord2
> Running: datanode master datanode1
> Running: datanode master datanode2
> -----------------------------------------------------
> Program received signal SIGPIPE, Broken pipe.
> 0x00007ff3cf3aa75b in send () from /lib64/libpthread.so.0
> (gdb) bt
> #0 0x00007ff3cf3aa75b in send () from /lib64/libpthread.so.0
> #1 0x0000000000403c31 in gtmpqSendSome (conn=0x7ff3c80028f0, len=21) at
> fe-misc.c:753
> #2 0x0000000000403daf in gtmpqFlush (conn=0x7ff3c80028f0) at fe-misc.c:848
> #3 0x000000000042331c in GTMProxy_ThreadMain (argp=0x1ccbb60) at
> proxy_main.c:1526
> #4 0x00000000004281db in GTMProxy_ThreadMainWrapper (argp=0x1ccbb60) at
> proxy_thread.c:319
> #5 0x00007ff3cf3a3df3 in start_thread () from /lib64/libpthread.so.0
> #6 0x00007ff3cf0d13dd in clone () from /lib64/libc.so.6
>
> 2、i try to use pqsignal(SIGPIPE, SIG_IGN) to ignore a signal SIGPIPE, but
> another question occur, steps:
> PGXC stop gtm master
> PGXC start gtm master // because of gtm restart , gtm proxy process
> coordinaor or datanode unregiste command will fail
> PGXC stop coordinator master coord1 // when stop coord1 , coord1 send
> unregiste command 'C' , then 'X' to gtm proxy
> PGXC start coordinator master coord1
> PGXC Psql postgres
> Selected coord1.
> psql (PGXC , based on PG 9.3.4)
> Type "help" for help.
> postgres=# \d
> WARNING: Xid is invalid.
> WARNING: Xid is invalid.
> WARNING: Xid is invalid.
> ERROR: GTM error, could not obtain snapshot XID = 0
> --------------------------------------
> static void
> ProcessPGXCNodeCommand(GTMProxy_ConnectionInfo *conninfo, GTM_Conn
> *gtm_conn,
> GTM_MessageType mtype, StringInfo message)
> { ....
> /* Unregister Node also on Proxy */
> if (Recovery_PGXCNodeUnregister(cmd_data.cd_reg.type,
> cmd_data.cd_reg.nodename,
> false,
> conninfo->con_port->sock))
> {
> ereport(ERROR,
> (EINVAL,
> errmsg("Failed to Unregister node"))); // gtm proxy process
> unregiste command 'C', if Recovery_PGXCNodeUnregister fail , ERROR lead to
> siglongjmp
> }
> ...
> }
> after this siglongjmp, gtm proxy will continue to process old unregister
> command 'C', Recovery_PGXCNodeUnregister still fail, then continue to
> siglongjmp. this phenomena cycle occurs. so gtm proxy cann't handler other
> connections , it blocks coordinator and datanode get xid.
>
> 3、if unregister proxy fail because of gtm restarting, this siglongjmp
> happens,gtm proxy doesn't stop propley. as follows:
> /*
> * Unregister Proxy on GTM
> */
> static void
> UnregisterProxy(void)
> { ...
> failed:
> elog(ERROR, "can not Unregister Proxy on GTM"); // ERROR lead to
> siglongjmp.
> }
>
> these scenarios is rare , but can anyone fix the problem?
>
>
>
> --
> Thanks
> Zhou Liang
>
--
Thanks
Zhou Liang
|