(April 6th, 2011, Koichi Suzuki)
Now we have GTM-Standby, which backs up all the current GTM status in synchronous way and fails over when master GTM fails. Now we need a mechanism in GTM-Proxy to reconnect to new GTM in such case.
When GTM-Standby fails over the GTM, all the status has been copied to the Standby and GTM-Proxy should just disconnect from the old GTM, reconnect to the new one and register itself. If the last command to the old GTM did not respond, the command can be reissued to the new GTM to get correct response.
Because reconnect itself will be triggered by XCM module, xcwatcher and monitoring agent, this document provides initial design how to implement the reconnect in GTM-Proxy.
Reconnect can be initiated by invoking gtm_ctl with a new command "reconnect". The syntacs will be as follows:
gtm_ctl -S gtm_proxy reconnect -D dir -o "-s xxx -t xxx"
where -D option describes gtm_proxy's working directory, which must be the same as it started. -s and -t specifies address and the port number of the new GTM.
Within gtm_ctl, these options are backed up to gtm_proxy.opts file in GTM's working directory, merged with existing options. Then gtm_ctl will prepare gtm_proxy_sighup.opt (file name may change) file to indicate to reconnect and issue SIGHUP signal to the gtm_proxy.
Gtm_proxy SIGHUP signal handler will check grm_proxy_sighup_opt and determines it should reconnect to the new GTM (at present, this is only one option of SIGUP action), and update a flag to indicate reconnect. Such a flag can be stored in thread-specific structure like:
typedef struct thread_interrupt[xx] {
GTM_RWLock ti_lock;
bool should_lock;
} thread_interrupt;
where xx is the number of created thread.
Each thread checks this flag before it send commands to GTM. If error is detected to receive response from GTM, it will be the time to check this structure. If it is no set, then the thread can wait for a little while and recheck this.
If "should_lock" bit is set, then the thread disconnects current connection to (old) GTM and reconnect to the new one and can continue service to coordinator or datanode.