Menu

GTM-Standby

Koichi Suzuki
There is a newer version of this page. You can find it here.

Problem

With HA_Support branch, GTM-proxy successfully register itself to the GTM, but datanode fails.

Format of the registration message is as follows:

   if (gtmpqPutMsgStart('C', true, conn) ||
       gtmpqPutInt(MSG_NODE_REGISTER, sizeof (GTM_MessageType), conn) ||
       gtmpqPutnchar((char *)&type, sizeof(GTM_PGXCNodeType), conn) ||
       gtmpqPutnchar((char *)&nodenum, sizeof(GTM_PGXCNodeId), conn) ||
       gtmpqPutInt(strlen(host), sizeof (GTM_StrLen), conn) ||
       gtmpqPutnchar(host, strlen(host), conn) ||
       gtmpqPutnchar((char *)&port, sizeof(GTM_PGXCNodePort), conn) ||
       gtmpqPutnchar((char *)&proxynum, sizeof(GTM_PGXCNodeId), conn) ||
       gtmpqPutInt(strlen(datafolder), sizeof (GTM_StrLen), conn) ||
       gtmpqPutnchar(datafolder, strlen(datafolder), conn) ||
       gtmpqPutInt(status, sizeof(GTM_PGXCNodeStatus), conn))
       goto send_failed;

Compared with GTM-non-standby, two data were added:

1) host, including the length indicator, 2) Status.

Then, in the GTM-Proxy, this is handled by the function ProcessPGXCNodeCommand(). Different from the original version, it then tries to convert the IP address of the other pier (datanode/coordinator) into the host name using getaddrinfo(). Somehow, the host information sent with the above command is not consumed in GTM-Proxy.

Should look into this a bit more in detail.

Another bug

node_get_local_addr() needs return value initialization. Return value is stored into caller's area and if it is not initialized properly, caller may (due to his own variable settings) regard this as an error.

Yet still...

Somehow, length of the host name embedded in MSG_NODE_REGISTER message is not sent to GTM-ACT correctly.

At last

Year, there were a fault in GTM-Proxy. I found that GTM-Proxy does not receive MSG_NODE_REGISTER message members in correct order and did not proxy it to GTM in correct order. I fixed all this and then GTM-Proxy works fine.

GTM-Standby Again

I hoped that GTM-Standby then works find. It didn't happen. GTM-Standby crashed with a core. The crash is caused by dump_transactioninfo_elog(), which prints backup from the GTM-ACT to the log.

I didn't think this is not just this function's bug but this might be caused by wrong message send or parse by GTM or GTM-Standby. I examined the the response to the message MSG_TXN_GXID_LIST. In my test environment, it says that gti_thread_id value is parsed as 140380929935104. Because this is the thread id in GTM-ACT/GTM-Proxy, this number is quite unusual. I should visit GTM-ACT code to receive this message, parse and construct reply, then compare this with the parse done at GTM-Standby.

Yes, there were wrong implementation in gtm_serialize.c and gtm_serialize_debug.c. In gtm_serialize.c, sn_xip is regarded as "integer". in fact, it it is the address of GlobalTransacionId array and the number of the elements is indicated by sn_xcnt. On the other hand, in gtm_serialize_debug.c, sn_xip is regarded as a pointer to GlobalTransactionId. Because address in GTM-ACT is exported to GTM-Standby, this caused the error. ---> code fixed for the test.

I also found that coordcount and datanodecount can be zero and the current code malloc() size zero area which returns some address to be passed to future free. This address is "readable" and it can be harmful too. --> code fixed for the test.


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.