[Postgres-xc-developers] Coordinator/Datanode crash detection

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I found that "select 1" does now work to detect datanode/coordinator
crash correctly when gtm/gtm_proxy crashes.   When gtm/gtm_proxy
crashes, "select 1" returns error and monitoring program (HA
middleware or other operation support program) determine
coordinator/datanode crashes, which is wrong.

So we need another means to detect coordinator/datanode is running but
gtm/gtm_proxy crashed.   One solution will be to make "select 1" not
to return error.  In this case, we may need another means to detect if
coordinator/datanode crashes.   It could be very complicated and lead
to allow very inconsistent view visible.   I think cleaner solution is
to provide "watchdog" to tell that sever loop is running and is ready
to accept connections.   I understand this is duplicate implementation
in the case of PostgreSQL itself but is needed for XC.    I also
understand that this could conflict when PG itself implement similar
feature.   This kind of risk is found in many other places in XC and I
believe watchdog timer is a good solution for monitoring
coordinator/datanode independent from gtm status.

Any feedbacks?
----------
Koichi Suzuki