From: Koichi S. <koi...@gm...> - 2013-02-21 10:28:44
|
Hello, I found that "select 1" does now work to detect datanode/coordinator crash correctly when gtm/gtm_proxy crashes. When gtm/gtm_proxy crashes, "select 1" returns error and monitoring program (HA middleware or other operation support program) determine coordinator/datanode crashes, which is wrong. So we need another means to detect coordinator/datanode is running but gtm/gtm_proxy crashed. One solution will be to make "select 1" not to return error. In this case, we may need another means to detect if coordinator/datanode crashes. It could be very complicated and lead to allow very inconsistent view visible. I think cleaner solution is to provide "watchdog" to tell that sever loop is running and is ready to accept connections. I understand this is duplicate implementation in the case of PostgreSQL itself but is needed for XC. I also understand that this could conflict when PG itself implement similar feature. This kind of risk is found in many other places in XC and I believe watchdog timer is a good solution for monitoring coordinator/datanode independent from gtm status. Any feedbacks? ---------- Koichi Suzuki |