From: Koichi S. <koi...@gm...> - 2013-03-08 02:52:03
|
I didn't have reactions to this. Again, we need to detect if coordinator/datanode is running even when gtm is down. Select 1 or select now does not for this purpose (it works for log shipping slave though). I'd like to start with the watchdog patch I submitted last July, attached just in case. This includes watchdog for gtm/gtmproxies. This may not be needed so far. An alternative is just to test if connection with one of PQ* functions succeeds. A bit of handling at the server is involved in this function and it could be used to detect if the server accepts connections. Please understand this is specific to XC, not to PG. Any input is welcome. Regards; ---------- Koichi Suzuki 2013/2/21 Koichi Suzuki <koi...@gm...>: > Hello, > > I found that "select 1" does now work to detect datanode/coordinator > crash correctly when gtm/gtm_proxy crashes. When gtm/gtm_proxy > crashes, "select 1" returns error and monitoring program (HA > middleware or other operation support program) determine > coordinator/datanode crashes, which is wrong. > > So we need another means to detect coordinator/datanode is running but > gtm/gtm_proxy crashed. One solution will be to make "select 1" not > to return error. In this case, we may need another means to detect if > coordinator/datanode crashes. It could be very complicated and lead > to allow very inconsistent view visible. I think cleaner solution is > to provide "watchdog" to tell that sever loop is running and is ready > to accept connections. I understand this is duplicate implementation > in the case of PostgreSQL itself but is needed for XC. I also > understand that this could conflict when PG itself implement similar > feature. This kind of risk is found in many other places in XC and I > believe watchdog timer is a good solution for monitoring > coordinator/datanode independent from gtm status. > > Any feedbacks? > ---------- > Koichi Suzuki |