The pooler process doesn't exit after I execute "stop all" command.
When gdbs the process, I find it stops at this place : poolmgr.c
2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
2369 if (shutdown_requested)
When the server_id is not changed, the select will wait forever.
The routine has no chance to echo "shutdown_requested"
Does it need to add a "timeout" in select routine?
The Syslogger process has the same problem.
Thanks a lot for the report. I thought select will return when
pooler receives a signal. Did you check if the pooler process
receives and handles a signal with "stop all" command, which is in
turn pg_ctl stop? If so, before adding a timeout, we need to find
when the signal cannot be handled correctly.
I'm afraid Syslogger may have the same issue.
Any more inputs/ideas on this?
Thank you again.
Koichi Suzuki
2014-07-21 17:50 GMT+09:00 peace zone peacezone@users.sf.net:
Related
Bugs: #487
Thank you for your response!
The pooler process catches the SIGTERM which "stop all" command in pgxc_ctl sends
and then only sets shutdown_requested to true. The pooler process exits only when shutdown_requested = ture.
In this situation the select doesn't know the ths signal is comming.
Look at these codes in
1. PoolManagerInit
static void
pooler_die(SIGNAL_ARGS)
{
shutdown_requested = true;
}
SIGTERM interrupts select system-call and then select returns -1 with errno = EINTR.
So poolmgr can know the signal IF POOLMGR IS WAITING IN SYSTEM CALL.
It means that this issue could happen when the signal is caught before select is called and the poolmgr has no connection.
I recommend add timeout or other strict logic.
I think Syslogger doesn't have this kind of problem. Why do you think it has?
Thank you for your response!
I found pooler and syslogger were still alive after the "stop all", the pooler stopped at select and the syslogger stopped at poll. I forgot to dump the stacks of the two processes, but I think the problem may be the same, I will dump the stack next time.
AFAIK syslogger is using Latch mechanism which is very well considered not to lose the event at any timing. The Latch mechanism uses pipe and poll.
We might need to consider the fear that postmaster doesn't sending the signal.
I suppose this is available at 9.3, written by Heiki. Yes, it is
very useful infrastructure to "pull" events. I agree to consider
such situation.
Koichi Suzuki
2014-07-23 10:25 GMT+09:00 cbx pgxccx@users.sf.net:
Related
Bugs: #487
I found a problem about autovacuum process
The code here in routine AutoVacLauncherMain
if (sigsetjmp(local_sigjmp_buf, 1) != 0) -- A
{ ... }
rebuild_database_list(InvalidOid); -- B
if there is a error in rebuild_database_list, the routine will jump to A, and a deadlock will happen. The condition happened when I execute "stop all" in pgxc_ctl, the other processes had exited except the logger and autovacuum.
The logger didn't exit because the autovacuum generated logs.
Stack of vacuum generates error:
0 GetSnapshotDataCoordinator (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:3058
1 0x0000000000730b65 in GetPGXCSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:2837
2 0x000000000072f0df in GetSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:1411
3 0x000000000089f3b7 in GetTransactionSnapshot () at snapmgr.c:180
4 0x00000000006e85b9 in get_database_list () at autovacuum.c:1860
5 0x00000000006e7592 in rebuild_database_list (newdb=0) at autovacuum.c:976
6 0x00000000006e6ea7 in AutoVacLauncherMain (argc=0, argv=0x0) at autovacuum.c:586
7 0x00000000006e6b5b in StartAutoVacLauncher () at autovacuum.c:391
8 0x00000000006f5cda in reaper (postgres_signal_arg=17) at postmaster.c:2750
9 <signal handler="" called="">
10 0x00007fedecb65b43 in __select_nocancel () from /lib64/libc.so.6
11 0x00000000006f406d in ServerLoop () at postmaster.c:1662
12 0x00000000006f3975 in PostmasterMain (argc=5, argv=0x15aced0) at postmaster.c:1369
13 0x000000000065a9f9 in main (argc=5, argv=0x15aced0) at main.c:206