Postgres-XC / Bugs / #487 The pooler process doesn't exit some times after excuting "stop all"

Koichi Suzuki - 2014-07-21

Thanks a lot for the report. I thought select will return when
pooler receives a signal. Did you check if the pooler process
receives and handles a signal with "stop all" command, which is in
turn pg_ctl stop? If so, before adding a timeout, we need to find
when the signal cannot be handled correctly.

I'm afraid Syslogger may have the same issue.

Any more inputs/ideas on this?

Thank you again.

Koichi Suzuki

2014-07-21 17:50 GMT+09:00 peace zone peacezone@users.sf.net:

[bugs:#487] The pooler process doesn't exit some times after excuting "stop
all"

Status: open
Group: 1.2 Dev Q
Labels: pool process
Created: Mon Jul 21, 2014 08:50 AM UTC by peace zone
Last Updated: Mon Jul 21, 2014 08:50 AM UTC
Owner: nobody

The pooler process doesn't exit after I execute "stop all" command.

When gdbs the process, I find it stops at this place : poolmgr.c

2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
2369 if (shutdown_requested)

When the server_id is not changed, the select will wait forever.
The routine has no chance to echo "shutdown_requested"

Does it need to add a "timeout" in select routine?

The Syslogger process has the same problem.

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/postgres-xc/bugs/487/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

Related

Bugs: #487

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

peace zone - 2014-07-21

Thank you for your response!

The pooler process catches the SIGTERM which "stop all" command in pgxc_ctl sends
and then only sets shutdown_requested to true. The pooler process exits only when shutdown_requested = ture.

In this situation the select doesn't know the ths signal is comming.

Look at these codes in
1. PoolManagerInit

pqsignal(SIGINT, pooler_die); pqsignal(SIGTERM, pooler_die); pqsignal(SIGQUIT, pooler_quickdie);

pooler_die
static void
pooler_die(SIGNAL_ARGS)
{
shutdown_requested = true;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- cbx - 2014-07-22
  
  SIGTERM interrupts select system-call and then select returns -1 with errno = EINTR.
  So poolmgr can know the signal IF POOLMGR IS WAITING IN SYSTEM CALL.
  
  It means that this issue could happen when the signal is caught before select is called and the poolmgr has no connection.
  I recommend add timeout or other strict logic.
  
  I think Syslogger doesn't have this kind of problem. Why do you think it has?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - peace zone - 2014-07-22
    
    Thank you for your response!
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - peace zone - 2014-07-22
    
    I found pooler and syslogger were still alive after the "stop all", the pooler stopped at select and the syslogger stopped at poll. I forgot to dump the stacks of the two processes, but I think the problem may be the same, I will dump the stack next time.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cbx - 2014-07-23

AFAIK syslogger is using Latch mechanism which is very well considered not to lose the event at any timing. The Latch mechanism uses pipe and poll.

We might need to consider the fear that postmaster doesn't sending the signal.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Koichi Suzuki - 2014-07-23
  
  I suppose this is available at 9.3, written by Heiki. Yes, it is
  very useful infrastructure to "pull" events. I agree to consider
  such situation.
  
  Koichi Suzuki
  
  2014-07-23 10:25 GMT+09:00 cbx pgxccx@users.sf.net:
  
  AFAIK syslogger is using Latch mechanism which is very well considered not
  to lose the event at any timing. The Latch mechanism uses pipe and poll.
  
  We might need to consider the fear that postmaster doesn't sending the
  signal.
  
  [bugs:#487] The pooler process doesn't exit some times after excuting "stop
  all"
  
  Status: open
  Group: 1.2 Dev Q
  Labels: pool process
  Created: Mon Jul 21, 2014 08:50 AM UTC by peace zone
  Last Updated: Mon Jul 21, 2014 09:09 AM UTC
  Owner: nobody
  
  The pooler process doesn't exit after I execute "stop all" command.
  
  When gdbs the process, I find it stops at this place : poolmgr.c
  
  2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
  2369 if (shutdown_requested)
  
  When the server_id is not changed, the select will wait forever.
  The routine has no chance to echo "shutdown_requested"
  
  Does it need to add a "timeout" in select routine?
  
  The Syslogger process has the same problem.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/postgres-xc/bugs/487/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: #487
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- peace zone - 2014-08-04
  
  I found a problem about autovacuum process
  
  The code here in routine AutoVacLauncherMain
  
  if (sigsetjmp(local_sigjmp_buf, 1) != 0) -- A
  { ... }
  
  rebuild_database_list(InvalidOid); -- B
  
  if there is a error in rebuild_database_list, the routine will jump to A, and a deadlock will happen. The condition happened when I execute "stop all" in pgxc_ctl, the other processes had exited except the logger and autovacuum.
  The logger didn't exit because the autovacuum generated logs.
  
  Stack of vacuum generates error:
  0 GetSnapshotDataCoordinator (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:3058
  1 0x0000000000730b65 in GetPGXCSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:2837
  2 0x000000000072f0df in GetSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:1411
  3 0x000000000089f3b7 in GetTransactionSnapshot () at snapmgr.c:180
  4 0x00000000006e85b9 in get_database_list () at autovacuum.c:1860
  5 0x00000000006e7592 in rebuild_database_list (newdb=0) at autovacuum.c:976
  6 0x00000000006e6ea7 in AutoVacLauncherMain (argc=0, argv=0x0) at autovacuum.c:586
  7 0x00000000006e6b5b in StartAutoVacLauncher () at autovacuum.c:391
  8 0x00000000006f5cda in reaper (postgres_signal_arg=17) at postmaster.c:2750
  9 <signal handler="" called="">
  10 0x00007fedecb65b43 in __select_nocancel () from /lib64/libc.so.6
  11 0x00000000006f406d in ServerLoop () at postmaster.c:1662
  12 0x00000000006f3975 in PostmasterMain (argc=5, argv=0x15aced0) at postmaster.c:1369
  13 0x000000000065a9f9 in main (argc=5, argv=0x15aced0) at main.c:206
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The pooler process doesn't exit some times after excuting "stop all"

Group

Searches

Help

#487 The pooler process doesn't exit some times after excuting "stop all"

Related

Discussion

Thank you again.

Related

Related