Menu

#487 The pooler process doesn't exit some times after excuting "stop all"

1.2 Dev Q
open
nobody
6
2014-07-23
2014-07-21
peace zone
No

The pooler process doesn't exit after I execute "stop all" command.

When gdbs the process, I find it stops at this place : poolmgr.c

2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
2369 if (shutdown_requested)

When the server_id is not changed, the select will wait forever.
The routine has no chance to echo "shutdown_requested"

Does it need to add a "timeout" in select routine?

The Syslogger process has the same problem.

Related

Bugs: #487

Discussion

  • Koichi Suzuki

    Koichi Suzuki - 2014-07-21

    Thanks a lot for the report. I thought select will return when
    pooler receives a signal. Did you check if the pooler process
    receives and handles a signal with "stop all" command, which is in
    turn pg_ctl stop? If so, before adding a timeout, we need to find
    when the signal cannot be handled correctly.

    I'm afraid Syslogger may have the same issue.

    Any more inputs/ideas on this?

    Thank you again.

    Koichi Suzuki

    2014-07-21 17:50 GMT+09:00 peace zone peacezone@users.sf.net:


    [bugs:#487] The pooler process doesn't exit some times after excuting "stop
    all"

    Status: open
    Group: 1.2 Dev Q
    Labels: pool process
    Created: Mon Jul 21, 2014 08:50 AM UTC by peace zone
    Last Updated: Mon Jul 21, 2014 08:50 AM UTC
    Owner: nobody

    The pooler process doesn't exit after I execute "stop all" command.

    When gdbs the process, I find it stops at this place : poolmgr.c

    2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
    2369 if (shutdown_requested)

    When the server_id is not changed, the select will wait forever.
    The routine has no chance to echo "shutdown_requested"

    Does it need to add a "timeout" in select routine?

    The Syslogger process has the same problem.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/postgres-xc/bugs/487/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #487

  • peace zone

    peace zone - 2014-07-21

    Thank you for your response!

    The pooler process catches the SIGTERM which "stop all" command in pgxc_ctl sends
    and then only sets shutdown_requested to true. The pooler process exits only when shutdown_requested = ture.

    In this situation the select doesn't know the ths signal is comming.

    Look at these codes in
    1. PoolManagerInit

    pqsignal(SIGINT, pooler_die);
    pqsignal(SIGTERM, pooler_die);
    pqsignal(SIGQUIT, pooler_quickdie);
    
    1. pooler_die
      static void
      pooler_die(SIGNAL_ARGS)
      {
      shutdown_requested = true;
      }
     
    • cbx

      cbx - 2014-07-22

      SIGTERM interrupts select system-call and then select returns -1 with errno = EINTR.
      So poolmgr can know the signal IF POOLMGR IS WAITING IN SYSTEM CALL.

      It means that this issue could happen when the signal is caught before select is called and the poolmgr has no connection.
      I recommend add timeout or other strict logic.

      I think Syslogger doesn't have this kind of problem. Why do you think it has?

       
      • peace zone

        peace zone - 2014-07-22

        Thank you for your response!

         
      • peace zone

        peace zone - 2014-07-22

        I found pooler and syslogger were still alive after the "stop all", the pooler stopped at select and the syslogger stopped at poll. I forgot to dump the stacks of the two processes, but I think the problem may be the same, I will dump the stack next time.

         
  • cbx

    cbx - 2014-07-23

    AFAIK syslogger is using Latch mechanism which is very well considered not to lose the event at any timing. The Latch mechanism uses pipe and poll.

    We might need to consider the fear that postmaster doesn't sending the signal.

     
    • Koichi Suzuki

      Koichi Suzuki - 2014-07-23

      I suppose this is available at 9.3, written by Heiki. Yes, it is
      very useful infrastructure to "pull" events. I agree to consider
      such situation.


      Koichi Suzuki

      2014-07-23 10:25 GMT+09:00 cbx pgxccx@users.sf.net:

      AFAIK syslogger is using Latch mechanism which is very well considered not
      to lose the event at any timing. The Latch mechanism uses pipe and poll.

      We might need to consider the fear that postmaster doesn't sending the
      signal.


      [bugs:#487] The pooler process doesn't exit some times after excuting "stop
      all"

      Status: open
      Group: 1.2 Dev Q
      Labels: pool process
      Created: Mon Jul 21, 2014 08:50 AM UTC by peace zone
      Last Updated: Mon Jul 21, 2014 09:09 AM UTC
      Owner: nobody

      The pooler process doesn't exit after I execute "stop all" command.

      When gdbs the process, I find it stops at this place : poolmgr.c

      2368 retval = select(nfds + 1, &rfds, NULL, NULL, NULL);
      2369 if (shutdown_requested)

      When the server_id is not changed, the select will wait forever.
      The routine has no chance to echo "shutdown_requested"

      Does it need to add a "timeout" in select routine?

      The Syslogger process has the same problem.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/postgres-xc/bugs/487/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #487

    • peace zone

      peace zone - 2014-08-04

      I found a problem about autovacuum process

      The code here in routine AutoVacLauncherMain

      if (sigsetjmp(local_sigjmp_buf, 1) != 0) -- A
      { ... }

      rebuild_database_list(InvalidOid); -- B

      if there is a error in rebuild_database_list, the routine will jump to A, and a deadlock will happen. The condition happened when I execute "stop all" in pgxc_ctl, the other processes had exited except the logger and autovacuum.
      The logger didn't exit because the autovacuum generated logs.

      Stack of vacuum generates error:
      0 GetSnapshotDataCoordinator (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:3058
      1 0x0000000000730b65 in GetPGXCSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:2837
      2 0x000000000072f0df in GetSnapshotData (snapshot=0xcb4240 <CurrentSnapshotData>) at procarray.c:1411
      3 0x000000000089f3b7 in GetTransactionSnapshot () at snapmgr.c:180
      4 0x00000000006e85b9 in get_database_list () at autovacuum.c:1860
      5 0x00000000006e7592 in rebuild_database_list (newdb=0) at autovacuum.c:976
      6 0x00000000006e6ea7 in AutoVacLauncherMain (argc=0, argv=0x0) at autovacuum.c:586
      7 0x00000000006e6b5b in StartAutoVacLauncher () at autovacuum.c:391
      8 0x00000000006f5cda in reaper (postgres_signal_arg=17) at postmaster.c:2750
      9 <signal handler="" called="">
      10 0x00007fedecb65b43 in __select_nocancel () from /lib64/libc.so.6
      11 0x00000000006f406d in ServerLoop () at postmaster.c:1662
      12 0x00000000006f3975 in PostmasterMain (argc=5, argv=0x15aced0) at postmaster.c:1369
      13 0x000000000065a9f9 in main (argc=5, argv=0x15aced0) at main.c:206

       

Log in to post a comment.