[ naviserver-Feature Requests-1156875 ] Watchdog process restarts failed server

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Feature Requests item #1156875, was opened at 2005-03-04 20:03
Message generated for change (Settings changed) made by vasiljevic
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=719009&aid=1156875&group_id=130646

Category: None
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Submitted By: Zoran Vasiljevic (vasiljevic)
Assigned to: Zoran Vasiljevic (vasiljevic)
Summary: Watchdog process restarts failed server

Initial Comment:
We have been using this for quite some time and it 
proved extremely useful. 
We doublefork the nsd process and make the first 
forked instance control the second. The first one 
(the watchdog) reacts on exit codes and signals 
caught during the watch and correspondingly restarts 
the second instance (the worker). 

Also, we have added the the "-restart" option to 
the "ns_shutdown" command. This just sends the 
SIGINT to the worker process. The watchdog is 
handling this signal and respawns the worker 
automatically. 

During operation, the watchdog logs events and their 
cause  into the system log file. This looks like: 
Feb 28 04:00:05 Develop nsd[19400]: worker: started. 
Mar  1 04:00:13 Develop nsd[4475]: watchdog: worker 
19400 exited (2). 
Mar  1 04:00:15 Develop nsd[21290]: worker: started. 
Mar  1 04:00:18 Develop nsd[14705]: watchdog: worker 
19399 exited (2). 
Mar  1 04:00:20 Develop nsd[21300]: worker: started. 

We have done all the changes with "--enable-watchdog" 
so anybody who needs this feature will have to compile 
with this option.  

----------------------------------------------------------------------

>Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-17 18:51

Message:
Logged In: YES 
user_id=95086

Added to cvs head. 

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-17 12:46

Message:
Logged In: YES 
user_id=87254

Oh, I thought you were going to commit this.  Sure, revert
the exit() stuff and commit away, or I'll get to it at the
wekend.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-17 12:23

Message:
Logged In: YES 
user_id=95086

Now, what? You go change/remove _exit and commit? 
Other things are ok for me. 
Or should I commit? 

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-17 12:02

Message:
Logged In: YES 
user_id=87254

Calling _exit()  was an attempt to clarify the intent of the
code.  But I didn't pay attention to the fact that the main
function is pluggable, so I guess that's a no-go.

4 hours of retry time before giving up sounds almost like
'inifinite'...  Looks like we can get something suitable
without adding extra switches, so we can commit this and
tweak the timing later.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-15 14:54

Message:
Logged In: YES 
user_id=95086

Stephen, 
why do you call _exit(0) instead of returning 
control to caller routine?  

Also.. 
  I don't want to go overboard with command line switches, but 
  an option to specify not to give up trying to restart the 
  server would be nice.  If you have that, you also need to 
  turn off the restart timeout doubling at a certain point.  I 
  don't know what the cleanest way to do that is. 

Restart time doubling is limited to 64 seconds (insane, should 
rather be 60 seconds, heh) so it should not double after 
reaching 60 secs. After that time we constantly wait 60 secs 
between restarts. 
With the server retry of 16 times and we will wait 1, 2, 4, 8,  
16, 32,  64 (i.e. total of 127) seconds and after that 8 times 64  
secs (= 512 total ) secs and eventually abort.  
All together, that will be 640 secs = about 10 minutes 
wait/retry. After that it will exit. I think if we extend retry count 
from 16 to 256 it will be pretty enough (about 4 hours) where 
one can fix the failure cause. Otherwise, I see syntactic 
problems of how to actually specify the retries parameter. 
Of curse, you can do something like 
  bin/nsd -w -r 16 

but the "-r"  (restart) would really have no meaning w/o -w 
and I do not like this very much indeed. I'm really for bumping 
the retry count higher (127 or 256 tries). 

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-15 01:50

Message:
Logged In: YES 
user_id=184124

Last time addition:
Can we make syslog function public, Ns_Syslog?
Also Tcl interface to it would be usefull as well, below is
syslog from my nssys module, just a suggestion, no rush with
this:

static struct syslog_data {
   int opened;
   int facility;
   int options;
   char ident[32];
   Tcl_HashTable *priorities;
   Tcl_HashTable *facilities;
} syslog_data;

/* Syslog initialization */
void SyslogInit()
{
   memset(&syslog_data,0,sizeof(syslog_data));
   syslog_data.options = LOG_PID;
   syslog_data.facility = LOG_USER;
   if((argv0 = Tcl_GetVar(interp,"argv0",TCL_GLOBAL_ONLY)))

strncpy(syslog_data.ident,argv0,sizeof(syslog_data.ident)-1);
   syslog_data.facilities = (Tcl_HashTable
*)Tcl_Alloc(sizeof(Tcl_HashTable));
   Tcl_InitHashTable(syslog_data.facilities,TCL_STRING_KEYS);
   AddEntry(syslog_data.facilities,"auth",LOG_AUTH);
   AddEntry(syslog_data.facilities,"authpriv",LOG_AUTHPRIV);
   AddEntry(syslog_data.facilities,"cron",LOG_CRON);
   AddEntry(syslog_data.facilities,"daemon",LOG_DAEMON);
   AddEntry(syslog_data.facilities,"kernel",LOG_KERN);
   AddEntry(syslog_data.facilities,"lpr",LOG_LPR);
   AddEntry(syslog_data.facilities,"mail",LOG_MAIL);
   AddEntry(syslog_data.facilities,"news",LOG_NEWS);
   AddEntry(syslog_data.facilities,"syslog",LOG_SYSLOG);
   AddEntry(syslog_data.facilities,"user",LOG_USER);
   AddEntry(syslog_data.facilities,"uucp",LOG_UUCP);
   AddEntry(syslog_data.facilities,"local0",LOG_LOCAL0);
   AddEntry(syslog_data.facilities,"local1",LOG_LOCAL1);
   AddEntry(syslog_data.facilities,"local2",LOG_LOCAL2);
   AddEntry(syslog_data.facilities,"local3",LOG_LOCAL3);
   AddEntry(syslog_data.facilities,"local4",LOG_LOCAL4);
   AddEntry(syslog_data.facilities,"local5",LOG_LOCAL5);
   AddEntry(syslog_data.facilities,"local6",LOG_LOCAL6);
   AddEntry(syslog_data.facilities,"local7",LOG_LOCAL7);
   syslog_data.priorities = (Tcl_HashTable
*)Tcl_Alloc(sizeof(Tcl_HashTable));
   Tcl_InitHashTable(syslog_data.priorities,TCL_STRING_KEYS);
   AddEntry(syslog_data.priorities,"emerg",LOG_EMERG);
   AddEntry(syslog_data.priorities,"alert",LOG_ALERT);
   AddEntry(syslog_data.priorities,"crit",LOG_CRIT);
   AddEntry(syslog_data.priorities,"err",LOG_ERR);
   AddEntry(syslog_data.priorities,"error",LOG_ERR);
   AddEntry(syslog_data.priorities,"warning",LOG_WARNING);
   AddEntry(syslog_data.priorities,"notice",LOG_NOTICE);
   AddEntry(syslog_data.priorities,"info",LOG_INFO);
   AddEntry(syslog_data.priorities,"debug",LOG_DEBUG);
}

/*
 * sys_log
 *
 * Usage:
 *   sys_log ?-facility f -options o -ident i? priority message
 *
 *  facility    - kernel, cron, authpriv, mail, local0,
local1, daemon, local2,
 *                news, local3, local4, local5, local6,
syslog, local7, auth, uucp, lpr, user
 *  options     - list with any of { CONS NDELAY PERROR PID
ODELAY NOWAIT }
 *  ident       - ident is prepended to every message, and
is typically the program name
 *  priority    - info, alert, emerg, err, notice, warning,
error, crit, debug
 *
 * Returns:
 *   nothing
 */
static int SysLog(ClientData data,Tcl_Interp *interp,int
argc,char **argv)
{
    int i,priority = 0;
    Tcl_HashEntry *entry;

    if(argc < 1) {
      Tcl_AppendResult(interp,argv[0]," ?-facility f
-options o -ident i? priority message",0);
      return TCL_ERROR;
    }
    for(i = 1;i < argc-1;) {
      if(!strcmp(argv[i],"-facility")) {
        entry =
Tcl_FindHashEntry(syslog_data.facilities,argv[i+1]);
        if(!entry) {
          Tcl_AppendResult(interp,"wrong facility
",argv[i+1],": available facilities: ",NULL);
          ListHash(interp,syslog_data.facilities);
          return TCL_ERROR;
        }
        syslog_data.facility = (int)Tcl_GetHashValue(entry);
        closelog();
        syslog_data.opened = 0;
        i += 2;
        continue;
      }
      if(!strcmp(argv[i],"-options")) {
        syslog_data.options = 0;
        if(strstr(argv[i+1],"CONS")) syslog_data.options |=
LOG_CONS;
        if(strstr(argv[i+1],"NDELAY")) syslog_data.options
|= LOG_NDELAY;
        if(strstr(argv[i+1],"PERROR")) syslog_data.options
|= LOG_PERROR;
        if(strstr(argv[i+1],"PID")) syslog_data.options |=
LOG_PID;
        if(strstr(argv[i+1],"ODELAY")) syslog_data.options
|= LOG_ODELAY;
        if(strstr(argv[i+1],"NOWAIT")) syslog_data.options
|= LOG_NOWAIT;
        closelog();
        syslog_data.opened = 0;
        i += 2;
        continue;
      }
      if(!strcmp(argv[i],"-ident")) {
        memset(syslog_data.ident,0,sizeof(syslog_data.ident));

strncpy(syslog_data.ident,argv[i+1],sizeof(syslog_data.ident)-1);
        closelog();
        syslog_data.opened = 0;
        i += 2;
        continue;
      }
      break;
    }
    if(i < argc) {
      entry = Tcl_FindHashEntry(syslog_data.priorities,argv[i]);
      if(!entry) {
        Tcl_AppendResult(interp,"wrong level ",argv[i],":
available levels: ",NULL);
        ListHash(interp,syslog_data.priorities);
        return TCL_ERROR;
      }
      priority = (int)Tcl_GetHashValue(entry);
      i++;
    }
    if(i < argc) {
      if(!syslog_data.opened) {

openlog(syslog_data.ident,syslog_data.options,syslog_data.facility);
        syslog_data.opened = 1;
      }
      syslog(priority,argv[i]);
      return TCL_OK;
    }
    return TCL_OK;
}

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-15 00:44

Message:
Logged In: YES 
user_id=87254

Oops, looks like I forgot to attach the patch...  Here it is.

Yes, the server will attempt to restart a number of times
before giving up.  I think this is as appropriate for config
errors as it is for others, then there doesn't need to be a
dual-standard fatal error.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-14 19:12

Message:
Logged In: YES 
user_id=95086

I'd rather wait for Stephens code and see what's done there 
and then make final changes and commit. It should not 
take long. 

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-14 19:07

Message:
Logged In: YES 
user_id=184124

Then i do not see any reason not committing it, we can have
discussions about it for another couple of years, but until
we have it and use it, it is just discussions. We aready
agreed on having it in the core.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-14 19:00

Message:
Logged In: YES 
user_id=95086

Stephen, 
You mind uploading your patch changes so I can peek into it? 
I think we should soon settle on some solution and move on. 

Vlad, 
The patch already does that. After 16 unsuccessfull restarts 
the watchdog exits. Between every restart we wait 1, 2, 4, 8,  
16, 32, .... 16384 seconds and then exit. This would be: 32767 
seconds or about 9 hours.  

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-14 15:53

Message:
Logged In: YES 
user_id=184124

I would say, repeat configured number of time and then exit.
This way all socket related problems will be cleared after
couple of restartes, config problems will make server exit .

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-14 15:41

Message:
Logged In: YES 
user_id=95086

Well, because I could not differentiate between config 
and later runtime errors, I thought its wiser to abort 
rather to repeatedly restart broken server. 
One possibility is to add different Ns_Fatal call like: 

  Ns_FatalEx(int exitcode, char *fmt, ...); 

Or? 

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-14 15:12

Message:
Logged In: YES 
user_id=87254

Are you sure Ns_Fatal should not restart the server?  Many
of the fatal errors are caused by bad configuration and I
can see why you might want the server to exit immediately. 
But there are also runtime fatal errors, and here I'd expect
the server to be restarted.

Some config errors are due to external factors like missing
directories or file system permissions and it would be nice
for the server to come back up as soon as the issue resolved
itself.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-14 10:52

Message:
Logged In: YES 
user_id=95086

Calling watchdog before or after chroot has no real 
implications I believe. 
Also, when the server exits with Ns_Fatal, it hsould not be 
restarted, I will look into your changes today. 

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-11 07:27

Message:
Logged In: YES 
user_id=87254

Oh yeah, and I moved the watchdog stuff eaven earlier,
before the prebind and chroot.  Hmm, now that I think about
it it's only the prebind that really needs to happen after
the watchdog is started, to ensure the sockets are always in
a sane state...

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-11 07:17

Message:
Logged In: YES 
user_id=87254

Am I reading this right?  It looks like if the server calls
Ns_Fatal() it will exit with code 1, but  0 and 1 are
treated as OK exit codes and the server will not be restarted.

I've attached a patch which changes the above, fixes the pid
problem in a different way because I completely forgot you
already posted below about that small glitch, use
Ns_ParseObjv(), adds the -w switch, and a couple of small
name changes.

I don't want to go overboard with command line switches, but
an option to specify not to give up trying to restart the
server would be nice.  If you have that, you also need to
turn off the restart timeout doubling at a certain point.  I
don't know what the cleanest way to do that is.

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-08 21:09

Message:
Logged In: YES 
user_id=95086

This is correct. As soon as we agree on some 
implementation, I will put the -i processing 
and avoid starting watchdog for inittab starts. 

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-08 20:48

Message:
Logged In: YES 
user_id=184124

There still should be possibility to run nsd from inittab,
so when -i switch is given, no watchdog should be running,
let /sbin/init to handle restarts

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-08 17:19

Message:
Logged In: YES 
user_id=95086

small glich: 
After applying the patch, change nsmain.c 
from: 
    nsconf.pid = serverPid: 
to 
    nsconf.pid = getpid(); 

otherwise the pid file will contain the bogus 
server pid if the watchdog restarted it later. 

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-08 16:26

Message:
Logged In: YES 
user_id=95086

Another try... 
Important to note: SIGKILL (signal 9) cannot be handled 
hence if somebody kills the watchdog with SIGKILL, the 
server will be left lingering w/o the watchdog. This is  
important to know. I do not see any possibility how to 
recover in such cases (i.e. how to stop the server). 

Apart from that, all objections from Stephen are taken 
into account. Please try again. A new copy of watchdog.patch 
file is attached. 

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-08 06:20

Message:
Logged In: YES 
user_id=184124

I just found this code in my old server sources, i just
chnaged internal name to Ns/NS, used to run pretty stable.

-------------------------------------------------------------------

#define NS_EXIT 99

static void NsTerminate(int sig)
{
    printf("NS[%d] signal %d received...",getpid(),sig);
    // Kill the server with the same signal
    if(nsPid > 0) kill(nsPid,sig);

    // Exit in case of fatal signal
    if(sig != SIGHUP) {
      printf("NS[%d] terminating...",getpid());
      exit(0);
    }
    // Reassign signal handler
    signal(sig,NsTerminate);
}

void NsWatchdog(int argc, char *argv[], char *envp[])
{
    int failcount = 0;
    time_t start;
    int status;
    pid_t pid;

    signal(SIGTTOU,SIG_IGN);
    signal(SIGTTIN,SIG_IGN);
    signal(SIGTSTP,SIG_IGN);
    signal(SIGPIPE,SIG_IGN);
    signal(SIGQUIT,SIG_IGN);
    signal(SIGHUP,NsTerminate);
    signal(SIGINT,NsTerminate);
    signal(SIGTERM,NsTerminate);

    // Go background
    if((pid = fork())) {
      if(pid < 0) err_logger("warpConfigure: fork:
%s",strerror(errno));
       exit(0);
    }
    setsid();

    for(;;) {

        // Execute the real server
        nsPid = fork();
        // Child, continue as server
        if(nsPid == 0) {
          exit(nsMain(argc, argv, ServerInit));
        }

        /* parent, behaves like a guardian */
        time(&start);
        printf("NS[%d] server process started",getpid());
        pid = waitpid(-1, &status, 0);
        if(WIFEXITED(status))
          printf("NS[%d] child process exited with status
%d",pid,WEXITSTATUS(status));
        else
        if(WIFSIGNALED(status))
          printf("NS[%d] child process exited due to signal
%d",pid,WTERMSIG(status));
        else
          printf("NS[%d] child process exited", pid);

        // Special exit code
        if(WIFEXITED(status)) {
          if(WEXITSTATUS(status) == NS_EXIT) {
            printf("NS[%d] child configuration error,
exiting",getpid());
            exit(0);
          } else
          if(WEXITSTATUS(status) == SIGHUP) {
          } else {
            if(time(0) - start < 10) failcount++;
            if(failcount == 10) {
              printf("Exiting due to repeated, frequent
failures");
              exit(1);
            }
          }
        }
        sleep(3);
    }
}

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-08 05:23

Message:
Logged In: YES 
user_id=87254

NsWatchdog() is called after the server drops root privs, so
both the watchdog and the server run as the defined user. 
What happens if the server dies and on restart needs to
rebind privileged ports?

A ps listing shows two running processes, parent and child.
  If I kill either one, the watchdog dies, the server
continues to run.
  If I kill -9 the parent, the child continues to run.
  If I kill -9 the child, the server is restarted.
Something seems not quite right here...

I'm a bit confused about how the code works.  For example,
NsWatchdog() seems to ignore all of it's arguments?  Here's
the code which calls it:

if (mode == 0) {
        i = ns_fork();
        if (i < 0) {
            Ns_Fatal("nsmain: fork() failed: '%s'",
strerror(errno));
        }
        if (i > 0) {
            return 0;
        }
        setsid();
        i = NsWatchdog(argc, argv, initProc);
        if (i < 0) {
            Ns_Fatal("nsmain: watchdog failed: '%s'",
strerror(errno));
        }
        if (i > 0) {
            return 0;
        }
        nsconf.pid = getpid();
    }

NsWatchdog() says it returns the worker pid, it also sets
the global variable wpid.  The code above ignores the
returned value, and the global variable, and instead calls
getpid()...  The variable 'i' is existing code, but still
somehow doesn't seem suitable...

Could that Ns_Fatal() above be moved into the NsWatchdog()
function?  I think maybe some comment is needed here.  The
code structure is very like that just above where ns_fork()
is called, but this function will return *multiple* times,
right?  This is kind of suprising, and an extra twist on the
already confusing fork() semantics (or maybe it's just me
who gets confused by fork...).  A return value of 0 here is
a 'request for orderly shutdown', right?

How about some more logging to syslog?  For example,
distinguish between start and restart.  Mention when the
MAX_SLEEP_PERIOD has been reached, etc.

Couple of small things:

Can we refer to 'the server', rather than 'the worker'? 
Worker and Watchdog begin with 'w', and so does the global
variable wpid...  Maybe serverPid?

NsWatchdog is a static function, it doesn't need the Ns prefix.

In NsWatchdog(), the variable 'run' should be something like
'nretries', 'nap' should be something more like
'retrySeconds' and MAX_SLEEP_PERIOD should be MAX_RETRY_SECONDS.

The comment for WaitForWorker() is misslabeled.

Should SigHandler() be called WatchdogSigHandler()?

----------------------------------------------------------------------

Comment By: Vlad Seryakov (seryakov)
Date: 2005-03-07 00:03

Message:
Logged In: YES 
user_id=184124

Looks good to me

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-05 13:06

Message:
Logged In: YES 
user_id=95086

Ah, correction: 
The restart option sends SIGINT to the worker process 
which causes the watchdog to restart it.  
And, the patchfile is now attached! 

----------------------------------------------------------------------

Comment By: Zoran Vasiljevic (vasiljevic)
Date: 2005-03-05 13:03

Message:
Logged In: YES 
user_id=95086

Here is the patch. 
I have added "-restart" option to "ns_shutdown". It is rather
clumsy to parse but should do. We should rewrite this with
your args parsing routine. 
The restart option sends SIGTERM to the worker process
which causes the watchdog to restart it. 

----------------------------------------------------------------------

Comment By: Stephen Deasey (sdeasey)
Date: 2005-03-05 00:46

Message:
Logged In: YES 
user_id=87254

I don't think this has to hide behind a config option.  It's
either a good idea or it's not.  Sounds good to me.  Is
there a patch?  I'm wondering about some of the
implementation details...

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=719009&aid=1156875&group_id=130646

[ naviserver-Feature Requests-1156875 ] Watchdog process restarts failed server

NaviServer, a high performance web server written in C and Tcl

[ naviserver-Feature Requests-1156875 ] Watchdog process restarts failed server