From: SourceForge.net <no...@so...> - 2005-03-17 17:51:54
|
Feature Requests item #1156875, was opened at 2005-03-04 20:03 Message generated for change (Settings changed) made by vasiljevic You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=719009&aid=1156875&group_id=130646 Category: None Group: None >Status: Closed >Resolution: Accepted Priority: 5 Submitted By: Zoran Vasiljevic (vasiljevic) Assigned to: Zoran Vasiljevic (vasiljevic) Summary: Watchdog process restarts failed server Initial Comment: We have been using this for quite some time and it proved extremely useful. We doublefork the nsd process and make the first forked instance control the second. The first one (the watchdog) reacts on exit codes and signals caught during the watch and correspondingly restarts the second instance (the worker). Also, we have added the the "-restart" option to the "ns_shutdown" command. This just sends the SIGINT to the worker process. The watchdog is handling this signal and respawns the worker automatically. During operation, the watchdog logs events and their cause into the system log file. This looks like: Feb 28 04:00:05 Develop nsd[19400]: worker: started. Mar 1 04:00:13 Develop nsd[4475]: watchdog: worker 19400 exited (2). Mar 1 04:00:15 Develop nsd[21290]: worker: started. Mar 1 04:00:18 Develop nsd[14705]: watchdog: worker 19399 exited (2). Mar 1 04:00:20 Develop nsd[21300]: worker: started. We have done all the changes with "--enable-watchdog" so anybody who needs this feature will have to compile with this option. ---------------------------------------------------------------------- >Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-17 18:51 Message: Logged In: YES user_id=95086 Added to cvs head. ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-17 12:46 Message: Logged In: YES user_id=87254 Oh, I thought you were going to commit this. Sure, revert the exit() stuff and commit away, or I'll get to it at the wekend. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-17 12:23 Message: Logged In: YES user_id=95086 Now, what? You go change/remove _exit and commit? Other things are ok for me. Or should I commit? ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-17 12:02 Message: Logged In: YES user_id=87254 Calling _exit() was an attempt to clarify the intent of the code. But I didn't pay attention to the fact that the main function is pluggable, so I guess that's a no-go. 4 hours of retry time before giving up sounds almost like 'inifinite'... Looks like we can get something suitable without adding extra switches, so we can commit this and tweak the timing later. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-15 14:54 Message: Logged In: YES user_id=95086 Stephen, why do you call _exit(0) instead of returning control to caller routine? Also.. I don't want to go overboard with command line switches, but an option to specify not to give up trying to restart the server would be nice. If you have that, you also need to turn off the restart timeout doubling at a certain point. I don't know what the cleanest way to do that is. Restart time doubling is limited to 64 seconds (insane, should rather be 60 seconds, heh) so it should not double after reaching 60 secs. After that time we constantly wait 60 secs between restarts. With the server retry of 16 times and we will wait 1, 2, 4, 8, 16, 32, 64 (i.e. total of 127) seconds and after that 8 times 64 secs (= 512 total ) secs and eventually abort. All together, that will be 640 secs = about 10 minutes wait/retry. After that it will exit. I think if we extend retry count from 16 to 256 it will be pretty enough (about 4 hours) where one can fix the failure cause. Otherwise, I see syntactic problems of how to actually specify the retries parameter. Of curse, you can do something like bin/nsd -w -r 16 but the "-r" (restart) would really have no meaning w/o -w and I do not like this very much indeed. I'm really for bumping the retry count higher (127 or 256 tries). ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-15 01:50 Message: Logged In: YES user_id=184124 Last time addition: Can we make syslog function public, Ns_Syslog? Also Tcl interface to it would be usefull as well, below is syslog from my nssys module, just a suggestion, no rush with this: static struct syslog_data { int opened; int facility; int options; char ident[32]; Tcl_HashTable *priorities; Tcl_HashTable *facilities; } syslog_data; /* Syslog initialization */ void SyslogInit() { memset(&syslog_data,0,sizeof(syslog_data)); syslog_data.options = LOG_PID; syslog_data.facility = LOG_USER; if((argv0 = Tcl_GetVar(interp,"argv0",TCL_GLOBAL_ONLY))) strncpy(syslog_data.ident,argv0,sizeof(syslog_data.ident)-1); syslog_data.facilities = (Tcl_HashTable *)Tcl_Alloc(sizeof(Tcl_HashTable)); Tcl_InitHashTable(syslog_data.facilities,TCL_STRING_KEYS); AddEntry(syslog_data.facilities,"auth",LOG_AUTH); AddEntry(syslog_data.facilities,"authpriv",LOG_AUTHPRIV); AddEntry(syslog_data.facilities,"cron",LOG_CRON); AddEntry(syslog_data.facilities,"daemon",LOG_DAEMON); AddEntry(syslog_data.facilities,"kernel",LOG_KERN); AddEntry(syslog_data.facilities,"lpr",LOG_LPR); AddEntry(syslog_data.facilities,"mail",LOG_MAIL); AddEntry(syslog_data.facilities,"news",LOG_NEWS); AddEntry(syslog_data.facilities,"syslog",LOG_SYSLOG); AddEntry(syslog_data.facilities,"user",LOG_USER); AddEntry(syslog_data.facilities,"uucp",LOG_UUCP); AddEntry(syslog_data.facilities,"local0",LOG_LOCAL0); AddEntry(syslog_data.facilities,"local1",LOG_LOCAL1); AddEntry(syslog_data.facilities,"local2",LOG_LOCAL2); AddEntry(syslog_data.facilities,"local3",LOG_LOCAL3); AddEntry(syslog_data.facilities,"local4",LOG_LOCAL4); AddEntry(syslog_data.facilities,"local5",LOG_LOCAL5); AddEntry(syslog_data.facilities,"local6",LOG_LOCAL6); AddEntry(syslog_data.facilities,"local7",LOG_LOCAL7); syslog_data.priorities = (Tcl_HashTable *)Tcl_Alloc(sizeof(Tcl_HashTable)); Tcl_InitHashTable(syslog_data.priorities,TCL_STRING_KEYS); AddEntry(syslog_data.priorities,"emerg",LOG_EMERG); AddEntry(syslog_data.priorities,"alert",LOG_ALERT); AddEntry(syslog_data.priorities,"crit",LOG_CRIT); AddEntry(syslog_data.priorities,"err",LOG_ERR); AddEntry(syslog_data.priorities,"error",LOG_ERR); AddEntry(syslog_data.priorities,"warning",LOG_WARNING); AddEntry(syslog_data.priorities,"notice",LOG_NOTICE); AddEntry(syslog_data.priorities,"info",LOG_INFO); AddEntry(syslog_data.priorities,"debug",LOG_DEBUG); } /* * sys_log * * Usage: * sys_log ?-facility f -options o -ident i? priority message * * facility - kernel, cron, authpriv, mail, local0, local1, daemon, local2, * news, local3, local4, local5, local6, syslog, local7, auth, uucp, lpr, user * options - list with any of { CONS NDELAY PERROR PID ODELAY NOWAIT } * ident - ident is prepended to every message, and is typically the program name * priority - info, alert, emerg, err, notice, warning, error, crit, debug * * Returns: * nothing */ static int SysLog(ClientData data,Tcl_Interp *interp,int argc,char **argv) { int i,priority = 0; Tcl_HashEntry *entry; if(argc < 1) { Tcl_AppendResult(interp,argv[0]," ?-facility f -options o -ident i? priority message",0); return TCL_ERROR; } for(i = 1;i < argc-1;) { if(!strcmp(argv[i],"-facility")) { entry = Tcl_FindHashEntry(syslog_data.facilities,argv[i+1]); if(!entry) { Tcl_AppendResult(interp,"wrong facility ",argv[i+1],": available facilities: ",NULL); ListHash(interp,syslog_data.facilities); return TCL_ERROR; } syslog_data.facility = (int)Tcl_GetHashValue(entry); closelog(); syslog_data.opened = 0; i += 2; continue; } if(!strcmp(argv[i],"-options")) { syslog_data.options = 0; if(strstr(argv[i+1],"CONS")) syslog_data.options |= LOG_CONS; if(strstr(argv[i+1],"NDELAY")) syslog_data.options |= LOG_NDELAY; if(strstr(argv[i+1],"PERROR")) syslog_data.options |= LOG_PERROR; if(strstr(argv[i+1],"PID")) syslog_data.options |= LOG_PID; if(strstr(argv[i+1],"ODELAY")) syslog_data.options |= LOG_ODELAY; if(strstr(argv[i+1],"NOWAIT")) syslog_data.options |= LOG_NOWAIT; closelog(); syslog_data.opened = 0; i += 2; continue; } if(!strcmp(argv[i],"-ident")) { memset(syslog_data.ident,0,sizeof(syslog_data.ident)); strncpy(syslog_data.ident,argv[i+1],sizeof(syslog_data.ident)-1); closelog(); syslog_data.opened = 0; i += 2; continue; } break; } if(i < argc) { entry = Tcl_FindHashEntry(syslog_data.priorities,argv[i]); if(!entry) { Tcl_AppendResult(interp,"wrong level ",argv[i],": available levels: ",NULL); ListHash(interp,syslog_data.priorities); return TCL_ERROR; } priority = (int)Tcl_GetHashValue(entry); i++; } if(i < argc) { if(!syslog_data.opened) { openlog(syslog_data.ident,syslog_data.options,syslog_data.facility); syslog_data.opened = 1; } syslog(priority,argv[i]); return TCL_OK; } return TCL_OK; } ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-15 00:44 Message: Logged In: YES user_id=87254 Oops, looks like I forgot to attach the patch... Here it is. Yes, the server will attempt to restart a number of times before giving up. I think this is as appropriate for config errors as it is for others, then there doesn't need to be a dual-standard fatal error. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-14 19:12 Message: Logged In: YES user_id=95086 I'd rather wait for Stephens code and see what's done there and then make final changes and commit. It should not take long. ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-14 19:07 Message: Logged In: YES user_id=184124 Then i do not see any reason not committing it, we can have discussions about it for another couple of years, but until we have it and use it, it is just discussions. We aready agreed on having it in the core. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-14 19:00 Message: Logged In: YES user_id=95086 Stephen, You mind uploading your patch changes so I can peek into it? I think we should soon settle on some solution and move on. Vlad, The patch already does that. After 16 unsuccessfull restarts the watchdog exits. Between every restart we wait 1, 2, 4, 8, 16, 32, .... 16384 seconds and then exit. This would be: 32767 seconds or about 9 hours. ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-14 15:53 Message: Logged In: YES user_id=184124 I would say, repeat configured number of time and then exit. This way all socket related problems will be cleared after couple of restartes, config problems will make server exit . ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-14 15:41 Message: Logged In: YES user_id=95086 Well, because I could not differentiate between config and later runtime errors, I thought its wiser to abort rather to repeatedly restart broken server. One possibility is to add different Ns_Fatal call like: Ns_FatalEx(int exitcode, char *fmt, ...); Or? ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-14 15:12 Message: Logged In: YES user_id=87254 Are you sure Ns_Fatal should not restart the server? Many of the fatal errors are caused by bad configuration and I can see why you might want the server to exit immediately. But there are also runtime fatal errors, and here I'd expect the server to be restarted. Some config errors are due to external factors like missing directories or file system permissions and it would be nice for the server to come back up as soon as the issue resolved itself. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-14 10:52 Message: Logged In: YES user_id=95086 Calling watchdog before or after chroot has no real implications I believe. Also, when the server exits with Ns_Fatal, it hsould not be restarted, I will look into your changes today. ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-11 07:27 Message: Logged In: YES user_id=87254 Oh yeah, and I moved the watchdog stuff eaven earlier, before the prebind and chroot. Hmm, now that I think about it it's only the prebind that really needs to happen after the watchdog is started, to ensure the sockets are always in a sane state... ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-11 07:17 Message: Logged In: YES user_id=87254 Am I reading this right? It looks like if the server calls Ns_Fatal() it will exit with code 1, but 0 and 1 are treated as OK exit codes and the server will not be restarted. I've attached a patch which changes the above, fixes the pid problem in a different way because I completely forgot you already posted below about that small glitch, use Ns_ParseObjv(), adds the -w switch, and a couple of small name changes. I don't want to go overboard with command line switches, but an option to specify not to give up trying to restart the server would be nice. If you have that, you also need to turn off the restart timeout doubling at a certain point. I don't know what the cleanest way to do that is. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-08 21:09 Message: Logged In: YES user_id=95086 This is correct. As soon as we agree on some implementation, I will put the -i processing and avoid starting watchdog for inittab starts. ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-08 20:48 Message: Logged In: YES user_id=184124 There still should be possibility to run nsd from inittab, so when -i switch is given, no watchdog should be running, let /sbin/init to handle restarts ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-08 17:19 Message: Logged In: YES user_id=95086 small glich: After applying the patch, change nsmain.c from: nsconf.pid = serverPid: to nsconf.pid = getpid(); otherwise the pid file will contain the bogus server pid if the watchdog restarted it later. ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-08 16:26 Message: Logged In: YES user_id=95086 Another try... Important to note: SIGKILL (signal 9) cannot be handled hence if somebody kills the watchdog with SIGKILL, the server will be left lingering w/o the watchdog. This is important to know. I do not see any possibility how to recover in such cases (i.e. how to stop the server). Apart from that, all objections from Stephen are taken into account. Please try again. A new copy of watchdog.patch file is attached. ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-08 06:20 Message: Logged In: YES user_id=184124 I just found this code in my old server sources, i just chnaged internal name to Ns/NS, used to run pretty stable. ------------------------------------------------------------------- #define NS_EXIT 99 static void NsTerminate(int sig) { printf("NS[%d] signal %d received...",getpid(),sig); // Kill the server with the same signal if(nsPid > 0) kill(nsPid,sig); // Exit in case of fatal signal if(sig != SIGHUP) { printf("NS[%d] terminating...",getpid()); exit(0); } // Reassign signal handler signal(sig,NsTerminate); } void NsWatchdog(int argc, char *argv[], char *envp[]) { int failcount = 0; time_t start; int status; pid_t pid; signal(SIGTTOU,SIG_IGN); signal(SIGTTIN,SIG_IGN); signal(SIGTSTP,SIG_IGN); signal(SIGPIPE,SIG_IGN); signal(SIGQUIT,SIG_IGN); signal(SIGHUP,NsTerminate); signal(SIGINT,NsTerminate); signal(SIGTERM,NsTerminate); // Go background if((pid = fork())) { if(pid < 0) err_logger("warpConfigure: fork: %s",strerror(errno)); exit(0); } setsid(); for(;;) { // Execute the real server nsPid = fork(); // Child, continue as server if(nsPid == 0) { exit(nsMain(argc, argv, ServerInit)); } /* parent, behaves like a guardian */ time(&start); printf("NS[%d] server process started",getpid()); pid = waitpid(-1, &status, 0); if(WIFEXITED(status)) printf("NS[%d] child process exited with status %d",pid,WEXITSTATUS(status)); else if(WIFSIGNALED(status)) printf("NS[%d] child process exited due to signal %d",pid,WTERMSIG(status)); else printf("NS[%d] child process exited", pid); // Special exit code if(WIFEXITED(status)) { if(WEXITSTATUS(status) == NS_EXIT) { printf("NS[%d] child configuration error, exiting",getpid()); exit(0); } else if(WEXITSTATUS(status) == SIGHUP) { } else { if(time(0) - start < 10) failcount++; if(failcount == 10) { printf("Exiting due to repeated, frequent failures"); exit(1); } } } sleep(3); } } ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-08 05:23 Message: Logged In: YES user_id=87254 NsWatchdog() is called after the server drops root privs, so both the watchdog and the server run as the defined user. What happens if the server dies and on restart needs to rebind privileged ports? A ps listing shows two running processes, parent and child. If I kill either one, the watchdog dies, the server continues to run. If I kill -9 the parent, the child continues to run. If I kill -9 the child, the server is restarted. Something seems not quite right here... I'm a bit confused about how the code works. For example, NsWatchdog() seems to ignore all of it's arguments? Here's the code which calls it: if (mode == 0) { i = ns_fork(); if (i < 0) { Ns_Fatal("nsmain: fork() failed: '%s'", strerror(errno)); } if (i > 0) { return 0; } setsid(); i = NsWatchdog(argc, argv, initProc); if (i < 0) { Ns_Fatal("nsmain: watchdog failed: '%s'", strerror(errno)); } if (i > 0) { return 0; } nsconf.pid = getpid(); } NsWatchdog() says it returns the worker pid, it also sets the global variable wpid. The code above ignores the returned value, and the global variable, and instead calls getpid()... The variable 'i' is existing code, but still somehow doesn't seem suitable... Could that Ns_Fatal() above be moved into the NsWatchdog() function? I think maybe some comment is needed here. The code structure is very like that just above where ns_fork() is called, but this function will return *multiple* times, right? This is kind of suprising, and an extra twist on the already confusing fork() semantics (or maybe it's just me who gets confused by fork...). A return value of 0 here is a 'request for orderly shutdown', right? How about some more logging to syslog? For example, distinguish between start and restart. Mention when the MAX_SLEEP_PERIOD has been reached, etc. Couple of small things: Can we refer to 'the server', rather than 'the worker'? Worker and Watchdog begin with 'w', and so does the global variable wpid... Maybe serverPid? NsWatchdog is a static function, it doesn't need the Ns prefix. In NsWatchdog(), the variable 'run' should be something like 'nretries', 'nap' should be something more like 'retrySeconds' and MAX_SLEEP_PERIOD should be MAX_RETRY_SECONDS. The comment for WaitForWorker() is misslabeled. Should SigHandler() be called WatchdogSigHandler()? ---------------------------------------------------------------------- Comment By: Vlad Seryakov (seryakov) Date: 2005-03-07 00:03 Message: Logged In: YES user_id=184124 Looks good to me ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-05 13:06 Message: Logged In: YES user_id=95086 Ah, correction: The restart option sends SIGINT to the worker process which causes the watchdog to restart it. And, the patchfile is now attached! ---------------------------------------------------------------------- Comment By: Zoran Vasiljevic (vasiljevic) Date: 2005-03-05 13:03 Message: Logged In: YES user_id=95086 Here is the patch. I have added "-restart" option to "ns_shutdown". It is rather clumsy to parse but should do. We should rewrite this with your args parsing routine. The restart option sends SIGTERM to the worker process which causes the watchdog to restart it. ---------------------------------------------------------------------- Comment By: Stephen Deasey (sdeasey) Date: 2005-03-05 00:46 Message: Logged In: YES user_id=87254 I don't think this has to hide behind a config option. It's either a good idea or it's not. Sounds good to me. Is there a patch? I'm wondering about some of the implementation details... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=719009&aid=1156875&group_id=130646 |