From: Ramesh B. <ram...@or...> - 2013-07-30 10:26:57
|
Nagu: Hans N might be pointing to the chances of hung of file-operation calls (esp. when some inconsistency happens with NFS). Just a guess, let Hans N confirm it. Thanks, Ramesh. On 7/30/2013 3:27 PM, Nagendra Kumar wrote: > Hi, > >>> regarding what can "hang" in the child part, e.g close of file descriptors close of file descriptors. > When this can happen? After fork is successful, this shouldn't happen. Can you please provide any example. > > Thanks > -Nagu > > -----Original Message----- > From: Hans Nordebäck [mailto:han...@er...] > Sent: 30 July 2013 12:53 > To: Nagendra Kumar > Cc: Hans Nordebäck; ope...@li...; Praveen Malviya; Ramesh Babu Betham; Hans Feldt > Subject: Re: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child process takes too long time before exec (#514) > > Hi Nagu, regarding what can "hang" in the child part, e.g close of file descriptors. /BR HansN On 07/30/13 09:01, Hans Nordebäck wrote: >> Hi Nagu, >> >> On 07/30/13 08:54, Nagendra Kumar wrote: >>> Hi Hans N, >>> >>>>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new >>>>> environment variable where value is used as input to alarm, if not >>>>> set it is default 2 seconds. >>> Do we have some place holder for this variable for configuration and >>> are we going to add it in README for information. >> Perhaps the name isn't the best, but it should be handled as the other >> env variable I guess, e.g. "AVND_PM_MONITORING_RATE", etc. >>>>> if the child "hangs" before exec this extra coredump should give >>>>> information where/what is wrong. >>> This means that fork hangs, am I right ? If yes, then dump is not >>> going to provide any information as it is a system call, it can only >>> show, ithangs in fork. >> I don't think fork hangs as the parent part continues and later, with >> the help of ncs_exec_mod_hdlr, the parent detects that the child or >> "exec" has timed out, >> 10 sec in this case. But in this case the exec has not been performed. >>>>> After exec, it will work as usual >>> This confirms that we are only targeting fork to debug. >> Yes, the extra core dump will help troubleshooting. >> /BR HansN >>> Thanks >>> -Nagu >>> >>> -----Original Message----- >>> From: Hans Nordebäck [mailto:han...@er...] >>> Sent: 30 July 2013 11:57 >>> To: Nagendra Kumar >>> Cc: ope...@li...; Praveen Malviya; Ramesh Babu >>> Betham; Hans Feldt; Hans Nordebäck >>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> Hi Nagu, >>> >>> 1. OPENSAF_CHILD_EXEC_TIME_TOLERANCE is the name of a new environment >>> variable where value is used as input to alarm, if not set it is >>> default 2 seconds. >>> 2. Yes you are right, in this particular case it is set to 10 sec, >>> that's why the env. variable above can be set. >>> 3. This alarm is just an additional precaution, at no extra cost, to >>> check the child part before the exec. After exec >>> it will work as usual but if the child "hangs" before exec >>> this extra coredump should give information where/what is wrong. >>> >>> /BR HansN >>> >>> -----Original Message----- >>> From: Nagendra Kumar [mailto:nag...@or...] >>> Sent: den 30 juli 2013 07:11 >>> To: Hans Nordebäck; Praveen Malviya; Hans Feldt; Ramesh Babu Betham >>> Cc: ope...@li... >>> Subject: RE: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> Hi Hans N, >>> For my understanding, can you please provide the below >>> information: >>> >>> 1. I can't find OPENSAF_CHILD_EXEC_TIME_TOLERANCE in opensaf >>> source code. >>> 2. I hope the child process is hung for more than >>> saAmfCtDefClcCliTimeout resulting in CLC time out. Am I right? >>> 3. Even we add assert in child process and we get core dump, but >>> it may not give any information as it got delayed because of >>> system issue. Are we targeting, which system call the child >>> process is hung? >>> >>> Thanks >>> -Nagu >>> >>> -----Original Message----- >>> From: Hans Nordeback [mailto:han...@er...] >>> Sent: 22 July 2013 17:07 >>> To: Nagendra Kumar; Praveen Malviya; han...@er...; Ramesh >>> Babu Betham >>> Cc: ope...@li... >>> Subject: [PATCH 1 of 1] leap: ncs_os_process_execute_timed child >>> process takes too long time before exec (#514) >>> >>> osaf/libs/core/leap/os_defs.c | 27 +++++++++++++++++++++++++++ >>> 1 files changed, 27 insertions(+), 0 deletions(-) >>> >>> >>> amfnd calls ncs_os_process_execute_timed and the child process takes >>> too long time before exec, (10 sec timeout). An alarm is set in the >>> ncs_os_process_execute_timed child process. If timed out a core dump >>> will be produced to be able to trouble shoot. >>> >>> diff --git a/osaf/libs/core/leap/os_defs.c >>> b/osaf/libs/core/leap/os_defs.c >>> --- a/osaf/libs/core/leap/os_defs.c >>> +++ b/osaf/libs/core/leap/os_defs.c >>> @@ -65,6 +65,15 @@ bool gl_ncs_atomic_mtx_initialise = fals >>> * description of SOCK_CLOEXEC. */ >>> static pthread_mutex_t s_cloexec_mutex = PTHREAD_MUTEX_INITIALIZER; >>> +/* >>> + * ALRM signal is used to detect if child process takes too long >>> time before exec. >>> + * >>> + * @param sig >>> + */ >>> +static void sigalrm_handler(int sig) { >>> + abort(); >>> +} >>> /*************************************************************************** >>> * >>> * uns64 >>> @@ -999,6 +1008,22 @@ uint32_t ncs_os_process_execute_timed(NC >>> osaf_mutex_lock_ordie(&s_cloexec_mutex); >>> if ((pid = fork()) == 0) { >>> + unsigned int alarm_time_sec; >>> + char* alarm_time; >>> + >>> + if (signal(SIGALRM, sigalrm_handler) == SIG_ERR) { >>> + LOG_ER("signal ALRM failed: %s", >>> strerror(errno)); >>> + } >>> + if ((alarm_time = >>> getenv("OPENSAF_CHILD_EXEC_TIME_TOLERANCE")) != NULL) { >>> + alarm_time_sec = strtol(alarm_time, NULL, 0); >>> + } >>> + else { >>> + // default alarm timeout 2 seconds >>> + alarm_time_sec = 2; >>> + } >>> + >>> + alarm(alarm_time_sec); >>> + >>> /* >>> ** Make sure forked processes have default scheduling class >>> ** independent of the callers scheduling class. >>> @@ -1054,6 +1079,8 @@ uint32_t ncs_os_process_execute_timed(NC >>> } >>> #endif >>> + alarm(0); >>> + >>> /* child part */ >>> if (execvp(req->i_script, req->i_argv) == -1) { >>> syslog(LOG_ERR, "%s: execvp '%s' failed - %s", >>> __FUNCTION__, req->i_script, strerror(errno)); |