If I start a process in the async mode, then the system's clock rolls backwards, STAFProc will never waitpid on its child process. The process will be a zombie forever, and the process query command will never report that the process has finished. The ProcessMonitorThread will permanently be blocked in STAFEventSemWait(). Tested on staf-3.4.14/FreeBSD 10, staf-3.4.18/FreeBSD 10, and staf-3.4.12/Ubuntu 13.10
Example with STAF-3.4.18 on FreeBSD 10
[root@tom ~]# date 1748
Mon Dec 22 17:48:00 UTC 2014
[root@tom ~]# staf local process start command date parms 0000 async
Response
--------
31
[root@tom ~]# staf local process query handle 31
Response
--------
Handle : 31
Handle Name : <None>
Title : <None>
Workload : <None>
Shell : <None>
Command : date
Parms : 0000
Workdir : <None>
Focus : Background
User Name : <None>
Key : <None>
PID : 7395
Start Mode : Async
Start Date-Time: 20141222-17:48:05
End Date-Time : <None>
Return Code : <None>
[root@tom ~]# ps -axp 7395
PID TT STAT TIME COMMAND
7395 - Z 0:00.00 <defunct>
Notice that there is no Return Code and no End Date-Time in the process query output. And there never will be. Here is a stack trace of the thread that I believe to be responsible:
(gdb) bt
#0 _umtx_op_err ()
at /scratch/jenkins/workspace/bluestorm-main/SpectraBSD/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37
#1 0x00000008016d8e8e in _thr_umtx_timedwait_uint (mtx=0x800947170,
id=<value optimized out>, clockid=<value optimized out>,
abstime=<value optimized out>, shared=0)
at /scratch/jenkins/workspace/bluestorm-main/SpectraBSD/lib/libthr/thread/thr_umtx.c:212
#2 0x00000008016e0d2e in cond_wait_common (cond=<value optimized out>,
mutex=<value optimized out>, abstime=0x7fffde1ee4a8, cancel=1)
at /scratch/jenkins/workspace/bluestorm-main/SpectraBSD/lib/libthr/thread/thr_cond.c:255
#3 0x0000000800b6be0d in STAFEventSemWait ()
from /usr/local/staf/lib/libSTAF.so
#4 0x0000000800b6c249 in STAFEventSem::wait ()
from /usr/local/staf/lib/libSTAF.so
#5 0x0000000800c08c97 in ProcessMonitorThread ()
from /usr/local/staf/lib/libSTAF.so
#6 0x0000000800b9ced9 in STAFThreadManager::workerThread ()
from /usr/local/staf/lib/libSTAF.so
#7 0x0000000800b9ce0d in STAFThreadManager::callWorkerThread ()
---Type <return> to continue, or q <return> to quit---
from /usr/local/staf/lib/libSTAF.so
#8 0x0000000800b6ee99 in RealSTAFThread () from /usr/local/staf/lib/libSTAF.so
#9 0x00000008016d74a5 in thread_start (curthread=0x80482b400)
at /scratch/jenkins/workspace/bluestorm-main/SpectraBSD/lib/libthr/thread/thr_create.c:284
#10 0x0000000000000000 in ?? ()
I'm not sure of the exact sequence of events that leads to the lockup, but I have a patch that fixes the bug. My patch converts STAFEventSemWait to use a monotonic clock instead of a realtime clock. In general, monotonic clocks should usually be used when comparing two time values taken during the lifetime of the same process.
Note that STAFMutexSemRequest appears to suffer from the same bug, but I haven't tested it yet.
Thanks for providing this patch. I'll take a look at it and test it on all the Unix operating systems that STAF supports (and look at making similar changes to stafif/unix/STAFMutexSem.cpp).
It looks like CLOCK_MONOTONIC is not supported on HP-UX or Mac OS X (as talked about in "C/C++ tip: How to measure elapsed real time for benchmarking" at http://nadeausoftware.com/articles/2012/04/c_c_tip_how_measure_elapsed_real_time_benchmarking). But, I think it would be good to change to use a monotonic clock on the Unix operating systems that support it and to fallback to using the real time clock for those Unix operating systems that don't support a monotonic clock.
Attached a cvs diff of the changes (monotonic_clock.diff). Continued to use a real-time clock on HP-UX, Mac OS X, and z/OS as these operating systems don't support monotonic clocks. Had to add the rt library to OS_LIBS in the STAF build makefiles for all Linux and Solaris operating systems in order to be able to use clock_gettime().
Made similar changes to stafif/unix/STAFMutexSem.cpp (in addition to stafif/unix/STAFEventSem.cpp). Also, updated section "2.3.5. How do I change the system date/time to a prior date/time via a PROCESS START request?" in the STAF FAQ.
These changes have been checked into cvs if you'd like to test them. This fix will be in the next release of STAF (V3.4.21) planned for the end of March 2015.