From: Oleg N. <ol...@re...> - 2010-01-20 20:36:03
|
Hi. We found some problems with strace's cleanup() logic during the testing, see below. Please bear in mind I am not familiar with strace's code ;) ------------------------------------------------------------------------------ strace.c:trace() 2470 if (tcp->flags & TCB_BPTSET) { 2471 /* 2472 * One example is a breakpoint inherited from 2473 * parent through fork (). 2474 */ 2475 if (clearbpt(tcp) < 0) /* Pretty fatal */ { 2476 droptcb(tcp); 2477 cleanup(); 2478 return -1; 2479 } 2480 } Afaics, it is not right to do cleanup() if clearbpt() fails. Yes, the "Pretty fatal" comment is correct, but we can have other processes to continue tracing. IOW, suppose "strace -c" traces the process P which fork's child C. Now droptcb(C) can fail just because C is SIGKILLED and therefore PTRACE_GETREGS/PTRACE_SETREGS fails. strace shouldn't "panic" in this case imho, we should ignore ESRCH like ptrace_restart() does. Actually, I don't really understand what cleanup() buys us if it is called when something goes wrong. Can't strace just exit and rely on implicit detach in kernel? Except cleanup() should probably send CONT/TERM before exiting. Yes, we can leave the tracee in the "bad" state, but I do not see how cleanup() can help. For example, it does not try to clearbpt() if TCB_BPTSET. In any case, cleanup() itself have problems which can lead to the hang, please see below. ------------------------------------------------------------------------------ detach() assumes that if TCB_STARTUP is set, the tracee must have the pending SIGSTOP queued by attach and thus my_tgkill(SIGSTOP) is not needed. This is mostly correct, but if cleanup() is the caller we can't trust TCB_STARTUP. cleanup: if (tcp->flags & TCB_ATTACHED) detach(tcp, 0); else { kill(tcp->pid, SIGCONT); kill(tcp->pid, SIGTERM); } If cleanup() finds a TCB_ATTACHED sub-thread before TCB_STARTUP sub-thread X it sends SIGCONT to the thread group. This steals the pending SIGSTOP and wakes up the TCB_STARTUP tracee X. After that detach()->wait4(X) can hang "forever" until X exits or stops. For example, X can sleep in sys_pause() and strace hangs in wait4(X). Probably TCB_ATTACHED should be per-process, not per thread. Or perhaps cleanup() should be changed to detach() sub-threads first, and only then the main thread which has TCB_ATTACHED. ------------------------------------------------------------------------------ There is another reason why I think cleanup() should detach sub threads first. The main thread can be zombie, in this case my_tgkill(SIGSTOP) can't help and wait4() can't return unless we detach and reap all sub-threads. ------------------------------------------------------------------------------ I can't understand why detach()->wait4() can't hang if TCB_STARTUP is set and the previous ptrace(PTRACE_DETACH) succeeded. ------------------------------------------------------------------------------ Finally, the exact scenario which we had during the testing. To simplify, the tracee does: fork(); // both parent and child run he code below for (0 .. NUM_THREADS) pthread_create(); raise(SIGFPE); it runs under "strace -f". We have 2 thread groups, G1 and G2. G2's main thread does raise(SIGFPE), this kills (the kernel sends SIGKILL to all threads) a thread T in G2 and clearbpt(T) fails. strace calls cleanup(). cleanup() tries to detach() all traced theads, from G1 and G2. It detaches some of them, then it sees G1's main thread with TCB_ATTACHED set and sends SIGCONT to G1. This wakes up a TCB_STARTUP thread X from G1, X resumes and sleeps in do_futex(). strace does detach(X) and hangs in wait4(X) (remember, detach() trusts TCB_STARTUP and doesn't send SIGSTOP). Now, wait4(X) hangs forever because X waits for FUTEX_WAKE from another thread in this thread group, and that thread sleep in TASK_TRACED. Oleg. |