[Dpcl-develop] DPCL and SIGCHLD/SIGTRAP
Brought to you by:
dpcl-admin,
dwootton
|
From: Steve C. <sl...@sg...> - 2005-05-02 15:34:32
|
Greetings, everyone. I need to bounce a DPCL concern off the experts
on this board. The recent Dyninst 4.2.1 release has exposed a stability
issue with the Hybrid version of DPCL. Code that used to be seemingly
innocuous with Dyninst 4.1.1 has proven to be incompatible with the
newer, improved process and/or thread control that Dyninst 4.2.1 provides.
The problematic code is in 'main.C' of the CommDaemon (dpcld) and it
involves the unblocking of SIGCHLD and SIGTRAP signals. To be sure this
is just an issue with the University Dyninst (_DYNINST) implementation,
but it is potentially destabilizing for the Open|SpeedShop project at
SGI (Silicon Graphics).
Following are some comments from Matt Legendre and Drew Bernat from
the Dyninst Group when informed that the Hybrid version of DPCL registers
for and unblocks SIGCHLD and SIGTRAP signals headed for the mutatee. In
the case of SIGCHLD, DPCL makes process control dicey by doing a 'waitpid'
in its 'sigchild_handler' as well as a 'waitpid' in its shared memory
manager code. By accepting SIGTRAP signals for the mutatee, DPCL also is
asking for trouble by interfering with Dyninst's trap-based instrumentation.
Frankly, I think this code is something left over from the original Hybrid
'creation' effort and DPCL (University Dyninst version only) should have
been changed to forget about SIGCHLD and SIGTRAP some time ago. But my
confidence is real shaky when I say this. Thus I am posting here for some
reinforcement(s).
Comments and/or reactions welcome.
Steve Collins, SGI Compilers/Tools
Drew Bernat writes:
> Dyninst uses signals for:
>
> 1) Trap-based instrumentation if we can't fit a jump in.
> 2) Discovering the completion of an inferior RPC (forcing code to run in
> the mutatee)
> 3) Discovering loads of new shared libraries (a trap in dlopen)
> 4) On Linux, discovery of when several system calls are executed.
> 5) Keeping track of process state (paused/running)
> and 6) discovering when a mutatee exits.
>
>
> The big one is trap-based instrumentation, followed closely by tracking
> process state. I'll let Matt fill in more details.
Matt Legendre writes:
> I think Drew caught all of the big places where Dyninst makes use of
> waitpid, it's a fundamental part of the ptrace debugging interface on
> Linux. And we call it frequently. Anytime BPatch::waitForStatusChange or
> BPatch::pollForStatusChange is called, anytime instrumentation is
> inserted, anytime the process is stopped, or anytime we try to read/write
> from it's address space, anytime a fork/exec happens.
> If DPCL is also calling waitpid frequently, then we stand a good chance of
> having DPCL get one of our signal events, or we get one of DPCL's.
> If we get an event generated by DPCL that we don't recognize it's likely
> to be silently dropped (SIGTRAP), forwarded back to the process
> (SIGPROF), or handled as if we caused it (SIGCHLD).
> If DPCL picks up one of our events, a lot of things could happen:
> * We'll fail to execute instrumentation and incorrectly execute part of
> the program (if we miss a SIGTRAP from trap-based instrumentation).
> Fortunately, the use of trap based instrumentation is rare.
> * Not know about a new shared library that's been loaded (A SIGTRAP
> generated from dlopen). We won't generate parse data for this library
> or be able to instrument it, but app will continue to run fine. We may
> not have seen that yet because not too many applications use dlopen.
> * Miss certain system calls that are being executed. I don't think this
> is a frequently used feature of Dyninst. We'll miss things like exec
> system calls, which the test applications might not be doing.
> * Missing the mutatee when it exits, which is what we're seeing now.
> Dyninst (as it currently stands) isn't going to change the process status
> to 'exited' until it sees this event.
> * If DPCL is also calling ptrace(PTRACE_CONT, ...) when we expect the
> process to be stopped, that's going to cause us to lose track of the
> process state and will probably cause certain operations (like inserting
> instrumentation) to start failing until the two sync back up.
> Now most of these aren't critical-fail-on-every-run errors, which is
> probably why we didn't see them before, but they're still unacceptable
> from a stability stand point. Unfortunately, I don't have a good
> suggestion for fixing this. Working around this from a pure Dyninst stand
> point would be incredibly difficult.
>
Steve Collins writes:
>The basic DPCL signal handler for SIGCHLD does a 'waitpid'. But it has
>always done that and things worked, at least with Dyninst 4.1.1.; maybe
>4.2.1 has rendered the DPCL SIGCHLD handler 'bad'.
Drew Bernat writes:
> Dyninst is designed with the
> assumption that it is the only thing consuming signals from the child;
> as a result, when a child dies we _will_ get a SIGCHLD from it so that
> we can clean up. I'm not surprised that some things may have worked in
> the past, but it's an error case that we explicitly don't test here. As
> an example, the internal call to terminateProc() was hanging with 4.1.1
> because we didn't get informed of the child dying; now it's pause(), but
> the root cause is the same.
> We can patch pause() to operate correctly, but there will still be
> problems when we don't catch a process dying.
> I'm bouncing this to Matt, the signals expert. It looks like DPCL is
> calling waitpid, which is just plain bad news. And it's forwarding
> signals, which is worse.
> The biggest problem is going to be Dyninst process control. However,
> _particularly_ on Linux (cruddy debugging interface), we really need as
> much information as we can get. That includes waitpid(), unfortunately.
> It makes sense as a monolithic structure, but not if it's handing
> process control to Dyninst. Problem is, Dyninst gives you process
> control along with instrumentation.
> As a note, this was likely to break the other way in Dyninst 5.0. One
> of the upcoming features is an internally multithreaded Dyninst library
> so that the user doesn't have to call "pollForStatusEvents" all the
> time; this means that Dyninst would probably catch the DPCL signals
> rather than the reverse.
Steve Collins wrote:
> Drew -
> Oh yeah, I think we can safely assume it is just the SIGCHLD signal.
> They use the handler to 'ptrace (PT_CONTINUE) the mutatee.
>
|