Re: [Dpcl-develop] DPCL and SIGCHLD/SIGTRAP
Brought to you by:
dpcl-admin,
dwootton
|
From: Dave W. <dwo...@us...> - 2005-05-02 22:29:41
|
Steve
How often are you encountering this in running DPCL with Dyninst 4.2.1?
We spent a little time this afternoon, but this needs more investigation
to understand the scope of the problem and possible solution.
We did look at the call to waitpid in the shared memory code. The code in
this area is trying to handle the case where dpcld is attempting to obtain
a shared memory lock which the mutatee is holding. Normally obtaining the
lock should not be a problem, and the waitpid call should not be issued.
Where DPCL gets into trouble is when the mutatee raises any signal. Since
the mutatee is under ptrace control with DPCL being the controlling
program, the mutatee gets suspended. If the signal is raised and the
mutatee suspended while the shared memory lock is held, then DPCL will not
be able to obtain that lock. The shared memory code retries the request to
get the lock 100 times. If it cannot obtain the lock after 100 retries, it
assumes that the mutatee is stopped for some reason. So, the shared memory
code issues a waitpid for the pid of the mutatee holding the lock and
evaluates the status returned by the waitpid call. If that status
indicates that the process is stopped, then the shared memory code sends
the mutatee process a SIGCONT in order to resume the process, since
presumably, the mutatee would continue processing, clear the shared memory
lock, and allow the dpcld shared memory code to get the lock.
If the mutatee got suspended because of, for instance, execution of a trap
instruction placed by Dyninst, and the subsequent SIGTRAP signal, then
when the shared memory code issues the waitpid, it will incorrectly get
and handle the status for the signal and Dyninst will of course never get
the SIGTRAP notification.
We talked about a solution, which may not be the right approach, and needs
to be investigated further. If the shared memory code was to retry
obtaining the lock for an extended period of time, and was unable to do
so, it might assume that the mutatee has somehow gotten hung and will
never release the lock. In that case, the shared memory code may force the
lock to cleared state so that it can then obtain it and continue with it's
processing. This has the risk that if the mutatee is only running slowly,
the mutatee may update shared memory structures that it thinks it holds
the lock for when it does not, and those structures may be corrupted,
resulting in a mutatee and/or dpcld crash, neither of which is desirable.
We also need to look at the other daemon calls to waitpid to see what the
intent of the code in those areas is.
Dave
Steve Collins <sl...@sg...>
Sent by: dpc...@li...
05/02/2005 11:34 AM
To
dpc...@li...
cc
sl...@sg..., Dave Wootton/Poughkeepsie/IBM@IBMUS, leg...@cs...,
ja...@cs..., be...@cs..., per...@sg...
Subject
[Dpcl-develop] DPCL and SIGCHLD/SIGTRAP
Greetings, everyone. I need to bounce a DPCL concern off the experts
on this board. The recent Dyninst 4.2.1 release has exposed a stability
issue with the Hybrid version of DPCL. Code that used to be seemingly
innocuous with Dyninst 4.1.1 has proven to be incompatible with the
newer, improved process and/or thread control that Dyninst 4.2.1
provides.
The problematic code is in 'main.C' of the CommDaemon (dpcld) and it
involves the unblocking of SIGCHLD and SIGTRAP signals. To be sure this
is just an issue with the University Dyninst (_DYNINST) implementation,
but it is potentially destabilizing for the Open|SpeedShop project at
SGI (Silicon Graphics).
Following are some comments from Matt Legendre and Drew Bernat from
the Dyninst Group when informed that the Hybrid version of DPCL registers
for and unblocks SIGCHLD and SIGTRAP signals headed for the mutatee. In
the case of SIGCHLD, DPCL makes process control dicey by doing a
'waitpid'
in its 'sigchild_handler' as well as a 'waitpid' in its shared memory
manager code. By accepting SIGTRAP signals for the mutatee, DPCL also is
asking for trouble by interfering with Dyninst's trap-based
instrumentation.
Frankly, I think this code is something left over from the original
Hybrid
'creation' effort and DPCL (University Dyninst version only) should have
been changed to forget about SIGCHLD and SIGTRAP some time ago. But my
confidence is real shaky when I say this. Thus I am posting here for some
reinforcement(s).
Comments and/or reactions welcome.
Steve Collins, SGI Compilers/Tools
Drew Bernat writes:
> Dyninst uses signals for:
>
> 1) Trap-based instrumentation if we can't fit a jump in.
> 2) Discovering the completion of an inferior RPC (forcing code to run in
> the mutatee)
> 3) Discovering loads of new shared libraries (a trap in dlopen)
> 4) On Linux, discovery of when several system calls are executed.
> 5) Keeping track of process state (paused/running)
> and 6) discovering when a mutatee exits.
>
>
> The big one is trap-based instrumentation, followed closely by tracking
> process state. I'll let Matt fill in more details.
Matt Legendre writes:
> I think Drew caught all of the big places where Dyninst makes use of
> waitpid, it's a fundamental part of the ptrace debugging interface on
> Linux. And we call it frequently. Anytime BPatch::waitForStatusChange
or
> BPatch::pollForStatusChange is called, anytime instrumentation is
> inserted, anytime the process is stopped, or anytime we try to
read/write
> from it's address space, anytime a fork/exec happens.
> If DPCL is also calling waitpid frequently, then we stand a good chance
of
> having DPCL get one of our signal events, or we get one of DPCL's.
> If we get an event generated by DPCL that we don't recognize it's likely
> to be silently dropped (SIGTRAP), forwarded back to the process
> (SIGPROF), or handled as if we caused it (SIGCHLD).
> If DPCL picks up one of our events, a lot of things could happen:
> * We'll fail to execute instrumentation and incorrectly execute part of
> the program (if we miss a SIGTRAP from trap-based instrumentation).
> Fortunately, the use of trap based instrumentation is rare.
> * Not know about a new shared library that's been loaded (A SIGTRAP
> generated from dlopen). We won't generate parse data for this library
> or be able to instrument it, but app will continue to run fine. We may
> not have seen that yet because not too many applications use dlopen.
> * Miss certain system calls that are being executed. I don't think this
> is a frequently used feature of Dyninst. We'll miss things like exec
> system calls, which the test applications might not be doing.
> * Missing the mutatee when it exits, which is what we're seeing now.
> Dyninst (as it currently stands) isn't going to change the process
status
> to 'exited' until it sees this event.
> * If DPCL is also calling ptrace(PTRACE_CONT, ...) when we expect the
> process to be stopped, that's going to cause us to lose track of the
> process state and will probably cause certain operations (like inserting
> instrumentation) to start failing until the two sync back up.
> Now most of these aren't critical-fail-on-every-run errors, which is
> probably why we didn't see them before, but they're still unacceptable
> from a stability stand point. Unfortunately, I don't have a good
> suggestion for fixing this. Working around this from a pure Dyninst
stand
> point would be incredibly difficult.
>
Steve Collins writes:
>The basic DPCL signal handler for SIGCHLD does a 'waitpid'. But it has
>always done that and things worked, at least with Dyninst 4.1.1.; maybe
>4.2.1 has rendered the DPCL SIGCHLD handler 'bad'.
Drew Bernat writes:
> Dyninst is designed with the
> assumption that it is the only thing consuming signals from the child;
> as a result, when a child dies we _will_ get a SIGCHLD from it so that
> we can clean up. I'm not surprised that some things may have worked in
> the past, but it's an error case that we explicitly don't test here. As
> an example, the internal call to terminateProc() was hanging with 4.1.1
> because we didn't get informed of the child dying; now it's pause(), but
> the root cause is the same.
> We can patch pause() to operate correctly, but there will still be
> problems when we don't catch a process dying.
> I'm bouncing this to Matt, the signals expert. It looks like DPCL is
> calling waitpid, which is just plain bad news. And it's forwarding
> signals, which is worse.
> The biggest problem is going to be Dyninst process control. However,
> _particularly_ on Linux (cruddy debugging interface), we really need as
> much information as we can get. That includes waitpid(), unfortunately.
> It makes sense as a monolithic structure, but not if it's handing
> process control to Dyninst. Problem is, Dyninst gives you process
> control along with instrumentation.
> As a note, this was likely to break the other way in Dyninst 5.0. One
> of the upcoming features is an internally multithreaded Dyninst library
> so that the user doesn't have to call "pollForStatusEvents" all the
> time; this means that Dyninst would probably catch the DPCL signals
> rather than the reverse.
Steve Collins wrote:
> Drew -
> Oh yeah, I think we can safely assume it is just the SIGCHLD signal.
> They use the handler to 'ptrace (PT_CONTINUE) the mutatee.
>
-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
Dpcl-develop mailing list
Dpc...@li...
https://lists.sourceforge.net/lists/listinfo/dpcl-develop
|