Re: [BProc] Re: oops 3.2.6 w/sigbypass on 2.4.20

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Tue, Dec 09, 2003 at 05:05:31PM -0500, Nicholas Henke wrote:
> On Tue, 2003-12-09 at 17:00, Nicholas Henke wrote:
> > On Mon, 2003-12-08 at 22:25, Nicholas Henke wrote:
> > > Hey Erik~
> > > 	Our largest cluster started crashing every 2 days or so, and we finally
> > > got the attached oops and ksymoops output from it. I traced the bug to
> > > kernel/ghost.c:805 ghost_update_status::g->last_response = jiffies;  I
> > > also included the objdump output for our bproc.o, as that is what I used
> > > to decode the assembly to the C function.
> > > 
> > > -- But then again I may be wrong in my tracing. It appears that
> > > something happends to tsk->bproc.ghost and accessing last_response is a
> > > BadThing. I could really use any ideas or help you may have :)
> > 
> > And the oopsing continues. Attached is the output in /var/log/messages
> > and the ksymoops output. I will be tracing through the code to make sure
> > the tracebacks are sane, but I could really use some help.
> 
> Oops ( no pun ...) I ran ksymoops with the wrong ksyms file, attached is
> the proper one.

Hrm.  There's a lot of craziness in there.  Are all these oopses from
the master node?  I'm going to assume yes for now.  Also, can I
presume that you're not mixing master and slave nodes?  In other
words, you're not running a bpslave on the same machine where you're
running a bpmaster.

From what I've seen so far, I strongly suspect that there's a kernel
stack overflow happening somewhere.  Some of the backtraces you're
sending are very long which isn't a good sign.

Some evidence that could support this theory is that the "ghost"
pointer seems to have gotten set to 0x1 which is clearly an invalid
pointer.  That led to the "g->last_response = jiffies;" line exploded.
I don't think there's any plausible way of the pointer getting set
like that except for some kind of random corruption.  Another clue is
that the back traces are including calls to both ghost and masq stuff.
ghost stuff is fine on the front end but masq stuff should *never* be
called on the front end at all.  The only way that that would happen
is if a process's masq pointer (in task_struct) got set to something
non-zero.  Both of these live right at the end of task_struct so
they'll be the first to get clobbered by a stack overflow.

Do you have CONFIG_DEBUG_STACKOVERFLOW turned on?  It might let us
know if that's what's going on here.  If it is, it will give us a
trace at the time of overflow - not some time later when BProc tries
to use the corrupted data - and that would be very useful.

- Erik