From: <er...@he...> - 2003-12-10 17:53:54
|
On Tue, Dec 09, 2003 at 05:05:31PM -0500, Nicholas Henke wrote: > On Tue, 2003-12-09 at 17:00, Nicholas Henke wrote: > > On Mon, 2003-12-08 at 22:25, Nicholas Henke wrote: > > > Hey Erik~ > > > Our largest cluster started crashing every 2 days or so, and we finally > > > got the attached oops and ksymoops output from it. I traced the bug to > > > kernel/ghost.c:805 ghost_update_status::g->last_response = jiffies; I > > > also included the objdump output for our bproc.o, as that is what I used > > > to decode the assembly to the C function. > > > > > > -- But then again I may be wrong in my tracing. It appears that > > > something happends to tsk->bproc.ghost and accessing last_response is a > > > BadThing. I could really use any ideas or help you may have :) > > > > And the oopsing continues. Attached is the output in /var/log/messages > > and the ksymoops output. I will be tracing through the code to make sure > > the tracebacks are sane, but I could really use some help. > > Oops ( no pun ...) I ran ksymoops with the wrong ksyms file, attached is > the proper one. Hrm. There's a lot of craziness in there. Are all these oopses from the master node? I'm going to assume yes for now. Also, can I presume that you're not mixing master and slave nodes? In other words, you're not running a bpslave on the same machine where you're running a bpmaster. From what I've seen so far, I strongly suspect that there's a kernel stack overflow happening somewhere. Some of the backtraces you're sending are very long which isn't a good sign. Some evidence that could support this theory is that the "ghost" pointer seems to have gotten set to 0x1 which is clearly an invalid pointer. That led to the "g->last_response = jiffies;" line exploded. I don't think there's any plausible way of the pointer getting set like that except for some kind of random corruption. Another clue is that the back traces are including calls to both ghost and masq stuff. ghost stuff is fine on the front end but masq stuff should *never* be called on the front end at all. The only way that that would happen is if a process's masq pointer (in task_struct) got set to something non-zero. Both of these live right at the end of task_struct so they'll be the first to get clobbered by a stack overflow. Do you have CONFIG_DEBUG_STACKOVERFLOW turned on? It might let us know if that's what's going on here. If it is, it will give us a trace at the time of overflow - not some time later when BProc tries to use the corrupted data - and that would be very useful. - Erik |