From: <er...@he...> - 2003-12-10 19:06:00
|
On Wed, Dec 10, 2003 at 01:23:27PM -0500, Nicholas Henke wrote: > On Wed, 2003-12-10 at 12:51, er...@he... wrote: > > > > Hrm. There's a lot of craziness in there. Are all these oopses from > > the master node? I'm going to assume yes for now. Also, can I > > presume that you're not mixing master and slave nodes? In other > > words, you're not running a bpslave on the same machine where you're > > running a bpmaster. > > No strangeness in the setup, all are actuall hardware, and master is a > different node from slave. > > > > > >From what I've seen so far, I strongly suspect that there's a kernel > > stack overflow happening somewhere. Some of the backtraces you're > > sending are very long which isn't a good sign. > > I would put money on that, it makes sense -- especially from the oddness > we have been seeing lately in the bproc backtraces. > > > > > Some evidence that could support this theory is that the "ghost" > > pointer seems to have gotten set to 0x1 which is clearly an invalid > > pointer. That led to the "g->last_response = jiffies;" line exploded. > > I don't think there's any plausible way of the pointer getting set > > like that except for some kind of random corruption. Another clue is > > that the back traces are including calls to both ghost and masq stuff. > > ghost stuff is fine on the front end but masq stuff should *never* be > > called on the front end at all. The only way that that would happen > > is if a process's masq pointer (in task_struct) got set to something > > non-zero. Both of these live right at the end of task_struct so > > they'll be the first to get clobbered by a stack overflow. > > > > Do you have CONFIG_DEBUG_STACKOVERFLOW turned on? It might let us > > know if that's what's going on here. If it is, it will give us a > > trace at the time of overflow - not some time later when BProc tries > > to use the corrupted data - and that would be very useful. > > Darn -- well, I actually had turned that off, as it was oopsing all over > the place. I think I emailed you about that, or at least were talking > about reducing the size of the things on the kernel stack. See the > message on Sept17 Subject:2.4.20 oops. > > So, what can i do ? I can turn that option back on, and see if the > oopses are the same as in Sept. Oh yeah, I remember that conversation. I would turn it back on. If it's pretty easy to get it to dump a traceback, then I'd try to get it to do it without BProc in the loop. The sep. 17 traceback just showed BProc doing a read and then gobs of network stuff above it. It seems likely that you could cause the same thing with ttcp or some other network stress thing. If that happens, then I think it's certain that some other code is eating the stack and it's not BProc. The other thing you could do if you think there are false positives would be to make it a little less sensitive. Maybe change the 1024 to 512 or something like that in arch/i386/kernel/irq.c:585. I still think there's a problem if that makes it go away. - Erik |