From: Nicholas H. <he...@se...> - 2003-12-10 18:23:33
|
On Wed, 2003-12-10 at 12:51, er...@he... wrote: > > Hrm. There's a lot of craziness in there. Are all these oopses from > the master node? I'm going to assume yes for now. Also, can I > presume that you're not mixing master and slave nodes? In other > words, you're not running a bpslave on the same machine where you're > running a bpmaster. No strangeness in the setup, all are actuall hardware, and master is a different node from slave. > > >From what I've seen so far, I strongly suspect that there's a kernel > stack overflow happening somewhere. Some of the backtraces you're > sending are very long which isn't a good sign. I would put money on that, it makes sense -- especially from the oddness we have been seeing lately in the bproc backtraces. > > Some evidence that could support this theory is that the "ghost" > pointer seems to have gotten set to 0x1 which is clearly an invalid > pointer. That led to the "g->last_response = jiffies;" line exploded. > I don't think there's any plausible way of the pointer getting set > like that except for some kind of random corruption. Another clue is > that the back traces are including calls to both ghost and masq stuff. > ghost stuff is fine on the front end but masq stuff should *never* be > called on the front end at all. The only way that that would happen > is if a process's masq pointer (in task_struct) got set to something > non-zero. Both of these live right at the end of task_struct so > they'll be the first to get clobbered by a stack overflow. > > Do you have CONFIG_DEBUG_STACKOVERFLOW turned on? It might let us > know if that's what's going on here. If it is, it will give us a > trace at the time of overflow - not some time later when BProc tries > to use the corrupted data - and that would be very useful. Darn -- well, I actually had turned that off, as it was oopsing all over the place. I think I emailed you about that, or at least were talking about reducing the size of the things on the kernel stack. See the message on Sept17 Subject:2.4.20 oops. So, what can i do ? I can turn that option back on, and see if the oopses are the same as in Sept. Thanks for the help! Nic -- Nicholas Henke Penguin Herder & Linux Cluster System Programmer Liniac Project - Univ. of Pennsylvania |