Re: 2.6.20 instability

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Feb 09, 2007 at 10:32:49AM +0100, Manuel Lauss wrote:
> On Fri, Feb 09, 2007 at 05:34:24PM +0900, Paul Mundt wrote:
> > On Thu, Feb 08, 2007 at 12:02:47PM +0100, Manuel Lauss wrote:
> > > On Thu, Feb 08, 2007 at 05:26:23PM +0900, Paul Mundt wrote:
> > > > Assuming no luck and you hit the same problem again, but at a different
> > > > offset, and it's repeatable, can you walk the page tables and print out
> > > > the pgprot encoding as well as the corresponding pfn << PAGE_SHIFT for
> > > > each remapped PTE?
> > > 
> > > That may take a (long) while, I don't know yet where to start :)
> > > 
> > Try this..
> 
> Thank you.
> 
> Courisouly, with the dumping enabled the oops did not occur
> the first 3 times, strange, but after a another reset it is
> as oops-happy as ever.
> 
> This is the log of the first few *working* cases:
> http://mlau.at/sh/sh4-ptedumper.txt
> 
> And here from the oopsing ones:
> http://mlau.at/sh/sh4-ptedumper-2.txt
> http://mlau.at/sh/sh4-ptedumper-3.txt
> 
> I'll try to find something with the JTAG probe next.
> 
The page tables look perfectly sane, so that's quite interesting. At
least we know it's not a problem of a bogus page table mapping, or pgprot
oddities.

The next thing to try would be to pre-fault the translation and make sure
it's in the TLB so we don't take a TLB miss for the dying page to see if
the issue persists for that page, or if it's bumped to the next one.
You're clearly on a system that has the old-style PTEA, that might be
something else to look at (and that's also set in the update_mmu_cache()
path). The fact that this fault happens at a fixed location suggestions
that it's not actually having a problem faulting in the translation, so I
would imagine you're not going to find much in the TLB miss case.

Once you've pre-faulted the page, please try to dump the whole page
and see if it manages to blow up at that same location. You can setup the
translation with:

	pte_t entry = pfn_pte(0x18000000 >> PAGE_SHIFT, pgprot);
	update_mmu_cache(NULL, 0xc0600000, entry);

or you can of course just hack something stupid in to the dump code to
pre-fault and bail out early (ie, a dummy read for every PTE).

The next question would be what register you have at 0x18000ff0, whether
you can use 16 or 32-bit reads, and if so, whether it's the same location
that blows up. The fact it worked the first few times and it's an
uncached mapping almost suggests a timing problem, oddly.