Here's a perfect example of a lost fix. I spent
2.5 days doing nothing but trying to fix this bug.
I've spent some unknown amount of time devising a
workaround for this bug. Finally today, 20020227
I found the bug, and found out someone else had
already fixed it.
I'm working on linux on an embedded 8260 box. This
box has no CPU_FTR_HPTE_TABLE feature, which means
it doesn't have hardware hashing of the page table
entries I suppose. The bug is in the
arch/ppc/mm/ppc_mmu.c file, in the MMU_init_hw
function. It has been in the kernel since 2.4.9.2,
before that the bug didn't exist.
Here it is:
if ((cur_cpu_spec[0]->cpu_features &
CPU_FTR_HPTE_TABLE) == 0)
return;
The problem is the return. Before returning, the
routine must modify the hash_page function to just
return without doing anything. That code is down
later in the function in an else clause:
else {
Hash_end = 0;
/*
* Put a blr (procedure return) instruction at the
* start of hash_page, since we can still get DSI
* exceptions on a 603.
*/
hash_page[0] = 0x4e800020;
flush_icache_range((unsigned long) &hash_page[0],
(unsigned long) &hash_page[1]);
}
What happens is the assembly code for hash_page
arbitrarily assigns a 256K byte hash table at
0x185000, right in the middle of the kernel. The
location is supposed to be patched later by the
MMU_init_hw function. But it only gets patched if
there is the hardware hashing feature for the cpu
in question. So what happens is the cpu gets page
faults as part of normal virtual memory operation,
and the hash_page function gets called, but rather
than return like it is supposed to it trashes 8
bytes at a time of the 256K byte block at 0x185000.
So your system starts crashing strangely. Very
difficult to track down.
If you look in the 2.4.* development tree, this has
already been fixed. But it isn't in the 2.4.*
public kernels downloadable from ftp.kernel.org.
All the experts in the newsgroups had basically
no helpful advice other than claiming it was flaky
hardware. Only through basically Herculean effort
did I track down this already fixed bug. The whole
thing is disgusting.