Hello,

I have encountered a crash in dom0 kernel while booting a domU from an AOE device. I haven't seen such crashes when booting from local partitions/ LVM volumes/ loopback file systems. Also I haven't seen such crash when I did repetitive I/O to these AOE devices. As the call trace of crash indicates the crash is in xenolinux kernel. Also this crash is predictably reproducible.

I am currently using xen 3.0.1, but I have seen the same thing happening in 3.0.2 some time back. If time permits I can try to reproduce it on latest Xen builds.

The domU's disks look like this:
'phy:/dev/etherd/e0.4,sda1,w'
'phy:/dev/etherd/e1.4,sda2,w'

Inside the domU, sda1 is treated as root device and sda2 is treated as swap.

The AOE setup involves, vblade servers running on the server machine that exports some disks over AOE. The dom0 instance in question is a client to this AOE server. It has 'aoe' module loaded into it and the aoe-tools version is 10.

The stack trace of the crash is as follows:

Unable to handle kernel NULL pointer dereference at virtual address 00000004

 printing eip:

c012cc32

*pde = ma 8da99067 pa 32e99067

*pte = ma 00000000 pa 55555000

Oops: 0002 [#1]

SMP

Modules linked in: ipt_physdev iptable_filter ip_tables aoe bridge nfs lockd ppdev vmnet vmmon sg parport_pc lp parport autofs4 sunrpc af_packet binfmt_misc dm_mirror dm_multipath video thermal processor fan button battery ac ipv6 md ohci1394 ieee1394 uhci_hcd intel_agp agpgart i2c_i801 i2c_core pci_hotplug snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc e1000 floppy unix sd_mod aacraid scsi_mod ext3 jbd dm_mod

CPU:    0

EIP:    0061:[<c012cc32>]    Tainted: P      VLI

EFLAGS: 00010012   (2.6.12.6-xen)

EIP is at run_timer_softirq+0xa2/0x1c0

eax: 00000000   ebx: 00000000   ecx: f33dbe00   edx: c03f3f0c

esi: 00000100   edi: c26deda0   ebp: 00000000   esp: c03f3ef8

ds: 007b   es: 007b   ss: 0069

Process swapper (pid: 0, threadinfo=c03f2000 task=c0369fc0)

Stack: 00000000 c03f3f7c 00000100 c01438a0 c03f2000 f33dbe00 c0449260 20000000

       00000011 c03ecda8 c0420ea0 00000000 c0127ee6 c03ecda8 0000000a c03f2000

       00000001 00000000 00000000 c0128005 00000000 fbf7e000 c010ef32 c0105a00

Call Trace:

 [<c01438a0>] handle_IRQ_event+0x60/0xb0

 [<c0127ee6>] __do_softirq+0x96/0x130

 [<c0128005>] do_softirq+0x85/0xa0

 [<c010ef32>] do_IRQ+0x22/0x30

 [<c0105a00>] evtchn_do_upcall+0x90/0x100

 [<c010a88c>] hypervisor_callback+0x2c/0x34

 [<c01082aa>] xen_idle+0x4a/0xa0

 [<c0108369>] cpu_idle+0x69/0xb0

 [<c03f49fa>] start_kernel+0x1ca/0x220

 [<c03f4370>] unknown_bootoption+0x0/0x1f0

Code: 00 8b 53 04 8d 6c 24 14 8b 44 24 14 89 69 04 89 4c 24 14 89 50 04 89 02 89 5b 04 89 5e 0c eb 66 8b 51 04 8b 01 8b 69 14 8b 59 18 <89> 50 04 89 02 c7 41 04 00 02 20 00 c7 01 00 01 10 00 89 4f 08

 <0>Kernel panic - not syncing: Fatal exception in interrupt

 (XEN) Domain 0 shutdown: rebooting machine.

(XEN) Reboot disabled on cmdline: require manual reset



Before getting this crash I get some warnings on the serial console that look like following:

Uninitialised timer!

This is just a warning.  Your computer is OK

function=0xc02344b0, data=0xf1b9d460

But I guess these have nothing to do with the crash.


I also observed the AOE traffic when the crash occurs using tcpdump. But nothing seemed unusual to my eyes, just that the packets stopped flowing after the AOE client dom0 crashed. Furthermore, there is no problem with AOE servers. After reboot I can again start using the same AOE devices (save the inconsistent file system). My past attempts of putting printk's in AOE driver source also didn't reveal any helpful information.

Please let me know if any bug fixes were done in recent versions in the area where this crash is being seen (handle_IRQ_event). Any other suggestions to tackle the problem are welcome.

Thanks,

--
Jayesh
------------------------------------------------------------------------
Everything you can imagine is real