Hi,

 

My company is using UML to simulate the Linux OS on embedded systems to do off-target testing. However during a special but complex testing a UML virtual machine hangs for about every 4 times. It took me a few weeks to debug it but so far I didn’t get any breakthrough.

 

Symptom:

1 virtual machine (there are totally 5 connected together with uml_switch) hangs. No response to key input (no blank line appears at all).

 

Backtrace:

After it hangs, use gdb to attach to it and I usually get this backtrace:

#28 0x080b5ed0 in check_poison_obj (cachep=0x27c4e5a0, objp=0x27c9d000) at /cc/4

#29 0x080b71d9 in cache_alloc_debugcheck_after (cachep=0x27c4e5a0, flags=208, ob

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/mm/slab.c:3072

#30 0x080b7547 in kmem_cache_alloc (cachep=0x27c4e5a0, flags=0) at /cc/4gbts/oss

#31 0x080bfddf in getname (filename=0x619e1f <Address 0x619e1f out of bounds>) a

#32 0x080b955b in do_sys_open (dfd=-100, filename=0x619e1f <Address 0x619e1f out

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/fs/open.c:1086

#33 0x080b9623 in sys_open (filename=0x619e1f <Address 0x619e1f out of bounds>,

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/fs/open.c:1113

#34 0x0805f120 in handle_syscall (r=0x27365374) at /cc/4gbts/oss_1/target/uml/li

#35 0x0806d1a0 in handle_trap (pid=17303, regs=0x27365374, local_using_sysemu=0)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/skas/proce

#36 0x0806d63e in userspace (regs=0x27365374) at /cc/4gbts/oss_1/target/uml/linu

#37 0x0805c800 in fork_handler () at include2/asm/thread_info.h:49

#38 0x00000000 in ?? ()

 

Some debugging I did:

1.       If I single step it, it shows the function userspace never has a chance to go out. If I set a breakpoint at “schedule”, it shows no schedule happens.

2.       If I set a breakpoint at do_IRQ, I can also get the following backtrace, which shows the process is delivering real time alarm signals.

#0  do_IRQ (irq=0, regs=0x1) at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/include/asm-generic/irq_regs.h:33

#1  0x0805dc3b in timer_handler (sig=26, regs=0x26f8ef0c)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/kernel/time.c:29

#2  0x0806aaaf in real_alarm_handler (sc=0x1) at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/signal.c:94

#3  0x0806ad3e in unblock_signals () at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/signal.c:277

#4  0x0806d7b4 in userspace (regs=0x27054354)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/skas/process.c:450

#5  0x0805c8de in fork_handler () at include2/asm/thread_info.h:49

#6  0x00000000 in ?? ()

3.       If I set breakpoint at sigio_handler, I can also get a backtrace which shows the process is responding to key input.

#0  sigio_handler (sig=29, regs=0x8592c6c) at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/kernel/irq.c:80

#1  0x0806aa2f in sig_handler_common (sig=29, sc=0x8592d28)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/signal.c:49

#2  0x0806aa72 in sig_handler (sig=29, sc=0x8592d28)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/signal.c:81

#3  0x0806ab95 in handle_signal (sig=-1207980044, sc=0x8592d28)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/signal.c:158

#4  0x0806c647 in hard_handler (sig=29)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/sys-i386/signal.c:12

#5  <signal handler called>

#6  check_poison_obj (cachep=0x27c4e5a0, objp=0x27c9d000) at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/mm/slab.c:1854

#7  0x080b7469 in cache_alloc_debugcheck_after (cachep=0x27c4e5a0, flags=208, objp=0x27c9d000, caller=0x27c4e56b)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/mm/slab.c:3072

#8  0x080b77d7 in kmem_cache_alloc (cachep=0x27c4e5a0, flags=0)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/mm/slab.c:3475

#9  0x080c006f in getname (filename=0x619e1f <Address 0x619e1f out of bounds>)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/fs/namei.c:146

#10 0x080b97eb in do_sys_open (dfd=-100, filename=0x619e1f <Address 0x619e1f out of bounds>, flags=0, mode=438)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/fs/open.c:1086

#11 0x080b98b3 in sys_open (filename=0x619e1f <Address 0x619e1f out of bounds>, flags=0, mode=438)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/fs/open.c:1113

#12 0x0805f200 in handle_syscall (r=0x27054354)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/kernel/skas/syscall.c:35

#13 0x0806d2b4 in handle_trap (pid=11408, regs=0x27054354, local_using_sysemu=0)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/skas/process.c:210

#14 0x0806d785 in userspace (regs=0x27054354)

    at /cc/4gbts/oss_1/target/uml/linux-2.6.26.7/src/arch/um/os-Linux/skas/process.c:439

#15 0x0805c8de in fork_handler () at include2/asm/thread_info.h:49

#16 0x00000000 in ?? ()

4.       I used another function to print current status to find out why the real time alarm signal didn’t trigger a schedule:

(gdb) p get_current()

$2 = (void *) 0x0

(gdb) p *get_current_thread()

$3 = {task = 0x0, exec_domain = 0x0, flags = 4, cpu = 0, preempt_count = -257, addr_limit = {seg = 0}, restart_block = {

    fn = 0, {futex = {uaddr = 0x0, val = 0, flags = 0, bitset = 0, time = 0}, nanosleep = {index = 0, rmtp = 0x0,

        expires = 0}}}, real_thread = 0x8592000}

(gdb) p currentstatus()

$4 = 0x85944c0 "PreemtCount: -257,HardIrqCount:268369920, softIrqCount:65024, irqCount:268434944, In interrupt:268434944\n"

 

Till here I’m lost. I guess the problem is about the interrupt handling, which is difficult for me to dig into now. Can someone help me?

Thanks

 

Mars Zhao