On Fri, Jul 11, 2008 at 1:02 AM, Jeff Dike <jdike@addtoit.com> wrote:
On Thu, Jul 10, 2008 at 10:25:29AM +0800, Jiaying Zhang wrote:
> Do you have any thought about what the problem might be?
> Thanks a lot!

Yeah, my first thought is that your code is buggy.

Since 2.6.25 seems OK, you can bisect between then and now to see
either what caused the bug or what is triggering the crash on an
existing bug.

The 2.6.24 kernels are OK, but I have seen this problem with all of the
2.6.25 kernels I have tried. There have been a lot of changes between
2.6.24 kernels and 2.6.25 kernels. I am not sure which one may lead
to this problem.

The other thing you can do is gdb the UML and see if gdb gives you a
better stack trace.

Here is the trace from gdb uml.

Program received signal SIGTERM, Terminated.
0xb7fff410 in ?? ()
(gdb) bt
#0  0xb7fff410 in ?? ()
#1  0x08323afc in cpu0_irqstack ()
#2  0xfffffffe in ?? ()
#3  0x0000000f in ?? ()
#4  0x464850c6 in kill () from /lib/tls/i686/cmov/libc.so.6
#5  0x0806624b in os_dump_core () at arch/um/os-Linux/util.c:92
#6  0x08059703 in panic_exit (self=0x83254f4, unused1=0, unused2=0x8340a80) at arch/um/kernel/um_arch.c:233
#7  0x080849d0 in notifier_call_chain (nl=0x0, val=0, v=0x8340a80, nr_to_call=0, nr_calls=0x0)
    at kernel/notifier.c:70
#8  0x08084a72 in __atomic_notifier_call_chain (nh=0x8340a60, val=0, v=0x8340a80, nr_to_call=-1,
    nr_calls=0x0) at kernel/notifier.c:159
#9  0x08084a89 in atomic_notifier_call_chain (nh=0x8340a60, val=0, v=0x8340a80) at kernel/notifier.c:168
#10 0x0807116f in panic (fmt=0x82d2039 "Kernel mode fault at addr 0x%lx, ip 0x%lx") at kernel/panic.c:101
#11 0x080594c1 in segv (fi={error_code = 6, cr2 = 98596, trap_no = 14}, ip=136845739, is_user=0,
    regs=0x8323c6c) at arch/um/kernel/trap.c:206
#12 0x080592a0 in segv_handler (sig=11, regs=0x8323c6c) at arch/um/kernel/trap.c:152
#13 0x0806537b in sig_handler_common (sig=11, sc=0x8323d24) at arch/um/os-Linux/signal.c:48
#14 0x080653b8 in sig_handler (sig=11, sc=0x8323d24) at arch/um/os-Linux/signal.c:80
#15 0x080654dd in handle_signal (sig=<value optimized out>, sc=0x8323d24) at arch/um/os-Linux/signal.c:157
#16 0x08066ebf in hard_handler (sig=11) at arch/um/os-Linux/sys-i386/signal.c:12
#17 <signal handler called>
#18 __down_interruptible (sem=0x9f68978) at include/linux/list.h:50
#19 0x0828091a in __down_failed_interruptible () at arch/um/sys-i386/../../x86/lib/semaphore_32.S:63
#20 0x08220a89 in ddsnap_create (target=0xa829080, argc=4, argv=0x9f6f290)
    at include/asm/arch/semaphore_32.h:120
#21 0x0821b160 in dm_table_add_target (t=0x9f6f178, type=0xa82414c "ddsnap", start=165497564, len=204800,
    params=0xa82415c "/dev/ubdc") at drivers/md/dm-table.c:772

Looks like the problem happens when __down_interruptible is called.
I checked the semaphore passed to __down_interruptible under gdb
and found it was corrupted:
(gdb) f 18
#18 __down_interruptible (sem=0x9f68d08) at include/linux/list.h:50
50              prev->next = new;
(gdb) p sem
$15 = (struct semaphore *) 0x9f68d08
(gdb) p *sem
$16 = {count = {counter = -268435295}, sleepers = 4, wait = {lock = {raw_lock = {<No data fields>}}, task_list = {
      next = 0x9f68d5c, prev = 0x18124}}}

But the semaphore looks correct before calling down_interruptible:
(gdb) f 20
#20 0x082209fd in ddsnap_create (target=0xa829080, argc=4, argv=0x9f733a8) at include/asm/arch/semaphore_32.h:120
120             __asm__ __volatile__(
(gdb) p info->identify_sem
$28 = {count = {counter = -1}, sleepers = 0, wait = {lock = {raw_lock = {<No data fields>}}, task_list = {
      next = 0x9f0ca14, prev = 0x9f0ca14}}}

I found from 2.6.25 kernel, the type of __down_failed_interruptible changed from fastcall to extern asmregparm.
Can it be related to this problem?