Thread: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

Brought to you by: blaisorblade, derrichard, jdike, rusty

user-mode-linux-devel

[uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-03 07:53:59

Hello,

I found since 2.6.25 kernels, uml crashes when it calls down() on a
semaphore with
zero counter. Here is some example code.

static struct semaphore test_sem;
static int testfunc(NULL)
{
        interruptible_sleep_on_timeout(&sleep_queue, 5 * HZ); // after some
short period
        up(&test_sem); // up the semaphore
}

static int parent_func(unsigned argc, char **argv)
{
        sema_init(&test_sem, 0); // init semaphore with zero counter
        kernel_thread((void *)testfunc, target, CLONE_FILES); // create a
thread that will up the semaphore
        down_interruptible(&test_sem); // SHOULD wait here until testfunc up
the semaphore
}

Our kernel module has used this kind of code to synchronize different kernel
threads.
It runs fine on real machine and old uml kernels, but crashes on
2.6.25.4uml. I tried the
latest 2.6.25.9 kernel, and still saw the same problem. It seems to have
something to
do with uml's signal handling. Does anyone know what changes in 2.6.25 uml
code that
may cause the problem? Thanks a lot!

Jiaying

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-07-03 13:56:55

On Thu, Jul 03, 2008 at 12:53:46AM -0700, Jiaying Zhang wrote:
> I found since 2.6.25 kernels, uml crashes when it calls down() on a
> semaphore with
> zero counter.

What's the stack trace?

Can you bisect it?

		Jeff

-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Mattia D. <mal...@li...> - 2008-07-20 15:20:37

On Fri, Jul 18, 2008 at 04:53:42PM -0400, Jeff Dike wrote:
> On Thu, Jul 17, 2008 at 12:55:09PM +0800, Jiaying Zhang wrote:
> > The patch below solves the 2.6.25 uml crash problem for me. Looks like the
> > problem should be away in 2.6.26 kernel because down_interruptible has
> > changed to the C code since 2.6.26. But I got kernel panic while booting
> > the 2.6.26 kernel :(.
> > 
> > --- linux-2.6.25.4/lib/semaphore-sleepers.c     2008-05-15
> > 23:00:12.000000000 +0800
> > +++ linux-2.6.25.4-new/lib/semaphore-sleepers.c 2008-07-17
> > 12:20:47.000000000 +0800
> > @@ -48,12 +48,12 @@
> >   *    we cannot lose wakeup events.
> >   */
> > 
> > -void __up(struct semaphore *sem)
> > +asmregparm void __up(struct semaphore *sem)
> >  {
> >         wake_up(&sem->wait);
> >  }
> 
> You continue to ignore a few important facts:
> 
>     1 - There are a ton of semaphores in UML
>     2 - They all work, except for yours
>     Therefore, a patch which changes all semphores across all
> architectures for which asmregparam has meaning can't possibly be the
> correct fix.
> 
> However, you might have treated this as an important clue, and looked
> at whether your broken semaphore has a different set of declarations
> in force than those in the rest of the kernel.

Jeff,
it's not entirely clear to me why, but that patch fixes a segfault that
I experience when booting uml 2.6.25 built with gcc-4.3 on a 2.6.25
host (I also applied your ICE workaround patch).
I'm booting a debian sid image that I usually run before uploading the
new uml package in debian.
I've got no fancy modules written by me and the segfault is 100%
reproducible with that debian image (a different image -a gentoo-
doesn't crash).

I'll provide more info tomorrow, I'll try to further trace the crash
with gdb.

cheers
-- 
mattia
:wq!

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-04 01:06:36

The stack trace isn't very helpful. Here it is.

EIP: 0073:[<d84156c5>] CPU: 0 Not tainted ESP: 007b:0be3ea78 EFLAGS:
00210206
    Not tainted
EAX: 0be548d8 EBX: 08325b54 ECX: 08325b58 EDX: 0be548cc
ESI: 00000001 EDI: 080598c6 EBP: 0be3ea98 DS: 007b ES: 007b
08323b6c:  [<0806a718>] show_regs+0xc4/0xc9
08323b98:  [<080594b3>] segv+0x20e/0x226
08323c3c:  [<080592a0>] segv_handler+0x4f/0x54
08323c5c:  [<0806537b>] sig_handler_common+0x63/0x72
08323cd4:  [<080653b8>] sig_handler+0x2e/0x3e
08323cec:  [<080654dd>] handle_signal+0x4d/0x7a
08323d0c:  [<08066ebf>] hard_handler+0xf/0x14
08323d1c:  [<b7fff420>] 0xb7fff420

Kernel panic - not syncing: Kernel mode fault at addr 0xd84156c5, ip
0xd84156c5

EIP: 0073:[<40146334>] CPU: 0 Not tainted ESP: 007b:bfaf2378 EFLAGS:
00200246
    Not tainted
EAX: ffffffda EBX: 00000003 ECX: c134fd09 EDX: 08050368
ESI: 4002b7c0 EDI: 40029180 EBP: bfaf24c8 DS: 007b ES: 007b
08323ad8:  [<0806a718>] show_regs+0xc4/0xc9
08323b04:  [<080596ed>] panic_exit+0x23/0x39
08323b18:  [<080849d0>] notifier_call_chain+0x21/0x4d
08323b38:  [<08084a72>] __atomic_notifier_call_chain+0x17/0x19
08323b54:  [<08084a89>] atomic_notifier_call_chain+0x15/0x17
08323b70:  [<0807116f>] panic+0x4f/0xd1
08323b8c:  [<080594c1>] segv+0x21c/0x226
08323c3c:  [<080592a0>] segv_handler+0x4f/0x54
08323c5c:  [<0806537b>] sig_handler_common+0x63/0x72
08323cd4:  [<080653b8>] sig_handler+0x2e/0x3e
08323cec:  [<080654dd>] handle_signal+0x4d/0x7a
08323d0c:  [<08066ebf>] hard_handler+0xf/0x14
08323d1c:  [<b7fff420>] 0xb7fff420

Segmentation fault

Jiaying

On Thu, Jul 3, 2008 at 9:56 PM, Jeff Dike <jd...@ad...> wrote:

> On Thu, Jul 03, 2008 at 12:53:46AM -0700, Jiaying Zhang wrote:
> > I found since 2.6.25 kernels, uml crashes when it calls down() on a
> > semaphore with
> > zero counter.
>
> What's the stack trace?
>
> Can you bisect it?
>
>                Jeff
>
> --
> Work email - jdike at linux dot intel dot com
>

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-07-20 15:44:40

On Mon, Jul 21, 2008 at 12:20:22AM +0900, Mattia Dongili wrote:
> it's not entirely clear to me why, but that patch fixes a segfault that
> I experience when booting uml 2.6.25 built with gcc-4.3 on a 2.6.25
> host (I also applied your ICE workaround patch).

Hmmm, get a stack trace from it and let's see what's going on.

Presumably, you're not doing kernel development, just building a stock UML?

	    	       	     Jeff

-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-10 02:25:41

Hi Jeff,

Do you have any thought about what the problem might be?
Thanks a lot!

Jiaying

On Fri, Jul 4, 2008 at 9:06 AM, Jiaying Zhang <jia...@go...> wrote:

> The stack trace isn't very helpful. Here it is.
>
> EIP: 0073:[<d84156c5>] CPU: 0 Not tainted ESP: 007b:0be3ea78 EFLAGS:
> 00210206
>     Not tainted
> EAX: 0be548d8 EBX: 08325b54 ECX: 08325b58 EDX: 0be548cc
> ESI: 00000001 EDI: 080598c6 EBP: 0be3ea98 DS: 007b ES: 007b
> 08323b6c:  [<0806a718>] show_regs+0xc4/0xc9
> 08323b98:  [<080594b3>] segv+0x20e/0x226
> 08323c3c:  [<080592a0>] segv_handler+0x4f/0x54
> 08323c5c:  [<0806537b>] sig_handler_common+0x63/0x72
> 08323cd4:  [<080653b8>] sig_handler+0x2e/0x3e
> 08323cec:  [<080654dd>] handle_signal+0x4d/0x7a
> 08323d0c:  [<08066ebf>] hard_handler+0xf/0x14
> 08323d1c:  [<b7fff420>] 0xb7fff420
>
> Kernel panic - not syncing: Kernel mode fault at addr 0xd84156c5, ip
> 0xd84156c5
>
> EIP: 0073:[<40146334>] CPU: 0 Not tainted ESP: 007b:bfaf2378 EFLAGS:
> 00200246
>     Not tainted
> EAX: ffffffda EBX: 00000003 ECX: c134fd09 EDX: 08050368
> ESI: 4002b7c0 EDI: 40029180 EBP: bfaf24c8 DS: 007b ES: 007b
> 08323ad8:  [<0806a718>] show_regs+0xc4/0xc9
> 08323b04:  [<080596ed>] panic_exit+0x23/0x39
> 08323b18:  [<080849d0>] notifier_call_chain+0x21/0x4d
> 08323b38:  [<08084a72>] __atomic_notifier_call_chain+0x17/0x19
> 08323b54:  [<08084a89>] atomic_notifier_call_chain+0x15/0x17
> 08323b70:  [<0807116f>] panic+0x4f/0xd1
> 08323b8c:  [<080594c1>] segv+0x21c/0x226
> 08323c3c:  [<080592a0>] segv_handler+0x4f/0x54
> 08323c5c:  [<0806537b>] sig_handler_common+0x63/0x72
> 08323cd4:  [<080653b8>] sig_handler+0x2e/0x3e
> 08323cec:  [<080654dd>] handle_signal+0x4d/0x7a
> 08323d0c:  [<08066ebf>] hard_handler+0xf/0x14
> 08323d1c:  [<b7fff420>] 0xb7fff420
>
> Segmentation fault
>
> Jiaying
>
>
> On Thu, Jul 3, 2008 at 9:56 PM, Jeff Dike <jd...@ad...> wrote:
>
>> On Thu, Jul 03, 2008 at 12:53:46AM -0700, Jiaying Zhang wrote:
>> > I found since 2.6.25 kernels, uml crashes when it calls down() on a
>> > semaphore with
>> > zero counter.
>>
>> What's the stack trace?
>>
>> Can you bisect it?
>>
>>                Jeff
>>
>> --
>> Work email - jdike at linux dot intel dot com
>>
>
>

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Mattia D. <mal...@li...> - 2008-07-21 12:54:39

On Sun, Jul 20, 2008 at 11:44:20AM -0400, Jeff Dike wrote:
> On Mon, Jul 21, 2008 at 12:20:22AM +0900, Mattia Dongili wrote:
> > it's not entirely clear to me why, but that patch fixes a segfault that
> > I experience when booting uml 2.6.25 built with gcc-4.3 on a 2.6.25
> > host (I also applied your ICE workaround patch).
> 
> Hmmm, get a stack trace from it and let's see what's going on.
> 
> Presumably, you're not doing kernel development, just building a stock UML?

nope, not doing kernel development on that it's a stock UML, added
patches are just small customizations for debian:
http://svn.debian.org/viewsvn/pkg-uml/trunk/src/user-mode-linux/debian/patches/
patch #1 is not used, #2 and #3 are trivial changes. #4 is the gcc-4.3
ICE workaround and #5 is Jiaying's patch we are discussing.

The configuration is this:
http://svn.debian.org/viewsvn/pkg-uml/trunk/src/user-mode-linux/config.i386?rev=310&view=markup
on top of this I enabled the debug info to be built:
CONFIG_PRINTK_TIME=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_INFO=y
CONFIG_FRAME_POINTER=y
I also just reconfirmaed that with Jiaying's patch it doesn't happen.

Program received signal SIGILL, Illegal instruction.
0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x080702e5 in __wake_up_common (q=0x16d50e88, mode=3, nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:4145
#2  0x08070323 in __wake_up_locked (q=0x16d50e88, mode=3) at kernel/sched.c:4174
#3  0x082556da in __down (sem=0x16d50e80) at lib/semaphore-sleepers.c:88
#4  0x0825401a in __down_failed () at arch/um/sys-i386/../../x86/lib/semaphore_32.S:42
#5  0x081072b7 in flush_commit_list (s=0x16dfba00, jl=0x16d50e80, flushall=1) at include/asm/arch/semaphore_32.h:99
#6  0x081077a3 in flush_async_commits (work=0x18936124) at fs/reiserfs/journal.c:3507
#7  0x08082a24 in run_workqueue (cwq=0x16ee9080) at kernel/workqueue.c:276
#8  0x08082cdf in worker_thread (__cwq=0x16ee9080) at kernel/workqueue.c:321
#9  0x0808538f in kthread (_create=0x17c679b4) at kernel/kthread.c:80
#10 0x08068f2b in run_kernel_thread (fn=0x8085347 <kthread>, arg=0x17c679b4, jmp_ptr=0x16e28bb4)
    at arch/um/os-Linux/process.c:267
#11 0x0805ae87 in new_thread_handler () at arch/um/kernel/process.c:151
#12 0x00000000 in ?? ()
(gdb) l
178      * area at compile-time..
179      */
180     static __always_inline void * __constant_c_memset(void * s, unsigned long c, size_t count)
181     {
182     int d0, d1;
183     __asm__ __volatile__(
184             "rep ; stosl\n\t"
185             "testb $2,%b3\n\t"
186             "je 1f\n\t"
187             "stosw\n"
(gdb) up
#1  0x080702e5 in __wake_up_common (q=0x16d50e88, mode=3, nr_exclusive=1, sync=0, key=0x0) at kernel/sched.c:4145
4145                    if (curr->func(curr, mode, sync, key) &&
(gdb) print *curr
$3 = {flags = 255, private = 0x0, func = 0, task_list = {next = 0x16d50e88, prev = 0x0}}

it looks like there is not func here...

(gdb) l
4140            wait_queue_t *curr, *next;
4141
4142            list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
4143                    unsigned flags = curr->flags;
4144
4145                    if (curr->func(curr, mode, sync, key) &&
4146                                    (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
4147                            break;
4148            }
4149    }
(gdb) up
#2  0x08070323 in __wake_up_locked (q=0x16d50e88, mode=3) at kernel/sched.c:4174
4174            __wake_up_common(q, mode, 1, 0, NULL);
(gdb) l
4169    /*
4170     * Same as __wake_up but called with the spinlock in wait_queue_head_t held.
4171     */
4172    void __wake_up_locked(wait_queue_head_t *q, unsigned int mode)
4173    {
4174            __wake_up_common(q, mode, 1, 0, NULL);
4175    }
4176
4177    /**
4178     * __wake_up_sync - wake up threads blocked on a waitqueue.
(gdb) up
#3  0x082556da in __down (sem=0x16d50e80) at lib/semaphore-sleepers.c:88
88              wake_up_locked(&sem->wait);
(gdb) l
83
84                      spin_lock_irqsave(&sem->wait.lock, flags);
85                      tsk->state = TASK_UNINTERRUPTIBLE;
86              }
87              remove_wait_queue_locked(&sem->wait, &wait);
88              wake_up_locked(&sem->wait);
89              spin_unlock_irqrestore(&sem->wait.lock, flags);
90              tsk->state = TASK_RUNNING;
91      }
92
(gdb) print *sem
$4 = {count = {counter = 5833}, sleepers = 0, wait = {lock = {raw_lock = {<No data fields>}}, task_list = {next = 0xf,
      prev = 0xf}}}

Any other useful information I could provide?
-- 
mattia
:wq!

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-07-10 17:02:25

On Thu, Jul 10, 2008 at 10:25:29AM +0800, Jiaying Zhang wrote:
> Do you have any thought about what the problem might be?
> Thanks a lot!

Yeah, my first thought is that your code is buggy.

Since 2.6.25 seems OK, you can bisect between then and now to see
either what caused the bug or what is triggering the crash on an
existing bug.

The other thing you can do is gdb the UML and see if gdb gives you a
better stack trace.

			Jeff

-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Mattia D. <mal...@li...> - 2008-08-02 05:54:20

On Mon, Jul 21, 2008 at 09:54:26PM +0900, Mattia Dongili wrote:
> On Sun, Jul 20, 2008 at 11:44:20AM -0400, Jeff Dike wrote:
> > On Mon, Jul 21, 2008 at 12:20:22AM +0900, Mattia Dongili wrote:
> > > it's not entirely clear to me why, but that patch fixes a segfault that
> > > I experience when booting uml 2.6.25 built with gcc-4.3 on a 2.6.25
> > > host (I also applied your ICE workaround patch).
> > 
> > Hmmm, get a stack trace from it and let's see what's going on.

Hi Jeff,
FWIW I can't reproduce this on 2.6.26.

cheers
-- 
mattia
:wq!

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-08-04 16:44:48

On Sat, Aug 02, 2008 at 02:54:08PM +0900, Mattia Dongili wrote:
> FWIW I can't reproduce this on 2.6.26.

Thanks for letting me know.  Too bad it's still a mystery though.

       	   	      Jeff

-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-14 09:07:06

On Fri, Jul 11, 2008 at 1:02 AM, Jeff Dike <jd...@ad...> wrote:

> On Thu, Jul 10, 2008 at 10:25:29AM +0800, Jiaying Zhang wrote:
> > Do you have any thought about what the problem might be?
> > Thanks a lot!
>
> Yeah, my first thought is that your code is buggy.
>
> Since 2.6.25 seems OK, you can bisect between then and now to see
> either what caused the bug or what is triggering the crash on an
> existing bug.


The 2.6.24 kernels are OK, but I have seen this problem with all of the
2.6.25 kernels I have tried. There have been a lot of changes between
2.6.24 kernels and 2.6.25 kernels. I am not sure which one may lead
to this problem.


> The other thing you can do is gdb the UML and see if gdb gives you a
> better stack trace.
>

Here is the trace from gdb uml.

Program received signal SIGTERM, Terminated.
0xb7fff410 in ?? ()
(gdb) bt
#0  0xb7fff410 in ?? ()
#1  0x08323afc in cpu0_irqstack ()
#2  0xfffffffe in ?? ()
#3  0x0000000f in ?? ()
#4  0x464850c6 in kill () from /lib/tls/i686/cmov/libc.so.6
#5  0x0806624b in os_dump_core () at arch/um/os-Linux/util.c:92
#6  0x08059703 in panic_exit (self=0x83254f4, unused1=0, unused2=0x8340a80)
at arch/um/kernel/um_arch.c:233
#7  0x080849d0 in notifier_call_chain (nl=0x0, val=0, v=0x8340a80,
nr_to_call=0, nr_calls=0x0)
    at kernel/notifier.c:70
#8  0x08084a72 in __atomic_notifier_call_chain (nh=0x8340a60, val=0,
v=0x8340a80, nr_to_call=-1,
    nr_calls=0x0) at kernel/notifier.c:159
#9  0x08084a89 in atomic_notifier_call_chain (nh=0x8340a60, val=0,
v=0x8340a80) at kernel/notifier.c:168
#10 0x0807116f in panic (fmt=0x82d2039 "Kernel mode fault at addr 0x%lx, ip
0x%lx") at kernel/panic.c:101
#11 0x080594c1 in segv (fi={error_code = 6, cr2 = 98596, trap_no = 14},
ip=136845739, is_user=0,
    regs=0x8323c6c) at arch/um/kernel/trap.c:206
#12 0x080592a0 in segv_handler (sig=11, regs=0x8323c6c) at
arch/um/kernel/trap.c:152
#13 0x0806537b in sig_handler_common (sig=11, sc=0x8323d24) at
arch/um/os-Linux/signal.c:48
#14 0x080653b8 in sig_handler (sig=11, sc=0x8323d24) at
arch/um/os-Linux/signal.c:80
#15 0x080654dd in handle_signal (sig=<value optimized out>, sc=0x8323d24) at
arch/um/os-Linux/signal.c:157
#16 0x08066ebf in hard_handler (sig=11) at
arch/um/os-Linux/sys-i386/signal.c:12
#17 <signal handler called>
#18 __down_interruptible (sem=0x9f68978) at include/linux/list.h:50
#19 0x0828091a in __down_failed_interruptible () at
arch/um/sys-i386/../../x86/lib/semaphore_32.S:63
#20 0x08220a89 in ddsnap_create (target=0xa829080, argc=4, argv=0x9f6f290)
    at include/asm/arch/semaphore_32.h:120
#21 0x0821b160 in dm_table_add_target (t=0x9f6f178, type=0xa82414c "ddsnap",
start=165497564, len=204800,
    params=0xa82415c "/dev/ubdc") at drivers/md/dm-table.c:772

Looks like the problem happens when __down_interruptible is called.
I checked the semaphore passed to __down_interruptible under gdb
and found it was corrupted:
(gdb) f 18
#18 __down_interruptible (sem=0x9f68d08) at include/linux/list.h:50
50              prev->next = new;
(gdb) p sem
$15 = (struct semaphore *) 0x9f68d08
(gdb) p *sem
$16 = {count = {counter = -268435295}, sleepers = 4, wait = {lock =
{raw_lock = {<No data fields>}}, task_list = {
      next = 0x9f68d5c, prev = 0x18124}}}

But the semaphore looks correct before calling down_interruptible:
(gdb) f 20
#20 0x082209fd in ddsnap_create (target=0xa829080, argc=4, argv=0x9f733a8)
at include/asm/arch/semaphore_32.h:120
120             __asm__ __volatile__(
(gdb) p info->identify_sem
$28 = {count = {counter = -1}, sleepers = 0, wait = {lock = {raw_lock = {<No
data fields>}}, task_list = {
      next = 0x9f0ca14, prev = 0x9f0ca14}}}

I found from 2.6.25 kernel, the type of __down_failed_interruptible changed
from fastcall to extern asmregparm.
Can it be related to this problem?

Jiaying

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-07-14 14:46:23

On Mon, Jul 14, 2008 at 05:06:49PM +0800, Jiaying Zhang wrote:
> The 2.6.24 kernels are OK, but I have seen this problem with all of the
> 2.6.25 kernels I have tried. There have been a lot of changes between
> 2.6.24 kernels and 2.6.25 kernels. I am not sure which one may lead
> to this problem.

So bisect it.

> Looks like the problem happens when __down_interruptible is called.
> I checked the semaphore passed to __down_interruptible under gdb
> and found it was corrupted:
> (gdb) f 18
> #18 __down_interruptible (sem=0x9f68d08) at include/linux/list.h:50
> 50              prev->next = new;
> (gdb) p sem
> $15 = (struct semaphore *) 0x9f68d08
> (gdb) p *sem
> $16 = {count = {counter = -268435295}, sleepers = 4, wait = {lock =
> {raw_lock = {<No data fields>}}, task_list = {
>       next = 0x9f68d5c, prev = 0x18124}}}
> 
> But the semaphore looks correct before calling down_interruptible:

What's the problem with debugging this, then?  You step through the
code starting when the semaphore is good and see exactly when it gets
corrupted.

				Jeff

-- 
Work email - jdike at linux dot intel dot com

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-16 09:52:44

On Mon, Jul 14, 2008 at 10:46 PM, Jeff Dike <jd...@ad...> wrote:

> On Mon, Jul 14, 2008 at 05:06:49PM +0800, Jiaying Zhang wrote:
> > The 2.6.24 kernels are OK, but I have seen this problem with all of the
> > 2.6.25 kernels I have tried. There have been a lot of changes between
> > 2.6.24 kernels and 2.6.25 kernels. I am not sure which one may lead
> > to this problem.
>
> So bisect it.

The problem seems to be related to the getting rid of fastcall changes
introduced in 2.6.25 kernels. I found the problem started to happen from
commit 82f74e7159749cc511ebf5954a7b9ea6ad634949: x86: unify
include/asm-x86/linkage_[32|64].h.
After that, several commits related to __down_interruptible had been
checked in, but they did not solve the crashing problem I saw.
In particular, I thought the d50efc6c40620b2e11648cac64ebf4a824e40382
x86: fix UML and -regparm=3 commit would solve the problem because it
adds the asmregparm macro that is the same as fastcall and uses the macro
for  __down_failed_interruptible declaration. Unfortunately, I tried that
version
of git code and saw the same problem happened.

> > Looks like the problem happens when __down_interruptible is called.
> > I checked the semaphore passed to __down_interruptible under gdb
> > and found it was corrupted:
> > (gdb) f 18
> > #18 __down_interruptible (sem=0x9f68d08) at include/linux/list.h:50
> > 50              prev->next = new;
> > (gdb) p sem
> > $15 = (struct semaphore *) 0x9f68d08
> > (gdb) p *sem
> > $16 = {count = {counter = -268435295}, sleepers = 4, wait = {lock =
> > {raw_lock = {<No data fields>}}, task_list = {
> >       next = 0x9f68d5c, prev = 0x18124}}}
> >
> > But the semaphore looks correct before calling down_interruptible:
>
> What's the problem with debugging this, then?  You step through the
> code starting when the semaphore is good and see exactly when it gets
> corrupted.
>

Yes. Looks like the corruption happens when __down_failed_interruptible()
calls __down_interruptible() and it has something to do with the 2.6.25's
x86
gcc attribute changes.

Jiaying

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jiaying Z. <jia...@go...> - 2008-07-17 04:55:19

The patch below solves the 2.6.25 uml crash problem for me. Looks like the
problem should be away in 2.6.26 kernel because down_interruptible has
changed to the C code since 2.6.26. But I got kernel panic while booting
the 2.6.26 kernel :(.

--- linux-2.6.25.4/lib/semaphore-sleepers.c     2008-05-15
23:00:12.000000000 +0800
+++ linux-2.6.25.4-new/lib/semaphore-sleepers.c 2008-07-17
12:20:47.000000000 +0800
@@ -48,12 +48,12 @@
  *    we cannot lose wakeup events.
  */

-void __up(struct semaphore *sem)
+asmregparm void __up(struct semaphore *sem)
 {
        wake_up(&sem->wait);
 }

-void __sched __down(struct semaphore *sem)
+asmregparm void __sched __down(struct semaphore *sem)
 {
        struct task_struct *tsk = current;
        DECLARE_WAITQUEUE(wait, tsk);
@@ -90,7 +90,7 @@ void __sched __down(struct semaphore *se
        tsk->state = TASK_RUNNING;
 }

-int __sched __down_interruptible(struct semaphore *sem)
+asmregparm int __sched __down_interruptible(struct semaphore *sem)
 {
        int retval = 0;
        struct task_struct *tsk = current;
@@ -153,7 +153,7 @@ int __sched __down_interruptible(struct
  * single "cmpxchg" without failure cases,
  * but then it wouldn't work on a 386.
  */
-int __down_trylock(struct semaphore *sem)
+asmregparm int __down_trylock(struct semaphore *sem)
 {
        int sleepers;
        unsigned long flags;

Jiaying

On Wed, Jul 16, 2008 at 5:52 PM, Jiaying Zhang <jia...@go...> wrote:

>
>
> On Mon, Jul 14, 2008 at 10:46 PM, Jeff Dike <jd...@ad...> wrote:
>
>> On Mon, Jul 14, 2008 at 05:06:49PM +0800, Jiaying Zhang wrote:
>> > The 2.6.24 kernels are OK, but I have seen this problem with all of the
>> > 2.6.25 kernels I have tried. There have been a lot of changes between
>> > 2.6.24 kernels and 2.6.25 kernels. I am not sure which one may lead
>> > to this problem.
>>
>> So bisect it.
>
>
> The problem seems to be related to the getting rid of fastcall changes
> introduced in 2.6.25 kernels. I found the problem started to happen from
> commit 82f74e7159749cc511ebf5954a7b9ea6ad634949: x86: unify
> include/asm-x86/linkage_[32|64].h.
> After that, several commits related to __down_interruptible had been
> checked in, but they did not solve the crashing problem I saw.
> In particular, I thought the d50efc6c40620b2e11648cac64ebf4a824e40382
> x86: fix UML and -regparm=3 commit would solve the problem because it
> adds the asmregparm macro that is the same as fastcall and uses the macro
> for  __down_failed_interruptible declaration. Unfortunately, I tried that
> version
> of git code and saw the same problem happened.
>
>
>> > Looks like the problem happens when __down_interruptible is called.
>> > I checked the semaphore passed to __down_interruptible under gdb
>> > and found it was corrupted:
>> > (gdb) f 18
>> > #18 __down_interruptible (sem=0x9f68d08) at include/linux/list.h:50
>> > 50              prev->next = new;
>> > (gdb) p sem
>> > $15 = (struct semaphore *) 0x9f68d08
>> > (gdb) p *sem
>> > $16 = {count = {counter = -268435295}, sleepers = 4, wait = {lock =
>> > {raw_lock = {<No data fields>}}, task_list = {
>> >       next = 0x9f68d5c, prev = 0x18124}}}
>> >
>> > But the semaphore looks correct before calling down_interruptible:
>>
>> What's the problem with debugging this, then?  You step through the
>> code starting when the semaphore is good and see exactly when it gets
>> corrupted.
>>
>
> Yes. Looks like the corruption happens when __down_failed_interruptible()
> calls __down_interruptible() and it has something to do with the 2.6.25's
> x86
> gcc attribute changes.
>
> Jiaying
>
>

Re: [uml-devel] 2.6.25 uml kernel crashes when it calls down() on a semaphore with zero counter

From: Jeff D. <jd...@ad...> - 2008-07-18 20:53:53

On Thu, Jul 17, 2008 at 12:55:09PM +0800, Jiaying Zhang wrote:
> The patch below solves the 2.6.25 uml crash problem for me. Looks like the
> problem should be away in 2.6.26 kernel because down_interruptible has
> changed to the C code since 2.6.26. But I got kernel panic while booting
> the 2.6.26 kernel :(.
> 
> --- linux-2.6.25.4/lib/semaphore-sleepers.c     2008-05-15
> 23:00:12.000000000 +0800
> +++ linux-2.6.25.4-new/lib/semaphore-sleepers.c 2008-07-17
> 12:20:47.000000000 +0800
> @@ -48,12 +48,12 @@
>   *    we cannot lose wakeup events.
>   */
> 
> -void __up(struct semaphore *sem)
> +asmregparm void __up(struct semaphore *sem)
>  {
>         wake_up(&sem->wait);
>  }

You continue to ignore a few important facts:

    1 - There are a ton of semaphores in UML
    2 - They all work, except for yours
    Therefore, a patch which changes all semphores across all
architectures for which asmregparam has meaning can't possibly be the
correct fix.

However, you might have treated this as an important clue, and looked
at whether your broken semaphore has a different set of declarations
in force than those in the rest of the kernel.

   	      	       Jeff

-- 
Work email - jdike at linux dot intel dot com