#154 kernel BUG @ ipc/semc:1931

v1.9.3
closed-fixed
IPC (12)
5
2008-10-19
2008-04-14
John Hughes
No

Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU: 0
EIP: 0060:[<c01d447c>] Not tainted VLI
EFLAGS: 00010202 (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88
esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44
ds: 007b es: 007b ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
[<c010694f>] show_stack+0x7f/0xa0
[<c0106b04>] show_registers+0x164/0x220
[<c0106e94>] die+0xf4/0x1c0
[<c0107015>] do_trap+0xb5/0xc0
[<c01072cc>] do_invalid_op+0xbc/0xd0
[<c01065a3>] error_code+0x2b/0x30
[<c02614f1>] ripc_semexit+0x31/0x50
[<c0256fb3>] svr_ripc_semexit+0xa3/0x100
[<c020abb3>] icssvr_daemon+0x2f3/0xab0
[<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55

Related

Bugs: #1

Discussion

  • John Hughes

    John Hughes - 2008-04-14

    Logged In: YES
    user_id=166336
    Originator: YES

    Here's the code:

    static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma)
    {
    int nsems, i;
    struct sem_semundo *un, **unp;

    BUG_ON(sem_checkid(sma,semid));

     
  • Roger Tsang

    Roger Tsang - 2008-04-15

    Logged In: YES
    user_id=1246761
    Originator: NO

    Maybe semid changed before exit_sem() got sem_lock(). Try attached patch.
    File Added: ipc_sem.c.patch

     
  • Roger Tsang

    Roger Tsang - 2008-04-15
     
  • Nobody/Anonymous

    Logged In: NO

    I'm sorry Roger, I don't get the point of your patch.

    I suppose the relevant bit is:

    @@ -2027,7 +2025,7 @@ namesvr_semexit_go:
    continue;
    }

    - __ssi_semexit(semid, current->tgid, sma);
    + __ssi_semexit(u->semid, current->tgid, sma);
    }
    }

    so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen.

    Also the call path that seems to be causing problems seems to be (from the trace above):

    [client node]
    exit_sem
    cli_ripc_semexit

    [server node]
    [...]
    svr_ripc_semexit
    ripc_semexit
    ssi_semexit
    __ssi_semexit

    and your patch touches the

    exit_sem
    __ssi_semexit

    path.

    I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local.

    Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision?

     
  • John Hughes

    John Hughes - 2008-04-16

    Logged In: YES
    user_id=166336
    Originator: YES

    Here's another example, this time it was going through the local exit_sem path:

    ------------[ cut here ]------------
    kernel BUG at ipc/sem.c:1913!
    invalid operand: 0000 [#1]
    SMP
    Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
    CPU: 0
    EIP: 0060:[<c01d3f29>] Not tainted VLI
    EFLAGS: 00210202 (2.6.11-jh-1)
    EIP is at exit_sem+0x229/0x2b0
    eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808
    esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0
    ds: 007b es: 007b ss: 0068
    Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
    Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4
    d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380
    0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456
    Call Trace:
    [<c010694f>] show_stack+0x7f/0xa0
    [<c0106b04>] show_registers+0x164/0x220
    [<c0106e94>] die+0xf4/0x1c0
    [<c0107015>] do_trap+0xb5/0xc0
    [<c01072cc>] do_invalid_op+0xbc/0xd0
    [<c01065a3>] error_code+0x2b/0x30
    [<c012a319>] do_exit+0xb9/0x3b0
    [<c012a68c>] do_group_exit+0x3c/0xb0
    [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
    [<c01057c4>] do_signal+0x74/0x140
    [<c0105917>] do_notify_resume+0x87/0x8c
    [<c0105a86>] work_notifysig+0x13/0x15
    Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

    So it's not to do with local/remote semaphores.

     

    Related

    Bugs: #1

  • John Hughes

    John Hughes - 2008-04-17

    Logged In: YES
    user_id=166336
    Originator: YES

    Another BUG in the semaphore code - may indicate the underlying cause of the problem?

    It's trying to unlock a lock that isn't locked.

    ------------[ cut here ]------------
    kernel BUG at include/asm/spinlock.h:112!
    invalid operand: 0000 [#1]
    SMP
    Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
    CPU: 0
    EIP: 0060:[<c046290b>] Not tainted VLI
    EFLAGS: 00010202 (2.6.11-jh-1)
    EIP is at _spin_unlock+0x1b/0x30
    eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08
    esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10
    ds: 007b es: 007b ss: 0068
    Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
    Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5
    02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003
    f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec
    Call Trace:
    [<c010694f>] show_stack+0x7f/0xa0
    [<c0106b04>] show_registers+0x164/0x220
    [<c0106e94>] die+0xf4/0x1c0
    [<c0107015>] do_trap+0xb5/0xc0
    [<c01072cc>] do_invalid_op+0xbc/0xd0
    [<c01065a3>] error_code+0x2b/0x30
    [<c01cecbb>] ipc_unlock+0xb/0x10
    [<c01ce77e>] ipc_drop_locks+0x1e/0x40
    [<c0261dd5>] ripc_drop_locks+0x45/0x60
    [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
    [<c020abb3>] icssvr_daemon+0x2f3/0xab0
    [<c01023a5>] kernel_thread_helper+0x5/0x10
    Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90

     

    Related

    Bugs: #1

  • John Hughes

    John Hughes - 2008-04-17

    Logged In: YES
    user_id=166336
    Originator: YES

    Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug.

     
  • Roger Tsang

    Roger Tsang - 2008-04-17

    Logged In: YES
    user_id=1246761
    Originator: NO

    Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP.

     
  • John Hughes

    John Hughes - 2008-04-21

    Logged In: YES
    user_id=166336
    Originator: YES

    I'm having some difficulty reproducing this problem after a reboot.

    I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it.

     
  • John Hughes

    John Hughes - 2008-05-30

    Logged In: YES
    user_id=166336
    Originator: YES

    Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash!

    In fact I've seen this bug before - the first time on a 2.6.10 based kernel.

    Here's the trace I got from 2.6.10:

    Unable to handle kernel NULL pointer dereference at virtual address 00000004
    printing eip:
    c01d4b6d
    *pde = 00000000
    Oops: 0000 [#1]
    SMP
    Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
    CPU: 1
    EIP: 0060:[<c01d4b6d>] Not tainted VLI
    EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3)
    EIP is at ssi_semexit+0x3d/0xd0
    eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0
    esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0
    ds: 007b es: 007b ss: 0068
    Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
    Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c
    c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c
    f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48
    Call Trace:
    [<c010671f>] show_stack+0x7f/0xa0
    [<c01068c4>] show_registers+0x164/0x230
    [<c0106c74>] die+0xf4/0x1c0
    [<c011f56d>] do_page_fault+0x48d/0x689
    [<c0106383>] error_code+0x2b/0x30
    [<c01d474e>] exit_sem+0x15e/0x190
    [<c012a619>] do_exit+0x159/0x4f0
    [<c012aa7a>] do_group_exit+0x3a/0xc0
    [<c0135163>] get_signal_to_deliver+0x233/0x360
    [<c0105590>] do_signal+0x70/0x150
    [<c01056c7>] do_notify_resume+0x57/0x8c
    [<c0105866>] work_notifysig+0x13/0x15
    Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34

    Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
    due to oops @ 0xc01d4b6d
    eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0
    esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d
    ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246
    xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
    [1]kdb>
    Stack traceback for pid 396087
    0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem
    EBP EIP Function (args)
    0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660)
    0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718)
    0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
    0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000)
    0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282)
    0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0)
    0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
    0xc0105866 work_notifysig+0x13
    [1]kdb>

     

    Related

    Bugs: #1

  • John Hughes

    John Hughes - 2008-06-02

    Logged In: YES
    user_id=166336
    Originator: YES

    Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing.

    When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one.

    Heres the output of some debugging printks I stuck in my kernel:

    sys_semget: key=56a4d5 nsems=1 flags=3b0
    ... totem creates the semaphore

    cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
    ... so we ask the nameserver for it'd ID

    ipcname_getid newid=360448, create=1
    ... it doesn't exist so we must create it

    sem_buildid (id=0, seq=11) = 360448
    ... create is now done

    semctl_down: IPC_RMID 360448
    ... now totem deletes the semaphore

    freeary id=360448

    cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver

    sys_semget: key=56a4d5 nsems=1 flags=3b0
    ... totem re-creates the semaphore

    cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
    ... we ask for it's ID

    ipcname_getid newid=393216, create=1
    ... it doesn't exist so we must re-create it

    sem_buildid (id=0, seq=12) = 393216
    ... create done

    ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence.

    ------------[ cut here ]------------
    kernel BUG at ipc/sem.c:1937!
    invalid operand: 0000 [#1]

     

    Related

    Bugs: #1

  • John Hughes

    John Hughes - 2008-06-02

    Logged In: YES
    user_id=166336
    Originator: YES

    Can it be as simple as that? Look at the code in freeary:

    #ifdef CONFIG_SSI
    for (un = sma->undo; un;) {
    u = un;
    un = u->id_next;
    kfree(u);
    }
    #else

    sma->undo is left pointing to free'd memory.

     
  • John Hughes

    John Hughes - 2008-06-02

    Logged In: YES
    user_id=166336
    Originator: YES

    Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers.

     
  • John Hughes

    John Hughes - 2008-06-03

    Logged In: YES
    user_id=166336
    Originator: YES

    Ok, here's the bug:

    Someone creates a semaphore

    Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node).

    Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

    Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore.
    Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match.

    Attached test program that crashes the system.

     
  • John Hughes

    John Hughes - 2008-06-03

    Crash node with semaphore fun

     
  • John Hughes

    John Hughes - 2008-06-03

    Logged In: YES
    user_id=166336
    Originator: YES

    File Added: semcrash.c

     
  • John Hughes

    John Hughes - 2008-06-03

    Logged In: YES
    user_id=166336
    Originator: YES

    (of course in my message below I meant:

    "Process (B), (where A != B) removes the semaphore, cleaning up the
    sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

     
  • John Hughes

    John Hughes - 2008-06-04

    Logged In: YES
    user_id=166336
    Originator: YES

    Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's just a stale undo operation, which can be ignored.

    (This will screw up if the semaphore sequence number wraps around during the lifetime of a process that uses semaphores that are removed and recreated, but the sequence number can go up to 2^(32-15) so I doubt that will happen often).

    File Added: sem.c.patch

     
  • John Hughes

    John Hughes - 2008-06-04

    Patch to fix crash

     
  • John Hughes

    John Hughes - 2008-06-04

    Logged In: YES
    user_id=166336
    Originator: YES

    Actually it doesn't matter if wrapping sequence numbers give us a collision - the proc sem_undo only gets done if the semaphore has a matching sem_semundo.

     
  • John Hughes

    John Hughes - 2008-10-19

    Fixed in cvs

     
  • John Hughes

    John Hughes - 2008-10-19
    • assigned_to: nobody --> hughesj
    • status: open --> closed-fixed
     

Log in to post a comment.