Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#155 unlocking unlocked lock in ripc_drop_locks

v1.9.3
closed-fixed
nobody
IPC (12)
5
2010-03-13
2008-04-17
John Hughes
No

------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU: 0
EIP: 0060:[<c046290b>] Not tainted VLI
EFLAGS: 00010202 (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08
esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10
ds: 007b es: 007b ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5
02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003
f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec
Call Trace:
[<c010694f>] show_stack+0x7f/0xa0
[<c0106b04>] show_registers+0x164/0x220
[<c0106e94>] die+0xf4/0x1c0
[<c0107015>] do_trap+0xb5/0xc0
[<c01072cc>] do_invalid_op+0xbc/0xd0
[<c01065a3>] error_code+0x2b/0x30
[<c01cecbb>] ipc_unlock+0xb/0x10
[<c01ce77e>] ipc_drop_locks+0x1e/0x40
[<c0261dd5>] ripc_drop_locks+0x45/0x60
[<c0258708>] svr_ripc_drop_locks+0x58/0xb0
[<c020abb3>] icssvr_daemon+0x2f3/0xab0
[<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90

Related

Bugs: #1

Discussion

  • John Hughes
    John Hughes
    2008-04-17

    Logged In: YES
    user_id=166336
    Originator: YES

    Frankly I can't see how this is supposed to work. In abort_rmid (cluster/ssi/ipc/ipcshm_svr.c around line 418) we have:

    nl = NSC_NODELIST_COPY(svp->shm_nodelist);
    [...]
    while ((node = NSC_NODELIST_GET_NEXT(&cookie, nl))
    != CLUSTERNODE_INVAL) {
    if (node == this_node)
    continue;
    ret = RIPC_DROP_LOCKS(node, &rval, id);
    }

    Which will call ipc_drop_locks on the remote node, but who has the lock? And why?

     
  • John Hughes
    John Hughes
    2008-04-17

    Logged In: YES
    user_id=166336
    Originator: YES

    Well,turned out to be pretty easy to duplicate. Wrote a silly test program to create, attach and destroy a segment.

    Node 1 node 2

    shmget
    smhmat
    shmget
    shmat
    shmctl (IPC_RMID)

    BUG!

     
  • John Hughes
    John Hughes
    2008-04-18

     
    Attachments
  • John Hughes
    John Hughes
    2008-04-18

    Logged In: YES
    user_id=166336
    Originator: YES

    Well, a simple patch seems to be set the "id" to zero in the call to ipc_drop_locks in ripc_drop_locks. That means we continue to call up(&ids->sem) (which we need) but not ipc_unlock (which crashes us immediately).
    Patch attached,
    File Added: ripc_drop_locks.patch

     
  • Roger Tsang
    Roger Tsang
    2008-04-19

    Logged In: YES
    user_id=1246761
    Originator: NO

    Race at shm_destroy() going to the wrong svrnode. Try this patch.
    File Added: ipc_shm.c.patch

     
  • Roger Tsang
    Roger Tsang
    2008-04-19

     
    Attachments
  • Roger Tsang
    Roger Tsang
    2008-04-19

    Logged In: YES
    user_id=1246761
    Originator: NO

    Updated patch for sys_shmget() client did not register with server.
    File Added: ipc_shm.c.patch

     
  • John Hughes
    John Hughes
    2008-04-22

    test program

     
    Attachments
  • John Hughes
    John Hughes
    2008-04-22

    Logged In: YES
    user_id=166336
    Originator: YES

    I've just tested with current CVS (2008-4-22) which, as far as I can tell includes the ipc_shm.c.patch, IPC_SHM_DESTROY_FIX is defined in linux.h and I get the same crash.

    To duplicate the problem, on a two node system:

    onnode 1 ./shmcrash 123 100 &
    sleep 1
    onnode 2 ./shmcrash 123 120

    Node 2 will crash in ipc_unlock when the shmcrash process on node 1 destroys the shared memory segment.

    Source for shmcrash program attached.

    File Added: shmcrash.c

     
  • Roger Tsang
    Roger Tsang
    2008-04-23

    Logged In: YES
    user_id=1246761
    Originator: NO

    Taking a closer look at shm_destroy() and looks like ssi_shm_cleanup() did not handle -ERMID. Kinda late night, so haven't tested either your shmcrash or this patch but it makes sense.
    File Added: IPC_SHM_DESTROY_FIX.patch

     
  • Roger Tsang
    Roger Tsang
    2008-04-23

    for CVS (2008-4-22)

     
  • Roger Tsang
    Roger Tsang
    2008-05-23

    • status: open --> open-accepted
     
  • Roger Tsang
    Roger Tsang
    2008-05-24

    • status: open-accepted --> closed-fixed
     
  • Roger Tsang
    Roger Tsang
    2008-05-24

    Logged In: YES
    user_id=1246761
    Originator: NO

    Latest patch in CVS.

     
  • John Hughes
    John Hughes
    2008-06-08

    Logged In: YES
    user_id=166336
    Originator: YES

    Using a kernel build from CVS as of 2008-06-07 I see the same problem - crashes in ipc_unlock called from ripc_drop locks.

    Here's my analysis of the problem:

    1. There is a rpc "RIPC_DROP_LOCKS(node, rval, id)" which calls ripc_drop_locks on another node.

    2. ripc_drop_locks, which is running in the context of some rpc server on some other node than the process that called RIPC_DROP_LOCKS does:

    struct shmid_kernel *shp = shm_cli_get(id);
    if (shp)
    shp->shm_perm.mode &= ~SHM_LOCK_DEST;
    ipc_drop_locks(id, (struct kern_ipc_perm *)shp, &shm_ids, 1);

    (id is obviously nonzero here or shm_cli_get would do nothing interesting)

    ipc_drop_locks does:

    if (id > 0)
    ipc_unlock(perm);
    if (table)
    up(&ids->sem);

    (id is nonzero so we must call ipc_unlock)

    and ipc_unlock does:

    spin_unlock(&perm->lock);
    rcu_read_unlock();

    So ipc_unlock expects that the calling process will have the spinlock, but ripc_unlock_locks *CANNOT* have the spinlock - it is operating in the context of the rpc server, and spinlocks can't be held across sleeps, and the rpc server obviously sleeps between rpc's.

    My hacky fix is to set id to zero in the call to ipc_drop_locks. An equivalent fix would to be to replace the call to ipc_drop_locks by directly calling up(&ids->sem);

     
  • John Hughes
    John Hughes
    2008-06-08

    • status: closed-fixed --> open-fixed
     
  • John Hughes
    John Hughes
    2008-10-20

    Fixed in latest CVS

     
  • Roger Tsang
    Roger Tsang
    2010-03-13

    • status: open-fixed --> closed-fixed