#148 vproc_hold_movement oops

v1.9.1
closed-fixed
nobody
5
2008-01-02
2007-12-02
Roger Tsang
No

<4>procfs: impossible type (25)<7>Assertion failed! vp != ((void *)0), cluster/ssi/vproc/dvp_vpops.c, vpop_report_state, line=1369
<1>Unable to handle kernel NULL pointer dereference at virtual address 0000000c
<1> printing eip:
<4>c029226c
<1>*pde = 00000000
<1>Oops: 0000 [#1]
<4>SMP
<4>Modules linked in: loop nfsd tun ipt_REJECT ipt_state ipt_multiport iptable_filter ipt_MASQUERADE iptable_nat ip_conntrack ip_tables softdog nls_iso8859_1 nls_cp437 vfat fat usb_storage binfmt_misc uhci_hcd ehci_hcd usbcore floppy drbd via_rhine sk98lin r8169 forcedeth dm_mod
<4>CPU: 1
<4>EIP: 0060:[<c029226c>] Not tainted VLI
<4>EFLAGS: 00010296 (2.6.11-ssi5.31)
<4>EIP is at vproc_hold_movement+0xc/0x1f0
<4>eax: 00000000 ebx: 00000000 ecx: c04c3a10 edx: 00000000
<4>esi: 00000001 edi: c4c6d400 ebp: f7d43b6c esp: f7d43b0c
<4>ds: 007b es: 007b ss: 0068
<4>Process child_reaper (pid: 2, threadinfo=f7d42000 task=f7d41630)
<4>Stack: 00000000 f7d43b30 c0136f7c f536df2c 00000001 00000000 00000000 00000000
<4> c04c3a10 f7d43b58 c011b6c1 f536df2c 00000001 00000000 00000000 c04c3a10
<4> c04c3a0c 00000001 00000286 f7d43b84 c011b738 00000000 00000001 c4c6d400
<4>Call Trace:
<4> [<c0104eff>] show_stack+0x7f/0xa0
<4> [<c01050a6>] show_registers+0x166/0x230
[1]more>
Only 'q' or 'Q' are processed at more prompt, input ignored
<4> [<c0105446>] die+0xf6/0x1c0
<4> [<c011850d>] do_page_fault+0x45d/0x652
<4> [<c0104b5f>] error_code+0x2b/0x30
<4> [<c0296f82>] pvpop_report_state+0x32/0x690
<4> [<c029e191>] vpop_report_state+0x1b1/0x3c0
<4> [<c01214fa>] release_task+0x17a/0x1c0
<4> [<c01227b6>] wait_task_zombie+0xe6/0x240
<4> [<c0122ddb>] pproc_reap+0x29b/0x380
<4> [<c0293704>] pvpop_reap+0x204/0x500
<4> [<c0292e7d>] dpvproc_nocldwait_async_handler+0x13d/0x2f0
<4> [<c02771a5>] async_cleanup_task_structs+0x55/0x90
<4> [<c02b5005>] initproc_postroot_init+0x145/0x230
<4> [<c027d872>] ssisys_cluster_initproc+0x12/0x20
<4> [<c027bd7b>] do_ssisys+0x9b/0x1f0
<4> [<c027bf1e>] sys_ssisys+0x4e/0x70
<4> [<c0103fc5>] sysenter_past_esp+0x52/0x75
<4>Code: 89 42 08 c9 c3 8d 76 00 8d bc 27 00 00 00 00 55 89 e5 c9 c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 57 56 53 83 ec 54 8b 45 08 <8b> 58 0c 8d 93 c4 00 00 00 89 d0 89 55 b0 e8 f1 a6 1b 00 8b 4d
<4>
[1]kdb> bt
Stack traceback for pid 2
0xf7d41630 2 0 1 1 R 0xf7d41800 *child_reaper
EBP EIP Function (args)
0xf7d43b6c 0xc029226c vproc_hold_movement+0xc (0x0, 0x0, 0xc047ad88, 0x292, 0xf7d43ba4)
0xf7d43c00 0xc0296f82 pvpop_report_state+0x32 (0x0, 0xc4c6d400, 0xf7d43c54, 0x0, 0x1)
0xf7d43c48 0xc029e191 vpop_report_state+0x1b1 (0xc4c6d400, 0x11, 0x0, 0x1, 0x0)
0xf7d43c84 0xc01214fa release_task+0x17a (0xe51239f0, 0x0, 0xf7d43cb8, 0x0, 0x0)
0xf7d43cc8 0xc01227b6 wait_task_zombie+0xe6 (0xe51239f0, 0x0, 0x0, 0xf7d43e4c, 0xf7d43e50)
0xf7d43d18 0xc0122ddb pproc_reap+0x29b (0xe51239f0, 0x0, 0xf7d43e4c, 0xf7d43e50, 0x313d3)
0xf7d43e28 0xc0293704 pvpop_reap+0x204 (0xcfc34000, 0xffffffff, 0x20, 0x313d3, 0xf7d43e4c)
0xf7d43efc 0xc0292e7d dpvproc_nocldwait_async_handler+0x13d (0xc6a4c218, 0xf7d42000, 0xf7d42000, 0xf7d41630, 0x8)
0xf7d43f18 0xc02771a5 async_cleanup_task_structs+0x55 (0xf7d41630, 0x0, 0x40000001, 0x0, 0xc02b4eb0)
0xf7d43f58 0xc02b5005 initproc_postroot_init+0x145
0xf7d43f60 0xc027d872 ssisys_cluster_initproc+0x12

Related

Bugs: #1

Discussion

  • Roger Tsang
    Roger Tsang
    2007-12-02

    Logged In: YES
    user_id=1246761
    Originator: YES

    Oops on dual-core AMD Opteron at vproc_hold_movement() due to null vp returned by tnc_locate_vproc_pid() when pid is at origin node but tnc_locate_vproc_pid() thinks it is not in the vproc hash and should be in the vproc hash.

     
  • Roger Tsang
    Roger Tsang
    2007-12-02

    Logged In: YES
    user_id=1246761
    Originator: YES

    Assert at vproc_origin_inform_nodedown_node() is related to oops on node 3?

    ng over master from node 3.
    <4>Node 3 has gone down!!!
    <7>Assertion failed! surrogate_origin_node == this_node, cluster/ssi/vproc/nd_origin.c, vproc_origin_inform_nodedown_done, line=289
    <4>passed the first scan in ipcname_pull_data
    <4>num_objects[MSG] = 0
    <4>num_objects[SEM] = 2
    <4>num_objects[SHM] = 9
    <4>ipcnameserver ready completed

     
  • Roger Tsang
    Roger Tsang
    2007-12-06

    Logged In: YES
    user_id=1246761
    Originator: YES

    Not using new ATOMIC_VPROC_REFCNT code.

    Also reported to occur during Infiniband IPC bring up (with 1.9.3) - which means bug can be reproduced?

     
  • Roger Tsang
    Roger Tsang
    2007-12-11

    Logged In: YES
    user_id=1246761
    Originator: YES

    Latest checkin marked #ifdef VPROC_HASH_LIST includes SMP bug fix for possible vproc hash corruption due to duplicate vproc release.

     
  • Roger Tsang
    Roger Tsang
    2007-12-11

    Logged In: YES
    user_id=1246761
    Originator: YES

    I cannot reproduce this bug; and I'm not using IB interconnect. Post your OOPS if you can reproduce it with the new VPROC_HASH_LIST code.

     
  • Roger Tsang
    Roger Tsang
    2007-12-19

    • status: open --> open-fixed
     
  • Roger Tsang
    Roger Tsang
    2007-12-27

    • milestone: 782904 --> v1.9.1
     
  • Roger Tsang
    Roger Tsang
    2007-12-27

    Logged In: YES
    user_id=1246761
    Originator: YES

    The original oops is produced on 2.0.0pre1, but affected code dates back to 1.9.1

     
  • Roger Tsang
    Roger Tsang
    2008-01-02

    • status: open-fixed --> closed-fixed