Hi,

I managed to generate an oops on the initnode.  Also there is an assertion failure for vproc_origin_inform_nodedown on the takeover init node.

Remember in my previous message that I managed to unstuck pidof on node 3?  Well this time some other pidof process crashed node 3.

I got these vproc_release_movement assertion failures probably while compiling some code on the cluster earlier on.  I was running the compilation on node 1 with loadleveling turned on.

Roger


Node 3
-----------
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<7>Assertion failed! mlp->ml_shr_count >= 0, cluster/ssi/vproc/dvp_lock.c, vproc_release_movement, line=138
<6>eth1: link up, 100Mbps, full-duplex, lpa 0xC5E1
<6>bonding: bond0: backup interface eth1 is now up
more>
<6>bonding: bond0: backup interface eth1 is now down
<6>bonding: bond0: backup interface eth1 is now up
<6>bonding: bond0: backup interface eth1 is now down
<4>eth2: network connection down
<6>bonding: bond0: link status down for active interface eth2, disabling it
<6>bonding: bond0: making interface eth1 the new active one.
<6>bonding: bond0: eth1 is up and now the active interface
<6>nm_add_node: Node 2 added
<4>eth2: network connection up using port A
<4>    speed:           1000
<4>    autonegotiation: yes
<4>    duplex mode:     full
<4>    flowctrl:        symmetric
<4>    role:            slave
<4>    irq moderation:  dynamic (2000 ints/sec)
<4>    scatter-gather:  enabled
<4>    tx-checksum:     enabled
<4>    rx-checksum:     enabled
<4>    rx-polling:      enabled
<6>bonding: bond0: backup interface eth2 is now up
<6>bonding: bond0: backup interface eth2 is now down
<4>Node 2 has gone down!!!
<6>drbd: No DRBD connection to node 2 defined.
more>
<6>drbd: No DRBD connection to node 2 defined.
<6>drbd: No DRBD connection to node 2 defined.
<2>EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
<2>EXT2-fs error (device ram0): ext2_readdir: bad page in #2
<3>drbd3: PingAck did not arrive in time.
<6>drbd3: drbd3_asender [232219]: cstate Connected --> NetworkFailure
<6>drbd3: asender terminated
<6>drbd3: drbd3_receiver [197162]: cstate NetworkFailure --> BrokenPipe
<3>drbd3: short read expecting header on sock: r=-512
<6>drbd3: worker terminated
<6>drbd3: drbd3_receiver [197162]: cstate BrokenPipe --> Unconnected
<6>drbd3: Connection lost.
<6>drbd3: drbd3_receiver [197162]: cstate Unconnected --> WFConnection
<6>drbd3: drbd3_receiver [197162]: cstate WFConnection --> WFReportParams
<6>drbd3: Handshake successful: DRBD Network Protocol version 74
<6>drbd3: Connection established.
<6>drbd3: I am(P): 1:00000158:00000029:00002f69:00000050:10
<6>drbd3: Peer(S): 1:00000158:00000029:00002f68:00000050:01
<6>drbd3: drbd3_receiver [197162]: cstate WFReportParams --> WFBitMapS
<6>drbd3: Primary/Unknown --> Primary/Secondary
<6>drbd3: drbd3_receiver [197162]: cstate WFBitMapS --> SyncSource
<6>drbd3: Resync started as SyncSource (need to sync 0 KB [0 bits set]).
<6>drbd3: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
more>
<6>drbd3: drbd3_receiver [197162]: cstate SyncSource --> Connected
<1>Unable to handle kernel paging request at virtual address 004056ad
<1> printing eip:
<4>c0185ab8
<1>*pde = 00000000
<1>Oops: 0000 [#1]
<4>Modules linked in: ipt_MASQUERADE loop nfsd exportfs tun ipt_REJECT ipt_state ipt_multiport iptable_filter iptable_nat ip_conntrack ip_tables binfmt_misc ehci_hcd usbcore floppy drbd bonding sata_via libata sk98lin r8169 via_rhine dm_mod
<4>CPU:    0
<4>EIP:    0060:[<c0185ab8>]    Not tainted VLI
<4>EFLAGS: 00010246   (2.6.10-bk7-ssi22)
<4>EIP is at task_dumpable+0x8/0x20
<4>eax: 0040564d   ebx: c6a7de1c   ecx: c6a7de74   edx: 00000000
<4>esi: c6a7de00   edi: cbe570dc   ebp: c5263dbc   esp: c5263dbc
<4>ds: 007b   es: 007b   ss: 0068
<4>Process pidof (pid: 204142, threadinfo=c5262000 task=d1519550)
<4>Stack: c5263e2c c0214598 0040564d 00000000 c03e5f68 f6ebe144 d1519550 c051a4c0
<4>       00000002 f6ebe144 00000001 00000002 00000000 c5263e1c c0276297 f6ebe144
<4>       00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000
<4>Call Trace:
<4> [<c0104a8f>] show_stack+0x7f/0xa0
<4> [<c0104c25>] show_registers+0x155/0x220
<4> [<c0104fac>] die+0xcc/0x190
<4> [<c011682d>] do_page_fault+0x46d/0x66b
more>
<4> [<c010470b>] error_code+0x2b/0x30
<4> [<c0214598>] pvpop_procfs_getattr+0x148/0x240
<4> [<c0185ced>] pid_revalidate+0x5d/0x120
<4> [<c016582f>] do_lookup+0x5f/0xa0
<4> [<c0165a00>] link_path_walk+0x190/0xca0
<4> [<c0166765>] path_lookup+0x75/0x130
<4> [<c0166eda>] open_namei+0x8a/0x5e0
<4> [<c01566aa>] filp_open+0x3a/0x60
<4> [<c0156aa5>] sys_open+0x55/0xa0
<4> [<c0103c55>] sysenter_past_esp+0x52/0x75
<4>Code: 24 08 8b 45 0c 89 44 24 04 8b 45 08 89 04 24 e8 ff fd ff ff c9 c3 8d b6 00 00 00 00 8d bc 27 00 00 00 00 55 31 d2 89 e5 8b 45 08 <8b> 40 60 85 c0 74 0a 0f b6 90 3c 01 00 00 83 e2 01 5d 89 d0 c3
<4>
kdb>
kdb> bt
Stack traceback for pid 204142
0xd1519550   204142   204141  1    0   R  0xd1519710 *pidof
EBP        EIP        Function (args)
0xc5263dbc 0xc0185ab8 task_dumpable+0x8 (0x40564d, 0x0, 0xc03e5f68, 0xf6ebe144, 0xd1519550)
0xc5263e2c 0xc0214598 pvpop_procfs_getattr+0x148 (0xc6a7de00, 0x0, 0x0, 0xc5263e50, 0xc5263e54)
0xc5263e68 0xc0185ced pid_revalidate+0x5d (0xcbe570dc, 0xc5263f58, 0x1, 0xf1bab00d, 0x5cfdca64)
0xc5263e88 0xc016582f do_lookup+0x5f (0xc5263f58, 0xc5263ed4, 0xc5263ecc, 0x0, 0xf7d806b4)
0xc5263ef0 0xc0165a00 link_path_walk+0x190 (0x1, 0x1, 0xc5263f58, 0x0)
0xc5263f08 0xc0166765 path_lookup+0x75 (0xc5263fc4, 0xc03d9c50, 0x6, 0xe, 0xb)
0xc5263f40 0xc0166eda open_namei+0x8a (0xf1bab000, 0x1, 0x1b6, 0xc5263f58, 0xf7d806b4)
0xc5263f9c 0xc01566aa filp_open+0x3a (0xf1bab000, 0x0, 0x1b6, 0xbffffbf0, 0x0)
0xc5263fbc 0xc0156aa5 sys_open+0x55
           0xc0103c55 sysenter_past_esp+0x52
kdb>


Node 1
-----------
Taking over master from node 3.
Node 3 has gone down!!!
write handler down off 140545363 len 147
Assertion failed! surrogate_origin_node == this_node, cluster/ssi/vproc/nd_origi
n.c, vproc_origin_inform_nodedown
_done, line=289
passed the first scan in ipcname_pull_data
num_objects[MSG] = 0
num_objects[SEM] = 4
num_objects[SHM] = 13